Workshop 2
COMP90051 Natural Language Processing Semester 1, 2020
COMP90051 Natural Language Processing (S1 2020)
Workshop 2
Jun Wang
About me
• Jun Wang
• I was a tutor of last semester SML
• I’m tutoring SML and NLP this semester • jun5@unimelb.edu.au
COMP90051 Natural Language Processing (S1 2020)
Workshop 2
Jun Wang
Materials
• Download files
• Workshop-02.pdf
• 01-preprocessing.ipynb • 02-bpe.ipynb
• From Canvas – Modules – Workshops – Materials
COMP90051 Natural Language Processing (S1 2020)
Workshop 2
Jun Wang
Learning Outcomes
• Discuss text preprocess • Tokenisation
• Normalisation
• Byte-pair Encoding • Algorithm
• Implementation exercise • Porter Stemmer
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Text processing application
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Text processing application
• Google translate
• Grammarly
• Spelling correction
• Spam filter for emails
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Word Tokenisation
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Word Tokenisation
• What is tokenisation •
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Word Tokenisation
• What is tokenisation
• Segmenting text into tokens(words) • Example
‘Topics to be covered include part-of-speech tagging, n-gram language
modelling, syntactic parsing and deep learning.’
Tokenised
[‘Topics’, ‘to’, ‘be’, ‘covered’, ‘include’, ‘part-of-speech’,
‘tagging,’, ‘n-gram’, ‘language’, ”, ‘modelling,’, ‘syntactic’,
‘parsing’, ‘and’, ‘deep’, ‘learning.’]
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Word Tokenisation
• What is tokenisation
• Segmenting text into tokens(words)
• Why we need tokenisation
COMP90051 Natural Language Processing (S1 2020)
Workshop 2
Jun Wang
Word Tokenisation
• What is tokenisation
• Segmenting text into tokens(words)
• Why we need tokenisation
• A text contain too much information
• Human can break it into individual components • Machine should do the same
COMP90051 Natural Language Processing (S1 2020)
Workshop 2
Jun Wang
Word Tokenisation
• What is tokenisation
• Segmenting text into tokens(words)
• Why we need tokenisation
• A text contain too much information
• Human can break it into individual components • Machine should do the same
COMP90051 Natural Language Processing (S1 2020)
Workshop 2
Jun Wang
Word Tokenisation
• What is tokenisation
• Segmenting text into tokens(words)
• Why we need tokenisation
• A text contain too much information
• Human can break it into individual components • Machine should do the same
• Subword tokenisation
• Byte-pair encoding (BPE)
• We have BPE implementation exercises (02-bpe.ipynb)
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Token Normalisation
• What is word normalisation?
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Token Normalisation
• What is word normalisation?
• Putting word into a standard format
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Token Normalisation
• What is word normalisation?
• Putting word into a standard format
• What can you do when normalise?
COMP90051 Natural Language Processing (S1 2020)
Workshop 2
Jun Wang
Token Normalisation
• What is word normalisation?
• Putting word into a standard format
• What can you do when normalise? • Case folding
• Correct spelling
• Expanding abbreviations
• Removing morphology • Stemming
• Lemmatisation
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Why preprocess
• Example
• Find the similar sentence for sentence “The cat
eats the rat” from the following sentences A. Cats eat rats.
B. The dog eats the meat.
C. The rat eats cheese.
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Why preprocess
• Find the similar sentence for sentence “The cat eats the rat” from the following sentences
A. Cats eat rats.
B. The dog eats the meat.
C. The rat eats cheese.
• Let’s consider only the number of identical words in sentences. (usually we don’t do this).
• If we don’t preprocessed sentences.
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Why preprocess
• Find the similar sentence for sentence “The cat eats the rat” from the following sentences
A. Cats eat rats.
B. The dog eats the meat.
C. The rat eats cheese.
• Let’s consider only the number of identical words in sentences. (usually we don’t do this)
• If we don’t preprocessed sentences. • A: 0 identical words
• B: 3 identical words
• C: 3 identical words
• BandC?
COMP90051 Natural Language Processing (S1 2020)
Workshop 2
Jun Wang
Why preprocess
• If we lowercased and lemmatised tokens • Query sentence: the cat eat the rat
• A:cateatrat
• B: the dog eat the meat
• C: the rat eat cheese • Now:
• A: 3 identical tokens • B: 3 identical token • C: 3 identical tokens
• All of them?
• What else can we do?
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Stop words
• A list of unwanted words
• Closed-class or function words • High frequency words
• Not appropriate when sequence is important
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Why preprocess
• If we lowercased and lemmatised tokens and remove stop words • Query sentence: cat eat rat
• A:cateatrat
• B: dog eat meat
• C: rat eat cheese
• You may can remove “eat” because it has high frequency. • Now:
• A: 3 identical tokens • B: 1 identical token • C: 2 identical tokens
• A!!!
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Stemming and lemmatisation
• What are stemming and lemmatisation?
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Stemming and lemmatisation
• What are stemming and lemmatisation?
• Lemmatisation: removing any inflection to reach the
uninflected form, the lemma
• Stemming: strips off all suffixes, leaving a stem
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Stemming and lemmatisation
• Similar and different between stemming and lemmatisation
Stemming
Lemmatisation
Garbage token
Remove derivational morphology
Remove inflectional morphology
Works with a lexicon
Remove or replace affixes (primarily suffixes)
Transform a token into a normalised form
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Stemming and lemmatisation
• What are stemming and lemmatisation?
Stemming
Lemmatisation
Garbage token
Remove derivational morphology
Remove inflectional morphology
Works with a lexicon
Remove or replace affixes (primarily suffixes)
Transform a token into a normalised form
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Stemming and lemmatisation
• What are stemming and lemmatisation?
Stemming
Lemmatisation
Garbage token
Remove derivational morphology
usually
Doesn’t usually
Remove inflectional morphology
Works with a lexicon
Remove or replace affixes (primarily suffixes)
Transform a token into a normalised form
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Stemming and lemmatisation
• What are stemming and lemmatisation?
Stemming
Lemmatisation
Garbage token
Remove derivational morphology
usually
Doesn’t usually
Remove inflectional morphology
Works with a lexicon
Remove or replace affixes (primarily suffixes)
Transform a token into a normalised form
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Stemming and lemmatisation
• What are stemming and lemmatisation?
Stemming
Lemmatisation
Garbage token
Remove derivational morphology
usually
Doesn’t usually
Remove inflectional morphology
Works with a lexicon
Remove or replace affixes (primarily suffixes)
Transform a token into a normalised form
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Stemming and lemmatisation
• What are stemming and lemmatisation?
Stemming
Lemmatisation
Garbage token
Remove derivational morphology
usually
Doesn’t usually
Remove inflectional morphology
Works with a lexicon
Remove or replace affixes (primarily suffixes)
Transform a token into a normalised form
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Stemming and lemmatisation
• What are stemming and lemmatisation?
Stemming
Lemmatisation
Garbage token
Remove derivational morphology
usually
Doesn’t usually
Remove inflectional morphology
Works with a lexicon
Remove or replace affixes (primarily suffixes)
Transform a token into a normalised form
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Inflectional morphology
• Inflectional morphology is the systematic process by which tokens are altered to conform to certain grammatical constraints
• Cat -> cats
• Eat -> eats, ate
• Teach -> teaching
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Derivational morphology
• Derivational morphology is the (semi-)systematic process by which we transform terms of one class into a different class.
• Teach -> teacher
• Personal -> personally
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Inflectional and derivational
• Computers
• -s: inflectional morpheme
• -er: derivational morpheme
• Lemmatisation: Computers -> Computer • Stemming: Computers -> comput
COMP90051 Natural Language Processing (S1 2020)
Workshop 2
Jun Wang
Porter Stemmer
• c = a single character of consonant • s, d, g
• v = a single character of vowel • a, o
• C = a sequence of consonants • s, ss, tr
• V = a sequence of vowels • a, ao, oo
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
• Measure
Porter Stemmer
• Word can be represented as: [C] (VC)m [V] • m = measure
• Example: PRIVATE
PR I V A T E
CVCVCV
[C] VC2 [V]
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Porter Stemmer
• Rule format: (condition) S1 -> S2 • e.g. (m>1) EMENT -> null
• Replacement -> Replac
COMP90051 Natural Language Processing (S1 2020) Workshop 2 Jun Wang
Porter Stemmer
• Rule format: (condition) S1 -> S2
• e.g. (*v*) ING -> null (m=1 and *o) -> E
• Filing -> File • Failing -> Fail
• *o – the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).