l2-preprocessing-v2
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
Natural Language Processing
Lecture 2
Semester 1 2021 Week 1
Jey Han Lau
Text Preprocessing
COMP90042 L2
2
COMP90042 L2
3
Definitions
• Words
‣ Sequence of characters with a meaning and/or function
• Sentence
‣ “The student is enrolled at the University of Melbourne.”
• Document: one or more sentences.
• Corpus: a collection of documents.
• Word token: each instance of a word.
• E.g. 9 word tokens in the example sentence.
• Word type: distinct words.
‣ Lexicon (“dictionary”): a group of word types.
‣ E.g. 8 word types in the example sentence.
COMP90042 L2
4
PollEv.com/jeyhanlau569
How many words (types) are there in
English?
• < 10K
• < 50K
• < 100K
• < 500K
• ???
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L2
5
COMP90042 L2
6
How Many Unique Words?
#Tokens (N) #Type (|V|)
Switchboard
phone
conversation
2.4 million 20 thousand
Shakespeare 800 thousand 31 thousand
Google N-gram 1 trillion 13 million
Church and Gale (1990): |V| > O(N1/2)
COMP90042 L2
7
Why Preprocess?
• Most NLP applications have documents as inputs:
‣ “This movie is so great!!! U should definitely watch
it in the theater! Best sci-fi eva!” → 😀
‣ “Eu estive em Melbourne no ano passado.” → “I
was in Melbourne last year.”
• Key point: language is compositional. As humans,
we can break these documents into individual
components. To understand language, a computer
should do the same.
• Preprocessing is the first step.
COMP90042 L2
8
Preprocessing Steps
1. Remove unwanted formatting (e.g. HTML)
2. Sentence segmentation: break documents into sentences
3. Word tokenisation: break sentences into words
4. Word normalisation: transform words into canonical
forms
5. Stopword removal: delete unwanted words
“
Hi there. I’m
TARS.
”
“Hi there. I’m
TARS.”
[“Hi there.”,
“I’m TARS.”]
[[“Hi”, “there”, “.”],
[“I”, “’m”, “TARS”, “.”]]
[[“hi”, “there”, “.”],
[“i”, “am”, “tars”, “.”]]
[[],[“tars”]]
COMP90042 L2
9
Sentence Segmentation
“Hi there. I’m
TARS.”
[“Hi there.”,
“I’m TARS.”]
COMP90042 L2
10
Sentence Segmentation
• Naïve approach: break on sentence punctuation ([.?!])
‣ But periods are used for abbreviations!
(U.S. dollar, …, Yahoo! as a word)
• Second try: use regex to require capital ([.?!] [A-Z])
‣ But abbreviations often followed by names (Mr.
Brown)
• Better yet: have lexicons
‣ But difficult to enumerate all names and abbreviations
• State-of-the-art uses machine learning, not rules
COMP90042 L2
11
Binary Classifier
• Looks at every “.” and decides whether it is the end of a
sentence.
‣ Decision trees, logistic regression
• Features
‣ Look at the words before and after “.”
‣ Word shapes:
– Uppercase, lowercase, ALL_CAPS, number
– Character length
‣ Part-of-speech tags:
– Determiners tend to start a sentence
COMP90042 L2
12
Word Tokenisation
[“Hi there.”,
“I’m TARS.”]
[[“Hi”, “there”, “.”],
[“I”, “’m”, “TARS”, “.”]]
COMP90042 L2
13
Word Tokenisation: English
• Naïve approach: separate out alphabetic strings (\w+)
• Abbreviations (U.S.A.)
• Hyphens (merry-go-round vs. well-respected vs. yes-
but)
• Numbers (1,000,00.01)
• Dates (3/1/2016)
• Clitics (n’t in can’t)
• Internet language (http://www.google.com, #metoo, :-))
• Multiword units (New Zealand)
COMP90042 L2
14
Word Tokenisation: Chinese
• Some Asian languages are written without spaces
between words
• In Chinese, words often correspond to more than
one character
墨墨⼤大 的 学⽣生 与众不不同
Unimelb ’s students (are) special
COMP90042 L2
15
• Standard approach assumes an existing
vocabulary
• MaxMatch algorithm
‣ Greedily match longest word in the vocabulary
V = {墨墨,⼤大,的,学,⽣生,与,众,不不,同, 墨墨⼤大,学⽣生,不不同,与众不不同}
墨墨⼤大的学⽣生与众不不同
Word Tokenisation: Chinese
match 墨墨⼤大,
move to 的
match 的,
move to 学
match 学⽣生,
move to 与
match与众不不同,
done
COMP90042 L2
16
Word Tokenisation: Chinese
• But how do we know what the vocabulary is
• And doesn’t always work
去 买 新⻄西兰 花
go buy New Zealand flowers
去 买 新 ⻄西兰花
go buy new broccoli
COMP90042 L2
17
Word Tokenisation: German
• Lebensversicherungsgesellschaftsangestellter
• = life insurance company employee
• Requires compound splitter
COMP90042 L2
18
Subword Tokenisation
• Colourless green ideas sleep furiously →
[colour] [less] [green] [idea] [s] [sleep] [furious] [ly]
• One popular algorithm: byte-pair encoding (BPE)
• Core idea: iteratively merge frequent pairs of
characters
• Advantage:
‣ Data-informed tokenisation
‣ Works for different languages
‣ Deals better with unknown words
COMP90042 L2
19
Byte-Pair Encoding
• Corpus
‣ [5] l o w _
‣ [2] l o w e s t _
‣ [6] n e w e r _
‣ [3] w i d e r _
‣ [2] n e w _
• Vocabulary
‣ _, d, e, i, l, n, o, r, s, t, w
COMP90042 L2
20
Byte-Pair Encoding
• Corpus
‣ [5] l o w _
‣ [2] l o w e s t _
‣ [6] n e w e r_
‣ [3] w i d e r_
‣ [2] n e w _
• Vocabulary
‣ _, d, e, i, l, n, o, r, s, t, w, r_
COMP90042 L2
21
Byte-Pair Encoding
• Corpus
‣ [5] l o w _
‣ [2] l o w e s t _
‣ [6] n e w er_
‣ [3] w i d er_
‣ [2] n e w _
• Vocabulary
‣ _, d, e, i, l, n, o, r, s, t, w, r_, er_
COMP90042 L2
22
Byte-Pair Encoding
• Corpus
‣ [5] l o w _
‣ [2] l o w e s t _
‣ [6] n ew er_
‣ [3] w i d er_
‣ [2] n ew _
• Vocabulary
‣ _, d, e, i, l, n, o, r, s, t, w, r_, er_, ew
COMP90042 L2
23
Byte-Pair Encoding
• Corpus
‣ [5] l o w _
‣ [2] l o w e s t _
‣ [6] new er_
‣ [3] w i d er_
‣ [2] new _
• Vocabulary
‣ _, d, e, i, l, n, o, r, s, t, w, r_, er_, ew, new
COMP90042 L2
24
Byte-Pair Encoding
• Vocabulary
‣ _, d, e, i, l, n, o, r, s, t, w, r_, er_, ew, new
‣ _, d, e, i, l, n, o, r, s, t, w, r_, er_, ew, new, lo
‣ _, d, e, i, l, n, o, r, s, t, w, r_, er_, ew, new, lo, low
‣ _, d, e, i, l, n, o, r, s, t, w, r_, er_, ew, new, lo,
low, newer_
‣ _, d, e, i, l, n, o, r, s, t, w, r_, er_, ew, new, lo,
low, newer_, low_
COMP90042 L2
25
Byte-Pair Encoding
• In practice BPE will run with thousands of merges,
creating a large vocabulary
• Most frequent words will be represented as full
words
• Rarer words will be broken into subwords
• In the worst case, unknown words in test data will
be broken into individual letter
COMP90042 L2
26
PollEv.com/jeyhanlau569
What are the disadvantages of subword
tokenisation?
• Language-dependent
• Larger vocabulary
• Creates non-sensical subwords
• Unable to handle misspellings
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L2
27
COMP90042 L2
28
Word Normalisation
[[“Hi”, “there”, “.”],
[“I”, “’m”, “TARS”, “.”]]
[[“hi”, “there”, “.”],
[“i”, “am”, “tars”, “.”]]
COMP90042 L2
29
Word Normalisation
• Lower casing (Australia → australia)
• Removing morphology (cooking → cook)
• Correcting spelling (definately → definitely)
• Expanding abbreviations (U.S.A → USA)
• Goal:
‣ Reduce vocabulary
‣ Maps words into the same type
COMP90042 L2
30
Inflectional Morphology
• Inflectional morphology creates grammatical variants
• English inflects nouns, verbs, and adjectives
‣ Nouns: number of the noun (-s)
‣ Verbs: number of the subject (-s), the aspect (-ing) of the
action and the tense (-ed) of the action
‣ Adjectives: comparatives (-er) and superlatives (-est)
• Many languages have much richer inflectional morphology
than English
‣ E.g. French inflects nouns for gender (un chat, une
chatte)
COMP90042 L2
31
Lemmatisation
• Lemmatisation means removing any inflection to reach the
uninflected form, the lemma
‣ speaking → speak
• In English, there are irregularities that prevent a trivial solution:
‣ poked → poke (not pok)
‣ stopping → stop (not stopp)
‣ watches → watch (not watche)
‣ was → be (not wa)
• A lexicon of lemmas needed for accurate lemmatisation
COMP90042 L2
32
Derivational Morphology
• Derivational morphology creates distinct words
• English derivational suffixes often change the lexical
category, e.g.
‣ -ly (personal → personally)
‣ -ise (final → finalise)
‣ -er (write → writer)
• English derivational prefixes often change the meaning
without changing the lexical category
‣ write → rewrite
‣ healthy → unhealthy
COMP90042 L2
33
Stemming
• Stemming strips off all suffixes, leaving a stem
‣ E.g. automate, automatic, automation →
automat
‣ Often not an actual lexical item
• Even less lexical sparsity than lemmatisation
• Popular in information retrieval
• Stem not always interpretable
COMP90042 L2
34
The Porter Stemmer
• Most popular stemmer for English
• Applies rewrite rules in stages
‣ First strip inflectional suffixes,
– E.g. -ies → -i
‣ Then derivational suffixes
– E.g -isation → -ise → -i
COMP90042 L2
35
The Porter Stemmer
• c (lowercase) = consonant; e.g. ‘b’, ‘c’, ‘d’
• v (lowercase) = vowel; e.g. ‘a’, ‘e’, ‘i’, ‘o’, ‘u’
• C = a sequence of consonants
‣ s, ss, tr, bl
• V = a sequence of vowels
‣ o, oo, ee, io
COMP90042 L2
36
The Porter Stemmer
• A word has one of the four forms:
‣ CVCV … C
‣ CVCV … V
‣ VCVC … C
‣ VCVC … V
• Which can be represented as:
‣ [C]VCVC … [V]
‣ [C] (VC)m [V]
‣ m = measure
COMP90042 L2
37
[C] (VC)m [V]
• TREE
‣ = CV
‣ = C(VC)0V
‣ [m=0]
• TREES
‣ = CVC
‣ = C(VC)1
‣ [m=1]
• TROUBLES
‣ = CVCVC
‣ = C(VC)2
‣ [m=2]
COMP90042 L2
38
Examples
• m=0: TR, EE, TREE, Y, BY
• m=1: TROUBLE, OATS, TREES, IVY
• m=2: TROUBLES, PRIVATE, OATEN, ORRERY
COMP90042 L2
39
The Porter Stemmer
• Rules format: (condition) S1 → S2
• e.g. (m > 1) EMENT → null
‣ REPLACEMENT
‣ → REPLAC
• Always use the longest matching S1
‣ CARESSES → CARESS
‣ CARESS → CARESS
‣ CARES → CARE
Rules:
SSES → SS
IES → I
SS → SS
S → null
CVCVC = C(VC)2 = [m=2]
COMP90042 L2
40
The Porter Stemmer
• Step 1: plurals and inflectional morphology
Rule Positive Example Negative Example
a
SSES → SS caresses → caress
IES → I ponies → poni
SS → SS caress → caress
S → null cats → cat
b
(m>0) EED → EE agreed → agree feed → feed
(*v*) ED → null
*v* = stem has vowel plastered → plaster bled → bled
(*v*) ING → motoring → motor sing → sing
b+ AT → ATE conflat(ed) → conflate
c (*v*) Y → I happy → happi
(VC)1 C(VC)0
COMP90042 L2
41
The Porter Stemmer
• Step 2, 3, 4: derivational inflections
Rule Positive Example
2
(m>0) ATIONAL → ATE relational → relate
(m>0) TIONAL → TION conditional → condition
(m>0) ENCI → ENCE valenci → valence
(m>0) ANCI → ANCE hesitanci → hesitance
3
(m>0) ICATE → IC triplicate → triplic
(m>0) ATIVE → null formative → form
(m>0) ALIZE → AL formalize → formal
4
(m>1) AL → null revival → reviv
(m>1) ER → null airliner → airlin
(m>1) ATE → null activate → activ
COMP90042 L2
42
The Porter Stemmer
• Step 5: tidying up
Rule Positive Example
5
(m>1) E → null probate → probat
(m>1 and *d and *L)
null → single letter
*d = stem ends with double
consonant (e.g. -TT)
*L = stem ends with ‘l’
controll → control
COMP90042 L2
43
The Porter Stemmer
• computational → comput
‣ step 2: ATIONAL → ATE: computate
‣ step 4: ATE → null: comput
• computer → comput
‣ step 4: ER → null: comput
COMP90042 L2
44
Fixing Spelling Errors
• Why fix them?
‣ Spelling errors create new, rare types
‣ Disrupt various kinds of linguistic analysis
‣ Very common in internet corpora
‣ In web search, particularly important in queries
• How?
‣ String distance (Levenshtein, etc.)
‣ Modelling of error types (phonetic, typing etc.)
‣ Use an n-gram language model (next lecture!)
COMP90042 L2
45
Other Word Normalisation
• Normalising spelling variations
‣ Normalize → Normalise (or vice versa)
‣ U r so coool! → you are so cool
• Expanding abbreviations
‣ US, U.S. → United States
‣ imho → in my humble opinion
COMP90042 L2
46
Stopword Removal
[[“hi”, “there”, “.”],
[“i”, “am”, “tars”, “.”]]
[[],[“tars”]]
COMP90042 L2
47
Stop Words
• Definition: a list of words to be removed from the
document
‣ Typical in bag-of-word (BOW) representations
‣ Not appropriate when sequence is important
• How to choose them?
‣ All closed-class or function words
– E.g. the, a, of, for, he, …
‣ Any high frequency words
‣ NLTK, spaCy NLP toolkits
COMP90042 L2
48
A Final Word
• Preprocessing unavoidable in text analysis
• Can have a major effect on downstream
applications
• Exact steps may vary depending on corpus, task
• Simple rule-based systems work well, but rarely
perfectly
• Language-dependent
COMP90042 L2
49
Further Reading
• J&M3 Ch 2.4
• Details on the Porter Stemmer algorithm http://
snowball.tartarus.org/algorithms/porter/
stemmer.html