CS计算机代考程序代写 decision tree information retrieval ER algorithm Text Preprocessing

Text Preprocessing
COMP90042
Natural Language Processing
Lecture 2
Semester 1 2021 Week 1 Jey Han Lau
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1

COMP90042
L2
2

COMP90042
L2
Definitions
• Words
‣ Sequence of characters with a meaning and/or function
• Sentence
‣ “The student is enrolled at the University of Melbourne.”
• Document: one or more sentences.
• Corpus: a collection of documents.
• Word token: each instance of a word.
• E.g. 9 word tokens in the example sentence. • Word type: distinct words.
‣ Lexicon (“dictionary”): a group of word types. ‣ E.g. 8 word types in the example sentence.
3

COMP90042
L2
How many words (types) are there in English?
• <10K • <50K • <100K • <500K • ??? PollEv.com/jeyhanlau569 4 COMP90042 L2 5 COMP90042 L2 How Many Unique Words? #Tokens (N) #Type (|V|) Switchboard phone conversation 2.4 million 20 thousand Shakespeare 800 thousand 31 thousand Google N-gram 1 trillion 13 million Church and Gale (1990): |V| > O(N1/2)
6

COMP90042
L2
Why Preprocess?
• MostNLPapplicationshavedocumentsasinputs:
‣ “This movie is so great!!! U should definitely watch it in the theater! Best sci-fi eva!” → 😀
‣ “Eu estive em Melbourne no ano passado.” → “I was in Melbourne last year.”
• Keypoint:languageiscompositional.Ashumans, we can break these documents into individual components. To understand language, a computer should do the same.
• Preprocessingisthefirststep.
7

COMP90042
L2
Preprocessing Steps
1. Remove unwanted formatting (e.g. HTML)
2. Sentence segmentation: break documents into sentences
3. Word tokenisation: break sentences into words
4. Word normalisation: transform words into canonical forms
5. Stopword removal: delete unwanted words

Hi there. I’m TARS.


“Hi there. I’m TARS.”
[“Hi there.”,
 “I’m TARS.”]
[[“hi”, “there”, “.”],
 [“i”, “am”, “tars”, “.”]]
[[“Hi”, “there”, “.”],

[“I”, “’m”, “TARS”, “.”]]
[[],[“tars”]]
8

COMP90042
L2
Sentence Segmentation
“Hi there. I’m [“Hi there.”,
 TARS.” “I’m TARS.”]
9

COMP90042
L2
Sentence Segmentation
Naïve approach: break on sentence punctuation ([.?!])
‣ But periods are used for abbreviations! 
 (U.S. dollar, …, Yahoo! as a word)




Second try: use regex to require capital ([.?!] [A-Z])
‣ But abbreviations often followed by names (Mr. Brown)
Better yet: have lexicons
‣ But difficult to enumerate all names and abbreviations
State-of-the-art uses machine learning, not rules
10

COMP90042
L2
Binary Classifier
• Looksatevery“.”anddecideswhetheritistheendofa sentence.
‣ Decision trees, logistic regression • Features
‣ Look at the words before and after “.” ‣ Word shapes:
– Uppercase, lowercase, ALL_CAPS, number
– Character length ‣ Part-of-speech tags:
– Determiners tend to start a sentence
11

COMP90042
L2
Word Tokenisation
[“Hi there.”,
 [[“Hi”, “there”, “.”],

“I’m TARS.”] [“I”, “’m”, “TARS”, “.”]]
12

COMP90042
L2
Word Tokenisation: English
• Naïve approach: separate out alphabetic strings (\w+)
• Abbreviations (U.S.A.)
• Hyphens (merry-go-round vs. well-respected vs. yes- but)
• Numbers (1,000,00.01)
• Dates (3/1/2016)
• Clitics (n’t in can’t)
• Internet language (http://www.google.com, #metoo, :-))
• Multiword units (New Zealand)
13

COMP90042
L2


Some Asian languages are written without spaces between words
Word Tokenisation: Chinese
In Chinese, words often correspond to more than one character
墨墨⼤大 的 学⽣生 与众不不同
Unimelb ’s students (are) special
14

COMP90042
L2


Standard approach assumes an existing vocabulary
Word Tokenisation: Chinese
MaxMatch algorithm
‣ Greedily match longest word in the vocabulary
V = {墨墨,⼤大,的,学,⽣生,与,众,不不,同, 墨墨⼤大,学⽣生,不不同,与众不不同} 墨墨⼤大的学⽣生与众不不同
match 墨墨⼤大, match 的, match 学⽣生, moveto的 moveto学 moveto与 done
match与众不不同,
15

COMP90042
L2
• •
But how do we know what the vocabulary is And doesn’t always work
去买 新⻄西兰 花

go buy New Zealand flowers
去买 新 ⻄西兰花
 go buy new broccoli
Word Tokenisation: Chinese
16

COMP90042
L2
• • •
Lebensversicherungsgesellschaftsangestellter = life insurance company employee
Requires compound splitter
Word Tokenisation: German
17

COMP90042
L2
Subword Tokenisation
• Colourlessgreenideassleepfuriously→
[colour] [less] [green] [idea] [s] [sleep] [furious] [ly]
• Onepopularalgorithm:byte-pairencoding(BPE)
• Coreidea:iterativelymergefrequentpairsof
characters
• Advantage:
‣ Data-informed tokenisation
‣ Works for different languages
‣ Deals better with unknown words
18

COMP90042
L2
Byte-Pair Encoding
• Corpus
‣ [5]low_
‣ [2]lowest_ ‣ [6]newer_ ‣ [3]wider_ ‣ [2]new_
• Vocabulary
‣ _,d,e,i,l,n,o,r,s,t,w
19

COMP90042
L2
Byte-Pair Encoding
• Corpus
‣ [5]low_
‣ [2]lowest_ ‣ [6]newer_ ‣ [3]wider_
‣ [2]new_
• Vocabulary
‣ _,d,e,i,l,n,o,r,s,t,w,r_
20

COMP90042
L2
Byte-Pair Encoding
• Corpus
‣ [5]low_
‣ [2]lowest_ ‣ [6]newer_
‣ [3]wider_
‣ [2]new_
• Vocabulary
‣ _,d,e,i,l,n,o,r,s,t,w,r_,er_
21

COMP90042
L2
Byte-Pair Encoding
• Corpus
‣ [5]low_
‣ [2]lowest_ ‣ [6]newer_
‣ [3]wider_
‣ [2]new_
• Vocabulary
‣ _,d,e,i,l,n,o,r,s,t,w,r_,er_,ew
22

COMP90042
L2
Byte-Pair Encoding
• Corpus
‣ [5]low_
‣ [2]lowest_ ‣ [6] new er_
‣ [3]wider_
‣ [2] new _
• Vocabulary
‣ _,d,e,i,l,n,o,r,s,t,w,r_,er_,ew,new
23

COMP90042
L2
Byte-Pair Encoding
• Vocabulary
‣ _,d,e,i,l,n,o,r,s,t,w,r_,er_,ew,new
‣ _,d,e,i,l,n,o,r,s,t,w,r_,er_,ew,new,lo
‣ _,d,e,i,l,n,o,r,s,t,w,r_,er_,ew,new,lo,low
‣ _,d,e,i,l,n,o,r,s,t,w,r_,er_,ew,new,lo, low, newer_
‣ _,d,e,i,l,n,o,r,s,t,w,r_,er_,ew,new,lo, low, newer_, low_
24

COMP90042
L2


• •
In practice BPE will run with thousands of merges, creating a large vocabulary
Byte-Pair Encoding
Most frequent words will be represented as full words
Rarer words will be broken into subwords
In the worst case, unknown words in test data will be broken into individual letter
25

COMP90042
L2
What are the disadvantages of subword tokenisation?
• Language-dependent
• • •
Larger vocabulary
Creates non-sensical subwords Unable to handle misspellings
PollEv.com/jeyhanlau569
26

COMP90042
L2
27

COMP90042
L2
Word Normalisation
[[“Hi”, “there”, “.”],
 [[“hi”, “there”, “.”],
 [“I”, “’m”, “TARS”, “.”]] [“i”, “am”, “tars”, “.”]]
28

COMP90042
L2
• • • • •
Lower casing (Australia → australia) Removing morphology (cooking → cook) Correcting spelling (definately → definitely) Expanding abbreviations (U.S.A → USA) Goal:
‣ Reduce vocabulary
‣ Maps words into the same type
Word Normalisation
29

COMP90042
L2
Inflectional Morphology
• Inflectionalmorphologycreatesgrammaticalvariants • Englishinflectsnouns,verbs,andadjectives
‣ Nouns: number of the noun (-s)
‣ Verbs: number of the subject (-s), the aspect (-ing) of the
action and the tense (-ed) of the action
‣ Adjectives: comparatives (-er) and superlatives (-est)
• Manylanguageshavemuchricherinflectionalmorphology than English
‣ E.g. French inflects nouns for gender (un chat, une chatte)
30

COMP90042
L2
Lemmatisation
• Lemmatisation means removing any inflection to reach the
uninflected form, the lemma ‣ speaking → speak
• In English, there are irregularities that prevent a trivial solution: ‣ poked → poke (not pok)
‣ stopping → stop (not stopp)
‣ watches → watch (not watche)
‣ was→be(notwa)
• A lexicon of lemmas needed for accurate lemmatisation
31

COMP90042
L2
Derivational Morphology
• Derivationalmorphologycreatesdistinctwords
• Englishderivationalsuffixesoftenchangethelexical category, e.g.
‣ -ly (personal → personally) ‣ -ise (final → finalise)
‣ -er (write → writer)
• Englishderivationalprefixesoftenchangethemeaning without changing the lexical category
‣ write → rewrite
‣ healthy → unhealthy
32

COMP90042
L2

Stemming
Stemming strips off all suffixes, leaving a stem
‣ E.g. automate, automatic, automation → automat
‣ Often not an actual lexical item
Even less lexical sparsity than lemmatisation Popular in information retrieval
Stem not always interpretable
• • •
33

COMP90042
L2
• •


The Porter Stemmer
Most popular stemmer for English Applies rewrite rules in stages
‣ First strip inflectional suffixes,
E.g. -ies → -i
‣ Then derivational suffixes
E.g -isation → -ise → -i
34

COMP90042
L2
• • •

c (lowercase) = consonant; e.g. ‘b’, ‘c’, ‘d’ v (lowercase) = vowel; e.g. ‘a’, ‘e’, ‘i’, ‘o’, ‘u’ C = a sequence of consonants
‣ s, ss, tr, bl
V = a sequence of vowels
‣ o, oo, ee, io
The Porter Stemmer
35

COMP90042
L2
The Porter Stemmer
• Awordhasoneofthefourforms: ‣ CVCV…C
‣ CVCV…V ‣ VCVC…C ‣ VCVC…V
• Whichcanberepresentedas: ‣ [C]VCVC … [V]
‣ [C] (VC)m [V]
‣ m = measure
36

COMP90042
L2
• TREE
‣ = CV
‣ = C(VC)0V ‣ [m=0]
• TREES
‣ = CVC
‣ = C(VC)1 ‣ [m=1]
[C] (VC)m [V]

TROUBLES ‣ = CVCVC ‣ = C(VC)2 ‣ [m=2]
37

COMP90042
L2
• • •
m=0: TR, EE, TREE, Y, BY
m=1: TROUBLE, OATS, TREES, IVY
m=2: TROUBLES, PRIVATE, OATEN, ORRERY
Examples
38

COMP90042
L2
The Porter Stemmer
• Rulesformat:(condition)S1→S2 • e.g.(m>1)EMENT→null
‣ REPLACEMENT
‣ → REPLAC
• AlwaysusethelongestmatchingS1
‣ CARESSES → CARESS ‣ CARESS → CARESS
‣ CARES → CARE
CVCVC = C(VC)2 = [m=2]
Rules: SSES → SS IES → I
SS → SS
S → null
39

COMP90042
L2

The Porter Stemmer Step 1: plurals and inflectional morphology
Rule
Positive Example
Negative Example
a
SSES → SS
caresses → caress
IES → I
ponies → poni
SS → SS
caress → caress
S → null
cats → cat (VC)1 C(VC)0
b
(m>0) EED → EE
agreed → agree
feed → feed
(*v*) ED → null *v* = stem has vowel
plastered → plaster
bled → bled
(*v*) ING →
motoring → motor
sing → sing
b+
AT → ATE
conflat(ed) → conflate
c
(*v*) Y → I
happy → happi
40

COMP90042
L2

The Porter Stemmer Step 2, 3, 4: derivational inflections
Rule
Positive Example
2
(m>0) ATIONAL → ATE
relational → relate
(m>0) TIONAL → TION
conditional → condition
(m>0) ENCI → ENCE
valenci → valence
(m>0) ANCI → ANCE
hesitanci → hesitance
3
(m>0) ICATE → IC
triplicate → triplic
(m>0) ATIVE → null
formative → form
(m>0) ALIZE → AL
formalize → formal
4
(m>1) AL → null
revival → reviv
(m>1) ER → null
airliner → airlin
(m>1) ATE → null
activate → activ
41

COMP90042
L2

The Porter Stemmer Step 5: tidying up
Rule
Positive Example
5
(m>1) E → null
probate → probat
(m>1 and *d and *L) null → single letter
*d = stem ends with double consonant (e.g. -TT)
*L = stem ends with ‘l’
controll → control
42

COMP90042
L2

computational → comput
‣ step 2: ATIONAL → ATE: computate ‣ step 4: ATE → null: comput
computer → comput
‣ step 4: ER → null: comput

The Porter Stemmer
43

COMP90042
L2
Fixing Spelling Errors
• Whyfixthem?
‣ Spelling errors create new, rare types
‣ Disrupt various kinds of linguistic analysis
‣ Very common in internet corpora
‣ In web search, particularly important in queries
• How?
‣ String distance (Levenshtein, etc.)
‣ Modelling of error types (phonetic, typing etc.) ‣ Use an n-gram language model (next lecture!)
44

COMP90042
L2

Normalising spelling variations
‣ Normalize → Normalise (or vice versa) ‣ Ursocoool!→youaresocool


Other Word Normalisation
Expanding abbreviations
‣ US, U.S. → United States
imho → in my humble opinion
45

COMP90042
L2
Stopword Removal
[[“hi”, “there”, “.”],
 [“i”, “am”, “tars”, “.”]]
[[],[“tars”]]
46

COMP90042
L2
Stop Words
• Definition:alistofwordstoberemovedfromthe document
‣ Typical in bag-of-word (BOW) representations
‣ Not appropriate when sequence is important • Howtochoosethem?
‣ All closed-class or function words – E.g. the, a, of, for, he, …
‣ Any high frequency words ‣ NLTK, spaCy NLP toolkits
47

COMP90042
L2
A Final Word
Preprocessing unavoidable in text analysis
Can have a major effect on downstream applications
• •

• •
Exact steps may vary depending on corpus, task
Simple rule-based systems work well, but rarely perfectly
Language-dependent
48

COMP90042
L2
• •
Further Reading
J&M3 Ch 2.4
Details on the Porter Stemmer algorithm http:// snowball.tartarus.org/algorithms/porter/ stemmer.html
49