程序代写代做代考 algorithm go html decision tree ER C information retrieval Text Preprocessing

Text Preprocessing
COMP90042
Natural Language Processing Lecture 2
COPYRIGHT 2020, THE UNIVERSITY OF MELBOURNE
1

COMP90042
L2
Definitions
• Corpus: a collection of documents.
• Document: one or more sentences.
• Sentence
‣ “The student is enrolled at the University of Melbourne.”
• Words
‣ Sequence of characters with a meaning and/or function
• Word token: each instance of “the” in the sentence above.
• E.g. 9 word tokens in the example sentence. • Word type: the distinct word “the”.
‣ Lexicon (“dictionary”): a group of word types. ‣ E.g. 8 word types in the example sentence.
2

COMP90042
L2
How Many Unique Words?
#Tokens (N)
#Type (|V|)
Switchboard phone conversation
2.4 million
20 thousand
Shakespeare
800 thousand
31 thousand
Google N-gram
1 trillion
13 million
Church and Gale (1990): |V| > O(N1/2)
3

COMP90042
L2
Why Preprocess?
• MostNLPapplicationshavedocumentsasinputs:
‣ “This movie is so great!!! U should definitely watch it in the theater! Best sci-fi eva!” → 😀
‣ “Eu estive em Melbourne no ano passado.” → “I was in Melbourne last year.”
• Keypoint:languageiscompositional.Ashumans, we can break these documents into individual components. To understand language, a computer should do the same.
• Preprocessingisthefirststep.
4

COMP90042
L2
Preprocessing Steps
1. Remove unwanted formatting (e.g. HTML)
2. Sentence segmentation: break documents into sentences
3. Word tokenisation: break sentences into words
4. Word normalisation: transform words into canonical forms
5. Stopword removal: delete unwanted words

Hi there. I’m TARS.


“Hi there. I’m TARS.”
[“Hi there.”,
 “I’m TARS.”]
[[“hi”, “there”, “.”],
 [“i”, “am”, “tars”, “.”]]
[[“Hi”, “there”, “.”],

[“I”, “’m”, “TARS”, “.”]]
[[],[“tars”]]
5

COMP90042 L2
Sentence Segmentation
6

COMP90042
L2
Sentence Segmentation
Naïve approach: break on sentence punctuation ([.?!])
‣ But periods are used for abbreviations! 
 (U.S. dollar, …, Yahoo! as a word)
• Secondtry:useregextorequirecapital([.?!][A-Z]) ‣ But abbreviations often followed by names (Mr.
Brown)
Better yet: have lexicons
‣ But difficult to enumerate all names and abbreviations • State-of-the-artusesmachinelearning,notrules


7

COMP90042
L2
Binary Classifier
• Looksatevery“.”anddecideswhetheritistheendofa sentence.
‣ Decision trees, logistic regression • Features
‣ Look at the words before and after “.” ‣ Word shapes:
– Uppercase, lowercase, ALL_CAPS, number
– Character length ‣ Part-of-speech tags:
– Determiners tend to start a sentence
8

COMP90042
L2
Word Tokenisation
9

COMP90042
L2
Word Tokenisation: English
• Naïve approach: separate out alphabetic strings (\w+)
• Abbreviations (U.S.A.)
• Hyphens (merry-go-round vs. well-respected vs. yes- but)
• Numbers (1,000,00.01)
• Dates (3/1/2016)
• Clitics (n’t in can’t)
• Internet language (http://www.google.com, #metoo, :-))
• Multiword units (New Zealand)
10

COMP90042
L2


Some Asian languages are written without spaces between words
Word Tokenisation: Chinese
In Chinese, words often correspond to more than one character
墨墨⼤大 的 学⽣生 与众不不同
Unimelb ’s students (are) special
11

COMP90042
L2


Standard approach assumes an existing vocabulary
Word Tokenisation: Chinese
MaxMatch algorithm
‣ Greedily match longest word in the vocabulary
V = {墨墨,⼤大,的,学,⽣生,与,众,不不,同, 墨墨⼤大,学⽣生,不不同,与众不不同} 墨墨⼤大的学⽣生与众不不同
match 墨墨⼤大, match 的, match 学⽣生, moveto的 moveto学 moveto与 done
match与众不不同,
12

COMP90042
L2
• •
But how do we know what the vocabulary is And doesn’t always work
Word Tokenisation: Chinese
去买新⻄西兰花
去买 新⻄西兰 花

go buy New Zealand flowers
去买 新 ⻄西兰花
 go buy new broccoli
13

COMP90042
L2
• • •
Lebensversicherungsgesellschaftsangestellter = life insurance company employee
Requires compound splitter
Word Tokenisation: German
14

COMP90042
L2
Subword Tokenisation
• Colourlessgreenideassleepfuriously→
[colour] [less] [green] [idea] [s] [sleep] [furious] [ly]
• Onepopularalgorithm:byte-pairencoding(BPE)
• Coreidea:iterativelymergefrequentpairsof
characters
• Advantage:
‣ Data-informed tokenisation
‣ Works for different languages
‣ Deals better with unknown words
15

COMP90042
L2
Byte-Pair Encoding
• Dictionary
‣ [5]low_
‣ [2]lowest_ ‣ [6]newer_ ‣ [3]wider_ ‣ [2]new_
• Vocabulary
‣ _,d,e,i,l,n,o,r,s,t,w
16

COMP90042
L2
Byte-Pair Encoding
• Dictionary
‣ [5]low_
‣ [2]lowest_ ‣ [6]newer_ ‣ [3]wider_
‣ [2]new_
• Vocabulary
‣ _,d,e,i,l,n,o,r,s,t,w,r_
17

COMP90042
L2
Byte-Pair Encoding
• Dictionary
‣ [5]low_
‣ [2]lowest_ ‣ [6]newer_
‣ [3]wider_
‣ [2]new_
• Vocabulary
‣ _,d,e,i,l,n,o,r,s,t,w,r_,er_
18

COMP90042
L2
Byte-Pair Encoding
• Dictionary
‣ [5]low_
‣ [2]lowest_ ‣ [6]newer_
‣ [3]wider_
‣ [2]new_
• Vocabulary
‣ _,d,e,i,l,n,o,r,s,t,w,r_,er_,ew
19

COMP90042
L2
Byte-Pair Encoding
• Vocabulary
‣ _,d,e,i,l,n,o,r,s,t,w,r_,er_,ew
‣ _,d,e,i,l,n,o,r,s,t,w,r_,er_,ew,new
‣ _,d,e,i,l,n,o,r,s,t,w,r_,er_,ew,new,lo
‣ _,d,e,i,l,n,o,r,s,t,w,r_,er_,ew,new,lo,low
‣ _,d,e,i,l,n,o,r,s,t,w,r_,er_,ew,new,lo,low, newer_
‣ _,d,e,i,l,n,o,r,s,t,w,r_,er_,ew,new,lo,low, newer_, low_
20

COMP90042
L2


• •
In practice BPE will run with thousands of merges, creating a large vocabulary
Byte-Pair Encoding
Most frequent words will be represented as full words
Rarer words will be broken into subwords
In the worst case, unknown words in test data will be broken into individual letter
21

COMP90042
L2
Word Normalisation
22

COMP90042
L2
• • • • •
Lower casing (Australia → australia) Removing morphology
Correcting spelling
Expanding abbreviations (U.S.A → USA) Goal:
‣ Reduce vocabulary
‣ Maps words into the same type
Word Normalisation
23

COMP90042
L2
Inflectional Morphology
• Inflectionalmorphologycreatesgrammaticalvariants • Englishinflectsnouns,verbs,andadjectives
‣ Nouns: number of the noun (-s)
‣ Verbs: number of the subject (-s), the aspect (-ing) of the
action and the tense (-ed) of the action
‣ Adjectives: comparatives (-er) and superlatives (-est)
• Manylanguageshavemuchricherinflectionalmorphology than English
‣ E.g. French inflects nouns for gender (un chat, une chatte)
24

COMP90042
L2
Lemmatisation
• Lemmatisation means removing any inflection to reach the
uninflected form, the lemma ‣ speaking → speak
• In English, there are irregularities that prevent a trivial solution: ‣ poked → poke (not pok)
‣ stopping → stop (not stopp)
‣ watches → watch (not watche)
‣ was→be(notwa)
• A lexicon of lemmas needed for accurate lemmatisation
25

COMP90042
L2
Derivational Morphology
• Derivationalmorphologycreatesdistinctwords
• Englishderivationalsuffixesoftenchangethelexical category, e.g.
‣ -ly (personal → personally) ‣ -ise (final → finalise)
‣ -er (write → writer)
• Englishderivationalprefixesoftenchangethemeaning without changing the lexical category
‣ write → rewrite
‣ healthy → unhealthy
26

COMP90042
L2

Stemming
Stemming strips off all suffixes, leaving a stem
‣ E.g. automate, automatic, automation → automat
‣ Often not an actual lexical item
Even less lexical sparsity than lemmatisation Popular in information retrieval
Stem not always interpretable
• • •
27

COMP90042
L2
• •


The Porter Stemmer
Most popular stemmer for English Applies rewrite rules in stages
‣ First strip inflectional suffixes,
E.g. -ies → -i
‣ Then derivational suffixes
E.g -isation → -ise → -i
28

COMP90042
L2
• • •

c (lowercase) = consonant; e.g. ‘b’, ‘c’, ‘d’ v (lowercase) = vowel; e.g. ‘a’, ‘e’, ‘i’, ‘o’, ‘u’ C = a sequence of consonants
‣ s, ss, tr, bl
V = a sequence of vowels
‣ o, oo, ee, io
The Porter Stemmer
29

COMP90042
L2
The Porter Stemmer
• Awordhasoneofthefourforms: ‣ CVCV…C
‣ CVCV…V ‣ VCVC…C ‣ VCVC…V
• Whichcanberepresentedas: ‣ [C]VCVC … [V]
‣ [C] (VC)m [V]
‣ m = measure
30

COMP90042
L2
• • •
• • •
m=0: TR, EE, TREE, Y, BY
m=1: TROUBLE, OATS, TREES, IVY
m=2: TROUBLES, PRIVATE, OATEN, ORRERY
The Porter Stemmer
TREE = C(VC)0V TREES = C(VC)1 TROUBLES = C(VC)2
31

COMP90042
L2
• •

Rules format: (condition) S1 → S2 e.g. (m > 1) EMENT → null
‣ REPLACEMENT → REPLAC Always use the longest matching S1
‣ CARESSES → CARESS ‣ CARESS → CARESS
‣ CARES → CARE
The Porter Stemmer
Rules: SSES → SS IES → I
SS → SS
S → null
32

COMP90042
L2

The Porter Stemmer Step 1: plurals and past participles
Rule
Positive Example
Negative Example
a
SSES → SS
caresses → caress
IES → I
ponies → poni
SS → SS
caress → caress
S → null
cats → cat
b
(m>0) EED → EE
agreed → agree
feed → feed
(*v*) ED → null *v* = stem has vowel
plastered → plaster
bled → bled
(*v*) ING →
motoring → motor
sing → sing
b+
AT → ATE
conflat(ed) → conflate
c
(*v*) Y → I
happy → happi
33

COMP90042
L2

The Porter Stemmer Step 2, 3, 4: derivational inflections
Rule
Positive Example
2
(m>0) ATIONAL → ATE
relational → relate
(m>0) TIONAL → TION
conditional → condition
(m>0) ENCI → ENCE
valenci → valence
(m>0) ANCI → ANCE
hesitanci → hesitance
3
(m>0) ICATE → IC
triplicate → triplic
(m>0) ATIVE → null
formative → form
(m>0) ALIZE → AL
formalize → formal
4
(m>1) AL → null
revival → reviv
(m>1) ER → null
airliner → airlin
(m>1) ATE → null
activate → activ
34

COMP90042
L2

The Porter Stemmer Step 5: tidying up
Rule
Positive Example
5
(m>1) E → null
probate → probat
(m=1 and not *o) E → null
*o = stem ends cvc, and second c is not w, x or y (e.g. -WIL, -HOP)
cease → ceas
(m>1 and *d and *L) null → single letter
*d = stem ends with double consonant (e.g. -TT)
*L = stem ends with ‘l’
controll → control
35

COMP90042
L2

computational → comput
‣ step 2: ATIONAL → ATE: computate ‣ step 4: ATE → null: comput
computer → comput
‣ step 4: ER → null: comput

The Porter Stemmer
36

COMP90042
L2
Fixing Spelling Errors
• Whyfixthem?
‣ Spelling errors create new, rare types
‣ Disrupt various kinds of linguistic analysis
‣ Very common in internet corpora
‣ In web search, particularly important in queries
• How?
‣ String distance (Levenshtein, etc.)
‣ Modelling of error types (phonetic, typing etc.) ‣ Use an n-gram language model
37

COMP90042
L2

Normalising spelling variations
‣ Normalize → Normalise (or vice versa) ‣ Ursocoool!→youaresocool


Other Word Normalisation
Expanding abbreviations
‣ US, U.S. → United States
imho → in my humble opinion
38

COMP90042
L2
Stopword Removal
39

COMP90042
L2
Stop Words
• Definition:alistofwordstoberemovedfromthe document
‣ Typical in bag-of-word (BOW) representations
‣ Not appropriate when sequence is important • Howtochoosethem?
‣ All closed-class or function words – E.g. the, a, of, for, he, …
‣ Any high frequency words ‣ NLTK, spaCy NLP toolkits
40

COMP90042
L2
A Final Word
Preprocessing unavoidable in text analysis
Can have a major effect on downstream applications
• •

• •
Exact steps may vary depending on corpus, task
Simple rule-based systems work well, but rarely perfectly
Language-dependent
41

COMP90042
L2


J&M3 Ch 2. on Normalisation 

(includes a review of regex and Levenshtien distance)
Further Reading
Details on the Porter Stemmer algorithm http:// snowball.tartarus.org/algorithms/porter/ stemmer.html
42