PowerPoint Presentation
LECTURE 2
Text Preprocessing: Segmentaton, Normalisaton, Stemming
Arkaitz Zubiaga, 10th January, 2018
2
LECTURE 2: CONTENTS
Text preprocessing.
Word tokenisaton.
Text normalisaton.
Lemmatsaton and stemming.
Sentence segmentaton.
3
TEXT NORMALISATION
Every NLP task needs to do text preprocessing:
Segmentng/tokenising words in running text.
Normalising word formats.
Segmentng sentences in running text.
4
CORPORA
Corpus: collecton of text or speech. One or more documents,
e.g. Brown corpus: 500 texts from diferent genres: newswire,
fcton, non-fcton.
We can split each document/text into sentences, and these
sentences into words.
5
SENTENCES
Sentences: shortest sequence of words that are grouped
together to convey some grammatcally correct self-contained
meaning.
For practcal purposes, sequence between full stops or “?|!”.
6
HOW MANY WORDS IN A SENTENCE?
It can be as simple as splitng by spaces, but it depends, e.g.:
I do uh main- mainly business data processing.
How many words in the following?
Seuss’s cat in the hat is diferent from other cats.
7
WORDS AND LEMMAS
Seuss’s cat in the hat is diferent from other cats.
Cat and cats both have the same lemma (cat), but two diferent
wordforms:
Cat: cat (lemma)
Cats: cat (lemma) + s (sufx)
8
HOW MANY WORDS?
They lay back on the Leamington Spa grass and look at the stars.
Type: an element of the vocabulary.
Token: an instance of that type in the running text.
9
HOW MANY TOKENS?
They lay back on the Leamington Spa grass and look at the stars. (13)
They lay back on the Leamington Spa grass and look at the stars. (12)
10
HOW MANY TYPES?
They lay back on the Leamington Spa grass and look at the stars. (11)
2 the
1 them
1 lay
1 back
1 on
1 Leamington Spa
1 grass
1 and
1 look
1 at
1 star
11
CORPORA
N = number of tokens
V = vocabulary = set of types
|V| is the size of the vocabulary
Tokens = N Types = |V|
Switchboard phone
conversatons
2.4 million 20 thousand
Shakespeare 884,000 31 thousand
Google N-grams 1 trillion 13 million
12
WORD TOKENISATION
The Complete Works of William Shakespeare:
13
SIMPLE TOKENISATION IN UNIX
tr -sc ’A-Za-z’ ’\n’ < shakes.txt Having shakes.txt as input (< shakes.txt) Convert all non-alphabetc characters (-sc ‘A-Za-z’) Into new lines (‘\n’) 14 STEP 1: TOKENISING > tr -sc ‘A-Za-z’ ‘\n’ < shakes.txt | head The Project Gutenberg EBook of The Complete Works of ... 15 STEP 2: SORTING > tr -sc ‘A-Za-z’ ‘\n’ < shakes.txt | sort | head a a a a a a a a a ... 16 STEP 3: COUNTING > tr -sc ‘A-Za-z’ ‘\n’ < shakes.txt | sort | uniq -c | head 12851 a 1949 A 25 Aaron 72 AARON 1 abaissiez 10 abandon 2 abandoned 2 abase 1 abash ... 17 STEP 3: SORT BY COUNT > tr -sc ‘A-Za-z’ ‘\n’ < shakes.txt | sort | uniq -c | sort -r -n | head 23455 the 22225 I 18715 and 16433 to 15830 of 12851 a 12236 you 10840 my 10074 in 8954 d ... What happened here? 18 STEP 4: LOWERCASING TEXT > tr -sc ‘A-Za-z’ ‘\n’ < shakes.txt | tr 'A-Z' 'a-z' | sort | uniq -c | sort -r -n | head 27843 the 26847 and 22538 i 19882 to 18307 of 14800 a 13928 you 12490 my 11563 that 11183 in ... 19 ISSUES IN TOKENISATION England’s capital → England, Englands, England’s what’re, I’m, isn’t → What are, I am, is not Hewlet-Packard → Hewlet Packard state-of-the-art → state of the art ? Lowercase → lower-case lowercase lower case ? Leamington Spa → one token or two? U.S.A., USA → ?? 20 PENN TREEBANK TOKENISATION “The Leamington Spa-based restaurant,” they said, “doesn’t charge £10”. “ The Leamington Spa-based restaurant , “ they said , “ does n’t charge £ 10 “ . Uses regular expressions: ftp://ftp.cis.upenn.edu/pub/treebank/public_hhtml/tokenizaton.html ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html 21 TOKENISATION: LANGUAGE ISSUES French: L'ensemble → one token or two? L ? L’ ? Le ? We want l’ensemble to match other instances of ensemble German noun compounds are not segmented: Lebensversicherungsgesellschafsangestellter ‘life insurance company employee’ German informaton retrieval needs compound spliter 22 TOKENISATION IN CHINESE Also called Word Segmentaton. Chinese words are composed of characters: Characters are generally 1 syllable and 1 morpheme. Average word is 2.4 characters long. Standard baseline segmentaton algorithm: Maximum Matching (also called Greedy) 23 MAXIMUM MATCHING ALGORITHM Given a wordlist (dictonary) of Chinese, and a string as input: 1) Start a pointer at the beginning of the string. 2) Find the longest word in dictonary that matches the string startng at pointer. 3) Move the pointer over the word in string. 4) Go to 2. 24 MAXIMUM MATCHING: EXAMPLE IN ENGLISH Thecatnthehat The cat in the hat Thetabledownthere Theta bled own there It’s quite a bad algorithm for English! 25 MAXIMUM MATCHING: EXAMPLE IN CHINESE But it’s actually very good for Chinese: 莎拉波娃现在居住在美国东南部的佛罗里达。 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达 Modern probabilistc segmentaton algorithms are even beter. WORD NORMALISATION AND STEMMING 27 NORMALISATION Symmetric normalisaton: e.g. We want to match U.S.A. and USA so we delete periods in a term. Assymetric normalisaton: Enter: windows Search: Windows, windows, window Enter: Windows Search: Windows More complicated, less efcient. 28 CASE FOLDING In applicatons like Informaton Retrieval (search): Reduce all to lowercase. Possible excepton: upper case in mid-sentence? e.g., General Motors In other applicatons like sentment analysis, machine translaton: Case can be important (e.g. US vs us) 29 LEMMATISATION Reduce infectons or variant forms to base form: am, are, is → be car, cars, car's, cars' → car the boy's cars are diferent colours → the boy car be diferent colour Lemmatsaton: need to fnd correct dictonary headword form. 30 MORPHOLOGY Morphemes: The small meaningful units that make up words. Stems: The core meaning-bearing units. Afxes: Bits and pieces that adhere to stems. Often with grammatcal functons. 31 PORTER: BEST-KNOWN ENGLISH STEMMER Step 1a sses → ss posesses → posess ies → i ponies → poni ss → ss posess → posess s → ø cats → cat Step 1b (*v*)ing → ø walking → walk sing → sing (*v*)ed → ø plastered → plaster … https://tartarus.org/martin/PorterStemmer/ https://tartarus.org/martin/PorterStemmer/ 32 PORTER: BEST-KNOWN ENGLISH STEMMER (*v*)ing → ø having → hav, living → liv, studying → study king → ø, sing → ø, thing → ø something → someth, morning → morn 33 LEMMATISER VS STEMMER for example compressed and compression are both accepted as equivalent to compress. STEMMER: for exampl compress and compress ar both accept as equival to compress LEMMATISER: for example compress and compress be both accept as equivalent to compress SENTENCE SEGMENTATION 35 SENTENCE SEGMENTATION !, ? are relatvely unambiguous But period “.” is much more ambiguous Sentence boundary Abbreviatons like Inc., etc. or PhD. Numbers like .02% or 4.3 36 SENTENCE SEGMENTATION So how do we deal with this ambiguity? We can build a binary classifer: Look at occurrences of ‘.’ Classifes EndOfSentence vs NotEndOfSentence. How? Hand-writen rules (if-then). Regular expressions. Machine learning. 37 SENTENCE SEGMENTATION: A DECISION TREE Deciding if a word is at the end of a sentence. 38 SENTENCE SEGMENTATION: MORE SOPHISTICATED Case of word with “.”: Upper, Lower, Cap, Number Case of word after “.”: Upper, Lower, Cap, Number Numeric features: Length of word with “.” Probability (word with “.” occurs at end-of-sentence) Probability (word after “.” occurs at beginning-of-sentence) 39 IMPLEMENTING DECISION TREES Decision tree = if-then-else statements. What’s interestng is to choose the features. PROBLEM: Very difcult to implement manually, it can become complex! PROBLEM: For numeric features, it’s too hard to pick each threshold. SOLUTION: Instead, structure usually learned by machine learning from a training corpus. 40 DETERMINING FEATURES FOR A CLASSIFIER We can think of features based on questons we’d make for a decision tree. Space after? What’s after? What’s before? How many periods? Then we can build a classifer that uses them, e.g. SVM. 41 SENTENCE SEGMENTATION: REFERENCES Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (1999). Automatc extracton of rules for sentence boundary disambiguaton. In Proceedings of the Workshop on Machine Learning in Human Language Technology (pp. 88-92). Reynar, J. C., & Ratnaparkhi, A. (1997, March). A maximum entropy approach to identfying sentence boundaries. In Proceedings of the ffth conference on Applied natural language processing (pp. 16-19). Associaton for Computatonal Linguistcs. 42 RESOURCES Jurafsky, Daniel, and James H. Martn. 2009. Speech and Language Processing: An Introducton to Natural Language Processing, Speech Recogniton, and Computatonal Linguistcs. 3rd editon. Chapter 2.2-2.5. Bird Steven, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O’Reilly Media, Inc., 2009. Chapters 1-3. Slide 1 Slide 2 Slide 3 Slide 4 Slide 5 Slide 6 Slide 7 Slide 8 Slide 9 Slide 10 Slide 11 Slide 12 Slide 13 Slide 14 Slide 15 Slide 16 Slide 17 Slide 18 Slide 19 Slide 20 Slide 21 Slide 22 Slide 23 Slide 24 Slide 25 Slide 26 Slide 27 Slide 28 Slide 29 Slide 30 Slide 31 Slide 32 Slide 33 Slide 34 Slide 35 Slide 36 Slide 37 Slide 38 Slide 39 Slide 40 Slide 41 Slide 42