COMP90042 Workshop Week 2
Introduction and Pre-processing
Zenan Zhai
The University of Melbourne
9 March 2014
Table of Contents
Introduction
Pre-processing
Table of Contents
Introduction
Pre-processing
Contact
Canvas – Discussion Board
Subject Coordinator
Dr. Jey Han Lau (laujh@unimelb.edu.au)
Me
Zenan Zhai (zenan.zhai@unimelb.edu.au)
Workshop slides available at
https://zenanz.github.io/comp90042-2020/
Table of Contents
Introduction
Pre-processing
Pre-processing Pipeline
1. Formatting
2. Sentence Segmentation 3. Tokenisation
4. Normalisation
5. Stopword Removal
Formatting
Web page you see
Formatting
Source code you get
Off-the-shelf packages come to aid (e.g. beautifulsoup)
Word-level Tokenisation
‘Coronavirus: Lowest virus cases in China since crisis began.’
⇓
[‘Coronavirus’, ‘:’, ‘Lowest’, ‘cases’, ‘in’, ‘China’, ‘since’, ‘crisis’, ‘began’, ‘.’]
Tokenisation
▸ Rule-based / Machine Learning
▸ Subject to languages/domains (e.g. Medicine Chemistry)
Off-the-shelf implementations
▸ NLTK
https://www.nltk.org/
▸ OpenNLP https://opennlp.apache.org/
▸ StanfordNLP https://stanfordnlp.github.io/stanfordnlp/
Word-level Tokenisation
‘Coronavirus: Lowest virus cases in China since crisis began.’
⇓
[‘Coronavirus’, ‘:’, ‘Lowest’, ‘cases’, ‘in’, ‘China’, ‘since’, ‘crisis’, ‘began’, ‘.’]
Tokenisation
▸ Rule-based / Machine Learning
▸ Subject to languages/domains (e.g. Medicine Chemistry)
Off-the-shelf implementations
▸ NLTK
https://www.nltk.org/
▸ OpenNLP https://opennlp.apache.org/
▸ StanfordNLP https://stanfordnlp.github.io/stanfordnlp/
Wait, why did you skip sentence segmentation?
Byte-pair encoding
Why tokenisation?
Byte-pair encoding
Why tokenisation?
Easier for machine to understand.
Byte-pair encoding
Why tokenisation?
Easier for machine to understand.
Is there a better unit of text than word?
Let’s take the word ’Coronavirus’ as an example.
▸ Do you know this word before the outbreak?
▸ If so, do you understand it when you first saw it?
Byte-pair encoding
Okay, now I know sub-words are fantastic. Tell me how to get sub-word vocab?
Byte-pair Encoding
1. Break the entire piece of text into single characters tokens. 2. Count frequency of two tokens being together.
3. Merge most frequent pair of characters into one token.
4. Repeat from step 2.
BPE in action
Coronavirus: Lowest virus cases in China since crisis began.
Normalisation
Normalisation techniques
Lower casing
Spelling correction Abbreviation expansion Removing Morphology …
What is normalisation?
Normalisation
Normalisation techniques
Lower casing
Spelling correction Abbreviation expansion Removing Morphology …
What is normalisation?
Converting words to a standard format
Why do we want normalisation?
Normalisation
Normalisation techniques
Lower casing
Spelling correction Abbreviation expansion Removing Morphology …
What is normalisation?
Converting words to a standard format
Why do we want normalisation?
Normalisation
Normalisation techniques
Lower casing
Spelling correction Abbreviation expansion Removing Morphology …
What is normalisation?
Converting words to a standard format
Why do we want normalisation?
Reduce noises
Normalisation
Normalisation techniques
Lower casing
Spelling correction Abbreviation expansion Removing Morphology …
What is normalisation?
Converting words to a standard format
Why do we want normalisation?
Reduce noises Reduce data sparsity
Morphology
Inflectional Morphology
▸ Grammatical variants Derivational morphology
▸ Another word with different meaning
Inflectional Morphology Derivational morphology
began → begin Ethiopia → Ethiopian cases → case
Lemmatization and Stemming
Rule-based deterministic algorithm for normalisation
Lemmatisation
Remove all inflections Matches with lexicons Product: Lemma
Stemming
Remove all suffixes No matching required Product: Stem
Porter Stemmer
Symbols
Case sensitive
V → sequence of vowels C → sequence of consonants
v → a single vowel c → a single consonant
Measure
1. Convert STEM of the word in the form of [C](VC)m[V] 2. Take m as measure
Rules
Example: (m > 0 not o) e → NULL
o = stem ends cvc and second c is not w, x or y (e.g. -HIL, -HOP)
Porter Stemmer – Exercise
Rules
1. (m > 0) ational → ate 2. (m>1) ate→NULL
computational
Porter Stemmer – Exercise
Rules
1. (m > 0) ational → ate 2. (m>1) ate→NULL
computational
Step Rule Stem Form m 1 1 comput [C ](VC )2 2
Result
computate
Porter Stemmer – Exercise
Rules
1. (m > 0) ational → ate 2. (m>1) ate→NULL
computational
Step Rule
Stem
Result
computate
comput
Form m [C ](VC )2 2
[C](VC)2 2 What about national?
1 1 2 2
comput
comput
Stopword
Stopword
Short functional words that are very common Examples (NLTK): me, what, by, with, into, above …
How would stopwords affect text classification?
Take away
1. Pre-processing pipeline
2. Tokenisation ▸ word level
▸ sub-word level (BPE)
3. Normalization
▸ Morphology (inflectional v.s. derivational) ▸ Lemmatisation v.s. Stemming