程序代写代做代考 C algorithm COMP90042 Workshop Week 2

COMP90042 Workshop Week 2
Introduction and Pre-processing
Zenan Zhai
The University of Melbourne
9 March 2014

Table of Contents
Introduction
Pre-processing

Table of Contents
Introduction
Pre-processing

Contact
Canvas – Discussion Board
Subject Coordinator
Dr. Jey Han Lau (laujh@unimelb.edu.au)
Me
Zenan Zhai (zenan.zhai@unimelb.edu.au)
Workshop slides available at
https://zenanz.github.io/comp90042-2020/

Table of Contents
Introduction
Pre-processing

Pre-processing Pipeline
1. Formatting
2. Sentence Segmentation 3. Tokenisation
4. Normalisation
5. Stopword Removal

Formatting
Web page you see 􏰈

Formatting
Source code you get 􏱇
Off-the-shelf packages come to aid (e.g. beautifulsoup)

Word-level Tokenisation
‘Coronavirus: Lowest virus cases in China since crisis began.’

[‘Coronavirus’, ‘:’, ‘Lowest’, ‘cases’, ‘in’, ‘China’, ‘since’, ‘crisis’, ‘began’, ‘.’]
Tokenisation
▸ Rule-based / Machine Learning
▸ Subject to languages/domains (e.g. Medicine Chemistry)
Off-the-shelf implementations
▸ NLTK
https://www.nltk.org/
▸ OpenNLP https://opennlp.apache.org/
▸ StanfordNLP https://stanfordnlp.github.io/stanfordnlp/

Word-level Tokenisation
‘Coronavirus: Lowest virus cases in China since crisis began.’

[‘Coronavirus’, ‘:’, ‘Lowest’, ‘cases’, ‘in’, ‘China’, ‘since’, ‘crisis’, ‘began’, ‘.’]
Tokenisation
▸ Rule-based / Machine Learning
▸ Subject to languages/domains (e.g. Medicine Chemistry)
Off-the-shelf implementations
▸ NLTK
https://www.nltk.org/
▸ OpenNLP https://opennlp.apache.org/
▸ StanfordNLP https://stanfordnlp.github.io/stanfordnlp/
Wait, why did you skip sentence segmentation?

Byte-pair encoding
Why tokenisation?

Byte-pair encoding
Why tokenisation?
Easier for machine to understand.

Byte-pair encoding
Why tokenisation?
Easier for machine to understand.
Is there a better unit of text than word?
Let’s take the word ’Coronavirus’ as an example.
▸ Do you know this word before the outbreak?
▸ If so, do you understand it when you first saw it?

Byte-pair encoding
Okay, now I know sub-words are fantastic. Tell me how to get sub-word vocab?
Byte-pair Encoding
1. Break the entire piece of text into single characters tokens. 2. Count frequency of two tokens being together.
3. Merge most frequent pair of characters into one token.
4. Repeat from step 2.

BPE in action
Coronavirus: Lowest virus cases in China since crisis began.

Normalisation
Normalisation techniques
Lower casing
Spelling correction Abbreviation expansion Removing Morphology …
What is normalisation?

Normalisation
Normalisation techniques
Lower casing
Spelling correction Abbreviation expansion Removing Morphology …
What is normalisation?
Converting words to a standard format
Why do we want normalisation?

Normalisation
Normalisation techniques
Lower casing
Spelling correction Abbreviation expansion Removing Morphology …
What is normalisation?
Converting words to a standard format
Why do we want normalisation?

Normalisation
Normalisation techniques
Lower casing
Spelling correction Abbreviation expansion Removing Morphology …
What is normalisation?
Converting words to a standard format
Why do we want normalisation?
Reduce noises

Normalisation
Normalisation techniques
Lower casing
Spelling correction Abbreviation expansion Removing Morphology …
What is normalisation?
Converting words to a standard format
Why do we want normalisation?
Reduce noises Reduce data sparsity

Morphology
Inflectional Morphology
▸ Grammatical variants Derivational morphology
▸ Another word with different meaning
Inflectional Morphology Derivational morphology
began → begin Ethiopia → Ethiopian cases → case

Lemmatization and Stemming
Rule-based deterministic algorithm for normalisation
Lemmatisation
Remove all inflections Matches with lexicons Product: Lemma
Stemming
Remove all suffixes No matching required Product: Stem

Porter Stemmer
Symbols
Case sensitive
V → sequence of vowels C → sequence of consonants
v → a single vowel c → a single consonant
Measure
1. Convert STEM of the word in the form of [C](VC)m[V] 2. Take m as measure
Rules
Example: (m > 0 not †o) e → NULL
†o = stem ends cvc and second c is not w, x or y (e.g. -HIL, -HOP)

Porter Stemmer – Exercise
Rules
1. (m > 0) ational → ate 2. (m>1) ate→NULL
computational

Porter Stemmer – Exercise
Rules
1. (m > 0) ational → ate 2. (m>1) ate→NULL
computational
Step Rule Stem Form m 1 1 comput [C ](VC )2 2
Result
computate

Porter Stemmer – Exercise
Rules
1. (m > 0) ational → ate 2. (m>1) ate→NULL
computational
Step Rule
Stem
Result
computate
comput
Form m [C ](VC )2 2
[C](VC)2 2 What about national?
1 1 2 2
comput
comput

Stopword
Stopword
Short functional words that are very common Examples (NLTK): me, what, by, with, into, above …
How would stopwords affect text classification?

Take away
1. Pre-processing pipeline
2. Tokenisation ▸ word level
▸ sub-word level (BPE)
3. Normalization
▸ Morphology (inflectional v.s. derivational) ▸ Lemmatisation v.s. Stemming