COMP5046
Natural Language Processing
Lecture 6: Part of Speech Tagging
Dr. Caren Han
Semester 1, 2021
School of Computer Science, University of Sydney
1
The course topics
What will you learn in this course?
Week 1: Introduction to Natural Language Processing (NLP)
Week 2: Word Embeddings (Word Vector for Meaning)
Week 3: Word Classification with Machine Learning I
Week 4: Word Classification with Machine Learning II
NLP and Machine Learning
Week 5: Language Fundamental
Week 6: Part of Speech Tagging
Week 7: Dependency Parsing
Week 8: Language Model and Natural Language Generation
NLP Techniques
Week 9: Information Extraction: Named Entity Recognition
Advanced Topic
Week 10: Advanced NLP: Attention and Reading Comprehension
Week 11: Advanced NLP: Transformer and Machine Translation
Week 12: Advanced NLP: Pretrained Model in NLP
Week 13: Future of NLP and Exam Review
0
LECTURE PLAN
Lecture 6: Part of Speech Tagging
1.
2.
3. 4.
Part-of-Speech Tagging
Baseline Approaches
1. Rule-based Model
2. Look-up Table Model
3. N-Gram Model
Probabilistic Approaches
1. Hidden Markov Model
2. Conditional Random Field
Deep Learning Approaches
1
Part-of-Speech Tagging
Parts of Speech (or word classes)
A class of words based on the word’s function, the way it works in a sentence
8 parts of speech are commonly listed
2000 years ago (starting with Aristotle)
Dionysius Thrax of Alexandria (c. 100 BCE)
Now (School Grammar) + articles or determiner
Nouns Verbs Pronouns Prepositions
Adverbs Conjunctions Participles Articles
Nouns Verbs Pronouns Prepositions
Adverbs Conjunctions Adjectives Interjections
1
Part-of-Speech Tagging
Part-of-Speech (English)
One basic kind of linguistic structure: syntactic word classes
typically large, have fluid membership, and are often stable under translation
Relatively fixed membership, and the repertoire differs more from language to language
1
Part-of-Speech Tagging
Part-of-Speech Tag sets – Modern English
In modern (English) NLP, larger and more fine-grained tag sets are preferred. Example
Penn Treebank Brown Corpus C7 Tagset
45 tags
87 tags 146 tags
http://bit.ly/1gwbird https://bit.ly/2FGtdLd http://bit.ly/1Mh36KX
Trade-off between complexity and precision and whatever tag-set we use, there will be some words that are hard to classify.
1
Part-of-Speech Tagging
POS tags in Penn Treebank
Marcus, et al. (1993)
1
Part-of-Speech Tagging
Criteria for part-of-speech tagging
Three different criteria might be considered.
• Distributional criteria: Where can the words occur?
• Morphological criteria: What form does the word have? (E.g. – tion, -ize). What affixes can it take? (E.g. -s, -ing, -est).
• Notional(or semantic) criteria: What sort of concept does the word refer to? (E.g. nouns often refer to ‘people, places or things’). More problematic: less useful for us
1
Part-of-Speech Tagging
Criteria for part-of-speech tagging: Nouns Three different criteria might be considered.
• Distributional criteria: Where can the nouns appear?
For example, nouns can appear with possession: “his car”, ”her idea”.
• Morphological criteria: What form does the word have? (E.g. – tion, -ize). What affixes can it take? (E.g. -s, -ing, -est).
ness, -tion, -ity, and -ance tend to indicate nouns. (happiness, exertion, levity, significance).
• Notional(or semantic) criteria: What sort of concept does the word refer to?
Nouns generally refer to living things (mouse), places (Sydney), non-living things (computer), or concepts (marriage).
1
Part-of-Speech Tagging
Criteria for part-of-speech tagging: Verbs Three different criteria might be considered.
• Distributional criteria: Where can the verbs appear? Different types of verbs have different distributional properties. For
example, base form verbs can appear as infinitives: “to jump”, ”to learn”.
• Morphological criteria: What form does the word have? (E.g. – tion, -ize). What affixes can it take? (E.g. -s, -ing, -est).
words that end in -ate or -ize tend to be verbs, and ones that end in -ing are often the present participle of a verb (automate, equalize; rising, washing)
• Notional(or semantic) criteria: What sort of concept does the word refer to?
Verbs refer to actions (observe, think, give).
1
Part-of-Speech Tagging
Example of POS inference
Emma has a beautiful flower
NNPVBZDTJJ NN
https://parts-of-speech.info/ , StanfordNLP(2018)
1
Part-of-Speech Tagging
POS Tagging: Issue
Given an input text, tag each word correctly:
There/ was/ still/ lemonade/ in/ the/ bottle/
▪ (Tag sets are quite counterintuitive!)
• In the above, the bottle is a noun not a verb
• but how does our tagger tell?
• The still could be an adjective or an adverb • which seems more likely?
1
Part-of-Speech Tagging
POS Tagging: Issue
Given an input text, tag each word correctly:
There/ was/ still/ lemonade/ in/ the/
▪ (Tag sets are quite counterintuitive!)
• In the above, the bottle is a noun not a verb
• but how does our tagger tell?
• The still could be an adjective or an adverb • which seems more likely?
bottle/
1
Part-of-Speech Tagging
The purpose of POS Tagging
Essential ingredient in natural language applications
▪ Useful in and of itself (more than you’d think)
• Text-to-speech: record, lead
• Lemmatization: saw[v] see, saw[n] saw
• Linguistically motivated word clustering
• Useful as a pre-processing step for parsing
• Useful as features to downstream systems.
0
LECTURE PLAN
Lecture 6: Part of Speech Tagging
1.
2.
3. 4.
Part-of-Speech Tagging
Baseline Approaches
1. Rule-based Model
2. Look-up Table Model
3. N-Gram Model
Probabilistic Approaches
1. Hidden Markov Model
2. Conditional Random Field
Deep Learning Approaches
2
Baseline Approaches
Part of Speech Tagging
NVDAN
Emma has a beautiful flower
2
Baseline Approaches
Part of Speech Tagging
NVDAN
Emma has a beautiful flower
NOTE: The example includes the high level of classes from the PoS tagset
2
Baseline Approaches
Rule-based POS Tagging
Basic idea:
Old POS taggers used to work in two stages, based on hand-written rules:
• the first stage identifies a set of possible POS for each word in the
sentence (based on a lexicon), and
• the second uses a set of hand-crafted rules in order to select a POS from
each of the lists for each word
IF Condition, Then Conclusion
2
Baseline Approaches
Rule-based POS Tagging
Basic idea:
• Assign each token all its possible tags.
• Apply rules that eliminate all tags for a token that are inconsistent with
its context.
• Assign any unknown word tokens a tag that is consistent with its context (eg, the most frequent tag).
X
√
X
2
Baseline Approaches
Rule-based POS Tagging
• Rule-based tagging often used a large set of hand-crafted context- sensitive rules.
Example (schematic):
if (-1 DT) /* if previous word is a determiner */ X
elim MD, VB /* eliminate modals and base verbs */ √
X
“Cannot eliminate all POS ambiguity.”
2
Baseline Approaches
Part of Speech Tagging
Emma John Will
Emma likes John
2
Baseline Approaches
Part of Speech Tagging
Emma John Will
Emma likes John
Database
John likes Will
Will likes Emma
2
Baseline Approaches
Part of Speech Tagging: Lookup Table
Emma likes John
N
V
John
1
0
likes
0
2
Will
2
0
Emma
1
0
Database
John likes Will
Will likes Emma
2
Baseline Approaches
Part of Speech Tagging: Lookup Table
Emma likes John
N
V
John
1
0
likes
0
2
Will
2
0
Emma
1
0
Database
John likes Will
Will likes Emma
2
Baseline Approaches
Part of Speech Tagging: Lookup Table
Pick the largest number of the corresponding row
NVN
Emma likes John
N
V
John
1
0
likes
0
2
Will
2
0
Emma
1
0
Database
John likes Will
Will likes Emma
What about more complicated sentences?
2
Baseline Approaches
Part of Speech Tagging
Emma John Will
Emma will meet Will
Database
John will meet Will
Emma will meet John
Will will meet Emma
2
Baseline Approaches
Part of Speech Tagging
Pick the largest number of the corresponding row
NMVM
Emma will meet Will
N
V
M
John
2
0
0
meet
0
3
0
Will
2
0
3
Emma
2
0
0
Database
John will meet Will
Emma will meet John
Will will meet Emma
2
Baseline Approaches
Part of Speech Tagging
Pick the largest number of the corresponding row
NMVN
Emma will meet Will
N
V
M
John
2
0
0
meet
0
3
0
Will
2
0
3
Emma
2
0
0
Database
John will meet Will
Emma will meet John
Will will meet Emma
Better Solution? What about considering the Neighbors?
2
Baseline Approaches
Part of Speech Tagging: N-gram
A contiguous sequence of N items from a given sample of text
Emma
will
meet
Will
N=1 unigram
Emma
will
meet
Will
N=2 N=3
bigram trigram
Emma
will meet
Will
2
Baseline Approaches
Part of Speech Tagging: N-gram
–
–
–
john-will
1
0
0
will-meet
0
3
0
meet-will
0
0
1
emma-will
1
0
0
meet-john
0
0
1
will-will
1
0
0
meet-emma
0
0
1
Emma will meet Will
Database
John will meet Will
Emma will meet John
Will will meet Emma
2
Baseline Approaches
Part of Speech Tagging: N-gram
–
–
–
john-will
1
0
0
will-meet
0
3
0
meet-will
0
0
1
emma-will
1
0
0
meet-john
0
0
1
will-will
1
0
0
meet-emma
0
0
1
NMVN
Emma will meet Will
Database
John will meet Will
Emma will meet John
Will will meet Emma
2
Baseline Approaches
Part of Speech Tagging: N-gram
–
–
–
john-will
1
0
0
will-meet
0
3
0
meet-will
0
0
1
emma-will
1
0
0
meet-john
0
0
1
will-will
1
0
0
meet-emma
0
0
1
NMVN
Emma will meet Will
Database
John will meet Will
Emma will meet John
Will will meet Emma
What if we don’t have in the database?
2
Baseline Approaches
Part of Speech Tagging: N-gram
–
–
–
john-will
1
0
0
will-meet
0
3
0
meet-will
0
0
1
emma-will
1
0
0
meet-john
0
0
1
will-will
1
0
0
meet-emma
0
0
1
Emma will see Will
Database
John will meet Will
Emma will meet John
Will will meet Emma
What if we don’t have in the database?
0
LECTURE PLAN
Lecture 6: Part of Speech Tagging
1. 2.
3.
4.
Part-of-Speech Tagging
Baseline Approaches
1. Rule-based Model
2. Look-up Table Model
3. N-Gram Model
Probabilistic Approaches
1. Hidden Markov Model
2. Conditional Random Field
Deep Learning Approaches
3
Probabilistic Approaches
Hidden Markov Model: Idea
NMVN
Emma will meet John
3
Probabilistic Approaches
Hidden Markov Model: Idea
Transition Probabilities
NMVN
Emma will meet John
Emission Probabilities
3
Probabilistic Approaches
Hidden Markov Model (HMM)
Hidden Markov Model
Hidden
What is ‘hidden’? What is ‘Markov Model’?
Markov Model
3
Probabilistic Approaches
Markov Model
Andrei Andreyevich Markov
The purpose of introducing Markov Chain
An example of statistical investigation in the text of `Eugene Onyegin’ illustrating coupling of `tests’ in chains.
• A stochastic model used to model randomly changing system
• Has the Markov property if the conditional probability distribution of future states of the process depends only upon the present state, not on the events that occurred before it.
3
Probabilistic Approaches
Markov Model (MM): K-Order Markov Property
• Assumption: last k states are sufficient
• (k=1) First-order Markov Process (Most Commonly used)
P(St|St−1, …,S0)=P(St|St−1)
• (k=2) Second-order Markov Process
P(St|St−1, …,S0)=P(St|St−1,St−2)
S1 S2 S3 S4 S5 S6
k=0
k=1
k=2
3
Probabilistic Approaches
Markov Model (MM): Example
Let’s predict tomorrow weather in Sydney. Assume we have three classes
Class 1: Rainy Class 2: Cloudy Class 3: Sunny
NOTE: Tomorrow weather depends only on today’s!
First order Markov Model
We found the weather change pattern based on the 1-year data.
Tomorrow
Rainy
Cloudy
Sunny
Today
Rainy
0.4
0.3
0.3
Cloudy
0.2
0.6
0.2
Sunny
0.1
0.1
0.8
Sij = P(St = j| St−1 = i)
3
Probabilistic Approaches
Markov Model (MM): Example
Let’s predict tomorrow weather in Sydney. Assume we have three classes
Class 1: Rainy Class 2: Cloudy Class 3: Sunny
NOTE: Tomorrow weather depends only on today’s!
First order Markov Model
We found the weather change pattern based on the 1-year data.
If it is raining today, how will be the weather tomorrow?
Sij = P(St = j| St−1 = i)
Srainyrainy = 0.4 Srainycloudy = 0.3 Srainysunny = 0.3
Tomorrow
Rainy
Cloudy
Sunny
Today
0.4
0.3
0.3
Rainy
Cloudy
0.2
0.6
0.2
Sunny
0.1
0.1
0.8
3
Probabilistic Approaches
Markov Model (MM): Example
Let’s predict tomorrow weather in Sydney. Assume we have three classes
Class 1: Rainy Class 2: Cloudy Class 3: Sunny
NOTE: Tomorrow weather depends only on today’s!
First order Markov Model
We found the weather change pattern based on the 1-year data.
If it is raining today, how will be the weather
tomorrow?
Rainy!
Tomorrow
Rainy
Cloudy
Sunny
Today
0.4
0.3
0.3
Rainy
Cloudy
0.2
0.6
0.2
Sunny
0.1
0.1
0.8
Sij = P(St = j| St−1 = i)
Srainyrainy = 0.4 Srainycloudy = 0.3 Srainysunny = 0.3
3
Probabilistic Approaches
Markov Model (MM): Example
Transition Probabilities
March 29th March 30th
rainy 0.3 cloudy 0.2
We found the weather change pattern based on the 1-year data.
March 31st
rainy …
…
Tomorrow
Rainy
Cloudy
Sunny
Today
Rainy
0.4
0.3
0.3
Cloudy
0.2
0.6
0.2
Sunny
0.1
0.1
0.8
3
Probabilistic Approaches
Markov Model (MM): Example
Visual illustration with diagram
• •
Each state corresponds to one observation Sum of outgoing edge weights is one
0.4 1:rainy
0.3 0.2
2:cloudy 0.6
0.1
0.1
0.3
0.2
3:sunny
0.8
3
Probabilistic Approaches
Markov Model (MM): Example
State Transition Matrix
Sij = P(St = j| St−1 = i) Sij ≥ 0
S11 S12 … S1N
1≤i, j≥N
S21 S22 … S2N
S=
SN1 SN2 … SNN
S31 S32 … S3N ::::
3
Probabilistic Approaches
Markov Model (MM): Example
Sequence Probability
Q: What is the probability that the weather for the next 7 days will be “sun-
sun-rain-rain-sun-cloudy-sun” if it is sunny today?
P(S3,S3,S3,S1,S1,S3,S2,S3 |model)
=P(S3)P(S3 |S3)P(S3 |S3)P(S1|S3)P(S1|S1)P(S3|S1)P(S2|S3)P(S3|S2) = 1 (0.8)(0.8)(0.1)(0.4)(0.3)(0.1)(0.2)
=1.53610−4
Sij = P(St = j| St−1 = i)
3
Probabilistic Approaches
Hidden Markov Model (HMM)
Hidden Markov Model
Hidden
What is ‘hidden’? What is ‘Markov Model’?
Markov Model
3
Probabilistic Approaches
Hidden Markov Model (HMM)
Hidden Markov Models (HMMs) are a class of probabilistic graphical model that allow us to predict a sequence of unknown (hidden) variables from a set of observed variables.
X1 X2 X3
hidden
x states
y possible observations
a state transition probabilities
b output probabilities
a b
Y1
Y2
Y3
• States are hidden
• Observable outcome linked to states
• Each state has observation probabilities
to determine the observable event
3
Probabilistic Approaches
Hidden Markov Model (HMM)
Hidden Markov Models (HMMs) are a class of probabilistic graphical model that allow us to predict a sequence of unknown (hidden) variables from a set of observed variables.
sunny
cloudy
rainy
hidden
x states
y possible observations
a state transition probabilities
b output probabilities
a b
• States are hidden
• Observable outcome linked to states
• Each state has observation probabilities
to determine the observable event
3
Probabilistic Approaches
Hidden Markov Model (HMM)
Predicting the weather (state: hidden variable) based on the type of clothes that the person wears (observed event)
• Weather (hidden variable): sunny, cloudy, rainy
• Observed variables are the type of clothing the person worn
The arrows represent:
• Transition Probabilities: from a hidden state to another hidden state
• Emission Probabilities: from a hidden state to an observed variable
Weather
Type of clothing
sunny hcildouddeyn rainy
Yesterday Today Tomorrow
One or more observations allow us to make an inference about a sequence of hidden states
Time
3
Probabilistic Approaches
Hidden Markov Model (HMM)
Predicting the weather (state: hidden variable) based on the type of clothes that the person wears (observed event)
Start
sunny cloudy rainy
In order to compute the joint probability of a sequence of hidden states, we need to assemble three types of information:
1. Initial state information (a.k.a. prior probability) – The initial probability of
transitioning to a hidden state.
2. Transition probabilities — the probability of transitioning to a new state conditioned
on a present state
3. Emission probabilities – the probability of transitioning to an observed state
conditioned on a hidden state
3
Probabilistic Approaches
Hidden Markov Model (HMM)
Predicting the weather (hidden variable) based on the type of clothes that someone wears (observed)
Start
sunny cloudy rainy
Priors Transitions Emissions
Tomorrow
Rainy
Cloudy
Sunny
Today
Rainy
0.6
0.3
0.1
Cloudy
0.4
0.3
0.3
Sunny
0.1
0.4
0.5
Shirts
Jacket
Hoodies
Rainy
0.8
0.19
0.01
Cloudy
0.5
0.4
0.1
Sunny
0.01
0.2
0.79
Rainy
0.6
Cloudy
0.3
Sunny
0.1
3
Probabilistic Approaches
Hidden Markov Model (HMM)
We had the list of clothes that Caren wears for three days
Day1 Day2 Day3
Shirts Hoodie Hoodie
Firstly, just calculate the weather condition ‘cloudy-cloudy-sunny’ (Random Selection)
1. Calculate the probability that Caren could wear that clothing (with the weather condition ‘cloudy – cloudy – sunny’)
P(shirts|cloudy)*P(hoodie|cloudy)*P(hoodie|sunny)
2. Calculate the probability that weathers were ‘cloudy – cloudy – sunny’
P(prior_cloudy)* P(cloudy|cloudy) * P(sunny|cloudy)
P(shirts|cloudy)*P(hoodie|cloudy)*P(hoodie|sunny)* P(prior_cloudy)* P(cloudy|cloudy) * P(sunny|cloudy) This is the probability when we assume the weather (cloudy – cloudy – sunny)
3
Probabilistic Approaches
Hidden Markov Model (HMM)
The previous was the only probability when we assume the weather (cloudy – cloudy – sunny)
This is a complete set of 33=27 cases of weather states for three days:
{x1=s1=sunny, x2=s1=sunny, x3=s1=sunny}, {x1=s1=sunny, x2=s1=sunny, x3=s2=cloudy}, {x1=s1=sunny, x2=s1=sunny, x3=s3=rainy}, {x1=s1=sunny, x2=s2=cloudy, x3=s1=sunny}, {x1=s1=sunny, x2=s2=cloudy, x3=s2=cloudy}, {x1=s1=sunny, x2=s2=cloudy, x3=s3=rainy}, {x1=s1=sunny, x2=s3=rainy, x3=s1=sunny}, {x1=s1=sunny, x2=s3=rainy, x3=s2=cloudy}, {x1=s1=sunny, x2=s3=rainy, x3=s3=rainy}, {x1=s2=cloudy, x2=s1=sunny, x3=s1=sunny}, {x1=s2=cloudy, x2=s1=sunny, x3=s2=cloudy}, {x1=s2=cloudy, x2=s1=sunny, x3=s3=rainy}, {x1=s2=cloudy, x2=s2=cloudy, x3=s1=sunny}, {x1=s2=cloudy, x2=s2=cloudy, x3=s2=cloudy}, {x1=s2=cloudy, x2=s2=cloudy, x3=s3=rainy}, {x1=s2=cloudy, x2=s3=rainy, x3=s1=sunny}, {x1=s2=cloudy, x2=s3=rainy, x3=s2=cloudy}, {x1=s2=cloudy, x2=s3=rainy, x3=s3=rainy}, {x1=s3=rainy, x2=s1=sunny, x3=s1=sunny}, {x1=s3=rainy, x2=s1=sunny, x3=s2=cloudy}, {x1=s3=rainy, x2=s1=sunny, x3=s3=rainy}, {x1=s3=rainy, x2=s2=cloudy, x3=s1=sunny}, {x1=s3=rainy, x2=s2=cloudy, x3=s2=cloudy}, {x1=s3=rainy, x2=s2=cloudy, x3=s3=rainy}, {x1=s3=rainy, x2=s3=rainy, x3=s1=sunny}, {x1=s3=rainy, x2=s3=rainy, x3=s2=cloudy}, {x1=s3=rainy, x2=s3=rainy, x3=s3=rainy}.
Easy but slow solution: Exhaustive enumeration!
3
Probabilistic Approaches
Hidden Markov Model (HMM): Evaluation
Do we need to calculate this much all the time?
This is a complete set of 33=27 cases of weather states for three days:
{x1=s1=sunny, x2=s1=sunny, x3=s1=sunny}, {x1=s1=sunny, x2=s1=sunny, x3=s2=cloudy},
{x1=s1=sunny, x2=s1=sunny, x3=s3=rainy}, {x1=s1=sunny, x2=s2=cloudy, x3=s1=sunny},
{x1=s1=sunny, x2=s2=cloudy, x3=s2=cloudy}, {x1=s1=sunny, x2=s2=cloudy, x3=s3=rainy},
{x1=s1=sunny, x2=s3=rainy, x3=s1=sunny}, {x1=s1=sunny, x2=s3=rainy, x3=s2=cloudy},
Any Efficient Solution?
{x1=s1=sunny, x2=s3=rainy, x3=s3=rainy}, {x1=s2=cloudy, x2=s1=sunny, x3=s1=sunny}, {x1=s2=cloudy, x2=s1=sunny, x3=s2=cloudy}, {x1=s2=cloudy, x2=s1=sunny, x3=s3=rainy},
Yes. Dynamic Programming technique {x1=s2=cloudy, x2=s2=cloudy, x3=s1=sunny}, {x1=s2=cloudy, x2=s2=cloudy, x3=s2=cloudy},
{x1=s2=cloudy, x2=s2=cloudy, x3=s3=rainy}, {x1=s2=cloudy, x2=s3=rainy, x3=s1=sunny}, {x1=s2=cloudy, x2=s3=rainy, x3=s2=cloudy}, {x1=s2=cloudy, x2=s3=rainy, x3=s3=rainy}, {x1=s3=rainy, x2=s1=sunny, x3=s1=sunny}, {x1=s3=rainy, x2=s1=sunny, x3=s2=cloudy}, {x1=s3=rainy, x2=s1=sunny, x3=s3=rainy}, {x1=s3=rainy, x2=s2=cloudy, x3=s1=sunny}, {x1=s3=rainy, x2=s2=cloudy, x3=s2=cloudy}, {x1=s3=rainy, x2=s2=cloudy, x3=s3=rainy}, {x1=s3=rainy, x2=s3=rainy, x3=s1=sunny}, {x1=s3=rainy, x2=s3=rainy, x3=s2=cloudy}, {x1=s3=rainy, x2=s3=rainy, x3=s3=rainy}.
3
Probabilistic Approaches
Now, Let’s Apply this
tothe !
Transition Probabilities
NMVN
HMM
Part of Speech Tagging Task
Emma will meet John
Emission Probabilities
3
Probabilistic Approaches
Part of Speech Tagging: with HMM
Emma John Will Pin
Database
Emma, John can meet Will
Pin will meet Emma
Will John pin Emma
Emma will pat Pin
3
Probabilistic Approaches
Part of Speech Tagging: with HMM
Emma John Will Pin
N
V
M
Emma
4
0
0
John
2
0
0
Will
1
0
3
Pin
2
1
0
Can
0
0
1
Meet
0
2
0
Pat
0
1
0
Database
Emma, John can meet Will
Pin will meet Emma
Will John pin Emma
Emma will pat Pin
3
Probabilistic Approaches
Part of Speech Tagging: with HMM
Emission Probabilities
N
V
M
Emma
4/9
0
0
John
2/9
0
0
Will
1/9
0
3/4
Pin
2/9
1/4
0
Can
0
0
1/4
Meet
0
2/4
0
Pat
0
1/4
0
N
4/9
2/9
Emma
John
2/9 1/9
Pin Will
MV
3/4 1/4 1/4
will can pin
1/4
pat
1/2
meet
3
Probabilistic Approaches
Part of Speech Tagging: with HMM
Transition Probabilities
N
V
M
3
0
1
0
N
1
1
3
4
V
4
0
0
0
M
1
3
0
0
Database
Emma, John can meet Will
Pin will meet Emma
Will John pin Emma
Emma will pat Pin
3
Probabilistic Approaches
Part of Speech Tagging: with HMM
Transition Probabilities
N
V
M
3/4
0
1/4
0
N
1/9
1/9
3/9
4/9
V
4/4
0
0
0
M
1/4
3/4
0
0
Database
Emma, John can meet Will
Pin will meet Emma
Will John pin Emma
Emma will pat Pin
3
Probabilistic Approaches
Part of Speech Tagging: with HMM
Transition Probabilities
NV
M
3
Probabilistic Approaches
Part of Speech Tagging: with HMM
Let’s combine this with emission probabilities!
NV
M
emma
john will pin will
can pin
meet pat
3
Probabilistic Approaches
Part of Speech Tagging: with HMM
Let’s combine this!
John will Pin Will
3/4 N 1/3 M 3/4 V 4/4 N 4/9
0.0003858
3
Probabilistic Approaches
Part of Speech Tagging: with HMM
Let’s combine this!
John will Pin Will
3/4
3/4
N 1/3 M 3/4 V 4/4 N 4/9
John will Pin Will
N 1/9 N 1/9 N 1/9 N 4/9
0.0003858
0.000000278
3
Probabilistic Approaches
Part of Speech Tagging: with HMM
Question!
How many possibilities do we have for the sentence ‘John will Pin Will’?
3
Probabilistic Approaches
Part of Speech Tagging: with HMM
Question!
How many possibilities do we have for the sentence ‘John will Pin Will’?
33 ∗ 3 = 81
3
Probabilistic Approaches
Hidden Markov Model (HMM) with POS Tagging
Emissions – Real World Example
3
Probabilistic Approaches
POS Tagging: with HMM
John will Pin Will
NNNN
2/9 1/9 2/9 1/9
M M M M
0 3/4 0 3/4
VVVV
0 0 1/4 0
3
Probabilistic Approaches
POS Tagging: with HMM
John will Pin Will
NNNN
2/9 1/9 2/9 1/9
M M M M
0 3/4 0 3/4
VVVV
0 0 1/4 0
Beam Search
Get rid of unlikely candidates
3
Probabilistic Approaches
POS Tagging: with HMM
John will Pin Will
NNNN
2/9 1/9 2/9 1/9
M M M M
0 3/4 0 3/4
VVVV
0 0 1/4 0
Beam Search
Get rid of unlikely candidates
3
Probabilistic Approaches
POS Tagging: with HMM
John will Pin Will
NNNN
2/9 1/9 2/9 1/9
M M
3/4
3/4
V
1/4
Beam Search
Get rid of unlikely candidates
3
Probabilistic Approaches
POS Tagging: with HMM
John will Pin Will
Question!
NNNN
2/9 1/9 2/9 1/9
How many possibilities do we have
for the sentence ‘John will Pin Will’ NOW?
M M
3/4
3/4
V 1/4
3
Probabilistic Approaches
POS Tagging: with HMM
John will Pin Will
NNNN
2/9 1/9 2/9 1/9
M M
3/4
3/4
V
1/4
Which Path is the most likely?
3
Probabilistic Approaches
POS Tagging: with HMM
John will Pin Will
NNNN
2/9 1/9 2/9 1/9
M M
3/4
3/4
V
1/4
0.000385
3
Probabilistic Approaches
POS Tagging: with HMM
John will Pin Will
NNNN
2/9 1/9 2/9 1/9
M M
3/4 3/4
V
Ok, I got it. But is there 1/4 any better algorithm?
0.000385
3
Probabilistic Approaches
Viterbi Algorithm!
Assume we have only these options now
John
N 1/9 2/9
1/3
will Pin
N 1/9 N
Will
1/9 N 1/9
4/9
1/4
4/9
1/9
M
3/4
2/9
3
Probabilistic Approaches
Viterbi Algorithm!
Assume we have only these options now
John
N 1/9 2/9
1/3
will
N 1/9 1/9
Pin
Will
1/9 N 1/9
N
2/9
4/9
1/4
4/9
M
3/4
0.000051
0.0023
3
Probabilistic Approaches
Viterbi Algorithm
John will Pin Will
3/4
NNNN
2/9 1/9 2/9 1/9
1/4M M M M 0 3/4 0 3/4
0
VVVV
0 0 1/4 0
3
Probabilistic Approaches
Viterbi Algorithm
John will Pin Will
N
2/9 3/4
1/4 M 0
0
V
0
3
Probabilistic Approaches
Viterbi Algorithm
John will Pin Will
1/6
1/486
N1/9 N
2/9
0
M
1 0
0
V
0
1/9
1/4
3
Probabilistic Approaches
Viterbi Algorithm
John will Pin Will
1/6
1/486
N1/9 N
2/9
0
1/3
1/24
M 0 M
0
0
3/4
0
VV
00
3
Probabilistic Approaches
Viterbi Algorithm
John will Pin Will
1/6 1/486 1/432 1/1152
N1/9 N N N
3/4 2/9
0
1/4 M 0
1/3
1/9 2/9 1/9
1/24 0 1/1728
M M M
3/4 0 3/4
0 1/128 0
1/2592
0
0
V V V V
0 0 1/4 0
3
Probabilistic Approaches
Viterbi Algorithm
John will Pin Will
1/6 1/486 1/432 1/1152
N1/9 N N N
3/4 2/9
0
1/4 M 0
1/3
1/9 2/9 1/9
1/24 0 1/1728
M M M
3/4 0 3/4
0 1/128 0
1/2592
0
0
V V V V
0 0 1/4 0
3
Probabilistic Approaches
Viterbi Algorithm
John will Pin Will
1/6
1/1152
N N
3/4
2/9 1/3
1/9
1/24
M
3/4
1/2592
1/128
V
1/4
3
Probabilistic Approaches
Viterbi Algorithm
https://web.stanford.edu/~jurafsky/slp3/8.pdf
3
Probabilistic Approaches
HMM Tagger Issue: #1. Unknown (OOV) Words
How to handle if there are any unknown words
Solution 1: Use N-grams to predict the correct Tag
Solution 2: Use morphology (prefixes, suffixes) or hyphenation
Out-of-Vocab
3
Probabilistic Approaches
HMM Tagger Issue: #2. Independency Problem
HMM is only dependent on every state and its corresponding observed object. The sequence labeling, in addition to having a relationship with individual words, also relates to such aspects as the observed sequence length, word context and others.
X1 X2 X3
a b
Y1
Y2
Y3
3
Probabilistic Approaches
Advanced HMM (MEMM or CRF)
The CRF model has addressed the labeling bias issue and eliminated two unreasonable hypotheses in HMM.
MEMM adopts local variance normalization while CRF adopts global variance normalization.
HMM
Maximum-entropy Markov model (MEMM)
Conditional random field (CRF)
3
Probabilistic Approaches
MEMM Labeling Bias
Probabilistic Approaches
Conditional Random Field: Advantages
• Compared with HMM: Since CRF does not have as strict independence assumptions as HMM does, it can accommodate any context information.
• Compared with MEMM: Since CRF computes the conditional
probability of global optimal output nodes, it overcomes the drawbacks
of label bias in MEMM.
MEMM suffers from Label Bias Problem, i.e., the transition probabilities of leaving a given state is normalized for only that state
However,
CRF is highly computationally complex at the training stage of the algorithm.
It makes it very difficult to re-train the model when newer data becomes available.
3
0
LECTURE PLAN
Lecture 6: Part of Speech Tagging
1. 2.
3.
4.
Part-of-Speech Tagging
Baseline Approaches
1. Rule-based Model
2. Look-up Table Model
3. N-Gram Model
Probabilistic Approaches
1. Hidden Markov Model
2. Conditional Random Field
Deep Learning Approaches
4
Deep Learning Approaches
RNN/LSTM/GRU in Part of Speech Tagging
PRP VBP JJ IN NN
Softmax
Softmax
Softmax
Softmax
Softmax
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝑥1 𝑥2 𝑥3 𝑥4 𝑥𝑛
I am crazy in love
4
Deep Learning Approaches
Do LSTMs really work so well for PoS tagging?
(Horsmann and Zesch, 2017)
4
Deep Learning Approaches
Do LSTMs really work so well for PoS tagging?
Corpora used in the experiments
4
Deep Learning Approaches
Do LSTMs really work so well for PoS tagging?
Coarse-grained PoS tag distribution of corpora by language group
4
Deep Learning Approaches
Do LSTMs really work so well for PoS tagging?
(Horsmann and Zesch, 2017)
HMM POS Tagger
4
Deep Learning Approaches
LSTM-based POS Tagging
llustration of LSTM-based joint POS tagging and graph-based dependency parsing.
/
Reference Summary
• M. Marcus, B. Santorini and M.A. Marcinkiewicz (1993). Building a large annotated corpus of English: The Penn Treebank. In Computational Linguistics, volume 19, number 2, pp. 313–330.
• Nguyen, D. Q., Dras, M., & Johnson, M. (2017). A novel neural network model for joint pos tagging and graph-based dependency parsing. arXiv preprint arXiv:1705.05952.
• Ling,W.,Luís,T.,Marujo,L.,Astudillo,R.F.,Amir,S.,Dyer,C.,…&Trancoso,I.(2015).Finding function in form: Compositional character models for open vocabulary word representation. arXiv preprint arXiv:1508.02096.
• Horsmann, T., & Zesch, T. (2017). Do LSTMs really work so well for PoS tagging?–A replication study. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 727-736).
• Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright c 2019. All rights reserved. Draft of October 2, 2019.