Sequence Tagging: Hidden Markov Models
COMP90042
Natural Language Processing
Lecture 6
Semester 1 2021 Week 3 Jey Han Lau
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
L6
•
•
•
•
•
Janet will back the bill
POS Tagging Recap
Janet/NNP will/MB back/VP the/DT bill/NN
Local classifier: prone to error propagation
What about treating the full sequence as a “class”? ‣ Output: “NNP_MB_VP_DT_NN”
Problems:
‣ Exponentially many combinations: |Tags|M ,for length M ‣ How to tag sequences of different lengths?
2
COMP90042
L6
•
•
Tagging is a sentence-level task but as humans we decompose it into small word-level tasks.
‣ Janet/NNP will/MB back/VP the/DT bill/NN Solution:
‣ Define a model that decomposes process into individual word level steps
‣ But that takes into account the whole sequence when learning and predicting (no error propagation)
•
This is the idea of sequence labelling, and more general, structured prediction.
A Better Approach
3
COMP90042
L6
A Probabilistic Model
• Goal:obtainbesttagsequencetfromsentencew
‣ ‣
^𝒕 = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑃(𝒕 𝒘)
^𝒕 = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑃(𝒘 𝒕)𝑃(𝒕) = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑃(𝒘 𝒕) 𝑃(𝒕)
[Bayes]
𝑃 (𝒘) • Let’sdecompose:
‣ 𝑃(𝒘 𝒕) = ∏𝑛 𝑃(𝑤𝑖|𝑡𝑖) [Prob. of a word depends only on the tag] ‣ ∏𝑛𝑖=1
𝑃(𝒕) = 𝑃(𝑡𝑖|𝑡𝑖−1) [Prob. of a tag depends only on the previous tag] 𝑖=1
• This is a Hidden Markov Model (HMM)
4
COMP90042
L6
Two Assumptions of HMM
^𝒕 = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑃(𝒘 𝒕) 𝑃(𝒕) 𝑃(𝒘 𝒕)= ∏𝑛 𝑃(𝑤𝑖|𝑡𝑖)
∏𝑛 𝑖=1 𝑃(𝒕) = 𝑃(𝑡𝑖|𝑡𝑖−1)
𝑖=1
• Output independence
‣ An observed event (word) depends only on the hidden state (tag)
• Markov assumption
‣ The current state (tag) depends only on previous state 5
COMP90042
L6
HMMs – Training
• Parameters are the individual probabilities
P(wi | ti) = emission (O) probabilities P(ti | ti−1) = transition (A) probabilities
• Training uses Maximum Likelihood Estimation (MLE)
‣ This is done by simply counting word frequencies according to
their tags (just like N-gram LMs!)
𝑐𝑜𝑢𝑛𝑡(𝑉𝐵, 𝑙𝑖𝑘𝑒) 𝑐𝑜𝑢𝑛𝑡(𝑉𝐵)
𝑐𝑜𝑢𝑛𝑡(𝐷𝑇, 𝑁𝑁) 𝑐𝑜𝑢𝑛𝑡(𝐷𝑇 )
‣ ‣
• •
𝑃(𝑙𝑖𝑘𝑒 𝑉𝐵) = 𝑃(𝑁𝑁 𝐷𝑇)=
6
COMP90042
L6
HMMs – Training • What about the first tag?
• Assume we have a symbol “” that represents the start of
your sentence
𝑐𝑜𝑢𝑛𝑡(<𝑠>, 𝑁𝑁) 𝑐𝑜𝑢𝑛𝑡( < 𝑠 > )
𝑃(𝑁𝑁 <𝑠>)=
• What about unseen (word, tag) and (tag, previous_tag)
combinations?
‣ Smoothing techniques
7
COMP90042
L6
Transition Matrix
8
COMP90042
L6
Emission (Observation) Matrix
9
COMP90042
L6
HMMs – Prediction (Decoding) ^𝒕 = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑃(𝒘 𝒕) 𝑃(𝒕)
= 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 ∏𝑛 𝑃(𝑤𝑖 𝑡𝑖)𝑃 (𝑡𝑖 | 𝑡𝑖−1)
𝑖=1
• Simple idea: for each word, take the tag that maximises
•
𝑃(𝑤𝑖 𝑡𝑖)𝑃(𝑡𝑖|𝑡𝑡−1). Do it left-to-right, in greedy fashion. This is wrong! We are looking for argmax𝒕, not individual
𝑎𝑟𝑔𝑚𝑎𝑥𝑡𝑖 terms.
‣ This is a local classifier: error propagation
• Correct way: consider all possible tag combinations, evaluate them, take the max
10
COMP90042
L6
What is the complexity of such a system that considers all possible tag combinations?
If sentence length = N and tag size = T
• NT • NT • TN •N
PollEv.com/jeyhanlau569
11
COMP90042
L6
12
COMP90042
L6
The Viterbi Algorithm
• Dynamic Programming to the rescue!
‣ We can still proceed sequentially, as long as we are careful.
• POS tag: “can play”
Best tag for “can” is easy: 𝑎𝑟𝑔𝑚𝑎𝑥𝑡 𝑃(can 𝑡)𝑃(𝑡|)
• Suppose best tag for “can” is NN. To get the tag for “play”, we
•
can take 𝑎𝑟𝑔𝑚𝑎𝑥𝑡 𝑃(play 𝑡)𝑃(𝑡 NN) but this is wrong.
• Instead, we keep track of scores for each tag for “can” and check them with the different tags of “play”.
13
COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
MD
VB
JJ
NN
RB
DT
14
COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
P(Janet|NNP) * P(NNP|)
MD
VB
JJ
…
NN
…
RB
…
DT
…
15
COMP90042
L6
P(Janet | NNP) =?; P(NNP | ) = ?
16
COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
0.000032 * 0.2767 = 8.8544e-06
MD
P(Janet|MD) * P(MD|)
VB
…
JJ
…
NN
…
RB
…
DT
…
17
COMP90042
L6
P(Janet | MD) = ?; P(MD | ) =?
18
COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
MD
0
VB
0
JJ
0
NN
0
RB
0
DT
0
19
COMP90042
L6
P(Janet | VB/JJ/etc) = ?
20
COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
P(will|NNP) * P(NNP|tJanet) *
s(tJanet|Janet)
MD
0
… …
Consider all possible tags, and take the maximum
max( P(will | NNP) * P(NNP | NNP) * s(
VB
0
NNP | Janet),
JJ
0
… P(will | NNP) * P(NNP | MD) * s(MD | Janet),
…
NN
0
…
P(will | NNP) * P(NNP | DT) * s(DT | Janet) )
RB
0
…
= P(will | NNP) * P(NNP | NNP) * 8.8544e-06
DT
0
…
21
COMP90042
L6
P(will | NNP) = ?; P(NNP | NNP) = ?
22
COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
MD
0
…
VB
0
…
JJ
0
…
NN
0
…
RB
0
…
DT
0
…
23
COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
MD
0
P(will|MD) * P(MD|tJanet) *
s(tJanet|Janet)
VB
0
…
max( P(will | MD) * P(MD | NNP) * s(NNP | Janet),
JJ
0
… … …
P(will | MD) * P(MD …
P(will | MD) * P(MD
= P(will | MD) * P(MD |
| MD) * s(MD | Janet),
| DT) * s(DT | Janet) )
NNP) * 8.8544e-06
NN
0
RB
0
DT
0
…
24
COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
MD
0
3.004e-08
VB
0
…
…
max( P(will | MD) * P(MD | NNP) * s(NNP | Janet), P(will | MD) * P(MD | MD) * s(MD | Janet),
…
JJ
0
NN
0
…
P(will | MD) * P(MD
| DT) * s(DT | Janet) )
RB
0
… = P(will | MD) * P(MD |
NNP) * 8.8544e-06
DT
0
…
25
COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
MD
0
3.004e-8
VB
0
2.231e-13
JJ
0
0
NN
0
1.034e-10
RB
0
0
DT
0
0
26
COMP90042
L6
Transition and Emission Matrix
27
COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
MD
0
3.004e-8
VB
0
2.231e-13
JJ
0
0
NN
0
1.034e-10
RB
0
0
DT
0
0
28
COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
MD
0
3.004e-8
0
VB
0
2.231e-13
P(back|VB) * P(VB|twill) *
s(twill|will)
JJ
0
0
NN
0
1.034e-10
RB
0
0
DT
0
0
29
COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
MD
0
3.004e-8
0
VB
0
2.231e-13
MD: 1.6e-11 VB: 7.5e-19 NN: 9.7e-17
JJ
0
0
NN
0
1.034e-10
RB
0
0
DT
0
0
30
COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
MD
0
3.004e-8
0
VB
0
2.231e-13
MD: 1.6e-11
VB: 7.5e-19 NN: 9.7e-17
JJ
0
0
NN
0
1.034e-10
RB
0
0
DT
0
0
31
COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
MD
0
3.004e-8
0
VB
0
2.231e-13
1.6e-11
JJ
0
0
NN
0
1.034e-10
RB
0
0
DT
0
0
32
COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
MD
0
3.004e-8
0
VB
0
2.231e-13
1.6e-11
JJ
0
0
5.1e-15
NN
0
1.034e-10
5.4e-15
RB
0
0
5.3e-11
DT
0
0
0
33
COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
2.5e-17
MD
0
3.004e-8
0
0
VB
0
2.231e-13
1.6e-11
0
JJ
0
0
5.1e-15
0
NN
0
1.034e-10
5.4e-15
0
RB
0
0
5.3e-11
0
DT
0
0
0
1.8e-12
34
COMP90042
L6
Done!
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
2.5e-17
0
MD
0
3.004e-8
0
0
0
VB
0
2.231e-13
1.6e-11
0
1.0e-20
JJ
0
0
5.1e-15
0
0
NN
0
1.034e-10
5.4e-15
0
2.0e-15
RB
0
0
5.3e-11
0
0
DT
0
0
0
1.8e-12
0
35
COMP90042
L6
Going Backwards To Get
The Best Tag Sequence
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
2.5e-17
0
MD
0
3.004e-8
0
0
0
VB
0
2.231e-13
1.6e-11
0
1.0e-20
JJ
0
0
5.1e-15
0
0
NN
0
1.034e-10
5.4e-15
0
2.0e-15
RB
0
0
5.3e-11
0
0
DT
0
0
0
1.8e-12
0
36
COMP90042
L6
Going Backwards To Get
The Best Tag Sequence
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
2.5e-17
0
MD
0
3.004e-08
0
0
0
VB
0
2.231e-13
1.6e-11
0
1.0e-20
JJ
0
0
5.1e-15
0
0
NN
0
1.034e-10
5.4e-15
0
2.0e-15
RB
0
0
5.3e-11
0
0
DT
0
0
0
1.8e-12
0
37
COMP90042
L6
Final Tagging Results
Janet/NNP
will/MD
back/VB
the/DT
bill/NN
NNP
8.8544e-06
●
0
0
2.5e-17
0
MD
0
3.004e-08
0
0
0
VB
0
2.231e-13
1.6e-11
0
1.0e-20
JJ
0
0
5.1e-15
0
0
NN
0
1.034e-10
5.4e-15
0
2.0e-15
RB
0
0
5.3e-11
0
0
DT
0
0
0
1.8e-12
0
38
COMP90042
L6
What is the optimal POS tag sequence for:
THEY FISH
N
V
0.6
0.4
N
0.8
0.2
V
0.7
0.3
they
fish
N
0.7
0.3
V
0.4
0.6
Word Emission Probabilities
PollEv.com/jeyhanlau569
Transition Probabilities
39
COMP90042
L6
THEY FISH
they
fish
N
V
PollEv.com/jeyhanlau569
40
COMP90042
L6
THEY FISH
they
N for “they”
fish
N
P(they | N) x P(N | ) = 0.7 x 0.6
= 0.42
P(fish | N) x P(N | N) x 0.42 = 0.3 x 0.8 x 0.42 = 0.1
P(fish | N) x P(N | V) x 0.16 = 0.3 x 0.7 x 0.16 = 0.03
0.1
V for “they”
V
P(they | V) x P(V | ) = 0.4 x 0.4
= 0.16
P(fish | V) x P(V | N) x 0.42 = 0.6 x 0.2 x 0.42 = 0.05
P(fish | V) x P(V | V) x 0.16 = 0.6 x 0.3 x 0.16 = 0.03
0.05
Answer: N N
41
COMP90042
L6
42
COMP90042
L6
•
•
Complexity: O(T2N), where T is the size of the tagset and N is the length of the sequence.
‣ T * N matrix, each cell performs T operations. Why does it work?
‣ Because of the independence assumptions that decompose the problem.
‣ Without these, we cannot apply DP.
The Viterbi Algorithm
43
COMP90042
L6
Viterbi Pseudocode
alpha = np.zeros(M, T)
for t in range(T):
alpha[1, t] = pi[t] * O[w[1], t]
for i in range(2, M):
for t_i in range(T):
for t_last in range(T):
s = alpha[i-1, t_last] * A[t_last, t_i] * O[w[i], t_i]
if s > alpha[i,t_i]:
alpha[i,t_i] = s
back[i,t_i] = t_last
best = np.max(alpha[M-1,:])
return backtrace(best, back)
• Good practice: work with log probabilities to prevent underflow (multiplications become sums)
• Vectorisation (use matrix-vector operations)
# t_last means t_{i-1}
44
COMP90042
L6
HMMs In Practice
• WesawHMMtaggersbasedonbigrams(firstorderHMM). ‣ I.e. current tag depends only the immediate previous tag
• State-of-the-artusetagtrigrams(secondorderHMM).
‣
𝑃 (𝒕) = ∏𝑛 𝑃 (𝑡𝑖 | 𝑡𝑖−1, 𝑡𝑖−2) Viterbi now O(T3N) 𝑖=1
• Needtodealwithsparsity:sometagtrigramsequencesmight not be present in training data ^ ^ ^
‣ Interpolation: 𝑃(𝑡𝑖|𝑡𝑖−1,𝑡𝑖−2) = λ3𝑃(𝑡𝑖|𝑡𝑖−1,𝑡𝑖−2) + λ2𝑃(𝑡𝑖|𝑡𝑖−1) + λ1𝑃(𝑡𝑖)
λ1 + λ2 + λ3 = 1
• With additional features, reach 96.5% accuracy on Penn Treebank (Brants, 2000)
‣
45
COMP90042
L6
Generative vs. Discriminative Taggers • HMM is generative
T̂ = argmaxTP(T|W)
= argmaxTP(W|T)P(T)
= argmaxT∏P(wi|ti)P(ti|ti−1) i
• trained HMM can generate data (sentences)!
• allows for unsupervised HMMs: learn model without any tagged data!
‣ ‣
‣
46
COMP90042
L6
Generative vs. Discriminative Taggers
T̂ = argmaxTP(T|W)
= argmaxT∏P(ti|wi,ti−1)
•
i
Discriminative models describe P(T | W ) directly
‣ supports richer feature set, generally better accuracy when trained over large supervised datasets
P(ti | wi, ti−1, xi, yi)
‣ E.g., Maximum Entropy Markov Model (MEMM), Conditional
random field (CRF)
‣ Most deep learning models of sequences are discriminative
47
‣
COMP90042
L6
•
HMMs are highly effective for part-of-speech tagging
‣ trigram HMM gets 96.5% accuracy
‣ related models are state of the art
‣ MEMMs
‣ CRFs
‣ Deep CRF
97% 97.6%
HMMs in NLP
97.9%
‣ English Penn Treebank tagging accuracy https://aclweb.org/aclwiki/
index.php?title=POS_Tagging_(State_of_the_art)
•
Apply out-of-the box to other sequence labelling tasks
‣ named entity recognition (lecture 18), shallow parsing, … ‣ In other fields: DNA, protein sequences…
48
COMP90042
L6
•
•
•
HMMs are a simple, yet effective way to perform sequence labelling.
A Final Word
Can still be competitive, and fast. Natural baseline for other sequence labelling models.
Main drawback: not very flexible in terms of feature representation, compared to MEMMs and CRFs.
49
COMP90042
L6
•
•
•
JM3 Appendix A A.1-A.2, A.4
See also E18 Chapter 7.3
Readings
References:
‣ Rabiner’s HMM tutorial http://tinyurl.com/2hqaf8
‣ Lafferty et al, Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001)
50