l6-hmm-v2
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
Semester 1 2021 Week 3
Jey Han Lau
Sequence Tagging:
Hidden Markov Models
COMP90042
Natural Language Processing
Lecture 6
COMP90042 L6
2
POS Tagging Recap
• Janet will back the bill
• Janet/NNP will/MB back/VP the/DT bill/NN
• Local classifier: prone to error propagation
• What about treating the full sequence as a “class”?
‣ Output: “NNP_MB_VP_DT_NN”
• Problems:
‣ Exponentially many combinations: |Tags|M ,for length M
‣ How to tag sequences of different lengths?
COMP90042 L6
3
A Better Approach
• Tagging is a sentence-level task but as humans we
decompose it into small word-level tasks.
‣ Janet/NNP will/MB back/VP the/DT bill/NN
• Solution:
‣ Define a model that decomposes process into individual
word level steps
‣ But that takes into account the whole sequence when
learning and predicting (no error propagation)
• This is the idea of sequence labelling, and more
general, structured prediction.
COMP90042 L6
4
A Probabilistic Model
• Goal: obtain best tag sequence t from sentence w
‣
‣ [Bayes]
• Let’s decompose:
‣ [Prob. of a word depends only on the tag]
‣ [Prob. of a tag depends only on the previous tag]
• This is a Hidden Markov Model (HMM)
�̂� = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑃(𝒕 𝒘)
�̂� = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕
𝑃(𝒘 𝒕)𝑃(𝒕)
𝑃(𝒘)
= 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑃(𝒘 𝒕) 𝑃(𝒕)
𝑃(𝒘 𝒕) =
𝑛
∏
𝑖=1
𝑃(𝑤𝑖 | 𝑡𝑖)
𝑃(𝒕) =
𝑛
∏
𝑖=1
𝑃(𝑡𝑖 | 𝑡𝑖−1)
COMP90042 L6
5
Two Assumptions of HMM
• Output independence
‣ An observed event (word) depends only on the hidden state
(tag)
• Markov assumption
‣ The current state (tag) depends only on previous state
�̂� = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑃(𝒘 𝒕) 𝑃(𝒕)
𝑃(𝒘 𝒕) =
𝑛
∏
𝑖=1
𝑃(𝑤𝑖 | 𝑡𝑖)
𝑃(𝒕) =
𝑛
∏
𝑖=1
𝑃(𝑡𝑖 | 𝑡𝑖−1)
COMP90042 L6
6
HMMs – Training
• Parameters are the individual probabilities
‣ = emission (O) probabilities
‣ = transition (A) probabilities
• Training uses Maximum Likelihood Estimation (MLE)
‣ This is done by simply counting word frequencies according to
their tags (just like N-gram LMs!)
•
•
P(wi | ti)
P(ti | ti−1)
𝑃(𝑙𝑖𝑘𝑒 𝑉𝐵) =
𝑐𝑜𝑢𝑛𝑡(𝑉𝐵, 𝑙𝑖𝑘𝑒)
𝑐𝑜𝑢𝑛𝑡(𝑉𝐵)
𝑃(𝑁𝑁 𝐷𝑇 ) =
𝑐𝑜𝑢𝑛𝑡(𝐷𝑇, 𝑁𝑁)
𝑐𝑜𝑢𝑛𝑡(𝐷𝑇 )
COMP90042 L6
7
HMMs – Training
• What about the first tag?
• Assume we have a symbol “” that represents the start of
your sentence
• What about unseen (word, tag) and (tag, previous_tag)
combinations?
‣ Smoothing techniques
𝑃(𝑁𝑁 < 𝑠 > ) =
𝑐𝑜𝑢𝑛𝑡( < 𝑠 > , 𝑁𝑁)
𝑐𝑜𝑢𝑛𝑡( < 𝑠 > )
COMP90042 L6
8
Transition Matrix
COMP90042 L6
9
Emission (Observation) Matrix
COMP90042 L6
10
HMMs – Prediction (Decoding)
• Simple idea: for each word, take the tag that maximises
. Do it left-to-right, in greedy fashion.
• This is wrong! We are looking for , not individual
terms.
‣ This is a local classifier: error propagation
• Correct way: consider all possible tag combinations, evaluate
them, take the max
�̂� = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑃(𝒘 𝒕) 𝑃(𝒕)
= 𝑎𝑟𝑔𝑚𝑎𝑥𝒕
𝑛
∏
𝑖=1
𝑃(𝑤𝑖 𝑡𝑖)𝑃(𝑡𝑖 | 𝑡𝑖−1)
𝑃(𝑤𝑖 𝑡𝑖)𝑃(𝑡𝑖 | 𝑡𝑡−1)
argmax𝒕
𝑎𝑟𝑔𝑚𝑎𝑥𝑡𝑖
COMP90042 L6
11
• NT
• NT
• TN
• N
PollEv.com/jeyhanlau569
What is the complexity of such a system that
considers all possible tag combinations?
If sentence length = N and tag size = T
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L6
12
COMP90042 L6
13
The Viterbi Algorithm
• Dynamic Programming to the rescue!
‣ We can still proceed sequentially, as long as we are careful.
• POS tag: “can play”
• Best tag for “can” is easy:
• Suppose best tag for “can” is NN. To get the tag for “play”, we
can take but this is wrong.
• Instead, we keep track of scores for each tag for “can” and
check them with the different tags of “play”.
𝑎𝑟𝑔𝑚𝑎𝑥𝑡 𝑃(can 𝑡)𝑃(𝑡 |)
𝑎𝑟𝑔𝑚𝑎𝑥𝑡 𝑃(play 𝑡)𝑃(𝑡 NN)
COMP90042 L6
14
The Viterbi Algorithm
Janet will back the bill
NNP
MD
VB
JJ
NN
RB
DT
COMP90042 L6
15
The Viterbi Algorithm
Janet will back the bill
NNP P(Janet|NNP) *
P(NNP|)
MD
VB
JJ …
NN …
RB …
DT …
COMP90042 L6
16
P(Janet | NNP) =?; P(NNP | ) = ?
COMP90042 L6
17
The Viterbi Algorithm
Janet will back the bill
NNP 0.000032 *
0.2767 =
8.8544e-06
MD P(Janet|MD) *
P(MD|)
VB …
JJ …
NN …
RB …
DT …
COMP90042 L6
18
P(Janet | MD) = ?; P(MD | ) =?
COMP90042 L6
19
The Viterbi Algorithm
Janet will back the bill
NNP 8.8544e-06
●
MD 0
VB 0
JJ 0
NN 0
RB 0
DT 0
COMP90042 L6
20
P(Janet | VB/JJ/etc) = ?
COMP90042 L6
21
The Viterbi Algorithm
Janet will back the bill
NNP 8.8544e-06
●
P(will|NNP) *
P(NNP|tJanet) *
s(tJanet|Janet)
MD 0 …
VB 0 …
JJ 0 …
NN 0 …
RB 0 …
DT 0 …
max( P(will | NNP) * P(NNP | NNP) * s(NNP | Janet),
P(will | NNP) * P(NNP | MD) * s(MD | Janet),
…
P(will | NNP) * P(NNP | DT) * s(DT | Janet) )
Consider all possible tags, and
take the maximum
= P(will | NNP) * P(NNP | NNP) * 8.8544e-06
COMP90042 L6
22
P(will | NNP) = ?; P(NNP | NNP) = ?
COMP90042 L6
23
The Viterbi Algorithm
Janet will back the bill
NNP 8.8544e-06
●
0
MD 0 …
VB 0 …
JJ 0 …
NN 0 …
RB 0 …
DT 0 …
COMP90042 L6
24
The Viterbi Algorithm
Janet will back the bill
NNP 8.8544e-06
●
0
MD 0 P(will|MD) *
P(MD|tJanet) *
s(tJanet|Janet)
VB 0 …
JJ 0 …
NN 0 …
RB 0 …
DT 0 …
max( P(will | MD) * P(MD | NNP) * s(NNP | Janet),
P(will | MD) * P(MD | MD) * s(MD | Janet),
…
P(will | MD) * P(MD | DT) * s(DT | Janet) )
= P(will | MD) * P(MD | NNP) * 8.8544e-06
COMP90042 L6
25
The Viterbi Algorithm
Janet will back the bill
NNP 8.8544e-06
●
0
MD 0 3.004e-08
VB 0 …
JJ 0 …
NN 0 …
RB 0 …
DT 0 …
max( P(will | MD) * P(MD | NNP) * s(NNP | Janet),
P(will | MD) * P(MD | MD) * s(MD | Janet),
…
P(will | MD) * P(MD | DT) * s(DT | Janet) )
= P(will | MD) * P(MD | NNP) * 8.8544e-06
COMP90042 L6
26
The Viterbi Algorithm
Janet will back the bill
NNP 8.8544e-06
●
0
MD 0 3.004e-8
VB 0 2.231e-13
JJ 0 0
NN 0 1.034e-10
RB 0 0
DT 0 0
COMP90042 L6
27
Transition and Emission Matrix
COMP90042 L6
28
The Viterbi Algorithm
Janet will back the bill
NNP 8.8544e-06
●
0
MD 0 3.004e-8
VB 0 2.231e-13
JJ 0 0
NN 0 1.034e-10
RB 0 0
DT 0 0
COMP90042 L6
29
The Viterbi Algorithm
Janet will back the bill
NNP 8.8544e-06
●
0 0
MD 0 3.004e-8 0
VB 0 2.231e-13 P(back|VB) *
P(VB|twill) *
s(twill|will)
JJ 0 0
NN 0 1.034e-10
RB 0 0
DT 0 0
COMP90042 L6
30
The Viterbi Algorithm
Janet will back the bill
NNP 8.8544e-06
●
0 0
MD 0 3.004e-8 0
VB 0 2.231e-13 MD: 1.6e-11
VB: 7.5e-19
NN: 9.7e-17
JJ 0 0
NN 0 1.034e-10
RB 0 0
DT 0 0
COMP90042 L6
31
The Viterbi Algorithm
Janet will back the bill
NNP 8.8544e-06
●
0 0
MD 0 3.004e-8 0
VB 0 2.231e-13 MD: 1.6e-11
VB: 7.5e-19
NN: 9.7e-17
JJ 0 0
NN 0 1.034e-10
RB 0 0
DT 0 0
COMP90042 L6
32
The Viterbi Algorithm
Janet will back the bill
NNP 8.8544e-06
●
0 0
MD 0 3.004e-8 0
VB 0 2.231e-13 1.6e-11
JJ 0 0
NN 0 1.034e-10
RB 0 0
DT 0 0
COMP90042 L6
33
The Viterbi Algorithm
Janet will back the bill
NNP 8.8544e-06
●
0 0
MD 0 3.004e-8 0
VB 0 2.231e-13 1.6e-11
JJ 0 0 5.1e-15
NN 0 1.034e-10 5.4e-15
RB 0 0 5.3e-11
DT 0 0 0
COMP90042 L6
34
The Viterbi Algorithm
Janet will back the bill
NNP 8.8544e-06
●
0 0 2.5e-17
MD 0 3.004e-8 0 0
VB 0 2.231e-13 1.6e-11 0
JJ 0 0 5.1e-15 0
NN 0 1.034e-10 5.4e-15 0
RB 0 0 5.3e-11 0
DT 0 0 0 1.8e-12
COMP90042 L6
35
Done!
Janet will back the bill
NNP 8.8544e-06
●
0 0 2.5e-17 0
MD 0 3.004e-8 0 0 0
VB 0 2.231e-13 1.6e-11 0 1.0e-20
JJ 0 0 5.1e-15 0 0
NN 0 1.034e-10 5.4e-15 0 2.0e-15
RB 0 0 5.3e-11 0 0
DT 0 0 0 1.8e-12 0
COMP90042 L6
36
Going Backwards To Get
The Best Tag Sequence
Janet will back the bill
NNP 8.8544e-06
●
0 0 2.5e-17 0
MD 0 3.004e-8 0 0 0
VB 0 2.231e-13 1.6e-11 0 1.0e-20
JJ 0 0 5.1e-15 0 0
NN 0 1.034e-10 5.4e-15 0 2.0e-15
RB 0 0 5.3e-11 0 0
DT 0 0 0 1.8e-12 0
COMP90042 L6
37
Janet will back the bill
NNP 8.8544e-06
●
0 0 2.5e-17 0
MD 0 3.004e-08 0 0 0
VB 0 2.231e-13 1.6e-11 0 1.0e-20
JJ 0 0 5.1e-15 0 0
NN 0 1.034e-10 5.4e-15 0 2.0e-15
RB 0 0 5.3e-11 0 0
DT 0 0 0 1.8e-12 0
Going Backwards To Get
The Best Tag Sequence
COMP90042 L6
38
Janet/NNP will/MD back/VB the/DT bill/NN
NNP 8.8544e-06
●
0 0 2.5e-17 0
MD 0 3.004e-08 0 0 0
VB 0 2.231e-13 1.6e-11 0 1.0e-20
JJ 0 0 5.1e-15 0 0
NN 0 1.034e-10 5.4e-15 0 2.0e-15
RB 0 0 5.3e-11 0 0
DT 0 0 0 1.8e-12 0
Final Tagging Results
COMP90042 L6
39
PollEv.com/jeyhanlau569
they fish
N 0.7 0.3
V 0.4 0.6
Word Emission Probabilities
N V
0.6 0.4
N 0.8 0.2
V 0.7 0.3
Transition Probabilities
What is the optimal POS tag sequence for:
THEY FISH
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L6
40
THEY FISH
they fish
N
V
PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L6
41
THEY FISH
they fish
N
P(they | N) x P(N | )
= 0.7 x 0.6
= 0.42
P(fish | N) x P(N | N) x 0.42 = 0.3 x 0.8 x 0.42 = 0.1
P(fish | N) x P(N | V) x 0.16 = 0.3 x 0.7 x 0.16 = 0.03
0.1
V
P(they | V) x P(V | )
= 0.4 x 0.4
= 0.16
P(fish | V) x P(V | N) x 0.42 = 0.6 x 0.2 x 0.42 = 0.05
P(fish | V) x P(V | V) x 0.16 = 0.6 x 0.3 x 0.16 = 0.03
0.05
Answer: N N
N for “they”
V for “they”
COMP90042 L6
42
COMP90042 L6
43
The Viterbi Algorithm
• Complexity: O(T2N), where T is the size of the
tagset and N is the length of the sequence.
‣ T * N matrix, each cell performs T operations.
• Why does it work?
‣ Because of the independence assumptions that
decompose the problem.
‣ Without these, we cannot apply DP.
COMP90042 L6
44
Viterbi Pseudocode
• Good practice: work with log probabilities to prevent
underflow (multiplications become sums)
• Vectorisation (use matrix-vector operations)
alpha = np.zeros(M, T)
for t in range(T):
alpha[1, t] = pi[t] * O[w[1], t]
for i in range(2, M):
for t_i in range(T):
for t_last in range(T): # t_last means t_{i-1}
s = alpha[i-1, t_last] * A[t_last, t_i] * O[w[i], t_i]
if s > alpha[i,t_i]:
alpha[i,t_i] = s
back[i,t_i] = t_last
best = np.max(alpha[M-1,:])
return backtrace(best, back)
COMP90042 L6
45
HMMs In Practice
• We saw HMM taggers based on bigrams (first order HMM).
‣ I.e. current tag depends only the immediate previous tag
• State-of-the-art use tag trigrams (second order HMM).
‣
Viterbi now O(T3N)
• Need to deal with sparsity: some tag trigram sequences might
not be present in training data
‣ Interpolation: ) = ) + ) + )
‣ = 1
• With additional features, reach 96.5% accuracy on Penn Treebank (Brants,
2000)
𝑃(𝒕) =
𝑛
∏
𝑖=1
𝑃(𝑡𝑖 | 𝑡𝑖−1, 𝑡𝑖−2)
𝑃(𝑡𝑖 | 𝑡𝑖−1, 𝑡𝑖−2 λ3�̂�(𝑡𝑖 | 𝑡𝑖−1, 𝑡𝑖−2 λ2�̂�(𝑡𝑖 | 𝑡𝑖−1 λ1�̂�(𝑡𝑖
λ1 + λ2 + λ3
COMP90042 L6
46
Generative vs. Discriminative Taggers
• HMM is generative
‣
‣
‣
• trained HMM can generate data (sentences)!
• allows for unsupervised HMMs: learn model without any
tagged data!
̂T = argmaxTP(T |W)
= argmaxTP(W |T)P(T)
= argmaxT∏
i
P(wi | ti)P(ti | ti−1)
COMP90042 L6
47
Generative vs. Discriminative Taggers
• Discriminative models describe directly
‣ supports richer feature set, generally better accuracy when
trained over large supervised datasets
‣
‣ E.g., Maximum Entropy Markov Model (MEMM), Conditional
random field (CRF)
‣ Most deep learning models of sequences are discriminative
̂T = argmaxTP(T |W)
= argmaxT∏
i
P(ti |wi, ti−1)
P(T |W)
P(ti |wi, ti−1, xi, yi)
COMP90042 L6
48
HMMs in NLP
• HMMs are highly effective for part-of-speech tagging
‣ trigram HMM gets 96.5% accuracy
‣ related models are state of the art
‣ MEMMs 97%
‣ CRFs 97.6%
‣ Deep CRF 97.9%
‣ English Penn Treebank tagging accuracy https://aclweb.org/aclwiki/
index.php?title=POS_Tagging_(State_of_the_art)
• Apply out-of-the box to other sequence labelling
tasks
‣ named entity recognition (lecture 18), shallow parsing, …
‣ In other fields: DNA, protein sequences…
https://aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
https://aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
https://aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
https://aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
COMP90042 L6
49
A Final Word
• HMMs are a simple, yet effective way to perform
sequence labelling.
• Can still be competitive, and fast. Natural baseline
for other sequence labelling models.
• Main drawback: not very flexible in terms of feature
representation, compared to MEMMs and CRFs.
COMP90042 L6
50
Readings
• JM3 Appendix A A.1-A.2, A.4
• See also E18 Chapter 7.3
• References:
‣ Rabiner’s HMM tutorial http://tinyurl.com/2hqaf8
‣ Lafferty et al, Conditional random fields: Probabilistic
models for segmenting and labeling sequence data (2001)
http://tinyurl.com/2hqaf8
http://tinyurl.com/2hqaf8