Video 1: Viterbi algorithm motivation
ANLP Week 4/Unit 2 Algorithms for HMMs
Sharon Goldwater
(based on slides by Philipp Koehn)
Sharon Goldwater
ANLP Week 4/Unit 2
But… tagging?
Sharon Goldwater ANLP Week 4/Unit 2 1
Hidden Markov Model (HMM)
HMM is actually a very general model for sequences. Elements of an HMM:
• a set of states (here: the tags)
• an output alphabet (here: words)
• intitial state (here: beginning of sentence)
• state transition probabilities (here: p(ti|ti−1)) • symbol emission probabilities (here: p(wi|ti))
Sharon Goldwater ANLP Week 4/Unit 2 3
Normally, we want to use the model to find the best tag sequence for an untagged sentence.
• Thus, the name of the model: hidden Markov model
– Markov: because of Markov assumption (tag/state only
depends on immediately previous tag/state).
– hidden: because we only observe the words/emissions; the
tags/states are hidden (or latent) variables.
• FSM view: given a sequence of words, what is the most probable
state path that generated them?
Sharon Goldwater ANLP Week 4/Unit 2 2
Formalizing the tagging problem
Normally, we want to use the model to find the best tag sequence T for an untagged sentence S:
Formalizing the tagging problem
Normally, we want to use the model to find the best tag sequence T for an untagged sentence S:
Sharon Goldwater
ANLP Week 4/Unit 2 4
Decomposing the model
Sharon Goldwater
ANLP Week 4/Unit 2
5
argmaxT p(T |S)
argmaxT p(T |S) • Bayes’ rule gives us:
p(T|S) = p(S|T) p(T) p(S )
• We can drop p(S) if we are only interested in argmaxT : argmaxT p(T |S) = argmaxT p(S|T ) p(T )
Now we need to compute P (S|T ) and P (T ) (actually, their product P(S|T)P(T) = P(S,T)).
• We already defined how!
• P (T ) is the state transition sequence:
P (T ) = P (ti|ti−1) i
• P (S|T ) are the emission probabilities:
P (S|T ) = P (wi|ti)
i
Search for the best tag sequence
• We have defined a model, but how do we use it? – given: word sequence S
– wanted: best tag sequence T∗
• For any specific tag sequence T , it is easy to compute P (S, T ) =
P(S|T)P(T).
P(S|T) P(T) = P(wi|ti) P(ti|ti−1)
i
• So, can’t we just enumerate all possible T, compute their probabilites, and choose the best one?
Sharon Goldwater ANLP Week 4/Unit 2 6
Sharon Goldwater ANLP Week 4/Unit 2 7
Enumeration won’t work
Enumeration won’t work
• Suppose we have c possible tags for each of the n words in the sentence.
• Suppose we have c possible tags for each of the n words in the sentence.
• How many possible tag sequences?
• There are cn possible tag sequences: the number grows
exponentially in the length n.
• For all but small n, too many sequences to efficiently enumerate.
• How many possible tag sequences?
Sharon Goldwater ANLP Week 4/Unit 2
Finding the best path
8
Sharon Goldwater
ANLP Week 4/Unit 2 9
Tagging example
one
dog
bit
CD NN PRP
NN VB
NN VBD
Words:
Possible tags: (ordered by frequency for each word)
Algorithms for HMMs (Goldwater, ANLP) 7
• The Viterbi algorithm finds this path without explicitly enumerating all paths.
•Our second example of a dynamic programming (or memoization) algorithm.
• Like min. edit distance, the algorithm stores partial results in a chart to avoid recomputing them.
Sharon Goldwater ANLP Week 4/Unit 2 10
Tagging example
one
dog
bit
CD NN PRP
NN VB
NN VBD
Words:
Possible tags: (ordered by frequency for each word)
• Choosingthebesttagforeachwordindependently gives the wrong answer ( CD NN NN ).
• P(VBD|bit) < P(NN|bit), but may yield a better sequence ( CD NN VB )
– because P(VBD|NN) and P(|VBD) are high.
Algorithms for HMMs (Goldwater, ANLP) 8
Video 2: Viterbi algorithm intuition
Algorithms for HMMs (Goldwater, ANLP) 2
Recap: HMM
• Elements of HMM:
– Set of states (tags)
– Output alphabet (word types)
– Start state (beginning of sentence)
– State transition probabilities
– Output probabilities from each state
Algorithms for HMMs (Goldwater, ANLP) 3
More general notation
• Previous lecture:
– Sequence of tags T = t1…tn
– Sequence of words S = w1…wn
• This lecture:
– Sequence of states Q = q1 … qT
– Sequence of outputs O = o1 … oT
– So t is now a time step, not a tag! And T is the sequence length.
Algorithms for HMMs (Goldwater, ANLP) 4
Recap: HMM
• Given a sentence O = o1 … oT with tags Q = q1 … qT, compute P(O,Q) as:
𝑇
𝑃(𝑂,𝑄)=ෑ𝑃 𝑜𝑡 𝑞𝑡 𝑃 𝑞𝑡 𝑞𝑡−1 𝑡=1
• But we want to find argmax𝑄 𝑃(𝑄|𝑂) without enumerating all possible Q
– Use Viterbi algorithm to store partial computations. Algorithms for HMMs (Goldwater, ANLP) 5
Today’s lecture • What algorithms can we use to
– Efficiently compute the most probable tag sequence for a given word sequence?
– Efficiently compute the likelihood for an HMM (probability it outputs a given sequence s)?
– Learn the parameters of an HMM given unlabelled training data?
• What are the properties of these algorithms (complexity, convergence, etc)?
Algorithms for HMMs (Goldwater, ANLP) 6
Tagging example
one
dog
bit
CD NN PRP
NN VB
NN VBD
Words:
Possible tags: (ordered by frequency for each word)
Algorithms for HMMs (Goldwater, ANLP) 7
Tagging example
one
dog
bit
CD NN PRP
NN VB
NN VBD
Words:
Possible tags: (ordered by frequency for each word)
• Choosingthebesttagforeachwordindependently gives the wrong answer ( CD NN NN ).
• P(VBD|bit) < P(NN|bit), but may yield a better sequence ( CD NN VB )
– because P(VBD|NN) and P(|VBD) are high.
Algorithms for HMMs (Goldwater, ANLP) 8
Viterbi: intuition
one
dog
bit
CD NN PRP
NN VB
NN VBD
Words:
Possible tags: (ordered by frequency for each word)
• Suppose we have already computed
a) The best tag sequence for … bit that ends in NN..
b) The best tag sequence for … bit that ends in VBD., or
• Then,thebestfullsequencewouldbeeither – sequence (a) extended to include
– sequence (b) extended to include
Algorithms for HMMs (Goldwater, ANLP) 9
Viterbi: intuition
one
dog
bit
CD NN PRP
NN VB
NN VBD
Words:
Possible tags: (ordered by frequency for each word)
• But similarly, to get
a) The best tag sequence for … bit that ends in NN.
• We could extend one of:
– The best tag sequence for … dog that ends in NN. – The best tag sequence for … dog that ends in VB.
• Andsoon…
Algorithms for HMMs (Goldwater, ANLP) 10
Viterbi: high-level picture
• Intuition:thebestpathoflengthtendinginstateq must include the best path of length t-1 to the previous state. (t now a time step, not a tag).
Algorithms for HMMs (Goldwater, ANLP) 11
Viterbi: high-level picture
• Intuition:thebestpathoflengthtendinginstateq must include the best path of length t-1 to the previous state. (t now a time step, not a tag). So,
– Find the best path of length t-1 to each state.
– Consider extending each of those by 1 step, to state q.
– Take the best of those options as the best path to state q.
Algorithms for HMMs (Goldwater, ANLP) 12
Video 3: Viterbi algorithm details
Algorithms for HMMs (Goldwater, ANLP) 13
Notation
• Sequence of observations over time o1, o2, …, oT
– here, words in sentence
• Vocabulary size V of possible observations
• Set of possible states q1, q2, …, qN (see note next slide) – here, tags
• A, an NxN matrix of transition probabilities
– aij: the prob of transitioning from state i to j. (JM3 Fig 8.7)
• B, an NxV matrix of output probabilities
– bi(ot): the prob of emitting ot from state i. (JM3 Fig 8.8)
Algorithms for HMMs (Goldwater, ANLP) 14
Note on notation
• J&M use q1, q2, …, qN for set of states, but also use
q1, q2, …, qT for state sequence over time.
– So, just seeing q1 is ambiguous (though usually disambiguated from context).
– I’ll instead use qi for state names, and qt for state at time t.
– So we could have qt = qi, meaning: the state we’re in at time t is qi.
Algorithms for HMMs (Goldwater, ANLP) 15
HMM example w/ new notation
Start
.7q1 q2 .5
.3
.5
.6 .1.3 xyzxyz
• States {q1, q2} (or {, q1, q2})
• Output alphabet {x, y, z}
.1 .7.2
Adapted from Manning & Schuetze, Fig 9.2
Algorithms for HMMs (Goldwater, ANLP) 16
Transition and Output Probabilities
• Transition matrix A: aij = P(qj | qi)
• Output matrix B:
bi(o) = P(o | qi) q1
q1 q2
1
0
.7
.3
.5
.5
q1 q2
.6
.1
.3
.1
.7
.2
for output o
xyz q2
Algorithms for HMMs (Goldwater, ANLP)
17
Joint probability of (states, outputs)
• Let λ = (A, B) be the parameters of our HMM.
• Using our new notation, given state sequence Q = (q1 … qT) and output sequence O = (o1 … oT), we have:
𝑇
𝑃 𝑂, 𝑄 𝜆 = ෑ 𝑃 𝑜𝑡 𝑞𝑡 𝑃 𝑞𝑡 𝑞𝑡−1 𝑡=1
Algorithms for HMMs (Goldwater, ANLP) 18
Joint probability of (states, outputs)
• Let λ = (A, B) be the parameters of our HMM.
• Using our new notation, given state sequence Q = (q1 … qT) and output sequence O = (o1 … oT), we have:
• Or:
𝑇
𝑃 𝑂, 𝑄 𝜆 = ෑ 𝑃 𝑜𝑡 𝑞𝑡 𝑃 𝑞𝑡 𝑞𝑡−1 𝑡=1
𝑇
𝑃𝑂,𝑄𝜆 =ෑ𝑏𝑞𝑡(𝑜𝑡)𝑎𝑞𝑡−1𝑞𝑡 𝑡=1
Algorithms for HMMs (Goldwater, ANLP) 19
Joint probability of (states, outputs)
• Let λ = (A, B) be the parameters of our HMM.
• Using our new notation, given state sequence Q = (q1 … qT) and output sequence O = (o1 … oT), we have:
𝑇
𝑃 𝑂, 𝑄 𝜆 = ෑ 𝑃 𝑜𝑡 𝑞𝑡 𝑃 𝑞𝑡 𝑞𝑡−1 𝑡=1
𝑇
𝑃𝑂,𝑄𝜆 =ෑ𝑏𝑞𝑡(𝑜𝑡)𝑎𝑞𝑡−1𝑞𝑡 𝑡=1
𝑃𝑂= 𝑦,𝑧,𝑄=(𝑞1,𝑞1)𝜆 =𝑏1 𝑦 ∙𝑏1 𝑧 ∙𝑎<𝑠>,1∙𝑎11
= (.1)(.3)(1)(.7)
• Or:
• Example:
Algorithms for HMMs (Goldwater, ANLP) 20
Viterbi: high-level picture • Wanttofindargmax𝑄𝑃(𝑄|𝑂)
• Intuition:thebestpathoflengthtendinginstateq must include the best path of length t-1 to the previous state. So,
– Find the best path of length t-1 to each state.
– Consider extending each of those by 1 step, to state q.
– Take the best of those options as the best path to state q.
Algorithms for HMMs (Goldwater, ANLP) 21
Viterbi algorithm
• Use a chart to store partial results as we go
– NxT table, where v(j,t) is the probability* of the best state sequence for o1…ot that ends in state j.
Algorithms for HMMs (Goldwater, ANLP) 22
*Specifically, v(j,t) stores the max of the joint probability P(o1…ot,q1…qt-1,qt=j|λ)
Viterbi algorithm
• Use a chart to store partial results as we go
– NxT table, where v(j,t) is the probability* of the best state sequence for o1…ot that ends in state j.
• Fillincolumnsfromlefttoright,with 𝑣𝑗,𝑡=max𝑁 𝑣𝑖,𝑡−1∙𝑎∙𝑏𝑜
𝑖=1 𝑖𝑗 𝑗 𝑡
Algorithms for HMMs (Goldwater, ANLP)
23
*Specifically, v(j,t) stores the max of the joint probability P(o1…ot,q1…qt-1,qt=j|λ)
Viterbi algorithm
• Use a chart to store partial results as we go
– NxT table, where v(j,t) is the probability* of the best state sequence for o1…ot that ends in state j.
• Fillincolumnsfromlefttoright,with 𝑣𝑗,𝑡=max𝑁 𝑣𝑖,𝑡−1∙𝑎∙𝑏𝑜
𝑖=1 𝑖𝑗 𝑗 𝑡
• Store a backtrace to show, for each cell, which state at t-1 we came from.
Algorithms for HMMs (Goldwater, ANLP) 24
*Specifically, v(j,t) stores the max of the joint probability P(o1…ot,q1…qt-1,qt=j|λ)
Example
• Suppose O=xzy. Our initially empty table:
o1=x o2=z o3=y
q1 q2
Algorithms for HMMs (Goldwater, ANLP)
25
q1 q2
Filling the first column
o1=x o2=z
o3=y
.6
0
𝑣1,1 =𝑎<𝑠>1∙𝑏1 𝑥)= 1(.6 𝑣2,1 =𝑎<𝑠>2∙𝑏2 𝑥)= 0(.1
Algorithms for HMMs (Goldwater, ANLP)
26
Starting the second column
o1=x o2=z o3=y
q1 q2
𝑣1,2=max𝑁 𝑣𝑖,1∙𝑎∙𝑏𝑧 𝑖=1 𝑖11
= max ቊ𝑣 1,1 ∙ 𝑎11∙ 𝑏1 𝑧 = (.6)(.7)(.3) 𝑣 2,1 ∙ 𝑎21∙ 𝑏1 𝑧 = (0)(.5)(.3)
.6
0
Algorithms for HMMs (Goldwater, ANLP) 27
Starting the second column
o1=x o2=z o3=y
q1 q2
𝑣1,2=max𝑁 𝑣𝑖,1∙𝑎∙𝑏𝑧 𝑖=1 𝑖11
= max ቊ𝑣 1,1 ∙ 𝑎11∙ 𝑏1 𝑧 = (.6)(.7)(.3) 𝑣 2,1 ∙ 𝑎21∙ 𝑏1 𝑧 = (0)(.5)(.3)
.6
.126
0
Algorithms for HMMs (Goldwater, ANLP) 28
Finishing the second column
o1=x o2=z o3=y
.6
.126
0
q1 q2
𝑣2,2=max𝑁 𝑣𝑖,1∙𝑎∙𝑏𝑧 𝑖=1 𝑖22
= max ቊ𝑣 1,1 ∙ 𝑎12∙ 𝑏2 𝑧 = (.6)(.3)(.2) 𝑣 2,1 ∙ 𝑎22∙ 𝑏2 𝑧 = (0)(.5)(.2)
Algorithms for HMMs (Goldwater, ANLP) 29
Finishing the second column
o1=x o2=z o3=y
.6
.126
0
.036
q1 q2
𝑣2,2=max𝑁 𝑣𝑖,1∙𝑎∙𝑏𝑧 𝑖=1 𝑖22
= max ቊ𝑣 1,1 ∙ 𝑎12∙ 𝑏2 𝑧 = (.6)(.3)(.2) 𝑣 2,1 ∙ 𝑎22∙ 𝑏2 𝑧 = (0)(.5)(.2)
Algorithms for HMMs (Goldwater, ANLP) 30
Third column
o1=x o2=z o3=y
.6
.126
.00882
0
.036
.02646
q1 q2
• Exercise: make sure you get the same results!
Algorithms for HMMs (Goldwater, ANLP) 31
Best Path
o1=x o2=z o3=y
.6
.126
.00882
0
.036
.02646
q1 q2
• Choose best final state: max𝑁 𝑣 𝑖, 𝑇 𝑖=1
• Follow backtraces to find best full sequence: q1q1q2, so:
Algorithms for HMMs (Goldwater, ANLP) 32
Video 4: Forward algorithm
Algorithms for HMMs (Goldwater, ANLP) 33
HMMs: what else?
• Using Viterbi, we can find the best tags for a
sentence (decoding), and get 𝑃 𝑂, 𝑄 𝜆).
• We might also want to
– Compute the likelihood 𝑃 𝑂 𝜆), i.e., the probability of a sentence regardless of tags (a language model!)
– learn the best set of parameters λ = (A, B) given only an unannotated corpus of sentences.
Algorithms for HMMs (Goldwater, ANLP) 34
Computing the likelihood
• From probability theory, we know that
𝑃 𝑂 𝜆) = 𝑃 𝑂, 𝑄 𝜆 𝑄
• There are an exponential number of Qs.
• Again, by computing and storing partial results, we
can solve efficiently.
• (Next slides show the algorithm but I’ll likely skip them)
Algorithms for HMMs (Goldwater, ANLP) 35
Forward algorithm
• Use a table with cells α(j,t): the probability of being in
state j after seeing o1…ot (forward probability). 𝛼(𝑗,𝑡) = 𝑃(𝑜1,𝑜2,…𝑜𝑡,𝑞𝑡 = 𝑗|𝜆)
• Fill in columns from left to right, with 𝑁
𝛼𝑗,𝑡 =𝛼𝑖,𝑡−1∙𝑎 ∙𝑏 𝑜 𝑖𝑗 𝑗 𝑡
𝑖=1
– Same as Viterbi, but sum instead of max (and no backtrace). Algorithms for HMMs (Goldwater, ANLP) 36
Example
• Suppose O=xzy. Our initially empty table:
o1=x o2=z o3=y
q1 q2
Algorithms for HMMs (Goldwater, ANLP)
37
q1 q2
Filling the first column
o1=x o2=z
o3=y
.6
0
𝛼1,1 =𝑎<𝑠>1∙𝑏1 𝑥)= 1(.6 𝛼2,1 =𝑎<𝑠>2∙𝑏2 𝑥)= 0(.1
Algorithms for HMMs (Goldwater, ANLP)
38
Starting the second column
o1=x o2=z o3=y
𝑁
𝛼1,2 =𝛼𝑖,1 ∙𝑎𝑖1∙𝑏1𝑧 𝑖=1
=𝛼 1,1 ∙𝑎11∙𝑏1 𝑧 +𝛼 2,1 ∙𝑎21∙𝑏1(𝑧)
= .6 .7 .3 + 0 .5 .3
= .126
.6
.126
0
q1 q2
Algorithms for HMMs (Goldwater, ANLP) 39
Finishing the second column
o1=x o2=z o3=y
𝑁
𝛼2,2 =𝛼𝑖,1 ∙𝑎𝑖2∙𝑏2𝑧 𝑖=1
=𝛼 1,1 ∙𝑎12∙𝑏2 𝑧 +𝛼 2,1 ∙𝑎22∙𝑏2(𝑧)
= .6 .3 .2 + 0 .5 .2
= .036
.6
.126
0
.036
q1 q2
Algorithms for HMMs (Goldwater, ANLP) 40
Third column and finish
o1=x o2=z o3=y
.6
.126
.01062
0
.036
.03906
q1 q2
• Add up all probabilities in last column to get the probability of the entire sequence:
𝑁
𝑃𝑂|𝜆 =𝛼𝑖,𝑇 𝑖=1
Algorithms for HMMs (Goldwater, ANLP) 41
Video 5: Forward-backward algorithm
Algorithms for HMMs (Goldwater, ANLP) 42
Learning
• Given only the output sequence, learn the best set of
parameters λ = (A, B).
• Assume‘best’=maximum-likelihood.
• Other definitions are possible, won’t discuss here.
Algorithms for HMMs (Goldwater, ANLP) 43
Unsupervised learning
• TraininganHMMfromanannotatedcorpusis simple.
– Supervised learning: we have examples labelled with the right ‘answers’ (here, tags): no hidden variables in training.
• Training from unannotated corpus is trickier.
– Unsupervised learning: we have no examples labelled with the right ‘answers’: all we see are outputs, state sequence is hidden.
Algorithms for HMMs (Goldwater, ANLP) 44
Circularity
• If we know the state sequence, we can find the best λ.
– E.g., use MLE: 𝑃 𝑞𝑗|𝑞𝑖 = 𝐶(𝑞𝑖→𝑞𝑗) 𝐶(𝑞𝑖)
• If we know λ, we can find the best state sequence. – use Viterbi
• Butwedon’tknoweither!
Algorithms for HMMs (Goldwater, ANLP) 45
Expectation-maximization (EM)
Essentially, a bootstrapping algorithm.
• Initialize parameters λ(0)
• Ateachiterationk,
– E-step: Compute expected counts using λ(k-1)
– M-step: Set λ(k) using MLE on the expected counts
• Repeat until λ doesn’t change (or other stopping criterion).
Algorithms for HMMs (Goldwater, ANLP) 46
Expected counts?? Counting transitions from qi→qj:
• Realcounts:
– count 1 each time we see qi→qj in true tag sequence.
• Expected counts:
– With current λ, compute probs of all possible tag sequences. – If sequence Q has probability p, count p for each qi→qj in Q. – Add up these fractional counts across all possible sequences.
Algorithms for HMMs (Goldwater, ANLP) 47
Example
• Notionally, we compute expected counts as follows:
Possible Probability of sequence sequence
Q1= q1 q1 q1 p1 Q2= q1 q2 q1 p2 Q3= q1 q1 q2 p3 Q4= q1 q2 q2 p4 Observs: x z y
Algorithms for HMMs (Goldwater, ANLP) 48
Example
• Notionally, we compute expected counts as follows:
Possible Probability of sequence sequence
Q1= q1 q1 q1 p1 Q2= q1 q2 q1 p2 Q3= q1 q1 q2 p3 Q4= q1 q2 q2 p4 Observs: x z y
መ11
𝐶𝑞→𝑞 =2𝑝1+𝑝3
Algorithms for HMMs (Goldwater, ANLP) 49
Forward-Backward algorithm
• As usual, avoid enumerating all possible sequences.
• Forward-Backward (Baum-Welch) algorithm computes expected counts using forward probabilities and backward probabilities:
𝛽(𝑗, 𝑡) = 𝑃(𝑞𝑡 = 𝑗, 𝑜𝑡+1, 𝑜𝑡+2, … 𝑜𝑇|𝜆)
– Details, see J&M 6.5
• EM idea is much more general: can use for many latent variable models.
Algorithms for HMMs (Goldwater, ANLP) 50
Guarantees
• EM is guaranteed to find a local maximum of the likelihood.
P(O| λ)
values of λ
Algorithms for HMMs (Goldwater, ANLP) 51
•
Guarantees
EM is guaranteed to find a local maximum of the likelihood.
P(O| λ)
values of λ
Not guaranteed to find global maximum.
Practical issues: initialization, random restarts, early stopping.
• •
Algorithms for HMMs (Goldwater, ANLP) 52
Forward-backward/EM in practice
• Fully unsupervised learning of HMM for POS tagging does not work well.
– Model inaccuracies that work ok for supervised learning often cause problems for unsupervised.
• Can be better if more constrained. – Other tasks, using Bayesian priors, etc.
• And, general idea of EM can also be useful.
– E.g., for clustering problems or word alignment in
machine translation.
Algorithms for HMMs (Goldwater, ANLP) 53
Summary
• HMM: a generative model of sentences using hidden state sequence
• Dynamic programming algorithms to compute
– Best tag sequence given words (Viterbi algorithm)
– Likelihood (forward algorithm)
– Best parameters from unannotated corpus (forward-backward algorithm, an instance of EM)
Algorithms for HMMs (Goldwater, ANLP) 54
Video 6: Answers to more HMM
• (no slides)
questions
Algorithms for HMMs (Goldwater, ANLP) 55