CS447: Natural Language Processing
http://courses.engr.illinois.edu/cs447
Julia Hockenmaier
juliahmr@illinois.edu
3324 Siebel Center
Lecture 6:
HMM algorithms
CS447: Natural Language Processing (J. Hockenmaier)
Recap: Statistical POS tagging
she1 promised2 to3 back4 the5 bill6
w = w1 w2 w3 w4 w5 w6
t = t1 t2 t3 t4 t5 t6
PRP1 VBD2 TO3 VB4 DT5 NN6
What is the most likely sequence of tags t
for the given sequence of words w ?
�2
CS447: Natural Language Processing (J. Hockenmaier)
Statistical POS tagging with HMMs
What is the most likely sequence of tags t
for the given sequence of words w ?
Hidden Markov Models define P(t) and P(w|t) as:
Transition probabilities:
P(t) = ∏i P(ti | ti−1) [bigram HMM]
or P(t) = ∏i P(ti | ti−1, ti−2) [trigram HMM]
Emission probabilities:
P(w | t) = ∏i P(wi | ti)
�3
Estimate argmaxt P(t|w) directly (in a conditional model)
or use Bayes’ Rule (and a generative model):
argmax
t
P(t|w) = argmax
t
P(t,w)
P(w)
= argmax
t
P(t,w)
= argmax
t
P(t)P(w|t)
Estimate argmaxt P(t|w) directly (in a conditional model)
or use Bayes’ Rule (and a generative model):
argmax
t
P(t|w) = argmax
t
P(t,w)
P(w)
= argmax
t
P(t,w)
= argmax
t
P(t)P(w|t)
CS447: Natural Language Processing (J. Hockenmaier)
HMMs as probabilistic automata
DT
JJ
NN
0.7
0.3
0.4
0.6
0.55
VBZ
0.45
0.5
the
0.2
a
0.1every
0.1
some 0.1
no
0.01
able
…
…
0.003
zealous
…
…
0.002
zone
0.00024
abandonment
0.001
yields
…
…
0.02
acts
A (bigram) HMM defines
Transition probabilities:
P( ti | ti-1)
Emission probabilities:
P( wi | ti )
�4
CS447: Natural Language Processing (J. Hockenmaier)
HMMs as probabilistic automata
Transition probabilities P(ti | ti−1):
Probability of going from one state (ti−1) of the
automaton to the next (ti)
“Markov model”: We’re making a Markov [independence]
assumption for how to move between states of the automaton
Emission probabilities P(wi | ti):
Probability of emitting a symbol (wi) in a given state of
the automaton (ti)
“Hidden Markov model”: The data that we see (at test time)
consists only of the words w, and we find tags for w by
searching for the most likely sequence of (hidden) states of the
automaton (the tags t) that generated the data w
�5 CS498JH: Introduction to NLP
An example HMM
�6
D N V A .
D 0.8 0.2
N 0.7 0.3
V 0.6 0.4
A 0.8 0.2
.
Transition Matrix A
the man ball throws sees red blue .
D 1
N 0.7 0.3
V 0.6 0.4
A 0.8 0.2
. 1
Emission Matrix B
D N V A .
π 1
Initial state vector π
D N
V
A
.
CS447: Natural Language Processing (J. Hockenmaier)
!
Using HMMs for tagging
-The input to an HMM tagger is a sequence of words, w.
The output is the most likely sequence of tags, t, for w.
-For the underlying HMM model, w is a sequence of output
symbols, and t is the most likely sequence of states (in the
Markov chain) that generated w.
�7
argmax
t
P( t⇤⇥�⌅
Outputtagger
| w⇤⇥�⌅
Inputtagger
) = argmax
t
P(w, t)
P(w)
= argmax
t
P(w, t)
= argmax
t
P( w⇤⇥�⌅
OutputHMM
| t⇤⇥�⌅
StatesHMM
)P( t⇤⇥�⌅
StatesHMM
)
CS447: Natural Language Processing (J. Hockenmaier)
How would the
automaton for a trigram
HMM with transition probabilities
P(ti | ti-2ti-1) look like?
What about unigrams
or n-grams?
???
???
�8
CS447: Natural Language Processing (J. Hockenmaier)
DT
JJ
NN VBZq0
Encoding a trigram model as FSA
JJ_DT
NN_DT
JJ
NN VBZDT
DT_
JJ_JJ
NN_JJ
VBZ_NN
NN_NN
Bigram model:
States = Tag Unigrams
Trigram model:
States = Tag Bigrams
�9 CS447: Natural Language Processing (J. Hockenmaier)
Trigram HMMs
In a trigram HMM tagger, each state qi corresponds
to a POS tag bigram (the tags of the current and
preceding word): qi=tjtk
Emission probabilities depend only on the current
POS tag: States tjtk and titk use the same emission
probabilities P(wi | tk)
�10
CS447: Natural Language Processing (J. Hockenmaier)
Building an HMM tagger
To build an HMM tagger, we have to:
-Train the model, i.e. estimate its parameters
(the transition and emission probabilities)
Easy case: we have a corpus labeled with POS tags
(supervised learning)
-Define and implement a tagging algorithm that
finds the best tag sequence t* for each input
sentence w:
t* = argmaxt P(t)P(w | t)
�11 CS498JH: Introduction to NLP
Learning an HMM
Where do we get the transition probabilities P(tj | ti)
(matrix A) and the emission probabilities P(wj | ti)
(matrix B) from?
Case 1: We have a POS-tagged corpus.
– This is learning from labeled data, aka “supervised learning”
Case 2: We have a raw (untagged) corpus and a tagset.
– This is learning from unlabeled data, aka “unsupervised learning”
�12
Pierre_NNP Vinken_NNP ,_, 61_CD years_NNS
old_JJ ,_, will_MD join_VB the_DT board_NN
as_IN a_DT nonexecutive_JJ director_NN Nov._NNP
29_CD ._.
Pierre Vinken , 61 years old , will
join the board as a nonexecutive
director Nov. 29 .
Tagset:
NNP: proper noun
CD: numeral,
JJ: adjective,…
CS498JH: Introduction to NLP
We count how often we see titj and wj_ti etc. in the data
(use relative frequency estimates):
Learning the transition probabilities:
Learning the emission probabilities:
We might use some smoothing, but this is pretty trivial…
Learning an HMM from labeled data
�13
P (tj |ti) =
C(titj)
C(ti)
Pierre_NNP Vinken_NNP ,_, 61_CD years_NNS
old_JJ ,_, will_MD join_VB the_DT board_NN
as_IN a_DT nonexecutive_JJ director_NN Nov._NNP
29_CD ._.
P (wj |ti) =
C(wj ti)
C(ti)
CS447: Natural Language Processing (J. Hockenmaier)
Learning an HMM from unlabeled data
We can’t count anymore.
We have to guess how often we’d expect to see titj
etc. in our data set. Call this expected count C(…)
-Our estimate for the transition probabilities:
-Our estimate for the emission probabilities:
-We will talk about how to obtain these counts on Friday
�14
Pierre Vinken , 61 years old , will
join the board as a nonexecutive
director Nov. 29 .
Tagset:
NNP: proper noun
CD: numeral,
JJ: adjective,…
P̂ (tj |ti) =
�C(titj)⇥
�C(ti)⇥
P̂ (wj |ti) =
�C(wj ti)⇥
�C(ti)⇥
CS447: Natural Language Processing (J. Hockenmaier)
Finding the best tag sequence
The number of possible tag sequences is
exponential in the length of the input sentence:
Each word can have up to T tags.
There are N words.
There are up to TN possible tag sequences.
We cannot enumerate all TN possible tag sequences.
But we can exploit the independence assumptions
in the HMM to define an efficient algorithm that
returns the tag sequence with the highest probability
�15 CS447: Natural Language Processing (J. Hockenmaier)
Dynamic
Programming for
HMMs
�16
CS498JH: Introduction to NLP
The three basic problems for HMMs
We observe an output sequence w=w1…wN:
w=“she promised to back the bill”
Problem I (Likelihood): find P(w | λ )
Given an HMM λ = (A, B, π), compute the likelihood
of the observed output, P(w | λ )
Problem II (Decoding): find Q=q1..qT
Given an HMM λ = (A, B, π), what is the most likely sequence of
states Q=q1..qN ≈ t1…tN to generate w?
Problem III (Estimation): find argmax λ P(w | λ )
Find the parameters A, B, π which maximize P(w | λ)
�17 CS498JH: Introduction to NLP
How can we solve these problems?
I. Likelihood of the input w:
Compute P(w | λ ) for the input w and HMM λ
II. Decoding (= tagging) the input w:
Find the best tags t*=argmaxt P(t | w,λ) for the input w and HMM λ
III. Estimation (= learning the model):
Find the best model parameters λ*=argmax λ P(t, w | λ)
for the training data w
These look like hard problems: With T tags, every input string
w1…n has Tn possible tag sequences
Can we find efficient (polynomial-time) algorithms?
�18
CS447: Natural Language Processing (J. Hockenmaier)
Dynamic programming
Dynamic programming is a general technique to solve
certain complex search problems by memoization
1.) Recursively decompose the large search problem
into smaller subproblems that can be solved efficiently
–There is only a polynomial number of subproblems.
2.) Store (memoize) the solution of each subproblem
in a common data structure
–Processing this data structure takes polynomial time
�19 CS498JH: Introduction to NLP
Dynamic programming algorithms for HMMs
I. Likelihood of the input:
Compute P(w| λ ) for an input sentence w and HMM λ
⇒ Forward algorithm
II. Decoding (=tagging) the input:
Find best tags t*=argmaxt P(t | w,λ) for an input sentence w and HMM λ
⇒ Viterbi algorithm
III. Estimation (=learning the model):
Find best model parameters λ*=argmax λ P(t, w | λ) for training data w
⇒ Forward-Backward algorithm
�20
CS447: Natural Language Processing (J. Hockenmaier)
Tags
Bookkeeping: the trellis
We use a N×T table (“trellis”) to keep track of the HMM.
The HMM can assign one of the T tags to each of the N words.
w(1) w(2) … w(i-1) w(i) w(i+1) … w(N-1) w(N)
t1
…
tj
…
tT
Words (“time steps”)
�21
word w(i) has tag tj
CS447: Natural Language Processing (J. Hockenmaier)
One tag sequence = one path through trellis
w(1) w(2) … w(i-1) w(i) w(i+1) … w(N-1) w(N)
q1
…
qj
…
qT
�22
One path through the trellis = one tag sequence
CS447: Natural Language Processing (J. Hockenmaier)
Computing P(t,w) for one tag sequence
w(1) w(2) … w(i-1) w(i) w(i+1) … w(N-1) w(N)
q1
…
qj
…
qT
P(w(1)|q1)
P(w(2) | qj)
P(w(i) | qi)
P(t(1)=q1)
P(qj | q1)
P(qi | q…)
P(q..| qi)
P(w(i+1) | qi+1)
P(w(N) | qj)
P(qj | q..)
�23
One path through the trellis = one tag sequence
To get its probability, we just multiply the initial state
and all emission and transition probabilities
CS447: Natural Language Processing (J. Hockenmaier)
The Viterbi algorithm
�24
CS447: Natural Language Processing (J. Hockenmaier)
Finding the best tag sequence
The number of possible tag sequences is exponential
in the length of the input sentence:
Each word can have up to T tags.
There are N words.
There are up to TN possible tag sequences.
We cannot enumerate all TN possible tag sequences.
But we can exploit the independence assumptions
in the HMM to define an efficient algorithm that
returns the tag sequence with the highest probability
in linear (O(N)) time.
�25 CS447: Natural Language Processing (J. Hockenmaier)
Notation: ti/wi vs t(i)/w(i)
To make the distinction between the i-th word/tag in
the vocabulary/tag set and the i-th word/tag in the
sentence clear:
use superscript notation w(i) for the i-th token
in the sequence
and subscript notation wi for the i-th type
in the inventory (tagset/vocabulary):
�26
CS447: Natural Language Processing (J. Hockenmaier)
HMM decoding
We observe a sentence w = w(1)…w(N)
w= “she promised to back the bill”
We want to use an HMM tagger to find its POS tags t
t* = argmaxt P(w, t)
= argmaxt P(t(1))·P(w(1)| t(1))·P(t(2)| t(1))·…·P(w(N)| t(N))
To do this efficiently, we will use
dynamic programming to exploit
the independence assumptions
in the HMM.
�27 CS447: Natural Language Processing (J. Hockenmaier)
The Viterbi algorithm
A dynamic programming algorithm which finds the
best (=most probable) tag sequence t* for an input
sentence w: t* = argmaxt P(w | t)P(t)
Complexity: linear in the sentence length.
With a bigram HMM, Viterbi runs in O(T2N) steps
for an input sentence with N words and a tag set of T tags.
The independence assumptions of the HMM tell us
how to break up the big search problem
(find t* = argmaxt P(w | t)P(t)) into smaller subproblems.
The data structure used to store the solution of these
subproblems is the trellis.
�28
CS447: Natural Language Processing (J. Hockenmaier)
HMM independences
1. Emissions depend only on the current tag:
… P(w(i) = man | t(i) = NN )…
We only have to multiply the emission probability
P(w(i) | tj ) with the probability of the best tag
sequence that gets us to t(i) = tj
�29 CS447: Natural Language Processing (J. Hockenmaier)
HMM independences
2. Transition probabilities to the current tag t(i)
depend only on the previous tag t(i−1):
… P( t(i) = NN | t(i−1) = DT )
-Assume the probability of the best tag sequence
for the prefix w(1)…w(i−1) that ends in the tag t(i−1) = tj
is known, and stored in a variable max[i−1][j].
-To compute the probability of the best tag sequence
for w(1)…w(i-1)w(i) that ends in the tags t(i-1)t(i) = tjtk,
multiply max[i−1][j] with P(tk | tj) and P(w(i) | tk)
-To compute the probability of the best tag sequence
for w(1)…w(i-1)w(i) that ends in t(i) = tk ,
consider all possible tags t(i-1) = tj for the preceding word:
max[i][k] = maxj ( max[i−1][j] P(tk | tj) )P(w(i) | tk)
�30
CS447: Natural Language Processing (J. Hockenmaier)
HMM independences
3. The current tag also determines the transition
probability of the next tag:
… P( t(i+1) = VBZ | t(i) = NN )…
We cannot fix the current tag t(i) based on the
probability of getting to t(i) (and producing w(i))
We have to wait until we have reached the last word
in the sequence.
Then, we can trace back to get the best tag sequence
for the entire sentence.
�31 CS447: Natural Language Processing (J. Hockenmaier)
Using the trellis to find t*
Let trellis[i][j] (word w(j) and tag tj) store the probability
of the best tag sequence for w(1)…w(i) that ends in tj
trellis[i][j] = max P(w(1)…w(i), t(1)…, t(i) = tj )
We can recursively compute trellis[i][j]
from the entries in the previous column trellis[i-1][j]
trellis[i][j] = P(w(i)| tj) ⋅Maxk( trellis[i-1][k]P(tj | tk) )
At the end of the sentence, we pick the highest
scoring entry in the last column of the trellis
�32
CS447: Natural Language Processing (J. Hockenmaier)
At any given cell
-For each cell in the preceding column: multiply its entry with
the transition probability to the current cell.
-Keep a single backpointer to the best (highest scoring) cell in
the preceding column
-Multiply this score with the emission probability of the current
word
�33
w(n-1) w(n)
t1 P(w(1..n-1), t(n-1)=t1)
… …
ti P(w(1..n-1), t(n-1)=ti)
… …
tN P(w(1..n-1), tn-1=ti)
P(ti |t1)
P(ti |ti)
P(ti
|tN
)
trellis[n][i] =
P(w(n)|ti)
⋅Max(trellis[n-1][j]P(ti |ti))
CS447: Natural Language Processing (J. Hockenmaier)
At the end of the sentence
In the last column (i.e. at the end of the sentence)
pick the cell with the highest entry, and trace back the
backpointers to the first word in the sentence.
�34
CS498JH: Introduction to NLP �35
w(1) w(2) … w(i-1) w(i) w(i+1) … w(N-1) w(N)
q1
…
qj
…
qT
Retrieving t* = argmaxt P(t,w)
By keeping one backpointer from each cell to the cell
in the previous column that yields the highest probability,
we can retrieve the most likely tag sequence when we’re done.
CS447: Natural Language Processing (J. Hockenmaier)
The Viterbi algorithm
Viterbi( w1…n){
for t (1…T) // INITIALIZATION: first column
trellis[1][t].viterbi = p_init[t] × p_emit[t][w1]
for i (2…n){ // RECURSION: every other column
for t (1….T){
trellis[i][t] = 0
for t’ (1…T){
tmp = trellis[i-1][t’].viterbi × p_trans[t’][t]
if (tmp > trellis[i][t].viterbi){
trellis[i][t].viterbi = tmp
trellis[i][t].backpointer = t’}}
trellis[i][t].viterbi ×= p_emit[t][wi]}}
t_max = NULL, vit_max = 0; // FINISH: find the best cell in the last column
for t (1…T)
if (trellis[n][t].vit > vit_max){t_max = t; vit_max = trellis[n][t].value }
return unpack(n, t_max);
}
�36
CS447: Natural Language Processing (J. Hockenmaier)
Unpacking the trellis
unpack(n, t){
i = n;
tags = new array[n+1];
while (i > 0){
tags[i] = t;
t = trellis[i][t].backpointer;
i–;
}
return tags;
}
�37 CS447: Natural Language Processing (J. Hockenmaier)
Today’s key concepts
HMM taggers
Learning HMMs from labeled text
Viterbi for HMMs
Dynamic programming
Independence assumptions in HMMs
The trellis
�38
CS447: Natural Language Processing (J. Hockenmaier)
Supplementary material:
Viterbi for Trigram
HMMs
�39 CS447: Natural Language Processing (J. Hockenmaier)
Trigram HMMs
In a Trigram HMM, transition probabilities are of the form:
P(t(i) = ti | t(i−1) = tj, t(i−2) = tk )
The i-th tag in the sequence influences the probabilities
of the (i+1)-th tag and the (i+2)-th tag:
… P(t(i+1) | t(i), t(i−1)) … P(t(i+2) | t(i+1), t(i))
Hence, each row in the trellis for a trigram HMM has to
correspond to a pair of tags — the current and the preceding tag:
(abusing notation)
trellis[i]⟨j,k⟩: word w(i) has tag tj, word w(i−1) has tag tk
The trellis now has T2 rows.
But we still need to consider only T transitions into each cell,
since the current word’s tag is the next word’s preceding tag:
Transitions are only possible from trellis[i]⟨j,k⟩ to trellis[i+1]⟨l,j⟩
�40