18 Hidden Markov Models
In Markov models, we directly observe the state of the process (for example, in a random walk, the state is completely described by the position of the random walker). We can easily imagine a system where we don’t directly observe the state but observe some outcome which depends on the state. Call the observed outcomes symbols. We thus make a distinction between the sequence of states and the sequence of symbols. We formally model these systems as hidden Markov models (HMMs).
Example: Weather. Suppose we live in a place that gets winds either from the south or north. On days the wind is from the south, it is rainy and cold 75% of the days, and warm and sunny the other 25% of the days. When the wind is from the north, is is rainy and cold 20% of days, and warm and sunny 80% of the days. If it is northerly today, it will be northerly tomorrow with probability 0.7. If it is southerly today it will be south
with prob. 0.6.
⇤
0.7
0.3
N 0.4 S
0.2 0.75 0.8 0.25
Rain (R) Sun (U)
0.6
Figure 5: The chance of Sunny or Rainy weather on a day is determined by the wind direction. When it is Northerly, Sun is more likely, when it is Southerly, Rain is more likely.
The idea in the HMM is that we only observe the sequence of symbols but to understand what is going on, we need to know what the underlying state is.
Example cont.: A state sequence for the weather example might be NNNSSNS while the emissions sequence looks like (using R for rain and U for sun) UUUURUR. We might observe the weather (rainy or sunny) and wonder what is causing the patterns we see. The pattern is best understood by knowing the sequence of states (direction of wind). ⇤
Notation for HMMs
The ith state of the state sequence is ⇡i and transitions (of the states) is given by akl = P(⇡i = l|⇡i 1 = k). We use b’s to represent the symbols so we get a new set
88
of parameters called emission probabilities, e⇡(b) = P(xi = b|⇡i = ⇡) is the prob of emitting symbol b given that we are in state ⇡.
Example cont.: In the weather example aNN = 0.7, aNS = 0.3, aSS = 0.6 and aN S = 0.4. The emission probabilities are eN (R) = 0.2, eN (U ) = 0.8, eS (R) = 0.75 and
eS(U) = 0.25.
⇤
aNN =0.7 eN(R) = 0.2
Rain (R)
aNS =0.3
N S
aSS =0.6
aSN = 0.4
0.75 0.8 eS(U) = 0.25
Sun (U)
Figure 6: The chance of Sunny or Rainy weather on a day is determined by the wind direction. When it is Northerly, Sun is more likely, when it is Southerly, Rain is more likely.
Example: Cheating Casino. A casino tips things in its favour at a dice table by occasionally switching from a fair die to a loaded one. The loaded die produces a 6 50% of the time, and each of 1-5 just 10% of the time. Call the fair die F and the loaded die L. WehavetransitionsaFL =0.05andaLF =0.1. aFF =1 aFL andaLL =1 aLF. Emissions are eF(b) = 1/6 for b = 1,…,6 while eL(b) = 1/10 for b = 1,…,5 and eL(6) = 0.5.
aFF =0.95 F eF (x) = 1/6
aFL =0.05
aLF = 0.1
L
aLL =0.9
eL(6) = 1/2, else eL(x) = 1/10
123456
Figure 7:
A run of the chain might look like:
FFFFLFFFLLLLLLLLFFFFFFFFFFFFFFLLLLFFF
89
FLLLLLLFFFFFFFF
producing rolls
2516624245663624223463653452351666624 6 6 2 6 6 6 6 6 1 5 1 6 4 1 2. ⇤
Think of HMMs as a generative process where the first state ⇡1 is chosen according to probability a0i. From that state, a symbol is emitted according to the probabilities e⇡1. State 2 is then chosen according to transition probs a⇡1i which emits a symbol and so on.
The joint probability of states and symbols is then,
YL i=1
Multiple underlying states could produce the same path. Clearly, though, some states are more likely than others given the emissions. If we see lots of very good weather, for example, we may guess that northerlies are predominating in our city. If we see long strings of 6s, we might guess that the casino has switched to the loaded die, while sequences of apparently evenly distributed throws suggest a fair die is being used.
If we want to work with just one state path, it can be argued that the state path with the highest probability is the one to choose. Let
⇡⇤ = argmaxP(x,⇡). ⇡
We can find ⇡⇤ using a recursive algorithm due to Andrew Viterbi (an Jewish Italian refugee from the fascists who grew up in the USA, wrote this algorithm while at MIT and then started Qualcomm, a now huge technology company).
Let vk(i) be the probability of the most probable state path ending in state k at obser- vation i (that is, considering only the first i observations). Suppose vk(i) is known for all states k. Then the next probability vl(i + 1) can be calculated for observation xi+1 as
vl(i + 1) = el(xi+1) max(vk(i)akl). k
If we insist that all sequences start in a begin state called 0, we have v0(0) = 1 and we can use a backtracking procedure like we saw in the pairwise alignment algorithm to find the most probable path.
The Viterbi algorithm thus proceeds as follows: Initialisation i = 0: v0(0) = 1, vk(0) = 0 for k > 0.
Recursion i = 1, . . . , L: vl(i) = el(xi) maxk(vk(i 1)akl). Pointeri(l) = argmaxk(vk(i 1)akl)
90
P(x,⇡) = a0⇡1
18.1 The Viterbi algorithm for finding the most probable state path
e⇡i(xi)a⇡i⇡i+1
Termination: P(x,⇡⇤) = maxk(vk(L)ak0). ⇡L⇤ = argmaxk(vk(L)ak0). Traceback i = L,…,1: ⇡⇤ = Pointeri(⇡⇤).
i 1 i
Here we assume an end state is modeled, as ak0. If the end is not modeled, this term
disappears.
Even the most probable path typically has a very low likelihood, so it is important to work in log units. The recursion in log units is
Vl(i + 1) = log(el(xi+1)) + max(Vk(i) + log(akl)) k
where Vk(i) = log(vk(i)).
91