Learning HMMs: Baum-Welch
J&M Appendix A
CMPSC 442
Week 10, Meeting 28, Three Segments
Outline
● Backward algorithm
● Formulae for estimating the parameters
● Baum-Welch
2Outline, Wk 10, Mtg 28
Learning HMMs: Baum-Welch
J&M Appendix A
CMPSC 442
Week 10, Meeting 28, Segment 1 of 3: Backward Algorithm
Developing the Parameters for an HMM
● In previous meetings, we assumed that the parameters of the HMM
model are known
○ Often these parameters are estimated on annotated training data
○ Annotation is often difficult and/or expensive
○ Training data might be different from a new dataset of interest
● Parameters can be learned from unlabeled data
○ In principle, the goal is to maximize the parameters with respect to the
current data
■ Develop a model , such that
4Backward Algorithm
Learning the HMM Parameters
● Unfortunately, there is no known way to analytically find a global
maximum, i.e., a model , such that
● It is possible to find a local maximum
● Theorem: Given an initial model , we can always find a model ,
such that
5Backward Algorithm
Iterative Re-Estimation of Parameters
● Key Idea: parameter re-estimation by hill-climbing
● Iteratively re-estimate the HMM parameters until convergence
○ Initialize with some parameter values
○ With these initial parameters, compute the expected counts of states, and
expected state transition counts, given the observation sequences
○ Using these expectations, which approximate MLE counts over the data
(were the state labels available), we then recompute the transition
probabilities and emission probabilities
6Backward Algorithm
Baum-Welch
● Baum-Welch algorithm uses Expectation Maximization to iteratively re-estimate
the parameters yielding at each iteration a new
○ Initializes to a random set of values, then for each iteration:
○ Calculates , from left to right (forward) and from right to left (backward) to
get the probability of the states i and j at times t and t+1
■ Every state transition from t to t+1 occurs as part of a sequence
■ For all transitions, we compute the forward probability from the sequence start to t, and
the backward probability from the sequence end back to t+1
○ Re-estimates
● Requires an algorithm for backward computation through the trellis
7Backward Algorithm
Backward Algorithm
● Given that we are in state i at time t, the probability β of seeing the
observations from time t + 1 to T, given λ, is:
● The Forward and Backward algorithms give same total probabilities
over the observations: P(O)=P(o1,o2,…,oT )
8Backward Algorithm
The backward probability βt (i) is symmetrical to αt (i) in the Forward
Algorithm:
● Initialization:
● Induction:
● Termination:
Backward Algorithm
9Backward Algorithm
Backward Induction Step Illustrated
10Backward Algorithm
Figure from Martin & Jurafsky, 3rd Ed., Appendix A
Parameter Re-estimation
● Three sets of HMM parameters need to be iteratively estimated
○ Initial state distribution: πi
○ Transition probabilities: ai,j
○ Emission probabilities: bi(ot)
11Backward Algorithm
Learning HMMs: Baum-Welch
J&M Appendix A
CMPSC 442
Week 10, Meeting 28, Segment 2 of 3: Formulae for
Estimating the Parameters
Likelihood of the Observations
● To compute the likelihood for o1,o2,…,oT as P(0|λ) we would want all the
paths through the trellis and the associated probabilities
● Given that we do not have λ, we use an approximation λ’ to compute
an expectation for each sequence of observations
● Given the Markov assumptions, we need only consider the probability
of every state transition i, j for every observation at every t
● Given all our sequences o1,o2,…,oT the probability of every state
transition with an emission at t+1 is derivable from the forward
probability from 1 to t, and the backward probability from T to t+1
13
Using the Forward – Backward Algorithm
14
Figure from Martin & Jurafsky, 3rd Ed., Appendix A
Backward Algorithm
Expectations for the Transition Probabilities
● Assume we have an estimate of the transition probability aij at a
particular point in time t in the observation sequence
● Then the estimate of the total count for sums over all t
○ Formulate an expression for the expected joint probability ξt(i,j) of state i
at time t and state j at time t+1, based on observation sequences
○ For the numerator of , sum over ξt(i,j) for all t
○ For the denominator, sum over ξt (i,k) for all t and for all transitions i,k
15Formulae for Parameter Estimates
Deriving the Expression for the Expectation
● The expression ξt(i,j) can be formulated as the joint probability of the
states at i and j conditioned on our observation sequences and the
parameters:
● Using the forward-backward computation, we can compute the
following:
16
Deriving the Expression for the Expectation
● We want:
● We have:
● We divide 𝝐
t
(i,j) by P(0|λ) to get ξt(i,j)
○ Because P(X|Y,Z) = P(X,Y|Z)/P(Y|Z)
17Formulae for Parameter Estimates
Deriving ξt(i,j)
1. By laws of probability
2. : the probability of the whole observation sequence
3. Divide εt(i,j) by to get ξt(i,j)
18Formulae for Parameter Estimates
Formulation for Estimating the Transition Probabilities
● Given the expression ξt(i,j), we can now sum over all t
● Numerator sums over all probabilities of states i at t and states j at t+1
● Denominator sums over all probabilities of states i at t and states k at
t+1 for all k in full set of states Q
19Formulae for Parameter Estimates
Expectations of the Emission Probabilities
● Given each ot estimate the probability γt(j) of being in state j at time t
20Formulae for Parameter Estimates
Estimating the Emission Probabilities (Sensor Model)
21Formulae for Parameter Estimates
Estimating the Priors for Each State
● Initial state distribution: is the probability that qi is a start state (t = 1)
22Formulae for Parameter Estimates
Learning HMMs: Baum-Welch
J&M Appendix A
CMPSC 442
Week 10, Meeting 28, Segment 3 of 3: Baum-Welch
Baum-Welch
● An example of Expectation Maximization
● E-Step: Compute expected values of the states j at times t using γt(j)
and of the transitions i,j from t to t+1 using ξt(i,j)
● M-Step: From these expected values, compute new parameter
estimates and
● Iterate until convergence
24Baum-Welch
Baum-Welch Expectation Step
25Baum-Welch
Baum-Welch Maximization Step
26Baum-Welch
Theory versus Practice
● In theory, Baum-Welch can do completely unsupervised learning of the
parameters
● In practice, the initialization is very important
27Baum-Welch
Summary
● Backward algorithm moves through the trellis backward to compute
the backward probabilities of paths
● Formulae for estimating the parameters of an HMM require a
combination of forward and backward computation through the trellis
● Baum-Welch is an example of the Expectation Maximization algorithm
which uses hill-climbing to find the optimal assignment of parameter
values that accounts for the observed sequences
28Summary, Wk 10, Mtg 28