Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 821
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
762of 821
Part XXII
Sequential Data 2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
763of 821
Example of a Hidden Markov Model
Assume Peter and Mary are students in Canberra and Sydney,
respectively. Peter is a computer science student and only
interested in riding his bicycle, shopping for new computer
gadgets, and studying. (Well, he also does other things but
because these other activities don’t depend on the weather we
neglect them here.)
Mary does not know the current weather in Canberra, but
knows the general trends of the weather in Canberra. She also
knows Peter well enough to know what he does on average on
rainy days and also when the skies are blue.
She believes that the weather (rainy or not rainy) follows a
given discrete Markov chain. She tries to guess the sequence
of weather patterns for a number of days after Peter tells her on
the phone what he did in the last days.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
764of 821
Example of a Hidden Markov Model
Mary uses the following HMM
rainy sunny
initial probability 0.2 0.8
transition probability rainy 0.3 0.7
sunny 0.4 0.6
emission probability cycle 0.1 0.6
shop 0.4 0.3
study 0.5 0.1
Assume, Peter tells Mary that the list of his activities in the last
days was [′cycle′,′ shop′,′ study′]
(a) Calculate the probability of this observation sequence.
(b) Calculate the most probable sequence of hidden states for
these observations.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
765of 821
Hidden Markov Model
The trellis for the hidden Markov model
Find the probability of ’cycle’, ’shop’, ’study’
Rainy
0.2
Sunny
0.8
Rainy
Sunny
0.4
0.7
0.3
0.6
Rainy
Sunny
0.4
0.7
0.3
0.6
cycle
shop
study
0.5
0.4
0.1 0.6 0.3
0.1
cycle
shop
study
0.5
0.4
0.1 0.6 0.3
0.1
cycle
shop
study
0.5
0.4
0.1 0.6 0.3
0.1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
766of 821
Hidden Markov Model
Find the most probable hidden states for ’cycle’, ’shop’,
’study’
Rainy
0.2
Sunny
0.8
Rainy
Sunny
0.4
0.7
0.3
0.6
Rainy
Sunny
0.4
0.7
0.3
0.6
cycle
shop
study
0.5
0.4
0.1 0.6 0.3
0.1
cycle
shop
study
0.5
0.4
0.1 0.6 0.3
0.1
cycle
shop
study
0.5
0.4
0.1 0.6 0.3
0.1
cycle
shop
study
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
767of 821
Homogeneous Hidden Markov Model
Joint probability distribution over both latent and observed
variables
p(X,Z |θ) = p(z1 |π)
[
N∏
n=2
p(zn | zn−1,A)
]
N∏
m=1
p(xm | zm, φ)
where X = (x1, . . . , xN), Z = (z1, . . . , zN), and
θ = {π,A, φ}.
Most of the discussion will be independent of the particular
choice of emission probabilities (e.g. discrete tables,
Gaussians, mixture of Gaussians).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
768of 821
Maximum Likelihood for HMM
We have observed a data set X = (x1, . . . , xn).
Assume it came from a HMM with a given structure
(number of nodes, form of emission probabilities).
The likelihood of the data is
p(X |θ) =
∑
Z
p(X,Z |θ)
This joint distribution does not factorise over n (as with the
mixture distribution).
We have N variables, each with K states : KN terms.
Number of terms grows exponentially with the length of the
chain.
But we can use the conditional independence of the latent
variables to reorder their calculation later.
Further obstacle to find a closed loop maximum likelihood
solution: calculating the emission probabilities for different
states zn.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
769of 821
Maximum Likelihood for HMM – EM
Employ the EM algorithm to find the Maximum Likelihood
for HMM.
Start with some initial parameter settings θold.
E-step: Find the posterior distribution of the latent
variables p(Z |X,θold).
M-step: Maximise
Q(θ,θold) =
∑
Z
p(Z |X,θold) ln p(Z,X |θ)
with respect to the parameters θ = {π,A, φ}.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
770of 821
Maximum Likelihood for HMM – EM
Employ the EM algorithm to find the Maximum Likelihood
for HMM.
Start with some initial parameter settings θold.
E-step: Find the posterior distribution of the latent
variables p(Z |X,θold).
M-step: Maximise
Q(θ,θold) =
∑
Z
p(Z |X,θold) ln p(Z,X |θ)
with respect to the parameters θ = {π,A, φ}.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
771of 821
Maximum Likelihood for HMM – EM
Employ the EM algorithm to find the Maximum Likelihood
for HMM.
Start with some initial parameter settings θold.
E-step: Find the posterior distribution of the latent
variables p(Z |X,θold).
M-step: Maximise
Q(θ,θold) =
∑
Z
p(Z |X,θold) ln p(Z,X |θ)
with respect to the parameters θ = {π,A, φ}.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
772of 821
Maximum Likelihood for HMM – EM
Employ the EM algorithm to find the Maximum Likelihood
for HMM.
Start with some initial parameter settings θold.
E-step: Find the posterior distribution of the latent
variables p(Z |X,θold).
M-step: Maximise
Q(θ,θold) =
∑
Z
p(Z |X,θold) ln p(Z,X |θ)
with respect to the parameters θ = {π,A, φ}.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
773of 821
Maximum Likelihood for HMM – EM
Denote the marginal posterior distribution of zn by γ(zn),
and
the joint posterior distribution of two successive latent
variables by ξ(zn−1, zn)
γ(zn) = p(zn |X,θold)
ξ(zn−1, zn) = p(zn−1, zn |X,θold).
For each step n, γ(zn) has K nonnegative values which
sum to 1.
For each step n, ξ(zn−1, zn) has K × K nonnegative values
which sum to 1.
Elements of these vectors are denoted by γ(znk) and
ξ(zn−1,j, znk) respectively.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
774of 821
Maximum Likelihood for HMM – EM
Because the expectation of a binary random variable is the
probability that it is one, we get with this notation
γ(znk) = E [znk] =
∑
zn
γ(zn)znk
ξ(zn−1,j, znk) = E [zn−1,j znk] =
∑
zn−1,zn
ξ(zn−1, zn)zn−1,j znk.
Putting all together we get
Q(θ,θold) =
K∑
k=1
γ(z1k) lnπk +
N∑
n=2
K∑
j=1
K∑
k=1
ξ(zn−1,j, znk) lnAjk
+
N∑
n=1
K∑
k=1
γ(znk) ln p(xn |φk).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
775of 821
Maximum Likelihood for HMM – EM
M-step: Maximising
Q(θ,θold) =
K∑
k=1
γ(z1k) lnπk +
N∑
n=2
K∑
j=1
K∑
k=1
ξ(zn−1,j, znk) lnAjk
+
N∑
n=1
K∑
k=1
γ(znk) ln p(xn |φk).
results in
πk =
γ(z1k)∑K
j=1 γ(z1j)
Ajk =
∑N
n=2 ξ(zn−1,j, znk)∑K
l=1
∑N
n=2 ξ(zn−1,j, znl)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
776of 821
Maximum Likelihood for HMM – EM
Still left: Maximising with respect to φ.
Q(θ,θold) =
K∑
k=1
γ(z1k) lnπk +
N∑
n=2
K∑
j=1
K∑
k=1
ξ(zn−1,j, znk) lnAjk
+
N∑
n=1
K∑
k=1
γ(znk) ln p(xn |φk).
But φ only appears in the last term, and under the
assumption that all the φk are independent of each other,
this term decouples into a sum.
Then maximise each contribution
∑N
n=1 γ(znk) ln p(xn |φk)
individually.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
777of 821
Maximum Likelihood for HMM – EM
In the case of Gaussian emission densities
p(x |φk) = N (x |µk,Σk),
we get for the maximising parameters for the emission
densities
µk =
∑N
n=1 γ(znk) xn∑N
n=1 γ(znk)
Σk =
∑N
n=1 γ(znk) (xn − µk)(xn − µk)
T∑N
n=1 γ(znk)
.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
778of 821
Forward-Backward HMM
Need to efficiently evaluate the γ(znk) and ξ(zn−1,j, znk).
The graphical model for the HMM is a tree!
We know we can use a two-stage message passing
algorithm to calculate the posterior distribution of the latent
variables.
For HMM this is called the forward-backward algorithm
(Rabiner, 1989), or Baum-Welch algorithm (Baum, 1972).
Other variants exist, different only in the form of the
messages propagated.
We look at the most widely known, the alpha-beta
algorithm.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
779of 821
Conditional Independence for HMM
Given the data X = {x1, . . . , xN} the following independence
relations hold:
p(X | zn) = p(x1, . . . , xn | zn) p(xn+1, . . . , xN | zn)
p(x1, . . . , xn−1 | xn, zn) = p(x1, . . . , xn−1 | zn)
p(x1, . . . , xn−1 | zn−1, zn) = p(x1, . . . , xn−1 | zn−1)
p(xn+1, . . . , xN | zn, zn+1) = p(xn+1, . . . , xN | zn+1)
p(xn+2, . . . , xN | xn+1, zn+1) = p(xn+2, . . . , xN | zn+1)
p(X | zn−1, zn) = p(x1, . . . , xn−1 | zn−1) p(xn | zn)
p(xn+1, . . . , xN | zn)
p(xN+1 |X, zN+1) = p(xN+1 | zN+1)
p(zn+1 |X, zN) = p(zN+1 | zN)
zn−1 zn zn+1
xn−1 xn xn+1
z1 z2
x1 x2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
780of 821
Conditional Independence for HMM – Example
Let’s look at the following independence relation:
p(X | zn) = p(x1, . . . , xn | zn) p(xn+1, . . . , xN | zn)
Any path from the set {x1, . . . , xn} to the set {xn+1, . . . , xN}
has to go through zn.
In p(X | zn) the node zn is conditioned on (= observed).
All paths from x1, . . . , xn−1 through zn to xn+1, . . . , xN are
head-tail.
The path from xn through zn to zn+1 is tail-tail, so blocked.
Therefore x1, . . . , xn ⊥⊥ xn+1, . . . , xN | zn.
zn−1 zn zn+1
xn−1 xn xn+1
z1 z2
x1 x2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
781of 821
Alpha-Beta HMM
Define the joint probability of observing all data up to step
n and having zn as latent variable to be
α(zn) = p(x1, . . . , xn, zn).
Define the probability of all future data given zn to be
β(zn) = p(xn+1, . . . , xN | zn).
Then it an be shown the following recursions hold
α(zn) = p(xn | zn)
∑
zn−1
α(zn−1) p(zn | zn−1)
α(z1) =
K∏
k=1
{πk p(x1 |φk)}z1k
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
782of 821
Alpha-Beta HMM
At step n we can efficiently calculate α(zn) given α(zn−1)
α(zn) = p(xn | zn)
∑
zn−1
α(zn−1) p(zn | zn−1)
k = 1
k = 2
k = 3
n− 1 n
α(zn−1,1)
α(zn−1,2)
α(zn−1,3)
α(zn,1)
A11
A21
A31
p(xn|zn,1)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
783of 821
Alpha-Beta HMM
And for β(zn) we get the recursion
β(zn) =
∑
zn+1
β(zn+1) p(xn+1 | zn+1) p(zn+1 | zn)
k = 1
k = 2
k = 3
n n+ 1
β(zn,1) β(zn+1,1)
β(zn+1,2)
β(zn+1,3)
A11
A12
A13
p(xn|zn+1,1)
p(xn|zn+1,2)
p(xn|zn+1,3)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
784of 821
Alpha-Beta HMM
How do we start the β recursion? What is β(zN) ?
β(zN) = p(xN+1, . . . , xN | zn).
Can be shown the following is consistent with the
approach
β(zN) = 1.
k = 1
k = 2
k = 3
n n+ 1
β(zn,1) β(zn+1,1)
β(zn+1,2)
β(zn+1,3)
A11
A12
A13
p(xn|zn+1,1)
p(xn|zn+1,2)
p(xn|zn+1,3)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
785of 821
Alpha-Beta HMM
Now we know how to calculate α(zn) and β(zn) for each
step.
What is the probability of the data p(X) ?
Use the definition of γ(zn) and Bayes
γ(zn) = p(zn |X) =
p(X | zn) p(zn)
p(X)
=
p(X, zn)
p(X)
and the following conditional independence statement
from the graphical model of the HMM
p(X | zn) = p(x1, . . . , xn | zn)︸ ︷︷ ︸
α(zn)/p(zn)
p(xn+1, . . . , xN | zn)︸ ︷︷ ︸
β(zn)
and therefore
γ(zn) =
α(zn)β(zn)
p(X)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
786of 821
Alpha-Beta HMM
Marginalising over zn results in
1 =
∑
zn
γ(zn) =
∑
zn α(zn)β(zn)
p(X)
and therefore at each step n
p(X) =
∑
zn
α(zn)β(zn)
Most conveniently evaluated at step N where β(zN) = 1 as
p(X) =
∑
zN
α(zn).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
787of 821
Alpha-Beta HMM
Finally, we need to calculate the joint posterior distribution
of two successive latent variables by ξ(zn−1, zn) defined by
ξ(zn−1, zn) = p(zn−1, zn |X).
This can be done directly from the α and β values in the
form
ξ(zn−1, zn) =
α(zn−1) p(xn | zn) p(zn | zn−1)β(zn)
p(X)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
788of 821
How to train a HMM using EM?
1 Make an inital selection for the parameters θold where
θ = {π,A, φ)}. (Often, A,π initialised uniformly or
randomly. The φk-initialisation depends on the emission
distribution; for Gaussians run K-means first and get µk
and Σk from there.)
2 (Start of E-step) Run forward recursion to calculate α(zn).
3 Run backward recursion to calculate β(zn).
4 Calculate γ(z) and ξ(zn−1, zn) from α(zn) and β(zn).
5 Evaluate the likelihood p(X).
6 (Start of M-step) Find a θnew maximising Q(θ,θold). This
results in new settings for the parameters πk,Ajk and φk as
described before.
7 Iterate until convergence is detected.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
789of 821
Alpha-Beta HMM – Notes
In order to calculate the likelihood, we need to use the joint
probability p(X,Z) and sum over all possible values of Z.
That means, every particular choice of Z corresponds to
one path through the lattice diagram. There are
exponentially many !
Using the alpha-beta algorithm, the exponential cost has
been reduced to a linear cost in the length of the model.
How did we do that?
Swapping the order of multiplication and summation.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
790of 821
HMM – Viterbi Algorithm
Motivation: The latent states can have some meaningful
interpretation, e.g. phonemes in a speech recognition
system where the observed variables are the acoustic
signals.
Goal: After the system has been trained, find the most
probable states of the latent variables for a given
sequence of observations.
Warning: Finding the set of states which are each
individually the most probable does NOT solve this
problem.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
791of 821
HMM – Viterbi Algorithm
Define ω(zn) = maxz1,…,zn−1 ln p(x1, . . . , xn, z1, . . . , zn)
From the joint distribution of the HMM given by
p(x1, . . . , xN , z1, . . . , zN) = p(z1)
[
N∏
n=2
p(zn | zn−1)
]
N∏
n=1
p(xn | zn)
the following recursion can be derived
ω(zn) = ln p(xn | zn) + max
zn−1
{ln p(zn | zn−1) + ω(zn−1)}
ω(z1) = ln p(z1) + ln p(x1 | z1) = ln p(x1, z1)
k = 1
k = 2
k = 3
n− 2 n− 1 n n + 1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
A Simple Example
Maximum Likelihood for
HMM
Forward-Backward
HMM
Conditional
Independence
Alpha-Beta HMM
How to train a HMM
using EM?
HMM – Viterbi Algorithm
792of 821
HMM – Viterbi Algorithm
Calculate
ω(zn) = max
z1,…,zn−1
ln p(x1, . . . , xn, z1, . . . , zn)
for n = 1, . . . ,N.
For each step n remember which is the best transition to
go into each state at the next step.
At step N : Find the state with the highest probability.
For n = 1, . . . ,N − 1: Backtrace which transition led to the
most probable state and identify from which state it came.
k = 1
k = 2
k = 3
n− 2 n− 1 n n + 1
Sequential Data 1
Motivation
Stationary versus Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten Digits