Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 821
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
736of 821
Part XXI
Sequential Data 1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
737of 821
Sequential Data
Often data is not i.i.d., e.g.
time series (finance, daily rainfall, speech, . . . ),
DNA sequences,
Natural language text (English, Chinese, . . . ).
Current data may not be independent of previous data.
The distribution of the observed data may change as the
sequence progresses.
Note: Use ‘past’ and ‘future’ to describe an order on the
observations, but the concept of sequence is not restricted
to temporal sequences.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
738of 821
Sequential Data Example
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
739of 821
Stationary vs. Nonstationary Sequential
Distributions
Stationary:
Data evolves in time, but the distribution from which it is
drawn stays the same.
Nonstationary:
The generative distribution changes itself with time.
We will focus on the stationary case.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
740of 821
Sequence Modelling Challenges
Goal:
Predict the next value in a sequence.
Assumption:
Not all previous data equally influence the next value.
Technical problem:
How to store an ever growing history of observations?)
Assume that recent observations are more likely to be
informative for the prediction of the next value than less
recent observations.
Is this always a good assumption?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
741of 821
Limitations of Modelling a Sequence as i.i.d.
Example:
Binary variable recording whether it rained or not on a day.
Only information from such a model: frequency of rainfall.
Can only use the frequency to predict whether it rains
tomorrow or not. (Maybe not so bad for Canberra 😉
Usually:
Observing whether it rained today helps to predict the
weather for tomorrow.
Need to relax the i.i.d. assumption to grasp this idea.
x1 x2 x3 x4
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
742of 821
The Markov Model
One of the simplest ways to relax the i.i.d. assumption.
Use the product rule to exactly express the joint
distribution
p(x1, . . . , xN) = p(x1)
N∏
n=2
p(xn | x1, . . . , xn−1)
Assume that each of the conditional expressions depends
only on the most recent.
First-order Markov chain
p(x1, . . . , xN) = p(x1)
N∏
n=2
p(xn | xn−1)
x1 x2 x3 x4
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
743of 821
Markov Model: Forecasting
Given the factorisation of the first-order Markov chain
p(x1, . . . , xN) = p(x1)
N∏
n=2
p(xn | xn−1)
What is the conditional distribution for observation xn given
the previous observations?
p(xn | x1, . . . , xn−1) =
p(x1, . . . , xn)
p(x1, . . . , xn−1)
=
p(x1)
∏n
i=2 p(xi | xi−1)
p(x1)
∏n−1
j=2 p(xj | xj−1)
= p(xn | xn−1)
Also clear from the graphical D-separation:
x1 x2 x3 x4
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
744of 821
Second-order Markov Chain
Assume that, say, the trend in the previous observations
provides important information in predicting the next value.
For a trend, we need at least two previous observations.
p(x1, . . . , xN) = p(x1) p(x2 | x1)
N∏
n=3
p(xn | xn−1, xn−2)
x1 x2 x3 x4
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
745of 821
Mth-order Markov Chain
Extend this idea to an Mth-order Markov chain in which the
probability of each observation depends on the previous M
observations
p(x1, . . . , xN) =p(x1) p(x2 | x1) . . . p(xM | x1, . . . xM−1)
×
N∏
n=M+1
p(xn | xn−1, xn−2 . . . xn−M)
What is a reasonable M?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
746of 821
Mth-order Markov Chain
Assuming K states for each variable, how many parameters
does a Mth-order Markov chain have?
M = 0 : no Markov parameter, i.i.d. data
M = 1 : First-order Markov chain, K − 1 parameters for
each of the K states of the previous observation. Number
of parameters: K(K − 1).
M : Mth-order Markov Chain, K − 1 parameters for each of
the K states of the previous M obervation. Number of
parameters: KM(K − 1).
Number of parameters grows exponentially with the order
of the Markov chain. Impractical for larger M.
(Neglects the special case of the first few observations)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
747of 821
Extending the Model
Goal: We want a model which is NOT restricted to the
Markov assumption to any order. BUT can be specified by
a limited number of free parameters.
Use the idea of latent variables to construct a rich class of
models out of simple components.
(Remember mixture of Gaussians.)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
748of 821
State Space Model
For each observation xn, a latent variable zn is added.
The type and dimensionality of zn can differ from xn.
Assume Markovian latent variables, i.e.
zn+1 ⊥⊥ zn−1 | zn
As a directed graphical model:
zn−1 zn zn+1
xn−1 xn xn+1
z1 z2
x1 x2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
749of 821
State Space Model – Basic Properties
Joint distribution:
p(x1, . . . , xN , z1, . . . , zN) = p(z1)
[
N∏
n=2
p(zn | zn−1)
]
N∏
n=1
p(xn | zn)
D-separation:
all pairs of xi and xj are not D-separated.
Forecast p(xn+1 | x1, . . . , xn) depends on the entire history.
The xn are not (fixed order) Markovian.
zn−1 zn zn+1
xn−1 xn xn+1
z1 z2
x1 x2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
750of 821
State Space Model – Two Important Varieties
Two important models described by this graphical model.
Hidden Markov Model or HMM:
Latent zn are discrete.
Observed xn may be discrete or continuous.
Linear Dynamical System:
Latent and observed variables are continuous.
In particular, latent and observed variables are Gaussian
with a linear-Gaussian dependence of the conditional
distribution on their parents.
zn−1 zn zn+1
xn−1 xn xn+1
z1 z2
x1 x2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
751of 821
Hidden Markov Model
State space model with discrete latent variables.
p(x1, . . . , xN , z1, . . . , zN) = p(z1)
[
N∏
n=2
p(zn | zn−1)
]
N∏
n=1
p(xn | zn)
= p(z1) p(x1 | z1)
N∏
n=2
p(zn | zn−1) p(xn | zn)
Each step can be viewed as an extension of the mixture
distribution model where each of the component densities
is p(x | z). Choice of mixture component depends now on
the previous state and is represented by p(zn | zn−1).
Latent variables are discrete multinomial variables zn
describing which component of the mixture is responsible
for generating the corresponding observable xn.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
752of 821
Hidden Markov Model: Latent Dynamics
Assume 1-in-K coding scheme.
Each latent variable zn has K different states.
Initial latent node z1 has no parent, therefore the marginal
distribution is given by a vector of probabilities π with
πk = p(z1k = 1) so that
p(z1 |π) =
K∏
k=1
π
z1k
k
∑
k
πk = 1
The conditional distribution p(zn | zn−1) is a table (matrix)
with K × K entries, denoted by A ∈ [0, 1]K×K .
The elements of A are called transition probabilities
Ajk = p(zn,k = 1 | zn−1,j = 1).
satisfying 0 ≤ Ajk ≤ 1 and
∑K
k=1 Ajk = 1.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
753of 821
Hidden Markov Model – Transition Diagram
Given the state zn−1 at step n− 1, now what is the next
state zn ?
p(zn | zn−1,A) =
K∏
k=1
K∏
j=1
A
zn−1,j znk
jk
A12
A23
A31
A21
A32
A13
A11
A22
A33
k = 1
k = 2
k = 3
Transition Diagram for a model with three possible states.
Black lines denote the elements of the transition matrix Ajk.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
754of 821
HMM – Unfolded Transition Diagram
Unfold the transition diagram over the steps to get a
lattice, or trellis. (Note, the transitions A in each step can
be different.)
k = 1
k = 2
k = 3
n− 2 n− 1 n n + 1
A11 A11 A11
A33 A33 A33
Unfolded Transition Diagram for a model with 3 possible states.
Each column corresponds to one latent variable zn.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
755of 821
Hidden Markov Model – Emission Probabilities
Complete the HMM by defining the emission probabilities
p(xn | zn, φ) where φ is a set of parameters governing the
conditional distributions.
As xn is observed, and φ is given, p(xn | zn, φ) is a
K-dimensional vector corresponding to the K possible
states of the binary vector zn.
Emission probabilities can be represented as
p(xn | zn, φ) =
K∏
k=1
p(xn |φk)znk
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
756of 821
Hidden Markov Model – Homogeneous
Remember: Transition probability in a Markov chain
T(zn−1, zn) = p(zn | zn−1).
A Hidden Markov Model is called homogeneous if the
transition probabilities are the same for all steps n
p(zn | zn−1) = p(zn−1 | zn−2) ∀ n = 3, . . . ,N
Assume a homogeneous HMM in the following.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
757of 821
Homogeneous Hidden Markov Model
Joint probability distribution over both latent and observed
variables
p(X,Z |θ) = p(z1 |π)
[
N∏
n=2
p(zn | zn−1,A)
]
N∏
m=1
p(xm | zm, φ)
where X = (x1, . . . , xN), Z = (z1, . . . , zN), and
θ = {π,A, φ}.
Most of the discussion will be independent of the particular
choice of emission probabilities (e.g. discrete tables,
Gaussians, mixture of Gaussians).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
758of 821
Homogeneous HMM – Generative View
Sampling from a hidden Markov model having a 3-state
latent variable z and a Gaussian emission model p(x | z)
where x ∈ R2.
k = 1
k = 2
k = 3
0 0.5 1
0
0.5
1
Contours of constant
probability for the
emission probabilities.
0 0.5 1
0
0.5
1
Sampling with 5%
probability of change to
each of the other states.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
759of 821
Hidden Markov Model – Variants
Create variants of HMM by imposing constraints on the
transition matrix A.
Left-to-right HMM : set all elements of A above the
diagonal to zero, Ajk = 0 for k < j.
Set the initial state probability p(z11) = 1 and p(z1j) = 0 for
j 6= 1.
Then, every sequence is constrained to start in state k = 1
and every state left can not be revisited again.
k = 1 k = 2 k = 3
A11 A22 A33
A12 A23
A13
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM - Generative View
HMM - Handwritten
Digits
760of 821
Hidden Markov Model - Variants
Further restrict the transition matrix A to ensure that no
large changes in the state space occur Ajk = 0 for k > j + ∆
where ∆ is the maximal change of state in one step.
k = 1
k = 2
k = 3
n− 2 n− 1 n n + 1
A11 A11 A11
A33 A33 A33
Example of a HMM with ∆ = 1.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Stationary versus
Nonstationary
Markov Model
State Space Model
Hidden Markov Model
HMM – Generative View
HMM – Handwritten
Digits
761of 821
Hidden Markov Model – Handwritten Digits
Digitize the trajectory online.
xi represent fixed width line segments at 16 different
angles.
Coincidentally also K = 16 states.
⇒ 16 × 16 transition and emission parameter matrices.
Trained with 45 examples of the digit ’2’.
Left-to-right transmission probability with ∆ = 1.
Upper row: training samples lower row: model samples.
Approximate Inference
Approximate Inference
Approximation Schemes
Variational Optimisation
Calculus of Variation
Variational Optimisation applied to Inference
Exponential Family
Expectation Propagation