CS计算机代考程序代写 scheme chain DNA finance Hidden Markov Mode Statistical Machine Learning

Statistical Machine Learning

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Outlines

Overview
Introduction
Linear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel Methods
Sparse Kernel Methods

Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis

Autoencoders
Graphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 821

Statistical Machine Learning

Christian Walder

Machine Learning Research Group
CSIRO Data61

and

College of Engineering and Computer Science
The Australian National University

Canberra
Semester One, 2020.

(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

736of 821

Part XXI

Sequential Data 1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

737of 821

Sequential Data

Often data is not i.i.d., e.g.
time series (finance, daily rainfall, speech, . . . ),
DNA sequences,
Natural language text (English, Chinese, . . . ).

Current data may not be independent of previous data.
The distribution of the observed data may change as the
sequence progresses.
Note: Use ‘past’ and ‘future’ to describe an order on the
observations, but the concept of sequence is not restricted
to temporal sequences.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

738of 821

Sequential Data Example

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

739of 821

Stationary vs. Nonstationary Sequential
Distributions

Stationary:
Data evolves in time, but the distribution from which it is
drawn stays the same.
Nonstationary:
The generative distribution changes itself with time.

We will focus on the stationary case.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

740of 821

Sequence Modelling Challenges

Goal:
Predict the next value in a sequence.
Assumption:
Not all previous data equally influence the next value.
Technical problem:
How to store an ever growing history of observations?)
Assume that recent observations are more likely to be
informative for the prediction of the next value than less
recent observations.
Is this always a good assumption?

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

741of 821

Limitations of Modelling a Sequence as i.i.d.

Example:
Binary variable recording whether it rained or not on a day.
Only information from such a model: frequency of rainfall.
Can only use the frequency to predict whether it rains
tomorrow or not. (Maybe not so bad for Canberra 😉
Usually:
Observing whether it rained today helps to predict the
weather for tomorrow.
Need to relax the i.i.d. assumption to grasp this idea.

x1 x2 x3 x4

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

742of 821

The Markov Model

One of the simplest ways to relax the i.i.d. assumption.
Use the product rule to exactly express the joint
distribution

p(x1, . . . , xN) = p(x1)
N∏

n=2

p(xn | x1, . . . , xn−1)

Assume that each of the conditional expressions depends
only on the most recent.
First-order Markov chain

p(x1, . . . , xN) = p(x1)
N∏

n=2

p(xn | xn−1)

x1 x2 x3 x4

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

743of 821

Markov Model: Forecasting

Given the factorisation of the first-order Markov chain

p(x1, . . . , xN) = p(x1)
N∏

n=2

p(xn | xn−1)

What is the conditional distribution for observation xn given
the previous observations?

p(xn | x1, . . . , xn−1) =
p(x1, . . . , xn)

p(x1, . . . , xn−1)

=
p(x1)

∏n
i=2 p(xi | xi−1)

p(x1)
∏n−1

j=2 p(xj | xj−1)
= p(xn | xn−1)

Also clear from the graphical D-separation:

x1 x2 x3 x4

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

744of 821

Second-order Markov Chain

Assume that, say, the trend in the previous observations
provides important information in predicting the next value.
For a trend, we need at least two previous observations.

p(x1, . . . , xN) = p(x1) p(x2 | x1)
N∏

n=3

p(xn | xn−1, xn−2)

x1 x2 x3 x4

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

745of 821

Mth-order Markov Chain

Extend this idea to an Mth-order Markov chain in which the
probability of each observation depends on the previous M
observations

p(x1, . . . , xN) =p(x1) p(x2 | x1) . . . p(xM | x1, . . . xM−1)

×
N∏

n=M+1

p(xn | xn−1, xn−2 . . . xn−M)

What is a reasonable M?

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

746of 821

Mth-order Markov Chain

Assuming K states for each variable, how many parameters
does a Mth-order Markov chain have?

M = 0 : no Markov parameter, i.i.d. data
M = 1 : First-order Markov chain, K − 1 parameters for
each of the K states of the previous observation. Number
of parameters: K(K − 1).
M : Mth-order Markov Chain, K − 1 parameters for each of
the K states of the previous M obervation. Number of
parameters: KM(K − 1).
Number of parameters grows exponentially with the order
of the Markov chain. Impractical for larger M.

(Neglects the special case of the first few observations)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

747of 821

Extending the Model

Goal: We want a model which is NOT restricted to the
Markov assumption to any order. BUT can be specified by
a limited number of free parameters.
Use the idea of latent variables to construct a rich class of
models out of simple components.
(Remember mixture of Gaussians.)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

748of 821

State Space Model

For each observation xn, a latent variable zn is added.
The type and dimensionality of zn can differ from xn.
Assume Markovian latent variables, i.e.

zn+1 ⊥⊥ zn−1 | zn

As a directed graphical model:

zn−1 zn zn+1

xn−1 xn xn+1

z1 z2

x1 x2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

749of 821

State Space Model – Basic Properties

Joint distribution:

p(x1, . . . , xN , z1, . . . , zN) = p(z1)

[
N∏

n=2

p(zn | zn−1)

]
N∏

n=1

p(xn | zn)

D-separation:
all pairs of xi and xj are not D-separated.
Forecast p(xn+1 | x1, . . . , xn) depends on the entire history.
The xn are not (fixed order) Markovian.

zn−1 zn zn+1

xn−1 xn xn+1

z1 z2

x1 x2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

750of 821

State Space Model – Two Important Varieties

Two important models described by this graphical model.
Hidden Markov Model or HMM:

Latent zn are discrete.
Observed xn may be discrete or continuous.

Linear Dynamical System:
Latent and observed variables are continuous.
In particular, latent and observed variables are Gaussian
with a linear-Gaussian dependence of the conditional
distribution on their parents.

zn−1 zn zn+1

xn−1 xn xn+1

z1 z2

x1 x2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

751of 821

Hidden Markov Model

State space model with discrete latent variables.

p(x1, . . . , xN , z1, . . . , zN) = p(z1)

[
N∏

n=2

p(zn | zn−1)

]
N∏

n=1

p(xn | zn)

= p(z1) p(x1 | z1)
N∏

n=2

p(zn | zn−1) p(xn | zn)

Each step can be viewed as an extension of the mixture
distribution model where each of the component densities
is p(x | z). Choice of mixture component depends now on
the previous state and is represented by p(zn | zn−1).
Latent variables are discrete multinomial variables zn
describing which component of the mixture is responsible
for generating the corresponding observable xn.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

752of 821

Hidden Markov Model: Latent Dynamics

Assume 1-in-K coding scheme.
Each latent variable zn has K different states.
Initial latent node z1 has no parent, therefore the marginal
distribution is given by a vector of probabilities π with
πk = p(z1k = 1) so that

p(z1 |π) =
K∏

k=1

π
z1k
k


k

πk = 1

The conditional distribution p(zn | zn−1) is a table (matrix)
with K × K entries, denoted by A ∈ [0, 1]K×K .
The elements of A are called transition probabilities

Ajk = p(zn,k = 1 | zn−1,j = 1).

satisfying 0 ≤ Ajk ≤ 1 and
∑K

k=1 Ajk = 1.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

753of 821

Hidden Markov Model – Transition Diagram

Given the state zn−1 at step n− 1, now what is the next
state zn ?

p(zn | zn−1,A) =
K∏

k=1

K∏
j=1

A
zn−1,j znk
jk

A12

A23

A31

A21

A32

A13

A11

A22

A33

k = 1

k = 2

k = 3

Transition Diagram for a model with three possible states.
Black lines denote the elements of the transition matrix Ajk.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

754of 821

HMM – Unfolded Transition Diagram

Unfold the transition diagram over the steps to get a
lattice, or trellis. (Note, the transitions A in each step can
be different.)

k = 1

k = 2

k = 3

n− 2 n− 1 n n + 1

A11 A11 A11

A33 A33 A33

Unfolded Transition Diagram for a model with 3 possible states.
Each column corresponds to one latent variable zn.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

755of 821

Hidden Markov Model – Emission Probabilities

Complete the HMM by defining the emission probabilities
p(xn | zn, φ) where φ is a set of parameters governing the
conditional distributions.
As xn is observed, and φ is given, p(xn | zn, φ) is a
K-dimensional vector corresponding to the K possible
states of the binary vector zn.
Emission probabilities can be represented as

p(xn | zn, φ) =
K∏

k=1

p(xn |φk)znk

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

756of 821

Hidden Markov Model – Homogeneous

Remember: Transition probability in a Markov chain
T(zn−1, zn) = p(zn | zn−1).
A Hidden Markov Model is called homogeneous if the
transition probabilities are the same for all steps n

p(zn | zn−1) = p(zn−1 | zn−2) ∀ n = 3, . . . ,N

Assume a homogeneous HMM in the following.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

757of 821

Homogeneous Hidden Markov Model

Joint probability distribution over both latent and observed
variables

p(X,Z |θ) = p(z1 |π)

[
N∏

n=2

p(zn | zn−1,A)

]
N∏

m=1

p(xm | zm, φ)

where X = (x1, . . . , xN), Z = (z1, . . . , zN), and
θ = {π,A, φ}.
Most of the discussion will be independent of the particular
choice of emission probabilities (e.g. discrete tables,
Gaussians, mixture of Gaussians).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

758of 821

Homogeneous HMM – Generative View

Sampling from a hidden Markov model having a 3-state
latent variable z and a Gaussian emission model p(x | z)
where x ∈ R2.

k = 1

k = 2

k = 3

0 0.5 1
0

0.5

1

Contours of constant
probability for the

emission probabilities.

0 0.5 1
0

0.5

1

Sampling with 5%
probability of change to
each of the other states.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

759of 821

Hidden Markov Model – Variants

Create variants of HMM by imposing constraints on the
transition matrix A.
Left-to-right HMM : set all elements of A above the
diagonal to zero, Ajk = 0 for k < j. Set the initial state probability p(z11) = 1 and p(z1j) = 0 for j 6= 1. Then, every sequence is constrained to start in state k = 1 and every state left can not be revisited again. k = 1 k = 2 k = 3 A11 A22 A33 A12 A23 A13 Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Stationary versus Nonstationary Markov Model State Space Model Hidden Markov Model HMM - Generative View HMM - Handwritten Digits 760of 821 Hidden Markov Model - Variants Further restrict the transition matrix A to ensure that no large changes in the state space occur Ajk = 0 for k > j + ∆
where ∆ is the maximal change of state in one step.

k = 1

k = 2

k = 3

n− 2 n− 1 n n + 1

A11 A11 A11

A33 A33 A33

Example of a HMM with ∆ = 1.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Stationary versus
Nonstationary

Markov Model

State Space Model

Hidden Markov Model

HMM – Generative View

HMM – Handwritten
Digits

761of 821

Hidden Markov Model – Handwritten Digits

Digitize the trajectory online.
xi represent fixed width line segments at 16 different
angles.
Coincidentally also K = 16 states.
⇒ 16 × 16 transition and emission parameter matrices.
Trained with 45 examples of the digit ’2’.
Left-to-right transmission probability with ∆ = 1.
Upper row: training samples lower row: model samples.

Approximate Inference
Approximate Inference
Approximation Schemes
Variational Optimisation
Calculus of Variation
Variational Optimisation applied to Inference
Exponential Family
Expectation Propagation