Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 821
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
391of 821
Part XI
Mixture Models and EM 1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
392of 821
Marginalisation
Sum rule p(A,B) =
∑
C p(A,B,C)
Product rule p(A,B) = p(A|B)p(B)
Why do we optimize the log likelihood?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
393of 821
Strategy in this course
Estimate best predictor = training = learning
Given data (x1, y1), . . . , (xn, yn), find a predictor fw(·).
1 Identify the type of input x and output y data
2 Propose a (linear) mathematical model for fw
3 Design an objective function or likelihood
4 Calculate the optimal parameter (w)
5 Model uncertainty using the Bayesian approach
6 Implement and compute (the algorithm in python)
7 Interpret and diagnose results
We will study unsupervised learning this week
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
394of 821
Mixture Models and EM
Complex marginal distributions over observed variables
can be expressed via more tractable joint distributions over
the expanded space of observed and latent variables.
Mixture Models can also be used to cluster data.
General technique for finding maximum likelihood
estimators in latent variable models:
expectation-maximisation (EM) algorithm.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
395of 821
Example – Wallaby Distribution
Introduced very recently to show . . .
-20 -15 -10 -5 0 5 10 15 20
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
396of 821
Example – ’Wallaby’ Distribution
. . . that already a mixture of three Gaussian can be fun.
p(x) =
3
10
N (x | 5, 0.5) +
3
10
N (x | 9, 2) +
4
10
N (x | 2, 20)
-20 -15 -10 -5 0 5 10 15 20
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
397of 821
Example – ’Wallaby’ Distribution
Use µ, σ as latent variables and define a distribution
p(µ, σ) =
3
10 if (µ, σ) = (5, 0.5)
3
10 if (µ, σ) = (9, 2)
4
10 if (µ, σ) = (2, 20)
0 otherwise.
p(x) =
∫ ∞
−∞
∫ ∞
0
p(x, µ, σ) dµ dσ
=
∫ ∞
−∞
∫ ∞
0
p(x |µ, σ) p(µ, σ) dµ dσ
=
3
10
N (x | 5, 0.5) +
3
10
N (x | 9, 2) +
4
10
N (x | 2, 20)
-20 -15 -10 -5 0 5 10 15 20
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
398of 821
K-means Clustering
Given a set of data {x1, . . . , xN} where xn ∈ RD,
n = 1, . . . ,N.
Goal: Partition the data into K clusters.
Each cluster contains points close to each other.
Introduce a prototype µk ∈ RD for each cluster.
Goal: Find
1 a set prototypes µk, k = 1, . . . ,K, each representing a
different cluster.
2 an assignment of each data point to exactly one cluster.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
399of 821
K-means Clustering
Given a set of data {x1, . . . , xN} where xn ∈ RD,
n = 1, . . . ,N.
Goal: Partition the data into K clusters.
Each cluster contains points close to each other.
Introduce a prototype µk ∈ RD for each cluster.
Goal: Find
1 a set prototypes µk, k = 1, . . . ,K, each representing a
different cluster.
2 an assignment of each data point to exactly one cluster.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
400of 821
K-means Clustering – The Algorithm
Start with arbitrary chosen prototypes µk, k = 1, . . . ,K.
1 Assign each data point to the closest prototype.
2 Calculate new prototypes as the mean of all data points
assigned to each of them.
In the following, we will formalise this introducing a
notation which will be useful later.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
401of 821
K-means Clustering – Notation
Binary indicator variables
rnk =
{
1, if data point xn belongs to cluster k
0, otherwise
using the 1-of-K coding scheme.
Define a distortion measure
J =
N∑
n=1
K∑
k=1
rnk ‖xn − µk‖
2
Find the values for {rnk} and {µk} so as to minimise J.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
402of 821
K-means Clustering – Notation
Find the values for {rnk} and {µk} so as to minimise J.
But {rnk} depends on {µk}, and {µk} depends on {rnk}.
Iterate until no further change
1 Minimise J w.r.t. rnk while keeping {µk} fixed,
rnk =
{
1, if k = argminj‖xn − µj‖
2
0, otherwise.
∀n = 1, . . . ,N
Expectation step
2 Minimise J w.r.t. {µk} while keeping rnk fixed,
0 = 2
N∑
n=1
rnk(xn − µk)
µk =
∑N
n=1 rnkxn∑N
n=1 rnk
Maximisation step
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
403of 821
K-means Clustering – Notation
Find the values for {rnk} and {µk} so as to minimise J.
But {rnk} depends on {µk}, and {µk} depends on {rnk}.
Iterate until no further change
1 Minimise J w.r.t. rnk while keeping {µk} fixed,
rnk =
{
1, if k = argminj‖xn − µj‖
2
0, otherwise.
∀n = 1, . . . ,N
Expectation step
2 Minimise J w.r.t. {µk} while keeping rnk fixed,
0 = 2
N∑
n=1
rnk(xn − µk)
µk =
∑N
n=1 rnkxn∑N
n=1 rnk
Maximisation step
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
404of 821
K-means Clustering – Example
(a)
−2 0 2
−2
0
2
(d)
−2 0 2
−2
0
2
(b)
−2 0 2
−2
0
2
(e)
−2 0 2
−2
0
2
(c)
−2 0 2
−2
0
2
(f)
−2 0 2
−2
0
2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
405of 821
K-means Clustering – Example
(d)
−2 0 2
−2
0
2
(g)
−2 0 2
−2
0
2
(e)
−2 0 2
−2
0
2
(h)
−2 0 2
−2
0
2
(f)
−2 0 2
−2
0
2
(i)
−2 0 2
−2
0
2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
406of 821
K-means Clustering – Cost Function
J
1 2 3 4
0
500
1000
Cost function J after each E step (blue points)
and M step (red points).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
407of 821
K-means Clustering – Notes
Initial condition crucial for convergence.
What happens, if at least one cluster centre is too far from
all data points?
Complex step: Finding the nearest neighbour. (Use
triangle inequality; build K-D trees, …)
Generalise to non-Euclidean dissimilarity measures
V(xn,µk) (called K-medoids algorithm),
J̃ =
N∑
n=1
K∑
k=1
rnkV(xn,µk).
Online stochastic algorithm
1 Draw data point xn and locate nearest prototype µk.
2 Update only µk using decreasing learning rate ηn
µ
new
k = µ
old
k + ηn(xn − µ
old
k ).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
408of 821
K-means Clustering – Image Segmentation
Segment an image into regions of reasonable
homogeneous appearance.
Each pixel is a point in R3 (red, blue, green). (Note that the
pixel intensities are bounded in the range [0, 1] and
therefore this space is strictly speaking not Euclidean).
Run K-means on all points of the image until convergence.
Replace all pixels with the corresponding mean µk.
Results in an image with a palette only K different colours.
There are much better approaches to image segmentation
(but it is an active research topic), this here serves only to
illustrate K-means.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
409of 821
Illustrating K-means Clustering – Segmentation
K = 2
K = 3
K = 10
Original image
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
410of 821
Illustrating K-means Clustering – Segmentation
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
411of 821
K-means Clustering – Compression
Lossy data compression: accept some errors in the
reconstruction as trade-off for higher compression.
Apply K-means to the data.
Store the code-book vectors µk.
Store the data in the form of references (labels) to the
code-book. Each data point has a label in the range
[1, . . . ,K].
New data points are also compressed by finding the
closest code-book vector and then storing only the label.
This technique is also called vector quantisation.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
412of 821
Illustrating K-means Clustering – Compression
K = 2
4.2%
K = 10
16.7%
K = 3
8.3%
Original image
100 %
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
413of 821
Latent variable modeling
We have already seen a mixture of two Gaussians for
linear classification
However in the clustering scenario, we do not observe the
class membership
Strategy (this is vague)
We have a difficult distribution p(x)
We introduce a new variable z to get p(x, z)
Model p(x, z) = p(x|z)p(z) with easy distributions
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
414of 821
Mixture of Gaussians
A Gaussian mixture distribution is a linear superposition of
Gaussians of the form
p(x) =
K∑
k=1
πkN (x |µk,Σk).
As
∫
p(x)dx = 1, if follows
∑K
k=1 πk = 1.
Let us write this with the help of a latent variable z .
Definition (Latent variables)
Latent variables (as opposed to observable variables), are
variables that are not directly observed but are rather inferred
(through a mathematical model) from other variables that are
observed and directly measured. They are also sometimes
called hidden variables, model parameters, or hypothetical
variables.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
415of 821
Mixture of Gaussians
Let z ∈ {0, 1}K and
∑K
k=1 zk = 1. In words, z is a
K-dimensional vector in 1-of-K representation.
There are exactly K different possible vectors z depending
on which of the K entries is 1.
Define the joint distribution p(x, z) in terms of a marginal
distribution p(z) and a conditional distribution p(x | z) as
p(x, z) = p(z) p(x | z)
x
z
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
416of 821
Mixture of Gaussians
Set the marginal distribution to
p(zk = 1) = πk
where 0 ≤ πk ≤ 1 together with
∑K
k=1 πk = 1.
Because z uses 1-of-K coding, we can also write
p(z) =
K∏
k=1
π
zk
k .
Set the conditional distribution of x given a particular z to
p(x | zk = 1) = N (x |µk,Σk),
or
p(x | z) =
K∏
k=1
N (x |µk,Σk)
zk ,
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
417of 821
Mixture of Gaussians
The marginal distribution over x is now found by summing
the joint distribution over all possible states of z
p(x) =
∑
z
p(z) p(x | z) =
∑
z
K∏
k=1
π
zk
k
K∏
k=1
N (x |µk,Σk)
zk
=
K∑
k=1
πkN (x |µk,Σk)
The marginal distribution of x is a Gaussian mixture.
For several observations x1, . . . , xN we need one latent
variable zn per observation.
What have we gained? Can now work with the joint
distribution p(x, z). Will lead to significant simplification
later, especially for EM algorithm.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
418of 821
Mixture of Gaussians
Conditional probability of z given x by Bayes’ theorem
γ(zk) = p(zk = 1 | x) =
p(zk = 1) p(x | zk = 1)∑K
j=1 p(zj = 1) p(x | zj = 1)
=
πkN (x |µk,Σk)∑K
j=1 πjN (x |µj,Σj)
γ(zk) is the responsibility of component k to ‘explain’ the
observation x.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
419of 821
Mixture of Gaussians – Ancestral Sampling
Goal: Generate random samples distributed according to
the mixture model.
1 Generate a sample ẑ from the distribution p(z).
2 Generate a value x̂ from the conditional distribution p(x | ẑ).
Example: Mixture of 3 Gaussians, 500 points.
(a)
0 0.5 1
0
0.5
1
Original states of z.
(b)
0 0.5 1
0
0.5
1
Marginal p(x).
(c)
0 0.5 1
0
0.5
1
(R, G, B) – colours
mixed according to
γ(znk).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
420of 821
Mixture of Gaussians – Maximum Likelihood
Given N data points, each of dimension D, we have the
data matrix X ∈ RN×D where each row contains one data
point.
Similarly, we have the matrix of latent variables Z ∈ RN×K
with rows zTn .
Assume the data are drawn i.i.d., the distribution for the
data can be represented by a graphical model.
xn
zn
N
µ Σ
π
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
421of 821
Mixture of Gaussians – Maximum Likelihood
The log of the likelihood function is then
ln p(X |π,µ,Σ) =
N∑
n=1
ln
{
K∑
k=1
πkN (x |µk,Σk)
}
Significant problem: If a mean µj ‘sits’ directly on a data
point xn then
N (xn | xn, σ2j I) =
1
(2π)1/2
1
σj
.
Here we assumed Σk = σ2k I. But problem is general, just
think of a main axis transformation for Σk.
Overfitting (in disguise) occuring again with the maximum
likelihood approach.
Use heuristics to detect this situation and reset the mean
of the corresponding component of the mixture.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians
422of 821
Mixture of Gaussians – Maximum Likelihood
A K component mixture has a total of K! equivalent
solutions corresponding to the K! ways of assigning K sets
of parameters to K solutions.
Also called identifiability problem. Needs to be considered
when the parameters discovered by a model are
interpreted.
Maximising the log likelihood of a Gaussian mixture is
more complex then for a single Gaussian. Summation over
all K components inside of the logarithm make it harder.
Setting the derivatives of the log likelihood to zero does
not longer result in a closed form.
May use gradient-based optimisation.
Or EM algorithm. Stay tuned.
Neural Network 3
Motivation
Autoencoder