Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 821
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures – Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures – Latent
Variables
Convergence of EM
423of 821
Part XII
Mixture Models and EM 2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures – Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures – Latent
Variables
Convergence of EM
424of 821
Recall: K-means Clustering
Find the values for {rnk} and {µk} so as to minimise J.
But {rnk} depends on {µk}, and {µk} depends on {rnk}.
Iterate until no further change
1 Minimise J w.r.t. rnk while keeping {µk} fixed,
rnk =
{
1, if k = argminj‖xn − µj‖
2
0, otherwise.
∀n = 1, . . . ,N
Expectation step
2 Minimise J w.r.t. {µk} while keeping rnk fixed,
0 = 2
N∑
n=1
rnk(xn − µk)
µk =
∑N
n=1 rnkxn∑N
n=1 rnk
Maximisation step
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures – Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures – Latent
Variables
Convergence of EM
425of 821
Responsibilities
θk
xn rnk
γ(znk)
For k-means clustering,
we have hard assignments
For GMM,
we have soft assignments
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures – Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures – Latent
Variables
Convergence of EM
426of 821
EM for Gaussian Mixtures
Starting point is the log of the likelihood function
ln p(X |π,µ,Σ) =
N∑
n=1
ln
{
K∑
k=1
πkN (x |µk,Σk)
}
Critical point of ln p(X |π,µ,Σ) w.r.t. µk
0 =
N∑
n=1
πkN (xn |µk,Σk)∑K
j=1 πjN (xn |µj,Σj)︸ ︷︷ ︸
γ(znk)
Σ−1(xn − µk)
Therefore
µk =
1
Nk
N∑
n=1
γ(znk)xn
where the effective number of points assigned to Gaussian
k is Nk =
∑N
n=1 γ(znk).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures – Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures – Latent
Variables
Convergence of EM
427of 821
EM for Gaussian Mixtures
Maximum of the log of the likelihood function for
µk =
1
Nk
N∑
n=1
γ(znk) xn
Similarly for the covariance matrix
Σk =
1
Nk
N∑
n=1
γ(znk)(xn − µk)(xn − µk)T ,
and for the mixing coefficients πk (using a Lagrange
multiplier as
∑
k πk = 1)
πk =
Nk
N
.
This is not a closed form solution because the
responsibilities γ(znk) depend on π,µ,Σ.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures – Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures – Latent
Variables
Convergence of EM
428of 821
EM for Gaussian Mixtures
Given a Gaussian mixture and data X, maximise the log
likelihood w.r.t. the parameters (π,µ,Σ).
1 Initialise the means µk, covariances Σk and mixing
coefficients πk. Evaluate the log likelihood function.
2 E step : Evaluate the γ(znk) using the current parameters
γ(znk) =
πkN (xn |µk,Σk)∑K
j=1 πjN (xn |µj,Σj)
3 M step : Re-estimate the parameters using the current
γ(znk)
µ
new
k =
1
Nk
N∑
n=1
γ(znk) xn π
new
k =
Nk
N
Σ
new
k =
1
Nk
N∑
n=1
γ(znk)(xn − µnewk )(xn − µ
new
k )
T
4 Evaluate the log likelihood, if not converged then goto 2.
ln p(X |π,µ,Σ) =
N∑
n=1
ln
{
K∑
k=1
π
new
k N (x |µ
new
k ,Σ
new
k )
}
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures – Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures – Latent
Variables
Convergence of EM
429of 821
EM for Gaussian Mixtures – Example
(a)−2 0 2
−2
0
2
(d)
L = 2
−2 0 2
−2
0
2
(b)−2 0 2
−2
0
2
(e)
L = 5
−2 0 2
−2
0
2
(c)
L = 1
−2 0 2
−2
0
2
(f)
L = 20
−2 0 2
−2
0
2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures – Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures – Latent
Variables
Convergence of EM
430of 821
EM for Gaussian Mixtures – Relation to K-Means
Assume a Gaussian mixture model.
Covariance matrices given by �I, where � is shared by all
components.
Then
p(x |µk,Σk) =
1
(2π�)D/2
exp
{
− 1
2�
‖x− µk‖2
}
.
Keep � fixed, do not re-estimate.
Responsibilities
γ(znk) =
πk exp
{
−‖xn − µk‖2/2�
}∑
j πj exp
{
−‖xn − µj‖2/2�
}
Taking the limit �→ 0, the term in the denominator for
which ‖xn − µj‖2 is the smallest will go to zero most slowly.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures – Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures – Latent
Variables
Convergence of EM
431of 821
EM for Gaussian Mixtures – Relation to K-Means
Assume a Gaussian mixture model.
γ(znk) =
πk exp
{
−‖xn − µk‖2/2�
}∑
j πj exp
{
−‖xn − µj‖2/2�
}
Therefore
γ(znk) =
{
1 if ‖xn − µk‖ < ‖xn − µj‖ ∀ j 6= k
0 otherwise
Holds independent of πk as long as none are zero.
Hard assignment to exactly one cluster : K-means.
lim
�→0
γ(znk) = rnk
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures - Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
432of 821
Mixture of Bernoulli Distributions
Set of D binary variables xi, i = 1, . . . ,D.
Each governed by a Bernoulli distribution with parameter
µi. Therefore
p(x |µ) =
D∏
i=1
µ
xi
i (1− µi)1−xi
Expectation and covariance
E [x] = µ
cov[x] = diag{µi(1− µi)}
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures - Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
433of 821
Mixture of Bernoulli Distributions
Mixture
p(x |µ,π) =
K∑
k=1
πk p(x |µk)
with
p(x |µk) =
D∏
i=1
µ
xi
ki(1− µki)
1−xi
Similar calculation as with mixture of Gaussian
γ(znk) =
πk p(xn |µk)∑K
j=1 πj p(xn |µj)
Nk =
N∑
n=1
γ(znk)
x̄ =
1
Nk
N∑
n=1
γ(znk)xn µk = x̄
πk =
Nk
N
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures - Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
434of 821
EM for Mixture of Bernoulli Distributions - Digits
Examples from a digits data set, each pixel taken only binary
values.
Parameters µki for each
component in the mixture.
Fit to one multivariate
Bernoulli distribution.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures - Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
435of 821
The Role of Latent Variables
EM finds the maximum likelihod solution for models with
latent variables.
Two kinds of variables
Observed variables X
Latent variables Z
plus model parameters θ.
Log likelihood is then
ln p(X |θ) = ln
{∑
Z
p(X,Z |θ)
}
Optimisation problem due to the log-sum.
Assume maximisation of the distribution p(X,Z |θ) over
the complete data set {X,Z} is straightforward.
But we only have the incomplete data set {X} and the
posterior distribution p(Z |X,θ).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures - Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
436of 821
EM - Key Idea
Key idea of EM: As Z is not observed, work with an
‘averaged’ version Q(θ,θold) of the complete log-likelihood
ln p(X,Z |θ), averaged over all states of Z.
Q(θ,θold) =
∑
Z
p(Z |X,θold) ln p(X,Z |θ)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures - Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
437of 821
EM Algorithm
1 Choose an initial setting for the parameters θold.
2 E step Evaluate p(Z |X,θold).
3 M step Evaluate θnew given by
θnew = arg max
θ
Q(θ,θold)
where
Q(θ,θold) =
∑
Z
p(Z |X,θold) ln p(X,Z |θ)
4 Check for convergence of log likelihood or parameter
values. If not yet converged, then
θold = θnew
and go to step 2.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures - Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
438of 821
EM Algorithm - Convergence
Start with the product rule for the observed variables X,
the unobserved variables Z, and the parameters θ
ln p(X,Z |θ) = ln p(Z |X,θ) + ln p(X |θ).
Apply
∑
Z q(Z) with arbitrary q(Z) to the formula∑
Z
q(Z) ln p(X,Z |θ) =
∑
Z
q(Z) ln p(Z |X,θ) + ln p(X |θ).
Rewrite as
ln p(X |θ) =
∑
Z
q(Z) ln
p(X,Z |θ)
q(Z)︸ ︷︷ ︸
L(q,θ)
−
∑
Z
q(Z) ln
p(Z |X,θ)
q(Z)︸ ︷︷ ︸
KL(q‖p)
KL(q‖p) is the Kullback-Leibler divergence.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures - Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
439of 821
Kullback-Leibler Divergence
‘Distance’ between two distributions p(y) and q(y)
KL(q‖p) =
∑
y
q(y) ln
q(y)
p(y)
= −
∑
y
q(y) ln
p(y)
q(y)
KL(q‖p) =
∫
q(y) ln
q(y)
p(y)
dy = −
∫
q(y) ln
p(y)
q(y)
dy
KL(q‖p) ≥ 0
not symmetric: KL(q‖p) 6= KL(p‖q)
KL(q‖p) = 0 iff q = p.
invariant under parameter transformations
Example: Kullback-Leibler divergence between two normal
distributions q(x) = N (x |µ1, σ1) and p(x) = N (x |µ2, σ2)
KL(q‖p) = log σ2
σ1
+
σ21 + (µ1 − µ2)2
2σ22
− 1
2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures - Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
440of 821
EM Algorithm - Convergence
The two parts of ln p(X |θ)
ln p(X |θ) =
∑
Z
q(Z) ln
p(X,Z |θ)
q(Z)︸ ︷︷ ︸
L(q,θ)
−
∑
Z
q(Z) ln
p(Z |X,θ)
q(Z)︸ ︷︷ ︸
KL(q‖p)
ln p(X|θ)L(q, θ)
KL(q||p)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures - Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
441of 821
EM Algorithm - E Step
Hold θold fixed. Maximise the lower bound L(q,θold) with
respect to q(·).
L(q,θold) is a functional.
ln p(X |θ) does NOT depend on q(·).
Maximum for L(q,θold) will occur when the
Kullback-Leibler divergence vanishes.
Therefore, choose q(Z) = p(Z |X,θold)
ln p(X |θ) =
∑
Z
q(Z) ln
p(X,Z |θ)
q(Z)︸ ︷︷ ︸
L(q,θ)
−
∑
Z
q(Z) ln
p(Z |X,θ)
q(Z)︸ ︷︷ ︸
KL(q‖p)
ln p(X|θold)L(q, θold)
KL(q||p) = 0
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures - Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures - Latent
Variables
Convergence of EM
442of 821
EM Algorithm - M Step
Hold q(·) = p(Z |X,θold) fixed. Maximise the lower bound
L(q,θ) with respect to θ :
θnew = arg maxθ L(q,θold) = arg maxθ
∑
Z q(·) ln p(X,Z |θ)
L(q,θnew) > L(q,θold) unless maximum already reached.
As q(·) = p(Z |X,θold) is fixed, p(Z |X,θnew) will not be
equal to q(·), and therefore the Kullback-Leiber distance
will be greater than zero (unless converged).
ln p(X |θ) =
∑
Z
q(Z) ln
p(X,Z |θ)
q(Z)︸ ︷︷ ︸
L(q,θ)
−
∑
Z
q(Z) ln
p(Z |X,θ)
q(Z)︸ ︷︷ ︸
KL(q‖p)
ln p(X|θnew)L(q, θnew)
KL(q||p)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
EM for Gaussian
Mixtures
EM for Gaussian
Mixtures – Relation to
K-Means
Mixture of Bernoulli
Distributions
EM for Gaussian
Mixtures – Latent
Variables
Convergence of EM
443of 821
EM Algorithm – Parameter View
θold θ
new
L (q, θ)
ln p(X|θ)
Red curve : incomplete data likelihood.
Blue curve : After E step. Green curve : After M step.
Mixture Models and EM 1
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians