CS计算机代考程序代写 algorithm Statistical Machine Learning

Statistical Machine Learning

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Outlines

Overview
Introduction
Linear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel Methods
Sparse Kernel Methods

Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis

Autoencoders
Graphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 821

Statistical Machine Learning

Christian Walder

Machine Learning Research Group
CSIRO Data61

and

College of Engineering and Computer Science
The Australian National University

Canberra
Semester One, 2020.

(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

EM for Gaussian
Mixtures

EM for Gaussian
Mixtures – Relation to
K-Means

Mixture of Bernoulli
Distributions

EM for Gaussian
Mixtures – Latent
Variables

Convergence of EM

423of 821

Part XII

Mixture Models and EM 2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

EM for Gaussian
Mixtures

EM for Gaussian
Mixtures – Relation to
K-Means

Mixture of Bernoulli
Distributions

EM for Gaussian
Mixtures – Latent
Variables

Convergence of EM

424of 821

Recall: K-means Clustering

Find the values for {rnk} and {µk} so as to minimise J.
But {rnk} depends on {µk}, and {µk} depends on {rnk}.
Iterate until no further change

1 Minimise J w.r.t. rnk while keeping {µk} fixed,

rnk =

{
1, if k = argminj‖xn − µj‖

2

0, otherwise.
∀n = 1, . . . ,N

Expectation step
2 Minimise J w.r.t. {µk} while keeping rnk fixed,

0 = 2
N∑

n=1

rnk(xn − µk)

µk =

∑N
n=1 rnkxn∑N

n=1 rnk

Maximisation step

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

EM for Gaussian
Mixtures

EM for Gaussian
Mixtures – Relation to
K-Means

Mixture of Bernoulli
Distributions

EM for Gaussian
Mixtures – Latent
Variables

Convergence of EM

425of 821

Responsibilities

θk

xn rnk

γ(znk)

For k-means clustering,

we have hard assignments

For GMM,

we have soft assignments

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

EM for Gaussian
Mixtures

EM for Gaussian
Mixtures – Relation to
K-Means

Mixture of Bernoulli
Distributions

EM for Gaussian
Mixtures – Latent
Variables

Convergence of EM

426of 821

EM for Gaussian Mixtures

Starting point is the log of the likelihood function

ln p(X |π,µ,Σ) =
N∑

n=1

ln

{
K∑

k=1

πkN (x |µk,Σk)
}

Critical point of ln p(X |π,µ,Σ) w.r.t. µk

0 =
N∑

n=1

πkN (xn |µk,Σk)∑K
j=1 πjN (xn |µj,Σj)︸ ︷︷ ︸

γ(znk)

Σ−1(xn − µk)

Therefore

µk =
1

Nk

N∑
n=1

γ(znk)xn

where the effective number of points assigned to Gaussian
k is Nk =

∑N
n=1 γ(znk).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

EM for Gaussian
Mixtures

EM for Gaussian
Mixtures – Relation to
K-Means

Mixture of Bernoulli
Distributions

EM for Gaussian
Mixtures – Latent
Variables

Convergence of EM

427of 821

EM for Gaussian Mixtures

Maximum of the log of the likelihood function for

µk =
1

Nk

N∑
n=1

γ(znk) xn

Similarly for the covariance matrix

Σk =
1

Nk

N∑
n=1

γ(znk)(xn − µk)(xn − µk)T ,

and for the mixing coefficients πk (using a Lagrange
multiplier as


k πk = 1)

πk =
Nk
N
.

This is not a closed form solution because the
responsibilities γ(znk) depend on π,µ,Σ.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

EM for Gaussian
Mixtures

EM for Gaussian
Mixtures – Relation to
K-Means

Mixture of Bernoulli
Distributions

EM for Gaussian
Mixtures – Latent
Variables

Convergence of EM

428of 821

EM for Gaussian Mixtures

Given a Gaussian mixture and data X, maximise the log
likelihood w.r.t. the parameters (π,µ,Σ).

1 Initialise the means µk, covariances Σk and mixing
coefficients πk. Evaluate the log likelihood function.

2 E step : Evaluate the γ(znk) using the current parameters

γ(znk) =
πkN (xn |µk,Σk)∑K
j=1 πjN (xn |µj,Σj)

3 M step : Re-estimate the parameters using the current
γ(znk)

µ
new
k =

1
Nk

N∑
n=1

γ(znk) xn π
new
k =

Nk
N

Σ
new
k =

1
Nk

N∑
n=1

γ(znk)(xn − µnewk )(xn − µ
new
k )

T

4 Evaluate the log likelihood, if not converged then goto 2.

ln p(X |π,µ,Σ) =
N∑

n=1

ln

{
K∑

k=1

π
new
k N (x |µ

new
k ,Σ

new
k )

}

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

EM for Gaussian
Mixtures

EM for Gaussian
Mixtures – Relation to
K-Means

Mixture of Bernoulli
Distributions

EM for Gaussian
Mixtures – Latent
Variables

Convergence of EM

429of 821

EM for Gaussian Mixtures – Example

(a)−2 0 2

−2

0

2

(d)

L = 2

−2 0 2

−2

0

2

(b)−2 0 2

−2

0

2

(e)

L = 5

−2 0 2

−2

0

2

(c)

L = 1

−2 0 2

−2

0

2

(f)

L = 20

−2 0 2

−2

0

2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

EM for Gaussian
Mixtures

EM for Gaussian
Mixtures – Relation to
K-Means

Mixture of Bernoulli
Distributions

EM for Gaussian
Mixtures – Latent
Variables

Convergence of EM

430of 821

EM for Gaussian Mixtures – Relation to K-Means

Assume a Gaussian mixture model.
Covariance matrices given by �I, where � is shared by all
components.
Then

p(x |µk,Σk) =
1

(2π�)D/2
exp

{
− 1

2�
‖x− µk‖2

}
.

Keep � fixed, do not re-estimate.
Responsibilities

γ(znk) =
πk exp

{
−‖xn − µk‖2/2�

}∑
j πj exp

{
−‖xn − µj‖2/2�

}
Taking the limit �→ 0, the term in the denominator for
which ‖xn − µj‖2 is the smallest will go to zero most slowly.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

EM for Gaussian
Mixtures

EM for Gaussian
Mixtures – Relation to
K-Means

Mixture of Bernoulli
Distributions

EM for Gaussian
Mixtures – Latent
Variables

Convergence of EM

431of 821

EM for Gaussian Mixtures – Relation to K-Means

Assume a Gaussian mixture model.

γ(znk) =
πk exp

{
−‖xn − µk‖2/2�

}∑
j πj exp

{
−‖xn − µj‖2/2�

}
Therefore

γ(znk) =

{
1 if ‖xn − µk‖ < ‖xn − µj‖ ∀ j 6= k 0 otherwise Holds independent of πk as long as none are zero. Hard assignment to exactly one cluster : K-means. lim �→0 γ(znk) = rnk Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University EM for Gaussian Mixtures EM for Gaussian Mixtures - Relation to K-Means Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 432of 821 Mixture of Bernoulli Distributions Set of D binary variables xi, i = 1, . . . ,D. Each governed by a Bernoulli distribution with parameter µi. Therefore p(x |µ) = D∏ i=1 µ xi i (1− µi)1−xi Expectation and covariance E [x] = µ cov[x] = diag{µi(1− µi)} Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University EM for Gaussian Mixtures EM for Gaussian Mixtures - Relation to K-Means Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 433of 821 Mixture of Bernoulli Distributions Mixture p(x |µ,π) = K∑ k=1 πk p(x |µk) with p(x |µk) = D∏ i=1 µ xi ki(1− µki) 1−xi Similar calculation as with mixture of Gaussian γ(znk) = πk p(xn |µk)∑K j=1 πj p(xn |µj) Nk = N∑ n=1 γ(znk) x̄ = 1 Nk N∑ n=1 γ(znk)xn µk = x̄ πk = Nk N Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University EM for Gaussian Mixtures EM for Gaussian Mixtures - Relation to K-Means Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 434of 821 EM for Mixture of Bernoulli Distributions - Digits Examples from a digits data set, each pixel taken only binary values. Parameters µki for each component in the mixture. Fit to one multivariate Bernoulli distribution. Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University EM for Gaussian Mixtures EM for Gaussian Mixtures - Relation to K-Means Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 435of 821 The Role of Latent Variables EM finds the maximum likelihod solution for models with latent variables. Two kinds of variables Observed variables X Latent variables Z plus model parameters θ. Log likelihood is then ln p(X |θ) = ln {∑ Z p(X,Z |θ) } Optimisation problem due to the log-sum. Assume maximisation of the distribution p(X,Z |θ) over the complete data set {X,Z} is straightforward. But we only have the incomplete data set {X} and the posterior distribution p(Z |X,θ). Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University EM for Gaussian Mixtures EM for Gaussian Mixtures - Relation to K-Means Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 436of 821 EM - Key Idea Key idea of EM: As Z is not observed, work with an ‘averaged’ version Q(θ,θold) of the complete log-likelihood ln p(X,Z |θ), averaged over all states of Z. Q(θ,θold) = ∑ Z p(Z |X,θold) ln p(X,Z |θ) Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University EM for Gaussian Mixtures EM for Gaussian Mixtures - Relation to K-Means Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 437of 821 EM Algorithm 1 Choose an initial setting for the parameters θold. 2 E step Evaluate p(Z |X,θold). 3 M step Evaluate θnew given by θnew = arg max θ Q(θ,θold) where Q(θ,θold) = ∑ Z p(Z |X,θold) ln p(X,Z |θ) 4 Check for convergence of log likelihood or parameter values. If not yet converged, then θold = θnew and go to step 2. Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University EM for Gaussian Mixtures EM for Gaussian Mixtures - Relation to K-Means Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 438of 821 EM Algorithm - Convergence Start with the product rule for the observed variables X, the unobserved variables Z, and the parameters θ ln p(X,Z |θ) = ln p(Z |X,θ) + ln p(X |θ). Apply ∑ Z q(Z) with arbitrary q(Z) to the formula∑ Z q(Z) ln p(X,Z |θ) = ∑ Z q(Z) ln p(Z |X,θ) + ln p(X |θ). Rewrite as ln p(X |θ) = ∑ Z q(Z) ln p(X,Z |θ) q(Z)︸ ︷︷ ︸ L(q,θ) − ∑ Z q(Z) ln p(Z |X,θ) q(Z)︸ ︷︷ ︸ KL(q‖p) KL(q‖p) is the Kullback-Leibler divergence. Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University EM for Gaussian Mixtures EM for Gaussian Mixtures - Relation to K-Means Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 439of 821 Kullback-Leibler Divergence ‘Distance’ between two distributions p(y) and q(y) KL(q‖p) = ∑ y q(y) ln q(y) p(y) = − ∑ y q(y) ln p(y) q(y) KL(q‖p) = ∫ q(y) ln q(y) p(y) dy = − ∫ q(y) ln p(y) q(y) dy KL(q‖p) ≥ 0 not symmetric: KL(q‖p) 6= KL(p‖q) KL(q‖p) = 0 iff q = p. invariant under parameter transformations Example: Kullback-Leibler divergence between two normal distributions q(x) = N (x |µ1, σ1) and p(x) = N (x |µ2, σ2) KL(q‖p) = log σ2 σ1 + σ21 + (µ1 − µ2)2 2σ22 − 1 2 Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University EM for Gaussian Mixtures EM for Gaussian Mixtures - Relation to K-Means Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 440of 821 EM Algorithm - Convergence The two parts of ln p(X |θ) ln p(X |θ) = ∑ Z q(Z) ln p(X,Z |θ) q(Z)︸ ︷︷ ︸ L(q,θ) − ∑ Z q(Z) ln p(Z |X,θ) q(Z)︸ ︷︷ ︸ KL(q‖p) ln p(X|θ)L(q, θ) KL(q||p) Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University EM for Gaussian Mixtures EM for Gaussian Mixtures - Relation to K-Means Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 441of 821 EM Algorithm - E Step Hold θold fixed. Maximise the lower bound L(q,θold) with respect to q(·). L(q,θold) is a functional. ln p(X |θ) does NOT depend on q(·). Maximum for L(q,θold) will occur when the Kullback-Leibler divergence vanishes. Therefore, choose q(Z) = p(Z |X,θold) ln p(X |θ) = ∑ Z q(Z) ln p(X,Z |θ) q(Z)︸ ︷︷ ︸ L(q,θ) − ∑ Z q(Z) ln p(Z |X,θ) q(Z)︸ ︷︷ ︸ KL(q‖p) ln p(X|θold)L(q, θold) KL(q||p) = 0 Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University EM for Gaussian Mixtures EM for Gaussian Mixtures - Relation to K-Means Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 442of 821 EM Algorithm - M Step Hold q(·) = p(Z |X,θold) fixed. Maximise the lower bound L(q,θ) with respect to θ : θnew = arg maxθ L(q,θold) = arg maxθ ∑ Z q(·) ln p(X,Z |θ) L(q,θnew) > L(q,θold) unless maximum already reached.
As q(·) = p(Z |X,θold) is fixed, p(Z |X,θnew) will not be
equal to q(·), and therefore the Kullback-Leiber distance
will be greater than zero (unless converged).

ln p(X |θ) =

Z

q(Z) ln
p(X,Z |θ)

q(Z)︸ ︷︷ ︸
L(q,θ)


Z

q(Z) ln
p(Z |X,θ)

q(Z)︸ ︷︷ ︸
KL(q‖p)

ln p(X|θnew)L(q, θnew)

KL(q||p)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

EM for Gaussian
Mixtures

EM for Gaussian
Mixtures – Relation to
K-Means

Mixture of Bernoulli
Distributions

EM for Gaussian
Mixtures – Latent
Variables

Convergence of EM

443of 821

EM Algorithm – Parameter View

θold θ
new

L (q, θ)

ln p(X|θ)

Red curve : incomplete data likelihood.
Blue curve : After E step. Green curve : After M step.

Mixture Models and EM 1
Review
Motivation
K-means Clustering
K-means Applications
Mixture of Gaussians