CS计算机代考程序代写 scheme python Bayesian algorithm Statistical Machine Learning

Statistical Machine Learning

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Outlines

Overview
Introduction
Linear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel Methods
Sparse Kernel Methods

Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis

Autoencoders
Graphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 821

Statistical Machine Learning

Christian Walder

Machine Learning Research Group
CSIRO Data61

and

College of Engineering and Computer Science
The Australian National University

Canberra
Semester One, 2020.

(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

391of 821

Part XI

Mixture Models and EM 1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

392of 821

Marginalisation

Sum rule p(A,B) =

C p(A,B,C)

Product rule p(A,B) = p(A|B)p(B)
Why do we optimize the log likelihood?

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

393of 821

Strategy in this course

Estimate best predictor = training = learning
Given data (x1, y1), . . . , (xn, yn), find a predictor fw(·).

1 Identify the type of input x and output y data
2 Propose a (linear) mathematical model for fw
3 Design an objective function or likelihood
4 Calculate the optimal parameter (w)
5 Model uncertainty using the Bayesian approach
6 Implement and compute (the algorithm in python)
7 Interpret and diagnose results

We will study unsupervised learning this week

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

394of 821

Mixture Models and EM

Complex marginal distributions over observed variables
can be expressed via more tractable joint distributions over
the expanded space of observed and latent variables.
Mixture Models can also be used to cluster data.
General technique for finding maximum likelihood
estimators in latent variable models:
expectation-maximisation (EM) algorithm.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

395of 821

Example – Wallaby Distribution

Introduced very recently to show . . .

-20 -15 -10 -5 0 5 10 15 20
0.00

0.05

0.10

0.15

0.20

0.25

0.30

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

396of 821

Example – ’Wallaby’ Distribution

. . . that already a mixture of three Gaussian can be fun.

p(x) =
3

10
N (x | 5, 0.5) +

3
10
N (x | 9, 2) +

4
10
N (x | 2, 20)

-20 -15 -10 -5 0 5 10 15 20
0.00

0.05

0.10

0.15

0.20

0.25

0.30

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

397of 821

Example – ’Wallaby’ Distribution

Use µ, σ as latent variables and define a distribution

p(µ, σ) =




3
10 if (µ, σ) = (5, 0.5)
3
10 if (µ, σ) = (9, 2)
4
10 if (µ, σ) = (2, 20)
0 otherwise.

p(x) =
∫ ∞
−∞

∫ ∞
0

p(x, µ, σ) dµ dσ

=

∫ ∞
−∞

∫ ∞
0

p(x |µ, σ) p(µ, σ) dµ dσ

=
3

10
N (x | 5, 0.5) +

3
10
N (x | 9, 2) +

4
10
N (x | 2, 20)

-20 -15 -10 -5 0 5 10 15 20
0.00

0.05

0.10

0.15

0.20

0.25

0.30

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

398of 821

K-means Clustering

Given a set of data {x1, . . . , xN} where xn ∈ RD,
n = 1, . . . ,N.
Goal: Partition the data into K clusters.

Each cluster contains points close to each other.
Introduce a prototype µk ∈ RD for each cluster.
Goal: Find

1 a set prototypes µk, k = 1, . . . ,K, each representing a
different cluster.

2 an assignment of each data point to exactly one cluster.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

399of 821

K-means Clustering

Given a set of data {x1, . . . , xN} where xn ∈ RD,
n = 1, . . . ,N.
Goal: Partition the data into K clusters.
Each cluster contains points close to each other.
Introduce a prototype µk ∈ RD for each cluster.
Goal: Find

1 a set prototypes µk, k = 1, . . . ,K, each representing a
different cluster.

2 an assignment of each data point to exactly one cluster.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

400of 821

K-means Clustering – The Algorithm

Start with arbitrary chosen prototypes µk, k = 1, . . . ,K.
1 Assign each data point to the closest prototype.
2 Calculate new prototypes as the mean of all data points

assigned to each of them.

In the following, we will formalise this introducing a
notation which will be useful later.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

401of 821

K-means Clustering – Notation

Binary indicator variables

rnk =

{
1, if data point xn belongs to cluster k
0, otherwise

using the 1-of-K coding scheme.
Define a distortion measure

J =
N∑

n=1

K∑
k=1

rnk ‖xn − µk‖
2

Find the values for {rnk} and {µk} so as to minimise J.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

402of 821

K-means Clustering – Notation

Find the values for {rnk} and {µk} so as to minimise J.
But {rnk} depends on {µk}, and {µk} depends on {rnk}.

Iterate until no further change
1 Minimise J w.r.t. rnk while keeping {µk} fixed,

rnk =

{
1, if k = argminj‖xn − µj‖

2

0, otherwise.
∀n = 1, . . . ,N

Expectation step
2 Minimise J w.r.t. {µk} while keeping rnk fixed,

0 = 2
N∑

n=1

rnk(xn − µk)

µk =

∑N
n=1 rnkxn∑N

n=1 rnk

Maximisation step

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

403of 821

K-means Clustering – Notation

Find the values for {rnk} and {µk} so as to minimise J.
But {rnk} depends on {µk}, and {µk} depends on {rnk}.
Iterate until no further change

1 Minimise J w.r.t. rnk while keeping {µk} fixed,

rnk =

{
1, if k = argminj‖xn − µj‖

2

0, otherwise.
∀n = 1, . . . ,N

Expectation step
2 Minimise J w.r.t. {µk} while keeping rnk fixed,

0 = 2
N∑

n=1

rnk(xn − µk)

µk =

∑N
n=1 rnkxn∑N

n=1 rnk

Maximisation step

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

404of 821

K-means Clustering – Example

(a)

−2 0 2

−2

0

2

(d)

−2 0 2

−2

0

2

(b)

−2 0 2

−2

0

2

(e)

−2 0 2

−2

0

2

(c)

−2 0 2

−2

0

2

(f)

−2 0 2

−2

0

2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

405of 821

K-means Clustering – Example

(d)

−2 0 2

−2

0

2

(g)

−2 0 2

−2

0

2

(e)

−2 0 2

−2

0

2

(h)

−2 0 2

−2

0

2

(f)

−2 0 2

−2

0

2

(i)

−2 0 2

−2

0

2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

406of 821

K-means Clustering – Cost Function

J

1 2 3 4
0

500

1000

Cost function J after each E step (blue points)
and M step (red points).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

407of 821

K-means Clustering – Notes

Initial condition crucial for convergence.
What happens, if at least one cluster centre is too far from
all data points?
Complex step: Finding the nearest neighbour. (Use
triangle inequality; build K-D trees, …)
Generalise to non-Euclidean dissimilarity measures
V(xn,µk) (called K-medoids algorithm),

J̃ =
N∑

n=1

K∑
k=1

rnkV(xn,µk).

Online stochastic algorithm
1 Draw data point xn and locate nearest prototype µk.
2 Update only µk using decreasing learning rate ηn

µ
new
k = µ

old
k + ηn(xn − µ

old
k ).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

408of 821

K-means Clustering – Image Segmentation

Segment an image into regions of reasonable
homogeneous appearance.
Each pixel is a point in R3 (red, blue, green). (Note that the
pixel intensities are bounded in the range [0, 1] and
therefore this space is strictly speaking not Euclidean).
Run K-means on all points of the image until convergence.
Replace all pixels with the corresponding mean µk.
Results in an image with a palette only K different colours.
There are much better approaches to image segmentation
(but it is an active research topic), this here serves only to
illustrate K-means.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

409of 821

Illustrating K-means Clustering – Segmentation

K = 2

K = 3

K = 10

Original image

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

410of 821

Illustrating K-means Clustering – Segmentation

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

411of 821

K-means Clustering – Compression

Lossy data compression: accept some errors in the
reconstruction as trade-off for higher compression.
Apply K-means to the data.
Store the code-book vectors µk.
Store the data in the form of references (labels) to the
code-book. Each data point has a label in the range
[1, . . . ,K].
New data points are also compressed by finding the
closest code-book vector and then storing only the label.
This technique is also called vector quantisation.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

412of 821

Illustrating K-means Clustering – Compression

K = 2

4.2%

K = 10

16.7%

K = 3

8.3%

Original image

100 %

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

413of 821

Latent variable modeling

We have already seen a mixture of two Gaussians for
linear classification
However in the clustering scenario, we do not observe the
class membership
Strategy (this is vague)

We have a difficult distribution p(x)
We introduce a new variable z to get p(x, z)
Model p(x, z) = p(x|z)p(z) with easy distributions

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

414of 821

Mixture of Gaussians

A Gaussian mixture distribution is a linear superposition of
Gaussians of the form

p(x) =
K∑

k=1

πkN (x |µk,Σk).

As

p(x)dx = 1, if follows
∑K

k=1 πk = 1.
Let us write this with the help of a latent variable z .

Definition (Latent variables)

Latent variables (as opposed to observable variables), are
variables that are not directly observed but are rather inferred
(through a mathematical model) from other variables that are
observed and directly measured. They are also sometimes
called hidden variables, model parameters, or hypothetical
variables.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

415of 821

Mixture of Gaussians

Let z ∈ {0, 1}K and
∑K

k=1 zk = 1. In words, z is a
K-dimensional vector in 1-of-K representation.
There are exactly K different possible vectors z depending
on which of the K entries is 1.
Define the joint distribution p(x, z) in terms of a marginal
distribution p(z) and a conditional distribution p(x | z) as

p(x, z) = p(z) p(x | z)

x

z

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

416of 821

Mixture of Gaussians

Set the marginal distribution to

p(zk = 1) = πk

where 0 ≤ πk ≤ 1 together with
∑K

k=1 πk = 1.
Because z uses 1-of-K coding, we can also write

p(z) =
K∏

k=1

π
zk
k .

Set the conditional distribution of x given a particular z to

p(x | zk = 1) = N (x |µk,Σk),

or

p(x | z) =
K∏

k=1

N (x |µk,Σk)
zk ,

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

417of 821

Mixture of Gaussians

The marginal distribution over x is now found by summing
the joint distribution over all possible states of z

p(x) =

z

p(z) p(x | z) =

z

K∏
k=1

π
zk
k

K∏
k=1

N (x |µk,Σk)
zk

=

K∑
k=1

πkN (x |µk,Σk)

The marginal distribution of x is a Gaussian mixture.
For several observations x1, . . . , xN we need one latent
variable zn per observation.
What have we gained? Can now work with the joint
distribution p(x, z). Will lead to significant simplification
later, especially for EM algorithm.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

418of 821

Mixture of Gaussians

Conditional probability of z given x by Bayes’ theorem

γ(zk) = p(zk = 1 | x) =
p(zk = 1) p(x | zk = 1)∑K
j=1 p(zj = 1) p(x | zj = 1)

=
πkN (x |µk,Σk)∑K
j=1 πjN (x |µj,Σj)

γ(zk) is the responsibility of component k to ‘explain’ the
observation x.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

419of 821

Mixture of Gaussians – Ancestral Sampling

Goal: Generate random samples distributed according to
the mixture model.

1 Generate a sample ẑ from the distribution p(z).
2 Generate a value x̂ from the conditional distribution p(x | ẑ).

Example: Mixture of 3 Gaussians, 500 points.

(a)

0 0.5 1

0

0.5

1

Original states of z.

(b)

0 0.5 1

0

0.5

1

Marginal p(x).

(c)

0 0.5 1

0

0.5

1

(R, G, B) – colours
mixed according to

γ(znk).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

420of 821

Mixture of Gaussians – Maximum Likelihood

Given N data points, each of dimension D, we have the
data matrix X ∈ RN×D where each row contains one data
point.
Similarly, we have the matrix of latent variables Z ∈ RN×K
with rows zTn .
Assume the data are drawn i.i.d., the distribution for the
data can be represented by a graphical model.

xn

zn

N

µ Σ

π

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

421of 821

Mixture of Gaussians – Maximum Likelihood

The log of the likelihood function is then

ln p(X |π,µ,Σ) =
N∑

n=1

ln

{
K∑

k=1

πkN (x |µk,Σk)

}

Significant problem: If a mean µj ‘sits’ directly on a data
point xn then

N (xn | xn, σ2j I) =
1

(2π)1/2
1
σj
.

Here we assumed Σk = σ2k I. But problem is general, just
think of a main axis transformation for Σk.
Overfitting (in disguise) occuring again with the maximum
likelihood approach.
Use heuristics to detect this situation and reset the mean
of the corresponding component of the mixture.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Motivation

K-means Clustering

K-means Applications

Mixture of Gaussians

422of 821

Mixture of Gaussians – Maximum Likelihood

A K component mixture has a total of K! equivalent
solutions corresponding to the K! ways of assigning K sets
of parameters to K solutions.
Also called identifiability problem. Needs to be considered
when the parameters discovered by a model are
interpreted.
Maximising the log likelihood of a Gaussian mixture is
more complex then for a single Gaussian. Summation over
all K components inside of the logarithm make it harder.
Setting the derivatives of the log likelihood to zero does
not longer result in a closed form.
May use gradient-based optimisation.
Or EM algorithm. Stay tuned.

Neural Network 3
Motivation
Autoencoder