程序代写 CSC 311: Introduction to Machine Learning

CSC 311: Introduction to Machine Learning
Tutorial 10 – EM Algorithm
University of Toronto

First, brief overview of Expectation-Maximization algorithm.
I In the lecture we were using Gaussian Mixture Model fitted with Maximum Likelihood (ML) estimation.
Today, practice with the E-M algorithm in an image completion task.
We will use Mixture of Bernoullis fitted with Maximum a posteriori (MAP) estimation.
I Learning the parameters I Posterior inference
Intro ML (UofT) CSC311-Tut 10 2 / 26

The Generative Model
We’ll be working with the following generative model for data D Assume a datapoint x is generated as follows:
I Choose a cluster z from {1,…,K} such that p(z = k) = ⇡k
I Given z, sample x from a probability distribution. (Earlier you saw
Guassian N(x|μz,I), now we will work with Bernoulli(✓z)) Can also be written:
p(z = k) = ⇡k
p(x|z = k) = N(x|μk,I)/Bernoulli(✓k)
Intro ML (UofT) CSC311-Tut 10 3 / 26

Maximum Likelihood with Latent Variables
How should we choose the parameters {⇡k,μk}Kk=1?
Maximum likelihood principle: choose parameters to maximize
likelihood of observed data
We don’t observe the cluster assignments z, we only see the data x Given data D = {x(n)}Nn=1, choose parameters to maximize:
log p(D) =
We can find p(x) by marginalizing out z:
p(x) = p(z = k, x) = p(z = k)p(x|z = k)
k=1 k=1 Intro ML (UofT) CSC311-Tut 10
log p(x(n))

Log-likelihood derivatives
@ logp(x) = @ logXp(x,z) @✓ @✓z
Intro ML (UofT) CSC311-Tut 10 5 / 26

Log-likelihood derivatives
@ logp(x) = @ logXp(x,z) @✓ @✓z
@ P p(x,z) @✓ z
Intro ML (UofT)
= Pz0 p(x,z0)
CSC311-Tut 10 6 / 26

Log-likelihood derivatives
@ logp(x) = @ logXp(x,z) @✓ @✓z
@ P p(x,z) @✓ z
= Pz0 p(x,z0) @ p(x,z)
= Pz0 p(x,z0)
Intro ML (UofT)
CSC311-Tut 10 7 / 26

Log-likelihood derivatives
@ logp(x) = @ logXp(x,z) @✓ @✓z
@ P p(x,z) @✓ z
= Pz0 p(x,z0) @ p(x,z)
= Pz0 p(x,z0)
p(x, z) @ = z P @✓
log p(x, z) z0 p(x,z0)
Intro ML (UofT)
CSC311-Tut 10 8 / 26

Log-likelihood derivatives
@ logp(x) = @ logXp(x,z) @✓ @✓z
@ P p(x,z) @✓ z
= Pz0 p(x,z0) @ p(x,z)
= Pz0 p(x,z0)
p(x, z) @ z @✓
log p(x, z) =X✓Pz0 p(x,z0) ◆
= Pp(x,z) @ logp(x,z) z z0
Intro ML (UofT)
CSC311-Tut 10

Log-likelihood derivatives
@ logp(x) = @ logXp(x,z) @✓ @✓z
@ P p(x,z) @✓ z
= Pz0 p(x,z0) @ p(x,z)
= Pz0 p(x,z0)
p(x, z) @ z @✓
log p(x, z) =X Pz0p(x,z0)
= p(z|x) @ logp(x,z) z @✓
=X✓Pp(x,z) @ logp(x,z)◆ z z0
Intro ML (UofT)
CSC311-Tut 10

Expectation-Maximization algorithm
The Expectation-Maximization algorithm alternates between two steps: 1. E-step: Compute the posterior probabilities r(n) = p(z(n) = k|x(n))
given our current model – i.e. how much do we think a cluster is
responsible for generating a datapoint.
2. M-step: Use the equations on the last slide to update the
parameters, assuming r(n) are held fixed- change the parameters of k
each distribution to maximize the probability that it would generate the data it is currently responsible for.
@✓ log p(D) = @✓ log p(z(n) = k, x(n))
p(z(n) = logp(x(n),z(n))
XN XK i=1 k=1
XNXK@ @
r(i) log Pr(z(i) = k) + log p(x(i) | z(i) = k) @✓
Intro ML (UofT)
CSC311-Tut 10 11 / 26

Image Completion using Mixture of Bernoullis 1 A probabilistic model for the task of image completion.
We observe the top half of an image of a handwritten digit, we would like to predict whats in the bottom half.
Given these observations… … you want to make these predictions
Intro ML (UofT) CSC311-Tut 10 12 / 26

Mixture of Bernoullis model
Our dataset is a set of 28 ⇥ 28 binary images represented as 784-dimensional binary vectors.
I N = 60,000, the number of training cases. The training cases are indexed by i.
I D = 28 ⇥ 28 = 784, the dimension of each observation vector. The dimensions are indexed by j.
Conditioned on the latent variable z = k, each pixel xj is an independent Bernoulli random variable with parameter ✓k,j:
p(x(i) | z = k) = =
p(x(i) | z = k)
YD x(i) (i)
(1✓k,j)1xj
Intro ML (UofT)
CSC311-Tut 10

The Generative Process
This can be written out as the following generative process: Sample z from a multinomial distribution ⇡.
For j = 1, . . . , D:
Sample xj from a Bernoulli distribution with parameter ✓k,j, where k is the value of z.
It can also be written mathematically as:
z ⇠ Multinomial(⇡)
xj | z = k ⇠ Bernoulli(✓k,j )
Intro ML (UofT) CSC311-Tut 10 14 / 26

Part 1: Learning the Parameters
In the first step, well learn the parameters of the model given the responsibilities (M-step of the E-M algorithm).
We want to use the MAP criterion instead of maximum likelihood (ML) to fit the Mixture of Bernoullis model.
I The only di↵erence is that we add a prior probability term to the ML objective function in the M-step.
I ML objective function: XNXK h
r(i) log Pr(z(i) = k) + log p(x(i) | z(i) = k) k
I MAP objective function:
r(i) log Pr(z(i) = k) + log p(x(i) | z(i) = k) + log p(⇡)+log p(⇥) k
Intro ML (UofT) CSC311-Tut 10 15 / 26

Part 1: Learning the Parameters (Prior Distribution)
Use Beta distribution as the prior for ⇥: Every entry is drawn independently from a beta distribution with parameters a and b:
p(✓k,j) / ✓a1(1 ✓k,j)b1 k,j
Use Dirichlet distribution as the prior over mixing proportions ⇡: p(⇡) / ⇡a11⇡a21 · · · ⇡aK 1.
Intro ML (UofT) CSC311-Tut 10 16 / 26

Part 1: Learning the Parameters
Derive the M-step update rules for ⇥ and ⇡ by setting the partial derivatives of the MAP objective function to zero.
XNXK(i)h (i) (i)(i) i J(✓,⇡)= rk logPr(z =k)+logp(x |z =k)
+ log p(⇡) + log p(⇥)
⇡k … ✓k,j …
Intro ML (UofT) CSC311-Tut 10

Part 1: Learning the Parameters
r(i) logPr(z(i) =k)+logp(x(i)|z(i) =k) +logp(⇡)+logp(⇥) k
XNXK ” XD #
r(i) log⇡k + x(i) log✓k,j +(1x(i))log(1✓k,j) kjj
i=1 k=1 j=1
(ak 1)log⇡k +
XK XD k=1 j=1
[(a1)log✓k,j +(b1)log(1✓k,j)]+C
Intro ML (UofT)
CSC311-Tut 10

Derivative wrt. ✓k,j
J(⇥,⇡)= r(i) log⇡k + x(i)log✓k,j +(1x(i))log(1✓k,j) kjj
i=1 k=1 j=1
XK XD k=1 j=1
(ak 1)log⇡k +
First we take derivative wrt. ✓k,j:
[(a1)log✓k,j +(b1)log(1✓k,j)]+C
@JXN1 !1 1 1
= r(i) x(i) +(1x(i)) +(a1) +(b1)
k j ✓k,j j ✓k,j 1 ✓k,j
= [r(i)x(i)] + (a 1) + [r(i)] [r(i)x(i)] + (b 1 ✓k,j kj ✓k,j1k kj
Intro ML (UofT)
CSC311-Tut 10

Derivative wrt. ✓k,j
@JXN1 !1 1 1
= r(i) x(i) +(1x(i)) +(a1) +(b1)
k j ✓k,j j ✓k,j 1 ✓k,j 1XN 1XN XN
0=(✓k,j 1)
[r(i)x(i)]+(b1) kjkkj
= [r(i)x(i)] + (a 1) + [r(i)] [r(i)x(i)] + (b 1 ✓k,j kj ✓k,j1k kj
i=1 i=1 i=1 Setting this to zero, and multiplying both sides by ✓k,j (✓k,j 1) yields:
! XN XN [r(i)x(i)]+(a1) +✓k,j [r(i)]
Intro ML (UofT)
CSC311-Tut 10

Derivative wrt. ✓k,j
@JXN1 !1 1 1
= r(i) x(i) +(1x(i)) +(a1) +(b1)
k j ✓k,j j ✓k,j 1 ✓k,j 1XN 1XN XN
= [r(i)x(i)] + (a 1) + [r(i)] [r(i)x(i)] + (b 1 ✓k,j kj ✓k,j1k kj
i=1 i=1 i=1 Setting this to zero, and multiplying both sides by ✓k,j (✓k,j 1) yields:
0=(✓k,j 1) This gives:
! XN XN [r(i)x(i)]+(a1) +✓k,j [r(i)] [r(i)x(i)]+(b1)
kjkkj i=1 i=1
i=1 k j PN [r(i)x(i)] + (a 1) + PN
PN [r(i)x(i)] + (a 1)
[r(i)] PN [r(i)x(i)] + (b 1) i=1 k i=1 k j
N [r(i)x(i)] + a 1
N [r(i)]+a+b2
Intro ML (UofT) CSC311-Tut 10

Derivative wrt. ⇡k
Now we take derivative wrt. ⇡k.
Note thatPit is a bit trickier because we need to account for the condition Kk=1 ⇡k = 1.
This can be donPe with the use of a Lagrange multiplier.
LetJ =J+( Kk=1[⇡k]1)
@J XN 1 1
= r(i) +(ak 1) +
@⇡k k ⇡k ⇡k i=1
Setting this to zero, we get:
(ak 1) + PN [r(i)]
Knowing that ⇡k sums to one, we obtain:
(ak 1)+PN [r(i)] (ak 1)+PN
[r(i)] ⇡k= i=1k =Pi=1k
PK [(ak 1)+PN [r(i)]] N + K (ak 1) k=1 i=1 k k=1
(WeusedPN PK r(i) =PN 1=N) i=1 k=1 k i=1
Intro ML (UofT) CSC311-Tut 10

Part 2: Posterior inference
We represent partial observations in terms of variables m(i), where j
m(i) = 1 if the jth pixel of the ith image is observed, and 0 j
otherwise.
Derive the posterior probability distribution p(z | xobs), where xobs denotes the subset of the pixels which are observed.
Using Bayes rule, we have:
p(z=k|x)= p(x|z=k)p(z=k) p(x)
⇡k QD ✓mjxj (1 ✓mj(1xj)) j=1 k,j k,j
PK ⇡l QD ✓mjxj (1 ✓mj(1xj)) l=1 j=1 l,j l,j
Intro ML (UofT)
CSC311-Tut 10

Part 3: Posterior Predictive Mean
Computes the posterior predictive means of the missing pixels given the observed ones.
The posterior predictive distribution is:
p(x2 |x1) = Xp(z|x1)p(x2 |z,x1) z
Assume that the xi values are conditionally independent given z. For instance, the pixels in one half of an image are clearly not independent of the pixels in the other half. But they are roughly independent, conditioned on a detailed description of everything going on in the image.
So we have:
E[p(xmis|xobs)] = Intro ML (UofT)
rkp(xmis = 1 | z = k) =
rkBernoulli(✓k,mis)
rk ✓k,mis CSC311-Tut 10

Questions?
Intro ML (UofT) CSC311-Tut 10 26 / 26

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts