Problem 4.1
CS5487 Problem Set 4
Mixture models and the EM algorithm
Department of Computer Science City University of
Copyright By PowCoder代写 加微信 powcoder
Mixture models
Mean and variance of a mixture model
Consider a d-dimensional vector r.v. x with a mixture distribution given by
cov(x) = ⇡j(⌃j + μjμTj ) E[x]E[x]T . (4.2) j=1
⇡jp(x|j), XK
where the mean and covariance of each component p(x|j) is μj and ⌃j. Show that the mean and
covariance of x is
……… Problem 4.2 Direct derivation of MLE of a GMM
In this problem we will directly derive the maximum likelihood estimate for a Gaussian mixture model. Let r.v. x be distributed according to a GMM,
⇡j N (x|μj , j2), (4.3) ˆ Xn
where ✓ = {⇡j , μj , j2}Kj=1 are the parameters. Given a set of i.i.d. samples, D = {x1, · · · , xn}, the
maximum likelihood estimate is given by
✓ = argmax `(✓), `(✓) = log p(xi|✓). (4.4)
(a) Derive the ML estimate for the mean μj by taking the partial derivative of `(✓) w.r.t. μj, setting it to zero, and solving for μj. Repeat for thPe variance j2 and priors ⇡j. For the latter, use Lagrange multipliers to enforce the constraint j ⇡j = 1 (see Problem 4.12).
(b) Verify that your estimate in (a) is not a closed-form solution (i.e., each estimate depends on itself). Suggest an iterative scheme for finding a solution. How is it related to the EM algorithm for GMMs?
EM algorithm for mixture models
Problem 4.3 EM for multivariate GMMs
Let x be a d-dimensional vector r.v. distributed as a multivariate Gaussian mixture model,
⇡j N (x|μj , ⌃j ), and Z = {z1, · · · , zn} the set of corresponding hidden values.
(a) Write down the complete data log-likelihood, log p(X, Z|✓).
(b) E-step: derive the Q function, Q(✓;✓ˆold), and show that the E-step consists of calculating the
where ✓ = {⇡j,μj,⌃j}Kj=1 are the parameters. Let X = {x1,··· ,xn} be a set of observed samples,
soft assignments,
zˆij = p(zi = j|xi,✓ˆold) = P ⇡jN(xi|μˆj,⌃ˆj) . Kk = 1 ⇡ k N ( x i | μˆ k , ⌃ˆ k )
(c) M-step: show that the parameters that maximize the Q function are 1Xn1Xn NˆjXn
μˆj = Nˆ zˆijxi, ⌃ˆj = Nˆ zˆij(xi μˆj)(xi μˆj)T, ⇡ˆj = n , Nˆj = zˆij. j i=1 j i=1 i=1
(4.7) Hint: use Lagrange multipliers (Problem 4.12) to find ⇡ˆj. The derivatives in Problem 2.6 will
be helpful.
Problem 4.4 EM for multivariate GMMs with shared covariance
Repeat Problem 4.3 for the special case in which the covariance matrices ⌃j of all components are constrained to have a common value of ⌃, i.e.,
⌃1 =⌃2 =…=⌃K =⌃. (4.8) ………
Problem 4.5 Flying Bombs, part II – EM for mixtures of Poissons
Let’s reconsider the Problem 2.1, where we fit a Poisson distribution to the numbers of flying bombs hitting di↵erent areas in London. If we assume that the Germans were indeed targeting specific areas, then the bomb hit rate would be higher for some squares (the targets), and lower for others (not the targets). Hence, the distribution over all squares should be a mixture of Poissons, with each Poisson component corresponding to squares with a particular hit rate. For K = 2, the components would correspond to target squares and non-target squares. For K > 2, one component would correspond to the target hit rate, while the other (non-target) components would have some gradation of hit rates (with squares far away from the target squares having lower hit rates).
(a) Consider the mixture of Poisson distribution
p(x = k|✓) = ⇡j k!e j , (4.9)
where j is the rate parameter for component j, and ✓ = { j,⇡j}Kj=1 the parameters of the mixture. Derive the EM algorithm to estimate the parameters of the model given samples X = {x1, · · · , xn}. How is the M-step related to the ML estimate for a Poisson (Problem 2.1)?
(b) Implement your algorithm and run it for di↵erent values of K 2 {1, 2, 3, 4, 5} on the following data obtained from 2 cities (learn a separate mixture model for each city):
XK 1 jk j=1
city London Antwerp
number of hits (k)
number of cells with k hits number of cells with k hits
0 1 229 211 325 115
2 3 93 35 67 30
4 5 and over
What conclusions can you make about the attacks on each city? Is there any evidence to suggest there is specific targeting of areas in London or Antwerp?
……… Problem 4.6 Mixture of exponentials
Consider a mixture of exponential densities,
EM for other things
Besides mixture models, EM can also be used for learning when there is other missing information (e.g., missing feature values).
Problem 4.7 EM and missing features
In this problem, we will investigate using EM to learn a multivariate distribution from training data when some of the features are missing. Suppose we have a two r.v.’s, x and y, that are jointly Gaussian,
p(x, y) = N (xy |μ, ⌃), (4.11) where μ = [μx,μy]T is the mean and ⌃ the covariance. We have collected a set of samples D =
{xi,yi}ni=1 to learn the parameters of the model. However, upon examining the samples, we find 29
⇡j p(x|j), p(x|j) = j e j x. (4.10) algorithm to estimate the parameters ✓ = {⇡j, j}Kj=1.
where j > 0 is the parameter of component j. Given a set of samples {x1, · · · , xn}, derive the EM
that the sample y1 is corrupted, while x1 is still okay. Rather than throw away both x1 and y1, we can to treat the data point y1 as “missing information” and then use EM to find the model parameters. Denote YI = {y2, · · · , yn} as the incomplete data of Y . Let’s consider the case where
the covariance is diagonal, ⌃ = x2 0 , i.e., x and y are independent univariate Gaussians. 0 y2
(a) Write down the complete data log-likelihood, p(X, Y ).
(b) E-step: Show that the Q function can be written as (up to an additive constant)
Q(✓; ✓ˆold) = Ey1|X,YI [log p(X, Y |✓)] 1Xn 1Xn
ˆy2+(μˆy μy)2 n n
2 y2 2log x2 2log y2. (4.13)
= 2 x2 (xi μx)2 2 y2 (yi μy)2 i=1 i=2
where {μˆy, ˆy2} are the old parameters (✓ˆold).
(c) M-step: Show that the maximum of the Q function in (4.13) is obtained with
μx = n xi,
x2 = n (xi μx)2,
μy = n μˆy + yi , y2 = n ˆy2 +(μˆy μy)2 + (yi μy)2 .
(d) Determine the fixed-point of this EM algorithm (i.e., the solution at convergence), by setting μy = μˆy and solving for μy (and similarly for y2). What does the fixed-point solution correspond to?
Now consider the case where ⌃ is an arbitrary covariance matrix. For convenience, define the vector zi=⇥xi yi⇤T.
(e) Write down the complete data log-likelihood, p(X, Y ), using matrix notation. (f) E-step: Show that the Q function can be written as
Q(✓; ✓ˆold) = Ey1|X,YI ,✓ˆold [log p(X, Y )]
1Xn n1⇣⇥⇤⌘
(4.16) (4.17)
= 2 (zi μ)T⌃ 1(zi μ) 2log|⌃| 2tr ⌃ 1Ey1|x1,✓ˆold (z1 μ)(z1 μ)T . i=2
(g) Show that the expectation in (4.17) is given by
Ey1|x1,✓ˆold ⇥(z1 μ)(z1 μ)T ⇤ = (zˆ1 μ)(zˆ1 μ)T + ⌃ˆ1, zˆ=x1 , ⌃ˆ=00
1μˆ 10 ˆ2 y1 y1
μˆy1 = E ˆold [y1], ˆy2 = var ˆold (y1). y1|x1,✓ 1 y1|x1,✓
In other words, the missing data point y1 is replaced with the mean of y1 conditioned on x1, with some extra variance added due to uncertainty when recovering y1 from x1. (Hint: write out the terms of the outer-product, and then take the expectation.)
(h) M-step: Show that the maximum of the Q function in (4.17) is
1 Xn ! 1″ˆ Xn #
μ= n zˆ1 + zi , ⌃= n ⌃1 +(zˆ1 μ)(zˆ1 μ)T + (zi μ)(zi μ)T . (4.21) i=2 i=2
Given the result in (c), how is this solution similar or di↵erent than what you expected?
Problem 4.8 Multinomial EM
In this problem we consider an example where there is a closed-form solution to ML estimation from incomplete data. The goal is to compare with the EM solution and get some insight on how the steps of the latter can be substantially easier to derive than the former.
Consider our bridge example and let U be the type of vehicle that crosses the bridge. U can take 4 values, (compact, sedan, station wagon, and pick-up truck) which we denote by U 2 {1, 2, 3, 4}. On a given day, an operator collects an iid sample of size n from U and the number of vehicles of each type is counted and stored in a vector D = [x1, x2, x3, x4]. The resulting random variable X (the histogram of vehicle classes) has a multinomial distribution
n! ✓1 1 ◆x1✓1 1 ◆x2✓1 1 ◆x3✓1 ◆x4 p(x1,x2,x3,x4|✓) = x1!x2!x3!x4! 2 + 4✓ 4 4✓ 4 4✓ 4✓ , (4.22)
where ✓ is the parameter and 0 ✓ 1.
It is later realized that the operator included motorcycles in the compact class. Denote
x11 as the number of compact cars, and x12 as the number of motorcycles. It is established that motorcycles have probability 14✓, which leads to a new model
n! ✓1◆x11✓1 ◆x12✓1 1 ◆x2✓1 1 ◆x3✓1 ◆x4 p(x11,x12,x2,x3,x4|✓) = x11!x12!x2!x3!x4! 2 4✓ 4 4✓ 4 4✓ 4✓ ,
(4.23) Determining the parameter ✓ from the available data is a problem of ML estimation with missing
data, since we only have measurements for
x1 = x11 + x12
but not for x11 and x12 independently.
(a) Determine the value of ✓ that maximizes the likelihood of D, i.e.
✓i⇤ = argmax p(D|✓) ✓
by using standard ML estimation procedures.
(b) Assume that we have the complete data, i.e. Dc = [x11, x12, x2, x3, x4]. Determine the value of
✓ that maximizes its likelihood, i.e.
✓c⇤ = argmax p(Dc; ✓), (4.26) ✓
by using standard ML estimation procedures. Compare the di culty of obtaining this solution vs. that of obtaining the solution in (a). Does this look like a problem where EM might be helpful?
(c) Derive the E and M-steps of the EM algorithm for this problem.
(d) Using the equations for the EM steps, determine the fixed point of the algorithm (i.e. the solution) by making
✓(k+1) = ✓(k) (4.27) where k is the iteration number. Compare to the solution obtained in (a).
……… Problem 4.9 EM for hyperparameter estimation
In this problem we will use EM to estimate the hyperparameters of Bayesian linear regression (Problem 3.10). As before, we want to estimate a function
f (x, ✓) = (x)T ✓ (4.28)
where ✓ 2 RD is the parameter vector, (x) is a feature transformation of input x 2 Rd. We observe a noisy version of the function y = f(x,✓) + ✏, where ✏ ⇠ N(0, 2). Hence, the likelihood of the output given the input x and parameter ✓ is
p(y|x, ✓, 2) = N (y| (x)T ✓, 2), (4.29) As it is a Bayesian model, we place a Gaussian prior on the parameter vector,
p(✓|↵) = N (✓|0, ↵I). (4.30)
For this model, = { 2,↵} are the hyperparameters. One way to estimate the hyperpa- rameters is to maximize the marginal likelihood of the data (e.g., see Programming Assignment 1, Q3),
ˆ=argmaxlogp(y|x, 2,↵)=argmaxlogZ p(y|x,✓, 2)p(✓|↵)d✓. (4.31) 2,↵ 2,↵
Note that this is a maximization problem of a likelihood that has “missing data”, i.e., the parameter vector ✓. Hence, the EM algorithm can be employed to optimize it! Let the observations D = {xi,yi}ni=1 be the incomplete data, and ✓ be the missing data (note there is only one ✓ for all the observations).
(a) Write down the complete data log-likelihood, p(D, ✓| ). (b) E-step: Show that the Q function is
Q( ;ˆold)=E✓|D,ˆold[logp(D,✓| )]
= Dlog↵ 1E ˆold[✓T✓] nlog 2 1 E
(4.34) (4.35)
ˆoldh y T✓ 2i E✓|D, ˆold h y T ✓ 2i = y T μˆ✓ 2 + tr( T ⌃ˆ✓ ),
2 2↵ ✓|D, 2 2 2 ✓|D, where the conditional expectations are
E✓|D,ˆold[✓T✓]=μˆT✓μˆ✓ +tr(⌃ˆ✓),
and {μˆ✓,⌃ˆ✓} are defined in Problem 3.10, and calculated using ˆold. 32
(c) M-step: show that the maximum of the Q function in (4.33) is
↵ = D1 ⇣μˆT✓ μˆ✓ + tr(⌃ˆ ✓ )⌘ ,
2 = n1 n y Tμˆ✓ 2 +tr( T⌃ˆ✓ )o.
What is the interpretation of these estimates? Does it make sense?
EM algorithm and MAP estimation
Problem 4.10 EM algorithm for MAP estimation
(4.36) (4.37)
In this problem, we extend the EM algorithm to perform MAP estimation for models with hidden variables. Let x be the observation r.v., z the hidden r.v., and ✓ the parameters r.v. The joint likelihood of {x, z} is
p(x, z|✓) = p(x|z, ✓)p(z|✓) (4.38) Let X be the observed data, and Z the corresponding hidden values. We wish to use the EM
algorithm to find the maximum of the posterior distribution over parameters p(✓|X).
(a) MAP Q function: rather than take the expectation of the complete data log-likelihood, log p(X, Z|✓), we take the expectation of the posterior logp(✓|X,Z). Show that the MAP Q function can be written as
QMAP (✓; ✓ˆold) = EZ|X,✓ˆold [log p(✓|X, Z)] (4.39) / EZ|X,✓ˆold [log p(X, Z|✓)] + log p(✓), (4.40)
where we have dropped terms that do not a↵ect the M-step later.
(b) MAP-EM algorithm: Using the result in (a), what are the E- and M-steps of the MAP-EM algorithm?
(c) MAP-EM, alternative view: Note that the first term in (4.40) is the only term with an expec- tation, and is also the same as the Q function of EM. Show that the MAP-EM algorithm from (b) can be written as
E step: Q(✓;✓ˆold)=EZ|X,✓ˆold[logp(X,Z|✓)],
M step : ✓ˆnew = argmax Q(✓; ✓ˆold) + log p(✓).
Hence, the MAP-EM adds the log prior to the M-step of standard EM, and the M-step looks like a standard Bayesian estimation procedure.
(d) Example: Derive the MAP-EM algorithm for a univariate GMM with 2 components,
p(x) = ⇡1N (x|μ1, 12) + (1 ⇡1)N (x|μ2, 2), (4.42)
where ✓ = {⇡1, μ1, μ2} are the parameters and j2 are known. The prior distributions are
p(⇡1)=1, 0⇡1 1, p(μ1) = N (μ1|μ0, 02), p(μ2) = N (μ2|μ0, 02).
(4.43) (4.44) (4.45)
(Hint: Use your results from Problem 3.4 and Problem 3.7.) How are these E- and M-steps similar and di↵erent to the standard EM for GMMs? What is the interpretation in terms of “virtual” samples added to standard EM for GMMs?
……… Problem 4.11 MAP-EM for multinomial
Consider the multinomial distribution of Problem 4.8, and a Gamma prior p(✓) = (⌫1 + ⌫2) ✓⌫1 1(1 ✓)⌫2 1.
(⌫1 ) (⌫2 )
Derive the equations of the EM algorithm for MAP estimation of the parameter ✓.
Optimization Theory
Problem 4.12 Lagrange multipliers and equality constraints
In the M-step of the EM algorithm for mixture models, we need to calculate an estimate of the component prior probabilities ⇡j via the optimization problem,
{⇡ˆj} = argmax {⇡j }
⇡j = 1, ⇡j 0, (4.47)
for some zj 0. Note that this optimization problem has an equality constraint, which is {⇡j} must sum to 1, since they represent a probability distribution.
One method of solving an optimization problem with equality constraints is to use Lagrange multipliers. Consider the following problem,
x⇤ =argmaxf(x), x
s.t. g(x) = 0,
where f(x) is the objective function and g(x) is the constraint function. Let’s look at two properties
of these functions,
• First, the gradient rg(x) is orthogonal to the constraint surface, since g(x) should be constant along the direction of the constraint surface (otherwise it would not be 0).
• Second, at the optimum, the gradient rf(x) must also be orthogonal to the constraint surface. Otherwise, we could move along the constraint surface to increase f(x).
This is illustrated in the following figure1:
Hence, rf(x) and rg(x) must be parallel or anti-parallel, and by extension,
rf (x) + rg(x) = 0, (4.49)
for some 6= 0. Define the Lagrangian function,
L(x, ) = f (x) + g(x). (4.50)
The optimality condition in (4.49) is obtained by setting @L = 0. Furthermore, setting @L = 0 @x @
yields the equality constraint, g(x) = 0. Hence, to solve the constrained optimization problem (4.48), we form the Lagrangian function, and find the stationary point w.r.t. both x and , by simultaneously solving
0.8 0.9 0.6 0.8 0.4 0.7 0.2 0.6
0 0.5 −0.2 0.4 −0.4 0.3 −0.6 0.2 −0.8 0.1
−1 0 −1 −0.5 0 0.5 1
= 1−x21−x2
g(x) = x1+x2−1
(x*1,x*2) = (0.5, 0.5)
@ L(x, ) = 0, @ L(x, ) = 0. @x @
(a) Use Lagrange multipliers to optimize (4.47), and show that the solution is ⇡j = P zj .
Kk = 1 z k {⇡ˆj}=argmax ⇡j(zj log⇡j), s.t.
(b) Consider another optimization problem,
More details about Lagrange multipliers can be found in Appendix E of Bishop’s book, PRML.
Show that the solution is ⇡ = P exp zj
j Kk = 1 e x p z k ………
1The red line is straight! There is an optical illusion that makes it looked curved! 35
Implementation Tricks
Problem 4.13 The “log trick” and preventing numerical precision problems
In this problem, we will see how to prevent numerical precision problems by doing calculations in the log-domain. When dealing with mixtures, we often need to calculate the likelihood of a sample under the full mixture distribution,
⇡jp(x|j), (4.54) example, p(x) is needed for the denominator when calculating the soft assignment variables,
where p(x|j) is the mixture component density and ⇡j is the component prior probability. For
⇡j p(x|j ) ⇡j p(x|j )
zˆij = Pk ⇡kp(x|k) = p(x) . (4.55)
However, when x is high-dimensional (or the sample x happens to be far away from all the com- ponents, or both), then the component likelihoods p(x|j) can become very small. If the likelihood p(x|j) < e 709, then the computer will interpret this as 0 due to the limited precision of the computer. If all the component likelihoods are this small, then zˆij cannot be calculated since the numerator and denominator will be both zero.
The solution to this problem is to do the calculations in the log domain. That is, use the log-likelihood of the components, `(x|j) = log p(x|j) + log ⇡j , to calculate the log-likelihood of the mixture `(x) = log p(x). We can then calculate the log probabilities of the assignment variables,
ˆij = log zˆij = `(xi|j) `(xi),
which avoids the 0/0 problem, and take the exponent to get the assignment variable
zˆij = e ˆij .
Finally, we can use the data log-likelihood to test for convergence of the algorithm,
(4.56) (4.57)
(4.58) The individual component log-likelihoods `(x|j) can be calculated without a problem. However,
when looking at `(x),
XK `(x) = logp(x) = log
⇡jp(x|j) = log e`(x|j) (4.59)
Xn log p(X) =
log p(xi) = `(xi).
we still need to take the exponent before calculating the sum. If all `(x|j) are small, then all e`(x|j) will be zero, the sum will be 0, and `(x) = log 0 = inf, resulting in the same precision problem. We will now derive a way to calculate `(x) in a numerically safe way, which is what I call the
“log trick”. Formally, given values {`j}Kj=1, we want to calculate the log-sum-exp of these values,
Let `⇤ = max(`1,··· ,`K) be the maximum value over all the `j.
e`j . (4.60)
(a) Show that ` in (4.60) can be rewritten as
......... Problem 4.14 Calculating log determinant
For very high-dimensional Gaussian densities, the likelihoods may be very close to zero (or even zero due to numerical precision). The numerical precision problems can be avoided by calculating the log-likelihood of a Gaussian, which requires the log-determinant of ⌃. Consider the following two properties:
• The Choleksy decomposition of a positive definite matrix ⌃ is
⌃ = LLT , (4.62)
where L is a lower triangular matrix (a matrix where the upper triangle is zero, i.e. Lij = 0 2L11 0 03
when i < j). A 3 ⇥ 3 lower triangular matrix looks like this 4L21 L22 0 5. L31 L32 L33
• The determinant of a lower triangular matrix L (or an upper triangular matrix) is the product of its diagonal elements
(b) Why is the form in (4.61) more numerically stable than (4.60)? That is, how have we avoided
taking the log of 0?
Use these two properties to show that the log-determinant of ⌃ is
log |⌃| = 2 where ⌃ = LLT is the Cholesky decomposition of ⌃.
e(`j `⇤). (4.61)
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com