Problem 1.6
Multivariate Gaussian
(a) The multivariate Gaussian distribution is N (x|μ, ⌃) = 1
exp ⇢ 1 (x μ)T ⌃ 1 (x μ) . 2
(S.158)
CS5487 Problem Set
Solutions – Tutorials (1-5)
Antoni Chan Department of Computer Science City University of Hong Kong
Tutorial Problems (1-5)
(2⇡)d/2 |⌃|1/2
Assuming a diagonal covariance matrix, ⌃ = 40 …
26 12 11T6.7
d2
N(x|μ,⌃)= (2⇡)d/2(Qd 2)1/2 exp>: X(x μ) 4 0 .. 5(x μ)>; (S.159)
in (1.16), 8><
i=1 i 2 1/ d2
037
5, and substituting the properties 21/ 12 0 3 9>=
1 (1d1 2) =Y d/2 Qd exp 2 2(xi μi)
(2⇡) ( i=1 i) i=1 i d1 ⇢11 2
(S.160) (S.161) (S.162)
= (2⇡)1/2 exp 2 2 (xi μi) i=1 i i
tion.
(c) See Figure 7a.
(d) The eigenvalues and eigenvector pairs ( i,vi) of ⌃ satisfy
=
Yd i=1
N (xi|μi, i2)
(b) See Figure 7b. The diagonal terms indicate how far the Gaussian stretches in each axis direc-
Rewriting using matrix notation, ⌃⇥v1 ···
⌃vi = ivi, i 2 {1,··· ,d}. vd⇤ = ⇥ 1v1 ··· dvd⇤
(S.163) (S.164)
(S.165) (S.166)
⇥
d
⇤ ⇥ ⇤ 26 1
037 ⌃v1 ··· vd =v1 ··· vd ·40… 5
⌃V = V ⇤, 24
(a) (b) (c)
Σ1 Σ2 Σ3 444
333 222 111 000
−1 −1 −1 −2 −2 −2 −3 −3 −3
−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
xxx 111
444 333 222 111 000
−1 −1 −1 −2 −2 −2 −3 −3 −3
−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
p(x) Mahal. distance
0.2 0.1
4 0.4 2 0.2
4 0.4 2 0.2
4 2
xxx 111
p(x)
p(x)
p(x)
000000
−4 −4 −4
−2 −2 −2 −2 −2 −2
0x0x0x 2 4−4 2 2 4−4 2 2 4−4 2
xxx 111
Figure 7: Example of multivariate Gaussians: a) isotropic covariance matrix; b) diagonal covariance matrix; c) full covariance matrix.
1 ⇥⇤ 2003
where V = v1 ··· vd is a matrix of eigenvectors, and ⇤ = 64 … 75 is a diagonal d
matrix with the corresponding eigenvalues. Finally, post-multiplying both sides by V 1, ⌃VV 1 = V⇤V 1 (S.167)
⌃ = V⇤VT, (S.168)
where in the last line we have used the property that the eigenvectors are orthogonal, i.e., V T V = I and hence V 1 = V T .
25
(e) The inverse of the covariance matrix is
⌃ 1 = (V⇤VT) 1 = V T⇤ 1V 1 = V⇤ 1VT.
Hence, the Mahalanobis distance term can be rewritten as
kx μk2⌃ =(x μ)T⌃ 1(x μ)=(x μ)V⇤ 1VT(x μ)=y⇤ 1y=kyk2⇤,
y
where we define y = VT(x μ).
(f) In the transformation x = V y + μ, first y is rotated according to the eigenvector matrix V ,
then the result is translated by vector μ.
| {z }
(S.169)
(S.170)
“p2 p2# 1 0
(g) SeeFigure7c. Forthis⌃,wehaveV = p22 p22 and⇤= 0 0.25 . TheGaussianisfirst
22
stretched according to the eigenvalues (e.g., like in Figure 7b). Then it is rotated to match the directions of the eigenvectors.
………
Problem 1.8 Product of Multivariate Gaussian Distributions
The product of two Gaussians is
N (x|a, A)N (x|b, B) (S.171)
= 1 exp⇢ 1(x a)TA 1(x a) 1 exp⇢ 1(x b)TB 1(x b) (2⇡)d/2 |A|1/2 2 (2⇡)d/2 |B|1/2 2
1 ⇢1⇥ T 1 T 1 ⇤ (S.172) = (2⇡)d|A|1/2|B|1/2 exp 2 (x a) A (x a)+(x b) B (x b) . (S.173)
Looking at the exponent term, we expand and collect terms,
M =(x a)TA 1(x a)+(x b)TB 1(x b)
= xT A 1x + xT B 1x 2aT A 1x 2bT B 1x + aT A 1a + bT B 1b
(S.174) (S.175) (S.176)
(S.177)
(S.178)
=xT (A 1 +B 1)x 2(aTA 1 +bTB 1)x+aTA 1a+bTB 1b.
| {z } | {z }
| {z }
A bT c
Using the above definitions of (A, b, c) we complete the square using Problem 1.10:
where
M = (x d)T A(x d) + e,
d = A 1b = (A 1 + B 1) 1(A 1a + B 1b)
26
and
e = c bT A 1b
= aT A 1a + bT B 1b (aT A 1 + bT B 1)(A 1 + B 1) 1(A 1a + B 1b)
= aT A 1a aT A 1(A 1 + B 1) 1A 1a + bT B 1b bT B 1(A 1 + B 1) 1B 1b
2aT A 1(A 1 + B 1) 1B 1b
= aT (A + B) 1a + bT (A + B) 1b 2aT (A + B) 1b = (a b)T (A + B) 1(a b),
(S.179) (S.180)
(S.181)
(S.182) (S.183)
where in (S.182) we use the matrix inversion lemma on the first two terms (from Problem 1.15). Finally, defining C = (A 1 + B 1) 1 and c = C(A 1a + B 1b), we obtain for the exponent term
M=(x c)TC 1(x c)+(a b)T(A+B) 1(a b)=kx ck2C +ka bk2A+B. Next, we look at the determinant term,
1 1 |C|1/2 D= |A|1/2 |B|1/2 = |A|1/2 |B|1/2 |C|1/2
(S.184)
(S.185)
(S.186) (S.187) (S.188) (S.189)
(S.190) (S.191)
(S.192) (S.193)
(S.194)
1
= |A|1/2 |B|1/2
=
(A 1 + B 1) 1 1/2
|C|1/2 11
|A|1/2 |B|1/2 |A 1 + B 1|1/2 |C|1/2 = 11
(|A| |A 1 + B 1| |B|)1/2 |C|1/2 = 11
|A + B|1/2 |C|1/2
Finally, substituting the derived expressions for M and D into (S.173)
N (x|a, A)N (x|b, B) ⇢ h i = 1 1 exp 1 kx ck2C+ka bk2A+B
(2⇡)d |A + B|1/2 |C|1/2
(2⇡)d/2 |A + B|1/2 2 (2⇡)d/2 |C|1/2
2
= ” 1 e x p ⇢ 1 k a b k 2A + B # ” 1
e x p ⇢ 1 k x c k 2C # 2
= N (a|b, A + B)N (x|c, C)
Problem 1.10 Completing the square
The original form is
………
f (x) = xT Ax 2xT b + c. 27
Now expand the desired form,
f(x)=(x d)TA(x d)+e= xTAx 2xTAd+dTAd+e. (S.195)
We need to match the quadratic, linear, and constant terms. The quadratic term already matches. For the linear term, we have by inspection d = A 1b so that xT Ad = xT b. For the constant term, we set
c=dTAd+e ) e=c dTAd
= c bT A 1AA 1b = c bT A 1b
……… Problem 2.6 MLE for a multivariate Gaussian
(a) The log-likelihood of the data is
1NN = 2i=1kxi μk2⌃ 2 log|⌃|
1 XN N
= 2 (xi μ)T⌃ 1(xi μ) 2 log|⌃|,
(S.196) (S.197)
(S.198)
(S.199)
(S.200)
|{z} |{z} | {z }
quadratic linear constant
NN⇢ XXX121d
` = log p(X) = log p(xi) = 2 kxi μk⌃ 2 log |⌃| 2 log 2⇡ i=1 i=1
i=1
where we have dropped constant terms. To find the maximum of the log-likelihood ` w.r.t. μ, we take the derivative and set to 0. One way to proceed is to expand the quadratic term and take the derivative. Alternatively, we can use the chain rule. First, removing terms that do not depend on μ,
@` = 1XN @ (xi μ)T⌃ 1(xi μ). @μ 2 i=1 @μ
Lettingzi=xi μandapplyingthechainrule@f =@f@z, @μ @z @μ
@` = 1XN @ ziT⌃ 1zi = 1XN @ ziT⌃ 1zi @zi @μ 2 i=1 @μ 2 i=1 @zi @μ
1XN⇥ ⇤ XN
= 2 2⌃ 1zi [ 1] = ⌃ 1(xi μ).
i=1 i=1 Setting the derivative to 0, and solving for μ,
XN i=1
(S.201)
(S.202) (S.203)
(S.204)
⌃ 1(xi μ) = 0. 28
Pre-multiplying both sides by ⌃,
XN XNXN ⌃ ⌃ 1(xi μ)=⌃0 ) xi
i=1 i=1 i=1
μ=0
(S.205)
XN
)
i=1
tr[xT Ax] = tr[AxxT ], on the log-likelihood,
1 XN
μˆ = N xi.
xi Nμ = 0
(b) To find the maximum of the log-likelihood w.r.t. ⌃, we first use the “trace” trick, xT Ax =
)
N (xi μ)T⌃ 1(xi μ) 2 log|⌃|
# N
2 log|⌃|.
(S.206)
(S.207) (S.208) (S.209)
1 XN
`= 2
1 XN
i=1
i=1
= 2
1 ” XN
N tr[⌃ 1(xi μ)(xi μ)T] 2 log|⌃|,
i=1
= 2tr ⌃ 1 (xi μ)(xi μ)T
i=1
Taking the derivative w.r.t. ⌃ (using the helpful derivatives provided in Problem 2.6) and
(S.210)
(S.211) (S.212)
setting to 0,
@ ` 1 ( ” XN # ) N
@⌃= 2 ⌃ 1 (xi μ)(xi μ)T ⌃ 1 2⌃ 1=0. i=1
Pre-multiplying and post-multiplying by ⌃ on both sides gives
1 ” XN
2
# N (xi μ)(xi μ)T 2⌃=0
ˆ 1 XN
⌃= N (xi μ)(xi μ)T.
i=1 ………
(a) We want to find the ✓ that minimizes the sum-squared error, XN 2
i=1 )
Problem 2.8
Least-squares regression and MLE
E = (yi (xi))2 = y T✓ =(y T✓)T(y T✓) i=1
(S.213) (S.214)
= yT y 2yT T ✓ + ✓T T ✓.
Next, take the derivative w.r.t. ✓ (using the derivatives from Problem 2.6), and setting to zero,
@E = 2 y+2 T✓=0 (S.215) @✓
) T✓ = y ) ✓ˆ= ( T) 1 y. (S.216) 29
(b) The noise ✏ is zero-mean Gaussian, and hence yi = f(xi) + ✏ is also Gaussian with mean equal to the function value f(xi) = (xi)T ✓,
p(yi|xi, ✓) = N (yi| (xi)T ✓, 2). Then the log-likelihood of the data is
XN XN⇢1T2121 `= logp(yi|xi✓)= 2 2(yi (xi) ✓) 2log 2log2⇡ .
i=1 i=1
The maximum likelihood solution is then
ˆ 1XN XN ✓=argmax`=argmax 2 2 (yi (xi)T✓)2 =argmin (yi (xi)T✓)2,
(S.217)
(S.218)
(S.219) where the last step follows from 1 being a scalar constant w.r.t. ✓. Hence, the maximum like-
✓ ✓ i=1 i=1 2 2
lihood solution is equivalent to the least-squares solution. This is mainly because we assumed Gaussian noise, which introduces the squared-error term.
………
Problem 3.10 Bayesian regression with Gaussian prior
(a) The posterior distribution of the parameters is obtained using Bayes’ rule,
p(✓|y, X) = R p(y, |X, ✓)p(✓) . (S.220)
Here the denominator ensures that the posterior is properly normalized (integrates to 1). Note that the denominator is only a function of the data D since the parameter ✓ is integrated out. Hence, it su ces to find the form of ✓ in the numerator first, and then normalize that equation to obtain the distribution,
p(✓|y, X) / p(y|X, ✓)p(✓) (S.221)
or equivalently
log p(✓|y, X) = log p(y|X, ✓) + log p(✓) + const. (S.222) Substituting for the data likelihood and prior terms (and ignoring terms not involving ✓),
logp(✓|y,X) = logN(y| T✓,⌃)+logN(✓|0, )+const. = 1 y T ✓ 2⌃ 1✓T 1✓ + const.
= 1 2✓T ⌃ 1y + ✓T ⌃ 1 T ✓ 1✓T 1✓ + const. 22
= 1[✓T ( ⌃ 1 T + 1)✓ 2✓T ⌃ 1y]+const. 2| {z } |{z}
Ab
(S.223) (S.224)
(S.225) (S.226)
p(y|X, ✓)p(✓)d✓
22
30
Next, using the above A and b, we complete the square (see Problem 1.10), log p(✓|y, X) = 1(✓ A 1b)T A(✓ A 1b) + const.
(S.227) (S.228)
(S.229) (S.230)
2
= 1k✓ μˆ✓k2ˆ +const.
2 ⌃✓ where again constant terms are ignored, and we define
μˆ✓ = A 1b = ( ⌃ 1 T + 1) 1 ⌃ 1y, ⌃ˆ✓ = A 1 = ( ⌃ 1 T + 1) 1.
Finally, note that the log-posterior in (S.228) is of the same form of a Gaussian for ✓. Hence the posterior is Gaussian,
p ( ✓ | y , X ) = N ( ✓ | μˆ ✓ , ⌃ˆ ✓ )
(b) The MAP solution is the mean of the Gaussian (i.e., the ✓ with largest likelihood),
✓ˆMAP =argmaxN(✓|μˆ✓,⌃ˆ✓) ✓
= μˆ✓ = ( ⌃ 1 T + 1) 1 ⌃ 1y.
(S.231)
(S.232)
(S.233)
The ✓ˆMAP is similar to the weighted least-squares estimate, but has an additional term 1. When 1 = 0, then ✓ˆMAP is the same as the weighted least squares solution. For non-zero values of , then the term serves to regularize the covariance matrix ⌃ 1 T , which might not be strictly positive definite or nearly singular. E.g., if = I is a diagonal matrix, then this would guarantee that the matrix inverse of ( ⌃ 1 T + 1) can always be performed.
(S.234)
(S.235) (S.236) (S.237)
(S.238) (S.239)
Hence the regularized least-squares estimate (aka ridge regression) is equivalent to the above MAP estimate for Bayesian regression with Gaussian noise and prior with isotropic covariances.
(c) Substituting for = ↵I and ⌃ = 2I in ✓ˆMAP ,
✓ˆ=( (1 I) T +1I) 1 1 Iy=( T + 2I) 1 y=( T + I) 1 y,
2↵ 2 ↵ where = 2 0.
↵
To solve the regularized least-squares problem, first consider the objective function
R= y T✓ 2 + k✓k2 =(y T✓)T(y T✓)+ ✓T✓
= yT y 2yT T ✓ + ✓T T ✓ + ✓T ✓ =yTy 2yT T✓+✓T( T + I)✓.
Taking the derivative and setting to 0,
@R = 2 y+2( T + I)✓=0 ) ( T + I)✓= y
@✓
) ✓ = ( T + I) 1 y
31
(d) Substituting for = ↵I and ⌃ = 2I, the posterior mean and covariance are
μˆ =( T+ 2I) 1 y, ⌃ˆ =(1 T+1I) 1. (S.240)
✓ ↵ ✓ 2 ↵ Here are the various cases of interest:
• Setting ↵ ! 0 corresponds to setting a very strong prior at ✓ = 0, since the covariance of the prior will be = 0. The term 1 I is a diagonal matrix with very large entries, and its
inverse is a matrix that is zero. As a result, the posterior of ✓ is equivalent to the prior, μˆ✓ = 0 and ⌃ˆ✓ = 0.
↵
• Setting ↵ ! 1 yields a very weak prior since the covariance is very large. As a result, the the 1 I term vanishes, leaving an estimate equivalent to ordinary least squares,
↵
μˆ =( T) 1 yand⌃ˆ =(1 T) 1.
✓ ✓ 2
• When 2 ! 0, then this means there is no observation noise. As a result, the prior is
ignored and the data term dominates the mean μˆ✓ = ( T ) 1 y, resulting in the ordinary least square estimate. The posterior covariance is ⌃ˆ✓ = 0, since there is no uncertainty in the observations.
• When 2 ! 1 this corresponds to observations being very very noisy. As a result, the data term is ignored and the prior dominates, resulting a posterior equivalent to the prior, μˆ✓ = 0 and ⌃ˆ✓ = ↵I.
(e) Given a novel input x⇤, we are interested in the function value f⇤ conditioned on the data {y, X, x⇤}. Note that the posterior ✓|y, X is a Gaussian random variable. Hence, when condi- tioning on the data, f⇤ = (x⇤)T ✓ is a linear transformation of a Gaussian random variable, which is also Gaussian. Using the properties in Problem 1.1, which give the mean and variance of the transformed Gaussian, we have
p(f⇤|y,X,x⇤) = N(f⇤|μˆ⇤, ˆ⇤2), μˆ⇤ = (x⇤)Tμˆ✓, ˆ⇤2 = (x⇤)T⌃ˆ✓ (x⇤). Finally, for the predicted y⇤, we have obtain the distribution by integrating over f⇤,
p(y⇤|y,X,x⇤)=Z p(y⇤|f⇤)p(f⇤|y,X,x⇤)df⇤ =Z N(y⇤|f⇤, 2)N(f⇤|μˆ⇤, ˆ⇤2)df⇤ = N (f⇤|y⇤, 2)N (f⇤|μˆ⇤, ˆ⇤2)df⇤ = N (y⇤|μˆ⇤, ˆ⇤2 + 2).
(S.241)
(S.242)
(S.243)
where the last line uses Problem 1.9 to calculate the integral (correlation between Gaussian distributions).
………
Problem 3.12 L1-regularized least-squares (LASSO)
(a) The L1-regularized least-squares objective function is
1 T 2 XD
E=2y ✓+ |✓i|. i=1
(S.244)
The first term (data term) is the squared-error as in ordinary least squares. The second term (regularization term) is the absolute value of the parameter values. In contrast, regularized
32
least squares (ridge regression) uses the norm of the parameter k✓k2. The objective function E is equivalent to the negative log-likelihood, so the likelihood takes the form of
` / e E = e ky T ✓k2 PDi=1 |✓i| = e PNi=1(yi (xi)T ✓)2 PDi=1 |✓i| (S.245) “YN T2#”YD #
= e (yi (xi) ✓) e |✓i| . (S.246) i=1 i=1
The left term is the data likelihood, which is equivalent to a Gaussian (as in ordinary least squares). The right term is the prior on ✓, and takes the form of a Laplacian on each parameter value. Hence, the probabilistic interpretation of L1-regularized least squares is to find the MAP estimate of the parameters ✓, using the model
y=f(x;✓)+✏, ✏⇠N(0, 2), ✓ ⇠Laplace( )= 1 e |✓i|. (S.247) i 2
(b) In the equivalent optimization problem, we rewrite ✓i as the di↵erence between two positive values,✓i =✓i+ ✓i ,where✓i+ 0and✓i 0. Notethatwhen✓i+ =0and✓i >0, and vice versa (✓i+ > 0, ✓i = 0), then the absolute value can be rewritten as |✓i+ ✓i | = (✓i+ + ✓i ). Finally, at the optimum indeed one of these two conditions hold, (✓i+ = 0, ✓i > 0) or (✓i+ > 0,✓i = 0). If it were not the case, i.e., (✓i+ > 0,✓i > 0), then the term (✓i+ + ✓i ) could be further reduced by subtracting min(✓i+,✓i ) from ✓i+ and ✓i . This would reduce the regularization term (✓i+ + ✓i ), but not a↵ect the data term since the data term only depends on the di↵erence (✓i+ ✓i ).
(c) Let x = ✓+ . The objective function is ✓
E = 1 y T ( ✓ + ✓ ) 2 + X ( ✓ i+ + ✓ i ) 2i
(S.248)
(S.249) (S.250)
(S.251)
= 1 y ⇥ T, T⇤x 2 + 1Tx 2
= 1yTy yT ⇥ T, T⇤x+ 1xT ⇥ T, T⇤x+ 1Tx 2 2
1 T T T ✓ y ◆T
/ 2x T T x+ 1 y x,
| {z } | {z }
Hf
where constant terms that do not a↵ect the minimization are dropped.
Figure 8 shows an example of cubic polynomial regression using least-squares, Bayesian regres- sion, and LASSO.
………
Problem 4.12 Lagrange multipliers and equality constraints
(a) The Lagrangian is
XK L(⇡, ) =
j=1
zj log ⇡j + 33
0@XK 1A
⇡j 1 .
(S.252)
j=1
true θ=[−2.272 0.000 −1.198 −0.228 ] LS θ=[−1.464 −0.458 −3.537 0.974 ]
BR θ=[−1.256 1.251 −2.378 0.050 ]
LASSO θ=[−1.673 −0.000 −2.272 0.296 ]
555
000 −5 −5 −5
−10 −15 −20
−2 −1 0 1 2 3
−10 −15 −20
−2 −1 0 1 2 3
−10 −15 −20
−2 −1 0 1 2 3
xxx
true θ=[−2.272 0.000 −1.198 −0.228 ] true θ=[−2.272 0.000 −1.198 −0.228 ]
Figure 8: Cubic polynomial regression using (left) least-squares, (middle) Bayesian regression, (right) LASSO. The true function is the dashed line. For Bayesian regression, the dotted-lines show the 2 standard-deviations around the mean. Note that LASSO can find that the linear term ✓1 is 0.
Taking the derivatives and setting to zero gives
@⇡j ⇡j Taking (S.254) and summing over j,
XK XK XK XK (zj +⇡j )=0 ) ⇡j = zj ) = zj,
@L XK
XK
) ⇡j = 1,
j=1
⇡j 1 = 0
@L=zj+ =0 ) zj+⇡j =0.
(S.253) (S.254)
(S.255)
@ =
j=1
j=1 j=1 j=1 j=1
since PKk=1 ⇡k = 1. Finally substituting into (S.254) (and changing the index from j to k to
avoid confusion),
(b) The Lagrangian is
XK
zj ⇡j zj = 0 )
k=1
XK
zj
⇡j = PK z .
k=1 k
(S.256)
(S.257)
(S.258) (S.259)
L(⇡, )=
Taking the derivatives and setting to zero gives
.
@L XK
@ =
@⇡j ⇡j ⇡j
⇡j(zj log⇡j)+ XK
0@XK j=1
⇡j 1
1A
j=1
⇡j 1 = 0
@L=zj log⇡j ⇡j+ =0 ) 1ezj 1e =0 ) ⇡je =ezj 1
) ⇡j = 1, j=1
j=1
34
y
y
y
Summing over j and noting that PKj=1 ⇡j = 1 gives
XK
e = Finally, substituting (S.260) back into (S.259),
log-likelihood in (S.264),
(S.264)
(S.265) (S.266)
⇡j
e z j k=1 k j=1 k
XK k=1
e z j 1
ezk 1 = ezj 1 ) ⇡j = PK ez 1 = PK ez
Xn XK i=1 j=1
zij log ⇡j + zij log p(xi|zi = j). Q(✓; ✓ˆ) = EZ|X,✓ˆ [log p(X, Z|✓)]
= EZ|X,✓ˆ
j=1
log p(X, Z) =
For the E-step, we obtain the Q function by taking the expectation of the complete data
24Xn XK i=1 j=1
35
ezj 1.
(S.260)
(S.261)
……… Problem 4.6 Mixture of exponentials
Define zi as the hidden assignment variable that assigns sample xi to mixture component zi = j. The complete data likelihood is
(S.262)
(S.263)
Yn p(X, Z) =
Yn i=1
p(zi)p(xi|zi) =
Using the indicator variable trick, by defining zij = 1 i↵ zi = j and 0 otherwise, we have
and taking the log,
i=1
⇡zi p(xi|zi). Yn YK z i j z
p(X,Z)= ⇡j p(xi|zi=j)ij, i=1 j=1
zij log ⇡j + zij log p(xi|zi = j) zˆ log⇡ +zˆ logp(x|z =j),
Xn XK
ij j ij ii
(S.267)
=
where the last line follows because the expectation only applies to variable zij. The “soft assign-
i=1 j=1
ment”termzˆ iscalculatedusingthecurrentparameterestimates✓ˆ, ij
zˆ =E [z ]=p(z =j|X,✓ˆ) ij Z|X,✓ˆij i
= p(X|zi = j)p(zi = j) = p(X¬i)p(xi|zi = j)p(zi = j) p(X ) p(X¬i )p(xi )
= P ⇡ j p ( x i | z i = j ) = p ( z i = j | x i , ✓ˆ ) . Kk=1 ⇡kp(xi|zi = k)
(S.268) (S.269)
(S.270)
35
(S.269) follows from the independence assumption of the samples, and X¬i is the set of xk with k 6= i. Note that we have not used any properties of the mixture components yet, so this is the general form of the Q function for any mixture model.
For the M-step, we maximize Q with respect to the parameters ✓. First, we optimize the component priors,
⇡⇤ = argmax Q(✓; ✓ˆ). (S.271) ⇡
We have a constraint that Pj ⇡j = 1, and ⇡j 0. For the equality constraint (the non-negative constraint is naturally satisfied), define the Lagrangian as
Xn XK L(⇡)=
XK
zˆ log⇡ + (1 ⇡),
(S.272)
(S.273) (S.274)
(S.275)
i=1 j=1
where is the Lagrange multiplier. Taking the derivatives and setting to 0,
@LXK XK
@ = 1 ⇡j = 0 ) ⇡j = 1,
j=1 j=1
@L Xn zˆ Xn
= ij+ =0) zˆ+ ⇡=0.
ijj j j=1
@⇡j ⇡j ij j
i=1 Summing (S.274) over j, we have
i=1
which follows from
PPX
XKXn XK
zˆ+ ⇡=0) = n,
ij j j=1 i=1 j=1
zˆ = 1 and ⇡ = 1. Finally, substituting into (S.274), we have j ij j j
1n nˆj ⇡ˆ= zˆ=,
(S.276) where nˆ = Pn zˆ is the (soft) number of samples assigned to component j. Again this is a
j n ij n i=1
j i=1 ij
standard result for any mixture model. For the M-step, the mixture weights are updated as before,
1 Xn nˆ j ⇡ˆ= zˆ=,
(S.277)
(S.278)
(S.279)
j n ij n i=1
where nˆ = Pn zˆ is the (soft) number of samples assigned to component j. j i=1ij
Finally, for the exponential parameter j, we optimize: ⇤j =argmaxQ(✓;✓ˆ).
j
We collect the terms of the Q function that depend on j,
Xn
` = zˆ logp(x|z =j)= zˆ (log x). jijii ijjji
i=1
Xn
36
i=1
Taking the derivative and setting to 0 gives
ji=1 i=1 i=1 i=1ij In summary, the EM algorithm for mixture of exponentials is
@`Xn Xn Xn Pnzˆx j 1 1 ˆ 1 i=1iji
= zˆ( x)=0 ) zˆ= zˆx ) =P @ ijji jijijijnzˆ
. (S.280)
(S.281) (S.282)
E step:zˆ =p(z=j|x,✓ˆ)=P⇡ˆjp(xi|zi=j, ˆj)
ij i i
Xn nˆj ˆ 1 1Xn
Kk=1 ⇡ˆkp(xi|zi = k, ˆj) M step:nˆ= zˆ, ⇡ˆ= , = zˆx.
j ijjnjnˆjiji i=1 i=1
Problem 5.1
………
Bias and variance of the kernel density estimator
(a) To calculate the bias and variance of the kernel density estimator pˆ(x) = 1 Pn k ̃(x xi), we
n i=1
suppose that the samples X = {xi}i=1 are distributed according to the true distribution p(x),
n
i.e., xi ⇠ p(x), 8i. The mean of the estimator pˆ(x) is
“n#nhi XXX
[pˆ(x)] = E 1 k ̃(x x ) = 1 E k ̃(x x ) X Xn inxi i
E
where (S.284) follows from each term in the sum being the same, and ⇤ is the convolution operator. The mean of estimator is the true distribution convolved with the kernel. In other words, it is a “blurred” or “smoothed” version of the true distribution.
(b) The variance of the estimator is
XXn ! Xn !
i=1
which follows from {xi} being independent distributions, and hence the variance of the sum is the sum of the variances (see Problem 1.4), and also identical distributions. Noting that var(x) = E[x2] E[x]2 and thus var(x) E[x2], then we can place an upper-bound on the variance,
(S.287) (S.288)
i=1 i=1
= 1 n Z p(xi)k ̃(x xi)dxi = Z p(z)k ̃(x z)dz = p(x) ⇤ k ̃(x),
(S.283) (S.284)
n i=1
1 k ̃(x xi) = 1 var n n2
k ̃(x xi) 1n ⇣ ̃ ⌘1⇣ ̃ ⌘
i=1
(S.285) (S.286)
varX(pˆ(x))=var
= n2 var k(x z) = nvar k(x xi) ,
i=1
varX (pˆ(x))
1 ̃ 2 1Z 1 ✓x z◆ ̃
E[k(x z) ] = Z d k k(x z)p(z)dz
nnhh
1 ⇣max k(x)⌘ k ̃(x z)p(z)dz = 1 ⇣max k(x)⌘ E[pˆ(x)],
nhd x nhd x
where the last line follows from k( x z ) maxx k(x), i.e., the kernel is upper-bounded by its
maximum value.
Figure 9 plots the mean and variance of the KDE for di↵erent bandwidths.
………
h
37
0.3 0.2 0.1
0.3 0.2 0.1
0.3 0.2 0.1
h=0.5
h=1
h=2
−5 0 5
p(x) p(x)
p(x) p(x)
p(x) p(x)
000 −5 0 5 −5 0 5
xxx
0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1
000
−5 0 5 −5 0 5 −5 0 5
xxx
Figure 9: Mean and variance of kernel density estimator. Top row: the true density function (thin line) and the estimator mean (thick line) and 2 standard deviations around the mean (dotted line) for di↵erent bandwidths h. Bottom row: Examples of the estimate pˆ(x) using 10 di↵erent sets of 50 samples drawn from the true density p(x). When the bandwidth is small (h = 0.5), the estimates are significantly di↵erent for each sample set (i.e., high variance), but the estimator mean is close to the true density. When the bandwidth is large (h = 2), the estimates are consistent with each other (i.e., low variance), but the estimator mean is far from the true density.
38