程序代写代做代考 graph go AI C algorithm EECS 189 Introduction to Machine Learning Fall 2020

EECS 189 Introduction to Machine Learning Fall 2020
This homework is due Wednesday, November 11 at 11:59 p.m.
2 Expectation Maximization (EM) Algorithm: A closer look!
HW10
Note that this problem appears lengthy. Don’t worry! Several parts are of a tutorial nature and several others are of a demo nature, where we have tried to provide ample explanations. This leads to the lengthy appearance of the overall problem, even though it is not actually that long. Hopefully you enjoy working on this homework and learn several concepts in greater detail.
The discussion will also be going over EM using examples similar to this homework.
In this problem, we will work on different aspects of the EM algorithm to reinforce your under- standing of this very important algorithm.
For the first few parts, we work with the following one-dimensional mixture model:

1 w.p. 1
Z ∼ 
2 w.p.2
2  1
X|Z=1∼N(μ1,σ21), and X|Z = 2 ∼ N(μ2,σ2).
Z denotes the hidden label of the Gaussian distribution from which X is drawn. p(Z = 1) = 0.5, p(Z = 2) = 0.5 — the hidden label is a fair coin toss. X has two different Gaussian distributions based on what Z is. In other words, we have an equally weighted 2-mixture of Gaussians, where the variances and means for both elements of the mixture are unknown. (Note that the mixture weights are given to you. In general, even those would be unknown.)
For a given set of parameters, we represent the likelihood of (X,Z) by p(X,Z;θ) and its log- likelihood by l(X, Z; θ), where θ is used to denote the set of all unknown parameters θ = {μ1, μ2, σ1, σ2}.
Given a dataset presumed to consist of i.i.d. samples of only (xi, i = 1, . . . , n) (and no labels zi), our goal is to iteratively approximate the maximum-likelihood estimate of the parameters θ. In the first few parts, we walk you through one way of achieving this, namely the EM algorithm.
(a) Write down the expression for the joint likelihood p(X = x, Z = 1; θ) and p(X = x, Z = 2; θ). What is the marginal likelihood p(X = x; θ) and the log-likelihood l(X = x; θ)?
Solution: We have
p(X = x,Z = 1;θ) = p(Z = 1;θ)p(X = x|Z = 1;θ) =
1 2
1
1
2  (x−μ1)
 2σ2  exp− 
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 1
􏰪
2πσ2  1

p(X = x,Z = 2;θ) = p(Z = 2;θ)p(X = x|Z = 2;θ) = and hence the total likelihood p(X = x; θ) is given by
1 2
1
2
2  (x−μ2)
 2σ2  exp− 
p(X = x; θ) = p(X = x, Z = 1; θ) + p(X = x, Z = 2; θ)
22  (x−μ1) 1 1  (x−μ2)
1 1
= 􏰪 exp− + 􏰪 exp− 
2  2σ2  2  2σ2  2πσ2  1  2πσ2  2 
􏰪
2πσ2  2
12 and the log-likelihood l(X = x; θ) is thereby given by

     
 1 (x−μ)2
  1  2     
l(X=x;θ)=log􏰪 exp− + 􏰪 exp− −log2 22  2 2σ 2 2σ 1122
 2πσ   2πσ  
1 (x−μ)2 
(b) Now we are given an i.i.d. dataset where the label values Z are unobserved, i.e., we are given a set of n data points {xi, i = 1, . . . , n}. Derive the expression for the log-likelihood l(X1 = x1,…,Xn = xn;θ) of a given dataset {xi}ni=1.
Solution: The log-likelihood for a dataset with i.i.d. points is just a sum of the log-likelihoods for individual points in the dataset and hence we have
l(X =x,…,X =x;θ)=−nlog2+ 11nn
􏰑􏰭2 1  (xi−μ1)
Note that finding MLE by differentiating this likelihood is not trivial (try differentiating with respect to μ1 and setting it to zero!) and hence we feel the need for an iterative algorithm like EM.
(c) Let q denote a possible distribution on the (hidden) labels {Zi}ni=1 given by
n
q(Z1 = z1,…,Zn = zn) = 􏰜qi(Zi = zi). (1)
i=1
Notethatsincezi ∈{1,2},qhasnparameters,namely{qi(Zi =1),i=1,…,n}.Moregenerally if Z took values in {1, . . . , K}, q would have n(K − 1) parameters. To simplify notation, from now on, we use the notation l(x;θ) := l(X = x;θ) and p(x,k;θ) := p(X = x,Z = k;θ). Show that for a given point xi, we have
􏰑2 l(xi;θ)≥Fi(θ;qi):=
k=1
􏰑2 􏰍 1 􏰎
n

i=1
 2σ2  log 􏰪 exp− 
2πσ2  1  1
 2􏰮 1  (xi−μ2)
 2σ2  +􏰪 exp− 
2πσ2

qi(k)log qi(k) , (2) L(xi;θ,qi) H(qi)
qi(k)logp(xi,k;θ)+ 􏰡􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰠􏰟􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰢 􏰡􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰠􏰟􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰢
2
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 2
k=1

where H(qi) denotes the Shannon-entropy of the distribution qi. Thus conclude that we obtain the following lower bound on the log-likelihood:
n
l({xi}ni=1; θ) ≥ F (θ; q) := 􏰑 Fi(θ; qi). (3)
i=1
Hint: Jensen’s inequality (for concave-∩ functions f , we know f (E[X]) ≥ E( f (X)) since a line joining two points is below the function), the concave-∩ nature of the log, and reviewing your notes from lecture might be useful.
Notice that the right hand side of the bound depends on q while the left hand side does not. Solution: We prove this result in generality, without using the particular expressions for the
Gaussian distribution.
l(xi; θ) = log p(xi; θ) = log
􏰑2 k=1
􏰑2 p ( x i , k ; θ ) p(xi, k; θ) = log qi(k) qi(k)
k=1
= log EZ∼qi qi(Z)
􏰭p(xi,Z;θ)􏰮 (i) 􏰭 p(xi,Z;θ)􏰮
≥ EZ∼qi log qi(Z)
k=1 k=1
where step (i) follows from Jensen’s inequality: For any concave-∩ function f, we have
f (E[X]) ≥ E( f (X)).
(d) The EM algorithm can be considered a coordinate-ascent1 algorithm on the lower bound F (θ; q) derived in the previous part, where we ascend with respect to θ and q in an alternating fashion. More precisely, one iteration of the EM algorithm is made up of 2-steps:
qt+1 =argmaxF(θt;q) (E-step) q
θt+1 ∈argmaxF(θ;qt+1). (M-step) θ
Given an estimate θt, the previous part tells us that l({xi}ni=1; θt) ≥ F (θt; q). Verify that equal- ityholdsinthisboundifwepluginq(Z =z,…,Z =z)=􏰛n p(Z=z|X=x;θt)and
hence we can conclude that
n
qt+1(Z1 = z1,…,Zn = zn) = 􏰜 p(Z = zi|X = xi;θt). (4)
i=1
is a valid maximizer for the problem maxq F (θt ; q) and hence a valid E-step update.
1A coordinate-ascent algorithm is just one that fixes some coordinates and maximizes the function with respect to the others as a way of taking iterative improvement steps. (By contrast, gradient-descent algorithms tend to change all the coordinates in each step, just by a little bit.)
􏰑2 􏰑2 1
= qi(k) log p(xi, k; θ) + qi(k) log qi(k).
1 1 n n i=1 i i
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 3

Solution: We have
n2
F(θt;q) = 􏰑􏰑qi(k)log p(xi,k;θt).
i=1 k=1 qi(k)
Substituting qi(k) = p(Zi = k|X = xi; θt) (short-form p(k|xi; θt), we obtain that the RHS is
n2 n2
􏰑 􏰑 qi(k) log p(xi, k; θt) = 􏰑 􏰑 p(k|xi; θt) log p(xi, k; θt).
i=1 k=1 qi(k) i=1 k=1 p(k|xi; θt)
Note that we have p(xi, k; θt) = p(xi; θt)p(k|xi; θt). Also since p(·|xi; θt) is valid conditional
distribution on Z, we have that 􏰏k p(·|x ; θt) = 1. Using these two facts, we obtain that k=1 i
n2 n2
􏰑 􏰑 qi(k) log p(xi, k; θt) = 􏰑 􏰑 p(k|xi; θt) log p(xi, k; θt)
i=1 k=1
qi(k) i=1 k=1 p(k|xi; θt) n2
= 􏰑 􏰑 p(k|xi; θt) log p(xi)p(k|xi; θt) i=1 k=1 p(k|xi; θt)
n2
= 􏰑 􏰑 p(k|xi; θt) log p(xi)
i=1 k=1 n2
= 􏰑 log p(xi) 􏰑 p(k|xi; θt) i=1 k=1
n
= 􏰑 log p(xi) = l({xi}ni=1; θt).
i=1
Using this equality and the fact that l({xi}ni=1; θt) ≥ F (θt; q) for any q, we conclude that
n
qt+1(Z1 = z1,…,Zn = zn) = 􏰜 p(Z = zi|X = xi;θt).
i=1
is a valid maximizer for the problem maxq F (θt ; q) and hence a valid E-step update.
(e) Using equation (4) from above and the relation (1), we find that the E-step updates can be re-written as
qt+1(Zi = k) = p(Z = k|X = xi;θt). i
Using this relation, show that the E-step updates for the 2-mixture case are given by
􏰡􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰠􏰟􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰢 =1
t+1
qi (Zi=1)=
1 􏰉 (xi−μ1)2􏰊 σ1 exp − 2σ21
􏰉 2􏰊 􏰉 2􏰊, and 1 exp −(xi−μ1) + 1 exp −(xi−μ2)
σ1 2σ21 σ2 2σ2
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 4

t+1
qi (Zi=2)=
1 􏰉 (xi−μ2)2􏰊 σ2 exp − 2σ2
2 t+1
􏰉 2􏰊=1−qi (Zi=1).
Solution: We have
p(Z=1|X=xi;θt)= p(Z=1,X=xi;θt)
p(X = xi;θt) = p(xi,1;θt)
p(xi, 1; θt) + p(xi, 2; θt)
= p(1; θt)p(xi|1; θt)
􏰉 2􏰊 1 exp −(xi−μ1)
+ 1 exp −(xi−μ2) σ2 2σ2
σ1 2σ21
Explain intuitively why these updates make sense.
p(1; θt)p(xi|1; θt) + p(2; θt)p(xi|2; θt) 1 p(xi|1;θt)
=2
1 p(xi|1; θt) + 1 p(xi|2; θt)
22
Nowusingthelikelihoodexpressionsfrompart(a)andsubstitutingθt =(μ1,μ2,σ1,σ2),we obtain the desired result.
These updates reflect that EM tries to probabilistically impute the hidden data labels Zi. In doing so, for any given set of parameters θt, it assigns the labels their posterior probabilities which for the Gaussian case simplify to the expression above.
Note that for the given case, the prior probabilities of Z do not change with θt, but in a general setting when the parameter for Z is unknown too (like you have seen in the lectures) p(Z; θt) will also change with θt.
(f) We now discuss the M-step. Using the definitions from equations (2) and (3), we have that nn
F (θ; qt+1) = 􏰑(L(xi; θ, qt+1) + H(qi)) = H(qt+1) + 􏰑 L(xi; θ, qt+1), ii
i=1 i=1
where we have used the fact that entropy in this case is given by H(qt+1) = 􏰏n H(qt+1). Notice
i=1 i
that although (as computed in previous part), qt+1 depends on θt, the M-step only involves
maximizing F (θ; qt+1) with respect to just the parameter θ while keeping the parameter qt+1 fixed. Now, noting that the entropy term H(qt+1) does not depend on the parameter θ, we conclude that the M-step simplifies to solving for
n
arg max 􏰑 L(xi; θ, qt+1) .
For this and the next few parts, we use the simplified notation
qt+1 := qt+1(Zi = 1) and 1 − qt+1 := qt+1(Zi = 2)
iii
θi i=1
􏰡􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰠􏰟􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰣􏰢 =:L(θ;qt+1 )
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 5

and recall that θ = (μ1, μ2, σ1, σ2). Show that the expression for L(θ; qt+1) for the 2-mixture case is given by
L((μ1, μ2, σ1, σ2); qt+1)
n   
􏰑 (xi −μ1)2  (xi −μ2)2  
t+1  t+1   
=C−
where C is a constant that does not depend on θ or qt+1.
q  +logσ +(1−q ) +logσ ,
 1  2 i=1i  2σ21  i  2σ2 
Solution:
L((μ1, μ2, σ1, σ2); qt+1) n
= 􏰑(qt+1 log p(xi, 1) + (1 − qt+1) log p(xi, 2)) ii
i=1
n
􏰑n 1 1 (qt+1 log p(xi|Zi = 1) + (1 − qt+1) log
=
= −n log 2 −
p(xi|Zi = 2)) 
i=1
i2i2
􏰑􏰭2
(xi −μ1) 1 
t+1   q  + log 2π + log σ1 i2σ22 
i=1
1
t+1  
2􏰮
(xi −μ2) 1  +(1−q ) + log2π+logσ2 i2σ22 
n    􏰑 (xi −μ1)2  (xi −μ2)2 
 t+1  t+1  
i=1i  2σ21  i  2σ2 
(g) Using the expression from the previous part, it is easy to show (make sure you know how to
=C−
 q  +logσ +(1−q ) +logσ .  1  2
do this) that the gradients of L(θ; qt+1) with respect to μ1, μ2, σ1, σ2 are given by
∂L ∂μ1
∂L ∂σ1
􏰏n qt+1(μ1 − xi)
∂L ∂μ2
􏰏n =− i=1
(1 − qt+1)(μ2 − xi) i
=− i=1 i
􏰏n qt+1(xi − μ1)2
,
σ2
(1 − qt+1)(xi − μ2)2
i
σ32
,
= i=1 i
σ31
􏰏n
− i=1 i ,
∂L ∂σ2
σ2
i .
σ21
qt+1 σ1
􏰏n = i=1
􏰏n − i=1
(1 − qt+1)
Typically, the M-step updates are computed using the stationary points for F(θ; qt+1). Using the expressions from previous parts and setting these gradients to zero, conclude that the M-step updates are given by
􏰏n qt+1x t+1 i=1 i i
􏰏n (1−qt+1)x t+1 i=1 i i
μ1 = 􏰏n qt+1 , i=1 i
μ2 = 􏰏n (1−qt+1), i=1 i
􏰏n qt+1(x − μt+1)2
􏰏n (1 − qt+1)(x − μt+1)2 2 (t+1) i=1 i i 2
2 (t+1) i=1 i i
i=1 i i=1 i
1
(σ1) = 􏰏n qt+1 , (σ2) = 􏰏n (1−qt+1)
Explain intuitively why these updates make sense.
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 6

Solution:
Setting gradients to zero, we see
0=− nn
ii i=1 i=1
􏰏n qt+1(μ −x) i=1i 1 i
σ21 =⇒ 􏰑qt+1μ1 =􏰑qt+1xi
􏰏n qt+1 x t+1 i=1ii
=⇒ μ1 = 􏰏n qt+1 i=1 i
and identically
􏰏n (1−qt+1)(μ −x) i=1 i 2 i
i=1
0=− nn
σ21 =⇒ 􏰑(1 − qt+1)μ2 = 􏰑(1 − qt+1)xi
ii i=1
􏰏n (1 − qt+1)x t+1 i=1 i i
=⇒μ2 =􏰏n(1−qt+1) i=1 i
Similarly
􏰏n qt+1(x − μ )2
0= i=1 i
􏰏n qt+1(x−μ)2
σ1
􏰏n qt+1 i 1 − i=1 i
􏰑n
=⇒ qt+1=i=1i i 1
i σ2 i=1 1
σ31
􏰏n qt+1(x − μt+1)2 2(t+1) i=1i i 1
=⇒(σ1)= 􏰏nqt+1 i=1 i
is the unique stationary point with an identical argument: 􏰏n (1−qt+1)(x −μ )2
􏰏n (1−qt+1) 0=i=1 i i 2−i=1 i
σ32 σ2 􏰑n 􏰏n (1−qt+1)(x−μ)2
=⇒ (1−qt+1)= i=1 i i 2 i σ2
i=1 2
􏰏n (1 − qt+1)(x − μt+1)2
2(t+1) i=1 i i 2 =⇒ (σ2) = 􏰏n (1−qt+1)
i=1 i
Notice that these updates are simply weighted updates when compared to sample mean and
sample covariance updates.
(h) For the next few parts, we further simplify the scalar mixture model. We work with the fol- lowing simpler one-dimensional mixture model that has only a single unknown parameter:

1 w.p. 1
Z ∼ 
2 w.p.2
2  1
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 7

X|Z=1∼N(μ,1), and X|Z = 2 ∼ N(−μ,1),
where Z denotes the label of the Gaussian from which X is drawn. Given a set of observations only for X (i.e., the labels are unobserved), our goal is to infer the maximum-likelihood param- eter μ. Using computations similar to part (a), we can conclude that the likelihood function in this simpler set-up is given by
1 e− 1 (x−μ)2 1 e− 1 (x+μ)2 22
p(X = x; μ) = 2 √ + 2 √ . 2π 2π
For a given dataset of i.i.d. samples {xi, i = 1, . . . , n}, what is the log-likelihood l({xi}ni=1; μ) as a function of μ?
Solution: In a straightforward manner, we obtain nn1212
n 􏰑 l({x } ; μ) =
􏰑
l(x ; μ) = i
1e−2(xi−μ) 1e−2(xi+μ) 
ii=1
i=1
 2 2π 2 2π

log  √ + √ 
i=1
􏰉 √ 􏰊 􏰑n 􏰋 −1(xi−μ)2 −1(xi+μ)2􏰌
loge2 +e2 .
Again, note that taking the derivative of the log-likelihood with respect to μ and finding the
solution for zero gradient is hard to do analytically.
(i) We now discuss EM updates for the set up introduced in the previous part. Let μt denote the estimate for μ at time t. First, we derive the E-step updates. Using part (d) (equation (4)) and part (e), show that the E-step updates simplify to
t+1 exp􏰇−(xi − μt)2/2􏰈
qi (Zi = 1) = exp􏰃−(xi − μt)2/2􏰄 + exp􏰃−(xi + μt)2/2􏰄.
Notice that these updates can also be derived by plugging in μ1 = μ and μ2 = −μ in the updates given in part (e).
Solution: This is basically asking us to verify that the argument given in part (d) is still correct since the expression is clearly an application of Bayes’ rule to the compute the conditional probability as in part (e) — as pointed out in the problem itself.
So, let’s examine the argument in part (d). Did we ever use what θ is or what the form of the probabilities is? No! The argument is completely generic and so it continues to apply here as well. The conditional probability of the hidden labels (given the current parameters θ and the data) is always a valid maximizer and hence a valid E-step update.
(j) Next, we derive the M-step update. Note that we can NOT simply plug in μ1 = μ in the updates obtained in part (h), because the parameters are shared between the two mixtures. However, we can still make use of some of our previous computations for this simpler case. Plugging in
=−nlog2 2π+
i=1
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 8

μ1 =μandμ2 =−μ,σ1 =σ2 =1inpart(f),showthattheobjectivefortheM-stepisgiven by
t+1
L(μ; q ) = C −
􏰑n2 2  t+1(xi −μ) t+1 (xi +μ) 
qi 2 + (1 − qi ) 2  . whereCisaconstantindependentofμ.Computetheexpressionforthegradient d (L(μ;qt+1)).
i=1

And by setting the gradient to zero, conclude that the M-step update at time t + 1 is given
by
Solution: Plugging in μ1 = μ and μ2 = −μ, σ1 = σ2 = 1 in the likelihood expression in part (f) indeed gives us
1 􏰑n
μt+1 = (2qt+1 − 1)xi.
t+1
L(μ; q ) = C −
􏰑n2 2  t+1(xi −μ) t+1 (xi +μ) 
qi 2 + (1 − qi ) 2  . Taking the derivative with respect to μ, we obtain
ni i=1
dL(μ; qt+1) 􏰑n = −
i=1
􏰇qt+1(μ − xi) + (1 − qt+1)(μ + xi)􏰈 . dμ i i
i=1
and setting the derivative to zero, we obtain
nnn 􏰑􏰇 􏰈 􏰏 (2qt+1−1)x 1􏰑
t+1t+1 i=1iit+1 qi (μ−xi)+(1−qi )(μ+xi) =0⇒μt+1 = 􏰏n (qt+1 +1−qt+1) = n (2qi
i=1 i i i=1
−1)xi
i=1
(k) Let us now consider a direct optimization method to estimate the MLE for μ: Doing a gradient ascent algorithm directly on the complete log-likelihood function l({xi}ni=1; μ)/n (scaling with
; μ)/n)
.
n is natural here since the log-likelihood is a sum.). Compute the gradient d (l({xi}n

􏰍 􏰎n  (xi−μ)2
i=1
and show that it is equal to
d1 n 1􏰑  e−2  
l({x} ;μ) = (2w(μ)−1)x−μ, where w(μ)= dμnii=1nii i 2 2
  −(xi−μ) ) −(xi+μ) )
i=1
Finally conclude that the gradient ascent scheme with step size α is given by
e2 +e2
d 􏰅􏰅 μGA=μGA+α l({x}n ;μ)􏰅􏰅
t+1 t dμ i i=1 􏰅μ=μGA t
= (1 − α)μ t
  + α  (2w (μ ) − 1)x  .
n
GA 1􏰑 GA 
n it i i=1
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission.
9

Solution: We have
d􏰍1 􏰎1d 􏰉√􏰊􏰑n 􏰋1 2 1 2􏰌

n
l({x} ;μ)=
− (xi−μ) − (xi+μ)
 2 2 
dμn ii=1
−nlog2 2π+ loge +e 
ndμ
1􏰑n e−1(xi−μ)2(x −μ)−e−1(xi+μ)2(x +μ)

i=1 =2i2i
1212 n i=1 e−2(xi−μ) +e−2(xi+μ)
1 􏰑n
(wi(μ)(xi − μ) − (1 − wi(μ))(xi + μ)) n
= n
1 􏰑 
i=1
 
=  (2w(μ)−1)x−μ,
ni i i=1 
where wi(μ) is as defined in the equation above. Writing the gradient ascent step, we have d 􏰅􏰅
μGA=μGA+α l({x}n ;μ)􏰅􏰅
t+1 t dμ i i=1 􏰅μ=μGA
t n
1  GA  GA GA
􏰑 
 
n  =μ +α  (2w(μ )−1)x−μ  tit it i=1 
= (1 − α)μ t
  + α  (2w (μ ) − 1)x  ,
(l) Comment on the similarity or dissimilarity between the EM and gradient ascent (GA) updates derived in the previous two parts. Refer to the corresponding jupyter notebook. You can run the two algorithms for the simpler one-dimensional mixture with a single unknown parameter μ. Run corresponding part in the notebook first. The code first generates a dataset of size 100 for two cases μtrue = 0.5 (i.e., when the two mixtures are close) and the case μtrue = 3 (i.e., when the two mixtures are far). We also plot the labeled dataset to help you visualize (but note that labels are not available to estimate μtrue and hence we use EM and GA to obtain estimates). Starting at μ0 = 0.1, the code then computes EM updates and GA updates with step size 0.05 for the dataset. Comment on the convergence plots (attach the convergence plots) for the two algorithms. Do the observations match well with the similarity/dissimilarity observed in the updates derived in the previous parts?
Solution: From previous parts, we have that qt+1 = wi(μt+1). And hence we can see that the i
GA updates are a weighted combination of EM update and the current iterate. In fact, if we set the step size as 1, then we recover EM updates exactly.
In this case, we see that EM updates will converge when θt stops changing much. The station- ary point for the M-update is given by
1 􏰑n
(2wi(μt) − 1)xi = μt.
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 10
μt+1 = μt ⇐⇒ n
i=1
n
GA 1􏰑 GA 
n it i i=1
as desired.

(a) μtrue = 0.5 (b) μtrue = 0.5, μ0 = 0.1
(c) μtrue = 3 (d) μtrue = 3,μ0 = 0.1
Figure 1: EM vs Gradient Ascent. We see that GA and EM converge to the same estimate, and converge faster for the case when data is well separated (μtrue = 3.). In the simple one-d example we considered, we also observe that Gradient Ascent converges slower than EM.
Note that solving for zero gradient also gives us the same equation (which is hard to solve directly!).
In other words, EM and gradient ascent updates have the same stationary point for this case. As we see that EM tries to find that point in a greedy manner by trying to solve the fixed point equation directly, but GA approaches the problem in a weighted fashion, and will converge more slowly than EM in this case. As pointed out in lecture, there are other cases when gradient ascent actually is faster.
Our observations are also confirmed by the convergence plots as seen in Figure 1.
More generally, recall that M update relies on being able to solve the equations for zero gra- dients. Whenever stationary points can be determined easily (or directly), EM can be used. Otherwise, gradient ascent method can be used. The GA method is also referred to as Gradient
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 11

EM method.
(m) Suppose we decided to use the simplest algorithm to estimate the parameter μ: K Means! Be- cause the parameter is shared, assuming μ > 0, we can estimate μˆ = 1 􏰅􏰅􏰅n μˆ − n μˆ 􏰅􏰅􏰅, where
n1+n2 11 22
n1,n2 denote the size of the two clusters at the time of update. μˆ1 and μˆ2 denote the cluster
centroids determined by the K means. Do you think this strategy will work well? Does your answer depend on whether μ is large or small? To help you answer the question, we have given a numerical implementation for this part as well. Run the code of corresponding part in jupyter notebook. The code then plots a few data-points where we also plot the hidden labels to help you understand the dataset. Also code provides the final estimates of μ by EM, Gradient Ascent (GA) and K Means. Use the plots and the final answers (report the final answers) to provide an intuitive argument for the questions asked above in this part. Do not spend time on being mathematically rigorous.
Hopefully you are able to learn the following take away messages: For the simple one- dimensional mixture model, we have that
• EM works well: It converges to a good estimate of μ pretty quickly.
• Gradient ascent is a weighted version of EM: It converges to a good estimate of μ, but is
slower than EM.
• K Means: Because of the hard thresholding, it converges to a biased estimate of μ if the two distributions overlap.
Solution: For the two cases we obtain that
• True mean:0.500, GA estimate:0.510, EM estimate:0.510, K-Means estimate:0.591
• True mean:3.000, GA estimate:3.085, EM estimate:3.085, K-Means estimate:3.085
From the data plots in Figure 1, we see that when the two clusters are far apart, we expect K- means to work well since there is a large separation and points from one cluster do not affect the data from the other cluster.
However, when the data overlaps, we observe that K-Means finds a biased estimate. To see why that happens, assume that K-means has converged, then note that the two means will depend on the data that is closest to them. Because of overlap, we will have some data from the other cluster weighing in (with full confidence since each data point in the cluster has weight 1) and some data from the true cluster not weighing in (since the data point not in the cluster has weight 0) to determine the cluster mean. In this simple one-d example, to determine the mean on the +x axis, we see that (approximately) no negative data (blue cross/red dots) point will weigh in, and in fact some positive data points (red dots) will weigh in thereby moving the mean for the blue cluster to right. Indeed, we see that the K-means estimate in this case is 0.591 which is larger than the true mean 0.5.
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 12

3 Expectation Maximization (EM) Algorithm: In Action!
Suppose we have the following general mixture of Gaussians. We describe the model by a pair of random variables (X, Z) where X takes values in Rd and Z takes value in the set [K] = {1, . . . , K}. The joint-distribution of the pair (X, Z) is given to us as follows:
Z ∼ Multinomial(π),
(X|Z = k) ∼ N(μk,Σk), k ∈ [K],
whereπ=(π ,…,π )⊤ and􏰏K π =1.Notethatwecanalsowrite 1 K k=1k
K
X ∼ 􏰑 πkN(μk, Σk).
k=1
Suppose we are given a dataset {xi}ni=1 without their labels. Our goal is to identify the K-clusters of the data. To do this, we are going to estimate the parameters μk , Σk using this dataset. (The EM updates will also pull out an estimate for π because the prior probability of a point being in each cluster is important for the probabilistic model, but K-Means style algorithms don’t care about that.) We are going to use the following three algorithms for this clustering task.
K-Means: For each data-point for iteration t we find its cluster by computing: y(t) = arg min ∥xi − μ(t−1)∥2
i j∈[K] j C(t) = {xi : y(t) == j}n
j i i=1
where μ(t−1) denotes the mean of C(t−1), the j-th cluster in iteration t − 1. The cluster means
jj
are then recomputed as:
We can run K-means till convergence (that is we stop when the cluster memberships do not change anymore). Let us denote the final iteration as T , then the estimate of the covariances Σk from the final clusters can be computed as:
Σj = 1 􏰑 (xi − μ(T))(xi − μ(T))⊤. |C(T)| j j
j xi∈C(T) j
Notice that this method can be viewed as a “hard” version of EM.
K-QDA: Given that we aim to also estimate the covariance, we may consider a QDA version of K-means where the covariances keep getting updated at every iteration and also play a role in determining cluster membership (the relevant distance to the mean has to be relative to the
μ(t)= 1 􏰑xi. j |C(t)| (t)
j xi∈Cj
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 13

norm induced by the inverse covariance matrix). The objective at the assignment-step would be given by
y(t) = arg min(xi − μ(t−1))⊤(Σ(t−1))−1(xi − μ(t−1)). i j∈[K] j j j
C(t) = {xi : y(t) == j}n
j i i=1
We could then use C(t) to recompute the parameters as follows: j
μ(t)= 1 􏰑xi, and j |C(t)| (t)
j xi∈Cj
Σ(t) = 1 􏰑 (xi − μ(t))(xi − μ(t))⊤.
j |C(t)| j j j xi∈C(t)
j
We again run K-QDA until convergence (that is we stop when the cluster memberships do not change anymore). Notice that, again, this method can be viewed as another variant for the “hard” EM method since each point is assigned to exactly one cluster.
EM: The EM updates are given by
– E-step: For k = 1,…,K and i = 1…,n, we have
q(t)(Zi = k) = p(Z = k|X = xi; θ(t−1)). i
– M-step:Fork=1,…,K,wehave
1 􏰑n 1 􏰑n q(t)(Zi = k) =
π(t) =
kni n
p(Z = k|X = xi; θ(t−1)),
determining the mean and covariance for each cluster.
We now see the three methods in action. You are provided with a code in jupyter notebook for all the above 3 algorithms. You can run it by calling the following function from main:
experiments(seed, factor, num_samples, num_clusters)
We assume that x ∈ R2, and the default settings are number of samples is 500 (n = 500), and the number of clusters is 3 (K = 3). Notice that seed will determine the randomness and factor will determine how far apart are the clusters.
i=1 i=1 􏰏n q(t)(Z = k)x
μ(t)= i=1 i i i, and k 􏰏n q(t)(Z=k)
i=1 i i
􏰏n q(t)(Z = k)(x − μ(t))(x − μ(t))⊤
Σ(t)=i=1i i i k i k . k 􏰏n q(t)(Z=k)
i=1 i i
Notice that unlike previous two methods, in the EM updates, each data point contributes in
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 14

(a) Run the following setting (as written in the corresponding part):
experiments(seed=11, factor=1, num_samples=500, num_clusters=3)
Observe the initial guesses for the means and the plots for the 3 algorithms on convergence.
Comment on your observations. Attach the two plots for this case.
Note that the colors are used to indicate that the points that belong to different clusters, to help you visualize the data and understand the results.
Solution:
The K-means method is not able to properly cluster samples from the two gaussians that are very close to one another, as one would expect. Notice that K-QDA does not converge well, so the fact that we try to estimate the variance from the clustered data does not help when clusters are close to one another. EM performs fairly well, as expected, even in this challenging setting and despite the fact that the initial guesses for the centroids were extremely inaccurate.
(b) Comment on the results obtained for the following setting: (as written in the corresponding part)
experiments(seed=63, factor=10, num_samples=500, num_clusters=3)
and attach the two plots for this case as well.
Solution:
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 15

All three methods are able to converge to the correct clustering, given that the clusters are fairly separated and the initial guesses for the centroids are roughly in the area of the correct values, the results are what one would expect.
4 Implicit Bias of Gradient Descent for Logistic Regression
You have seen in an earlier homework that gradient descent on squared-loss will tend to the minimum-norm solution if initialized at zero. We focused in on the special case where you could get to a perfect solution (perfectly interpolate all the training data) where you get the lowest- possible squared-loss, namely zero. Here, we study what happens for logistic regression in the counterpart case.
In this problem, we will consider the effects of the method used to train a model on the final out- come. Specifically, consider a classification problem with two linearly separable classes of points {ai} and {bi}. (This condition is the logistic regression counterpart of having a perfect solution possible — we can perfectly separate the training points.) Let the first class of points have label 0, and the second class of points have label +1. (We used −1 and +1 earlier at times, but here we will use 0 and +1 as the labels.)
Recall that logistic regression associates a point x with a real number from 0 to 1 by computing: 1
f(x)= 1+exp􏰥−wTx􏰦,
This number can be interpreted as the estimated probability for the point x having a true label of
+1. Since this number is 1 when wT x = 0, the sign of wT x is what predicts the label of the test 2
point x.
The loss function is defined to be
􏰑􏰍1􏰎 􏰍1􏰎
yiln f(xi) +(1−yi)ln1−f(xi), (5) i
where the label of the ith point xi is yi. This form illustrates the “log-loss” nature of logistic loss. Notice that when yi = +1, the second term inside the sum is gone and all we want is ln􏰃 f (xi )􏰄 to be
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 16

close to 0 which will happen if f (xi ) is close to 1, which happens if wT xi is a huge positive number. A similar story for yi = 0 makes us want wT xi to be a huge negative number.
In the two-class example defined above, it can be rewritten as 􏰑ln􏰇1+ewTai􏰈+􏰑ln􏰇1+e−wTbi􏰈, (6)
ii
summing over the points in each class separately.
(a) Consider trying to minimize (6) when our sets of points are linearly separable (i.e. there exists a w such that wT ai < 0 and wT b j > 0 for all valid i.) Does a single “optimal” solution exist for minimizing (6)? How low can the loss go?
(Hint: If w is such that it properly separates the two classes of points, what happens to the loss in (6) if we doubled w? Also, look at the individual terms in (6) — can any of those terms be negative?)
Solution: It is clear that if such a w existed, kw would also have the same property, but (for k > 1) with a lower loss. In other words, scaling w up would drive the loss down. For some w′ =kw,ask→∞,weseethat
􏰑 ln􏰇1 + ewT ai 􏰈 + 􏰑 ln􏰇1 + e−wT bi 􏰈 → 􏰑 ln􏰇1 + e−∞􏰈 + 􏰑 ln􏰇1 + e−∞􏰈 = 0, iiii
so we see that we can in fact drive the loss down to zero (abusing notation somewhat). So if our data is linearly separable, we can simply increase the magnitude of w to bring the loss down arbitrarily close to zero.
(b) In the previous part, you should have seen that any w that correctly separates our training data by class (i.e. sign(wT x) has zero classification error on the training data) can be used to produce a set of weights w′ with an arbitrarily low loss on the training set when evaluating (6).
In most cases of separable training data, infinitely many qualitatively distinct ws can exist that separate the training data, as shown in the below figure:
y
x
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 17

Yet when we try to fit a model by minimizing (6) using standard techniques such as gradient descent, it turns out that we tend to converge to a particular classifier, even though infinitely many exist with arbitrarily small loss.
To understand this effect, we will consider a very simple one-dimensional example where the points are scalars. To be able to place the separator anywhere on the line, we will introduce a bias term into our model so that there are two parameters. In particular, here we consider
1
f(x) = 1 + exp􏰥−m(x − c)􏰦
with decision boundary x = c. (All the numbers bigger than c are deemed to belong to the class 1 and all the points smaller than c are deemed to belong to the class 0.) For any c that separates our training data, it is easy to see that as m → ∞, the training loss in (5) goes to zero if we assume that all the +1-labeled training points are indeed bigger than c and all the 0-labeled training points are indeed smaller than c.
Explicitly write down the loss function for logistic regression in this example, assuming thatwehavethepointsa1,…,an1 inclass0,andb1,…,bn2 inclass1followingtheexample of (6) so that the dependence on the two scalar parameters m and c is clear.
Solution: We have
n1 n2
L = 􏰑 ln􏰇1 + em(ai−c)􏰈 + 􏰑 ln􏰇1 + e−m(bi−c)􏰈.
i=1 i=1
(c) Compute the gradient descent update step with step size γ. Give updates for both m and c. Solution: Differentiating the loss function with respect to m, we find that
∂L n1 a −c n2 b −c
∂m =
The gradient update step for m is therefore
􏰑i 􏰑i
(t+1)
(t) 􏰑 =m −
ai − c(t)
(t) i (t)
􏰑
i=1
bi − c(t) 
i=1
1+exp􏰥−m(ai −c)􏰦 − 1+exp􏰥m(bi −c)􏰦. i=1
n1 n2 
1+exp􏰥−m (a −c )􏰦−
(t)  1+exp􏰥m (b −c )􏰦γ.
m
Similarly,
so
i=1 1+exp−m(t)(ai−c(t))
(d) Now, you can imagine that based on where we are initialized and based on where the points
are, that gradient descent will take some path as it minimizes. Here, we are going to assume HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 18

i=1
(t) i
n1 n2  ∂L 􏰑 −1 􏰑 1 
∂c (t)
  =m 􏰥 􏰦+ 􏰥 􏰦,
i=1 1+exp −m(ai −c) i=1 1+exp m(bi −c) 
n1 n2 
−1
c =c−m 􏰥 􏰦+
􏰑 1  
(t+1)

(t) 􏰑
􏰥 􏰦γ. i=1 1+expm(t)(bi−c(t))

that our training data is separable, so max({ai}) < min({bi}). It is important to note that initially, it is possible that c does not separate our data i.e. there may exist an i such that c < ai, or c > bi. We can start with some points on the wrong side of c.
It turns out that logisitic loss is nicely convex and so with a sufficiently small step size γ, gradient-descent will take us towards a global minimizer. (Strictly speaking, a global in- fimizer.) Throughout this problem, you may use the fact that gradient descent, when applied to minimize our loss function, can drive the loss down to Lmin + ε for any ε > 0, where Lmin is the minimum possible loss (0, in the separable case).
Show that gradient descent will move c to a value that does separate the training data. (Hint: Suppose that there was a point that was on the wrong side of c. How low could your
loss be if that were true?)
Solution: Our goal will be to establish a nonzero lower bound for our loss function, if c does not separate our data. Then we know that gradient descent will at some point have to reduce the loss below this lower bound, meaning that c will have to separate the training data.
Without loss of generality, assume that c > b1, so it is not smaller than all the {bi}. Then we can write our loss L as
n1 n2
L = 􏰑 ln􏰇1 + em(ai−c)􏰈 + 􏰑 ln􏰇1 + e−m(bi−c)􏰈 ≥ ln􏰇1 + em(c−b1)􏰈.
i=1 i=1
First consider the case when m ≥ 0. Since c > b1, we have that e(c−b1) > 1, so em(c−b1) ≥ 1.
Thus, L ≥ ln(2) > 1, as desired.
Next, consider the case when m < 0. By an analogous manipulation to what was done earlier, we have that Since c > b1, c > a1 as well, since the {bi} are all greater than the {ai} (by our assumption of
L ≥ ln􏰇1 + em(a1−c)􏰈.
separability). Thus, m(a1 − c) > 0, so L ≥ ln(2) > 1 when m < 0 as well, completing our proof. (e) You have now shown that, after some time, gradient descent will be forced to drive c strictly between the two classes in the training data. For a sufficiently small choice of learning rate, it is also known that gradient descent updates only strictly decrease the loss being optimized by gradient descent. Thus, we have in fact shown that c will continue separate to our training data by class as we take further steps in gradient descent. At this point, there are two conceptual steps to understand why c must converge to the max- margin choice. First, we are going to show that m must grow unboundedly. Then, we will leverage the idea that m has gotten very large to allow us to focus on the class members on the boundary and their impact on gradient descent. Here, we are asking you to do the first part: after c separates our training data by class, show that m grows unboundedly towards infinity. For simplicity, assume that m(t) > 0 when c first separates our training points by class.
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 19

If you want, you can use approximation or heuristic-based reasoning to understand the rate of growth, as a function of the number of gradient descent steps. To do this, you may also assume a sufficiently small step size γ, that is so small that you can approximate m as a continuous function of t, with
m(t+1) −m(t) ≈ dm. dt
This will permit you to use a differential equation to understand the growth of m.
(Hint: You want to show that the gradient is going to be pushing m(t) to be bigger. How small
can this gradient be? That will give you a lower bound of the form κ where κ can be 1+exp{m(t)β}
made as small as you want by setting the step-size, and β is something that has to do with the training data. You could similarly upper-bound this gradient if you wanted to, even though we didn’t ask you to. Notice that this bound by itself can allow you to prove that m(t) cannot stay bounded. Why? Suppose that m(t) stayed bounded. Then the gradient would also stay bounded and positive. Why does that lead to a contradiction?
Youcanalsoapplytheapproximationthatdm =yourlower-boundforthegradientandlookat dt
that differential equation. The gradient is positive and so it is growing. Solve this differential equation by guessing a solution using separation-of-variables if you want to get the asymptotic scaling.
Solution:
It is useful to invoke the WLOG ordering assumption given in the next part since it makes the notation simpler. We cann use b1 to denote the smallest member of one class and a1 to denote the largest member of the other class. This way, b1 − a1 represents the separation between the two sets of training points.
Looking at the gradient update step for m, we can rewrite it as
n1 n2 
m
(t) (t) i 1+exp􏰥m (c −a)􏰦+
(t) 
1+exp􏰥m (b −c )􏰦γ. (7)
(t+1)
(t) 􏰑 =m +

c(t) − ai 􏰑
bi − c(t) 
i=1
For simplicity, we can lower bound this as
i=1
(t) i
 (t)
(t+1) (t)  c −a1
(t) b1−c
 
m − m ≥ 1 + exp􏰥m(t)(c(t) − a1)􏰦 + 1 + exp􏰥m(t)(b1 − c(t))􏰦 γ. (8) Here, notice that at least one of the two terms (c(t) − a1) or (b1 − c(t)) has to be at least b1−a1 and
2
the biggest that either can be is (b1 − a1). So we can further lower-bound the increment in (8)
by
b1 − a1
γ2(1 + exp􏰥m(t)(b1 − a1)􏰦). (9)
We know γ > 0. Let us use a proof by contradiction. Suppose that there existed a finite upper
bound on m(t) so that for all t, we had m(t) < m ̄ < ∞. Then we know that (9) is itself lower- bounded by γ b1−a1 > 0 and so the gradient-descent update step always grows m(t) by 2(1+exp{m ̄ (b1−a1)})
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 20

at least that much, which means it grows unboundedly. This is a contradiction and so m(t) must grow to infinity.
To get a sense for how fast we to infinity, we can make some more approximations. For a sufficiently small step size γ, based on the lower-bounds above, we can treat the gradient- descent difference equation as though it was a gradient-flow differential equation instead. If we have a lower-bound on the growth (and the right-hand side is always positive), then the solution to this differential equation clearly must be a lower bound on the growth rate. And similarly if we had an upper-bound.
dtκ
dtm ≥ 1 + exp􏰥m(t)(b1 − a1)􏰦,
for some constant κ independent of m, c, and t. Solving this differential equation via separation- of-variables, we obtain
m(t) + 1 exp􏰝m(t)(b1 − a1)􏰞 = κt =⇒ m(t) = Ω(log(t)). b1 − a1
This confirms that m goes to infinity. To see that the growth is indeed logarithmic, we can upper-bound the right-hand side of the gradient-descent update in the same qualitative manner. Notice that there are a finite number of terms in the sum and each of them is no bigger than what we have here when m is large enough.
(f) Now that we have shown that m tends towards infinity, let’s see what is eventually going to happen to c. This will tell us which of the classifiers separating our training data gradient descent converges towards.
Assume without loss of generality that
an1 0 and will thus all go to zero. However, it is clear that the terms with a smaller k will dominate, since they will be greater than the other terms.
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 21
i=1 1+expm(t)(bi−c(t))

As stipulated, consider the case where c(t) ∈ (c∗ + δ , c∗ + δ. For convenience, let us define the
2
margin of the problem as ξ = b1 − a1.
Here, the dominant term that is trying to make c(t) bigger is the one coming from 1 .
Meanwhile, the dominant term that is trying to make c(t) smaller is the one coming from
−1 .Ofcourse,therearen othertermsthatcouldbehelpingtrytomakecbigger. 1+exp{m(t) (b1 −c(t) )} 1
However, if m(t) > ln n1 , even all of them put together cannot overcome the push coming from δ
the single term −1 . Why? Because b − c(t) < c(t) − a + δ by assumption. 1+exp{m(t) (b1 −c(t) )} 1 1 Consequently, the gradient is pushing c closer to c∗ as long as m is large enough. By symmetry, the exact same argument works if c were on the other side of c∗ but at least δ away from c∗. 2 (g) The final thing to do is to make sure that the force of gradient descent can actually move c enough to close the gap by δ in a finite amount of time if c were to stay in the region between 2 δ and δ away from c∗. Argue why this must be true based on the fact that m(t) must grow 2 unboundedly by comparing the gradient pushing c to the one that is pushing m. Once you’ve done that, you have shown that c has to close the gap to δ away in a finite time, 2 and since δ was arbitary, this means that we have to close the gap to δ and then to δ and so on 48 all the way as c converges to c∗ itself. It cannot converge anywhere else. Technically speaking, we also would have to rule out the prospect of gradient descent with a constant step size constantly jumping too far over across c∗ to result in an infinite oscillation. However, you can also show this cannot happen forever based on the fact that m(t) → ∞ implies that the gradients themselves are getting smaller and smaller. No finite jump could be sustained. Solution: We saw in the previous part that if m(t) is large enough, that the effectively, the only two training points that matter as far as pushing c go are those on the margin. And if we are in the interval (c∗ + δ , c∗ + δ), that the term −1 in the gradient dominates as long as m(t) is large 2 1+exp{m(t) (b1 −c(t) )} enough. So after many timesteps, when m(t) is large enough, we can bound the difference (t+1) (t) m(t) 􏰍 1 􏰎 c −c ≤− 2 1+exp􏰥m(t)(b1 −c(t))􏰦 γ. (10) Let us compare this to the force pushing the m(t) in (7). The denominator in (10) dominates all the denominators in the sum in (7). Furthermore, there are a finite number of terms in (7) and the ai and bi points have a finite span. Consequently, when m(t) is large enough, the right hand side of (10) is always bigger in magnitude than the increment in (7). Since m(t) grows unboundedly, this must mean that c(t) can also move as far as needed. As described in the problem statement above, this means that c(t) must therefore converge to c∗ and logistic regression finds the max margin separator when solved using gradient descent. (h) RuntheassociatedJupyternotebooktoobservethiseffect,anditsgeneralizationtod-dimensions, numerically. Comment on what you observe in the notebook. 1+exp{m(t) (c(t) −a1 )} HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 22 Solution: See the jupyter solution notebook for details. 5 Linear Methods on Fruits and Veggies In previous semesters that were not struck by the pandemic and shelter-at-home restrictions, stu- dents were asked to collectively collect a dataset for the class. Unfortunately, this is hard to do this semester and so we are going to have to trust that you had enough experience collecting data for learning in 16AB’s labs, or if you are a graduate student, in the appropriate lab-based course for your field. In this problem, we will use a dataset of fruits and vegetables that was collected in previous semesters. The goal is to accurately classify the produce in the image. Instead of operating on the raw pixel values (because that can be slow), we operate on extracted HSV colorspace histogram features from the image. HSV histogram features extract the color spectrum of an image, so we expect these features to serve well for distinguishing produce like bananas from apples. Denote the input state x ∈ R729, which is an HSV histogram generated from an RGB image with a fruit centered in it. Each data point will have a corresponding class label, which corresponds to their matching produce. Given 25 classes, we can denote the label as y ∈ {0, . . . , 24}. Better features would of course give better results, but we chose color spectra for an initial prob- lem for ease of interpretation. As a historical note, early image search engines also used color histograms to group images. HSV is a more perceptually relevant color-space than RGB, and that is why we are using those features. Classification here is still a hard problem because the state space is much larger than the amount of data we obtained in the class – we are trying to perform classification in a 729 dimensional space with only a few hundred data points from each of the 25 classes. In order to obtain higher accuracy, we will examine how to perform hyper-parameter optimization and dimensionality reduction. We will first build out each component and test on a smaller dataset of just 3 categories: apple, banana, eggplant. Then we will combine the components to perform a search over the entire dataset. Note all python packages needed for the project, will be imported already. DO NOT import new Python libraries. (a) Before we classify our data, we will study how to reduce the dimensionality of our data. We will project some of the dataset into 2D to visualize how effective different dimensionality reduction procedures are. The first method we consider is a random projection, where a matrix is randomly created and the data is linearly projected along it. For random projections, it produces a matrix, A ∈ R2×729 where each element Ai j is sampled independently from a normal distribution (i.e. Ai j ∼ N(0, 1)). Run the Part (a) in Jupyter Notebook and report the resulting plot of the projection. You do not need to write any code. Solution: HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 23 Figure 2: A random projection of the high-dimensional HSV Histograms See Jupyter Notebook for solution. You plot may look different based on the random values you generated, but the message to take away is similar. (b) We will next examine how well PCA performs as a dimensionality-reduction tool. PCA projects the data into the subspace with the most variance, which is determined via the co- variance matrix ΣXX. We can compute the principal components via the singular value decom- position UΛV⊤ = ΣXX. Collecting the first two column vectors of U in the matrix U2, we can project our data as follows: x ̄ = U 2⊤ x . Run the Part (b) in Jupyter Notebook and report the projected 2D points figure. You do not need to write any code. Solution: Figure 3: A PCA projection of the high-dimensional HSV Histograms See Jupyter Notebook for solution. HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 24 (c) Finally, we will use what is called Canonical Correlation Analysis (CCA) to do dimension- ality reduction. As PCA provided us with a dimensionality-reduction approach that didn’t use the labels y in any way, what would happen to PCA if the data by nature was “contaminated” with a strong correlated latent signal that had nothing to do with the labels? PCA would select projections with the greatest variation and keep them (even though they are irrelevant to pre- dicting the labels), in effect throwing away those dimensions where we could actually hope to get information relevant for predicting the label. We saw in an earlier homework how we can use auxiliary labels to help us find the relevant directions in the inputs that correspond to latent variables that are useful in predicting the auxiliary labels as a collective. In this problem, we are going to exploit the fact that we have more than one class. This means, we can use one-hot encoding to get vector-valued y ̄ ∈ {0, 1}J where J is the number of class labels, which is J = 3 for this part. Each of these dimensions can function as an auxiliary label. In the earlier homework, we saw that the object whose SVD is desired is E[xz⊤]. In that problem however, we had made sure that that the input data x by its nature did not have any strong directions in it. If we want to suppress any strong directions in our x inputs, we can apply a normalization to our data set that is called “whitening” — whereby we change coordinates to that our X matrix has all of its nonzero singular values being constant. This can be done by applying a change of coordinates that is given by the square-root of the PSD matrix that represents the inverse of the empirical covariance of the x in the data. In an earlier homework, you also saw that there can be issues that arise from having class im- balances in our data set. Such class imbalances manifest in one-hot encodings by having some the “label-features” (banana or not-banana) having a higher norm than others — notice that under one-hot encoding, the empirical “label-feature” vectors are guaranteed to be orthogonal to begin with. Reweighting data-points to achieve class balance is spiritually akin to whitening the y one-hot-encodings of the labels. −1 −1 CCA computes the singular value decomposition UΛV⊤ = Σ 2 ΣXYΣ 2 to basically work with such whitened x and y data. We do not want to be biased by strong directions in x, nor by strong directions in y — we just want to understand which directions in a whitened x help us predict a whitened y. We can then project to the canonical variates by using the first k columns in U, or Uk together with the whitening. The complete lower-dimensional projection can be written as follows: −1 x ̄ = U ⊤ Σ 2 x . XX YY k XX In this problem, however, you can just treat CCA as a black-box tool for dimensionality- reduction that takes label information into account. It is interesting to note that although theoretically, once you have picked the first J singular vectors, there is no further label-targeted information in the singular vectors beyond that point. However, there is still information about the inputs there and so formally, we do not have to stop at J when it comes to reducing dimensionality. HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 25 Run the Part (c) in Jupyter Notebook and report the resulting plot for CCA. You do not need to write any code. Among the dimension reduction methods we have tried, i.e. random projection, PCA and CCA, which is the best for separation among classes? Which is the worst? Why do you think this happens? Solution: Figure 4: A CCA projection of the high-dimensional HSV Histograms See Jupyter Notebook for solution. Among the 3 dimension reduction methods, CCA works best and the random projection is the worst, because compared to PCA, random projection does not preserve the maximum variance in the data. CCA works better than PCA because it has access to the labels which PCA does not. Thus it is able to find the direction that maximize the correlation with the labels. (d) We will now examine ways to perform classification using the smaller projected space from CCA as our features. One technique is to regress to the class labels and then greedily choose the model’s best guess. In this problem, we will use ridge regression to learn a mapping from the HSV histogram features to the one-hot encoding y ̄ described in the previous problem. Solve the following Ridge Regression problem: N min􏰑||y ̄−W⊤xn||2 +λ||W||2F. W n=1 Then we will make predictions with the following function: y = argmax (W⊤x)j, j∈0,..,J−1 where (W⊤x)j considers the j-th coordinate of the predicted vector. Run the Part (d) in Jupyter Notebook. You do not need to write any code. It will output a confusion matrix, a matrix that compares the actual label to the predicted label of the model. HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 26 The higher the numerical value on the diagonal, the higher the percentage of correct predictions made by the model, thus the better model. Report the Ridge Regression confusion matrix for the training data and validation data. Solution: Figure 5: A Confusion Matrix for Training Data with Ridge Figure 6: A Confusion Matrix for Validation Data with Ridge See Jupyter Notebook for solution. HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 27 (e) Instead of performing regression, we can potentially obtain better performance by using algo- rithms that are more tailored for classification problems. LDA (Linear Discriminant Analysis) approaches the problem by assuming each p(x|y = j) is a normal distribution with mean μj and covariance Σ. Notice that the covariance matrix is assumed to be the same for all the class labels. LDA works by fitting μj and Σ on the dimensionality-reduced dataset. During prediction, the class with the highest likelihood is chosen. y = argmin (x − μj)⊤Σ−1(x − μj). j∈0,...,J−1 Fill in the Part (e) in Jupyter Notebook. Report the LDA confusion matrix for the training and validation data. Solution: Figure 7: A Confusion Matrix for Training Data with LDA See Jupyter Notebook for solution. (f) LDA makes an assumption that all classes have the same covariance matrix. We can relax this assumption with QDA. In QDA, we will now parametrize each conditional distribution (still normal) by μj and Σj. The prediction function is then computed2 as y = argmin (x − μ )⊤Σ−1(x − μ ) + ln􏰇det|Σ |􏰈. j∈0,..,J−1 2You should verify for yourself why this is indeed the maximum likelihood way to pick a class. jjjj HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 28 Figure 8: A Confusion Matrix for Validation Data with LDA Fill in the Part (f) in Jupyter Notebook. Report the QDA confusion matrix for the training and validation data. Solution: Figure 9: A Confusion Matrix for Training Data with QDA See Jupyter Notebook for solution. (g) Let us try the Linear SVM, which fits a hyperplane to separate the dimensionality-reduced HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 29 Figure 10: A Confusion Matrix for Validation Data with QDA data. The regularization parameter of SVM (the squared l2 regularization in SVM) is inversely proportional to C > 0 in the Jupyter Notebook. You can treat this C as a tuning hyperparameter in this problem.
Run the Part (g) in Jupyter Notebook. Report the Linear SVM confusion matrix for the training data and validation data. You do not need to write any code.
Solution:
Figure 11: A Confusion Matrix for Training Data with SVM
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 30

Figure 12: A Confusion Matrix for Validation Data with SVM
See Jupyter Notebook for solution.
(h) Let us try logistic regression.
Run the Part (h) in Jupyter Notebook. Report the logistic regression confusion matrix for the training data and validation data. You do not need to write any code.
Solution:
Figure 13: A Confusion Matrix for Training Data with Logistic Regression
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 31

Figure 14: A Confusion Matrix for Validation Data with Logistic Regression
label=1 prediction=1 True Positive
prediction=0 False Negative
See Jupyter Notebook for solution.
label=0
False Positive True Negative
(i) In this part, we look at the Receiver Operating Characteristic (ROC), another approach to un- derstanding a classifier’s performance. In a two class classification problem, we can compare the prediction to the labels. Specifically, we count the number of True Positives (TP), False Positives (FP), False Negatives (FN) and True Negatives (TN). The true positive rate (TPR) measures how many positive examples out of all positive examples have been detected, con- cretely, TPR = TP/(TP+FN). The false positive rate (FPR) on the other hand, measures the proportion of negative examples that are mistakenly classified as positive, concretely, FPR = FP/(FP + TN).
A perfect classifier would have TPR = 1.0 and FPR = 0.0. However, in the real world, we usually do not have such a classifier. Requiring a higher TPR usually incurs the cost of a higher FPR, since that makes the classifier predict more positive outcomes. An ROC plots the trade off between TPR and FPR in a graphical manner. One way to get an ROC curve is to first obtain a set of scalar prediction scores, one for each of the validation samples. A higher score means that the classifier believes the sample is more likely to be a positive example. For each of the possible thresholds for the classifier, we can compute the TPR and FPR. After getting all (TPR, FPR) pairs, we can plot the ROC curve.
Finish the Part (i) in Jupyter Notebook. Report the ROC curve for an SVM with different regularization weight C. Which C is better and why?
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 32

Solution:
Figure 15: ROC curve for SVM with different C.
C = 0.01 is better in this case, since the curve closer to the upper left corner is better. See the Jupyter Notebook code for details.
(j) If you multiply the scores output by the classifier by a factor of 10.0, how the ROC curve would change?
Solution: The ROC curve will not change, since only the relative ranking among the scores matters, rather than the absolute scores.
(k) We will finally train on the full dataset and compare the different models. We will perform a grid search over the following hyperparameters:
(a) The regularization term λ in Ridge Regression.
(b) The weighting on slack variables, C in the linear SVM. (c) The number of dimensions, k we project to using CCA.
Part (k) in Jupyter Notebook contains the parameters that will be swept over. If the code is correctly implemented in the previous steps, this code should perform a sweep over all parameters and give the best model.
Run the Part (k) in Jupyter Notebook, report the model parameters chosen, report the plot of the models’s validation error, and report the best model’s confusion matrix for validation data.
Solution:
best model: svm C 01. best k (CCA): 100.
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 33

Contributors:
• Alex Tsigler
• Anant Sahai
• Michael Laskey • Peter Wang
• Raaz Dwivedi • Rahul Arya
• Ruta Jawale
• Stella Yu
• Yaodong Yu
Figure 16: Comparison of Models Across Validation Error
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 34

Figure 17: A Confusion Matrix for Validation Data with SVM
HW10, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 35