CS代考 CS5487 Problem Set

CS5487 Problem Set
Solutions – Homework and Tutorials
Antoni of Computer Science City University of
Important Note: These problem set solutions are meant to be a study aid for the final exam only. They should not be used as “model answers” to help do the problem set. The point of the problem set is to encourage students to think criti- cally about machine learning problems – part of this is to learn to check your own answers, examine them for flaws, and be confident that they are correct. This is an important skill for PhD students to obtain, since research, by definition, involves solving unsolved problems. There are no “model answers” in research, so its best to get used to not having them available.

Copyright By PowCoder代写 加微信 powcoder

For current and former CS5487 students:
• DO NOT give the solutions to current CS5487 students.
• DO NOT post electronic copies of the solutions online.
• DO NOT use the solutions as “model answers” for doing the problem set. • DO use these solutions as a study aid for the final exam.
version v1: 4220e5da9c68ecfd5d8536fe34e0bcfb
Homework Problems
Problem 2.1 The Poisson distribution and flying bombs
(a) The log-likelihood of the data D = {k1,··· ,kN} is
log p(D) = 􏲋 [−λ + ki log λ − log ki!] = −Nλ + (􏲋 ki) log λ − 􏲋 log ki!
i=1 i=1 i=1 We obtain the maximum by setting the derivative with respect to λ to 0,
∂logp(D) 1􏲋N
λˆ = N ki,
∂λ = −N + λ
which is the sample average of the counts. The constraint that λ ≥ 0 is naturally satisfied since
it is always negative (or zero in the degenerate case), and hence we indeed have a maximum of log-likelihood.
ki ≥ 0. Looking at the second derivative,
∂2logp(D) 1 􏲋N

(b) Let xi be the sample random variables, where each xi is distributed as a Poisson with parameter λ. The mean of the estimator using samples X = {x1,··· ,xN} is
􏲗 1 􏲋N 􏲘 1 􏲋N
EX[λˆ]=EX N
⇒ EX[λˆ−λ]=0.
EX[xi]= N Hence the ML estimate λˆ is unbiased. For the variance, we have
(S.4) (S.5)
(c) The number of cells is 576. Assuming that the observation for the “5 and over” bin was actually 5, the ML estimate for the Poisson distribution is
λˆ = 1 (229(0) + 211(1) + 93(2) + 35(3) + 7(4) + 5(1)) = 0.929. (S.12) 576
Note that increasing the value of the “5 and over” bin to 6 slightly increases the estimate by 1/576 = 0.0017.
(d) Using λˆ from (c), the expected counts for 576 observations is given in the table below. k 0 1 2 3 4 5+
 􏲇 1 􏲋N 􏲖 2 
NNNN 1􏲋􏲋1􏲋􏲋
= N2EX  (xi −λ)(xj −λ)= N2 i=1 j=1
􏲖 2  varX[λˆ]=EX (λˆ−EX[λˆ])2 =EX  N( xi)−λ =EX  N (xi −λ) 
EX[(xi −λ)(xj −λ)] 
= 1 􏲋EX[(xi −λ)2]+􏲋EX[(xi −λ)(xj −λ)] N2  
For i ̸= j, we have
Exi,xj[(xi −λ)(xj −λ)]=Exi[xi −λ]Exj[xj −λ]=0, which uses the independence assumption. For i = j, we have
Substituting into (S.8),
Exi [(xi − λ)2] = var(xi) = λ. varX[λˆ] = 1 (Nλ) = λ .
Hence the variance of the estimator is inversely proportional to the number of samples.
p(x = k) expected observed
0.3950 0.3669 227.5 211.3 229 211
0.1704 98.1 93
0.0528 0.0122 0.0027 30.4 7.1 1.5 35 7 1

The expected and observed counts are very similar. For k = 2 and k = 3 the expected and observed counts are off by around 5, but other values are very close. Because the expected and observed data match well, this supports the conclusion that the flying bombs were falling randomly, and not due to any precision targeting. Quantitatively, a χ2 test can be used to check the goodness-of-fit between the expected and actual data. In this case, the χ2 test gives a high probability (0.888) that the observations came from the expected distribution.
……… Problem 2.10 Robust Regression and MLE
(a) Assuming that the observation noise is Laplace distribution will yield the robust regression formulation. Formally, we have a function f (x) = φ(x)T θ. We observe noisy outputs of f (x),
yi =f(xi)+εi,
where εi is distributed as a Laplace distribution with zero-mean,
1 −|εi| p(εi)=2λe λ .
Hence, yi is also a Laplace distribution, but with mean f(xi),
1 −|yi−f(xi)| p(yi|xi) = 2λe λ
Taking the log and summing over the training points, we get the data log-likelihood
|yi −φ(xi)Tθ|.
Since λ is a constant that does not affect the optimization, the MLE solution is then
θ∗ = argmax log p(y|X) = argmax − 􏲋 |yi − φ(xi)T θ| = argmin 􏲋 |yi − φ(xi)T θ|
logp(y|X)=−λ
θ θ i=1 θ i=1
In summary, robust regression (minimizing the L1 norm) is equivalent to assuming a Laplace
distribution on the observation noise and applying MLE.
The L1 norm is more robust to outliers because it gives less penalty to large errors than the L2 norm. This is illustrated in Figure 1a. The L2 norm gives a squared penalty to large errors, which in turn makes the regression algorithm focus more on reducing those errors. In contrast, the L1 norm gives less penalty, thus allowing the regression to focus less on large errors. Another way to think about it is to consider the derivatives of the L2 and L1 error functions. For L2, the error is decreased the most by reducing the largest error. For L1, the error decreased equally for regardless of the amount of error.
A second way to explain how L1 is more robust to outliers is to consider the associated prob- abilistic model. Noting that outliers are rare events of large amplitude, the ability of a given model to explain outliers is a function of how much probability mass is contained in its tails. In particular, an heavy-tailed distribution (more probability mass in the tails) will explain outliers better than a distribution that is not heavy-tailed. It is indeed the case that the Laplacian is

(a) (b) (c)
1600 1400 1200 1000
800 600 400 200
0 5 10 15 20 25 30 35 40
0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05
−4 −3 −2 −1 0 1 2 3 4
0 −1 −2 −3 −4 −5 −6 −7 −8
−4 −3 −2 −1 0 1 2 3 4
Figure 1: (a) comparison of L1 and L2 norms; (b) Gaussian and Laplace distributions; (c) log distributions.
more heavy-tailed than the Gaussian. You can convince yourself of this by plotting the two distributions. Notice that, since both functions are exponentially decreasing with the distance from the mean, it is usually difficult to see anything about the tails in the plot of the pdf. A better strategy is to look at the plot of the log of the pdf, which makes the differences more salient. Both are shown in Figure 1b and 1c, for the Gaussian and Laplacian distributions when σ2 = 1. Figure 1b shows the pdf plots, and Figure 1c shows their log. As you can see, the Gaussian decays much faster (its log is a quadratic function of the distance from the mean) than the Laplacian (log is a linear function of this distance). For example, for x five standard deviations away from the mean, the difference between the two functions is about 8 dbs, i.e. the Laplace probability is about 3,000 larger. This implies that Laplacian noise explains outliers much better than Gaussian noise. Hence, in the presence of outliers there will be less of a mismatch under the Laplacian model and the ML solution is therefore better than that achievable with the Gaussian.
(b) This is a optimization trick to turn an absolute value in the objective function into inequality constraints. ti is an upper bound on each term |yi − φ(xi)T θ|. At the optimum, we must have ti = |yi − φ(xi)T θ|. Otherwise, we wouldn’t be at a minimum (we could still decrease ti and hence decrease the objective function). Hence, at the optimum of (2.16), the minimum of the original objective (2.15) has been found.
(c) Note that an inequality constraint |a| ≤ b can be turned into two inequality constraints,
􏲙a ≤ b −a ≤ b
or equivalently
yi − φ(xi)T θ ≤ ti, −(yi − φ(xi)T θ) ≤ ti,
−φ(xi)T θ − ti ≤ −yi φ(xi)T θ − ti ≤ yi.
(S.19) (S.20)
(S.21) (S.22)
|a|≤b ⇒ −b≤a≤b ⇒
If a is positive, then the first inequality bounds it. If a is negative, then the second inequality bounds its magnitude by b. Hence, the ith inequality constraint in (2.16) can be written as

Stacking the inequalities into a vector, we obtain the definitions of A and b in (2.18). Setting 􏲃0D 􏲄
(S.23) (S.24)
(S.25) (S.26)
f = 1 will sum over the ti’s. n
Problem 3.7 Bayesian estimation for a Bernoulli distribution
(a) Assuming independent samples,
p(D|π)=􏲌p(xi|π)=􏲌πxi(1−π)1−xi =π􏲈ni=1xi(1−π)􏲈ni=1(1−xi)
i=1 i=1 =πs(1−π)n−s, s=􏲋xi.
(b) Assuming a uniform prior p(π) = 1, and applying Bayes’ rule,
p(D|π)p(π) p(D|π) πs(1 − π)n−s
p(π|D) = 􏲊 1 p(D|π)p(π)dπ = 􏲊 1 p(D|π)dπ = 􏲊 1 πs(1 − π)n−sdπ 000
πs(1 − π)n−s (n + 1)! s n−s = s!(n−s)! = s!(n−s)!π (1−π)
(s+n−s+1)!
which uses (3.31).
For n = 1 and s ∈ {0,1}, the posterior is
p(π|D) = 2πs(1 − π)1−s.
The plots of the two densities are shown in Figure 2. Notice that the observation of a “0” turns the uniform prior into a posterior with a lot more probability for small values of π (which is the probability of “1”). On the other hand, the observation of a “1” turns the uniform prior into a posterior with a lot more probability for large values of π. As usual for Bayesian inference, this is intuitive.
2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
0 0.2 0.4 0.6 0.8 1
p(π|s=0) p(π|s=1)
Figure 2: Posteriors for π.

(c) The predictive distribution is obtained by integrating the observation likelihood with the pos- terior,
(S.28) (S.29) (S.30)
􏲍1 􏲍1 (n+1)!
p(x|D) = p(x|π)p(π|D)dπ = πx(1 − π)1−x πs(1 − π)n−sdπ
0 0 s!(n−s)!
(n+1)! 􏲍 1 (n+1)! (s+x)!(n−s+1−x)!
πs+x(1 − π)n−s+1−xdπ = s!(n − s)! (n + 2)! = (n+1)!(s+x)!(n−s+1−x)! = 1 (s+1)x(n−s+1)1−x
= s!(n − s)!
(n+2)! s! (n−s)! n+2
􏲁s+1􏲂x 􏲁n−s+1􏲂1−x 􏲁s+1􏲂x 􏲁 s+1􏲂1−x = n+2 n+2 = n+2 1−n+2 ,
(S.31) where (S.29) follows from (3.31) and (S.30) uses x ∈ {0, 1}. The effective estimate for π = s+1 ,
n+2 which can be explained as adding two “virtual” samples, “0” and “1”, to the MLE estimate.
(d) The ML solution is
πML =argmaxlogp(D|π)=argmaxslogπ+(n−s)log(1−π) (S.32)
Note that we should impose the constraints 0 ≤ π ≤ 1. However, as we will see below, the maximum of the unconstrained problem is inside the constraint region. Hence, we can get away without doing this. It therefore suffices to compute the derivative of l = log p(D|π),
and set to zero, from which we obtain
∂l =s − n−s (S.33) ∂π π 1−π
s−n−s=0 (S.34) π 1−π
s(1−π)−(n−s)π = 0, (S.35) ⇒ π∗ = ns . (S.36)
Since s ≤ n this is indeed inside the constraint region. However, we still need to check that we have a maximum. For this we compute the 2nd derivative
∂2l=−s− n−s. (S.37) ∂π2 π2 (1−π)2
This is always negative, and hence we have a maximum. Furthermore the function l(π) is concave and this maximum is global. Hence, we do not need to check boundary constraints.
The MAP solution is
πˆ = argmax p(π|D) = argmax log p(D|π) + log p(π) (S.38) ππ
Notice that, since p(π) is uniform, the second term does not depend on π and can be dropped. Hence, the MAP solution is the same as the ML solution. This result is always true when we have a uniform prior, in which case there is no point in favoring MAP over ML.

(e) For the first prior, p1(π) = 2π, substituting into (S.38),
l1 =logp(D|π)+logp(π)=slogπ+(n−s)log(1−π)+log2π, ∂l1 = s − n − s + 1 = 0
∂π π 1−π π
⇒ s(1 − π) − (n − s)π + (1 − π) = 0
⇒ πˆ 1 = s + 1 . n+1
For the second prior, p0(π) = 2 − 2π,
l0 =logp(D|π)+logp(π)=slogπ+(n−s)log(1−π)+log2(1−π) ∂l0=s−n−s− 1 =0
∂π π 1−π 1−π
⇒ s(1 − π) − (n − s)π − π = 0
⇒πˆ0= s . n+1
(S.39) (S.40) (S.41) (S.42)
(S.43) (S.44)
(S.45) (S.46)
For these two priors, it is equivalent to adding a virtual sample to the training set, a “1” for p1(π) and a “0” for p0(π). These virtual samples reflect the bias of the prior.
In summary, for all these cases, the predictive distribution (the distribution of a new sample after learning the parameter π) is of the form
p(x|D) = πˆx(1 − πˆ)1−x
where πˆ is determined by the estimation method,
Estimator πˆ # tosses # “1”s
same as ML add an extra “1” add an extra “0” add one of each.
s/n (s+1)/(n+1) s/(n+1)
(s + 1)/(n + 2)
n n+1 n+1 n + 2
interpretation
MAP (non-informative) MAP (favor 1s) MAP (favor 0s) Bayesian (non-informative)
In all cases, they estimate can be interpreted as an ML estimate with extra sample(s) added to reflect the bias of the prior.
Problem 4.5 Flying Bombs, part II – EM for mixtures of Poissons
(a) Define zi as the hidden assignment variable that assigns sample xi to mixture component zi = j. The complete data likelihood is
p(X, Z) = 􏲌 p(zi)p(xi|zi) = 􏲌 πzi p(xi|zi). (S.48)
Using the indicator variable trick, by defining zij = 1 iff zi = j and 0 otherwise, we have
p(X, Z) = 􏲌 􏲌 πzij p(xi|zi = j)zij , (S.49) j

and taking the log,
logp(X,Z)=􏲋􏲋zij logπj +zij logp(xi|zi =j). (S.50)
For the E-step, we obtain the Q function by taking the expectation of the complete data
log-likelihood in (S.50),
Q(θ; θˆ) = EZ|X,θˆ [log p(X, Z|θ)]
(S.51) (S.52)
nK  =E 􏲋􏲋z logπ +z logp(x|z =j)
Z|X,θˆ ij j ij ii  i=1 j=1
assignment”termzˆ iscalculatedusingthecurrentparameterestimatesθˆ, ij
zˆ =E [z ]=p(z =j|X,θˆ) ij Z|X,θˆij i
= p(X|zi = j)p(zi = j) = p(X¬i)p(xi|zi = j)p(zi = j) p(X ) p(X¬i )p(xi )
πjp(xi|zi = j) ˆ = 􏲈Kk=1 πkp(xi|zi = k) = p(zi = j|xi, θ).
=􏲋􏲋zˆ logπ +zˆ logp(x|z =j), ij j ij ii
(S.53) where the last line follows because the expectation only applies to variable zij. The “soft
L(π)=􏲋􏲋zˆ logπ +λ(1−􏲋π ), ijj j
i=1 j=1 j=1
where λ is the Lagrange multiplier. Taking the derivatives and setting to 0,
∂λ =1−􏲋πj =0 ⇒ 􏲋πj =1,
(S.58) (S.59)
=􏲋 ij +λ=0 ⇒ 􏲋zˆ +λπ =0.
∂ππ ijj j i=1 j i=1
Summing (S.59) over j, we have KnK
􏲋􏲋zˆ +λ􏲋π =0 ⇒ λ=−n, ij j
j=1 i=1 j=1
(S.54) (S.55)
(S.55) follows from the independence assumption of the samples, and X¬i is the set of xk with k ̸= i. Note that we have not used any properties of the mixture components yet, so this is the general form of the Q function for any mixture model.
For the M-step, we maximize Q with respect to the parameters θ. First, we optimize the component priors. We have a constraint that 􏲈j πj = 1, and πj ≥ 0. For the equality constraint (the non-negative constraint is naturally satisfied), define the Lagrangian as

which follows from 􏲈 zˆ = 1 and 􏲈 π = 1. Finally, substituting into (S.59), we have j ij j j
standard result for any mixture model.
Next, we look at optimizing λj for each Poisson component. Note that the parameters of the jth component do not affect the other components, so each λj can be optimized separately, by maximizing
where nˆ = 􏲈n
l =􏲋zˆ logp(x|z =j) j ij ii
=􏲋zˆ (−λ +x logλ −logx!) ij j i j i
=−λ (􏲋zˆ )+(􏲋zˆ x)logλ −􏲋zˆ logx! j ij iji j ij i
i=1 i=1 i=1 Taking the derivative and setting to zero,
(S.62) (S.63) (S.64)
(S.65) (S.66)
1 􏲋n nˆ j πˆ= zˆ=,
(S.61) zˆ is the (soft) number of samples assigned to component j. Again this is a
j n ij n i=1
j = − 􏲋 zˆ + 􏲋 zˆ x = 0
∂λ ijλiji j i=1 j i=1
ˆ1􏲋n 1􏲋n ⇒ λ = 􏲈 zˆ x =
zˆ x j n zˆ ij i nˆ ij i
i=1 ij i=1 j i=1 In summary, the EM algorithm for mixture of Poissons is
E−step: zˆ =p(z =j|x,θ)=􏲈 ij i i
ˆ πˆ j p ( x i | z i = j , λˆ j )
Kk=1 πˆkp(xi|zi = k,λˆj) 􏲋n nˆj ˆ 1􏲋n
M−step:nˆ= zˆ, πˆ= , λ= zˆx. j ijjnjnˆiji
The estimate of λj, the arrival rate, is similar to the standard ML estimate, except a weighted average is used with EM, where the weights are the posterior probabilities of the sample being assigned to component j. Note that for any exponential family distribution, we will get a similar form in the M-step, which is a weighted average of the sufficient statistics.
(b) A mixture of Poissons was learned from the data of each city with number of components K ∈ {1, 2, 3, 4}. The final data log-likelihood and model parameters are in the following table, and the mixture distributions are plotted in Figure 3 (top).

city K loglike
1 -728.713
(1, 0.929) (0.736, 0.988) (0.599, 0.982) (0.099, 0.995) (1, 0.896) (0.339, 2.195) (0.278, 2.374) (0.278, 2.374)
(π2, λ2) (0.264, 0.765)
(0.206, 0.972) (0.317, 0.989)
(0.661, 0.230) (0.451, 0.523) (0.234, 0.523)
(π3, λ3) (0.195 0.721)
(0.330, 0.981)
(0.271, 0.0001) (0.217, 0.523)
(0.254, 0.761)
(0.271, 0.0001)
2 -728.703 3 -728.701 4 -728.703 1 -830.696 2 -748.025 3 -747.665 4 -747.665
There are a few interesting observations. First, for the London data, the log-likelihood does not increase significantly as K increases. This suggests that the London data can already be explained well with one component. Looking at the model parameters for K = 2, there seem to be two important values of λ, 0.98 and 0.76. The first is the dominant value (larger prior weight), while the second provides some correction for the number of samples with count 2 and 3 (see Figure 3 top-left). While EM has found this second component, the fact that the log-likelihood is similar to K = 1 indicates that the variations in the sample data are equally well explained by the single component. In any case, 0.98 and 0.76 are relatively similar values. For larger values of K, the extra components have values similar to the original 0.98.
Looking at the Antwerp models, there is a clear increase in log-likelihood as K increases, until K = 3. For K = 1, the value of λ is 0.896, but looking at the plot in Figure 3 (top- right) shows that the single Poisson does not model the data well. Increasing the number of components to K = 2, we recover two significant components, λ1 = 2.195 and λ2 = 0.230. The difference between the two values suggest that some squares are hit more often than others. Further increasing to K = 3 recovers three types of squares: 1) frequently hit (λ1 = 2.374), 2) infrequently hit (λ2 = 0.523), and 3) never hit (λ3 = 0.0001). Hence, from these results we see strong evidence to support that there is specific targeting of sites in Antwerp. On the other hand, the analysis of the London data suggests that there was no targeting of sites in London.
Finally, the counts for the London data are the real counts observed during WWII. The Antwerp data was generated synthetically. Figure 3 (middle) shows the simulated hit locations and square plot counts for the two cities. For the London data, the distribution of hits is uniform. For Antwerp, there are three targeted locations. We can assign each square to one of the components of the mixture, which clusters the squares according to the hit frequency. Figure 3 (bottom) shows the clusters. For the London data, the squares are always assigned to the same component, i.e., there is one dominant component in each mixture, as seen in Figure 3 (top-left). On the other hand, for the Antwerp data, there is a clear distinction between regions. For K = 2, the city is separated into regions with 2 or more hits, and those with less than 2. For K = 3, the city is separated into regions with 2 or more hits, only 1 hit, and zero hits. This can be seen in the mixture plots, where a different component dominates for different values of x.

London Antwerp
0.8 0.6 0.4 0.2
0.8 0.6 0.4 0.2
0.8 0.6 0.4 0.2
0.8 0.6 0.4 0.2
00 012345012345
00 012345012345
0.4 0.3 0.2 0.1
0.4 0.3 0.2 0.1
0.4 0.3 0.2 0.1
0.4 0.3 0.2 0.1
00 012345012345
00 012345012345
5 10 15 20
10 9 8 7 6 5 4 3 2 1 0
5 10 15 20
5 10 15 20
5 10 15 20
55 5 10 15 20
5 10 15 20
5 10 15 20
55 5 10 15 2

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com