程序代写代做代考 go algorithm chain C kernel Bayesian CS5487 Problem Set

CS5487 Problem Set
Solutions – Homework and Tutorials
Antoni Chan Department of Computer Science City University of Hong Kong
Important Note: These problem set solutions are meant to be a study aid for the final exam only. They should not be used as “model answers” to help do the problem set. The point of the problem set is to encourage students to think criti- cally about machine learning problems – part of this is to learn to check your own answers, examine them for flaws, and be confident that they are correct. This is an important skill for PhD students to obtain, since research, by definition, involves solving unsolved problems. There are no “model answers” in research, so its best to get used to not having them available.
For current and former CS5487 students:
• DO NOT give the solutions to current CS5487 students.
• DO NOT post electronic copies of the solutions online.
• DO NOT use the solutions as “model answers” for doing the problem set. • DO use these solutions as a study aid for the final exam.
version v1: 4220e5da9c68ecfd5d8536fe34e0bcfb
Homework Problems
Problem 2.1 The Poisson distribution and flying bombs
(a) The log-likelihood of the data D = {k1,··· ,kN} is
NNN
log p(D) = 􏰋 [−λ + ki log λ − log ki!] = −Nλ + (􏰋 ki) log λ − 􏰋 log ki!
i=1 i=1 i=1 We obtain the maximum by setting the derivative with respect to λ to 0,

(S.1)
∂logp(D) 1􏰋N
1 􏰋N
λˆ = N ki,
∂λ = −N + λ
i=1
ki = 0
which is the sample average of the counts. The constraint that λ ≥ 0 is naturally satisfied since
(S.2)
ki, (S.3)
it is always negative (or zero in the degenerate case), and hence we indeed have a maximum of log-likelihood.
ki ≥ 0. Looking at the second derivative,
∂2logp(D) 1 􏰋N
∂λ2 = −λ2
i=1
i=1
1

(b) Let xi be the sample random variables, where each xi is distributed as a Poisson with parameter λ. The mean of the estimator using samples X = {x1,··· ,xN} is
􏰮 1 􏰋N 􏰯 1 􏰋N
EX[λˆ]=EX N
⇒ EX[λˆ−λ]=0.
1 􏰋N
xi = N
λ=λ.
(S.4) (S.5)
(S.6)
(S.7)
(S.8)
(S.9)
(S.10)
(S.11)
(c) The number of cells is 576. Assuming that the observation for the “5 and over” bin was actually 5, the ML estimate for the Poisson distribution is
λˆ = 1 (229(0) + 211(1) + 93(2) + 35(3) + 7(4) + 5(1)) = 0.929. (S.12) 576
Note that increasing the value of the “5 and over” bin to 6 slightly increases the estimate by 1/576 = 0.0017.
(d) Using λˆ from (c), the expected counts for 576 observations is given in the table below. k 0 1 2 3 4 5+
EX[xi]= N Hence the ML estimate λˆ is unbiased. For the variance, we have
i=1
i=1
i=1
 􏰉 1 􏰋N 􏰊 2 
i=1 i=1
NNNN 1􏰋􏰋1􏰋􏰋
= N2EX  (xi −λ)(xj −λ)= N2 i=1 j=1
􏰊 2  varX[λˆ]=EX (λˆ−EX[λˆ])2 =EX  N( xi)−λ =EX  N (xi −λ) 
􏰨􏰩
 􏰉 1 􏰋N
EX[(xi −λ)(xj −λ)] 
= 1 􏰋EX[(xi −λ)2]+􏰋EX[(xi −λ)(xj −λ)] N2  
i=j i̸=j
For i ̸= j, we have
Exi,xj[(xi −λ)(xj −λ)]=Exi[xi −λ]Exj[xj −λ]=0, which uses the independence assumption. For i = j, we have
i=1 j=1
Substituting into (S.8),
Exi [(xi − λ)2] = var(xi) = λ. varX[λˆ] = 1 (Nλ) = λ .
N2 N
Hence the variance of the estimator is inversely proportional to the number of samples.
p(x = k) expected observed
0.3950 0.3669 227.5 211.3 229 211
0.1704 98.1 93
0.0528 0.0122 0.0027 30.4 7.1 1.5 35 7 1
2

The expected and observed counts are very similar. For k = 2 and k = 3 the expected and observed counts are off by around 5, but other values are very close. Because the expected and observed data match well, this supports the conclusion that the flying bombs were falling randomly, and not due to any precision targeting. Quantitatively, a χ2 test can be used to check the goodness-of-fit between the expected and actual data. In this case, the χ2 test gives a high probability (0.888) that the observations came from the expected distribution.
……… Problem 2.10 Robust Regression and MLE
(a) Assuming that the observation noise is Laplace distribution will yield the robust regression formulation. Formally, we have a function f (x) = φ(x)T θ. We observe noisy outputs of f (x),
yi =f(xi)+εi,
where εi is distributed as a Laplace distribution with zero-mean,
1 −|εi| p(εi)=2λe λ .
Hence, yi is also a Laplace distribution, but with mean f(xi),
(S.13)
(S.14)
(S.15)
(S.16)
(S.17)
1 −|yi−f(xi)| p(yi|xi) = 2λe λ
.
Taking the log and summing over the training points, we get the data log-likelihood
1 􏰋n
|yi −φ(xi)Tθ|.
Since λ is a constant that does not affect the optimization, the MLE solution is then
nn
θ∗ = argmax log p(y|X) = argmax − 􏰋 |yi − φ(xi)T θ| = argmin 􏰋 |yi − φ(xi)T θ|
logp(y|X)=−λ
i=1
θ θ i=1 θ i=1
In summary, robust regression (minimizing the L1 norm) is equivalent to assuming a Laplace
distribution on the observation noise and applying MLE.
The L1 norm is more robust to outliers because it gives less penalty to large errors than the L2 norm. This is illustrated in Figure 1a. The L2 norm gives a squared penalty to large errors, which in turn makes the regression algorithm focus more on reducing those errors. In contrast, the L1 norm gives less penalty, thus allowing the regression to focus less on large errors. Another way to think about it is to consider the derivatives of the L2 and L1 error functions. For L2, the error is decreased the most by reducing the largest error. For L1, the error decreased equally for regardless of the amount of error.
A second way to explain how L1 is more robust to outliers is to consider the associated prob- abilistic model. Noting that outliers are rare events of large amplitude, the ability of a given model to explain outliers is a function of how much probability mass is contained in its tails. In particular, an heavy-tailed distribution (more probability mass in the tails) will explain outliers better than a distribution that is not heavy-tailed. It is indeed the case that the Laplacian is
3

(a) (b)
(c)
1600 1400 1200 1000
800 600 400 200
0
0.5 ε2 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05
0
−1 Gaussian −2
−3
−4
−5
−6
−7
−8
|ε|
0 5 10 15 20 25 30 35 40
Laplace
Gaussian
Laplace
−4 −3 −2 −1 0 1 2 3 4
p(x)
log p(x)
0 −9
or equivalently
yi − φ(xi)T θ ≤ ti, −(yi − φ(xi)T θ) ≤ ti,
−φ(xi)T θ − ti ≤ −yi φ(xi)T θ − ti ≤ yi.
(S.19) (S.20)
(S.21) (S.22)
−4 −3 −2 −1 0 1 2 3 4
εxx
Figure 1: (a) comparison of L1 and L2 norms; (b) Gaussian and Laplace distributions; (c) log distributions.
more heavy-tailed than the Gaussian. You can convince yourself of this by plotting the two distributions. Notice that, since both functions are exponentially decreasing with the distance from the mean, it is usually difficult to see anything about the tails in the plot of the pdf. A better strategy is to look at the plot of the log of the pdf, which makes the differences more salient. Both are shown in Figure 1b and 1c, for the Gaussian and Laplacian distributions when σ2 = 1. Figure 1b shows the pdf plots, and Figure 1c shows their log. As you can see, the Gaussian decays much faster (its log is a quadratic function of the distance from the mean) than the Laplacian (log is a linear function of this distance). For example, for x five standard deviations away from the mean, the difference between the two functions is about 8 dbs, i.e. the Laplace probability is about 3,000 larger. This implies that Laplacian noise explains outliers much better than Gaussian noise. Hence, in the presence of outliers there will be less of a mismatch under the Laplacian model and the ML solution is therefore better than that achievable with the Gaussian.
(b) This is a optimization trick to turn an absolute value in the objective function into inequality constraints. ti is an upper bound on each term |yi − φ(xi)T θ|. At the optimum, we must have ti = |yi − φ(xi)T θ|. Otherwise, we wouldn’t be at a minimum (we could still decrease ti and hence decrease the objective function). Hence, at the optimum of (2.16), the minimum of the original objective (2.15) has been found.
(c) Note that an inequality constraint |a| ≤ b can be turned into two inequality constraints,
􏰍a ≤ b −a ≤ b
|a|≤b ⇒ −b≤a≤b ⇒
(S.18)
If a is positive, then the first inequality bounds it. If a is negative, then the second inequality bounds its magnitude by b. Hence, the ith inequality constraint in (2.16) can be written as
4

Stacking the inequalities into a vector, we obtain the definitions of A and b in (2.18). Setting 􏰢0D 􏰣
(S.23) (S.24)
(S.25) (S.26)
(S.27)
f = 1 will sum over the ti’s. n
………
Problem 3.7 Bayesian estimation for a Bernoulli distribution
(a) Assuming independent samples,
nn
p(D|π)=􏰦p(xi|π)=􏰦πxi(1−π)1−xi =π􏰖ni=1xi(1−π)􏰖ni=1(1−xi)
i=1 i=1 =πs(1−π)n−s, s=􏰋xi.
(b) Assuming a uniform prior p(π) = 1, and applying Bayes’ rule,
p(D|π)p(π) p(D|π) πs(1 − π)n−s
n i=1
p(π|D) = 􏰥 1 p(D|π)p(π)dπ = 􏰥 1 p(D|π)dπ = 􏰥 1 πs(1 − π)n−sdπ 000
πs(1 − π)n−s (n + 1)! s n−s = s!(n−s)! = s!(n−s)!π (1−π)
(s+n−s+1)!
which uses (3.31).
For n = 1 and s ∈ {0,1}, the posterior is
p(π|D) = 2πs(1 − π)1−s.
,
The plots of the two densities are shown in Figure 2. Notice that the observation of a “0” turns the uniform prior into a posterior with a lot more probability for small values of π (which is the probability of “1”). On the other hand, the observation of a “1” turns the uniform prior into a posterior with a lot more probability for large values of π. As usual for Bayesian inference, this is intuitive.
2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
p(π|s=0) p(π|s=1)
0 0.2 0.4 0.6 0.8 1
π
Figure 2: Posteriors for π.
5

(c) The predictive distribution is obtained by integrating the observation likelihood with the pos- terior,
(S.28) (S.29) (S.30)
􏰧1 􏰧1 (n+1)!
p(x|D) = p(x|π)p(π|D)dπ = πx(1 − π)1−x πs(1 − π)n−sdπ
0 0 s!(n−s)!
(n+1)! 􏰧 1 (n+1)! (s+x)!(n−s+1−x)!
πs+x(1 − π)n−s+1−xdπ = s!(n − s)! (n + 2)! = (n+1)!(s+x)!(n−s+1−x)! = 1 (s+1)x(n−s+1)1−x
= s!(n − s)!
(n+2)! s! (n−s)! n+2
0
􏰛s+1􏰜x 􏰛n−s+1􏰜1−x 􏰛s+1􏰜x 􏰛 s+1􏰜1−x = n+2 n+2 = n+2 1−n+2 ,
(S.31) where (S.29) follows from (3.31) and (S.30) uses x ∈ {0, 1}. The effective estimate for π = s+1 ,
n+2 which can be explained as adding two “virtual” samples, “0” and “1”, to the MLE estimate.
(d) The ML solution is
πML =argmaxlogp(D|π)=argmaxslogπ+(n−s)log(1−π) (S.32)
ππ
Note that we should impose the constraints 0 ≤ π ≤ 1. However, as we will see below, the maximum of the unconstrained problem is inside the constraint region. Hence, we can get away without doing this. It therefore suffices to compute the derivative of l = log p(D|π),
and set to zero, from which we obtain
∂l =s − n−s (S.33) ∂π π 1−π
s−n−s=0 (S.34) π 1−π
s(1−π)−(n−s)π = 0, (S.35) ⇒π∗ = s. (S.36)
n
Since s ≤ n this is indeed inside the constraint region. However, we still need to check that we
have a maximum. For this we compute the 2nd derivative
∂2l=−s− n−s. (S.37)
∂π2 π2 (1−π)2
This is always negative, and hence we have a maximum. Furthermore the function l(π) is
concave and this maximum is global. Hence, we do not need to check boundary constraints. The MAP solution is
πˆ = argmax p(π|D) = argmax log p(D|π) + log p(π) (S.38) ππ
Notice that, since p(π) is uniform, the second term does not depend on π and can be dropped. Hence, the MAP solution is the same as the ML solution. This result is always true when we have a uniform prior, in which case there is no point in favoring MAP over ML.
6

(e) For the first prior, p1(π) = 2π, substituting into (S.38),
l1 =logp(D|π)+logp(π)=slogπ+(n−s)log(1−π)+log2π, ∂l1 = s − n − s + 1 = 0
∂π π 1−π π
⇒ s(1 − π) − (n − s)π + (1 − π) = 0
⇒ πˆ 1 = s + 1 . n+1
For the second prior, p0(π) = 2 − 2π,
l0 =logp(D|π)+logp(π)=slogπ+(n−s)log(1−π)+log2(1−π) ∂l0=s−n−s− 1 =0
∂π π 1−π 1−π
⇒ s(1 − π) − (n − s)π − π = 0
⇒πˆ0= s . n+1
(S.39) (S.40) (S.41) (S.42)
(S.43) (S.44)
(S.45) (S.46)
For these two priors, it is equivalent to adding a virtual sample to the training set, a “1” for p1(π) and a “0” for p0(π). These virtual samples reflect the bias of the prior.
In summary, for all these cases, the predictive distribution (the distribution of a new sample after learning the parameter π) is of the form
p(x|D) = πˆx(1 − πˆ)1−x
where πˆ is determined by the estimation method,
Estimator πˆ # tosses # “1”s
(S.47)
same as ML add an extra “1” add an extra “0” add one of each.
ML
MAP (non-informative) MAP (favor 1s) MAP (favor 0s) Bayesian (non-informative)
s/n n s
s/n (s+1)/(n+1) s/(n+1)
(s + 1)/(n + 2)
n n+1 n+1 n + 2
s s+1 s
s + 1
interpretation
In all cases, they estimate can be interpreted as an ML estimate with extra sample(s) added to reflect the bias of the prior.
………
Problem 4.5 Flying Bombs, part II – EM for mixtures of Poissons
(a) Define zi as the hidden assignment variable that assigns sample xi to mixture component zi = j. The complete data likelihood is
nn
p(X, Z) = 􏰦 p(zi)p(xi|zi) = 􏰦 πzi p(xi|zi). (S.48)
i=1 i=1
Using the indicator variable trick, by defining zij = 1 iff zi = j and 0 otherwise, we have
nK
p(X, Z) = 􏰦 􏰦 πzij p(xi|zi = j)zij , (S.49) j
i=1 j=1
7

and taking the log,
nK
logp(X,Z)=􏰋􏰋zij logπj +zij logp(xi|zi =j). (S.50)
i=1 j=1
For the E-step, we obtain the Q function by taking the expectation of the complete data
log-likelihood in (S.50),
Q(θ; θˆ) = EZ|X,θˆ [log p(X, Z|θ)]
(S.51) (S.52)
nK  =E 􏰋􏰋z logπ +z logp(x|z =j)
Z|X,θˆ ij j ij ii  i=1 j=1
nK
i=1 j=1
assignment”termzˆ iscalculatedusingthecurrentparameterestimatesθˆ, ij
zˆ =E [z ]=p(z =j|X,θˆ) ij Z|X,θˆij i
= p(X|zi = j)p(zi = j) = p(X¬i)p(xi|zi = j)p(zi = j) p(X ) p(X¬i )p(xi )
πjp(xi|zi = j) ˆ = 􏰖Kk=1 πkp(xi|zi = k) = p(zi = j|xi, θ).
=􏰋􏰋zˆ logπ +zˆ logp(x|z =j), ij j ij ii
(S.53) where the last line follows because the expectation only applies to variable zij. The “soft
(S.54) (S.55)
(S.56)
(S.55) follows from the independence assumption of the samples, and X¬i is the set of xk with k ̸= i. Note that we have not used any properties of the mixture components yet, so this is the general form of the Q function for any mixture model.
For the M-step, we maximize Q with respect to the parameters θ. First, we optimize the component priors. We have a constraint that 􏰖j πj = 1, and πj ≥ 0. For the equality constraint (the non-negative constraint is naturally satisfied), define the Lagrangian as
nKK
L(π)=􏰋􏰋zˆ logπ +λ(1−􏰋π ), ijj j
i=1 j=1 j=1
where λ is the Lagrange multiplier. Taking the derivatives and setting to 0,
∂LK K
∂λ =1−􏰋πj =0 ⇒ 􏰋πj =1,
j=1 j=1
(S.57)
(S.58) (S.59)
(S.60)
∂Lnzˆ n
=􏰋 ij +λ=0 ⇒ 􏰋zˆ +λπ =0.
∂ππ ijj j i=1 j i=1
Summing (S.59) over j, we have KnK
􏰋􏰋zˆ +λ􏰋π =0 ⇒ λ=−n, ij j
j=1 i=1 j=1
8

which follows from 􏰖 zˆ = 1 and 􏰖 π = 1. Finally, substituting into (S.59), we have j ij j j
standard result for any mixture model.
Next, we look at optimizing λj for each Poisson component. Note that the parameters of the jth component do not affect the other components, so each λj can be optimized separately, by maximizing
where nˆ = 􏰖n
j i=1 ij
n
l =􏰋zˆ logp(x|z =j) j ij ii
i=1 n
=􏰋zˆ (−λ +x logλ −logx!) ij j i j i
i=1
nnn
=−λ (􏰋zˆ )+(􏰋zˆ x)logλ −􏰋zˆ logx! j ij iji j ij i
i=1 i=1 i=1 Taking the derivative and setting to zero,
(S.62) (S.63) (S.64)
(S.65) (S.66)
(S.67)
(S.68)
1 􏰋n nˆ j πˆ= zˆ=,
(S.61) zˆ is the (soft) number of samples assigned to component j. Again this is a
j n ij n i=1
∂l n 1n
j = − 􏰋 zˆ + 􏰋 zˆ x = 0
∂λ ijλiji j i=1 j i=1
ˆ1􏰋n 1􏰋n ⇒ λ = 􏰖 zˆ x =
i=1 ij i=1 j i=1 In summary, the EM algorithm for mixture of Poissons is
zˆ x j n zˆ ij i nˆ ij i
E−step: zˆ =p(z =j|x,θ)=􏰖 ij i i
ˆ πˆ j p ( x i | z i = j , λˆ j )
Kk=1 πˆkp(xi|zi = k,λˆj) 􏰋n nˆj ˆ 1􏰋n
M−step:nˆ= zˆ, πˆ= , λ= zˆx. j ijjnjnˆiji
i=1 j i=1
The estimate of λj, the arrival rate, is similar to the standard ML estimate, except a weighted average is used with EM, where the weights are the posterior probabilities of the sample being assigned to component j. Note that for any exponential family distribution, we will get a similar form in the M-step, which is a weighted average of the sufficient statistics.
(b) A mixture of Poissons was learned from the data of each city with number of components K ∈ {1, 2, 3, 4}. The final data log-likelihood and model parameters are in the following table, and the mixture distributions are plotted in Figure 3 (top).
9

city K loglike
1 -728.713
(π1, λ1)
(1, 0.929) (0.736, 0.988) (0.599, 0.982) (0.099, 0.995) (1, 0.896) (0.339, 2.195) (0.278, 2.374) (0.278, 2.374)
(π2, λ2) (0.264, 0.765)
(0.206, 0.972) (0.317, 0.989)
(0.661, 0.230) (0.451, 0.523) (0.234, 0.523)
(π3, λ3) (0.195 0.721)
(0.330, 0.981)
(0.271, 0.0001) (0.217, 0.523)
(π4, λ4)
(0.254, 0.761)
(0.271, 0.0001)
London
Antwerp
2 -728.703 3 -728.701 4 -728.703 1 -830.696 2 -748.025 3 -747.665 4 -747.665
There are a few interesting observations. First, for the London data, the log-likelihood does not increase significantly as K increases. This suggests that the London data can already be explained well with one component. Looking at the model parameters for K = 2, there seem to be two important values of λ, 0.98 and 0.76. The first is the dominant value (larger prior weight), while the second provides some correction for the number of samples with count 2 and 3 (see Figure 3 top-left). While EM has found this second component, the fact that the log-likelihood is similar to K = 1 indicates that the variations in the sample data are equally well explained by the single component. In any case, 0.98 and 0.76 are relatively similar values. For larger values of K, the extra components have values similar to the original 0.98.
Looking at the Antwerp models, there is a clear increase in log-likelihood as K increases, until K = 3. For K = 1, the value of λ is 0.896, but looking at the plot in Figure 3 (top- right) shows that the single Poisson does not model the data well. Increasing the number of components to K = 2, we recover two significant components, λ1 = 2.195 and λ2 = 0.230. The difference between the two values suggest that some squares are hit more often than others. Further increasing to K = 3 recovers three types of squares: 1) frequently hit (λ1 = 2.374), 2) infrequently hit (λ2 = 0.523), and 3) never hit (λ3 = 0.0001). Hence, from these results we see strong evidence to support that there is specific targeting of sites in Antwerp. On the other hand, the analysis of the London data suggests that there was no targeting of sites in London.
Finally, the counts for the London data are the real counts observed during WWII. The Antwerp data was generated synthetically. Figure 3 (middle) shows the simulated hit locations and square plot counts for the two cities. For the London data, the distribution of hits is uniform. For Antwerp, there are three targeted locations. We can assign each square to one of the components of the mixture, which clusters the squares according to the hit frequency. Figure 3 (bottom) shows the clusters. For the London data, the squares are always assigned to the same component, i.e., there is one dominant component in each mixture, as seen in Figure 3 (top-left). On the other hand, for the Antwerp data, there is a clear distinction between regions. For K = 2, the city is separated into regions with 2 or more hits, and those with less than 2. For K = 3, the city is separated into regions with 2 or more hits, only 1 hit, and zero hits. This can be seen in the mixture plots, where a different component dominates for different values of x.
………
10

0.4 0.3 0.2 0.1
data
0.8 0.6 0.4 0.2
data
mixture component
0.8 0.6 0.4 0.2
0.4 0.3 0.2 0.1
0.4 0.3 0.2 0.1
0.8 0.6 0.4 0.2
0.8 0.6 0.4 0.2
K=1
K=2
K=1
K=2
London
0.4 mixture 0.3
Antwerp
component
0.2
0.1
0000 012345012345012345012345
K=3
K=4
K=3
K=4
0000 012345012345012345012345
20
15
10
5
4.5 420
3.5 315
2.5
210 1.5
10 9 8 7 6 5 4 3
London
Antwerp
5152
0.5 1
00 5 10 15 20 5 10 15 20
K=1 K=2 K=1 K=2
20 20 20 20 15 15 15 15 10 10 10 10
5555 5 10 15 20 5 10 15 20 5 10 15 20
5 10 15 20
K=3 K=4 K=3 K=4
20 20 20 20 15 15 15 15 10 10 10 10
5555 5 10 15 20 5 10 15 20 5 10 15 20
5 10 15 20
Figure 3: Flying bombs analysis: (top) mixture of Poissons learned for each city, for different number of components (K); (middle) hit locations and square plot counts; (bottom) assignment of squares to most likely components.
11

Problem 5.2 Mean and variance of a kernel density estimate
(a) The mean of density estimate pˆ(x) is
􏰧 􏰧 1 􏰋n
1 􏰋n 􏰧
k ̃(x − xi)xdx = n k ̃(x − xi)xdx.
i=1
k(x − xi)xdx = k(x ̄)(x ̄ + xi)dx ̄ = k(x ̄)x ̄ + k(x ̄)xi dx ̄ = xi,
where the last step follows from the assumption of zero mean of the kernel in (5.6), and the
μˆ = Epˆ[x] =
Looking at the integral, define x ̄ = x − xi and perform a change of variable,
pˆ(x)xdx = n
􏰧 ̃ 􏰧 ̃ 􏰧􏰨 ̃ ̃􏰩
(S.69)
fact that the kernel integrates to 1. Finally, substituting into (S.69),
1 􏰋n
μˆ = n (b) For the covariance of the density estimate,
ˆ􏰝T􏰞􏰧􏰝T􏰞 Σ=covpˆ(x)=Epˆ (x−μˆ)(x−μˆ) = pˆ(x) (x−μˆ)(x−μˆ) dx
􏰧 1 􏰋n
= n
i=1
Looking at the integral, and again performing a change of variable, 􏰧 ̃􏰝T􏰞􏰧 ̃􏰝T􏰞
k(x−xi) (x−μˆ)(x−μˆ) dx= k(x ̄) (x ̄+xi −μˆ)(x ̄+xi −μˆ) dx ̄
(S.70)
(S.71)
i=1
i=1
xi.
(S.72) k ̃(x−xi)􏰝(x−μˆ)(x−μˆ)T􏰞dx= n k ̃(x−xi)􏰝(x−μˆ)(x−μˆ)T􏰞dx.
􏰧􏰨 ̃ T ̃ T ̃
= k(x ̄)x ̄x ̄ + k(x ̄)x ̄(xi − μˆ) + k(x ̄)(xi − μˆ)x ̄
T􏰩
+ k(x ̄)(xi − μˆ)(xi − μˆ) dx ̄
(S.73)
(S.74)
(S.75)
= H + (xi − μˆ)(xi − μˆ)T ,
where the last step follows from the zero-mean assumption of the kernel, and the fact that the
kernel integrates to 1. Finally, substituting into (S.73),
(S.76)
(xi − μˆ)(xi − μˆ)T . (S.77)
ˆ 1 􏰋n
n i=1
􏰝H + (xi − μˆ)(xi − μˆ)T 􏰞 = H +
1 􏰋n
n i=1
Σ =
1 􏰋n 􏰧 i=1
T ̃
(c) The mean of the distribution μˆ is the same as the empirical mean of the samples. On the other hand, the covariance of the distribution Σˆ is the sum of the empirical covariance and the kernel covariance H. Hence, the kernel density estimate produces a distribution with larger covariance than that of the samples. In class, we showed that the mean of the estimator is the convolution between the true density and the kernel, EX [pˆ(x)] = p(x) ∗ k ̃(x), and hence the estimator is biased. From the result in this problem, we can see that this estimator bias manifests itself as a density estimate with covariance larger than what we see empirically in the samples.
………
12

Problem 6.4 Coin Tossing
(a) For this problem, the Bayesian decision rule is to guess heads (H) when
p(s = H|r = H) > p(s = T|r = H)
p(r = H|s = H)p(s = H) > p(r = H|s = T)p(s = T)
(S.78) (S.79) (S.80)
(S.81)
(S.82)
(S.83)
(S.84)
(1−θ1)α > θ2(1−α) α> θ2
1−θ1 +θ2 α< θ2 . and tails (T) when When any guess is equally good. (b) When θ1 = θ2 = θ the minimum probability of error decision is to declare heads if α>θ
1−θ1 +θ2 α= θ2
1−θ1 +θ2
and tails otherwise. This means that you should only believe your friend’s report if your prior for heads is greater than the probability that he lies. To see that this makes sense let’s look at a few different scenarios:
• If your friend is a pathological liar (θ = 1), then you know for sure that the answer is not heads and you should always say tails. This is the decision that (S.84) advises you to take.
• If he never lies (θ = 0) you know that the answer is heads. Once again this is the decision that (S.84) advises you to take.
• If both α = 0 and θ = 0 we have a contradiction, i.e. you know for sure that the result of the toss is always tails but this person that never lies is telling you that it is heads. In this case Bayes just gives up and says “either way is fine”. This is a sensible strategy, there is something wrong with the models, you probably need to learn something more about the problem.
• If your friend is completely random, θ = 1/2, (S.84) tells you to go with your prior and ignore him. If you believe that that the coin is more likely to land on heads say heads otherwise say tails. Bayes has no problem with ignoring the observations, whenever these are completely uninformative.
• When you do not have prior reason to believe that one of the outcomes is more likely than the other, i.e. if you assume a fair coin (α = 1/2), (S.84) advises you to reject the report whenever you think that your friend is more of a liar (θ > 1/2) and to accept it when you believe that he is more on the honest side (θ < 1/2). Once again this makes sense. • In general, the optimal decision rule is to “modulate” this decision by your prior belief on the outcome of the toss: say heads if your prior belief that the outcome was really heads is larger than the probability that your friend is lying. 13 (c) Denoting by ri the ith report and assuming that the sequence of reports R = {r1, . . . , rn} has nh heads and n − nh tails, the MPE decision is now to say heads if p(s = H|r1,...,rn) > p(s = T|r1,…,rn) p(r1,…,rn|s = H)p(s = H) > p(r1,…,rn|s = T)p(s = T)
(1 − θ1)nh θn−nh α > θnh (1 − θ2)n−nh (1 − α) 12
(S.85) (S.86) (S.87)
(S.88)
(S.89) (S.90)
θnh (1 − θ2)n−nh α>2
(1−θ1)nhθn−nh +θnh(1−θ2)n−nh 12
α>1 1+(1−θ1)nh( θ1 )n−nh
θ2 1−θ2
and tails otherwise.
(d) When θ1 = θ2 = θ and the report sequence is all heads (nh = n), the MPE decision becomes
to declare heads if
1
α > 1 + 􏰑 1−θ 􏰒n (S.91)
θ
and tails otherwise. As n becomes larger, i.e. n → ∞, we have three situations:
• Your friend is more of a liar, θ > 1/2. In this case, ((1 − θ)/θ)n → 0 and the decision rule
becomes α > 1. That is, you should always reject his report.
• Your friend is more of a honest person, θ < 1/2. In this case, ((1 − θ)/θ)n → ∞ and the decision rule becomes α > 0. That is, you should always accept his report.
• Your friend is really just random, θ = 1/2. In this case, the decision rule becomes α > 1/2
and you should go with your prior.
Once again this makes sense. Now you have a lot of observations so you are much more confident on the data and need to rely a lot less on your prior. It also takes a lot less work to figure out what you should do, since you do not have to make detailed probability comparisons. Because your friend seems to be so certain of the outcome (he always says heads), you either: 1) not trust him (θ somewhere in between 1/2 and 1) therefore believe that he is just trying to fool you and reject what he says, or 2) trust him (θ somewhere in between 0 and 1/2) and accept his report. It is only in the case that he is completely unpredictable that the prior becomes important. This looks like a really good strategy, and sounds a lot like the way people think. As you can see in this example, the optimal Bayesian decision can be something as qualitative as: if you trust accept, if you doubt reject, otherwise ignore.
……… Problem 7.5 PCA and Classification
(a) To solve this problem in the most general form, we consider that p(y = 1) = π and p(y = 2) = 1 − π. Marginalizing out y, the distribution of the features x is
p(x) = 􏰋 p(x|y)p(y) = p(x|y = 1)p(y = 1) + p(x|y = 2)p(y = 2) (S.92) y
= πp(x|y = 1) + (1 − π)p(x|y = 2). (S.93) 14

We then note that, for any function f(x), 􏰧􏰧
E[f (x)] = f (x)πp(x|y = 1)dx + f (x)(1 − π)p(x|y = 2)dx
= πEx|y=1[f(x)] + (1 − π)Ex|y=2[f(x)]
To compute the mean of x it suffices to apply this result with f(x) = x, which leads to
μx = πμ1 + (1 − π)μ2. Forthecovarianceweusef(x)=(x−μx)(x−μx)T leadingto
Σx =E[(x−μx)(x−μx)T]
= πEx|y=1[(x − μx)(x − μx)T ] + (1 − π)Ex|y=2[(x − μx)(x − μx)T ]
We next note that
Ex|y=j[(x−μx)(x−μx)T]=Ex|y=j[(x−μj +μj −μx)(x−μj +μj −μx)T]
(S.94) (S.95)
(S.96)
(S.97) (S.98)
(S.99)
= Σi + Ex|y=j[(x − μj)(μj − μx)T ] + Ex|y=j[(μj − μx)(x − μj)T ] + Ex|y=j[(μj − μx)(μj − μx)T ] (S.100)
= Σi + (μi − μx)(μi − μx)T (S.101) from which it follows that
Σx =πΣ1 +(1−π)Σ2 +π(μ1 −μx)(μ1 −μx)T +(1−π)(μ2 −μx)(μ2 −μx)T (S.102) =πΣ1 +(1−π)Σ2 +π(1−π)2(μ1 −μ2)(μ1 −μ2)T +(1−π)π2(μ1 −μ2)(μ1 −μ2)T
=πΣ1 +(1−π)Σ2 +[π(1−π)2 +(1−π)π2](μ1 −μ2)(μ1 −μ2)T =πΣ1 +(1−π)Σ2 +π(1−π)(μ1 −μ2)(μ1 −μ2)T
Hence, when π = 1/2,
μx = E[x] = 1(μ1 + μ2)
2
Σx = E[(x − μx)(x − μx)T ] = 1(Σ1 + Σ2) + 1(μ1 − μ2)(μ1 − μ2)T . 24
wehaveμx =0and
􏰢1 + α2 0 􏰣
Σx = 0 σ2 . (S.108)
Since this is a diagonal matrix, its eigenvectors are just the coordinate axis. Thus PCA will choose the eigenvector e1 = [1, 0]T when 1 + α2 > σ2 and the eigenvector e2 = [0, 1]T otherwise. We thus have two possible decision rules:
• Whenα>√σ2−1wehavez=eT1x=x1,wherex1 isthe1stdimensionofxand
p(z|y = 1) = p(x1 = z|y = 1) = N(z|α,1) (S.109)
p(z|y=2)=p(x1 =z|y=2)=N(z|−α,1). (S.110) This is the direction along which the Gaussians have greatest separation. Hence the
classifier obtained with PCA is optimal. 15
(S.103) (S.104) (S.105)
(S.106)
(S.107) (b) Regarding PCA, we start by computing the mean and covariance of x. Using the results of (a),

• Whenα<√σ2−1wehavez=eT2x=x2,and p(z|y = 1) = p(x2 = z|y = 1) = N(z|0,σ2) (S.111) p(z|y = 2) = p(x2 = z|y = 2) = N(z|0,σ2) (S.112) This is a terrible selection of direction to project on, since both classes have the same marginal along this projection and the classification error is therefore going to be the worst possible (0.5, i.e. the same as random guessing). In this case PCA fails miserably. Overall, the conclusion is that one has to be very careful with the use of PCA for classification. There are situations in which it is optimal, but others in which it leads to the worst possible choice of features. Furthermore, a slight perturbation of the covariance matrix can lead to the switch from one situation to the other. Hence, in the context of classification problems, PCA is not necessarily an optimal technique for dimensionality reduction. ......... Problem 8.5 Loss functions (a) Defining zi = yiwT xi with classes yi ∈ {+1, −1}, the point xi is misclassified if zi < 0. Hence the number of misclassified training points is the number of points with zi < 0, where Remp = 􏰋 L01(zi), i=1 􏰍0, zi≥0 L01(zi)= 1, zi <0. (S.113) (S.114) (S.115) (b) The perceptron minimizes the error over the misclassified points, E(w) = 􏰋 −yiwT xi, i∈M where M is the set of misclassified points. Noting that correctly classified points (when zi ≥ 0) have 0 penalty, the sum can be rewritten over all points as E(w)=􏰋−zi=􏰋n 􏰍0, zi≥0 i∈M i=1 −zi, zi<0 nn = 􏰋 max(0, −zi) = 􏰋 Lp(zi), i=1 i=1 (S.116) (S.117) where Lp(zi) = max(0, −zi). (c) Least-squares classification minimizes the squared-error between the label and the linear func- tion, n n􏰛1􏰜2n12 E(w) = 􏰋(wT xi − yi)2 = 􏰋 (yiwT xi − yi2) i=1 i=1 yi nn = 􏰋(yiwT xi − 1)2 = 􏰋(zi − 1)2, = 􏰋 􏰑yiwT xi − yi2􏰒 i=1 yi2 (S.118) (S.119) i=1 i=1 where the last line follows from yi2 = 1. Hence, LLSC (zi) = (zi − 1)2. 16 (d) Logistic regression maximizes the data log-likelihood, or equivalently minimizes the “cross- entropy” error n E(w)=−􏰋[y ̃ilogπi +(1−y ̃i)log(1−πi)] (S.120) i=1 where the classes are now defined as y ̃i ∈ {1, 0}, and πi = σ(wT xi) is the classifier’s probability (confidence) that xi should be in class y ̃i = 1. Because y ̃i can only take a value of either 1 or 0, only one of the terms in the square brackets is non-zero for each i. Hence, we have E(w)=−􏰋n 􏰍logπi, y ̃i =1 (S.121) log(1−πi), y ̃i=0 Next, we map the class labels from y ̃i ∈ {1, 0} to yi ∈ {+1, −1}, and the error function is i=1 E(w)=−􏰋n 􏰍logπi, yi =+1 yi = −1 (S.122) (S.123) (S.124) (S.125) log(1 − πi), =−􏰋n 􏰍logσ(wTxi), i=1 i=1 log σ(−wT xi), n = − 􏰋 log σ(yiwT xi) i=1 yi =+1 yi = −1 􏰋n 1 􏰋n = −log = log(1+e−zi), i=1 1 + exp(−yiwT xi) i=1 where (S.123) follows from the property 1−σ(f) = σ(−f) from Problem 8.1. Hence, the logistic regression loss function is LLR(zi) = log(1 + e−zi ) ∝ 1 log(1 + e−zi ). (S.126) log(2) The purpose of the scaling term 1/log(2) is to make the loss function go through the (0,1) “corner” of the 0-1 loss function; obviously it doesn’t change the underlying optimization problem. (e) See discussion for Problem 9.5 (c). Problem 9.1 Margin ......... (a) The goal is to find the point x on the hyperplane f(x) = wTx+b = 0 that is closest to a point xa. In other words, we want to minimize the distance from x to xa, subject to the constraint f(x) = 0. The Lagrangian for this problem is L=∥x−xa∥2 −λ(wTx+b). Setting the derivative wrt λ to 0, we recover the desired equality constraint ∂L =−(wTx+b)=0 ⇒ f(x)=0. ∂λ (S.127) (S.128) 17 Next setting the derivative wrt x to 0, ∂L =2(x−xa)−λw=0 ⇒ 2(x−xa)=λw. (S.129) (S.130) (S.131) (S.132) (S.133) (S.134) (S.135) (S.136) (S.137) (S.138) (S.139) ∂x Multiplying both sides by wT , 2wT (x − xa) = λwT w 2(wT x − wT xa) = λ ∥w∥2 2(wTx+b−wTxa −b)=λ∥w∥2 2(f(x) − f(xa)) = λ ∥w∥2 ⇒ λ = −2f(xa), ∥w∥2 which follows from the constraint f(x) = 0. Next substituting for λ in (S.129), (x − xa) = −f(xa)w. The length-squared of this vector is ∥x−xa∥ = ∥w∥4 ∥w∥ , ∥w∥2 2 f(xa)2 2 and hence the minimal distance between x (a point on the hyperplane) and xa is ∥x − xa∥ = |f(xa)|, ∥w∥ which is the distance from xa to the hyperplane. (b) Setting xa = 0 as the origin, we have ∥x∥= |wT0+b| = |b| , ∥w∥ ∥w∥ which is the distance from the origin to the hyperplane. (c) The point closest to xa on the hyperplane is obtained from (S.135), xˆ = x a − f ( x a ) w . ∥w∥2 ......... 18 Problem 9.5 Soft-margin SVM risk function (a) The soft-margin constraint is yi(wTxi +b)≥1−ξi ⇒ ξi ≥1−yi(wTxi +b). (S.140) The slack variable is also non-negative, ξi ≥ 0. Both inequalities are a lower bound on ξi, and hence they can be combined by taking the maximum of the two ξi ≥ max(0, 1 − yi(wT xi + b)). Note that the soft-margin objective, 1 2 􏰋n min ∥w∥ +C ξi, w,b,ξ 2 i=1 (S.141) (S.142) is minimizing the 1-norm of ξi. Hence, at the minimum, we must have equality for the constraint in (S.141) (i.e., the constraint is active), ξi = max(0, 1 − yi(wT xi + b)). (S.143) If this were not true, we could still reduce ξi and satisfy the inequality constraint, which means the objective is not minimized yet. (b) Substituting (S.143) into the objective in (S.142), we have the error function 1 􏰋n E(w,b)= 2∥w∥2 +C max(0,1−yi(wTxi +b)) ∝􏰋max(0,1−yi(wTxi +b))+λ∥w∥2, (S.144) (S.145) (S.146) n i=1 where λ = 1/C > 0. The left term is the empirical risk, where the loss function is LSVM(zi)=max(0,1−yi(wTxi +n))=max(0,1−zi),
i=1
and the right term regularizes w (i.e., controls the complexity). This loss function is also known as the “hinge loss”.
(c) Figure 4 plots the loss function for various classifiers: perceptron, least-squares (LS), logistic regression (LR), SVM, and AdaBoost. We didn’t talk about boosting (AdaBoost) in this class, but anyways, it can be shown to be minimizing an exponential loss function Lexp(zi) = e−zi. There are three important regions in this plot: misclassified points (z < 0), correctly classified points inside the margin 0 < z < 1, and correct points outside the margin 1 < z. Let’s compare these regions for the different classifiers. • For points that are misclassified (z < 0), the perceptron, LR, and SVM all apply a penalty that is linear to the distance of the point from the classification boundary. On the other hand, boosting and LS both use non-linear penalties for misclassifications, with boosting being the most aggressive (exponential penalty). Intuitively, classifiers with the linear misclassification penalty (LR, SVM) will be more robust to outliers and errors caused by overlapping classes. On the other hand, boosting will tend to overfit to reduce these errors (in the absence of enough training data). 19 • All the classifiers except for the perceptron penalize correctly classified points that are inside the margin (0 ≤ z ≤ 1). The main purpose is to find a hyperplane that has the most amount of “buffer” space (neutral zone) between the positive and negative classes. The perceptron is very similar to the SVM, except that it doesn’t have a margin penalty! • Finally, the classification methods differ in how they deal with correctly classified points far from the margin (z > 1). First, for SVM, there is no penalty if the point falls outside the margin, i.e., these points don’t affect the hyperplane at all. On the other hand, LR and boosting apply an exponential decaying penalty, which causes correctly classified points to be “pushed” farther away from the margin (actually the points are stationary and the margin is being pushed). This has the effect of tilting the decision boundary away from the bulk of the data (see Figure 5 for an example). Finally, LS applies a large penalty for points that are correctly classified and far from the margin, i.e. that are “too correct”. So, in actuality, LS is pushing all the points towards the margin, regardless if they are correct or not.
So which classifiers are better? Well, there are different properties that might be desirable for a classifier: 1) robustness to outliers (linear misclassification penalty); 2) maximizing the margin; 2) and pushing points away from the margin. These properties are embodied in the three classifiers: SVM, LR, and AdaBoost. Interestingly, LR is somehow a mix of SVM (with linear misclassification penalty) and AdaBoost (with exponential decaying penalty for correct classifications).
4
3
2
1
0
−3 −2 −1 0 1 2 3
linear penalty for errors
non−linear penalty
for errors
0−1
perceptron least−square logistic regression SVM
AdaBoost
penalty for “too correct”
push points away from margin
penalty for correct points inside margin
no penalty outside of margin
z
margin
Figure 4: Loss functions versus distance to margin (z = ywT x) ………
20
loss

5 4 3 2 1 0
−1 −2 −3 −4 −5
LR
SVM SVM m class +1 class −1
argin
−6 −4 −2 0 2 4 6
x1
Figure 5: Example of logistic regression and SVM classification boundaries. The LR boundary is slightly tilted away from the data, whereas the SVM boundary only depends on the points on the margin.
Problem 10.1 Constructing kernels from kernels
(f) The Taylor expansion (at 0) of the exponential function is
a 􏰋∞ a n a 2 a 3 a 4
e =
Hence, substituting the kernel function for a
(S.147)
(S.148)
n! =1+a+ 2 + 6 +24+··· k(x, z) = ek1(x,z) = 􏰋∞ 1 k1(x, z)n.
n=0
This is an infinite sum of non-negative integer powers of the kernel function, where each term is scaled by a positive value. If k1(x,z) is a positive definite kernel, then all these operations also result in a positive definite kernel (see Problems 10.1a, 10.1b, and 10.1e). Hence, the exponential of a positive definite kernel is also positive definite.
……… Problem 10.2 Kernels on real vectors
(a) We can rewrite the RBF as
k(x,z)=e−α∥x−z∥2 =e−α(xTx−2xTz+zTz)=e−αxTxe2αxTze−αzTz (S.149)
The middle term k1(x,z) = e2αxTz is a positive definite kernel via Problem 10.1(f) and the fact that the linear kernel xTz is positive definite (as is the scaled version). Next letting
21
n=0 n!
x2

f(x) = e−αxT x, we can rewrite the RBF kernel as
k(x, z) = f (x)e2αxT z f (z). (S.150)
Hence, using the input scaling property (Problem 10.1(d)), the RBF kernel is also positive definite.
In lecture we saw that the mapping for the RBF kernel is from a point to a function, xi → k(·, xi) = e−α∥·−xi∥2 , in the RKHS. Using the Taylor expansion of the exponential function we can get an alternative interpretation. For simplicity, consider the case when α = 1/2. The exponential term can be written as a sum of polynomial terms,

k1(x,z)=exTz =􏰋 1(xTz)n =1+xTz+1(xTz)2+1(xTz)3+ 1(xTz)4+··· n=0n! 2624
Letting Φp(x) be the feature transform of the degree-p polynomial, k1(x,z)=1+xTz+1Φ2(x)TΦ2(z)+1Φ3(x)TΦ3(z)+ 1Φ4(x)TΦ4(z)+···
(S.151)
(S.152)
(S.153)
= Φ(x) Φ(z),
 √6Φ3(x)  1 Φ (x)
T
√22 Φ(x) = 1
2 6 24 1
x  1 Φ (x)
f(x)=e
−αxT x 􏰨 2αxT x􏰩−1/2 1
= e =􏰬k1(x,x). (S.154)
√ 4  24. 
.
Hence, the feature transformation of the exponential term is a concatenation of the feature transformations of all the polynomial kernels (i.e., a concatenation of all polynomial combina- tion of features) with decaying weights. Note that the input scaling terms are
Hence, the RBF kernel is the normalized version of the exponential kernel k1, which only measures the angle between the high-dimensional vectors in feature space (see Problem 10.4 for more about normalized kernels).
………
Problem 10.4 Constructing kernels from kernels (part 3 – normalization)
(a) Setting f(x) = √ 1 and using the “input scaling” property (Problem 10.1(d)) shows that k(x,x)
the normalized kernel is positive definite if k(x, z) is also positive definite.
(b) The positive definite kernel k(x, z) can be expressed as an inner-product in the high-dimensional space, k(x,z) = ⟨Φ(x),Φ(z)⟩. The length (norm) of the transformed vector is ∥Φ(x)∥ = 􏰬⟨Φ(x), Φ(x)⟩. Hence, the normalized kernel can be written as
̃ k(x, z) ⟨Φ(x), Φ(z)⟩
k(x, z) = 􏰬k(x, x)k(z, z) = ∥Φ(x)∥ ∥Φ(z)∥ , (S.155)
which is the formula for the cosine of the angle between Φ(x) and Φ(z). Usually this relationship is written as: ⟨x, z⟩ = ∥x∥ ∥z∥ cos θ.
22

(c) Since k ̃(x, z) is the cosine of an angle, we have −1 ≤ k ̃(x, z) ≤ 1. In addition k ̃(x, z) = 1 when Φ(x) and Φ(z) are pointing in the same direction, k ̃(x, z) = −1 if they are pointing in opposite directions, and k ̃(x, z) = 0 when Φ(x) is orthogonal to Φ(z).
On a side note, squaring the normalized kernel, we can obtain the inequality
k ̃(x, z)2 ≤ 1 ⇒ k(x, z)2 ≤ k(x, x)k(z, z), (S.156)
which is true for any positive definite kernel k(x,z). Taking the square-root, we obtain the Cauchy-Schwarz inequality for kernels,
|k(x, z)| ≤ 􏰬k(x, x)k(z, z). (S.157)
Finally, Figure 6 shows the exponential kernel matrix for the data in Figure 5, and its normalized kernel (i.e., RBF kernel). In the unnormalized kernel, some rows have within-class kernel values that are similar to out-of-class values for other rows. The dynamic range of the similarity values changes depending on the location, which makes the learning task harder for the classifier. On the other hand, after normalizing the kernel, the maximum similarity is 1 (on the diagonal), and all similarities in each row are scaled relative to this.
Figure 6: (left) the exponential kernel k(x, z) = eαxT z ; (right) the normalized exponential kernel, i.e. RBF kernel. The data is from Figure 5 and α = 0.1.
………
20
20
kernel
normalized kernel
250 40 200
40 0.6
60 150 60 100
0.4
0.2
80
100
50
80
100
20 40 60 80 100
20 40 60 80 100
300
1
0.8
23

Tutorial Problems
Problem 1.6 Multivariate Gaussian
(a) The multivariate Gaussian distribution is
1 􏰓1 T−1 􏰤
N(x|μ,Σ)= d/2 1/2 exp −2(x−μ) Σ (x−μ) . (2π) |Σ|
(S.158)
111
σ12 0
. Assuming a diagonal covariance matrix, Σ = 0 ..
in (1.16), N(x|μ,Σ)=
σd2
 1/σ 
di=1 σi2)1/2  2 0 = (2π)d/2(􏰗di=1 σi) exp −2 i=1 σi2 (xi − μi)2
1/σd2
, and substituting the properties 20
T. 􏰗 exp − (x−μ)  ..
(2π)d/2(
1 􏰍1􏰋d1 􏰰
(x−μ) (S.159) 
(S.160) (S.161) (S.162)
􏰦d􏰢1 􏰓11 2􏰤􏰣 = (2π)1/2σ exp −2 σ2 (xi − μi)
i=1 i i d
= 􏰦 N (xi|μi, σi2) i=1
(b) See Figure 7b. The diagonal terms indicate how far the Gaussian stretches in each axis direc- tion.
(c) See Figure 7a.
(d) The eigenvalues and eigenvector pairs (λi,vi) of Σ satisfy
Σvi = λivi, i ∈ {1,··· ,d}.
vd􏰞 = 􏰝λ1v1 ··· λdvd􏰞 
(S.163)
(S.164) (S.165) (S.166)
λd
matrix with the corresponding eigenvalues. Finally, post-multiplying both sides by V −1,
ΣVV−1 =VΛV−1 (S.167) Σ = V ΛV T , (S.168)
Rewriting using matrix notation, Σ􏰝v1 ···
􏰝􏰞􏰝􏰞00 .
Σv1 ··· vd =v1 ··· vd · ..  λd
ΣV = V Λ,
where V = v1 ··· vd is a matrix of eigenvectors, and Λ =  ..  is a diagonal
􏰝􏰞00
24
λ1

λ1
.

(a) (b) (c)
Σ1 Σ2 Σ3 444
333 222 111 000
−1 −1 −1 −2 −2 −2 −3 −3 −3
−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
xxx 111
444 333 222 111 000
−1 −1 −1 −2 −2 −2 −3 −3 −3
−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
p(x) Mahal. distance
0.2 0.1
4 0.4 2 0.2
4 0.4 2 0.2
4 2
xxx 111
p(x)
p(x)
p(x)
000000
−4 −4 −4
−2 −2 −2 −2 −2 −2
0x0x0x 2 4−4 2 2 4−4 2 2 4−4 2
xxx 111
Figure 7: Example of multivariate Gaussians: a) isotropic covariance matrix; b) diagonal covariance matrix; c) full covariance matrix.
where in the last line we have used the property that the eigenvectors are orthogonal, i.e., V T V = I and hence V −1 = V T .
(e) The inverse of the covariance matrix is
Σ−1 =(VΛVT)−1 =V−TΛ−1V−1 =VΛ−1VT.
Hence, the Mahalanobis distance term can be rewritten as
∥x−μ∥2Σ =(x−μ)TΣ−1(x−μ)=(x−μ)VΛ−1VT(x−μ)=yΛ−1y=∥y∥2Λ,
(S.169)
(S.170)
where we define y = VT(x−μ).
25
􏰳 􏰲􏰱 􏰴
y

(f) In the transformation x = V y + μ, first y is rotated according to the eigenvector matrix V , then the result is translated by vector μ.
􏰮√2 −√2􏰯 􏰢1 0􏰣
(g) SeeFigure7c. ForthisΣ,wehaveV = √2 √2 andΛ= . TheGaussianisfirst 22 00.25
22
stretched according to the eigenvalues (e.g., like in Figure 7b). Then it is rotated to match the directions of the eigenvectors.
………
Problem 1.8 Product of Multivariate Gaussian Distributions
The product of two Gaussians is
N (x|a, A)N (x|b, B) (S.171)
=
1 􏰓1 T−1 􏰤 1 􏰓1 T−1 􏰤 d/2 1/2 exp −2(x−a) A (x−a) d/2 1/2 exp −2(x−b) B (x−b)
(S.172) (S.173)
(S.174) (S.175) (S.176)
(S.177)
(S.178)
(S.179) (S.180)
(S.181)
(S.182) (S.183)
(2π) |A| (2π) |B| 1􏰓1􏰝T−1 T−1􏰞􏰤
d 1/2 1/2 exp −2 (x−a) A (x−a)+(x−b) B (x−b) . (2π) |A| |B|
=
Looking at the exponent term, we expand and collect terms,
M =(x−a)TA−1(x−a)+(x−b)TB−1(x−b)
= xT A−1x + xT B−1x − 2aT A−1x − 2bT B−1x + aT A−1a + bT B−1b
=xT (A−1 +B−1)x−2(aTA−1 +bTB−1)x+aTA−1a+bTB−1b.
􏰳 􏰲􏰱 􏰴 􏰳 􏰲􏰱 􏰴
􏰳 􏰲􏰱 􏰴
A bT c
Using the above definitions of (A, b, c) we complete the square using Problem 1.10:
where
and
M = (x − d)T A(x − d) + e,
d = A−1b = (A−1 + B−1)−1(A−1a + B−1b)
e = c − bT A−1b
= aT A−1a + bT B−1b − (aT A−1 + bT B−1)(A−1 + B−1)−1(A−1a + B−1b)
= aT A−1a − aT A−1(A−1 + B−1)−1A−1a + bT B−1b − bT B−1(A−1 + B−1)−1B−1b
− 2aT A−1(A−1 + B−1)−1B−1b
= aT (A + B)−1a + bT (A + B)−1b − 2aT (A + B)−1b = (a − b)T (A + B)−1(a − b),
26

where in (S.182) we use the matrix inversion lemma on the first two terms (from Problem 1.15). Finally, defining C = (A−1 + B−1)−1 and c = C(A−1a + B−1b), we obtain for the exponent term
M=(x−c)TC−1(x−c)+(a−b)T(A+B)−1(a−b)=∥x−c∥2C +∥a−b∥2A+B. Next, we look at the determinant term,
(S.184)
(S.185)
(S.186) (S.187) (S.188) (S.189)
(S.190) (S.191)
(S.192) (S.193)
(S.194)
(S.195)
D = =
= = =
1 1 |C|1/2 |A|1/2 |B|1/2 = |A|1/2 |B|1/2 |C|1/2
1
|A|1/2 |B|1/2
􏰘(A−1 + B−1)−1􏰘1/2 􏰘􏰘
|C|1/2 11
|A|1/2 |B|1/2 |A−1 + B−1|1/2 |C|1/2 11
(|A| |A−1 + B−1| |B|)1/2 |C|1/2 11
|A + B|1/2 |C|1/2
Finally, substituting the derived expressions for M and D into (S.173)
N (x|a, A)N (x|b, B)
= =
11􏰓1􏰨2 2􏰩􏰤 d 1/2 1/2 exp −2 ∥x−c∥C +∥a−b∥A+B
(2π) |A+B| |C|
􏰮 1 􏰓1 2 􏰤􏰯􏰮 1 􏰓1 2􏰤􏰯
d/2 1/2 exp −2∥a−b∥A+B (2π) |A + B|
d/2 1/2 exp −2∥x−c∥C (2π) |C|
= N (a|b, A + B)N (x|c, C)
……… Problem 1.10 Completing the square
The original form is
Now expand the desired form,
f (x) = xT Ax − 2xT b + c. f(x)=(x−d)TA(x−d)+e= xTAx −2xTAd+dTAd+e.
􏰳􏰲􏰱􏰴 􏰳􏰲􏰱􏰴 􏰳 􏰲􏰱 􏰴
quadratic linear constant
We need to match the quadratic, linear, and constant terms. The quadratic term already matches. For the linear term, we have by inspection d = A−1b so that xT Ad = xT b. For the constant term, we set
c=dTAd+e ⇒ e=c−dTAd (S.196) = c − bT A−1AA−1b = c − bT A−1b (S.197)
………
27

Problem 2.6 MLE for a multivariate Gaussian
(a) The log-likelihood of the data is
NN􏰓11d􏰤 l=logp(X)=􏰋logp(xi)=􏰋 −2∥xi−μ∥2Σ−2log|Σ|−2log2π
(S.198)
(S.199)
(S.200)
i=1 i=1 1􏰋N N
=−2 ∥xi−μ∥2Σ− 2 log|Σ| i=1
=−2
i=1
N (xi−μ)TΣ−1(xi−μ)− 2 log|Σ|,
1 􏰋N
where we have dropped constant terms. To find the maximum of the log-likelihood l w.r.t. μ, we take the derivative and set to 0. One way to proceed is to expand the quadratic term and take the derivative. Alternatively, we can use the chain rule. First, removing terms that do not depend on μ,
∂l =−1􏰋N ∂ (xi −μ)TΣ−1(xi −μ). ∂μ 2 i=1 ∂μ
Lettingzi=xi−μandapplyingthechainrule∂f =∂f∂z, ∂μ ∂z ∂μ
(S.201)
(S.202) (S.203)
(S.204)
(S.205)
∂ l 1 􏰋N ∂ T − 1 ∂μ = −2 ∂μzi Σ
1 􏰋N 􏰢 ∂ T − 1 􏰣 􏰢 ∂ z i 􏰣
zi = −2 ∂z zi Σ zi i=1 i
∂μ
i=1 1NN
= − 􏰋 􏰝2Σ−1zi􏰞 [−1] = 􏰋 Σ−1(xi − μ). 2 i=1 i=1
Setting the derivative to 0, and solving for μ,
Pre-multiplying both sides by Σ,
N
􏰋 Σ−1(xi − μ) = 0.
i=1
NNN
Σ􏰋Σ−1(xi −μ)=Σ0 i=1
􏰋N ⇒
i=1
⇒ ⇒
􏰋xi −􏰋μ=0 i=1 i=1
i=1
x i − N μ = 0
(b) To find the maximum of the log-likelihood w.r.t. Σ, we first use the “trace” trick, xT Ax =
1 􏰋N
μˆ = N x i .
(S.206)
28

tr[xT Ax] = tr[AxxT ], on the log-likelihood,
l=−2
1 􏰋N
=−2
(S.207) (S.208) (S.209)
1 􏰋N
N (xi−μ)TΣ−1(xi−μ)− 2 log|Σ|
N tr[Σ−1(xi −μ)(xi −μ)T]− 2 log|Σ|,
i=1
i=1
1􏰮N 􏰯N
=−2tr Σ−1􏰋(xi −μ)(xi −μ)T − 2 log|Σ|. i=1
Taking the derivative w.r.t. Σ (using the helpful derivatives provided in Problem 2.6) and setting to 0,
∂l 1􏰍 􏰮N 􏰯 􏰰 N ∂Σ=−2 −Σ−1 􏰋(xi−μ)(xi−μ)T Σ−1 −2Σ−1=0.
i=1
Pre-multiplying and post-multiplying by Σ on both sides gives
(S.210)
(S.211) (S.212)
(S.213) (S.214)
1􏰮N 􏰯N
􏰋(xi −μ)(xi −μ)T − 2 i=1
ˆ 1 􏰋N
Σ = 0 (xi −μ)(xi −μ)T.
⇒ Σ= N ………
i=1
= yT y − 2yT ΦT θ + θT ΦΦT θ.
Next, take the derivative w.r.t. θ (using the derivatives from Problem 2.6), and setting to zero, ∂E =−2Φy+2ΦΦTθ=0 (S.215)
2
Problem 2.8
Least-squares regression and MLE
(a) We want to find the θ that minimizes the sum-squared error,
N
E =􏰋(yi −φ(xi))2 =􏰡y−ΦTθ􏰡2 =(y−ΦTθ)T(y−ΦTθ)
i=1
􏰡􏰡
∂θ
⇒ ΦΦTθ = Φy ⇒ θˆ= (ΦΦT)−1Φy. (S.216)
(b) The noise ε is zero-mean Gaussian, and hence yi = f(xi) + ε is also Gaussian with mean equal to the function value f(xi) = φ(xi)T θ,
p(yi|xi, θ) = N (yi|φ(xi)T θ, σ2). Then the log-likelihood of the data is
NN􏰓111􏰤 l=􏰋logp(yi|xiθ)=􏰋 −2σ2(yi−φ(xi)Tθ)2−2logσ2−2log2π .
(S.217)
(S.218)
i=1 i=1
29

The maximum likelihood solution is then 1NN
θˆ = argmax l = argmax − 􏰋(yi − φ(xi)T θ)2 = argmin 􏰋(yi − φ(xi)T θ)2, (S.219)
θ θ 2σ2
where the last step follows from −1 being a scalar constant w.r.t. θ. Hence, the maximum like-
i=1 i=1 2σ2
lihood solution is equivalent to the least-squares solution. This is mainly because we assumed Gaussian noise, which introduces the squared-error term.
………
Problem 3.10 Bayesian regression with Gaussian prior
(a) The posterior distribution of the parameters is obtained using Bayes’ rule, p(y, |X, θ)p(θ)
Here the denominator ensures that the posterior is properly normalized (integrates to 1). Note that the denominator is only a function of the data D since the parameter θ is integrated out. Hence, it suffices to find the form of θ in the numerator first, and then normalize that equation to obtain the distribution,
p(θ|y, X) ∝ p(y|X, θ)p(θ) (S.221)
or equivalently
log p(θ|y, X) = log p(y|X, θ) + log p(θ) + const. (S.222) Substituting for the data likelihood and prior terms (and ignoring terms not involving θ),
(S.223) (S.224)
(S.225) (S.226)
(S.227) (S.228)
(S.229) (S.230)
p(θ|y, X) = 􏰥 p(y|X, θ)p(θ)dθ .
(S.220)
logp(θ|y,X) = logN(y|ΦTθ,Σ)+logN(θ|0,Γ)+const. =−1􏰡y−ΦTθ􏰡2 −1θTΓ−1θ+const.
2􏰡 􏰡Σ2
= −1 􏰑−2θT ΦΣ−1y + θT ΦΣ−1ΦT θ􏰒 − 1θT Γ−1θ + const.
22
=−1[θT (ΦΣ−1ΦT +Γ−1)θ−2θT ΦΣ−1y]+const.
2􏰳 􏰲􏰱 􏰴 􏰳􏰲􏰱􏰴 Ab
Next, using the above A and b, we complete the square (see Problem 1.10), log p(θ|y, X) = −1(θ − A−1b)T A(θ − A−1b) + const.
2
=−1∥θ−μˆθ∥2ˆ +const.
2 Σθ where again constant terms are ignored, and we define
μˆθ = A−1b = (ΦΣ−1ΦT + Γ−1)−1ΦΣ−1y, Σˆθ = A−1 = (ΦΣ−1ΦT + Γ−1)−1.
Finally, note that the log-posterior in (S.228) is of the same form of a Gaussian for θ. Hence the posterior is Gaussian,
p ( θ | y , X ) = N ( θ | μˆ θ , Σˆ θ ) (S.231) 30

(b) The MAP solution is the mean of the Gaussian (i.e., the θ with largest likelihood),
θˆMAP = argmaxN(θ|μˆθ,Σˆθ) (S.232)
θ
= μˆθ = (ΦΣ−1ΦT + Γ−1)−1ΦΣ−1y. (S.233)
The θˆMAP is similar to the weighted least-squares estimate, but has an additional term Γ−1. When Γ−1 = 0, then θˆMAP is the same as the weighted least squares solution. For non-zero values of Γ, then the term serves to regularize the covariance matrix ΦΣ−1ΦT , which might not be strictly positive definite or nearly singular. E.g., if Γ = I is a diagonal matrix, then this would guarantee that the matrix inverse of (ΦΣ−1ΦT + Γ−1) can always be performed.
(S.234)
(S.235) (S.236) (S.237)
(S.238) (S.239)
(c) Substituting for Γ = αI and Σ = σ2I in θˆMAP ,
θˆ=(Φ(1 I)ΦT +1I)−1Φ1 Iy=(ΦΦT +σ2I)−1Φy=(ΦΦT +λI)−1Φy,
σ2ασ2 α whereλ=σ2 ≥0.
α
To solve the regularized least-squares problem, first consider the objective function
􏰡􏰡
R=􏰡y−ΦTθ􏰡2 +λ∥θ∥2 =(y−ΦTθ)T(y−ΦTθ)+λθTθ
= yT y − 2yT ΦT θ + θT ΦΦT θ + λθT θ =yTy−2yTΦTθ+θT(ΦΦT +λI)θ.
Taking the derivative and setting to 0,
∂R =−2Φy+2(ΦΦT +λI)θ=0 ⇒ (ΦΦT +λI)θ=Φy
∂θ
⇒ θ = (ΦΦT + λI)−1Φy
Hence the regularized least-squares estimate (aka ridge regression) is equivalent to the above MAP estimate for Bayesian regression with Gaussian noise and prior with isotropic covariances.
(d) Substituting for Γ = αI and Σ = σ2I, the posterior mean and covariance are
μˆ =(ΦΦT+σ2I)−1Φy, Σˆ =(1ΦΦT+1I)−1. (S.240)
θ α θσ2 α Here are the various cases of interest:
• Setting α → 0 corresponds to setting a very strong prior at θ = 0, since the covariance of the prior will be Γ = 0. The term 1 I is a diagonal matrix with very large entries, and its
α
inverse is a matrix that is zero. As a result, the posterior of θ is equivalent to the prior, μˆθ = 0 and Σˆθ = 0.
• Setting α → ∞ yields a very weak prior since the covariance Γ is very large. As a result, the the 1 I term vanishes, leaving an estimate equivalent to ordinary least squares,
α
μˆ =(ΦΦT)−1ΦyandΣˆ =(1ΦΦT)−1.
θ θσ2
• When σ2 → 0, then this means there is no observation noise. As a result, the prior is
ignored and the data term dominates the mean μˆθ = (ΦΦT )−1Φy, resulting in the ordinary least square estimate. The posterior covariance is Σˆθ = 0, since there is no uncertainty in the observations.
31

• When σ2 → ∞ this corresponds to observations being very very noisy. As a result, the data term is ignored and the prior dominates, resulting a posterior equivalent to the prior, μˆθ = 0 and Σˆθ = αI.
(e) Given a novel input x∗, we are interested in the function value f∗ conditioned on the data {y, X, x∗}. Note that the posterior θ|y, X is a Gaussian random variable. Hence, when condi- tioning on the data, f∗ = φ(x∗)T θ is a linear transformation of a Gaussian random variable, which is also Gaussian. Using the properties in Problem 1.1, which give the mean and variance of the transformed Gaussian, we have
p(f∗|y,X,x∗) = N(f∗|μˆ∗,σˆ∗2), μˆ∗ = φ(x∗)Tμˆθ, σˆ∗2 = φ(x∗)TΣˆθφ(x∗).
Finally, for the predicted y∗, we have obtain the distribution by integrating over f∗, 􏰧􏰧
p(y∗|f∗)p(f∗|y,X,x∗)df∗ = N(y∗|f∗,σ2)N(f∗|μˆ∗,σˆ∗2)df∗
(S.241)
(S.242)
p(y∗|y,X,x∗)= =
􏰧
N (f∗|y∗, σ2)N (f∗|μˆ∗, σˆ∗2)df∗ = N (y∗|μˆ∗, σˆ∗2 + σ2).
where the last line uses Problem 1.9 to calculate the integral (correlation between Gaussian
distributions).
Problem 3.12 L1-regularized least-squares (LASSO)
(a) The L1-regularized least-squares objective function is
1􏰡 T􏰡2 􏰋D
E=2􏰡y−Φ θ􏰡 +λ
………
|θi|. (S.244)
The first term (data term) is the squared-error as in ordinary least squares. The second term (regularization term) is the absolute value of the parameter values. In contrast, regularized least squares (ridge regression) uses the norm of the parameter ∥θ∥2. The objective function E is equivalent to the negative log-likelihood, so the likelihood takes the form of
l ∝ e−E = e−∥y−ΦT θ∥2−λ 􏰖Di=1 |θi| = e− 􏰖Ni=1(yi−φ(xi)T θ)2−λ 􏰖Di=1 |θi| (S.245) 􏰮N􏰯􏰮D􏰯
= 􏰦 e−(yi−φ(xi)T θ)2 􏰦 e−λ|θi| . (S.246) i=1 i=1
The left term is the data likelihood, which is equivalent to a Gaussian (as in ordinary least squares). The right term is the prior on θ, and takes the form of a Laplacian on each parameter value. Hence, the probabilistic interpretation of L1-regularized least squares is to find the MAP estimate of the parameters θ, using the model
2 1 −|θi|
y=f(x;θ)+ε, ε∼N(0,σ ), θi ∼Laplace(λ)= 2λe λ . (S.247)
(S.243)
i=1
32

(b) In the equivalent optimization problem, we rewrite θi as the difference between two positive values,θi =θi+−θi−,whereθi+ ≥0andθi− ≥0. Notethatwhenθi+ =0andθi− >0, and vice versa (θi+ > 0, θi− = 0), then the absolute value can be rewritten as |θi+ − θi−| = (θi+ + θi−). Finally, at the optimum indeed one of these two conditions hold, (θi+ = 0, θi− > 0) or (θi+ > 0,θi− = 0). If it were not the case, i.e., (θi+ > 0,θi− > 0), then the term (θi+ + θi−) could be further reduced by subtracting min(θi+,θi−) from θi+ and θi−. This would reduce the regularization term (θi+ + θi−), but not affect the data term since the data term only depends on the difference (θi+ − θi−).
􏰢θ+􏰣
(c) Let x = θ− . The objective function is
E = =
=

1 􏰡y − ΦT (θ+ − θ−)􏰡2 + λ 􏰋(θ+ + θ−) 2􏰡􏰡ii
(S.248)
(S.249) (S.250)
(S.251)
i 1􏰡y−􏰝ΦT,−ΦT􏰞x􏰡2 +λ1Tx
2􏰡􏰡
1yTy−yT 􏰝ΦT,−ΦT􏰞x+ 1xT 􏰢 Φ 􏰣􏰝ΦT,−ΦT􏰞x+λ1Tx
2 2−Φ
1 T􏰢ΦΦT −ΦΦT􏰣 􏰛 􏰢Φy􏰣􏰜T
2x −ΦΦT ΦΦT x+ λ1− −Φy x, 􏰳 􏰲􏰱 􏰴 􏰳 􏰲􏰱 􏰴
Hf
where constant terms that do not affect the minimization are dropped.
Figure 8 shows an example of cubic polynomial regression using least-squares, Bayesian regres- sion, and LASSO.
true θ=[−2.272 0.000 −1.198 −0.228 ] true θ=[−2.272 0.000 −1.198 −0.228 ]
true θ=[−2.272 0.000 −1.198 −0.228 ] LS θ=[−1.464 −0.458 −3.537 0.974 ]
BR θ=[−1.256 1.251 −2.378 0.050 ]
LASSO θ=[−1.673 −0.000 −2.272 0.296 ]
555
000 −5 −5 −5
−10 −15 −20
−2 −1 0 1 2 3
−10 −15 −20
−2 −1 0 1 2 3
−10 −15 −20
−2 −1 0 1 2 3
xxx
Figure 8: Cubic polynomial regression using (left) least-squares, (middle) Bayesian regression, (right) LASSO. The true function is the dashed line. For Bayesian regression, the dotted-lines show the 2 standard-deviations around the mean. Note that LASSO can find that the linear term θ1 is 0.
………
33
y
y
y

Problem 4.12 Lagrange multipliers and equality constraints
(a) The Lagrangian is
Taking the derivatives and setting to zero gives
∂LK K ∂λ=􏰋πj−1=0 ⇒ 􏰋πj=1,
∂πj πj Taking (S.254) and summing over j,
KK 
L(π,λ)=􏰋zj logπj +λ 􏰋πj −1 . j=1 j=1
(S.252)
(S.253) (S.254)
(S.255)
j=1 j=1 ∂L=zj+λ=0 ⇒ zj+πjλ=0.
KKKK 􏰋(zj +πjλ)=0 ⇒ λ􏰋πj =−􏰋zj ⇒ λ=−􏰋zj, j=1 j=1 j=1 j=1
since 􏰖Kk=1 πk = 1. Finally substituting λ into (S.254) (and changing the index from j to k to avoid confusion),
􏰋K zj − πj
k=1
zj = 0

zj
πj = 􏰖K z .
k=1 k
(S.256)
(S.257)
(S.258) (S.259)
(S.260)
(S.261)
(b) The Lagrangian is
KK
L(π,λ)=􏰋πj(zj −logπj)+λ j=1
 􏰋πj −1
j=1
.
Taking the derivatives and setting to zero gives
∂LK K ∂λ=􏰋πj−1=0 ⇒ 􏰋πj=1,
j=1 j=1
∂L=zj−logπj−πj+λ=0 ⇒ 1ezj−1eλ=0 ⇒ πje−λ=ezj−1
∂πj πj πj Summing over j and noting that 􏰖Kj=1 πj = 1 gives
K
e−λ =􏰋ezj−1.
j=1 Finally, substituting (S.260) back into (S.259),
􏰋K πj
k=1
ezk−1 = ezj−1

e z j − 1
e z j
πj = 􏰖K ez −1 = 􏰖K ez kk
………
34
k=1 j=1

Problem 4.6 Mixture of exponentials
The derivation is similar to that of Problem 4.5 (mixture of Poissons). The only difference is the observation likelihood is now an exponential density rather than a Poisson. In particular, the Q function is
nK
Q(θ;θˆ)=E [logp(X,Z|θ)]=􏰋􏰋zˆ logπ +zˆ logp(x|z =j), Z|X,θˆ ij j ij ii
(S.262)
(S.263)
(S.264)
depend
(S.265)
(S.266)
where the soft assignments are
zˆ = E [ z ] = 􏰖
πjp(xi|zi = j)
ij Z|X,θˆ ij Kk=1πkp(xi|zi=k)
ˆ = p ( z = j | x , θ ) .
i=1 j=1
πˆ= zˆ=, j n ij n
i i
For the M-step, the mixture weights are updated as before, 1 􏰋n nˆ j
= 􏰖n zˆ j i=1ij
i=1
is the (soft) number of samples assigned to component j.
where nˆ
Finally, for the exponential parameter λj , we collect the terms of the Q function that
on λj,
nn
l =􏰋zˆ logp(x|z =j)=􏰋zˆ (logλ −λx). jijii ijjji
i=1 i=1 Taking the derivative and setting to 0 gives
∂ln nn 􏰖nzˆx j 􏰋 −1 −1􏰋 􏰋 −1 i=1iji
= zˆ(λ −x)=0 ⇒ λ zˆ= zˆx ⇒ λ =􏰖 ∂λijji jijijijnzˆ
.
ji=1 i=1 i=1 i=1ij ………
Problem 5.1 Bias and variance of the kernel density estimator
(a) To calculate the bias and variance of the kernel density estimator pˆ(x) = 1 􏰖n k ̃(x − xi), we n i=1
suppose that the samples X = {xi}ni=1 are distributed according to the true distribution p(x), i.e., xi ∼ p(x), ∀i. The mean of the estimator pˆ(x) is
􏰮1􏰋n 􏰯1􏰋n􏰨􏰩 EX[pˆ(x)]=EX n k ̃(x−xi) =n Exi k ̃(x−xi)
(S.267) (S.268)
= n
p(xi)k(x − xi)dxi = p(z)k(x − z)dz = p(x) ∗ k(x),
i=1 i=1
1􏰋n􏰧 ̃ 􏰧 ̃ ̃
i=1
where (S.268) follows from each term in the sum being the same, and ∗ is the convolution operator. The mean of estimator is the true distribution convolved with the kernel. In other words, it is a “blurred” or “smoothed” version of the true distribution.
35

(b) The variance of the estimator is
0.3 0.2 0.1
0.3 0.2 0.1
0.3 0.2 0.1
􏰉1n 􏰊1􏰉n 􏰊
varX(pˆ(x))=var 􏰋k ̃(x−xi) = 2var 􏰋k ̃(x−xi) n i=1 n i=1
1􏰋n 􏰙 ̃ 􏰚1􏰙 ̃ 􏰚
(S.269) (S.270)
= n2
i=1
var k(x−z) = nvar k(x−xi) ,
which follows from {xi} being independent distributions, and hence the variance of the sum is the sum of the variances (see Problem 1.4), and also identical distributions. Noting that var(x) = E[x2] − E[x]2 and thus var(x) ≤ E[x2], then we can place an upper-bound on the variance,
1 ̃ 2 1􏰧 1 􏰛x−z􏰜 ̃ varX(pˆ(x))≤ nE[k(x−z) ]= n hdk h k(x−z)p(z)dz
1􏰙􏰚􏰧 ̃1􏰙􏰚 ≤ d max k(x) k(x − z)p(z)dz = d max k(x)
(S.271)
E[pˆ(x)],
where the last line follows from k( x−z ) ≤ maxx k(x), i.e., the kernel is upper-bounded by its
h=0.5
h=1
h=2
−5 0 5
nh x nh x h
maximum value.
Figure 9 plots the mean and variance of the KDE for different bandwidths.
000 −5 0 5 −5 0 5
xxx
0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1
000
−5 0 5 −5 0 5 −5 0 5
xxx
Figure 9: Mean and variance of kernel density estimator. Top row: the true density function (thin line) and the estimator mean (thick line) and 2 standard deviations around the mean (dotted line) for different bandwidths h. Bottom row: Examples of the estimate pˆ(x) using 10 different sets of 50 samples drawn from the true density p(x). When the bandwidth is small (h = 0.5), the estimates are significantly different for each sample set (i.e., high variance), but the estimator mean is close to the true density. When the bandwidth is large (h = 2), the estimates are consistent with each other (i.e., low variance), but the estimator mean is far from the true density.
………
p(x) p(x)
p(x) p(x)
p(x) p(x)
36
(S.272)

Problem 6.2 BDR for regression
(a) For a given x, the conditional risk function for the squared loss function is 􏰧
p(y|x)(g(x) − y)2dy
p(y|x)(g(x)2 − 2yg(x) + y2)dy = g(x)2 − 2g(x)E[y|x] + E[y2|x].
(S.273)
(S.274)
R(x) = Ey|x[L(g(x), y)] = 􏰧
=
We wish to minimize the risk w.r.t. the choice of g(x). Hence, taking the derivative w.r.t. g(x) and setting to 0,
∂R(x) = 2g(x) − 2E[y|x] = 0 ⇒ g∗(x) = E[y|x]. (S.275) ∂g(x)
(b) The Minkowski loss function is plotted in Figure 10. For q = 1 it corresponds to the absolute loss (L1-norm), while for q = 2 it is the squared loss (L2-norm). As q → 0, it becomes the 0-1
􏰍0, g(x) = y loss function, L0 = 1, otherwise.
42 3.5
3 1.5 2.5
21 1.5
1 0.5 0.5
00
−2 −1 0 1 2 −2 −1 0 1 2
(g(x)−y) (g(x)−y)
q=0.2
q=1 q=2 q=10
q=0.01
q=0.1 q=0.2
Figure 10: Minkowski loss function Lq = |g(x) − y|q for different values of q.
(c) The conditional risk function for the q = 1 is 􏰧
R(x) = p(y|x) |g − y| dy
(S.276)
(S.277) (S.278)
(S.279) (S.280)
E−[y] = E−[1] =
p(y|x)ydy, p(y|x)dy.
􏰧g 􏰧∞ = p(y|x)(y − g)dy +
p(y|x)(g − y)dy = E−[y] − gE−[1] + gE+[1] − E+[y],
−∞ g
where we have defined the following partial expectation terms,
􏰧g 􏰧∞ p(y|x)ydy, E+[y] =
−∞ g 􏰧g 􏰧∞
p(y|x)dy, E+[1] = −∞ g
37
L q
L q

To minimize the risk we need to take the derivative w.r.t. g. However, g also appears in the limits of the integral. Recall the “fundamental theorem of calculus”,
∂􏰧x ∂􏰧a
∂x f(t)dt = f(x) and ∂x f(t)dt = −f(x).
ax
Hence, the derivatives of the partial expectations are
(S.281)
(S.282) (S.283)
(S.284) (S.285)
(S.286) (S.287)
∂ E−[y] = p(g|x)g, ∂g
∂ E−[1] = p(g|x), ∂g
∂ E+[y] = −p(g|x)g, ∂g
∂ E+[1] = −p(g|x). ∂g
Finally, taking the derivative of R(x) and setting to 0,
∂R(x) = p(g|x)g − (E−[1] + gp(g|x)) + (E+[1] − gp(g|x)) − (−p(g|x)g) = 0
∂g
⇒ −E−[1] + E+[1] = 0
Hence, at the minimum we have the condition on g that
E−[1] = E+[1] 􏰧g􏰧∞1
p(y|x)dy = p(y|x)dy = 2, −∞ g
since the density integrates to 1. The optimal g is the value that splits the posterior distribution p(y|x) in half, i.e., the median of p(y|x).
………
Problem 6.6 Gaussian classifier with common covariance
(a) The BDR for a classifier with 0-1 loss function is
g(x) = argmax log p(x|j) + log p(j),
(S.288)
(S.289) (S.290)
(S.291) (S.292) (S.293) (S.294)
j
where the decision function for each class j is gj (x) = log p(x|j) + log p(j)
􏰳 􏰲􏰱 􏰴
gj (x)
= − 1 ∥x − μj ∥2Σ − 1 log |Σ| − d log 2π + log πj 222
∝ − 1 ∥x − μj ∥2Σ + log πj 2
= − 1 (x − μj )T Σ−1 (x − μj ) + log πj 2
=−1(xTΣ−1x−2μTj Σ−1x+μTj Σ−1μj)+logπj 2
∝μTj Σ−1x+−1μTj Σ−1μj +logπj =wjTx+bj 􏰳􏰲􏰱􏰴 2
wT 􏰳 􏰲􏰱 􏰴 j bj
38

(b) The boundary between two classes i and j occurs where the two decision functions are equal,
gi(x) = gj(x) wiTx+bi =wjTx+bj
⇒ (wi −wj)Tx+(bi −bj)=0=wTx+b, 􏰳 􏰲􏰱 􏰴 􏰳 􏰲􏰱 􏰴
wb
(S.295) (S.296) (S.297)
(S.298) (S.299)
(S.300)
where
w = (wi − wj) = Σ−1(μi − μj),
􏰢 1 T −1 􏰣 􏰢1 T −1
􏰣
b=(bi−bj)= −2μiΣ μi+logπi − 2μjΣ μj+logπj = −1(μi + μj)T Σ−1(μi − μj) + log πi .
2 πj
The last line follows from the identity aT Ca − bT Cb = (a + b)T C(a − b).
(c) To rewrite the hyperplane in the form wT(x−x0) = 0, we note that wTx−wTx0 = 0, and hence we must find an x0 such that b = −wT x0.
b = −wT x0
−1(μi + μj)T Σ−1(μi − μj) + log πi =−(μi−μj)TΣ−1×0
(S.301) (S.302)
(S.303) (S.304)
(S.305)
(S.306) (S.307)
(S.308)
2 πj
−1(μi + μj)T Σ−1(μi − μj) + ∥μi − μj∥2Σ log πi =−(μi−μj)TΣ−1×0
2
􏰮1
−(μi −μj)TΣ−1 Hence by inspection,
∥ μ i − μ j ∥ 2Σ (μ −μ )
π j π 􏰯
i =−(μi −μj)TΣ−1×0 π j
2
(μi +μj)−
i j log ∥ μ i − μ j ∥ 2Σ
x0=1(μi+μj)− (μi−μj) logπi. 2 ∥ μ i − μ j ∥ 2Σ π j
………
Problem 7.4 PCA implementation using SVD
(a) X ̄ is the matrix of mean-subtracted points,
X ̄ = [x1 − μ, · · · , xn − μ] = X − [μ, · · · μ] = X − μ1T 􏰉1􏰋n􏰊 􏰛1􏰜 1
=X− n xi 1T =X− nX1 1T =X(I−n11T). i=1
(b) The covariance can be written as the outer-product of X ̄, Σ = 1 X ̄ X ̄ T .
n
39

Substituting for the SVD of X ̄,
Σ= 1USVT(USVT)T = 1USVTVSUT = 1US2UT,
since V T V = I. Finally,
(S.309)
nnn
􏰛1 2􏰜 T Σ=U nS U .
(S.310) Note that S2 is a diagonal matrix and also UT U = I. Hence U is a matrix of eigenvectors and
1 S2 is a diagonal matrix of eigenvalues. In particular, the i-th eigen-pair is {ui, s2i }. nn
(c) The PCA algorithm using SVD is
1. Center the data: X ̄ = X(I − 1 11T ). ̄n
2. Calculate the SVD: X = USV T .
3. Get the top K singular values, s1 ≥ s2 ≥ ··· ≥ sK. 4. Form the PCA basis matrix, Φ = [u1,··· ,uK].
5. Project data: z = ΦT x.
………
Problem 7.6 Fisher’s linear discriminant
(a) Fixing the denominator of the ratio J(w) yields the equivalent optimization problem, w∗=argmaxwTSBw s.t.wTSWw=1.
w
The Lagrangian is
L(w, λ) = wT SB w − λ(wT SW w − 1). (b) Taking the derivative w.r.t. w and setting to 0,
(S.311)
(S.312)
(S.313)
(S.314) (S.315)
∂L =2SBw−2λSWw=0 ⇒ ∂w
This is a generalized eigenvalue problem. (c) Assuming that SW is invertible,
λw = S−1SBw W
SBw=λSWw.
= S−1(μ1 − μ2) (μ1 − μ2)T w W
􏰳 􏰲􏰱 􏰴
scalar
Note that the scale of w does not affect the ratio J (w) = wT SB w . Hence, we only need the wT SW w
direction of w. Thus, we can drop the scalar terms, λ and (μ1 − μ2)T w, and the solution is w∗ = S−1(μ1 − μ2). (S.316)
………
W
40

Problem 8.6 Newton-Raphson Method
(a) The derivative f′(x(i)) of f(x) at x(i) is equal to the change in the function divided by the change in the input,
f′(x(i)) = ∆f (S.317) ∆x
To find the zero-crossing point, we substitute the current point (x(i),f(x(i))) and the desired zero-crossing point (x(i+1), 0),
f′(x(i))= f(x(i))−0. x(i) − x(i+1)
Solving for x(i+1) gives the iteration,
x(i+1) = x(i) − f(x(i)) .
f′(x(i))
starting at x(1) = 0.
(S.318)
(S.319) (b) Figure 11 shows the Newton-Raphson method on the polynomial f(x) = x3 − 2x + 2 when
Iteration 1
66 44 22 00
−2 −2
Iteration 2
f’(x)
f(x)
x(1)=0 x(2)=1
f(x)
f’(x)
x(3)=0 x(2)=1
−2 −1 0 1 2 −2 −1 0 1 2
xx
Figure 11: Newton-Raphson method for the polynomial f(x) = x3 − 2x + 2 when starting at x(1) = 0. In the first iteration, the next point is calculated as x(2) = 1. In the second iteration, the next point is calculated as x(3) = 0. Hence, the iteration repeats indefinitely and fails to converge.
(c) The condition of the optimum of g(x) is g′(x) = 0. Hence, we want to find the zero-crossing of f(x) = g′(x), yielding the iteration,
x(i+1) = x(i) − 1 g′(x(i)) (S.320) g′′(x(i))
The iteration is similar to gradient descent/ascent in that a scaled version of the gradient is added to the current point in each iteration. The main difference is that the scale is chosen adaptively using the second derivative g′′(x(i)).
(d) Let h(x) be the Taylor approximation of g(x) around the point a = x(i),
h(x) = g(a) + g′(a)(x − a) + g′′(a)(x − a)2 (S.321) 2
41

To find the optimum of h(x), we take the derivative and set to 0,
∂h(x) =g′(a)+g′′(a)(x−a)=0 ∂x
⇒ x∗ =a− g′(a). g′′(a)

x−a=−g′(a) g′′(a)
(S.322)
(S.323)
This is equivalent to the Newton-Raphson iteration in (S.320). Hence, using the Newton- Raphson method for finding the optimum of a function g(x) can be interpreted as iteratively forming a quadratic Taylor approximation h(x) at the current point, and then moving to the optimum of the approximation h(x).
………
Problem 8.4 Regularized Logistic Regression: MAP framework
(a) The MAP objective function is
l = log p(y|X, w) + log p(w) = 􏰋 log p(yi|xi, w) + log p(w)
n i=1
(S.324) (S.325)
(S.326) (S.327)
(S.328)
(S.329) (S.330) (S.331)
∂πi =σ(wTxi)(1−σ(wTxi))∂wTxi =πi(1−πi)xi. (S.332)
􏰋n 1T 1 d
i=1
[yilogπi+(1−yi)log(1−πi)]−2w Γw−2log|Γ|−2log2π. w∗ = argmax l = argmin −l
ww
=
Dropping terms that are constant, the MAP solution is equivalent to
􏰋n =argmin−
w i=1
􏰳 􏰲􏰱 􏰴
1 T [yilogπi +(1−yi)log(1−πi)]+ w Γw
2
E(w) =argminEˆ(w), Eˆ(w)=E(w)+1wTΓw.
w2 (b) First let’s look at the gradient of E(w),
∂ 􏰋n
= −
=−
∇E(w)= ∂w −
[yilogπi +(1−yi)log(1−πi)] 􏰋n􏰢∂ ∂ 􏰣
i=1
yi ∂w log πi + (1 − yi)∂w log(1 − πi)
i=1
􏰋n 􏰢 1 ∂ π i − 1 ∂ π i 􏰣
yiπi ∂w +(1−yi)(1−πi)∂w .
Using Problem 8.1a and the chain rule, we obtain the derivative of πi = σ(wT xi) w.r.t. w
i=1
∂w
∂w
42

Substituting into (S.331),
􏰋n􏰢1 −1 􏰣 ∇E(w) = − yi π πi(1 − πi)xi + (1 − yi)(1 − π )πi(1 − πi)xi
i=1 i i nn
=−􏰋[yi(1−πi)xi −(1−yi)πixi]=−􏰋[yixi −yiπixi −πixi +yiπixi] i=1 i=1
nn
=−􏰋(yi −πi)xi =􏰋(πi −yi)xi =X(π−y).
i=1 i=1 The gradient of wT Γw is Γw. Hence,
∇ Eˆ ( w ) = X ( π − y ) + Γ w .
(S.333) (S.334) (S.335)
(S.336)
(S.337) (S.338)
(S.339)
(S.340)
(S.341)
(S.342) (S.343) (S.344) (S.345) (S.346)
(S.347)
(c) For the Hessian of Eˆ(w),
∇2Eˆ(w)= ∂ ∂ Eˆ(w)= ∂ 􏰝(π−y)TXT +wTΓ􏰞
∂w ∂wT ∂w
= ∂ 􏰝πTXT −yTXT +wTΓ􏰞.
∂w
Next, we note that
∂
∂ ∂w1
T.􏰝 􏰞
∂ w w =  ∂.  w 1 , · · · , w d = I , ∂wd
∂ πT = ∂ 􏰝π1,···,πn􏰞=􏰝π1(1−π1)x1,···,πn(1−πn)xn􏰞=XR, ∂w ∂w
where R = diag(π1(1 − π1), · · · , πn(1 − πn)). Finally, substituting into (S.338), ∇2Eˆ(w) = XRXT + Γ.
(d) The Newton-Raphson iteration is
w(new) = w(old) − [∇2E(w)]−1∇E(w)
= w(old) − (XRXT + Γ)−1(X(π − y) + Γw(old))
= (XRXT + Γ)−1(XRXT + Γ)w(old) − (XRXT + Γ)−1(X(π − y) + Γw(old))
= (XRXT + Γ)−1(XRXT w(old) − X(π − y))
= (XRXT + Γ)−1XR (XT w(old) − R−1(π − y)) 􏰳 􏰲􏰱 􏰴
z
= (XRXT + Γ)−1XRz
(e) Γ is added to the term XRXT which is then inverted. Hence, Γ helps to ensure that we avoid inverting a singular matrix.
43

(f) Substituting into the objective function Eˆ(w),
Eˆ1(w) = E(w) + 1wT Γ1w = E(w) + 1(λ ∥w ̃∥ + ̃b2), (S.348)
22
Eˆ2(w) = E(w) + 1wT Γ2w = E(w) + 1λ ∥w ̃∥ . (S.349) 22
Hence, the prior covariance Γ1 applies a squared penalty on ̃b, thus favoring smaller ̃b. In contrast, prior Γ2 does not apply a penalty on ̃b. In general, we should prefer Γ2. The value of ̃b depends on the location of the data in the space. Data that is far from the origin will typically have large values of ̃b in order to translate the separating hyperplane into the region where the data lies. It seems unreasonable to penalize such translations.
………
Problem 9.4 Soft-margin SVM (1-norm penalty)
(a) First, rewrite the soft-margin inequality constraints
yi(wTxi +b)≥1−ξi ⇒ yi(wTxi +b)−1+ξi ≥0. (S.350)
Introducing Lagrange multipliers αi ≥ 0 for the soft-margin inequality constraints, and ri ≥ 0 for the non-negative slack constraint (ξi ≥ 0), we can obtain the Lagrangian
1nn n
L(w,b,ξ,α,r)= 2∥w∥2 +C􏰋ξi −􏰋αi(yi(wTxi +b)−1+ξi)−􏰋riξi. (S.351)
i=1 i=1 i=1
(b) Taking the derivative w.r.t. {w, b, ξ} and setting to zero yields the conditions for a minimum,
∂Ln n
∂w =w−􏰋αiyixi =0 ⇒ w∗ =􏰋αiyixi.
(S.352)
(S.353) (S.354)
riξi. (S.355)
i=1 i=1 ∂Ln n
∂b = −􏰋αiyi = 0 ⇒ 􏰋αiyi = 0. i=1 i=1
∂L=C−αi−ri=0 ⇒ ri=C−αi. ∂ξi
(c) Substituting (S.352) into L,
􏰡􏰡2  
L=
􏰡􏰡 αiyixi􏰡􏰡 +C ξi − 2 􏰡i=1 􏰡 i=1
αiyi( αjyjxj) xi +b−1+ξi− j=1 i=1
1􏰡n􏰡nnn n 􏰋􏰋􏰋􏰋T􏰋
Next, expanding the norm term and collecting terms together,
i=1
1nnnnnnn
L = 2 ( 􏰋 α i y i x i ) T ( 􏰋 α j y j x j ) − 􏰋 α i y i 􏰋 α j y j x Tj x i − 􏰋 α i y i b + 􏰋 α i + 􏰋 ( C − α i − r i ) ξ i .
i=1 j=1 i=1j=1 i=1 i=1i=1􏰳􏰲􏰱􏰴
􏰴 􏰳 􏰲􏰱 􏰴
0 (S.353)
0 (S.354)
􏰳 􏰲􏰱
−1 􏰖n 􏰖n αiαjyiyjxT xi 2 i=1 j=1 j
44
(S.356)

Hence, the dual function is
(d) From(S.354),wehaveαi+ri =C. Sinceri ≥0andαi ≥0,thentheupper-boundofαi isC when ri = C. Hence, we have the equivalent constraint 0 ≤ αi ≤ C. Finally, maximizing the dual function subject to this constraint and (S.353) yields the SVM dual problem,
(S.358) (S.359)
(S.360)
(S.361)
n1nn
L(α) = 􏰋αi − 􏰋􏰋αiαjyiyjxTi xj. (S.357)
i=1 2 i=1 j=1
n1nn
max􏰋αi − 􏰋􏰋αiαjyiyjxTi xj
α i=1 2i=1j=1 n
s.t. 􏰋αiyi =0, 0≤αi ≤C,∀i. i=1
(e) For the optimization problem and corresponding Lagrangian,
minf(x) s.t. g(x)≥0, L(x,λ)=f(x)−λg(x),
x
the following KKT conditions are satisfied at a solution, g(x) ≥ 0,

λ ≥ 0, λg(x) = 0.
The last condition implies that either:
• g(x) = 0 and λ > 0, i.e., the equality constraint is active.
• g(x) > 0 and λ = 0, i.e., the equality constraint is inactive.
Since the SVM problem has several Lagrange multipliers (αi,ri), we consider different combi-
nations active/inactive constraints:
LM
slack constraint
LM
margin constraint
interpretation of point xi
ri > 0
active, ξi = 0.
αi = 0
inactive,
yi(wT xi + b) > 1.
The point is correctly classi- fied and beyond the margin.
ri > 0
active, ξi = 0.
αi > 0
active,
yi(wT xi + b) = 1.
The point is correctly classi- fied and on the margin.
ri = 0
inactive ξi > 0
α = C, using (S.354).
active,
yi(wTxi +b)=1−ξi.
Since ξi > 0, then the point is violating the margin (i.e., in- side the margin).
………
Problem 10.12 Gaussian process regression – nonlinear Bayesian regression
(a) Substituting μˆθ into the predictive mean μˆ∗,
μˆ∗ = φT∗ μˆθ = φT∗ (Γ−1 + ΦΣ−1ΦT )−1ΦΣ−1 y
􏰳 􏰲􏰱 􏰴
(P−1 + BT R−1B)−1BT R−1 = PBT (BPBT + R)−1 = φT∗ ΓΦ(ΦT ΓΦ + Σ)−1y
= φT∗ ΓΦ(ΦT ΓΦ + σ2I)−1y,
where we use the matrix inverse identity (1.33) in Problem 1.15.
(S.362)
(S.363) (S.364)
45

(b) Substituting the covariance Σˆθ into the predictive variance σˆ∗2 gives σˆ∗2 = φT∗ Σˆθφ∗ = φT∗ (Γ−1 + ΦΣ−1ΦT )−1 φ∗
􏰳 􏰲􏰱 􏰴
(A−1 +UC−1VT)−1 =A−AU(C+VTAU)−1VTA) = φT∗ (Γ − ΓΦ(Σ + ΦT ΓΦ)−1ΦT Γ)φ∗
= φT∗ Γφ∗ − φT∗ ΓΦ(ΦT ΓΦ + σ2I)−1ΦT Γφ∗,
where we have used the matrix inverse identity (1.36) in Problem 1.15.
(S.365)
(S.366) (S.367)
(c) Using the matrix inverse identities, the expressions for μˆ∗ and σˆ∗2 now depend on the data through inner products ΦT ΓΦ. Hence, we define the kernel function as k(xi, xj ) = φ(xi)T Γφ(xj ), resulting in
μˆ∗ = k∗T (K + σ2I)−1y (S.368) σˆ∗2 = k∗∗ − k∗T (K + σ2I)−1k∗, (S.369)
k(x1, x1) · · · k(x1, xn) …
where K =  . .. .  is the kernel matrix with entries k(xi,xj), k∗ = k(xn, x1) · · · k(xn, xn)
k(x∗, x1)
 .  is the test kernel vector, and k = k(x ,x ). The prior covariance Γ is em-
 ∗∗∗∗
k(x∗, xn)
bedded into the kernel function – one interpretation is that the prior covariance parameter is changed into the parameters of the kernel function.
(d) Note that z = (K + σ2I)−1y is fixed for a given training set. We have
n
μˆ∗ = k∗T z = 􏰋 zik(x∗, xi). (S.370)
i=1
In other words, the z contains coefficients on the kernel functions centered at each xi. Substi- tuting for different kernel functions, we can get the forms of the posterior mean μˆ∗ as a function of x∗,
• linear: k(xi,xj) = xTi xj. The posterior mean is n􏰉n􏰊
μˆ =􏰋z(xTx)=xT 􏰋zx , ∗i∗i∗ii
i=1 i=1
which is a linear function in x∗.
• polynomial: k(xi,xj)=(xTi xj +1)2. Theposteriormeanis
nn
μˆ∗ =􏰋zi(xT∗xi +1)2 =􏰋zi((xT∗xi)2 +2xT∗xi +1)
(S.371)
(S.372)
i=1 i=1
n 􏰉n􏰊􏰉n􏰊n
=􏰋z(xTxxTx +2xTx +1)=xT 􏰋zxxT x +2xT 􏰋zx +􏰋z, i∗ii∗∗i ∗ iii∗∗ ii i
i=1
which is a quadratic function in x∗.
46
i=1 i=1 i=1 (S.373)

• RBF: k(xi,xj) = e−α∥xi−xj∥2. The posterior mean is
n
μˆ∗ = 􏰋 zie−α∥x∗−xi∥2 . (S.374)
i=1
This looks a like a kernel density estimate, except we don’t have any constraints on the weights zi, so the function can be negative.
−sin2 (xi−xj) • periodic: k(xi,xj) = e 2
. The posterior mean is
n n∞􏰛􏰜m 􏰋 −sin2(x∗−xi) 􏰋 􏰋1 2(x∗−xi)
μˆ∗= zie 2 = zi m!−sin 2
that are integer multiples of the base frequency,
m
sin2m(θ) = am,0 + 􏰋 am,j cos(2jθ), (S.377)
j=1
for some coefficients {am,j } (the actual formula is a little messy). Substituting into (S.376),
(S.378)
(S.379)
(S.375)
i=1 i=1 m=0 n∞n
= 􏰋 zi + 􏰋 (−1)m 􏰋 zi sin2m (x∗ − xi ) , i=1 m=1 m! i=1 2
(S.376) where we have used the Taylor expansion of the exponential function ex = 􏰖∞ xm .
m=0 m! Note that sin2m(θ) can be rewritten as a linear combination of cosines with frequencies
n∞nm μˆ∗ = 􏰋 zi + 􏰋 (−1)m 􏰋 zi am + 􏰋 am,j cos(j(x∗ − xi))
i=1 m=1 m! i=1 j=1 ∞n
= b0 + 􏰋 bm 􏰋 zi cos(m(x∗ − xi)), m=1 i=1
for some weights {bm}. Note that a linear combination of cosines of the same frequency (m) but with different phase shifts xi is also a scaled cosine of frequency m. Hence, the mean function finally reduces to

μˆ∗ = w0 + 􏰋 wm cos(m(x∗ − cm)), (S.380)
m=1
for some weights {wm} and phase offsets {cm}. In other words, the mean function is a weighted sum of cosines, each with a frequency that is a multiple of the base frequency of 1. Hence, the mean function is a periodic function with frequency 1. Here we have assumed that the frequency is 1, but we could easily add a parameter θ to change the base frequency, as in sin2(θ(xi − xj )/2).
Figure 12 shows examples of the above four kernels.
(e) Since (K + σ2I) is a positive definite matrix, as is its inverse, then the smallest value that
k∗T (K + σ2I)−1k∗ can take is 0. Hence the predictive variance is bounded by
0 ≤ σˆ∗2 ≤ k∗∗. (S.381)
47

linear quadratic RBF
0000
−5 −5 −5 −5
periodic
−2 −1 0 1 2 3
−10 −10 −2 −1 0 1 2 3
−2 −1 0 1 2 3
−10
−2 −1 0 1 2 3
−10
xxxx
Figure 12: GP functions learned using linear, quadratic, RBF, and periodic kernels. The dashed line is the ground-truth function. The thick line is the posterior mean, while the dotted lines show 2 standard-deviations around the mean.
The predictive variance is largest when
k∗T (K + σ2I)−1k∗ = 0 ⇒ σˆ∗2 = k∗∗. (S.382)
This occurs when the vector k∗ = 0, i.e., k(x∗, xi) = 0 for all training samples xi. This means that x∗ is different than all examples in the training set. Formally, ⟨φ(x∗), φ(xi)⟩=0, and hence x∗ is orthogonal to all xi in the high-dimensional feature space.
The predictive variance is smallest when
k∗∗ = k∗T (K + σ2I)−1k∗ ⇒ σˆ∗2 = 0. (S.383)
Consider the case when the test point x∗ is the same as the first point x1, and different from all other points xi, i > 2 . Also assume the extreme case where k(x∗, x1) = k(x∗, x∗) = k∗∗ = α and k(x∗,xi) = 0 for i > 2. E.g., this could occur when using a Gaussian kernel with the bandwidth is set very small. The test kernel vector is
k(x∗, x1) α k(x∗, x2) 0
k∗ =  .  = .. ..
k(x∗, xn) 0 Also, the kernel matrix can be decomposed in blocks as

k(x1,x1) ··· k(x1,xn) 􏰢 T􏰣
. α0 K= . K2 =0A,
k(xn, x1) since x1 = x∗. Then we have
􏰢α+σ20T􏰣−1􏰢1 0T􏰣 (K+σ2I)−1= = α+σ2 ,
(S.384)
(S.385)
(S.386)
kT (K + σ2I)−1k = α+σ2 = . (S.387) ∗ ∗ 0 0 (A+σ2I)−1 0 α+σ2
0 A+σ2I 0 (A+σ2I)−1
which uses the property that the inverse of a block-diagonal matrix is a block-diagonal matrix
of the inverses. Substituting into the quadratic term,
􏰢α􏰣T􏰢 1 0T 􏰣􏰢α􏰣 α2
48
y
y
y
y

Hence, when σ2 = 0, then
k∗T(K+σ2I)−1k∗ =α=k∗∗ ⇒ σˆ∗2 =0. (S.388)
In other words, the predictive variance will be 0 when the observation noise is 0, and the test point is the same as one training point and dissimilar to all other training points. This is the extreme case. In general, the predictive variance will be small when the observation noise is small and the test point is very similar to only one training point and very dissimilar to all others, according to the kernel function (i.e., the kernel function is very localized, or has small bandwidth).
………
49