EECS 189 Introduction to Machine Learning Fall 2020
This homework is due Wednesday, October 14 at 11:59 p.m.
HW6
The most important problem on this homework is technically optional if you have already done this and are completely confident in your setup. However, if you are not fully confident, it is worth doing this again to reduce stress ahead of the actual midterm. Also, be aware that if you experience any kind of technical difficulties and do not have a successfully fully completed practice on file, you will be given zero accommodations of any kind even if the disruption was entirely out of your control. If a student can’t be bothered to actually test out their exam setup, they are taking that risk entirely on their own head.
2 Exam Policy and Practice
Please read through the entirety of the EECS189 exam proctoring policy (click here) carefully before proceeding. This question is designed to familiarize you with some of the things you will have to do during the exam.
(a) After reading through the Exam Policy carefully, please answer the following questions.
(i) Given you experience no disruptions during the exam, how many total minutes do you have for scanning and submission? Does it matter if you are using a tablet or whether you are using paper to write your answers?
(ii) Are you required to record locally during the exam? How much space should you have available on your computer for a local recording?
(iii) How should you contact the course staff for an emergency situation during the exam?
Solution:
(i) You have a total of 20 minutes for scanning and submission if you experience no disrup- tion and are using paper. 10 minutes if you are using a tablet. If you experience x minutes of disruption during the exam, you may work for min(x, 15) minutes past the end of the exam.
(ii) You are not required to record locally; you may do a Zoom cloud recording. You should have 5 GB available on your computer if you are doing a local recording.
(iii) You should contact the course staff by making a private post on Piazza. However, you should not be looking at Piazza during the exam other than to make a private post in the case of an emergency.
(b) Please configure your Zoom link.
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 1
(i) Fill out the following Google Form (click here) to submit the Zoom link you will be using. You must use this Zoom link for this assignment as well as for the exams. This can easily be done by submitting your Personal Meeting Room link and setting your Personal Meeting ID as your default on all devices you will be using for the final.
(ii) Ensure that anyone can join your Zoom link and that there is no waiting room for your Zoom meeting. You can try to do this by entering the meeting on one device that is logged in to your official Berkeley Zoom account and then entering the meeting on another de- vice that is logged into some other Zoom account. If you are able to do that, then your Zoom link is joinable. If you are not put into a waiting room, then your Zoom meeting will not have a waiting room. (You might want to have a member of your study group try this out with you if you don’t already have two Zoom accounts.)
Solution: Ensure your Zoom link is joinable and that the Zoom link in the form is the correct link which you will be using for the exam.
(c) You will now conduct a Zoom recording. You should use this recording to work through a homework problem or other study material to simulate the actual circumstances of the final exam.
(i) Start the Zoom call for the link you provided above. Turn on your microphone and recording device (webcam, phone camera). Turn off your speaker. Share your entire desktop (not just a particular window).
(ii) Start recording via Zoom. You may record locally or on the cloud (see the exam policy document for details).
(iii) HoldyourCalIDnexttoyourfaceandrecordyourselfsayingyournameintothewebcam. Both your face and your entire CalID should be visible in the video. We should be able to read your name and SID. This step should take at least 3 seconds. See figure 2. If you do not have a CalID for some reason, please hold up some document which has an image of you and proves your identity, such as a driver’s license.
Figure 1: ID card demonstration. Do not actually black out your SID and name.
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 2
Figure 2: exam.
(vi) (vii)
(iv)
(v)
Turn your recording device (webcam, phone) around 360◦ slowly so that we can see your entire room clearly. There should be no uncovered screens anywhere in the room during your exam. Only admin TAs and instructors will be able to see your videos (both for this assignment and for the actual exams).
Position your recording device in such a way that we can see your workspace and your hands. We don’t care if we can see your face or not. If you are not using your phone to record your workspace, then it should be visible in the recording, face down. See figure 3. Think during this test run. On the actual exam, you will want to use the desktop to see the exam itself. If you are using your laptop’s built-in camera, the angle might be bad. In this case, think about how you might position the laptop or consider using your phone or an external webcam on a stand to record your workspace. We want you to iron out these details ahead of the actual exam so that the exam itself has no additional stress due to proctoring logistics.
Demonstration of taking your exam. Your setup should look like this while you are taking the
Your microphone should be on at all times. We should be able to see the time on your desktop at all times.
Record for a full two and a half hours. You should use this time to work through a homework problem or other study material for the course. The more realistic it is to actually taking an exam, the better practice it will be for you. (We also want to make sure that your computer can handle the video encoding task if your are doing a local recording.)
(viii) After two and a half hours, stop the recording. Check your recording to confirm that it contains your video as well as your desktop throughout its duration. Upload your video to Google drive and submit the link to the video using this Google Form (click here). You must make sure that the link sharing permissions are set so that we may view the video.
Solution: Ensure that you were able to submit your video to the Google form.
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 3
(d) A Midterm Google document should be shared with you with instructions for the midterm and the time which you will start taking the exam a few days before the midterm. More details will be made available closer to the exam date.
Link for policy:
https://docs.google.com/document/d/14qO3xRt724j8Fs12i9d4E-pMWuyqk4h8b5gBbnd3hVo/
edit
Form to submit Zoom link:
https://forms.gle/P7CDYVpaqLBV4zez5
Form to submit video link:
https://forms.gle/F5V1J89NKf21Rj7Y6
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 4
The first real problem in this homework is designed to exercise your understanding of probability, Gaussians, and coordinate changes. It is not a previous exam problem, but you have not had quite enough practice on this and these skills will be useful on the midterm exam this semester. This isn’t really a machine-learning question. Consider it a warm-up exercise.
3 Gaussian Quadrant Probability
Consider a two-dimensional Gaussian random variable X = X ∼ N (0, Σ) where Σ is a positive 2
definite matrix,
a c
Σ=c b.
In this problem we calculate the probability P(X1 > 0, X2 > 0).
22
(a) Define an appropriate Σ− 1 and show that Y = Σ− 1 X ∼ N(0, I).
Solution: A linear transformation of a Gaussian is a Gaussian. We have, −1 −1
E[Y]=E[Σ 2X]=Σ 2E[X]=0 ⊤−1⊤−1
X1
E[YY ] = E Σ 2 XX Σ 2 −1⊤−1
=Σ 2E[XX ]Σ 2 22
= Σ−1 ΣΣ−1 = I.
(b) Let Q denote the region in the 2D-plane given by, Q={(x1,x2)|x1 >0,x2 >0}.
NotethatP(X1 >0,X2 >0)=P(X∈Q).ShowthatP(X∈Q)=P(Y∈S)whereSisa sector of the 2D-plane centered around the origin with angle θ ∈ [0, π] as shown in the figure with,
−1
θ=cos √ .
−c
ab
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 5
x2
u2
θ
Q
u1
x1
( Hint 1: Compute the mapping of the points (1, 0), (0, 1) under the mapping Σ− 1 212
−1 1 b −c
Σ=. |ab−c2|−c a
2
Thus under the mapping Σ− 1 , the set Q maps to the set S given by,
S ={x∈R2 |x=c1u1 +c2u2 forc1 >0,c2 >0}.
The angle θ between the vectors u1 and u2 can be computed by observing that,
which gives,
First note that,
⟨u1, u2⟩ = ∥u1∥∥u2∥ cos(θ),
⟨u1,u2⟩ ∥u1 ∥∥u2 ∥
u=Σ, 1 0
u=Σ. 2 1
θ = cos−1
.
−1 1 2
to get u , u .)
(Hint 2: Note that ,
)
Solution: Using the hint we have,
Note that the set Q can be expressed as, 2
1 0
Q={x∈R |x=c +c forc >0,c >0}. 10211 2
−1 0 2
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 6
We have,
= 1 0 Σ 1
= −c . |ab − c2|
Similarly by computing inner products, ∥u1∥ =
(c) ConcludethatP(Y∈S)= 2π cos √ Remember the distribution for Y ∼ N (0, I).)
−1 2
−1 2
0
1
⟨u,u⟩=1 0Σ Σ 12
.
(HINT: If you look at Y in polar coordinates, what is the probability distribution of the angle?
−1 0
∥u2∥ =
b |ab − c2|
a
|ab − c2|.
Putting everything together we have,
ab
Solution: Since Y ∼ N (0, I) it is radially symmetric around the origin and thus, P(Y∈S)= θ
−1
θ=cos √ .
−c
ab
1 −1 −c
2π 1
−c
−1
= cos √ .
ab
−1
1 2 2πab
2π
Thus we have shown that
1
P(X >0,X >0)= cos √ .
−c
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 7
The second problem on this homework is designed to further exercise skills and concepts that you have seen in earlier homeworks this semester, but were not in scope in earlier semesters. This problem is designed to help make sure that you get more practice with these ideas ahead of the exam, since these ideas are definitely in scope.
4 Linear Regression in Overparameterized Regime: Part II
Recall from the previous homework how conducting linear regression in the overparameterized regime, where many features aliased each other on the training points, led to an effect similar to explicit ridge regularization.
We repeat the setup from that homework problem below, for your reference.
[Begin setup from previous homework]
We are trying to learn a one-dimensional function f(x) for x ∈ [0,1]. For mathematical ease,
consider a standard complex Fourier featurization: φuk(x) = ej2πkx
for k = {0, 1, . . . , d − 1}. Notice that this featurization is orthonormal relative to a uniform test- distribution on [0, 1].
Assume that the number of features d = (M + 1)n for a positive integer M. We have access to observations y ∈ Cn where
x0
xn−1
where the scalar complex function f (x) is applied element-wise to the vector of training sampling
x0
x 1
y=f( . )+ε (1)
. .
x
1 2
locations . andεreferstothenoiseinthetrainingdataandweassumeε∼CN(0,σIn).We
. .
xn−1
further assumed that f (x) = φut (x).
Instead of performing linear regression using this featurization we first scale the features to obtain scaled features,
φk(x) = ckφuk(x)
where ck ∈ R for k = 0, 1, . . . , d−1. Let Φt denote the column vector with entries φt(x0), φt(x1), . . . , φt(xn−1), and similarly Φut the unscaled version of the same thing. To do learning, we solve the minimum
norm interpolation problem using the scaled features:
ˆ β = arg min β
2
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 8
β
d−1
s.t. y=βkΦk,
k=0
Having found βˆ we convert it into an equivalent set of coefficients αˆ to obtain,
d−1
y = αˆ k Φ uk ,
k=0
whereαˆk =ckβˆk.
The test data xtest,ytest is drawn from the distribution p and is noiseless. Recall that the features
{φuk}k=0,…,d−1 are orthornormal with respect to the test distribution. We make a prediction at test time using the learned coefficients,
d−1 d−1
ypred = βˆkφk(xtest) = αˆkφuk(xtest).
k=0 k=0
[End setup from previous homework]
Last time, we showed that the expected squared test error
2 n−1
E
pred
=E[(f(x)−Φ(x)β) ]=1− +
V(t) V2(t)
c + k n
V2(q)
c , k
2 2 ˆ2ct1 4σ1 4
where f (x) = φut (x) is the true function that is a pure unscaled feature with t < n and A(t) =
{n+t,2n+t,...} is the aliasing set of all feature indices that alias φut , and R(t) = A(t)∪{t} is the set
of all relevant feature indices that alias φu, as well as t itself. Here, the notation V(q) = c2 t k∈R(q) k
was essentially a measure of the total energy in all the scaled features in the group R(q).
Now, we will consider particular choices of the scalings ci intended to have a regularizing effect. Last time, we looked at the strict “bi-level” model, where the first s features were given identically large scalings, and the remaining d − s features were given comparatively small scalings, also identical. We saw that the effects of the noise present in our training data were distributed over the remaining d − s features in a manner that did not significantly affect our performance on the test data, without significantly affecting the learned weight on the tth (true) feature.
We will now explore what happens when we perturb the strict Bi-Level scalings to no longer be strictly identical. Recall that in the previous homework, that as long as the true function was some linear combination of the first s features, we were able to treat our true function as though it were a single feature φut (x), because we could always rotate our orthonormal featurization to make that the case, without affecting the feature scaling. This was because the first s features all had the same scaling, so we could rotate them to produce s new features with that same scaling — and then in spirit appliy the identical rotation to each group of aliases of these s features. Since minimum- norm solutions would essentially be unaffected by these kinds of unitary transformations, the entire story would still go through and the expressions for the expected squared-loss on test points would be unchanged.
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 9
k∈A(t)
q=0
k∈R(q)
However, now that we are going to be perturbing the scalings of each feature independently, we can no longer rely on this “trick” to simplify our setup. Instead, we need to consider the general case, when our true function is of the form
t−1
y = f(x) = αkφuk(x), (2)
k=0
where the αk are bounded complex coefficients such that |αk| ≤ 1. We also give ourselves the freedom to choose t ≤ s, so that all the features in the true function are in the “favored” group of features, as per the original bi-level model.
(a) Suppose we had noise-free observations of the form
n−1
y = w k φ uk ( x ) ,
k=0
on the same regularly spaced sampling points as the previous homework problem.
Show that the min-norm interpolating solution yields
αˆ k = c 2k w k . V (q)
(Hint: this is a simple consequence of the work that you already did in the previous homework. You don’t have to redo that work. What is the property of the min-norm interpolating solution that we are using here?)
Solution: This follows from part (a) from the previous homework and the linearity property of the min-norm solution.
(b) Now, let us assume that the true noiseless function is given by (2) and we have noisy observa- tions of it as in (1).
Find the expected test error Epred as a function of the {αk}, the {ck}, and σ, taking an expectation over both the test distribution for x and the training noise ε. This calculation will be fairly similar to what you did in the previous homework, and you can verify your result by checking that it matches your previous result when all the αk = 0, except for αt = 1.
(HINT: Remember to invoke the same “frequency-domain” change-of-coordinates on the noise
as you did last time — transform the training noise into the i.i.d δi that were each CN(0, σ2 ). n
This will let you decompose the problem across different groups R(q) and deal with each sep- arately by leveraging the orthonormality of the features relative to the test distribution.)
Solution: We will first write the minimum-norm solution as a function of the αi and the noise
δi. We have that
αˆ k = c 2k ( α q + δ q ) , V (q)
defining αq = 0 for q ≥ t for convenience.
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 10
Then, we compute the expected test error to be
t − 1 d − 1 2 u u
E = E α φ ( x ) − αˆ φ ( x ) pred k k k k
k = 0 k = 0
2
t−12n−1 2 t−1 2 q c c αc ku k u ku
=E α 1− φ (x)− δ φ (x)− φ (x) kq
kkk
k=0 V(k) q=0 k∈R(q) V(q) q=0 k∈A(q) V(q)
t−1 22 n−1 t−1 2
c σ2 1 αq
= |αk| 1−
+ V(k) n
V2(q)
c + k
V2(q)
c. k
2k44
k=0
q=0
k∈R(q)
q=0
k∈A(q)
Notice the different terms here. The first one is the counterpart of the survival-related terms since it is impacting the true signal. The second is all the contamination that is due to the noise in the training data. And the third is the external contamination that is due to the learning algorithm itself putting weight on the aliases instead of the true pattern.
(c) Now we will look at how these error terms respond to changes in the feature scalings. There are essentially four different windows of interest that one could think about. There’s the favored true features from 0 to t − 1, there are the favored features that happen not to be true from t to s − 1, there are the unfavored features from s to n − 1, and then there is the tail of aliases from s to d.
In the strict bi-level model, recall that all cq for q ≥ n are set equal to some fixed quantity: denote it as c for this part. Let’s leave that alone.
For some q ∈ [s,n), imagine that we increase or decrease cq. In each case, what happens to Epred? Show that it monotonically increases as cq increases to be bigger than c and monotonically increases as cq decreases smaller than c with a unique global optimum at cq = c.
(HINT: Does this affect the survival of the true pattern? Does this affect any contamination that might be coming from the true pattern itself? Does this affect the contamination that is coming from noise in the training data? Once you have identified the relevant term that is impacted, take a derivative to see what is going on.)
Solution: We see that varying cq will not affect the survival component of the error, but will only affect the contamination due to training noise. Specifically, we only need to consider the change in the expression
1 c4k, V2(q) k∈R(q)
By definition, we can substitute for V(k) to obtain c4
k∈R(q) k . c22
k∈R(q) k
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 11
Now applying the known scalings of the other ck from the bi-level model, we can rewrite the
above as
c 4q + M c 4
( c 2q + M c 2 ) 2 .
Taking the derivative with respect to cq, we get
4c3q(c2q + Mc2)2 − 4cq(c4q + Mc4)(c2q + Mc2) = 4c3q(c2q + Mc2) − 4cq(c4q + Mc4) (c2q + Mc2)2 c2q + Mc2
= 4 c 3q M c 2 − 4 c q M c 4 c 2q + M c 2
4cqMc2
= c 2q + M c 2 ( c 2q − c 2 ) .
Notice that this expression is a positive number times (c2q − c2). So the derivative is only zero at cq = c and is a positive number if cq > c — meaning the cost monotonically increases as cq gets bigger than c — and is a negative number if cq < c — meaning that the cost monotonically increases as cq gets smaller than c. The minimum is only at cq = c with monotonic increases on both sides.
(d) Now that we understand the qualitative effects of these perturbations, we will show that we do not need exactly the bi-level model to obtain the desired regularization effects. One key property of the bi-level scalings is that they “favor” the first s features, including the t that are involved in the true functions, so the learned weights associated with those features do not get too thinly distributed over the M aliases. Another key property is that the training noise, having independent and identically distributed components in all n aliasing groups, is “spread” and thus diluted over the many aliases in the trailing d − s features, since their scalings are all low and roughly the same.
In this part, we will examine this second property. From the previous part, you should have seen that decreasing cq increases the test error, due to noise contamination. One way to visual- ize this is to imagine that the noise is now spread less evenly over the unfavored features, since the qth feature is absorbing less of it.
Let’s see what happens in the extreme case. Start with the bi-level model with the first s
features scaled as
γd s
and the remaining d − s features scaled as
(1 − γ)d.
d−s
We calculated the test error for this model in the previous homework. What does the test error become if all ct , . . . , cn−1 are reduced all the way to zero? As in the previous homework, assume that d ≫ n, and let λ = s(1 − γ)/(nγ), for convenience.
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 12
(Hint: reuse your results from the previous homework)
Solution: Intuitively, when the first n − s unfavored features have their scaling zeroed out, the weight that they had will shift to their aliases. After all, the other d − n features above n still have a presence, and since there is still one feature in each aliasing group, we will still be able to handle noise, just slightly less well.
More formally, observe that varying the cq for q ∈ [s, n) only affects the CN2ε component of the error, as we showed in an earlier part. Thus, we can reuse our results from the previous homework for the other components of the error.
For CN2ε, we can write it as
n−1 s−1 n−1
σ2 1 c4k=σ2 1 c4k+σ2 1 c4k
n q=0 V2(q) k∈R(q)
n q=0 V2(q) k∈R(q) n q=s V2(q) k∈R(q)
s−1 γd 2 + M (1−γ)d 2 =σ2 s d−s
n−1 (M + 1) (1−γ)d 2
n γd (1−γ)d 2 q=0 s +M d−s
s−1 γd2+d(1−γ)2 ≈ σ2 s n
n γd d (1−γ)d2 q=0 s +n d−s
+σ2
n
+ σ2 n
n
d−s (1−γ)d 2
q=s
n−1 d(1−γ)2
M d−s 2
d
q=s n(1−γ)
σ2 s 1 + n λ2 σ2(n − s) ≈d+.
n (1+λ)2 d
Notice that this is exactly the same as what we got in the previous homework, before we zeroed out some of the scalings. This matches with our intuitive prediction, since as d ≫ n, removing just n − s unfavored features will not significantly impact our ability to absorb noise, since we’ll still have ≈ d unfavored features remaining.
Solution: The previous part showed that zeroing out all these scalings does not affect the test error. An earlier part showed that reducing each of these ck decreases the cost monotonically, so long as the scalings for features past the first n are kept the same. Thus, decreasing the ck cannot reduce the test error below what it would be if all the ck were set to zero, which is just the original error. Therefore, it is clear that decreasing any of these specific ck from their initial value will not qualitatively impact the test error.
This helps us see that the result in the bi-level scaling is robust to small changes in the scalings for this intermediate regime. Essentially, as long as this intermediate regime has low-enough scalings, we are going to get to enjoy nice effects from this kind of learning approach.
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 13
The third problem continues to further exercise skills and concepts that are being introduced this semester. Since there is no other way that you can practice these ideas ahead of the exam.
5 Junk features can have regularizing effect
We have seen previously that increasing the number of parameters of a model (e.g. the number of features that we use for linear regression) can lead to increase in sensitivity to noise (recall the problem ”Learning 1d function” from HW1), and one needs to introduce regularization to cope with it. On the other hand, in HW4 we observed that kernel methods can sometimes learn a noisy function with very small amount of regularization, despite the number of implicit features being infinite! The reason is that not only the number of features is important, but also how those features are weighted.
In the previous homework, you saw a simple 1d function model involving Fourier features and regularly-spaced training data. You saw how you extra features can have a regularizing effect in what we called the bi-level scaling model where the true pattern was a low-frequency complex exponential chosen from one of s favored features that had a high scaling and there were lots of high frequency complex exponentials around that had lower scalings but acted as perfect aliases for the their low-frequency counterparts.
The previous problem in this homework gave you a hint that the dependence on the details of the bi-level scaling is not so strict for the essential results to continue holding. However, you might believe that the complex exponentials are a very peculiar set of features and that perfectly regularly spaced samples are a very special case for the sampling matrix. This problem is designed to help you see how a similar story about the regularizing effect of extra features holds more generally as well as how we need to have sufficiently loud features where the actual pattern resides for us to be able to avoid over-regularization.
In machine-learning, the “signal-processing” intuition provided by Fourier features often carries over to Gaussian features at the expense of some technical difficulties. Different frequencies corre- spond to independent Gaussian features, and the scalings of the Fourier features correspond to the standard deviations on those Gaussians. Because all jointly Gaussian random variables are just a unitary transformation away from being independent scaled Gaussians, the story is thus completely generic.
Consequently, in this problem we are going to explore how weights of the features influence our prediction error in a simple spiked covariance model: imagine that our data comes from a multi- variate gaussian distribution with covariance matrix
Σ = diag(λL, . . . , λL, λS , . . . , λS ),
k times p times
wherek≪n≪p,λS ≪λL(Lstandsfor”large”andS—for”small”),andnisthenumberof data points. In words, the dimension of our data is much larger than the number of samples that we have. However, most of the signal comes from the subspace of dimension k, which is much smaller than n. Notice that k here is playing the role of s in the bi-level weighting for Fourier features and p here is essentially playing the role of d = Mn in the bi-level weighting for Fourier features.
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 14
Notation: due to the structure of the covariance matrix above, we will often separate the co- ordinates that correspond to the spiked part (the first k coordinates) from the coordinates that correspond to the non-spiked part. When we do that we use “programming notation” from Python to denote such things: if A is a matrix, A[ : , 0 : k] denotes the matrix comprised from the first k columns of A, and A[ : , k : p + k] denotes the matrix comprised from the columns of A starting from k+1-st and ending p-th. Analogously, if a is a vector, a[0 : k] means a vector in Rk comprised of the first k components of a, and a[k : p + k] — a vector in Rp comprised of components from k + 1-st to p-th.
(a) Before even thinking about our linear regression, let’s understand how we are going to measure it’s performance. Suppose that the true value for any point x is given by x⊤w∗, where w∗ is the vector of true parameters. Show that for any wˆ and x ∼ N (0, Σ) which is independent of wˆ we have
R(wˆ , w∗) :=E[(x⊤wˆ − x⊤w∗)2|wˆ ]
= √ Σ ( wˆ − w ∗ ) 2 2
=λL∥(wˆ − w∗)[0 : k]∥2 + λS ∥(wˆ − w∗)[k : p + k]∥2, where we defined our notion of risk (expected test error) in the first line.
Comment on how this measure of error is qualitatively different from ∥wˆ − w∗∥2. (which components do you need to estimate more accurately? What is the connection between this and the α and β weights that you saw in the previous problem?)
Solution: First, we can write
(x⊤wˆ − x⊤w∗)2
=(x⊤wˆ − x⊤w∗)⊤(x⊤wˆ − x⊤w∗) =(wˆ − w∗)⊤x⊤x(wˆ − w∗).
When we take expectation in x (i.e. conditional on wˆ ), we just substitute E[xx⊤|wˆ ] = E[xx⊤] = Σ. Therefore,
E[(x⊤wˆ − x⊤w∗)2|wˆ ] =(wˆ − w∗)⊤Σ(wˆ − w∗)
= √ Σ ( wˆ − w ∗ ) 2 2
√
coordinates of
p coordinates of
Σ is: it is a diagonal matrix, whose first k diagonal entries are λL, and the rest of the diagonal entries are λS . Thus, the first k
To obtain the second equality we just need to recall what
√∗∗√
Σ(wˆ − w ) are the first k coordinates of wˆ − w multiplies by λL, and the last
√∗∗√
Σ(wˆ − w ) are the last p coordinates of wˆ − w multiplies by λS . Therefore
√ Σ ( wˆ − w ∗ ) 2 2
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 15
k+p−1√ 2 = Σ ( wˆ − w ∗ ) [ i ]
i=0 k−1
=λ (wˆ − w∗)[i]2 + λ LS
k+ p−1
(wˆ − w∗)[i]2
i=0
=λL∥(wˆ − w∗)[0 : k]∥2 + λS ∥(wˆ − w∗)[k : p + k]∥2.
We see that the first k coordinates have larger weight than the last p coordinates, thus we
need to estimate the first k coordinates more accurately. If our risk were ∥wˆ − w∗∥2, then
errors in single coordinates would have exactly the same influence on the final total error. In √ 2
i=k
the setting of the previous problem, our error is equal to Σ(βˆ − β∗) = ∥αˆ − α∗∥2, so we
2
2
should estimate betas with different accuracy, but when it comes to alphas, their errors all contribute in the same way. This helps us better understand what “orthonormality under the test distribution” refers to. Whereas in the case of functions on the 1d interval, we thought of the original complex exponential features as being orthonormal and the scaling of features as something that happened later, in this case, the “native features” are themselves scaled relative to standard Gaussians. What matters is always the scaling/coordinate-system in which the
prediction error corresponds to the error in estimating the parameters.
(b) Recall that interpolating min norm solution is given by
wˆ = X⊤(XX⊤)−1y. (3)
To simplify the computation a bit more, we will assume that the true vector w∗ has all zero coordinates starting from k + 1-st (i.e. w∗[k : p + k] = 0): that is y = X[ : , 0 : k]w∗[0 : k] + ε, where ε ∼ N(0,σ2In) — independent of X (i.e. our true signal only lives in the spiked part, and the independent noise is i.i.d. centered gaussian with variance σ2).
Since wˆ is linear in y, we can split it into two components: wˆ = wˆ B + wˆ V ,
where
wˆB =X⊤(XX⊤)−1X[:,0:k]w∗[0:k], wˆV =X⊤(XX⊤)−1ε.
Here B and V correspond to ”Bias” and ”Variance”. Here, we follow traditional terminology in which any impact of the learning rule on the squared error that we face even when there is no training noise is called “bias” and the impact of training noise is called “variance.”
Denote expectation in ε as Eε.
Look at the following decomposition of the error:
√√ EεR(wˆ,w∗) =∥ Σ(wˆB −w∗)∥2 +Eε∥ ΣwˆV∥2
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 16
=λL∥(X[ : , 0 : k]⊤(XX⊤)−1X[ : , 0 : k] − Ik)w∗[0 : k]∥2 +λS ∥X[ : , k : p + k]⊤(XX⊤)−1X[ : , 0 : k]w∗[0 : k]∥2 +σ2λL∥X[ : , 0 : k]⊤(XX⊤)−1∥2F
+σ2λS∥X[:,k : p+k]⊤(XX⊤)−1∥2F.
How would you interpret each component of the error? (Which comes from the noise and which from the bias? Which corresponds to the error in the estimation of the first k coordinates of w∗ (i.e. the spiked part), and which to contamination that comes from learning non-zero coefficients in the non-spiked part?)
Show that this decomposition is indeed correct.
HINT: show that if A is independent from ε, then Eε∥Aε∥2 = σ2TraceA⊤A = σ2∥A∥2F.
Solution: The decomposition from the first line is standard: since our noise ε is centered and independent of the signal, when we compute the squared error the cross-terms have zero mean, so expectation kills them. In language of formulas it looks like
E ε R ( wˆ , w ∗ )
√ ∗ 2
= E Σ ( wˆ − w ) ε
2
√ =Eε ∥ Σ(wˆB+wˆV−w∗)∥2
√√ =Eε ∥ Σ(wˆB −w∗)∥2 +∥ ΣwˆV∥2 +wˆ⊤VΣ(wˆB −w∗)
√√
=∥ Σ(wˆB −w∗)∥2 +Eε∥ ΣwˆV∥2,
√
∥ Σ(wˆB−w∗)∥2 =λL∥(wˆB−w∗)[0:k]∥2+λS∥(wˆB−w∗)[k:p+k]∥2,
√
∥ ΣwˆV∥2=λL∥wˆV[0:k]∥2+λS∥wˆV[k:p+k]∥2
Now we just plug in
√
and wˆ ⊤V Σ(wˆ B − w∗) depends on it linearly, so it has zero expectation.
where we made the last transition by using the facts that ∥
Now we can further split these two terms analogously to what we did in part a:
Σ(wˆ B − w∗)∥2 doesn’t depend on ε
to obtain
w∗[k : p + k] =0,
wˆ B[0 : k] = X[ : , 0 : k]⊤(XX⊤)−1X[ : , 0 : k]w∗[0 : k],
wˆ B[k : p + k] = X[ : , k : p + k]⊤(XX⊤)−1X[ : , 0 : k]w∗[0 : k] wˆV[0 : k] = X[:,0 : k]⊤(XX⊤)−1ε,
wˆ V [k : p + k] = X[ : , k : p + k]⊤(XX⊤)−1ε.
λL∥(wˆ B − w∗)[0 : k]∥2 =λL∥(X[ : , 0 : k]⊤(XX⊤)−1X[ : , 0 : k] − Ik)w∗[0 : k]∥2
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 17
λS∥(wˆB−w∗)[k:p+k]∥2 =λS∥X[:,k:p+k]⊤(XX⊤)−1X[:,0:k]w∗[0:k]∥2 λL∥wˆV[0:k]∥2 =λL∥X[:,0:k]⊤(XX⊤)−1ε∥2
λS∥wˆV[k : p+k]∥2 =λS∥X[:,k : p+k]⊤(XX⊤)−1ε∥2.
The final step is to plug in the equality from the hint: Eε∥Aε∥2 = ∥A∥2F to replace the last two
terms with their expectation in epsilon. To show this equality we can write: ∥Aε∥2
=ε⊤ A⊤ Aε
= (A⊤A)ε[i]ε[ j].
i,j
Thus,
Eε∥Aε∥2 = (A⊤A)Eε[ε[i]ε[j]] = σ2 (A⊤A) = σ2TraceA⊤A = σ2∥A∥2F,
i,j i=j
where we used the fact that
Finally, we see that the first term in the decomposition is the bias in the spiked part, the second — bias in non-spiked part, the third — variance in the spiked part and the fourth — variance in the non-spiked part.
To use the language of survival and contamination, the bias terms in the spiked part largely correspond to the extent to which the true pattern survives the learning process. The bias terms in the non-spiked part are entirely due to the contamination produced in these features due to the learning process as a consequence of the true pattern in the spiked part. The variance in the spiked part is the contamination that the training noise produces within the estimated weights that correspond to where the true pattern lies. The variance in the non-spiked part is the contamination that the training noise produces there.
(c) Now that we know what we are looking for, let’s start with evaluating how well linear regres- sion predicts values on unobserved data. If our training data vectors were i.i.d. Gaussian from the test distribution itself, the data matrix would be X = √λLZ[ : , 0 : k], √λS Z[ : , k : p + k], where Z ∈ Rn×(k+p) is a matrix with i.i.d. standard normal entries. However, considering such random matrix would add quite some technical work to the argument that we are going to make, so we’ll want to make a simplifying assumption. But what should that assumption be?
Here, we recall what was going on in the regularly spaced training data for complex expo- nential features. Notice that there, we had M aliasing groups. For the favored features, the columns in the un-scaled training matrix were orthogonal and each had norm squared equal to n. For Gaussians, the counterpart statement would be to have Z[ : , 0 : k]⊤Z[ : , 0 : k] equal to
Eε[ε[i]ε[ j]] =
σ2 i = j,
0 ij.
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 18
n times the k × k identity matrix. After all, orthogonality is like independence and the sum of the variances is like the norm squared.
But what was going on for the un-favored features? There were lots of exact aliases here. The key is to observe that if we ignore the special zone from s to n (which the previous problem tells us really wouldn’t matter qualitatively even if we gave it zero weight), that although columns kept repeating in the tail, the n rows there were orthogonal to each other (they inherited that orthogonality from the standard DFT matrix) and each row basically had a norm squared of Mn. For Gaussians, the counterpart statement would be to have Z[ : , k : p + k]Z[ : , k : p + k]⊤ equal to p times the n × n identity matrix. This is because p is playing the role of Mn here.
So, we know what to wish for. But is this wish reasonable, and can we do anything with it? Recall that in problem 5 (”Isotropic gaussians”) of HW3 we saw that a n × d matrix with i.i.d.
√
standard normal entries has all its non-zero singular values concentrate in the vicinity of d≫n.ForourcasethatmeansZ[:,0:k]⊤Z[:,0:k]≈nIkandZ[:,k:p+k]Z[:,k:p+k]⊤ ≈ pIn. Since for isotropic gaussians as far as singular values go, it doesn’t matter if a matrix very tall or very wide. Singular values don’t care about transposing. This means that our wish is indeed reasonable.
To make life simpler and to avoid dealing with technicalities, in this problem we assume that matrix X ∈ Rn×(k+p) is such that those approximate equalities are actually exact equalities, i.e.
X[:,0:k]⊤X[:,0:k]=nλLIk andX[:,k:p+k]X[:,k:p+k]⊤ =pλSIn. (4) (Notice that these conditions are effectively satisfied for the Fourier training matrix from the
previous homework. That can help you check your work.)
It may look tricky from the first glance to deal with the inverse matrix in (3). We got away with things in the Fourier case by really leveraging the detailed structure of the exact aliases. Here, we can’t do that and have to stick to more linear-algebraic arguments. Luckily, under our assumptions, this matrix becomes easily tractable. Recall that X consists of two parts: the first k columns form a spiked part, and we may view their contribution as a low rank correction when we look at XX⊤. There is a tool just for this case: an inverse of a low-rank correction of a matrix can be computed via the Woodbury matrix identity (see this link to wikipedia). Use the Woodbury matrix identity to show that
11
(XX⊤)−1 = pλ In − pλ +nλ X[:,0:k]X[:,0:k]⊤ . (5)
SSL
HINT:XX⊤ =X[:,0:k]X[:,0:k]⊤+X[:,k:p+k]X[:,k:p+k]⊤.
Solution:
(XX⊤)−1 =(X[:,0:k]X[:,0:k]⊤+X[:,k:p+k]X[:,k:p+k]⊤)−1
=(pλSIn +X[:,0:k]X[:,0:k]⊤)−1
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 19
d if
pλS In
Woodbury → = I − X[:,0:k]I +X[:,0:k] X[:,0:k]/(pλ ) nkS
X[:,0:k]
−1
⊤ ⊤
11
S S nλLIk
pλ (pλ )2
= 1 In− 1 X[:,0:k]X[:,0:k]⊤
pλS (pλS )2(1 + nλL/(pλS ))
11 = pλ In − pλ +nλ X[:,0:k]X[:,0:k]⊤
SSL
(d) In the next part we will put everything together to obtain the final theoretical result. This part introduces one small computation that will be useful. Use (5) to show that
Solution:
(XX⊤)−1X[:,0 : k] = 1 X[:,0 : k]. (6) pλS +nλL
(XX⊤)−1X[:,0 : k]
11
=pλ In − pλ +nλ X[:,0:k]X[:,0:k]⊤ X[:,0:k]
= =
1 nλ X[ : , 0 : k] − L X[ : , 0 : k]
SSL
pλS pλS +nλL 1 X[ : , 0 : k].
pλS +nλL
(e) Plug (4) , (5) and (6) in the decomposition of EεR(wˆ,w∗) into 4 terms from part b to
express each of those terms in terms of ∥w∗∥2, λS , λL, n, k and p.
HINT 1: this is a lengthy but simple computation. In the next part you will check your result
numerically.
HINT 2: ∥a∥2 = a⊤a and ∥A∥2F = TraceAA⊤ = TraceA⊤A. HINT3:showthat∥X[:,k:p+k]⊤A∥2F =pλS∥A∥2F foranyA. Solution: The answer is
pλ 2
λ ∥(X[:,0:k]⊤(XX⊤)−1X[:,0:k]−I )w∗[0:k]∥2 =λ S ∥w∗∥2,
L k 2LpλS+nλL 2
λ ∥X[:,k:p+k]⊤(XX⊤)−1X[:,0:k]w∗[0:k]∥2 =λ pλS
S 2 S pλS +nλL
nλL pλS +nλL
∥w∗∥2, 2
· σλL∥X[:,0:k](XX) ∥F=σn pλ +nλ
2
2 ⊤ ⊤ −1 2 2k λLn SL
,
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission.
20
We already know that
Analogously,
and
X[:,0 : k]⊤(XX⊤)−1X[:,0 : k] = This gives us the first line:
1 X[:,0 : k]. pλS +nλL
1 X[:,0 : k]⊤, pλS +nλL
2 2 2 ⊤⊤−12σ pλs
σλ ∥X[:,k:p+k] (XX ) ∥ = n−k+k
. pλS+nλL
S
(XX⊤)−1X[:,0 : k] =
X[:,0 : k]⊤(XX⊤)−1 =
F p
1
pλS +nλL
nλL Ik. pλS +nλL
X[:,0 : k]⊤X[:,0 : k] =
pλ 2
λ ∥(X[:,0:k]⊤(XX⊤)−1X[:,0:k]−I )w∗[0:k]∥2 =λ S ∥w∗∥2. L k 2LpλS+nλL2
For the second line we can write
∥X[ : , k : p + k]⊤(XX⊤)−1X[ : , 0 : k]w∗[0 : k]∥2 1 2
= pλ +nλ ∥X[:,k:p+k]⊤X[:,0:k]w∗[0:k]∥2 SL
1 2
= pλ +nλ w∗[0:k]⊤X[:,0:k]⊤X[:,k:p+k]X[:,k:p+k]⊤X[:,0:k]w∗[0:k] SL
1 2 =nλL ·pλS · pλ +nλ
SL
which gives the second line. The third line:
∥w∗[0:k]∥2,
∥X[ : , 0 : k]⊤(XX⊤)−1∥2F 1 2
= pλ +nλ ∥X[:,0:k]⊤∥2F SL
1 2
= pλ +nλ TraceX[:,0:k]⊤X[:,0:k] SL
1 2 =knλL pλ +nλ .
SL
And the fourth line:
∥X[ : , k : p + k]⊤(XX⊤)−1∥2F
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 21
= pλS ∥(XX⊤)−1∥2F
= pλS Trace(XX⊤)−2
2 11⊤
= pλTraceI−
X[:,0:k]X[:,0:k]
(pλS)2 S n
11 nλ
pλS +nλL
= Trace In −2 X[:,0:k]X[:,0:k]⊤ +
L X[:,0:k]X[:,0:k]⊤ (pλS +nλL)2
pλS pλS +nλL 22
1 knλL knλL =n−2 +
pλ pλ +nλ (pλ +nλ )2 SSLSL
2 1 nλ
L =pλ n−k+k 1− pλ +nλ
SSL 2
1 pλ S
=pλ n−k+k pλ +nλ SSL
(f) In the bi-level model with Fourier features, we saw that there were also effectively four terms to the error. Comment on the relationship between those terms and the ones here.
Solution:
The first term λ pλS ∥w∗∥2 = pλS (λ ∥w∗∥2). Here, notice that the (λ ∥w∗∥2) is
2 2
L pλS+nλL 2 pλS+nλL L 2 L 2
just the energy in the true pattern, and so the term that is before it is what corresponds to the
(1 − SU)2 = (1 − 1 )2 = ( λ )2 in the Bi-Level Fourier model. So we see that pλS corresponds 1+λ (1+λ) nλL
to the λ in the Bi-Level Fourier case. Notice that in the Fourier case, the λ = s(1−γ) . Notice nγ
that λL is like the square of the big weights in the Fourier model and so corresponds to dγ . s
MeanwhileλS islikethesquareofthesmallweightsintheFouriermodelandsocorresponds to approximately 1 − γ. Since d ≈ p in this case, we have an exact match. pλS ≈ s(1−γ) .
λL γ Once we see that the λ ≈ pλS in this model, the other terms also start making sense.
nλL
Look at CNs ≈ d (1+λ)2 . This corresponds to the second line
2 nλ2
λ pλS · nλL ∥w∗∥2 = n( pλS )2(λ ∥w∗∥2) (7)
Spλ+nλpλ+nλ 2 ppλ+nλ L 2 SLSL SL
The last term (λL∥w∗∥2) is just the true signal strength squared as before. The parenthetical λ2 n n
term is just (1+λ)2 as we have already seen above. And p is the same as d . So these terms match up as well.
Before tackling the third and fourth terms, we need to tease them apart a bit. In the Bi-Level Fourier case, we had for the noise-induced terms:
σ 2 s 1 + n λ 2 σ 2 ( n − s ) σ 2 s 1 σ 2 λ d 2
n (1+λ)2+ d = n ((1+λ)2)+ d (n−s+s(1+λ) )
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 22
We instantly see that the first of the above terms corresponds to the third term here: σ2 k λLn 2 n pλS +nλL
using the correspondances above since k is playing the role of s. After all 1 = (1 − λ )2 →
λLn 2 pλS +nλL .
And the second term is a perfect fit for the fourth term here σ2 p
2 n − k + k pλs
because p Thus we see that every term has a direct correspondance (and hence an interpretation) between
(1+λ)2
1+λ
is playing the role of d and k is playing the role of s. the two cases.
(g) Finally, we can compare our theoretical findings with the results of numerical experiments. Do the tasks from the associated jupyter notebook and report your results.
Solution: The 4 pieces of code that you needed to add in part 1 are correspondingly
return (k)/(n) * ((n * lambda_L)/(p * lambda_S + n * lambda_L))**2
In part 2 the desired event often holds if
• forn=10disaround300, • forn=30disaround1000, • forn=30disaround3900.
Looking at these numbers it would be reasonable to assume that for our predictions to do well for a gaussian design we would need something like n ≈ 100k and p ≈ 100n, so p/k ≈ 104.
However, in part 3 we see that our bounds actually predict the performance for gaussian design pretty well, despite we have p/k < 10 most of the time. The discrepancy looks largest when p is small, the value of k doesn’t influence it very much.
pλS +nλL
return lambda_L * ((p * lambda_S) / (p * lambda_S + n * lambda_L)) → **2
return lambda_S * ((p * lambda_S) / (p * lambda_S + n * lambda_L)) → * ((n * lambda_L) / (p * lambda_S + n * lambda_L))
return (k)/(p) *( (p * lambda_S)/(p * lambda_S + n * lambda_L))**2 → + (n-k)/(p)
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 23
The remaining problems are curated problems from past midterms and we have provided the very rough approximate times that we expect these to take to be solved in an exam setting, where you experience stress and random confusion. In an exam setting, you are however expected to have your full and complete attention on the problem in front of you and are expected to be committed to making progress. In homeworks, much time is sometimes lost to just sitting there and pondering whether the next step you want to take is the right one, etc... On an exam, you need to YOLO it and move forward. Because these times were computed expecting students to already have fully studied, your times might be slower at first try. Don’t panic, but do make sure that you can redo the problems in less than the given times once you’ve figured out how to solve them.
6 Finding the Centroid
Let x1, . . . , xn ∈ Rd. We consider computing the centroid of this dataset. Consider the loss function
1 n
L(w) := 2n
(a) (Est.: 5 minutes) First, we compute the gradient of the loss function. Show that
∇ w L ( w ) = w − x ̄ ,
Solution: We have that ∇∥xi − x∥2 = ∇(x⊤i xi − 2x⊤i x + ∥x∥2) = 2xi − 2x. Hence,
1 n
wherex ̄:=1n x. n i=1 i
∥xi − w∥2.
i=1
(2xi −2x)=x ̄−x.
(b) (Est.:5minutes)Showthattheminimizerofthelossfunctionisgivenbyx ̄,i.e.argminw∈Rd L(w)=
∇L(x)= 2n x ̄. Make sure to justify your answer.
i=1
Solution: Since L is convex and quadratic, at the minimum the gradient is equal to zero. There is only one place where this happens. Thus, by the previous problem, this is when x = x ̄.
(c) (Est.: 10 minutes) Suppose x1, . . . , xn are identically and independently distributed according
to a normal distribution with mean x∗ and diagonal covariance, i.e. xi ∼ i = 1,...,n.
Calculate E[∥x ̄ − x∗∥2].
N (x∗ , σ2 Id ) for
Solution:
1 n
E[∥x ̄ − x∗∥2] = E[∥n
1 n
xi − x∗∥2] (xi − x∗)∥2]
i=1
= E[∥n
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 24
i=1
= =
= =
= =
(xi − x∗)∥2]
n2E[∥(xi −x∗)∥2 +2⟨xi −x∗,xj −x∗⟩]
i=1 ij n
1 n
n2 E[∥ 1n
i=1
1 2
n 2 1
E[∥(xi −x∗)∥ +2 i = 1
n 2
i < j
E[⟨xi −x∗,xj −x∗⟩]
E[∥(x −x )∥ ] i∗
n2
1 2
i=1 n
σd
n2 σ2d/n.
i=1
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 25
7 Checking Kernels
Recall that for a function k to be a valid kernel, it must be symmetric in its arguments and its Gram matrices must be positive semi-definite. More precisely, for every sample x1 , x2 , . . . , xn ∈ Rd , the Gram matrix
k(x1, x1) · · · k(x1, xn)
. .
K=. . . k(x,x) . ij
k(xn, x1) · · · k(xn, xn)
must be positive semi-definite. Also, recall that a matrix is positive semi-definite if it is symmetric
and all its eigenvalues are non-negative.
(a) (Est.: 5 minutes) Give an example of two positive semi-definite matrices A1 and A2 in R2×2
such that A1 − A2 is not positive semi-definite.
As a consequence, show that the function k defined by k(xi, xj) = k1(xi, xj) − k2(xi, xj) is not
necessarily a kernel even when k1 and k2 are valid kernels. Solution: Take A1 = (1/2) · I2 and A2 = I2.
This would result in A1 − A2 = − 1 I2 which has negative eigenvalues and so is not PSD. 2
We can define k1(x, z) = 1 xT z and k2(x, z) = xT z to have 2 × 2 Gram matrices equal to A1 and 2
A2 respectively as long as the original x1 = [1, 0]T and x2 = [0, 1]T .
2
(b) (Est.: 5 minutes) Show that the function k defined by k(xi,xj) = xi −xj2 is not a valid
kernel.
Solution: The diagonal of Gram matrices is always equal to zero while the off diagonal terms
0 1
are positive in general. Such a matrix need not be PSD as the example 1 0 shows. This
matrix has eigenvalues of ±1, which includes a negative eigenvalue.
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 26
8 Bias-Variance for Ridge Regression (Est.: 24 minutes total) Consider the scalar data-generation model:
Y = xw∗ + Z
where x denotes the scalar input feature, Y denotes the scalar noisy measurement, Z ∼ N(0,1) is standard unit-variance zero-mean Gaussian noise, and w∗ denotes the true generating parameter that we would like to estimate.
We are given a set of n training samples {xi,yi}ni=1 that are generated by the above model with i.i.d. Zi and distinct xi. Our goal is to fit a linear model and get an estimate w for the true parameter w∗. For all parts, assume that xi’s are given and fixed (not random).
For a given training set {xi, yi}ni=1, the ridge-regression estimate for w∗ is defined by n
wλ = arg min λw2 + (yi − xiw)2 with λ ≥ 0.
w∈R
For the rest of the problem, assume that this has been solved and written in the form:
i=1
wλ= Sxy (8) s 2x + λ
whereS =n xY ands2 =n x2. xy i=1 i i x i=1 i
(This is given, no need to rederive it).
(a) (Est.: 8 minutes) Compute the squared-bias of the ridge estimate wλ defined as follows
Bias2(wλ) = (E[wλ] − w∗)2. (9)
It is fine if your answer depends on w∗ or sx, but it should not depend directly or indirectly on the realizations of the random Z noise. (So, no S xy allowed.)
Hint: First compute the expectation of the estimate wλ over the noises Z in the observation.
Solution: Since we have to minimize a convex function in w, we have a convex program. And since the search space is unconstrained it suffices to take the derivative of the objective and set it to zero to find the minima. Thus, we have
2λw−2
Solution: Note that the estimates are random because the observation Y is random. and that E[Yi] = E[xiw∗ + Zi] = xiw∗. Using these two facts and the linearity of expectation, we have
n 1
=
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 27
n ni = 1 x i y i s x y
i=1
xi(yi −xiw)=w⇒wλ = λ+ni=1 xi2 = s2x +λ.
E [ w ols
] = E ni=1xi2
E [ x Y ] ii
n
xiYi i=1
ni = 1 x i2 i = 1
For the wλ, we have
=
n 1
=w . ni=1xi2+λ
x E[Y ] ii
ni = 1 x i2 i = 1 n
1 2 ∗ =xw
ni=1 xi2 i i=1
= w∗. For the ridge-regression estimate, we see that
n n
i=1
xiYi 1
E [ w ] = E = E [ x Y ] λii
ni=1xi2+λ
ni=1xi2+λi=1
n
1
ni = 1 x i2 + λ
= =xw
xE[Y] ii
i = 1 n
1 2∗
ni = 1 x i2 + λ i i=1
n 2
∗ i=1 xi
From the previous part, we have that E[wols] = w∗ and hence the squared-bias is zero. In other words, the OLS estimate is unbiased.
n 2 2
x
i=1 i ∗2∗∗
(E[w −w]) =w −w λ
ni=1xi2+λ
λ2 =(w)(n x2+λ)2
∗2
i=1 i
∗2 λ2
= ( w ) ( s 2x + λ ) 2 .
(b) (Est.: 8 minutes) Compute the variance of the estimate wλ which is defined as
Var(wλ) = E[(wλ − E[wλ])2]. (10)
Hint: It might be useful to write wλ = E[wλ] + R for some random variable R. Solution: We have
n x Y n x2w∗ + xiZi i=1 i i i=1 i
wλ=λ+ni=1xi2= λ+ni=1xi2
= E [ w λ ] + λ + ni = 1 x i2
n
x i Z i .
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 28
1
i=1
Thus, we have
E(wλ − E[wλ])2 = (λ + n x2)2 E( i=1 i
1 n
xiZi)2
= (λ+n x2)2(xi2E(Zi2)+xixjE[ZiZj])
i=1 ij
i=1 1n
i=1 i ni=1 xi2
=(λ+n x2)2 i=1 i
= s2x . ( λ + s 2x ) 2
since E(Zi2) = 1 and E(ZiZj) = 0 for i j as Zi is independent of Zj.
(c) (Est.: 8 minutes) Describe how the squared-bias and variance of the estimate wλ change as we change the value of λ? What happens as λ → 0? λ → ∞? Is the bias increasing or de- creasing? Is the variance increasing or decreasing? In what sense is there a bias/variance tradeoff?
9
and thus clearly squared-bias increases with increase in λ and takes value 0 as λ → 0. In fact, λ = 0 corresponds to the OLS case and we have already seen that OLS is unbiased. Furthermore, for λ → ∞, we have that bias-squared → (w∗)2. We can directly observe this fact, since λ → ∞ implies that the ridge estimate would be 0 and hence the bias would be w∗ and consequently the bias-squared would be (w∗)2.
For the variance, we see that increasing λ reduces variance. This is intuitively correct too, because larger penalty forces the weight to shrink towards zero thereby reducing its scale and hence the variance too!
Any other argument using the derivative is fine too!
In this single-scalar-weight learning problem, external contamination isn’t an issue. So this simple argument involving bias and variance tells the whole story.
Hospital (25 min)
Solution: We have
λ2 1 2∗2∗2
Bias (wλ)=(w ) (n x2 +λ)2 =(w ) (s2/λ+1)2 i=1 i x
ni=1 xi2 s2x Var(wλ)= (λ+n x2)2 = (λ+s2)2
i=1 i x
You work at hospital A. Your hospital has collected patient data to build a model to predict who is likely to get sepsis (a bad outcome). Specifically, the data set contains the feature matrix X ∈ Rn×d, and associated real number labels y ∈ Rn, where n is the number of patients you are learning from and d is the number of features describing each patient. You plan to fit a linear regression model
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 29
y = w⊤x that will enable you to predict a label for future, unseen patients (using their feature vectors).
However, your hospital has only started collecting data a short time ago. Consequently the model
you fit is not likely to be particularly accurate. Hospital B has exactly the same set up as your
hospital (i.e., their patients are drawn from the same distribution as yours and they have the same
measurement tools). For privacy reasons, Hospital B will not share their data. However, they tell
you that they have trained a linear model on their own sepsis-relevant data: (XB and yB) and are ⊤
willing to share their learned model y = wB x with you. In particular, Hospital B shares their entire Gaussian posterior distribution on w with you: N(wB, Ψ).
(a) (10 min) Assume that we use the posterior from Hospital B as our own prior distribution for w ∼ N(wB,Ψ). Suppose that our Hospital A model is given by y = Xw+ε, where the noise, ε, has an assumed distribution ε ∼ N (0, I). Derive the MAP estimate w for w using Hospital A’s data X, y and the prior information from Hospital B.
HINT: Recall that traditional ridge regression could be derived from a MAP perspective, where the parameter w has a zero mean Gaussian prior distribution with a scaled identity covariance. How could you use reparameterization (i.e. change of variables) for the problem here?
Solution:
The overall point of the Hospital problem was to highlight the equivalence between a prior distribution on parameter(s) and ”pseudo” data. Stating a prior can in general be reformulated as pretending to have seen some made up data, often referred to as pseudo-data. This point was highlighted in homework. In general, the tighter the prior distribution (e.g. the less variance), the more sure we are that it is ”correct”. Thus it also corresponds to having seen more ”pseudo” data.
In part (a) the question is really just asking you to crank through the MAP estimate. But the pedagogical insight was to appreciate that in the hospital setting, the ”pseudo” data actually corresponded to real data and helped us get around a privacy problem! One could derive it from ”scratch” as done in lecture, with the specific prior given in the question, as follows (but there is a much easier solution we go over next):
From first principles solution:
Let D be the dataset, i.e. X and y. According to the Bayes’ Rule and ignoring the terms not
related to w, we have:
log p(w|D) ∝ log p(D|w) + log p(w)
∝ −(Xw−y)⊤(Xw−y)−(w−w )⊤Ψ−1(w−w ) BBB
∝ (−w⊤X⊤Xw + 2y⊤Xw) − (w⊤Ψ−1w − 2w ⊤Ψ−1w) BBB
∝ −w⊤(X⊤X + Ψ−1)w + 2(y⊤X + w ⊤Ψ−1)w BBB
Setting the derivative of log p(w|D) to zero and solve the equation, we get: wˆ = (X⊤X + Ψ−1)−1(X⊤y + Ψ−1w )
(11)
BBB
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 30
However, an easier way to solve this question was to simply do a change of variables on w such that its prior became zero-mean and we could just read off the solution derived in lecture for MAP with arbitrary prior variance. In particular let v = w − wb (i.e. w = v + w =), then we have
log p(w|D) ∝ log p(D|w) + log p(w)
= log N(y; X(v + wB), I) + log N(v; 0, Ψ)
(12)
From which we read off that wˆ = wB + (XT X + Ψ−1)−1XT (y − Xwb), appealing to the formula we derived in class and appearing in the notes.
A third solution would have been to use the change of variables w = Ψ1/2v + wB
to reduce to standard ridge with λ = 1 and read the solution off directly as done in lecture–see the lecture notes only without the wB offset term.
(b) (15 min) Now, for simplicity, consider d = 1 so that the w is a scalar parameter. Suppose that instead of giving you their posterior distribution, Hospital B only gave you their mean wB. How can you use this information to help fit your model? Describe in detail how you should use your own hospital’s patient data and combine it with the mean wB from Hospital B in a procedure to find your own w for predicting sepsis in Hospital A.
Hint 1: You might want to consider introducing an appropriate hyperparameter and doing what you usually do with hyperparameters.
Hint 2: What does the λ hyperparameter in ridge-regression correspond to from a probabilistic perspective?
Solution:
This problem was asking students to remember that an unknown variance term in a prior cor- responds to a hyperparameter. After all λ in traditional ridge regression corresponded to the reciprocal of the prior variance. This suggests that we should introduce a hyperparameter ψ into the problem for the unknown variance. Once we have done this, we can do what we always do with hyperparameters: use cross-validation to pick an appropriate value for them. Putting these ideas together, we get the solution:
(a) Divide the data X and y from Hospital A into k folds. (For example, we could even have k = n here.)
(b) For different potential values of ψ do the following: i. Repeat over the k folds:
A. Let the training data Xtrain be all of X except the j-th fold. Similarly, let the training data ytrain be all of y except the j-th fold.
B. Use the modified ridge regression of Part (a) using the current potential value for the hyperparameter ψ to fit a candidate wψ using Xtrain and ytrain.
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 31
10
C. Evaluate the validation error on the validation data representing the j-th fold of X and y.
ii. Average together the validation error (squared errors) for the different fold to get an average validation error estimate for this particular value for the hyperparameter ψ.
(c) Choose the hyperparameter value ψ with the lowest average validation error.
(d) Retrain the model using the chosen value for ψ on the entire training data X and y to get
our final w for hospital A.
We also would accept other ways of using a hyperparameter. For example, realizing that the effect of the hyperparameter in the scalar case is simply to “shrink” the OLS estimate towards wB, it would also be reasonable to directly define the amount of interpolation between the two as the hyperparameter. The important thing is that the validation data for evaluating hyperparameters be chosen in a way that is not cross-contaminated with data being used to set the parameters themselves. So, it is still vital that the OLS estimate be done with different data and cross-validation be used.
Given that the context here is data poor, it does make sense to use k-fold cross-validation instead of simply splitting the data into a training and validation set once. However on the exam, no points were deducted for just using a simple validation approach.
Ridge regression vs. PCA (Est.: 24 minutes total)
Assume we are given n training data points (xi, yi). We collect the target values into y ∈ Rn, and
the inputs into the matrix X ∈ Rn×d where the rows are the d−dimensional feature vectors x⊤i
corresponding to each training point. Furthermore, assume that 1 n x = 0, n > d and X has n i=1 i
rank d.
In this problem we want to compare two procedures: The first is ridge regression with hyperparam- eter λ, while the second is applying ordinary least squares after using PCA to reduce the feature dimension from d to k (we give this latter approach the short-hand name k-PCA-OLS where k is the hyperparameter).
Notation: The singular value decomposition of X reads X = UΣV⊤ where U ∈ Rn×n, Σ ∈ Rn×d and V ∈ Rd×d. We denote by ui the n-dimensional column vectors of U and by vi the d−dimensional column vectors of V. Furthermore the diagonal entries σi = Σi,i of Σ satisfy σ1 ≥ σ2 ≥ · · · ≥ σd > 0. For notational convenience, assume that σi = 0 for i > d.
(a) (Est.: 6 minutes) It turns out that the ridge regression optimizer (with λ > 0) in the V- transformed coordinates
wridge = arg min ∥XVw − y∥2 + λ∥w∥2 w
has the following expression:
wridge=diag( σi )U⊤y. (13) λ + σ 2i
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 32
Use ytest = x⊤testVwridge to denote the resulting prediction for a hypothetical xtest. Using (13) and the appropriate {βi}, this can be written as:
d
y =x⊤ vβu⊤y. test test iii
i=1
What are the βi for this to correspond to (13) from ridge regression?
Solution: First, we will abuse notation slightly and define R = diag( σi λ+σ2i
(14)
) ∈ Rd×n as a d × d square diagonal matrix with a 0d×(n−d) matrix appended. We can do this because σi = 0 for
i > d. Our predictor is then Working right to left, we get
ytest = x⊤testVRU⊤y. ⊤
u y 1 .
y =x⊤VR.
test test
. u ⊤n y
σ1 ⊤ λ+σ2 u1 y
1
.
=x⊤V . test . uy σd ⊤
λ+σ2d d
=xtest =⇒βi= σi
viλ+σ2ui y i
⊤d σi ⊤
i=1 λ + σ 2i
(b) (Est.: 12 minutes) Suppose that we do k-PCA-OLS — i.e. ordinary least squares on the reduced k-dimensional feature space obtained by projecting the raw feature vectors onto the k < d principal components of the covariance matrix X⊤X. Useytest to denote the resulting prediction for a hypothetical xtest,
It turns out that the learned k-PCA-OLS predictor can be written as:
d
y =x⊤ vβu⊤y. (15) test test iii
i=1
Give the βi coefficients for k-PCA-OLS. Show work.
Hint 1: some of these βi will be zero. Also, if you want to use the compact form of the SVD, feel free to do so if that speeds up your derivation.
Hint 2: some inspiration may be possible by looking at the next part for an implicit clue as to what the answer might be.
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 33
Solution: The OLS on the k-PCA-reduced features reads min∥XVkw−y∥2
w
where the columns of Vk consist of the first k eigenvectors of X.
Now again there are two ways to arrive at wPCA. Apply OLS on the new matrix XVk to obtain
wPCA = [(XVk)⊤(XVk)]−1(XVk)⊤y = [V⊤k VΣ2V⊤Vk]−1V⊤k X⊤y
= Σ−1U⊤y = Σ ̃ −1U⊤y kkk
whereΣk=Σk 0
The resulting prediction for PCA reads (note that you need to project it first!)
yPCA = x⊤VkwPCA
= x⊤VkΣ−1U⊤y
kk =x⊤k 1viu⊤iy
i=1 σi
1, i=1,...,k β =σi
i,PCA
0, i=k+1,...,d
(c) (Est.: 6 minutes) For the following part, d = 5. The following β := (β1, . . . , β5) (written out to two significant figures) are the results of OLS (i.e. what we would get from ridge regression in the limit λ → 0), λ-ridge-regression, and k-PCA-OLS for some X, y (identical for each method) and λ = 1, k = 3. Write down which procedure was used for each of the three sub-parts below.
We hope this helps you intuitively see the connection between these three methods.
Hint: It is not necessary to find the singular values of X explicitly, or to do any numerical computations at all.
(i) β=(0.01,0.1,0.5,0.1,0.01) (ii) β=(0.01,0.1,1,0,0)
(iii) β=(0.01,0.1,1,10,100) Solution: Ridge, 3-PCA-OLS, OLS.
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 34
11 Nearest Neighbors Generalize (Est.: 30 minutes)
In this question, we consider whether and why nearest-neighbor approaches generalize from the training set to test sets, assuming that both were drawn from the same distribution.
For simplicity, we set k = 1 and focus on 1-nearest neighbors. The algorithm operates on a training dataset S = {z1,z2,...,zn} of pairs zi = (xi,yi) consisting of features xi ∈ Rd and labels yi ∈ {0,1}. Each point zi is drawn i.i.d. from a distribution D. For the entire data set, this is written S ∼ Dn.
The 1-NN algorithm computes a prediction fS (x) for input x by finding the xi ∈ S that is closest to x, and using yi as the prediction.
The overall goal of this question is to bound the generalization error of 1-NN: LD( fS ) = Ez∼D[l( fS , z)]
where l is the classification error for z. (i.e. l(f,z) = 0 if f(x) = y and l(f,z) = 1 if f(x) y).
We want to say that the generalization error is not that much bigger on average than the leave-one-
out error (i.e. n-fold cross-validation error) given by 1 n
l(fS(i),zi) where S (i) is defined as the dataset with the i-th point dropped, i.e.
Lloo(f) = n
S(i) = {z1,z2,...,zi−1,zi+1,...,zn}.
(a) (Est.: 3 minutes)What is the value for the training error for 1-NN classification? (i.e you run the training data through the trained 1-NN classifier and compute the classification error)
Solution: The nearest neighbor of each training sample xi is indeed itself. Trivially, the train- ing error is zero.
(b) (Est.:11minutes)ShowthatEz∼D[|l(fS(i),z)−l(fS,z)|]islessthanorequaltotheprobability that a random point z = (x, y) drawn according to D has its x closer to xi than to any other pointxj ∈S(i).
Hint: focus your attention separately on these two kinds of potential points, z = (x, y): (i) those that have their x closer to xi than to any other point xj ∈ S(i), and (ii) those for which some other point in S (i) is their nearest neighbor in S . Does the addition or removal of the point xi have any effect on the classification decision for the latter set of points?
Solution: Only when x from z = (x, y) is closer to xi than any other point xj ∈ S (i) (we call a set of points with this property as Vi), l( fS (i) , z) and l( fS , z) are potentially different. We therefore havel(fS(i),z)−l(fS,z)≤1x∈Vi andbytakingexpectationsoverz,weobtaintheresult:
E l( fS (i) , z) − l( fS , z) ≤ E[1x∈Vi ] = P(x ∈ Vi)
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 35
i=1
(c) (Est.: 11 minutes) Build on the previous part to show ES∼Dn,z∼D[|l(fS(i),z)−l(fS,z)|]≤ 1.
n
In other words, when averaged over both the possible training data and the possible test points, the average generalization error of 1-NN is on average less than 1 away from the average leave-
n
one-out cross-validation error.
Hint: Show that 1 is the average probability of a point having the i-th random training point
n
xi as its nearest neighbor in the random training set S .
Hint for Hint: Is there anything special about the i-th training point or can you somehow leverage the symmetry of the i.i.d. drawing of the training set S ?
Solution: Since the Vi are a partition of the feature space (i.e. a union of all Vi’s assembles the input domain and no two Vi’s intersect each other), we have
n
1=E n 1 (16)
S ∼D ,z∼D x∈Vi i=1
and therefore nES ∼Dn [P(x ∈ Vi)] = 1 because the training data is i.i.d. Taking an expectation over S of the result from part (b) and using what we just conclude here yield the desired result.
Put differently, x has to fall into one of Vi’s, and on expectation, the probability of falling into each Vi is 1/n. This is because all xi’s are i.i.d. drawn, and so no Vi is “special” or they are symmetric.
(d) (Est.: 5 minutes) For this part, we give you that with probability 1 − δ over the draw of the training set S ∼ Dn
7 LD( fS ) ≤ Lloo( fS ) + 2δn
holds. (This can be shown with tedious algebra looking at the expected squared difference of the two errors and using Markov’s inequality. For the exam, we are not making you show this.)
On a given training set of size n = 350000, we measure 3% average leave-one-out cross- validation error Lloo(f) for the 1-nearest-neighbor classifier. What can you say with 90% confidence about the probability of classification error on a test point drawn from the same distribution that the training points were drawn from?
Solution: Set δ = 0.1 and plug the numbers into the bound. You get
so we can be sure that we won’t make more than 4% errors on average on test points with probability at least 90%.
7 = 0.01 and 2∗0.1∗350000
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 36
This last problem is an example of multiple-choice problems. However, since the time of this exam, our understanding of the material and what we have taught you has improved. Consequently, some of these questions are more confusing than we would hope. As a homework problem, this problem doesn’t count. However, it is included so that you can be familiar with how multiple choice problems might work. You can rest assured that our grading philosophy is not a “gotcha” oriented one — we give partial credit fairly even for multiple choice problems when we notice that some answers can kind of be correct even though others are more correct.
12 Multiple Choice Questions (Est.: 15 minutes)
For these questions, select all the answers which are correct. You will get full credit for selecting
all the right answers.
(a) Let X ∈ Rn×d with n ≥ d. Suppose X = UΣV⊤ is the singular value decomposition of X where σi = Σi,i are the diagonal entries of Σ and satisfy σ1 ≥ σ2 ≥ ··· ≥ σd while ui and vi are the ith columns of U and V respectively. Which of the following is the rank k approximation to X that is best in the Froebenius norm. That is, which low rank approximation, Xk, for X yields the lowest value for ||X − Xk||2F?
⃝k σuv⊤ i=1 i i n−i
Solution: k σ u v⊤ i=1 i i i
This is Ekhart-Young.
⃝ k σuv⊤ i=1 iii
⃝ d σuv⊤ i=d−k+1 iii
⃝ k σu v⊤ i=1 in−ii
(b) Consider a simple dataset of points (xi, yi) ∈ R2, each associated with a label bi which is −1 or +1. The dataset was generated by sampling data points with label −1 from a disk of radius 1.0 (shown as filled circles in the figure) and data points with label +1 from a ring with inner radius 0.8 and outer radius 2.0 (shown as crosses in the figure). Which set of polynomial features would be best for performing linear regression, assuming at least as much data as shown in the figure?
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 37
⃝ 1, xi, yi, xi2, xiyi, y2i
⃝ 1, xi, yi ⃝ 1, xi, yi, xi2, xiyi, y2i , xi3, y3i , xi2yi, xiy2i
Solution: 1, xi2, xiyi, y2i
(c) Which of the following is a valid kernel function for vectors of the same length, x and y?
⃝ 1,xi
⃝ k(x,y)=x⊤y
⃝ k(x, y) = e− 1 ||x−y||2
⃝ k(x, y) = (1 + x⊤y)p for some degree p
⃝ k(x, y) = k1(x, y) − k2(x, y) for valid ker-
nelsk andk. 22 12
Solution: All except k(x, y) = k1(x, y) − k2(x, y) for valid kernels k1 and k2.
(d) During training of your model, both independent variables in the matrix X ∈ Rn×d and depen- dent target variables y ∈ Rn are corrupted by noise. At test time, the data points you are com- puting predictions for, xtest, are noiseless. Which method(s) should you use to estimate the value of wˆ from the training data in order to make the most accurate predictions ytest from the noiseless test input data, Xtest? Assume that you make predictions using ytest = Xtestwˆ .
⃝ OLS
⃝ Ridge regression
⃝ Weighted Least Squares ⃝ TLS
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 38
Solution: TLS is the natural choice if we think that we have a lot of training data. Ridge regression with hyperparameter search that includes negative values for the ridge parameter would also be something that could work.
(e) Assume you have n input data points, each with d high quality features (X ∈ Rn×d) and associ- ated labels (y ∈ Rn). Suppose that d ≫ n and that you want to learn a linear predictor. Which of these approaches would help you to avoid overfitting?
⃝ Preprocess X using k ≪ n random pro- jections
⃝ Preprocess X using PCA with k ≪ n components.
nents.
⃝ Add polynomial features ⃝ Use a kernel approach
⃝ Add a ridge penalty to OLS ⃝ Do weighted least squares
⃝ Preprocess X using PCA with n compo-
Solution: Preprocess using random projections, PCA, ridge penalty
The high quality is what lets random projections still work. Ridge and PCA behave like each other.
(f) Which methods could yield a transformation to go from the two-dimensional data on the left to the two-dimensional data on the right?
⃝ Use of a kernel
⃝ PCA ⃝ Adding polynomial features
Solution: The natural answer is polynomial features. However, you could also do PCA on a polynomial kernel.
⃝ Random projections
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 39
(g) Your friend is training a machine learning model to predict disease severity based on k different health indicators. She generates the following plot, where the value of k is on the x axis.
Which of these might the y axis represent? ⃝ Training Error
⃝ Bias
⃝ Validation Error ⃝ Variance
Solution: As k increases, the number of features in the ML model increases.
Full credit was given to anyone who responded Training Error, Bias, Validation Error or
Training Error, Bias.
It is most important to recognize that training error and bias decrease with the number of features in the ML model. The reason that validation error is a correct answer is because it may be the case that the behavior changes after k = 13. Given this graph, it is not possible to know for sure.
Variance is ambiguous in general, but if we had interpreted it to mean the impact of the training noise, we now understand that in a severely overparameterized model, that can also drop with more features. For example, suppose there were only two training points.
(h) Your friend is training a machine learning model to predict disease severity based on k different
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 40
health indicators. She generates the following plot, where the value of k is on the x axis.
Which of these might the y axis represent? ⃝ Training Error
⃝ Bias
⃝ Validation Error ⃝ Variance
Solution: As k increases, the number of features in the ML model increases.
Full credit was given to anyone who responded Variance, Validation Error or Variance.
It is most important to recognize that variance increases with the number of features in the ML model. The reason that validation error is a correct answer is because it may be the case that the behavior changes before k = 4. Given this graph, it is not possible to know for sure.
The challenge with the work Bias is that the word is ambiguous. If we meant the gap between the true value of the underlying parameter and the survival of the true parameter after learning, then we have seen that overparameterization can behave like increased regularization. This can make the bias increase too. (We didn’t know this when this question was first run.)
(i) Your friend is training a machine learning model to predict disease severity based on k different
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 41
13
Solution: As k increases, the number of features in the ML model increases. Full credit was given to anyone who responded Validation Error.
As you have seen, the words Bias and Variance are not always clear. Because overparameteri- zation is a thing, those could also show this behavior.
Your Own Question
health indicators. She generates the following plot, where the value of k is on the x axis.
Which of these might the y axis represent? ⃝ Training Error
⃝ Bias
⃝ Validation Error ⃝ Variance
Write your own question, and provide a thorough solution.
Writing your own problems is a very important way to really learn the material. The famous “Bloom’s Taxonomy” that lists the levels of learning is: Remember, Understand, Apply, Analyze, Evaluate, and Create. Using what you know to create is the top-level. We rarely ask you any HW questions about the lowest level of straight-up remembering, expecting you to be able to do that yourself. (e.g. make yourself flashcards) But we don’t want the same to be true about the highest level.
As a practical matter, having some practice at trying to create problems helps you study for exams much better than simply counting on solving existing practice problems. This is because thinking about how to create an interesting problem forces you to really look at the material from the perspective of those who are going to create the exams.
Besides, this is fun. If you want to make a boring problem, go ahead. That is your prerogative. But it is more fun to really engage with the material, discover something interesting, and then come up with a problem that walks others down a journey that lets them share your discovery. You don’t have to achieve this every week. But unless you try every week, it probably won’t happen ever.
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 42
Contributors:
• Alexander Tsigler • Alvin Wan
• Alvin Won
• Anant Sahai
• Chawin Sitawarin • Esther Rolf
• Fanny Yang
• Jane Yu
• Jennifer Listgarten • Kate Sanders
• Khalil Sarwari
• M. J. McDonald
• Max Simchowitz • Peter Wang
• Philipp Moritz
• Raaz Dwivedi
• Rahul Arya
• Ross Boczar
• Sarah Dean
• Satish Rao
• Vignesh Subramanian • Yang Gao
HW6, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 43