程序代写代做代考 go chain flex kernel C algorithm Hive EECS 189 Introduction to Machine Learning Fall 2020

EECS 189 Introduction to Machine Learning Fall 2020
This homework is due Wednesday, October 21 at 11:59 p.m.
HW7
The first part of this homework is just a redo of the midterm. In a sense, this is a short part of the assignment since everything is intended to be completable in essentially 2 hours and you’ve already spent 2 hours on it. However, we want to encourage you to take a deep breath and redo the problems or at least check your existing work carefully to make sure that you actually understand what is going on. Talk to your group. Use Piazza to ask questions. Get everything right before having to look at the solutions.
2 Regularization from the Augmentation Perspective (10 points)
Assume w is a d-dimensional Gaussian random vector w ∼ N(0,Σ) and Σ is symmetric positive-
definite. Our model for how the {yi} training data is generated is
y=w⊤x+Z, Z∼N(0,1), (1)
where the noise variables Z are independent of w and iid across training samples. Notice that all the training {yi} and the parameters w are jointly normal/Gaussian random variables conditioned on the training inputs {xi}. Let us define the standard data matrix and measurement vector:
T
 
 
X (n+d)×d ˆ
w
y n+d 
x  y1 1      T x  y   2 2   
X=., y=..   . . . .
     x T   y 
nn
In this model, the MAP estimate of w is given by the Tikhonov regularization counterpart of ridge regression:
w􏰓 = (X⊤X + Σ−1)−1X⊤y, (2) In this question, we explore Tikhonov regularization from the data augmentation perspective.
Define the matrix Γ as a d × d matrix that satisfies Γ⊤Γ = Σ−1. Consider the following augmented design matrix (data) Xˆ and augmented measurement vector yˆ:
yˆ = 0d ∈ R , argmin∥yˆ−Xˆw∥2
X = Γ ∈ R
where 0d is the zero vector in Rd. Show that the ordinary least squares problem
,
and
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 1

has the same solution as (2).
(HINT: Feel free to just use the formula you know for the OLS solution. You don’t have to rederive
that. This problem is not intended to be hard or time consuming.)
Solution:Thesolutiontominw∥yˆ−Xˆw∥2canbesolvedusingtheOLSformulaas:   −1  
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
􏰉 ⊤ 􏰊−1 􏰔 􏰕X 􏰔 􏰕y  
ˆ ˆ ˆ  ⊤ ⊤   ⊤ ⊤   X X Xyˆ=X Γ Γ X Γ 0
k
= (X⊤X + Γ⊤Γ)−1X⊤y + (X⊤X + Γ⊤Γ)−1Γ⊤0
= (X⊤X + Σ−1)−1X⊤y. This exactly agrees with (2) and so we are done.
3 What is going on? (12 points)
This question relies on your understanding of the behavior of ridge regression and kernel ridge regression as the hyperparameters change. Part (a) uses a piecewise linear function which you have already seen in homework. Parts (b) and (c) show the results of (kernel) ridge regression using heatmaps which you should already be familiar with from homework. For parts (b) and (c) we used the circles dataset which you have already seen, with an 80% train/ 20% test split illustrated in Figure 1 below.
Circle Training Dataset
Circle Test Dataset
y= +1 y=1
0.5
0.0
0.5 1.0
1.0
0.5 0.0 0.5
Figure 1: Training and test datasets, with partial transparency to show overlapping datapoints. Notice that there are “mislabeled” points in this dataset.
(a) (6 pts) In this part we have performed kernel ridge regression with the RBF kernel k(x, x′) = exp􏰇−γ∥x − x′∥2􏰈
on a piecewise linear 1D function which you should recognize from homework. Both the regularization parameter λ and the smoothing parameter γ have been varied in the plots on
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 2
y= +1 y=1

the next page. Your goal in this problem is to identify the qualitative level of the γ and λ parameters given the visible plots of what was learned.
Determine whether the regression was over- or under-regularized and whether the smooth- ing factor was too wide (small γ) or too narrow (large γ), and write the letter of each sub- figure from Figure 2 in the appropriate location on your answer sheet following Figure 3. We have provided the location of (b) to get you started. Not all the marked spots will be used and each spot will correspond to at most one letter.
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 3

22 11 00 11 22
prediction
Ground truth training data points
prediction
Ground truth training data points
0.0 0.2 0.4
0.6 0.8 1.0 0.0
0.2
0.4 0.6
(b)
0.8 1.0
prediction
Ground truth training data points
0.8 1.0
prediction
Ground truth training data points
(a)
22 11 00 11 22
prediction
Ground truth training data points
0.0 0.2
2
0.4
0.6
0.8
prediction
Ground truth training data points
1.0
0.0
2
0.2
0.4 0.6
(d)
(c)
11
00
11
22
0.0 0.2 0.4 0.6 0.8 1.0 0.0
0.2
0.4 0.6
0.8 1.0
(e)
Figure 2: Match these subfigures to the (γ, λ) hyper-parameter space.
λ ↑ (overregularized)
b
(f)
γ ↓ (oversmoothed)
γ ↑ (undersmoothed)
λ ↓ (underregularized)
Figure 3: Write subfigure letters from Figure 2 in the appropriate location on your answer sheet.
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 4

Solution: Your plot of the parameter space should look something like this. λ ↑ (overregularized)
f
e
γ ↓ (oversmoothed)
a b d c
λ ↓ (underregularized) Figure 4: Approximate parameter mapping.
γ ↑ (undersmoothed)
The key indications are that (d) is clearly undersmoothed because it is more wiggly, but oth- erwise it is following the curve in (b) pretty closely. This suggests that otherwise, its explicit regularization is just fine. By contrast, (e) is clearly wiggly but has been shrunk closer to zero, and this tells us that (e) is both undersmoothed and overregularized. Going in the opposite qualitative direction to (c), we see the wiggliness but also even wilder oscillations as it tries to get closer to the training points, and so this (c) is both undersmoothed and underregularized.
Then we look at (a) and (f). Both are not nearly wiggly enough, and this is a mark of them being over-smoothed. Between the two of them (f) has clearly been shrunk a lot closer to zero and so it is overregularized as well as being oversmoothed. (a) is then simply oversmoothed. How do we know that it is isn’t also underregularized? That is harder to rule out, but there is no visible indication of underregularization.
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 5

􏰉 ∥x−x′∥2 􏰊 (b) (3pts)InthispartwehaveperformedkernelridgeregressionwiththeRBFkernelexp − 2σ2
and λ = 0.001. Match each subfigure (a), (b), and (c) of Figure 5 with the appropriate location (1), (2), or (3) on the test error curve of Figure 6. For example, write: (a) → (1); (b) → (2); (c) → (3).
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75
0
1
1.00 0.75 0.50 0.25
4
2
0
2 0.00 0.25 0.50 0.75
5
3
4
2
4
0.75
0.50
0.25
0.00
0.25
0.50
0.75 1.00
0.75
0.50 0.25 0.00 0.25
0.50 0.75
1.00
(a) Test Error: 0.21
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75
0.75 0.50
Figure 5: Heatmaps corresponding to the functions learned by kernel ridge regression. Match these subfig- ures to the test error curve below. Don’t forget to use the test-error information.
1.0 0.8 0.6 0.4 0.2
Train and Test Error
Training Error Test Error
10 2
(1)
10 1
(3) (2)
100 101
0.25 0.00 0.25 0.50
0.75 1.00
0
1
2
3
4
5
(c) Test Error: 0.58
Figure 6: Train and test errors for varying RBF kernel parameter σ.
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 6
(b) Test Error: 0.56
-0.500
-2.000
-0.500
0.000
-0.500
0.000
-0.500
0.000
0.000
-1.000
-1.000
-2.000
0.000
-0.500
0.000
0.000
0.000
0.000
1.000
0.000
0.500
-0.500
-1.000
-1.000
-0.500
0.000
-0.500
0.000
-2.000
0.000
0.500
0.000
-0.500
-0.500
-1.000
-0.500
-0.500
-0.500
-2.000
0.000
0.000
-1.000
0.000
-0.500
0.500
-1.000
-0.500
0.000
0.000
1.000
0.000
-0.500
-2.000
0.000
0.000
-1.000
-1.000
-1.000
-0.500
-2.000
-1.000
0.000
0.000
-0.500
0.500
2.000
1.000
-1.000
1.000
-0.500
0.000
-2.000
-1.000
-2.000
1.000
-0.500
2.000
-0.500
2.000
0.000
0.500
0.000
0.500
0.000
0.500
-0.500
0.500
0.500
0.000
0.500
-0.500
-1.000
-1.000
-1.000
-2.000
-0.500
0.000
-1.000
-0.500
-2.000
-2.000
-1.000
-0.500
0.000
0.000
0.500
0.000
0.000
0.000
-2.000
0.000
-2.000
0.000
0.000
-2.000
-1.000
-1.000
0.000
-0.500
-1.000
0.000
-1.000
-2.000
0.500
-2.000
0.000
-0.500
-2.000
-0.500
0.000
Error

Solution: Subfigure (a) has σ = 0.3288, (b) has σ = 0.037, and (c) has σ = 0.651. (a) → (2)
(b) → (1)
(c) → (3)
How could you tell? Getting (a) in the right place is easy by just looking at the test error. For (b), you notice that is is just reverting to being near zero everywhere — this means that the influence of the training points isn’t going out as far. That means that (b) is undersmoothed, which in this case, comes from having too small of a width σ for the RBF. For (c), we see that it is oversmoothed by comparison and so has a σ that has too big of a width.
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 7

(c) (3 pts) In this part we have performed ridge regression on the circles dataset with polynomial features of increasing degree D. After examining Figure 7, match each subfigure (a), (b), and (c) with the appropriate location (1), (2), or (3) on the train/test error plot in Figure 8. For example, write: (a) → (1); (b) → (2); (c) → (3).
1.00
0.75
0.50
0.25
0.00 0 0.25
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75
1
0
1
2
3
4
5
0.50 0.75
4
2
2
4
0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(a) Test Error: 0.29
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75
0.75
Figure 7: Heatmaps corresponding to the functions learned by ridge regression. Match these subfigures to the test error curve below. Don’t forget to use the test-error information.
1.0 0.8 0.6 0.4 0.2
(1) (2)
Training Error Test Error
(3)
0.50 0.25
0.00
0.25 0.50
0.75 1.00
(c) Test Error: 0.22
Train and Test Error
5 10 15 20 25 30 Polynomial Degree
Figure 8: Train and test errors for increasing polynomial degrees.
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 8
0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(b) Test Error: 0.29
4
2
0
2
4
0.000
0.500
0.000
0.000
1.000 0.500
-0.500
0.000
-0.500
0.500
0.500
1.000
-0.500
-2.000
-2.000
-1.000
-1.000
-1.000
-0.500
-0.500
0.000
2.000
-1.000
-0.500
1.000
0.000
0.000
0.500
0.500
2.000
1.000
-2.000
0.000
-0.500
2.000
-0.500
-2.000
-1.000
-2.000
0.500
-0.500
0.000
-0.500
-0.500
0.000
0.500
0.000
0.000
0.000
-0.500
0.500
0.500
0.500
1.000
0.000
-1.000
-1.000
-1.000
-1.000
-2.000
-1.000
-1.000
0.500
-1.000
-1.000
-2.000
-2.000
-2.000
2.000
-0.500
-0.500
-1.000
0.000
-0.500
0.000
-1.000
2.000
-0.500
-0.500
1.000
0.500
0.500
2.000
1.000
-2.000
-2.000
0.500
0.000
0.500
-1.000
-0.500
2.000
-1.000
-2.000
0.500
0.500
-2.000
-1.000
-0.500
-0.500
-0.500
0.000
0.500
0.000
1.000
0.000
2.000
0.010.5000
1.000
-0.500
0.000
-0.500
-0.500
-0.500
-0.500
-1.000
-0.500
0.000
0.500
1.000
2.000
-2.000
-1.000
-1.000
0.000
-0.500
0.500
0.000
0.000
-1.000
-0.500
0.000
0.000
2.000
1.000
0.500
0.000
-0.500
0.000
0.000
0.500
0.500
0.000
-1.000
-1.000
-1.000
-0.500
-2.000
-2.000
0.500
12.000
-0.500
0.000
-2.000
-0.500
0.000
0.500
0.500
-1.000
-1.000
-2.000
-0.500
0.000
0.500
0.000
Error

4
Solution: Subfigure (a) is degree 27, (b) is degree 8, and (c) is degree 15. (a) → (3)
(b) → (1)
(c) → (2)
This is the most intriguing one of all. You can see from the test error numbers that (c) must be in position (2) since it has the smallest test error. Looking at (a), we see more rapid changes in the function and more responsiveness to the details of where the training points are and what their values are — this is a signature of a higher degree polynomial as compared to the more smooth behavior of (b) which is a signature of a lower-degree polynomial.
The interesting thing here is that (b) looks more “correct” than (c) does, even though it is getting a higher test error. There are two lessons here. First, one has to wonder if the problem here wasn’t in the model but in the loss function — maybe squared error was making life more difficult for this data set. Notice however that the data set that we were given didn’t have any qualitatively different points in it so this is probably not the main issue. The second question here is whether we should always just pick the minimum validation error when choosing a model. What this example shows is that when the validation data might be coming from a finite data set, maybe we should be a little bit biased towards going for the simpler model even when the losses are nearly tied. This theme is developed futher in the next problem.
Validation can sometimes fail (40 points)
This problem is a simplified setting where we investigate the probability that validation ends up picking a more complicated false model instead of the simpler true model. Consider the setting of learning a one-dimensional function from noisy samples.
In this problem, suppose the true underlying function is the constant zero function. We have access to noisy training data consisting of nt data points that we arrange into a pair of nt-dimensional vectors (xt,yt). Because the underlying function is the zero function, we have:
yt =εt,
where εt ∼ N (0, σ2 Int ). Further we have access to similarly noisy validation data that has been
arranged into a pair of nv-dimensional vectors (xv, yv) where yv = εv,
and εv ∼ N(0, σ2Inv ). We assume εv is independent of εt. For ease of computation, assume
∥xt∥2 = nt, ∥xv∥2 = nv.
We use the training data to learn models and then use the validation data to choose the better
performing one.
In this simplified case, the 0-th model is just the constant zero. There are no parameters to learn for this model.
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 9

The 1-st model is a simple linear function f (x) = wx where we need to learn the single parameter w from the training data.
For this problem, the various parts are largely independent of each other even though they collectively tell one story. Since this is an exam, don’t get bogged down in any one part and lose easy points elsewhere.
(a) (4 pts) Suppose that we just decided to perform OLS on the validation data and computed 􏰆 􏰆2
w􏰓v =argmin􏰆􏰆yv −xvw􏰆􏰆2. w∈R
Just use the formula for OLS to show that
w􏰓 = 1 x ⊤ ε .
vnvv v
and give the mean and variance of the Gaussian random variable w􏰓v.
Here, we are viewing this hypothetical w􏰓v as a random variable because the noise in the vali-
dation data is random.
Solution: Using the OLS formula, we have
w􏰓 = ( x ⊤ x ) − 1 x ⊤ y
= 1 x ⊤ε . nvv
v
where the second line follows immediately from the first since ∥xv∥2 = nv and yv = εv. Since εv is 0-mean, a linear combination of its component must be zero mean. Further,
vvvvv
21⊤⊤
E [ w􏰓 v ] = n 2 E [ x v ε v ε v x v ]
v
= 1x⊤E[εε⊤]x n2v vvv
v
σ2 =n.
v
Thus w􏰓v ∼ N(0, σ2 ). nv
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 10

(b) (4 pts) By doing a calculation analogous to the previous part, you can understand the actually learned parameter w􏰓t obtained by performing OLS on the training data. This gives
w􏰓 = 1 x ⊤ ε . tntt
t
and this has a Gaussian distribution w􏰓t ∼ N(0, σ2 ). Here, the randomness in w􏰓t comes from nt
the random noise in the training data.
Solution:
Just for your information since we didn’t ask, we can perform OLS on the training data and learn
We have,
Further,
Thus w􏰓t ∼ N(0, σ2 ). nt
􏰆 􏰆2 w􏰓t =argmin􏰆􏰆yt −xtw􏰆􏰆2.
w∈R
w􏰓 = ( x ⊤ x ) − 1 x ⊤ y ttttt
= 1x⊤ε. ntt
t
21⊤⊤
E [ w􏰓 t ] = n 2 E [ x t ε t ε t x t ]
t
= 1x⊤E[εε⊤]x n2t ttt
t
σ2 =n.
t
You now have the marginal distributions for w􏰓t and w􏰓v. Find the joint distribution of w􏰓t and w􏰓v, i.e. find the distribution of
 0  nt 
   w􏰓 t  
w= .  w􏰓 v 
Solution: Note that εt and εv are independent of each other. So any individual linear combina-
tions of their components would also be independent of each other. Consequently, w is a joint
Gaussian with mean and covariance matrix given by:

  E [ w􏰓 t ]     0   E[w]= =   E [ w􏰓 v ]   0 
2
E[w􏰓tw􏰓v] 2  E[w􏰓v] 
⊤  E[w􏰓t ] E[ww]=
E[w􏰓tw􏰓v] σ2 
= 2  0 σ 
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 11
nv

(c) (12 pts) Suppose we have the following two choices for the predictor:
Choice 0: 􏰓f (x) = 0.
Choice1: 􏰓f(x)=xw􏰓t.
We use the validation data to determine which choice to use by computing the validation errors:
val 􏰆􏰆􏰆􏰆2 E0 = 􏰆yv􏰆2,
val 􏰆􏰆 􏰆􏰆2 E1 =􏰆yv−xvw􏰓t􏰆2.
If Eval ≤ Eval, we correctly choose the zero predictor. 01
If Eval > Eval, we choose the linear model (Choice 1). 01
Since the true function is actually zero, the probability that validation fails is given by:
Show that:
P(Validation fails) = P(Eval < Eval). 10 val val 􏰇 2 2􏰈 P ( E 1 < E 0 ) = P ( w􏰓 v − w􏰓 t ) < w􏰓 v . (Hint 1: Start expanding Eval − Eval. 10 Hint 2: At some point, you’ll need to expand out squares ∥v∥2 = vT v. Hint 3: At an opportune point, use the trick y v = y v − x v w􏰓 v + x v w􏰓 v . Hint 4: Remember that the orthogonality principle in ordinary least squares tells us that the residual vector yv − xvw􏰓v is orthogonal to the vector xv. ) Solution: We have, Thus, val val 􏰆􏰆 􏰆􏰆2 􏰆􏰆 􏰆􏰆2 E1 −E0 =􏰆yv−xvw􏰓t􏰆2−􏰆yv􏰆2 􏰆 􏰆2􏰆 􏰆2 = 􏰆􏰆yv − xvw􏰓v + xvw􏰓v − xvw􏰓t􏰆􏰆2 − 􏰆􏰆yv − xvw􏰓v + xvw􏰓v􏰆􏰆2 􏰆􏰆􏰆􏰆2􏰆􏰆􏰆􏰆2 ⊤ = 􏰆yv − xvw􏰓v􏰆2 + 􏰆xvw􏰓v − xvw􏰓t􏰆2 + 2(yv − xvw􏰓v) xv(w􏰓v − w􏰓t) 􏰆􏰆 􏰆􏰆2􏰆􏰆􏰆􏰆2 ⊤ − 􏰆yv − xvw􏰓v􏰆2 − 􏰆xvw􏰓v􏰆2 − 2(yv − xvw􏰓v) xvw􏰓v 􏰆􏰆 􏰆􏰆2 􏰆􏰆 􏰆􏰆2 ⊤ = 􏰆xvw􏰓v − xvw􏰓t􏰆2 − 􏰆xvw􏰓v􏰆2 − 2(yv − xvw􏰓v) xvw􏰓t 2􏰇 2 2􏰈 = ∥ x v ∥ 2 ( w􏰓 v − w􏰓 t ) − w􏰓 v 􏰇 2 2􏰈 = n v ( w􏰓 v − w􏰓 t ) − w􏰓 v . P(Eval < Eval) = P(Eval − Eval < 0) 1010 􏰉􏰇 2 2􏰈􏰊 = P n v ( w􏰓 v − w􏰓 t ) − w􏰓 v < 0 HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 12 􏰉􏰇 2 2􏰈 􏰊 = P ( w􏰓 v − w􏰓 t ) − w􏰓 v < 0 􏰇 2 2􏰈 = P ( w􏰓 v − w􏰓 t ) ≤ w􏰓 v . There are also other ways to get here. Some of which directly take you to the expression in the next part. HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 13 val val 􏰇 2 2􏰈 (d) (4 pts) Simple algebra takes the previous result P(E1 < E0 ) = P (w􏰓v − w􏰓t) < w􏰓v and simplifies it to P(Eval 0. Draw a one-dimensional region corresponding to values of w􏰓v for which validation fails.
Your drawing should have a 1d real line with both zero and w􏰓t marked, and a shaded region that corresponds to those values of w􏰓v for which validation would make the wrong choice. Be sure to label the boundary of the region.
Solution: The values of w􏰓v for which validation fails are precisely those that are above 1 w􏰓t so 2
we obtain a region as shown in red in Fig. d.
0 w􏰓t w􏰓t w􏰓v 2
w􏰓t
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission.
14

This is because we need w􏰓t(w􏰓t − 2w􏰓v) < 0 but if w􏰓t > 0, the only way that can happen is if w􏰓t − 2w􏰓v < 0 or equivalently: w􏰓v > 1 w􏰓t .
10
show that the probability validation fails is given by

 
π
2
(e) (12 pts) Use the illustration in Fig. 9 and the fact that P(Eval < Eval ) = P 􏰇w􏰓t (w􏰓t − 2w􏰓v ) < 0􏰈 to 10 11 val val −1   P(E 0,X >0)= cos √ .
  1 2 2πab
 1+ 
 4n t
 nv 
2
and ask yourself what the relevant 2d Gaussian random variables are here that you need to think about to understand the probability of the event we want to understand.)
Solution: We have:
P(Eval 0,w􏰓t −2w􏰓v <0􏰈+P􏰇w􏰓t <0,w􏰓t −2w􏰓v >0􏰈
These two probabilities are clearly the same by symmetry. (You could also dealt with the two
cases in an identical manner, just flipping both signs.) Let X1 = w􏰓t, X2 = 2w􏰓v − w􏰓t. Recall from part (c) that,
E[w􏰓t] = E[w􏰓v] = 0 E[w􏰓tw􏰓v] = 0
σ2 E [ w􏰓 t ] = n t
σ2 E [ w􏰓 v ] = n .
v
2
2
 X1
Thus X = X  ∼ N(0, Σ) where 2
 a c
Σ=c b,
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 15

σ2 a=E[X1]=E[w􏰓t]= nt
22
􏰍1 4􏰎 b=E[X2]=E[w􏰓t]+4E[w􏰓v]−4E[w􏰓tw􏰓v]=σ n +n
222 2
σ2 c = E[X1X2] = 2E[w􏰓tw􏰓v] − E[w􏰓t ] = − nt
σ2 nt
σ2 · σ2 􏰇 1 + 4 􏰈 nt nt nv
1
=􏰪. 1 + 4nt nv
2
tv
Note that,
Thus
−c √=􏰪
ab
P(Eval 0,w􏰓t −2w􏰓v <0􏰈+P􏰇w􏰓t <0,w􏰓t −2w􏰓v >0􏰈 10
=P(X1 >0,X2 >0)+P(X1 <0,X2 <0) =2P(X1 >0,X2 >0)
1
=2· cos √ 
−1 2π
 
 −c 
 ab 
  −1  
11  
=cos􏰪 ,  1+ 
π  4n t
 nv
where in the third equality we used symmetry and in the last equality we used the result from
the hint.
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 16

(f) (4 pts) Use what you proved in the previous part, P(E < E ) = cos , to  1+4nt  nv  val val 1 −11   1 evaluate P(Eval < Eval) when nt = 50, nv → ∞ as well as 10 evaluate P(Eval < Eval) when nt → ∞, nv = 50. 10 Feel free to use the facts that cos−1 (1) = 0 and cos−1 (0) = π . 2 Solution: We have, For the second one, we have: P(E 2(m + n) one can give a better-looking lower
bond on the probability: the predictions are exactly the same with probability at least exp
The derivation of this fact is given below. It uses the inequality ln(1 − x) ≥ −2x for x ∈ [0, 0.5].
m+n−1 m+n−1  􏰜 􏰑 
 m+n−1   􏰑 

≥ exp −2
 i/d i=0 
 =exp􏰃−(m+n−1)(m+n)/d􏰄
≥ exp 􏰇−(m + n)2/d􏰈
This means that when there are significantly more coupon types than there are training or test points, there is literally no difference between having these categorical features and doing ridge regression. In other words, you can add junk features to your data (not only training, but test too!) completely independently, and this will do exactly ridge regularization if the number of your junk features is high enough!
7 Local approximation using neighbors (20 points)
Suppose we are learning a one-dimensional function f. We have access to noisy samples (xi,yi)
with xi, yi ∈ R, We consider the estimator
􏰓f ( x ) = 􏰑 y i h ( x − x i ) ,
i
where h(x − xi) is the local influence of the i-th training sample on the value of 􏰓f at x.
(a) (5 pts) Suppose our training samples xi were equally spaced (∆ apart: so xi+1 = xi + ∆) and ordered (i.e., xi < xj for i < j), with i going through the integers from −∞ to +∞. If 􏰓f is a k-nearest neighbor estimator (i.e., the prediction for a test point is the average of the closest k samples from the training data), and k = 2l for some strictly positive integer l, what is the relevant local influence function h(·) that would give this estimator? Your answer should involve l and ∆ in some way. HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 24 􏰉 (m+n)2 􏰊 − d .  (1 − i/d) = exp  i=0  i=0  ln(1 − i/d)  (HINT: Draw what is going on to help.) Solution:  h(x) = 2l   1 if − l∆ ≤ x < l∆ 0 otherwise  This is because with that window, we get exactly k = 2l of the regularly spaced neighbors and since we want to average them, we need to divide by 2l. (b) (15 pts) Now, let us consider an explicit function that we want to learn: the absolute value function f (x) = |x|. Suppose we have training samples (xi, yi) that are separated by ∆ = 1 with xi =i+1 andyi = f(xi)+εi forallintegersi.Thetrainingnoiseisi.i.d.withεi ∼N(0,σ2). 2 If our test point is just x = 0, our goal is to understand the best value of the hyperparameter l so that our mean squared error on this test point would be minimized when using a k = 2l-nearest neighbor estimator. What l should we choose as a function of the training noise standard deviation σ? (HINT: The only randomness here is in the noise on the training points. To understand the dependence of terms you might want to make sure you understand how the mean-squared error behaves when l = 1 and l = 2 before working with a formula. What gets worse as you increase l? What might be getting better?) Depending on how you you choose to solve this, you might find the formula for the sum of an arithmetic series useful 􏰏l i = 1 l(l + 1). i=1 2 Feel free to compute a possibly continuous value for l — on the exam, don’t worry about what the right way to round it might be. Solution: First we find an expression for the MSE at a test point (xv, yv), where yv = f (xv) + εv and εv ∼ N(0, σ2), just like the training data. 􏰆  􏰆2 􏰭􏰮􏰆n n􏰆 􏰆 􏰆2 􏰆 􏰑 􏰑 􏰆 􏰆 􏰓􏰆 􏰆  E =E 􏰆y −f(x)􏰆 =E􏰆f(x)− f(x)h(x −x)+ε − εh(x −x)􏰆 􏰆 val 􏰆v v􏰆􏰆v iviv ivi􏰆 􏰆 i=1 i=1 􏰆 􏰆􏰆 􏰆 􏰆2 􏰆 􏰆2 􏰆n 􏰆􏰆n 􏰆 􏰑 􏰆􏰆   􏰆    􏰆􏰆 􏰑 􏰆 􏰆􏰆     􏰆􏰆 =E f(x)− 􏰆 v 􏰆  􏰆 f(x)h(x −x) +E ε − εh(x −x)  i v i􏰆 i v i􏰆 􏰆v 􏰆􏰆  􏰆􏰆 􏰆􏰆 i=1 􏰆􏰆n 􏰆􏰆2n  􏰆􏰆  􏰆􏰆􏰑 􏰆􏰆2􏰑2 􏰆 􏰆  = f(x)− f(x)h(x −x) +σ 1+ h(x−x) 􏰆v ivi􏰆 i 􏰆􏰆 􏰆i=1 􏰆i=1  We observe this is a bias-variance style decomposition for the error. Now we plug the explicit form of the estimator fˆ. The bias term is 1 l−1 1 􏰍1 3 2l−1􏰎 f(0)− 􏰑f(i+1/2)=− 2 + +...+ i=1 2li=−l 2l 2 2 2 HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 25 . Therefore, the validation error is E =􏰆1 val 􏰆􏰆􏰆2l f(x)􏰆+σ21+1=l+σ21+1 i 􏰆􏰆􏰆 2l 4 2l =− 1(1+3+...+(2l−1))=−1 􏰇l2􏰈=−l 2l 2l2 􏰆􏰆 i+l 􏰆􏰆2 􏰆􏰑􏰆􏰍􏰎2􏰍􏰎 􏰆 i=j−l+1 Differentiating with respect to l and setting this result equal to zero yields 􏰆 l 􏰍1􏰎 0= −σ2 So l∗ = σ2/3. 2 2l2 HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 26 Now, the new part of the homework begins on material that you haven’t done homework on before. 8 l1-Penalized Linear Regression: LASSO The l1-norm is one of the popular penalty functions used to enhance the robustness of regression models. Regression with an l1-penalty is referred to as the LASSO regression. One of its most important aspects is that it promotes sparsity in the resulting solution. In this problem, we will explore the optimization of the LASSO in a simplified setting. Assume the training data points are denoted as the rows of a n × d matrix X and their corresponding output value as an n × 1 vector y. The generic parameter vector and its optimal value (relative to the LASSO cost function) are represented by d × 1 vectors w and w􏰓, respectively. For the sake of simplicity, we assume columns of data have been standardized to have mean 0 and variance 1, and are also uncorrelated (i.e. X⊤X = nI). (We center the data mean to zero, so that the penalty treats all features similarly. We assume uncorrelated features as a simplified assumption in order to reason about LASSO in this question. In general, this is a major simplification since the power of regularlizers is that they often enable us to make reliable inferences even when we do not have as many samples as we have raw features.) For LASSO regression, the optimal parameter vector is given by: w􏰓=argmin{Jλ(w)= 1∥y−Xw∥2 +λ∥w∥1}, w2 where λ > 0.
(a) Show that for data with uncorrelated features, one can learn the parameter wi corre- sponding to each i-th feature independently from the other features, one at a time, and get a solution which is equivalent to having learned them all jointly as we normally do.
Hint: To show this, write Jλ(w) in the following form for appropriate functions g and f :
d
Jλ(w) = g(y) + 􏰑 f (Xi, y, wi, λ)
i=1
where Xi is the i-th column of X. By having no interaction terms that link the different wi vari-
ables, you know that the joint optimization can be decomposed into individual optimizations.
i=1
Solution: Considering Xi as the i-th column of X:
1 1􏰑d n
2∥y − Xw∥2 + λ∥w∥1 = 2y⊤y + Notice that we achived the form:
{−y⊤Xiwi + 2w2i + λ|wi|} Jλ(w) = g(y) + 􏰑 f (Xi, y, wi, λ)
i=1 Using this with the hint we get the solution.
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 27
d

(b) Assume that w􏰓i > 0. What is the value of w􏰓i in this case? Solution: If w􏰓i > 0, then we want to minimize
−y⊤Xiwi + nw2i + λ|wi| = −y⊤Xiwi + nw2i + λwi 22
Since the function above is convex, taking the derivative and equating to zero will give us the minimum. So, we get:
w􏰓 i = 1 ( y ⊤ X i − λ ) n
This is only valid if 1 (y⊤ Xi − λ) is indeed greater than zero. n
(c) Assume that w􏰓i < 0. What is the value of w􏰓i in this case? Solution: If w􏰓i < 0, then we want to minimize −y⊤Xiwi + nw2i + λ|wi| = −y⊤Xiwi + nw2i − λwi 22 Take the derivative and equate to zero, we have: w􏰓 i = 1 ( y ⊤ X i + λ ) n This is only valid if 1 (y⊤ Xi − λ) is indeed less than zero. n (d) From the previous two parts, what is the condition for w􏰓i to be zero? If you put this part together with the previous parts, you can understand why this is often referred to as “soft thresholding.” Solution: At wi = 0 the derivative is not defined, it is a critical point. The optimum will be there if no zeros exist for the derivative. From the previous parts, we know w􏰓i = 0 if none of the above conditions hold, that is; Combining them, we get y⊤Xi +λ≥0, y⊤Xi −λ≤0 −λ≤y⊤Xi ≤λ (e) Now consider the ridge regression problem. In ridge, the regularization term is replaced by λ∥w∥2 where the optimal parameter vector is now given by: w􏰓=argmin{Jλ(w)= 1∥y−Xw∥2 +λ∥w∥2}, w2 where λ > 0.
What is the condition for w􏰓i = 0? How does it differ from the condition you obtained in
the previous part? Can you see why the l1 norm promotes sparsity?
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 28

Solution: If the LASSO is replaced by λ∥w∥2, the optimization problem regarding wi is given
by:
take the derivative and equate to zero:
− y ⊤ X i w i + n w 2i + λ w 2i 2
w􏰓 i = y ⊤ X i n+2λ
It is equal to zero if y⊤Xi = 0 exactly or λ goes to infinity. A finite λ, if y is noisy this will never happen. In contrast, w􏰓i = 0 when |y⊤Xi| < λ in K LASSO regression. This is why the l1-norm regularization encourages sparsity. (f) Assume that we have a sparse image vectorized in the vector w (so w is a sparse vector). WehaveaGaussianmatrixn×dmatrixXandann×1noisevectorzwheren > 1. Our measurements take the form y = Xw + z. We want to extract the original image w given matrix X knowing that this image is sparse. The fact that w is sparse suggests using l1 regularization. Use the provided Jupyter notebook and apply l1 regularization.
(This might remind you of EE16A. That’s intended. OMP is another way to find sparse so- lutions. OMP is generally faster to run than LASSO and can be understood without as much optimization background. The issue with vanilla OMP is that while you can view when to stop as a hyperparameter, this isn’t quite the same as being able to tune λ. It turns out that there are algorithms that are inspired by OMP that can spiritually compute something pretty close to what LASSO would compute as you vary the λ parameter. One such algorithm is called LARS, but it is out of scope for this course.)
Change the hyperparameter λ to extract the best looking image and report it.
Solution: Please refer to the provided code. Any λ in an order of 10−7 to 10−5 should give you
a good reconstruction.
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 29

9 Contamination of Sparse Linear Models Obtained by Thresholding
In this question, we will analyze the benefits of learning sparse linear models when the true pat- tern is indeed sparse. In particular, we will analyze two simple procedures (computing the OLS solution and then just keeping the biggest entries or just keeping entries bigger than a threshold) that perform feature selection in linear models, and show quantitatively that feature selection low- ers the resulting contamination in a way that reduces the mean-squared prediction error. This is sometimes viewed as a “variance reduction” in the literature.
This should make sense to you at an intuitive level for two reasons. The first should be pretty straightforward: zeroing out a bunch of feature weights when the truth is sparse is going to reduce contamination if it is contaminating weights that are being zeroed out. A more roundabout way is to think about model complexity (using a vague analogy with polynomial degrees, etc…) since enforcing sparsity is equivalent to deliberately constraining model complexity. Simpler models should be easier to learn in a more reliable way as long as we have enough data.
There are, however, some concerns. The most basic one is that you might be worried that the thresholding procedure will end up eliminating the true signal itself! How do we know that the true signal will make the cut? If the true signal doesn’t survive, then we could get bad prediction error performance from that fact, notwithstanding any reduction in contamination.
The other issue is more subtle. There is a difference between feature selection before looking at training data and after seeing the training data. If we use fewer features to begin with (and we have enough training data), our intuitions so far imply that we will have lower prediction error because of smaller model complexity — there are simply fewer features that the training noise can put contamination into.
What we learn from working through this problem is that selecting features adaptively (that is, based on the training data itself) does not hurt either, if done properly. In other words, although there is a philosophical difference between doing feature selection before or after using the data, post training feature selection still leads to improved prediction error (via contamination reduction) under certain assumptions.
First, some setup. Data from a sparse linear model is generated using
y=Xw∗ +z,
where y ∈ Rn denotes a vector of responses, X ∈ Rn×d is our data matrix, w∗ ∈ Rd is an unknown, s-sparse vector of true parameters (with at most s non-zero entries), and z ∼ N(0,σ2In) is an n- dimensional vector of i.i.d. Gaussian noise of variances σ2. Notice the critical difference from the Bi-Level model you had seen earlier when we were discussing overparameterized models. In that model, we knew which s features were favored and where the true parameters lay. In this model, all that we know is that the true parameter vector is sparse — we don’t know which features are favored before seeing the labels.
The solution to first three parts of the problem can be filled out in the provided Jupyter notebook. The point of the first three parts is to make sure that you understand the magnitude of this effect. The rest is designed to help you see why this effect is there. Parts (d)-(j) must have a separate,
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 30

written solution. All logarithms are to the base e.
(a) Let us first do some numerical exploration. In the provided Jupyter notebook, you will find
code to generate and plot the behavior of the ordinary least squares algorithm.
Solution: See solution notebook.
The error is defined as ||w∗ − wˆ ||2. n is the number of examples, d is the number of feature dimensions and s is the number of non-zero entries in the weights w. The 3 figures below plots the error versus n, d, s respectively.
Above, we plotted the normalized prediction error over: 1. n (in the range (100,2000)), with d = 100 and s = 5. 2. d (in the range (10,1000)), with n = 1000 and s = 5. 3. s (in the range (5, 50)), with d = 100 and s = 5.
Clearly, these plots are expected. The first has slope -1, showing that the error has an inverse dependence on n. The second is linear, with slope roughly 1/1000 = 1/n, as expected. The third shows that the least squares error does not depend at all on the sparsity s, as expected.
See Figure 1, 2 and 3 for details.
Figure 10: OLS log error vs log n. Errors in all figures are defined as ||w∗ − wˆ ||2. In this case d = 100 and s = 5. It has slope -1, showing that the error has an inverse dependence on n.
(b) In this problem, we will analyze two estimators that explicitly take into account the fact that w∗ is sparse, and consequently attain lower error than the vanilla least-squares estimate.
Let us define two operators. Given a vector v ∈ Rd, the operation τk(v) zeroes out all but the top k entries of v measured in absolute value. The operator Tλ(v), on the other hand, zeros out all entries that are less than λ in absolute value.
Recall that the least squares estimate was given by w􏰓LS = X†y = X⊤y, where X† is the appro- priate pseudo-inverse of X. As we did earlier, we’ll make the simplifying assumption that X
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 31

Figure 11: OLS error vs d. In this case, n = 1000 and s = 5. It is linear, with slope roughly 1/1000 = 1/n.
Figure 12: OLS error vs s. In this case, d = 100 and s = 5. The least squares error does not depend at all on the sparsity s, as expected.
has orthonormal columns even in the training data. (Notice that by making this assumption, we are also making it clear that the input data matrix X doesn’t give any hints as to which features or linear combinations of features are to be prefered — everything looks equally strong as a feature. This is a way of forcing the y to speak if they have anything to say.) We now define
w􏰓top(s) = τs(w􏰓LS) w􏰓T (λ) = Tλ(w􏰓LS),
which are the two sparsity-inducing estimators that we will consider.
Now implement the two estimators described above and plot their performance as a function
of n, d and s.
Here, we are going to assume that the underlying features are also orthonormal relative to the test distribution so the Euclidean-norm squared of the error in estimating the weights is a correct estimate of the prediction error on the test distribution.
Solution: See solution notebook.
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 32

Same as above, the error is defined as ||w∗ − wˆ ||2. n is the number of examples, d is the number of feature dimensions and s is the number of non-zero entries in the weights w. The 3 figures below plots the error versus n, d, s respectively. Here for each of the plots, we plot 3 methods: least square, the top s estimator and the threshold estimator.
We see here that the sparsity seeking estimators are way better than the least squares estima- tor as long as the sparsity is small enough. This advantage is most stark in the case of d- dependence, where we see that while the least squares estimator has error depending linearly on d, the sparsity-seeking estimators have way lower dependence (logarithmic, as predicted by our theory).
See Figure ??, ?? and ?? for details. We also included the oracle performance, where we are magically given which entries of the weight w are non-zero, denoted as LS oracle in the figures. Note that you don’t need to include the oracle in your solution.
Figure 13: Three estimators’ log error vs log n. LS is the ordinary least square estimator, Tops is the wˆ top(s) estimator and Thresh is the wˆ T (λ) estimator. In this case d=100 and s=5. The sparsity aware estimators are much better than OLS. The slop of Tops and Thresh are the same as that of LS.
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 33

Figure 15: Three estimators’ error vs s. In this case, n = 1000, d = 100. LS does not have any dependence on the underlying model sparsity, while the sparsity aware estimators’ error increase linearly with the sparsity s.
Figure 14: Three estimators’ error vs d. In this case, n = 1000, s = 500. We see that while the least squares estimator has error depending linearly on d, the sparsity-seeking estimators have way lower dependence (logarithmic, as predicted by our theory).
(c) Now generate data from a non-sparse linear model, and numerically compute the esti- mators for this data. Explain the behavior you are seeing in these plots in terms of the underlying trade-off between approximation error, survival and contamination.
Solution: See solution notebook.
We see here that the top(s) estimator does way worse than expected when it searches for a model with sparsity 5, even when the true sparsity is 25. Clearly, this implies that the estimator can hurt the survival of the true signal, since it is searching within a small class of models that does not contain the truth. This can also be viewed as a kind of approximation error. When the sparsity being searched over is large enough, the error now drops to something reasonable. The dominant term now is the contamination that is due to the training noise entering our estimated weights.
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 34

Figure 16: On a non-sparse dataset, three estimators’ log error vs log n. In this case d = 100, s = 5. Since we assume the wrong number of sparsity entries, Tops performs the worst due to high bias.
Figure 17: On a non-sparse dataset, three estimators’ error vs d. In this case, n = 1000, s = 5. The Thres estimator performs the best. The performance of LS is independent of the true s. Tops still perform worst due to high bias.
See Figure ??, ?? and ?? for details.
(d) In the rest of the problem, we will theoretically analyze the prediction error for the top-k proce- dure for sparse estimation, and try to explain the curves you saw in the numerical explorations above. We will need to use a handy tool, which is a bound on the maximum of d Gaussian random variables.
Show that given d Gaussians {Zi}di=1 (not necessarily independent) with mean 0 and vari- ance σ2, we have
􏰨􏰩1 P max |Zi|≥2σ􏰤logd ≤ .
i∈{1,2,…,d} d
Hint 1: You may use without proof the fact that for a Gaussian random variable Z ∼ N(0, σ2) − t2
andscalart>0,wehaveP{|Z|≥t}≤e 2σ2 .
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 35

Figure 18: On a non-sparse dataset, three estimators’ error vs s. In this case, n = 1000, d = 100. We see here that the top(s) estimator does way worse than expected when it searches for a model with sparsity 5, even when the true sparsity is 25. Clearly, this implies that the bias of the estimator is high, since it is searching within a small class of models that does not contain the truth. When the sparsity being searched over is large enough, the error now drops to something reasonable.
Hint 2: For the maximum to be large, one of the Gaussians must be large. Now use the union bound.
(If you wanted to, you could prove tighter concentration for the maximum of iid Gaussian random variables by invoking/proving the weak-law-of-large numbers for the maximum of iid random variables and introducing an appropriate tolerance ε, etc. But we don’t ask you to do that here to keep the math lighter.)
Solution: As provided in the hint, the maximum is large if and only if at least one of the individual Gaussians is large. In other words, for every t > 0, we have the inclusion of events
􏰨􏰩
max |Zi| ≥ t = ∪i∈{1,2,…,d} 􏰥|Zi| ≥ t􏰦 .
i∈{1,2,…,d}
Using the union bound, we therefore have
􏰨􏰩
P max |Zi|≥t ≤ P􏰥|Zi|≥t􏰦
i∈{1,2,…,d}
where we have used the first hint to bound the probability of each individual Gaussian being large. Finally. substituting t = 2σ 􏰤log d, we have
􏰨􏰩
P max |Zi| ≥ 2σ􏰤logd ≤ e−logd = 1/d,
i∈{1,2,…,d}
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 36
d
􏰑
i=1
􏰑d 2
2σ2 ≤ e−t
i=1
= de− t2
2σ2 =e−t2 +logd,
2σ2

which completes the proof.
(e) As in the previous problem, for algebraic convenience, we will restrict attention in this entire problem to the special case where the input data matrix X has orthonormal columns.
Show that w􏰓top(s) can be expressed as the top s entries of the vector w∗ + z′ in absolute value, where z′ is an i.i.d. Gaussian random variable with variance σ2.
Solution: In order to show this, it is sufficient to show that w􏰓LS = w∗ + z′. After all, the definition says w􏰓top(s) = τs(w􏰓LS).
This is easy to do. since
w􏰓 L S = X ⊤ y
= X⊤(Xw∗ + z)
= w∗ + X⊤z = w∗ + z′,
where we have used the fact that X⊤X = I. To conclude, we must show that z′ is i.i.d. Gaus- sian. It is clearly Gaussian, since it is a linear combination of i.i.d. Gaussians. Additionally, E[z′z′⊤] = X⊤E[zz⊤]X = σ2I, and a multivariate Gaussian with covariance equal to the scaled identity is i.i.d.
(f) Argue that the (random) error vector e = w􏰓top(s) − w∗ is always (at most) 2s-sparse.
Solution: This is true by definition, since both vectors w∗ and w􏰓top(s) are s-sparse. The differ- ence has the largest number of non-zeros when both of the sparsity patterns are disjoint, and so e is at most 2s-sparse.
(g) Let us now condition on the event E = {max |z′i | ≤ 2σ 􏰤log d}. Conditioned on this event, show that we have |ei| ≤ 4σ 􏰤log d for each index i.
Solution: Conditioned on the event E, we may assume that all the noise is actually bounded by 2σ 􏰤log d in absolute value. Now notice that by the thresholding operation, we have that
ei = zi if w∗ = 0 and [w􏰓top(s)]i 􏰁 0 i
 
0ifw∗ =[w􏰓 (s)] =0  i top i



∗∗
−w ifw 􏰁0and[w􏰓 (s)] =0
top i
In addition, there are at most 2s indices such that one of the latter two cases holds.
If ei = zi, then clearly, we have |ei| ≤ 2σ 􏰤log d ≤ 4σ 􏰤log d. It remains to handle the final case.
Notice that a non-zero entry of w∗ (let us call this index k) only gets zeroed out by the thresh- olding operation if it becomes one of the d − s smallest entries of w∗ + z′ . This can only happen if some zero entry of w∗ (say entry l) was such that |(w∗ + z′)l| > |(w∗ + z′)k|, which implies that |z′l| > |w∗k +z′k|. Clearly, if |w∗k| > 4σ􏰤logd, this can never occur, since |z′i| ≤ 2σ􏰤logd for all i.
ii
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 37

Consequently, we must have that for any such index k which is zeroed out, |w∗k| ≤ 4σ􏰤logd, which results in the required result |ei| ≤ 4σ 􏰤log d for each index i.
(h) Conclude that with probability at least 1 − 1/d, we have ∥w􏰓top(s)−w∗∥2 ≤32σ2slogd.
We can understand if you feel that this was a bit too slick since we never truly engaged with what the true signal was. Essentially, we finessed this issue by saying that the worst random competing false feature can’t be too loud, and so if we lost any parts of the true signal, they weren’t that important to begin with.
Solution: We know that conditioned on the event E from the previous part, we have |ei| ≤ 4σ 􏰤log d for each index i. Additionally, there are at most 2s indices i for which ei is non- zero. Consequently, we have
d ∥w􏰓top(s)−w∗∥2 =􏰑e2i
i=1
≤ (2s) · 16σ2 log d
= 32σ2slogd.
By our previous calculations, the event E occurs with probability at least 1 − 1/d, which com-
pletes the proof.
Clearly, we have 32σ2 s log d ≤ σ2 d whenever s ≤ d . In particular, when d grows with
32logd
s remaining a constant, we see significantly better estimation by exploiting the sparsity in our
models than the naive least squares estimator.
(i) We can view this from a denoising perspective on the original training data. Use the above part to show that with probability at least 1 − 1/d, we have
1∥X(w􏰓top(s)−w∗)∥2 ≤32σ2slogd. nn
Solution: By unitary invariance of the Euclidean norm (note that X has orthonormal columns), we have from the calculations of the previous part that
1∥X(w􏰓top(s) − w∗)∥2 = 1∥w􏰓top(s) − w∗∥2 nn
≤ 32σ2 slogd. n
(j) Considering the simple least-squares estimator w􏰓LS = (X⊤X)−1X⊤y, we can show that
􏰭1 􏰮d
E ∥X(w􏰓LS−w∗)∥2 =σ2 ,and (9)
nn
􏰔 ∗2􏰕2 􏰔⊤−1􏰕
E∥w􏰓LS−w∥2 =σtrace(XX) , (10)
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 38

respectively. Equations (9) and (10) represent the “denoising” squared error and the mean squared error of the parameters of our OLS model, respectively. Recall that in this problem, we have been assuming that the matrix X has orthonormal columns, and so the bounds become
􏰭1 􏰮d
E ∥X(w􏰓LS − w∗)∥2 = σ2 , and (11)
nn
􏰔 ∗2􏰕2
E∥w􏰓LS−w∥2 =σd. (12)
Compare these to the previous part, assume that the statement proved with probability 1 − 1 d
can actually be modified with more work to be an analogous statement that holds with high
probability, and conclude that by leveraging the sparsity assumption, we have smaller
prediction error as long as s ≤ d . 32logd
Solution: Comparing the variance of the sparse inducing estimator 32σ2 s log d with the LS
varianceσ2d,weseegainswhens≤ d . n 32logd
n
(k)
In conclusion, roughly speaking the sparse solution has a mean prediction error upper bound of O(σ2 s log d ) with high probability. On the other hand, the least square solution has an expected
n
mean prediction error of σ2 d . Thus when s log d < c × d, the sparse solution has smaller error n than the least square solution, where c is some constant. It turns out that appropriate sparsity seeking estimators can go well beyond the n = d barrier by exploiting core effects similar to what are being discussed here. Fundamentally, truly false features have a hard time by dumb luck competing with true ones because of chance alignment with the training noise. Meanwhile, true features can stand out. This is one of the reasons that OMP works well, but that is not a topic for this problem. Now consider the case if we already knew the important s features to begin with. What would be the prediction squared error of the sparse OLS estimator that just used the s important features? How does this compare to the behavior for the sparse w􏰓top(s) derived above? What is the price of not knowing in advance which are the important features? Solution: If we knew the important s features, the squared error would be σ2 s , comparing n to the error of the sparse estimator, 32σ2 s log d , we pay a price of α log d, where α is some n constant. Sensors, Objects, and Localization 10 In this problem, we will be using gradient descent to solve the problem of figuring out where objects are, given noisy distance measurements. (This is roughly how GPS works and students who have taken EE16A have seen a variation on this problem in lecture and lab. The hope is that this problem will provide you with a way to build a more physical intuition about what is going on with both gradient descent as well as the concept of local minima, etc. and their relationship with machine-learning relevant concepts like noise, number of features, etc.) First, the setup. Let us say there are m sensors and n objects located in a two dimensional plane. The m sensors are located at the points (a1, b1), . . . , (am, bm). The n objects are located at the points HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 39 (x1, y1), . . . , (xn, yn). We have measurements for the distances between the sensors and the objects: Di j is the measured distance from sensor i to object j. The distance measurement has noise in it. Specifically, we model Dij =||(ai,bi)−(xj,yj)||2 +Zij, where Zi j ∼ N(0, 1). The noise is independent across different measurements. Code has been provided for data generation to aid your explorations. This is the first part of this problem. We will return to this in a later homework where you will see how neural network approaches stack up on such problems. (a) Consider the case where m = 7 and n = 1. That is, there are 7 sensors and 1 object. Suppose that we know the exact location of the 7 sensors but not the 1 object. We have 7 measurements of the distances from each sensor to the object Di1 = di for i = 1, . . . , 7. Because the underlying measurement noise is modeled as iid Gaussian, the log likelihood function, which we would like to maximize, is 7 L(x1,y1)=−􏰑(􏰤(ai −x1)2 +(bi −y1)2 −di)2 +C, (13) i=1 where C is a constant. Manually compute the symbolic gradient of the log likelihood function with respect to x1 and y1. Solution: We identify x, y with x1, y1 respectively for notational simplicity. We have Di1 ∼N(􏰤(ai −x)2 +(bi −y)2,1). (14) The log likelihood function is 7 L(x,y)=􏰑logP(Di1 =di) i=1 7 =−􏰑(􏰤(ai −x)2 +(bi −y)2 −di)2 +Constant i=1 For notational simplicity, define φi(x, y) = 􏰤(ai − x)2 + (bi − y)2. Then the gradient of φi(x, y) with respect to x, y is ∇φi(x,y)=(x−ai , y−bi )T. φi(x,y) φi(x,y) The gradient of L(x, y) is 7 ∇L(x, y) = ∇[− 􏰑(φi(x, y) − di)2] i=1 7 = −2 􏰑(φi(x, y) − di)∇φi(x, y) i=1 HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 40 =−2􏰑7 φi(x,y)−di(x−ai,y−bi)T, i=1 φi(x, y) where the second line follows by the chain rule and the third line follows from plugging in the gradient of φi directly. (b) The provided code generates • m=7sensorlocations(ai,bi)sampledfromN(0,σ2sI) • n = 1 object locations (x1, y1) sampled from N(μ, σ2oI) • mn = 7 distance measurements Di1 = ||(ai, bi) − (x1, y1)|| + N(0, 1). for μ = [0, 0]T , σs = 100 and σo = 100. Solve for the maximum likelihood estimator of (x1, y1) by gradient descent on the negative log-likelihood. Report the true and estimated (x1,y1) for the given sensor locations. Try two approaches for initializing gradient de- scent: starting at 0 and starting at a random point. Which of the following step sizes is the most reasonable one, 1, 0.01, 0.001 or 0.0001? Solution: Please refer to the solution code. We should use a step size that is not too large to make sure that the algorithm converges and also avoid choosing a step size that is too small so that the algorithm converges faster. ######################################################################## ######### Results ##################################################### ######################################################################## The real object location is [[ 44.38632327 33.36743274]] The estimated object location with zero initialization is [[ 43.07188426 32.71217807]] The estimated object location with random initialization is [[ 43.07188426 32.71217807]] (c) (Local Minima of Gradient Descent) In this part, we vary the location of the single object among different positions: (x1, y1) ∈ {(0, 0), (100, 100), (200, 200), . . . , (900, 900)}. For each choice of (x1, y1), generate the following data set 10 times, for a total of 100 data sets: • Generate m = 7 sensor locations (ai, bi) from N(0, σ2s I) (Use the same σs from the previ- ous part.) • Generate mn = 7 distance measurements Di1 = ||(ai, bi) − (x1, y1)||2 + N(0, 1). For each data set, run the gradient descent methods 100 times to find a prediction for (x1, y1). We are pretending we do not know (x1,y1) and are trying to predict it. For each gradient descent, take 1000 iterations with step-size 0.1 and a random initialization of (x,y) from N(0, σ2I), where σ = x1 + 1. These are the results you should report for this problem: HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 41 • Draw the contour plot of the log likelihood function of a particular data set for (x1, y1) = (0, 0) and (x1, y1) = (200, 200). • For each of the ten data sets and each of the ten choices of (x1, y1), calculate the number of distinct points that gradient descent converges to. Then, for each of the ten choices of (x1, y1), calculate the average of the number of distinct points over the ten data sets. Plot the average number of local minima against x1. For this problem, two local minima are considered identical if their distance is within 0.01. Hint: np.unique and np.round will help. • Foreachofthetendatasetsandeachofthetenchoicesof(x1,y1),calculatetheproportion of gradient descents which converge to what you believe to be a global minimum (that is, the minimum point in the set of local minima that you have found). Then, for each of the ten choices of (x1, y1), calculate the average of the proportion over the ten data sets. Plot the average proportion against x1. • For the 7th data set for the object location of (500, 500), plot the sensor locations, the ground truth object location and the MLE object locations found by 100 times of gradient descent. Do you find any patterns? Please be aware that the code might take a while to run. Solution: Please refer to the solution code and figures 19, 20 and 21. You will notice that as we get to the larger values for the true location of the object, gradient descent converges to more and more local minima. This is happening for two reasons. First, we are creating more local minima because the distance of the object to the sensors is slowly dominating the distance between the sensors themselves. Image the extreme case where sensors are within a radius of 1 while the object is about 1 × 108 away. Then the distances of the object to each of the sensor are almost the same. And the estimated location of the object can be anywhere around the circle of radius 1 × 108. In other words, all of these places would be local minima. Second, gradient descent is starting further away from the true optimal, which increases the chances that gradient descent get diverted to a local minima along the way. From figure 21, we can see that the recovered local minima roughly lies on a circle. Thus, in this case, most of the local minimas are caused by the large distance between the object and the sensors. Figure 19 HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 42 Figure 20 Figure 21 (d) Repeat the previous part, except explore what happens as you reduce the variance of the mea- surement noise. Present the same plots and comment on the differences and why they occur. For the sake of saving time, you can experiment with only one smaller noise, such as N (0, 0.012 ). You might need to change the learning rate to make the gradient descent converge. Solution: For the sake of this solution, it is important to distinguish between the process of generating the Di1 and the optimization process for a fixed set of Di1. HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 43 First, think about how incorrect local minima might arise from the data generation process. The first way that local minima are created is by noise in the neighborhood of the global minimum. This creates local minima that are near the true object location, but their distance from the true object location will grow with the noise σ. The second way local minima are created is by bad geometry, meaning that there is not a unique global minimum (because the sensors are not located in a way that can disambiguate certain points). Decreasing noise helps with the first case by making the objective value at the true object location deeper, while it does not help in the second case. To be more concrete, imagine the extreme case where the noise σ is very small (practically 0). The true object location is a global minimum. The only other minima are points which are almost exactly Di1 from each of the i; the geometric ambiguities. The argument above explains how having smaller noise in data generation can create an opti- mization landscape with fewer local minima in the neighborhood of the global minimum. The argument below will show that once the optimization landscape is fixed, the choice of σ only scales our gradient descent algorithm. Consider the optimization process once the Di1 are fixed. Reducing the variance of measure- ment to σ2 < 1 will scale the gradient at every point by 1 (check this by re-writing the log σ2 likelihood function). We can counteract this by making our step size significantly smaller. The result is that these two changes cancel each other out and we end up taking the same set of steps as before if we start at the same initialization. Overall, we do not see significant changes in our results in practice. If you did see an effect, hopefully GD improved with smaller noise σ. If you saw erratic behavior when you changed σ2, it is possible that you did not change the step size and thus your gradient was huge and your GD diverged. Figure 22 HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 44 See figures 22, 23 and 24. Figure 23 Figure 24 (e) Repeat the part (c) again, except explore what happens as you increase the number of sensors. For the sake of saving time, you can experiment with only one number of sensors, such as 20. Present the same plots and comment on the differences and why they occur. Solution: Intuitively, more sensors lead to more measurements from different locations. This can help reduce the negative effects of measurement errors and locate the object. HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 45 Let’s be a little more precise and explore what happens to local minima. When we increase the number of sensors, the log likelihood will change to include more information about the loca- tion of the object. Imagine that we examine the global minimum and a local minimum before and after adding a new sensor. When we add the new sensor, the local minimum will be pulled upwards more than the global minimum. That is because the global minimum already roughly “agrees” with the data from the new sensor (that is, adds a small penalty to the likelihood) while the local minimum might “disagree” (that is, adds a strong penalty to the likelihood). Ultimately, this addition of sensors will eliminate some of the local minima and so we expect gradient descent to work better. See figures 25 and 26. Consider a concrete example where more sensors can help determine the location of the object better geometrically. Imagine the case where we only have two sensors at (−1, 0) and (1, 0) √ and each report a distance of ∼ 2 from the object. Then we cannot decide whether an object is around (0, 1) or around (0, −1). A new sensor, preferably one which is far from the x-axis, will likely resolve this issue. For the (500, 500) object, 20 sensors help to resolves the local minima problem. Figure 25 HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 46 Figure 26 11 ReLU Elbow Update under SGD In this question we will explore the behavior of the ReLU nonlinearity with Stochastic Gradient Descent (SGD) updates. The hope is that this problem should help you build a more intuitive understanding for how SGD works and how it iteratively adjusts the learned function. We want to model a 1D function y = f (x) using a 1-hidden layer network with ReLU activations and no biases in the linear output layer. Mathematically, our network is fˆ(x) = W(2)Φ 􏰇W(1) x + b􏰈 HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 47 where x,y ∈ R, b ∈ Rd, W(1) ∈ Rd×1, and W(2) ∈ R1×d. We define our loss function to be the squared error, 􏰇 􏰈1􏰆􏰆􏰆􏰆2 l x,y,W(1),b,W(2) = 􏰆fˆ(x)−y􏰆 . 2􏰆 􏰆2 For the purposes of this problem, we define the gradient of a ReLU at 0 to be 0. You will visualize the behavior derived here using the notebook from Discussion 7. (a) Let’s start by examining the behavior of a single ReLU with a linear function of x as the input,  0, else Notice that the slope of φ(x) is w in the non-zero domain. wx+b, wx+b>0
φ(x) =  . 
We define a loss function l(x, y, φ) = 1 ∥φ(x) − y∥2. Find the following: 22
(i) Thelocationofthe‘elbow’eofthefunction,whereittransitionsfrom0tosomething else.
(ii) The derivative of the loss w.r.t. φ(x), namely dl dφ
(iii) The partial derivative of the loss w.r.t. w, namely ∂l ∂w
(iv) The partial derivative of the loss w.r.t. b, namely ∂l ∂b
Solution:
(i)
(ii)
(iii)
(iv)
e=−b w
dl =φ(x)−y dφ

(φ(x)−y), wx+b>0

0, else
∂l ∂l∂φ ∂w = ∂φ ∂w =
∂l ∂l∂φ ∂b = ∂φ ∂b =
(φ(x)−y)x, wx+b>0 
 
0, else 
(b) Now suppose we have some training point (x, y) such that φ(x) − y = 1. In other words, the prediction φ(x) is 1 unit above the target y — we are too high and are trying to pull the function downward.
Describe what happpens to the slope and elbow of φ(x) when we perform gradient descent in the following cases:
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 48

(i) φ(x) = 0.
(ii) w > 0, x > 0, and φ(x) > 0. It is fine to check the behavior of the elbow numerically
in this case.
(iii) w>0,x<0,andφ(x)>0.
(iv) w < 0, x > 0, and φ(x) > 0. It is fine to check the behavior of the elbow numerically in this case.
Additionally, draw and label φ(x), the elbow, and the qualitative changes to the slope and elbow after a gradient update to w and b. You should label the elbow location and a candidate (x, y) pair. Remember that the update for some parameter vector p and loss l under SGD is
p′ =p−λ∇p(l),λ>0.
Solution: For each of the cases the change in the slope is determined by ∂l and the new elbow
is located at
(i) No changes since the derivatives are 0.
(x, y)
−(b−∆b) −b+λ∂l e′= = ∂b
∂w
w − ∆w w − ∂l ∂w
(ii) The slope gets shallower and the elbow moves right.
∂l =1x>0 =⇒ w>w−λx
(iii) The slope gets steeper and the elbow moves right. Using the fact that w, b > 0 and e < 0 for this configuration, ∂l =1x<0 =⇒ |w|<|w−λx| ∂w φ(x) (x, y) φ(x) ∂w Several numerical checks show the motion of the elbow. HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 49 ∂l = 1 =⇒ |b| > |b − λ| ∂b
∴ e′ > e φ(x)
(x, y)
(iv) The slope gets steeper and the elbow moves left. Since w, x < 0 for this configuration, ∂l =1x>0 =⇒ |w|<|w−λx| ∂w Several numerical checks show the motion of the elbow. φ(x) Believe it or not, even when the slope moves in the opposite direction you expect it to the error still decreases. You can try this yourself with a learning rate of 0.1, x = 1, y = 1, w = −2, and b = 4. We encourage you to check the behavior (numerically is probably the easiest method) for these cases when φ(x) − y < 0 and see what changes and what stays the same. You may give yourself full credit for self-grades if you checked the behavior for a single value in each case rather than in general. (c) Now we return to the full network function fˆ(x). Derive the location ei of the elbow of the i’th elementwise ReLU activation. (x, y) Solution: W(1)x+bi =0 i x=− bi W(1) i (d) Derive the new elbow location e′i of the i’th elementwise ReLU activation after one stochas- tic gradient update with learning rate λ. (Hint: we found the vector partial derivatives of this network formulation in Discussion 7 using backpropagation.) HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 50 Solution: The new location after an SGD update is −bi +λ∂l e′i= ∂bi. W(1) − λ ∂l i ∂W(1) i From Discussion 7, we have ∂l 􏰉 (2) For the ReLU activation, ∂Φ 􏰉 􏰇 (1) 􏰈􏰊 ∂􏰇W(1)x+b􏰈 =diag 1 W x+b>0
where 1(·) is an indicator variable for whether each element of the contents is true or false. Sinceweareonlyinterestedinthei’thindexofthisgradient,wecansimplifyW(2)∂Φ to
􏰈
∂W(1) =x W Φ W x+b −y W ∂􏰇W(1)x+b􏰈
􏰇 (1)
􏰊 (2) ∂Φ
Therefore,
(2)􏰇(1) 􏰈
  W i , W x + b i > 0

0, else
􏰉􏰇􏰈􏰊
 (2) (1) (2) (1)
=
∂l

 x W Φ W x + b − y W , W x + b i > 0
ii
∂(·)
∂W i
else
(1)
  0,
∂l
∂b = W Φ W x+b −y Wi
∂Φi
􏰇 (1) 􏰈
Putting everything together,
􏰇(2)(1) 􏰈(2)
􏰉 (2) 􏰇 (1)
􏰉 􏰇 􏰈 􏰊
􏰈 􏰊 (2)
i ∂Wx+b
 (2) (1) (2) (1)
W ΦW x+b−yW , W x+b>0 =iii


0, else
 −bi+λ W Φ(W x+b)−y W (1)
 􏰇 􏰈i ,Wx+b>0 ′ W(1)−λx W(2)Φ(W(1)x+b)−y W(2) i i
e=i i i 
 i
e , else
 −bi+λ􏰇fˆ(x)−y􏰈W(2) (1)
 􏰇 􏰈i ,Wx+b>0
W(1)−λx fˆ(x)−y W(2) i i =i i

 i
e , else
We can still predict the change in each component ReLU function Φi(x) as long as we know fˆ(x) − y, W(1), bi, and W(2).
ii
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 51

Contributors:
• Alexander Tsigler • Anant Sahai
• Ashwin Pananjady • Chawin Sitawarin • Inigo Incer
• Josh Sanz
• Raaz Dwivedi • Rahul Arya
• Yang Gao
• Yaodong Yu
HW7, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 52