程序代写代做代考 chain algorithm graph go cache kernel ada C decision tree EECS 189 Introduction to Machine Learning Fall 2020

EECS 189 Introduction to Machine Learning Fall 2020
This homework is due Wednesday, December 9 at 11:59 p.m. 1 Getting Started
HW13
Read through this page carefully. You may typeset your homework in latex or submit neatly handwritten/scanned solutions. Please start each question on a new page. Deliverables:
1. Submit a PDF of your writeup, with an appendix for your code, to the appropriate assign- ment on Gradescope. If there are graphs, include those graphs in the correct sections. Do not simply reference your appendix.
2. If there is code, submit all code needed to reproduce your results.
3. If there is a test set, submit your test set evaluation results.
After you’ve submitted your homework, watch out for the self-grade form.
(a)
(b)
2
Who else did you work with on this homework? In case of course events, just describe the group. How did you work on this homework? Any comments about the homework?
Please copy the following statement and sign next to it. We just want to make it extra clear so that no one inadvertently cheats.
I certify that all solutions are entirely in my words and that I have not looked at another student’s solutions nor have I looked at any online solutions to any of these problems. I have credited all external sources in this write up.
CNNs on Fruits and Veggies
Note that running the CNN for once on Google Colab or a Intel (R) i5-6360U CPU takes about 50 mins, so please start early.
In this problem, we will use the dataset of fruits and vegetables that was collected by students in a different semester. The goal is to accurately classify the produce in the image. In prior homework, we explored how to select features and then use linear classification to learn a function. There, we just used the color histograms as the basic data. We will now explore using Convolutional Neural Networks to optimize feature discovery jointly with learning a classification policy.
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 1

Denote the input state x ∈ R90×90×3, which is a downsampled RGB image with the fruit centered in it. Each data point will have a corresponding class label, which corresponds to their matching produce. Given 25 classes, we can denote the label as y ∈ {0, …, 24}.
The goal of this problem is make sure you learn how to implement a Convolutional Neural Network (CNN) using PyTorch. We will use it to compare with both a fully-connected neural network as well as a nearest-neighbor approach.
Note all python packages needed for the project, will be imported already. DO NOT import new Python libraries. Also, this project will be computationally expensive on your computer’s CPU. Please use Google Colab if you do not have a strong CPU. The dataset for this project can be download here(link). We recommend using PyTorch version 1.7.0 with Python3 interface.
(a) To begin the problem, we need to implement a CNN in PyTorch. In the starter code is a file named cnn pytorch.py, the network architecture and the loss function are currently blank. You will have to write a convolutional neural network that has the following architecture:
(a) Layer 1: A convolutional layer with 5 filters of size 15 by 15
(b) Non-Linear Response: Rectified Linear Units
(c) A max pooling operation with filter size of 3 by 3
(d) Layer 2: A Fully Connected Layer with output size 512.
(e) Non-Linear Response: Rectified Linear Units
(f) Layer 3: A Fully Connected Layer with output size 25 (i.e. the class labels in one-hot encoding)
(g) Loss Layer: Softmax Cross-Entropy Loss
In the file example cnn pytorch.py, we show how to implement a network in PyTorch. Please use this as a reference. Once the network is implemented run the corresponding part in the jupyter notebook on the dataset and report the resulting confusion matrix. The goal is to ensure that your network compiles, but we should not expect the results to be good because it is randomly initialized.
(b) Thenextsteptotrainthenetworkistocompletethepipelinewhichloadsthedatasetsandoffers it as mini-batches into the network. Fill in the missing code in data manager pytorch.py and report your code. If you are using google colab, you can directly open and edit corresponding files on the file tab 􏰀→ Drive 􏰀→ MyDrive 􏰀→ Colab Notebooks 􏰀→ prob2.
(c) Wewillnowcompletetheiterativeoptimizationloop.Fillinthemissingcodeintrainer pytorch.py to iteratively apply SGD for a fixed number of iterations. In our system, we will be using an extra momentum term to help speed up the SGD optimization. Run the jupyter notebook and report the resulting chart.
(d) We want to better understand how the network was able to achieve better performance on our fruits and veggies dataset. Somehow, it is learning features to reduce the dimensionality of
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 2

the data. We can see what features were learned by examining the response maps after our convolutional layer.
The response map is the output image after the convolutional has been applied. This image can be interpreted as what features are interesting for classification. Fill in the missing code in viz features pytorch.py and report the images specified.
(e) Data augmentation are techniques used to increase the amount of data by adding slightly mod- ified copies of already existing data or newly created synthetic data from existing data. It acts as a regularizer and can help improve generalization when training a machine learning model. It is a staple in image classification and usually improves the performance of CNNs.
We have implemented a image horizontal flip augmentation for you. Specifically, in the train- ing phase, there is 50% possibility that the loaded image will be horizontally flipped. There will be no augmentation in the evaluation phase.
Change “FLIPAUG = True” in data manager pytorch.py and run the corresponding part in the jupyter notebook. Compare the performance with part (c) which has no augmen- tation.
(Bonus) Implement another augmentation algorithm that outperforms the given method. To start with, you can think of random crop, color jittering, etc.
(f) We have seen that the convolutional neural network can achieve a good performance on this dataset. However, we are not sure about whether a fully connected neural network could do the same job. We would like to replace the layers (a)-(c) in the original network by a fully connected layer and a ReLU layer. We want to have approximately the same number of parameters between layer (a) in the original network and the new fully connected layer. How many hidden units will the new fully connected layer have? You can round up fractional numbers to the nearest integer. Do you observe any wierd aspect about this new network?
Hint: For a convolution layer with input size W × H × C and N convolution filters with each of them having a size of X × Y, it has C × X × Y × N parameters.
(g) Implement the fully connected network in the previous part and redo part (c). How’s the performance of the fully connected network?
(h) Given that our CNN has achieved some generalization with such low training error, it suggests that we should try other “interpolating” type of estimators. The easiest to understand such approach is 1-nearest neighbors. Fill in the missing code in nn classifier.py and report the performance as the numbers of neighbors is swept across when the corresponding jupyter notebook part is run.
3 Gradient boosting and early stopping
In this problem we show how the greedy algorithms Matching Pursuit (a simplified version of the Orthogonal Matching Pursuit that you might have seen in EECS16A) and AdaBoost which you have seen in class, can both be interpreted as taking steps greedily in a direction which is closest
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 3

to the negative gradient in a restricted set/space. We start by connecting gradient descent on the square loss with Matching Pursuit.
(a) Let us explore a naive approach to doing gradient descent on square loss:
So far, we have considered the squared error loss L(w) := 1 ∥y − Xw∥2 as a function of the 22
variable w. When running gradient descent we took the derivative of L with respect to w and obtained iterations
wt+1 = wt − αtX⊤(Xwt − y) (1) We can substitute v = Xw ∈ Rn and instead consider the loss function as a map L : V 􏰀→ R
mapping from the Euclidean space V = Rn to real values: L(v) = 1∥y − v∥2.
2
In order to minimize this new loss L with respect to v, we could imagine simply running gradient descent on this loss as follows:
vt+1 = vt − αt∇vL(v)|v=vt , (2) for some suitably chosen step size αt > 0.
Compute the gradient ∇vL(v). Write down the gradient descent update for vt+1. 􏰆
What is the gradient ∇vL(v) evaluated at v = Xw which we denote as ∇vL(v)􏰆􏰆v=Xw? For vt = Xwt, compute the expression for vt+1 using update (2).
(b) Now we compare this update with the one obtained by running the gradient descent update of w on the loss L. In particular, using equation (1), write down the expression for Xwt+1. How do you interpret the difference between the updates Xwt+1 and vt+1? Think in terms of underlying subspaces in Rd in which the updates lie.
(c) As we see in the previous part, the updates Xwt+1 and vt+1 derived by taking gradients of L and L respectively, are not equivalent. In general, we may not be able to run the gradient updates (2) exactly, as there might be constraints on the update we want to make. Sometimes, these constraints are consequences of the problem formulation (in the previous part for exam- ple, we wanted a linear fit for y depending on x). Sometimes, there are further constraints that come from computational considerations, as in the follow-up parts dealing with AdaBoost. In this problem, we will see that if instead of (2) we take a direction closest to the gradient in some restricted space, both procedures actually match.
In the specific example of linear regression, we want to find a vector 􏰛vt which can be expressed as a linear function of our data matrix X. Note that 􏰛vt will be used for updating v later (equation (4)). For this purpose we restrict our search space to the column space of X (a subspace of Rn) in which all vectors 􏰛v can be expressed by 􏰛v = Xw for some vector w which has norm smaller than one (we’ll see later why this is necessary), i.e. the set
V􏰛 = {􏰛v = Xw : ∥w∥2 ≤ 1,w ∈ Rd} ⊂ V.
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 4

Notice that the gradient of L with respect to vt need not be in this smaller subspace, i.e. ∇vL(vt) 􏰢 V􏰛. In that case, we consider a greedy strategy: We first find the direction in V􏰛 which best aligns with the negative gradient. That is at each iteration, we find
􏰛vt = argmax⟨−∇L(vt),􏰛v⟩. (3) 􏰛v∈V􏰛
and then compute vt+1 using the update equation
vt+1 = vt + αt􏰛vt, (4)
for an appropriate choice of step size αt , where 􏰛vt now belongs to the subset V􏰛 . We now show how this new procedure yields updates in the same direction as performing the gradient step with respect to w explicitly.
Again assume that the current iterate is given by vt = Xwt for some vector wt. Now, using equation(3)andtheexpressionfor∇vL(v)􏰆􏰆􏰆v=Xwt derivedpreviously,showthat􏰛vtin(3)isex- actly parallel to the vector −XX⊤(Xwt − y) (i.e. they are the same up to a positive scalar). Now write the update equation (4) in terms of X, wt , y and αt . Thus, we have seen that gra- dient descent on the losses L and L are essentially the same (and in fact the same if adjusting the stepsize αt accordingly) by choosing the set V􏰛 accordingly.
Hint: For any function h mapping to a scalar you may use that argmax h(v)=Xargmaxh(Xw).
v=Xw:∥w∥2 ≤1 ∥w∥2 ≤1
(d) Towards Matching Pursuit in the gradient descent framework In this problem we are going to interpret the direction-selection part of Matching Pursuit as a general gradient-type method as described in (b) for an appropriate choice of the set V􏰛.
As you have seen in earlier homeworks, Orthogonal Matching Pursuit is a greedy method to pick relevant coordinates that can explain the data best. It searches for a direction (among the columns) such that the residual error can be explained best and updates the entire fit in the augmented space. Matching Pursuit (MP) is a simpler variant where the entire fit is not updated at every step, just the fit along the new coordinate that was chosen.
Let X ∈ Rn×d be the training feature matrix, y the observed training target variables from which we want to infer (potentially sparse) weights w ∈ Rd. At every iteration t, MP picks the column of X with maximal absolute inner product with the current residual rt = y − Xwt, i.e.
it = arg max |⟨rt, φi(X)⟩|. (5) i=1,…,d
whereX ∈ Rn×d isthedatamatrixwithcolumnsφi(X)androwsx1,…,xn,i.e. X = (x1,…,xn)⊤ = (φ1 (X), . . . , φd (X)). Note that xi denotes the i-th data point and φi (X) denotes the vector of the i-th features of all the data points (i.e. the i-th column of the data matrix X).
The updates of MP at each iteration t then read
Xwt+1 = Xwt + ⟨rt , φit (X)⟩ φit (X). (6)
∥φ (X)∥2 it
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 5

Now your task in this part is to establish the correspondence between the MP selection rule (5) and the general approach in (restricted) gradient descent given by (3).
In particular, set vt = Xwt and find the set V􏰛 such that the selection represented by (5) can be understood as just being (3) in action, that is
􏰛vt = argmax⟨−∇L(vt),􏰛v⟩ 􏰛v∈V􏰛
= sign(⟨rt , φit (X)⟩)φit (X). Hint: don’t forget how to deal with the absolute value.
(e) In traditional gradient descent, the rule for choosing the step-size αt is left open. One way of doing this is adaptively by linesearch — picking the step-size α that reduces the loss the most. Formally, choosing stepsize αt via linesearch with respect to the search direction 􏰛vt at each iteration for a loss L means finding the best α such that αt = arg minα≥0 L(vt + α􏰛vt ). Argue that the update (4) with 􏰛vt = sign(⟨rt , φit (X)⟩)φit (X) you derived in (c), and choosing αt via linesearch along that 􏰛vt is the same as doing the MP update (6).
(f) MP as gradient boosting The two previous greedy principles of matching-pursuit (column selection and linesearch) can be extended to more general settings of non-linear and non- parametric function classes F and general loss functions L (which need not be squared loss). Consider the general learning problem where again we are given a bunch of pairwise training samplesintheformof(xi,yi)fori = 1,…,n,withxi ∈ Rd andyi ∈ Y ⊂ Randwewant to learn a function f : Rd → Y from a function space F (with possibly non-linear and non- parametric basis elements, such as trees) which we can then use to predict ytest by f(xtest) for a test sample xtest.
For this purpose we minimize the following (average) loss L : Rn → R
L(yi, f(xi)).
for some point-wise loss function L : Y × Y 􏰀→ R for one sample. Notice that the loss L maps a vector to a scalar, as in the linear settings. Here, we use f({x}ni=1) := ( f (x1), . . . , f (xn)) to denote the vector of function values of f ∈ F at the points x1, . . . , xn.
As in gradient descent (4), we want to minimize the loss using an iterative algorithm with updates of the form
f t + 1 = f t + α t 􏰛f t .
Note that our notation follows the convention that f t ∈ F is the current iterate and 􏰛f t ∈ W is the direction that we are adding at each step. The main question is how to choose 􏰛ft. Let’s assume that it is for example computationally much simpler to search over a smaller set of functions W ⊂ F at each timestep.
Similar to the parametric update in (3), we then choose the function 􏰛f ∈ W for which the function value vector􏰛f({x}ni=1) has the maximal inner product with the negative gradient, i.e.
􏰛ft =argmax⟨−∇L(ft({x}ni=1)),􏰛f({x}ni=1)⟩ (7) 􏰛f ∈W
n 1 􏰘n
L(f({x}i=1)) = L(f(x1),…, f(xn)) = n
i=1
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 6

where the gradient ∇L(ft({x}ni=1)),􏰛f({x}ni=1) = ∇vL(v)􏰆􏰆􏰆v=ft({x}ni=1) which you have already com- puted above for the square loss.
Now we want to see how Matching Pursuit can be viewed as an instance of this framework, also called gradient boosting or functional gradient descent. Specifically, show that the particular choices of
F ={f(x)=w⊤x:w∈Rd}and W={􏰛f(x)=±e⊤i x:i=1,…,d},
where ei is the i-th standard basis element with a 1 in the i-th position and zeros elsewhere, yield Matching Pursuit as a form of gradient boosting. Notice that F is the function space of all linear functions and W is the set of functions on vectors x ∈ Rd which pick out signed coordinates of x.
Hint: For matching pursuit, what are􏰛ft({x}ni=1) and ft({x}ni=1)?
(g) AdaBoost as gradient boosting Now let’s have a look at non-linear and non-parametric func- tions such as decision trees for classification. In this problem we will establish that AdaBoost is also an instance of gradient boosting on a particular loss function (which is not the zero-one loss!).
Notice how each decision tree can be viewed as a function 􏰛f : Rd → {−1, +1}. Since in each iteration we fit a tree to the weighted samples, our function set W now corresponds to the set of decision trees with a fixed set of parameters.
Remember the AdaBoost algorithm from lecture and the previous problem. At each iteration t, it upweights each sample i by a different wti in every iteration and finds a new tree which solves the reweighted classification problem
􏰘n
Ada i in t
exp􏰁−yi ft(xi)􏰂 wtI(yi 􏰡 􏰛f(xi))where wt = 􏰖 􏰁 􏰂
􏰛ft =argmin
􏰛f∈W i=1 i=1 exp −yi f (xi)
where f t is a linear combination of functions in W found by previous iterations and I(A) = 1 if A is true and 0 otherwise (the indicator function). Notice that the samples which the current classifier f t gets wrong are assigned a larger weight.
The update for the overall classifier reads
f t+1 = f t + αt 􏰛f t (8)
The classification rule is then obtained by taking the sign of the final iterate at time τ, i.e. ffinal(x) = sign( f τ(x)) which lies in a bigger space F ⊃ W.
Show that 􏰛ft ({x}ni=1) = 􏰛ft({x}ni=1) in the gradient boosting framework (7) for the loss Ada
l(y, f (x)) = exp􏰄−y f (x)􏰅. This implies that finding the best fit for the reweighted samples is equivalent to fitting the gradient. For the gradient in (7), notice that we again view L as a function of a vector v with
nn
L(v) = 1 􏰘 l(yi, f (xi)) = 1 􏰘 exp􏰄−yivi􏰅 (9)
n i=1 n i=1
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 7
Ada

and the gradient ∇L(ft({x}ni=1)) = ∇vL(v)􏰆􏰆􏰆v=ft({x}ni=1).
Hint: Now note that for y ∈ {−1,+1} and 􏰛f : Rd → {−1,+1}, we have
yi 􏰛f(xi) = 2(1 − I(yi 􏰡 􏰛f(xi))) − 1. (10) Hint: Recall that the maximizer does not change if you scale by or add to the function a
constant factor which is independent of the object you are searching over, i.e. arg max CH(􏰛f({x}ni=1)) + G(f({x}ni=1)) = arg max H(􏰛f({x}ni=1))
􏰛f ∈W 􏰛f ∈W
for some functions H,G that depend on the function value vectors f,􏰛f, and a scalar C ∈ R. f
does not depend on 􏰛f .
(h) AdaBoost weight update as linesearch steps In this problem we show that the weights used in the AdaBoost algorithm at each time step t correspond to the “best” stepsize in the direction
􏰛f t using linesearch. Ada
Let’s define the error of the classifier f t as ε = 􏰖n wt I(y 􏰡 f t (x )). In practice, AdaBoost
uses the stepsize αt = 1 ln 1−εt . 2 εt
Show that the AdaBoost stepsize at any time t corresponds to the stepsize chosen via linesearch for the direction 􏰛v = 􏰛ft ({x}ni=1), i.e. it solves the optimization problem
Ada
L(ft({x}ni=1) + α􏰛ft ({x}ni=1))
min Ada α∈R L(f t ({x}ni=1 ))
Hint: division by a constant factor (independent of the variable we are minimizing over) does not change the minimizer of the optimization problem, i.e.
arg min g(α; {x}ni=1) = arg min g(α; {x}n ) α h ( { x } ni = 1 ) α i = 1
(i) Now we will explore how gradient boosting does in practice for both examples we have the- oretically studied in the previous sections: Matching Pursuit and AdaBoost. In particular, we will see how the number of updates, equivalent to the number of trees that we are fitting, effects the test and training error.
We have written some code (see the corresponding jupyter notebook) that trains several clas- sifiers on the Spam dataset you worked with in the previous problem. Run the code for this part of the question. It generates a training error plot that uses several different decision trees (depths 1, 2, and 4) as well as several AdaBoost-trained classifiers, each built off of one of these decision tree classifiers. Do you see a trend in the performance of different trees by themselves? Do you observe a trend in the training error as you use deeper trees for AdaBoost? Why might this happen?
(j) Now we examine the test error, and compare against a baseline classifier (a decision tree of depth 9). Run the code for this part. Is there a difference in the training and test error? Which decision tree depth works best for AdaBoost? Explain your observations.
􏰛ti=1 􏰛i Ada i Ada
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 8

(k)
In the last part, you noticed that the test error decreases as a function of boosting iterations in the beginning but eventually it starts to increase when the number of decision trees in Adaboost is pretty large. In practice, this phenomenon motivates us to stop training early and limit the number of classifiers used in a boosting setting. Here “stopping early” means that training error has not reduced to zero but we have stopped training! Refer to the plot from the previous part and answer: Do you think limiting the number of base classifiers used for AdaBoost would help? Which base classifier can we run more boosting iterations on before the test error starts increasing? Justify your answer intuitively.
In this part, we connect this phenomenon to matching pursuit. The provided code generates a synthetic dataset:
(l)
4
Here, w is a sparse vector (it has only a few non-zero entries) and z denotes noise. Given the observations ytrain and the feature matrix Xtrain, we are tasked with finding a sparse solution wˆ for which we use the matching pursuit algorithm.
Let wˆ tMP denote the estimate of w recovered by MP after t-iterations. The code also plots the training error ∥ytrain − Xtrainwˆ tMP∥2 and the test error ∥ytest − Xtestwˆ tMP∥2 as a function of iterations t (as t increases matching pursuit builds an estimate using a greater number of features). Run the code for this part. Explain the shape of the training error plot. Does the plot for test error look similar to the one from part (i)? Comment on the similarities. You may use conclusions obtained in previous parts to justify your comments.
Neural Tangent Kernels
with (for both test and train data)
ytrain, ytest ∈ Rn Xtrain, Xtest ∈ Rn×d w ∈ Rd,
y = Xw + z.
This problem shows how we can understand neural network behavior better if we leverage a very standard tool that you have seen in EECS16B (or wherever you learned similar ideas) — lineariza- tion of nonlinear functions. This staple tool from control theory and analog transistor circuits (where it is called “small signal analysis” and is developed in courses like EE105) turns out to be vital to give us insight into modern neural networks. In hindsight, this is perhaps not surprising given that a neural network is basically an analog circuit with nonlinear elements in it. The key is to view the network as a function not of its supposed inputs, but instead of its parameter weights. Since the learning dynamics evolve on the weights, a “small signal” view of the response to the weights sheds some important insight into what is going on.
Consider the two-layer neural network which computes the following function:
1 􏰘m
f(x|a,W) = √m
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 9
arσ(w⊤r x).
r=1

We impose the following assumptions:
• Initial values of final layer weights {ai} are centered i.i.d. random variables from the segment
[−A, A] (not necessarily from the uniform distribution).
• Initial values of weight vectors {wr} are i.i.d. random vectors independent of the {ar},
• The nonlinear activation function σ and its first and second derivatives σ′ and σ′′ are all bounded (for example you can think of sigmoid activation function).
For understanding convenience, the model that we will use to make predictions from the learned weights is the following:
fˆ(x|a, W) := f (x|a, W) − f (x|ainitial, Winitial)
i.e. we subtract out the guess that our initial random weights make. (This corresponds to the
concept of bias point in analog circuits and the idea of operating point in control theory.)
Why? Because the spirit of this problem is to take the Taylor expansion of the neural network with respect to the weights that we’re learning and just keep the first two terms — the constant term (corresponding to the random initial state’s predictions) and the linear term (that captures how the network’s behavior will change locally as we learn). The prediction model above lets us focus entirely on the linear term where all the action is — the jupyter parts will help you see this more clearly.
Our further goal is to understand what happens for large neural networks. Here, we’ll use another standard approach — we’ll take limits in two ways. We’ll let m → ∞ and we will also use a continuous-time perspective since that lets us leverage the core differential-equation thinking that you’ve developed in other courses (like EECS16B).
Consequently, this problem studies training of such a neural net with gradient descent in the “infinite width” regime, which means that we will set parameters of σ, distributions of ar and wr, the number of data points n and the data X itself to be constant, and will only have m going to infinity. What happens when we take m to infinity? As it turns out, in this regime training a neural net is no different from just doing a kernel regression with kernel K(xi,xj) = ⟨∇a,W f(xi|a,W),∇a,W f(xj|a,W)⟩, which in turn itself converges to a fixed kernel in the infinite- width limit. Sure, there are an infinite number of “features,” but that is not an obstacle since the actual (delta-) weights being applied to them are also infinitesimal. This limiting kernel is of- ten called the “Neural Tangent Kernel” (NTK) for obvious reasons. This is what “small signal” analysis as applied to a neural net is about.
The rough outline of our argument in this problem is the following: we spend most of our time showing that we don’t go very far in the weight space during gradient descent, and then in the last part we show why it means that we are doing kernel regression.
We start with computing the gradient of our prediction:
∂1⊤ ∂a f(x|a,W)= √mσ(wrx),
r
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 10

1′⊤
∇wr f(x|a,W) = √marσ (wr x)x,
For given training data X = [x1,…,xn]⊤ and y = [y1,…,yn]⊤ consider the MSE loss: 1n2
L(X, W, a) = 􏰘 􏰈yi − fˆ(xi|a, W)􏰉 2 i=1
To simplify the math, instead of considering (usual) discrete-time gradient descent, we will assume that during the training the parameters of our NN evolve according to the continuous-time gradient descent defined by the system of differential equations:
dw (t) r
dt
n
= −∇wi L(X, W(t), a(t)) = 􏰘 􏰈yi − fˆ(xi|a(t), W(t))􏰉 ∇wr f (xi|a(t), W(t)),
da(t) dt
i=1 n
= −∇aL(X,W(t),a(t)) = 􏰘􏰈yi − fˆ(xi|a(t),W(t))􏰉∇a f(xi|a(t),W(t)). i=1
You can think of this process as of your usual gradient descent with infinitesimally small step size and a rescaling of time.
It turns out that looking at the evolution of weights directly is not very convenient. The main idea is to look at the dynamics of the predictions on the training data: denote
ui(t) := fˆ(xi|a(t), W(t)), u(t) := [u1(t), . . . , un(t)]⊤.
Note that u(0) = 0 because of how we simplified things. Then during the GD we have
d u i ( t ) 􏰒 d a ( t ) 􏰓 􏰘m 􏰒 d w r ( t ) 􏰓 dt = ∇a f(xi|a,W), dt + ∇wi f(xi|a,W), dt
r=1
n
=􏰘􏰈yj −uj(t)􏰉􏰔∇af(xi|a,W),∇af(xj|a,W)􏰕
j=1 nm
+􏰘􏰘􏰈yj −uj(t)􏰉􏰔∇wr f(xi|a,W),∇wr f(xj|a,W)􏰕 j=1 r=1
So, looking at the differential equation satisfied by the residuals (this should remind you of what you say in the review problem in HW0 when we looked at discrete-time gradient-descent for least- squares), we see:
d (y − u(t)) = −(G(t) + H(t))(y − u(t)), dt
where G(t) and H(t) are empirical kernel matrices:
1 􏰘m
Gi,j(t) :=m
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 11
σ(wr(t)⊤xi)σ(wr(t)⊤xj)
r=1

1 􏰘m
Hi,j(t) :=x⊤i xj m
As m goes to infinity, G(0) and H(0) converge to kernel Gram matrices with kernels G∞ and H∞
correspondingly:
G∞i,j :=Ew[σ(w⊤xi)σ(w⊤xj)],
H∞i,j :=x⊤i xjE[a2]Ew[σ′(w⊤xi)σ′(w⊤xj)].
Denote
Note that K∞ is exactly the kernel Gram matrix that corresponds to NTK.
In this problem we assume that the matrix K∞ is non-degenerate, i.e. λmin(K∞) > 0. (This is a reasonable assumption since there are more features than data points.) Our goal is to show that under this condition training with gradient descent leads to the same prediction as kernel regression with the neural tangent kernel.
(a) The first observation that we are going to make is that if the lowest eigenvalue of our matrix K(t) remains separated from zero, then the residuals converge rapidly to zero loss. This rapid convergence will help us prove that our weights don’t go very far during the training.
Suppose that for some T > 0 for any t ∈ [0,T] the lowest eigenvalue λmin(K) of K(t) is at least λ ̄ >0.Showthatforanyt∈[0,T]
∥y − u(t)∥2 ≤ ∥y∥2 exp􏰈−2tλ ̄􏰉 Hint: consider d ∥y − u(t)∥2. Don’t forget that u(0) = 0.
dt
(b) As promised, now we want to show that our weights don’t go very far and we will not get out of the local regime where the NTK is a good approximation. Suppose for some T > 0 there exist γ > 0 and λ ̄ > 0 s.t. for any t < T the following holds: • for any r, |ar(t)| < γ (i.e. there is some bound on how big the final-layer weights become), • λmin(K(t)) ≥ λ ̄ Showthatforanyt 0 there exists m0 s.t. for any m > m0 the following holds: if we can control the condition number and the magnitude of the weights a i.e.
• ∀t ∈ [0,T] λmin(K(t)) > λ ̄, • ∀t ∈ [0,T] maxr |ar(t)| < γ then for any t ∈ [0, T ] and any r ∥wr (t) − wr (0)∥ < δ and |ar (t) − ar (0)| < δ. Hint: you need to integrate the bounds from the previous part. How do they behave when m grows? (d) So far we were assuming that the lowest singular value of K(t) stays separated from zero. We know that it is true at the initialization as K(0) is close to K∞ for m large enough. However, it is not clear why it should stay like that throughout the training. To show that we are going to use something like a “recursion”: we already showed that if λmin(K(t)) stays separated from zero, then the weights don’t go very far. Now we are going to show that if the weights don’t go very far, then λmin(K(t)) stays separated from zero! In the next part we will put everything together and finalize the argument, but for now Show that if for any r, ∥wr(t) − wr(0)∥ < δ and |ar(t) − ar(0)| < δ then ∥G(t) − G(0)∥ ≤CG(n, σ, X)δ, ∥H(t) − H(0)∥ ≤CH(n, σ, A, X)(δ + δ2) where CG(n, σ, X) is a constant that only depends on n, σ, X, and CH(n, σ, A, X) is a con- stant that only depends on n, σ, A, X. Hint: you have explicit formulas for G and H. How does it help that the derivatives of σ are bounded, and the distribution of ar also lies in a bounded segment? (e) Finally, we can put everything together: Show that for any ε > 0 we can choose m to be so
large that with probability at least 1 − ε for any t the following hold:
• ∥K(t)−K∞∥≤ε,
• maxr ∥wr(t) − wr(0)∥ ≤ ε,
• maxr |ar(t) − ar(0)| < ε Your argument should (probably) go as follows: • For large m, K(0) is close to K∞, so λmin(K(0)) is separated from zero. • The weights {ar} are always all bounded at initializaiton. • While λmin(K(0)) is separated from zero and the weights {ar} are bounded the weights don’t go very far if m is large enough. • However, if the weights don’t go very far, then λmin(K(0)) stays separated from zero and the weights {ar} stay bounded. HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 13 (f) • Therefore, if m is large enough, the weights will not go far away from initialization, and the matrix K(t) will not go far away from K∞. Finally we can show that going not far from the initialization means that we are not actually doing anything non-linear! We can write our prediction at some new point x as 􏰙∞d dt f (x|a(t), W(t)) dt 􏰙∞􏰒􏰘n 􏰓 (yi − ui(t))∇a,W f (xi|a(t), W(t)) dt (yi − ui(t))k(x, xi, t) dt where we introduced the notation k(x,xi,t) for the scalar product of gradients. It’s not hard to see that our argument could be applied almost as it is to show that for m large enough the vector[k(x,x1,t),...,k(x,xn,t)]⊤ staysclosetothevector[k(x,x1),...,k(x,xn)]⊤ wherekis the NTK. Plugging this approximation we get 􏰘n 􏰎􏰙∞ 􏰏 (g) 5 fˆ(x) = = = 0 0 􏰙 ∞ 􏰘n 0 ∇a,W f (x|a(t), W(t)), i=1 i=1 fˆ(x)= Thus, to show that our predictions converge to the predictions of NTK we just need to show that the coefficients 􏰞􏰗 ∞(yi − ui(t)) dt􏰟n converge to the coefficients learned by kernel regression 0 i=1 with NTK. Show that convergence for m going to infinity. Hint: look at the learned predictions in the data points. You don’t have to be completely rigorous, for example you may assume K(t) = K∞ for large m and pull K∞ out of the integral as we did above with k(x, xi). For the final part of this problem you will implement the key steps for neural tangent kernel regression and explore the behavior of the kernel compared to the network under gradient descent for varying hidden layer widths. Complete the missing code and respond to the prompts for each subpart in the notebook. Double descent i=1 0 k(x,xi)· (yi −ui(t))dt As we increase the number of features, we increase the kinds of functions that our model can approximate well. This reduces the approximation error, and if the learning method can take advantage of it, could be a good thing. Unfortunately, there is a price that generally is paid by the learning process — as you have seen in earlier homeworks, the additional parameters can provide a place for noise/errors to enter the learned model and thereby degrade the prediction performance on unseen test data. Traditionally, the test error is depicted as a ∪-shaped curve as a function of the number of parame- ters of the model, and one wants to find the “sweet spot” (the minimum of the test error). However, HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 14 in this course you have learned that the story is not so simple. You have seen how more features can have a regularizing effect — they can reduce the contamination effect — even as they allow you to exactly interpolate the training data. In earlier homeworks, you have used both a Fourier model and a Gaussian model to see this effect. This homework problem is an homage to (see this celebrated paper) where the consequences of this underlying phenomenon were observed empirically and showed that this wasn’t something that was restricted to deep neural nets, but was actually something ubiquitous. The paper showed that as we increase the number of parameters even further than the interpolation threshold (the point where the model is rich enough to exactly interpolate the data), the test error begins to decrease again in many problems — and in many cases, actually deceases below the best possible model with a few parameters. This phenomenon is called ”Double Descent.” The aim of this problem is to show theoretically how double descent happens in a simple model where we add features one by one and perform linear regression for each number of features. This simple model is one that you have already seen in an earlier homework where we gave you a theoretical model that exhibits adversarial examples. You’ve done analyses in this style before — the goal here is to help you put the pieces together. Suppose our signal is φ∗(x) where φ∗ is a “master feature.” We don’t know the feature φ∗, but we have some intuition about it. Let’s model it in the following way: suppose that we can construct features φi = φ∗ + ∆i s.t. the following conditions hold: if x is a point from the data distribution, then • Exφ∗(x)∆i(x) = Ex∆i(x)∆j(x) = 0 ∀i 􏰡 j, • Exφ∗(x) = 0, Exφ∗(x)2 = 1, • Ex∆i(x) = 0, Ex∆i(x)2 = σ2, • (φ∗(x), ∆1(x), ∆2(x), . . . ) are jointly Gaussian (note that it implies independence since corre- lations are zero). Notice how this model can be thought of as posessing a single dimensional latent space (using the style of thinking that you have also seen in earlier homeworks) where each visible feature is a noisy view of that latent space. We want to estimate the generalization error of linear regression if we create m features. If m < n we do least squares, if m ≥ n we find the min norm interpolating solution. Let’s denote the corresponding solution as wˆ y (note that we emphasize the dependence on y here). Suppose we find weighs {wi}mi=1. The generalization error will be  m 2 m 􏰘   R(w) := 1 − w   i 2􏰘2 w . + σ Let O ∈ Rm×m be a unitary matrix (i.e. a matrix with orthonormal columns) whose first row is  i=1  [1/ √m, . . . , 1/ √m]. We can change coordinates in the weight space in the following way: X→XO⊤ =:[√mφ∗(X)+c1,c2,...,cm] w→Ow=:θ HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 15 i i=1 where [c1, . . . , cm] a the matrix with i.i.d. entries from N(0, σ2). The generalization error becomes L(θ) := R(Ow) = (1 − √mθ1)2 + σ2∥θ∥2, where θ1 is the first coordinate of θ. Notice that the last group of terms are contamination terms in spirit and the first term is fundamen- tally about how much of the true signal survives. However, the scenario here has an approximation- error dimension as well since the true pattern/signal here cannot actually be represented accurately using a finite number of features. The first feature in rotated coordinates is becoming an ever-better approximation to the true pattern as we add more features, and the approximation error goes to zero as the number of features goes to infinity. It is this improved approximation error that permits us the possibility of doing better with lots of features than we could possibly do with a few features. To allow us to get a better handle on the contamination, denote by θ−1 the vector comprised of the rest of the components after the first one and let C−1 := [c2,...,cm] This problem has many parts to help you walk through the derivation in detail, but it is good to remember what you already know from earlier homeworks. As the number of features increases, the effective “weight” (remember the meta-learning problem on the previous homework) on the first feature in rotated coordinates is increasing. This is going to make the min-norm interpolating solution put a lot of weight on that relative to its aliases — this is what is going to allow the signal to survive, even as we have to learn from a few samples. Meanwhile, the contaminating aliases are going to become less and less contaminating to test points as we add more features, in the same way that you have seen in earlier homeworks about Fourier-based models as well as Gaussian-based models. (a) Let us give the first feature in the rotated coordinates its own name and notation: η := √mφ∗(X) + c1. This is where we want a lot of weight to end up in the learned solution. Prove the following formula for the solution: m ≤ n else.  ˆ y⊤η⊥/∥η⊥∥2, (θ1)y=† ⊤† † 2 (C−1η) C−1y/(1 + ∥C−1η∥ ).  †⊤2 C (I −ηη/∥η∥)y, −1m ⊥⊥ m≤n  Im − C−1η(C−1η) /(1 + ∥C−1η∥ ) C−1y, else, Hint 1: the least squares solution finds the linear combination of columns that is closest to the target y. Therefore for m < n, we know (θˆ1)yη is the dilation of η that is closest to the projection of y on the orthogonal complement to the span of columns of C. Hint 2: (θˆ−1)y = C†−1(y − θˆ1η). HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 16 (θˆ−1)y=􏰈 † † ⊤ † 2􏰉† where η⊥ is the projection of η on the space orthogonal to the column span of C−1. Hint 3: for m > n plug the equation from the previous hint into the definition of the min norm interpolating solution.
(b) We can already see that relation between y and η is somewhat inconvenient: η = √my + c1 so it is easy to express η in terms of y, but the equations depend on η in a complicated way, so we would like to leave η as a free variable. Fortunately, because we assumed that the distributions are jointly Gaussian this dependence can be simplified too! Show that
√m 􏰠σ2
y= m+σ2η+ m+σ2ε,
where ε ∈ N(0, In) — independent of η and C−1.
This decomposition is great because it lets us pretend that the y is actually a noisy version of a scaled first feature — bringing us closer to cases that we have studied earlier. Notice how the noise gets smaller as m.
Hint: use the conditional distribution formula for multivariate Gaussians.
(c) Treating ε as something like noise, we can derive a bias-variance style decomposition: L(θ) =(1 − √mθ1)2 + σ2∥θ∥2
√􏰠 m σ2
 E L ( θˆ ) = E L   θˆ + θˆ   ε y εm+σ2η m+σ2ε
√2 mσ
􏰜􏰝 =L θˆ+ E (θˆ)2m+σ2∥θˆ∥

m+σ2ηm+σ2ε 1ε ε
Here, notice that we have introduced this subscript notation for convenience on the estimated parameters in the rotated space — this refers to what happens if we run the relevant least- squares/min-norm estimator using the subscript as the target.
Now we can compute each part of the error separately. Let’s start with the bias (the part that doesn’t depend on the realization of the “noise” ε): show that
 1 + ∥ C †− 1 η ∥ 2 
ˆ 
0, m≤n
(θ−1)η= 1 †
1+∥C†−1η∥2 C−1η, else.
(Conceptual Hint: When m ≤ n, there is no alias for η among the features. When m > n, what is the relationship of C†−1η to an alias for η?)
(d) Now plug in the result of the previous part to obtain
 ˆ 
1, m≤n (θ1)η =1− 1 . else.
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 17

 σ2
√ 2, m≤n
 m  m+σ
􏰎 􏰏
Lθˆ= 2 †2
m+σ2 η  σ2 m
 m+σ2 m+σ2 1+∥C†−1η∥2 (m+σ2)2 1+∥C†−1η∥2
1 σ2m ∥C η∥
 + + −1 ,m>n
σ2 􏰜 ˆ 2 2 ˆ 2􏰝
(e) Now let’s deal with the “variance” part m+σ2 Eε (θ1)εm + σ ∥θε∥ (don’t forget that ε is a
vector with i.i.d. standard normal entries, that doesn’t depend on η or C−1.)
The variance term turns out to be a bit more complicated then the bias term, so we start with the following technical step: show that
m ≤ n else.
􏰈† †
tr C C  −1
− ∥(C ) C η∥ , else, (1+∥C†−1 η∥2 )2 −1 −1

ˆ 2 1/∥η⊥∥2,
Eε(θ1)ε=†⊤† 2 † 22 ∥(C−1) C−1η∥ /(1 + ∥C−1η∥ ) .
􏰉
tr (C )⊤C (Im −ηη⊤/∥η ∥2 −η η⊤/∥η ∥2 +ηη⊤/∥η ∥2)⊤ , m≤n
 −1 −1 ∥⊥ ⊥ ⊥∥ ⊥ ∥∥ ⊥
E∥(θˆ)∥2=􏰊􏰈􏰉⊤􏰋 †2 ε−1ε†† 2+∥C−1η∥†⊤†2
straightforward.
Hint 1: if some matrix A is independent from ε, then Eε∥Aε∥2 = ∥A∥F = tr􏰈AA⊤􏰉 = tr􏰈A⊤A􏰉.
Hint 2: explain why C†−1η = C†−1η∥
Hint 3: tr􏰈AB⊤􏰉 = tr􏰈B⊤A􏰉 for any matrices A, B ∈ Ra×b for any natural a, b.
(f) We will not ask you to finish the computation of the generalization error for the case m > n. The steps of that computation are exactly the same as the steps for the case m ≤ n, but it just turns out to be (substantially) more bulky. So, from now on you will only deal with the underparametrized case m ≤ n.
Moreover, even in this case we’ll do some work for you: since η∥ is zero-mean and independent of η⊥ and C−1, the expectation of the second and the third term is equal to zero, so we can remove them without changing the expectation:
􏰌􏰈†⊤† ⊤ 2 ⊤ 2 ⊤ 2⊤􏰉􏰍 E tr (C−1) C−1(Im − η∥η⊥/∥η⊥∥ − η⊥η∥ /∥η⊥∥ + η∥η∥ /∥η⊥∥ )
􏰌 􏰈 † ⊤ † ⊤ 2 ⊤􏰉􏰍 =E tr (C−1) C−1(Im + η∥η∥ /∥η⊥∥ )
−1
where we denoted η∥ := η − η⊥. The formula may look bulky but the computation is very
􏰐􏰊 􏰈 􏰉⊤􏰋 􏰑 =E tr C†−1 C†−1 + ∥C†−1η∥∥2/∥η⊥∥2
.
So overall for the expectation of the variance term we have σ2 􏰜ˆ2 2ˆ2􏰝
m+σ2E (θ1)εm+σ ∥θε∥
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 18

􏰐􏰊􏰋􏰑
σ22†􏰈†􏰉⊤2†22 2
 EσtrCC +(σ∥Cη∥+σ+m)/∥η∥,m≤n, m+σ2 −1 −1 −1∥ ⊥
=􏰐􏰊􏰋􏰑 
 2 􏰈 􏰉⊤ m−σ2(1+∥C† η∥2) σ2†† −1†⊤†2
m+σ2 E σ tr C−1 C−1 + (1+∥C†−1η∥2)2 ∥(C−1) C−1η∥ , m > n.
Now use Lemma 1 to show that for m ≤ n 
+∞, n−m+1≤2 􏰜 2 † 2 2 2􏰝 
E (σ ∥C η ∥ + σ + m)/∥η ∥ =  
−1∥ ⊥1 􏰜2†22􏰝
 E (σ∥C η∥ +σ +m) , n−m+1>2,
 (m+σ2 )(n−m−1) −1
Lemma 1. Suppose a r.v. ξ has chi-squared distribution with d degrees of freedom. Then 
−1 +∞, d≤2
E[ξ ]= 
1
 , d>2. d−2
(g) At this point we’ve obtained expressions for the generalization error that only depend on 4 random quantities: E tr 􏰈(C† )⊤C† 􏰉 , E∥(C† )⊤C† ∥2 ,, E∥C† η∥2, E∥(C† )⊤C† η∥2. To sim-
−1 −1 −1 −1F −1 −1 −1
plify the last step in estimation of the expected error we are going to cheat a little bit: instead
of computing true expectations we will simply substitute those 4 quantities with their expecta- tions in our equations.
Hint: don’t forget that η is gaussian and independent of C. Use Lemma 2 to compute the following quantities:
E tr 􏰈(C† )⊤C† 􏰉 , E∥(C† )⊤C† ∥2 , E∥C† η∥2, E∥(C† )⊤C† η∥2. −1−1 −1−1F −1 −1−1
Lemma 2. Suppose G is a a × b matrix with i.i.d. standard normal entries and a ≥ b. Then 
􏰈 † † ⊤􏰉 +∞, a∈{b,b+1},
EtrG(G) = 
b

 a−b−1
,
a > b + 1, a≤b+3,
and
, a>b+3.
(h) Finally, we are ready to compute the answer. As stated before, plug in the following in the
formulas:
∥C†−1η∥2 ≈E∥C†−1η∥2, ∥(C†−1)⊤C†−1η∥2 ≈E∥(C†−1)⊤C†−1η∥2
Moreover, let’s assume that we are in the regime when m, n, |m − n| ≫ 1, so the following also holds:
|m − n| ≈ |m − n| − 1 ≈ |m − n| − 2 ≈ |m − n| − 3,
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 19

† †⊤2 +∞,
E∥G (G ) ∥ = 
F
(a−1)b
 (a−b)(a−b−1)(a−b−3)
 

m≈m−1≈m−2 etc.
(note that σ2 is not a constant, so you cannot assume m + σ2 ≈ m).
Combine the expressions for bias and variance and plug in our approximations to show
that for m ≤ n
ˆ nσ2 EL(θy) ≈ (m + σ2)(n − m).
As promised, we don’t ask you to do the same computation for m > n. However, we provide you with the answer: for m > n
EL(θˆy)≈ σ2(n2+mσ2) . (n+σ2)2(m−n)
(i) Look once again at our result:
ˆ
 nσ2 (m+σ2)(n−m),
m ≤ n, m > n.
Answer the following questions qualitatively:
EL(θy) ≈  σ2(n2+mσ2) (n+σ2)2(m−n),
• What happens if m is close to n? What happens to the training error in this case?
• Suppose n and σ are constant. What is the shape of the curve if you plot EL(θˆy) against
m? Do we always get double descent?
• Suppose n and σ are fixed. What is the smallest achievable error if m ≤ n? What if
m > n? (don’t forget that σ2 may be larger than n)
(j) Compare your theoretical result to the simulation result in the Jupyter notebook. Report
your observations.
(k) (Optional) Copy-paste-and-modify the simulation in the Jupyter notebook to plot what happens if you hold the number of features constant at something moderately large and sweep the number of training points from 0 to something large. Report your observa- tions.
This should remind you of plots you saw in the meta-learning problem as well. The fact that more high-quality training data can actually at times make performance worse was not widely appreciated until recently, but is important to be aware of.
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 20

The second part on this homework is designed to further exercise skills and concepts by giving you past exam problems to look at. You can consider all these problems optional.
6 Short answer (5 parts, Est. 14 minutes)
For each of the following questions, write at most one sentence in answer.
If you write more than one sentence, we will only grade the first sentence.
(a) (Est. 4 minutes) For the following loss functions l : R × R → R, write a YES or a NO in the
table for whether it
(1) assigns a loss of 0 to any prediction with the right sign, i.e., for any yˆ, y∗ ∈ R,
l(yˆ, y∗) = 0 if sign(yˆ) = sign(y∗) , 􏰆􏰆
(2) is bounded, i.e. maxyˆ∈R,y∗∈R 􏰆􏰆l(yˆ, y∗)􏰆􏰆 ≤ C for some finite constant C ≥ 0,
(3) is convex on R in its first argument, i.e. yˆ 􏰀→ l(yˆ, y∗) is convex on R,
(4) is differentiable everywhere on R in its first argument, i.e. yˆ 􏰀→ l(yˆ, y∗) is differentiable on R.
The losses are defined as follows (recall that both yˆ and y∗ are scalars): 
1 if sign(yˆ) 􏰡 sign(y )
• 0-1 loss: l(yˆ, y ) =  ∗

0 otherwise
• squared loss: l(yˆ, y∗) = (yˆ − y∗)2
• absolute loss: l(yˆ, y∗) = |yˆ − y∗|
• sigmoid loss: l(yˆ, y∗) = 1 1+exp(−yˆ·y∗ )
• hinge loss: l(yˆ, y∗) = max{0, 1 − yˆ · y∗}
The first column of answers is provided for you as an example.

0-1 loss
squared loss
absolute loss
sigmoid loss
hinge loss
(1) sign
YES
(2) bounded
YES
(3) convex
NO
(4) everywhere differentiable
NO
(b) (Est. 2 minutes) Give an example of two random variables X, Y ∈ R that are uncorrelated, but not independent.
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 21

(c) (Est. 2 minutes) Give a necessary and sufficient condition on a matrix X ∈ Rn×n, under which we can write X = UDU⊤ for D ∈ Rn×n diagonal and U ∈ Rn×n orthonormal.
(d) (Est. 2 minutes) Recall that when describing the optimal regression weights w using ridge regression, the pseudo-inverse, and PCA, we can write the weights using the SVD of the feature matrix X as follows, with rescaling on the singular values si according to the spectral function ρ(si):
n
w = 􏰘 ρ ( s i ) v i u ⊤i y
i=1
Figure 1 plots the spectral function ρ(si) for (1) Pseudo-Inverse, (2) PCA, and (3) Ridge Re- gression. In the boxes on Figure 1, label the graphs of the spectral function with the method that they correspond to.
(e) (Est. 4 minutes) Assume that X ∈ Rn×d1 and Y ∈ Rn×d2 are two full column-rank datasets (n ≥ d1, n ≥ d2). Recall that in the computations for canonical correlation analysis (CCA), we would like to maximize the correlation coefficient,
ρcca(X,Y) := max √ v⊤X⊤Yu . u∈Rd1 , v∈Rd2 v⊤X⊤Xv · u⊤Y⊤Yu
u􏰡0,v􏰡0
Now consider a full column-rank matrix Z ∈ Rn×d (i.e. rank(Z) = d). Compute the following correlation coefficients, and write down any corresponding pair of vectors u, v ∈ Rd that achieves the maximum.
(1) ρcca(Z, Z)
(2) ρcca(Z, −Z)
7 Gaussian Kernels (3 parts, Est. 12 minutes)
In this question, we will look at training a binary classifier with a Gaussian kernel. Specifically
given a labelled dataset S = {(xi, yi)}ni=1 ⊆ Rd × {±1} and a kernel function k(x1, x2), we consider HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 22

Figure 1: Fill in the boxes with “ridge”, “pinv”, and “PCA.” classifiers of the form:
n
􏰘 
􏰚   f(x)=sign αk(x,x) ,
ii i=1 
wherewedefinesign(u)tobe1ifu ≥ 0or−1ifu < 0. Inordertochoosetheweightsαi,i = 1,...,n, we will consider the least-squares problem: α∈argmin∥Kα−y∥2 , (11) α∈Rn where K = (k(xi,xj))ni=1,j=1 is the kernel matrix and y = (y1,...,yn) is the vector of labels. We will work with the Gaussian kernel. Recall that the Gaussian kernel with bandwidth σ > 0 is defined as:
k(xi, xj) := exp − 2σ2  . 
 ∥x−x∥2  i j  2
(a) (Est. 4 minutes) When the bandwidth parameter σ → 0, observe that the off-diagonal entries of the kernel matrix K tend also to zero. Consider a two sample dataset S (i.e. n = 2) with (x1, y1) = (1, 1) and (x2, y2) = (−1, −1). Assuming that as σ → 0 the off-diagonal entries of K are equal to zero (and the diagonal entries unmodified), what is the optimal solution of α for the optimization problem (11) and what is the resulting classifier 􏰚f (x)?
(b) (Est. 4 minutes) Now we consider the regime when the bandwidth parameter σ → +∞. Observe in this regime, the off-diagonal entries of the kernel matrix K tend to one. Given a
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 23

dataset S , suppose we solve the optimization problem (11) with all the off-diagonal entries of K equal to one (and the diagonal entries unmodified). Prove that if the number of +1 labels in S equals the number of −1 labels in S , then α = 0 is an optimal solution of (11). What is the resulting classifier 􏰚f (x)?
(Est. 4 minutes) Now we consider the regime when the bandwidth parameter is large but finite.
Consider again the two sample dataset S with (x1, y1) = (1, 1) and (x2, y2) = (−1, −1). When
isgivenbyα = (σ2,−σ2). Whatisthe −1 
Neural Networks (3 parts, Est. 12 minutes)
We consider the following neural network structure, where:
x, w1, w2, ∈ R7, and α1, α2, h11, h12, h21, h22, 􏰚y ∈ R .
The output of this network 􏰚y(x) is given by 􏰚y = α1v1 + α2v2, where
u1 = max{0, w⊤1 x}
u2 = max{0, w⊤2 x}
v1 = max{0, h11u1 + h21u2} v2 = max{0, h12u1 + h22u2} .
For a single data pair (x, y) we consider the loss function l(x, y) = 1 (􏰚y(x) − y)2. 2
(a) (Est. 4 minutes) The average loss over an arbitrary set of labeled points {xi,yi} is given by 1 􏰖n l(x , y ). In general, this function does not have a unique minimizer. Give a specific
(c)
8
σ ≫ 1, we can approximate k(x , x ) ≈ 1 + x1 x2 . Show that the solution of the optimization
12 2σ2 problem(11)withthekernelk (x ,x ) = 1+ x1x2
a12 2σ2
resulting classifier 􏰚f (x)?
Hint: The inverse of a 2 × 2 matrix is given by the formula c d
 a b d −b 1
= ad−bc −c
a .
n i=1 i i
reason why this is true.
(b) (Est. 4 minutes) Compute the partial derivative ∂ l(x, y). Your answer may be in terms of ∂hi j
intermediate variables but must not contain any partial derivatives. Please give an answer HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 24

(c)
9
in terms of a general ∂ and do not simply give an explicit enumeration of the four different ∂hi j
∂ partial derivatives. ∂hi j
Note: You may take the derivative of a ReLu at zero to be zero.
(Est. 4 minutes) Assuming n > 2, suppose we train this network with stochastic gradient
descent using minibatches of size 2 and step size η. Write down a procedure to update hi j
at time t. (You may write it in terms of an unexpanded ∂l , i.e. you do not need to rely on a ∂hi j
correct part (b).)
Multiple Choice (Est. 10 minutes)
For the following multiple choice questions, select all that apply.
(a) Which of the following computation graphs describes the function
n
z=􏰘φ(x⊤i w+bi)+∥w∥2
i=1 wherez∈R,w,xi ∈Rd,andbi ∈R?
x1 x1 x1
x2 x2 x2
x3 zx3 zx3 z
www
b1 b1 b1 b2 b2 b2 b3 b3 b3
(A) (B) (B)
(a) ⃝ computation graph (A)
(b) Consider a distribution D on labelled pairs (x, y) ∈ {±1} × {±1} where the conditional distribu-
tion of y given x is:
RD[w] = E(x,y)∼D 􏰃I(sign(w · x) 􏰡 y􏰂] .
What is minw∈R RD[w]?
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 25

x with probability α
y= . 
−x with probability 1 − α
The 0/1 population risk of a classifier the form sign(w · x), w ∈ R is defined as:
(b) ⃝ computation graph (B) (c) ⃝ computation graph (C)

(a) 1/2.
(b) max{α,1−α}.
(c) min{α,1−α}.
(d) Depends on the marginal distribution of x.
(c) Using a fully connected neural net with one hidden layer and ReLU activations, what is the minimum number of hidden nodes needed to achieve perfect classification, with respect to the hinge loss, of the data shown in Figure 2 (the plus signs denote data points labelled +1 while the minus signs denote data points labelled −1)? Mathematically, the neural network we are considering has the form
N
f (x) = β + 􏰘 α j max{⟨w j , x⟩ + b j , 0},
j=1
where β, αj, wj, and bj are the parameters of the network.
– + – -+-
Figure 2: Data set for part (d).
(a) N = 1 (b) N = 2 (c) N = 3 (d) N = 4
When testing is “unfair” (Est. 35 minutes)
In this problem we explore what happens when we train an estimator on a data set which has a
different distribution than the test set.
For concreteness, let us consider the problem of linear regression where we want to learn something close to the true linear model f (x) = x⊤ w⋆ . Our training set consists of a feature matrix X ∈ Rn×d with rows x⊤1 , . . . , x⊤n and noisily observed target vector y = Xw⋆ + ε with Gaussian observation noise ε ∼ N (0, I), where w⋆ ∈ Rd is the true weight vector.
10
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 26

After fitting a linear regression model to (X, y) which results in the estimator w􏰚, we want to know how well this learnt model will perform on a test set, where inputs have been drawn from a different distribution than the training points, but still generating true outputs in the same manner (namely with ytest = x⊤testw⋆, for the same true w⋆). This is sometimes referred to as “covariate shift” in the machine learning and statistical literature.
In the following we assume that X is full rank and has compact singular value decomposition X = UΛ1/2V⊤ with diagonal matrix Λ which has entries λ1, . . . , λd on the diagonal and matrices U ∈ Rn×d, V ∈ Rd×d. We refer to the matrix Σ = X⊤X as the empirical covariance matrix of X.
(a) (Est. 5 minutes) Write down the least squares estimator w􏰚 you obtain by minimizing the ordinary least squares (OLS) objective, i.e. w􏰚 := arg min d 1 ∥y − Xw∥2 , in terms of
w∈R 2 2
w⋆ , V, Λ and 􏰛ε = U⊤ ε . You do not need to derive the standard OLS solution but can start with
that if you choose.
(b) (Est. 5 minutes) What is the mean and covariance matrix of the Gaussian random vector
􏰛ε = U ⊤ ε ?
(c) (Est. 15 minutes) We are now given a different test set Xtest with empirical covariance Σtest,
where Xtest has the same eigenbasis (and hence implicity, the same number n of points) for
simplicity, i.e. Xtest = UΛ1/2V⊤. Use parts (a), (b) to derive the following expected average test
prediction error in terms of the eigenvalues λi of X⊤X and eigenvalues λtest of X⊤testXtest: i
1
n
1, . . . , d. Note that the expectation above is taken over the only randomness in the problem,
the training noise ε.
Hint: Depending on how you do it, you might find these facts useful: (i) u⊤v = tr(vu⊤) and (ii)
E[tr(A)] = tr(E[A]) and (iii) tr(A) = 􏰖 λ where λ are the eigenvectors of matrix A. But there iii
are other ways to get to the answer as well.
(d) (Est. 5 minutes) In practice, we sometimes have a choice of training sets. Let’s consider a
concrete scenario with d = 2. We assume for simplicity that V = 0 1 for all covariance
matrices to follow. Assume that we can choose to obtain noisy observations y = X⊤w⋆ + ε
for three possible training feature matrices X1,X2,X3 whose covariance matrices have the
following diagonal eigenvalue matrices

Eε∥Xtestw􏰚 − Xtestw⋆∥2 =
Recall that these eigenvalues, λtest, correspond to the same i-th eigenvector in V, for i =
1􏰘n λtest
i .
10 0 Λ =  1 010
1 0 Λ =  2 0100
100 0 Λ = .
i
First, we are given a test set Xtest (with the same singular vectors as X) with diagonal eigenvalue 
0 
 and are asked to minimize the average expected prediction error
0.01 matrix Λ = 
test  0
1 Eε ∥Xtest w􏰚 − Xtest w⋆ ∥2 . Which training feature matrix do you choose and why?
100 n2
ni=1 λi
3 00.1
 1 0
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 27

(e) (Est. 5 minutes) Now you are asked to make a choice of training data from the previous part’s X1, X2, X3 which minimizes the worst-case expected error under any possible xtest with unit norm. Which feature matrix do you choose and why?
11 Coin tossing with unknown coins (Est. 35 minutes)
This question is about adapting EM and the spirit of k-means to a discrete problem of tossing coins.
We have a bag that contains two kinds of coins that look identical. The first kind has probability of heads pa and the other kind has probability of heads pb, but we don’t know these. We also don’t know how many of each kind of coin are in the bag; so the probability, αa, of drawing a coin of the first type is also unknown (and since αa + αb = 1, we do not need to separately estimate αb, the probability of drawing a coin of the second type).
What we have is n pieces of data: for each data point, someone reached into the bag, pulled out a random coin, tossed it l times and then reported the number hi which was the number of times it came up heads. The coin was then put back into the bag. This was repeated n times. The resulting n head-counts (h1, h2, h3, . . . , hn) constitute our data.
Our goal is to estimate pa, pb, αa from this data in some reasonable way. For this problem, the binomial distribution can be good to have handy:
􏰎l􏰏 P(H=h)= h ph(1−p)l−h
for the probability of seeing exactly h heads having tossed a coin l times with each toss inde- pendently having probability p of turning up heads. Also recall that the mean and variance of a binomial distribution are given respectively by lp and lp(1 − p).
(a) (Est. 10 minutes) How would you adapt the main ideas in the k-means algorithm to con- struct an analogous approach to estimating 􏰚pa,􏰚pb,􏰚αa from this data set? Give an explicit algorithm, although it is fine if it is written just in English.
(b) (Est. 8 minutes) Suppose that the true pa = 0.4 and the true pb = 0.6 and αa = 0.5, and l = 5. For n → ∞, will your “k-means” based estimates (those from the preceding question) for 􏰚pa and 􏰚pb yield the correct parameter estimates (namely, 􏰚pa = 0.4 and 􏰚pb = 0.6)? Why or why not?
Hint: Draw a sketch of the typical histograms of the number of heads of each coin on the same axes.
(c) (Est. 17 minutes) How would you adapt the EM for Gaussian Mixture Models that you have seen to construct an EM algorithm for estimating 􏰚pa,􏰚pb,􏰚αa from this data set?
You don’t have to solve for the parameters in closed form, but (i) write down the E-step update equations (i.e. write down the distributions that should be computed for the E- step — not in general, but specifically for this problem) and (ii) the objective function that gets maximized for the M-step and also what you are maximizing with respect to (again,
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 28

not just the general form, but specific to this problem). If you introduce any notation, be sure to explain what everything means. Explain in words what the E- and M-steps are doing on an intuitive level.
12 SVMs for Novelty Detection (Est. 40 minutes)
This problem is an SVM-variant that works with training data from only one class.
The classification problems we saw in class are two-class or multi-class classification problems. What would one-class classification even mean? In a one-class classification problem, we want to determine whether our new test sample is normal (not as in Gaussian), namely whether it is a member of the class represented by the training data or whether it is abnormal. One-class classi- fication is also called outlier detection. In particular, we assume that all/most of the training data are from the normal class, and want to somehow model them, such that for new unseen test points, we can tell whether they “look like” these points, or whether they are different (i.e, abnormal).
For example, Netflix may want to predict whether a user likes a movie but only have thumbs-up data about movies that the user liked and no thumbs-down votes at all. How can we deal with learning with no negative training samples?
(a) (Est. 7 minutes) Let x1, x2, . . . , xn be your training data for the one-class classification problem (all supposedly belonging to one class — the normal class). One way to formulate one-class classification using SVMs is to have the goal of finding a decision plane which goes through the origin, and for which all the training points are on one side of it. We also want to maximize the distance between the decision plane and the data points. Let the equation of the decision plane H be
H := {x ∈ Rd|w⊤x = 0}. (12) Let the margin m be the distance between the decision plane and the data points
m = min |w⊤xi|. (13) i ∥w∥
If the convex hull of the training data does not contain the origin, then it is possible to solve the following optimization problem:
w􏰚=argmin 1∥w∥2 (14) w2
subjectto w⊤xi ≥1, 1≤i≤n (15) Argue that in the above hard one-class SVM optimization (assuming that the convex hull of
the training data does not contain the origin), the resulting margin is given by m􏰚 = 1 . ∥w􏰚∥
(b) (Bonus Est. 10 minutes) The optimal w􏰚 in the hard one-class SVM optimization problem defined by (14) and (15) is identical to the optimal w􏰚two-class in the traditional two-class hard- margin SVM you saw in class using the augmented training data (x1, 1), (x2, 1), . . ., (xn, 1), (−x1, −1), (−x2, −1), . . . , (−xn, −1).
HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 29

Argue why this is true by comparing the objective functions and constraints of the two optimization problems, as well as the optimization variables.
(c) (Est. 3 minutes) It turns out that the hard one-class SVM optimization cannot deal with prob- lems in which the origin is in the convex hull of the training data. To extend the one-class SVM to such data, we use the hinge loss function
max(0,1−w⊤xi) (16) to replace the hard constraints used in the one-class SVM so that the optimization becomes
1 􏰘n
w􏰚=argmin 2∥w∥2 +C max(0,1−w⊤xi). (17)
w i=1
Explain how the hyper-parameter C > 0 affects the behavior of the soft one-class SVM in
(17).

(e) (Est. 10 minutes) The Lagrangian dual of the soft one-class SVM optimization problem is: 1nn
α􏰚=argmax α⊤1− 􏰘􏰘ααx⊤x (18) α2ijij
i=1 j=1
0≤αi ≤C, 1≤i≤n.
(d) (Est. 5 minutes) Let w􏰚 x be the score of sample x where w􏰚 is the optimal w according to the training data and the soft one-class SVM in (17) with some particular C. We will classify a sample as an outlier if its score is below a threshold η. Describe a procedure to find a threshold η so that about 5% of your training data will be classified as outliers.
Given a sample xtest and the w􏰚 solution to the primal one-class-SVM optimization in (17), you ⊤
could test whether it is an outlier by checking if w􏰚 xtest < η. How can you test whether xtest is an outlier using the dual formulation given the training data and the optimal α􏰚 from (18)? (f) (Est. 5 minutes) Your friend claims that linear models like the one-class SVM are too simple to be useful in practice. After all, for the example training data in Figure 3, it is impossible to find a sensible decision line to separate the origin and the raw training data. Suppose that we believe the right pattern for “normalcy” here is everything within an approximate annulus around the unit circle. How could you use the one-class SVM to do the right thing for outlier detection with such data? Explain your answer. HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 30 1 0 −1 −1 0 1 Figure 3: Counter example provided by your friend. 13 Multiple Choice Questions (Est. 30 minutes) For these questions, select all the answers which are correct. You will get full credit for selecting all the right answers. On some questions, real-valued partial credit will be assigned. You will be graded on your best 15 of the 18 questions below. So feel free to skip three of them if you want. (a) Which of the following classifiers can be used to classify the training data perfectly? Methods are not kernelized or with additional features unless explicitly stated so. ⃝ Logistic regression ⃝ 3-nearest neighbors ⃝ SVM with a quadratic kernel ⃝ Hard margin SVM ⃝ LDA ⃝ QDA (b) Which of the following classifiers can be used to classify the data perfectly? Methods are not HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 31 kernelized or with additional features unless explicitly stated so. ⃝ Logistic regression without a kernel ⃝ 3-nearest neighbors ⃝ SVM with a quadratic kernel ⃝ Hard margin SVM ⃝ LDA ⃝ QDA (c) Which of the following loss functions tends to result in a sparse solution? Select all that apply. ⃝a ⃝b ⃝c ⃝ None (d) “Regularization” is a critical aspect of machine learning. Which of the following statements about regularization is true? ⃝ the main function of regularization is to reduce the bias of your learned predictor ⃝ the main function of regularization is to reduce the variance of your learned predictor ⃝ regularization serves to combat overfitting ⃝ regularization serves to combat underfitting ⃝ Removing training data points is regularizing ⃝ Adding noisy copies of training data points is regularizing ⃝ Dropout is regularizing HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 32 ⃝ Doing dimensionality reduction as a part of the learning process has a regularizing effect. ⃝ Adding a penalty term on the learned weights can be regularizing. ⃝ Adding a prior for the learned weights can be regularizing. ⃝ Averaging together the results of diverse predictors is regularizing. ⃝ Adding features to a learning problem is regularizing. ⃝ Using kernelized methods is always more regularizing than using direct methods. (e) The form of the decision boundary for a Linear Discriminant Analysis (LDA) based classifier ⃝ is always linear ⃝ is always quadratic ⃝ is always cubic ⃝ can take on any form ⃝ depends on if the covariance matrices are diagonal ⃝ depends on if the means for each class are are all the same (f) K-means ⃝ is like LDA and QDA in that it is a generative model for classification. ⃝ is similar to EM in that during training it alternates between (i) assigning points to clusters and (ii) updating the parameters of the model – centroids for each class. ⃝ is like Mixtures of Gaussians because it works on unsupervised data. ⃝ always optimizes a loss which is equivalent to cross-entropy loss. ⃝ is dissimilar to EM in that it hard assigns each training point to only one cluster during each training iteration. ⃝ is guaranteed to find a global optimum. (g) Which of the following statements are true about loss functions? ⃝ loss functions are usually learned from the data set ⃝ cross-entropy loss and mean-squared loss are always identical. ⃝ the lower the value of the loss function at the end of training, the better the generalization error. ⃝ MSE loss can be derived from doing maximum likelihood on a linear regression model (with Gaussian iid noise). (h) Your organization wants to develop a predictor as a component of a self-driving car. It collects a lot of labeled data and randomly divides it into two parts A and B. It creates different teams to try different prediction techniques and gives every team access to A. The organization holds back B so that it can evaluate the performance of each team’s predictor on B to choose which predictor to commit to for further development in the self-driving car. Your team has further divided the A data into two disjoint sets: AV and AT . Which of the following are true? HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 33 ⃝ If you use AT to train your neural net predictor, you should use AV to decide how many layers to use. ⃝ If you use AT to train a kernelized SVM, you should use AV to pick the kernel parameters. ⃝ If you use AT to create a nearest-neighbor based predictor, you should use AV to pick the number of neighbors. ⃝ If you use AT to train a boosted decision tree, you should use AV to decide how many stages of boosting to use. ⃝ Your team should try to figure out a way to get access to B so you can make the best predictor. (i) Consider using k-nearest-neighbors for the problem of classification. Which of the following are true? ⃝ The training classification error on the training data used to create the 1-nearest-neighbor classifier is always zero. ⃝ As the number of neighbors k gets very large and the number of training points goes to infinity, the probability of error for a k-nearest-neighbor classifier will tend to approach the probability of error for a MAP classifier that knew the true underlying distributions. ⃝ Reducing k has a regularizing effect. ⃝ If the training data is linearly separable with a positive margin, then the decision boundary found by 1-nearest-neighbors will always be a hyperplane. ⃝ k-nearest-neighbors can be regularized by dimensionality reduction via PCA applied to the training data. (j) Consider using a random forest of depth-1 trees for least-squares regression where the individ- ual trees are formed by randomly picking a feature and then randomly chosing a value for it to split on. All random choices for the trees in the random forest are made i.i.d. Which of the following are true? ⃝ Such individual regression trees are always going to be unbiased estimators. ⃝ The expected (over both the training data and the random choices used to construct the trees) bias of the random forest estimator will always be the same as the expected bias of a single random tree. ⃝ The variance (with an expectation over both the training data and the random choices used to construct the trees) of the random forest estimator will be a non-increasing function of the number of trees in the random forest. ⃝ The random forest estimator will tend to become unbiased as the number of trees in it increases. (k) You the source code. (You can run it as much as you would like though.) The function has three inputs that are a True/False Boolean, a real number from [−1, +1], and an integer from 1 to 10. The function returns true or false. You decide to learn an approximation to this function have a function that is implemented as a black box executable to which you have lost HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 34 using machine learning and use the black box executable to generate training data. Which of the following are reasonable things to do in this situation? ⃝ Representing the Boolean input using ±1. ⃝ Check our performance using a validation set that was drawn separately from the training set ⃝ Try using AdaBoost with decision trees to fit the training data ⃝ Try using OLS to fit the training data ⃝ Represent the integer input using a real number but scaling and shifting it to be in the range [−1, +1]. ⃝ Represent the integer input using a vector of length 10 and one-hot-encoding. ⃝ Divide the problem into 20 sub-problems based on the Boolean and integer values, and just learn a 1-dimensional classifier for each one using a binary decision tree on the re- maining real input. ⃝ Do dimensionality reduction using random projections to regularize this problem (l) Suppose we have a neural network with one hidden layer and all linear activations. Then we create a second model which is similar, but with 50 hidden layers (of the same size). No parameters are tied. Which of the following are true? ⃝ the 50-layer network has the same representational capacity as the 1-layer network. ⃝ the 50-layer network has greater representational capacity as the 1-layer network. ⃝ the 50-layer network is more likely to suffer from unstable gradients than the 1-layer network. ⃝ if we let the 1-layer network be arbitrarily wide and replace the linear activations with sigmoid activations, this updated 1-layer network would have higher representational ca- pacity (be able to model a larger class of funtions) than the original 50-layer network (with only linear activatoins). (m) Which of the following statements about Convolutional Neural Networks (CNN) are true? ⃝ the convolution operation is invariant to image rotation ⃝ a CNN has fewer parameters than a fully connected network when the number of hidden units (and input size and number of outputs) is the same ⃝ a CNN has more parameters than a fully connected network when the number of hidden units (and input size and number of outputs) is the same ⃝ the convolutional filters must be designed by hand because they cannot be learned by gradient descent (n) You designed a 3-layer, fully connected neural network with sigmoid activation functions for a classification task. The output layer is a linear layer followed by a softmax function. The whole network is trained by stochastic gradient descent with cross entropy loss. You observe that the accuracy on the training set is close to 100%, however, it performs poorly on the hold-out validation set. Which of these are likely to improve the validation performance? HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 35 ⃝ Increase the number of hidden units ⃝ Reduce the number of hidden units ⃝ Increase the l2 regularization weight ⃝ Reduce the l2 regularization weight ⃝ Add the use of dropout to the neural network training ⃝ Train the network for more iterations/epochs (o) The ReLU activation functions, r(z), is less likely to contribute to vanishing gradients than the sigmoid activation function, s(z), because: ⃝ s(z)>=r(z)forz<0 ⃝ r(z) always has a constant non-zero gradient for z > 0
⃝ r(z) is constant when z < 0 ⃝ r(z=0)=0whereass(z=0)=0.5 (p) Backpropagation 14 ⃝ is a type of neural network model ⃝ is a way of computing the gradient in neural networks ⃝ yields better parameter estimation than using the chain rule for the same number of train- ing iterations ⃝ only works for cross-entropy loss ⃝ prevents vanishing gradients by efficient local computation ⃝ is efficient in part because it caches factors of the gradient that can be re-used ⃝ has time complexity that is linear in the number of layers ⃝ has time complexity that is quadratic in the number of layers Your Own Question Write your own question, and provide a thorough solution. Writing your own problems is a very important way to really learn the material. The famous “Bloom’s Taxonomy” that lists the levels of learning is: Remember, Understand, Apply, Analyze, Evaluate, and Create. Using what you know to create is the top-level. We rarely ask you any HW questions about the lowest level of straight-up remembering, expecting you to be able to do that yourself. (e.g. make yourself flashcards) But we don’t want the same to be true about the highest level. As a practical matter, having some practice at trying to create problems helps you study for exams much better than simply counting on solving existing practice problems. This is because thinking about how to create an interesting problem forces you to really look at the material from the perspective of those who are going to create the exams. HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 36 Besides, this is fun. If you want to make a boring problem, go ahead. That is your prerogative. But it is more fun to really engage with the material, discover something interesting, and then come up with a problem that walks others down a journey that lets them share your discovery. You don’t have to achieve this every week. But unless you try every week, it probably won’t happen ever. HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 37 Contributors: • Alexander Tsigler • Anant Sahai • Fanny Yang • Garrett Thomas • Jennifer Listgarten • Joshua Sanz • Kailas Vodrahalli • Lisa Jian • Mark Velednitsky • Michael Laskey • Peter Wang • Raaz RSK Dwivedi • Ruta Jawale • Yang Gao • Yaodong Yu • Yichao Zhou HW13, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 38