EECS 189 Introduction to Machine Learning Fall 2020
HW8
This homework is due Wednesday, October 28 at 11:59 p.m.
2 Towards understanding Few-shot Learning by Predicting Auxiliary La-
bels
Make sure to submit the code you write in this problem to “HW8 Code” on Gradescope.
This problem helps us understand the “latent space” view of what learning is about. Here, we explore things in the simplest possible setting, but the idea is far more generally applicable. It is the foundation of ideas like “self-supervision” and “few-shot learning.” The practical problem is that in many real world settings, there is not enough labeled data. Or more precisely, there is not enough data that has been labeled with the labels that you care about. How can we hope to get machine learning to succeed in these cases? At the most basic level, the only patterns that we can ever hope to learn are those that are simple enough relative to the amount of data that we have.
We have seen in earlier homeworks that having lots of features or a high dimensional input can be regularizing. However, even in those cases, the examples we gave of where learning succeeds still had some inner structure to those inputs that was approximately low-dimensional. We know that approaches like least-squares and ridge-regression only really care about the singular-value structure of the input feature matrix — they don’t care if things have been rotated. And so, if there are only a few big singular values, then those are what define the dimensionality of the input data. If the true pattern that explains the target labels aligns well to those important directions, then learning can succeed. This was made completely explicit with PCA: an approach that allows us to discover those directions and then just use them.
If we ask ourselves what is the story behind PCA, the most natural story involves thinking about a latent space. There is a true low-dimensional space where the truly important (physically signifi- cant) parameters are. All the inputs and features that we measure or compute are noisy projections of that underlying latent space.
In a way, PCA is a way of picking out the “loud directions” in the input data. But should “loudness” always be synonymous with “likely to be important” when we do machine learning? Clearly not. We saw in the last homework how assuming sparsity could help us even if the inputs didn’t have any explicitly loud directions in them. If we knew that the true pattern is only going to depend on a few of the features, then we could use sparsity-seeking approaches (like OMP, LASSO, or even simple thresholding/pruning to zero) to learn the pattern.
This problem explores a different way of getting at the underlying latent space when simply looking at the loud directions is not enough. The key insight is philosophical: if something is actually real, then it should influence the world in multiple ways. No task is an isolated island — almost certainly, there are going to be other tasks that are related in the sense of depending on the same
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 1
underlying (physically real) variables. We should be able to use auxiliary labels for presumably related tasks to help us set the stage to do well on our task. How? By using those to figure out what the important directions are in feature space.
Suppose our observed data inputs X = [x1, · · · , xn]⊤ ∈ Rn×d are generated by hidden data points from the latent space,
L = [l1,··· ,ln]⊤ ∈ Rn×r, and l ∼ N(0,σ2lIr×r). In math, the data matrix can be defined as X = LV⊤ux,
xi = Vuxli + εi, εi ∼ N(0, σ2xId×d), i ∈ [n],
where li is the vector of latent variables associated with the i-th training sample, li ∈ Rr, and Vux ∈ Rd×r. We assume orthonormality of these vectors that relate the important latent space to the observedinputs,namelyV⊤uxVux =Ir×r.Supposethattheunderlyingtrueoutputy=[y1,···,yn]⊤ is generated from the latent space:
yi=l⊤iθ, i∈[n],
where θ ∈ Rr is the ground truth weight vector. The actual observed vector y might be a noisy
version of the underlying true outputs.
Besides the output y, we can also observe k sets of auxiliary labels, {z1,··· ,zk}, where zj =
[z1j , · · · , znj ]⊤ . For these too, the underlying noise-free versions come from the same latent space: zij=l⊤iqj, i∈[n], j∈[k].
This question helps us see the utility of such auxiliary labels for machine learning, even though we do not have access to the auxiliary labels at test time.
(a) If you look at the back-story of the training points above, notice that what we have is n i.i.d. samples of a random d-dimensional vector x which is distributed according to x = Vuxl + εx where l ∼ N(0,σ2lIr×r) and the independent εx ∼ N(0,σ2xId×d). Similarly, the auxiliary k-dimensional labels z are distributed according to z = Q⊤l + εz where l is the same latent vector that helped make x and εz is independent of the other random variables (l and ε) and distributed εz ∼ N(0, σ2z Ik×k).
Here, suppose that we had the expectation of xz⊤, i.e., E xz⊤. This is a matrix with d rows and k columns. We can compute its full SVD:
⊤ ⊤ UΣV=SVDExz .
Prove that we can use this to recover Vux up to rotation, T = VuxR,
where R ∈ Rr×r and R⊤R = RR⊤ = Ir×r. Note that here, you need to clearly tell us what matrix T you are going to use to estimate Vux.
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 2
(Hint: do you think that U might be involved? V? What are the sizes of these matrices? What size do you need the answer to be? How many nonzero singular values are there going to be? To start, you’re going to have to compute what E xz⊤ is in terms of things like Vux and Q and the SVD of Q might come in handy.)
(Note: The linear mapping cannot be uniquely determined, because the latent distribution is isotropic, you could always rotate the transformation matrix Vux and it will give you the same results.)
Solution:
First, let’s write the full SVD of Q = RΣV⊤. We are using different letters, but all that matters
here is that R is a square matrix with orthonormal columns. ⊤
E xz
⊤ ⊤
= Vuxσ2Ir×rRΣV⊤ l
= VuxR(σ2Σ)V⊤ l
U Σ
=E (Vuxl+εx)(Q l+εz)
⊤ ⊤ ⊤ ⊤
=E Vuxll Q+Vuxlεz +εxl Q+εxεz ⊤ ⊤ ⊤ ⊤
=VuxE ll Q+E Vuxlεz +εxl Q+εxεz
which completes the proof. The key is that the scalar σ2l commutes with matrices and the identity matrix goes away — this is the mathematical manifestation of the fact that we assumed that the true latent space was isotropic.
Essentially, we just use the large singular vectors in the compact SVD U as our matrix T.
(b) In reality, we don’t actually have access to this expectation E xz⊤. Instead, we only have a
finite number n of training samples to approximate the expectation. Consider the following matrix 1 X⊤ Z, where
=0
1X⊤Z n
= 1X⊤ z1,··· ,zk n
n
=
x z , · · · , x z , · · · , x z
Z
nnn 1
1 j k
iiii nii
i=1 i=1 i=1 nnnn
1⊤⊤
= ···, V l(q l)+ ε(q l)+ V l(ε [j])+ ε(ε [j]),···
n ux i j i i j i ux i z,i i z,i
i=1
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 3
i=1
i=1
i=1
nnnn
1 ⊤ 1 ⊤ 1 ⊤ 1 ⊤ =Vllq,···,q+εlq,···,q+Vlε+εε
uxn ii 1 k i=1
= V u x Σˆ n Q + Eˆ n Q + V u x Sˆ + Pˆ
What kind of random variables are inside the matrices Σˆ n and Eˆ n? Give their means and the dependence of their variances on n. Are they individually going to converge to anything as n grows?
To help you out, we will give you the answer for the entries in Sˆ and Pˆ . Notice that each of
these matrices is the average of n rank-1 matrices given by the outer product of two independent
random vectors. The n constituent random variables for each entry are iid and are distributed as
the product of two independent zero-mean Gaussians (such random variables have a modified
Bessel function as their density but this isn’t important). Their individual means are therefore
zero since the mean of a product of two independent random variables is the product of their
means. Because of independence, this same property extends to the variances as well and so
the entries in Sˆ have zero mean and variances 1σ2σ2. Similarly, the entries in Pˆ have zero nlz
mean and variances 1 σ2 σ2 . Since d and k are constant, as n gets large these two matrices are nxz
tending to zero matrices, and by the CLT, along the way, the individual entries in the matrices are also behaving individually like small Gaussians.
(Note: Here we assume l is drawn from an isotropic gaussian distribution, in practice, you may first want to zero-center the data and then perform some appropriate whitening to the data if you want to suppress any “loudness” that might exist in x alone.)
Solution: Just as in the examples already done for you above, the random variables inside
the matrices Σˆ n and Eˆ n are made from products of the constituents of li and εi. In the ( j, k)-
th position inside Σˆ , we see the random variable 1 m l [ j]l [k]. Similarly, n the ( j, k)-th n n i=1 i i
position inside Eˆ , we see the random variable 1 m l [ j]ε [k]. n n i=1 i i
Q Σˆ n
n ii 1 i=1
k uxn iz,i n iz,i i=1 i=1
Eˆ n
Sˆ Pˆ
Because all the cross-terms are zero-mean, the mean of Σˆ n is σ2l Ir×r . The variance of the off-diagonal entries in Σˆ n are 1 σ4 , and the variance of the diagonal entries in Σˆ n are 2 σ4 (be-
nl nl cause the square of Gaussian variable follows the chi-squared distribution). More specifically,
suppose X ∼ N(0, σ2), then we have
Var[X2] = E[X4] − (E[X2])2 = 2σ4.
3σ4 σ4
Notice that the variance of both the off-diagonal and the diagonal terms is going to zero as n goes to infinity. This means that the matrix as a whole is tending to its mean, which is just a scaled identity matrix.
By similar logic, the mean of Eˆ n is 0d×r. The variance of the entries in Eˆ n are all 1 σ2σ2x. Notice nl
how these are going to go to zero as n increases, and so the entire matrix approaches the zero matrix.
(c) Now, as a reality check, we are going to see how the ideally recovered T would work if we used it. How can do this reality check? Well, we can just assert that T = VuxR as above and
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 4
try to solve the following ridge regression problem that forces us to learn parameters that go through what we think that the latents are:
min ∥y − XTθ∥2 + λ∥Tθ∥2. θ
Prove that when εi = 0, i ∈ [n] (the no observation noise case), then the solution θ to the above problem satisfies
X T θ = L ( L ⊤ L + λ I ) − 1 L ⊤ y = L θˆ ,
θˆ
where θˆ is the solution to min ∥y − Lθ∥2 + λ∥θ∥2. In other words, it is as though we had
θ22 been working directly with the latents themselves.
(Note: Although we can only recover the matrix Vux up to a rotation matrix, it is equivalent to the case that we have access to the exact Vux w.r.t. the predictions.)
Solution:
Notice that for us, the T has orthonormal columns by construction. This means that ∥Tθ∥ = ∥θ∥ by construction. So, we are free to just use the ridge-regression formula with a new data matrix XT.
The solution to the above problem can be written as
θ = (T⊤X⊤XT + λI)−1T⊤X⊤y
= (T⊤(X⊤X + λI)T)−1T⊤X⊤y
= (T⊤(X⊤X + λI)T)−1T⊤VuxL⊤y
= (T⊤(VuxL⊤LV⊤ux + λI)T)−1T⊤VuxL⊤y
= ((VuxR)⊤(VuxL⊤LV⊤ux + λI)VuxR)−1(VuxR)⊤VuxL⊤y = (R⊤(L⊤L + λI)R)−1R⊤L⊤y
= R−1(L⊤L + λI)−1L⊤y
which completes the proof since we can multiply both sides by matR and see that what we have is exactly what we would have gotten if we did ridge regression directly using the latents themselves.
(d) Next, let us suppose that there are many latent variables but only a smaller subset of them are relevant to both the true labels y as well as the auxiliary labels zj. We rewrite the latent data matrix as
L = [T|F] ∈ Rn×r, and T ∈ Rn×t,F ∈ Rn×f,
where T = [t1,··· ,tn]⊤, F = [ft,··· ,fn]⊤, and r = t + f. The first t features can be viewed as the “truly relevant” features and so we can define ti to be the t-dimensional vector that just has those in it. Assume the outputs (including the auxiliary outputs) are generated as
yi=t⊤iθ i∈[n].
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 5
and
Figure 1: Latent space.
zj =t⊤q +εj, εj ∼N(0,σ2), i∈[n], j∈[k]. iijiiz
So all the outputs are only related to the first t features in the latent space. From what you’ve done above, you should see that the SVD strategy should give you access to a transformation that gives you access to those t latent dimensions. This can be exploited to improve our ability to learn y.
Complete the coding part related to Part (a)-(b) and compare its performance with: (1). k-PCA OLS; (2) (realistic best baseline) OLS when Vux is known in the Jupyter notebook. Note that for visualization purposes, we fix t = 2.
1) Based on the previous parts, implement functions to solve truncated SVD and compute θˆ in the jupyter notebook.
2) Visualize the recovered latent space on the code in the jupyter notebook and include the plot in your solution. Compare to the ground-truth latent variable and that recovered by PCA, what do you observe?
3) Change k, the dimension of auxiliary labels Z. Plot the performance of different base- line methods and include the plot in your solution. Comment on how the performance of different methods changes as k increases.
Solution: Please refer to the solution notebook. Code to implement is also avaialble below.
In the latent variable visualization (Fig. 1), with the aid of Z, truncated SVD is able to recover the latent variables (with a rotation). PCA fails to recover the latent variables because they are not loud enough.
Fig. 2 shows that the method studied here in general has better performance with larger k. However, in this example, the performance gain starts to saturate around k = 2. Students are encouraged to think why there is a saturation. Staff finds that increasing latent variable dimension also increase the threshold of saturation.
val = (np.transpose(X)@Z)
U, z1, _ = np.linalg.svd(val)
U = U[:, :dim_t]
return U
theta_hat = np.linalg.inv(np.transpose(T_hat)@(np.transpose(X)@X + ridge_lambda*np.eye(X.shape[1]))
→ @T_hat)@np.transpose(T_hat)@np.transpose(X)@y return theta_hat
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 6
Figure 2: Performance with different dimension of Z. 3 Backpropagation Algorithm for Neural Networks
Make sure to submit the code you write in this problem to “HW8 Code” on Gradescope.
In this problem, we will be implementing the backpropagation algorithm to train a neural network to classify the difference between two handwritten digits (specifically the digits 3 and 9).
Note that we need to install the library mnist for loading data, which has already been provided in the notebook.
The notebook might take a while to run, so please start early.
Our implementation takes about 20 minutes to run all parts in total.
To establish notation for this problem, we define:
ai+1 = σ(zi) = σ(Wiai + bi).
In this equation, Wi is a ni+1 × mi matrix that maps the input ai of dimension mi to a vector of dimension ni+1, where ni+1 is the size of layer i + 1 and we have that mi = ni. The vector bi is the bias vector added after the matrix multiplication, and σ is the nonlinear function applied element- wise to the result of the matrix multiplication and addition. zi = Wiai + bi is a shorthand for the intermediate result within layer i before applying the nonlinear activation function σ. Each layer is computed sequentially where the output of one layer is used as the input to the next. To compute the derivatives with respect to the weights Wi and the biases bi of each layer, we use the chain rule starting with the output of the network and work our way backwards through the layers, which is where the backprop algorithm gets its name.
In the jupyter notebook, you are given starter code with incomplete function implementations. For this problem, you will fill in the missing code so that we can train a neural network to learn your nonlinear classifier. The code currently trains a network with one hidden layer with 4 nodes.
(a) Start by drawing a small example network with one hidden layer, where the last layer has a single scalar output. The first layer should have 784 inputs corresponding to the input
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 7
image x which consists of a 28 × 28 pixels. The computational layers should have widths of 16, and 1 respectively. The final “output” layer’s “nonlinearity” should be a linear unit that just returns its input. The earlier “hidden” layers should have ReLU units. Label all the ni and mi as well as all the ai and Wi and bi weights. You can consider the bias terms to be weights connected to a dummy unit whose output is always 1 for the purpose of labeling. You can also draw and label the loss function that will be important during training — use a squared-error loss.
Here, the important thing is for you to understand your own clear ways to illustrate neural nets. You can follow conventions seen online or in lecture or discussion, or you can modify those conventions to pick something that makes the most sense to you. The important thing is to have your illustration be unambiguous so you can use it to help understand the forward flow of information during evaluation and the backward flow during gradient computations. Since you’re going to be implementing all this during this problem, it is good to be clear.
Solution: The following diagram is a sample of what you could have drawn for your answer. Here, the weights a marked as Wi: j−>k which indicates the weight in layer i going from the jth input to the kth output. The weights are drawn along the lines connecting graph nodes to indicate multiplication along the connection. Operations are indicated with circles, and model
parameters and inputs are indicated with rectangles.
(b) Let’s start by implementing the cost function of the network. We will be using a simple least squares classifier. We will regress all the positive classes to 1, and the negative classes to -1. The sign of the predicted value will be the predicted class label. This function is used to assign an error for each prediction made by the network during training.
The error we actually care about is the misclassification error (MCE) which will be:
1 n
MCE(yˆ) = n
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 8
i=1
I{sign(yi) sign(yˆi)} .
However this function is hard to to optimize so the implementation will be optimizing the mean squared error cost (MSE) function, which is given by
1 n
(yi − yˆi)2
where yi is the observation that we want the neural network to output and yˆi is the prediction
from the network.
Write the derivative of the mean squared error cost function with respect to the predicted outputsyˆ.Incorrespondingpartofthenotebook,implementthefunctionsQuadraticCost.fx and QuadraticCost.dx
Solution: The derivative is given by
∂MSE = 21(−1)(y − yˆ)
Please refer to the solution notebook or the following script:
(c) Now, let’s take the derivatives of the nonlinear activation functions used in the network. Im- plement the following nonlinear functions in the code and their derivatives:
σlinear(z) = z
MSE(yˆ) = 2n
i=1
return 0.5 * np.square(yp-y)
return yp – y
z otherwise σtanh(z) = ez − e−z
sigmoid function as an example activation function.
∂yˆ 2
= yˆ − y
σReLU(z) =
0 z<0
ez + e−z
For the tanh function, feel free to use the tanh function in numpy. We have provided the
Solution:
1 otherwise
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 9
∂σlinear =1 ∂z
∂σ 0z<0
ReLU
∂z =
∂σtanh = 1 − σtanh(z)2 ∂z
Note in particular for σtanh there are multiple valid ways to express its derivative. Please refer to the solution notebook or the following script:
(d) Wehaveimplementedtheforwardpropagationpartofthenetworkforyou(seeModel.evaluate
in the notebook). We now need to compute the derivative of the cost function with respect to
the weights W and the biases b of each layer in the network. We will be using all of the code
we previously implemented to help us compute these gradients. Assume that ∂MSE is given, ∂ai+1
where ai+1 is the input to layer i + 1. Write the expression for ∂MSE in terms of ∂MSE . Then ∂ai ∂ai+1
implement these derivative calcualtions in the function Model.compute grad. Recall, ai+1 is given by
ai+1 = σ(zi) = σ(Wiai + bi) .
Solution: We start with looking at the formula that takes the activations from layer i and
computes the activations for layer i + 1:
ai+1 = σ(Wiai + bi).
We know that the error is computed as a composition of the layers in the network, so we can use the chain rule to help us compute ∂MSE by using
∂MSE = ∂MSE ∂ σ(Wiai + bi) ∂ai ∂ai+1 ∂ai
= ∂MSEσ′(Wiai + bi)Wi ∂ai+1
Please refer to the solution notebook or the following script:
(e) One method to shortcut training is to randomly initialize the weights of the neural network and only train the weights of the last layer. This effectively treats the hidden layers of the neural net- work as random feature mappings. Use ridge regression to optimize the weights of the final layer. Compare the resulting approximation mean-squared-error values and misclassifi- cation error values for the 8 cases above (2 nonlinearities times 4 widths). Comment on what you see.
Solution: Please refer to the results below. Note it can change due to randomness.
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 10
return np.tanh(z)
return 1 - np.square(np.tanh(z))
return np.maximum(0,z)
return (z>0).astype(’float’)
∂ai
idx = idx – 1
grad.insert(0,np.multiply(d,layer.dx(curZ)))
d = np.dot(layer.W.T,grad[0])
Running ridge regression on last layer
width train MSE test MSE train MCE test MCE nonlinearity
0
1
2
3
4
5
6
7
8
9
10 256 0.058585 0.962726 0.019040 0.020802 tanh
11 1024 0.032488 0.973296 0.006374 0.007925 tanh
(f) We use gradients to update the model parameters using batched stochastic gradient descent.
16 0.210569 0.792764 0.124007 0.116394 ReLU
128 0.080593
0.952182
0.957169
0.974312
0.025911
0.021358
0.007781
0.024765 ReLU
0.020307 ReLU
0.010896 ReLU
256 0.066176
1024 0.034290
16 0.203848 0.801650 0.115315 0.093611 linear
128 0.079075
0.946974
0.954685
0.959153
0.024503
0.021854
0.017550
0.022784 linear
0.022784 linear
0.019316 linear
256 0.069996
1024 0.064330
16 0.189980 0.821697 0.100662 0.094106 tanh
128 0.074589 0.939985 0.024421 0.024269 tanh
Implement the function GDOptimizer.update to update the parameters in each layer of the network. You will need to use the derivatives ∂MSE and the outputs of each layer ai to
∂zi
compute the derivatives ∂MSE and ∂MSE . Use the learning rate η, given by self.eta in the
∂Wi ∂bi
function, to scale the gradients when using them to update the model parameters. Choose
several different batch sizes and different activation functions. Report the final error on the training set given the various batch sizes and activation functions.
Solution: Please refer to the solution notebook or the following script:
Please refer to the results below. Note it can change due to randomness.
Using SGD
width train MSE test MSE train MCE test MCE nonlinearity epochs \
update = np.dot(curGrad,curA.T)
updateB = np.sum(curGrad,1).reshape(layer.b.shape)
layer.updateWeights(-self.eta/m * update)
layer.updateBias(-self.eta/m * updateB)
0 4 0.042577 0.967269 0.016142 0.018326
1 4 0.048379 0.922746 0.016556 0.018821
2 4 0.045741 0.907047 0.014156 0.015849
3 4 0.074020 0.933215 0.024834 0.022288
4 4 0.071947 0.950103 0.022930 0.018821
5 4 0.072266 0.919634 0.022268 0.021793
6 4 0.015053 0.993663 0.008278 0.012878
7 4 0.015780 0.989164 0.008940 0.013868
8 4 0.016225 0.992477 0.008940 0.013868
batch_size
01 1 128 2 12080 31 4 128 5 12080 61 7 128 8 12080
ReLU 10
ReLU 10
ReLU 10
linear 10
linear 10
linear 10
tanh 10
tanh 10
tanh 10
(g) Let’s now explore how the number of hidden nodes per layer affects the approximation. Train a models using the tanh and the ReLU activation functions with 2, 4, 8, 16, and 32 hidden nodes per layer (width). Use the same training iterations and learning rate from the starter
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 11
code. Report the resulting error on the training set after training for each combination of parameters.
Solution: Please refer to the results below. Note it can change due to randomness.
Training with various sized network
width train MSE test MSE train MCE test MCE nonlinearity epochs \
0 1 2 3 4 5 6 7 8 9
2 0.043303 0.967367 0.016308 0.019316
4 0.032604 0.978365 0.008940 0.010896
8 0.030336 0.981463 0.006209 0.009411
16 0.019606 0.926387 0.003394 0.005448
32 0.013829 0.972237 0.001573 0.006439
2 0.016775 0.992101 0.009189 0.014364
4 0.015167 0.992304 0.008526 0.012878
8 0.014315 0.987023 0.007533 0.013868
ReLU 10
ReLU 10
ReLU 10
ReLU 10
ReLU 10
tanh 10
tanh 10
tanh 10
tanh 10
tanh 10
16 0.010133
32 0.009117
0.992090
0.989887
0.005215
0.002980
0.012382
0.009411
batch_size 01 11 21 31 41 51 61 71 81 91
(h) Optional. Not for credit: Currently the code classifies between the digits 3 and 9, modify the variables N0 and N1 in the code to classify between other pairs of digits. Do your results change if you pick different digits? Do you need more or less hidden nodes to reach the same accuracy? Do you need more or less training epochs?
Solution: Below is an example results of classifying digit 2 and 8. Note it can change due to randomness.
training classifier between digits 2 and 8
—————————————-
Running ridge regression on last layer
width train MSE test MSE train MCE test MCE nonlinearity
0
1
2
3
4
5
6
7
8
9
10 256 0.082940 0.918686 0.026251 0.025922 tanh
11 1024 0.044983 0.955066 0.007621 0.011964 tanh
—————————————-
Using SGD
width train MSE test MSE train MCE test MCE nonlinearity epochs \
16 0.216486 0.782670 0.127022 0.121137 ReLU
128 0.129465
0.882611
0.908358
0.954601
0.051147
0.029723
0.008892
0.046361 ReLU
0.028415 ReLU
0.014955 ReLU
256 0.091692
1024 0.050245
16 0.288043 0.706028 0.193751 0.186441 linear
128 0.105861
0.895269
0.900846
0.915866
0.036667
0.031671
0.027606
0.036889 linear
0.031904 linear
0.031904 linear
256 0.096373
1024 0.088115
16 0.298087 0.690882 0.197392 0.232802 tanh
128 0.098381 0.899877 0.035227 0.037388 tanh
0 1 2
4 0.055427 0.998874 0.015158 0.018445
4 0.071957 1.018416 0.023626 0.028415
4 0.059661 0.954422 0.021340 0.026919
ReLU 10
ReLU 10
ReLU 10
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 12
3 4 0.101653 0.948827 0.036244 0.043370
4 4 0.101807 0.945396 0.035820 0.040877
5 4 0.101652 0.945019 0.036752 0.042373
6 4 0.021250 0.982485 0.012279 0.016451
7 4 0.020162 0.981614 0.011347 0.017448
8 4 0.018996 0.992728 0.010162 0.014457
batch_size
01 1 128 2 11809 31 4 128 5 11809 61 7 128 8 11809
—————————————-
linear 10
linear 10
linear 10
tanh 10
tanh 10
tanh 10
Training with various sized network
width train MSE test MSE train MCE test MCE nonlinearity epochs \
0 1 2 3 4 5 6 7 8 9
2 0.076284 0.894241 0.026166 0.045862
4 0.060280 0.964880 0.022610 0.027418
8 0.044081 1.002878 0.008553 0.017946
16 0.026663 0.979146 0.003557 0.007478
32 0.025266 0.918361 0.002879 0.008973
2 0.030053 0.967184 0.019985 0.020439
4 0.021715 0.972992 0.011940 0.017946
8 0.011949 0.975711 0.004996 0.011466
ReLU 10
ReLU 10
ReLU 10
ReLU 10
ReLU 10
tanh 10
tanh 10
tanh 10
tanh 10
tanh 10
16 0.015331
32 0.013982
0.966522
0.988086
0.008637
0.005420
0.013958
0.012463
batch_size 01 11 21 31 41 51 61 71 81 91
4
Entropy, KL Divergences, and Cross-Entropy
So far, we have mostly considered so called Euclidean spaces, the most prominent example is the vector space Rd with inner product ⟨x, y⟩ = x⊤y and norm ∥x∥2 = x⊤x. For example in linear regression, we had the loss function
n f(θ)=∥X⊤θ−y∥2 =(x⊤θ−y)2
2ii i=1
which uses precisely the Euclidean norm squared to measure the error.
In this problem we will implicitly consider a different geometric structure, which defines a metric on probability distributions. In more advanced courses, this can be developed further to show how probability distributions are naturally imbued with a curved non-Euclidean intrinsic geometry.
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 13
Here, our goals are more modest — we just want you to better understand the relationship between probability distributions, entropy, KL Divergence, and cross-entropy.
Letpandqbetwoprobabilitydistributions,i.e. p ≥0,q ≥0, p =1and q =1,thenwe iiiiii
define the Kullback-Leibler divergence
KL(p, q) = pi ln pi
i qi
which is the “distance“ to p from q. We have KL(p, p) = 0 and KL(p, q) ≥ 0 (the latter by Jensen’s inequality) as would be expected from a distance metric. However, KL(p, q) KL(q, p) since the KL divergence is not symmetric.
(a) Entropy motivation: Let X1, X2, . . . , Xn be independent identically distributed random variables
taking values in a finite set {0, 1, . . . , m}, i.e. p j = P(Xi = j) for j ∈ {0, 1, . . . , m}. The empirical
number of occurrences is then a random vector that we can denote F(n) where F(n) is the number j
of variables Xi that happen to take a value equal to j.
Intuitively, we can consider coin tosses with j = 0 corresponding to heads and j = 1 corre-
sponding to tails. Say we do an experiment with n = 100 coin tosses, then F(100) is the number
of heads that came up and F(100) is the number of tails. 1
0
Recall that the number of configurations of X1, X2, . . . , Xn that have f (n) as their empirical n
type is f(n),f(n),…,f(n) . Further notice that dividing the empirical type by n yeilds an empirical 01m
probability distribution.
Show using the crudest form of Stirling’s approximation (l! ≈ (l)l) that this number of
e
n (n)
configurations f (n), f (n),…, f (n) is approximately equal to exp nH( f /n) 01m
where the entropy H
is the number of ways to put n interchangeable objects into m boxes so that box j has f (n) j
objects in it. There are that many distinct “n-length strings” of X realizations whose empirical counts match f(n).
Then use the attached Jupyter notebook to plot your approximation, and verify that it looks reasonable.
Solution:
of a probability distribution is defined as H(p) = m p ln 1 . j=0 j pj
Note: The multinomial coefficient
n n!
f(n), f(n),…, f(n) = (n) (n) (n) 0 1 m f0 !f1 !···fm !
By applying the Stirling approximation, we get n n!
f(n), f(n),…, f(n) = (n) (n) (n) 0 1 m f0 !f1 !···fm !
≈ (n/e)n (f(n)/e)f(n)(f(n)/e)f(n) ···(f(n)/e)f(n)
0011 mm
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 14
f(n) mi m(n)
n f n i
==expn ln f ( n ) n f ( n )
i=0i i=0 i = exp nH( f (n)/n) .
(b) KL divergence motivation: Recall that the probability of seeing a particular empirical type is given by:
.
Consider the limit of large n and a sequence of empirical types so that 1f(n) → f for n → ∞, n
where f is some distribution of interest.
Use Stirling’s approximation to show that
lim1lnP(F(n) =f(n))=−KL(f,p) n→∞ n
Intuitively this means that the larger KL(f, p) is, the easier it is to conclude f p from empirical data since the chance that we would get confused in that way is decaying exponentially. Note also that the empirical distribution is the first argument of the KL divergence and the true model is the second argument of the KL divergence — we are going from the true model to the empirical one.
After you obtain a result, use the attached Jupyter notebook to see this convergence with a simulated model, and experiment with varying both f and p.
n m f(n) P(F(n) =f(n))= pj
f(n), f(n),…, f(n) j 01 mj=0
Solution: Recall that there are
n n!
f(n), f(n),…, f(n) = (n) (n) (n) 0 1 m f0 !f1 !···fm !
f(n)
many ways to realize f(n), each of them has probability m p j . Therefore
j=0 j
n m f(n)
P(F(n) =f(n))= pj f(n), f(n),…, f(n) j
.
01 mj=0
Using the alternate form of Stirling’s formula ln(n!) ≈ n ln n − n + o(n) for large n we then get
n! m f(n) lnP(F(n) = f(n)) = ln p j
f(n)!f(n)!··· f(n)! j 01 mj=0
≈ (n ln n − n) − ( f (n) ln f (n) − f (n) ) + f (n) ln p j jjjj
jj
f (n) =(nlnn−f(n)lnn)−(n−f(n))−f(n)ln j +f(n)lnpj
jjjnj jjjj
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 15
f(n) p =n j ln j
n f(n)/n jj
Therefore we have
for n → ∞.
(c) Show that for probability distributions p(x,y) and qθ(x,y) = qθ(y | x)q(x) with x from
some discrete set X and y from some discrete set Y we have
KL(p, qθ) = c − p(x, y) ln qθ(y | x) (1)
x∈X y∈Y
for some constant c independent of θ.
This is one of the motivations for looking at the log of the probabilities as a loss function. From the point of view of learning θ, this is functioning the same as using the KL divergence. The next part asks you to make this explicit.
f (n)
1lnP(F(n) = f(n))→ j ln pj →−KL(f,p)
n n f(n)/n jj
Solution: We have
KL(p,qθ) = p(x,y)ln p(x,y)
x∈X y∈Y qθ(x,y)
= p(x, y) ln p(x, y) − p(x, y) lnqθ(y | x)q(x)
x∈X y∈Y
= c − p(x, y) ln qθ(y | x)
x∈X y∈Y
(d) In logistic regression we predict labels yi = +1 or yi = −1 from features xi using the transition probability model
q (y | x ) = 1 . (2) θ i i 1+e−yiθ⊤xi
We now show that the cross-entropy loss you have seen in lectures can be formulated as min- imizing the KL distance to the empirical probabilities from the probabilities induced by the model qθ.
For convenience, we assume that all the feature xi are distinct — no two training points are identical.
Use (c) to show that with the empirical distribution
0 otherwise
1
ifx=x andy=y forsomei=1,2,…,n p ( x , y ) = n i i
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 16
5
we get
argmin KL(p,qθ) = argmin −1 lnqθ(yi | xi), θθni
which is the cross entropy loss derived in lectures.
Solution: By plugging p(x, y) = 1 n 1 into the expression from (c), we get n i=1 {x=xi ,y=yi }
1 n
KL(p, qθ) = c − n 1{x=xi,y=yi} ln qθ(y | x)
x∈X y∈Y i=1
1 n
=c−n
The constant doesn’t matter for the minimizer and so we can get rid of it.
SGD in the overparametrized regime
i=1
lnqθ(yi |xi).
This is a problem that helps you understand why we cared so much about the properties of minimum-norm solutions in the context of machine learning. The standard way of training neural networks in practice is stochastic gradient descent (and variants thereof). We need to understand how it behaves. Here, we just try to minimize squared error. (A similar story holds for logistic loss.)
Consider the standard least-squares problem:
minL(w;X,y), whereL(w;X,y):= 1∥Xw−y∥2.
w2
Here X = [x1, . . . , xn]⊤ is a n × d matrix of features and y is an n-dimensional vector of labels.
Say d > n. For a single training sample (xi, yi) let fi(w) := 1 (x⊤w − yi)2. A stochastic gradient is 2i
∇w fi(w) = ( d fi(w))⊤ where i is uniformly sampled from {1, . . . , n}. By the properties of vector dw
calculus, recall that the derivative of a scalar function of a vector is represented by a row vector, and the gradient is the transpose of that row vector so that we have a regular column vector.
(a) Show that if there exists a minimizer with loss L(w; X, y) = 0, then there is an infinite number of such minimizers.
(HINT: Remember the shape of X here.)
Solution: The loss is zero iff Xw = y — minimizers with zero loss are exactly the solutions of the system of linear equations. Since d > n the system is underdetermined, and it either has no solutions or infinitely many (if w is a solution, then w + z is also a solution for any z in the null space of X.) Thus, if there is at least one minimizer with zero loss, then there are infinitely many of them.
(b) Prove that any linear combination of stochastic gradients must lie in the row span of X. That is, it must be a linear combination of the {xi}.
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 17
(HINT: Here, you need to actually take the derivative and see what it is.)
Solution: In our setting a stochastic gradient is
∇ 1(x⊤w−y)2 =x(x⊤w−y)
w2iiiii
for some i — a vector which is collinear to xi. Since x⊤i are rows of X, any stochastic gradient lies in the row span of X, and as the row span is a linear space, any linear combination of stochastic gradients lies in the same span.
(c) Suppose that rows of X are linearly independent, and that stochastic gradient descent with constant step size η > 0 initialized at w0 = 0 converges to something when we view the sequence of vectors wt as an infinite sequence. Show that it converges to the minimum norm interpolating solution w = X†y. Here X† is the Moore-Penrose Pseudoinverse. You can feel free to assume that each training point is used infinitely often by SGD.
(Hint: Remember, what does it mean for a sequence to converge? What does that imply about the individual gradients? What does that mean about the quantity ∥Xwt − y∥2? Finally, this part comes after the previous part for a reason.)
Solution: We can only converge to some point w∗ if all the stochastic gradients in that point are zero: indeed, if some stochastic gradient has length l > 0, our SGD would have to eventually make a step of length close to lη, but since our SGD converges, the steps also must converge to zero.
Thus, we can only converge to such w∗ for which xi(x⊤i w∗ − yi) = 0 for any i, i.e. to the interpolating solution. Now note that the interpolating solution that lies in the row span of the data is exactly the unique minimum norm interpolating solution, so by the previous part this is the solution we will converge to (as our trajectory cannot escape the span of the rows of X).
(d) Suppose you initialize SGD at w = w0 0, and SGD still converges. What will it converge to in this case? Express your answer in terms of X, y and w0.
Solution: We know that SGD only travels in the affine space which is parallel to the span of the rows of X. Therefore, we start with introducing w∥ and w⊥: the part of w that lies in the span of rows of X and the orthogonal part correspondingly. The explicit formulas are the following:
w∥ =X⊤(XX⊤)−1Xw,
w⊥ =(Id − X⊤(XX⊤)−1X)w,
w =w∥ + w⊥.
Here we used the fact that X⊤(XX⊤)−1X is the orthogonal projector on the row span of X.
Now let’s see what SGD does. We already know that all stochastic gradients lie in the span of rows of X, so w⊥ will always remain unchanged. To see how w∥ evolves, let’s write out a stochastic gradient:
xi(x⊤i w−yi)=xi(x⊤i (w∥ +w⊥)−yi)=xi(x⊤i w∥ −yi).
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 18
We see that a stochastic gradient at w is exactly the same as the corresponding stochastic gradient at w∥. That means that we can throw away w⊥ before running SGD and all the steps of SGD will be exactly the same (the whole trajectory will just be shifted by −w⊥)!
Thus, w∥ (which is the same as the shifted trajectory) converges to w∗ from the previous part (note that in the previous part we did not use the fact that we start at zero, we only needed to start in the span of rows of X), and the w⊥ doesn’t change. Thus, the limiting point is
X⊤(XX⊤)−1y + (Id − X⊤(XX⊤)−1X)w0.
(e) Do the tasks from the associated Jupyter notebook and report the results. Solution:
6
Part 2 For the usual parametrization SGD converges in about 900 iterations. For the ”neural network” style it takes around 1500 iterations.
Part 3 We can see the result of part b by observing that the whole trajectory lies in a 2d plane. When it comes to the ”neural network” style, its trajectory doesn’t lie in a plane, but it does lie on a 2d surface (submerged in 3d space).
Part 4 The convergence doesn’t occur when the training points x1 and x2 are close to each other. The usual parametrization is better in terms of convergence.
When the convergence occurs, the function learned with ”neural network” parametriza- tion is usually flatter (graph is closer to a straigt line).
SGD (Interpolation)
Part 1 The code that you need to write is the following:
phi_train = np.array([x_train**2, x_train, 1])
return phi_train * (np.dot(phi_train, w) – y_train)
return np.array([2 * w[1] * x_train * (w[0] * x_train + 1),
(w[0] * x_train + 1)**2,
1]) * (w[1] * (w[0] * x_train + 1)**2 + w[2] –
→ y_train)
This is a problem about the convergence of SGD. It can seem a bit hairy, but we’ve tried to simplify things so that you can understand and there is code at the end that will help you visualize what is going on. If you want, start at the end and run the code first. Then you can appreciate the phenomenon, and can work through the arguments here.
Consider we are minimizing the following optimization problem: 1 n
minL(w)= (yi −x⊤i w)2, (3) w n i=1
Li (w)
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 19
where xi ∈ Rd and the data points {(xi, yi)}ni=1 are generated as follows,
yi = x⊤i w⋆, (4)
which implies that
Here we apply stochastic gradient descent for optimizing (3),
L(w⋆)=0, and Li(w⋆)=0, i∈[n].
1
w =w−η ∇L(w), t=0,···T, (5)
t+1t it |St| i∈|St|
where St is uniformly sampled from {1,··· ,n}, St ⊂ {1,··· ,n} and |St| = B, and η > 0 is the step size.
Denote X = [x1,··· ,xn]⊤ ∈ Rn×d and y = [y1,··· ,yn]⊤. Then minimizing (3) is equivalent to minimizing:
min 1∥Xw − y∥2. wn
Denote the largest and smallest eigenvalue of 1 X⊤ X by λmax and λmin , and denote n
β = max ∥xi∥2. i∈[n]
(a) Prove the following inequality holds for all w′ and w:
L(w′) ≤ L(w) + ⟨∇L(w), w′ − w⟩ + λmax∥w − w′∥2.
We will start you out with the first few lines of this proof. You need to finish it. You’ll see some familiar tricks:
L(w′ ) =1∥Xw′ − y∥2
n
=1∥X(w′ −w+w)−y∥2 n
=1∥Xw−y∥2 + 2⟨X(w′ −w),Xw−y⟩+ 1∥X(w−w′)∥2 nnn
=···
(Hint: expand stuff out and remember what the gradient is here.)
Solution: Write down the formulation of L(w) and perform eigenvalue decomposition of 1X⊤X,
n
1X⊤X = UΣU⊤, n
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 20
where
and λmax = λ1 and λmin = λd. More specifically,
L(w′ ) =1∥Xw′ − y∥2
n
=1∥X(w′ −w+w)−y∥2 n
=1∥Xw−y∥2 + 2⟨X(w′ −w),Xw−y⟩+ 1∥X(w−w′)∥2 nnn
Σ = diag(λ1,··· ,λd),
=
1 2 1 ∥Xw−y∥2 +⟨w′ −w, X⊤(Xw−y)⟩+(w−w′)⊤ X⊤X (w−w′)
nnn
L(w) ∇L(w) UΣU⊤
≤L(w) + ⟨w′ − w, ∇L(w)⟩ + λmax∥w − w′∥2, which completes the proof.
(b) Assume n > d, λmin > 0, and the data points are generated according to (4), prove the follow- ing inequalities hold for all w:
4λmin(L(w) − L(w⋆)) ≤ ∥∇L(w)∥2 ≤ 4λmax(L(w) − L(w⋆)), ∥∇Li(w)∥2 ≤ 4β(Li(w) − Li(w⋆)), for i ∈ [n].
To start you out, we first write down the expression for ∥∇L(w)∥2, ∥∇L(w)∥2 = 4 ⟨X⊤(Xw − y), X⊤(Xw − y)⟩
0 or something else. . . .
(Hint: Think about the possible nullspace of X. Does it matter?)
Solution: We first write down the expression for ∥∇L(w)∥2, ∥∇L(w)∥2 = 4 ⟨X⊤(Xw − y), X⊤(Xw − y)⟩
n2
= 4(Xw−y)⊤XX⊤(Xw−y).
n2
To complete this, you need to think about what Xw − y can be. There are always two choices
n2
= 4(Xw−y)⊤XX⊤(Xw−y),
n2
suppose Xw − y = 0, then the first two inequalities hold. Assume Xw − y 0, we have Xw−y = X(w−w⋆) N(X),whereN(X)isthenullspaceofX. Thenthefirsttwo inequalities also hold. Then for ∥∇Li(w)∥2, we have
∥∇Li(w)∥2 = 4⟨(x⊤i w − yi)xi, (x⊤i w − yi)xi⟩ = 4∥xi∥2(x⊤i w − yi)2, which completes the proof.
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 21
(c) Assume n > d, λmin > 0, and the data points are generated according to (4), prove that by performing SGD with constant step size η, when η satisfies
η≤ 2λminB , 2(λmaxβ + λminλmax(B − 1))
the iterates of SGD in (5) satisify the following exponential decay in expectation of the losses: E L(wt+1) ≤ 1 − ηλmin E L(wt) ,
where the expectation is taken w.r.t. the randomness in sampling the mini-batch.
Spiritually, think back to the review problem that you did way back in HW0 that was about gradient descent for ordinary least squares. There, you saw exactly this kind of exponential convergence. Here, we are doing SGD and while E ∇Li(wt) = ∇L(wt) since a training point i is chosen uniformly at random across the data set, that isn’t quite enough. We need to make sure that we don’t jump around too much due to the randomness in the choice of that training point. This is why B (the size of the minibatch) is here as well as β (the maximum norm squared of a training point).
To bound the size of the fluctuation that can happen from the randomness in the choice of the mini-batch, we need to get a handle on the expectation of the squared-norm of the stochastic gradient. How different is this than the squared-norm of the actual gradient?
Here, the following trick comes in handy that is a consequence of all the mini-batch being chosen uniformly at random from the possible B-sized subsets of the training data.
1 B2
we know how we have to start. We need to use the first part of this problem: L(wt+1) ≤ L(wt) + ⟨∇L(wt), wt+1 − wt⟩ + λmax∥wt+1 − wt∥2
= L(wt) − η⟨∇L(wt), 1 ∇Li(wt)⟩ + η2λmax∥ 1 ∇Li(wt)∥2, |St| i∈|St| |St| i∈|St|
taking the expectation w.r.t. the randomness in sampling this mini-batch and rearranging the terms,
E[L(wt) − L(wt+1)]
1 2
2
2 ∥∇L(wt)∥
E ∥ ∇Li(wt)∥ = 2
E
∥∇Li(wt)∥ 2
+
2
B(B−1) B2
|St|i∈St
What this does is tell us that we can’t be that far away. How do we use this above fact? Well,
i∈St
11 2 2
≥η ∇L(wt),E |St|
i∈St
= ··· . .
≥ ηλminL(wt).
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 22
∇Li(wt) −ηλmaxE∥
∇Li(wt)∥ 2
|St|
i∈St
You need to take it from here and use the ideas above to fill in the · · · lines of gaps. What can you do to those expectations? Then you need to combine terms to pull together things that depend on ∥∇L(wt∥2. Once you’ve done that, you can apply one of the bounds from above. (Which direction does your inequality need to go?)
To complete the proof once you reach the end of the derivation above, you just take expecta- tions of both sides.
Solution: To start with, we apply the result in Part (a),
L(wt+1) ≤ L(wt) + ⟨∇L(wt), wt+1 − wt⟩ + λmax∥wt+1 − wt∥2
= L(wt) − η⟨∇L(wt), 1 ∇Li(wt)⟩ + η2λmax∥ 1 ∇Li(wt)∥2, |St| i∈|St| |St| i∈|St|
taking the expectation w.r.t. the randomness in sampling the mini-batch and rearranging the terms,
E[L(wt) − L(wt+1)]
11 2 2
≥η ∇L(wt),E ∇Li(wt) −ηλmaxE∥ ∇Li(wt)∥ |St| |St| 2
i∈St i∈St
1 22 2
B(B−1)
= η∥∇L(wt)∥ − η λmax E ∥∇Li(wt)∥
η2λ
=η− ∥∇L(w)∥ −ηλ E
2
2 B2 i∈St
+
2 ∥∇L(wt)∥ 2 B2
(B − 1) 4β max 2 2
L(w) t maxit
B 2 B2 i∈St
η2λmax(B − 1) 4η2βλmax ≥4λ η− L(w)−
L(w) min B t B t
2 2 η λmax(B−1) 4η βλmax
= 4λ η− −
L(w) min BBt
2 2
= ηλmin + 3ηλmin − 4η λminλmax(B − 1) − 4η βλmax L(wt)
BB
≥0
≥ ηλminL(wt),
where the second inequality is because Part (b), and the last ineuqality is because our assump-
tion on η. We complete the proof.
(d) (Completely Optional — Just to see if you understand the details. Read this part however.
1
It is important.) Consider n < d and denote the smallest nonzero singular value of √n X by
σmin > 0, and let λ ̃min = σ2min. The data points are still generated in a noise-free manner according to (4). It turns out that the inequalities in Part (b) still hold by replacing λmin with λ ̃ min :
4λ ̃min(L(w) − L(w⋆)) ≤ ∥∇L(w)∥2 ≤ 4λmax(L(w) − L(w⋆)),
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 23
∥∇Li(w)∥2 ≤ 4β(Li(w) − Li(w⋆)), for i ∈ [n].
If you want to practice, Verify the above proof still works. The consequence of that proof
working is that by performing SGD with constant step size η, when η satisfies η≤ 2λ ̃minB ,
2(λmaxβ + λ ̃minλmax(B − 1))
the iterates of SGD in (5) also have an exponential decay in expectation: ̃
E L(wt+1) ≤ 1 − ηλmin E L(wt) ,
where the expectation is taken w.r.t. the randomness in sampling the mini-batch. This is because none of the arguments that you used in the earlier part actually cared about what the relationship was between n and d.
This has an important consequence. When data is noiseless and you are doing least-squares, you get exponential convergence even if you do SGD instead of gradient descent. It doesn’t matter if you have more data than features or more features than data.
You might wonder whether there is a difference between this fact holding in expectation and convergence holding with probability 1. Think about it. When we have an exponential decay in an expectation, even Markov’s inequality will tell us that this convergence has to happen with probability 1. Suppose there was a finite probability of divergence. Then there has to be some finite non-zero level that is crossed infinitely often. Markov’s inequality will rule that out. So, SGD with a constant step size (that is sufficiently small) always converges and does so exponentially fast!
Solution: Proof can be done by using similar arguments in Part (b) and Part (c). Similar to Part (b). We first write down the expression for ∥∇L(w)∥2,
∥∇L(w)∥2 = 4 ⟨X⊤(Xw − y), X⊤(Xw − y)⟩ n2
= 4(Xw−y)⊤XX⊤(Xw−y), n2
suppose Xw − y = 0, then the first two inequalities hold. Assume Xw − y 0, we have Xw−y = X(w−w⋆) N(X),whereN(X)isthenullspaceofX. Thenthefirsttwo inequalities also hold. More specifically, for the lower bound part of ∥∇L(w)∥2, we have
∥∇L(w)∥2 = 4 (Xw − y)⊤XX⊤(Xw − y) ≥ 4λ ̃min (Xw − y)⊤(Xw − y), n2 n
which completes the proof.
Then the convergence proof can be done as in Part (c) by applying the 4λ ̃ min (L(w) − L(w⋆ )) ≤
∥∇L(w)∥2 ≤ 4λmax(L(w) − L(w⋆)).
(e) Now we can enjoy the payoff of the work above. The noise-free case seems insanely special. What if we have some noise or model mismatch? Adding regularization (the right way) does the trick for us.
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 24
Suppose n < d and the data points are generated with noise in them as follows: yi=x⊤iw⋆+εi, ε∼N(0,σ).
Recall the augmented feature perspective on ridge regression: min ∥θ∥2
(6)
(7)
where the augmented data matrix Xˆ is defined as
Xˆ = X | √λI ∈ Rn×(d+n),
θ
s.t.Xθ=y, θ=ξ∈R ,
w ˆ
d+n
1ˆ where λ is the regularization in ridge regression. Denote the smallest singular value of √n X by
σmin > 0, and let λ ̃min = σ2min, and let
β = max∥xˆi∥2, i∈[n]
where xˆ i is the i-th row of matrix Xˆ . Denote the optimal solution to (7) by
Consider the case of initializing θ0 = 0 for SGD and having a learning rate η that satisfies η≤ 2λ ̃minB .
2(λmaxβ + λ ̃minλmax(B − 1)) Consider the iterates of SGD in (5) for the following problem
w
ξ, θ=.
1 n minL(θ)= (yi −xˆ⊤i θ)2,
solved by w in the limit (θ = ) as t gets large? t t ξt,
wt
θ n i=1 Li (θ)
and argue based on earlier parts and earlier problems that they satisify the following
inequality which shows exponential decay in expected loss for the augmented problem.
̃
E L(θt+1) ≤ 1 − ηλmin E L(θt) ,
where the expectation is taken w.r.t. the randomness in sampling the mini-batch. Based on the previous HW problem, what do you think is the implication for what problem is being
Solution: Simply by applying the results in Part (d) and the previous problem on implicit regularization, we could finish the proof. The earlier parts tell us that the solution to least- squares SGD must converge. The previous problem tells us it must converge to the minimum norm solution if we initialize at zero.
This means that we get the solution to ridge regression this way using SGD and that we get that solution exponentially fast in terms of convergence rate. For a large feature space and a lot of data points, this represents a tremendous speedup in practice.
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 25
(f) The previous part has shown you that done properly, you can get SGD to converge exponen- tially for ridge regression. Check out the Jupyter part of this problem, run (5) for different settings as described in Part (c), (d), and (e). Report what you observed in terms of the convergence rate of (5).
One of the lessons that you will observe from the code is that the implementation details matter. If you do ridge regression and just treat it as an optimization problem, you won’t just be able to use SGD and get exponential convergence with a constant step size. (You would have to adjust the step sizes to make them smaller, but this would slow down your convergence considerably.) But if you intelligently use the feature-aumentation perspective on ridge regression, you’ll get exponential convergence.
This is why it is vital for people in EECS to really understand machine learning at the level of detail that we are teaching you. Because in the real world, even if you are a practicing machine learning engineer, if you are working on cutting-edge systems, you need to understand how to implement what you want to do so that it works fast. Equivalent formulations mathematically need not be equivalent from the point of view of implementation – this is one dramatic example of a case when they are not. Take EE227C and beyond if you want to understand these things more deeply.
Solution: Ths SGD with constant step size (5) converges exponentially towards zero in these setttings.
For people coming from an optimization point of view who have seen SGD before, this might seem perplexing. After all, it is generically the case that the convergence of SGD to a so- lution requires an appropriately diminishing step size, and that the convergence is far slower than exponential. This is where the machine-learning understanding is important as well as understanding the special case of interpolating solutions. The machine-learning understanding tells us that by doing SGD in the augmented-features point of view, we are keeping a little bit of memory for every training point. This memory is being used to help us better “agree to disagree” on each individual point instead of constantly having the same argument over and over again between the individual data points and the collective impact of the rest of the data points. Whereas traditional SGD convergence requires an increasing stubborness of the col- lective vis-a-vis each data point (that is what a diminishing step size means — it is harder to change the collective’s point of view), the augmented features perspective gets around this by allowing learning to focus on the new disagreements as opposed to old ones. Similar ideas are important in other ML contexts, although most people don’t understand their linkages.
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 26
Contributors:
• Alexander Tsigler • Anant Sahai
• Ashwin Pananjady • Kailas Vodrahalli • Peter Wang
• Philipp Moritz
• Stephen Bailey
• Stephen Tu
• Vaishaal Shankar • Yaodong Yu
HW8, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 27