计算机代写 GA-1003 – Spring 2022 1 Homework 2: Gradient Descent & Regularization

DS-GA-1003 – Spring 2022 1 Homework 2: Gradient Descent & Regularization
Due: Wednesday, February 16, 2022 at 11:59PM
Instructions: Your answers to the questions below, including plots and mathematical work, should be submitted as a single PDF file. It’s preferred that you write your answers using software that typesets mathematics (e.g.LaTeX, LyX, or MathJax via iPython), though if you need to you may scan handwritten work. You may find the minted package convenient for including source code in your LaTeX document. If you are using LyX, then the listings package tends to work better. The last application is optional.
This second homework features 3 problem sets and explores gradient descent algorithms, loss functions (both topics of week 2), regularization (topic of week 3) and statistical learning theory (week 1 + week 2). Following the instructions in the homework you should be able to solve a lot of questions before the lecture of week 3. Additionally this homework has an optional problem set. Optional questions will be graded but the points do not count towards this assignment. They instead contribute towards the extra credit you can earn over the entire course (maximum 2%) which can be used to improve the final grade by half a letter (e.g., A- to A).

Copyright By PowCoder代写 加微信 powcoder

1 Statistical Learning Theory
In the last HW, we used training error to determine whether our models have converged. It is crucial to understand what is the source of this training error. We specifically want to understand how it is connected to the noise in the data. In this question, we will compute the expected training error when we use least squares loss to fit a linear function.
Consider a full rank N × d data matrix X (N > d) where the training labels are generated as yi = b xi + εi where εi ∼ N(0,σ2) is noise. From HW 1, we know the formula for the ERM, ˆb = (XT X)−1XT y.
1. Show that:
1􏰖􏰖􏰌 T −1 T 􏰍􏰖􏰖2 Training Error = 􏰖􏰖􏰖􏰖 X(X X) X − I ε􏰖􏰖􏰖􏰖
N2 where ε ∼ N (0, σ2In) and training error is defined as N1 ||Xˆb − y||2.
2. Show that the expectation of the training error can be expressed solely in terms of only N,d,σ as:
􏰔1􏰖􏰖􏰌 T −1 T 􏰍􏰖􏰖2􏰕 (N−d) 2
E 􏰖􏰖􏰖􏰖X(XX)X−Iε􏰖􏰖􏰖􏰖= N2N
• Consider A = X(XT X)−1XT . What is AT A? Is A symmetric? What is A2? • For a symmetric matrix A satisfying A2 = A, what are its eigenvalues?
• If X is full rank, then what is the rank of A? What is the eigenmatrix of A?
3. From this result, give a reason as to why the training error is very low when d is close to N i.e. when we overfit the data.

DS-GA-1003 – Spring 2022 2 2 Gradient descent for ridge(less) linear regression
We have provided you with a file called ridge regression dataset.csv. Columns x0 through x47 correspond to the input and column y corresponds to the output. We are trying to fit the data using a linear model and gradient based methods. Please also check the supporting code in skeleton code.py. Throughout this problem, we refer to particular blocks of code to help you step by step.
Feature normalization When feature values differ greatly, we can get much slower rates of convergence of gradient-based algorithms. Furthermore, when we start using regularization, features with larger values are treated as “more important”, which is not usually desired.
One common approach to feature normalization is perform an affine transformation (i.e. shift and rescale) on each feature so that all feature values in the training set are in [0, 1]. Each feature gets its own transformation. We then apply the same transformations to each feature on the validation set or test set. Importantly, the transformation is “learned” on the training set, and then applied to the test set. It is possible that some transformed test set values will lie outside the [0, 1] interval.
4. Modify function feature normalization to normalize all the features to [0, 1]. Can you use numpy’s broadcasting here? Often broadcasting can help to simplify and/or speed up your code. Note that a feature with constant value cannot be normalized in this way. Your function should discard features that are constant in the training set.
At the end of the skeleton code, the function load data loads, split into a training and test set, and normalize the data using your feature normalization.
Linear regression
In linear regression, we consider the hypothesis space of linear functions hθ : Rd → R, where hθ(x) = θT x,
for θ,x ∈ Rd, and we choose θ that minimizes the following “average square loss” objective
pothesis space of affine functions:
hθ,b(x) = θT x + b,
which allows a nonzero intercept term b – sometimes called a “bias” term. The standard way to achieve this, while still maintaining the convenience of the first representation, is to add an extra dimension to x that is always a fixed value, such as 1, and use θ,x ∈ Rd+1. Convince yourself that this is equivalent. We will assume this representation.
5. Let X ∈ Rm×(d+1) be the design matrix, where the i’th row of X is xi. Let y = (y1, . . . , ym)T ∈ Rm×1 be the response. Write the objective function J(θ) as a ma- trix/vector expression, without using an explicit summation sign. 1
1Being able to write expressions as matrix/vector expressions without summations is crucial to making im-
(hθ(xi) − yi)2 ,
While this formulation of linear regression is very convenient, it’s more standard to use a hy-
where (x1, y1), . . . , (xm, ym) ∈ Rd × R is our training data.

DS-GA-1003 – Spring 2022 3
6. Write down an expression for the gradient of J without using an explicit summation sign.
7. Write down the expression for updating θ in the gradient descent algorithm for a step size η.
8. Modify the function compute square loss, to compute J(θ) for a given θ. You might want to create a small dataset for which you can compute J(θ) by hand, and verify that your compute square loss function returns the correct value.
9. Modify the function compute square loss gradient, to compute ∇θJ(θ). You may again want to use a small dataset to verify that your compute square loss gradient function returns the correct value.
Gradient checker
We can numerically check the gradient calculation. If J : Rd → R is differentiable, then for any vector h ∈ Rd, the directional derivative of J at θ in the direction h is given by
lim J(θ+εh)−J(θ−εh). ε→0 2ε
It is also given by the more standard definition of directional derivative, lim 1[J(θ+εh)−J(θ)] .
The former form gives a better approximation to the derivative when we are using small (but not infinitesimally small) ε. We can approximate this directional derivative by choosing a small value of ε > 0 and evaluating the quotient above. We can get an approximation to the gradient by approximating the directional derivatives in each coordinate direction and putting them together into a vector. In other words, take h = (1, 0, 0, . . . , 0) to get the first component of the gradient. Then take h = (0,1,0,…,0) to get the second component, and so on.
10. Complete the function grad checker according to the documentation of the function given
in the skeleton code.py. Alternatively, you may complete the function generic grad checker so which can work for any objective function.
You should be able to check that the gradients you computed above remain correct throughout the learning below.
Batch gradient descent
We will now finish the job of running regression on the training set.
11. Complete batch gradient descent. Note the phrase batch gradient descent distinguishes
between stochastic gradient descent or more generally minibatch gradient descent.
12. Now let’s experiment with the step size. Note that if the step size is too large, gradient descent may not converge. Starting with a step-size of 0.1, try various different fixed step sizes to see which converges most quickly and/or which diverge. As a minimum, try step sizes 0.5, 0.1, .05, and .01. Plot the average square loss on the training set as a function of the number of steps for each step size. Briefly summarize your findings.
plementations that are useful in practice, since you can use numpy (or more generally, an efficient numerical linear algebra library) to implement these matrix/vector operations orders of magnitude faster than naively implementing with loops in Python.

DS-GA-1003 – Spring 2022 4 13. For the learning rate you selected above, plot the average test loss as a function of the
iterations. You should observe overfitting: the test error first decreases and then increases.
Ridge Regression
We will add l2 regularization to linear regression. When we have a large number of features compared to instances, regularization can help control overfitting. Ridge regression is linear regression with l2 regularization. The regularization term is sometimes called a penalty term. The objective function for ridge regression is
where λ is the regularization parameter, which controls the degree of regularization. Note that the bias term (which we included as an extra dimension in θ) is being regularized as well as the other parameters. Sometimes it is preferable to treat this term separately.
14. Compute the gradient of Jλ(θ) and write down the expression for updating θ in the gradient descent algorithm. (Matrix/vector expression, without explicit summation)
15. Implement compute regularized square loss gradient.
16. Implement regularized grad descent.
Our goal is to find λ that gives the minimum average square loss on the test set. So you should start your search very broadly, looking over several orders of magnitude. For example, λ ∈ 􏰇10−7,10−5,10−3,10−1,1,10,100􏰈. Then you can zoom in on the best range. Follow the steps below to proceed.
17. Choosing a reasonable step-size, plot training average square loss and the test average square loss (just the average square loss part, without the regularization, in each case) as a function of the training iterations for various values of λ. What do you notice in terms of overfitting?
18. Plot the training average square loss and the test average square loss at the end of training as a function of λ. You may want to have log(λ) on the x-axis rather than λ. Which value of λ would you choose ?
19. Another heuristic of regularization is to early-stop the training when the test error reaches a minimum. Add to the last plot the minimum of the test average square loss along training as a function of λ. Is the value λ you would select with early stopping the same as before?
20. What θ would you select in practice and why?
Stochastic Gradient Descent (SGD) (optional)
When the training data set is very large, evaluating the gradient of the objective function can take a long time, since it requires looking at each training example to take a single gradient step.
In SGD, rather than taking −∇θJ(θ) as our step direction to minimize
(hθ(xi) − yi)2 + λθT θ,

DS-GA-1003 – Spring 2022 5 we take −∇θfi(θ) for some i chosen uniformly at random from {1, . . . , m}. The approximation
is poor, but we will show it is unbiased.
In machine learning applications, each fi(θ) would be the loss on the ith example. In practical implementations for ML, the data points are randomly shuffled, and then we sweep through the whole training set one by one, and perform an update for each training example individually. One pass through the data is called an epoch. Note that each epoch of SGD touches as much data as a single step of batch gradient descent. You can use the same ordering for each epoch, though optionally you could investigate whether reshuffling after each epoch affects the convergence speed.
21. Show that the objective function
(hθ(xi) − yi)2 + λθT θ
can be written in the form Jλ(θ) = m1 􏰂mi=1 fi(θ) by giving an expression for fi(θ) that
Jλ(θ) = m makes the two expressions equivalent.
22. Showthatthestochasticgradient∇θfi(θ),forichosenuniformlyatrandomfrom{1,…,m}, is an unbiased estimator of ∇θJλ(θ). In other words, show that E[∇fi(θ)] = ∇Jλ(θ) for any θ. It will be easier to prove this for a general J(θ) = m1 􏰂mi=1 fi(θ), rather than the specific case of ridge regression. You can start by writing down an expression for E [∇fi(θ)]
23. Write down the update rule for θ in SGD for the ridge regression objective function.
24. Implement stochastic grad descent.
25. Use SGD to find θλ∗ that minimizes the ridge regression objective for the λ you selected
in the previous problem. (If you could not solve the previous problem, choose λ = 10−2).
Try a few fixed step sizes (at least try ηt ∈ {0.05, .005}). Note that SGD may not converge
with fixed step size. Simply note your results. Next try step sizes that decrease with the
step number according to the following schedules: ηt = C and ηt = √C , C ≤ 1. Please tt
include C = 0.1 in your submissions. You are encouraged to try different values of C (see notes below for details). For each step size rule, plot the value of the objective function (or the log of the objective function if that is more clear) as a function of epoch (or step number, if you prefer). How do the results compare?
A few remarks about the question above:
• In this case we are investigating the convergence rate of the optimization algorithm with different step size schedules, thus we’re interested in the value of the objective function, which includes the regularization term.
• Sometimes the initial step size (C for C/t and C/ t) is too aggressive and will get you into a part of parameter space from which you can’t recover. Try reducing C to counter this problem.
• SGD convergence is much slower than GD once we get close to the minimizer (remember, the SGD step directions are very noisy versions of the GD step direction). If you look at the objective function values on a logarithmic scale, it may look like SGD will never find objective values that are as low as GD gets. In statistical learning theory terminology, GD

DS-GA-1003 – Spring 2022 6
has much smaller optimization error than SGD. However, this difference in optimization error is usually dominated by other sources of error (estimation error and approximation error). Moreover, for very large datasets, SGD (or minibatch GD) is much faster (by wall-clock time) than GD to reach a point close enough to the minimizer.
Acknowledgement: This problem set is based on assignments developed by of NYU and Bloomberg.
3 Image classification with regularized logistic regression
In this second problem set we will examine a classification problem. To do so we will use the MNIST dataset2 which is one of the traditional image benchmark for machine learning algo- rithms. We will only load the data from the 0 and 1 class, and try to predict the class from the im- age. You will find the support code for this problem in mnist classification source code.py. Before starting, take a little time to inspect the data. Load X train, y train, X test, y test with pre process mnist 01(). Using the function plt.imshow from matplotlib visualize some data points from X train by reshaping the 764 dimensional vectors into 28 × 28 arrays. Note how the class labels ‘0’ and ‘1’ have been encoded in y train. No need to report these steps in your submission.
Logistic regression
We will use here again a linear model, meaning that we will fit an affine function, hθ,b(x) = θT x + b,
with x ∈ R764, θ ∈ R764 and b ∈ R. This time we will use the logistic loss instead of the squared loss. Instead of coding everything from scratch, we will also use the package scikit learn and study the effects of l1 regularization. You may want to check that you have a version of the package up to date (0.24.1).
26. Recall the definition of the logistic loss between target y and a prediction hθ,b(x) as a function of the margin m = yhθ,b(x). Show that given that we chose the convention yi ∈ {−1, 1}, our objective function over the training data {xi, yi}mi=1 can be re-written as
(1 + yi) log(1 + e−hθ,b(xi)) + (1 − y) log(1 + ehθ,b(xi)).
27. What will become the loss function if we regularize the coefficients of θ with an l1 penalty
using a regularization parameter α ?
We are going to use the module SGDClassifier from scikit learn. In the code provided we have set a little example of its usage. By checking the online documentation3, make sure you understand the meaning of all the keyword arguments that were specified. We will keep the learning rate schedule and the maximum number of iterations fixed to the given values for all the problem. Note that scikit learn is actually implementing a fancy version of SGD to deal with the l1 penalty which is not differentiable everywhere, but we will not enter these details here.
2 http://yann.lecun.com/exdb/mnist/
3 https://scikit- learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html

DS-GA-1003 – Spring 2022 7
28. To evaluate the quality of our model we will use the classification error, which corresponds to the fraction of incorrectly labeled examples. For a given sample, the classification error is 1 if no example was labeled correctly and 0 if all examples were perfectly labeled. Using the method clf.predict() from the classifier write a function that takes as input an SGDClassifier which we will call clf, a design matrix X and a target vector y and returns the classification error. You should check that your function returns the same value as
1 – clf.score(X, y).
To speed up computations we will subsample the data. Using the function sub sample, restrict
X train and y train to N train = 100.
29. Report the test classification error achieved by the logistic regression as a function of the regularization parameters α (taking 10 values between 10−4 and 10−1). You should make a plot with α as the x-axis in log scale. For each value of α, you should repeat the experiment 10 times so has to finally report the mean value and the standard deviation. You should use plt.errorbar to plot the standard deviation as error bars.
30. Which source(s) of randomness are we averaging over by repeating the experiment?
31. What is the optimal value of the parameter α among the values you tested?
32. Finally, for one run of the fit for each value of α plot the value of the fitted θ. You can access it via clf.coef , and should reshape the 764 dimensional vector to a 28 × 28 arrray to visualize it with plt.imshow. Defining scale = np.abs(clf.coef ).max(), you can use the following keyword arguments (cmap=plt.cm.RdBu, vmax=scale, vmin=-scale) which will set the colors nicely in the plot. You should also use a plt.colorbar() to visualize the values associated with the colors.
33. What can you note about the pattern in θ? What can you note about the effect of the regularization?

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com