CS 189 Introduction to Machine Learning Fall 2017
This homework is due Friday, September 8 at 10 p.m.. 1 Getting Started
HW2
You may typeset your homework in latex or submit neatly handwritten and scanned solutions. Please make sure to start each question on a new page, as grading (with Gradescope) is much easier that way! Deliverables:
1. Submit a PDF of your writeup to assignment on Gradescope, “HW[n] Write-Up” 2. Submit all code needed to reproduce your results, “HW[n] Code”.
3. Submit your test set evaluation results, “HW[n] Test Set”.
After you’ve submitted your homework, be sure to watch out for the self-grade form.
(a) Before you start your homework, write down your team. Who else did you work with on this homework? List names and email addresses. In case of course events, just describe the group. How did you work on this homework? Any comments about the homework?
(b) Please copy the following statement and sign next to it:
I certify that all solutions are entirely in my words and that I have not looked at another student’s solutions. I have credited all external sources in this write up.
CS 189, Fall 2017, HW2 1
2 Geometry of Ridge Regression
One way to interpret “ridge regression” is as the Lagrangian form of a constrained problem.
- (a) Given a matrix X ∈ Rn×d and a vector⃗y ∈ Rn, define the optimization problem
minimize ∥⃗y−X⃗w∥2. (1) subject to ∥⃗w∥2 ≤ β 2.
Using the spirit of the method of Lagrange multipliers (wherein we replace a constraint with a penalty on the thing that we are constraining, and then adjust the level of the penalty until the solution ends up satisfying the constraint. These penalties are sometimes referred to as “shadow prices,” especially in the economics literature.), rewrite the constrained optimization problem an an unconstrained optimization with the constraint incorporated within the objective function.
Recall that ridge regression is given by the unconstrained optimization problem
w = argmin∥⃗y−X⃗w∥2 +λ∥⃗w∥2. (2)w
Hence, by performing ridge regression with penalty λ , we are essentially solving a constrained least squares problem with our parameter having bounded Euclidean norm β . Qualitatively, how would increasing β be reflected in the desired penalty λ of ridge regression?
- (b) One reason why we might want to have small weights ⃗w has to do with the sensitivity of the predictor to its input. Let ⃗x be a d-dimensional list of features corresponding to a new test point. Our predictor is ⃗w⊤⃗x. By how much can our prediction change if nature added an arbitrary disturbance vector of length ε to the test point’s features ⃗x?
- (c) Derive that the solution to ridge regression (2) is given by wˆr = (X⊤X +λI)−1X⊤y. What happens when λ → ∞? It is for this reason that sometimes regularization is referred to as “shrinkage.”
- (d) Note that in computing wˆr, we are trying to invert the matrix X⊤X +λI instead of the matrix X⊤X. If X⊤X has eigenvalues λ1,…,λd, what are the eigenvalues of X⊤X +λI? Comment on why adding the regularizer term λ I can improve the inversion operation numerically.
- (e) Let d = 3, n = 5, and let the eigenvalues of X⊤X be given by 1000, 1 and 0.001. We must now choose between two regularization parameters λ1 = 100 and λ2 = 0.5. Which do you think is a better choice for this problem and why?
- (f) Another advantage of ridge regression can be seen for under-determined systems. Say we have the data drawn from a 5 parameter model, but only have 4 samples of it, i.e. X ∈ R4×5. Now this is clearly an underdetermined system, since n < d. Show that ridge regression with λ > 0 results in a unique solution, whereas ordinary least squares has an infinite number of solutions.
Hint: To make this point, it may be helpful to expand any vector w as w = w0 + X⊤a for w0 ∈ nullspace(X) and some a.
CS 189, Fall 2017, HW2 2
(g) (h)
3
(BONUS) For the previous part, what will the answer be if you take the limit λ → 0 for ridge regression?
Tikhonov regularization is a general term for ridge regression, where the constraint set takes the form of an ellipsoid instead of a ball. In other words, we solve the optimization problem
w = argmin∥y−Xw∥2 +λ∥Γw∥2 w
for some full rank matrix Γ ∈ Rd×d. Derive a closed form solution to this problem. Polynomials and invertibility
Consider using a D-degree polynomial to fit a function y = f (x) with n training samples, where both x and y are scalar. We know that this is equivalent to performing linear regression with a feature matrix F constructed from the n training data sampling positions x1,…,xn. Assume all the training sampling positions are non-zero, and let this mapping be given by F = [⃗pD (x1 ), . . . , ⃗pD (xn )]T where ⃗pD(x) = [x0,x1,…,xD]T .
- (a) Forn=2andD=1,showthatthematrixF hasfullrankiffx1 ̸=x2.
- (b) More generally, let us now show that the columns of F are linearly independent provided the sampling data points are distinct and n ≥ D + 1. It suffices to consider the case n = D + 1 and so assume that from this point forward for all the questions about univariate polynomials.
We do this by performing the following operations in sequence. From the matrix F, subtract the first row from rows 2 through n to obtain matrix F′.
Is it true that det(F) = det(F′)?
Hint: Think about representing the row subtraction operation using a matrix multiplication, and argue why this additional matrix must have determinant 1. (What are the eigenvalues of a triangular matrix?)
- (c) Perform the following sequence of operations to F′, and obtain the matrix F′′. i) Subtract x1 ∗ columnn−1 from columnn.
ii) Subtract x1 ∗ columnn−2 from columnn−1.
.n-1) Subtract x1 ∗ column1 from column2.
Write out the matrix F′′ and argue why det(F′) = det(F′′). - (d) For any square matrix A ∈ Rd×d and a matrix
1 ⃗0⊤B = ⃗0 A ,
argue that the d+1 eigenvalues of B are given by {1,λ1(A),λ2(A),…,λd(A)}, and con-clude that det(B) = det(A). Here,⃗0 represents a column vector of zeros in Rd.
CS 189, Fall 2017, HW2 3
(e) Use the above parts to show by induction that we have det(F) = ∏1≤i<j≤n(xj −xi). Con- sequently, the matrix X is full rank unless two data points are equal.
Hint: First show that
det(F)= ∏(xi−x1) det([⃗pD−1(x2),⃗pD−1(x3),…,⃗pD−1(xn)]T).
ni=2
Hint Hint: You can use the fact that multiplying a row of a matrix by a constant scales the determinant by this constant. (A fact that is clear from the oriented volume interpretation of determinants.)
(f) Let us now extend this argument to features from a multidimensional space of dimension l, using a multivariate polynomial of degree D.
Using a stars and bars (link is here if you did not take CS70) argument, show that now, we have D+l features for each sampling point.
l
(g) Choose n sample points {⃗xi}ni=1 with ⃗xi = (xi,1,xi,2,…,xi,l)T , and stack up all their features
like in part (a) to form the feature matrix Fl.
First, we will show that for some choice of distinct sampling points, this may not be full rank. Let us now form a particular instance of the matrix Fl by choosing xi,1 = xi,2 = · · · = xi,l = αi for distinct αi as i ∈ {1,2,3…,n}. Show that this leads to an Fl with linearly dependent columns no matter how many samples n you take.
(h) To show that the matrix can be full rank, pick a set of sampling points ⃗xi to show that there is a way to choose the samples to get the matrix Fl to be full column rank. You are allowed to pick as many sample points as you want. Although we are only asking you to show that there exists a way to sample to achieve full rank with enough samples taken, it turns out to be true that the Fl matrix will be full rank as long as n = D+l “generic” points are chosen.
Hint: Leverage earlier parts of this problem if you can.
4 Polynomials and approximation
For a p-times differentiable function f : R → R, the Taylor series expansion of order m ≤ p − 1 about the point x0 is given by
m1(i) i1(m+1) m+1
f(x)=∑i!f (x0)(x−x0) +(m+1)!f (a(x))(x−x0) . (3)
i=0
Here, f (m) denotes the mth derivative of the function f , and a(x) is some value between x0 and x.
The last term of this expansion is tyically referred to as the approximation-error term when ap- proximating f (x) by an m-th degree polynomial. For functions f whose derivatives are bounded intheneighborhoodofinterest,ifwehave|f(m)(x)|≤T forx∈(x0−s,x0+s),thenweknowthat
for x ∈ (x0 −s,x0 +s) that |f(x)−∑m 1 f(i)(x0)(x−x0)i| ≤ T(x−x0)m+1 . i=0 i! (m+1)!
l
CS 189, Fall 2017, HW2
4
- (a) Compute the 2nd, 3rd, and 4th order Taylor approximation of the following functions about the point x0 = 0. Also bound the error of approximation at x = 3.
i) ex
ii) sinx - (b) Let us say we would like an accurate polynomial approximation of the functions in part (a) for all x ∈ [−3,3]. In other words, we want a polynomial of degree D (call it φD) such that |f(x)−φD(x)|≤ε forallx∈[0,3]. HowlargedoesDneedtobeasafunctionofε forsuch a guarantee for the two choices of f (x) in part (a)?
- (c) What is lim (x−x0)m+1 ? m→∞ (m+1)!
Conclude that a univariate polynomial of high enough degree can approximate any func- tion that is sufficiently smooth. This is the power of using polynomial features, even when we don’t know the underlying function f that is generating our data! This universal approxi- mation property gives us some justification for using polynomial features. (Later, we will see that neural networks are also universal function approximators.)
- (d) Now let’s extend this idea of approximating functions with polynomials to multivariable func- tions. The Taylor series expansion for a function f(x,y) about the point (x0,y0) is given by
f(x,y) = f(x0,y0)+ fx(x0,y0)(x−x0)+ fy(x0,y0)(y−y0)+
1 [fxx(x0,y0)(x−x0)2 + fxy(x0,y0)(x−x0)(y−y0)+ (4)2!
fyx(x0,y0)(x−x0)(y−y0)+ fyy(x0,y0)(y−y0)2]+…where f =∂f, f =∂f, f =∂2f, f =∂2f,and f = ∂2f
x ∂x y ∂y xx ∂x2 yy ∂y2 xy ∂x∂y
As you can see, the Taylor series for multivariate functions quickly becomes unwieldy after the second order. Let’s try to make the series a little bit more manageable. Using matrix notation, write the expansion for a function of two variables in a more compact form up to the second order terms where f(⃗x) = f(x,y) with⃗x = [x,y]T and⃗x0 = [x0,y0]. Clearly define any additional vectors and matrices that you use.
(e) To see that we can do universal approximation, first just consider that we want to approximate the function in a single straight line path. For this problem, we will use the function
f(⃗x)=exsiny
where⃗x = [x,y]T . We’re interested in the path⃗x(t) = [ 1 t, 1 t]T . Write the 3rd order Taylor
√√
22
expansion of f (⃗x(t )) for the variable t about the point t0 = 0. Use the chain rule.
(f) Similar to part (b), determine the degree D for the polynomial φD(⃗x(t)) from the Taylor series about the point t0 = 0 such that |f(⃗x(t))−φD(⃗x(t))| ≤ ε on the interval t ∈ [0,3] for the function from the previous part.
CS 189, Fall 2017, HW2 5
(g) BONUS: Sketch how the argument above can be extended to show that multivariate poly- nomials are universal function approximators for sufficiently smooth functions of many variables.
5 Jaina and her giant peaches
In another alternative universe, Jaina is a mage testing how long she can fly a collection of giant peaches. She has n training peaches – with masses given by x1,x2,…xn – and flies all the peaches once to collect training data. The experimental flight time of peach i is given by yi. She believes that the flight time is well approximated by a polynomial function of the mass, and her goal is to fit a polynomial of degree D to this data. Include all text responses and plots in your write-up.
- (a) Show how Jaina’s problem can be formulated as a linear regression problem.
- (b) You are given data of the masses {xi}ni=1 and flying times {yi}ni=1 in the “x train” and “y train”
keys of the file 1D POLY.MAT, respectively, with the masses centered and normalized to lie
in the range [−1,1]. Write a routine to do a least-squares fit (taking care to include a
constant term) of a polynomial function of degree D to the data. Letting fD denote the
fitted polynomial, plot the average training error R(D) = 1 ∑n (yi − fD(xi))2 against D in n i=1
the range D ∈ {1,2,3,…,n−1}.
- (c) How does the average training error behave as a function of D, and why? What happens
if you try to fit a polynomial of degree n with a standard matrix inversion method?
- (d) JainahastakenMysticalLearning189,andsodecidesthatsheneedstorunanotherexperiment
before deciding that her prediction is true. She runs another fresh experiment of flight times
using the same peaches, to obtain the data with key “y fresh” in 1D POLY.MAT. Denoting the
fresh flight time of peach i by y ̃i, plot the average error R ̃(D) = 1 ∑n (y ̃i − fD(xi))2 for the n i=1
same values of D as in part (b) using the polynomial approximations fD also from the previous part. How does this plot differ from the plot in (b) and why?
- (e) How do you propose using the two plots from parts (b) and (d) to “select” the right polynomial model for Jaina?
- (f) Jaina has a new hypothesis – the flying time is actually a function of the mass, smoothness, size, and sweetness of the peach, and some multivariate polynomial function of all of these pa- rameters. The data in POLYNOMIAL REGRESSION SAMPLES.MAT (100000×5) with columns corresponding to the 5 attributes of the peach. Use 4-fold cross-validation to decide which of D ∈ {1, 2, 3, 4} is the best fit for the data provided. For this part, compute the polynomial coefficients via ridge regression with penalty λ = 0.1, instead of ordinary least squares.
- (g) Now redo the previous part, but use 4-fold cross-validation on all combinations of D ∈ {1, 2, 3, 4} and λ ∈ {0.05, 0.1, 0.15, 0.2} – this is referred to as a grid search. Find the best D and λ that best explains the data using ridge regression.
CS 189, Fall 2017, HW2 6
6 Your Own Question
Write your own question, and provide a thorough solution.
This problem should show your understanding of a key concept in the class.
Writing your own problems is a very important way to really learn material. The famous “Bloom’s Taxonomy” that lists the levels of learning is: Remember, Understand, Apply, Analyze, Evaluate, and Create. Using what you know to create is the top-level. We rarely ask you any HW questions about the lowest level of straight-up remembering, expecting you to be able to do that yourself. (e.g. make yourself flashcards) But we don’t want the same to be true about the highest level.
As a practical matter, having some practice at trying to create problems helps you study for exams much better than simply counting on solving existing practice problems. This is because thinking about how to create an interesting problem forces you to really look at the material from the perspective of those who are going to create the exams.
Besides, this is fun. If you want to make a boring problem, go ahead. That is your prerogative. But it is more fun to really engage with the material, discover something interesting, and then come up with a problem that walks others down a journey that lets them share your discovery. You don’t have to achieve this every week. But unless you try every week, it probably won’t happen ever.
CS 189, Fall 2017, HW2 7