DS-GA-1003 – Spring 2022 1 Homework 1: Error Decomposition & Polynomial Regression
Due: Wednesday, February 2, 2022 at 11:59pm
Instructions: Your answers to the questions below, including plots and mathematical work, should be submitted as a single PDF file. It’s preferred that you write your answers using software that typesets mathematics (e.g.LaTeX, LyX, or MathJax via iPython), though if you need to you may scan handwritten work. You may find the minted package convenient for including source code in your LaTeX document. If you are using LyX, then the listings package tends to work better. The last application is optional.
General considerations (10 Points)
Copyright By PowCoder代写 加微信 powcoder
For the first part of this assignment we will consider a synthetic prediction problem to develop our intuition about the error decomposition. Consider the random variables x ∈ X = [0,1] distributed uniformely (x ∼ Unif([0,1])) and y ∈ Y = R defined as a polynomial of degree 2 of x: there exists (a0,a1,a2) ∈ R3 such that the values of x and y are linked as y = g(x) = a0 + a1 x + a2 x2 . Note that this relation fixes the joint distribution PX ×Y .
From the knowledge of a sample {xi, yi}Ni=1, we would like to predict the relation between x and y, that is find a function f to make predictions yˆ = f(x). We note Hd, the set of polynomial functions on R of degree d: Hd = f :x→b0 +bx+···bdxd;bk ∈R∀k∈{0,···d}. We will consider the hypothesis classes Hd varying d. We will minimize the squared loss l(yˆ,y) =
21 (yˆ − y)2 to solve the regression problem.
1. (2 Points) Recall the definition of the expected risk R(f) of a predictor f. While this cannot be computed in general note that here we defined PX×Y. Which function f∗ is an obvious Bayes predictor? Make sure to explain why the risk R(f∗) is minimum at f∗.
2. (2 Points) Using H2 as your hypothesis class, which function f∗ is a risk minimizer in H2
H2? Recall the definition of the approximation error. What is the approximation error achieved by f∗ ?
3. (2 Points) Considering now Hd, with d > 2. Justify an inequality between R(f∗ ) and
R(f∗ ). Which function f∗ is a risk minimizer in Hd? What is the approximation error
achieved by f∗ ?
4. (4 Points) For this question we assume a0 = 0. Considering H = {f : x → b1x; b1 ∈ R}, which function fH∗ is a risk minimizer in H? What is the approximation error achieved by fH∗ ? In particular what is the approximation error achieved if furthermore a2 = 0 in the definition of true underlying relation g(x) above?
Polynomial regression as linear least squares (5 Points)
In practice, PX ×Y is usually unknown and we use the empirical risk minimizer (ERM). We will reformulate the problem as a d-dimensional linear regression problem. First note that functions in Hd are parametrized by a vector b = [b0, b1, · · · bd]⊤, we will use the notation fb. Similarly we will note a ∈ R3 the vector parametrizing g(x) = fa(x). We will also gather data points from the training sample in the following matrix and vector:
1 x1 ··· xd1
1 x2 ··· xd2 ⊤
X = . . . . , y = [y0,y1,···yN] . (1)
. . . . 1 xN ··· xN
DS-GA-1003 – Spring 2022 2 These notations allow us to take advantage of the very effective linear algebra formalism. X is
called the design matrix.
5. (2 Points) Show that the empirical risk minimizer (ERM) bˆ is given by the following
minimization bˆ = arg min∥Xb − y∥2 . b
6. (3 Points) If N > d and X is full rank, show that bˆ = (X⊤X)−1X⊤y. (Hint: you should take the gradients of the loss above with respect to b 1). Why do we need to use the conditions N > d and X full rank ?
Hands on (7 Points)
Open the source code file hw1 code source.py from the .zip folder. Using the function get a getavaluefora,anddrawasamplextrain, ytrainofsizeN=10andasamplextest, y test of size Ntest = 1000 using the function draw sample.
7. (2 Points) Write a function called least square estimator taking as input a design matrix X ∈ RN×(d+1) and the corresponding vector y ∈ RN returning bˆ ∈ R(d+1). Your function should handle any value of N and d, and in particular return an error if N ≤ d. (Drawing x at random from the uniform distribution makes it almost certain that any design matrix X with d ≥ 1 we generate is full rank).
8. (1 Points) Recall the definition of the empical risk R(f ) on a sample {xi , yi }i=1 for a
prediction function f. Write a function empirical risk to compute the empirical risk
of fb taking as input a design matrix X ∈ RN×(d+1), a vector y ∈ RN and the vector b ∈ R(d+1) parametrizing the predictor.
9. (3 Points) Use your code to estimate bˆ from x train, y train using d = 5. Compare bˆ and a. Make a single plot (Plot 1) of the plan (x, y) displaying the points in the training set, values of the true underlying function g(x) in [0, 1] and values of the estimated function fbˆ(x) in [0,1]. Make sure to include a legend to your plot .
10. (1 Points) Now you can adjust d. What is the minimum value for which we get a “perfect fit”? How does this result relates with your conclusions on the approximation error above?
In presence of noise (13 Points)
Now we will modify the true underlying PX ×Y , adding some noise in y = g(x)+ε, with ε ∼ N (0, 1) a standard normal random variable independent from x. We will call training error et the empirical risk on the train set and generalization error eg the empirical risk on the test set.
11. (6Points)Plotet andeg asafunctionofN ford