Homework 1: ML
Q1. Recall the definition of the expected risk R(f) of a predictor f. While this cannot be computed in general note that here we defined PX×Y. Which function f∗ is an obvious Bayes predictor? Make sure to explain why the risk R(f∗) is minimum at f∗.
f∗(x) = g(x)
Bayes Predictor, f ∗ = arg min R(f ) = arg min E(x,y)∼PX ×Y [l(yˆ, y)]
Copyright By PowCoder代写 加微信 powcoder
= arg min E(x,y)∼PX ×Y [12 (yˆ − y)2] f
This is non-negative, so the function g(x) which has zero risk is the Bayes Predictor.
Q2. Using H2 as your hypothesis class, which function fH is a risk minimizer in H2 ? Recall the definition of the approximation error. What is the approximation error achieved
f∗ (x) = g(x), this is the Bayes predictor and is in the hypothesis space H2 so it is the risk H2
minimizer within H2
Which function fH∗ fH∗ ?
In a general case, the inequality is: R(f∗ ) ≤ R(f∗ ) Hd H2
For any subspace Hd , with d > 2, all functions from H2 are included in Hd because you can
set the coefficients bk, k ∈ {3…d} to zero and set b0, b1, b2 accordingly to match the required
Q3 Considering now Hd , with d > 2. Justify an inequality between R(f∗ ) and R(f∗ ).
Hd H2 is a risk minimizer in Hd? What is the approximation error achieved by
function from H2. So the risk minimizer function from the hypothesis space H2, f∗ , is
included in Hd, d > 2. So the minimzer from Hd, d > 2 i.e fH∗ the risk or at worst it matches that from f∗ .
H2 has to either further minimze
is the same as g(x) but it is defined as b0 = a0,b1 = a1,b2 = a2,b3…d = 0.
The minimizer fH∗
And as a result the approximation error is zero again because this is the Bayes predictor.
Q4. For this question we assume a0 = 0. Considering H = {f : x → b1x;b1 ∈ R}, which function fH∗ is a risk minimizer in H? What is the approximation error achieved by fH? In
particular what is the approximation error achieved if furthermore a2 = 0 in the definition of true underlying relation g(x) above?
We are given that a0 = 0. Therefore y = a1x + a2x2. Alsopredictionyˆ=f(x)=b1x. SotheminimizerfH∗ =argmin12E[(f(x)−y)2].
Solving the expectation alone (note we are told that x ∼ Unif[0,1] so in [0,1], probability
density function fX (x) = 1):
E[(f(x) − y)2] = 1[(f(x) − y)2] dx 0
= 1[(b1x − a1x − a2x2)2] dx 0
=1×2 ·[(b1 −a1 −a2x)2]dx 0
=1×2 ·[b2 +a2 +a2x2 −2a b −2a b x+2a a x]dx 0112112112
=1b2x2 +a2x2 +a2x4 −2a b x2 −2a b x3 +2a a x3]dx 0112112112
23 23 25 1 b1x a1x a2x 2a1b1x3 2a2b1x4 2a1a2x4
= 3 + 3 + 5 − 3 − 4 + 4 0 =b21 +a21 +a2 −2a1b1 −2a2b1 +2a1a2
We now need to find the function f ∈ H that minimizes this i.e. the value of b1 that mini- mizes the above expression. To do this, we can differentiate by b1 as:
dE[(f(x)−y)2] = 2b1 − 2a1 − 2a2 = 0 db1 334
Or b1 = a1 + 34 a2
We can verify that the double derivative is positive (d2E[(f(x)−y)2] = 2) and hence the value
db21 3 The associated approximation error, R(fH) − R(f∗) = R(fH) − 0
found for b1 is a minima.
= 1E[(f(x)−y)2]= 1 1(b1x−a1x−a2x2)2dx
= 1 1(3a2x−a2x2)2dx= 1 1 a2(3x−4×2)2dx 2 0 4 2 0 16
= a2 1 9×2 +16×4 −24x3dx 32 0
= a2 9 + 16 − 24 = a2 323 5 4 160
Furthermore if we apply the constraint that a2 = 0, then fH∗ (x) = a1x. And the associated approximation error is zero.
Q5. Show that the empirical risk minimizer (ERM) ˆb is given by the following minimiza- tion ˆb = arg min||Xb − y||2.
As long as you show that opening out the matrix expression gives you the same result as the ERM, you get the points.
ERM:R(fˆ)=min1l(f(x,y)=min1 (f(x)−y)2=min(f(x)−y)2 nf∈Hn iif∈H2n iif∈H ii
d i=1 d i=1 di=1
||Xb − y||2 = (f(xi) − yi)2 = (f(xi) − yi)2 = 2 × l(f(xi), yi)
i=1 i=1 i=1
The 2 in the expression can be dropped when we look to minimize.
Therefore: arg min||Xb − y||2 = min 1 (f(x ) − y )2 2f∈H2n ii
Q6. If N > d and X is full rank, show that ˆb = (XT X)−1XT y. (Hint: you should take the gradients of the loss above with respect to b). Why do we need to use the conditions N > d and X full rank?
From Q5, we showed that ERM, ˆb = arg min||Xb − y||2 b
||Xb−y||2 = (Xb−y)T(Xb−y)
=(bTXT −yT)(Xb−y)
= bT XT Xb − yT Xb − bT XT y + yT y
=bTXTXb−2bTXTy+yTy (becauseyTXTbisascalarandwecanaddthetwoterms) To minimize this, we need to take the derivative w.r.t. b as follows:
d||Xb−y||2 = d bTXTXb−2bTXTy+yTy=2XTXb−2XTy=0 db db
Or 2XTXb = 2XTy Or b = (XT X)−1XT y
The second derivative (2XT X) is non negative so the obtained b is a minima. Therefore ˆb = arg min||Xb − y||2 = (XT X)−1XT y
For XTX to be invertible, we need the rank of XTX to be d because XTX is a d×d matrix. TosatisfythisweneedbothofX tobefullrankaswellasN >d. IftheX isfullrank, that means that rank(X) = min(N, d) so if N < d then the condition on XT X would not be satisfied. (Note we use that rank(X) = rank(XT X))
Q7. Write a function called least square estimator taking as input a design matrix X and the corresponding vector y returning b. Your function should handle any value of N and d, and in particular return an error if N ≤ d. (Drawing x at random from the uniform distri- bution makes it almost certain that any design matrix X with d ≥ 1 we generate is full rank).
def least_squares_estimator(X, y):
if X.shape[0] <= X.shape[1] - 1:
print("Error. Number of features > Number of examples.”) 3
b = np.linalg.pinv(X.T @ X) @ X.T @ y
Recall the definition of the empirical risk R(f) on a sample {xi,yi}i=1 for a predic-
tion function f. Write a function empirical risk to compute the empirical risk of fb taking
as input a design matrix X, a vector y and the vector b parametrizing the predictor.
def empirical_risk(X, y, b):
preds = X @ b
n = len(y)
emp_risk = np.sum((preds – y)**2) / (2*n)
return emp_risk
Q9. Use your code to estimate ˆb from x train, y train using d = 5. Compare ˆb and a. Make a single plot (Plot 1) of the plan (x, y) displaying the points in the training set, values of the true underlying function g(x) in [0, 1] and values of the estimated function f in [0, 1]. Make sure to include a legend to your plot. The code to estimate this is attached at the end of the assignment
Comparing the found ˆb with true a:
Vector ˆb : [8.820, 2.001, 4.894, −2.882e−06, 1.93e−06, −4.610e−07]
Vectora : [8.820,2.001,4.894]
We can see that there’s pretty much a perfect match with the first three coefficients matching upto 3 decimal places and subsquent coefficients in ˆb taking very small values.
The required figure is Fig. 1. This might look slightly different depending on a, but you should show a perfect fit.
Figure 1: Scatter plot of train points along with true underlying function g and estimated function f (d-value 5) from x = 0 to x=1. We can see a perfect fit between f and g.
Q10. Now you can adjust d. What is the minimum value for which we get a “perfect fit”? How does this result relates with your conclusions on the approximation error above?
For d=2 and above there is a perfect fit. This agrees with our conclusions from Q2 and Q3 where the approximation error for H2 is zero and approximation error for Hd(d > 2) ≤ H2 (and hence is zero for all of those as well). Basically the true distribution function g belongs to all hypothesis spaces from d=2 onwards. A plot to confirm this programmatically would look something like Figure 2.
Figure 2: Test for minimum d-value needed for perfect fit. We can see that there is a perfect fit i.e. zero empirical risk, for d ≥ 2
Q11. Plotet andeg asafunctionofN ford