COMP9417 – Machine Learning Tutorial: Regression I
Weekly Problem Set: Please submit questions 1b, 1c and 2g on Moodle by 12pm Tuesday 22nd Feb, 2022. Please only submit these requested questions and no others.
In this tutorial we will explore the theoretical foundations of linear regression. We first work through linear regression in 1 dimension (univariate linear regression), and then extend the analysis to multivariate linear regression. We also consider a numerical technique called gradient descent for solving optimisation prob- lems. If you find yourself unsure about any of the math covered in this tutorial, we strongly recommend reading the corresponding material in the text (freely available online):
Mathematics for Machine Learning by Deisenroth, A. and Ong. We will refer to this as the MML book throughout the course.
Copyright By PowCoder代写 加微信 powcoder
Question 1. (Univariate Least Squares)
A univariate linear regression model is a linear equation y = w0 + w1x. Learning such a model requires
fitting it to a sample of training data, (x1 , y1 ), . . . , (xn , yn ), so as to minimize the loss (usually Mean
Squared Error (MSE)), L = n1 ni=1(yi − (w0 + w1xi))2. To find the best parameters w0 and w1 that
minimize this error function we need to find the error gradients ∂L and ∂L . So we need to derive ∂w0 ∂w1
these expressions by taking partial derivatives, set them to zero, and solve for w0 and w1.
(a) Derive the least-squares estimates (minimizers of the MSE loss function) for the univariate linear
regression model.
(b) Showthatthecentroidofthedata,i.e.thepoint(x,y)isalwaysontheleastsquaresregressionline.
(c) To make sure you understand the process, try to solve the following loss function for linear regres- sion with a version of “L2” regularization, in which we add a penalty that penalizes the size of w1. Let λ > 0 and consider the regularised loss
(yi − (w0 + w1xi))2 + λw12
In the previous question, we found the least squares solution for the univariate (single feature) problem.
We now generalise this for the case when we have p features. Let x1 , x2 , . . . , xn be n feature vectors (e.g.
Question 2. (Multivariate Least Squares)
L(w0, w1) = n
corresponding to n instances) in Rp, that is:
We stack these feature vectors into a single matrix, X ∈ Rn×p, called the design matrix. The convention is to stack the feature vectors so that each row of X corresponds to a particular instance, that is:
xT1 x10 x11
xT2 x20 X = . = .
xi0 xi1
xTn xn0 xn1 · · · xn,p−1
where the superscript T denotes the transpose operation. Note that it is standard to take the first element of the feature vectors to be 1 to account for the bias term, so we will assume xi0 = 1 for all i = 1, . . . , n. Analagously to the previous question, the goal is to learn a weight vector w ∈ Rp , and make predictions:
yˆi =wTxi =w0 +w1xi1 +w2xi2 +···+wp−1xi,p−1,
where yˆi denotes the i-th predicted value. To solve for the optimal weights in w, we can use the same
procedure as before and use the MSE:
L(w0,w1,…,wp−1)= n
One approach to solve this would be to take derivatives with respect to each of the p weights and solve the resulting equations, but this would be extremely tedious. The more efficient way to solve this problem is to appeal to matrix notation. We can write the above loss as:
L ( w ) = n1 ∥ y − X w ∥ 2 2 ,
where ∥ · ∥2 is the Euclidean norm. For the remainder of this question, we will assume that X is a
full-rank matrix, which means that we are able to compute the inverse of XT X. (a) Show that L(w) has a critical point:
wˆ = (XT X)−1XT y,
note: critical point here means that ( dL(w) )(wˆ) = 0, i.e. the gradient evaluated at the critical point
wˆ is zero).
Hint1: ifuisavectorinRn,andvisafixedvectorinRn,then ∂vTu =v. ∂u
Hint2: ifAisafixedn×nmatrixthen ∂uTAu =Au+ATu. ∂u
x2,p−1 . ...···.
(yi −yˆi)2
(yi −(w0 +w1xi1 +w2xi2 +···+wp−1xi,p−1))2.
(b) The condition ∂L (wˆ) = 0 is necessary but not sufficient to show that wˆ is a (global) minimizer of ∂w
L, since this point could be a local minimum or a saddle point. Show that the critical point in part (a) is indeed a global minimizer of L.
Hint 1: L(w) is a function of w ∈ Rp, and so its Hessian, H, is the p × p matrix of second order partial derivatives, that is, the (k, l)-th element of H is
Hkl = ∂2L(w). ∂wk∂wl
We will often write H = ∇2wL(w), where ∇ is the gradient operator, and ∇2 means taking the gradient twice. Note that the Hessian plays the role of second derivative for multivariate functions. Hint 2: a function is convex if its Hessian is positive semi-definite, which means that for any vector u,
uT Hu ≥ 0.
Note also that this condition means that for any choice of u, the product term will always be non- negative.
Hint3: Anycriticalpointofaconvexfunctionisaglobalminimum.
(c) In the next parts, we will use the formula derived above to verify our solution in the univariate case. We assume that p = 2, so that we have a two dimensional feature vector (one term for the intercept, and another for our feature). Write down the following values:
xi, y, w, X, XTX, (XTX)−1, XTy.
(d) Compute the least-squares estimate for the p = 2 case using the results from the previous part.
(e) Consider the following problem: we have inputs x1,…,x5 = 3,6,7,8,11 and outputs y1,…,y5 = 13, 8, 11, 2, 6. Compute the least-squares solution and plot the results both by hand and using python. Finally, use the sklearn implementation to check your results.
(f) Advanced: Some of you may have been concerned by our assumption that X is a full-rank matrix, which is an assumption made to ensure that XT X is an invertible matrix. If XT X was not invert- ible, then we would not have been able to compute the least squares solution. So what happens if X is not full-rank? Does least squares fail? The answer is no, we can use something called the pseudo-inverse, also referred to as the Moore- . This is outside the scope of this course, and the interested reader can refer to the following notes, or chapter 2 in MML. At a very high level, the pseudo-inverse is a matrix that acts like the inverse for non-invertible matrices. In numpy, we can easily compute the pseudo inverse using the command ‘np.linalg.pinv‘.
(g) The mean-squared error (MSE) is
MSE(w)= n1∥y−Xw∥2,
whereas the sum of squared errors (SSE) (also referred to as the residual sum of squares (RSS)) is SSE(w) = ∥y − Xw∥2.
Are the following statements True or False. Explain why. (i) arg minw∈Rp MSE(w) = arg minw∈Rp SSE(w)
(ii) minw∈Rp MSE(w) = minw∈Rp SSE(w)
Notation: recall that minx g(x) is the smallest value of g(x), whereas arg minx g(x) is the value of x
that minimizes g(x). So minx(x − 2)2 = 0 but arg minx(x − 2)2 = 2. Page 3
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com