CS代写 COMP9417 – Machine Learning Tutorial: Linear Regression

COMP9417 – Machine Learning Tutorial: Linear Regression
Question 1. (Least Squares Regression)
A univariate linear regression model is a linear equation y = w0 + w1x. Learning such a model requires
fitting it to a sample of training data, (x1 , y1 ), . . . , (xn , yn ), so as to minimize the loss (usually Mean

Copyright By PowCoder代写 加微信 powcoder

Squared Error (MSE), which is also referred to as the mean sum of squares loss), L = n1 􏰗ni=1 (yi − (w0 +
w1xi))2. To find the best parameters w0 and w1 that minimize this error function we need to find the
error gradients ∂L and ∂L . So we need to derive these expressions by taking partial derivatives, set ∂w0 ∂w1
them to zero, and solve for w0 and w1.
(a) Derive the least-squares estimates (minimizers of the MSE loss function) for the univariate linear regression model.
First we write the loss function for the univariate linear regression y = w0 + w1x as
L=L(w0,w1)= n
(yi −(w0 +w1xi))2 = n (yi −w0 −w1xi)2
At a minimum of L, the partial derivatives with respect to w0, w1 should be zero. We will start with taking the partial derivative of L with respect to w0:
∂w =∂w n (yi−w0−w1xi)2
=1􏰑n ∂ (yi−w0−w1xi)2
−2(yi −w0 −w1xi)
􏰍1􏰑n 1􏰑n 1􏰑n􏰎
yi −w0n =−2[y−w0 −w1x],
1−w1n xi i=1

wherewehaveintroducedthenotationftomeanthesampleaverageoff,i.e.f=m1 􏰗mj=1fj, where m is the length of f . Now, we equate this to zero and solve for w0 to get
− 2 [ y − w 0 − w 1 x ] = 0 =⇒ w 0 = y − w 1 x
Note, we have not actually solved for w0 yet, since our expression depends on w1, which we
must also optimise over. Taking the partial derivative of L with respect to w1: ∂L ∂1􏰑n
∂w =∂w n (yi−w0−w1xi)2 1 1i=1
−2xi(yi − w0 − w1xi)
􏰍1􏰑n 1􏰑n 1􏰑n􏰎
=−2 n xiyi −w0n
= −2 􏰣xy − w0x − w1x2􏰤 ,
xi −w1n x2i i=1
Now, we equate this to zero and solve for w1 to get −2􏰣xy−w0x−w1x2􏰤=0 =⇒ w1 = xy−w0x
We now have an expression for w0 in terms of w1, and an expression for w1 in terms of w0. These are known as the Normal Equations. In order to get an explicit solution for w0 , w1 , we can plug w0 into w1 and solve:
w1 = xy − w0x x2
= xy−(y−w1x)x x2
= xy−xy+w1x2 x2
Rearranging and solving for w1 gives us
w1 = xy − x y
So now we have an explicit solution for the regression parameters w0 and w1, and so we are done.
(b) To make sure you understand the process, try to solve the following loss function for linear regres- sion with a version of “L2” regularization, in which we add a penalty that penalizes the size of w1. Let λ > 0 and consider the regularised loss

L(w0, w1) = n
(yi − (w0 + w1xi))2 + λw12
Repeating the steps above for the new problem yields
w0 =y−w1x, w1= xy−xy .
Question 2. (Multivariate Least Squares)
In the previous question, we found the least squares solution for the univariate (single feature) problem. We now generalise this for the case when we have p features. Let x1 , x2 , . . . , xn be n feature vectors (e.g. corresponding to n instances) in Rp, that is:
 xi0   xi1 
xi= .  .
We stack these feature vectors into a single matrix, X ∈ Rn×p, called the design matrix. The convention is to stack the feature vectors so that each row of X corresponds to a particular instance, that is:
xT1  x10 x11
xT2  x20 X =  .  =  .
xTn xn0 xn1 · · · xn,p−1
where the superscript T denotes the transpose operation. Note that it is standard to take the first element of the feature vectors to be 1 to account for the bias term, so we will assume xi0 = 1 for all i = 1, . . . , n. Analagously to the previous question, the goal is to learn a weight vector w ∈ Rp , and make predictions:
yˆi =wTxi =w0 +w1xi1 +w2xi2 +···+wp−1xi,p−1,
where yˆi denotes the i-th predicted value. To solve for the optimal weights in w, we can use the same
procedure as before and use the MSE:
L(w0,w1,…,wp−1)= n
x2,p−1 .  ...···.
(yi −yˆi)2
(yi −(w0 +w1xi1 +w2xi2 +···+wp−1xi,p−1))2.

One approach to solve this would be to take derivatives with respect to each of the p weights and solve the resulting equations, but this would be extremely tedious. The more efficient way to solve this problem is to appeal to matrix notation. We can write the above loss as:
L ( w ) = n1 ∥ y − X w ∥ 2 2 ,
where ∥ · ∥2 is the Euclidean norm. For the remainder of this question, we will assume that X is a
full-rank matrix, which means that we are able to compute the inverse of XT X. (a) Show that L(w) is minimised by the least-squares estimator:
wˆ = (XT X)−1XT y.
Hint1: ifuisavectorinRn,andvisafixedvectorinRn,then ∂vTu =v.
Hint2: ifAisafixedn×nmatrix,andiff =zTAz,then ∂uTAu =Au+ATu. ∂u
∥y−Xw∥2 =(y−Xw)T(y−Xw)
= yT y − 2yT Xw + wT XT Xw.
To minimise the above, we take derivatives with respect to w and set equal to zero as follows: ∂TTTT TT(set)
∂w(y y−2y Xw+w X Xw)=−2X y+2X Xw = 0.
Solving for w in the above yields:
2XTXw=2XTy =⇒ wˆ=(XTX)−1XTy
(b) In the next parts, we will use the formula derived above to verify our solution in the univariate case. We assume that p = 2, so that we have a two dimensional feature vector (one term for the bias, and another for our feature). Write down the following values:
xi, y, w, X, XTX, (XTX)−1, XTy.
We have the following results:
y1 􏰈􏰉y􏰈􏰉
1 2 2 n w0 2 xi= ∈R, y=.∈R, w= ∈R
xi1  .  w1 yn

Finally, we have
1 x11 1x
T n nx 2×2 XX=nxnx2∈R
 21 n×2 X=. .∈R
T−1 1􏰈x2−x􏰉T􏰈ny􏰉 (X X) =n(x2−x2) −x 1 , X y= nxy
(c) Compute the least-squares estimate for the p = 2 case using the results from the previous part.
(d) Consider the following problem: we have inputs x1,…,x5 = 3,6,7,8,11 and outputs y1,…,y5 = 13, 8, 11, 2, 6. Compute the least-squares solution and plot the results both by hand and using python. Finally, use the sklearn implementation to check your results.
This should be straight-forward, we can use the following python code:
Putting the results from parts (a) and (b) together:
􏰍 y − wˆ 1 x 􏰎 wˆ = (XT X)−1XT y = xy−xy ,
which is exactly the same doing the brute-force approach in the previous question.
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
y = np.array([13, 8, 11, 2, 6])
x = np.array([3, 6, 7, 8, 11])
n = x.shape[0]
X = np.stack((np.ones(n), x), axis=1)
# compute the least-squares solution
XTX = X.T @ X
XTXinv = np.linalg.inv(XTX)
XTy = X.T @ y
LeastSqaresEstimate = XTXinv @ XTy
# sklearn comparison
model = linear_model.LinearRegression()
model.fit(x.reshape(-1,1), y)
xx = np.linspace(0,13,1000)
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21

plt.scatter(x, y, color=’red’)
plt.plot(xx, 15. – xx, color=’blue’, label=’$y = 15 – x$’)
plt.legend()
plt.xlabel(’x’); plt.ylabel(’y’)
plt.savefig(“LSLine.png”)
plt.show()
22 23 24 25 26 27 28
(e) Advanced: Some of you may have been concerned by our assumption that X is a full-rank matrix, which is an assumption made to ensure that XT X is an invertible matrix. If XT X was not invert- ible, then we would not have been able to compute the least squares solution. So what happens if X is not full-rank? Does least squares fail? The answer is no, we can use something called the pseudo-inverse, also referred to as the Moore- . This is outside the scope of this course, and the interested reader can refer to the following notes. At a very high level, the pseudo- inverse is a matrix that acts like the inverse for non-invertible matrices. In numpy, we can easily compute the pseudo inverse using the command ‘np.linalg.pinv‘.
We should end up with the following

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com