COMP90051 StatML 2020S2 Q&A 02 – Solutions
August 6, 2021
Exercise 1: A dataset has two instances with one feature, X =
[
2
4
]
and the response is y =
[
1
2
]
.
We want to fit a linear regression model on this dataset. Please use the normal equation of linear
regression to find w and b.
This question is to showcase some basic operations on matrices and vectors such as matrix-
matrix, matrix-vector multiplications, and transposes. It also makes more concrete what one does
once they have the normal equations: those come from analytical solution of linear regression
training (formulated as either an MLE or decision-theoretically optimising the sum of squared
errors). To get some more intuition, you might like to plot the x’s and y’s here. You might expect
the answer as w = 0.5, b = 0 visually. Although in general even for 1D data, it isn’t possible to do
this visually.
First, we can add a dummy feature into X =
[
1 2
1 4
]
, to absorb the bias term in the weights
vector w =
[
b
w
]
. Then, we can directly find out the optimal value of ŵ by solving the normal
equation.
ŵ = (XtX)−1Xty
=
([
1 2
1 4
]t [
1 2
1 4
])−1 [
1 2
1 4
]t [
1
2
]
=
([
1 1
2 4
] [
1 2
1 4
])−1 [
1 1
2 4
] [
1
2
]
=
[
2 6
6 20
]−1 [
3
10
]
=
1
2× 20− 6× 6
[
20 −6
−6 2
] [
3
10
]
=
[
5 −3/2
−3/2 1/2
] [
3
10
]
=
[
0
1/2
]
where we have used a formula for inverting 2× 2 matrices which we don’t expect students to know
or remember.
1
Exercise 2: Show that Newton-Raphson for linear regression gives you the normal equations.
Linear regression training has us minimise the sum of squared errors. We can include a 0.5 factor
to make the derivative more convenient (cancelling the 2 that we’d otherwise obtain):
L)(w) =
1
2
‖Xw − y‖22 .
Newton-Raphson makes use of both the gradient and Hessian of this objective function, so let’s
start by calculating those. The first-order derivatives (gradient):
∇L(w) = XtXw −Xty
The second-order derivatives (Hessian):
∇2L(w) = XtX
To apply Newton-Raphson method, we need to set initial value to w0 then iteratively update it.
Let’s not worry for now what intial value we choose (but in practice it could be a random value).
The updated wt+1 is given by:
wt+1 = wt − (∇2L(w))−1∇L(w)
Plug in the expressions for ∇2L(w) and ∇L(w), and simplify the equation.
wt+1 = wt − (XtX)−1(XtXwt −Xty)
= wt − (XtX)−1XtXwt + (XtX)−1Xty
= wt − Iwt + (XtX)−1Xty
= wt −wt + (XtX)−1Xty
= (XtX)−1Xty
That is, we get that the first and all future iterates are given by the normal equation of linear
regression! Moreover, wt+1 does not depend on wt, and no matter what the initial value of w0 is,
the Newton-Raphson method converges in one step to the usual linear regression solution.
Hat tip to Cameron for also outlining this solution in Piazza too!
2