COMP90051 StatML 2020S2 Q&A 03 – Solutions
August 13, 2021
Exercise 1: We derived the exact analytical solution to linear regression during the workshop. Can
you do the same for the ridge regression solution ŵ =
(
XTX + λI
)−1
XTy?.
The objective function in linear regression training is
L(w) = ‖y −Xw‖22 .
Ridge regression adds a regularisation term to this objective:
L(w) = ‖y −Xw‖22 + λ ‖w‖
2
2 .
We can expand it like what we did in Week 3 workshop, collecting the like ‘cross terms’ multi-
plying y with Xw since these products are scalars and are therefore unchanged by transposes (see
the Q& A session recording for more detail on this if you’re unsure):
L(w) = wTXTXw − 2wTXTy + yTy + λwTw
Next, at an optimal w the gradient must be zero (this is the first-order necessary condition).
So let’s calculate the gradient:
∇wL(w) = ∇wwTXTXw −∇w2wTXTy +∇wyTy +∇wλwTw
= 2XTXw − 2XTy + 0 + 2λw .
At an optimal w, we should have gradient zero:
∇wL(w) = 0 .
Plugging in our calculated expression for ∇wL(w) and trying to isolate w:
2XTXw − 2XTy + 2λw = 0
XTXw + λIw = XTy(
XTX + λI
)
w = XTy .
By inserting the identity matrix before w (Iw = w) in the second equality, we’re able to ‘factor’
out the w in the third equality. We can then solve for w:
w = (XTX + λI)−1XTy .
Comparing with the solution of linear regression, ridge regression adds positive values (λ > 0)
to the diagonal of XTX. This is the same as adding λ to each eigenvalue of the Gramm matrix
XTX and guarantees that the matrix becomes full-rank and invertible (if it isn’t already).
1