CS计算机代考程序代写 COMP90051 StatML 2020S2 Q&A 03 – Solutions

COMP90051 StatML 2020S2 Q&A 03 – Solutions

August 13, 2021

Exercise 1: We derived the exact analytical solution to linear regression during the workshop. Can
you do the same for the ridge regression solution ŵ =

(
XTX + λI

)−1
XTy?.

The objective function in linear regression training is

L(w) = ‖y −Xw‖22 .

Ridge regression adds a regularisation term to this objective:

L(w) = ‖y −Xw‖22 + λ ‖w‖
2
2 .

We can expand it like what we did in Week 3 workshop, collecting the like ‘cross terms’ multi-
plying y with Xw since these products are scalars and are therefore unchanged by transposes (see
the Q& A session recording for more detail on this if you’re unsure):

L(w) = wTXTXw − 2wTXTy + yTy + λwTw
Next, at an optimal w the gradient must be zero (this is the first-order necessary condition).

So let’s calculate the gradient:

∇wL(w) = ∇wwTXTXw −∇w2wTXTy +∇wyTy +∇wλwTw
= 2XTXw − 2XTy + 0 + 2λw .

At an optimal w, we should have gradient zero:

∇wL(w) = 0 .

Plugging in our calculated expression for ∇wL(w) and trying to isolate w:

2XTXw − 2XTy + 2λw = 0
XTXw + λIw = XTy(
XTX + λI

)
w = XTy .

By inserting the identity matrix before w (Iw = w) in the second equality, we’re able to ‘factor’
out the w in the third equality. We can then solve for w:

w = (XTX + λI)−1XTy .

Comparing with the solution of linear regression, ridge regression adds positive values (λ > 0)
to the diagonal of XTX. This is the same as adding λ to each eigenvalue of the Gramm matrix
XTX and guarantees that the matrix becomes full-rank and invertible (if it isn’t already).

1