Problem Set: Regression SOLUTIONS
1. (a) The Multinomial Expansion tells us how many terms we would have in a k-dimensional polynomial in p variables (see: https://en.wikipedia.org/ wiki/Multinomial_theorem):
k + p − 1 (k + p − 1)! p−1 = k!(p−1)!
Here k = 3, p = 6, therefore :
Copyright By PowCoder代写 加微信 powcoder
#terms =(3+6−1)!=56
So our problem is underdetermined, and therefore not unique.
VERSION 1:
Explanation:
We have m = 56 parameters to learn, but only n = 50 data points (equations). The OLS objective, L, is given by:
L = 21∥y − Xw∥2 Where X is the design matrix.
Note: X ∈ Rn×m, XT X ∈ Rm×m
Therefore, since n < m, (XT X) is rank deficient and (XT X) 0
So L is not strictly convex.
By Calculus Lecture: Quadratic Functions this solution is not unique because L is not strictly convex.
VERSION 2:
So the normal equations will not yield a solution.
Explanation:
We have m = 56 parameters to learn, but only n = 50 data points (equations).
Recall from the normal equations that:
wOLS = (XT X)−1XT y
Where X is the design matrix.
Note: X ∈ Rn×m, XT X ∈ Rm×m
Therefore, since n < m, (XT X) is rank deficient and (XT X)−1 does not exist. Thus the normal equations do not yield a solution.
(b) Discard an attribute then k = 3, p = 5, therefore: #terms =(3+5−1)!=35
3!4! Now we have more data points than unknowns.
XT X is invertible and the solution is unique and globally optimal, and the normal equations yield a solution.
(c) Regularisation
VERSION 1:
Explanation:
Adding an appropriate regulariser (e.g. the l2-norm) ensures that the Hessian of the objective is positive definite.
Therefore the objective is strictly convex and our solution will be unique.
VERSION 2:
Explanation:
Via an Illustration, e.g. Ridge Regression (other illustrations acceptable):
Ridge Regression objective:
L = 12 ∥y − Xw∥2 + λ2 ∥w∥2
Therefore:
H = (XT X + λI) ≻ 0
Why? Because the rank of (XT X + λI) is always full due to the ridge.
(Relatedly, the revised normal equations:
wRR =(XTX+λI)−1XTy
yield a solution because (XT X + λI) is invertible.)
(d) (No marks for any kind of suggestion which is equivalent to a combinatorial search!)
Use the LASSO:
L = 21 ∥y − Xw∥2 + λ2 |w|1
Explanation: Second term in objective acts to push weights to zero (more
aggressively than in ridge regression) as λ increases.
See lecture notes, especially demonstration that for subgradients around wj then optimality obtains if:
−λ< ∂L <λ 2 ∂wj 2
Thus the range of values of the gradient for which wj remains at 0 if it is already there grows with λ.
P(S; w) = P(x(i), y(i); w) because of i.i.d. assumption
= P(y(i)|x(i); w)P(x(i)) because P(A ∩ B) = P(A|B)P(B) i=1
=N(wx(i),α)P(x(i)) byy=wx+ε,ε∼N(0,α)
i=1 nn11
= P(x(i)) × √2πα exp −2α(y(i) − wx(i))2 i=1 i=1
Therefore :
pX,Y(S;w)= √2παexp −2α (y −wx )
nn P(x(i)) 1 (i) (i) 2
Therefore:
n P(x(i))
(y(i) − wx(i))2
(b) Prior distribution:
pW(w) = N(0,β) = √2πβ exp −2βw
Likelihood: pX,Y(S|w)=pX,Y(S;w)=A×exp
Posterior distribution is pW(w|S) By Bayes’ Rule:
pW (w|S) ∝ pX ,Y (S|w)pW (w)
1 (y(i) −wx(i))2 − 2βw2
= √2πβ exp −2α
(y(i) −wx(i))2
ni=1(x(i))2 ni=1 x(i)y(i) ni=1(y(i))2 =√2πβexp−w2β+2α +w α −2α
=const.×exp −
α + β ni=1(x(i))2 2 2β ni=1 x(i)y(i)
2αβ w −w α+βni=1(x(i))2 α + β ni=1(x(i))2 β ni=1 x(i)y(i) 2
= const.’ × exp = const.’ × exp
− 2αβ (w−μ)2
β ni=1 x(i)y(i)
μ = α + β ni=1(x(i))2 2 αβ
w − α + β ni=1(x(i))2
σ = α + β ni=1(x(i))2
We recognise the characteristic form of a Gaussian distribution.
Since the posterior distribution is a true probability distribution then it must be normalised to 1.
This allows us to say that: const.’ = √ 1 2πσ2
More succintly:
β ni=1 x(i)y(i) 2αβ α + β ni=1(x(i))2 , α + β ni=1(x(i))2
(c) The MAP estimate of w is the w for which the posterior, pW(w|S) is maximal,
(w−μ)2 wMAP =argmaxexp − 2σ2
For a Gaussian distribution we know that the mode coincides with the mean,
β ni=1 x(i)y(i) wMAP = μ = α + β ni=1(x(i))2
(d) Recall that our optimisation problem is:
This can be written as:
(w−μ)2 argmax exp − 2σ2
argmin (w − μ)2 w
Let us call the objective L(w) = (w − μ)2 Now:
∂L = 2(w − μ) ∂w
∂2L =2>0 ∂w2
By the definition of Strict Convexity (Calculus Lecture: Convexity) this satis- fies the second order sufficient condition for strict convexity
Therefore L is strictly convex
By Calculus Lecture: Convexity then if L is strictly convex then the solution
to: minw L(w), must be unique Therefore ww is unique
(e) Recall from part (a) that:
pW(w|S) = const. × exp −2α
From the definition of wMAP:
wMAP =argmaxpW(w|S)
1 n 1
This is identical to the canonical form of the optimisation problem associated
with Ridge Regression:
Where the connection is expressed via: λ = αβ
(y(i) − wx(i))2 + 2 w2 i=1
(y(i) − wx(i))2 − 2βw2
1 2α (y(i) − wx(i))2 + 2β w2
1 n i=1
1n α 2 (y(i) − wx(i))2 + 2β w2
1 n λ
3. (a) Ordinary Least Squares
argmin 2 (y(i) − w · x(i))2 = argmin 2∥y − Xw∥2
(c) Note that:
ε ∼ N (0, σ2) =⇒ y|x ∼ N (w · x, σ2)
The likelihood is given by:
P(S;w,σ) = pX,Y(x(i),y(i);w,σ)
= pY (y(i)|x(i); w, σ)pX (x(i)) i=1
Let us search for the maximum likelihood estimator: wMLE =argmaxP(S;w,σ)
= argmax ln P(S; w, σ) w
n 1 (y(i)−w·x(i))2 (i)
1 =argmaxln √
(y(i) −w·x(i))2 2σ2
+ln pX(x i=1
√2πσ2 exp − 2σ2 pX(x )
n (y(i)−w·x(i))2
= argmin (y(i) − w · x(i))2
(d) Note that:
ε ∼ Laplace(0, b) =⇒ y|x ∼ Laplace(w · x, b)
The likelihood is given by:
P(S;w,σ) = pX,Y(x(i),y(i);w,σ)
= pY (y(i)|x(i); w, σ)pX (x(i)) i=1
Let us search for the maximum likelihood estimator: wMLE =argmaxP(S;w,σ)
= argmax ln P(S; w, σ) w
n 1 |y(i)−w·x(i)| (i)
=argmaxln 2bexp −
2bexp − b pX(x )
1 |y(i)−w·x(i)|
(i) pX(x )
n |y(i)−w·x(i)|
=argmin|y(i) −w·x(i)|
(e) The estimator is robust to outliers c.f. squared error
Because an outlying point is given less weight by the l1 norm than by the l2 norm
(f) Note that the objective is not smooth and there is no analytic solution
Try numerical optimisation of some sort (proximal gradient descent, for exam- ple)
(g) The objective function is convex, but it is not strictly convex (use the defini- tion of convexity to demonstrate this).
Thus local minima are global minima, but such minima are not necessarily unique.