Inputs: x=[1,x1,x2,…,xm]T 2Rm+1
y 2 R for regression problems
y 2 {0, 1} for binary classification problems
Training Data:
Copyright By PowCoder代写 加微信 powcoder
S = {(x(i), y(i))}ni=1
Input Training Data:
The design matrix, X, is defined as:
2 (1)T3 2 (1) (1)3 x 1×1··xm
x(n)T 1 x(n) · · x(n) 1m
6x(2)T 7 61 x(2) · · x(2) 7 6 7 6 1 m7
X = 64 · 75 = 6 6 · · · · · 7 7 · 4·····5
Output Training Data:
6y(2) 7 y = 64 · · 75
Data-Generating Distribution:
S is drawn i.i.d. from a data-generating distribution, D
2 TURN OVER
1. Consider the following function:
f (x) = xT Ax + b · x Where: 8 (2↵ 1)
x=[x1,x2]T,A= 1 2 ,b=[6, 2]T,x1 2R,x2 2R,↵2R.
(a) Show that, without loss of generality, this function can be re-written as: f (x) = xT Bx + b · x
Where B is a symmetric matrix. In so doing characterise B in terms of ↵.
Now, assume that x is the outcome of a multivariate Gaussian random variable, X = [X1, X2]T , characterised by the following probability distribution function, p(x):
p(x) = g(B, b)e f (x) Here g is some function such that, g : R2⇥2 ⇥ R2 ! R.
(b) Characterise the covariance, ⌃, of this distribution in terms of B.
(c) Characterise the mean, μ, of this distribution in terms of B and b.
(d) Characterise g(B, b) in terms of B and b.
[4 marks] (e) Fully characterise the conditions on ↵ for p(x) to be a valid multivariate Gaussian
probability distribution function.
density associated with the point x = ⇥ 37 , 74 ⇤T ?
3 TURN OVER
(f) Assuming that the correlation between X1 and X2 is 0.75 then what is the probability
[5 marks] [Total for Question 1: 25 marks]
2. In linear regression we seek to learn a linear mapping, fw, characterised by a weight vector, w 2 Rm+1, and drawn from a function class, F:
F ={fw(x)=w·x|w=[w0,w1,…,wm]T 2Rm+1}
One approach to learning w is to seek the optimal weight vector, w⇤, that minimises the
empirical mean squared error loss across the training set:
w⇤ =argmin1Xn ⇣y(i) w·x(i)⌘2
(a) What is the name usually given to this approach?
(b) Assume that we pre-process the data such that the input vector is rescaled to be xb, where xb = ↵x, and ↵ 2 R.
Now we re-train our model to seek wb⇤ such that:
wb⇤=argmin1Xn ⇣y(i) w·xb(i)⌘2 w 2 i=1
Given a novel test point, xnew, demonstrate that the output prediction associated with this point is the same for both versions of our model (i.e. for the versions with unscaled and scaled data).
(c) If we wish to alleviate overfitting in the context of our original model we may seek to add a term kwk2 (where > 0) to our objective function, which will result in a modified model, and which results in the following optimisation problem:
w⇤ = argmin 1 Xn ⇣y(i) w · x(i)⌘2 + kwk2 w 2 i=1
What name do we usually give to the term which we added, and what name do we give to the learning algorithm associated with the optimisation of this modified optimisation problem?
(d) Explain how and why the weights learnt from optimising the modified objective will di↵er from those learnt from optimising the original objective?
(e) Again assume that we pre-process the data such that the input vector is rescaled to be xb, where xb = ↵x, and ↵ 2 R.
4 TURN OVER
Now we re-train our modified model to seek wb ⇤ such that: b
wb⇤ = argmin 1 Xn ⇣y(i) w · xb(i)⌘2 + kwk2
Here b > 0 is some constant.
Given a novel test point, xnew, what relationship between and b is required such that
the output prediction associated with this point will be the same for both versions of our updated model? (i.e. for the versions with unscaled and scaled data)?
(f) Will the optimisation of the alternative objective function described in part (c) result in unique solutions? Explain.
(g) State an alternative term, which we discussed in lectures, which we could add to our original objective function and which would encourage sparsity in the weight vector. Furthermore, name the learning algorithm associated with the optimisation of this alternative objective.
[2 marks] [Total for Question 2: 25 marks]
5 END OF PAPER
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com