Question 1
(a) You are given a dataset {(xn, tn)}Nn=1 where xn = (φ1(xn), φ2(xn), . . . , φp(xn)) is a p-dimensional row vector of real-valued features associated with an individual n and tn is the corresponding real-valued target variable. A linear regression model introduces a function y = f(x;w) that depends linearly on weight vector w. Write out the expression for f (x; w). How many real numbers in w does the learning algorithm have to learn.
[4 marks]
(b) Express the learning problem introduced in the previous part in terms of a design matrix A, so that Aw = y. You must outline what the size of the design matrix is, and what each of its rows and columns represents. Use the column vector of targetst=(t1,…,tN)T todefinethemeansquarederror(MSE)lossfunction.
COMP3223W1
Copyright 2020 © University of Southampton Page 2 of 39
[4 marks]
Additional space. Do not use unless necessary. Clearly mark corresponding question.
COMP3223W1
TURN OVER Copyright 2020 © University of Southampton Page 3 of 39
COMP3223W1 (c) The best fit weights w for the MSE is given by w = A+t, where A+ is the
pseudo-inverse
Show that (i) the sum of the residuals is zero; (ii) the vector of each of the p features (φj(x1),φj(x2),…φj(xN)), for j = 1,…,p is orthogonal to the vector of residuals.
A+ = ATA−1 AT.
(d) Suppose the design matrix has a singular value decomposition A=UΣVT =σkukvTk
k
where U and V are orthogonal matrices of size (N × N ) and (p × p) respectively and Σ contains non-zero entries σi only along the diagonal. Show that Avi = σiui.
[3 marks]
Copyright 2020 © University of Southampton Page 4 of 39
[5 marks]
Additional space. Do not use unless necessary. Clearly mark corresponding question.
COMP3223W1
TURN OVER Copyright 2020 © University of Southampton Page 5 of 39
COMP3223W1 (e) If you rewrite the target vector t and weights w as linear combinations of the
columns ui of U and columns vi of V as follows:
w = αivi, t = βnun,
in
show that the MSE loss function reduces to
1 (βi − αiσi)2. Ni
For what choice of the expansion coefficients αi of the weight vector w is this loss minimised?
[5 marks]
(f) Suppose the targetvector t is corrupted by some (N × 1) additive noise vector e, so that t + e = i(βi + εi)ui where εi is small. Show that the existence of small singular values σi can lead to large changes in the learnt weights.
Copyright 2020 © University of Southampton Page 6 of 39
[4 marks]
Additional space. Do not use unless necessary. Clearly mark corresponding question.
COMP3223W1
TURN OVER Copyright 2020 © University of Southampton Page 7 of 39
Question 2
(KL) divergence KL(PA∥PB) are defined as: KL(PA∥PB) = ai log(ai ).
Similar definitions apply in the continuous case. Provide some intuition for when the value of KL(PA∥PB) is minimised, motivating how this is exploited in ma- chine learning algorithms. [5 marks]
(a) For 2 probability distributions P = {a|a ≥ 0, a = 1,i ∈ X} and P = AiiiB
i
{bi|bi ≥ 0, i bi = 1,i ∈ X} over the same event space X the Kullback-Leibler
(b) If the true distribution is given by P, which is bimodal, and you are learning the parameters of a gaussian Q, minimising KL(P∥Q) or KL(Q∥P) can yield either of the two distributions (in dash-dotted lines) shown in the figure. Explain which of the two choices of the KL corresponds to which figure.
Copyright 2020 © University of Southampton Page 8 of 39
i∈K bi
COMP3223W1
[5 marks]
COMP3223W1
Additional space. Do not use unless necessary. Clearly mark corresponding question.
TURN OVER Copyright 2020 © University of Southampton Page 9 of 39
COMP3223W1 (c) You are given a sequence x := {x(1),…,x(N)} of heads (x(i) = H) and tails
(x(i) = T) which are the outcomes of N tosses of a (potentially biased) coin, with nH being the number of times heads appears. All possible outcomes of N coin tosses would constitute the event space X . A binomial distribution B(N, θ) sets the probability of occurrence of n1 events of type 1, and n2 = N − n1 of type
2as N! n1 n2 P(n1,n2|N,θ)=n1!n2!θ (1−θ) .
Describe how you would fit the data X to a binomial distribution using maximum
likelihood estimation (MLE) and find the result θMLE = nH . N
Copyright 2020 © University of Southampton
Page 10 of 39
[8 marks]
Additional space. Do not use unless necessary. Clearly mark corresponding question.
COMP3223W1
TURN OVER Copyright 2020 © University of Southampton Page 11 of 39
COMP3223W1 (d) Use Bayes’ theorem to write the posterior probability P(θ|x) of the parameters
θ upon observing some data x. Discuss how a conjugate Beta prior θα−1(1 − θ)β−1 1 α−1 β−1
01 θα−1(1−θ)β−1dθ = Zθ (1−θ)
introduces “pseudo-counts” to affect the estimation of parameters of the bino- mial distribution as above, and combats the problem of overfitting, particularly for data sets of small size N. (Z is the normalisation constant.)
Copyright 2020 © University of Southampton Page 12 of 39
[7 marks]
Additional space. Do not use unless necessary. Clearly mark corresponding question.
COMP3223W1
TURN OVER Copyright 2020 © University of Southampton Page 13 of 39
Question 3
(a) You are given a dataset X = {xn}Nn=1 where X is a (N × p) matrix, each row of which xn is a p-dimensional vector of real-valued features associated with an individuali.Writedownwhatthe(i,j)-thcomponentsofthematricesXXT and XTX are.
[5 marks] (b) If the square of the length of xn defined above is ∥xn∥2 = pi=1 x2n,i, show that
where Tr is the matrix trace.
Copyright 2020 © University of Southampton
Page 14 of 39
Tr(XXT ) =
N n=1
∥xn∥2,
COMP3223W1
[3 marks]
Additional space. Do not use unless necessary. Clearly mark corresponding question.
COMP3223W1
TURN OVER Copyright 2020 © University of Southampton Page 15 of 39
COMP3223W1 (c) From the given matrix X as above, write out the (i,j) entry of the covariance
matrix of features.
[3 marks]
(d) Explain in detail how you would use Principal Components Analysis (PCA) to reduce the dimensionality p of the feature set. You have to introduce the eigen- values and eigenvectors of the covariance matrix, and explain how they are used to perform this task.
Copyright 2020 © University of Southampton Page 16 of 39
[5 marks]
Additional space. Do not use unless necessary. Clearly mark corresponding question.
COMP3223W1
TURN OVER Copyright 2020 © University of Southampton Page 17 of 39
COMP3223W1 (e) A uniform random variable U(a,b) has mean a+b and variance (b−a)2 . For any
2 12
sample x of N numbers x = {x1, x2, . . . , xN } drawn from a uniform distribution
U (1, 3) its average < x > is a random variable. What is the mean and standard deviation of this random variable?
[4 marks]
Copyright 2020 © University of Southampton Page 18 of 39
Additional space. Do not use unless necessary. Clearly mark corresponding question.
COMP3223W1
TURN OVER Copyright 2020 © University of Southampton Page 19 of 39
(f) The adjacent figure illustrates the performance of a learned machine on a task whose target is at the centre of each set of concentric circles. The stars rep- resents the prediction of the machine on test data. Explain the meaning of the terms bias and variance of a learning method using the figures as a guide.
[5 marks]
TURN OVER Copyright 2020 © University of Southampton Page 20 of 39
COMP3223W1
Additional space. Do not use unless necessary. Clearly mark corresponding question.
COMP3223W1
TURN OVER Copyright 2020 © University of Southampton Page 21 of 39
Question 4
(a) Describe the k-means clustering algorithm.
COMP3223W1
Copyright 2020 © University of Southampton Page 22 of 39
[7 marks]
Additional space. Do not use unless necessary. Clearly mark corresponding question.
COMP3223W1
TURN OVER Copyright 2020 © University of Southampton Page 23 of 39
(b) For a 2-class Gaussian classifier with class label c = A, B, in a two dimensional feature space with points X = (x, y)T ∈ R2 the probability distribution functions
are
Consider the special case
p(X|c) = |Λc| exp−1(X −μc)TΛc(X −μc), c = A,B. 2π 2
ΛA =ΛB =Λ= a b ,a,b,d>0, andmeansμA =μ=−μB. bd
Draw the contours of the two pdfs.
COMP3223W1
[4 marks]
TURN OVER Copyright 2020 © University of Southampton Page 24 of 39
Additional space. Do not use unless necessary. Clearly mark corresponding question.
COMP3223W1
TURN OVER Copyright 2020 © University of Southampton Page 25 of 39
COMP3223W1 (c) For the conditional pdfs introduced above, show that the decision boundary
P(A|X) = P(B|X) for equal priors (P(A) = P(B)) is the set of points X = (x,y)T describedbyXTΛμ=0.
Copyright 2020 © University of Southampton
Page 26 of 39
[3 marks]
Additional space. Do not use unless necessary. Clearly mark corresponding question.
COMP3223W1
TURN OVER Copyright 2020 © University of Southampton Page 27 of 39
(d) A linear hyperplane is described by the equation y = w · x + b. The decision boundary in the figure is the line (representing a hyperplane) for which y = 0 (labelled 0) and is perpendicular to w. The two parallel hyperplanes that go through the support vectors (points with thickened edges) are indicated by the values y = ±1. Explain why a large margin is necessary for robust classification. From the geometry of the figure show that the size of the margin along the direction of w is 2 . You may take x+ and x− to be support vectors.
∥w∥
COMP3223W1
TURN OVER Copyright 2020 © University of Southampton Page 28 of 39
[6 marks]
Additional space. Do not use unless necessary. Clearly mark corresponding question.
COMP3223W1
TURN OVER Copyright 2020 © University of Southampton Page 29 of 39
COMP3223W1 (e) Explain how the max margin classifier for the training set {(x1, y1), . . . , (xN , yN )}
is expressed as the solution to the constrained optimisation problem
minmaxL(w,b,α)= 1∥w∥2 −αnyn(wTxn +b)−1, αn ≥0. w,bα 2n
How many constraints are enforced by each Lagrange multiplier αn and what does each constraint encode?
Copyright 2020 © University of Southampton Page 30 of 39
[5 marks]
Additional space. Do not use unless necessary. Clearly mark corresponding question.
COMP3223W1
TURN OVER Copyright 2020 © University of Southampton Page 31 of 39
Question 5 Suppose you are given the 2 class dataset in the figure below. **
++
** * *
+ +
(a) Would a perceptron be a suitable classifer for this dataset? Briefly explain your answer.
+ Class 1 * Class 2
COMP3223W1
++ +
* **
[3 marks]
(b) Would a neural network with 1 hidden layer and no bias be a suitable classifier for this dataset? Briefly explain your answer.
Copyright 2020 © University of Southampton Page 32 of 39
[3 marks]
Additional space. Do not use unless necessary. Clearly mark corresponding question.
COMP3223W1
TURN OVER Copyright 2020 © University of Southampton Page 33 of 39
(c) Assume you now have a fully functional neural network architecture in place and you are training your model. How can you check to see if you are overfitting? Explain exactly what you could plot from your output and what on this plot may indicate overfitting.
COMP3223W1
Copyright 2020 © University of Southampton
Page 34 of 39
[7 marks]
Additional space. Do not use unless necessary. Clearly mark corresponding question.
COMP3223W1
TURN OVER Copyright 2020 © University of Southampton Page 35 of 39
Now assume you are working with a different dataset of 1000 images. These images are from the MNIST dataset and consist of handwritten digits from 0 to 9. Some samples are given below. Each one of the images consists of 28 x 28 pixels each with a value of 0 or 1 and you want to build a neural network to classify the digits. Note, for any of the answers below, if you do not have a
calculator and cannot work out the final number you will not be penalised. Simply write out the simplest form of the equation to the answer of the question.]
COMP3223W1
(d) How many inputs would you need in your neural network?
(e) How many neurons would you need in your output layer?
[3 marks]
Copyright 2020 © University of Southampton Page 36 of 39
[3 marks]
Additional space. Do not use unless necessary. Clearly mark corresponding question.
COMP3223W1
TURN OVER Copyright 2020 © University of Southampton Page 37 of 39
COMP3223W1 (f) For a ReLU function defined as f + (x, a) := max(x − a, 0), evaluate the output of
the network:
h = 1−f+(x1 +x2,0.4)
y = [f+(h,0.1)−2f+(h,0.5)+f+(h,0.9)]
on the set D of input pairs (x1, x2), where
D := {(0.1, 0.2), (0.2, 0.9), (0.8, 0.1), (0.7, 0.9)}.
What pattern has this network captured? You should create a table with a col- umn of inputs on the left and results of the stages of the computation on other columns.
s
(x1,x2) x1+x2 1−f+(s,0.4) f+(h,0.1) f+(h,0.5) f+(h,0.9) y (0.1, 0.2)
(0.2, 0.9)
(0.8, 0.1)
(0.7, 0.9)
Copyright 2020 © University of Southampton
Page 38 of 39
[6 marks]
Additional space. Do not use unless necessary. Clearly mark corresponding question.
COMP3223W1
END OF PAPER
Copyright 2020 © University of Southampton Page 39 of 39