3 COMP3223W1
1
(a) You are given a dataset {(xn,yn)}Nn=1 where xn ∈ Rp is a p-dimensional vector of real-valued features associated with an individual n and yn is a real-valued “output” variable for n. Show how you would set up a linear regression model by expressing yn as a linear function of xn. What would the learning algorithm return? [5 marks]
(b) For a new dataset of input-output pairs {(xn,yn)} n = 1,…,N, with scalars xn, yn ∈ R you wish to express yn as a cubic function of xn, how would the corresponding design matrix A be set up? For N = 5 data points describe the entries of the rows and columns of A and introduce a loss function.
[8 marks]
TURN OVER Copyright 2019 ⃝c University of Southampton Page 3 of 25
4 COMP3223W1
(c) You are asked to make an update to the weights of the quadratic function y=f(x)=w0+w1x+w2x2 fromagivensetofvaluesw=(w0,w1,w2)= (2, 1, −2) to minimise the squared loss on a minibatch consisting of the two points:
{(x1, y1), (x2, y2)} = {(1.0, 0.5), (−1.0, −0.5)}.
In stochastic gradient descent, what would the direction of the gradient be for each point? What would the batch gradient be? For a learning rate of 0.1 what is the updated weight vector w when the entire minibatch is used to compute the gradient? [12 marks]
Copyright 2019 ⃝c University of Southampton Page 4 of 25
Copyright 2019 ⃝c University of Southampton
TURN OVER Page 5 of 25
5 COMP3223W1
6 COMP3223W1
2
(a) For a data set {xn|n = 1,…,N} of N column vectors xn with feature val- ues as components xn,1, xn,2, . . . , xn,p, describe in detail how you would con- struct the covariance matrix of features. [6 marks]
(b) Using the singular value decomposition (SVD) of a (N × p) matrix A = UΣVT describehowyouwouldconstructarankkapproximationofA,for k < min{N, p}. [6 marks]
Copyright 2019 ⃝c University of Southampton Page 6 of 25
Copyright 2019 ⃝c University of Southampton
TURN OVER Page 7 of 25
7 COMP3223W1
8 COMP3223W1
(c) Explain in detail how you would use Principal Components Analysis (PCA) to reduce the dimensionality p of the vectors xn:Rp ∋ xn → x′n ∈ Rk, k < p. Comment on the relationship between the variance of the reduced dimen- sional version of the vectors x′n and those of the original xn. [7 marks]
(d) For classification problems where there is a training set of labelled data, explain why dimensionality reduction using PCA may give rise to poor clas- sification outcomes. You may draw a figure to illustrate your arguments.
[6 marks]
Copyright 2019 ⃝c University of Southampton Page 8 of 25
Copyright 2019 ⃝c University of Southampton
TURN OVER Page 9 of 25
9 COMP3223W1
3
(a) Describe the k-means clustering algorithm.
[7 marks]
Copyright 2019 ⃝c University of Southampton
Page 10 of 25
10
COMP3223W1
11 COMP3223W1 (b) In Fisher’s Linear Discriminant Analysis (LDA), training data – xiA and xjB
from 2 classes A and B – is projected along some vector w so that xiA, xjB are mapped into
yAi :=wTxiA,i=1,...,NA, andyBj :=wTxjB,j=1,...,NB
respectively. Means and covariances are computed from the training data. These empirical vector means are denoted mA and mB and the variance- covariance matrices are called SA and SB respectively. Let μA and μB be the (scalar) means for the values {yAi } and {yBj }, and σA, σB the corresponding variances, and will depend on the choice of vector w. What is the quantity that LDA seeks to maximise in order to achieve classification accuracy? Explain why that is a good choice. Discuss whether the quantity is invariant under scaling. [5 marks]
TURN OVER Copyright 2019 ⃝c University of Southampton Page 11 of 25
12 COMP3223W1
(c) A linear hyperplane is described by the equation y = w · x + b. The decision boundary in the figure is the line (representing a hyperplane) for which y = 0 (labelled 0) and is perpendicular to w. The two parallel hyperplanes that go through the support vectors (points with thickened edges) are indicated by the values y = ±1. Explain why a large margin is necessary for robust classification. From the geometry of the figure show that the size of the margin along the direction of w is 2 . You may take x+ and x− to be support vectors. ∥w∥
[7 marks]
Copyright 2019 ⃝c University of Southampton
Page 12 of 25
Copyright 2019 ⃝c University of Southampton
TURN OVER Page 13 of 25
13 COMP3223W1
14 COMP3223W1
(d) The max margin classifier for the training set {(x1, y1), . . . , (xN , yN )} is ob- tained by solving an optimisation problem
1 2 N T minmaxL(w,b,α)= ∥w∥ − αn yn(w xn +b)−1 , αn ≥0.
w,bα 2 n=1
Explain the motivation behind such an expression describing what each
term stands for. [6 marks]
Copyright 2019 ⃝c University of Southampton Page 14 of 25
Copyright 2019 ⃝c University of Southampton
TURN OVER Page 15 of 25
15 COMP3223W1
16 COMP3223W1 (a) For a probability distribution p over an event space X p := {pi|i ∈ X},
4
explain why the entropy H(p)
H(p) = pi log( 1 )
i pi
is often interpreted as the average degree of ‘surprise’. [3 marks]
(b) For 2 probability distributions p = {pi|i ∈ X} and q = {qi|i ∈ X} over the same event space X the cross-entropy H(p, q) and the Kullback-Leibler (KL) divergence KL(p∥q) are defined as:
H(p,q)=pilog(1) andKL(p∥q)=H(p,q)−H(p)=pilog(pi). i qi i qi
Provide some intuition for what each of the two metrics over pairs of prob- ability distributions captures. You may treat the distribution p as the one representing ‘ground truth.’ [6 marks]
Copyright 2019 ⃝c University of Southampton Page 16 of 25
Copyright 2019 ⃝c University of Southampton
TURN OVER Page 17 of 25
17 COMP3223W1
distribution p ̃ = (p ̃H , p ̃T ): .
θ ∗ = n H = : p ̃ H N
18 COMP3223W1
(c) You are given a sequence x := {x(1),...,x(N)} of heads (x(i) = H) and tails (x(i) = T ) which are the outcomes of N tosses of a (potentially biased) coin. All possible outcomes of N coin tosses would constitute the event space X. A binomial distribution B(N,θ) sets the probability of occurrence of n1 eventsoftype1,andn2 =N−n1 oftype2as
P(n1,n2|N,θ)= N! θn1(1−θ)n2. n1!n2!
Describe how you would fit the data X to a binomial distribution using maxi- mum likelihood estimation and find the result to be the same as the empirical
[10 marks]
Copyright 2019 ⃝c University of Southampton
Page 18 of 25
Copyright 2019 ⃝c University of Southampton
TURN OVER Page 19 of 25
19 COMP3223W1
20 COMP3223W1 (d) A 2-dimensional Gaussian distribution is defined by the probability density
function of X = (X ,X )T parameterised by its mean μ = (μ ,μ )T and 12 σ2ρσ1σ2 12
variance-covariance matrix Σ = 1 2 . ρσ1 σ2 σ2
1 1 T−1 p(X|μ,Σ)=2πσ1σ21−ρ2exp −2(X−μ) Σ (X−μ) .
Draw 3 contour plots of equiprobable values of (X1,X2) for the cases (a) ρ = 0, (b) ρ < 0 and (c) ρ > 0 providing reasons for doing so. [6 marks]
Copyright 2019 ⃝c University of Southampton Page 20 of 25
Copyright 2019 ⃝c University of Southampton
TURN OVER Page 21 of 25
21 COMP3223W1
22 COMP3223W1
5 Suppose we are training a neural network to do classification on the MNIST dataset. We run various experiments and plot the loss and accuracy over epochs and observe the plots shown in the figures below.
(a) In order to generate the loss vs. epochs plot above, the learning rate is modified four times resulting in the curves marked (a), (b), (c) and (d). The trends behave quite differently as a result of modifying the learning rate. You need to decide which learning rate to use. Which case would you choose (a), (b), (c) or (d) and explain why. [6 marks]
(b) Looking at the loss curve marked (a) in the loss plot, how can you modify your network to solve the issue? [6 marks]
Copyright 2019 ⃝c University of Southampton Page 22 of 25
Copyright 2019 ⃝c University of Southampton
TURN OVER Page 23 of 25
23 COMP3223W1
24 COMP3223W1
(c) Now consider the accuracy of your model on training versus validation sets over epochs and assume you observe the results shown in the above figure. Comparing the curves labeled (a) and (b) which are both plotting the valida- tion accuracy, which of these cases illustrates overfitting? Can you explain why? [7 marks]
(d) How can you prevent your neural network model from overfitting? [6 marks]
Copyright 2019 ⃝c University of Southampton Page 24 of 25
Copyright 2019 ⃝c University of Southampton
25 COMP3223W1
END OF PAPER
Page 25 of 25