Practise Exam
Introduction to Statistical Machine Learning COMPSCI 3314, 7314
Writing Time: Uploading Time: Total Duration:
Questions
Answer all 6 questions
130 mins 30 mins 160 mins
Time 130 mins
Marks 100 marks 100 Total
Introduction to Statistical Machine Learning
Practise Exam Page 2 of 14
Overview of Machine Learning, etc. Question 1
(a) Cross-validation is a method to (Choose the best single answer from multiple choices):
(A) Remove the curse of dimensionality
(B) Assess how the results of a machine learning model will gener- alise to an unseen data set
(C) Remove noises or outliers from a data set
(b) Kernel Principal Component Analysis is a method for (Choose the best single answer from multiple choices):
(A) Classification
(B) Reduction of the dimensionality (C) Probability estimation
(D) Regression
(c) WhichofthefollowingstatementsisbestpracticeinMachineLearn- ing for building a real system? (Choose the best single answer from multiple choices)
(A) Use all the data available for training to obtain optimal perfor- mance
(B) Use all the data available for testing the performance of your algorithm
(C) Split the training data into two separate sets. Use the first sub- set for training and perform cross-validation solely on the second subset
(D) Perform cross-validation on training, validation and testing sets
(d) WhichofthefollowingstatementsaboutMachineLearningisFalse? (Choose the best single answer from multiple choices)
(A) Machine learning algorithms often suffer from the curse of di- mensionality
(B) Machine learning algorithms cannot generalise to the data that are not observed during training of the algorithm
(C) Machine learning algorithms are typically sensitive to noise
(D) Machine learning algorithms typically perform better in terms of testing accuracy when more training data become available
(e) Which of the following statements is (are) true? (Select all the correct ones)
(A) Gaussian mixture model (GMM) is a supervised learning method.
[2 marks]
[2 marks]
Please go on to the next page. . .
[3 marks]
[3 marks]
Introduction to Statistical Machine Learning
Practise Exam Page 3 of 14
(B) With the correct step size and batch size, stochastic gradient descent always converges to the global minimum of the objective function for a support vector machine with polynomial kernels if such as minimum exists.
(C) With the correct step size and batch size, stochastic gradient decent always converges to the global minimum of the objective function for a 10-layer convolutional neural network if such a min- imum exists.
(D) For k = 1, k-nearest neighbour (kNN) classifiers can achieve 100% accuracy on the training set, therefore implying that choos- ing k = 1 in kNN tends to produce the best model.
(E) If one wants to train a classifier to categorise an input image into one of 1,000 categories. There are over 106 labelled images available for training. For this problem, a deep convolutional neu- ral network with 50 layers is more likely to work better than a support vector machine classifier.
(f) Which of the following modification to the Gaussian mixture mod- els will make it most similar to k-means? (Choose the best single answer from multiple choices)
(A) Restrict each covariance matrix Σi to have all off-diagonal en- tries being zeros
(B) Restrict Σi to take the form αI. Here α a positive real-valued scalar shared by all clusters and I is the identity matrix.
(C) Restrict each Σi to take the from αiI. Here αi is a positive real-valued scalar and I is the identity matrix.
[4 marks]
Please go on to the next page. . .
[4 marks]
[Total for Question 1: 18 marks]
Introduction to Statistical Machine Learning
Practise Exam Page 4 of 14
Support Vector Machines (SVMs) and Kernels
Question 2
Let {(xi, yi)}ni=1 be the training data for a binary classification problem, where xi ∈ Rd and yi ∈ {−1, 1}. Let w ∈ Rd be the parameter vector,
b ∈ R be the offset, ξi be the slack variable for i = 1,…,n.
Here the notation ⟨p, q⟩ = p · q calculates the inner product of two vectors.
(a) What is wrong with the following primal form of the soft margin SVMs?
1 2 n min ∥w∥ +C ξi, w,b,ξ 2 i=1
s.t. yi(⟨xi,w⟩+b)≥1−ξi,i=1,···,n.
(b) After fixing the problem in the above form, what is the estimated
w if C = 0?
(c) The dual form of the soft margin SVMs is given below. How to modify it (slightly) to make it become the dual form for the hard margin SVMs?
[2 marks]
[2 marks]
n 1
max αi − αiαjyiyj ⟨xi,xj⟩
αi=1 2i,j
s.t. 0≤αi ≤C,i=1,···,n n
αiyi =0 i=1
(d) Express b using the dual variables and the training data.
(e) A RBF kernel corresponds to lifting to a feature space with how
many dimensions?
(f) Let u = [w;b] and z = [x;1]. We can rewrite (⟨w,x⟩ + b) as ⟨u, z⟩. This means if we augment the training data {(xi, yi)}ni=1 to {(zi,yi)}ni=1, where zi = [xi;1], we only need to learn one pa- rameter u instead of two parameters w and b.
1. Please write down the primal form of the soft margin SVMs using decision function sign[⟨u, z⟩].
[2 marks] [3 marks]
[3 marks]
Please go on to the next page. . .
Introduction to Statistical Machine Learning
Practise Exam Page 5 of 14
2. Is the new primal form equivalent to the old primal form? In other words, if we train two SVMs (standard SVM and this new re-parameterised SVM), in general, will we obtain exactly the same classification function?
3. Please prove your answer for above question (i.e. using deriva- tion to show why or why not equivalent).
(g) Suppose that we have a kernel K (·, ·) such that there is an implicit high-dimensional feature map Φ : Rd → RD that satisfies ∀x, z ∈ Rd, K(x, z) = ⟨Φ(x), Φ(z)⟩ = Φ(x)⊤Φ(z) = Di=1 Φ(x)iΦ(z)i is the inner product in the D-dimensional space.
Show how to compute the squared l2 distance in the D-dimensional space:
D
||Φ(x) − Φ(z)||2 = (Φ(x)i − Φ(z)i)2
i=1
without explicitly calculating the values in the D-dimensional vec-
[6 marks]
tors. You are asked to provide a formal proof.
Please go on to the next page. . .
[6 marks]
[Total for Question 2: 24 marks]
Introduction to Statistical Machine Learning
Practise Exam Page 6 of 14
Boosting Question 3
(a) True or False
AdaBoost must give zero training error regardless of the type of
weak classifiers it uses, provided enough iterations are performed.
(b) The AdaBoost algorithm has two drawbacks. Answer the following questions regarding these.
(I) Show mathematically why a weak learner with < 50% predic- tive accuracy presents a problem to AdaBoost.
(II) AdaBoost is susceptible to outliers. Suggest a simple heuristic that may alleviate this.
(c) Assume that the weak learners are a finite set of decision stumps. We then train a AdaBoost classifier. Can the boosting algorithm select the same weak classifier more than once? Explain.
Please go on to the next page. . .
[3 marks]
[6 marks]
[5 marks]
[Total for Question 3: 14 marks]
Introduction to Statistical Machine Learning
Practise Exam Page 7 of 14
Neural Networks Question 4
(a) Which one statement is true about neural networks? (Select the single best answer)
(A) We always train neural networks by optimising a convex cost function.
(B) Neural networks are more robust to outliers than support vec- tor machines.
(C) Neural networks always output values between 0 and 1.
(D) A neural network with a large number of parameters often can better use big training data than support vector machines.
(b) Which of the following statements is (are) true about neural net- works?
(A) The training time depends on the size of the network as well as the training data.
(B)
The perceptron is a single layer recurrent neural network.
(C) In image processing, compared with fully connected networks, usually convolutional networks are preferred.
(D) Neural network cannot be used for solving Regression prob- lems.
(c) True of False: The training strategy “back-propagationâA ̆ ̇I in neu- ral networks is essentially to calculate gradients of the network parameters using the chain rule.
(d) Training a convolutional neural network for speech recognition, one finds that performance on the training set is very good while the performance on the validation set is unacceptably low. A rea- sonable fix might be to: (Select the single best answer)
(A) Decrease the weight decay
(B) Reduce the training set size
(C) Reduce the number of layers and neurons (D) Increase the number of layers and neurons
(e) In neural networks, nonlinear activation functions such as sigmoid, tanh, and ReLU
(Select the single best answer)
(A) speed up the gradient calculation in back propagation as com- pared to linear units.
(B) help to learn nonlinear decision boundaries Please go on to the next page. . .
[3 marks]
[3 marks]
[3 marks]
[4 marks]
Introduction to Statistical Machine Learning Practise Exam
(C) always output values between 0 and 1 (D) are applied only to the output units
Page 8 of 14
Please go on to the next page. . .
[Total for Question 4: 17 marks]
[4 marks]
Introduction to Statistical Machine Learning
Practise Exam Page 9 of 14
Regression Question 5
(a) Linear regression solves the following optimisation problem,
min ∥Xw − y∥2, w
where X is the data matrix (each row is a data vector), y is the label column vector, and w is the parameter vector to be learned.
Note that we have omitted the offset parameter b here.
Write down the solution of linear regression (i.e., the optimal w).
(b) Write down the formulation of support vector regression (both Pri- mal and Dual).
Please go on to the next page. . .
[3 marks]
[3 marks]
[Total for Question 5: 6 marks]
Introduction to Statistical Machine Learning
Practise Exam Page 10 of 14
Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA)
Question 6
In this problem two linear dimenionality reduction methods will be discussed. They are principal component analysis (PCA) and linear discriminant analysis (LDA).
(a) LDA reduces the dimensionality given labels by maximising the overall interclass variance relative to intraclass variance. Plot the directions (roughly) of the first PCA and LDA components in the following figures respectively. In the figures, squares and circles represent two different classes of data points.
(a) First PCA component
(b) First LDA component
(b) In a supervised binary classification task we have a total of 15 fea- tures, among which only 4 are useful for predicting the target vari- able, the other features are pure random noise with very high vari- ance. What complicates matters even worse is that the 4 features when considered individually show no predictive power, and only
Please go on to the next page. . .
[5 marks]
Introduction to Statistical Machine Learning
Practise Exam Page 11 of 14
work when considered together.
Consider each of the following dimension reduction (or classifica- tion techniques) and indicate whether it may be able to successfully identify the relevant dimensions (Yes or No). Briefly explain why.
(I) Principle Component Analysis
(II) Linear Discriminant Analysis
(III) AdaBoost with decision stumps as the weak learner
(IV) AdaBoost with a linear support vector machine as the weak learner
Please go on to the next page. . .
[16 marks]
[Total for Question 6: 21 marks]