Question2 [23marks]Inthisquestion,wewillapplygradientdescenttoasimulateddataset. You may make use of numpy and matplotlib. You are not permitted to make use of any existing numpy implementations of gradient descent (if they exist). Generate data using the following Python code:
1 2 3 4 5 6 7 8
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42) # make sure you run this line for consistency
x = np.random.uniform(1, 2, 100)
y = 1.2 + 2.9 * x + 1.8 * x**2 + np.random.normal(0, 0.9, 100) plt.scatter(x,y)
plt.show()
Your data should look like the following:
Then, consider the loss function
Lc(x,y)=r1(x y)2 +1 1,
where c 2 R is a hyper-parameter.
c2
5 Please see over
(a) [3 marks] Consider the (simple) linear model yˆi = w0 + w1xi for i = 1, . . . , n. We can write this more succinctly by letting w = (w0, w1)T and Xi = (1, xi)T , so that yˆi = wT Xi. The loss achieved by our model (w) on a given dataset of n observations is then
Xn Xn”r1T2# Lc(y,yˆ)= Lc(yi,yˆi)= c2(yi w Xi) +1 1 ,
i=1 i=1 Compute the following derivatives:
@Lc(yi, yˆi) and @Lc(yi, yˆi). @w0 @w1
If you are unable to figure this out, you may use Wolfram Alpha or a similar program to compute the derivatives in order to make use of them in later questions. Note that you must show your working for full marks, so solutions using a tool like Wolfram Alpha will receive a mark of zero.
(b) [2 marks] Using the gradients computed in (a), write down the gradient descent updates
for w0 and w1 (using pseudocode), assuming a step size of ↵. Note that in gradient descent we
consider the loss over the entire dataset, not just at a single observation. For simplicity, assume
that your updates are always based on the values at the previous time step, even if you have
access to the value at the current time step (i.e., when updating multiple parameters, the update
for w(t+1) might depend on w , so you could use the new value of w , w(t+1), since it has already 1000
been computed, but here we assume that we just use the old value w(t)). 0
(c) [12 marks] In this section, you will implement gradient descent from scratch on the gen- erated dataset using the gradients computed in (a), and the pseudocode in (b). Initialise your weight vector to w(0) = [1, 1]T , and let c = 2. Consider step sizes ↵ 2 [1, 0.1, 0.01, 0.001, . . . , 0.00000001] (a total of 9 step sizes). For each step-size, generate a plot of the loss achieved at each iteration
of gradient descent. You should use 100 iterations in total. Generate a 3 ⇥ 3 grid of figures showing performance for each step-size. Add a screen shot of the Python code for this question in your answers PDF file (as well as pasting the code into your solutions.py file). The following code may help with plotting:
fig, ax = plt.subplots(3,3, figsize=(10,10))
alphas = [10e-1, 10e-2, 10e-3,10e-4,10e-5,10e-6,10e-7, 10e-8, 10e-9] for i, ax in enumerate(ax.flat):
# losses is a list of 9 elements. Each element is an array of length 100 storing the loss at each iteration for
# that particular step size
ax.plot(losses[i])
ax.set_title(f”step size: {alphas[i]}”) # plot titles plt.tight_layout() # plot formatting
plt.show()
1 2 3 4
5 6 7 8 9
6 Please see over
(d) [1 mark] Comment on your results in part (c): what do you see happening for di↵erent step sizes? Why does this occur?
(e) [3 marks] To find an appropriate step-size, re-run the gradient descent to find the optimal model parameters. State this step-size, the final model, and generate two plots: the first for the weights w0 and w1 over the 100 iterations, and the second of the data with the final model super-imposed. What do you think of the final model? Are there any obvious ways to improve it? Be sure to include the plots in your answers PDF file, and the code used in your solutions.py file.
(f) [2 marks] Consider the following scenario: you re-run the analysis for various values of c and you notice that the optimal step-size varies across di↵erent values of c. Describe an approach to choosing an appropriate value of c.
7 Please see over
Question3 [25marks]Thisquestioncoversthe(kernel)perceptronandrequiresyoutorefer to the following training data for parts (a)-(c). You are only permitted to make use of numpy and matplotlib. You are not permitted to make use of any existing numpy implementations of perceptrons (if they exist).
x1 x2 y -0.8 1 1 3.9 0.4 -1 1.4 1 1
0.1 -3.3 -1
1.2 2.7 -1 -2.45 0.1 -1 -1.5 -0.5 1
1.2 -1.5 1
Table 1: Data for Question 3
(a) [3 marks] Plot the data in Table 1. Recall that the polynomial kernel is defined as k(x,y) = (m+xTy)d for m 2 {0,1,2,…} and d 2 {1,2,…}. Each such kernel corresponds to a feature representation of the original data. Find the simplest polynomial kernel for which this data becomes linearly separable (note: simplest in this case is defined as the polynomial kernel with the smallest values of both m and d).
(b) [2 marks] The optimal kernel found in (a) induces a feature representation in Rp for some integer p determined by your choice of kernel. Choose a subset of two coordinates from this p-dimensional vector and plot the transformed data. For example, for vector (w, x, y, z)T 2 R4, a subset of two coordinates could be the first two coordinates (w,x), or the first and third coordinates (w, y), etc.). Is your transformed data linearly separable?
(c) [10 marks] Train a kernel perceptron on this data with initial weight vector w(0) = 1p (the vector of ones in Rp). Use a learning rate of ⌘ = 0.2. Note: this should be done in numpy. Provide a table outlining all updates of the weight vector, and the iteration number at which the update occured. State the final learned perceptron and the number of iterations until convergence. Demonstrate that your perceptron correctly classifies each of the points. You may use the following table as a template for presenting your results:
8 Please see over
Iteration No. w0 w1 . . . wp 011…1
first update iteration 1+ 0 1+ 1 . . . 1+ p . . . . .
where j is the update for wj computed at the first iteration. To demonstrate that your perceptron classifies each point correctly, use the following table:
xi (xi) [ 0.8,1]T ([ 0.8,1]T)
. . [1.2, 1.5]T ([1.2, 1.5]T)
yi T (xi)w⇤ r1 >0
.
r8 >0
where ri = yi T (xi)w⇤ should be positive if your perceptron has converged. Along with your table(s) of results, provide a screen shot of your code in your answers PDF file, and add the code to your solutions.py file.
(d) [5 marks] Let x, y 2 R2 (i.e., x and y are two dimensional vectors), and consider the kernel
k(x, y) = (2xT y + 3)3.
Compute the feature vector (x) correpsonding to this kernel (in other words, the feature repre- sentation of x and y such that (x)T (y) = k(x, y)).
(e) [5 marks]
Consider the following dataset. The positive examples (class = +1) are: 11
(2,0), (0,2), (1,1), (p2,p2) and the negative examples (class = 1) are:
11 ( 2,0), (0, 2), ( 1, 1), ( p2, p2).
Claim: An ordering of the examples in this dataset on which a perceptron (with learning rate ⌘ = 1 ) makes at least 5 mistakes during training cannot exist.
If you agree with this claim, provide a proof of why it must be true. Otherwise, provide an ordering of the examples such that the number of mistakes is achieved.
9 Please see over
Question 4 [20 marks] This question covers the Support Vector Machine and requires you to refer to the following two-dimensional training data:
x1 x2 y -3 9 -1 -2 4 -1 -1 1 -1 0 -3 1 111 2 4 -1 3 9 -1
(a) [1 mark] Plot the data (no need to provide any code if you do this in Python). Is the data linearly separable?
(b) [11 marks] In this section, we will build an SVM classifier by hand. Recall that the dual formulation of the SVM classifier for a dataset of size n is
1 Xn Xn
↵i
Solve for (↵1, . . . , ↵7) for the provided dataset. (Hint:
Superimpose a plot of the SVM onto the scatter in (a). What is the margin for this model?
(d) [2 marks] Discuss briefly the following claim: Linear classifiers are unable to represent non-linear functions.
(e) [3 marks] In your own words, explain what is meant by the ‘kernel trick’. Why is this such an important technique for machine learning?
10 Please see over
Xn arg max
Xn
↵i 0, for all i, ↵iyi = 0.
↵i↵jyiyjxi · xj,
and your knowledge of SVMs, try to simplify the (dual) objective before doing any calculus.)
subject to
Using the plot in the previous section
2 i=1 j=1
(c) [3 marks] For your SVM, compute the corresponding weight vector (w 2 R2) and bias t.
↵1,…,↵n i=1
i=1
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Question 5 [20 marks] In this question, we will consider the Scikit-learn implementations of the following classifiers:
• Decision Trees
• Random Forest
• AdaBoost
• Logistic Regression
• Multilayer Perceptron (Neural Network) • Support Vector Machine
You are required to compare the performance of the above models on a binary classification task. The following code loads in these classifiers and defines a function to simulate a toy dataset:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import warnings
warnings.simplefilter(action=’ignore’, category=FutureWarning)
import time
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler from sklearn.datasets import make_classification
def create_dataset():
X, y = make_classification( n_samples=1250,
n_features=2,
n_redundant=0, n_informative=2, random_state=5, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += 3 * rng.uniform(size = X.shape)
linearly_separable = (X, y)
X = StandardScaler().fit_transform(X) return X, y
11 Please see over
(a) [3 marks] Generate a dataset. Then, randomly split the dataset into training set Xtrain and test set Xtest, with 80 examples for training and 20 for testing. Plot the decision boundaries of each of the classifiers on the test set. You may wish to use the following plotter function which plots the decision boundary and can work for any sklearn model.
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17
def plotter(classifier, X, X_test, y_test, title, ax=None): # plot decision boundary for given classifier
plot_step = 0.02
x_min, x_max = X[:, 0].min() – 1, X[:,0].max() + 1
y_min, y_max = X[:, 1].min() – 1, X[:,1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step)) Z = classifier.predict(np.c_[xx.ravel(),yy.ravel()])
Z = Z.reshape(xx.shape)
if ax:
ax.contourf(xx, yy, Z, cmap = plt.cm.Paired) ax.scatter(X_test[:, 0], X_test[:, 1], c = y_test) ax.set_title(title)
else:
plt.contourf(xx, yy, Z, cmap = plt.cm.Paired)
plt.scatter(X_test[:, 0], X_test[:, 1], c = y_test) plt.title(title)
Note that you can use the same general approach for plotting a grid as used in Question 2, and the plotter function supports an ‘ax’ argument. Note that this plotter function di↵ers slightly from the one in the sample final. Be sure to include all your code in solutions.py, and any figures generated in your answers PDF file.
(b) [10 marks] Now, test how the performance of each of the classifiers varies as you increase the size of the training set. Fix your training and test sets from part (a). Then, starting from a random subset (chosen with replacement) of your training set of size 50, train your classification model, and compute the accuracy on the test set. Repeat this process for training set sizes of [50, 100, 200, 300, . . . , 1000]. Repeat the experiment a total of 10 times for each classifier. Then, for each classifier, plot its average accuracy at each training set size. Compare the accuracy across di↵erent algorithms in a single figure, and in 5 lines or less, discuss your observations. Please use the following color code for your plots: [Decision Tree, Random Forest, AdaBoost, Logistic Regression, Neural Network, SVM]. Be sure to include all your code in solutions.py, and any figures generated in your answers PDF file.
12 Please see over
(c) [7 marks] Using the time module, record the training time for each of the classifiers at each of the training set sizes. Plot the average training time over the 10 trials of each classifier as a function of the training size. You may add code for this section to your code in part (b). What do you observe? In 5 lines or less, discuss your observations. Use the same color scheme as in (b). Be sure to include your code in solutions.py, and any figures generated in your answers PDF file.
13 Please see over