COMP9417: A collection of sample exam exercises August 5, 2021
Note to student: Some of these questions are longer/more difficult than the questions that will be on the actual final exam. With that being said, you should aim to work through and understand all questions here as part of your revision. Note that this is not a sample final exam, it is a collection of possible exercises, and the actual final exam will be shorter (more details to be announced later in week 10). Solutions to these problems will be released at the end of Week 10.
Question 1
Note: this question was incorporated into the Ensemble learning lab and the code there is much im- proved. Refer to that lab for a better approach to this problem. This question requires you to implement the Adaptive Boosting Algorithm from lectures. Use the following code to generate a toy binary classi- fication dataset:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import warnings
warnings.simplefilter(action=’ignore’, category=FutureWarning)
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_blobs
np.random.seed(2)
n_points = 15
X, y = make_blobs(n_points, 2, centers=[(0,0), (-1,1)])
y[y==0] = -1 # use -1 for negative class instead of 0
plt.scatter(*X[y==1].T, marker=”+”, s=100, color=”red”)
plt.scatter(*X[y==-1].T, marker=”o”, s=100, color=”blue”)
plt.show()
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18
Your data should look like:
1
(a) By now, you will be be familiar with the scikitlearn DecisionTreeClassifier class. Fit Decision trees of increasing maximum depth for depths ranging from 1 to 9. Plot the decision boundaries of each of your models in a 3 × 3 grid. You may find the following helper function useful:
def plotter(classifier, X, y, title, ax=None):
# plot decision boundary for given classifier
plot_step = 0.02
x_min, x_max = X[:, 0].min() – 1, X[:,0].max() + 1
y_min, y_max = X[:, 1].min() – 1, X[:,1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step))
Z = classifier.predict(np.c_[xx.ravel(),yy.ravel()])
Z = Z.reshape(xx.shape)
if ax:
ax.contourf(xx, yy, Z, cmap = plt.cm.Paired)
ax.scatter(X[:, 0], X[:, 1], c = y)
ax.set_title(title)
else:
plt.contourf(xx, yy, Z, cmap = plt.cm.Paired)
plt.scatter(X[:, 0], X[:, 1], c = y)
plt.title(title)
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18
Solution:
Page 2
The code used is:
1 2 3 4 5 6 7 8 9
fig, ax = plt.subplots(3,3, figsize=(10,10))
titles = [f”DT max_depth={i}” for i in range(1,10)]
for i, ax in enumerate(ax.flat):
dt = DecisionTreeClassifier(max_depth=i+1).fit(X, y)
plotter(dt, X, y, titles[i], ax)
plt.savefig(“boosting_a.png”)
plt.tight_layout()
plt.show()
Page 3
(b) Comment on your results in (a). What do you notice as you increase the depth of the trees? What do we mean when we say that trees have low bias and high variance?
(c) We now restrict attention to trees of depth 1. These are the most basic decision trees and are commonly referred to as decision stumps. Consider the adaptive boosting algorithm presented in the ensemble methods lecture notes on slide 50/70. In adaptive boosting, we build a model composed of T weak learners from a set of weak learners. At step t, we pick a model from the set of weak learners that minimises weighted error:
n
εt = wt−1,iI{yi ̸= yˆi}
i=1
where wt−1,i is the weight at the previous step for observation i, and I{yi ̸= yˆi} is equal to 1 if yi ̸= yˆi and zero otherwise. We do this for a total of T steps, which gives us a boosted model composed of T base classifiers:
T
M(x) = αtMt(x) t=1
where αt is the weight assigned to the t-th model. Classification is then carried out by assigning a point to the positive class if M(x) > 0 or to the negative class if M(x) < 1. Here we will take the class of weak learners to be the class of Decision stumps. You may make use of the ‘sample weight’ argument in the ‘fit()’ method to assign weights to the individual data points. Write code to build a boosted classifier for T = 15. Demonstrate the performance of your model on the generated dataset by printing out a list of your predictions versus the true class labels. (note: you may be concerned that the decision tree implementation in scikit learn does not actually minimise εt even when weights are assigned, but we will ignore this detail for the current question).
Solution:
something along the lines of: low bias since the shape of the learned function is highly flexible and can adapt to the shape of the data. DTs can fit any noise in the data so are high variance. Some discussion of the bias variance decomposition would also be nice.
Solution:
One approach is as follows:
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15
## Question 6 part b
def weighted_error(w, y, yhat):
return np.sum(w*(y!=yhat))
T = 15
w = np.zeros((T+1, n_points))
w[0] = np.ones(n_points)/n_points
# storage for boosted model
alphas = np.zeros(T); component_models = []
def boosted_model(x, alphas, component_models):
individual_preds = np.array([m.predict(x) for m in component_models])
alphaM = alphas * individual_preds.T
return np.sign(alphaM.sum(axis=1))
Page 4
for t in range(1, T+1):
# select weak classifier that minimises weighted error
dt = DecisionTreeClassifier(max_depth=1).fit(X, y, sample_weight=w[t-1])
# compute predictions from current model
yhat = dt.predict(X)
# comptue weighted error epsilon_t
eps_t = weighted_error(w[t-1], y, yhat)
# compute alpha_t (model weight)
alpha_t = 0.5 * np.log((1-eps_t)/eps_t)
# update weights for next round
for i in range(n_points):
if y[i] == yhat[i]:
# correctly classified instances
w[t, i] = w[t-1, i]/(2 * (1-eps_t))
else:
# misclassified instances
w[t, i] = w[t-1, i]/(2 * eps_t)
# save model
alphas[t-1] = alpha_t
component_models.append(dt)
boosted_preds = boosted_model(X, alphas, component_models)
np.all((boosted_preds * y) == 1) # check all classified correctly
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
(d) In this question, we will extend our implementation in (c) to be able to use the plotter function in (b). To do this, we need to implement a boosting model class that has a ‘predict’ method. Once you do this, repeat (c) for T = [2,...,17]. Plot the decision boundary of your 16 models in a 4 × 4 grid. The following template may be useful:
class boosted_model:
def __init__(self, T):
self.alphas = # your code here
# your code here
def predict(self, x):
# your code here
1 2 3 4 5 6 7 8
Solution:
1 2 3 4 5 6 7
class boosted_model:
def __init__(self, T):
self.alphas = np.zeros(T)
self.component_models = []
def predict(self, x):
individual_preds = np.array([m.predict(x) for m in self.
component_models])
Page 5
alphaM = self.alphas * individual_preds.T
return np.sign(alphaM.sum(axis=1))
bms = []
for T in range(2, 18):
w = np.zeros((T+1, n_points))
w[0] = np.ones(n_points)/n_points
bm = boosted_model(T)
for t in range(1, T+1):
# select weak classifier that minimises weighted error
dt = DecisionTreeClassifier(max_depth=1).fit(X, y, sample_weight=w[t-1])
# compute predictions from current model
yhat = dt.predict(X)
# comptue weighted error epsilon_t
eps_t = weighted_error(w[t-1], y, yhat)
# compute alpha_t (model weight)
alpha_t = 0.5 * np.log((1-eps_t)/eps_t)
# update weights for next round
for i in range(n_points):
if y[i] == yhat[i]:
# correctly classified instances
w[t, i] = w[t-1, i]/(2 * (1-eps_t))
else:
# misclassified instances
w[t, i] = w[t-1, i]/(2 * eps_t)
# save model
bm.alphas[t-1] = alpha_t
bm.component_models.append(dt)
# check number of components is right
# print("nmb_components: ", len(bm.component_models))
# store current boosted model
bms.append(bm)
fig, ax = plt.subplots(4,4, figsize=(12,12))
titles = [f"n_components={len(bm.component_models)}" for bm in bms]
for i, ax in enumerate(ax.flat):
plotter(bms[i], X, y, titles[i], ax)
plt.savefig("boosting_grid.png")
plt.tight_layout()
plt.show()
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
(e) Discuss the differences between bagging and boosting.
Page 6
Question 2
Note: this question is actually a bit longer and (slightly) more difficult than an exam question, but it is good practice In this question, you will be required to manually implement Gradient Descent in Python to learn the parameters of a linear regression model. You will make use of the ‘real estate.csv’ dataset, which you can download directly from the course Moodle page. The dataset contains 414 real estate records, each of which contains the following features:
• transactiondate: date of transaction
• age: age of property
• nearestMRT: distance of property to nearest supermarket
• nConvenience: number of convenience stores in nearby locations • latitude
• longitude
The target variable is the property price. The goal is to learn to predict property prices as a function of a subset of the above features.
(a) A number of pre-processing steps are required before we attempt to fit any models to the data. First remove any rows of the data that contain a missing (‘NA’) value. List the indices of the removed data points. Then, delete all features from the dataset apart from: age, nearestMRT and nConvenience.
Solution:
The indices of rows with missing values are: [19, 41, 109, 144, 230, 301].
1 2 3 4 5 6 7
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("real_estate.csv")
df = df.dropna()
X = df[["age", "nearestMRT", "nConvenience"]].values
(b) An important pre-processing step: feature normalisation. Feature normalisation involves rescaling the features such that thay all have similar scales. This is also important for algorithms like gradient descent to ensure the convergence of the algorithm. One common technique is called min-max normalisation, in which each feature is scaled to the range [0, 1]. To do this, for each feature we must find the minimum and maximum values in the available sample, and then use the following formula to make the transformation:
xnew = x−min(x) . max(x) − min(x)
After applying this normalisation, the minimum value of your feature will be 0, and the maximum will be 1. For each of the features, provide the mean value over your dataset.
Page 7
Solution:
The mean values for the three features are: [0.40607933, 0.16264268, 0.4120098].
# preprocessing
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
y = df["price"].values
1 2 3 4 5 6
(c) Nowthatthedataispre-processed,wewillcreatetrainandtestsetstobuildandthenevaluateour models on. Use the first half of observations (in the same order as in the original csv file excluding those you removed in the previous question) to create the training set, and the remaining half for the test set. Print out the first and last rows of both your training and test sets.
Solution:
• first row Xtrain: [0.73059361, 0.00951267, 1.] • last row Xtrain: [0.87899543, 0.09926012, 0.3] • first row Xtest: [0.26255708, 0.20677973, 0.1] • last row Xtest: [0.14840183, 0.0103754, 0.9]
• first row ytrain: 37.9 • last row ytrain: 34.2 • first row ytest: 26.2 • last row ytest: 63.9
# train/test splot
X = np.concatenate((np.ones([len(X), 1]), X), axis=1)
split_point = X.shape[0] // 2
X_train = X[:split_point]
X_test = X[split_point:]
y_train = y[:split_point]
y_test = y[split_point:]
# printing
print("first row Xtrain: ", Xtrain[0])
print("last row Xtrain: ", Xtrain[-1])
print("first row Xtest: ", Xtest[0])
print("last row Xtest: ", Xtest[-1])
print("first row ytrain: ", ytrain[0])
print("last row ytrain: ", ytrain[-1])
print("first row ytest: ", ytest[0])
print("last row ytest: ", ytest[-1])
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18
(d) Consider the loss function
12 Lc(x,y)= c2(x−y) +1−1,
Page 8
where c ∈ R is a hyper-parameter. Consider the (simple) linear model
yˆ(i) =w +w x(i) +w x(i) +w x(i), i=1,...,n.
011 22 33
We can write this more succintly by letting w = (w ,w ,w ,w )T and X(i) = (1,x(i),x(i),x(i))T , so 0123 123
that yˆ(i) = wT X(i). The mean-loss achieved by our model (w) on a given dataset of n observations is then
1 n
Lc(y, yˆ) = n
Compute the following derivatives:
1 n 1
c2 (y(i) − ⟨w(t), X(i)⟩)2 + 1 − 1 , k = 0,1,2,3.
i=1
Lc(y(i), yˆ(i)) = n ∂Lc(y(i),yˆ(i)),
i=1
∂wk You must show your working for full marks.
(e) For some value of c > 0, we have
Solution:
After applying the chain rule twice, we get results similar to the L2 loss cases:
∂Lc(y(i), yˆ(i)) ∂wk
X(i)(⟨w(t), X(i)⟩ − y(i)) k
=
, k = 0, 1, 2, 3.
c2
1 (⟨w(t), X(i)⟩ − y(i))2 + 1 c2
∂Lc(y(i), yˆ(i)) x(i)(⟨w(t), X(i)⟩ − y(i)) k
∂w =(t)(i) (i)2 ,k=0,1,2,3. k 2 (⟨w ,X ⟩−y ) +4
Using this result, write down the gradient descent updates for w0, w1, w2, w3 (using pseudocode), assuming a step size of η. Note that in gradient descent we consider the loss over the entire dataset, not just at a single observation. Further, provide pseudocode for stochastic gradient descent up- dates. Here, w(t) denotes the weight vector at the t-th iteration of gradient descent.
Solution:
We have the following updates for GD, for t = 1,…,T:
(t+1) (t) 1 n X(i)(⟨w(t), X(i)⟩ − yi)
k
wk =wk −ηn i=1 2(⟨w(t),X(i)⟩−yi)2 +4
Similarly for SGD, we have updates
(t+1) (t) X(i)(⟨w(t), X(i)⟩ − yi) k
wk =wk −η2(⟨w(t),X(i)⟩−yi)2 +4
Page 9
(f) In this section, you will implement gradient descent from scratch on the generated dataset using the gradients computed in Question 3, and the pseudocode in Question 4. Initialise your weight vector to w(0) = [1, 1, 1, 1]T . Consider step sizes
η ∈ {10, 5, 2, 1, 0.5, 0.25, 0.1, 0.05, 0.01}
(a total of 9 step sizes). For each step-size, generate a plot of the loss achieved at each iteration of gradient descent. You should use 400 iterations in total (which will require you to loop over your training data in order). Generate a 3 × 3 grid of plots showing performance for each step-size. Add a screen shot of the python code written for this question in your report (as well as pasting the code in to your solutions.py file). The following code may help with plotting:
fig, ax = plt.subplots(3,3, figsize=(10,10))
nIter = 400
alphas = [10,5,2, 1,0.5, 0.25,0.1, 0.05, 0.01]
for i, ax in enumerate(ax.flat):
# losses is a list of 9 elements. Each element is an array of length
nIter storing the loss at each iteration for
# that particular step size
ax.plot(losses[i])
ax.set_title(f”step size: {alphas[i]}”) # plot titles
plt.tight_layout() # plot formatting
plt.show()
1 2 3 4 5
6 7 8 9
10 11
Solution:
The plot should look like:
Page 10
If the student did not figure out that the provided gradient corresponds to the case c = 2, then they were told they could use either c = 1 or c = 2. This affects their loss values, even though they are using the correct gradients, and in that case their plots would look like:
Page 11
Example code is as follows:
1 2 3 4 5 6
7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
def calc_loss(y, y_pred, c=2):
return np.mean(np.sqrt((1/c**2) * (y – y_pred)**2 + 1) – 1)
def calc_grad(X_train, y_train, w, c=2):
Xw = X_train @ w
const = (Xw – y_train) / (c**2 * np.sqrt((1/c**2) * (Xw – y_train)**2
+ 1))
grad = np.mean(X_train * np.repeat(const, 4).reshape(-1, 4), axis=0)
return grad
nmb_iter = 400
w = np.zeros(shape=(nmb_iter+1, 4))
w[0] = np.array([1,1,1,1])
alphas = [10,5,2,1,0.5,0.25,0.1,0.05, 0.01]
losses = []
cur_c = 2
for alpha in alphas:
train_loss = np.zeros(shape=(nmb_iter+1))
train_loss[0] = calc_loss(ytrain, Xtrain @ w[0], c=cur_c)
for i in range(1, nmb_iter+1):
w[i] = w[i-1] – alpha * calc_grad(Xtrain, ytrain, w[i-1])
train_loss[i] = calc_loss(ytrain, Xtrain @ w[i], c=cur_c)
losses.append(train_loss)
Page 12
fig, ax = plt.subplots(3,3, figsize=(10,10))
for i, ax in enumerate(ax.flat):
ax.plot(losses[i])
ax.set_title(f”step size: {alphas[i]}”)
plt.tight_layout()
plt.savefig(“GDgrid.png”, dpi=500)
plt.show()
25 26 27 28 29 30 31 32 33
(g) Fromyourresultsinthepreviouspart,chooseanappropriatestepsize(andstateyourchoice),and explain why you made this choice.
(h) For this part, take η = 0.3, re-run GD under the same parameter initilisations and number of iterations. On a single plot, plot the progression of each of the four weights over the iterations. Print out the final weight vector. Finally, run your model on the train and test set, and print the achieved losses.
Solution:
η = 10 seems like the best choice.
Solution:
We have
w = [26.11838076, 7.55495429, 0.22091267, 15.88216107] Rerunning this model gives the following results:
• train loss: 4.089850739201336 / 8.887715476132039 (if c = 1)
• test loss: 3.8275009292188362 / 8.351707785988028 (if c = 1) The plot of the weights look like:
Page 13
w = np.zeros(shape=(nmb_iter,4))
w[0] = np.array([1,1,1,1])
alpha = 0.3
for i in range(1, nmb_iter):
w[i] = w[i-1] – alpha * calc_grad(Xtrain, ytrain, w[i-1])
plt.plot(w[:,0], label=”w0″)
plt.plot(w[:,1], label=”w1″)
plt.plot(w[:,2], label=”w2″)
plt.plot(w[:,3], label=”w3″)
plt.legend()
plt.savefig(“GDweights.png”, dpi=400)
plt.show()
wstar = w[nmb_iter-1]
print(“train loss: “, calc_loss(ytrain, Xtrain @ wstar, c=cur_c))
print(“test loss: “, calc_loss(ytest, Xtest @ wstar, c=cur_c))
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
(i) We will now re-run the analysis in the previous part, but using stochastic gradient descent (SGD) instead. In SGD, we update the weights after seeing a single observation. Use the same settings as in the gradient descent implementation. For each step-size, generate a plot of the loss achieved at each iteration of stochastic gradient descent, which is to be run for 6 Epochs. An epoch is one pass over the entire training dataset, so in total there should be 6 × ntrain updates, where ntrain is the size of your training data. Be sure to run these updates in the same ordering that the data is stored. Generate a 3 × 3 grid of plots showing performance for each step-size. Add a screen shot of the python code written for this question in your report (as well as pasting the code in to your solutions.py file).
Solution:
The plot should look like:
Page 14
If students use c = 1, then
Page 15
Example code is as follows:
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 22
23 24
def calc_grad_i(w, Xi, yi, c=2):
# compute gradient for a single observation
xw = Xi@w
num = xw-yi
den = (c**2) * np.sqrt(1 + (1/c**2) * (xw-yi)**2)
return (num/den)*Xi
epochs = 6
nmb_iter = Xtrain.shape[0] * epochs
w = np.zeros(shape=(nmb_iter,4))
w[0] = np.array([1,1,1,1])
alphas = [10, 5, 2, 1, 0.5, 0.25, 0.1, 0.05, 0.01]
losses = []
cur_c = 2
for alpha in alphas:
train_loss = np.zeros(shape=(nmb_iter))
train_loss[0] = calc_loss(ytrain, Xtrain @ w[0], c=cur_c)
for i in range(1, nmb_iter):
idx = i % Xtrain.shape[0]
idx], c=cur_c)
w[i] = w[i-1] – alpha * calc_grad_i(w[i-1], Xtrain[idx], ytrain[
train_loss[i] = calc_loss(ytrain, Xtrain @ w[i], c=cur_c)
losses.append(train_loss)
Page 16
fig, ax = plt.subplots(3,3, figsize=(10,10))
for i, ax in enumerate(ax.flat):
ax.plot(losses[i])
ax.set_title(f”step size: {alphas[i]}”)
plt.tight_layout()
plt.savefig(“SGDgrid.png”, dpi=500)
plt.show()
25 26 27 28 29 30 31 32 33 34
(j) Fromyourresultsinthepreviouspart,chooseanappropriatestepsize(andstateyourchoice),and explain why you made this choice.
(k) Take η = 0.4 and re-run SGD under the same parameter initilisations and number of epochs. On a single plot, plot the progression of each of the four weights over the iterations. Print yoru final model. Finally, run your model on the train and test set, and record the achieved losses.
Solution:
0.5 seems like a reasonable choice, but 0.25 is also acceptable.
Solution:
We have
w = [30.13933686, −5.58250367, −12.85536393, 24.85002152] Rerunning this model gives the following results:
• train loss: 2.7805130067939547 / 6.181134362027168 (if c = 1)
• test loss: 2.8446116656442526 / 6.329857966449784 (if c = 1) The plot of the weights look like:
1 2 3
w = np.zeros(shape=(nmb_iter,4))
w[0] = np.array([1,1,1,1])
Page 17
alpha = .4
loss_vec = np.zeros(shape=(nmb_iter))
loss_vec[0] = calc_loss(ytrain, Xtrain @ w[0], c=cur_c)
for i in range(1, nmb_iter):
idx = i % Xtrain.shape[0]
idx], c=cur_c)
w[i] = w[i-1] – alpha * calc_grad_i(w[i-1], Xtrain[idx], ytrain[
loss_vec[i] = calc_loss(ytrain, Xtrain @ w[i], c=cur_c)
plt.plot(w[:,0], label=”w0″)
plt.plot(w[:,1], label=”w1″)
plt.plot(w[:,2], label=”w2″)
plt.plot(w[:,3], label=”w3″)
plt.legend()
plt.savefig(“SGDweights.png”, dpi=500)
plt.show()
wstar = w[-1]
print(“w: “, wstar)
print(“train loss: “, calc_loss(Xtrain@wstar, ytrain, c=cur_c))
print(“test loss: “, calc_loss(Xtest@wstar, ytest, c=cur_c))
4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
(l) In a few lines, comment on your results in Questions 5 and 6. Explain the importance of the step- size in both GD and SGD. Explain why one might have a preference for GD or SGD. Explain why the GD paths look much smoother than the SGD paths.
Solution:
This is an open question so any reasonable short answer should be given full marks. Students should mention things along the lines of:
• In general, Gradient descent takes steps towards the minimum by moving in the direc- tion of greatest descent. For large step sizes, we may overshoot the minimum and this may result in oscillating or divergence of the algorithm, whereas too small a stepsize will lead to very slow convergence. We will in general find that GD uses a larger step size relative to SGD, since the direction (gradient) is more reliable in GD than in SGD.
• SGD is an approximation of GD, since in GD we use the entire training data before mak- ing a single update, whereas in SGD we make updates after seeing a single training point. The reason we might prefer SGD is when it is too computationally expensive to compute the gradient over the entire dataset, and SGD tends to converge much faster than GD for this reason. We see that in this case, SGD is able to achieve a lower mean loss, though we cannot directly compare the two algorithms here as the number of iterations differs, it would be interesting if a student provides a timing comparison for the two cases but not necessary.
• TheSGDpathsaremuchmoreturbulentbecauseweareestimatingthegradientbylook- ing at a single training point, and so the gradient updates are much noisier than when computing the gradient over the entire training set as in standard GD.
• Any other discussion points, or mention of mini-batch GD, or other variants.
Page 18
Question 3
In this question, we will consider the scikitlearn implementations of the following classifiers:
• Decision Trees
• Random Forest
• AdaBoost
• Logistic Regression
• Multilayer Perceptron (Neural Network) • Support Vector Machine
You are required to compare the performance of the above models on a classifiation task. The following code loads in these classifiers and defines a function to simulate a toy dataset:
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import warnings
warnings.simplefilter(action=’ignore’, category=FutureWarning)
import time
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
def create_dataset(n=1250, nf=2, nr=0, ni=2, random_state=125):
’’’
generate a new dataset with
n: total number of samples
nf: number of features
nr: number of redundant features (these are linear combinatins of
informative features)
ni: number of informative features (ni + nr = nf must hold)
random_state: set for reproducibility
’’’
X, y = make_classification( n_samples=n,
n_features=nf,
n_redundant=nr,
n_informative=ni,
random_state=random_state,
n_clusters_per_class=2)
rng = np.random.RandomState(2)
X += 3 * rng.uniform(size = X.shape)
X = StandardScaler().fit_transform(X)
return X, y
(a) Generate a dataset using the default parameters of create dataset. Then, randomly split the dataset into training set Xtrain and test set Xtest (use the sklearn train test split function with random state=45), with 80 % of examples for training and 20 % for testing. Fit each of the
Page 19
models to the training data and then plot the decision boundaries of each of the classifiers (using default parameter settings) on the test set. If you prefer, you may use the following plotter function which plots the decision boundary and works for any sklearn model.
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18
def plotter(classifier, X, X_test, y_test, title, ax=None):
# plot decision boundary for given classifier
plot_step = 0.02
x_min, x_max = X[:, 0].min() – 1, X[:,0].max() + 1
y_min, y_max = X[:, 1].min() – 1, X[:,1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step))
Z = classifier.predict(np.c_[xx.ravel(),yy.ravel()])
Z = Z.reshape(xx.shape)
if ax:
ax.contourf(xx, yy, Z, cmap = plt.cm.Paired)
ax.scatter(X_test[:, 0], X_test[:, 1], c = y_test)
ax.set_title(title)
else:
plt.contourf(xx, yy, Z, cmap = plt.cm.Paired)
plt.scatter(X_test[:, 0], X_test[:, 1], c = y_test)
plt.title(title)
Note that you can use the same general approach for plotting a grid as used in Homework 1, and the plotter function supports an ‘ax’ argument. What to submit: a single 3 × 2 plot, a print screen of the python code used, a copy of your python code in solutions.py.
Solution:
The students should generate a plot that looks like:
Page 20
. The code used to generate this plot is:
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
X, y = create_dataset()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size
=0.2, random_state=45)
mods = [DecisionTreeClassifier(),
RandomForestClassifier(),
AdaBoostClassifier(),
LogisticRegression(),
MLPClassifier(),
SVC()]
titles = [“Decision Tree”, “RF”, “AdaBoost”,
“Logistic Regression”, “Neural Network”, “SVM”]
fig, ax = plt.subplots(3,2, figsize=(10,10))
for i, ax in enumerate(ax.flat):
mod = mods[i]
Page 21
mod.fit(X_train, y_train)
plotter(mod, X, X_test, y_test, titles[i], ax)
plt.savefig(“figures/boundaries.png”)
plt.show()
17 18 19 20 21
(b) Next, we will study how the performance of each of the clasifiers varies as you increase the size of the training set. Fix your training and test sets from part (a). Then, starting from a random subset (no need to set a seed) of your training set of size 50 (chosen with replacement), train your classification model, and compute the accuracy on the test set. Repeat this process for training set sizes of [50, 100, 200, 300, . . . , 1000]. Repeat the experiment a total of 10 times for each classifier. Then, for each classifier, plot its average accuracy (achieved on the test set) for each training set size. Compare the accuracy across different algorithms in a single figure, and in 5 lines or less, discuss your observations: For the models covered in the course so far, use what you know about the bias-variance decomposition to inform your discussion. Which model you prefer for this task?
Please use the following color scheme for your plots: [Decision Tree, Random Forest, AdaBoost, Logistic Regression, Neural Network, SVM], and please include a legend in your plot. What to submit: a discussion of your observation, a single plot, a print screen of the python code used, a copy of your python code in solutions.py.
Solution:
The figure generated is
Page 22
Discussion points can include:
• trees doing poorly and exhibiting high variance, RF seems to be overfitting
• as expected MLPs and SVMs do quite well without much tuning
• logistic regression is preferred here as it the simplest model and achieves consistently
good results.
Full marks should use the decision surfaces in (a) to argue about overfitting of RFs and Ad- aboost relative to the other models. Code is presented at end of part (c).
(c) Using the time module, record the training time for each of the classifiers at each of the training set sizes. Plot the log of the average training time over the 10 trials of each classifier as a function of the training size (use log base e). You may add code for this section to your code in part (b). What do you observe? In 5 lines or less, discuss your observations. Use the same color scheme as in (b). What to submit: a discussion of your observation, a single plot, a print screen of the python code used, a copy of your python code in solutions.py.
Page 23
Solution:
The figure generated is
of if log-scale;
Some basic comments about NNs taking more time as you increase the training size. Some discussion of the default MLP number of hidden layers would be good here but not necessary. Code used for both (b) and (c) is provided here:
Page 24
def train_mod(mod, X_train, y_train, X_test, y_test, train_size):
idxs = np.random.choice(X_train.shape[0], size=train_size)
Xp = X_train[idxs]
yp = y_train[idxs]
mod.fit(Xp, yp)
return np.mean(mod.predict(X_test)==y_test)
mod_mean_accuracies = np.zeros(shape=(6, 12))
mod_mean_time = np.zeros(shape=(6, 12))
train_sizes = [25,50] + [100 + 100*i for i in range(10)]
for mod_idx, mod in enumerate(mods):
mean_accuracies = np.zeros(shape=(10, 12))
mean_times = np.zeros(shape=(10, 12))
for i in range(10):
for j, tr in enumerate(train_sizes):
tstart = time.time()
mean_accuracies[i, j] = train_mod(mod, X_train,
y_train, X_test,
tend = time.time()
y_test, tr)
mean_times[i,j] = tend – tstart
mod_mean_accuracies[mod_idx,:] = mean_accuracies.mean(axis=0)
mod_mean_time[mod_idx, :] = mean_times.mean(axis=0)
fig_size = plt.rcParams[“figure.figsize”]
fig_size[0] = 10
fig_size[1] = 8
plt.rcParams[“figure.figsize”] = fig_size
for i in range(6):
plt.plot(train_sizes, mod_mean_accuracies[i], label=titles[i])
plt.xlabel(“Training Size”)
plt.ylabel(“Average Accuracy”)
plt.legend()
plt.savefig(“figures/Accuracy.png”)
plt.show()
for i in range(6):
plt.plot(train_sizes, mod_mean_time[i], label=titles[i])
plt.xlabel(“Training Size”)
plt.ylabel(“Average Time”)
plt.legend()
plt.savefig(“figures/Timing.png”)
plt.show()
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
Page 25
Question 4
Refer to the following dataset containing a sample S of ten examples. Each example is described using two Boolean attributes A and B. Each is labelled (classified) by the target Boolean function.
A B Class 10+ 01- 11- 10+ 11- 11- 00+ 11+ 00+ 00-
(a) What is the entropy of these examples with respect to the given classification?
(b) What is the Information gain of attribute A on sample S above?
Solution:
The classes are split equally (5 in each class), so the entropy is
H(S)=−p log p −p log p =−1log 1−1log 1=1. +2+−2−222222
Solution:
If we split on A, we get 6 observations falling into A = 1, and they are equally split across the classes. The remaining 4 observations fall in to A = 0, and they are also equally split. So they both have entropy of 1. Therefore, the gain of A is
G(A)=H(S)− 6 ×1− 4 ×1=1−1=0. 10 10
In other words, A is useless.
(c) What is the information gain of attribute B on sample S above?
Solution:
In this case, we see 5 observations each in the two regions B = 1 and B = 0. However, in B=1,wesee1positiveand4negatives,sotheentropyis−1log 1−4log 4 =0.7219.Inthe
region B = 0, we see 4 positives and 1 negative, so again the entropy is 0.7219. The gain of B is therefore
G(B)=H(S)− 5 ×0.7219− 5 ×0.7219=1−0.7219=0.2781. 10 10
525525
Page 26
(d) What would be chosen as the ‘best’ attribute by a decision tree learner using the ifnromation gain splitting criterion? Why?
(e) What are ensembles? Discuss one example in which decision trees are used in an ensemble.
Solution:
We choose B as it has higher information gain.
Solution:
One example is Bagging trees, for more discussion about this see here.
Page 27
Question 5
Refer to the following data
x1 x2 y -2 -1 -1 2 -1 1 111 -1 -1 -1 321
(a) Apply the Perceptron Learning Algorithm with starting values w0 = 5, w1 = 1 and w2 = 1, and a learning rate η = 0.4. Be sure to cycle through the training data in the same order that they are presented in the table.
Solution:
Running the Perceptron training algorithm results in the following iterations, where high- lighted columns signify that the weight vector is changing on that particular iteration. The
correct weights are therefore w0 = 3.8, w1 = 2.6, iteration 9.
w2 = 2.2. These weights first appear on w
[4.6, 1.8, 1.4]T [4.6, 1.8, 1.4]T [4.6, 1.8, 1.4]T [4.2, 2.2, 1.8]T [4.2, 2.2, 1.8]T [4.2, 2.2, 1.8]T [4.2, 2.2, 1.8]T [4.2, 2.2, 1.8]T [3.8, 2.6, 2.2]T [3.8, 2.6, 2.2]T [3.8, 2.6, 2.2]T [3.8, 2.6, 2.2]T [3.8, 2.6, 2.2]T
wT x 2 6.80
ywT x −2.00 6.80 7.80 −1.40
Iteration
1 2.00
3 7.80
4 1.40
5 14.40 14.40
6 −2.00 2.00
7 6.80 6.80
8 8.20 8.20
9 0.20 −0.20
10 16.00 16.00
11 12 13
−3.60 3.60 6.80 6.80 8.60 8.60
(b) Consider a new point, x⋆ = (−5, 3). What is the predicted value and predicted class based on your learned perceptron for this point?
(c) Consider adding a new point to the data set, x⋆ = (2, 2) and y⋆ = −1. Will your perceptron converge on the new dataset? How might you remedy this?
(d) Consider the following three logical functions: 1. A∧¬B
Solution:
value: −2.6, class: y⋆ = −1
2. ¬A∨B
3. (A∨B)∧(¬A∨¬B)
Page 28
Which of these functions can a perceptron learn? Explain. What are two ways that you can extend a perceptron to learn all three functions ?
Solution:
A Perceptron can only learn f1 and f2. we can extend via multilayer perceptrons, or by using a smart transformation (i.e. kernel perceptron)
Page 29
Question 6
Refer to the following information. In these two questions you will apply the k-means algorithm. You will use a univariate (one-variable) dataset containing the following 12 instances:
{2.01, 3.49, 4.58, 4.91, 4.99, 5.01, 5.32, 5.78, 5.99, 6.21, 7.26, 8.00}
Use the Manhattan or city-block distance, i.e., the distance between two instances xi and xj is the absolute value of the difference xi − xj . For example, if xi = 2 and xj = 3 then the distance between xi and xj is |2 − 3| = 1. Use the arithmetic mean to compute the centroids. Apply the k-means algorithm to the above dataset of examples. Let k = 2. Let the two centroids (means) be initialised to {3.33, 6.67}. On each iteration of the algorithm record the centroids.
(a) After two iterations of the algorithm you should have recorded two sets of two centroids. What are they?
Now apply the algorithm for one more iteration. Record the new centroids after iteration 3 and answer the following question.
(b) After 3 iterations it is clear that:
(a) due to randomness in the data, the centroids could change on further iterations
(b) due to randomness in the algorithm, the centroids could change on further iterations (c) k-means converges in probability to the true centroids
(d) the algorithm has converged and the clustering will not change on further iterations (e) the algorithm has not converged and the clustering will change on further iterations
Solution:
{4.00, 6.22} and {4.17, 6.43}
Solution:
After the third iteration, the cluster centroids are {4.17, 6.43}. This means that there is no change from iteration 2 to iteration 3, so the algorithm has converged, so (d) is true. The reason that (c) is not true is that we cannot be sure that these are the true centroids, so we cannot conclude that the converged centroids are the right centroids with high probability.
Page 30
Question 7
Note: there will not be any MCQ on the final this year – this question is still useful practice though.
In this question, we will apply a mistake-bounded learner to the following dataset. This dataset has 6 binary features, x1 , x2 , . . . x6 . The class variable y can be either 1, denoting a positive example of the concept to be learned, or 0, denoting a negative example.
Example x1 x2 x3 x4 x5 x6 Class 1) 0 0 0 0 1 1 1 2) 1 0 1 1 0 1 1 3) 0 1 0 1 0 1 0 4) 0 1 1 0 0 1 0 5) 1 1 0 0 0 0 1
Apply the Winnow2 algorithm to the above dataset of examples in the order in which they appear. Use the following values for the Winnow2 parameters: threshold t = 2, α = 2. Initialise all weights to have the value 1.
Learned Weights
Weight vector 1 Weight vector 2 Weight vector 3 Weight vector 4 Weight vector 5
w1 w2 2.000 1.000 3.000 0.000 2.000 2.000 2.000 0.500 2.000 0.250
w3 w4 1.000 0.000 1.000 1.000 2.000 2.000 0.500 0.500 0.500 0.500
w5 2.000 2.000 2.000 2.000 4.000
w6 1.000 1.000 2.000 0.500 0.125
weight configurations has
(a) After one epoch, i.e., one pass through the dataset, which of the above been learned ?
(a) Weight vector 1 (b) Weight vector 2 (c) Weight vector 3 (d) Weight vector 4 (e) Weight vector 5
Solution:
(d)
(b) On which of the examples did the algorithm not make a mistake ? (a) Examples 1), 2) and 5) (b) Example 5)
(c) Example 4)
(d) Examples 4) and 5) (e) None of the examples
(c) The algorithm has learned a consistent concept on the training data:
Page 31
Solution:
(e)
(a) True
(b) False
(c) It is not possible to determine this
(d) Assume the target concept from which this dataset was generated is defined by exactly two fea- tures. The worst-case mistake bound for the algorithm on this dataset is approximately:
Solution:
(a)
(a) 1.79 (b) 2.58 (c) 3.58 (d) 4.67 (e) 10.75
Solution:
(c)
comment: The solution is based on the worst case mistake bound O(r log n) where
r = number of relevant features = 2 n = number of total features = 6
So r log n = 2 log 6 = 3.5835 (log is base e). This concept was introduced on slide 61/78, but also used in Q5 of the learning theory problem set. The idea is that winnow is more robust to irrelevant features, so will scale with the logarithm of the number of irrelevant features. In this case, we are given 6 features x1 , . . . , x6 , but in (d) we are told that the data was actually generated from two features. The proof of the bound is outside the scope of the course but not too complicated, see here for a simplified version.
Page 32