Name of Candidate: …………………………………………….. Student id: …………………………………………….. Signature: ……………………………………………..
THE UNIVERSITY OF NEW SOUTH WALES
Term 2, 2020
COMP9417 Machine Learning and Data Mining – Sample Final Examination (SOLUTIONS)
1. I ACKNOWLEDGE THAT ALL OF THE WORK I SUBMIT FOR THIS EXAM WILL BE COMPLETED BY ME WITHOUT ASSISTANCE FROM ANYONE ELSE.
2. TIME ALLOWED — 24 hours
3. OPEN BOOK EXAM – LECTURE NOTES, TUTORIALS, AND ONLINE RESOURCES ARE PERMITTED. PLEASE USE REFERENCES WHERE NECESSARY.
4. SUBMISSION — YOU MUST SUBMIT A PDF FILE CONTAINING YOUR ANSWERS FOR EACH QUESTION ATTEMPTED. START EACH SUB-QUESTION ON A NEW PAGE. MARKS MAY BE DEDUCTED FOR UNCLEAR WORK. YOU MAY TYPE YOUR SOLUTIONS USING LATEX, OR TAKE CLEAR PHOTOS OF HANDWRIT- TEN WORK. FOR QUESTIONS THAT REQUIRE CODING, YOU MUST SUBMIT A .PY FILE (SEE TEMPLATE) CONTAINING YOUR CODE, THOUGH GENERATED PLOTS/TABLES MUST BE INCLUDED IN THE PDF.
5. DISCUSSION WITH OTHER STUDENTS IS STRICTLY PROHIBITED. CODE SUB- MISSIONS WILL BE CHECKED FOR PLAGIARISM. CHEATING WILL RESULT IN A FAILING GRADE FOR THE COURSE AND POTENTIAL FURTHER DISCIPLINARY ACTION.
6. IF NEEDED, YOU ARE PERMITTED TO SEEK CLARIFICATION FROM COURSE STAFF ON THE WEBCMS FORUM. QUESTIONS SPECIFIC TO CONTENT WILL NOT BE ANSWERED.
1 Please see over
Question 1 is on Linear Regression and requires you to refer to the following training data:
xy 42 64 12 10 25 23 29 28 46 44 59 60
We wish to fit a linear regression model to this data, i.e. a model of the form: yˆ i = w 0 + w 1 x i .
We consider the Least-Squares loss function:
L ( w 0 , w 1 ) = ( y i − yˆ i ) 2 ,
i=1 where n is the total number of training points.
(a) Derive the least squares estiamtes of w0 and w1 (showing their working), and compute them for the provided data.
ANSWER
For (A), students are expected to derive the least squares estimates: w0=y−w1x, w1=xy−xy
x2 − x2
then, computing the specific different pieces, we get
w0 = −2.4570, w1 = 1.0398.
(b) Based on your linear model, what is the prediction for a test point x⋆ = 50? ANSWER
For (B), they just need to compute
y ⋆ = wˆ 0 + wˆ 1 x ⋆
using the estimates from their previous answer.
2 Please see over
n
(c) Consider a new training point, (xnew,ynew) = (45,0). What are the new least squares parameters if we include all training data (including this new point)?
ANSWER
In (c), they need to recompute the estimates using the estimates in (A).
(d) The new training point in (c) can be considered to be what kind of point? What does such a point mean for your estimated parameters? How could you remedy the situation?
ANSWER
This is a clear outlier, and it has a large impact on our estimates under least squares error. This is not desirable as we would not want a single data point to have so much influence over our model. One way to remedy this is to screen for outliers, so this point would be removed from our analysis before computing the least squares estimates. Another approach would be to use a different loss function, namely, a robust loss such as the mean absolute error: ni=1 |yi − yˆi|.
(e) Are you comfortable coding this question up in Numpy? What about using the scikit learn implementation of Linear Regression?
(f) Compute the derivatives of the following functions (SHOW YOUR WORKING):
1. f(x) = x2 ln(x) 2. g(x)=(1+2×4)3
3. h(x) = 2 ln(x)+4x x3
ANSWER
1. f′(x)=2xln(x)+x 2. g′(x) = 24×3(1 + 2×4)2
x6
3.h′(x)= x
=
.
3 Please see over
( 2 +4)x3 −(2 ln(x)+4x)3×2
2−6 ln(x)−8x x4
Question 2 is on Tree Learning and requires you to refer to the following dataset containing a sample S of ten examples. Each example is described using two Boolean attributes A and B. Each is labelled (classified) by the target Boolean function.
A B Class 10+ 01- 11- 10+ 11- 11- 00+ 11+ 00+ 00-
(a) What is the entropy of these examples with respect to the given classification? ANSWER
The classes are split equally (5 in each class), so the entropy is
H(S)=−p log p −p log p =−1log 1−1log 1=1.
+2+−2−222222 (b) What is the Information gain of attribute A on sample S above?
ANSWER
If we split on A, we get 6 observations falling into A = 1, and they are equally split across the classes. The remaining 4 observations fall in to A = 0, and they are also equally split. So they both have entropy of 1. Therefore, the gain of A is
G(A)=H(S)− 6 ×1− 4 ×1=1−1=0. 10 10
In other words, A is useless.
(c) What is the information gain of attribute B on sample S above? ANSWER
4 Please see over
In this case, we see 5 observations each in the two regions B = 1 and B = 0. However, in B = 1, we see 1 positive and 4 negatives, so the entropy is −1 log 1 − 4 log 4 = 0.7219. In the
region B = 0, we see 4 positives and 1 negative, so again the entropy is 0.7219. The gain of B is therefore
G(B)=H(S)− 5 ×0.7219− 5 ×0.7219=1−0.7219=0.2781. 10 10
(d) What would be chosen as the ‘best’ attribute by a decision tree learner using the information gain splitting criterion? Why?
ANSWER
We choose B as it has higher information gain.
(e) What are ensembles? Discuss one example in which decision trees are used in an ensemble. ANSWER
One example is Bagging trees, for more discussion about this see here.
525525
5 Please see over
Questions 3 is on Perceptron Training and requires you to refer to the following training data:
x1 x2 y -2 -1 -1 2 -1 1 111 -1 -1 -1 321
(a) Apply the Perceptron Learning Algorithm with starting values w0 = 5, w1 = 1 and w2 = 1, and a learning rate η = 0.4. Be sure to cycle through the training data in the same order that they are presented in the table.
ANSWER
Running the Perceptron training algorithm results in the following iterations, where highlighted columns signify that the weight vector is changing on that particular iteration. The correct weights are therefore w0 = 3.8, w1 = 2.6, w2 = 2.2. These weights first appear on iteration 9.
Iteration wTx
1 2.00 −2.00
2 6.80 6.80
3 7.80 7.80
4 1.40 −1.40
5 14.40 14.40
6 −2.00 2.00
7 6.80 6.80
8 8.20 8.20
9 0.20 −0.20
10 16.00 16.00 11 −3.60 3.60
12 6.80 6.80
13 8.60 8.60
ywTx w
[4.6, 1.8, 1.4]T
[4.6, 1.8, 1.4]T [4.6, 1.8, 1.4]T [4.2, 2.2, 1.8]T [4.2, 2.2, 1.8]T [4.2, 2.2, 1.8]T [4.2, 2.2, 1.8]T [4.2, 2.2, 1.8]T [3.8, 2.6, 2.2]T [3.8, 2.6, 2.2]T [3.8, 2.6, 2.2]T [3.8, 2.6, 2.2]T [3.8, 2.6, 2.2]T
(b) Consider a new point, x⋆ = (−5, 3). What is the predicted value and predicted class based on your learned perceptron for this point?
ANSWER
6 Please see over
value: −2.6, class: y⋆ = −1
(c) Consider adding a new point to the data set, x⋆ = (2, 2) and y⋆ = −1. Will your perceptron converge on the new dataset? How might you remedy this?
(d) Consider the following three logical functions:
1. A∧¬B
2. ¬A∨B
3. (A∨B)∧(¬A∨¬B)
Which of these functions can a perceptron learn? Explain. What are two ways that you can extend a perceptron to learn all three functions ?
ANSWER
A Perceptron can only learn f1 and f2. we can extend via multilayer perceptrons, or by using a smart transformation (i.e. kernel perceptron).
7 Please see over
Questions 4 covers Unsupervised Learning and require you to refer to the following infor- mation.
In these two questions you will apply the k-means algorithm. You will use a univariate (one- variable) dataset containing the following 12 instances:
Dataset = { 2.01, 3.49, 4.58, 4.91, 4.99, 5.01, 5.32, 5.78, 5.99, 6.21, 7.26, 8.00 }
Use the Manhattan or city-block distance, i.e., the distance between two instances xi and xj is the absolute value of the difference xi − xj . For example, if xi = 2 and xj = 3 then the distance between xi and xj is |2 − 3| = 1. Use the arithmetic mean to compute the centroids.
Apply the k-means algorithm to the above dataset of examples. Let k = 2. Let the two centroids (means) be initialised to {3.33, 6.67}. On each iteration of the algorithm record the centroids.
After two iterations of the algorithm you should have recorded two sets of two centroids.
(a) After applying your algorithm to the dataset for two iterations, which of the sets of centroids in the table above has been learned ?
(select the row of the table with values closest to your centroids)
Centroids
After 1 iteration After 2 iterations
Centroids 1 Centroids 2 Centroids 3 Centroids 4 Centroids 5
{ 2.75, 5.81 } { 4.00, 6.22 } { 4.51, 6.87 } { 4.67, 7.16 } { 4.83, 7.03 }
{ 3.24, 6.01 } { 4.17, 6.43 } { 4.33, 6.65 } { 4.51, 6.87 } { 4.28, 6.79 }
(a) Centroids 1 (b) Centroids 2 (c) Centroids 3 (d) Centroids 4 (e) Centroids 5
ANSWER
(b)
8 Please see over
Now apply the algorithm for one more iteration. Record the new centroids after iteration 3 and answer the following question.
(b) After 3 iterations it is clear that:
(a) due to randomness in the data, the centroids could change on further iterations
(b) due to randomness in the algorithm, the centroids could change on further iterations (c) k-means converges in probability to the true centroids
(d) the algorithm has converged and the clustering will not change on further iterations (e) the algorithm has not converged and the clustering will change on further iterations
ANSWER
(d) comment: After the third iteration, the cluster centroids are {4.17, 6.43}. This means that there is no change from iteration 2 to iteration 3, so the algorithm has converged, so (d) is true. The reason that (c) is not true is that we cannot be sure that these are the ’true’ centroids, so we cannot conclude that the converged centroids are the right centroids with high probability. (Convergence in probability means ‘with probability 1’)
9 Please see over
Question 5 is on Learning Theory and requires you to apply a mistake-bounded learner to the following dataset.
This dataset has 6 binary features, x1, x2, . . . x6. The class variable y can be either 1, denoting a positive example of the concept to be learned, or 0, denoting a negative example.
Apply the Winnow2 algorithm to the above dataset of examples in the order in which they appear. Use the following values for the Winnow2 parameters: threshold t = 2, α = 2. Initialise all weights to have the value 1.
(a) After one epoch, i.e., one pass through the dataset, which of the above weight configura- tions has been learned ?
Example
x1 x2 x3 x4 x5 x6
Class
1) 2) 3) 4) 5)
000011 101101 010101 011001 110000
1 1 0 0 1
Learned Weights
w1 w2 w3 w4 w5 w6
Weight vector 1 Weight vector 2 Weight vector 3 Weight vector 4 Weight vector 5
2.000 1.000 1.000 0.000 2.000 1.000 3.000 0.000 1.000 1.000 2.000 1.000 2.000 2.000 2.000 2.000 2.000 2.000 2.000 0.500 0.500 0.500 2.000 0.500 2.000 0.250 0.500 0.500 4.000 0.125
(a) Weight vector 1 (b) Weight vector 2 (c) Weight vector 3 (d) Weight vector 4 (e) Weight vector 5
ANSWER
(d)
10 Please see over
(b) On which of the examples did the algorithm not make a mistake ?
(a) Examples 1), 2) and 5) (b) Example 5)
(c) Example 4)
(d) Examples 4) and 5) (e) None of the above
ANSWER
(e)
(c) The algorithm has learned a consistent concept on the training data:
(a) True
(b) False
(c) It is not possible to determine this
ANSWER
(a)
(d) Assume the target concept from which this dataset was generated is defined by exactly two features. The worst-case mistake bound for the algorithm on this dataset is approximately:
(a) 1.79 (b) 2.58 (c) 3.58 (d) 4.67 (e) 10.75
ANSWER
11 Please see over
(c)
comment: The solution is based on the worst case mistake bound O(r log n) where
r = number of relevant features = 2 n = number of total features = 6
So r log n = 2 log 6 = 3.5835 (log is base e). This concept was introduced on slide 61/78, but also used in Q5 of the learning theory problem set. The idea is that winnow is more robust to irrelevant features, so will scale with the logarithm of the number of irrelevant features. In this case, we are given 6 features x1, . . . , x6, but in (d) we are told that the data was actually generated from two features.
The proof of the bound is outside the scope of the course but not too complicated, see here for a simplified version.
12 Please see over
Question 6 is on Boosting Theory and requires you to implement the Adaptive Boosting Algo- rithm from lectures. Use the following code to generate a toy binary classification dataset:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import warnings
warnings.simplefilter(action=’ignore’, category=FutureWarning)
from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import make_blobs
np.random.seed(2)
n_points = 15
X, y = make_blobs(n_points, 2, centers=[(0,0), (-1,1)]) y[y==0] = -1 # use -1 for negative class instead of 0
plt.scatter(*X[y==1].T, marker=”+”, s=100, color=”red”)
plt.scatter(*X[y==-1].T, marker=”o”, s=100, color=”blue”) plt.show()
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17
1 def plotter(classifier , X, y, title , ax=None):
13 Please see over
Your data should look like:
(a) By now, you should be familiar with the scikitlearn DecisionTreeClassifier class. Fit Decision trees of increasing maximum depth for depths ranging from 1 to 9. Plot the decision boundaries of each of your models in a 3 × 3 grid. You may find the following helper function useful:
# plot decision boundary for given classifier
plot_step = 0.02
x_min, x_max = X[:, 0].min() – 1, X[:,0].max() + 1 y_min, y_max = X[:, 1].min() – 1, X[:,1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step), np.arange(y_min , y_max , plot_step))
Z = classifier.predict(np.c_[xx.ravel(),yy.ravel()]) Z = Z.reshape(xx.shape)
if ax:
ax.contourf(xx, yy, Z, cmap = plt.cm.Paired)
ax.scatter(X[:, 0], X[:, 1], c = y)
ax.set_title(title) else:
plt.contourf(xx, yy, Z, cmap = plt.cm.Paired) plt.scatter(X[:, 0], X[:, 1], c = y) plt.title(title)
2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17
ANSWER
14 Please see over
The code used is:
1 2 3 4 5 6 7
fig, ax = plt.subplots(3,3, figsize=(10,10))
titles = [f”DT max_depth={i}” for i in range(1,10)] for i, ax in enumerate(ax.flat):
dt = DecisionTreeClassifier(max_depth=i+1).fit(X, y)
plotter(dt, X, y, titles[i], ax) plt.savefig(“boosting_a.png”) plt.tight_layout()
15 Please see over
8 plt.show()
16 Please see over
(b) Comment on your results in (a). What do you notice as you increase the depth of the trees? What do we mean when we say that trees have low bias and high variance?
ANSWER
something along the lines of: low bias since the shape of the learned function is highly flexible and can adapt to the shape of the data. DTs can fit any noise in the data so are high variance. Some discussion of the bias variance decomposition would also be nice.
17 Please see over
(c) We now restrict attention to trees of depth 1. These are the most basic decision trees and are commonly referred to as decision stumps. Consider the adaptive boosting algorithm presented in the ensemble methods lecture notes on slide 50/70. In adaptive boosting, we build a model composed of T weak learners from a set of weak learners. At step t, we pick a model from the set of weak learners that minimises weighted error:
n
εt = wt−1,iI{yi ̸= yˆi}
i=1
where wt−1,i is the weight at the previous step for observation i, and I{yi ̸= yˆi} is equal to 1 if yi ̸= yˆi and zero otherwise. We do this for a total of T steps, which gives us a boosted model composed of T base classifiers:
T
M(x) = αtMt(x) t=1
where αt is the weight assigned to the t-th model. Classification is then carried out by assigning a point to the positive class if M(x) > 0 or to the negative class if M(x) < 1. Here we will take the class of weak learners to be the class of Decision stumps. You may make use of the ‘sample weight’ argument in the ‘fit()’ method to assign weights to the individual data points. Write code to build a boosted classifier for T = 15. Demonstrate the performance of your model on the generated dataset by printing out a list of your predictions versus the true class labels. (note: you may be concerned that the decision tree implementation in scikit learn does not actually minimise εt even when weights are assigned, but we will ignore this detail for the current question).
ANSWER
One approach is as follows:
1 2 3 4 5 6 7 8 9
10 11
## Question 6 part b
def weighted_error(w, y, yhat):
return np.sum(w*(y!=yhat))
T = 15
w = np.zeros((T+1, n_points)) w[0] = np.ones(n_points)/n_points
# storage for boosted model
alphas = np.zeros(T); component_models = []
18 Please see over
def boosted_model(x, alphas , component_models):
individual_preds = np.array([m.predict(x) for m in component_models]) alphaM = alphas * individual_preds.T
return np.sign(alphaM.sum(axis=1))
for t in range(1, T+1):
# select weak classifier that minimises weighted error
dt = DecisionTreeClassifier(max_depth=1).fit(X, y, sample_weight=w[t-1])
# compute predictions from current model
yhat = dt.predict(X)
# comptue weighted error epsilon_t
eps_t = weighted_error(w[t-1], y, yhat)
# compute alpha_t (model weight)
alpha_t = 0.5 * np.log((1-eps_t)/eps_t)
# update weights for next round
for i in range(n_points): if y[i] == yhat[i]:
# correctly classified instances
w[t, i] = w[t-1, i]/(2 * (1-eps_t)) else:
# misclassified instances
w[t, i] = w[t-1, i]/(2 * eps_t)
# save model
alphas[t-1] = alpha_t component_models.append(dt)
boosted_preds = boosted_model(X, alphas , component_models)
np.all((boosted_preds * y) == 1) # check all classified correctly
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
19 Please see over
1 2 3 4 5 6 7
8
9 10 11 12 13 14 15 16 17 18 19 20 21
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
(d) In this question, we will extend our implementation in (c) to be able to use the plotter function in (b). To do this, we need to implement a boosting model class that has a ‘predict’ method. Once you do this, repeat (c) for T = [2, . . . , 17]. Plot the decision boundary of your 16 models in a 4 × 4 grid. The following template may be useful:
class boosted_model:
def __init__(self, T):
self.alphas = np.zeros(T) self.component_models = []
def predict(self, x):
individual_preds = np.array([m.predict(x) for m in self.
component_models])
alphaM = self.alphas * individual_preds.T
bms = []
return np.sign(alphaM.sum(axis=1))
for T in range(2, 18):
w = np.zeros((T+1, n_points))
w[0] = np.ones(n_points)/n_points bm = boosted_model(T)
for t in range(1, T+1):
# select weak classifier that minimises weighted error
dt = DecisionTreeClassifier(max_depth=1).fit(X, y, sample_weight=w[t -1])
# compute predictions from current model
yhat = dt.predict(X)
# comptue weighted error epsilon_t
eps_t = weighted_error(w[t-1], y, yhat)
# compute alpha_t (model weight)
alpha_t = 0.5 * np.log((1-eps_t)/eps_t)
# update weights for next round
for i in range(n_points): if y[i] == yhat[i]:
# correctly classified instances
w[t, i] = w[t-1, i]/(2 * (1-eps_t)) else:
# misclassified instances
w[t, i] = w[t-1, i]/(2 * eps_t)
20 Please see over
# save model
bm.alphas[t-1] = alpha_t bm.component_models.append(dt)
# check number of components is right
# print("nmb_components: ", len(bm.component_models))
# store current boosted model
bms.append(bm)
fig, ax = plt.subplots(4,4, figsize=(12,12))
titles = [f"n_components={len(bm.component_models)}" for bm in bms] for i, ax in enumerate(ax.flat):
plotter(bms[i], X, y, titles[i], ax) plt.savefig("boosting_grid.png") plt.tight_layout()
plt.show()
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
(e) Discuss the differences between bagging and boosting.
21 Please see over
Question 7 Some more suggestions - Make sure you are comfortable with the following 1. Naive Bayes Classification example from lectures/tutorials
2. Decision trees
3. Understanding the Bias Variance trade off
4. SVM calculation
5. K Means clustering
6. VC dimension
7. have worked through the labs and are comfortable with scikitlearn/numpy.
22 Please see over
END OF PAPER
23