Exercise 2
Section A. Model Complexity and Model Selection
In this section, you study the effect of model complexity on the training and testing error. You also demonstrate your programming skills by developing a regression algorithm and a cross-validation technique that will be used to select the models with the most effective complexity.
Background
A KNN regressor is similar to a KNN classifier (covered in Activity 1.1) in that it finds the K nearest neighbors and estimates the value of the given test point based on the values of its neighbours. The main difference between KNN regression and KNN classification is that KNN classifier returns the label that has the majority vote in the neighborhood, whilst KNN regressor returns the average of the neighbors’ values.
Question 1 [KNN Regressor, 20 Marks]
I. Implement the KNN regressor function: knn(train.data, train.label, test.data, K=3)
which takes the training data and their labels (continuous values), the test set, and the size of the neighborhood (K). It should return the regressed values for the test data points. When choosing the neighbors, you can use the Euclidean distance function to measure the distance between a pair of data points.
Hint: You are allowed to use KNN classifier code from Activity 1.1.
II. Plot the training and the testing errors versus 1/K for K=1,..,20 in one plot, using the Task1A_train.csv and Task1A_test.csv datasets provided for this assignment. Save the plot in your Jupyter Notebook file for Question 1.
III. Report (in your Jupyter Notebook file) the optimum value for K in terms of the testing error. Discuss the values of K corresponding to underfitting and overfitting based on your plot in the previous part (Part II).
Question 2 [L-fold Cross Validation, 20 Marks]
I. Implement a L-Fold Cross Validation (CV) function for your KNN regressor: cv(train.data, train.label, numFold=10)
which takes the training data and their labels (continuous values), the number of folds, and returns errors for different folds of the training data.
Hint: you are allowed to use bootstrap code from Activity 1.2.
II. Using the training data, run your L-Fold CV where the numFold is set to 10.
Change the value of K=1,..,20 in your KNN regressor, and for each K compute the average 10 error numbers you have got. Plot the average error numbers versus 1/K for K=1,..,20 in your KNN regressor. Further, add two dashed lines around the average error indicating the average +/- standard deviation of errors. Save the plot in your Jupyter Notebook file for Question 2.
III. Report (in your Jupyter Notebook file) the values of K that result to minimum average error and minimum standard deviation of errors based on your cross validation plot in the previous part (Part II).
Section B. Prediction Uncertainty with Bootstrapping
This section is the adaptation of Activity 1.2 from KNN classification to KNN regression. You use the bootstrapping technique to quantify the uncertainty of predictions for the KNN regressor that you implemented in Section A
Question 3 [Bootstrapping, 20 Marks]
I. Modify the code in Activity 1.2 to handle bootstrapping for KNN regression.
II. Load Task1B_train.csv and Task1B_test.csv sets. Apply your bootstrapping for KNN regression with times = 100 (the number of subsets), size = 25 (the size of each subset), and change K=1,..,20 (the neighbourhood size). Now create a boxplot where the x-axis is K, and the yaxis is the average error (and the uncertainty around it) corresponding to each K. Save the plot in your Jupyter Notebook file for Question 3.
III. Based on the plot in the previous part (Part I), how does the test error and its uncertainty behave as K increases? Explain in your Jupyter Notebook file.
IV. Load Task1B_train.csv and Task1B_test.csv sets. Apply your bootstrapping for KNN regression with K=10 (the neighbourhood size), size = 25 (the size of each subset), and change times = 10, 20, 30,.., 200 (the number of subsets). Now create a boxplot where the x-axis is ‘times’, and the y-axis is the average error (and the uncertainty around it) corresponding to each value of ‘times’. Save the plot in your Jupyter Notebook file for Question 3.
V. Based on the plot in the previous part (Part IV), how does the test error and its uncertainty behave as the number of subsets in bootstrapping increases? Explain in your Jupyter Notebook file.
Section C. Probabilistic Machine Learning
In this section, you show your knowledge about the foundation of the probabilistic machine learning (i.e. probabilistic inference and modeling) by solving two simple but basic statistical inference problems. Solve the following problems based on the probability concepts you have learned in Module 1 with the same math conventions.
Question 4 [Bayes Rule, 20 Marks]
Suppose we have one red and one blue box. In the red box we have 2 apples and 6 oranges, whilst in the blue box we have 3 apples and 1 orange. Now suppose we randomly selected one of the boxes and picked a fruit. If the picked fruit is an apple, what is the probability that it was picked from the blue box?
Note that the chance of picking the red box is 40% and the selection chance for any of the pieces from a box is equal for all the pieces in that box. Please show your work in your PDF report.
Question 5 [Maximum Likelihood, 20 Marks]
As opposed to a coin which has two faces, a dice has 6 faces. Suppose we are given a dataset which contains the outcomes of 10 independent tosses of a dice: D:={1,4,5,3,1,2,6,5,6,6}. We are asked to build a model for this dice, i.e. a model which tells what is the probability of each face of the dice if we toss it. Using the maximum likelihood principle, please determine the best value for our model parameters. Please show your work in your PDF report.
Section D. Ridge Regression
In this section, you develop Ridge Regression by adding the L2 norm regularization to the linear regression (covered in Activity 1 of Module 2). This section assesses your mathematical skills (derivation) and programming skills.
Question 6 [Ridge Regression, 25 Marks]
I. Given the gradient descent algorithms for linear regression (discussed in Chapter 2 of Module 2), derive weight update steps of stochastic gradient descent (SGD) as well as batch gradient descent (BGD) for linear regression with L2 regularisation norm. Show your work with enough explanation in your PDF report; you should provide the steps of SGD and BGD, separately.
Hint: Recall that for linear regression we defined the error function E and set its derivation to zero. For this assignment, you only need to add an L2 regularization term to the error function and set the derivative of both terms (error term plus the regularization term) to zero. This question is similar to Activity 2.1.
II. Using R (with no use of special libraries), implement SGD and BGD algorithms that you derived in Step I. The implementation is straightforward as you are allowed to use the code examples from 2.1.
III. Now let’s compare SGD and BGD implementations of ridge regression from Step II:
a. Load Task2A_train.csv and Task2A_test.csv sets.
b. Set the termination criterion as maximum of 20 weight updates for BGD, which is equivalent to 20 x N weight updates for SGD (where N is the number of training data).
c. Run your implementations of SGD and BGD while all parameter settings (initial values, learning rate etc) are exactly the same for both algorithms. During run, record training error rate every time the weights get updated. Create a plot of error rates (use different colors for SGD and BGD), where the x-axis is the number of visited data points and y-axis is the error rate. Save your plot in your Jupyter Notebook file for Question 6. Note that for every N errors for SGD in the plot, you will only have one error for BGD; the total length of the x-axis will be 20 x N.
d. Explain (in your Jupyter Notebook file) your observation based on the errors plot you generated in Part c. Particularly, discuss the convergence speed and the fluctuations you see in the error trends
Section E.
Bias-Variance Analysis In this section, you conduct a bias-variance study on the ridge regression that you have developed in Section D. This task assesses your analytical skills, and is based on Chapter 6 of Module 2. You basically recreate Figure 2.6.3 as shown below using your implementation of Ridge regression (with SGD) from Section D.
Question 7 [Bias-Variance for Ridge Regression, 25 Marks]
I. Load Task2B_train.csv and Task2B_test.csv sets.
II. Sample 50 sets from the provided training set, each of which having 100 randomly selected data points (with replacement).
III. For each lambda in {0, 0.2, 0.4, 0.6, …, 5} do:
a. Build 50 regression models using the sampled sets
b. Based on the predictions of these models on the test set, calculate the (average) test error, variance, (bias)2, and variance + (bias)2. Plot the (average) test error, variance, (bias)2, and variance + (bias)2 versus log lambda, and include it in your Jupyter Notebook file for Question 7.
IV. Based on your plot in the previous part (Part III), what’s the best value for lambda? Explain your answer in terms of the bias, variance, and test error in your Jupyter Notebook file.
Section F. Multiclass Perceptron
In this section, you are asked to demonstrate your understanding of linear models for classification. You expand the binary-class perceptron algorithm that is covered in Activity 1 of Module 3 into a multiclass classifier.
Training Algorithm. We train the multiclass perceptron based on the following algorithm:
In what follows, we look into the convergence properties of the training algorithm for multiclass perceptron (similar to Activity 3.1).
Question 8 [Multiclass Perceptron, 25 Marks]
I. Load Task2C_train.csv and Task2C_test.csv sets.
II. Implement the multiclass perceptron as explained above. Please provide enough comments for your code in your submission.
III. Set the learning rate η to .1, and train the multiclass perceptron on the provided training data. After processing every 5 training data points (also known as a mini-batch), evaluate the error of the current model on the test data. Plot the error of the test data vs the number of mini-batches, and include it in your Jupyter Notebook file for Question 8.
IV. Suppose we did not want to use multiclass Perceptron, and instead would be interested to use the one-versus-one approach to solve the multi-class classification problem (Chapter 2 in Module 3). The idea is to build K(K−1)/2 classifiers for each possible pair of classes where K is the number of classes. Each point is then classified
according to a majority vote among the discriminant functions.
a. Train your K(K-1)/2 perceptron binary classifiers using the training data.
b. Predict the labels of the data points in the test set. Whenever there is a tie between two or more labels for a data point, call it a confusion event.
c. Did you expect to see a confusion event in the one-versus-one approach ? For how many test data points you have observed confusion? Why? Explain in your Jupyter Notebook file.
Section G. Logistic Regression vs. Bayesian Classifier
This task assesses your analytical skills. You need to study the performance of two well-known generative and discriminative models, i.e. Bayesian classifier and logistic regression, as the size of the training set increases. Then, you show your understanding of the behavior of learning curves of typical generative and discriminative models.
Question 9 [Discriminative vs Generative Models, 25 Marks]
I. Load Task2D_train.csv and Task2D_test.csv as well as the Bayesian classifier (BC) and logistic regression (LR) codes from 3.2 and 3.3.
II. Using the first 5 data points from the training set, train a BC and a LR model, and compute their test errors. In a “for loop”, increase the size of training set (5 data points at a time), retrain the models and calculate their test errors until all training data points are used. In one figure, plot the test errors for each model (with different colors) versus the size of the training set; include the plot in your Jupyter Notebook file for Question 9.
III. Explain your observations in your Jupyter Notebook file.:
a. What does happen for each classifier when the number of training data points is increased?
b. Which classifier is best suited when the training set is small, and which is best suited when the training set is big?
c. Justify your observations in previous questions (III.a & III.b) by providing some speculations and possible reasons.
Hint: Think about model complexity and the fundamental concepts of machine
learning covered in 1.1 and 1.2.
Requirements
1. Jupyter Notebook files containing the code and your answers for questions 1 to 9 (separate “.ipynb” file 1 to 9 ).
2. You must add enough comments to your code to make it readable and understandable
3. A PDF file that contains your report, the file name should be in the following format STUDNETID_exercise_2_report.pdf You should replace STUDENTID with your own student ID