Assignment 1 Brief¶
Deadline: Tuesday, October 29, 2019 at 14:00 hrs¶
Number of marks available: 20¶
Scope: Sessions 1 to 5¶
How and what to submit¶
A. Submit a Jupyter Notebook named COM4509-6509_Assignment_1_UCard_XXXXXXXXX.ipynb where XXXXXXXXX refers to your UCard number.
B. Upload the notebook file to MOLE before the deadline above.
C. NO DATA UPLOAD: Please do not upload the data files used. We have a copy already.
Assessment Criteria¶
• Being able to express an objective function and its gradients in matrix form.
• Being able to use numpy and pandas to preprocess a dataset.
• Being able to use numpy to build a machine learning pipeline for supervised learning.
Late submissions¶
We follow Department’s guidelines about late submissions, i.e., a deduction of 5% of the mark each working day the work is late after the deadline. NO late submission will be marked one week after the deadline because we will release a solution by then. Please read this link.
Use of unfair means¶
“Any form of unfair means is treated as a serious academic offence and action may be taken under the Discipline Regulations.” (from the MSc Handbook). Please carefully read this link on what constitutes Unfair Means if not sure.
Regularisation for Linear Regression¶
Regularisation is a technique commonly used in Machine Learning to prevent overfitting. It consists on adding terms to the objective function such that the optimisation procedure avoids solutions that just learn the training data. Popular techniques for regularisation in Supervised Learning include Lasso Regression, Ridge Regression and the Elastic Net.
In this Assignment, you will be looking at Ridge Regression and devising equations to optimise the objective function in Ridge Regression using two methods: a closed-form derivation and the update rules for stochastic gradient descent. You will then use those update rules for making predictions on a Air Quaility dataset.
Ridge Regression¶
Let us start with a data set for training $\mathcal{D} = \{\mathbf{y}, \mathbf{X}\}$, where the vector $\mathbf{y}=[y_1, \cdots, y_n]^{\top}$ and $\mathbf{X}$ is the design matrix from Lab 3, this is,
\begin{align*} \mathbf{X} = \begin{bmatrix} 1 & x_{1,1} & \cdots & x_{1, D}\\ 1 & x_{2,1} & \cdots & x_{2, D}\\ \vdots & \vdots\\ 1 & x_{n,1} & \cdots & x_{n, D} \end{bmatrix} = \begin{bmatrix} \mathbf{x}_1^{\top}\\ \mathbf{x}_2^{\top}\\ \vdots\\ \mathbf{x}_n^{\top} \end{bmatrix}. \end{align*}
Our predictive model is going to be a linear model
$$ f(\mathbf{x}_i) = \mathbf{w}^{\top}\mathbf{x}_i,$$
where $\mathbf{w} = [w_0\; w_1\; \cdots \; w_D]^{\top}$.
The objetive function we are going to use has the following form
$$ J(\mathbf{w}, \alpha) = \frac{1}{n}\sum_{i=1}^n (y_i – f(\mathbf{x}_i))^2 + \frac{\alpha}{2}\sum_{j=0}^D w_j^2,$$
where $\alpha>0$ is known as the regularisation parameter.
The first term on the right-hand side (rhs) of the expression for $J(\mathbf{w}, \alpha)$ is very similar to the least-squares objective function we have seen before, for example in Lab 3. The only difference is on the term $\frac{1}{n}$ that we use to normalise the objective with respect to the number of observations in the dataset.
The first term on the rhs is what we call the “fitting” term whereas the second term in the expression is the regularisation term. Given $\alpha$, the two terms in the expression have different purposes. The first term is looking for a value of $\mathbf{w}$ that leads the squared-errors to zero. While doing this, $\mathbf{w}$ can take any value and lead to a solution that it is only good for the training data but perhaps not for the test data. The second term is regularising the behavior of the first term by driving the $\mathbf{w}$ towards zero. By doing this, it restricts the possible set of values that $\mathbf{w}$ might take according to the first term. The value that we use for $\alpha$ will allow a compromise between a value of $\mathbf{w}$ that exactly fits the data (first term) or a value of $\mathbf{w}$ that does not grow too much (second term).
This type of regularisation has different names: ridge regression, Tikhonov regularisation or $\ell_2$ norm regularisation.
Question 1: $J(\mathbf{w}, \alpha)$ in matrix form (2 marks)¶
Write the expression for $J(\mathbf{w}, \alpha)$ in matrix form. Include ALL the steps necessary to reach the expression.
Question 1 Answer¶
Write your answer to the question in this box.
Optimising the objective function with respect to $\mathbf{w}$¶
There are two ways we can optimise the objective function with respect to $\mathbf{w}$. The first one leads to a closed form expression for $\mathbf{w}$ and the second one using an iterative optimisation procedure that updates the value of $\mathbf{w}$ at each iteration by using the gradient of the objective function with respect to $\mathbf{w}$, $$ \mathbf{w}_{\text{new}} = \mathbf{w}_{\text{old}} – \eta \frac{d J(\mathbf{w}, \alpha)}{d\mathbf{w}}, $$ where $\eta$ is the learning rate parameter and $\frac{d J(\mathbf{w}, \alpha)}{d\mathbf{w}}$ is the gradient of the objective function.
Question 2: Derivative of $J(\mathbf{w}, \alpha)$ wrt $\mathbf{w}$ (2 marks)¶
Find the closed-form expression for $\mathbf{w}$ by taking the derivative of $J(\mathbf{w}, \alpha)$ with respect to $\mathbf{w}$, equating to zero and solving for $\mathbf{w}$. Write the expression in matrix form.
Also, write down the specific update rule for $\mathbf{w}_{\text{new}}$ by using the equation above.
Question 2 Answer¶
Write your answer to the question in this box.
Using ridge regression to predict air quality¶
Our dataset comes from a popular machine learning repository that hosts open source datasets for educational and research purposes, the UCI Machine Learning Repository. We are going to use ridge regression for predicting air quality. The description of the dataset can be found here.
In [ ]:
import pods
pods.util.download_url(‘https://archive.ics.uci.edu/ml/machine-learning-databases/00360/AirQualityUCI.zip’)
import zipfile
zip = zipfile.ZipFile(‘./AirQualityUCI.zip’, ‘r’)
for name in zip.namelist():
zip.extract(name, ‘.’)
In [ ]:
# The .csv version of the file has some typing issues, so we use the excel version
import pandas as pd
air_quality = pd.read_excel(‘./AirQualityUCI.xlsx’, usecols=range(2,15))
We can see some of the rows in the dataset
In [ ]:
air_quality.sample(5)
The target variable corresponds to the CO(GT) variable of the first column. The following columns correspond to the variables in the feature vectors, e.g., PT08.S1(CO) is $x_1$ up until AH which is $x_D$. The original dataset also has a date and a time columns that we are not going to use in this assignment.
Before designing our predictive model, we need to think about three stages: the preprocessing stage, the training stage and the validation stage. The three stages are interconnected and it is important to remember that the testing data that we use for validation has to be set aside before preprocessing. Any preprocessing that you do has to be done only on the training data and several key statistics need to be saved for the test stage.
Separating the dataset into training and test before any preprocessing has happened help us to recreate the real world scenario where we will deploy our system and for which the data will come without any preprocessing.
We are going to use hold-out validation for testing our predictive model so we need to separate the dataset into a training set and a test set.
Question 3: Splitting the dataset (1 mark)¶
Split the dataset into a training set and a test set. The training set should have 70% of the total observations and the test set, the 30%. For making the random selection make sure that you use a random seed that corresponds to the last five digits of your student UCard. Make sure that you comment your code.
Question 3 Answer¶
In [ ]:
# Write your code here
Preprocessing the data¶
The dataset has missing values tagged with a -200 value. Before doing any work with the training data, we want to make sure that we deal properly with the missing values.
Question 4: Missing values (3 marks)¶
Make some exploratory analysis on the number of missing values per column in the training data.
• Remove the rows for which the target feature has missing values. We are doing supervised learning so we need all our data observations to have known target values.
• Remove features with more than 20% of missing values. For all the other features with missing values, use the mean value of the non-missing values for imputation.
Question 4 Answer¶
In [ ]:
# Write your code here
Question 5: Normalising the training data (2 marks)¶
Now that you have removed the missing data, we need to normalise the input vectors.
• Explain in a sentence why do you need to normalise the input features for this dataset.
• Normalise the training data by substracting the mean value for each feature and dividing the result by the standard deviation of each feature. Keep the mean values and standard deviations, you will need them at test time.
Question 5 Answer¶
Write your explanation in this box
In [ ]:
# Write your code here
Training and validation stages¶
We have now curated our training data by removing data observations and features with a large amount of missing values. We have also normalised the feature vectors. We are now in a good position to work on developing the prediction model and validating it. We will use both the closed form expression for $\mathbf{w}$ and gradient descent for iterative optimisation.
We first organise the dataframe into the vector of targets $\mathbf{y}$ and the design matrix $\mathbf{X}$.
In [ ]:
# Write your code here to get y and X
y =
X =
Question 6: training with closed form expression for $\mathbf{w}$ (3 marks)¶
To find the optimal value of $\mathbf{w}$ using the closed form expression that you derived before, we need to know the value of the regularisation parameter $\alpha$ in advance. We will determine the value by using part of the training data for finding the parameters $\mathbf{w}$ and another part of the training data to choose the best $\alpha$ from a set of predefined values.
• Use np.log(start, stop, num) to create a set of values for $\alpha$ in log scale. Use the following parameters start=-3, stop=2 and num=20.
• Randomly split the training data into what is properly called the training set and the validation set. As before, make sure that you use a random seed that corresponds to the last five digits of your student UCard. Use 70% of the data for the training set and 30% of the data for the validation set.
• For each value that you have for $\alpha$ from the previous step, use the training set to compute $\mathbf{w}$ and then measure the mean-squared error (MSE) over the validation data. After this, you will have num=20 MSE values. Choose the value of $\alpha$ that leads to the lower MSE and save it. You will use it at the test stage.
• What was the best value of $\alpha$? Is there any explanation for that?
Question 6 Answer¶
In [ ]:
# Write your code here
Write your answer to the last question here.
Question 7: validation with the closed form expression for $\mathbf{w}$ (2 marks)¶
We are going to deal now with the test data to perform the validation of the model. Remember that the test data might also contain missing values in the target variable and in the input features.
• Remove the rows of the test data for which the labels have missing values.
• If you remove any feature at the training stage, you also need to remove the same features from the test stage.
• Replace the missing values on each feature variables with the mean value you computed in the training data.
• Normalise the test data using the means and standard deviations computed from the training data
• Compute again $\mathbf{w}$ for the value of $\alpha$ that best performed on the validation set using ALL the training data (not all the training set).
• Report the MSE on the preprocessed test data and an histogram with the absolute error.
• Does the regularisation have any effect on the model? Explain your answer.
Question 7 Answer¶
In [ ]:
# Write your code here
Write the explanation to your answer here.
Question 8: training with gradient descent and validation (5 marks)¶
Use gradient descent to iteratively compute the value of $\mathbf{w}_{\text{new}}$. Instead of using all the training set to compute the gradient, use a subset of $B$ datapoints in the training set. This is sometimes called minibatch gradient descent where $B$ is the size of the minibacth. When using gradient descent with minibatches, you need to find the best values for three parameters: $\eta$, the learning rate, $B$, the number of datapoints in the minibatch and $\alpha$, the regularisation parameter.
• As you did on Question 6, create a grid of values for the parameters $\alpha$ and $\eta$ using np.logspace and a grid of values for $B$ using np.linspace. Because you need to find three parameters, start with num=5 and see if you can increase it.
• Use the same training set and validation set that you used in Question 6.
• For each value that you have of $\alpha$, $\eta$ and $B$ from the previous step, use the training set to compute $\mathbf{w}$ using minibatch gradient descent and then measure the MSE over the validation data. For the minibatch gradient descent choose to stop the iterative procedure after $500$ iterations.
• Choose the values of $\alpha$, $\eta$ and $B$ that lead to the lower MSE and save them. You will use them at the test stage.
3 marks of out of the 5 marks
• Use the test set from Question 7 and provide the MSE obtained by having used minibatch training with the best values for $\alpha$, $\eta$ and $B$ over the WHOLE training data (not only the training set).
• Compare the performance of the closed form solution and the minibatch solution. Are the performances similar? Are the parameters $\mathbf{w}$ and $\alpha$ similar in both approaches? Please comment on both questions.
2 marks of out of the 5 marks
Question 8 Answer¶
In [ ]:
# Write the code for your answer here
Write the answer to your last question here.