F21_APS1070_Project_4
Project 4, APS1070 Fall 2021¶
Linear Regression – 13 points¶
Copyright By https://powcoder.com 加微信 powcoder
Deadline: Nov 26, 21:00
Academic Integrity
This project is individual – it is to be completed on your own. If you have questions, please post your query in the APS1070 &A forums (the answer might be useful to others!).
Do not share your code with others, or post your work online. Do not submit code that you have not written yourself. Students suspected of plagiarism on a project, midterm or exam will be referred to the department for formal discipline for breaches of the Student Code of Conduct.
Please fill out the following:
Your name:
Your student number:
Part 1 – Getting Started [1.5 marks]¶
Ailerons are small hinged sections on the outboard portion of a wing used to control the roll of an airplane. In this project, we are going to design a controller to manage the ailerons of an aircraft based on supervised learning.
The following dataset contains 13750 instances, where each instance is a set of 40 features describing the airplane’s status. Our goal is to use these features to predict the Goal column, which is a command that our controller should issue. We will make our predictions by implementing linear regression.
import pandas as pd
import numpy as np
df=pd.read_csv(“https://raw.githubusercontent.com/aps1070-2019/datasets/master/F16L.csv” , skipinitialspace=True)
Here are the steps to complete this portion:
Print the dataframe.
Prepare your dataset as follows: [0.5]
Using train_test_split from Sklearn, split the dataset into training, validation, and test sets ($70\%$ training, $15\%$ validation, and $15\%$ test). When splitting, set random_state=1.
Standardize the data using StandardScaler from sklearn.
Insert the first column of all $1$s in the training, validation, and test set.
Explain the difference between epoch and iteration in the Gradient descent algorithm (SGD/mini-batch)? [1]
### YOUR CODE HERE ###
Part 2 – Linear Regression Using Direct Solution [1 marks]¶
Implement the direct solution of the linear regression problem on the training set. [0.5]
Note: You should use scipy.linalg.inv to perform the matrix inversion, as numpy.linalg.inv may cause numerical issues.
Report the root-mean-square error (RMSE) for both the training and validation sets. [0.5]
You may use mean_squared_error from Sklearn for computing the RMSE.
### YOUR CODE HERE ###
Part 3 – Full Batch Gradient Descent [2 marks]¶
We will now implement a “full batch” gradient descent algorithm and record the training time for our model. Recall that the full batch gradient descent is,
$$w_t = w_{t-1} – \alpha~g_t$$ where $\alpha$ is the learning rate and $g_t$ is your gradient, computed on the entire data.
Here are the steps for this part:
Implement gradient descent for linear regression using a fixed learning rate of $\alpha= 0.01$, and iterate until your model’s validation RMSE converges.
We consider the gradient descent as having converged when RMSE on the validation set using gradient descent satisfies:
$$ RMSE_\text{GD} \leq 1.001 \times RMSE_\text{Direct Solution}$$
where $RMSE_\text{Direct Solution}$ is the RMSE on the validation set using the direct solution that you have calculated in the previous part.
We refer to the quantity $RMSE_\text{Direct Solution}\times 1.001$ as the convergence threshold (CT).
Record the training time (from the first iteration until convergence) using the time.time() function. Be sure to compute the gradients yourself! Take a look at the code provided in the tutorial. [0.5]
Plot the training RMSE and the validation RMSE vs. epoch on the same figure. [0.5]
Comment on overfitting/underfitting by observing the training and validation RMSE [1]
Hint: Initialize your weights with small random numbers (<$0.001$) import time start_time = time.time() ## Records current time ## GD Script -- Sample code in tutorial! ## print("--- Total Training Time: %s (s) ---" % (time.time() - start_time)) Part 4 - Mini-batch and Stochastic Gradient Descent [4 marks]¶ Write a function that performs mini-batch gradient descent until the convergence threshold (CT) is reached. [1] The inputs of that function are: Input data (training/validation), Batch size, Learning Rate, Convergence Threshold (CT) Your function will return the following arrays: The final weights after training. The training RMSE at each epoch. The validation RMSE at each epoch. An array that has the elapsed time from the start of the training process to the end of each epoch (e.g., if each epoch takes exactly 2 seconds, the array would look like: [2 4 6 8 ...]). For certain batch sizes, GD might not converge to a solution. For that reason, you need to check the RMSE of the validation/training set at each epoch, and if it's getting larger and larger, you should stop the training for that case (the design is up to you!). CT will help you to know when your model is converged. Important: after each epoch, you need to shuffle the entire training set. This ensures that new mini-batches are selected for every epoch. Hint: use np.random.permutation. Let's now use the function to investigate the effect of batch size on convergence. When the batch size is 1, we call that stochastic gradient descent. When the batch size equals the # of training data, it is full-batch (i.e., all data points are used at every iteration). Anywhere in between is mini-batch (we use some of the data). Sweep different values for the mini-batch size (at least 5 values), each time using a learning rate of $\alpha= 0.01$. Hint: Try batch sizes that are powers of two (e.g., 2,4,8,16,32,64,128...). These batch sizes fit better on the hardware and may achieve higher performance. [0.5] Provide the following $3$ plots: Plot training and validation RMSE vs. epoch for all the converging batch sizes (some batch sizes might not converge) in a figure. The X-axis is Epoch # and the Y-axis is RMSE. [0.5] Plot training and validation RMSE vs. time for all the converging batch sizes in a figure. The X-axis is Time, and the Y-axis is RMSE. [0.5] Plot Total training time (y-axis) vs. Batch size (x-axis). [0.5] Describe your findings, including the main takeaways from each of your plots. [1] ### YOUR CODE HERE ### Part 5 - Introducing Momentum [3.5 marks]¶ Momentum is a popular technique that helps the gradient descent algorithm to converge faster. Simply, it behaves like a moving average of gradients. First, take a look at here to get familiar with the concept. To summarize, If the weight update formula in the time-step $t$ is $w_t = w_{t-1} - \alpha~g_t$, the $g_t$ using momentum could be computed as $g_t = \beta~g_{t-1} + (1-\beta) \frac{{\partial J}}{\partial w}$. Where $\beta$ is the momentum coefficient, between [0, 1]. Weight updates ($g_t$) with momentum are not only computed based on the derivative of the loss function but also is a function of previous weight updates. If you put $\beta=0$ in the $g_t$ equation, it would be the original gradient descent method. Add momentum to your training function. [1] Train a linear model with a specific batch size and various values of momentums. Plot your training and validation RMSE for each epoch. [1] With some plots (or tables) show how momentum affects the training time. [1] Summarize your experiments and comment on the effect of momentum. [0.5] ### YOUR CODE HERE ### Part 6 - Finalizing a model [1 marks]¶ Based on your findings from the previous parts, pick a model (or combination of model settings) that you think would work best for our dataset and evaluate it on the test set. Briefly describe your model selections/settings. [0.5] Summarize the performance of your model for the task of managing ailerons of an aircraft. [0.5] ### YOUR CODE HERE ###