Homework Assignment 2
(10 Credits)
Due: Feb 13, 2019
The goal of this homework assignment is to master the programming of Linear Regression method. Sample codes are provided and you are required to complete missing lines, evaluate your codes and report your observations. Details instructions are as follows.
Overview. In this programing exercise, you will need to implement a linear regression algorithm for predicting student’s university GPAs from two features, i.e., Math SAT and Verb SAT. You will be instructed to program the GD method to solve the linear regression model from a set of training samples. Then, you will apply the model to predict the university GPAs of testing samples.
Datasets. The student’s data table is available in the attached ‘sat.csv file, which includes a data matrix. Each row represents one of the 105 students and includes five properties. We will use the properties: Math SAT, Verb SAT as input features, and the property: University GPA as output labels. We split the whole dataset into two parts: the first 60 rows for training and the rest for testing.
Sample Codes. The file “main_ha2.py” provides the starting codes for 4 major steps: loading training and testing data, training the linear regression model, testing and evaluating the learned model.
The first step is to load student’s data and split them into two sets, one for training and the other for testing. We use the function download_data() to load data from the ‘sat.csv’ file.
The second step will call the function gradientDescent() in GD.py, i.e., the implementation of gradient descent method, to obtain the optimal parameters and costs over iteration. The latter will be visualized as a convergence curve as the following:
The third step will apply the learned model (i.e., the optimal parameters) to predict the GPA for testing students. The last step will return the average error and standard deviation (STD) of the evaluation results.
There are four PLACEHOLDERs in the provided scripts: two in ‘main_ha2.py” and two in “GD.py”
In “main_ha2.py”,
PLACEHOLDER1: you will need to change the two variables: alpha and MAX_ITER and observe how the convergences curve and evaluation results change. Write down your observations in the report.
PLACEHOLDER2: You will also need to normalize the two input features and the output labels in the first step. The current lines use a function rescaleMatrix() now. Please replace it with your own normalization codes. You can use the function rescaleMatrix() to verify if you codes are running properly.
There are a few different ways for normalizing these features, e.g., you can try scale every feature to be 0 and 1, or taking mean off every value, or others. Please test and evaluate at least one normalization method.
In “GD.py”, you will need to implement the gradient descent function gradientDescent (). Replace the temporary code lines in PLACEHOLDER with your own codes.
PLACEHOLDER3: write your codes to update theta, i.e., the parameters to estimate, following the gradient direction.
PLACEHOLDER4: calculate the current cost with the updated model parameters, i.e. theta.
Write-up
In the report you will describe your algorithm and any decisions you made to write your algorithm a particular way. Then you will show and discuss the results of your codes following the above instructions. In the case of this project, show the convergence curves and quantitative results for each case of your implementations. Also, discuss anything extra you did. Feel free to add any other information you feel is relevant.
How to submit
• Submit your source codes and report using the course site. The codes should be self-contained, and run without any error. Otherwise, penalty will be applied.
• No hard copy required.