Homework 5
Due: Sunday, March 10, 2019 at 12:00pm (Noon)
Programming Assignment Introduction
In this assignment, you’ll implement Binary Logistic Regression with regularization to perform classification. The regularization method that you will be using is Tikhonov regularization.
Stencil Code & Data
You can find the stencil code for this assignment on the course website. We have provided the following two files:
• main.py is the entry point of your program which will read in the data, run the classifier and print the results.
• models.py contains the RegularizedLogisticRegression model which you will be implementing.
To feed the data into the program successfully, please do not rename the data files and also make sure all the data files are in a directory named data located in the same directory as the stencil folder. To run the program, run python main.py in a terminal.
UCI Breast Cancer Wisconsin (Diagnostic) Data Set
You will be using a modified version of the Breast Cancer Wisconsin (Diagnostic) Data Set from UC Irvine’s Machine Learning Repository site. You can read more about the dataset here at https://archive.ics. uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). To modify it, we have added additional features which may or may not be informative. We have split it up into train and validation sets already for you and read them in in main.py.
On a department machine, you can copy the files to your working directory by running the command
cp /course/cs1420/pub/hw5/*
where
locally, you will need to use the scp command to copy the files to your computer over ssh:
scp
Data Format
We do a 70-15-15 split to the original dataset to produce the training, validation, and test set (you are only using train and validation, we will be using test when running your programs). We also add a constant column of ones to the dataset to account for the bias.
The Assignment
We provide you with a sigmoid function to use when training your data. In models.py, there are four functions you will implement. They are:
• RegularizedLogisticRegression:
1
– train() uses gradient descent to learn the weights. Since we are now also trying to keep our weights as small as possible, how would this change from our previous usage of gradient descent.
– predict() predicts the labels using the inputs of test data.
– accuracy() computes the percentage of the correctly predicted labels over a dataset.
– plotError() trains and predicts for multiple values of the hyperparameter lambda, then graphs these.
Once you have completed the plotError function, you can call this in main.py to print out your graphs
Note: You are not allowed to use any off-the-shelf packages that have already implemented these models, such as scikit-learn. We’re asking you to implement them yourself.
Project Report
Guiding Questions
• Explain how you used gradient descent with regularization to learn the weights.
• Think back to when you implemented Logistic Regression on MNIST. How would it have been different
if you applied Tikhonov regularization?
• Produce a model selection curve, that is, plot training and validation error with respect to the value of lambda. It may be useful to graph the log values of lamba rather than the actual values of lambda here. Then, conclude what the best value of lambda is and explain why. NOTE: please set your default lambda in the constructor to your optimal lambda you discovered for TA testing purposes.
Grading Breakdown
We expect the validation accuracy that RegularizedLogisticRegression reaches should be at least 87%. As always, you will primarily be graded on the correctness of your code and not based on whether it does or does not achieve the accuracy target.
The grading breakdown for the assignment is as follows:
Handing in
Programming Assignment
To hand in the programming component of this assignment, first ensure that your code runs on Python 3 using our course virtualenv. You can activate the virtualenv on a department machine by running the following command in a Terminal:
source /course/cs1420/cs142 env/bin/activate
Once the virtualenv is activated, run your program and ensure that there are no errors. We will be using this virtualenv to grade all programming assignments in this course so we recommend testing your code on a department machine each time before you hand in. Note that handing in code that does not run may result in a significant loss of credit.
To hand in the coding portion of the assignment, run cs142 handin hw5 from the directory containing all of your source code and your report in a file named report.pdf.
2
Regularized Logistic Regression Report
80% 20%
Total
100%
Anonymous Grading
You need to be graded anonymously, so do not write your name anywhere on your handin. Instead, you should use the course ID that you generated when filling out the collaboration policy form. If you do not have a course ID, you should email the HTAs as soon as possible.
Obligatory Note on Academic Integrity
Plagiarism—don’t do it.
As outlined in the Brown Academic Code, attempting to pass off another’s work as your own can result in failing the assignment, failing this course, or even dismissal or expulsion from Brown. More than that, you will be missing out on the goal of your education, which is the cultivation of your own mind, thoughts, and abilities. Please review this course’s collaboration policy and, if you have any questions, please contact a member of the course staff.
3