F21_APS1070_Project_1
APS1070¶
Basic Principles and Models – Project 1¶
Deadline: Oct 1, 9PM – 10 percent
Academic Integrity
This project is individual – it is to be completed on your own. If you have questions, please post your query in the APS1070 Piazza Q&A forums (the answer might be useful to others!).
Do not share your code with others, or post your work online. Do not submit code that you have not written yourself. Students suspected of plagiarism on a project, midterm or exam will be referred to the department for formal discipline for breaches of the Student Code of Conduct.
Name: ___ (here and elsewhere, please replace the underscore with your answer)
Student ID: ___
Marking Scheme:¶
This project is worth 10 percent of your final grade.
Draw a plot or table where necessary to summarize your findings.
Practice Vectorized coding: If you need to write a loop in your solution, think about how you can implement the same functionality with vectorized operations. Try to avoid loops as much as possible (in some cases, loops are inevitable).
**Remember to push your work on GitHub and share the link of your private repo on Quercus.**
Project 1 [10 Marks]¶
Let’s apply the tools we have learned in the tutorial to a new dataset.
We’re going to work with a breast cancer dataset. Download it using the cell below:
In [ ]:
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()
Part 1: Getting started [2 Marks]¶
First off, take a look at the data, target and feature_names entries in the dataset dictionary. They contain the information we’ll be working with here. Then, create a Pandas DataFrame called df containing the data and the targets, with the feature names as column headings. If you need help, see here for more details on how to achieve this. [0.4]
How many features do we have in this dataset? ___
How many observations have a ‘mean area’ of greater than 700? ___
How many participants tested Malignant? ___
How many participants tested Benign? ___
Splitting the data¶
It is best practice to have a training set (from which there is a rotating validation subset) and a test set. Our aim here is to (eventually) obtain the best accuracy we can on the test set (we’ll do all our tuning on the training/validation sets, however.)
Split the dataset into a train and a test set “70:30”, use random_state=0. The test set is set aside (untouched) for final evaluation, once hyperparameter optimization is complete. [0.5]
In [ ]:
### YOUR CODE HERE ###
Effect of Standardization (Visual)¶
Use seaborn.lmplot (help here) to visualize a few features of the training set. Draw a plot where the x-axis is worst smoothness, the y-axis is worst fractal dimension, and the color of each datapoint indicates its class. [0.5]
Standardizing the data is often critical in machine learning. Show a plot as above, but with two features with very different scales. Standardize the data and plot those features again. What’s different? Based on your observation, what is the advantage of standardization? [0.6]
In [ ]:
### YOUR CODE HERE ###
Part 2: KNN Classifier without Standardization [2 Marks]¶
Normally, standardizing data is a key step in preparing data for a KNN classifier. However, for educational purposes, let’s first try to build a model without standardization. Let’s create a KNN classifier to predict whether a patient has a malignant or benign tumor.
Follow these steps:
Train a KNN Classifier using cross-validation on the dataset. Sweep k (number of neighbours) from 1 to 100, and show a plot of the mean cross-validation accuracy vs k. [1]
What is the best k? What is the highest cross-validation accuracy? [0.5]
Comment on which ranges of k lead to underfitted or overfitted models (hint: compare training and validation curves!). [0.5]
In [ ]:
### YOUR CODE HERE ###
Part 3: Feature Selection [4 Marks]¶
In this part, we aim to investigate the importance of each feature on the final classification accuracy.
If we want to try every possible combination of features, we would have to test $2^F$ different cases, where F is the number of features, and in each case, we have to do a hyperparameter search (finding K, in KNN using cross-validation). That will take days!.
To find more important features we will use a decision tree. based on a decision tree we can compute feature importance that is a metric for our feature selection (code is provided below).
You can use the following link to get familiar with extracting the feature impotance order of machine learning algorithms in Python:
https://machinelearningmastery.com/calculate-feature-importance-with-python/
After we identified and removed the least important feature and evaluated a new KNN model on the new set of features, if the stop conditions (see step 7 below) are not met, we need to remove another feature. To do that we fit a new decision tree to the remaining features and identify the least important feature.
Design a function ( Feature_selector) that accepts your dataset (X_train , y_train) and a threshold as inputs and: [3]
Fits a decision tree classifier on the training set.
Extracts the feature importance order of the decision tree model.
Each time, removes the least important feature based on step 2.
Then, a KNN model is trained on the remaining features. The number of neighbors (k) for each KNN model should be tuned using a 5-fold cross-validation.
Store the best mean cross-validation score and the corresponding k (number of neighbours) value in two lists.
Go back to step 1, fit a new tree on the reduced dataset and follow all the steps until you meet the stop condition.
We will stop this process when (1) there is only one feature left, or (2) our cross-validation accuracy is dropped significantly compared to a model that uses all the features. In this function, we accept a threshold as an input argument. For example, if threshold=0.95 we do not continue removing features if our mean cross-validation accuracy after tuning k is bellow 0.95 $\times$ Full Feature cross-validation accuracy.
Your function returns the list of removed features, and the corresponding mean cross-validation accuracy and k value when a feature was removed.
Visualize your results by plotting the mean cross-validation accuracy (with a tuned k on y axis) vs. the number of features (x axis). This plot describes: what is the best cv score with 1 feature, 2 features, 3 features … and all the features. [0.5]
Plot the best value of k (y-axis) vs. the number of features. This plot explains the trend of number of neighbours with respect to the number of features. [0.5]
You can use the following piece of code to start training a decision tree classifier and obtain its feature importance order.
from sklearn import tree
dt = tree.DecisionTreeClassifier()
dt.fit(X_train,y_train)
importance = dt.feature_importances_
In [ ]:
def Feature_selector (X_train , y_train , tr=0.95):
### YOUR CODE HERE ###
return ______
Part 4: Standardization [1.5 Marks]¶
Standardizing the data usually means scaling our data to have a mean of zero and a standard deviation of one.
Note: When we standardize a dataset, do we care if the data points are in our training set or test set? Yes! The training set is available for us to train a model – we can use it however we want. The test set, however, represents a subset of data that is not available for us during training. For example, the test set can represent the data that someone who bought our model would use to see how the model performs (which they are not willing to share with us).
Therefore, we cannot compute the mean or standard deviation of the whole dataset to standardize it – we can only calculate the mean and standard deviation of the training set. However, when we sell a model to someone, we can say what our scalers (mean and standard deviation of our training set) was. They can scale their data (test set) with our training set’s mean and standard deviation. Of course, there is no guarantee that the test set would have a mean of zero and a standard deviation of one, but it should work fine.
To summarize: We fit the StandardScaler only on the training set. We transform both training and test sets with that scaler.
Standardize the training and test data (Help) [0.5]
Call your Feature_selector function on the standardized training data with a threshold of 0.95\%. [0.5]
Plot the Cross validation accuracy when we have the standardized data (this part) and the original training data (last part) vs. the Number of features in a single plot (to compare them easily).
Discuss how standardization (helped/hurt) your model and its performance? Discuss which cases lead to a higher cross validation accuracy (how many features? which features? What K?) [0.5]
In [ ]:
### YOUR CODE HERE ###
Part 5: Test Data [0.5 Mark]¶
Now that you’ve created several models, pick your best one (highest CV accuracy) and apply it to the test dataset you had initially set aside. Discuss your results. [0.5]
In [ ]:
### YOUR CODE HERE ###
References:
https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052
Machine Learning 101: Decision Tree Algorithm for Classification