HW-05_SVMandPCA-STUDENT
Homework Assignment #5 (Individual)¶
Using SVMs and PCA with new data: The Palmer Penguins Dataset¶
✅ Put your name here.
¶
✅ Put your _GitHub username_ here.
¶
Goals for this homework assignment¶
By the end of this assignment, you should be able to:
Use git to track your work and turn in your assignment
Read in data and prepare it for modeling
Build, fit, and evaluate an SVC model of data
Use PCA to reduce the number of important features
Build, fit, and evaluate an SVC model of PCA-transformed data
Systematically investigate the effects of the number of PCA components on an SVC model of data
Assignment instructions:¶
Work through the following assignment, making sure to follow all of the directions and answer all of the questions.
There are 47 points (+2 bonus points) possible on this assignment. Point values for each part are included in the section headers.
This assignment is due at 11:59 pm on Friday, December 3. It should be pushed to your repo (see Part 1) and submitted to D2L.
Imports¶
It’s useful to put all of the imports you need for this assignment in one place. Read through the assignment to figure out which imports you’ll need or add them here as you go.
In [ ]:
# Put all necessary imports here
1. Add to your Git repository to track your progress on your assignment (4 points)¶
As usual, for this assignment, you’re going to add it to the cmse202-f21-turnin repository you created in class so that you can track your progress on the assignment and preserve the final version that you turn in. In order to do this you need to
✅ Do the following:
Navigate to your cmse202-f21-turnin repository and create a new directory called hw-05.
Move this notebook into that new directory in your repository, then add it and commit it to your repository.
Finally, to test that everything is working, “git push” the file so that it ends up in your GitHub repository.
Important: Make sure you’ve added your Professor and your TA as collaborators to your “turnin” respository with “Read” access so that we can see your assignment (you should have done this in the previous homework assignment)
Also important: Make sure that the version of this notebook that you are working on is the same one that you just added to your repository! If you are working on a different copy of the noteobok, none of your changes will be tracked!
If everything went as intended, the file should now show up on your GitHub account in the “cmse202-f21-turnin” repository inside the hw-05 directory that you just created. Periodically, you’ll be asked to commit your changes to the repository and push them to the remote GitHub location. Of course, you can always commit your changes more often than that, if you wish. It can be good to get into a habit of committing your changes any time you make a significant modification, or when you stop working on the project for a bit.
✅ Do this: Before you move on, put the command that your instructor should run to clone your repository in the markdown cell below.
# Put the command for cloning your repository here!
2. Loading a new dataset: The Palmer Penguins data (8 points)¶
We’ve the seen the iris dataset a number of times in the course so far and it has a number of nice features that make it useful for getting some practice with some of the machine learning methods that are around today. However, recently a new dataset was suggested as a possible replacement/alternative for the iris data: the “Palmer Penguins” — perhaps you’ve already seen it before! This dataset also has some nice properties that make it a good playground for experiment with machine learning tools. You can learn more about the dataset on the their website.
Since the goal for this assignment is to practice using the SVM and PCA tools we’ve covered in class, we’ll going to use this relatively simple dataset and avoid any complicated data wrangling headaches!
The data¶
The penguins dataset is pretty straight forward, but you’ll need to download the data and give yourself some time to get familiar with it.
✅ Do This: To get started, you’ll need to download the following file:
https://raw.githubusercontent.com/msu-cmse-courses/cmse202-F21-data/main/data/penguins_size.csv
Once you’ve downloaded the data, open the files using a text browser or other tool on your computer and take a look at the data to get a sense for the information it contains. You’ll probably also want to read through the information on the palmerpenguins website to get a sense for what the values correspond to. The website talks about two different versions of the data, a simplified one and a “raw” one with more values. Which one are you working with?
2.1 Load the data¶
✅ Task 2.1 (2 points): Read the penguin_size.csv file into your notebook. For the purposes of this assignment, we’re going to use “species” as the class that we’ll be trying to predict with our classification model. To make this clear, you should rename the species column to be class. The species class should currently have the following class labels:
“Adelie”
“Chinstrap”
“Gentoo”
Once you’ve loaded in the data and changed the species column to class, display the DataFrame to make sure it looks reasonable. You should have 7 columns and 344 rows.
In [ ]:
# Put your code here
2.2 Relabeling the classes¶
To simplify the process of modeling the penguin data, we should convert the class labels from strings to integers. For example, rather than Adelie, we can consider this to be class “0”.
✅ Task 2.2 (2 points): Replace all of the strings in your “class” column with integers based on the following:
original label replaced label
Adelie 0
Chinstrap 1
Gentoo 2
Once you’ve replaced the labels, display your DataFrame and confirm that it looks correct.
In [ ]:
# Put your code here
2.3 Removing rows with missing data¶
At this point, you’ve hopefully noticed that some of the rows seems to be missing data values as indicated by the existence of NaN values. Since we don’t necessarily know what to replace these values with, let’s just play it safe and remove all of the rows that have NaN in any of the column entries. This should help to ensure that we don’t end up with errors or confusing results when we try to classify the data.
✅ Task 2.3 (1 point): Remove all of the rows that contain a NaN in any column. Make sure you actually store this new version of your dataframe either in the original variable name or in a new variable name. If everything went as intended, you should find that you have 334 rows left over.
In [ ]:
# Put your code here
2.4 Separating the “features” from the “labels”¶
As we’ve seen when working with sklearn it can be much easier to work with the data if we have separate variables that store the features and the labels.
✅ Task 2.4 (1 point): Split your DataFrame so that you have two separate DataFrames, one called features, which contains all of the penguin features, and one called labels, which contains all of the new penguin integer labels you just created.
In [ ]:
# Put your code here
✅ Question 2.1 (1 point): How balanced is your set of penguin classes? Does it matter for the set of classes to be balanced? Why or why not? (You might need to write a bit of code to figure out how balanced your set of penguin classes is.)
✎ Erase this and put your answer here.
2.5 Dropping the non-numeric features¶
The last thing we should probably do before you move on to building your classifier model is to drop the two categorical (i.e. non-numeric) features from our set of features to avoid confusing or complicating the model.
✅ Task 2.5 (1 point): Drop the two non-numeric columns from your new features dataframe. You should end up with your final four features, which should all have floating point values. Display your new features dataframe to make sure this is true.
In [ ]:
# Put your code here
🛑 STOP¶
Pause to commit your changes to your Git repository!
Take a moment to save your notebook, commit the changes to your Git repository using the commit message “Committing Part 2”, and push the changes to GitHub.
3. Building an SVC model (4 points)¶
Now, to tackle this classification problem, we will use a support vector machine just like we’ve done previously (e.g. in the Day 19 and Day 20 assignments). Of course, we could easily replace this with any sklearn classifier we choose, but for now we will just use an SVC with a linear kernel.
3.1 Splitting the data¶
But first, we need to split our data into training and testing data!
✅ Task 3.1 (1 point): Split your data into a training and testing set with a training set representing 75% of your data. For reproducibility , set the random_state argument to 314159. Print the lengths to show you have the right number of entries.
In [ ]:
# Put your code here
3.2 Modeling the data and evaluating the fit¶
As you have done this a number of times at this point, we ask you to do most of the analysis for this problem in one cell.
✅ Task 3.2 (2 points): Build a linear SVC model with C=0.01, fit it to the training set, and use the test features to predict the outcomes. Evaluate the fit using the confusion matrix and classification report.
Note: Double-check the documentation on the confusion matrix because the way sklearn outputs false positives and false negatives may be different from what most images on the web indicate.
In [ ]:
# Put your code here
✅ Question 3.1 (1 point): How accurate is your model? What evidence are you using to determine that? How many false positives and false negatives does it predict for each class?
✎ Erase this and put your answer here.
🛑 STOP¶
Pause to commit your changes to your Git repository!
Take a moment to save your notebook, commit the changes to your Git repository using the commit message “Committing Part 3”, and push the changes to GitHub.
4. Finding and using the best hyperparameters (8 points)¶
At this point, we have fit one model and determined it’s performance, but is it the best model? We can use GridSearchCV to find the best model (given our choices of parameters). Once we do that, we will use that “best” model for making predictions. This is similar to what we did when working with the “digits” data and the “faces” data in the Day 20 and Day 21 assignments.
Note: you would typically rerun this grid search in a production environment to continue to verify the best model, but we are not for the sake of speed.
4.1 Performing a grid search¶
✅ Task 4.1 (4 points): Using the following parameters (C = 1e-3, 0.01, 0.1, 1, 10, 100 and gamma = 1e-6, 1e-5, 1e-4, 1e-3, 0.01, 0.1) for both a linear and rbf kernel use GridSearchCV with the SVC() model to find the best fit parameters. Once, you’re run the grid search, print the “best estimators”.
In [ ]:
# Put your code here
✅ Question 4.1 (1 point): How do the “best estimator” results of the grid search compare to what you used in Part 3? Did the hyper parameter(s) change? What kernel did the grid search determine was the best option?
✎ Erase this and put your answer here.
4.2 Evaluating the best fit model¶
Now that we have found the “best estimators”, let’s determine how good the fit is.
✅ Task 4.2 (2 points): Use the test features to predict the outcomes for the best model. Evaluate the fit using the confusion matrix and classification report.
Note: Double-check the documentation on the confusion matrix because the way sklearn outputs false positives and false negatives may be different from what most images on the web indicate.
In [ ]:
# Put your code here
✅ Question 4.2 (1 point): How accurate is this best model? What evidence are you using to determine that? How many false positives and false negatives does it predict?
✎ Erase this and put your answer here.
🛑 STOP¶
Pause to commit your changes to your Git repository!
Take a moment to save your notebook, commit the changes to your Git repository using the commit message “Committing Part 4”, and push the changes to GitHub.
5. Using Principal Components (11 points)¶
The full model uses all 4 penguin features to predict the results and you likely found that the model is pretty accurate using all 4 features. But in some cases, we might have significantly more features (which means much more computational time!), and we might not need nearly the level of accuracy we can achieve with the full data set or we might not have enough computational resources to use all of the features.
In such situations, we might need to see how close we can get with fewer features. But instead of simply removing features, we will use a Principal Component Analysis (PCA) to determine the features that contribute the most the model (through their accounted variance) and use those to build our SVC model. We did this to improve our classification with the “faces” dataset in the Day 21 assignment.
5.1 Doing a little bit of data preparation before we perform our PCA¶
Because the features in our dataset have very different relative values (i.e. body_mass_g is in the range 3000-5000, but bill_depth_mm is the range 10-20), the variation captured by the PCA will be skewed by these relative differences. As a result, it is good practice to normalize the features so that they have comparable values. Thankfully, sklearn has a useful function for doing this!
✅ Do This: Run the following cell which uses a sklearn function to perform a “Min-Max” scaling to normalize the features. These new features are stored in the features_norm variable. Take a look at the output to get a sense for how the values change when they are normalized.
In [ ]:
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
features_norm = pd.DataFrame(min_max_scaler.fit_transform(features), columns=features.columns, index=features.index)
features_norm
Recreating our train-test split¶
Now that we have new feature values, we need to create new training and testing variables.
✅ Task 5.1 (1 point): As you did in Task 3.1 above, split your new normalized features and corresponding labels (the labels are the same as before) into a training and testing set with a training set representing 75% of your data. For reproducibility , set the random_state argument to 314159. Print the lengths to show you have the right number of entries.
In [ ]:
# Put your code here
5.2 Running a Principle Component Analysis (PCA)¶
Since we only have 4 total features to start with, let’s see how well we can do if we try to aggressively reduce the feature count and use only 1 principle component. We’ll see how well we can predict the classes of the penguin dataset with just one!
✅ Task 5.2 (3 points): Using PCA() and the associated fit() method, run a principle component analysis on your training features using only 1 component. Transform both the test and training features using the result of your PCA. Print the explained_variance_ratio_.
In [ ]:
# Put your code here
✅ Question 5.1 (1 point): What is the total explained variance ratio captured by this simple 1-component PCA? (just quote the number) How well do you think a model with this many feature will perform? Why?
✎ Erase this and put your answer here.
5.2 Fit and Evaluate an SVC model¶
Using the PCA transformed features, we need to train and test a new SVC model. You’ll want to perform the GridSearchCV again since there may a better choice for the kernel and the hyper-parameters.
✅ Task 5.3 (2 points): Using the PCA transformed training data, build and train an SVC model using the GridSearchCV tool to make sure you’re using the best kernel and hyper-parameter combination. Predict the classes using the PCA transformed test data. Evaluate the model using the classification report, and the confusion matrix.
In [ ]:
# Put your code here
✅ Question 5.2 (1 point): How accurate is this model? What evidence are you using to determine that? How many false positives and false negatives does it predict? How does it compare to the full feature model? Which classes does it seem to be struggling the most with classifying? Why might that be?
✎ Erase this and put your answer here.
5.3 Repeat your analysis with more components¶
You probably found that the model with just 1 features didn’t actually do too bad, which is pretty impressive. That said, can we do better?
What if we increase the number of principle components to 2? What happens now?
✅ Task 5.4 (2 points): Repeat your analysis from 5.1 and 5.2 using 2 components instead. As part of your analysis, print the total explained variance ratio for both components as well as the sum of these values.
In [ ]:
# Put your code here
✅ Question 5.3 (1 point): What is the total explained variance ratio captured by this PCA? How accurate is this model? What evidence are you using to determine that? How many false positives and false negatives does it predict? How does it compare to the 1 PCA component model? To the full feature model?
✎ Erase this and put your answer here.
🛑 STOP¶
Pause to commit your changes to your Git repository!
Take a moment to save your notebook, commit the changes to your Git repository using the commit message “Committing Part 5”, and push the changes to GitHub.
6. How well does PCA work? (12 points)¶
Clearly, the number of components we use in our PCA matters. Let’s investigate how they matter by systematically building a model for any number of selected components. While this might seem a bit unnecessary for such a simple dataset, this can be very useful for more complex datasets and models!
6.1 Accuracy vs. Components¶
To systematically explore how well PCA improves our classification model, we will do this by writing a function that creates the PCA, the SVC model, fits the training data, predict the labels using test data, and returns the accuracy scores and the explained variance ratio. So your function will take as input:
the number of requested PCA components
the training feature data
the testing feature data
the training data labels
the test data labels
and it should return the accuracy score for an SVC model fit to pca transformed features and the total explained variance ratio (i.e. the sum of the explained variance for each component).
✅ Task 6.1 (4 points): Create this function, which you will use in the next section.
In [ ]:
# Put your code here
6.2 Compute accuracies¶
Now that you have created a function that returns the accuracy for a given number of components, we will use that to plot the how the accuracy of your SVC model changes when we increase the number of components used in the PCA.
✅ Task 6.2 (2 points): For 1 through 4 components, use your function above to compute and store (as a list) the accuracy of your models and the total explained variance ratio of your models.
In [ ]:
# Put your code here
6.3 Plot accuracy vs number of components¶
Now that we have those numbers, it makes sense to look at the accuracy vs # of components.
✅ Task 6.3 (2 points): Plot the accuracy vs # of components.
In [ ]:
# Put your code here
✅ Question 6.1 (1 point): Where does it seem like we have diminishing returns? That is, at what point is there no major increase in accuracy (or perhaps the accuracy is decreased) as we add additional components to the PCA?
✎ Erase this and put your answer here.
6.4 Plot total explained variance vs number of components¶
What if we look at total explained variance as a function of # of components?
✅ Task 6.4 (2 points): Plot the total explained variance ratio vs # of components.
In [ ]:
# Put your code here
✅ Question 6.2 (1 points): Where does it seem like we have diminishing returns, that is, no major increase in explained variance as we add additional components to the PCA? How does that number of components compare to the diminishing returns for accuracy?
✎ Erase this and put your answer here.
🛑 STOP¶
Pause to commit your changes to your Git repository!
Take a moment to save your notebook, commit the changes to your Git repository using the commit message “Committing Part 6”, and push the changes to GitHub.
7. Bonus exercise: visualizing the decision boundaries for a portion of the feature space (2 bonus points)¶
As you might imagine, visualizing decision boundaries with for a multidimensional feature space can be a challenge! That said, when trying to build some intuition about how these classifiers work, visualing 2D decisions boundaries can be useful.
To earn some extra points on this assignment try using the following example as a guide to visualize the decision boundary for your “best estimator” parameters using your 2 PCA components as your training features. To be clear, you should be using your PCA component data and your best fit parameters, you should not just be running the example! You should be able to get a plot that looks something like this:
Since we didn’t explicitly cover this in class, you do not have to complete this part of the assignment unless you would like the extra credit points.
✅ Task 7.1 (2 extra points): Try to create a plot of the decision boundaries for the 2 principle components using your “best estimator” parameters.
In [ ]:
# Put your code here
Assignment wrap-up¶¶
Please fill out the form that appears when you run the code below. You must completely fill this out in order to receive credit for the assignment!
In [ ]:
from IPython.display import HTML
HTML(
“””
“””
)
Congratulations, you’re done!¶
Submit this assignment by uploading it to the course Desire2Learn web page. Go to the “Homework Assignments” folder, find the submission folder for Homework #5, and upload your notebook.