Assignment_2
Copyright By PowCoder代写 加微信 powcoder
Assignment 2: Classification with Naive Bayes & Logistic Regression¶
Introduction¶
In this assignment, we will implement multiclass classification using Logistic Regression and Naive Bayes via the Scikit-learn library.
In Part A of this assignment, we are going to implement Logistic Regression. First, we are going to examine how data scaling affects the performance of the classifier: We will produce a classification report and plot a confusion matrix. We are then going to use cross validation to more reliably compare the performance of the models.
In Part B of this assignment, we are going to implement multiclass and Bernoulli Naive Bayes. We are going to perform feature selection and analyse the performance of the model. Finally, we are also going to look at decision boundaries.
We are going to work with two datasets:
The ‘Gene expression cancer RNA-Seq’ dataset
The ‘Zoo’ dataset.
Guidelines¶
The structure of the code is given to you and you will need to fill in the parts corresponding to each question.
You will have to submit the completed notebook in the Jupyter notebook format: .ipynb.
Do not modify/erase other parts of the code if you have not been given specific instructions to do so.
When you are asked to insert code, do so between the areas which begin:
##########################################################
#[your code here]
And which end:
##########################################################
When you are asked to comment on the results you should give clear and comprehensible explanations. Write the comments in a ‘Code Cell’ with a sign # at the beginning of each row, and in the areas which begin:
# [INSERT YOUR ANSWER HERE]
Please do not change the cell below, you will see a number of imports. All these packages are relevant for the assignment and it is important that you get used to them. You can find more information about them in their respective documentation. The most relevant package for this assignment is Scikit-learn:
https://scikit-learn.org/stable/
#PLEASE DO NOT CHANGE THIS CELL
# Standard python libraries for data and visualisation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pylab
%matplotlib inline
import seaborn as sns
# SciKit Learn python ML Library
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import MultinomialNB, ComplementNB, BernoulliNB
# Import error metric
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Import library for handling warnings
import warnings
#PLEASE DO NOT CHANGE THIS CELL
# Functions to use
# Decision boundary plotting
def plot_predictions(X, y, clf):
h = .02 # step size in the mesh
# create a mesh to plot in
x_min, x_max = X.iloc[:, 0].min() – 1, X.iloc[:, 0].max() + 1
y_min, y_max = X.iloc[:, 1].min() – 1, X.iloc[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
plt.figure(figsize=(8,6))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
# Plot also the training points
pylab.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, cmap=plt.cm.coolwarm)
plt.xlabel(list(X.head(0))[0])
plt.ylabel(list(X.head(0))[1])
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(np.arange(min(X.iloc[:, 0]), max(X.iloc[:, 0])+1, 1.0))
plt.yticks(np.arange(min(X.iloc[:, 1]), max(X.iloc[:, 1])+1, 1.0))
plt.title(clf)
# confusion matrix plotting plotting
def plot_conf_matrix(conf_matrix):
plt.figure(figsize=(5,5))
sns.heatmap(conf_matrix, annot=True, cmap=”YlGnBu” ,fmt=’g’)
plt.ylabel(‘Actual label’);
plt.xlabel(‘Predicted label’);
Part A: Logistic Regression [50 marks]¶
Seeds dataset¶
This dataset contains gene expression data from patients diagnosed with one of tumor types: BRCA, KIRC, COAD, LUAD and PRAD. Each feature corresponds to a different gene.
Dataset location: https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq
Number of instances: 801
Number of features: 20531
All of these parameters are real-valued continuous. To reduce computation time, we are going to work with the first 200 features.
Load dataset¶
Please save the ‘data.csv’ and ‘labels.csv’ files included in the assignement zip file, which contain this data, and change the paths below to the paths leading to the location of your downloaded files. You may want to use os.chdir to change directory.
# PLEASE CHANGE THE FILE PATHS
file_path_data = “yourFilePath/data.csv”
file_path_labels = “yourFilePath/labels.csv”
#PLEASE DO NOT CHANGE THIS CELL
# read the file with pandas.read_csv
X = pd.read_csv(file_path_data, usecols=[*range(1, 200)])
y = pd.read_csv(file_path_labels, usecols=[1]).values.ravel()
label_list = [“BRCA”, “KIRC”, “COAD”, “LUAD”, “PRAD”]
Data analysis and pre-processing¶
Below, we will generate histograms of the first 12 ‘Gene expression cancer RNA-Seq’ dataset features.
#PLEASE DO NOT CHANGE THIS CELL
figs, axs = plt.subplots(3, 4, figsize=(12, 10))
axs = axs.ravel()
for counter in range(12):
col = X.columns[counter]
axs[counter].hist(X[col], bins=20)
axs[counter].set_title(col)
Regularisation makes the classifier dependent on the scale of the features.
We are going to scale the features and compare the performance of Logistic Regression on unscaled and scaled dataset.
Question 1 [10 marks]¶
a) [3 marks]¶
Use StandardScaler() to scale the data. Save the result to a new variable (do not overwrite X).
#######################################################
#[your code here]
#######################################################
b) [3 marks]¶
Explain how the StandardScaler() function changes the data, (in particular its mean and variance)? (Hint: You can re-run the code from the section Data analysis and pre-processing in order to visualise scaled values.)
# [INSERT YOUR ANSWER HERE]
c) [4 marks]¶
LogisticRegression() uses $\ell_2$ regularisation as default. Briefly explain the effect of such a regulariser. Furthermore, briefly explain why data scaling might be a useful pre-processing step before the application of such a regulariser.
# [INSERT YOUR ANSWER HERE]
Classifier performance analysis¶
A Confusion Matrix is a table used for the evaluation of classification models. The x axis represents predicted labels while the y axis represents actual labels. Each cell indicates the sum of instances assigned to a particular combination of these labels. Diagonal values represents correctly classified instances.
Question 2 [20 marks]¶
a) [5 marks]¶
Create training and testing datasets for the unscaled and scaled data (set random_state=42 when making your split).
lg = LogisticRegression(solver = “lbfgs”, multi_class = “multinomial”, max_iter = 5000)
lg_scaled = LogisticRegression(solver = “lbfgs”, multi_class = “multinomial”, max_iter = 5000)
#######################################################
#[your code here]
#######################################################
b) [5 marks]¶
Fit LogisticRegression() to the unscaled and scaled data.
#######################################################
#[your code here]
#######################################################
c) [5 marks]¶
Print confusion matrices for the scaled and unscaled data using Scikit-learn confusion_matrix() functon defined for you at the beginning of the notebook.
#######################################################
#[your code here]
#######################################################
d) [5 marks]¶
Print a classification report using scikit-learn classification_report() function. You can use target_names = label_list to include labels.
#######################################################
#[your code here]
#######################################################
Cross validation¶
In Scikit-learn, StratifiedKFold() splits the data into $k$ different folds.
cross_val_score() then uses these folds to run the classifier multiple times and collect multiple accuracy scores.
Question 3 [20 marks]¶
a) [5 marks]¶
Split data using StratifiedKFold(). Set n_splits = 10, shuffle = True, and random_state=42.
with warnings.catch_warnings():
warnings.simplefilter(“ignore”)
# please note the lines above are used to silence sklearn warnings
# which is not necessary
#######################################################
#[your code here]
########################################################
b) [5 marks]¶
Calculate cross validation scores using cross_val_score(). Call the variables storing these scores lg_scores and lg_scaled_scores (for consistency with plotting done for you in the subsequent section). (Hint: cv is equal to the output of StratifiedKFold().)
with warnings.catch_warnings():
warnings.simplefilter(“ignore”)
# please note the lines above are used to silence sklearn warnings
# which is not necessary
#######################################################
#[your code here]
########################################################
c) [5 marks]¶
Calculate and print the mean of the scores.
with warnings.catch_warnings():
warnings.simplefilter(“ignore”)
# please note the lines above are used to silence sklearn warnings
# which is not necessary
#######################################################
#[your code here]
########################################################
d) [5 marks]¶
Unlike vanilla KFold(), StratifiedKFold() aims to preserve the proportion of examples belonging to each class in each split. Does StratifiedKFold() make each data split balanced if the whole dataset is not balanced?
# [INSERT YOUR ANSWER HERE]
We can visualise the scores using a box plot. It highlights the lower and upper quartiles, and “whiskers” showing the extent of the scores.
#PLEASE DO NOT CHANGE THIS CELL
plt.figure(figsize=(8, 4))
plt.plot([1]*10, lg_scores, “.”)
plt.plot([2]*10, lg_scaled_scores, “.”)
plt.boxplot([lg_scores, lg_scaled_scores], labels=(“logistic regression”,”logistic regression w/ scaling”))
plt.ylabel(“Accuracy”, fontsize=14)
plt.show()
Part B: Naive Bayes [50 marks]¶
Please note that we are still working with the ‘Gene expression cancer RNA-Seq’ dataset loaded in Part A.
Removing correlated features¶
Feature independence is an assumption of Naive Bayes. Naive Bayes is particularly sensitive to feature correlations which can lead to overfitting. Based on data alone, we cannot test if features are truly independent, but we can exclude correlated features.
Below, we test if features are correlated.
Question 4 [10 marks]¶
Drop features with correlation above 0.75.
Hint: see what to_drop returns, then use it as an argument in the pandas drop() function with axis = 1.
# Create correlation matrix
corr_matrix = X.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.75
to_drop = [column for column in upper.columns if any(upper[column] > 0.75)]
#######################################################
#[your code here]
#######################################################
print(“Correlated features dropped: “)
print(*to_drop, sep = “, “)
Recursive feature elimination¶
Lets go further and select the 5 most important features. Recursive Feature Elimination (RFE) is designed to select features by recursively considering smaller and smaller sets of features.
Question 5 [10 marks]¶
a) [5 marks]¶
Use the RFE() function in Scikit-learn to select features. (Hint: Check the Scikit-learn documentation and example.)
b) [5 marks]¶
After selecting features to eliminate, use the support_ attribute as a mask to select the right columns.
nb = MultinomialNB()
#######################################################
#[your code here]
#######################################################
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
nb.fit(X_train, y_train)
nb_predict = nb.predict(X_test)
print(classification_report(y_test, nb_predict, target_names = label_list))
nb_confusion = confusion_matrix(y_test, nb_predict)
plot_conf_matrix(nb_confusion)
We are now going to switch to a different dataset.
Zoo dataset¶
This is a simple dataset which classifies animals into 7 categories.
Dataset location: https://archive.ics.uci.edu/ml/datasets/Zoo
Number of instances: 210
Number of features: 17
Attribute Information:
animal name: Unique for each instance
hair: Boolean
feathers: Boolean
eggs: Boolean
milk: Boolean
airborne: Boolean
aquatic: Boolean
predator: Boolean
toothed: Boolean
backbone: Boolean
breathes: Boolean
venomous: Boolean
fins: Boolean
legs: Numeric (set of values: {0,2,4,5,6,8})
tail: Boolean
domestic: Boolean
catsize: Boolean
type: Numeric (integer values in range [1,7])
All of these parameters are discrete-valued.
Load dataset¶
Please save the ‘zoo.csv’ file included in the assignement zip file, which contains a subset of this data, and change the paths below to the paths leading to the location of your downloaded files. You may want to use os.chdir to change directory.
# PLEASE CHANGE THE FILE PATHS
file_path_data_zoo = “yourFilePath/zoo.csv”
#PLEASE DO NOT CHANGE THIS CELL
# read the file with pandas.read_csv
data_zoo = pd.read_csv(file_path_data_zoo)
# because the file does not contain header information, we manually add headers of the dataset
data_zoo.columns = [“animal name”, “hair”, “feathers”,”eggs”,
“milk”, “airborne”,”aquatic”, “predator”, “toothed”, “backbone”,
“breathless”, “venomous”, “fins”, “legs”, “tail”, “domestic”, “catsize”, “class”]
We assign columns 2 to 18 (everything other than animal name and class) to variable X_zoo. Remember that indexing starts at 0.
We assign the “class” column to variable y.
We then split X and y into train and test datasets.
#PLEASE DO NOT CHANGE THIS CELL
X_zoo = data_zoo.iloc[:,1:17]
y_zoo = data_zoo[“class”]
X_zoo_train, X_zoo_test, y_zoo_train, y_zoo_test = train_test_split(X_zoo, y_zoo, test_size=0.2, random_state=42)
Below, we create a test dataset which contains only animals with 4 or fewer legs.
#PLEASE DO NOT CHANGE THIS CELL
X_drop_train = X_zoo_train.drop(X_zoo_train[X_zoo_train[“legs”]>4].index)
y_drop_train = y_zoo_train.drop(X_zoo_train[X_zoo_train[“legs”]>4].index)
We create an instance of a multinomial Naive Bayes classifier. We train nb on X_drop_train, and test it on X_zoo_test.
You should get an error message suggesting that the value of alpha is automatically overwritten.
#PLEASE DO NOT CHANGE THIS CELL
nb = MultinomialNB(alpha =0)
nb.fit(X_drop_train, y_drop_train)
nb_predict_train = nb.predict(X_drop_train)
nb_predict_test = nb.predict(X_zoo_test)
print(accuracy_score(nb_predict_train, y_drop_train))
print(accuracy_score(nb_predict_test, y_zoo_test))
Question 6 [10 marks]¶
Please comment on what the alpha parameter does.
Think what Naive Bayes does when encountering a discrete feature value of which is absent in the train dataset but which is present in the test dataset. What probability estimate would be associated with it?
See: Scikit-learn MultinomialNB documentation, in particular the description of the alpha parameter https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
### [INSERT YOUR ANSWER HERE]
We are now going to repeat the above process using Naive Bayes for features with the Bernoulli distribution (BernoulliNB() in scikit-learn). We are particularly interested in interpreting decision boundaries of Bernoulli Naive Bayes.
Question 7 [20 marks]¶
a) [2 marks]¶
Use recursive feature elimination (RFE) is to select only 2 features (Note: in real life cases, you are likely to use this approach to select multiple rather than just two features. In this case we are asking you to analyse the outcome, and 2 features allows us to visualise the decision boundary more easily.)
b) [1 mark]¶
After selecting the features to eliminate, use the support_ attribute as a mask to select the right columns.
nb = BernoulliNB()
#######################################################
#[your code here]
#######################################################
c) [2 marks]¶
Split data into train and test sets (set random_state=42).
d) [1 mark]¶
Fit the model.
#######################################################
#[your code here]
##########################################################
e) [2 marks]¶
Create a Confusion Matrix using the plot_conf_matrix()
#######################################################
#[your code here]
##########################################################
f) [2 marks]¶
Plot decision boundaries using the plot_predictions() function.
#######################################################
#[your code here]
##########################################################
g) [10 marks]¶
Interpret the decision boundaries: Recall the shapes of decision boundaries you have seen in classes – were they straight and crossing at right angles? Why is this the case when using BernoulliNB()?
### [INSERT YOUR ANSWER HERE]
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com