Assignment_4_Solved
Follow These Instructions¶
Once you are finished, ensure to complete the following steps.
Copyright By PowCoder代写 加微信 powcoder
Restart your kernel by clicking ‘Kernel’ > ‘Restart & Run All’.
Fix any errors which result from this.
Repeat steps 1. and 2. until your notebook runs without errors.
Submit your completed notebook to OWL by the deadline.
Assignment 4: Classification with Logistic Regression [ __ /100 marks]¶
In this assignment we will use the diabetes dataset, which was collected and made available by “National Institute of Diabetes and Digestive and Kidney Diseases” as part of the Pima Indians Diabetes Database.
We will use logistic regression to predict whether subjects have diabetes or not.
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, precision_recall_curve, auc
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
Question 1.1 [ _ /3 marks]¶
Read the file diabetes.csv into a pandas DataFrame. Display the first 5 rows of the DataFrame.
# ****** your code here ******
df = pd.read_csv(‘diabetes.csv’)
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
Question 1.2 [ _ /6 marks]¶
(1) How many classes are there? How many features are available to predict the outcome?
Your answer: 2 classes. 8 features.
(2) Is the dataset class-balanced?
# ****** your code here ******
df.Outcome.value_counts()
Name: Outcome, dtype: int64
Your answer: No
(3) For this classification problem, what is the baseline accuracy? Round into 3 decimal place.
# ****** your code here ******
counts= df.Outcome.value_counts()
# A = df.Outcome[df[‘Outcome’]== 0].count() # majority class
# B = df.Outcome[df[‘Outcome’]== 1].count()
baseline_accuracy = round(counts[0]/(counts[0]+counts[1]), 3)
print(“Baseline Accuracy is:”, baseline_accuracy)
Baseline Accuracy is: 0.651
Question 1.3 [ _ /3 marks]¶
Use train_test_split with random_state=0 to split the data into training and test sets. Leave 20% for testing.
# Store all the features into variable “X”
# ****** your code here ******
#X = df.iloc[:,0:8].values
X = df.drop(“Outcome”, axis=’columns’).values
# Store the output class values into variable “y”
# ****** your code here ******
y = df.Outcome.values
# Split your X and y data using train_test_split
# ****** your code here ******
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=seed)
Question 2.1 [ _ /3 marks]¶
We will use sklearn’s LogisticRegression to solve the classification problem. Before we move on, answer the following questions by reading the documentation.
(1) Does LogisticRegression use a penalty by default? If yes, what penalty?
Your answer: Yes, it uses L2 penalty by default.
(2) If we apply a penalty during learning, what difference do you expect to see in the resulting coefficients, relative to not applying a penalty during learning?
Your answer: We expect the coefficients to be smaller if a penalty is applied.
(3) If using the default settings of LogisticRegression, do you need to include a column of 1s in your feature/design matrix? Briefly explain why or why not.
Your answer: A column of 1s is not needed because by default sklearn estimates an intercept.
Question 2.2 [ _ /10 marks]¶
Create a LogisticRegression model with penalty=none. Let’s fisrt train and test this classifier using only “Insulin” as the input feature. Make a scatter plot of the points. Plot your prediction on the same graph.
# Create a LogisticRegression model without regularization
# ****** your code here ******
LR1 = LogisticRegression(penalty=”none”, max_iter=10000)
# Obtain training data and test data
# ****** your code here ******
xtrain = Xtrain[:,4].reshape(-1, 1)
xtest = Xtest[:,4].reshape(-1, 1)
# xtrain = Xtrain[:,[4]]
# xtest = Xtest[:,[4]]
# Fit to your training data using Logistic Regression
# ****** your code here ******
LR1.fit(xtrain, ytrain)
# Create a scatter plot of the test data.
# ****** your code here ******
plt.scatter(xtest, ytest)
plt.xlabel(‘Insulin’)
plt.ylabel(‘Outcome’)
# Also plot your prediction using sns.lineplot
# lineplot needs 1d vector x
ypred = LR1.predict(xtest)
sns.lineplot(x=xtest.reshape(-1), y=ypred,color=”red”)
plt.show()
Question 2.3 [ _ /10 marks]¶
Evaluate the classification performance using Accuracy, Recall, Precision, Sensitivity and Specificity.
# ****** your code here ******
# You can either write a function or not
def compute_performance(yhat, y, classes):
# First, get tp, tn, fp, fn
tp = sum(np.logical_and(yhat == classes[1], y == classes[1]))
tn = sum(np.logical_and(yhat == classes[0], y == classes[0]))
fp = sum(np.logical_and(yhat == classes[1], y == classes[0]))
fn = sum(np.logical_and(yhat == classes[0], y == classes[1]))
print(f”tp: {tp} tn: {tn} fp: {fp} fn: {fn}”)
# Accuracy
acc = (tp + tn) / (tp + tn + fp + fn)
# Precision
# “Of the ones I labeled +, how many are actually +?”
precision = tp / (tp + fp)
# “Of all the + in the data, how many do I correctly label?”
recall = tp / (tp + fn)
# Sensitivity
# “Of all the + in the data, how many do I correctly label?”
sensitivity = recall
# Specificity
# “Of all the – in the data, how many do I correctly label?”
specificity = tn / (fp + tn)
# Print results
print(“Accuracy:”,round(acc,6),”Recall:”,round(recall,6),”Precision:”,round(precision,6),
“Sensitivity:”,round(sensitivity,6),”Specificity:”,round(specificity,6))
compute_performance(ypred, ytest, LR1.classes_)
tp: 3 tn: 106 fp: 1 fn: 44
Accuracy: 0.707792 Recall: 0.06383 Precision: 0.75 Sensitivity: 0.06383 Specificity: 0.990654
Question 3.1 [ _ /10 marks]¶
Create another LogisticRegression model with penalty=none. Train and test this classifier with all features and then evaluate the performance.
# Create a LogisticRegression model without regularization
# ****** your code here ******
LR2 = LogisticRegression(penalty=”none”, max_iter=10000)
# Fit to your training data using Logistic Regression
# ****** your code here ******
LR2.fit(Xtrain, ytrain)
# Compute your test predictions, given test inputs
# ****** your code here ******
ypred2 = LR2.predict(Xtest)
# Evaluate the performance
# ****** your code here ******
compute_performance(ypred2, ytest, LR2.classes_)
tp: 29 tn: 98 fp: 9 fn: 18
Accuracy: 0.825 Recall: 0.617 Precision: 0.763 Sensitivity: 0.617 Specificity: 0.916
Does using more features help to improve the classification?
Your answer : Yes
Question 3.2 [ _ /10 marks]¶
Let’s adjust the decision threshold from 0.5 (default) to 0.4 and 0.6, and then evlaute the performance.
# Using your classifer from last question, adjust the decision threshold and get the updated predictions
# ****** your code here ******
threshold = 0.4
ytest_prob = LR2.predict_proba(Xtest)
ypred_new = LR2.classes_[(ytest_prob[:,1] > threshold).astype(int)]
# Evaluate the performance
# ****** your code here ******
compute_performance(ypred_new, ytest, LR2.classes_)
tp: 34 tn: 88 fp: 19 fn: 13
Accuracy: 0.792 Recall: 0.723 Precision: 0.642 Sensitivity: 0.723 Specificity: 0.822
# Using your classifer from last question, adjust the decision threshold and get the updated predictions
# ****** your code here ******
threshold = 0.6
ytest_prob = LR2.predict_proba(Xtest)
ypred_new = LR2.classes_[(ytest_prob[:,1] > threshold).astype(int)]
# Evaluate the performance
# ****** your code here ******
compute_performance(ypred_new, ytest, LR2.classes_)
tp: 26 tn: 100 fp: 7 fn: 21
Accuracy: 0.818 Recall: 0.553 Precision: 0.788 Sensitivity: 0.553 Specificity: 0.935
What do you think is a better threshold?
Your answer: To be determined in accordance with context and requirements of the project. This may involve consulting the project stakeholders.
Question 3.3 [ _ /10 marks]¶
Create a final LogisticRegression model with penalty=l2, C=0.01. Train and test this classifier with all features and then evaluate the performance.
# Create a LogisticRegression model with l2 regularization
# ****** your code here ******
LR3 = LogisticRegression(penalty=”l2″, solver=’lbfgs’, multi_class=’ovr’, C=0.01, max_iter=10000)
# LR3 = make_pipeline(StandardScaler(), LogisticRegression(penalty=”l2″, solver=’lbfgs’, multi_class=’ovr’, C=0.01, max_iter=10000))
# LR3 = LogisticRegression(penalty=”l2″, solver=’lbfgs’, multi_class=’ovr’, C=1, max_iter=10000)
# LR3 = make_pipeline(StandardScaler(), LogisticRegression(penalty=”l2″, solver=’lbfgs’, multi_class=’ovr’, C=1, max_iter=10000))
# Fit to your training data using Logistic Regression
# ****** your code here ******
LR3.fit(Xtrain, ytrain)
# Compute your test predictions, given test inputs
# ****** your code here ******
ypred3 = LR3.predict(Xtest)
# Evaluate the performance
# ****** your code here ******
compute_performance(ypred3, ytest, LR3.classes_)
tp: 28 tn: 97 fp: 10 fn: 19
Accuracy: 0.811688 Recall: 0.595745 Precision: 0.736842 Sensitivity: 0.595745 Specificity: 0.906542
Does regularization help to improve the classification?
Your answer : No
Question 4 [ _ /15 marks]¶
Plot ROC Curves for the classifiers you used in questions 2.2, 3.1, and 3.3. Use AUC to determine which classifier is the best.
# Use roc_curve to get FPR and TPR for each of the 3 classifiers
# ****** your code here ******
fpr1, tpr1, _ = roc_curve(ytest, LR1.predict_proba(xtest)[:,1], pos_label=1)
fpr2, tpr2, _ = roc_curve(ytest, LR2.predict_proba(Xtest)[:,1], pos_label=1)
fpr3, tpr3, _ = roc_curve(ytest, LR3.predict_proba(Xtest)[:,1], pos_label=1)
# Plot all of the ROC curves
# ****** your code here ******
ax = sns.lineplot(x=fpr1, y=tpr1, color=”blue”, label=’Q2.2′)
ax = sns.lineplot(x=fpr2, y=tpr2, color=”red”, label=’Q3.1′)
ax = sns.lineplot(x=fpr3, y=tpr3, color=”green”, label=’Q3.3′)
ax.set_xlabel(“FP Rate”)
ax.set_ylabel(“TP Rate”)
# Determine AUC for each of the ROC curves
# ****** your code here ******
print(f”AUC Q2.2 Model: {auc(fpr1, tpr1).round(3)}| AUC Q3.1 Model: {auc(fpr2, tpr2).round(3)}| AUC Q3.3 Model: {auc(fpr3, tpr3).round(3)}”)
AUC Q2.2 Model: 0.536| AUC Q3.1 Model: 0.867| AUC Q3.3 Model: 0.853
Which one is the best classifier?
Your answer: The LR classifier using all input features and with no regularization.
Question 5 [ _ /20 marks]¶
Multiclass Logistic Regression
In the classification lab, we trained a binary LR classifier using the mnist dataset to discriminate entries which were equal to 5 from the rest. Use the same dataset to train a multiclass SGDClassifier with l2 regularization. So, this time you will have 10 classes i.e., 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. Your model must achieve an average accuracy of more than 90% according to the cross validation accuracy scores over 3 folds. For training use max_iter=2000, tol=1e-3, random_state=seed. For some sklearn functions you can set argument n_jobs=N to run them in parallel for computation speed-up. A good value for N can be the number of physical CPU cores that your machine possesses. Check the documentations of the functions to take advantage from this where possible. First load the data and plot a histogram to comment on class distribution and whether you assume the dataset as imbalanced or balanced. For data splitting use test_size=0.5 and random_state=seed.
# We will be using the MNIST dataset, which is a set of 70000 small images of handwritten digits.
# Each image is labeled with the digit it represents.
from sklearn.datasets import fetch_openml
mnist = fetch_openml(‘mnist_784’, version=1, as_frame=False)
x_mnist = mnist[“data”]
y_mnist = mnist[“target”].astype(np.uint8)
c, n = np.unique(y_mnist, return_counts=True)
print(‘Classes:’, c)
print(‘Class count:’,n)
fig, ax = plt.subplots(figsize=(4, 3))
sns.histplot(y_mnist).plot(ax=ax)
plt.xlabel(‘Class’)
plt.show()
Classes: [0 1 2 3 4 5 6 7 8 9]
Class count: [6903 7877 6990 7141 6824 6313 6876 7293 6825 6958]
Your answer:
According to the histogram, it is safe to say we have balanced class distribution.
X_train, X_test, y_train, y_test = train_test_split(x_mnist, y_mnist, test_size=0.5, random_state=seed)
After reading the SGDClassifier documentation you realize the following argument settings:
loss=’hinge’ –> gives linear support-vector machine
loss=’log_loss –> gives logistic regression
clf1 = SGDClassifier(loss=’hinge’, penalty=”l2″, max_iter=2000, tol=1e-3, n_jobs=4, random_state=seed).fit(X_train, y_train)
print(‘Real label of i is _’+str(y_train[i])+’_ and the model predicted _’+str(clf1.predict([X_train[i]])[0])+’_.’)
scores1 = cross_val_score(clf1, X_train, y_train, cv=3, n_jobs=4, scoring=”accuracy”)
print(“Scores:”, scores1.round(3))
print(“Standard deviation:”, scores1.std().round(3))
print(“Mean score: “+str(scores1.mean().round(3)*100)+’%’)
Real label of i is _4_ and the model predicted _4_.
Scores: [0.874 0.857 0.875]
Standard deviation: 0.008
Mean score: 86.9%
Not interested in clf1 as its mean score is less than 90%.
from sklearn.pipeline import make_pipeline
clf2 = make_pipeline(StandardScaler(), SGDClassifier(loss=’hinge’, penalty=”l2″, max_iter=2000, tol=1e-3, n_jobs=4, random_state=seed)).fit(X_train, y_train)
print(‘Real label of i is _’+str(y_train[i])+’_ and the model predicted _’+str(clf2.predict([X_train[i]])[0])+’_.’)
scores2 = cross_val_score(clf2, X_train, y_train, cv=3, n_jobs=4, scoring=”accuracy”)
print(“Scores:”, scores2.round(3))
print(“Standard deviation:”, scores2.std().round(3))
print(“Mean score: “+str(scores2.mean().round(3)*100)+’%’)
Real label of i is _8_ and the model predicted _8_.
Scores: [0.902 0.909 0.91 ]
Standard deviation: 0.003
Mean score: 90.7%
clf2 is an acceptable model as its mean score is greater than 90%.
clf3 = SGDClassifier(loss=’log_loss’, penalty=”l2″, max_iter=2000, tol=1e-3, n_jobs=4, random_state=seed).fit(X_train, y_train)
print(‘Real label of i is _’+str(y_train[i])+’_ and the model predicted _’+str(clf3.predict([X_train[i]])[0])+’_.’)
scores3 = cross_val_score(clf3, X_train, y_train, cv=3, n_jobs=4, scoring=”accuracy”)
print(“Scores:”, scores3.round(3))
print(“Standard deviation:”, scores3.std().round(3))
print(“Mean score: “+str(scores3.mean().round(3)*100)+’%’)
Real label of i is _3_ and the model predicted _3_.
Scores: [0.863 0.875 0.88 ]
Standard deviation: 0.007
Mean score: 87.3%
Not interested in clf3 as its mean score is less than 90%.
from sklearn.pipeline import make_pipeline
clf4 = make_pipeline(StandardScaler(), SGDClassifier(loss=’log_loss’, penalty=”l2″, max_iter=2000, tol=1e-3, n_jobs=4, random_state=seed)).fit(X_train, y_train)
print(‘Real label of i is _’+str(y_train[i])+’_ and the model predicted _’+str(clf4.predict([X_train[i]])[0])+’_.’)
scores4 = cross_val_score(clf4, X_train, y_train, cv=3, n_jobs=4, scoring=”accuracy”)
print(“Scores:”, scores4.round(3))
print(“Standard deviation:”, scores4.std().round(3))
print(“Mean score: “+str(scores4.mean().round(2)*100)+’%’)
Real label of i is _7_ and the model predicted _7_.
Scores: [0.901 0.909 0.909]
Standard deviation: 0.004
Mean score: 91.0%
clf4 is an acceptable model as its mean score is greater than 90%.
You can run the cell below to see how well your model can recognize a digit written by the mouse cursor. Set the final_model variable, run the cell, draw on the pop-up canvas, and once you close the canvas you will see the result.
final_model = clf4 # use the name of your final model
#!pip install tk-tools
from tkinter import *
import tkinter as tk
from PIL import Image
import matplotlib as mpl
temp_file_name=”TEMP_image_TEMP.jpg”
app = Tk()
app.geometry(“300×300”)
canvas = tk.Canvas(app, bg=’white’)
canvas.pack(anchor=’nw’, fill=’both’, expand=1)
def get_x_and_y(event):
global lasx, lasy
lasx, lasy = event.x, event.y
def draw_smth(event):
global lasx, lasy
canvas.create_line((lasx, lasy, event.x, event.y), fill=’red’, width=3.5)
lasx, lasy = event.x, event.y
ps = canvas.postscript(colormode = ‘color’)
img = Image.open(io.BytesIO(ps.encode(‘utf-8’)))
img.save(temp_file_name)
canvas.bind(“
canvas.bind(“
app.mainloop()
img = Image.open(temp_file_name)
#resize image to 28×28 pixels
img = img.resize((28,28))
#convert rgb to grayscale
img = img.convert(“L”)
img = np.array(img)
img = 255.0 – img
print(“Your input:”)
plt.imshow(img, cmap = mpl.cm.binary); plt.axis(“off”)
plt.show()
# reshaping to support our model input
img = np.reshape(img, 28*28)
#predicting the class
print(‘Input recognized as ‘ + str(final_model.predict([img])[0])+’.’)
Your input:
Input recognized as 7.
Question 6 [0/0 marks]¶
It often makes mistakes and its accuracy does not seem to be around 90%. Can you give reasons to explain why?
Your answer: This was explained in class.
Follow These Instructions¶
Once you are finished, ensure to complete the following steps.
Restart your kernel by clicking ‘Kernel’ > ‘Restart & Run All’.
Fix any errors which result from this.
Repeat steps 1. and 2. until your notebook runs without errors.
Submit your completed notebook to OWL by the deadline.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com