Live Coding Wk4 – Lecture 11 – Linear Classification¶
For this demo we will be exploring how to do classification using Linear Classification. We will continue in our task as a intern of the of Environment and Society, continuing with the task of classifying flower species. Specifically we will be focusing on different metrics for classification.
Copyright By PowCoder代写 加微信 powcoder
### Imports and data you will need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
# Some thing which should be a bit familiar
data = pd.read_csv(‘data/IRIS.csv’)
data.head()
Problem: More methods for classification¶
You present the K Nearest Neighbor results we explored in the last demo. Unfortunately, your data has experienced some random noise! As a result you will have to retrain some models.
# The magic data science fairy adding gaussian noise
np.random.seed(0)
data[‘sepal_length’] = data[‘sepal_length’].apply(lambda x: max(0, x + (np.random.rand() – 0.5) / 2))
data[‘sepal_width’] = data[‘sepal_width’].apply(lambda x: max(0, x + (np.random.rand() – 0.5) / 2))
data[‘petal_length’] = data[‘petal_length’].apply(lambda x: max(0, x + (np.random.rand() – 0.5) / 2))
data[‘petal_width’] = data[‘petal_width’].apply(lambda x: max(0, x + (np.random.rand() – 0.5) / 2))
Lets see how KNN and Linear classification compares in this different setting. We will be using linear classification using LogisticRegression from sklearn.
Last time¶
Here is some of the setup we did last time:
# Should have been done last time!
def species_to_index(species_str):
if species_str == ‘Iris-setosa’:
elif species_str == ‘Iris-versicolor’:
def index_to_species(species_index):
if species_index == 0:
return ‘Iris-setosa’
elif species_index == 1:
return ‘Iris-versicolor’
return ‘Iris-virginica’
# Columns we want to get values from
info_columns = [‘sepal_length’, ‘sepal_width’, ‘petal_length’, ‘petal_width’,’species_index’]
data[‘species_index’] = data.species.apply(species_to_index)
iris_array = data[info_columns].values
train_data, test_data = train_test_split(iris_array, train_size=0.6, random_state=1)
# Training Data
train_input = train_data[:, :-1]
train_output = train_data[:, -1]
# Testing Data
test_input = test_data[:, :-1]
test_output = test_data[:, -1]
Remember how to define a K Nearest Neighbor classifier we made last time? (n_neighbors=3)
# Define our model
knn = None # TODO
knn.fit(train_input, train_output)
test_pred_knn = knn.predict(test_input)
Logistic Regression¶
Alright now lets look at the LogisticRegression function from sklearn.
?LogisticRegression
Now its your turn to define, train, and predict with logistic regression.
# For you to fill!
logistic = None # TODO
# Train here as well!
test_pred_logistic = None # TODO
Here is a comparison of the different classification methods so far:
# Simple plotting
plt.figure(figsize=[14,6])
colours = [‘r’, ‘b’, ‘g’]
correctly_pred_knn = np.equal(test_pred_knn, test_output)
correctly_pred_logistic = np.equal(test_pred_logistic, test_output)
ax1 = plt.subplot(121)
for i in range(3):
s_indices_correct = (test_output == i) & correctly_pred_knn
plt.scatter(test_input[s_indices_correct, 0], test_input[s_indices_correct, 1],
c=colours[i], alpha=0.5)
s_indices_incorrect = (test_output == i) & (np.logical_not(correctly_pred_knn))
plt.scatter(test_input[s_indices_incorrect, 0], test_input[s_indices_incorrect, 1],
marker=’X’, c=colours[i], alpha=0.5)
plt.title(“KNN Classification of Sepal Dimensions”)
plt.xlabel(“sepal_length”)
plt.ylabel(“sepal_width”)
ax2 = plt.subplot(122)
for i in range(3):
s_indices_correct = (test_output == i) & correctly_pred_logistic
plt.scatter(test_input[s_indices_correct, 0], test_input[s_indices_correct, 1],
c=colours[i], alpha=0.5, label=index_to_species(i))
s_indices_incorrect = (test_output == i) & (np.logical_not(correctly_pred_logistic))
plt.scatter(test_input[s_indices_incorrect, 0], test_input[s_indices_incorrect, 1],
marker=’X’, c=colours[i], alpha=0.5)
plt.title(“Logistic Classification of Sepal Dimensions”)
plt.xlabel(“sepal_length”)
plt.ylabel(“sepal_width”)
plt.legend()
plt.show()
Discussion: Hmmm, doesn’t seem like too much has changed between the two methods. Can you recall what are the pros and cons between these two methods? What would happen if the dataset’s size increased?
Discuss Here!
Measuring success¶
In the lecture, a number of metrics were introduced for classification. Luckily sklearn has many of these already defined in sklearn.metric.
Tale a look at sklearn.metrics.precision_recall_fscore_support, this should provide many of the metrics we have been introduced to!
Use the function to define the precision, recall, and F1 score for both classifiers we have evaluated above.
import sklearn.metrics as skm # We shorthanding sklearn.metrics to just skm
# For you to fill!
knn_precision, knn_recall, knn_f1score = None # TODO
logistic_precision, logistic_recall, logistic_f1score = None # TODO
Discussion: Try looking at the different metric values the function has given us. Why is the dimensionality of the values the way they are? How do you interpret the scores? (i.e., what species do the metric values correspond to?)
Discuss Here!
Here is a dataframe to summarise your results.
knn_results = [{‘model’: ‘knn’, ‘species’: index_to_species(i), ‘precision’: knn_precision[i], ‘recall’: knn_recall[i], ‘f1score’: knn_f1score[i]} for i in range(3)]
logistic_results = [{‘model’: ‘logistic’, ‘species’: index_to_species(i),’precision’: logistic_precision[i], ‘recall’: logistic_recall[i], ‘f1score’: logistic_f1score[i]} for i in range(3)]
results_df = pd.DataFrame(knn_results + logistic_results)
results_df
Lets do some simple comparisons.
Discussion: Which of classifier performs better for each of the species?
## Code to have a better display
## For you to fill!
Discuss Here!
Hopefully less confusing¶
You boss want to see exactly how many cases in the test dataset we are miss representing. It is suggested you look at sklearn.metric.confusion_matrix.
Call this function for each of the predicted species classes to see what is happening.
knn_confusion_matrix = None # TODO
logistic_confusion_matrix = None # TODO
print(“KNN:”)
print(knn_confusion_matrix)
print(“Logistic:”)
print(logistic_confusion_matrix)
Discussion: Does this result align with the metrics we looked at above? Does this align with our intuition of the data?
Discuss Here!
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com