程序代写代做代考 database Hive Classifier Calibration¶

Classifier Calibration¶
Many classifiers, including random forest classifiers, can return prediction probabilities, which can be interpreted as the probability that a given prediction point falls into a given class (i.e., given the data X and a candidate class c, the prediction probability states P(Y=c|X)). However, when the classes in the training data are unbalanced, as in this wine example, these prediction probabilities calculated by a classifier can be inaccurate. This is because many classifiers, again including random forests, do not have a way to internally adjust for this imbalance.
Despite the inaccuracy caused by imbalance, the prediction probabilities returned by a classifier can still be used to construct good predictions if we can choose the right way to turn a prediction probability into a prediction about the class that the datapoint belongs to. We call this task calibration.
If a classifier’s prediction probabilities are accurate, the appropriate way to convert its probabilities into predictions is to simply choose the class with probability > 0.5. This is the default behavior of classifiers when we call their predict method. When the probabilities are inaccurate, this does not work well, but we can still get good predictions by choosing a more appropriate cutoff. In this question, we will choose a cutoff by cross validation.
(a) Fit a random forest classifier to the wine data using 15 trees. Compute the predicted probabilities that the classifier assigned to each of the training examples (Hint: Use the predict_proba method of the classifier after fitting.). As a sanity test, construct a prediction based on these predicted probabilities that labels all wines with a predicted probability of being in class 1 > 0.5 with a 1 and 0 otherwise. For example, if originally probabilities =[0.1,0.4,0.5,0.6,0.7], the predictions should be [0,0,0,1,1]. Compare this to the output of the classifier’s predict method, and show that they are the same.
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import sklearn
import scipy as sp
import sklearn.model_selection
In [3]:
df = pd.read_csv(‘https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv’, sep=’;’)
df.head()
Out[3]:

fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol
quality
0
7.4
0.70
0.00
1.9
0.076
11.0
34.0
0.9978
3.51
0.56
9.4
5
1
7.8
0.88
0.00
2.6
0.098
25.0
67.0
0.9968
3.20
0.68
9.8
5
2
7.8
0.76
0.04
2.3
0.092
15.0
54.0
0.9970
3.26
0.65
9.8
5
3
11.2
0.28
0.56
1.9
0.075
17.0
60.0
0.9980
3.16
0.58
9.8
6
4
7.4
0.70
0.00
1.9
0.076
11.0
34.0
0.9978
3.51
0.56
9.4
5
In [ ]:

(b) Write a function cutoff_predict that takes a trained classifier, a data matrix X, and a cutoff, and generates predictions based on the classifier’s predicted probability and the cutoff value, as you did in the previous question.
In [ ]:

(c) Using 10-fold cross validation find a cutoff in np.arange(0.1,0.9,0.1) that gives the best average F1 score when converting prediction probabilities from a 15-tree random forest classifier into predictions.
To help you with this task, we have provided you a function custom_f1 that takes a cutoff value and returns a function suitable for using as the scoring argument to cross_val_score. This function uses the cutoff_predict function that you defined in the previous question.
Using a boxplot, compare the F1 scores that correspond to each candidate cutoff value.
In [ ]:

(d) According to this analysis, which cutoff value gives the best predictive results? Explain why this answer makes sense in light of the unbalanced classes in the training data.
In [ ]: