Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum.
Fundamentals of Machine Learning for
Predictive Data Analytics
Chapter 8: Evaluation Sections 8.4, 8.5
Copyright By PowCoder代写 加微信 powcoder
and Namee and Aoife D’Arcy
Designing Evaluation Experiments
Hold-out Sampling
k-Fold Cross Validation Leave-one-out Cross Validation Bootstrapping
Out-of-time Sampling
Performance Measures: Categorical Targets
Confusion Matrix-based Performance Measures Precision, Recall and F1 Measure
Average Class Accuracy
Measuring Profit and Loss
Performance Measures: Prediction Scores
Receiver Operating Characteristic Curves Kolmogorov-Smirnov Statistic
Measuring Gain and Lift
Performance Measures: Multinomial Targets
Performance Measures: Continuous Targets
Basic Measures of Error
Domain Independent Measures of Error
Evaluating Models after Deployment
Monitoring Changes in Performance Measures Monitoring Model Output Distributions
Monitoring Descriptive Feature Distribution Changes Comparative Experiments Using a Control Group
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum.
Designing Evaluation Experiments
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Hold-out Sampling
Training** Predic’Voanli*da’on* Set* Model* Set*
Test* Set*
(a) A 50:20:30 split
Training** Set*
VPraelidiac’on* MSoedte* l*
Test* Set*
(b) A 40:20:40 split
Figure: Hold-out sampling can divide the full data into training, validation, and test sets.
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Hold-out Sampling
Performance on Training Set Performance on Validation Set
0 50 100 150 200 Training Iteration
Figure: Using a validation set to avoid overfitting in iterative machine learning algorithms.
Misclassification Rate
0.1 0.2 0.3 0.4 0.5
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. k-Fold Cross Validation
Fold*1″ Fold*2″ Fold*3″ Fold*4″ Fold*5″ Fold*6″ Fold*7″ Fold*8″ Fold*9″ Fold*10″
Figure: The division of data during the k-fold cross validation process. Black rectangles indicate test data, and white spaces indicate training data.
Confusion Matrix
Class Accuracy
’lateral’ ’frontal’
’lateral’ ’frontal’
’lateral’ ’frontal’
’lateral’ ’frontal’
’lateral’ ’frontal’
’lateral’ ’frontal’
Prediction
’lateral’ ’frontal’
43 9 10 38
Prediction
’lateral’ ’frontal’
Prediction
’lateral’ ’frontal’
51 10 8 31
Prediction
’lateral’ ’frontal’
Prediction
’lateral’ ’frontal’
Prediction
’lateral’ ’frontal’
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Leave-one-out Cross Validation
Fold*1″ Fold*2″ Fold*3″ Fold*4″ Fold*5″
Fold*k-2″ Fold*k-1″ Fold*k”
Figure: The division of data during the leave-one-out cross validation process. Black rectangles indicate instances in the test set, and white spaces indicate training data.
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Bootstrapping
Itera’on*1″ Itera’on*2″ Itera’on*3″
Itera’on*k-1″ Itera’on*k”
Figure: The division of data during the ε0 bootstrap process. Black rectangles indicate test data, and white spaces indicate training data.
A random selection of m instances is taken from the dataset to generate the test set, the remaining instances are used for training. A performance measure is calculated for this iteration.
This process is repeated for k iterations.
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Out-of-time Sampling
Training*Set*
Figure: The out-of-time sampling process.
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum.
Performance Measures: Categorical Targets
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Confusion Matrix-based Performance Measures
TPR = TP (1) (TP + FN)
TNR = TN (2) (TN + FP)
FPR = FP (3) (TN + FP)
FNR = FN (4) (TP + FN)
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Confusion Matrix-based Performance Measures
TPR = TP (TP + FN)
TNR = TN (TN + FP)
FP FPR = (TN + FP)
FNR = FN (TP + FN)
= 0.667 = 0.818 =0.182 =0.333
= (6+3) = 9
(9+2) (9+2)
FNR = 3 (6+3)
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Precision, Recall and F1 Measure
precision = TP (5) (TP + FP)
recall = TP (6) (TP + FN)
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Precision, Recall and F1 Measure
precision = 6 = 0.75 (6+2)
recall = 6 = 0.667 (6+3)
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Precision, Recall and F1 Measure
F1-measure = 2 × (precision × recall) (7) (precision + recall)
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Precision, Recall and F1 Measure
F1-measure = 2 × (precision × recall) (7) (precision + recall)
F1-measure=2× 6 + 6 (6+2) (6+3)
6×6 (6+2) (6+3)
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Average Class Accuracy
Table: A confusion matrix for a k-NN model trained on a churn prediction problem.
Prediction
’non-churn’ ’churn’
91% accuracy
’non-churn’ ’churn’
Table: A confusion matrix for a naive Bayes model trained on a churn prediction problem.
’non-churn’ ’churn’
Prediction
’non-churn’ ’churn’ 78%
70 20 accuracy
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Average Class Accuracy
We’ll use average class accuracy instead of classification accuracy to deal with the imbalanced data in the first table:
average class accuracy = 1 recalll (8)
|levels(t )| l∈levels(t)
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Average Class Accuracy
Alternative: use harmonic mean instead of arithmetic mean
average class accuracy = 1 (9) HM 1 1
|levels(t)| l∈levels(t) recalll
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Average Class Accuracy
1 1 + 1 =5.5=18.2%
2 1.0 0.1 11
1 1 + 1 = 1.268 =78.873% 2 0.778 0.800
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Average Class Accuracy
Figure: Surfaces generated by calculating (a) the arithmetic mean and (b) the harmonic mean of all combinations of features A and B that range from 0 to 100.
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Measuring Profit and Loss
It is not always correct to treat all outcomes equally
In these cases, it is useful to take into account the cost of the different outcomes when evaluating models
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Measuring Profit and Loss
Table: The structure of a profit matrix.
Prediction positive negative
positive TPProfit FNProfit negative FPProfit TNProfit
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Measuring Profit and Loss
Table: The profit matrix for the pay-day loan credit scoring problem.
Prediction
’good’ ’bad’ ’good’ 140 −140 ’bad’ −700 0
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Measuring Profit and Loss
Table: (a) The confusion matrix for a k-NN model trained on the pay-day loan credit scoring problem
(average class accuracyHM = 83.824%); (b) the confusion matrix for a decision tree model trained on the pay-day loan credit scoring problem (average class accuracyHM = 80.761%).
(a) k-NN model
Prediction
’good’ ’bad’ Target ’good’ 57 3
’bad’ 10 30
(b) decision tree
’good’ ’bad’
Prediction
’good’ ’bad’
43 17 3 37
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Measuring Profit and Loss
Table: (a) Overall profit for the k-NN model using the profit matrix in Table 4 [25] and the confusion matrix in Table 5(a) [26]; (b) overall profit for the decision tree model using the profit matrix in Table 4 [25] and the confusion matrix in Table 5(b) [26].
(a) k-NN model
Prediction
’good’ ’bad’ ’good’ 7980 −420 ’bad’ −7000 0 Profit 560
(b) decision tree
Prediction
’good’ ’bad’ ’good’ 6020 −2380 ’bad’ −2100 0 Profit 1 540
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum.
Performance Measures: Prediction Scores
Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum.
All our classification prediction models return a score which is then thresholded.
threshold (score, 0.5) =
positive if score ≥ 0.5 negative otherwise
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum.
Table: A sample test set with model predictions and scores (threshold= 0.5.
ID Target iction 7 ham ham 11 ham ham 15 ham ham 13 ham ham 19 ham ham 12 spam ham 2 spam ham 3 ham ham 16 ham ham 1 spam ham
Score come 0.001 TN 0.003 TN 0.059 TN 0.064 TN 0.094 TN 0.160 FN 0.184 FN 0.226 TN 0.246 TN 0.293 FN
Pred- Out-
Pred- Out- ID Target iction Score come
5 ham ham 0.302 TN 14 ham ham 0.348 TN 17 ham spam 0.657 FP 8 spam spam 0.676 TP 6 spam spam 0.719 TP 10 spam spam 0.781 TP 18 spam spam 0.833 TP 20 ham spam 0.877 FP 9 spam spam 0.960 TP 4 spam spam 0.963 TP
Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum.
We have ordered the examples by score so the threshold is apparent in the predictions.
Note that, in general, instances that actually should get a prediction of ’ham’ generally have a low score, and those that should get a prediction of ’spam’ generally get a high score.
Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum.
There are a number of performance measures that use this ability of a model to rank instances that should get predictions of one target level higher than the other, to assess how well the model is performing.
The basis of most of these approaches is measuring how well the distributions of scores produced by the model for different target levels are separated
Cat. Targets Pred. Scores
Multinomial
Cont. Targets Deployment Sum.
Prediction Score
Prediction Score
Figure: Prediction score distributions for two different prediction models. The distributions in (a) are much better separated than those in (b).
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum.
0.0 0.2 0.4 0.6 0.8 1.0
Prediction Score
0.4 0.6 0.8 1.0
Prediction Score
Figure: Prediction score distributions for the (a) ’spam’ and (b) ’ham’ target levels based on the data in Table 7 [30].
0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5 2.0 2.5
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Receiver Operating Characteristic Curves
The receiver operating characteristic index (ROC index), which is based on the receiver operating characteristic curve (ROC curve), is a widely used performance measure that is calculated using prediction scores.
TPR and TNR are intrinsically tied to the threshold used to convert prediction scores into target levels.
This threshold can be changed, however, which leads to different predictions and a different confusion matrix.
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Receiver Operating Characteristic Curves
Table: Confusion matrices for the set of predictions shown in Table 7 [30] using (a) a prediction score threshold of 0.75 and (b) a prediction score threshold of 0.25.
(a) Threshold: 0.75
Prediction
’spam’ ’ham’
’spam’ 4 4 ’ham’ 2 10
(b) Threshold: 0.25
Prediction
’spam’ ’ham’ ’spam’ 7 2 ’ham’ 4 7
ID Target 7 ham 11 ham 15 ham 13 ham 19 ham 12 spam 2 spam 3 ham 16 ham 1 spam 5 ham 14 ham 17 ham 8 spam 6 spam 10 spam 18 spam 20 ham
Score 0.001 0.003 0.059 0.064 0.094 0.160 0.184 0.226 0.246 0.293 0.302 0.348 0.657 0.676 0.719 0.781 0.833 0.877
Pred. Pred. (0.10) (0.25) ham ham ham ham ham ham ham ham ham ham
spam ham spam ham spam ham spam ham spam spam spam spam spam spam spam spam spam spam spam spam spam spam spam spam spam spam spam spam spam spam 0.300 0.300 1.000 0.778 0.455 0.636 0.545 0.364 0.000 0.222
Pred. Pred. Pred. (0.50) (0.75) (0.90) ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham ham
spam ham ham spam ham ham spam ham ham spam spam ham spam spam ham spam spam ham spam spam spam spam spam spam 0.250 0.300 0.350 0.667 0.444 0.222 0.818 0.909 1.000 0.182 0.091 0.000 0.333 0.556 0.778
spam 0.960
spam 0.963
Misclassification Rate True Positive Rate (TPR) True Negative rate (TNR)
False Positive Rate (FPR) False Negative Rate (FNR)
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Receiver Operating Characteristic Curves
Note: as the threshold increases TPR decreases and TNR increases (and vice versa).
Capturing this tradeoff is the basis of the ROC curve.
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Receiver Operating Characteristic Curves
Thresh = 0.25
Prediction spam ham
spam Target ham
Thresh = 0.5
Prediction spam ham
spam Target ham
Thresh = 0.75
Prediction spam ham
spam Target ham
0.0 0.2 0.4 0.6
0.4 0.6 0.8 1.0 False Positive Rate
True Positive Rate True Negative Rate
Figure: (a) The changing values of TPR and TNR for the test data shown in Table 36 [37] as the threshold is altered; (b) points in ROC space for thresholds of 0.25, 0.5, and 0.75.
0.0 0.2 0.4
0.6 0.8 1.0
True Positive Rate
0.0 0.2 0.4 0.6 0.8 1.0
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Receiver Operating Characteristic Curves
False Positive Rate
False Positive Rate
Model 1 (0.996) Model 2 (0.887) Model 3 (0.764) Model 4 (0.595)
Figure: (a) A complete ROC curve for the email classification example; (b) a selection of ROC curves for different models trained on the same prediction task.
True Positive Rate
0.0 0.2 0.4 0.6 0.8 1.0
True Positive Rate
0.0 0.2 0.4 0.6 0.8 1.0
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Receiver Operating Characteristic Curves
We can also calculate a single performance measure from an ROC curve
The ROC Index measures the area underneath an ROC curve.
ROC index =
|T| (FPR(T[i]) − FPR(T[i − 1])) × (TPR(T[i]) + TPR(T[i − 1]))
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Receiver Operating Characteristic Curves
The Gini coefficient is a linear rescaling of the ROC index Gini coefficient = (2 × ROC index) − 1 (12)
The Gini coefficient takes value in the range [0,1], the higher the value, the better the performance of the model
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Kolmogorov-Smirnov Statistic
The Kolmogorov-Smirnov statistic (K-S statistic) is another performance measure that captures the separation between the distribution of prediction scores for the different target levels in a classification problem.
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Kolmogorov-Smirnov Statistic
To calculate the K-S statistic, we first determine the cumulative probability distributions of the prediction scores for the positive and negative target levels:
CP(positive, ps) = CP(negative, ps) =
num positive test instances with score ≤ ps num positive test instances
num negative test instances with score ≤ ps
num negative test instances
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Kolmogorov-Smirnov Statistic
0.4 0.6 Score
CP(spam, ps) CP(ham, ps)
Figure: The K-S chart for the email classification predictions shown in Table 7 [30].
Cumulative Probability
0.0 0.2 0.4 0.6 0.8 1.0
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Kolmogorov-Smirnov Statistic
The K-S statistic is calculated by determining the maximum difference between the cumulative probability distributions for the positive and negative target levels.
K-S = max (CP(positive, ps) − CP(negative, ps)) (15) ps
Prediction ID Score
7 0.001 11 0.003 15 0.059 13 0.064 19 0.094 12 0.160
16 0.246 1 0.293 5 0.302 14 0.348 17 0.657 8 0.676 6 0.719 10 0.781 18 0.833 20 0.877 9 0.960 4 0.963
(’spam’ ) Cumulative Count
Negative (’ham’ ) Cumulative Count
(’spam’ ) Cumulative Probability 0.000 0.000 0.000 0.000 0.000 0.111 0.222 0.222 0.222 0.333 0.333 0.333 0.333 0.444 0.556 0.667 0.778 0.778 0.889 1.000
Negative (’ham’ ) Cumulative
Probability Distance 0.091 0.091 0.182 0.182 0.273 0.273 0.364 0.364 0.455 0.455 0.455 0.343 0.455 0.232 0.545 0.323 0.636 0.414 0.636 0.303 0.727 0.394 0.818 0.485 0.909 0.576* 0.909 0.465 0.909 0.354 0.909 0.242 0.909 0.131 1.000 0.222 1.000 0.111 1.000 0.000
(*much better model)
(a) Model 1 (b) Model 2 (c) Model 3 (d) Model 4
P_Spam(s) P_Spam(s) P_Spam(s) P_Spam(s) P_NonSpam(s) P_NonSpam(s) P_NonSpam(s) P_NonSpam(s)
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Score Score Score Score
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Prediction Score Prediction Score Prediction Score Prediction Score
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Prediction Score
Prediction Score
Prediction Score Prediction Score
CP(spam, ps) CP(ham, ps)
CP(spam, ps) CP(ham, ps)
CP(spam, ps) CP(ham, ps)
CP(spam, ps) CP(ham, ps)
Cumulative Probability
0.0 0.2 0.4 0.6 0.8 1.0
Cumulative Probability
0.0 0.2 0.4 0.6 0.8 1.0
Cumulative Probability
0.0 0.2 0.4 0.6 0.8 1.0
Cumulative Probability
0.0 0.2 0.4 0.6 0.8 1.0
0.6 0.8 1.0
0.6 0.8 1.0
0.6 0.8 1.0
0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Score Score Score Score
Figure: A series of charts for different model performance on the same large email classification test set used to generate the ROC
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Measuring Gain and Lift
Measuring gain and lift
If we are to rank the instances in the test data in descending order of prediction scores, we would expect the majority of the positive instances to be toward the top of this ranking.
The gain and lift attempt to measure to what extent a set of predictions made by a model meet this assumption
Gain(dec) = num positive test instances in decile dec (16) num positive test instances
Table: The test set with model predictions and scores from Table 7 [30] extended to include deciles.
Decile 1st
2nd 3rd 4th 5th 6th 7th 8th 9th
ID Target Prediction 9 spam spam
4 spam spam
18 spam spam 20 ham spam 6 spam spam 10 spam spam 17 ham spam 8 spam spam 5 ham ham 14 ham ham 16 ham ham 1 spam ham 2 spam ham 3 ham ham 19 ham ham 12 spam ham 15 ham ham 13 ham ham 7 ham ham 11 ham ham
Score Outcome 0.960 TP 0.963 TP 0.833 TP 0.877 FP 0.719 TP 0.781 TP 0.657 FP 0.676 TP 0.302 TN 0.348 TN 0.246 TN 0.293 FN 0.184 FN 0.226 TN 0.094 TN 0.160 FN 0.059 TN 0.064 TN 0.001 TN 0.003 TN
Design Cat. Targets Pred. Scores Multinomial Cont. Targets Deployment Sum. Measuring Gain and Lift
Table: Tabulating the workings requir
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com