CS699
Lecture 6 Performance Evaluation
Model Evaluation and Selection
Evaluation metrics: How can we measure accuracy? Other metrics to consider?
Use an independent test dataset instead of training dataset when assessing accuracy
Methods for estimating a classifier’s accuracy: Holdout method, random subsampling
Cross‐validation
Bootstrap
Comparing classifiers:
Confidence intervals ROC Curves
2
5 attribute
6 values
YYTP
YNFN
NNTN (application dependent) YNFN
NNTN
N N TN
NYFP
YYTP
NNTN
N N TN
7 8 9 10
7 out of 10 were correctly classified:
Model Evaluation and Selection
Model testing Dataset with known class
predicts
(or classifies)
MODEL
Id A1 A2 A3 A4
Actual Class
Predicted Class
1 2 3 4
Assume:
Y is positive N is negative
accuracy = 70%
3
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class
Positive
Negative
Positive Negative
True Positives (TP) False Positives (FP)
False Negatives (FN) True Negatives (TN)
Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i that were labeled by the classifier as class j
May have extra rows/columns to provide totals
Confusion matrix of the example in the previous slide
Actual class\Predicted class Y N Total
Y224 N156
Total 3 7 10
4
Classifier Evaluation Metrics: Confusion Matrix
Another Example:
Actual class\Predicted class
buy_computer buy_computer Total
buy_computer = yes buy_computer = no Total
= no
6954 46 7000
= yes
412 2588 3000
7366 2634
10000
5
Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity
A\P C ¬C
C TP FN P
C: buys_computer = yes (P) ¬C: buys_computer = no (N)
¬C FP TN N P’ N’ All
Classifier Accuracy, or recognition rate: percentage of test set tuples that are correctly classified
Accuracy = (TP + TN)/All
Error rate: 1 – accuracy, or Error rate = (FP + FN)/All
6
Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity
Class Imbalance Problem:
One class may be rare, e.g. fraud, or HIV‐positive
Significant majority of the negative class and minority of the positive class
Example: 9900 N’s (no cancer) and 100 P’s (cancer)
A model correctly classifies all N’s and 20 P’s
Model accuracy = 9920 / 10000 = 99.2% => very high
But, 80% of cancer patients were misdiagnosed as non‐ cancer patient
7
Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity
If it is important to correctly diagnose cancer patients, we need a different measure.
In this example, we need TP/P.
Sensitivity: True Positive recognition rate
Sensitivity = TP/P
Specificity: True Negative recognition rate Specificity = TN/N
8
Classifier Evaluation Metrics: Precision and Recall, and F-measures
Precision: exactness – what % of tuples that the classifier labeled as positive are actually positive
Recall: completeness – what % of actual positive tuples did the classifier label as positive?
Recall is the same as sensitivity Perfect score is 1.0 (for both)
9
Classifier Evaluation Metrics: Precision and Recall, and F-measures
Combine precision and recall into a single measure
F measure (F1 or F‐score): harmonic mean of precision and
recall
Fß: weighted measure of precision and recall
ß is a parameter
assigns ß times as much weight to recall as to precision
10
Classifier Evaluation Metrics: Example
Actual Class\Predicted class cancer = yes
cancer = no
Total
cancer = yes
cancer = no Total 210 300
Recognition(%) 30.00 (sensitivity 98.56 (specificity) 96.50 (accuracy)
Precision = 90/230 = 39.13%
Recall = 90/300 = 30.00%
F (or F1 or F‐score) = (2 * 0.39 * 0.3) / (0.39 + 0.3) = 0.339
F2 = [(1 + 22) * 0.39 * 0.3] / (22 * 0.39 + 0.3) = 0.315
90 140 230
9560 9700 9770 10000
11
Classifier Evaluation Metrics: Example
Actual Class\Predicted class cancer = yes
cancer = no
Total
cancer = yes
cancer = no Total 210 300
Recognition(%) 30.00 (sensitivity 98.56 (specificity) 96.50 (accuracy)
90 140 230
9560 9700 9770 10000
• MCC (Matthews Correlation Coefficient): More reliable for unbalanced binary-class dataset.
• Calculate MCC for the above confusion matrix.
12
Evaluating Classifier Accuracy: Holdout Method
Holdout method
Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation
Random subsampling: a variation of holdout
Repeat holdout k times, accuracy = avg. of the k accuracies obtained
13
Evaluating Classifier Accuracy: Cross-Validation Method
Cross‐validation (k‐fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets, D1, D2, …, Dk,
each approximately equal size
At i‐th iteration, use Di as test set and others as training set
Iter. 1: {D2, …, D10} used for training, D1 used for testing Iter. 2: {D1, D3,…, D10} used for training, D2 used for testing …
Iter. 10: {D1, …, D9} used for training, D10 used for testing
Accuracy = (# tuples correctly classified) / (total # tuples)
Leave‐one‐out: k = # of tuples, each fold has one sample, for small sized data
Stratified cross‐validation: folds are stratified so that class distribution in each fold is approx. the same as that in the initial data
14
Evaluating Classifier Accuracy: Bootstrap
Bootstrap
Works well with small data sets
Samples the given training tuples uniformly with replacement
i.e., each time a tuple is sampled it is re‐added to the training set Severalbootstrapmethods,andacommononeis.632boostrap
A data set with d tuples is sampled d times, with replacement, resulting in a training set of d samples.
Since samples are replaced, the same tuple can be sampled multiple times.
So, the training sample can have duplicates.
The data tuples that did not make it into the training set end up forming the test set.
15
Evaluating Classifier Accuracy: Bootstrap
Bootstrap
About 63.2% of the original data end up in the training set (which is also referred to as bootstrap), and the remaining 36.8% form the test set (since (1 – 1/d)d ≈ e‐1 = 0.368, for a large d)
Model is built from the training dataset and it is tested on both training dataset and test dataset
Accuracy of the model is computed by combining two accuracies (see next slide)
This is repeated k times, and overall accuracy of the model is computed as:
16
Evaluating Classifier Accuracy: Bootstrap
Bootstrap illustration:
17
Comparing Models M1 vs. M2: Hypothesis Testing
Suppose we have 2 classifiers, M1 and M2, which one is better?
Use 10‐fold cross‐validation to obtain and
These mean error rates are just estimates
We want to know whether the difference between the 2 error rates is attributed to just a chance or statistically significant.
Use a test of statistical significance
18
Hypothesis Testing
Hypothesis Testing or Testing of Statistical Significance Will describe two‐sided testing
State null hypothesis (and alternative hypothesis)
Choose the significance level
Compute test statistic X
Find the critical value C from the distribution table If X > C or X < ‐C, reject the null hypothesis.
Otherwise, fail to reject the null hypothesis.
19
Perform 10‐fold cross‐validation
Use t‐test (or Student’s t‐test)
Hypothesis Testing
with the degrees of freedom = k – 1 (k = 10 for our case)
Null Hypothesis: M1 & M2 are the same
If we can reject null hypothesis, then
we conclude that the difference between M1 & M2 is statistically significant
Choose model with lower error rate
If we cannot reject null hypothesis, we conclude that the difference is by chance (or not statistically significant).
20
Hypothesis Testing: t-test
When only 1 test set available: pairwise comparison
Perform 10‐fold cross‐validation for M1 and M2
For each iteration, the same partitions (i.e., the same training set and the same test set) are used for both M1 and M2.
For each iteration, error rate of each classifier is computed.
For example
Iteration No.
err(M1)i
err(M2)i
1
0.023 0.12
0.05 0.067
2 .........
21
0
var(M1 M2)/k
Hypothesis Testing: t-test
Average over 10 rounds to get and
Compute t‐statistic with k‐1 degrees of freedom:
t err(M1)err(M2) ,where
Note: This formula from the textbook calculates a population variance (the denominator is k). However, in the example in a later slide, a sample variance will be used.
22
Hypothesis Testing: t-test
We choose a significance level α
For example, significance level α = 0.05 (or 5%)
From the t‐distribution table, we find t value, tα/2, df , corresponding to α/2 (t‐distribution is symmetric; typically upper % points of distribution shown) and degrees of freedom df = k – 1. This is the critical value.
For 10‐fold cross‐validation, k = 10.
Example: if we choose α = 0.05, we find t0.025, 9 from the t‐distribution table (see next slide)
23
Hypothesis Testing: t-test
t0.025, 9 = 2.262
24
Hypothesis Testing: t-test
Test
If t0 > tα/2, df or t0 < ‐tα/2, df , then t value lies in the rejection
region:
Reject null hypothesis
Conclude the difference between M1 and M2 is statistically significant
We choose the one with a lower error rate
Otherwise, conclude that any difference is chance
25
Hypothesis Testing: t-test
t = mean(E1-E2) / SQRT(var(E1-E2) / k) = 2.489
t0.025,9 = 2.262 (from t-distribution table)
Since t > t0.025,9 , we reject the null hypothesis and conclude M2 is better than M1.
Note: A sample variance is used for var(E1 – E2)
26
false positive rate
ROC Curves
ROC (Receiver Operating Characteristics) curves: for visual comparison of classification models
Originated from signal detection theory
Shows the trade‐off between the true positive rate and the
The area under the ROC curve is a measure of the accuracy of the model
27
ROC Curves
Vertical axis represents the true positive rate (TPR)
Horizontal axis represents the false positive rate (FPR)
The plot also shows a diagonal line
The closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate is the model
A model with perfect accuracy will have an area of 1.0
28
ROC Curves
Rank the test tuples in decreasing order of predicted probability: the one that is most likely to belong to the positive class appears at the top of the list.
tuple_id 1
2
3
4
5
6
7
8
9
10
Actual class
Probability
P P N P P N P N N P
0.992 0.964 0.953 0.931 0.893 0.875
0.82 0.793 0.778 0.742
29
Foreachrow:
ROC Curves
Assume all tuples above, including itself, are classified as positive and all tuples below are classified as negative
Calculate TP, FP, TN, FN and TPR (= TP/P) and FPR = (FP/N)
Then, plot TPR vs. FPR
30
P = 6, N = 4
tuple_id
1 P
0.992 0.964 0.953 0.931 0.893 0.875
1 2 2 3 4 4 5 5 5 6
0 0 1 1 1 2 2 3 4 4
4 4 3 3 3 2 2 1 0 0
50.170 40.330 4 0.33 0.25 3 0.5 0.25 2 0.67 0.25 2 0.67 0.5 1 0.83 0.5 1 0.83 0.75 1 0.83 1.0 0 1.0 1.0
2 P
3 N
4 P
5 P
6 N
7 P
8 N
0.82 0.793 0.778 0.742
9 N
10 P
ROC Curves
Actual class
Probability
TP
FP
TN
FN
TPR FPR
31
Actual class 2 P
Probability
TP
FP
TN
FN
TPR FPR
tuple_id
1 P
Classified as
3 N
4 P
5 P
6 N
7 P
8 N
9 N
0.992 0.964 0.953 0.931 0.893 0.875
P P P P N N N N N N
1 2 2 3 4 4 5 5 5 6
0 0 1 1 1 2 2 3 4 4
4 4 3 3 3 2 2 1 0 0
50.170 40.330 4 0.33 0.25 3 0.5 0.25 2 0.67 0.25 2 0.67 0.5 1 0.83 0.5 1 0.83 0.75 1 0.83 1.0 0 1.0 1.0
10 P
0.82 0.793 0.778 0.742
ROC Curves
Fourth row:
• Top four are classified as P and all others are classified as N.
• TP is 3 (3 positives are correctly classified as P)
• FP is 1 (1 negative is incorrectly classified as P)
• TN is 3 (3 negatives are correctly classified as N)
• FN is 3 (3 positives are incorrectly classified as N)
• TPR=3/6=0.5andFPR=1/4=0.25
32
ROC Curves
33
Issues Affecting Model Selection
Accuracy
classifier accuracy: predicting class label
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk‐resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules
34
Performance Measures for Numeric Prediction
Will discuss measures on Weka: Correlation coefficient
Mean Absolute error
Root mean squared error
Relative absolute error
Root relative squared error
35
Performance Measures for Numeric Prediction
Notations:
a1, a2, …, an: Actual values of the dependent variable (also
called output attribute or class attribute)
p1, p2, …, pn: Predicted values (by a prediction/classifier algorithm)
Performance measures represent how far the predicted values are from the actual attribute values
36
Performance Measures for Numeric Prediction
Correlation coefficient
Correlation between a’s and p’s Between ‐1 and 1
SPA , where SPSA
S n (pip)(aia), i1
PA
n1
n (pp)2 n (aa)2
i
S i1 , S i1
i
P n1 A n1
37
Performance Measures for Numeric Prediction
Mean absolute error
Average of the magnitude of individual errors Less affected by outliers
a1p1a2p2…anpn n aipi or i1
nn
38
Performance Measures for Numeric Prediction
Root mean squared error
Square root of the average of the squared
individual errors
Effect of outliers is exaggerated.
(a p)2 (a p )2 …(a p )2 or n (ai pi)2 1122nn i1
nn
39
Performance Measures for Numeric Prediction
Relative absolute error
Absolute error is normalized by the error that would have been generated when a simple predictor had been used. The average of actual attribute values of a’s is used as the simple predictor.
a1p1a2p2…anpn a1aa2a…ana
or
n aipi i1
n aia i1
40
Performance Measures for Numeric Prediction
Root relative squared error
Square root of relative squared error
(ap)2(ap)2…(ap)2 n (aipi)2 1 1 2 2 n n or i1
(a a)2 (a a)2 …(a a)2 n (a a)2 1 2 n i1i
41
Performance Measures for Numeric Prediction
Which measure is best?
Should be determined by the application
where the prediction is used.
For many practical problems, the best prediction method is still the best regardless of which performance measure is used.
42
• http://www.cs.illinois.edu/~hanj/bk3/
•
I.H. Witten and E. Frank, “Data Mining Practical Machine Learning and Techniques,” Second Edition, 2005, Morgan Kaufmann, pp. 176 – 179
43
References
• Han, J., Kamber, M., Pei, J., “Data mining: concepts and techniques,” 3rd Ed., Morgan Kaufmann, 2012