CS计算机代考程序代写 data mining decision tree database algorithm CS699

CS699
Lecture 6 Performance Evaluation

Model Evaluation and Selection
 Evaluation metrics: How can we measure accuracy? Other metrics to consider?
 Use an independent test dataset instead of training dataset when assessing accuracy
 Methods for estimating a classifier’s accuracy:  Holdout method, random subsampling
 Cross‐validation
 Bootstrap
 Comparing classifiers:
 Confidence intervals  ROC Curves
2

5 attribute
6 values
YYTP
YNFN
NNTN (application dependent) YNFN
NNTN
N N TN
NYFP
YYTP
NNTN
N N TN
7 8 9 10
7 out of 10 were correctly classified:
Model Evaluation and Selection
 Model testing Dataset with known class
predicts
(or classifies)
MODEL
Id A1 A2 A3 A4
Actual Class
Predicted Class
1 2 3 4
Assume:
Y is positive N is negative
accuracy = 70%
3

Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class
Positive
Negative
Positive Negative
True Positives (TP) False Positives (FP)
False Negatives (FN) True Negatives (TN)
 Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i that were labeled by the classifier as class j
 May have extra rows/columns to provide totals
 Confusion matrix of the example in the previous slide
Actual class\Predicted class Y N Total
Y224 N156
Total 3 7 10
4

Classifier Evaluation Metrics: Confusion Matrix
Another Example:
Actual class\Predicted class
buy_computer buy_computer Total
buy_computer = yes buy_computer = no Total
= no
6954 46 7000
= yes
412 2588 3000
7366 2634
10000
5

Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity
A\P C ¬C
C TP FN P
C: buys_computer = yes (P) ¬C: buys_computer = no (N)
¬C FP TN N P’ N’ All
 Classifier Accuracy, or recognition rate: percentage of test set tuples that are correctly classified
Accuracy = (TP + TN)/All
 Error rate: 1 – accuracy, or Error rate = (FP + FN)/All
6

Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity
 Class Imbalance Problem:
 One class may be rare, e.g. fraud, or HIV‐positive
 Significant majority of the negative class and minority of the positive class
 Example: 9900 N’s (no cancer) and 100 P’s (cancer)
 A model correctly classifies all N’s and 20 P’s
 Model accuracy = 9920 / 10000 = 99.2% => very high
 But, 80% of cancer patients were misdiagnosed as non‐ cancer patient
7

Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity
 If it is important to correctly diagnose cancer patients, we need a different measure.
 In this example, we need TP/P.
 Sensitivity: True Positive recognition rate
 Sensitivity = TP/P
 Specificity: True Negative recognition rate  Specificity = TN/N
8

Classifier Evaluation Metrics: Precision and Recall, and F-measures
 Precision: exactness – what % of tuples that the classifier labeled as positive are actually positive
 Recall: completeness – what % of actual positive tuples did the classifier label as positive?
 Recall is the same as sensitivity  Perfect score is 1.0 (for both)
9

Classifier Evaluation Metrics: Precision and Recall, and F-measures
 Combine precision and recall into a single measure
 F measure (F1 or F‐score): harmonic mean of precision and
recall
 Fß: weighted measure of precision and recall
 ß is a parameter
 assigns ß times as much weight to recall as to precision
10

Classifier Evaluation Metrics: Example
Actual Class\Predicted class cancer = yes
cancer = no
Total
cancer = yes
cancer = no Total 210 300
Recognition(%) 30.00 (sensitivity 98.56 (specificity) 96.50 (accuracy)
 Precision = 90/230 = 39.13%
Recall = 90/300 = 30.00%
 F (or F1 or F‐score) = (2 * 0.39 * 0.3) / (0.39 + 0.3) = 0.339
 F2 = [(1 + 22) * 0.39 * 0.3] / (22 * 0.39 + 0.3) = 0.315
90 140 230
9560 9700 9770 10000
11

Classifier Evaluation Metrics: Example
Actual Class\Predicted class cancer = yes
cancer = no
Total
cancer = yes
cancer = no Total 210 300
Recognition(%) 30.00 (sensitivity 98.56 (specificity) 96.50 (accuracy)
90 140 230
9560 9700 9770 10000
• MCC (Matthews Correlation Coefficient): More reliable for unbalanced binary-class dataset.
• Calculate MCC for the above confusion matrix.
12

Evaluating Classifier Accuracy: Holdout Method
 Holdout method
 Given data is randomly partitioned into two independent sets
 Training set (e.g., 2/3) for model construction  Test set (e.g., 1/3) for accuracy estimation
 Random subsampling: a variation of holdout
 Repeat holdout k times, accuracy = avg. of the k accuracies obtained
13

Evaluating Classifier Accuracy: Cross-Validation Method
 Cross‐validation (k‐fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive subsets, D1, D2, …, Dk,
each approximately equal size
 At i‐th iteration, use Di as test set and others as training set
Iter. 1: {D2, …, D10} used for training, D1 used for testing Iter. 2: {D1, D3,…, D10} used for training, D2 used for testing …
Iter. 10: {D1, …, D9} used for training, D10 used for testing
Accuracy = (# tuples correctly classified) / (total # tuples)
 Leave‐one‐out: k = # of tuples, each fold has one sample, for small sized data
 Stratified cross‐validation: folds are stratified so that class distribution in each fold is approx. the same as that in the initial data
14

Evaluating Classifier Accuracy: Bootstrap
 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with replacement
 i.e., each time a tuple is sampled it is re‐added to the training set  Severalbootstrapmethods,andacommononeis.632boostrap
 A data set with d tuples is sampled d times, with replacement, resulting in a training set of d samples.
 Since samples are replaced, the same tuple can be sampled multiple times.
 So, the training sample can have duplicates.
 The data tuples that did not make it into the training set end up forming the test set.
15

Evaluating Classifier Accuracy: Bootstrap
 Bootstrap
 About 63.2% of the original data end up in the training set (which is also referred to as bootstrap), and the remaining 36.8% form the test set (since (1 – 1/d)d ≈ e‐1 = 0.368, for a large d)
 Model is built from the training dataset and it is tested on both training dataset and test dataset
 Accuracy of the model is computed by combining two accuracies (see next slide)
 This is repeated k times, and overall accuracy of the model is computed as:
16

Evaluating Classifier Accuracy: Bootstrap
 Bootstrap illustration:
17

Comparing Models M1 vs. M2: Hypothesis Testing
 Suppose we have 2 classifiers, M1 and M2, which one is better?
 Use 10‐fold cross‐validation to obtain and
 These mean error rates are just estimates
 We want to know whether the difference between the 2 error rates is attributed to just a chance or statistically significant.
 Use a test of statistical significance
18

Hypothesis Testing
 Hypothesis Testing or Testing of Statistical Significance  Will describe two‐sided testing
 State null hypothesis (and alternative hypothesis)
 Choose the significance level
 Compute test statistic X
 Find the critical value C from the distribution table  If X > C or X < ‐C, reject the null hypothesis. Otherwise, fail to reject the null hypothesis. 19  Perform 10‐fold cross‐validation  Use t‐test (or Student’s t‐test) Hypothesis Testing with the degrees of freedom = k – 1 (k = 10 for our case)  Null Hypothesis: M1 & M2 are the same  If we can reject null hypothesis, then  we conclude that the difference between M1 & M2 is statistically significant  Choose model with lower error rate  If we cannot reject null hypothesis, we conclude that the difference is by chance (or not statistically significant). 20 Hypothesis Testing: t-test  When only 1 test set available: pairwise comparison  Perform 10‐fold cross‐validation for M1 and M2  For each iteration, the same partitions (i.e., the same training set and the same test set) are used for both M1 and M2.  For each iteration, error rate of each classifier is computed.  For example Iteration No. err(M1)i err(M2)i 1 0.023 0.12 0.05 0.067 2 ......... 21 0 var(M1 M2)/k Hypothesis Testing: t-test  Average over 10 rounds to get and  Compute t‐statistic with k‐1 degrees of freedom: t err(M1)err(M2) ,where  Note: This formula from the textbook calculates a population variance (the denominator is k). However, in the example in a later slide, a sample variance will be used. 22 Hypothesis Testing: t-test  We choose a significance level α  For example, significance level α = 0.05 (or 5%)  From the t‐distribution table, we find t value, tα/2, df , corresponding to α/2 (t‐distribution is symmetric; typically upper % points of distribution shown) and degrees of freedom df = k – 1. This is the critical value.  For 10‐fold cross‐validation, k = 10.  Example: if we choose α = 0.05, we find t0.025, 9 from the t‐distribution table (see next slide) 23 Hypothesis Testing: t-test t0.025, 9 = 2.262 24 Hypothesis Testing: t-test  Test  If t0 > tα/2, df or t0 < ‐tα/2, df , then t value lies in the rejection region:  Reject null hypothesis  Conclude the difference between M1 and M2 is statistically significant  We choose the one with a lower error rate  Otherwise, conclude that any difference is chance 25 Hypothesis Testing: t-test t = mean(E1-E2) / SQRT(var(E1-E2) / k) = 2.489 t0.025,9 = 2.262 (from t-distribution table) Since t > t0.025,9 , we reject the null hypothesis and conclude M2 is better than M1.
Note: A sample variance is used for var(E1 – E2)
26

false positive rate
ROC Curves
 ROC (Receiver Operating Characteristics) curves: for visual comparison of classification models
 Originated from signal detection theory
 Shows the trade‐off between the true positive rate and the
 The area under the ROC curve is a measure of the accuracy of the model
27

ROC Curves
 Vertical axis represents the true positive rate (TPR)
 Horizontal axis represents the false positive rate (FPR)
 The plot also shows a diagonal line
 The closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate is the model
 A model with perfect accuracy will have an area of 1.0
28

ROC Curves
 Rank the test tuples in decreasing order of predicted probability: the one that is most likely to belong to the positive class appears at the top of the list.
tuple_id 1
2
3
4
5
6
7
8
9
10
Actual class
Probability
P P N P P N P N N P
0.992 0.964 0.953 0.931 0.893 0.875
0.82 0.793 0.778 0.742
29

 Foreachrow:
ROC Curves
 Assume all tuples above, including itself, are classified as positive and all tuples below are classified as negative
 Calculate TP, FP, TN, FN and TPR (= TP/P) and FPR = (FP/N)
 Then, plot TPR vs. FPR
30

P = 6, N = 4
tuple_id
1 P
0.992 0.964 0.953 0.931 0.893 0.875
1 2 2 3 4 4 5 5 5 6
0 0 1 1 1 2 2 3 4 4
4 4 3 3 3 2 2 1 0 0
50.170 40.330 4 0.33 0.25 3 0.5 0.25 2 0.67 0.25 2 0.67 0.5 1 0.83 0.5 1 0.83 0.75 1 0.83 1.0 0 1.0 1.0
2 P
3 N
4 P
5 P
6 N
7 P
8 N
0.82 0.793 0.778 0.742
9 N
10 P
ROC Curves
Actual class
Probability
TP
FP
TN
FN
TPR FPR
31

Actual class 2 P
Probability
TP
FP
TN
FN
TPR FPR
tuple_id
1 P
Classified as
3 N
4 P
5 P
6 N
7 P
8 N
9 N
0.992 0.964 0.953 0.931 0.893 0.875
P P P P N N N N N N
1 2 2 3 4 4 5 5 5 6
0 0 1 1 1 2 2 3 4 4
4 4 3 3 3 2 2 1 0 0
50.170 40.330 4 0.33 0.25 3 0.5 0.25 2 0.67 0.25 2 0.67 0.5 1 0.83 0.5 1 0.83 0.75 1 0.83 1.0 0 1.0 1.0
10 P
0.82 0.793 0.778 0.742
ROC Curves
Fourth row:
• Top four are classified as P and all others are classified as N.
• TP is 3 (3 positives are correctly classified as P)
• FP is 1 (1 negative is incorrectly classified as P)
• TN is 3 (3 negatives are correctly classified as N)
• FN is 3 (3 positives are incorrectly classified as N)
• TPR=3/6=0.5andFPR=1/4=0.25
32

ROC Curves
33

Issues Affecting Model Selection
 Accuracy
 classifier accuracy: predicting class label
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk‐resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules
34

Performance Measures for Numeric Prediction
 Will discuss measures on Weka:  Correlation coefficient
 Mean Absolute error
 Root mean squared error
 Relative absolute error
 Root relative squared error
35

Performance Measures for Numeric Prediction
 Notations:
 a1, a2, …, an: Actual values of the dependent variable (also
called output attribute or class attribute)
 p1, p2, …, pn: Predicted values (by a prediction/classifier algorithm)
 Performance measures represent how far the predicted values are from the actual attribute values
36

Performance Measures for Numeric Prediction
 Correlation coefficient
 Correlation between a’s and p’s  Between ‐1 and 1
SPA , where SPSA
S n (pip)(aia), i1
PA
n1
n (pp)2 n (aa)2
i
S  i1 , S  i1
i
P n1 A n1
37

Performance Measures for Numeric Prediction
 Mean absolute error
 Average of the magnitude of individual errors  Less affected by outliers
a1p1a2p2…anpn n aipi or i1
nn
38

Performance Measures for Numeric Prediction
 Root mean squared error
 Square root of the average of the squared
individual errors
 Effect of outliers is exaggerated.
(a p)2 (a p )2 …(a p )2 or n (ai pi)2 1122nn i1
nn
39

Performance Measures for Numeric Prediction
 Relative absolute error
 Absolute error is normalized by the error that would have been generated when a simple predictor had been used. The average of actual attribute values of a’s is used as the simple predictor.
a1p1a2p2…anpn a1aa2a…ana
or
n aipi i1
n aia i1
40

Performance Measures for Numeric Prediction
 Root relative squared error
 Square root of relative squared error
(ap)2(ap)2…(ap)2 n (aipi)2 1 1 2 2 n n or i1
(a a)2 (a a)2 …(a a)2 n (a a)2 1 2 n i1i
41

Performance Measures for Numeric Prediction
 Which measure is best?
 Should be determined by the application
where the prediction is used.
 For many practical problems, the best prediction method is still the best regardless of which performance measure is used.
42

• http://www.cs.illinois.edu/~hanj/bk3/

I.H. Witten and E. Frank, “Data Mining Practical Machine Learning and Techniques,” Second Edition, 2005, Morgan Kaufmann, pp. 176 – 179
43
References
• Han, J., Kamber, M., Pei, J., “Data mining: concepts and techniques,” 3rd Ed., Morgan Kaufmann, 2012