Document Classification 4: Evaluation
This time:
Accuracy and error rate
The confusion matrix
Precision and recall
Trading off precision and recall
The Naïve Bayes decision rule The decision boundary
The Receiver Operating Characteristics (ROC) curve
Data Science Group (Informatics) NLE/ANLP Autumn 2015 1 / 20
Evaluating Classifiers
Suppose we want to measure how well a classifier is performing — evaluate on a test dataset
What proportion of items in test set classified correctly? — Accuracy
What proportion of items in test set classified incorrectly? — Error Rate
Are these measures the most appropriate way of assessing/comparing classifiers?
Data Science Group (Informatics) NLE/ANLP Autumn 2015 2 / 20
Binary Classification
Will focus on case of binary classification
Binary choice common scenario for document classification:
spam v. ham
thumbs up v. thumbs down on topic v. off topic
One class is usually distinguished as being “of interest”
the positive class (+ve)
often the +ve class much smaller than the -ve class we are trying to find documents in the +ve class
Data Science Group (Informatics) NLE/ANLP Autumn 2015 3 / 20
Confusion Matrix (Binary Classification)
Predicted Class
+ve
-ve
True Positive (TP)
False Negative (FN)
False Positive (FP)
True Negative (TN)
+ve
-ve
Total number of data items Tot = TP + FP + TN + FN
Data Science Group (Informatics) NLE/ANLP Autumn 2015 4 / 20
True Class
Measuring Accuracy and Error Rate
Accuracy: the proportion that is classified correctly: TP + TN
Tot
Error Rate: the proportion that is classified incorrectly
FN+FP Tot
Note that Accuracy = 1 − Error Rate
Data Science Group (Informatics) NLE/ANLP Autumn 2015 5 / 20
Example 1: Balanced Classes
Predicted Class
+ve -ve
TP = 45
FN = 5
FP = 15
TN 35
+ve
-ve
Accuracy = 45 + 35 = 0.8 100
ErrorRate= 15+5 =0.2 100
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
6 / 20
True Class
Example 2: Skewed Classes
Predicted Class
+ve -ve
TP = 3
FN = 7
FP = 3
TN 87
+ve
-ve
Accuracy = 3 + 87 = 0.9 100
But how good is this classifier at detecting the +ve class?!
Data Science Group (Informatics) NLE/ANLP Autumn 2015 7 / 20
True Class
Precision & Recall
Predicted +ve
Precision
|C| |A|
Actually +ve
Recall
|C| |B|
ACB
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
8 / 20
Measuring Precision & Recall
Precision (positive predictive value): TP
TP + FP
Recall (true positive rate, sensitivity, hit rate):
TP TP + FN
F-Score: one number that captures both precision and recall 2· Precision·Recall
Precision + Recall — the harmonic mean of Precision and Recall
Data Science Group (Informatics) NLE/ANLP Autumn 2015 9 / 20
Example 2: Precision and Recall
Predicted Class
+ve -ve
TP = 3
FN = 7
FP = 3
TN 87
+ve
-ve
Precision = 3 = 0.5 Recall = 3 3+3 3+7
F-Score=2· 0.5·0.3 = 0.3 =0.375 0.5 + 0.3 0.8
= 0.3
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
10 / 20
True Class
Trading off Precision and Recall
Given a classifier, can change the decision boundary Makes it more/less likely to classify item as class +ve
Can be used to trade off precision and recall:
Data Science Group (Informatics) NLE/ANLP Autumn 2015 11 / 20
Naïve Bayes: Decision Rule
Standard decision rule:
If P(+ve|fn) > P(−ve|fn) then choose +ve class 11
If P(−ve|fn) > P(+ve|fn) then choose -ve class 11
The standard decision boundary is defined by P ( + v e | f 1n ) = 0 . 5
Generalised decision rule:
Define the decision boundary by
P ( + v e | f 1n ) = p
for some 0 ≤ p ≤ 1
Data Science Group (Informatics) NLE/ANLP Autumn 2015 12 / 20
Easy Decision Boundary
0.0 0.25 0.5 0.75 1.0 P(+ve|f1n)
Data Science Group (Informatics) NLE/ANLP Autumn 2015 13 / 20
Easy Decision Boundary
0.0 0.25 0.5 0.75 P(+ve|f1n)
0 blue documents mis-classified 1 red document mis-classified
1.0
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
13 / 20
Easy Decision Boundary
0.0 0.25 0.5 0.75 P(+ve|f1n)
0 blue documents mis-classified 0 red documents mis-classified
1.0
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
13 / 20
Harder Decision Boundary
0.0 0.25 0.5 0.75 1.0 P(+ve|f1n)
Data Science Group (Informatics) NLE/ANLP Autumn 2015 14 / 20
Harder Decision Boundary
0.0 0.25 0.5 0.75 P(+ve|f1n)
4 blue documents mis-classified 1 red document mis-classified
1.0
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
14 / 20
Harder Decision Boundary
0.0 0.25 0.5 0.75 P(+ve|f1n)
1 blue document mis-classified 3 red documents mis-classified
1.0
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
14 / 20
ROC Curve
Receiver Operating Characteristics
— originally used to show trade-off between false alarm rate and hit rate
A plot of false positive rate against recall
False Positive Rate =
Recall =
FP FP + TN
TP TP + FN
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
15 / 20
Example ROC Curve
0.0 0.25 0.5 0.75 100
80 60 40 20
0
1.0
0 20 40 60 80 100
False positive rate
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
16 / 20
Recall
Example ROC Curve
0.0 0.25 0.5 0.75 100
80 60 40 20
0
1.0
0 20 40 60 80 100
False positive rate
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
17 / 20
Recall
AUROC: Area Under the ROC Curve
A measure of the quality of the classifier
The greater the area, the better the classifier
A way of comparing different classifiers
Data Science Group (Informatics) NLE/ANLP Autumn 2015 18 / 20
ROC Curve: Comparing Classifiers
Classifier 1 (good)
Classifier 2 (random)
Classifier 3 (poor)
100
80
60
40
20
0
0 20 40 60 80 100 False positive rate
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
19 / 20
Recall
Next Topic: Document Similarity
Characterising documents
Identifying indicative terms
Stop words
Word frequency TF-IDF weighting
Words and phrases
Measuring document similarity
The vector space model Cosine similarity
Data Science Group (Informatics) NLE/ANLP Autumn 2015 20 / 20