Document Classification 4: Evaluation
Evaluating Classifiers
This time:
Accuracy and error rate
The confusion matrix
Precision and recall
Trading off precision and recall
The Naïve Bayes decision rule The decision boundary
The Receiver Operating Characteristics (ROC) curve
Data Science Group (Informatics) NLE/ANLP
Binary Classification
Will focus on case of binary classification
Autumn 2015
1 / 20
Suppose we want to measure how well a classifier is performing — evaluate on a test dataset
What proportion of items in test set classified correctly? — Accuracy
What proportion of items in test set classified incorrectly? — Error Rate
Are these measures the most appropriate way of assessing/comparing classifiers?
Data Science Group (Informatics) NLE/ANLP Autumn 2015
Confusion Matrix (Binary Classification)
2 / 20
Predicted Class
+ve
-ve
True Positive (TP)
False Negative (FN)
False Positive (FP)
True Negative (TN)
Binary choice common scenario for document classification:
spam v. ham
thumbs up v. thumbs down on topic v. off topic
One class is usually distinguished as being “of interest”
the positive class (+ve)
often the +ve class much smaller than the -ve class we are trying to find documents in the +ve class
+ve -ve
Total number of data items Tot = TP + FP + TN + FN
Data Science Group (Informatics) NLE/ANLP Autumn 2015
3 / 20
Data Science Group (Informatics) NLE/ANLP Autumn 2015
4 / 20
True Class
Measuring Accuracy and Error Rate
Example 1: Balanced Classes
Accuracy: the proportion that is classified correctly: TP + TN
Tot
Error Rate: the proportion that is classified incorrectly
Predicted Class
+ve -ve
TP = 45
FN = 5
FP = 15
TN 35
+ve -ve
FN+FP Tot
Note that Accuracy = 1 − Error Rate
Data Science Group (Informatics) NLE/ANLP
Example 2: Skewed Classes
Autumn 2015
5 / 20
Data Science Group (Informatics)
Precision & Recall
Accuracy = 45 + 35 = 0.8 100
ErrorRate= 15+5 =0.2 100
NLE/ANLP
Autumn 2015
6 / 20
Predicted Class
+ve -ve
TP = 3
FN = 7
FP = 3
TN 87
+ve -ve
ACB
Accuracy = 3 + 87 = 0.9 100
Predicted +ve
Precision
|C| |A|
Actually +ve
Recall
|C| |B|
But how good is this classifier at detecting the +ve class?!
Data Science Group (Informatics) NLE/ANLP Autumn 2015
7 / 20
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
8 / 20
True Class
True Class
Measuring Precision & Recall
Example 2: Precision and Recall
Precision (positive predictive value): TP
TP + FP
Recall (true positive rate, sensitivity, hit rate):
TP TP + FN
F-Score: one number that captures both precision and recall 2· Precision·Recall
Predicted Class
+ve
-ve
TP = 3
FN = 7
FP = 3
TN 87
+ve -ve
Precision = 3
3+3 3+7
Recall = 3 F-Score = 2 · 0.5 · 0.3 = 0.3 = 0.375
Naïve Bayes: Decision Rule
Standard decision rule:
If P(+ve | f1n) > P(−ve | f1n) then choose +ve class
If P(−ve | f1n) > P(+ve | f1n) then choose -ve class The standard decision boundary is defined by
= 0.3
Autumn 2015
= 0.5
Precision + Recall — the harmonic mean of Precision and Recall
Data Science Group (Informatics) NLE/ANLP
Trading off Precision and Recall
Autumn 2015
9 / 20
0.5 + 0.3 0.8 Data Science Group (Informatics) NLE/ANLP
10 / 20
Given a classifier, can change the decision boundary Makes it more/less likely to classify item as class +ve
Can be used to trade off precision and recall:
P ( + v e | f 1n ) = 0 . 5 Define the decision boundary by
Generalised decision rule:
P ( + v e | f 1n ) = p Data Science Group (Informatics) NLE/ANLP
for some 0 ≤ p ≤ 1
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
11 / 20
Autumn 2015
12 / 20
True Class
Easy Decision Boundary
Easy Decision Boundary
0.0 0.25
0.5 0.75 P(+ve|f1n)
1.0
0.0
0.25 0.5 0.75 P(+ve|f1n)
0 blue documents mis-classified 1 red document mis-classified
1.0
Data Science Group (Informatics)
Easy Decision Boundary
NLE/ANLP
Autumn 2015
13 / 20
Data Science Group (Informatics) NLE/ANLP
Harder Decision Boundary
Autumn 2015
13 / 20
0.0
0.25
0.5 0.75 P(+ve|f1n)
1.0
0.0 0.25 0.5 P(+ve|f1n)
0.75 1.0
0 blue documents mis-classified 0 red documents mis-classified
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
13 / 20
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
14 / 20
Harder Decision Boundary
Harder Decision Boundary
0.0 0.25 0.5 0.75 P(+ve | f1n)
4 blue documents mis-classified 1 red document mis-classified
Data Science Group (Informatics) NLE/ANLP
ROC Curve
Receiver Operating Characteristics
1.0
0.0
0.25 0.5 0.75 P(+ve | f1n)
1 blue document mis-classified 3 red documents mis-classified
1.0
Autumn 2015
14 / 20
Data Science Group (Informatics) NLE/ANLP
Example ROC Curve
0.0 0.25 0.5 0.75
Autumn 2015
1.0
14 / 20
— originally used to show trade-off between false alarm rate and hit rate
100
80
60
40
20
0
A plot of false positive rate against recall
False Positive Rate =
Recall =
FP FP + TN
TP TP + FN
0 20 40 60 80 100
False positive rate
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
15 / 20
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
16 / 20
Recall
Example ROC Curve
AUROC: Area Under the ROC Curve
0.0
100
80
60
40
20
0
0.25 0.5 0.75
1.0
A measure of the quality of the classifier
The greater the area, the better the classifier
A way of comparing different classifiers
0 20 40 60 80 100
False positive rate
Data Science Group (Informatics) NLE/ANLP
ROC Curve: Comparing Classifiers
Autumn 2015
Classifier 3 (poor)
17 / 20
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
18 / 20
Classifier 1 (good)
Classifier 2 (random)
Next Topic: Document Similarity
Characterising documents
Identifying indicative terms
Stop words
Word frequency TF-IDF weighting
Words and phrases
Measuring document similarity
The vector space model Cosine similarity
100
80
60
40
20
0
0 20 40 60 80 100
False positive rate
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
19 / 20
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
20 / 20
Recall Recall