Introduction to information system
Model Evaluation Metrics
Bowei Chen
School of Computer Science
University of Lincoln
CMP3036M/CMP9063M Data Science
MASH
• Maths
• And
• Stats
• Help
• MASH
• mash@lincoln.ac.uk
• In The Library
mailto:mash@lincoln.ac.uk
• What Is A Model Evaluation Metric?
• Mean Absolute Error (MAE)
• Root Mean Squared Error (RMSE)
• Confusion Matrix
• Accuracy, Error Rate, True Positive Rate, False Positive Rate, Precision
• Receiver Operator Characteristic (ROC)
• Area Under ROC Curve (AUC)
Today’s Objectives
Quick Recap on Model Selection
• Basic Setup of the Learning from Data
• Cross-Validation Methods
– Test Set Method
– Leave-One-Out Cross Validation
– 𝑘-Fold Cross Validation
• Appendix A: Testing-Based/Stepwise Procedures
– Backward Elimination
– Forward Selection
– Stepwise Selection
• Appendix B: Criterion-Based Procedures
– Mallows’ 𝐶𝑝
– AIC & BIC
– 𝑅2 Adjusted
Basic Setup of the Learning from Data
Source: Y. Abu-Mostafa, M. Magdon-Ismail and H. Lin.
Learning from Data. AMLbook.com, 2012, Chapter 1
Test Set Method
1) Randomly choose 30% of the data
to be in a test set
2) The remainder is a training set
3) Perform your regression on the
training set
4) Estimate your future performance
with the test set
x
y
LOOCV (Leave-One-Out Cross Validation)
x
y
x
y
x
y
x
y
x
y
x
y
x
y
x
y
x
y
For 𝑘 = 1 to 𝑛
1) Let (𝑥𝑘 , 𝑦𝑘) be the 𝑘
th record
2) Temporarily remove (𝑥𝑘 , 𝑦𝑘)
from the dataset
3) Train on the remaining 𝑁 − 1
data points
4) Note your error (𝑥𝑘 , 𝑦𝑘)
When you’ve done all points, report
the mean error.
𝑘-Fold Cross Validation
x
y
Break the dataset into 𝑘 partitions randomly. In
this example, we’ll have 𝑘 = 3 partitions
colored blue, green and violet.
For the blue partition, train on all the points not
in the blue partition. Find the test-set sum of
errors on the blue points.
For the green partition, train on all the points
not in the green partition. Find the test-set sum
of errors on the green points.
For the violet partition, train on all the points
not in the violet partition. Find the test-set sum
of errors on the violet points.
Then report the mean error
What Is A Model Evaluation Metric?
A model evaluation metric (or performance metric) measures how well your
data mining or machine learning algorithm is performing on a given dataset.
Example:
If we apply a classification algorithm on a dataset, we first check to see how
many of the data points were classified correctly. This is a performance metric
and the formal name for it is ―accuracy.‖
MAE
The mean absolute error (MAE) metric is given by
MAE =
1
𝑛
|𝜀𝑖|
𝑛
𝑖=1
=
1
𝑛
|𝑦 𝑖 − 𝑦𝑖|
𝑛
𝑖=1
,
where
• 𝑛 is the total number of observations
• 𝑦 𝑖 is the predicted value of the 𝑖th observation
• 𝑦𝑖 is the actual value of the 𝑖th observation
Test Set Method with Using MAE
1) Randomly choose 30% of the data
to be in a test set
2) The remainder is a training set
3) Perform your regression on the
training set
4) Estimate your future performance
with the test set
MAE =
1
3
3 + 7 + 1 =
11
3
≈ 3.67
x
y
-3
-7
1
RMSE
The root mean squared error (RMSE) metric is given by
RMSE =
1
𝑛
𝜀𝑖
2
𝑛
𝑖=1
=
1
𝑛
𝑦 𝑖 − 𝑦𝑖
2
𝑛
𝑖=1
,
where
• 𝑛 is the total number of observations
• 𝑦 𝑖 is the predicted value of the 𝑖th observation
• 𝑦𝑖 is the actual value of the 𝑖th observation
Test Set Method with Using RMSE
1) Randomly choose 30% of the data
to be in a test set
2) The remainder is a training set
3) Perform your regression on the
training set
4) Estimate your future performance
with the test set
x
y
-3
-7
1
RMSE =
1
3
32 + 72 + 1 =
59
3
≈ 4.43
MAE vs RMSE
Similarity
Both measures express average
model prediction error in units of
the variable of interest. They range
from 0 to ∞ and are indifferent to
the direction of errors. Lower
values better model performances.
Difference
Taking the square root of the average squared
errors has some interesting implications for
RMSE. Since the errors are squared before
they are averaged, the RMSE gives a relatively
high weight to large errors. This means the
RMSE should be more useful when large errors
are particularly undesirable.
Editor of Human in a Machine World. MAE and RMSE — Which Metric is Better?
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d#.5utapadgw
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
MAE vs RMSE
Editor of Human in a Machine World. MAE and RMSE — Which Metric is Better?
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d#.5utapadgw
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
Classification with Two Classes
Price Fullbase
1 420 1
2 385 0
3 495 0
4 605 0
5 610 0
6 660 1
7 660 1
8 690 0
9 838 1
10 885 0
… … …
Housing dataset
Response variable Predictor
Distributions of Two Classes
With
Fullbase
Without
Fullbase
# of
houses
𝑃
Threshold
With
Fullbase
Without
Fullbase
# of
houses
Call these houses ―negative‖ Call these houses ―positive‖
𝑃
True Positive (TP)
With
Fullbase
Without
Fullbase
# of
houses
True Positive (TP)
Call these houses ―negative‖ Call these houses ―positive‖
𝑃
False Positive (FP)
With
Fullbase
Without
Fullbase
# of
houses
False Positive (FP)
Call these houses ―negative‖ Call these houses ―positive‖
𝑃
True Negative (TN)
With
Fullbase
Without
Fullbase
# of
houses
True Negative (TN)
Call these houses ―negative‖ Call these houses ―positive‖
𝑃
False Negative (FN)
𝑃
With
Fullbase
Without
Fullbase
# of
houses
False Negative (FN)
Call these houses ―negative‖ Call these houses ―positive‖
2 × 2 Confusion Matrix for Two-Class Problems
Actual
1 0 ∑
Estimate
1 𝑇𝑃 𝐹𝑃 𝑁 + = 𝑇𝑃 + 𝐹𝑃
0 𝐹𝑁 𝑇𝑁 𝑁 − = 𝐹𝑁 + 𝑇𝑁
∑ 𝑁+ = 𝑇𝑃 + 𝐹𝑁 𝑁− = 𝐹𝑃 + 𝑇𝑁 𝑁 = 𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁
Kevin Murphy. Machine Learning A Probabilistic Perspective, pp.183
• 𝑁+ is the true number of positives
• 𝑁− is the true number of negatives
• 𝑁 + is the estimated number of positives
• 𝑁 − is the estimated number of negatives
Accuracy
Overall, how often is the classifier correct?
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁
𝑁
=
100 + 50
165
= 0.91
Actual
1 0 ∑
Estimate
1 𝑇𝑃 = 100 𝐹𝑃 = 10 𝑁 + = 𝑇𝑃 + 𝐹𝑃 = 110
0 𝐹𝑁 = 5 𝑇𝑁 = 50 𝑁 − = 𝐹𝑁 + 𝑇𝑁 = 55
∑ 𝑁+ = 𝑇𝑃 + 𝐹𝑁 = 105 𝑁− = 𝐹𝑃 + 𝑇𝑁 = 60 𝑁 = 165
Error Rate
Overall, how often is the classifier incorrect?
1 − 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 1 −
𝑇𝑃 + 𝑇𝑁
𝑁
=
𝐹𝑃 + 𝐹𝑁
𝑁
= 0.09
Actual
1 0 ∑
Estimate
1 𝑇𝑃 = 100 𝐹𝑃 = 10 𝑁 + = 𝑇𝑃 + 𝐹𝑃 = 110
0 𝐹𝑁 = 5 𝑇𝑁 = 50 𝑁 − = 𝐹𝑁 + 𝑇𝑁 = 55
∑ 𝑁+ = 𝑇𝑃 + 𝐹𝑁 = 105 𝑁− = 𝐹𝑃 + 𝑇𝑁 = 60 𝑁 = 165
True Positive Rate/Recall
When it’s actually yes, how often does it predict yes?
𝑇𝑃𝑅 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
=
100
105
= 0.95
also known as recall
Actual
1 0 ∑
Estimate
1 𝑇𝑃 = 100 𝐹𝑃 = 10 𝑁 + = 𝑇𝑃 + 𝐹𝑃 = 110
0 𝐹𝑁 = 5 𝑇𝑁 = 50 𝑁 − = 𝐹𝑁 + 𝑇𝑁 = 55
∑ 𝑁+ = 𝑇𝑃 + 𝐹𝑁 = 105 𝑁− = 𝐹𝑃 + 𝑇𝑁 = 60 𝑁 = 165
False Positive Rate
When it’s actually no, how often does it predict yes?
𝐹𝑃𝑅 =
𝐹𝑃
𝐹𝑃 + 𝑇𝑁
=
10
60
= 0.17
Actual
1 0 ∑
Estimate
1 𝑇𝑃 = 100 𝐹𝑃 = 10 𝑁 + = 𝑇𝑃 + 𝐹𝑃 = 110
0 𝐹𝑁 = 5 𝑇𝑁 = 50 𝑁 − = 𝐹𝑁 + 𝑇𝑁 = 55
∑ 𝑁+ = 𝑇𝑃 + 𝐹𝑁 = 105 𝑁− = 𝐹𝑃 + 𝑇𝑁 = 60 𝑁 = 165
Precision
When it predicts yes, how often is it correct?
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
=
100
110
= 0.91
Actual
1 0 ∑
Estimate
1 𝑇𝑃 = 100 𝐹𝑃 = 10 𝑁 + = 𝑇𝑃 + 𝐹𝑃 = 110
0 𝐹𝑁 = 5 𝑇𝑁 = 50 𝑁 − = 𝐹𝑁 + 𝑇𝑁 = 55
∑ 𝑁+ = 𝑇𝑃 + 𝐹𝑁 = 105 𝑁− = 𝐹𝑃 + 𝑇𝑁 = 60 𝑁 = 165
ROC Graph/Curve/Space
• ROC graphs are two-dimensional
graphs in which TPR is plotted on the Y
axis and FPR is plotted on the X axis.
• An ROC grahp depicts relative trade-
offs between benefits (true positives)
and costs (false positives).
• Figure shows an ROC graph with five
classifiers/models labelled A through E
ROC Graph/Curve/Space
• Lower left point (0, 0) represents the strategy
of never issuing a positive classification,
such a classier commits no false positive
errors but also gains no true positives.
• Upper right corner (1, 1) represents the
opposite strategy, of unconditionally issuing
positive classifications.
• Point (0, 1) represents perfect classification.
D’s performance is perfect as shown.
• One point in ROC space is better than
another if it is to the northwest of the first
Best ROC Curve
T
P
R
0%
100%
FPR 0%
100%
With
Fullbase
Without
Fullbase
# of
houses
𝑃
Call these houses ―negative‖ Call these houses ―positive‖
TPR=1 FPR=0
The distributions don’t overlap at all
Worse ROC Curve
T
P
R
0%
100%
FPR 0%
100%
With
Fullbase
Without
Fullbase
# of
houses Call these houses ―negative‖ Call these houses ―positive‖
FPR=TPR
The distributions overlap completely
𝑃
Threshold TPR (%) FPR (%)
0.25 99 50
0.3 97 39
0.4 83 20
0.6 60 10
0.8 40 5
0.9 20 2
0.25
0.3
0.4
0.6
0.8
0.9
FPR
T
P
R
Plotting A ROC Curve
AUC
Area under ROC curve (AUC) has an
important statistical property:
• The AUC of a model is equivalent
to the probability that the classier
will rank a randomly chosen
positive instance higher than a
randomly chosen negative
instance.
• Often used to compare classifiers:
The bigger AUC the better
T
P
R
0%
100%
FPR
0% 100%
T
P
R
0%
100%
FPR
0% 100%
T
P
R
0%
100%
FPR
0% 100%
AUC = 50% AUC = 90% AUC = 65% AUC = 100%
T
P
R
0%
100%
FPR
0% 100%
AUC
• What Is A Model Evaluation Metric?
• Mean Absolute Error (MAE)
• Root Mean Squared Error (RMSE)
• Confusion Matrix
• Accuracy, Error Rate, True Positive Rate, False Positive Rate, Precision
• Receiver Operator Characteristic (ROC)
• Area Under ROC Curve (AUC)
Summary
References
• Editor of Human in a Machine World. MAE and RMSE — Which Metric is Better?
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-
e60ac3bde13d#.5utapadgw
• David Page. Evaluating Machine Learning Methods. University of Wisconsin-Madison
Lecture Slides, 2016
• Kevin Murphy. Machine Learning A Probabilistic Perspective. Chapters 5, 6, 7 & 8
• Christopher Manning, Prabhakar Raghavan and Hinrich Schütze. An Introduction to
Information Retrieval. Chapter 1
• Jesse Davis and Mark Goadrich. The Relationship Between Precision-Recall and ROC
Curves. In ICML, 2006
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
Thank You!
bchen@Lincoln.ac.uk
mailto:bchen@Lincoln.ac.uk