CS373 Data Mining and�
Machine Learning�
Lecture 9
Jean Honorio
Purdue University
Goal of machine learning?
Goal of machine learning
• Use algorithms that will perform well in unseen data
Goal of machine learning
• Use algorithms that will perform well in unseen data
• How to measure performance?
• How to use unseen data?
Goal of machine learning
• Use algorithms that will perform well in unseen data
• How to measure performance?
• How to use unseen data?
• Variability?
• By-product: a way to set hyper-parameters
- e.g., C for SVMs, λ for kernel ridge regression, etc.
Measures of Performance: Regression
• Assume that for a point , we predict
• Mean square error:
• Root mean square error:
• Mean absolute error:
MSE(g) =
1
n
(g(xi )− yi )
2
i=1
n
∑
g(x)x
RMSE(g) = MSE(g)
1
n
g(xi )− yii=1
n
∑
Measures of Performance: Classification
• True Positive (TP)
• True Negative (TN)
• False Positive (FP)
• False Negative (FN)
• Accuracy
• Error
• Recall / Sensitivity
• Precision
• Specificity
• Use jointly: (Precision,Recall) or (Sensitivity,Specificity)
True Label
+1 -1
+1
-1
TP
TN
P
re
di
ct
ed
L
ab
el
FP
FN
(TP +TN ) / (TP +FP +FN +TN )
TP / (TP +FN )
TP / (TP +FP)
TN / (TN +FP)
(FP +FN ) / (TP +FP +FN +TN )
Precision and Recall
• Idea comes from information retrieval
Wikipedia
Sensitivity and Specificity
• Idea comes from signal detection theory
• Assume Gaussian distributions
• By sliding the offset we get different (TP, FP, TN, FN)
and thus, different sensitivity and specificity
Pattern Classification, Duda et al., 2nd Ed., Chapter 2
p(x | y = +1) = N(µ1,σ
2 )
p(x | y = −1) = N(µ2,σ
2 )
p(x | y = +1) p(x | y = −1)
θ0
g(x) =
+1, if x <θ0
−1, if x ≥θ0
#
$
%
&%
Classifier:
θ0
Receiver Operating Characteristic (ROC)
• By varying the
hyperparameter of a
classifier (e.g., C for SVM with
linear kernel, β for SVM with
RBF kernels) we can get
different:
- Sensitivity
- Specificity
• Summarized with an Area
Under the Curve (AUC)
- Random: 0.5
- Perfect classifier: 1
The Elements of Statistical Learning, Hastie et al.
SVM with linear kernel
SVM with RBF kernels
Logistic regression
Other Loss Functions
• Let +1 mean “diseased patient” and -1 mean “healthy
patient”
True Label
+1 -1
+1
-1
0
0
P
re
di
ct
ed
L
ab
el
1
1
True Label
+1 -1
+1
-1
0
0
P
re
di
ct
ed
L
ab
el
1
10
1
n
1[g(xi ) ≠ yi ]i=1
n
∑ 1
n
Cost(g(xi ), yi )i=1
n
∑
2) Using “Unseen” Data
• Overfitting:
- Bigger model class , better fit in training data (linear to
quadratic to cubic)
- Find hyper-parameters that better fit training data
- Usually poor performance in unseen data
• To prevent overfitting, how can we “see” unseen data?
- Simulate it !
Pattern Classification, Duda et al., 2nd Ed., Chapter 9
Training, Validation, Testing
• Three data sets:
Training
set
Validation
set
Test
set
Report measures
Try different algorithms
different hyper-parameters
k-Fold Cross Validation
• Split training data D into k disjoint sets S1,…,Sk
- Either randomly, or in a fixed fashion
- If D has n samples, then each fold has approximately n / k samples
- Popular choices: k=5, k=10, k=n (leave-one-out)
• For i = 1…k:
train with sets S1,…,Si–1, Si+1,…,Sk
test on set Si
let Mi be the test measure (e.g., accuracy, MSE)
• Mean and variance are:
µ̂ =
1
k
Mii=1
k
∑ σ̂ 2 =
1
k
(Mi −i=1
k
∑ µ̂)2
0.632 Bootstrapping
µ̂ =
1
B
Mii=1
B
∑ σ̂ 2 =
1
B
(Mi −i=1
B
∑ µ̂)2
• Let B>0, and n be the number of training samples in D
• For i = 1…B:
Pick n samples from D with replacement, call it Si
(Si might contain the same sample more than once)
train with set Si
test on the remaining samples (D – Si)
let Mi be the test measure (e.g., accuracy, MSE)
• Mean and variance are:
0.632 Bootstrapping
• Why 0.632 ?
• Recall that:
- We pick n items with replacement from out of n items
- We choose uniformly at random
• The probability of:
- not picking one particular item in 1 draw is
- not picking one particular item in n draws is
- picking one particular item in n draws is
• Finally: limn→∞1− 1−1/ n( )
n
=1−1/ e ≈ 0.632
1−1/ n
(1−1/ n)n
1− (1−1/ n)n
3) Variability
• How to compare two algorithms?
- Not only means, also variances !
• Statistical hypothesis testing
Statistical Hypothesis Testing
• How to compare two algorithms?
- Not only means, also variances !
• Let be mean and variance of algorithms 1 and 2.
• When to reject null hypothesis in favor of ?
• Let:
µ̂1,σ̂1
2, µ̂2,σ̂ 2
2
x =
(µ̂1 − µ̂2 ) n
σ̂1
2 + σ̂ 2
2
ν =
(σ̂1
2 + σ̂ 2
2 )2 (n−1)
σ̂1
4 + σ̂ 2
4
”
#
#
$
%
%
µ1 > µ2µ1 = µ2
Degrees of freedom of
Student’s t-distribution
Statistical Hypothesis Testing
• Student’s t-distribution:
• For significance level , degrees of freedom
- Find the value for which CDF =
- Python: from scipy.stats import t
t.ppf(1–alpha, v)
• If reject null hypothesis in favor of µ1 > µ2µ1 = µ2
να
Probability density function (PDF) Cumulative density function (CDF)
1−αx1−α,ν
x > x1−α,ν
1−α
x1−α,ν
Wikipedia
What is a Sample?
• In this lecture we assume that each sample is a different
“unit of interest” for the experimenter
• Never sample the same “unit of interest” several times
- In a medical application, we might be interested on knowing
the accuracy (and variance) with respect to patients.
- Taking two visits of the same patient as two different samples
would be incorrect.
• Collect more data, if necessary
• Never duplicate (copy & paste) data.