STAT318 — Data Mining
Dr
University of Canterbury, Christchurch,
Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
, University of Canterbury 2021
STAT318 — Data Mining ,1 / 31
Cross-Validation and the Bootstrap
In this section we discuss two important resampling methods: cross-validation and the bootstrap.
These methods use samples formed from the training data to obtain additional information about a fitted model or an estimator.
They can be used for estimating prediction error, determining appropriate model flexibility, estimating standard errors, …
, University of Canterbury 2021
STAT318 — Data Mining ,2 / 31
Training error vs. test error
The training error is the average error that results from applying a statistical learning method to the observations used for training — a simple calculation.
The test error is the average error that results from applying a statistical learning technique to test observations that were not used for training — a simple calculation if test data exists, but we usually only have training data.
unnerves
The training error tends to dramatically under-estimate the test error.
, University of Canterbury 2021
STAT318 — Data Mining ,3 / 31
we wantthe test error to be lower
But we usually onlyhavetraining data
Performance
High Bias Low Variance
Low Bias High Variance
Testing Sample
Training Sample
Low Flexibility High
, University of Canterbury 2021
STAT318 — Data Mining ,4 / 31
We want to be able to determine the correct level of model flexibility (be near the minimum of the testing error curve) for our particular statistical learning problem.
Prediction Error
Validation Set Approach
A very simple strategy is to randomly divide the training data into two sets:
1 Training Set: Fit the model using the training set.
2 Validation Set: Predict the response values for the observations in the validation
set.
The validation set error provides an estimate of the test error (MSE for regression and error rate for classification).
, University of Canterbury 2021
STAT318 — Data Mining ,5 / 31
The key word here is randomly. You should always randomly split the data to destroy data entry ordering and to ensure that each set captures all the character- istics of the population. You also need to be careful about ephemeral predictors (those whose predictive powers fade away over time). If their predictive powers are strong in the training set and weak in the testing set, poor error estimates can be obtained.
Validation Set Approach
!”#”$””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””%”
&””##””!$”””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””‘!”
In this example, the training data are randomly split into two sets of approximately the same size. The blue set is used for training and the orange set for validation.
, University of Canterbury 2021
STAT318 — Data Mining ,6 / 31
Example: auto data
2 4 6 8 10 Degree of Polynomial
Find the best level of flexibility in polynomial regression using the validation set approach (50% used for training).
, University of Canterbury 2021
STAT318 — Data Mining ,7 / 31
You will recall this example from earlier slides, where we were using horsepower to predict miles per gallon (mpg). We consider polynomial regression models of the
form:
ÿd i 1 mpg = —0 + i=1 —i(horsepower) ,
where d ranges from 1 through to 10. Using the validation set approach, we see that the quadratic model (i=2) is the best model here. Adding higher order poly- nomial terms to the model does not substantially reduce the test MSE, but it does make the model more complex. We should always choose the most parsimonious
_____
model (the simplest model that works well).
Mean Squared Error
16 18 20 22 24 26 28
Example: auto data
2 4 6 8 10 Degree of Polynomial
The validation set approach using di erent training and validation sets (50% used for training).
, University of Canterbury 2021
STAT318 — Data Mining ,8 / 31
We can see that the validation set approach is sensitive to the way the training data are split. The curves have the same shape and suggest the same level of flexibility (quadratic in this case), but the test MSE estimates are highly variable. There are two main reasons for this variability:
• the training data split; and
• only half the observations are used to fit the model.
If the training data set is large, the validation set approach works well because there are enough observations to fit the model and the data split variability tends to be small.
Mean Squared Error
16 18 20 22 24 26 28
Drawbacks of the validation set approach
The validation set estimate of the test error can be highly variable.
Only a subset of observations are used to train the model. Hence, the validation set test error will tend to over-estimate the test error.
, University of Canterbury 2021
STAT318 — Data Mining ,9 / 31
Trainthe finalmodelwiththecompletedataset
K-fold cross-validation
1 2
3
4
The training data are divided into K groups (folds) of approximately equal size. The first fold is used as a validation set and the remaining (K ≠ 1) folds are used
for training. The test error is then computed on the validation set.
This procedure is repeated K times, using a di erent fold as the validation set each time.
The estimate of the test error is the average of the K test errors from each validation set.
, University of Canterbury 2021
STAT318 — Data Mining ,10 / 31
Cross-validation attempts to mimic the validation set approach without the need for a validation set. The training data should be randomly partitioned into K groups to destroy data entry ordering etc.
5-fold cross-validation
!”#”$””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””%”
!!”&'”(“””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””)&”
!!”&'”(“””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””)&”
!!”&'”(“””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””)&”
!!”&'”(“””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””)&”
!!”&'”(“””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””)&”
Each row corresponds to one iteration of the algorithm, where the orange set is used for validation and the blue set is used for training.
If K = n, we have leave one out cross-validation (LOOCV).
, University of Canterbury 2021
STAT318 — Data Mining ,11 / 31
The LOOCV method can be computationally expensive because you need to fit many models. However, for linear models, the LOOCV test error can be computed from one fit (see page 180 in the course textbook). Best practice (at least according Hastie et al.) is to choose K = 5 or 10, possibly averaged over several random data splits.
Downside to cross_validation computation cost
Cross-validation test error
We randomly divide the training data of n observations into K groups of approximately equal size, C1, C2, . . . , CK .
Regression Problems (average MSE):
C V K = n 1 ÿK ÿ ( y i ≠ y ˆ i ) 2 . k=1i:xiœCk
Classification Problems (average error rate):
CVK=n1ÿK ÿI(yi”=yˆi). k=1i:xiœCk
, University of Canterbury 2021
STAT318 — Data Mining ,12 / 31
Assuming n/K is an integer (or close enough to an integer), we have CV=1ÿK ÿ(y≠yˆ)2=1ÿKMSE,
where
K n k=1 i:x œC i i K k=1 k ik
MSE=K ÿ(y≠yˆ)2.
k ni:xœC i i ik
Cross-validation: auto data
LOOCV
10−fold CV
2 4 6 8 10
Degree of Polynomial
2 4 6 8 10
Degree of Polynomial
The right plot shows nine di erent 10-fold cross-validations.
, University of Canterbury 2021
STAT318 — Data Mining ,13 / 31
We can see that the variability in the estimated test MSE has been removed (by averaging). The one standard error (SE) rule can be used to choose the best model (although you are not expected to this is this course). We choose the most parsimonious model whose test MSE is within one SE of the minimum test MSE. The SE is computed using
SE = Ûvar(MSE1,MSE2,…,MSEK), K
where K is the number of folds.
Auto data
●
●●●●●●●● ●
●
2 4 6 8 10 Degree of Polynomial
CV Error
18 20 22 24 26
Mean Squared Error
16 18 20 22 24 26 28
Mean Squared Error
16 18 20 22 24 26 28
K-fold cross-validation: classification
X1
The Bayes error rate is 0.133 in this example.
oo oooo
o o oo o
o o oooo
o
o
oo
oo o
o o o ooo
ooooooo oo
oooo ooo o oooooo o
oo o oo
oooo ooooooo
o oooo
o
oooo
oo o o o o
ooo o ooo oo
o ooo
oooooooo o o o o ooooo
ooo oooo
oo
o oooooo
ooooo ooo o
oooo
o
oo oo
oo o o o ooo o
o o oooooo
oo oo o
ooo ooo oo
o oooo
o
o
o
o oo
o
, University of Canterbury 2021
STAT318 — Data Mining ,14 / 31
You will recall this example from earlier in the course. where the dashed line is Bayes decision boundary. Bayes error rate is the test error rate of Bayes classifier (the best you can do), which can be calculated here because this is a simulated example.
X2
K-fold cross-validation: logistic regression
Degree=1 Degree=2
oo oo o o
o
o oo o o
oo
oo o ooo
o
o o o ooo
oo o ooooooo oo
oooo ooo o oooooo o
ooo o oooo ooooooo
oooo oo oooo
oo o o o o ooo o ooo oo o
o
oo o
o o oooooooo o
o o o ooooo o ooo
o ooo o oo ooo o oo
o o o oo o oooo o o o
ooooo oo oo
o ooo oo o
o o o oo o oo
o
oo o
oo
o oooo
ooo
The test error rates are 0.201 and 0.197, respectively.
oo oo o o
o
o oo o o
oo
oo o ooo
o
o o o ooo
oo o ooooooo oo
oooo ooo o oooooo o
ooo o oooo ooooooo
oooo oo oooo
oo o o o o ooo o ooo oo o
o o oooooooo o
o o o ooooo o ooo
o ooo o oo ooo o oo
o o o oo o oooo o o o
ooooo oo oo
o ooo oo o
o o o oo o oo
o
oo o
o
oo o
oo
o oooo
ooo
, University of Canterbury 2021
STAT318 — Data Mining ,15 / 31
Degree 1:
Degree 2:
l n A p ( x ) B = —ˆ + —ˆ x + —ˆ x 1≠p(x) 0 11 22
lnA p(x) B=—ˆ+—ˆx+—ˆx+—ˆx2+—ˆx2 1≠p(x) 0 11 22 31 42
K-fold cross-validation: logistic regression
Degree=3 Degree=4
oo oo o o
o
o oo o o
oo
oo o ooo
o
o o o ooo
oo o ooooooo oo
oooo ooo o oooooo o
ooo o oooo ooooooo
oooo oo oooo
oo o o o o ooo o ooo oo o
o
oo o
o o oooooooo o
o o o ooooo o ooo
o ooo o oo ooo o oo
o o o oo o oooo o o o
ooooo oo oo
o ooo oo o
o o o oo o oo
o
oo o
oo
o oooo
ooo
The test error rates are 0.160 and 0.162, respectively.
oo oo o o
o
o oo o o
oo
oo o ooo
o
o o o ooo
oo o ooooooo oo
oooo ooo o oooooo o
ooo o oooo ooooooo
oooo oo oooo
oo o o o o ooo o ooo oo o
o o oooooooo o
o o o ooooo o ooo
o ooo o oo ooo o oo
o o o oo o oooo o o o
ooooo oo oo
o ooo oo o
o o o oo o oo
o
oo o
o
oo o
oo
o oooo
ooo
, University of Canterbury 2021
STAT318 — Data Mining ,16 / 31
Degree 3:
lnA p(x) B=—ˆ +—ˆx +—ˆx +—ˆx2+—ˆx2+—ˆx3+—ˆx3
1≠p(x) 0 11 22 31 42 51 62 Degree 4:
lnA p(x) B=—ˆ +—ˆx +—ˆx +—ˆx2+—ˆx2+—ˆx3+—ˆx3++—ˆx4+—ˆx4 1≠p(x) 0 11 22 31 42 51 62 71 82
K-fold cross-validation: logistic regression and KNN
Teestrror
2 4 6 8 10 0.01 0.02
Order of Polynomials Used
0.05 0.10 0.20 0.50 1.00
1/K
The test error (orange), training error (blue) and the 10-fold cross-validation error
(black). The left plot shows logistic regression and the right plot shows KNN.
seen
, University of Canterbury 2021
STAT318 — Data Mining ,17 / 31
Recall that Bayes error for this problem was 0.133. As we increase the complexity of the models, we start over-fitting the training error (as expected) and the charac- teristic U curve is visible. The CV error rate does a good job at finding the correct level of flexibility (base of the U) for each model. However, in this example, the CV errors under-estimate the true error rate (but at least we can actually compute the CV error!).
Error Rate
0.12 0.14 0.16 0.18 0.20
Error Rate
0.12 0.14 0.16 0.18 0.20
Comments
Since each training set has approximately (1 ≠ 1/K )n observations, the cross-validation test error will tend to over-estimate the prediction error.
LOOCV minimizes this upward bias, but this estimate has high variance.
K = 5 or 10 provides a good compromise for this bias-variance trade-o . 00
, University of Canterbury 2021
STAT318 — Data Mining ,18 / 31
LOOCV can have high variance because the training data sets that are used for each fit only di er by one observation (the training folds are highly correlated). Once again, best practice (at least according Hastie et al.) is to choose K = 5 or 10, possibly averaged over several random data splits.
Cross Validation: right and wrong
Consider the following classifier for a two-class problem:
1 Starting with 1000 predictors and 50 observations, find the 20 predictors having the largest correlation with the response.
2 Apply a classifier using only these 20 predictors.
If we use cross-validation to estimate test error, can we simply apply it at step (2)?
, University of Canterbury 2021
STAT318 — Data Mining ,19 / 31
The filtering step (subset selection) is a training step because the response variable is used. Hence, we cannot simply apply CV at step (2). We need to apply CV to the full learning process. That is, divide the data into k-folds, find the 20 predictors that are most correlated with the response using k ≠ 1 of the folds and then fit the classifier to those 20 predictors. We would expect di erent ‘20 best predictors’ to be found at each iteration and hence, we expect to fit the model using di erent predictors at each iteration.
Unsupervised screening steps are fine (e.g. choose the 20 predictors with the highest variance across all 50 observations) because the response variable is not used (it’s not supervised).
Bootstrap
The use of the term bootstrap derives from the phrase:
“to pull oneself up by one’s bootstraps”.
The bootstrap is a powerful statistical tool that can be used to quantify uncertainty associated with a statistical learning technique or a given estimator.
, University of Canterbury 2021
STAT318 — Data Mining ,20 / 31
Example
Suppose we wish to invest a fixed sum of money in two financial assets that yield returns X and Y , respectively (X and Y are random variables).
We want to minimize the risk (variance) of our investment:
where 0 Æ – Æ 1.
minV(–X +(1≠–)Y), –
, University of Canterbury 2021
STAT318 — Data Mining ,21 / 31
This is a motivating example for the Bootstrap. We can minimise the variance (see below), but we are really just interested in obtaining a statistic. Then we can apply the Bootstrap to investigate properties of our statistic.
Optional material (not assessed) for those interested.
Properties of variance:
V(aX) = a2V(X)
V (X + Y ) = V (X ) + V (Y ) + 2cov(X , Y )
Expanding the variance using these properties gives
V (–X + (1 ≠ –)Y ) = –2 V (X ) + (1 ≠ –)2 V (Y ) + 2–(1 ≠ –)cov(X , Y ).
To minimize this expression we set d V (.) = 0 which gives d–
2–V (X ) ≠ 2(1 ≠ –)V (Y ) + (2 ≠ 4–)cov(X , Y ) = 0. Hence, the – that minimises the expression is
–= V(Y)≠cov(X,Y) = ‡Y2 ≠‡XY . V(X)+V(Y)≠2cov(X,Y) ‡X2 +‡Y2 ≠2‡XY
Example
The values of ‡X2 , ‡Y2 and ‡XY are unknown and hence, need to be estimated from sample data.
We can then estimate the – value that minimizes the variance of our investment
using
–ˆ is an estimator, but we don’t know its sampling distribution or its standard error.
–ˆ= ‡ˆY2≠‡ˆXY . ‡ˆX2 +‡ˆY2 ≠2‡ˆXY
, University of Canterbury 2021
STAT318 — Data Mining ,22 / 31
Revision material: I hope you know sample variance and sample covariance, but if you don’t here they are:
‡ˆX2= 1 ÿn(xi≠x ̄)2; n ≠ 1 i=1
‡ˆ Y2 = 1 ÿn ( y i ≠ y ̄ ) 2 ; n ≠ ÿ1 i = 1
n (xi ≠x ̄)(yi ≠y ̄), where x ̄ and y ̄ are the sample means for x and y, respectively.
‡ˆXY = 1
n ≠ 1 i=1
Example: simulated returns for investments X and Y
−2 −1 0 1 2 −2 −1 0 1 2
XX
−3 −2 −1 0 1 2 −2 −1 0 1 2 3
XX
, University of Canterbury 2021
STAT318 — Data Mining ,23 / 31
Each plot shows 100 simulated returns for investments X and Y .
• Each sample gives an estimate of –.
• We can use sample statistics to look at statistical properties of our estimator –ˆ, for example, its standard error (standard deviation of an estimator).
The big assumption here is that we can draw as many samples from the population as we like. If we could do this, statistics would be a trivial subject! Clearly, we cannot do this is in practice, we’re just trying to illustrate a point here.
YY
−3 −2 −1 0 1 2 −2 −1 0 1 2
YY
−3 −2 −1 0 1 2 −2 −1 0 1 2
Example: simulated sampling distribution for –ˆ
0.4 0.5 0.6
0.7 0.8 0.9
↵
, University of Canterbury 2021
STAT318 — Data Mining ,24 / 31
The sampling distribution of –ˆ using 1000 simulated data sets (a histogram of 1000 –ˆ values, one –ˆ from each data set). The true value of – is shown by the vertical line (– = 0.6).
0 50 100 150 200
Example: statistics from 1000 observations
The sample mean is
– ̄ = 1000 ÿ –ˆi = 0.5996,
which is very close to the true value, – = 0.6. The standard deviation is
1 1000
sd(–ˆ) = ˆıÙ1000 ≠ 1 ÿi=1 (–ˆi ≠ – ̄)2 = 0.083,
which gives an approximate standard error of SE(–ˆ) = 0.083.
1 1000 i=1
, University of Canterbury 2021
STAT318 — Data Mining ,25 / 31
Statistics is very easy if we can draw as many samples from the population as we like!
Real world
The procedure we have discussed cannot be applied in the real world because we cannot sample the original population many times.
The bootstrap comes to the rescue.
Rather than sampling the population many times directly, we repeatedly sample
the observed sample data using random sampling with replacement.
These bootstrap samples are the same size as the original sample (n observations) and will likely contain repeated observations.
, University of Canterbury 2021
STAT318 — Data Mining ,26 / 31
Bootstrap samples contain approximately 2/3 of the original training observations (or if you prefer, approximately 1/3 of the original training observations are not included in a Bootstrap sample). The probability of a training observation xi being included (at least once) in the bootstrap sample is
Pr(xi œBootstrap)=1≠(1≠1/n)n. Plugging in some n values we see that
Pr(xi œ Bootstrap) = 0.6513 for n = 10, Pr(xi œ Bootstrap) = 0.6340 for n = 100.
As n æ Œ you can show (not in this course) that
lim 1≠(1≠1/n)n =1≠exp(≠1)¥2/3. næŒ
Bootstrap
Obs
X
Y
3
5.3
2.8
1
4.3
2.4
3
5.3
2.8
Obs
X
Y
2
2.1
1.1
3
5.3
2.8
1
4.3
2.4
Obs
X
Y
1
4.3
2.4
2
2.1
1.1
3
5.3
2.8
Obs
X
Y
2
2.1
1.1
2
2.1
1.1
1
4.3
2.4
, University of Canterbury 2021
STAT318 — Data Mining ,27 / 31
Original Data (Z)
Z*1 Z*2
!!
!!
!!
!! !Z*B
!! !! !!
αˆ * 1
αˆ * 2
!! !! !! !! !!
αˆ * B
In this figure, there are only three observations and hence, the Bootstrap is point- less. We are just trying to illustrate the procedure here! That is, each Bootstrap sample gives an estimate of – (the estimator of interest) and if we draw many Bootstrap samples, we get many di erent estimates of –.
Example: bootstrap sampling distribution for –ˆ
0.4 0.5 0.6 0.7 0.8 0.9 0.3 0.4 0.5 0.6 0.7 0.8 0.9
↵↵
, University of Canterbury 2021
STAT318 — Data Mining ,28 / 31
(Left) The simulated sampled distribution for –ˆ. (Right) Histogram showing 1000 bootstrap estimates of –. The bootstrap histogram is similar to the simulated sampling distribution:
• both have similar shape;
• both are centred on – = 0.6; and • both have similar variance.
Hence, we can learn statistical properties of –ˆ, for example its standard error, by looking at the sample statistics from the Bootstrap samples.
0 50 100 150 200
0 50 100 150 200
Example: statistics from B bootstrap samples
Let Zúi and –ˆúi denote the ith bootstrap sample and the ith bootstrap estimate of –, respectively.
We estimate the standard error of –ˆ using
SEB=ˆıÙ 1 ÿB(–ˆúi≠– ̄ú)2,
B ≠ 1 i=1 where B is some large value (say 1000) and
For our example SEB = 0.087
ú 1ÿB úi – ̄=B –ˆ.
i=1
, University of Canterbury 2021
STAT318 — Data Mining ,29 / 31
The simulated (true) SE was 0.083, so SEB ¥ SE(–ˆ) for this example.
The general picture
●
● ●
● ● ●
● ● ●
● ●●
●
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x
An empirical distribution function (EDF) for a sample of n = 20 standard normal random variables.
●
, University of Canterbury 2021
STAT318 — Data Mining ,30 / 31
The EDF is defined as
Fˆn(x)= #{xi