STAT318 — Data Mining
Dr
University of Canterbury, Christchurch,
Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
, University of Canterbury 2021 STAT318 — Data Mining ,1 / 31
Cross-Validation and the Bootstrap
In this section we discuss two important resampling methods: cross-validation and the bootstrap.
These methods use samples formed from the training data to obtain additional information about a fitted model or an estimator.
They can be used for estimating prediction error, determining appropriate model flexibility, estimating standard errors, …
, University of Canterbury 2021 STAT318 — Data Mining ,2 / 31
Training error vs. test error
The training error is the average error that results from applying a statistical learning method to the observations used for training — a simple calculation.
The test error is the average error that results from applying a statistical learning technique to test observations that were not used for training — a simple calculation if test data exists, but we usually only have training data.
The training error tends to dramatically under-estimate the test error.
, University of Canterbury 2021 STAT318 — Data Mining ,3 / 31
Performance
High Bias Low Variance
Low Bias High Variance
Testing Sample
Training Sample
Low Flexibility High
, University of Canterbury 2021 STAT318 — Data Mining
,4 / 31
We want to be able to determine the correct level of model flexibility (be near the minimum of the testing error curve) for our particular statistical learning problem.
Prediction Error
Validation Set Approach
A very simple strategy is to randomly divide the training data into two sets:
1 Training Set: Fit the model using the training set.
2 Validation Set: Predict the response values for the observations in the validation set.
The validation set error provides an estimate of the test error (MSE for regression and error rate for classification).
, University of Canterbury 2021 STAT318 — Data Mining ,5 / 31
The key word here is randomly. You should always randomly split the data to destroy data entry ordering and to ensure that each set captures all the character- istics of the population. You also need to be careful about ephemeral predictors (those whose predictive powers fade away over time). If their predictive powers are strong in the training set and weak in the testing set, poor error estimates can be obtained.
Validation Set Approach
In this example, the training data are randomly split into two sets of approximately the same size. The blue set is used for training and the orange set for validation.
, University of Canterbury 2021 STAT318 — Data Mining ,6 / 31
Example: auto data
2 4 6 8 10 Degree of Polynomial
Find the best level of flexibility in polynomial regression using the validation set approach (50% used for training).
, University of Canterbury 2021 STAT318 — Data Mining ,7 / 31
You will recall this example from earlier slides, where we were using horsepower to predict miles per gallon (mpg). We consider polynomial regression models of the
form:
d
mpg = β0 + βi (horsepower)i ,
i=1
where d ranges from 1 through to 10. Using the validation set approach, we see that the quadratic model (i=2) is the best model here. Adding higher order poly- nomial terms to the model does not substantially reduce the test MSE, but it does make the model more complex. We should always choose the most parsimonious model (the simplest model that works well).
Mean Squared Error
16 18 20 22 24 26 28
Example: auto data
2 4 6 8 10 Degree of Polynomial
The validation set approach using different training and validation sets (50% used for training).
, University of Canterbury 2021 STAT318 — Data Mining ,8 / 31
We can see that the validation set approach is sensitive to the way the training data are split. The curves have the same shape and suggest the same level of flexibility (quadratic in this case), but the test MSE estimates are highly variable. There are two main reasons for this variability:
• the training data split; and
• only half the observations are used to fit the model.
If the training data set is large, the validation set approach works well because there are enough observations to fit the model and the data split variability tends to be small.
Mean Squared Error
16 18 20 22 24 26 28
Drawbacks of the validation set approach
The validation set estimate of the test error can be highly variable.
Only a subset of observations are used to train the model. Hence, the validation set test error will tend to over-estimate the test error.
, University of Canterbury 2021 STAT318 — Data Mining ,9 / 31
K-fold cross-validation
1 The training data are divided into K groups (folds) of approximately equal size.
2 The first fold is used as a validation set and the remaining (K − 1) folds are used
for training. The test error is then computed on the validation set.
3 This procedure is repeated K times, using a different fold as the validation set each time.
4 The estimate of the test error is the average of the K test errors from each validation set.
, University of Canterbury 2021 STAT318 — Data Mining ,10 / 31
Cross-validation attempts to mimic the validation set approach without the need for a validation set. The training data should be randomly partitioned into K groups to destroy data entry ordering etc.
5-fold cross-validation
Each row corresponds to one iteration of the algorithm, where the orange set is used for validation and the blue set is used for training.
If K = n, we have leave one out cross-validation (LOOCV).
, University of Canterbury 2021 STAT318 — Data Mining ,11 / 31
The LOOCV method can be computationally expensive because you need to fit many models. However, for linear models, the LOOCV test error can be computed from one fit (see page 180 in the course textbook). Best practice (at least according Hastie et al.) is to choose K = 5 or 10, possibly averaged over several random data splits.
Cross-validation test error
We randomly divide the training data of n observations into K groups of approximately equal size, C1, C2, . . . , CK .
Regression Problems (average MSE):
n k=1 i:xi ∈Ck Classification Problems (average error rate):
1K
CVK= (yi−yˆi)2.
1K
CVK= I(yi̸=yˆi).
n k=1 i:xi ∈Ck
, University of Canterbury 2021 STAT318 — Data Mining ,12 / 31
Assuming n/K is an integer (or close enough to an integer), we have 1K 1K
where
CVK= (yi−yˆi)2= MSEk, nk=1i:xi∈Ck K k=1
MSEk=K (yi−yˆi)2. n i:xi∈Ck
Cross-validation: auto data
LOOCV
10−fold CV
2 4 6 8 10
Degree of Polynomial
2 4 6 8 10
Degree of Polynomial
The right plot shows nine different 10-fold cross-validations.
, University of Canterbury 2021 STAT318 — Data Mining ,13 / 31
We can see that the variability in the estimated test MSE has been removed (by averaging). The one standard error (SE) rule can be used to choose the best model (although you are not expected to this is this course). We choose the most parsimonious model whose test MSE is within one SE of the minimum test MSE. The SE is computed using
var(MSE1,MSE2,…,MSEK) SE= K ,
where K is the number of folds.
Auto data
●
●
●
2 4 6 8 10
● ●●●●
●●
●
Degree of Polynomial
CV Error
18 20 22 24 26
Mean Squared Error
16 18 20 22 24 26 28
Mean Squared Error
16 18 20 22 24 26 28
K-fold cross-validation: classification
oo oooo
o
o oo o o
oo
o oooooo
o o o o o o
oooo o o ooo oo
oooo ooo o oooooo o
oo oo oo o o ooooooo
oo oooooo oo oooo o
ooo o ooo oo o o oo
oooooooo o o o o ooooo
oo o o
o oo ooooo
oo
o o o o
o oooo
ooo o o ooooooo
oo ooo ooo
oo ooooo oo oo
ooo o o o oo o
ooo
o
oo
oo o o
X1
The Bayes error rate is 0.133 in this example.
, University of Canterbury 2021 STAT318 — Data Mining ,14 / 31
You will recall this example from earlier in the course. where the dashed line is Bayes decision boundary. Bayes error rate is the test error rate of Bayes classifier (the best you can do), which can be calculated here because this is a simulated example.
X2
K-fold cross-validation: logistic regression
Degree=1 Degree=2
oo oo oo o o oo o o
o o o oooo
o o o
o oooo
oo o o o
oo oo
oo
o oo o o o oo o o
o o ooo o o
o ooo o
o o o o o o o o o o o o oooooo
o oo o o o o o oo o o o o ooo ooo o o ooo ooo o oooo o o o oooo o
o
o oo
o
o oo o o o oo o oooooo ooooooo ooooooo
o o
oo oooo oo oo oo oooo oo oooo oooo
ooo o ooo oo o oo oo o ooo o
ooo o ooo oo o
o oo o o o
o o o oo o o oo
o o o oo o o oo
o oooooooooo o
o oooooooooo o
o
oo oo o ooo o
o o oo o o oo
o o oo o
o o o o o o o o
oo
o o oo o o oo
oo
o oooo o
o oooo o o o oo o
o
ooo ooo
ooo ooo
The test error rates are 0.201 and 0.197, respectively.
Degree 1:
Degree 2:
p(x)
ln 1−p(x) =βˆ0+βˆ1×1+βˆ2×2
o
o
o
o ooo o o ooo o o ooo o ooo
, University of Canterbury 2021 STAT318 — Data Mining
,15 / 31
ln
p(x)
1−p(x) =βˆ0 +βˆ1×1 +βˆ2×2 +βˆ3×12 +βˆ4×2
K-fold cross-validation: logistic regression
Degree=3 Degree=4
oo oo oo o o oo o o
oo oo
oo
o oo o o o oo o o
o o ooo o o
o ooo o
o o o o o o o o o o o o
oooooo o oo o o o o o oo o o o o
o ooo ooo o o ooo ooo o o oooo o o o oooo o o
o oo o o o oo o oooooo ooooooo ooooooo
o oo
oo oooo oo oo oo oooo oo oooo oooo
ooo o ooo oo o oo oo o ooo o
ooo o ooo oo o
o o o oooo
o o o
o oooo
oo o o o
o oo o o o
o o o oo o o oo
o o o oo o o oo
o oooooooooo o
o oooooooooo o
o
oo oo o ooo o o
o o oo o o oo
o o oo o
o o o o o o o o
oo
o o oo o o oo
oo
o oooo o
o oooo o o o oo o
o
ooo ooo
ooo ooo
The test error rates are 0.160 and 0.162, respectively.
, University of Canterbury 2021 STAT318 — Data Mining
o
o
o
o ooo o o ooo o o ooo o ooo
Degree 3:
ln 1 − p(x) = βˆ0 + βˆ1×1 + βˆ2×2 + βˆ3×12 + βˆ4×2 + βˆ5×13 + βˆ6×23
p(x) Degree 4:
p(x)
ln 1 − p(x) = βˆ0 + βˆ1×1 + βˆ2×2 + βˆ3×12 + βˆ4×2 + βˆ5×13 + βˆ6×23 + +βˆ7×14 + βˆ8×24
,16 / 31
K-fold cross-validation: logistic regression and KNN
2 4 6 8 10 0.01 0.02 0.05 0.10 0.20 0.50 1.00
Order of Polynomials Used 1/K
The test error (orange), training error (blue) and the 10-fold cross-validation error (black). The left plot shows logistic regression and the right plot shows KNN.
, University of Canterbury 2021 STAT318 — Data Mining ,17 / 31
Recall that Bayes error for this problem was 0.133. As we increase the complexity of the models, we start over-fitting the training error (as expected) and the charac- teristic U curve is visible. The CV error rate does a good job at finding the correct level of flexibility (base of the U) for each model. However, in this example, the CV errors under-estimate the true error rate (but at least we can actually compute the CV error!).
Error Rate
0.12 0.14 0.16 0.18 0.20
Error Rate
0.12 0.14 0.16 0.18 0.20
Comments
Since each training set has approximately (1 − 1/K )n observations, the cross-validation test error will tend to over-estimate the prediction error.
LOOCV minimizes this upward bias, but this estimate has high variance. K = 5 or 10 provides a good compromise for this bias-variance trade-off.
, University of Canterbury 2021 STAT318 — Data Mining ,18 / 31
LOOCV can have high variance because the training data sets that are used for each fit only differ by one observation (the training folds are highly correlated). Once again, best practice (at least according Hastie et al.) is to choose K = 5 or 10, possibly averaged over several random data splits.
Cross Validation: right and wrong
Consider the following classifier for a two-class problem:
1 Starting with 1000 predictors and 50 observations, find the 20 predictors having the largest correlation with the response.
2 Apply a classifier using only these 20 predictors.
If we use cross-validation to estimate test error, can we simply apply it at step (2)?
, University of Canterbury 2021 STAT318 — Data Mining ,19 / 31
The filtering step (subset selection) is a training step because the response variable is used. Hence, we cannot simply apply CV at step (2). We need to apply CV to the full learning process. That is, divide the data into k-folds, find the 20 predictors that are most correlated with the response using k − 1 of the folds and then fit the classifier to those 20 predictors. We would expect different ‘20 best predictors’ to be found at each iteration and hence, we expect to fit the model using different predictors at each iteration.
Unsupervised screening steps are fine (e.g. choose the 20 predictors with the highest variance across all 50 observations) because the response variable is not used (it’s not supervised).
Bootstrap
The use of the term bootstrap derives from the phrase:
“to pull oneself up by one’s bootstraps”.
The bootstrap is a powerful statistical tool that can be used to quantify uncertainty associated with a statistical learning technique or a given estimator.
, University of Canterbury 2021 STAT318 — Data Mining ,20 / 31
Example
Suppose we wish to invest a fixed sum of money in two financial assets that yield returns X and Y , respectively (X and Y are random variables).
We want to minimize the risk (variance) of our investment: minV(αX +(1−α)Y),
α
where 0 ≤ α ≤ 1.
, University of Canterbury 2021 STAT318 — Data Mining ,21 / 31
This is a motivating example for the Bootstrap. We can minimise the variance (see below), but we are really just interested in obtaining a statistic. Then we can apply the Bootstrap to investigate properties of our statistic.
Optional material (not assessed) for those interested.
Properties of variance:
V(aX) = a2V(X)
V (X + Y ) = V (X ) + V (Y ) + 2cov(X , Y )
Expanding the variance using these properties gives
V (αX + (1 − α)Y ) = α2 V (X ) + (1 − α)2 V (Y ) + 2α(1 − α)cov(X , Y ).
To minimize this expression we set d V (.) = 0 which gives dα
2αV (X ) − 2(1 − α)V (Y ) + (2 − 4α)cov(X , Y ) = 0. Hence, the α that minimises the expression is
α= V(Y)−cov(X,Y) = σY2 −σXY . V(X)+V(Y)−2cov(X,Y) σX2 +σY2 −2σXY
Example
The values of σX2 , σY2 and σXY are unknown and hence, need to be estimated from sample data.
We can then estimate the α value that minimizes the variance of our investment using
αˆ= σˆY2−σˆXY . σˆX2 +σˆY2 −2σˆXY
αˆ is an estimator, but we don’t know its sampling distribution or its standard error.
, University of Canterbury 2021 STAT318 — Data Mining ,22 / 31
Revision material: I hope you know sample variance and sample covariance, but if you don’t here they are:
σˆ X2 =
σˆ Y2 =
(xi −x ̄)(yi −y ̄), where x ̄ and y ̄ are the sample means for x and y, respectively.
1n
( x i − x ̄ ) 2 ;
n − 1 i=1
1n
( y i − y ̄ ) 2 ;
σˆXY = n−1
i=1
n − 1 i=1
1 n
Example: simulated returns for investments X and Y
−2 −1 0 1 2 −2 −1 0 1 2
XX
−3 −2 −1 0 1 2 −2 −1 0 1 2 3
XX
, University of Canterbury 2021 STAT318 — Data Mining ,23 / 31
Each plot shows 100 simulated returns for investments X and Y .
• Each sample gives an estimate of α.
• We can use sample statistics to look at statistical properties of our estimator αˆ, for example, its standard error (standard deviation of an estimator).
The big assumption here is that we can draw as many samples from the population as we like. If we could do this, statistics would be a trivial subject! Clearly, we cannot do this is in practice, we’re just trying to illustrate a point here.
YY
−3 −2 −1 0 1 2 −2 −1 0 1 2
YY
−3 −2 −1 0 1 2 −2 −1 0 1 2
Example: simulated sampling distribution for αˆ
0.4 0.5 0.6 0.7 0.8 0.9
α
, University of Canterbury 2021 STAT318 — Data Mining ,24 / 31
The sampling distribution of αˆ using 1000 simulated data sets (a histogram of 1000 αˆ values, one αˆ from each data set). The true value of α is shown by the vertical line (α = 0.6).
0 50 100 150 200
Example: statistics from 1000 observations
The sample mean is
1000 i=1 which is very close to the true value, α = 0.6.
1 1000
α ̄ = αˆi = 0.5996,
The standard deviation is
1 1000
sd(αˆ) = (αˆi − α ̄)2 = 0.083, 1000−1 i=1
which gives an approximate standard error of SE(αˆ) = 0.083.
, University of Canterbury 2021 STAT318 — Data Mining ,25 / 31
Statistics is very easy if we can draw as many samples from the population as we like!
Real world
The procedure we have discussed cannot be applied in the real world because we cannot sample the original population many times.
The bootstrap comes to the rescue.
Rather than sampling the population many times directly, we repeatedly sample
the observed sample data using random sampling with replacement. These bootstrap samples are the same size as the original sample (n
observations) and will likely contain repeated observations.
, University of Canterbury 2021 STAT318 — Data Mining ,26 / 31
Bootstrap samples contain approximately 2/3 of the original training observations (or if you prefer, approximately 1/3 of the original training observations are not included in a Bootstrap sample). The probability of a training observation xi being included (at least once) in the bootstrap sample is
Pr(xi ∈Bootstrap)=1−(1−1/n)n. Plugging in some n values we see that
Pr(xi ∈ Bootstrap) = 0.6513 for n = 10, Pr(xi ∈ Bootstrap) = 0.6340 for n = 100.
As n → ∞ you can show (not in this course) that
lim 1−(1−1/n)n =1−exp(−1)≈2/3. n→∞
Bootstrap
Obs
X
Y
2
2.1
1.1
3
5.3
2.8
1
4.3
2.4
Original Data (Z)
Z*1
Z*2
Z*B
αˆ * 1
αˆ * 2
αˆ * B
STAT318 — Data Mining
, University of Canterbury 2021
,27 / 31
Obs
3
1
3
X
5.3
4.3
5.3
Y
2.8
2.4
2.8
Obs
X
Y
1
4.3
2.4
2
2.1
1.1
3
5.3
2.8
Obs
X
Y
2
2.1
1.1
2
2.1
1.1
1
4.3
2.4
In this figure, there are only three observations and hence, the Bootstrap is point- less. We are just trying to illustrate the procedure here! That is, each Bootstrap sample gives an estimate of α (the estimator of interest) and if we draw many Bootstrap samples, we get many different estimates of α.
Example: bootstrap sampling distribution for αˆ
0.4 0.5 0.6 0.7 0.8 0.9 0.3 0.4 0.5 0.6 0.7 0.8 0.9
αα
, University of Canterbury 2021 STAT318 — Data Mining
,28 / 31
(Left) The simulated sampled distribution for αˆ. (Right) Histogram showing 1000 bootstrap estimates of α. The bootstrap histogram is similar to the simulated sampling distribution:
• both have similar shape;
• both are centred on α = 0.6; and • both have similar variance.
Hence, we can learn statistical properties of αˆ, for example its standard error, by looking at the sample statistics from the Bootstrap samples.
0 50 100 150 200
0 50 100 150 200
Example: statistics from B bootstrap samples
Let Z∗i and αˆ∗i denote the ith bootstrap sample and the ith bootstrap estimate
of α, respectively.
We estimate the standard error of αˆ using
1 B
(αˆ∗i −α ̄∗)2,
1B α ̄∗= αˆ∗i.
B i=1
, University of Canterbury 2021 STAT318 — Data Mining ,29 / 31
The simulated (true) SE was 0.083, so SEB ≈ SE(αˆ) for this example.
SEB =B−1 where B is some large value (say 1000) and
For our example SEB = 0.087
i=1
The general picture
● ● ●●
● ●
● ●●
● ● ●
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
x
An empirical distribution function (EDF) for a sample of n = 20 standard normal random variables.
, University of Canterbury 2021 STAT318 — Data Mining ,30 / 31
The EDF is defined as
ˆ #{xi