程序代写 STAT318 — Data Mining

Cross-validation and the Bootstrap
STAT318 — Data Mining
Jiˇr ́ı Moravec
University of Canterbury, Christchurch,

Copyright By PowCoder代写 加微信 powcoder

Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 1 / 42

Cross-validation and the Bootstrap
Cross-Validation and the Bootstrap
Cross-validation Bootstrap
In this section we discuss two important resampling methods: cross-validation and the bootstrap.
These methods use samples formed from the training data to obtain additional information about a fitted model or an estimator.
They can be used for estimating prediction error, determining appropriate model flexibility, estimating standard errors, …
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 2 / 42

Cross-validation and the Bootstrap
Cross-validation
Motivation
Cars: Speed vs distance
5 10 15 20
Speed (mph)
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 3 / 42
Distance (feet)
0 20 40 60 80

Cross-validation and the Bootstrap
Cross-validation
Cars: Speed vs distance
Motivation
Distance (feet)
0 20 40 60 80
5 10 15 20
Speed (mph)
Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 4 / 42

Cross-validation and the Bootstrap
Cross-validation
Cars: Speed vs distance
Motivation
Distance (feet)
0 20 40 60 80
5 10 15 20
Speed (mph)
Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 5 / 42

Cross-validation and the Bootstrap
Cross-validation
Training error vs. test error
The training error is the average error that results from applying a statistical learning method to the observations used for training — a simple calculation.
The test error is the average error that results from applying a statistical learning technique to test observations that were not used for training — a simple calculation if test data exists, but we usually only have training data.
The training error tends to dramatically under-estimate the test error.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 6 / 42

Cross-validation and the Bootstrap
Cross-validation
Performance
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 7 / 42

Cross-validation and the Bootstrap
Cross-validation
Validation Set Approach
A very simple strategy is to randomly divide the training data into two sets:
1 Training Set: Fit the model using the training set.
2 Validation Set: Predict the response values for the observations in the validation set.
The validation set error provides an estimate of the test error (MSE for regression and error rate for classification).
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 8 / 42

Cross-validation and the Bootstrap
Cross-validation
Validation Set Approach
􏰠 􏰝􏰝 􏰛􏰞 􏰡􏰛
In this example, the training data are randomly split into two sets of approximately the same size. The blue set is used for training and the orange set for validation.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 9 / 42

Cross-validation and the Bootstrap
Cross-validation
Example: auto data
2 4 6 8 10 Degree of Polynomial
Find the best level of flexibility in polynomial regression using the validation set approach (50% used for training).
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 10 / 42
Mean Squared Error
16 18 20 22 24 26 28

Cross-validation and the Bootstrap
Cross-validation
Example: auto data
Polynomial regression:
mpg=β0 +􏰋pi=1βi(horsepower)i
Minimum MSE at p ≈ 7
The relative increase of complexity compared to a more parsimonus model p = 2 is likely not worth the small improvement in MSE
2 4 6 8 10 Degree of Polynomial
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 11 / 42
Mean Squared Error
16 18 20 22 24 26 28

Cross-validation and the Bootstrap
Cross-validation
Example: auto data
2 4 6 8 10 Degree of Polynomial
The validation set approach using different training and validation sets (50% used for training).
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 12 / 42
Mean Squared Error
16 18 20 22 24 26 28

Cross-validation
Drawbacks of the validation set approach
Cross-validation and the Bootstrap
The validation set estimate of the test MSE can be highly variable. The main reasons are:
random split of the traning data
only half of the observations (n2) are used to train the model
Hence, the validation set test error will tend to over-estimate the test error.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 13 / 42

Cross-validation
K-fold cross-validation
1 The training data are divided into K groups (folds) of approximately equal size.
2 The first fold is used as a validation set and the remaining (K − 1) folds are used for training. The test error is then computed on the validation set.
3 This procedure is repeated K times, using a different fold as the validation set each time.
4 The estimate of the test error is the average of the K test errors from each validation set.
Cross-validation and the Bootstrap
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 14 / 42

5-fold cross-validation
Cross-validation and the Bootstrap
Cross-validation
􏰝􏰝 􏰦􏰧 􏰨 􏰡􏰦 􏰝􏰝 􏰦􏰧 􏰨 􏰡􏰦 􏰝􏰝􏰝􏰝 􏰦􏰦􏰧􏰧 􏰨􏰨 􏰡􏰡􏰦􏰦 􏰝􏰝 􏰦􏰧 􏰨 􏰡􏰦 􏰝􏰝 􏰦􏰧 􏰨 􏰡􏰦
Each row corresponds to one iteration of the algorithm, where the orange set is used for validation and the blue set is used for training.
If K = n, we have leave one out cross-validation (LOOCV).
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 15 / 42

Cross-validation and the Bootstrap
Cross-validation
Cross-validation test error
We randomly divide the training data of n observations into K groups of approximately equal size, C1, C2, . . . , CK .
Regression Problems (average MSE):
1K 1K CV=􏰌􏰌(y−yˆ)2= 􏰌MSE.
KniiKk k=1 i:xi ∈Ck k=1
Classification Problems (average error rate):
CV = 􏰌 􏰌 I(y ̸=yˆ).
Knii k=1 i:xi ∈Ck
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 16 / 42

Cross-validation: auto data
10−fold CV
Cross-validation and the Bootstrap
Cross-validation
2 4 6 8 10
Degree of Polynomial
2 4 6 8 10
Degree of Polynomial
The right plot shows nine different 10-fold cross-validations.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 17 / 42
Mean Squared Error
16 18 20 22 24 26 28
Mean Squared Error
16 18 20 22 24 26 28

Cross-validation and the Bootstrap
Cross-validation
Cross-validation: auto data
The variability in the estimated test MSE has been removed (by averaging). The one standard error (SE) rule can be used to choose the best model. We choose the most parsimonious model whose test MSE is within one SE of the minimum test MSE. The SE is computed as:
􏰢 Var(MSE1 , …, MSEK ) SE= K ,
where K is the number of folds.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 18 / 42

K-fold cross-validation: classification oo
Cross-validation and the Bootstrap
Cross-validation
ooooooo oo
oooo ooo o oooooo o
The Bayes error rate is 0.133 in this example.
ooo o oooo ooooooo
ooooo oooo
oo o o o o ooo o ooo oo
o ooo oooooooo o
o o o ooooo oo ooo
o ooo o oo ooo ooo
o oooooo oooo
o o o o oo o
o ooooooo o
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 19 / 42

Cross-validation and the Bootstrap
Cross-validation
K-fold cross-validation: logistic regression
This is a classification problem, so we will use logistic regression. Consider polynomial logistic regression of p-degrees and 2 predictors:
􏰂 p(x) 􏰃 p p
ln 1−p(x) =βˆ0+􏰌βˆ1ix1i+􏰌βˆ2ix2i
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 20 / 42

Cross-validation
K-fold cross-validation: logistic regression
Degree=1 Degree=2
oo oo oo o o oo o o
Cross-validation and the Bootstrap
o oo o o o oo o o
o oooo oo o
o o o o o o o o o o o o oooooo
ooo oooo o ooo oooo o ooo ooo o o ooo ooo o oooo o o o oooo o
o oo o o oo oooooooo ooooooo ooooooo
oo oooo oo oo oo oooo oo oooo o oooo o
ooo o ooo oo o ooo o ooo oo o o o o o
oo oo o ooo o o o o oooo
oo oo o ooo o o o o oooo
o o o oo o o o o oo o ooo ooo
o oo o o o
o oo o o o
o o o oo o o oo
o o o oo o o oo
oooooooooo o o o
oooooooooo o o o
ooo ooo oo o ooo oo o ooo
o o o o oo o o o o o oo o
o o o o o o o o
o ooo o o ooo o
o ooo o ooo ooo ooo
The test error rates are 0.201 and 0.197, respectively.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 21 / 42

Cross-validation
K-fold cross-validation: logistic regression
Degree=3 Degree=4
oo oo oo o o oo o o
Cross-validation and the Bootstrap
o oo o o o oo o o
o oooo oo o
o o o o o o o o o o o o oooooo
ooo oooo o ooo oooo o ooo ooo o o ooo ooo o oooo o o o oooo o
o oo o o oo oooooooo ooooooo ooooooo
oo oooo oo oo oo oooo oo oooo o oooo o
ooo o ooo oo o ooo o ooo oo o o o o o
oo oo o ooo o o o o oooo
oo oo o ooo o o o o oooo
o o o oo o o o o oo o ooo ooo
o oo o o o
o oo o o o
o o o oo o o oo
o o o oo o o oo
oooooooooo o o o
oooooooooo o o o
ooo ooo oo o ooo oo o ooo
o o o o oo o o o o o oo o
o o o o o o o o
o ooo o o ooo o
o ooo o ooo ooo ooo
The test error rates are 0.160 and 0.162, respectively.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 22 / 42

Cross-validation
K-fold cross-validation: logistic regression and KNN
Cross-validation and the Bootstrap
2 4 6 8 10 0.01 0.02 0.05 0.10 0.20 0.50 1.00
Order of Polynomials Used 1/K
The test error (orange), training error (blue) and the 10-fold cross-validation error (black). The left plot shows logistic regression and the right plot shows KNN.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 23 / 42
Error Rate
0.12 0.14 0.16 0.18 0.20
Error Rate
0.12 0.14 0.16 0.18 0.20

Cross-validation and the Bootstrap
Cross-validation
Since each training set has approximately (1 − 1/K )n observations, the cross-validation test error will tend to over-estimate the prediction error.
LOOCV minimizes this upward bias, but this estimate has high variance. This is due to a high correlation between training folds as they differ only by a single observation.
K = 5 or 10 provides a good compromise for this bias-variance trade-off.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 24 / 42

Cross-validation and the Bootstrap
Cross-validation
Cross Validation: right and wrong
Consider the following classifier for a two-class problem:
1 Starting with 1000 predictors and 50 observations, find the 20 predictors having the largest correlation with the response.
2 Apply a classifier using only these 20 predictors.
If we use cross-validation to estimate test error, can we simply apply it at step (2)?
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 25 / 42

Cross-validation and the Bootstrap
Cross-validation
Cross Validation: right and wrong
The filtering step (subset selection) is a training step because the response variable is used. Hence, we cannot simply apply CV at step (2).
We need to apply CV to the full learning process. That is, divide the data into k-folds, find the 20 predictors that are most correlated with the response using k − 1 of the folds and then fit the classifier to those 20 predictors.
We would expect different 20 best predictors to be found at each iteration and hence, we expect to fit the model using different predictors at each iteration.
Unsupervised screening steps are fine (e.g. choose the 20 predictors with the highest variance across all 50 observations) because the response variable is not used (it is not supervised)
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 26 / 42

Cross-validation and the Bootstrap
Cross-validation
The use of the term bootstrap derives from the phrase:
“to pull oneself up by one’s bootstraps”.
The bootstrap is a powerful statistical tool that can be used to quantify uncertainty associated with a statistical learning technique or a given estimator.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 27 / 42

Cross-validation and the Bootstrap
Cross-validation
Suppose we wish to invest a fixed sum of money in two financial assets that yield returns X and Y , respectively (X and Y are random variables).
We want to minimize the risk (variance) of our investment:
where 0 ≤ α ≤ 1.
minV(αX +(1−α)Y), α
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 28 / 42

Cross-validation and the Bootstrap
Cross-validation
The values of σX2 , σY2 and σXY are unknown and hence, need to be estimated from sample data.
We can then estimate the α value that minimizes the variance of our investment using
αˆ= σˆY2−σˆXY . σˆX2 +σˆY2 −2σˆXY
αˆ is an estimator, but we don’t know its sampling distribution or its standard error.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 29 / 42

Cross-validation and the Bootstrap
Cross-validation
We could either try to calculate this analytically, or in this case, we can simulate: Simulate 1000 returns for investment X and Y
Calculate αˆ
Repeat to get a good estimate of αˆ, sd(αˆ) etc.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 30 / 42

Cross-validation and the Bootstrap
Cross-validation
Example: simulated returns for investments X and Y
−2 −1 0 1 2 −2 −1 0 1 2
−3 −2 −1 0 1 2 −2 −1 0 1 2 3
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 31 / 42
−3 −2 −1 0 1 2 −2 −1 0 1 2
−3 −2 −1 0 1 2 −2 −1 0 1 2

Cross-validation and the Bootstrap
Cross-validation
Example: simulated sampling distribution for αˆ
0.4 0.5 0.6
0.7 0.8 0.9
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 32 / 42
0 50 100 150 200

Cross-validation
Example: statistics from 1000 observations
Cross-validation and the Bootstrap
The sample mean is
which is very close to the true value, α = 0.6.
The standard deviation is
sd(αˆ) = 􏰤􏰣
which gives an approximate standard error of SE(αˆ) = 0.083.
􏰌αˆi =0.5996, i=1
􏰌(αˆi − α ̄)2 = 0.083, i=1
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 33 / 42

Cross-validation
Example: simulated returns for investments X and Y
Cross-validation and the Bootstrap
The big assumption here is that we can draw as many samples from the population as we like. If we could do this, statistics would be a trivial subject!
Clearly, we cannot do this is in practice.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 34 / 42

Cross-validation and the Bootstrap
Cross-validation
Real world
The procedure we have discussed cannot be applied in the real world because we cannot sample the original population many times.
The bootstrap comes to the rescue.
Rather than sampling the population many times directly, we repeatedly sample
the observed sample data using random sampling with replacement.
These bootstrap samples are the same size as the original sample (n observations) and will likely contain repeated observations.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 35 / 42

Cross-validation and the Bootstrap
Cross-validation
Original Data (Z)
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 36 / 42

Cross-validation and the Bootstrap
Cross-validation
Example: bootstrap sampling distribution for αˆ
0.4 0.5 0.6 0.7 0.8 0.9 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Left: The simulated sampled distribution for αˆ. Right: Histogram showing 1000 bootstrap estimates of α.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 37 / 42
0 50 100 150 200
0 50 100 150 200

Cross-validation
Example: statistics from B bootstrap samples
Let Z∗i and αˆ∗i denote the ith bootstrap sample and the ith bootstrap estimate
of α, respectively.
We estimate the standard error of αˆ using
SEB =􏰣B−1 where B is some large value (say 1000) and
Cross-validation and the Bootstrap
For our example SEB = 0.087 (close to true SE= 0.083)
(αˆ∗i −α ̄∗)2,
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 38 / 42

The general picture
Cross-validation and the Bootstrap
Cross-validation
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
Cumulative density function (CDF) (black) and an empirical distribution function (EDF) (red) for a sample of n = 20 standard normal random variables.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 39 / 42
0.0 0.2 0.4 0.6 0.8 1.0

Cross-validation and the Bootstrap
Cross-validation
The general picture
The EDF is defined as:
ˆ the number of elements in the sample ≤ x 1 􏰌n
Fn(x)= n =n
where the indicator function I (.) = 1 if the condition (.) is true and zero otherwise. The EDF approximates the CDF
Sampling from the EDF (the bootstrap) approximates sampling from the population CDF
Bootstrap samples will only contain values from the observed sample, but each bootstrap sample will contain different combintions of these values, mimicking a newly drawn random sample from the population.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 40 / 42

Cross-validation and the Bootstrap
Cross-validation
Be aware that bootstrap samples contain approximately 2/3 of the original training observations.
The probability that a training observation xi is included (at least once) in the bootstrap sample is:
􏰂 1􏰃n Pr(xi ∈bootstrap)=1− 1−n
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 41 / 42

Cross-validation and the Bootstrap
Cross-validation
The bootstrap can also be used to approximate confidence intervals, the simplest method being the bootstrap percentile confidence interval.
For example, an approximate 90% confidence interval is 5th and 95th percentiles of B bootstrap estimates.
It is possible to use the bootstrap for estimating prediction error, but cross-validation is easier and gives similar results.
We will use the bootstrap when building decision trees.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 42 / 42

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com