Overfitting, underfitting, cross validation and the curse of dimensionality
Chenlei Leng
Outline
The curse of dimensionality Dimension reduction Underfitting and overfitting Example: Linear regression Validation
The curse of dimensionality
Many sources of data are very high dimensional:
Human single nucleotide polymorphism (SNP) data: 106 dimensions
Images: > 106 pixels (“megapixels”), ∼12 bits per pixel (bpp)
HD video: 1920 × 1080 pixels with 24 frames per second (fps)
CD quality audio: 44.1 × 103 samples per second (kHz), 16
bits per sample, 2 channels (stereo) up to 8 channels (7.1)
High dimensional spaces are hard to understand
k-nearest neighbours classification and regression generally do
badly in high dimensions
“Nearest points still a long way off.”
Ordinary least squares (OLS) regression will tend to overfit in high-dimensional spaces
The curse of dimensionality
Arrange N points uniformly at random inside the D-dimensional unit cube.
Each point will be far away from its nearest neighbour unless D ≪ N.
How small does ε need to be for the cube [ε,1−ε]D to have volume 1/2?
●
Dε
1 0.250 2 0.146 3 0.103 4 0.080 5 0.064
10 0.033 100 0.003
●
●●
●
●
●
●●
●●
●● ●
●
●● ●●●●
●●
●
●●
● ●
● ●●●●
●● ●
●●
●● ●
●
●
●
●
●●
● ●
●
●● ● ●
●
● ●
● ●●
●
●
●
●
●
●
●
●
●●● ●●
●●
●
●
●
●
● ●●
●
● ● ● ●
●
● ●
●●
●
●
●● ●
●
●● ●
●
●●
●
1.0
● ●●
0.8
0.6
0.0 0.2 0.4
0.6 0.8 1.0
x
0.0
0.4
0.2
“Volume concentrates near the boundary.”
0.0 0.2 0.4
0.6 0.8 1.0
y
z
The curse of dimensionality
To deal with a 14-dimensional space, visualize a 3D space and say ‘fourteen’ to yourself very loudly. Everyone does it.
— Geoffrey Hinton, FRS (U. Toronto & Google)
Dimension reduction
For high-dimensional data we might try a dimension reduction technique.
Examples:
SVD : Singular value decomposition
PCA : Principal components analysis SOM : Self-organizing map
LLE : Local linear embedding AE : Autoencoder
Underfitting
exp(− 8×2)
●● ●●
●●
●●
●●
●●
●●
●● ●●
●● ●●
●● ●●
●●●●●●● ●●●●●●●
−1.0 −0.5 0.0 0.5 1.0
x
Underfitting: The classifier cannot capture the complexity of the training data.
High training error—errors on the training data
Typically too few parameters / a model that is too simple High bias.
High generalization error—errors on new data.
0.0 0.2 0.4
0.6 0.8 1.0
y
Overfitting
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
● ●
●
●
●
−1.0 −0.5 0.0 0.5 1.0
x
Overfitting: Good classification of training samples, but by an overly complicated classifier.
High generalization error.
Typically too many parameters / a model that is too complex High variance.
0.2 0.4
0.6 0.8
y
Overfitting and underfitting
Model selection and assessment
Model selection for supervised learning:
2. 3.
Gather data (x(n))Nn=1.
Detect underfitting by looking at the training error.
Detect overfitting by looking at the validation error—evaluate the performance of the trained model on a separate, labelled dataset that was not used in training.
1.
Label the data (x(n),y(n))Nn=1.
Train a model on the data.
Model assessment for supervised learning:
Select the trained model with lowest validation error.
Evaluate your chosen model against somebody else’s by looking at its test error—its performance on a third, separate, labelled dataset.
at the end of the data analysis. Suppose instead that we use the test-set repeatedly, choosing the model with smallest test-set error. Then the test
Model selection and assessment
set error of the final chosen model will underestimate the true test error, sometimes substantially.
The previous procedure required three labelled datasets.
It is difficult to give a general rule on how to choose the number of
observations in each of the three parts, as this depends on the signal-to- In practice we will collect one labelled dataset and chop it in
noise ratio in the data and the training sample size. A typical split might
three.
be 50% for training, and 25% each for validation and testing:
Hastie et al. (2009) suggest the ratio 2:1:1 as a rule of thumb:
Train
Validation
Test
Observations:
The methods in this chapter are designed for situations where there is
The train/validation split should be random so that the
insufficient data to split it into three parts. Again it is too difficult to give
training data and validation data come from the same
a general rule on how much training data is enough; among other things,
distribution.
this depends on the signal-to-noise ratio of the underlying function, and
Test data that comes from the same distribution as the the complexity of the models being fit to the data.
train/validation data should be handled as well as the
validation data (“interpolation”).
Test data that differs from the train/validation data can be
handled badly (“extrapolation”).
Ordinary Least Squares regression
Recap of ordinary least squares regression:
Y=Xβ+e
X is the N × D design matrix,
YandeareN×1columnvectors
errors ei have mean zero.
ˆOLS 2
Least squares estimator β minimizes |Y −Xβ|2, i.e.
ˆOLS N 2
β = argmin ∑(Y −Xβ)i .
β i=1
There is a unique, closed-form solution if X is of full rank:
ˆOLS T−1T
β =(XX)XY.
InR:oneof
lm(y~x1+x2+x3)
lm(y~.,dataframe)
Ridge regression
Y=Xβ+easbefore.
Let us introduce a parameter, λ to give us more control over
the ‘complexity’ of the model we choose.
Theridgeregressionestimatorminimizes|Y−Xβ|2+λ|β|2,
i.e.
ˆRIDGE N 2D2 β = argmin ∑(Y −Xβ)i +λ ∑βi .
β i=1 i=1
Theorem: There is always a unique solution (unlike OLS).
Interpret λ as a penalty for complex models, where complexity is measured by |β|2.
ˆ RIDGE ˆ OLS
Asλ→0,β →β (unbiased,overfits).
ˆ RIDGE Asλ→∞,β
→0(zerovariance,underfits). “Bias-variance trade-off.”
MSE = bias2 + variance is minimized by some λ > 0. In R: library(MASS); lm.ridge(y~x1+x2+x3).
Solution for ridge regression
The ridge regression estimator also has a closed form solution. ˆRIDGE T −1 T
Solution: β =(X X+λID) X Y. (Check!) Theorem
(XTX +λID) is invertible for λ > 0.
Proof.
For any z, zTXTXz = ∥Xz∥2 ≥ 0 so XTX is positive semidefinite. HenceXTX+λID ispositivedefinite.
Hence it is invertible (c.f. ST208).
Ridge Regression
Trade-off analysis:
Ife∼N(0,σ2ID)thenY=(Xβ+e)∼N(Xβ,σ2ID).
ˆRIDGE T −1 T
Sinceβ =(X X+λID) X Y isalinear
transformation of Y and
“Z ∼ N(μ,Σ) =⇒ MZ ∼ N(Mμ,MΣMT)′′
we find
ˆRIDGE T 2 T T −1
β ∼N(WX Xβ,σ WX XW), W :=(X X+λID) .
Diagonalising the symmetric matrix XTX = PDPT
(P orthogonal, PPT = PTP = ID, D diagonal), we can show
ˆRIDGE Var(β
2 ˆRIDGE Bias (β
2
−2 T
DP , which is decreasing in λ,
which is increasing in λ.
) = σ P(D +λID)
2 2 )=λ ∥Wβ∥2,
LASSO: least absolute shrinkage and selection operator
Y=Xβ+easbefore.
What about a different penalty term?
ˆ LASSO 2
TheLASSOestimatorβ minimizes|Y−Xβ|2+λ|β|1.
An innocuous change but has important implications—some of the βi will be equal to 0 (for large enough λ).
No closed-form solution as for ridge regression.
InR:
library(lars)
lars(x, y, type=”lasso”)
It is difficult to give a general rule on how to choose the number of
observations in each of the three parts, as this depends on the signal-to- SimnopisleeraVtiaolindathteiodnata and the training sample size. A typical split might
be 50% for training, and 25% each for validation and testing:
Train
Validation
Test
The methods in this chapter are designed for situations where there is
Back to the model selection problem:
insufficient data to split it(inn)toN three parts. Again it is too difficult to give
1. Gather data (x )n=1.
a general rule on how much (tnr)ain(in)gNdata is enough; among other things,
Label the data (x ,y )n=1.
this depends on the signal-to-noise ratio of the underlying function, and
Train a model on the training data. Repeat for different the complexity of the models being fit to the data.
models (e.g. different penalties λ).
Compare trained models on validation data. Pick a model that
works well.
Compete against someone else on test data.
2. Major issue: we might not have sufficient data to split it in three.
Can we approximate the validation error somehow, or re-use our data?
Leave one out cross validation (LOOCV)
Idea: don’t just split the training data once. Given training data (x(n),y(n))Ni=1:
Forj=1,…,N:
Train a model on the data but with the j-th case removed.
Evaluate the model on the j-th case only
(a validation set of size 1).
Look at the mean validation error over the N cases. Pick the model with the lowest validation error.
Re-train that model using all N training cases
Makes the most of the available data
Except in very simple cases, very slow!
k-fold cross validation
Idea: Don’t just split the training data once.
Given training data (x(n),y(n))Ni=1, split it into k equally sized subsets.
Forj=1,…,k:
Train a model on the data but with the j-th subset removed.
Test the model on the j-th subset only
(a validation set of size N/k).
Look at the mean validation error over the k cases.
Pick the model with the lowest validation error.
Re-train that model using all N training cases.
Makes the most of the available data.
Often it is a good compromise between
simple validation (one validation set, less accurate, fastest),
and
LOOCV (N validation sets, most accurate, slowest)