Model assessment and selection
Data Mining
Prof. Dr. Matei Demetrescu
Statistics and Econometrics (CAU Kiel) Summer 2020 1 / 43
Why consider model selection?
Predictor subset selection
We identify a subset of the p predictors that we believe to be related to the response.
We then fit the relevant model using the reduced set of variables. This way, we aim to improve
Prediction Accuracy
One determinant is variance of estimators, and reducing p helps.
Model Interpretability
By removing irrelevant features we can obtain a model that is more easily interpreted.
To achieve this, we will need estimates of error levels.
Statistics and Econometrics (CAU Kiel) Summer 2020 2 / 43
Today’s outline
Model assessment and selection
1 Subset selection in the linear case
2 Estimating the prediction error
3 Validation 4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 3 / 43
Subset selection in the linear case
Outline
1 Subset selection in the linear case
2 Estimating the prediction error
3 Validation 4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 4 / 43
Subset selection in the linear case
Best subset selection
Why not just find the best combination of predictors?
1 Let M0 denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.
2 For k = 1,2,…p:
(a) Fit all p models that contain exactly k predictors.
k
(b) Pick the best among these p models, and call it Mk. Here best is k
defined as having the smallest RSS, or equivalently largest R2. 3 Select a single best model from among M0, . . . , Mp.
Unfortunately we may not use R2 in Step 3, since R2 decreases by construction as predictors are added to the model!
Statistics and Econometrics (CAU Kiel) Summer 2020 5 / 43
Subset selection in the linear case
Example: Credit ratings set
RSS/R2 for each possible model containing a subset of the ten predictors in the Credit data set (with Balance as target). The red frontier tracks
Example- Credit data set
the best model for a given number of predictors.
Residual Sum of Squares
2e+07 4e+07 6e+07 8e+07
0.0 0.2 0.4
0.6 0.8 1.0
R2
2 4 6 8 10 2 4 6 8 10
Number of Predictors Number of Predictors
(Though the data set contains only ten predictors, the x-axis ranges from For each possible model containing a subset of the ten predictors
1 to 11, since one of the variables is categorical and takes on three values,
ilneadthineg Ctorethdeictredaatitoan soef t,wothdeumRmSSy vanridabRles.)are displayedà The red
frontier tracks the best model for a given number of predictors,
Statistics and Econometrics (CAU Kiel) Summer 2020 6 / 43
2
Subset selection in the linear case
This is a general idea
Although we have presented best subset selection here for linear least squares regression, the same ideas apply to other types of models.
E.g. the deviance — negative two times the maximized log-likelihood — plays the role of RSS for a broader class of models.
May employ other loss functions directly (MAE, MAPE, 0/1 loss etc.).
Using the best-performing model (given the data) works if
X to be predicted/classified comes from the same population, and You don’t trust raw RSS too much!
The problem is that RSS can be made arbitrarily small by adding (irrelevant) predictors (remember overfitting?).
Statistics and Econometrics (CAU Kiel) Summer 2020 7 / 43
Subset selection in the linear case
Stepwise selection
If p is large, best subset selection runs into problems Computational problems (how many models to compare?) Statistical problems
The larger the search space, the higher the chance of finding models that look good on the given data, even though they might not have any predictive power on future data.
Thus an enormous search space can lead to overfitting and high variance of the coefficient estimates.
For both of these reasons, stepwise methods, which explore a far more restricted set of models, are attractive alternatives.
Statistics and Econometrics (CAU Kiel) Summer 2020 8 / 43
Subset selection in the linear case
Forward stepwise selection
Forward stepwise selection begins with a model containing no predictors,
and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model.
In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.
This type of approach is called greedy: go for the largest immediate improvement.
Statistics and Econometrics (CAU Kiel) Summer 2020
9 / 43
Subset selection in the linear case
In detail
1 Let M0 denote the null model, which contains no predictors.
2 For k = 0, . . . , p − 1 :
1 Consider all p − k models that augment the predictors in Mk with one additional predictor.
2 Choose the best among these p − k models, and call it Mk+1. Here best is defined as having the smallest RSS, or equivalently largest R2.
3 Select a single best model from among M0, . . . , Mp using suitable methods.
Computational advantage over best subset selection is clear.
It is not guaranteed to find the best possible model out of all 2p
models containing subsets of the p predictors.
Statistics and Econometrics (CAU Kiel) Summer 2020 10 / 43
Subset selection in the linear case
Credit rating example
Set up predictive model for Balance
#Variables One
Two
Three Four
Best subset
rating rating,income rating,income, student cards, income, student, limit
Forward stepwise
rating rating,income rating,income, student rating, income student, limit
The first four selected models for best subset selection and forward stepwise selection on the Credit rating data set. The first three models are identical but the fourth models differ.
Statistics and Econometrics (CAU Kiel) Summer 2020 11 / 43
Subset selection in the linear case
Backward stepwise selection
Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection.
It begins with the full model containing all p predictors, and then Iteratively removes the least useful predictor, one-at-a-time.
1 Let Mp denote the full model, which contains all p predictors.
2 For k = p, p − 1, . . . , 1:
1 Consider all k models that contain all but one of the predictors in Mk, for a total of k − 1 predictors.
2 Choose the best among these k models, and call it Mk−1. Here best is defined as having smallest RSS or highest R2.
3 Select a single best model from among M0, . . . , Mp using suitable methods.
Statistics and Econometrics (CAU Kiel) Summer 2020 12 / 43
Subset selection in the linear case
More on backward stepwise selection
Backward selection too searches through only 1 + p(p + 1)/2 models, and so can be applied in settings where p is too large to apply best subset selection.
The backward approach is not guaranteed to yield the best model containing a subset of the p predictors.
Backward selection requires that the number of samples n is larger than the number of variables p (so that the full model can be fit).
In contrast, forward stepwise can be used even when n < p, and so is the only viable subset method when p is very large.
Statistics and Econometrics (CAU Kiel) Summer 2020 13 / 43
Subset selection in the linear case
Additional tricks
The effectiveness of model selection can be improved:
Targetting/Pre-screening
Only consider predictors that are not statistically independent of target (in linear regression, this – more or less – amounts to correlation).
Core predictors
Some predictors may be always included (for non-statistical reasons).
Smart searches
One could use heuristic discrete optimization techniques to find best model without parsing all 2p of them.
Either way, success hinges on a reliable prediction error estimate.
Statistics and Econometrics (CAU Kiel) Summer 2020 14 / 43
Subset selection in the linear case
Choosing the optimal model
Worth repeating:
The model containing all of the predictors will always have the smallest RSS and the largest R2. This is not informative about actual predictive performance!
Therefore, RSS and R2 are only suitable for selecting the best model among a collection of models with the same number of predictors.
To select among models of different complexity, we have to understand how error variances can be estimated. (Or, for classification, error rates.)
Statistics and Econometrics (CAU Kiel) Summer 2020 15 / 43
Estimating the prediction error
Outline
1 Subset selection in the linear case
2 Estimating the prediction error
3 Validation 4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 16 / 43
Estimating the prediction error
Prediction errors
Take generic prediction model
Y =f(X;θ)+ε.
This may be linear, polynomial, or other (more to come).1
The predictor function is f x; θˆ for some fitted θˆ.
Assuming bias away for simplicity, the prediction error is roughly
εˆ = ε − θˆ − θ ′ ∇ f ( X ; θ ) ,
where X is a new data point (and thus independent of training data).
One of course aims to minimize Var (εˆ), but you need to measure it.
1We look at prediction in more detail, but same ideas apply for classification.
Statistics and Econometrics (CAU Kiel) Summer 2020 17 / 43
Estimating the prediction error
Prediction error variance
The variance of the prediction error εˆ is
Var(εˆ)≈Var(ε)+tr Cov θˆ E∇f∇f′ , where independence of X and θˆ was used.
Of equal interest is the error variance given X,
Var(εˆ|X)≈Var(ε)+∇f′Cov θˆ ∇f.
The second term is relatively easy to estimate given training data.
But estimating Var (ε) is not trivial.
Statistics and Econometrics (CAU Kiel) Summer 2020 18 / 43
Either way,
Estimating the prediction error
Training errors
The usual residual variance estimator
1n 2 ˆ
i=1
Var(εˆ) = n
1 n ˆ ′
2
Yi−f Xi;θ
εi− θ−θ ∇f(Xi;θ)
= n
i=1 1
is an “in-sample” estimator, Var (εˆ) = n RSS.
We call it training error.
Since RSS can be made arbitrarily small by adding regressors (or, in nonlinear models, the model complexity), the training error rate is a downward biased estimate of Var (εˆ).
Statistics and Econometrics (CAU Kiel) Summer 2020
19 / 43
Estimating the prediction error
The issue
A bit of algebra shows that
1 n Var(εˆ)=n εi
2 i=1
ˆ ′2n
− θ−θ ni=1∇f(Xi;θ)εi
ˆ ′ 1 n
+ θ−θ n i=1 ∇f(Xi;θ)∇f(Xi;θ)′
ˆ θ−θ .
The third term on the r.h.s. is usually of lower magnitude order; But, since θˆ is not independent of Xi, the second has expectation2
ˆ ′ 2 n 2
E − θ−θ ni=1∇f(Xi;θ)εi ≈−ndimθVar(ε).
2Strictly speaking, this requires some regularity conditions to hold.
Statistics and Econometrics (CAU Kiel) Summer 2020 20 / 43
Estimating the prediction error
Estimating test error: two approaches
To compare we need (unbiased) estimators of the prediction error variance:
We can directly estimate the test error, using a so-called validation/test set approach or a cross-validation approach.
The idea is to ensure that − 2 dim θ vanishes n
By forcing X and the training data to be (approximately) independent.
We can indirectly estimate predictive error variances by making an adjustment to the training error to account for the bias due to overfitting.
We illustrate both approaches next.
Statistics and Econometrics (CAU Kiel) Summer 2020 21 / 43
Estimating the prediction error
C , AIC/BIC, and adjusted R2
p
Credit data example
These statistics adjust the training error for the model size, and can be used to select among a set of models with different numbers of variables.
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
Number of Predictors Number of Predictors Number of Predictors
The figure displays Cp, BIC, and adjusted R2 for the best model of each size produced by best subset selection on the Credit data set.
Statistics and Econometrics (CAU Kiel) Summer 2020 22 / 43
Cp
10000 15000 20000 25000 30000
BIC
10000 15000 20000 25000 30000
Adjusted R2
0.86 0.88 0.90 0.92 0.94 0.96
Estimating the prediction error
Now for some details
Mallow’s Cp for LS regression:
C p = 1 ( R S S + 2 d σˆ 2 ) ,
and σˆ2 is an estimate of the variance of the error ε.
The AIC criterion is defined for a large class of models fit by ML:
AIC = −2 log L + 2d
where L is the maximized value of the likelihood function for the
estimated model.
(Some textbooks have a slightly different variant).
May replace (with some care) −2 log L with sum of normalized
in-sample prediction losses.
n
where d is nr. of parameters (d ̸= p if including interactions, powers etc.)
Statistics and Econometrics (CAU Kiel) Summer 2020
23 / 43
Estimating the prediction error
Bayesian IC
For the Bayesian (Schwarz) criterion we have
BIC = −2 log L + d log n.
Like Cp, the BIC will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value.
Notice that BIC replaces the 2dσˆ2 used by AIC with a log(n)dσˆ2 term, where n is the number of observations.
Since log n > 2 for any n > 7, the BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than Cp.
Statistics and Econometrics (CAU Kiel) Summer 2020 24 / 43
Estimating the prediction error
AIC vs. BIC
BIC:
chooses right model size as n → ∞ … but is a bit strict when n is small
AIC:
chooses better forecasting models when n is small … but too liberal as n → ∞.
Statistics and Econometrics (CAU Kiel) Summer 2020 25 / 43
Estimating the prediction error
Adjusted R2
For a least squares model with d variables, the adjusted R2 statistic is calculated as
adjustedR2 =1−RSS/(n−d−1) TSS(n−1)
where TSS is the total sum of squares.
Unlike Cp, AIC, and BIC, for which a small value indicates a model with a low test error, a large value of adjusted R2 indicates a model with a low test error.
Maximizing the adjusted R2 is equivalent to minimizing
RSS/(n − d − 1). While RSS always decreases as the number of variables in the model increases, RSS/(n − d − 1) may increase or decrease, due to the presence of d in the denominator.
Unlike the R2 statistic, the adjusted R2 statistic pays a price for the
inclusion of unnecessary variables in the model.
Statistics and Econometrics (CAU Kiel) Summer 2020 26 / 43
Estimating the prediction error
Credit data example
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
Number of Predictors Number of Predictors Number of Predictors
The validation errors were calculated by randomly selecting three-quarters of the observations as the training set, and the remainder as the validation set.
The cross-validation errors were computed using k = 10 folds.
See below for the essential details.
24
Statistics and Econometrics (CAU Kiel) Summer 2020
27 / 43
100
120 140 160 180 200 220
100
120 140 160 180 200 220
100
120 140 160 180 200 220
Square Root of BIC
Validation Set Error
Cross−Validation Error
Validation
Outline
1 Subset selection in the linear case
2 Estimating the prediction error
3 Validation 4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 28 / 43
Validation
Validation-set approach
The idea is to remove dependence of θˆ and X.
So divide (randomly!) the available set of samples into two parts: the training set and a validation (or hold-out) set.
The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set.
The resulting validation-set error gives an honest estimate of prediction error:
MSE/misclassification rate for competing models.
We also use such methods to select so-called tuning parameters (e.g. controlling the complexity of a particular model rather than which predictors enter the model).
(Btw., we may have a 2nd layer of held-out data: the test set, which delivers a final estimate of prediction error variance that includes e.g. model selection as source of estimation noise.)
Statistics and Econometrics (CAU Kiel) Summer 2020 29 / 43
Validation
The Validation process
The Validation process
!”#”$””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””%”
&””##””!$”””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””””‘!”
A random splitting into two halves: left part is training set, right part is
A random splitting into two halves: left part is training set,
validation set.
right part is validation set
Statistics and Econometrics (CAU Kiel) Summer 2020
30 / 43
6 44
Validation
Example: Automobile Data I
Non-linear e↵ects of predictors
3.3 Other Considerations in the Regression Model 91
Compare linear vs. higher-order polynomial terms in a linear regression.
polynomial regression on Auto data
50 100 150 200 Horsepower
FIGURE 3à8à The Auto data setà For a number of cars, mpg and horsepower are45 48 (DataavasihlaowbnleàTfhreolmineatrhregrcesosmionpÞatnisioshnowneibnsoirtaengeoàfTJheamlineasrretgreassl.ionteÞxttfborook.)
a model that includes horsepower2 is shown as blue curveà The linear regression
Þt for a model that includes all polynomials of horsepower up to Þfth-degree is
Statistics and Econometrics (CAU Kiel) Summer 2020 31 / 43
Linear Degree 2 Degree 5
Miles per gallon
10 20 30 40 50
shown in greenà
Validation
Example: Automobile Data II
Example: automobile data
• Want to compare linear vs higher-order polynomial terms in a linear regression
We randomly split the 392 observations into two sets,
• We randomly split the 392 observations into two sets, a training set containing 196 of the data points, and a
a training set containing 196 of the data points, and
validation set containing the remaining 196 observations.
a validation set containing the remaining 196 observations.
2 4 6 8 10 2 4 6 8 10
Degree of Polynomial Degree of Polynomial
Left panel shows single splità right panel shows multiple splits
s.
Left panel shows single split; right panel shows multiple split
à
44
Statistics and Econometrics (CAU Kiel) Summer 2020
32 / 43
Mean Squared Error
16 18 20 22 24 26 28
Mean Squared Error
16 18 20 22 24 26 28
Validation
Drawbacks of validation set approach
The validation estimate of the test error can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set.
In the validation approach, only a subset of the observations — those that are included in the training set rather than in the validation set — are used to fit the model.
This suggests that the validation set error may tend to overestimate the test error for the model fit on the entire data set.
Statistics and Econometrics (CAU Kiel) Summer 2020 33 / 43
Validation
K-fold Cross-validation
Widely used approach for estimating test error.
Estimates can be used to select the best model, and to give an idea
of the test error of the final chosen model.
The idea is to randomly divide the data into K equal-sized parts: We leave out part k,
fit the model to the other K − 1 parts (combined), and then obtain predictions for the left-out kth part.
This is done in turn for each part k = 1, 2, …K , and then the results are combined.
Statistics and Econometrics (CAU Kiel) Summer 2020 34 / 43
K-fold Cross-validation in detail
Validation
K-fold Cross-validation in detail
Divide data into K roughly equal-sized parts (K = 5 here) Divide data into K roughly equal-sized parts (K = 5 here)
12345
Validation
Train
Train
Train
Train
… and use, in turn, each fold for validation.
Let the K parts/subsamples be C1, C2, . . . CK , where Ck denotes the
indices of the observations in part k.
There are nk observations in part k: if n is a multiple of K, then à 44
nk = n/K.
Statistics and Econometrics (CAU Kiel) Summer 2020
35 / 43
Validation
The details
Compute MSEk = i∈Ck (Yi − Yˆi)2/nk, where Yˆi is the fit for observation i, obtained from the data with part k removed, followed by
CV(K)=K nkMSEk. k=1 n
Put K = n to get n-fold or leave-one out cross-validation: (LOOCV). The estimated standard deviation of CV(K) is
K
2
SE(CV(K)) = (MSEk − MSEk) /(K − 1) k=1
This is a useful, but, strictly speaking, not a valid estimate. Why not? Statistics and Econometrics (CAU Kiel) Summer 2020 36 / 43
Validation
A nice special case!
With least-squares linear or polynomial regression, a shortcut makes the cost of LOOCV the same as that of a single model fit! Namely,
n 2 CV(n)=1 Yi−Yˆi ,
n i=1 1 − hi
where Yˆi is the ith fitted value from the original least squares fit, and hi is the leverage (diagonal of the “hat” matrix X(X′X)−1X′.) This is like the ordinary MSE, except the ith residual is divided by 1 − hi.
LOOCV often useful, but sometimes doesn’t shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance.
AbetterchoiceisK=5or 10.
Statistics and Econometrics (CAU Kiel) Summer 2020 37 / 43
Validation
Auto data revisited Auto data revisited
LOOCV
10−fold CV
2 4 6 8 10
Degree of Polynomial
2 4 6 8 10
Degree of Polynomial
Statistics and Econometrics (CAU Kiel) Summer 2020 38 / 43
Mean Squared Error
16 18 20 22 24 26 28
Mean Squared Error
16 18 20 22 24 26 28
3 44
Validation
Other issues with cross-validation
Since each training set is only (K − 1)/K as big as the original training set, the estimates of prediction error will typically be biased upward.
This bias is minimized when K = n (LOOCV), but this estimate has high variance, as noted earlier.
K = 5 or K = 10 provides a good compromise for this bias-variance tradeoff.
Don’t forget to use the relevant loss function,
And don’t forget to include any model (pre)selection steps in the cross-validation procedure!
Statistics and Econometrics (CAU Kiel) Summer 2020 39 / 43
Validation
Cross-Validation for Classification Problems
We divide the data into K roughly equal-sized parts C1,C2,…CK. Ck denotes the indices of the observations in part k. There are nk obs. in part k.
Compute
where Errk = i∈Ck I(Yi ̸= Yˆi)/nk.
The estimated standard deviation of CVK is
K
2
CVK=K nkErrk k=1 n
SE(CVK)= (Errk −Errk) /(K−1). k=1
Statistics and Econometrics
(CAU Kiel)
Summer 2020
40 / 43
Validation
No model selection without…
In-sample estimation of prediction error Easy to compute
Downward biased without correction
Validation and cross-validation
possibly more accurate
no need for asymptotic approximations more flexible, not only ML estimators
Statistics and Econometrics (CAU Kiel) Summer 2020 41 / 43
Up next
Outline
1 Subset selection in the linear case
2 Estimating the prediction error
3 Validation 4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 42 / 43
Up next
Coming up
Shrinkage and complexity reduction
Statistics and Econometrics (CAU Kiel) Summer 2020 43 / 43