Predictive Analytics – Week 4: Model Selection and Estimation I
Predictive Analytics
Week 4: Model Selection and Estimation I
Semester 2, 2018
Discipline of Business Analytics, The University of Sydney Business School
Week 4: Model Selection
1. Model Selection and Evaluation
2. The bias-variance decomposition
3. KNN bias-variance decomposition
4. Cross validation
5. The bias-variance derivation details (optional)
Reading: Chapter 2.1, 2.2 of ISL.
Exercise questions: Chapter 2.4 of ISL, Q1, Q3, Q5, Q6 and
Q7.(a). Only focus on “regression” if the questions contain
“classification” content.
2/47
Model Selection and Evaluation
Model Selection and Evaluation
Model Selection: estimate the performance of different models in
order to choose the (approximate) best one. We select the model
that is estimated to have the best predictive ability.
Model Evaluation: after chosen the “different” model, estimate
its test error (generalisation error) on new data.
• Training set: for exploratory data analysis, model building,
model estimation, etc.
• Validation set: for appropriate model selection.
• Test set: for model evaluation.
3/47
Underfitting and Overfitting
4/47
Why is overfitting bad?
• Low training error, high generalization error
• Poor predictive performance
• Overreacts to minor fluctuations in training data
How to address overfitting:
• Drop some features
• Model selection algorithm
• Manually select features to keep
• Regularization
• Keep all features, reduce the magnitude/values of parameters.
More details in later sections.
5/47
The bias-variance trade-off (key concept)
• Increasing model complexity brings higher flexibility and
therefore lower bias. However, this comes at a cost of higher
variance: there is higher sample variability when estimating
complex models. Overfitting can be a problem.
• Decreasing model complexity leads to lower variance.
However, simpler models may not be sufficiently flexible to
capture the underlying patterns in the data, leading to higher
bias. Underfitting can be a problem.
6/47
The bias-variance trade-off
To review, we use the training data to estimate the additive error
model
Y = f(X) + ε
leading to an estimator f̂(x0) for the regression function at given
input point x0, f(x0).
7/47
Examples
Linear regression. Adding predictors increases model complexity.
Least squares estimates have high variability when the number of
inputs is large, but excluding relevant predictors leads to bias.
Review: Lecture 2.
8/47
Examples
KNN regression. Reducing the number of neighbours increases
model complexity. Closer neighbours means lower bias. However,
averaging fewer observations increases variance.
Review: Lecture 3.
9/47
Approaches to model selection
Train, validation and test sets. Validation set is used for
selecting the optimal model complexity.
Resampling methods. Estimate generalisation performance by
generating multiple splits of the training data.
Analytical criteria. Use analytical results to penalise training
performance to account for overfitting.
10/47
Training, validation, and test split
In the validation set approach, we randomly split the training
data into a training set, a validation set and a test set. We select
the model with the best predictive performance in the validation
set.
Typically, we use 50-80% of the data for the training set.
11/47
Training, validation, and test split
1. Estimate different models on the training data.
2. Predict the observations in the validation set.
3. Select the model with best validation set performance.
4. Re-estimate the selected model by combining the training
and validation sets.
5. Predict the test data with the selected model.
12/47
Validation set
The validation set approach has serious limitations when the size
of the training data is not large. The model may not have enough
data to train on, and there may not be enough cases in the
validation set to reliably estimate generalisation performance.
We turn instead to resampling methods.
13/47
Learning curve
We cannot use training set to select the best model. The model
minimize the loss of this training data set, not necessarily minimize
the loss of the new date sets.
14/47
Diagnosing learning curve
Suppose your training loss is low, while validation/test loss is high.
• Underfitting or overfitting problem?
Suppose your training loss is high, while validation/test loss is also
high.
• Underfitting or overfitting problem?
15/47
Diagnosing learning curve
Underfitting: training error is high, validation error is slightly >
training error;
Overfitting: training error is low, validation error is significantly >
training error.
16/47
The bias-variance decomposition
Expected prediction error
• Consider again the additive error model:
Y = f(X) + ε,
where we assume that E(ε) = 0 and Var(ε) = σ2.
• In the previous section, we treated f̂(·) as given since our
objective was to estimate the test error. Now, we discuss the
fundamental problem of choosing a method to learn a
predictive function f̂(·).
17/47
Expected prediction error
We define the expected prediction error for a new input point
X = x0 as
Err(x0) = E
[(
Y0 − f̂(x0)
)2
|X = x0
]
,
where Y0 = f(x0) + ε0. The expectation is over ε0 and the
training sample, i.e. over the sampling distribution of f̂(·). The
EPE is a expected loss.
Note that this is different from the generalisation error, where
f̂(x0) is an estimate (not an estimator), and the expectation is
over the population P (X,Y ).
18/47
Expected prediction error decomposition
We can write the expected prediction error as:
Err(x0) = E
[(
Y0 − f̂(x0)
)2
|X = x0
]
= E
[(
f(x0) + ε− f̂(x0)
)2
|X = x0
]
= σ2 + E
[(
f(x0)− f̂(x0)
)2
|X = x0
]
= Irreducible error + Reducible error
19/47
Expected prediction error decomposition
Err(x0) = σ2 + ED
[
(f(x0)− f̂(x0))2|X = x0
]
= Irreducible error + Reducible error
• The first term is the variance of the response around its true
mean f(x0). We cannot avoid this source of error, and it puts
an upper bound on the accuracy of the prediction.
• In choosing a method, our concern is the reducible error: we
want to minimise the estimation error
ED
[
(f(x0)− f̂(x0))2|X = x0
]
.
• Here, ED (·) is used to emphasise that the expectation is over
the training data. This notation might be omitted later for
simplicity purpose.
20/47
The bias-variance trade-off
We can show that (see last optional section of this slides):
E(f(x0)− f̂(x0))2 =
[
E(f̂(x0))− f(x0)
]2
+ E([f̂(x0)− E(f̂(x0))]2)
= Bias2(f̂(x0)) + Var(f̂(x0))
= Bias2 + Variance
21/47
The bias-variance trade-off
E(f(x0)− f̂(x0))2 = Bias2
(
f̂(x0)
)
+ Var
(
f̂(x0)
)
• We would like our model to be flexible enough to be able to
approximate (possibly) complex relationships between Y and
X.
• Typically, the more complex we make the model, the better its
approximation capabilities, which translates into lower bias.
• On the other hand, increasing model complexity leads to
higher variance. This is due to the larger (effective) number
of parameters to estimate.
• Hence, we would like to find the optimal (problem specific)
model complexity that minimises our expected loss. 22/47
The bias-variance trade-off
Increasing model complexity will always reduce the training error,
but there is an optimal level of complexity that minimises the test
error.
How the plots will be looked like if we fix the model complexity
and change x-axis from ”model complexity” to ”size of the data”?
23/47
The bias-variance trade-off
Just because a model is more “realistic”, it does not mean that it
will have higher predictive accuracy. All models are
approximations, and our task is to find the most accurate one for
our purposes in a data-driven way.
24/47
Model selection
• Model selection is a set of methods (such as cross
validation) that allow us to choose the right model among
options of different complexity. It will be a fundamental part
of our methodology.
• Similarly to model evaluation, model selection methods are
concerned with estimating the generalisation error. However,
it is important not to confuse these two steps, which have
different goals.
25/47
KNN bias-variance decomposition
Bias-variance decomposition
Remember the following expression for the expected prediction
error:
Err(x0) = E
[
(Y0 − f̂(x0))2|X = x0
]
= σ2 + Bias2
(
f̂(x0)
)
+ Var
(
f̂(x0)
)
26/47
The Bias-Variance Decomposition: kNN example
• For kNN regression: Suppose x0 is a test data point, its k
nearest neighbours are {x1,x2, …,xk} with responses
{y1, y2, …, yk}. Then the model prediction is
f̂(x0) =
1
k
∑
xi∈Nk(x0,D)
yi,
• Then the test error
Err(x0) =σ2 + Bias2(f̂(x0)) + Var(f̂(x0))
=σ2 +
[
E(f̂(x0))− f(x0)
]2
+ E([f̂(x0)− E(f̂(x0))]2)
27/47
The Bias-Variance Decomposition: kNN example
• First we check, noting yl = f(xl) + εl
E(f̂(x0)) =E
[
1
k
k∑
l=1
yl
]
=
1
k
k∑
l=1
E[f(xl) + εl]
=
1
k
k∑
l=1
f(xl)
• For the third term, we have
f̂(x0)− E(f̂(x0)) =
1
k
k∑
l=1
yl −
1
k
k∑
l=1
f(xl) =
1
k
k∑
l=1
εl
hence, noting the independence of εl
Var(f̂(x0)) = E
(1
k
k∑
l=1
εl
)2 = 1
k2
(σ2 + · · ·+ σ2) =
1
k
σ2
28/47
The Bias-Variance Decomposition: kNN example
• Finally we have
Err(x0) =E[(Y − f̂(x0))2|X = x0]
=σ2 +
[
f(x0)−
1
k
k∑
`=1
f(x`)
]2
+
σ2
k
where we need the true model values f(x0) and f(xl) (on k
neighbours of x0).
29/47
Bias-variance decomposition
Err(x0) = σ2 +
[
f(x0)−
1
k
k∑
`=0
f(x`)
]2
+
σ2
k
• The model complexity decreases with the number of
neighbours k.
• With a small k, the bias will be relatively small since the
regression function evaluated at the neighbours f(x`) will be
close to f(x0). However, a small k means that we averaging
only a few observations, leading to high variance (σ
2
k
is high).
• As we increase k we reduce the variance, at the cost of higher
bias. 30/47
Cross validation
Cross validation
Cross validation methods are based multiple random
training/validation set splits. Unlike in the validation set approach,
each observation gets a turn at being predicted.
31/47
K-fold cross-validation (key concept)
!”!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
!%&!’!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!’!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!’!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!’!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!’!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!’!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
The idea of K-fold cross validation is simple:
1. We randomly split the training sample into K folds of roughly
equal size.
2. For each fold k ∈ {1, . . . ,K}, we estimate the model on all
other folds combined, and use k as the validation set.
3. The cross validation error is the average error across the K
validation sets.
32/47
K-fold cross-validation
5-fold and 10-fold CV. K = 5 or K = 10 folds are common
choices for cross validation.
Leave one out cross validation (key concept). If we set K = N ,
this is called leave one out cross validation, or LOOCV. For each
observation i, we train the model on all other observations, and
predict i.
33/47
Leave one out CV
Algorithm Leave one out CV for regression
1: for i=1:N do
2: Assign observation i to the validation set.
3: Assign observations 1, . . . , i − 1, i + 1 . . . , N to the training
set D−i.
4: Estimate the model using the training set D−i.
5: Compute the prediction f̂−i(xi).
6: Compute the squared error (yi − f̂−i(xi))2.
7: end for
8: Compute the leave-one-out MSE:
MSECV =
1
N
N∑
i=1
(yi − f̂−i(xi))2
34/47
LOOCV vs K-fold cross-validation
LOOCV. Approximately unbiased estimator of the expected
prediction error. However, it can have high variance in some
settings (since the training sets are very similar for every
prediction) and a high computational cost (except in special cases).
K-fold. Lower computational cost and may have lower variance.
However, it is subject to bias since the training sets are smaller
than N .
35/47
Cross validation: recommendations
One standard deviation rule. Pick the simplest model within one
standard deviation of the model with the lowest cross validated
errors.
Choice of K. There are no general guidelines for choosing K since
the trade-off between variance, bias, and computational cost is
highly context specific. The variance of LOOCV tends to be
relatively low with stable estimators such as linear regression.
Many predictors. When there are many predictors, pre-screening
based on the entire training set may result in misleading CV.
36/47
Leave one out CV for linear regression
For a linear regression estimated by OLS, we can use a shortcut
compute the leave-out-one errors without having to re-estimate the
model. Given the OLS fitted values
ŷ = X(XTX)−1XTy = Hy,
we can show that the leave-one-out MSE is
MSECV =
1
N
N∑
i=1
(yi − f̂−i(xi))2 =
1
N
N∑
i=1
[
yi − f̂(xi)
1−Hii
]2
,
where Hii is the ith diagonal element of the hat matrix H.
37/47
Generalised cross validation
The previous method applies to many situations in which we have
fitted values of the type
ŷ = Sy.
The generalised cross validation method approximates the leave
one out MSE as
MSEGCV =
1
N
N∑
i=1
[
yi − f̂(xi)
1− tr(S)/N
]2
,
where tr(S) is the trace of S (the sum of the elements in its
diagonal). GCV can be computationally convenient in some
settings.
38/47
The bias-variance derivation details
(optional)
The Bias-Variance Decomposition
• We first focus on the squared-error loss function. Assume the true
model Y = f(X) + � where E(�) = 0 and Var(�) = σ2
• For a test point x0, first we look at
(Y − f̂(x0))2
=
[
Y − f(x0) + f(x0)− EDf̂(x0) + EDf̂(x0)− f̂(x0)
]2
=(Y − f(x0))2 + (f(x0)− EDf̂(x0))2 + (EDf̂(x0)− f̂(x0))2
+2(Y−f(x0))(f(x0)− EDf̂(x0)) + 2(Y − f(x0))(EDf̂(x0)− f̂(x0))
+ 2(f(x0)− EDf̂(x0))(EDf̂(x0)− f̂(x0))
39/47
The Bias-Variance Decomposition
• We calculate its test error as, noting that Y − f(x0) = �,
Err(x0)
=EY,D[(Y − f̂(x0))2|X = x0]
=EY,D[�2] + EY,D[(f(x0)− EDf̂(x0))2]
+ EY,D[(EDf̂(x0)− f̂(x0))2]
+ 2EY,D[(�)(f(x0)− EDf̂(x0))] + 2EY,D[(�)(EDf̂(x0)− f̂(x0))]
+ 2EY,D[(f(x0)− EDf̂(x0))(EDf̂(x0)− f̂(x0))]
40/47
The Bias-Variance Decomposition
• Clearly
EY,D[�2] = E�[�2] = Var(�) = σ2
• There is no randomness in (f(x0)− EDf̂(x0))2, so
EY,D[(f(x0)− EDf̂(x0))2] = (f(x0)− EDf̂(x0))2
called the Squared Bias
41/47
The Bias-Variance Decomposition
• And the Variance
EY,D[(EDf̂(x0)−f̂(x0))2] = ED[(EDf̂(x0)−f̂(x0))2] = Var(f̂(x0))
• There is no randomness in (f(x0)− EDf̂(x0)), hence
EY,D[(�)(f(x0)− EDf̂(x0))] = E(�)(f(x0)− EDf̂(x0)) = 0
• With independence, we have
EY,D[(�)(EDf̂(x0)−f̂(x0))] = E(�)ED[(EDf̂(x0)−f̂(x0))] = 0
• Similarly
EY,D[(f(x0)− EDf̂(x0))(EDf̂(x0)− f̂(x0))] = 0
42/47
The Bias-Variance Decomposition
• Then we have the following decomposition
Err(x0) =σ2� + Bias
2(f̂(x0)) + Var(f̂(x0))
• The first term is irreducible error. This exists due to the
nature. We cannot make it smaller through any modelling
• the second term is the squared bias, which is defined by
Bias2(f̂(x0)) = (f(x0)− EDf̂(x0))2
• And the last term is the variance (the expected squared
deviation of the estimated f̂(x0) around its mean, i.e.,
Var(f̂(x0)) = ED[(EDf̂(x0)− f̂(x0))2]
where ED is the “average” over all the training data.
43/47
The Bias-Variance Decomposition: OLS regression example
• For OLS with the estimated model f̂ols(x0) = β̂Tx0 = xT0 β̂,
we have
EDf̂ols(x0) =ED
[
xT0 β̂
]
= xT0 ED[(X
TX)−1XTy]
=xT0 (X
TX)−1XTED[y]
=xT0 (X
TX)−1XTED[f + �]
=xT0 (X
TX)−1XT f . (Note that E(�) = 0)
where f = [f(x1), f(x2), …, f(xN )]T the vector of true model
values at training data and � = [�1, �2, …, �N ]T given by
yi = f(xi) + �i (i = 1, 2, …, N).
44/47
The Bias-Variance Decomposition: OLS regression example
• Hence the Bias square is
Bias2 = (f(x0)− xT0 (X
TX)−1XT f)2
• Denote
h(x0) = X(XTX)−1×0 = [h1(x0), h2(x0), …, hN (x0)]T
45/47
The Bias-Variance Decomposition: OLS regression example
• For the variance term, we have
Var(f̂(x0)) =ED[(f̂ols(x0)− EDf̂ols(x0))2]
=ED
[(
h(x0)Ty− h(x0)T f
)2]
=ED
[
(h(x0)T �)2
]
=ED[(h1(x0)�1 + h2(x0)�2 + · · ·+ hN (x0)�N )2]
• According to assumptions, ED(�2) = σ2 and all �i’s are
independent of each other. We have
Var(f̂(x0)) =h1(x0)σ2 + h2(x0)σ2 + · · ·+ hN (x0)σ2 = ‖h(x0)‖2σ2
• Finally we have
Err(x0) = σ2 + (f(x0)− h(x0)T f)2 + ‖h(x0)‖2σ2
46/47
Review questions
• How does model selection relate to the bias-variance
trade-off?
• What is a validation set? How is it different from a test set?
• What is K-Fold cross validation? Describe how it works.
• What is the one standard deviation rule?
47/47
Model Selection and Evaluation
The bias-variance decomposition
KNN bias-variance decomposition
Cross validation
The bias-variance derivation details (optional)