程序代写代做代考 flex algorithm Predictive Analytics – Week 4: Model Selection and Estimation I

Predictive Analytics – Week 4: Model Selection and Estimation I

Predictive Analytics
Week 4: Model Selection and Estimation I

Semester 2, 2018

Discipline of Business Analytics, The University of Sydney Business School

Week 4: Model Selection

1. Model Selection and Evaluation

2. The bias-variance decomposition

3. KNN bias-variance decomposition

4. Cross validation

5. The bias-variance derivation details (optional)

Reading: Chapter 2.1, 2.2 of ISL.

Exercise questions: Chapter 2.4 of ISL, Q1, Q3, Q5, Q6 and
Q7.(a). Only focus on “regression” if the questions contain
“classification” content.

2/47

Model Selection and Evaluation

Model Selection and Evaluation

Model Selection: estimate the performance of different models in
order to choose the (approximate) best one. We select the model
that is estimated to have the best predictive ability.

Model Evaluation: after chosen the “different” model, estimate
its test error (generalisation error) on new data.

• Training set: for exploratory data analysis, model building,
model estimation, etc.

• Validation set: for appropriate model selection.

• Test set: for model evaluation.

3/47

Underfitting and Overfitting

4/47

Why is overfitting bad?

• Low training error, high generalization error
• Poor predictive performance
• Overreacts to minor fluctuations in training data

How to address overfitting:

• Drop some features
• Model selection algorithm
• Manually select features to keep

• Regularization
• Keep all features, reduce the magnitude/values of parameters.

More details in later sections.

5/47

The bias-variance trade-off (key concept)

• Increasing model complexity brings higher flexibility and
therefore lower bias. However, this comes at a cost of higher
variance: there is higher sample variability when estimating
complex models. Overfitting can be a problem.

• Decreasing model complexity leads to lower variance.
However, simpler models may not be sufficiently flexible to
capture the underlying patterns in the data, leading to higher
bias. Underfitting can be a problem.

6/47

The bias-variance trade-off

To review, we use the training data to estimate the additive error
model

Y = f(X) + ε

leading to an estimator f̂(x0) for the regression function at given
input point x0, f(x0).

7/47

Examples

Linear regression. Adding predictors increases model complexity.
Least squares estimates have high variability when the number of
inputs is large, but excluding relevant predictors leads to bias.

Review: Lecture 2.

8/47

Examples

KNN regression. Reducing the number of neighbours increases
model complexity. Closer neighbours means lower bias. However,
averaging fewer observations increases variance.

Review: Lecture 3.

9/47

Approaches to model selection

Train, validation and test sets. Validation set is used for
selecting the optimal model complexity.

Resampling methods. Estimate generalisation performance by
generating multiple splits of the training data.

Analytical criteria. Use analytical results to penalise training
performance to account for overfitting.

10/47

Training, validation, and test split

In the validation set approach, we randomly split the training
data into a training set, a validation set and a test set. We select
the model with the best predictive performance in the validation
set.

Typically, we use 50-80% of the data for the training set.

11/47

Training, validation, and test split

1. Estimate different models on the training data.

2. Predict the observations in the validation set.

3. Select the model with best validation set performance.

4. Re-estimate the selected model by combining the training
and validation sets.

5. Predict the test data with the selected model.

12/47

Validation set

The validation set approach has serious limitations when the size
of the training data is not large. The model may not have enough
data to train on, and there may not be enough cases in the
validation set to reliably estimate generalisation performance.

We turn instead to resampling methods.

13/47

Learning curve

We cannot use training set to select the best model. The model
minimize the loss of this training data set, not necessarily minimize
the loss of the new date sets.

14/47

Diagnosing learning curve

Suppose your training loss is low, while validation/test loss is high.

• Underfitting or overfitting problem?

Suppose your training loss is high, while validation/test loss is also
high.

• Underfitting or overfitting problem?

15/47

Diagnosing learning curve

Underfitting: training error is high, validation error is slightly >
training error;

Overfitting: training error is low, validation error is significantly >
training error.

16/47

The bias-variance decomposition

Expected prediction error

• Consider again the additive error model:

Y = f(X) + ε,

where we assume that E(ε) = 0 and Var(ε) = σ2.

• In the previous section, we treated f̂(·) as given since our
objective was to estimate the test error. Now, we discuss the
fundamental problem of choosing a method to learn a
predictive function f̂(·).

17/47

Expected prediction error

We define the expected prediction error for a new input point
X = x0 as

Err(x0) = E
[(
Y0 − f̂(x0)

)2
|X = x0

]
,

where Y0 = f(x0) + ε0. The expectation is over ε0 and the
training sample, i.e. over the sampling distribution of f̂(·). The
EPE is a expected loss.

Note that this is different from the generalisation error, where
f̂(x0) is an estimate (not an estimator), and the expectation is
over the population P (X,Y ).

18/47

Expected prediction error decomposition

We can write the expected prediction error as:

Err(x0) = E
[(
Y0 − f̂(x0)

)2
|X = x0

]
= E

[(
f(x0) + ε− f̂(x0)

)2
|X = x0

]
= σ2 + E

[(
f(x0)− f̂(x0)

)2
|X = x0

]
= Irreducible error + Reducible error

19/47

Expected prediction error decomposition

Err(x0) = σ2 + ED
[
(f(x0)− f̂(x0))2|X = x0

]
= Irreducible error + Reducible error

• The first term is the variance of the response around its true
mean f(x0). We cannot avoid this source of error, and it puts
an upper bound on the accuracy of the prediction.

• In choosing a method, our concern is the reducible error: we
want to minimise the estimation error
ED

[
(f(x0)− f̂(x0))2|X = x0

]
.

• Here, ED (·) is used to emphasise that the expectation is over
the training data. This notation might be omitted later for
simplicity purpose.

20/47

The bias-variance trade-off

We can show that (see last optional section of this slides):

E(f(x0)− f̂(x0))2 =
[
E(f̂(x0))− f(x0)

]2
+ E([f̂(x0)− E(f̂(x0))]2)

= Bias2(f̂(x0)) + Var(f̂(x0))
= Bias2 + Variance

21/47

The bias-variance trade-off

E(f(x0)− f̂(x0))2 = Bias2
(
f̂(x0)

)
+ Var

(
f̂(x0)

)
• We would like our model to be flexible enough to be able to

approximate (possibly) complex relationships between Y and
X.

• Typically, the more complex we make the model, the better its
approximation capabilities, which translates into lower bias.

• On the other hand, increasing model complexity leads to
higher variance. This is due to the larger (effective) number
of parameters to estimate.

• Hence, we would like to find the optimal (problem specific)
model complexity that minimises our expected loss. 22/47

The bias-variance trade-off

Increasing model complexity will always reduce the training error,
but there is an optimal level of complexity that minimises the test
error.

How the plots will be looked like if we fix the model complexity
and change x-axis from ”model complexity” to ”size of the data”?

23/47

The bias-variance trade-off

Just because a model is more “realistic”, it does not mean that it
will have higher predictive accuracy. All models are
approximations, and our task is to find the most accurate one for
our purposes in a data-driven way.

24/47

Model selection

• Model selection is a set of methods (such as cross
validation) that allow us to choose the right model among
options of different complexity. It will be a fundamental part
of our methodology.

• Similarly to model evaluation, model selection methods are
concerned with estimating the generalisation error. However,
it is important not to confuse these two steps, which have
different goals.

25/47

KNN bias-variance decomposition

Bias-variance decomposition

Remember the following expression for the expected prediction
error:

Err(x0) = E
[
(Y0 − f̂(x0))2|X = x0

]
= σ2 + Bias2

(
f̂(x0)

)
+ Var

(
f̂(x0)

)

26/47

The Bias-Variance Decomposition: kNN example

• For kNN regression: Suppose x0 is a test data point, its k
nearest neighbours are {x1,x2, …,xk} with responses
{y1, y2, …, yk}. Then the model prediction is

f̂(x0) =
1
k


xi∈Nk(x0,D)

yi,

• Then the test error

Err(x0) =σ2 + Bias2(f̂(x0)) + Var(f̂(x0))

=σ2 +
[
E(f̂(x0))− f(x0)

]2
+ E([f̂(x0)− E(f̂(x0))]2)

27/47

The Bias-Variance Decomposition: kNN example

• First we check, noting yl = f(xl) + εl

E(f̂(x0)) =E
[

1
k

k∑
l=1

yl

]
=

1
k

k∑
l=1

E[f(xl) + εl]

=
1
k

k∑
l=1

f(xl)

• For the third term, we have

f̂(x0)− E(f̂(x0)) =
1
k

k∑
l=1

yl −
1
k

k∑
l=1

f(xl) =
1
k

k∑
l=1

εl

hence, noting the independence of εl

Var(f̂(x0)) = E


(1

k

k∑
l=1

εl

)2 = 1
k2

(σ2 + · · ·+ σ2) =
1
k
σ2

28/47

The Bias-Variance Decomposition: kNN example

• Finally we have

Err(x0) =E[(Y − f̂(x0))2|X = x0]

=σ2 +
[
f(x0)−

1
k

k∑
`=1

f(x`)
]2

+
σ2

k

where we need the true model values f(x0) and f(xl) (on k
neighbours of x0).

29/47

Bias-variance decomposition

Err(x0) = σ2 +
[
f(x0)−

1
k

k∑
`=0

f(x`)
]2

+
σ2

k

• The model complexity decreases with the number of
neighbours k.

• With a small k, the bias will be relatively small since the
regression function evaluated at the neighbours f(x`) will be
close to f(x0). However, a small k means that we averaging
only a few observations, leading to high variance (σ

2

k
is high).

• As we increase k we reduce the variance, at the cost of higher
bias. 30/47

Cross validation

Cross validation

Cross validation methods are based multiple random
training/validation set splits. Unlike in the validation set approach,
each observation gets a turn at being predicted.

31/47

K-fold cross-validation (key concept)

!”!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!

!%&!’!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

!%&!’!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

!%&!’!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!’!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

!%&!’!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

!%&!’!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

The idea of K-fold cross validation is simple:

1. We randomly split the training sample into K folds of roughly
equal size.

2. For each fold k ∈ {1, . . . ,K}, we estimate the model on all
other folds combined, and use k as the validation set.

3. The cross validation error is the average error across the K
validation sets.

32/47

K-fold cross-validation

5-fold and 10-fold CV. K = 5 or K = 10 folds are common
choices for cross validation.

Leave one out cross validation (key concept). If we set K = N ,
this is called leave one out cross validation, or LOOCV. For each
observation i, we train the model on all other observations, and
predict i.

33/47

Leave one out CV

Algorithm Leave one out CV for regression

1: for i=1:N do
2: Assign observation i to the validation set.
3: Assign observations 1, . . . , i − 1, i + 1 . . . , N to the training

set D−i.
4: Estimate the model using the training set D−i.
5: Compute the prediction f̂−i(xi).
6: Compute the squared error (yi − f̂−i(xi))2.
7: end for

8: Compute the leave-one-out MSE:

MSECV =
1
N

N∑
i=1

(yi − f̂−i(xi))2

34/47

LOOCV vs K-fold cross-validation

LOOCV. Approximately unbiased estimator of the expected
prediction error. However, it can have high variance in some
settings (since the training sets are very similar for every
prediction) and a high computational cost (except in special cases).

K-fold. Lower computational cost and may have lower variance.
However, it is subject to bias since the training sets are smaller
than N .

35/47

Cross validation: recommendations

One standard deviation rule. Pick the simplest model within one
standard deviation of the model with the lowest cross validated
errors.

Choice of K. There are no general guidelines for choosing K since
the trade-off between variance, bias, and computational cost is
highly context specific. The variance of LOOCV tends to be
relatively low with stable estimators such as linear regression.

Many predictors. When there are many predictors, pre-screening
based on the entire training set may result in misleading CV.

36/47

Leave one out CV for linear regression

For a linear regression estimated by OLS, we can use a shortcut
compute the leave-out-one errors without having to re-estimate the
model. Given the OLS fitted values

ŷ = X(XTX)−1XTy = Hy,

we can show that the leave-one-out MSE is

MSECV =
1
N

N∑
i=1

(yi − f̂−i(xi))2 =
1
N

N∑
i=1

[
yi − f̂(xi)

1−Hii

]2
,

where Hii is the ith diagonal element of the hat matrix H.

37/47

Generalised cross validation

The previous method applies to many situations in which we have
fitted values of the type

ŷ = Sy.

The generalised cross validation method approximates the leave
one out MSE as

MSEGCV =
1
N

N∑
i=1

[
yi − f̂(xi)

1− tr(S)/N

]2
,

where tr(S) is the trace of S (the sum of the elements in its
diagonal). GCV can be computationally convenient in some
settings.

38/47

The bias-variance derivation details
(optional)

The Bias-Variance Decomposition

• We first focus on the squared-error loss function. Assume the true
model Y = f(X) + � where E(�) = 0 and Var(�) = σ2

• For a test point x0, first we look at

(Y − f̂(x0))2

=
[
Y − f(x0) + f(x0)− EDf̂(x0) + EDf̂(x0)− f̂(x0)

]2
=(Y − f(x0))2 + (f(x0)− EDf̂(x0))2 + (EDf̂(x0)− f̂(x0))2

+2(Y−f(x0))(f(x0)− EDf̂(x0)) + 2(Y − f(x0))(EDf̂(x0)− f̂(x0))

+ 2(f(x0)− EDf̂(x0))(EDf̂(x0)− f̂(x0))

39/47

The Bias-Variance Decomposition

• We calculate its test error as, noting that Y − f(x0) = �,

Err(x0)

=EY,D[(Y − f̂(x0))2|X = x0]

=EY,D[�2] + EY,D[(f(x0)− EDf̂(x0))2]

+ EY,D[(EDf̂(x0)− f̂(x0))2]

+ 2EY,D[(�)(f(x0)− EDf̂(x0))] + 2EY,D[(�)(EDf̂(x0)− f̂(x0))]

+ 2EY,D[(f(x0)− EDf̂(x0))(EDf̂(x0)− f̂(x0))]

40/47

The Bias-Variance Decomposition

• Clearly
EY,D[�2] = E�[�2] = Var(�) = σ2

• There is no randomness in (f(x0)− EDf̂(x0))2, so

EY,D[(f(x0)− EDf̂(x0))2] = (f(x0)− EDf̂(x0))2

called the Squared Bias

41/47

The Bias-Variance Decomposition

• And the Variance

EY,D[(EDf̂(x0)−f̂(x0))2] = ED[(EDf̂(x0)−f̂(x0))2] = Var(f̂(x0))

• There is no randomness in (f(x0)− EDf̂(x0)), hence

EY,D[(�)(f(x0)− EDf̂(x0))] = E(�)(f(x0)− EDf̂(x0)) = 0

• With independence, we have

EY,D[(�)(EDf̂(x0)−f̂(x0))] = E(�)ED[(EDf̂(x0)−f̂(x0))] = 0

• Similarly

EY,D[(f(x0)− EDf̂(x0))(EDf̂(x0)− f̂(x0))] = 0

42/47

The Bias-Variance Decomposition

• Then we have the following decomposition

Err(x0) =σ2� + Bias
2(f̂(x0)) + Var(f̂(x0))

• The first term is irreducible error. This exists due to the
nature. We cannot make it smaller through any modelling

• the second term is the squared bias, which is defined by

Bias2(f̂(x0)) = (f(x0)− EDf̂(x0))2

• And the last term is the variance (the expected squared
deviation of the estimated f̂(x0) around its mean, i.e.,

Var(f̂(x0)) = ED[(EDf̂(x0)− f̂(x0))2]

where ED is the “average” over all the training data.

43/47

The Bias-Variance Decomposition: OLS regression example

• For OLS with the estimated model f̂ols(x0) = β̂Tx0 = xT0 β̂,
we have

EDf̂ols(x0) =ED
[
xT0 β̂

]
= xT0 ED[(X

TX)−1XTy]

=xT0 (X
TX)−1XTED[y]

=xT0 (X
TX)−1XTED[f + �]

=xT0 (X
TX)−1XT f . (Note that E(�) = 0)

where f = [f(x1), f(x2), …, f(xN )]T the vector of true model
values at training data and � = [�1, �2, …, �N ]T given by
yi = f(xi) + �i (i = 1, 2, …, N).

44/47

The Bias-Variance Decomposition: OLS regression example

• Hence the Bias square is

Bias2 = (f(x0)− xT0 (X
TX)−1XT f)2

• Denote
h(x0) = X(XTX)−1×0 = [h1(x0), h2(x0), …, hN (x0)]T

45/47

The Bias-Variance Decomposition: OLS regression example

• For the variance term, we have

Var(f̂(x0)) =ED[(f̂ols(x0)− EDf̂ols(x0))2]

=ED
[(

h(x0)Ty− h(x0)T f
)2]

=ED
[
(h(x0)T �)2

]
=ED[(h1(x0)�1 + h2(x0)�2 + · · ·+ hN (x0)�N )2]

• According to assumptions, ED(�2) = σ2 and all �i’s are
independent of each other. We have

Var(f̂(x0)) =h1(x0)σ2 + h2(x0)σ2 + · · ·+ hN (x0)σ2 = ‖h(x0)‖2σ2

• Finally we have

Err(x0) = σ2 + (f(x0)− h(x0)T f)2 + ‖h(x0)‖2σ2

46/47

Review questions

• How does model selection relate to the bias-variance
trade-off?

• What is a validation set? How is it different from a test set?

• What is K-Fold cross validation? Describe how it works.

• What is the one standard deviation rule?

47/47

Model Selection and Evaluation
The bias-variance decomposition
KNN bias-variance decomposition
Cross validation
The bias-variance derivation details (optional)