Predictive Analytics – Week 6: Model Selection and Estimation III
Predictive Analytics
Week 6: Model Selection and Estimation III
Semester 2, 2018
Discipline of Business Analytics, The University of Sydney Business School
Week 6: Model Selection and Estimation III
1. Maximum likelihood (continued)
2. Inference for the ML estimator (optional)
3. Analytical criteria
4. Comparison of model selection methods
5. Limitations of model selection
6. Optimism (optional)
Reading: Chapter 6.1 of ISL.
Exercise questions: Chapter 6.1 of ISL, Q1. Try to answer this question based on your existing knowledge of
regression variable selection, which will help you be prepared for the next lecture.
2/40
Maximum likelihood (continued)
Maximum likelihood
Maximum likelihood estimation (MLE), which we have discussed in
the context of linear regression is one of the most important
concepts in statistics. We now present it more generally for
inference.
3/40
ML for discrete distributions (key concept)
Let p(y; θ) be a discrete probability distribution. The likelihood
function is
`(θ) = P (Y1 = y1, Y2 = y2, . . . , YN = yN )
= P (Y1 = y1)P (Y2 = y2) . . . P (YN = yN )
=
N∏
i=1
p(yi;θ)
The maximum likelihood estimate θ̂ is the value of θ that
maximises `(θ).
4/40
ML for continuous distributions (key concept)
Let p(y; θ) be a density function. The likelihood function is
`(θ) = p(y1;θ) p(y2;θ) . . . p(yN ;θ)
=
N∏
i=1
p(yi;θ)
The maximum likelihood estimate θ̂ is the value of θ that
maximises `(θ).
5/40
Maximum likelihood
• Even though `(θ) equals an expression that involves p(yi;θ),
we think of these functions in different ways.
• When considering a probability mass function or density
p(y;θ), we consider y to be a variable, and θ to be fixed.
• In the likelihood, θ is a variable, and y is fixed.
6/40
Log-likelihood (key concept)
The log-likelihood is
L(θ) = log
(
N∏
i=1
p(yi;θ)
)
=
N∑
i=1
log p(yi;θ)
Because the log-likelihood is a monotonic transformation of the
likelihood, maximising it is the same as maximising the likelihood.
7/40
Example: Bernoulli distribution
Suppose that Y1, . . . , YN follow the Bernoulli distribution with
parameter θ (the probability of a success).
p(yi; θ) = θyi(1− θ)(1−yi)
`(θ) =
N∏
i=1
θyi(1− θ)(1−yi)
L(θ) =
N∑
i=1
[yi log(θ) + (1− yi) log(1− θ)]
=
(∑
yi
)
log(θ) + (N −
∑
yi) log(1− θ)
8/40
Example: Bernoulli distribution
Derivative of the log-likelihood with respect to θ:
dL(θ)
dθ
=
∑
yi
θ
−
N −
∑
yi
1− θ
The ML estimate therefore satisfies∑
yi
θ̂
=
N −
∑
yi
1− θ̂
.
The solution is the sample proportion:
θ̂ =
∑n
i=1 yi
N
.
What about the 2nd order derivative?
9/40
Inference for the ML estimator
(optional)
Inference for the ML estimator
The score function is
s(θ̂) = OL(θ)|
θ=θ̂
For example, when the parameter is a scalar
s(θ̂) =
n∑
i=1
d log p(yi; θ)
dθ
∣∣∣∣
θ=θ̂
10/40
Inference for the ML estimator
The observed information matrix is the negative of the second
derivative (the Hessian matrix) of the log-likelihood.
J(θ̂(D)) = −O2θ L(θ)|θ=θ̂
When the parameter is a scalar,
J(θ̂) =
N∑
i=1
d2 log p(yi)
dθ2
∣∣∣∣∣
θ=θ̂
.
11/40
Inference for the ML estimator
We define the Fisher information matrix as the expected value of
the observed information matrix
In(θ̂) = Eθ
[
J(θ̂(D))
]
So the observed information matrix is a sample-based version of
the Fisher information matrix.
12/40
Inference for the ML estimator
A standard result shows that the sampling distribution of the ML
converges to the normal distribution
θ̂ → N(θ, I−1n (θ))
as n→∞.
That suggests the large sample approximations
N(θ, IN (θ̂)−1) or N(θ,J(θ̂)−1)
13/40
Example: Bernoulli distribution
Continuing the example, the observed information matrix is
−
d2L(θ)
dθ2
=
∑
yi
θ2
+
n−
∑
yi
(1− θ)2
Since E(Y ) = θ,
E(J(θ)) =
N
θ(1− θ)
,
so that
I−1N =
θ(1− θ)
N
,
which is familiar as the variance of a sample proportion from basic
statistics.
14/40
Inference for the ML estimator
The corresponding estimates for the standard errors of individual
parameters are
SE(θ̂j) =
√
IN (θ̂)−1jj or SE(θ̂j) =
√
J(θ̂)−1jj
A large sample 100× (1− α)% confidence interval is
θ̂j ± zα/2 × SE(θ̂j)
15/40
Inference for the ML estimator
The following large sample approximation leads to accurate
confidence intervals and hypothesis tests
2
(
L(θ̂)− L(θ)
)
∼ χ2d,
where d is the number of parameters in θ.
16/40
Analytical criteria
Analytical criteria
Analytic criteria provide estimate the generalisation error based on
theoretical arguments. They have the form:
criterion = training loss + penalty for number of parameters
17/40
Mallow’s Cp statistic
The Mallow’s Cp statistic applies to linear regression. It directly
implements the recipe suggested by our calculation of the
optimism:
Cp =
RSS
N
+
2
N
σ̂2(p+ 1),
In the formula, σ̂2 is an estimate of variance of the errors based on
the largest model under consideration.
18/40
Mallow’s Cp statistic
We select the model with the lowest Cp. To compare two
specifications,
∆Cp = MSE1 −MSE2 +
2
N
σ̂2(p1 − p2).
19/40
Akaike Information Criterion (key concept)
The Akaike information criterion (AIC) applies to models
estimated by maximum likelihood.
AIC = −2L(θ̂) + 2p,
where L(θ̂) is the maximised log-likelihood and p is the number of
estimated parameters. We select the model with the lowest AIC.
20/40
Akaike Information Criterion
• The AIC is one of the most popular and versatile strategies for
model selection.
• The formula follows the in-sample performance plus penalty
for complexity structure.
• The AIC has a rigorous theoretical justification which we not
address here. However, keep in mind that it is an asymptotic
approximation (N →∞).
21/40
AIC for linear regression
In the special case of comparing linear regression specifications
under Gaussian errors, the AIC simplifies to (up to proportionality
and ignoring constant terms in the log-likelihood):
AIC = log
(
RSS
N
)
+
2
N
(p+ 2),
The number of parameters is d = p+ 2 because the parameter
vector includes the constant and the variance of errors. Note that
this is different from the formula in the book, which is a
simplification with unknown error.
22/40
Relation between Mallow’s Cp and the AIC
For a linear regression with Gaussian errors and known variance:
AIC =
1
σ̂2
(
RSS
N
+
2
N
σ̂2(p+ 1)
)
,
which compares to
Cp =
RSS
N
+
2
N
σ̂2(p+ 1).
Hence, the AIC and Cp lead to the same decision in this case. For
practical purposes, the AIC and Cp are regarded as the same for
linear regression.
23/40
Bayesian information criterion
The Bayesian information criterion (BIC) also applies to models
estimated by maximum likelihood.
BIC = −2L(θ̂) + log(N)p
where L(θ̂) is the maximised log-likelihood, p is the number of
estimated parameters and N is the sample size. We select the
model with the lowest BIC.
24/40
Bayesian information criterion
• The BIC formula is comparable to the AIC case, but with 2
penalty factor replaced by log(N). Hence, the BIC penalises
complexity more heavily when N ≥ 8. The BIC has a very
different theoretical justification to the AIC.
• The BIC is an asymptotic approximation to a Bayesian
approach to model selection.
25/40
BIC: Gaussian linear regression case
In the special case of a linear regression with Gaussian errors, the
BIC simplifies to (ignoring constant terms)
BIC = log
(
RSS
N
)
+
log(N)
N
(p+ 2)
If we assume that the variance of the errors is known, we have
instead
BIC =
1
σ̂2
(
RSS
N
+
log(N)
N
σ̂2(p+ 1)
)
,
In this case the BIC is proportional to AIC and Cp, but with a
log(N) penalty factor instead of 2.
26/40
Comparison of model selection
methods
Model selection properties
Consistency. In a collection of models that includes the correct
model, the probability that the model selection criterion chooses
the correct one approaches one when N →∞.
Efficiency. The selected model predicts as well as the theoretically
best model under consideration in terms of expected loss when
N →∞.
It is not possible to combine these properties (Claeskens and Hjort,
2008, Section 4.9).
27/40
Properties of model selection methods
LOOCV, AIC, and Mallow’s Cp. Efficient but not consistent.
The efficiency follows because they construct unbiased estimators
of the test error. However, they select models that are strictly
more complex than the true model when N →∞.
BIC. Consistent under some conditions but not efficient. It often
chooses models that are too simple because of its heavier penalty
on complexity.
28/40
LOOCV, AIC and Cp
• LOOCV, AIC, and Cp are equivalent when N →∞. They will
pick the same model in practice when N is large.
• In finite samples, we can view AIC and Cp as theoretical
approximations to LOOCV.
• The advantage of AIC and Cp over LOOCV is mainly
computational. CV should be preferred to AIC when the
assumptions of the model (e.g., constant error variance) are
likely to be wrong.
• LOOCV is universally applicable, while this is not the case for
AIC and Cp.
29/40
Limitations of model selection
Limitations of model selection
• Standard statistical inference is no longer valid after model
selection.
• The reason is that standard inference assumes a fixed model,
whereas model selection will by definition pick the specific
model that best fits the sample. This will lead to optimistic
estimates of sample variation based on the chosen model.
• In our context, the only way around this difficulty would be
data splitting: using one part of the sample for model
selection, and another for inference.
30/40
Limitations of model selection
Model selection is an important tool in your data analysis process,
but should not be a replacement to model building through EDA,
diagnostics, and domain knowledge.
31/40
Review questions
• What are the Akaike Information Criterion, Bayesian
Information Criterion and Mallow’s Cp?
• What are the relationships between the above 3 metrics?
• Why is it incorrect to conduct statistical inference after model
selection (using the same data)?
32/40
Optimism (optional)
Optimism
Our objective in this section is to develop a better understanding
of overfitting. This discussion will inform our understanding of
analytical criteria in the next section.
33/40
Optimism
We define the training error as the empirical loss for the training
data,
errD =
1
N
N∑
i=1
L(Yi, f̂(xi)).
We focus on our standard regression setting,
errD =
1
N
N∑
i=1
(
Yi − f̂(xi)
)2
,
which is RSS/N for linear regression estimated by least squares.
34/40
Optimism
The expected prediction error (EPE) is
EPE(x0) = ED
[(
Y0 − f̂(x0)
)2]
= ED
[(
f(x0) + ε0 − f̂(x0)
)2]
= σ2 + ED
[(
f(x0)− f̂(x0)
)2]
,
where x0 is fixed.
35/40
Optimism
Now, consider the estimation error for a training case i,
ED
(
Yi − f̂(xi)
)2
= ED
[(
f(xi) + εi − f̂(xi)
)2]
= σ2 + ED
[(
f(xi)− f̂(xi)
)2]
− 2ED
[
f̂(xi)εi
]
= σ2 + ED
[(
f(xi)− f̂(xi)
)2]
− 2Cov(f̂(xi), εi)
Unlike in the EPE, the last term appears because the estimator
f̂(xi) is a function of D, which includes training case i itself.
36/40
Optimism
Averaging over the data, the expected value of the training error is
E [errD] =
1
N
N∑
i=1
ED
[(
Yi − f̂(xi)
)2]
= σ2 +
1
N
N∑
i=1
ED
(
f(xi)− f̂(xi)
)2
−
2
N
N∑
i=1
Cov(f̂(xi), εi).
Because of the last term, the training error is not a good estimate
of the expected prediction error.
37/40
Optimism
We define the out of sample error as
Errout =
1
N
N∑
i=1
E
[(
Y 0i − f̂(xi)
)2]
,
= σ2 +
1
N
N∑
i=1
ED
[(
f(xi)− f̂(xi)
)2]
,
where Y 0i = f(xi) + ε
0
i indicates an independent case for a given
xi.
38/40
Optimism
The optimism of the training error is
Errout − E(ErrD) =
2
N
N∑
i=1
Cov(f̂(xi), εi)
The more we overfit, the higher Cov(f̂(xi), εi will be, increasing
the optimism.
39/40
Example: linear regression
For the linear regression model, we can show that
Optimism =
2
N
N∑
i=1
Cov
(
f̂(xi), εi
)
=
2
N
σ2(p+ 1)
Interpretation:
• The larger the sample size (N), the harder it is to overfit.
• The larger the variance of the errors (σ2), the larger the
overfitting.
• The optimism is proportional to the number of predictors.
40/40
Maximum likelihood (continued)
Inference for the ML estimator (optional)
Analytical criteria
Comparison of model selection methods
Limitations of model selection
Optimism (optional)