程序代写代做代考 Bayesian Predictive Analytics – Week 6: Model Selection and Estimation III

Predictive Analytics – Week 6: Model Selection and Estimation III

Predictive Analytics
Week 6: Model Selection and Estimation III

Semester 2, 2018

Discipline of Business Analytics, The University of Sydney Business School

Week 6: Model Selection and Estimation III

1. Maximum likelihood (continued)

2. Inference for the ML estimator (optional)

3. Analytical criteria

4. Comparison of model selection methods

5. Limitations of model selection

6. Optimism (optional)
Reading: Chapter 6.1 of ISL.

Exercise questions: Chapter 6.1 of ISL, Q1. Try to answer this question based on your existing knowledge of
regression variable selection, which will help you be prepared for the next lecture.

2/40

Maximum likelihood (continued)

Maximum likelihood

Maximum likelihood estimation (MLE), which we have discussed in
the context of linear regression is one of the most important
concepts in statistics. We now present it more generally for
inference.

3/40

ML for discrete distributions (key concept)

Let p(y; θ) be a discrete probability distribution. The likelihood
function is

`(θ) = P (Y1 = y1, Y2 = y2, . . . , YN = yN )
= P (Y1 = y1)P (Y2 = y2) . . . P (YN = yN )

=
N∏
i=1

p(yi;θ)

The maximum likelihood estimate θ̂ is the value of θ that
maximises `(θ).

4/40

ML for continuous distributions (key concept)

Let p(y; θ) be a density function. The likelihood function is

`(θ) = p(y1;θ) p(y2;θ) . . . p(yN ;θ)

=
N∏
i=1

p(yi;θ)

The maximum likelihood estimate θ̂ is the value of θ that
maximises `(θ).

5/40

Maximum likelihood

• Even though `(θ) equals an expression that involves p(yi;θ),
we think of these functions in different ways.

• When considering a probability mass function or density
p(y;θ), we consider y to be a variable, and θ to be fixed.

• In the likelihood, θ is a variable, and y is fixed.

6/40

Log-likelihood (key concept)

The log-likelihood is

L(θ) = log
(
N∏
i=1

p(yi;θ)
)

=
N∑
i=1

log p(yi;θ)

Because the log-likelihood is a monotonic transformation of the
likelihood, maximising it is the same as maximising the likelihood.

7/40

Example: Bernoulli distribution

Suppose that Y1, . . . , YN follow the Bernoulli distribution with
parameter θ (the probability of a success).

p(yi; θ) = θyi(1− θ)(1−yi)

`(θ) =
N∏
i=1

θyi(1− θ)(1−yi)

L(θ) =
N∑
i=1

[yi log(θ) + (1− yi) log(1− θ)]

=
(∑

yi
)

log(θ) + (N −

yi) log(1− θ)

8/40

Example: Bernoulli distribution

Derivative of the log-likelihood with respect to θ:

dL(θ)

=

yi
θ

N −


yi

1− θ

The ML estimate therefore satisfies∑
yi

θ̂
=
N −


yi

1− θ̂
.

The solution is the sample proportion:

θ̂ =
∑n
i=1 yi
N

.

What about the 2nd order derivative?
9/40

Inference for the ML estimator
(optional)

Inference for the ML estimator

The score function is

s(θ̂) = OL(θ)|
θ=θ̂

For example, when the parameter is a scalar

s(θ̂) =
n∑
i=1

d log p(yi; θ)

∣∣∣∣
θ=θ̂

10/40

Inference for the ML estimator

The observed information matrix is the negative of the second
derivative (the Hessian matrix) of the log-likelihood.

J(θ̂(D)) = −O2θ L(θ)|θ=θ̂

When the parameter is a scalar,

J(θ̂) =
N∑
i=1

d2 log p(yi)
dθ2

∣∣∣∣∣
θ=θ̂

.

11/40

Inference for the ML estimator

We define the Fisher information matrix as the expected value of
the observed information matrix

In(θ̂) = Eθ
[
J(θ̂(D))

]
So the observed information matrix is a sample-based version of
the Fisher information matrix.

12/40

Inference for the ML estimator

A standard result shows that the sampling distribution of the ML
converges to the normal distribution

θ̂ → N(θ, I−1n (θ))

as n→∞.

That suggests the large sample approximations

N(θ, IN (θ̂)−1) or N(θ,J(θ̂)−1)

13/40

Example: Bernoulli distribution

Continuing the example, the observed information matrix is


d2L(θ)
dθ2

=

yi
θ2

+
n−


yi

(1− θ)2

Since E(Y ) = θ,
E(J(θ)) =

N

θ(1− θ)
,

so that
I−1N =

θ(1− θ)
N

,

which is familiar as the variance of a sample proportion from basic
statistics.

14/40

Inference for the ML estimator

The corresponding estimates for the standard errors of individual
parameters are

SE(θ̂j) =

IN (θ̂)−1jj or SE(θ̂j) =


J(θ̂)−1jj

A large sample 100× (1− α)% confidence interval is

θ̂j ± zα/2 × SE(θ̂j)

15/40

Inference for the ML estimator

The following large sample approximation leads to accurate
confidence intervals and hypothesis tests

2
(
L(θ̂)− L(θ)

)
∼ χ2d,

where d is the number of parameters in θ.

16/40

Analytical criteria

Analytical criteria

Analytic criteria provide estimate the generalisation error based on
theoretical arguments. They have the form:

criterion = training loss + penalty for number of parameters

17/40

Mallow’s Cp statistic

The Mallow’s Cp statistic applies to linear regression. It directly
implements the recipe suggested by our calculation of the
optimism:

Cp =
RSS
N

+
2
N
σ̂2(p+ 1),

In the formula, σ̂2 is an estimate of variance of the errors based on
the largest model under consideration.

18/40

Mallow’s Cp statistic

We select the model with the lowest Cp. To compare two
specifications,

∆Cp = MSE1 −MSE2 +
2
N
σ̂2(p1 − p2).

19/40

Akaike Information Criterion (key concept)

The Akaike information criterion (AIC) applies to models
estimated by maximum likelihood.

AIC = −2L(θ̂) + 2p,

where L(θ̂) is the maximised log-likelihood and p is the number of
estimated parameters. We select the model with the lowest AIC.

20/40

Akaike Information Criterion

• The AIC is one of the most popular and versatile strategies for
model selection.

• The formula follows the in-sample performance plus penalty
for complexity structure.

• The AIC has a rigorous theoretical justification which we not
address here. However, keep in mind that it is an asymptotic
approximation (N →∞).

21/40

AIC for linear regression

In the special case of comparing linear regression specifications
under Gaussian errors, the AIC simplifies to (up to proportionality
and ignoring constant terms in the log-likelihood):

AIC = log
(

RSS
N

)
+

2
N

(p+ 2),

The number of parameters is d = p+ 2 because the parameter
vector includes the constant and the variance of errors. Note that
this is different from the formula in the book, which is a
simplification with unknown error.

22/40

Relation between Mallow’s Cp and the AIC

For a linear regression with Gaussian errors and known variance:

AIC =
1
σ̂2

(
RSS
N

+
2
N
σ̂2(p+ 1)

)
,

which compares to

Cp =
RSS
N

+
2
N
σ̂2(p+ 1).

Hence, the AIC and Cp lead to the same decision in this case. For
practical purposes, the AIC and Cp are regarded as the same for
linear regression.

23/40

Bayesian information criterion

The Bayesian information criterion (BIC) also applies to models
estimated by maximum likelihood.

BIC = −2L(θ̂) + log(N)p

where L(θ̂) is the maximised log-likelihood, p is the number of
estimated parameters and N is the sample size. We select the
model with the lowest BIC.

24/40

Bayesian information criterion

• The BIC formula is comparable to the AIC case, but with 2
penalty factor replaced by log(N). Hence, the BIC penalises
complexity more heavily when N ≥ 8. The BIC has a very
different theoretical justification to the AIC.

• The BIC is an asymptotic approximation to a Bayesian
approach to model selection.

25/40

BIC: Gaussian linear regression case

In the special case of a linear regression with Gaussian errors, the
BIC simplifies to (ignoring constant terms)

BIC = log
(

RSS
N

)
+

log(N)
N

(p+ 2)

If we assume that the variance of the errors is known, we have
instead

BIC =
1
σ̂2

(
RSS
N

+
log(N)
N

σ̂2(p+ 1)
)
,

In this case the BIC is proportional to AIC and Cp, but with a
log(N) penalty factor instead of 2.

26/40

Comparison of model selection
methods

Model selection properties

Consistency. In a collection of models that includes the correct
model, the probability that the model selection criterion chooses
the correct one approaches one when N →∞.

Efficiency. The selected model predicts as well as the theoretically
best model under consideration in terms of expected loss when
N →∞.

It is not possible to combine these properties (Claeskens and Hjort,
2008, Section 4.9).

27/40

Properties of model selection methods

LOOCV, AIC, and Mallow’s Cp. Efficient but not consistent.
The efficiency follows because they construct unbiased estimators
of the test error. However, they select models that are strictly
more complex than the true model when N →∞.

BIC. Consistent under some conditions but not efficient. It often
chooses models that are too simple because of its heavier penalty
on complexity.

28/40

LOOCV, AIC and Cp

• LOOCV, AIC, and Cp are equivalent when N →∞. They will
pick the same model in practice when N is large.

• In finite samples, we can view AIC and Cp as theoretical
approximations to LOOCV.

• The advantage of AIC and Cp over LOOCV is mainly
computational. CV should be preferred to AIC when the
assumptions of the model (e.g., constant error variance) are
likely to be wrong.

• LOOCV is universally applicable, while this is not the case for
AIC and Cp.

29/40

Limitations of model selection

Limitations of model selection

• Standard statistical inference is no longer valid after model
selection.

• The reason is that standard inference assumes a fixed model,
whereas model selection will by definition pick the specific
model that best fits the sample. This will lead to optimistic
estimates of sample variation based on the chosen model.

• In our context, the only way around this difficulty would be
data splitting: using one part of the sample for model
selection, and another for inference.

30/40

Limitations of model selection

Model selection is an important tool in your data analysis process,
but should not be a replacement to model building through EDA,
diagnostics, and domain knowledge.

31/40

Review questions

• What are the Akaike Information Criterion, Bayesian
Information Criterion and Mallow’s Cp?

• What are the relationships between the above 3 metrics?

• Why is it incorrect to conduct statistical inference after model
selection (using the same data)?

32/40

Optimism (optional)

Optimism

Our objective in this section is to develop a better understanding
of overfitting. This discussion will inform our understanding of
analytical criteria in the next section.

33/40

Optimism

We define the training error as the empirical loss for the training
data,

errD =
1
N

N∑
i=1

L(Yi, f̂(xi)).

We focus on our standard regression setting,

errD =
1
N

N∑
i=1

(
Yi − f̂(xi)

)2
,

which is RSS/N for linear regression estimated by least squares.

34/40

Optimism

The expected prediction error (EPE) is

EPE(x0) = ED
[(
Y0 − f̂(x0)

)2]
= ED

[(
f(x0) + ε0 − f̂(x0)

)2]
= σ2 + ED

[(
f(x0)− f̂(x0)

)2]
,

where x0 is fixed.

35/40

Optimism

Now, consider the estimation error for a training case i,

ED
(
Yi − f̂(xi)

)2
= ED

[(
f(xi) + εi − f̂(xi)

)2]
= σ2 + ED

[(
f(xi)− f̂(xi)

)2]
− 2ED

[
f̂(xi)εi

]
= σ2 + ED

[(
f(xi)− f̂(xi)

)2]
− 2Cov(f̂(xi), εi)

Unlike in the EPE, the last term appears because the estimator
f̂(xi) is a function of D, which includes training case i itself.

36/40

Optimism

Averaging over the data, the expected value of the training error is

E [errD] =
1
N

N∑
i=1

ED

[(
Yi − f̂(xi)

)2]

= σ2 +
1
N

N∑
i=1

ED
(
f(xi)− f̂(xi)

)2

2
N

N∑
i=1

Cov(f̂(xi), εi).

Because of the last term, the training error is not a good estimate
of the expected prediction error.

37/40

Optimism

We define the out of sample error as

Errout =
1
N

N∑
i=1

E

[(
Y 0i − f̂(xi)

)2]
,

= σ2 +
1
N

N∑
i=1

ED

[(
f(xi)− f̂(xi)

)2]
,

where Y 0i = f(xi) + ε
0
i indicates an independent case for a given

xi.

38/40

Optimism

The optimism of the training error is

Errout − E(ErrD) =
2
N

N∑
i=1

Cov(f̂(xi), εi)

The more we overfit, the higher Cov(f̂(xi), εi will be, increasing
the optimism.

39/40

Example: linear regression

For the linear regression model, we can show that

Optimism =
2
N

N∑
i=1

Cov
(
f̂(xi), εi

)
=

2
N
σ2(p+ 1)

Interpretation:

• The larger the sample size (N), the harder it is to overfit.

• The larger the variance of the errors (σ2), the larger the
overfitting.

• The optimism is proportional to the number of predictors.

40/40

Maximum likelihood (continued)
Inference for the ML estimator (optional)
Analytical criteria
Comparison of model selection methods
Limitations of model selection
Optimism (optional)