Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Model Diagnostics
MAST90083 Computational Statistics and Data Mining
School of Mathematics & Statistics The University of Melbourne
Model Diagnostics 1/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Outline
§2.1 General purpose of model diagnostics §2.2 Training error vs. Generalization error §2.3 Model Diagnostic with Data
§2.4 Bias-variance decomposition
§2.5 Optimism
§2.6 Model Selection Criteria
§2.7 Model Evaluation and Averaging §2.8 Cross-Validation
Model Diagnostics 2/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
General purpose of model diagnostics
Supervised learning models are used to investigate/discover the relationship between a response/outcome/dependent variable y and a set of predictor/explanatory/independent/covariate variables x, based on observations D = {(yi,xi),i = 1,··· ,N}.
Given a training data set there is generally more than one possible learning method or model
Model Diagnostics 3/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
General purpose of model diagnostics
There is therefore a need for a measure of the quality of these learning methods or models
The“generalization performance”of a leaning method relates to its prediction capability on independent test data
It gives a measure of the quality of the selected model
It helps assess how well the model fits and if necessary, modify
the model to improve the fit
It guides the choice of a learning method or model
Assessment of this performance is important and used in practice
Model Diagnostics 4/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
General purpose of model diagnostics
Assume a quantitative response y and a vector of predictors x Given a training sample D = {(yi,xi),i = 1,··· ,N} we can
estimate a prediction model fˆ (x)
The cost for measuring the error or deviation between y and
fˆ ( x ) i s
y − fˆ (x) squared error
2
Ly,fˆ(x)=
y − fˆ (x) absolute error
Model Diagnostics 5/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Training error vs. Generalization error
The training error is the empirical loss or the average loss over the training sample
1 N ˆ
e ̄rr=N
The generalization error is the expected prediction error over
L yi,f(xi) an independent test sample (test error)
Err = E Ly,fˆ(x) Interest: Test error of our estimated model fˆ
i=1
Model Diagnostics 6/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Training error vs. Generalization error
The training error is not a good estimate of the test error
More complex model → adapt to more complex structures
Training error consistently decreases with the model complexity → dropping to zero for high enough complex model
However, a model with zero training error is overfit to the training data → generalize poorly
Model Diagnostics 7/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Training error vs. Generalization error
For qualitative or categorical response G ∈ {1, …, K } we model
pk(x) = Pr (G = k/x) and Gˆ(x) = argmaxpˆk (x) k
The loss functions are LG,Gˆ(x)=IG̸=Gˆ(x), 0−1loss
K
L(G,pˆ(x)) = −2I (G = k)logpˆk (x) = −2logpˆG (x)
k=1
Model Diagnostics 8/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Training error vs. Generalization error
We are interested in the expected missclassification rate Err = E LG,Gˆ(x) or Err = E [L(G,pˆ(x))]
But in practice we have access to
i=1
We are interested in estimating the test error
and find the model with the appropriate complexity
̄
Err= N
logpˆgi(xi)
− 2 N
Model Diagnostics 9/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Training error vs. Generalization error
The problem of estimating the test error in categorical response settings is similar to the quantitative case response setting on which we will focus.
If there was a parameter α which controlled the complexity of the model, the aim is to find α that produces the minimum test error.
Model Diagnostics 10/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Training error vs. Generalization error
High Bias Low Variance
Low Bias High Variance
Training data
Low
High
Test data
Model Complexity
The test error varies with the model complexity
Model Diagnostics 11/74
Prediction error
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Model Diagnostic with Data
There are two separate objects in supervised learning
Model Selection estimating the performance of different models in order to choose the approximate best one; and
Model Assessment having chosen or selected a model, estimating its prediction error (generalization error or performance) on new data.
Before we look at these however, we shall define some terms and discuss the bias-variance tradeoff and its relation to model complexity.
Model Diagnostics 12/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Model Diagnostic with Data
In a data-rich situation, the best approach for both problems is to randomly divide the dataset into three sets
A training dataset is the set of data used to fit a model.
A validation dataset is the one on which we check the performance of the model fitted from the training dataset. We use this to guide our model selection.
A test dataset is the one on which we assess the prediction accuracy of the model found from a model selection procedure.
If the test set is also used to choose the model → the final model will underestimate the true test error
Model Diagnostics 13/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Model Diagnostic with Data
There is no general rule on how to choose the number of samples of each data set. This could for example depends on signal-to-noise ratio in the data or the model complexity
A typical split might be 50% for training and 25% each for validation and testing.
Model Diagnostics 14/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Accuracy versus sample size
By splitting up the data this way we are left with a significantly smaller number of observations with which to fit our model. This is sometimes a problem, although not always.
Model Diagnostics 15/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Accuracy versus sample size
In a number of cases, there is insufficient data to split it into three parts. The methods here
Approximate the validation step analytically using model selection criteria or by
Efficient sample re-use
Provide an estimate of the test error of the final chosen model
Model Diagnostics 16/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Accuracy versus sample size
The methods discussed next are designed for situations where there is insufficient data
These methods approximate or include the validation step
analytically: Cp, AIC, BIC
by efficient sample re-use: cross-validation and bootstrap
Model Diagnostics 17/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Bias-variance decomposition
Assumey=f(x)+εwithE(ε)=0,Var(ε)=σ2 andfˆ(x)isa regression fit
The expected prediction error of a fit fˆ (x) at x = x0 ˆ 2
Err (x0) = E y − f (x0)
2ˆ 2ˆˆ2
=σ + Ef (x0)−f (x0) +E f (x0)−Ef (x0) = σ2 + Bias2 fˆ (x0) + Var fˆ (x0)
Note 1
Model Diagnostics 18/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Bias-variance decomposition
The first term is the variance around f (x0) and can not be reduced
The second term is the squared bias, the average fˆ (x0) differs from f (x0)
The last term is the variance, the average deviation of fˆ (x0) from its mean
Model Diagnostics 19/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Bias-variance decomposition
The more complex we make fˆ, the lower the (squared) bias but the higher the variance
The optimal model is the one that gives the best compromise between the second and third term
Model Diagnostics 20/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Bias-variance decomposition
Suppose we have the choice of a range of models of differing complexity.
For example, the number of polynomial terms in a one dimensional linear regression (x,x2,x3,…).
Suppose further that we have both a training dataset to fit our model and a test dataset on which to assess prediction accuracy.
The more complex models will always fit the training data better, but will not necessarily improve test performance!
The extra complexity allows the bias of the estimate to be reduced, but at a cost of extra variance associated with estimating parameters.
Model Diagnostics 21/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
An Example of Bias-variance decomposition
For a linear model fˆ (x) = βˆ⊤x, β ∈ Rp estimated by least square ˆ 2
values xi ) doesn’t Note 2
Err (xi ) = E y − f (xi )
2ˆ 2 22
=σ + Ef (xi)−f (xi) +∥h(xi)∥ σ
While the variance changes with i its average (over the sample
Model Diagnostics 22/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
An Example of Bias-variance decomposition
1N 1Nˆ2p
N Err (xi ) = σ2 + N E f (xi ) − f (xi ) + N σ2
i=1 i=1
For linear models fit by LS, the bias is zero
For regularized fit, the bias is positive with aim to reduce the variance
Note 2
Model Diagnostics 23/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
An illustration of bias-variance tradeoff
Supposewehaven=50observationswithxi=dUniform(0,2),and yi =cos(2xi)+εi ,
with εi =d N(0,0.32).
We want to use a polynomial function of xi to fit yi , but we are
not sure how many polynomial terms to use.
To test this, we fit on the data for linear, quadratic, cubic, quartic and quintic fits.
We then check how each fitted model performs on a separate test dataset of n = 1000. We repeat 100 times and average the results.
Model Diagnostics 24/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Target function
Y=cos(2x)
0.0 0.5 1.0 1.5 2.0
x
Model Diagnostics 25/74
y
−1.0 −0.5 0.0 0.5 1.0
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Example data (one of the 100 repeations)
●
●● ●●
● ●
●
●
● ●●●●
● ●●●
●
● ●●●
●●
● ●●
●
0.5 1.0 1.5
x
●
●
●
●
●● ●●
● ●●
●
●●● ● ●●
●
●
●
Model Diagnostics 26/74
−1.0 −0.5
0.0 0.5 1.0
y
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Training and test error for increasingly complex models
●
●
Test
●
Train ●
●●
●
●
●
●
12345
No. polynomial terms
Model Diagnostics 27/74
Avg Squared Err
0.06 0.08 0.10 0.12 0.14
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Bias-variance decomposition for test data
●
●
Total error
● ● ● ● Variance
Bias
●
12345
No. polynomial terms
Model Diagnostics 28/74
Avg Squared Err
0.00 0.05 0.10 0.15
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Resume: model fitting, selection and assessment
A morale coming out of this example is that a model that fits well on the training data is not indicative of true model performance.
Thus the average loss
1 N
L{yi,fˆ(xi)} computed from the training data can be misleading.
e ̄rr = N
Rather, the average loss of the model should be computed on
a separate“test”dataset.
i=1
In reality, this means we should partition the data, fitting on one portion and testing performance on the other.
Model Diagnostics 29/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Resume: model fitting, selection and assessment
On the other hand, suppose the data is split into a training and a validation set.
In model selection we use the training data to fit each candidate model and choose as the selected model the one having the best performance on the validation set.
The average loss on the validation set will be smallest for the selected model.
But to assess the prediction accuracy of the selected model, we still need to calculate its average loss based on a separate
“test” dataset.
Model Diagnostics 30/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Optimism
The training error rate
1 N ˆ
e ̄rr=N
will not correctly reflect the true error
i=1
Err = E Ly,fˆ(x)
because the same data is used to fit the model and assess its error
L yi,f(xi)
Model Diagnostics 31/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Optimism
The model obtained from D = {(yi,xi),i = 1,··· ,N} adapts to the training data
The training error e ̄rr is an optimistic estimate of the generalization error Err
Err is a form of extra-sample error, since the test feature vectors don’t need to coincide with the training vectors
Model Diagnostics 32/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Optimism
The nature of the optimism can be seen when we consider 1N ˆ
E yi,f (xi)
where ENew indicates that we observe multiple new responses
Err = N
at each of the training points xi , i = 1, …, N
i=1
It better reflects the true error and therefore it is a better performance measure of a model
Model Diagnostics 33/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Optimism
The optimism is defined as
op = E [Err − e ̄rr]
and is positive since e ̄rr is usually biased downward as an estimate of Err
An obvious way to estimate the prediction error is to estimate the optimism and add it to the training error e ̄rr
Model Diagnostics 34/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Optimism
A corrected estimate of Err is ˆ
Err = e ̄rr + oˆp where oˆp is an estimate of the optimism
This corrected estimate provides a method to assess and select a model
Model Diagnostics 35/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Optimism
For squared loss
2 N
cov (yˆi , yi )
depends on how strongly yi affects its own prediction
The harder we fit the data, the greater cov (yˆi , yi ) will be, thereby increasing the optimism
Note 3
op = N
The amount by which e ̄rr underestimates the true error
i=1
Model Diagnostics 36/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Optimism
The expected criterion is
2 N
Err = Ey (e ̄rr ) + N
which gives in the case of linear model fit with d input
i=1
Err = Ey (e ̄rr) + 2d σ2 N
cov (yˆi , yi )
Note 4
Model Diagnostics 37/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Relation with existing criteria
Cp statistics
C p = e ̄r r + 2 Nd σˆ 2
σˆ2 is the noise variance obtained from the mean squared error
of a low bias model
This criterion adjust the training error by a factor proportional to the number of parameter
Model Diagnostics 38/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Relation with existing criteria
The Akaike information criterion
AIC = − 2 .loglik + 2d
NN and is equivalent to Cp for Gaussian models
A I C ( α ) = e ̄r r ( α ) + 2 d ( α ) σˆ 2 N
since −2loglik equals i (yi −f(xi))2 /σ2 which is N.e ̄rr/σ2
Model Diagnostics 39/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Relation with existing criteria
The Bayesian information criterion
BIC = −N2 .loglik + dlog(N) and for Gaussian models
N d 2 BIC=2 e ̄rr+N.log(N).σ
Therefore BIC is proportional to AIC and Cp with the factor 2 replaced by log(N)
BIC tends to penalize complex models more heavily, giving preference to simpler models in selection
Model Diagnostics 40/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
The effective number of parameters
The concept of“number of parameters”can be generalized where regularization is used
ˆy = S y
where for least squares S=XXXX
and for ridge regression S=XXX+λI X
⊤ −1 ⊤
⊤ −1⊤
Model Diagnostics 41/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
The effective number of parameters
The effective number of parameters is
d (S) = trace (S)
If S is an orthogonal projection matrix
trace(S) = d the number of parameters
Model Diagnostics 42/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Motivations for Selection Criteria
Let M0 be the model with density fβ0 In the linear case for example
y=Xβ0+ε, ε∼N0,σ02I Wehavefβ0 =f(y/β0)=f(ε)
Model Diagnostics 43/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Motivations for Selection Criteria
Given a class of candidate models M = {M1, …, MK }, selection criteria aims to select a candidate model Mk as an approximation for M0
In the linear this becomes
ˆy=Xβˆk+εk, εk∼N0,σk2I
andwehavef =fy/βˆ =f(ε ) βk k k
Model Diagnostics 44/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Motivations for Selection Criteria
Cp is a criterion derived using the L2 norm as a basis for measuring the discrepancy
The derivation of selection criteria requires methods or measures to quantify the separation between M0 and Mk
Cp is a criterion derived using the L2 norm as a basis for measuring the discrepancy between M0 and Mk
∆(M0,Mk)=∥μM0 −μMk∥2 =L2(Mk) where μM0 and μMk are the true and candidate model means
Model Diagnostics 45/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Motivations for Selection Criteria
Advantages
L2 depends only on the means of the models and not on the actual two densities
This means that L2 can be applied when errors are not normally distributed
Disadvantage
L2 is a matrix in certain multivariate models
Model Diagnostics 46/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Motivations for Selection Criteria
Cp provides an estimation of E [Jk ] where
1ˆ ⊤⊤ˆ
Jk=σ02βk−β0 XXβk−β0 E RSSk /σ02 = n − k + Bk
σ02
RSSk Bk E σ02 −n+2k =k+σ02
Model Diagnostics 47/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Motivations for Selection Criteria
Hence
is unbiased for E [Jk ]
Cp = RSSk − n + 2k σ02
In Cp, σ02 is replaced by σK2 obtained from the largest candidate model
Model Diagnostics 48/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Motivations for Selection Criteria
AIC is based on using the Kullback-Leibler divergence between the true and approximating probability density models as measure of discrepancy
E0
f (y/β0)logf (y/β0)dy f (y/β0)logf (y/βk)dy
f (y/β0) log f (y/βk) =
−
= d (β0, β0) − d (βk , β0)
Model Diagnostics 49/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Motivations for Selection Criteria
The discrepancy can then by measured using
dn (βk,β0) = E0 {−2logf (yn/βk)}
Using the maximum likelihood βˆk
d βˆ , β = E {−2 log f (y /β )} |
n k 0 0 n k βk=βˆk
Model Diagnostics 50/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Motivations for Selection Criteria
Akaike noted that −2 log f y /βˆ is baised and that the bias nk
E E {−2logf (y /β )}| −E −2logf y /βˆ 00 nkβk=βˆk 0 nk
can often be asymptotically estimated by twice the dimension of βˆk
Model Diagnostics 51/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Motivations for Selection Criteria
Therefore for
we have
AIC = −2 log f y /βˆ + 2k nk
E { A I C } ≃ E d βˆ , β 0 0nk0
Model Diagnostics 52/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Motivations for Selection Criteria
The derivation of BIC is motivated using Bayesian arguments Let f (Mk ), k ∈ {1, …, K } denotes the discrete prior over the
models M1, …, MK
Let f (βk/Mk) denotes a prior on βk given the model Mk
Model Diagnostics 53/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Motivations for Selection Criteria
Applying Bayes rule gives
f (y,βk,Mk) = f (y/βk,Mk)f (βk/Mk)f (Mk)
= f (βk,Mk/y)f (y)
BIC aims to choose the model which is a posteriori most probable
Model Diagnostics 54/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Motivations for Selection Criteria
The posterior probability density for Mk 1
f (Mk/y) = f (y)f (Mk) f (y/βk,Mk)f (βk/Mk)dβk Considering minimizing
−2logf (Mk/y)=2log{f (y)}−2log{f (Mk)}
−2log f (y/βk,Mk)f (βk/Mk)dβk
Model Diagnostics 55/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Motivations for Selection Criteria
The first term is constant with respect to k and
assuming uniform prior for f (Mk) and f (βk/Mk)
the BIC is obtained using a Taylor serie expension and a Laplace
approximation of the resulting integral
−2logf (M /y)≈−2logf y/βˆ +klogn kk
Model Diagnostics 56/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Motivations for Selection Criteria
BIC is an asymptotic approximation of −2 log f (Mk /y)
The model with minimum BIC is the model with the largest
approximate posterior probability
We have discussed three type of criteria Cp, AIC and BIC, what is the difference ?
Model Diagnostics 57/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Motivations for Selection Criteria
AIC and Cp are asymptotically efficient
lim E0[L2(Mc)]=1
n→∞ E0 [L2 (Mk )]
Mc is the model that is the closest to the true model
BIC is consistent (asymptotically select, with probability one, the model having the correct structure)
Model Diagnostics 58/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Motivations for Selection Criteria
In Bayesian applications, comparison between models are based on Bayes factors
Considering two models Mk1 and Mk2 the bayes factor B12 is the ratio of the posterior odds
f (Mk1 /y) f (Mk2 /y)
IfB12 >1,Mk1 isfavoredbythedataandifB12 <1,thenMk2 is favored by the data
Model Diagnostics 59/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Model Evaluation
A problem closely related to model selection is one of model evaluation
Here, an investigator is less interested in the selection of a single model and more interested in assessing preference from the data toward each of the models in the candidate collection
Model Diagnostics 60/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Model Evaluation
As BIC approximates a transformation of a model’s posterior probability, one can perform model evaluation by transforming BIC back to a posterior probability
exp−21BICk f(Mk/y) ≈ Ki=1 exp−21BICi
The set of posterior probabilities can be used as a model evaluation tool and assess the relative merits of the considered models
Model Diagnostics 61/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Model Averaging
This cam also be used in model averaging
Consider inference on a parameter δ that is defined within each model in the collection of candidates models
δ can be a prediction f (x) at some fixed value x0
Rather than taking a selected model as correct with probability one, model averaging allows a quantification of the uncertainty inherent to model selection
Model Diagnostics 62/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Model Averaging
The posterior distribution on δ is found as a weighted average of the posterior distributions conditional on each model
K
f (δ/y) = f (δ/Mk,y)f (Mk/y)
k=1
with posterior mean
K
E (δ/y) = E (δ/Mk,y)f (Mk/y)
k=1
Model Diagnostics 63/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Model Averaging
This Bayesian prediction is a weighted average of the individual predictions with weights proportional to the posterior probability of each model
The process of model averaging is seen to improve estimation and prediction which tend to be over-confident if one proceeds as if a selected model is correct with certainty
Model Diagnostics 64/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Cross-Validation
The simplest and most widely used method for estimating the prediction error is cross-validation
It estimates the generalization error
Err = E L y, fˆ (x)
when fˆ (x) is applied to an independent test sample from the joint distribution
Model Diagnostics 65/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Cross-Validation
Suppose for now that we do not need the final model assessment on a test dataset, so only need to fit and validate a model
K-fold cross validation uses part of the available data to fit the model and a different part to test it
Model Diagnostics 66/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Cross-Validation
K −1 parts of the data are used to fit or learn the model and the kth part is used to calculate the prediction error of the fitted model when predicting the kth part of the data
This is repeted for k = 1, ..., K and the K estimates of the prediction error are combined (averaged)
fˆ−k is widely used to denote the fitted model obtained with the kth part of the data removed
Model Diagnostics 67/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Illustration of Cross-Validation
Model Diagnostics 68/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Cross-Validation
The case K = N is known as leave-one-out cross-validation 1 N ˆ − k ( i )
CV=N
In this case k(i) = i, the fit is computed using all the data except
theith pair(yi,xi).
In this case CV is approximately unbiased for the true prediction error with low bias
i=1
L yi,f (xi)
Model Diagnostics 69/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Cross-Validation
Given a set of models indexed by a tuning parameter α 1Nˆ−k(i)
i=1
The curve CV (α) is used for tuning the parameter α Select αˆ that minimizes CV (α)
Use the model f (x, αˆ) is the final chosen model
CV(α)=N
L yi,f (xi,α)
Model Diagnostics 70/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Cross-validation
Leave-one-out vs. k-fold CV
In practice K = 5 or K = 10 is usually sufficient.
In situations where the sample size is not large, leave-one-out CV may be employed.
k-fold preferable on leave-one-out CV
Save computational time.
Improves the accuracy due to bias-variance trade-off.
AsK ↑moreobs. usedtofitthemodel ⇒bias↓.
But, the num. of obs. in the validation set ↓ ⇒ ↑ variance
(less typical obs. / outliers have more influence).
Model Diagnostics 71/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Generalized Cross-validation
GCV provides a convenient approximation to leave-one-out cross-validation for linear fitting under squared loss
In linear fitting ˆy = Sy with least square H = X X⊤X−1 X⊤ In the case of linear model
N N2
1
N
i=1
ˆ−k(i) yi −f
(xi)
2
1 yi−fˆ(xi)
= N
i=1
1−Sii
where
Note 5
⊤ ⊤ −1 Sii=xi XX xi
Model Diagnostics 72/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
Generalized Cross-validation
The GCV approximation is
G C V ( fˆ ) = 1 y i − fˆ ( x i )
and takes the form
N 2
1 − trace(S)/N
G C V ( fˆ ) = e ̄r r + 2 p σˆ 2 N
σ ˆ 2 = N 1 N y i − f ˆ ( x i ) 2 i=1
N
i=1
where
Note 6
Model Diagnostics 73/74
Introduction Training error vs. Generalization error Model Diagnostics with Data Bias-variance decomposition Optimism
For more readings
Summaries on LMS.
Chapters 7 & 8 from ’The elements of statistical learning’
book.
Chapters 6 from ’An introduction to statistical learning’ book.
Model Diagnostics 74/74