QBUS2820 Predictive Analytics
QBUS2820 Predictive Analytics
Copyright By PowCoder代写 加微信 powcoder
Model Section and Variable Selection
Business Analytics, University of School
Table of contents
Popular model selection methods
Variable selection in linear regression
Recommended reading
▶ Chapter 6, An Introduction to Statistical Learning with
Applications in R by James et al.: easy to read, come with
R/Python code for practice.
▶ Chapters 7 The Elements of Statistical Learning by Hastie et
al.: well-written, deep in theory, suitable for students with a
sound maths background.
Popular model selection
▶ Model selection in general and variable selection in particular
are important parts of data analysis.
▶ Let {Mi , i ∈ I} be a set of potential models, we want to select
the best one.
▶ Variable selection is a special case of model selection: select
the “best” subset of the given p predictors to explain/predict
the response Y .
▶ Two ideal measures for doing model selection: prediction error
and expected prediction error
▶ Bias-Variance decomposition
Expected prediction error = Variance + Bias2
So model selection is to pick up a right model that trades off
between Bias and Variance
AIC: Akaike’s Information Criterion
Akaike’s information criterion (AIC): Select the model with the
smallest AIC
AIC = −2× log-likelihood(θ̂mle) + 2d
which has the form: training error+model complexity.
▶ log-likelihood(θ̂mle) is the log-likelihood evaluated at the MLE
▶ d is the number of parameters in θ (i.e. # covariates)
▶ The factor 2 is not important, but useful when comparing AIC
to other model selection criteria
▶ proposed by in 1973, one of the most
impactful statistical methods of ’modern’ times.
AIC: Simple intuition
▶ The training error is actually an approximation of the expected
prediction error, only it can be a quite flawed approximation.
▶ Under certain assumptions on the true generating process and
the model, we can compute a correction term that, added to
the training error, resembles the prediction error a bit better.
▶ This correction term is simply twice the number of parameters
in the model!
AIC: theoretical justification*
▶ Consider data D, and model M depending a vector of
parameters θ of size d .
▶ Under model M, the density of the data is p(D|θ), i.e. the
likelihood function.
▶ Denote by g(D) the true density function of the data. Of
course g(D) is unknown!
▶ Idea: We want a model such that some discrepancy between
the truth g(D) and the model-based p(D|θ) is small.
▶ Let’s use Kullback-Leibler divergence to measure such
discrepancy.
AIC: theoretical justification*
▶ Kullback-Leibler (KL) divergence is widely used to measure
the discrepancy between two probability distributions.
▶ Consider two probability distributions with the corresponding
density functions p(x) and q(x).
▶ KL divergence from a probability distribution with pdf p(x) to
a distribution with pdf q(x) is defined as
▶ Note: KL(p∥q) ̸= KL(q∥p).
▶ Example (check yourself!) If p(x) is the pdf of N(µ1, σ
q(x) is the pdf of N(µ2, σ
(µ1 − µ2)2 + σ21
AIC: theoretical justification*
▶ The KL divergence from the truth g(D) to the model-based
LM(θ) := KL(g(D)∥p(D|θ)) = ED
This can be explained as the distance between the true
distribution of the data, and the distribution of the data under
the postulated model M.
▶ We want a model M that minimises LM(θ) for all θ.
▶ It’s reasonable to consider an easier problem: We want a
model M that minimises ED
with θ̂mle the MLE
estimator of θ (under model M)
AIC: theoretical justification*
▶ Akaike showed that, when the sample size n is large enough
+ ED [log g(D)],
AIC = −2p(D|θ̂mle) + 2d
= −2× log-likelihood(θ̂mle) + 2d .
▶ As the last term ED [logg(D)] is a constant (independent of
model M), minimizing AIC is equivalent to minimizing
AIC: Limitations
▶ The model families must contain the true generating process.
▶ Each possible value for θ within a model family must map to
a different probability distribution. Rules out models with that
produce the same prediction for different values of their
parameters. This rules out some important models families
such as neural networks.In general all model families that can
fit the data to ’perfection’, including lienar models with many
parameters.
▶ It is an asymptotic approximation, relies on the sample size
being very large compared to the number of parameters.
▶ If these conditions are not met, it can still be used, without
guarantees. In practice, when it errs, most of the times is too
conservative, tends to prefer the smaller model.
Bayesian information criterion (BIC): Select the model with the
smallest BIC
BIC = −2× log-likelihood(θ̂mle) + (log n)× d
BIC, proposed by 1978, is motivated by Bayesian
BIC: theoretical justification*
Consider a model M with d-vector of parameters θ, and data D.
The posterior of M is
p(M|D) ∝ p(M)p(D|M)
p(D|θ,M)p(θ|M)dθ.
Here: p(D|θ,M) is the likelihood function (under model M) and
p(θ|M) is the prior distribution of θ.
▶ Bayesian statistics encodes all information about model M in
its posterior distribution p(M|D).
▶ Bayesian statistics is getting more and more popular in
modern statistics. Taught in QBUS3830.
▶ We want a model M with the highest posterior p(M|D).
BIC: theoretical justification*
Using an uniform prior for M and approximating the integral by the
so-called Laplace approximation, it can be shown that
− log p(M|D) ≈ − log p(D|θ̂mle ,M) +
We want to pick a model with the highest posterior p(M|D),
which is equivalent to picking a model with the smallest BIC
AIC or BIC?
AIC = −2× log-likelihood(θ̂mle) + 2d
BIC = −2× log-likelihood(θ̂mle) + (log n)× d
▶ They’re both popular model selection methods. BIC puts a
heavier penalty on model complexity.
▶ BIC is shown to be consistent asymptotically: it is able to
identify the true model when n → ∞ (if there exists such a
true model! Some people argue that true model doesn’t exist)
▶ Practitioners seem to prefer AIC over BIC when n is small
[M.-N. Tran (2011), The Loss Rank Criterion for Variable Selection
in Linear Regression Analysis, Scandinavian J of Statistics]
proposes another criterion which is somehow a compromise
between AIC and BIC.
Validation: Hold-out
By far the most common method for model selection, because it
has its roots in the scientific method.
▶ Separate part of the available data into training and
validiation dataset.
▶ Fit a model to the training set, compute the error in the
validation dataset.
▶ The validation error is a good approximation of the prediction
Problem: When doing model selection, if we try many possible
candidate models, picking the best (lower validation error), we will
end up with similar problems as when using the training error to do
model selection. Basically, the selection becomes less reliable.
Think of multiple hypothesis testing. For critical applications
(medicine, policy making), use usually use another, extra validation
dataset to estimate the error of the chosen model.
Cross-validation
▶ An extension of hold-out validation that gives an estimate of
the expected prediction error.
Cross-validation
▶ divide the data into K sets, K ≥ 2. Often, this is done
▶ for the kth part, fit the model to the other K − 1 parts.
Denote the fitted model as f̂ −k(x)
▶ use the fitted model to predict the kth part. The prediction
error is ∑
(yi ,xi )∈part k
▶ The k-fold cross-validated prediction error is
(yi ,xi )∈part k
It’s an estimate of the expected prediction error as it’s
averaged over both test data and training data
Cross-validation
▶ The selected model is the one with the smallest CV prediction
▶ Typical choices of K are 5, 10 or n. The case K = n is known
as leave-one-out cross-validation.
Cross-validation: Limitations
Cross-validation is simple and widely used.
▶ Can be very computationally expensive because one has to fit
the model many times.
▶ Partitioning, we are using a smaller training set than what is
going to be the used for the final model. This means that it
tends to prefere simpler models.
▶ As approximation of the expected prediction error, it can be
quite poor, because the partitions share a lot of data, they are
not independent.
Practical considerations:
▶ CV is useful in small datasets.
▶ In large datasets, we can get away with just hold-out
validation. Large as in 1M training, 100K validation.
Variable selection in linear
regression
Variable selection in linear regression
▶ Consider a linear regression model or a logistic regression
model with p potential covariates.
▶ At the initial step of modelling, a large number p of covariates
is often introduced in order to reduce potential bias.
▶ The task is then to select the best subset among these p
Best subset selection: Search over the totally 2p possible subsets
of p covariates to find the best subset. The criterion can be AIC,
BIC or any other model selection criteria.
Variable selection in linear regression
Searching over 2p subsets is only feasible when p is small (< 30).
But we will see recent methods for computing the best subset a bit
Forward-stepwise selection: Start with the intercept, then
sequentially add into the model the covariate that most improves
the model selection criterion.
Backward-stepwise selection: Start with the full model with p
covariates, then sequentially remove the covariate that most
improves the model selection criterion.
▶ Advantage: much more time-efficient than the best subset
selection method
▶ Disadvantage: not necessarily end up at the best subset.
Variable selection in linear regression
Variable selection based on hypothesis testing
▶ Consider the test H0 : βj = 0 v.s. H1 : βj ̸= 0.
▶ Let β̂j be an estimator of βj . If the sampling distribution of β̂j
is known, p-value can be computed
▶ If the p-value is large (e.g. > 0.05, 0.1) then the
corresponding covariate Xj might be removed from the model
Possible disadvantages
▶ It’s not clear what predictor error is optimised
▶ not time-efficient when p is large (need to refit the model
many times)
This variable selection method is not popular in “modern”
statistics, machine learning. Because of historic reasons, it’s still
widely used in many fields such as social sciences.
Woman labor force example
The data set “MROZ.xlsx”, available in canvas, contains
information on Women’s labour force participation.
We would like to build a logistic regression model to explain
women’s labour force participation using potential predictors
nwifeinc (income), educ (years of education), age, exper (years
of experience), expersq (squared years of experience), kidslt6
(number of kids less than 6-year-old) and kidsge6 (number of
kids more than 6-year-old).
Let’s carry out the variable selection task.
Woman labor force example
Woman labor force example
Woman labor force example
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com