Predictive Analytics – Week 6: Regularization
Predictive Analytics
Week 6: Regularization
Copyright By PowCoder代写 加微信 powcoder
Business Analytics, University of School
Table of contents
Ridge regression
LASSO and other regularisation methods
Recommended reading
• Section 6.2, An Introduction to Statistical Learning with
Applications in R by James et al.: easy to read, comes with
R/Python code for practice.
• Section 3.4, The Elements of Statistical Learning by Hastie et
al.: well-written, deep in theory, suitable for students with a
sound maths background.
Regularization
• Adding an additional penalty term to the formulation of a
learning problem with the purpose of reducing overfitting.
• Any modification we make to a learning algorithm that is
intended to reduce its generalization error, but not its training
error (Goodfellow et al.)
L(f(X), Y ) → argmin
L(f(X), Y ) + R(f)
Ridge regression
Multiple linear regression (recall)
yi = β0 + β1xi1 + … + βpxip + ϵi = x′iβ + ϵi i = 1, …, n
E(ϵi)=0, V(ϵi)=σ2, β =(β0,…,βp)′ ∈Rp+1.
1 x11 ··· x1p
1 xi1 ··· xip
1 xn1 ··· xnp
Matrix form
The ordinary least squares (OLS) method estimate of β
β̂ls = (X ′X)−1X ′y,
given that the inverse of matrix X ′X exists.
What if the inverse of matrix X ′X doesn’t exist, or near singlular?
Ridge regression
• It can be shown that E∥β̂ − β∥2 = σ2tr((X ′X)−1). tr(A)
denotes the trace of matrix A, which is the sum of the
diagonal elements of A.
• Multicollinearity issue: when at least two columns of X are
strongly correlated to each other, then X ′X is nearly singular.
Typical warning from statistical software:
Warning: Matrix is close to singular or badly scaled. Results
may be inaccurate. RCOND = 1.012021e-16.
Ridge regression
The fact that the estimate β̂ explodes when multicollinearity is
present leads us to the idea of restricting coefficients to be
”small”, for example within a sphere centered at 0:
(yi − β0 − β1xi1 − … − βpxip)2
subject to
for some pre-specified control parameter t > 0.
This is known as the ridge regression.
Ridge regression
This constraint optimisation problem is equivalent to minimising
λ = arg min
yi − β0 − p∑
for some shrinkage parameter λ > 0.
• We minimise the penalised RSS. The second terms penalises
the parameter complexity
• It turns out that the idea of jointly optimising
Error term + Model complexity term
is very common in Statistics/Data Mining and Machine
• Note that one usually doesn’t penalise the intercept β0
• λ controls the penalty, needs to be selected 9/21
Ridge regression
This constraint optimisation problem is equivalent to minimising
λ = arg min
yi − β0 − p∑
for some shrinkage parameter λ > 0.
In regularisation methods for linear regression, for mathematical
convenience, we often center the data, then β0 = 0. The ridge
problem becomes
λ = arg min
Ridge regression
Or, written in matrix form,
λ = arg min
∥y − Xβ∥2 + λβ′Ipβ
with X the design matrix (without the column of 1s), and Ip is the
identity matrix of size p.
• It’s easy to see that βridgeλ = (X
′X + λIp)−1X ′y
• Bias: The Expected Squared Error of Decomp. Ridge
Regression has higher Bias2 than least squares.
• Variance: The Expected Squared Error of Decomp Ridge
Regression has lower Variance than least squares.
• It can be shown that there exist a nonzero λ value that has a
better tradeoff than least squares.
Optimal values for λ in Ridge regression
• How do we find the best values for λ in Ridge regression?
• Many ideas through the years…
LASSO and other regularisation
LASSO regression
βlassoλ = arg min
yi − β0 − p∑
• LASSO regression is another penalization, using absolute
value instead of the square of the coefficients.
• Lasso is often used when we also need to do variable selection,
since it tends to generate 0 for some coefficients for large
enough values of λ.
Which method to use?
• Recall the no free lunch theorem: neither ridge regression or
the lasso universally outperform the other. The choice of
method should be data driven.
• The lasso has better interpretability since it can lead to a
sparse solution.
• Ridge always chooses all the predictors, hence loses the
interpretability, but often has better prediction error.
View of regularization as a Bayesian prior
We can draw an equivalence between ridge regression and lasso,
and maximizing the posterior likelihood under a some prior
distribution.
Recall the Bayes Theorem
P (β|D) ∼ P (D|β) × P (β)
For models
Posterior ∼ Likelihood × Prior
For example, if we consider that the prior of β are gaussian
distributed with mean=0 and a given variance, if we maximize the
log of the posterior we get exactly the penalized formulation of
ridge regression. If we consider the prior to be laplacian
distributed, we get the lasso.
The takeaway
We can see regularization as a way of adding ”prior information”,
we know that values close to zero are preferable. For some
problems there might be other values, not zero (e.g. we know
some βs of similar tasks).
Elastic Net
The elastic net is a compromise between ridge regression and the
β̂EN = argmin
yi − β0 − p∑
αβ2j + (1 − α)|βj |
for λ ≥ 0 and 0 < α < 1. The elastic net performs variable selection like the lasso, and shrinks together the coefficients of correlated predictors like ridge regression. Best subset We can view Best subset of k variables as another penalized or constrained optimization problem, minimizing: yi − β0 − p∑ subject to This is often what we want when we do variable selection, while lasso is an approximation to this goal. Best subset selection Nowadays we can solve this problem much faster! Review questions • What is the multicolinearity issue, and when might we encounter that issue? • What is the ridge regression? • What are the penalty terms in the ridge and Lasso methods? • How to choose the shrinkage parameter in ridge regression? Other types of regularization • Modifying the optimization algorithm (early stopping) • Injecting noise • Augmenting dataset (rotations, shiftings of the data, adding 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com