Shrinkage and dimensionality reduction
Data Mining
Prof. Dr. Matei Demetrescu
Statistics and Econometrics (CAU Kiel) Summer 2020 1 / 43
Things can go bad fast
Take linear regression with p = 1 predictor for n = 2 data points
0.0 0.2 0.4 0.6 0.8 1.0
x
Line fits all – no matter how data were generated!
Some datasets (e.g. in biostatistics) exhibit p = O(n), or even p ≫ n.
So what do we do?
0.0 0.2 0.4 0.6 0.8 1.0
x
Statistics and Econometrics (CAU Kiel) Summer 2020 2 / 43
0.0 0.4
0.8 1.2
0.0 0.4
0.8 1.2
y
y
Today’s outline
Shrinkage and dimensionality reduction
1 Recall model selection
2 Shrinkage methods
3 Dimension reduction methods
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020
3 / 43
Recall model selection
Outline
1 Recall model selection
2 Shrinkage methods
3 Dimension reduction methods
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 4 / 43
Recall model selection
More generally: model degrees of freedom I
If there are not enough data (relative to dimensionality and complexity of the model), overfitting looms.
In linear regression, p is the relevant quantity;
It covers both, dimensionality and complexity (polynomials etc.). We need a concept extending to nonlinear and nonparametric models.
In introductory statistics classes you learn about degrees of freedom.
This referred to the number of (linearly) independent residuals in computing a residual variance.
If we take d.o.f. to mean nr of independent variables in a system,
We could extend it its meaning to arbitrary models.
The “usual” degrees of freedom are then error degrees of freedom.
Statistics and Econometrics (CAU Kiel) Summer 2020
5 / 43
Recall model selection
More generally: model degrees of freedom II
Thus, model degrees of freedom capture the capability of a model to learn complex functional forms.
It makes sense to compare complexity only for given dimensionality.
You may equate degrees of freedom with the number of distinct data points that can be exactly interpolated by the model.
A rough approximation is given by the number of parameters, but
the more nonparametric and the more nonlinear the model, the less exact the approximation.
In nonparametric cases, one can devise equivalence formulas (see splines later on).
Either way,
More model degrees of freedom than observations is really bad!
Statistics and Econometrics (CAU Kiel) Summer 2020
6 / 43
Recall model selection
Too many predictors?
But what if we tried to estimate only the coefficients of predictors with strong signal?
After all…
there is not enough information to estimate all
if we have a lot of coefficients, some must be much less important than others
So set them to 0 (or close to)!
(Formally, we try to get good MSE by reducing variance yet increasing bias a bit.)
Statistics and Econometrics (CAU Kiel) Summer 2020
7 / 43
Recall model selection
Forward stepwise selection may help
Stepwise adding regressors
may be combined with different criteria can handle data with p > n
but
need (in principle) to account for multiple testing estimates may be seriously biased
… alternatives available.
Statistics and Econometrics (CAU Kiel) Summer 2020 8 / 43
Recall model selection
More approaches to play with
Shrinkage
We fit a model involving all p predictors, but the estimated coefficients are shrunk towards zero relative to OLS estimates.
Such regularization has the effect of reducing variance (at the cost of some bias) and can also perform variable selection.
Dimension Reduction
We project the p predictors into an M-dimensional subspace, where M < p.
The M projections are used as predictors to fit a linear regression model by least squares.
Model averaging
Rather than selecting a single model, we “pool” many simpler models with suitably selected weights.
We study this under ensemble methods.
Statistics and Econometrics (CAU Kiel) Summer 2020 9 / 43
Shrinkage methods
Outline
1 Recall model selection
2 Shrinkage methods
3 Dimension reduction methods
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 10 / 43
Shrinkage methods
Regularized (linear) regression
Make sure that some of parameter estimates are small or zero Place a penalty on large values for slope coefficients, and add it to the sum of squared residuals (or likelihod, or...) then minimize/maximize to get βˆ,
n βˆ=argmin Yi −Xi′β2 +λfβ ̄ .
β
Here, f measures the distance between 0 and slope coefficients β ̄ (the intercept does not need regularization). The how matters.
The tuning parameter λ (shrinkage intensity) controls how close to 0 the estimates of nonimportant coefficients should be.
i=1
Statistics and Econometrics (CAU Kiel) Summer 2020 11 / 43
Shrinkage methods
Ridge regression
Recall that the least squares fitting procedure estimates β0, β1, . . . , βp using the values that minimize
np 2
RSS= Yi− β0+βjXij i=1 j=1
In contrast, the ridge regression coefficient estimates βˆR are the values that minimize
n p 2 p p
Yi− β0+βjXij +λβj2=RSS+λβj2, i=1 j=1 j=1 j=1
where λ ≥ 0 is a tuning parameter, to be determined separately. Statistics and Econometrics (CAU Kiel) Summer 2020
12 / 43
Shrinkage methods
Ridge regression — summarized
As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small.
However, the second term, λ βj2 is small when β0, β1, . . . , βp are close to zero, and so it has the effect of shrinking the estimates of βj towards zero.
The tuning parameter λ serves to control the relative impact of these two terms on the regression coefficient estimates.
Selecting a good value for λ is critical; information criteria or (more often) cross-validation are used for this.
Not to be underestimated: there’s a closed-form expression for the ridge regression estimator.
Statistics and Econometrics (CAU Kiel) Summer 2020 13 / 43
Credit data example
Shrinkage methods
Credit data example
Income Limit Rating Student
1e−02 1e+00 1e+02 1e+04 0.0 0.2 0.4 0.6 0.8 1.0
k ö R k 2 / k ö k 2
Ridge regression estimators plotted against amount of shrinkage (Left: against λ; Right: against length reduction compared to OLS).
Statistics and Econometrics (CAU Kiel) Summer 2020 14 / 43
Standardized Coefficients
−300 −100 0 100 200 300 400
Standardized Coefficients
−300 −100 0 100 200 300 400
2
Shrinkage methods
Ridge regression: scaling of predictors
The standard least squares coefficient estimates are scale equivariant. In other words, regardless of how the jth predictor is scaled, Xjβˆj will
remain the same.
In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function.
Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula
̃ Xij
Xij=1n (Xji−Xj)2 n i=1
Statistics and Econometrics (CAU Kiel)
Summer 2020 15 / 43
Why Does Ridge Regression Improve Over Least
Shrinkage methods
Squares
Does Ridge Regression Improve Over Least Squares?
The Bias-Variance tradeo↵
The Bias-Variance trade-off
1e−01 1e+01 1e+03 0.0 0.2 0.4 0.6 0.8 1.0
k ö R k 2 / k ö k 2
Simulated data with n = 50 observations, p = 45 predictors, all having Simulated data with n = 50 observations, p = 45 predictors, all having
nonzero coe cientsà Squared bias black , variance green , and test
nonzero coefficients. Squared bias (black), variance (green), and test mean msqeuaanresdqeurarroerd(peurrpolre) pfourrtphle rifdogretrheegrreisdsigoenrpergerdeiscstionspornedaicstimiounlsatoedn a
ˆR ˆ R
data set, as a function of λ and ||β || /||β|| . Tˆhe horizˆontal dashed lines
simulated data set, as a functionλof2 and2 k k2/k k2à The indicate the minimum possible MSE. The purple crosses indicate the
horizontal dashed lines indicate the minimum possible MSEà The
ridge regression models for which the MSE is smallest.
purple crosses indicate the ridge regression models for which the MSE
Statistics and Econometrics (CAU Kiel) Summer 2020 16 / 43
Mean Squared Error
0 10 20 30 40 50 60
Mean Squared Error
0 10 20 30 40 50 60
is smallestà 3
Shrinkage methods
The Lasso
Ridge regression does have one obvious disadvantage: unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model
The Lasso is an alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, βˆλR minimize the quantity
n p 2 p p
Yi− β0+βjXij +λ|βj|=RSS+λ|βj|. i=1 j=1 j=1 j=1
The lasso uses an l1 penalty instead of an l2 penalty. The l1 norm of a coefficient vector β is given by ||β||1 = |βj|.
Statistics and Econometrics (CAU Kiel) Summer 2020 17 / 43
Shrinkage methods
The Lasso — continued
As with ridge regression, the lasso shrinks the coefficient estimates towards zero.
However, in the case of the lasso, the l1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large.
Hence, much like best subset selection, the lasso performs variable selection.
We say that the lasso yields sparse models — that is, models that involve only a subset of the variables.
As in ridge regression, selecting a good value of λ for the lasso is critical.
Statistics and Econometrics (CAU Kiel) Summer 2020
18 / 43
C
Example:
Shrinkage methods
redit dataset
Example: Credit dataset
Income Limit Rating Student
20 50 100 200 500 2000 5000 0.0 0.2
0.4 0.6 0.8 1.0
k ö Lk1/k ök1
Statistics and Econometrics (CAU Kiel) Summer 2020 19 / 43
Standardized Coefficients
−200 0 100 200 300 400
Standardized Coefficients
−300 −100 0 100 200 300 400
Shrinkage methods
The variable selection property of the Lasso
Why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly equal to zero?
Lasso and ridge regressions penalized sums of squares are actually the Lagrangeans to
and
np2 p β
min Yi− β0+βjXij subjectto|βj|≤s i=1 j=1 j=1
np2 p β
min Yi− β0+βjXij subjecttoβj2≤s. i=1 j=1 j=1
Statistics and Econometrics (CAU Kiel) Summer 2020
20 / 43
Shrinkage methods
The Lasso Picture
The Lasso vs. Ridge picture and animation
Statistics and Econometrics (CAU Kiel) Summer 2020 21 / 43
Shrinkage methods
Comparing the Lasso and Ridge Regression
Comparing the Lasso and Ridge Regression
0.02 0.10 0.50
2.00 10.00 50.00 0.0 0.2 0.4 0.6 0.8 1.0
R2 on Training Data
LeLfte:ft:PPllots offsqsuqauraedrebdiasbi(abslackb)l,avcakria,ncvea(rgiraenenc)e, angdretesnt M, aSnEd test (purple) for the lasso on simulated data set. Right: Comparison of squared
MSE purple for the lasso on simulated data set of Slide 32à
bias, variance and test MSE between lasso (solid) and ridge (dashed). Right: Comparison of squared bias, variance and test MSE
Both are plotted against their R2 on the training data as common scale. etween lasso solid and ridge dashed à Both are plotted
The crosses in both plots indicate the lasso model for which the MSE is gainst their R2 on the training data, as a common form of
smallest.
ndexingà The crosses in both plots indicate the lasso model for
Statistics and Econometrics (CAU Kiel) Summer 2020 22 / 43
Mean Squared Error
0 10 20 30 40 50 60
Mean Squared Error
0 10 20 30 40 50 60
b a i
Shrinkage methods
aring the Lasso and Ridge Regression: co
Comparing the Lasso and Ridge Regression — continued
0.02 0.10 0.50
2.00 10.00 50.00 0.4 0.5 0.6 0.7 0.8 0.9 1.0
R2 on Training Data
Plots of squared bias black , variance green , and te
The simulated data is similar to previous one, except that now only two
predictors are related to the target.
purple for the lassoà The simulated data is similar
Left: Plots of squared bias (black), variance (green), and test MSE in(puSrpliled)efo3r t8h,e leaxsscoe. pRtightth: aCtomnpoarwisononoflsyqutawreod bpiarse, dvairciatnocresanadrtestre
MSE between lasso (solid) and ridge (dashed). The crosses in both plots
e responseà Right: Comparison of squared bias, varian
indicate the lasso model for which the MSE is smallest.
test MSE between lasso solid and ridge dashed à Bo
Statistics and Econometrics (CAU Kiel) Summer 2020 23 / 43
pn
Mean Squared Error
0 20 40 60 80 100
Mean Squared Error
0 20 40 60 80 100
:
Et l
hc t
plotted against their R2 on the training data, as a comm
Shrinkage methods
Conclusions
These two examples illustrate that neither ridge regression nor the lasso will universally dominate the other. An extension, the elastic net, considers both penalty terms for this reason.
Still, the lasso and its descendants are more popular in practice.
In general, one might expect the lasso to perform better when the response is a function of only a relatively small number of predictors.
However, the number of predictors that is related to the response is never known a priori for real data sets.
A technique such as cross-validation can be used in order to determine which approach is better on a particular data set.
Statistics and Econometrics (CAU Kiel) Summer 2020 24 / 43
Shrinkage methods
Tuning Parameters for Ridge Regression and the Lasso
As for subset selection, for ridge regression and lasso we require a method to determine which of the models under consideration is best.
That is, we require a method selecting a value for the tuning parameter λ or equivalently, the value of the constraint s.
Cross-validation provides a simple way to tackle this problem. We choose a grid of λ values, and compute the cross-validation error rate for each value of λ.
We then select the tuning parameter value for which the cross-validation error is smallest.
Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter.
Statistics and Econometrics (CAU Kiel) Summer 2020 25 / 43
Shrinkage methods
Credit data example
Credit data example
5e−03 5e−02 5e−01 5e+00 5e−03 5e−02 5e−01 5e+00
Left: Cross-validation errors that result from applying ridge Left: Cross-validation errors that result from applying ridge regression to
regression to the Credit data set with various values of à the Credit data set with various values of λ.
Right: The coe cient estimates as a function of à The vertical Right: The coefficient estimates as a function of λ. The vertical dashed
linesdainsdhiecdatleisnethseinvdailucaetoefs λthseelveactluede boyf crossesl-evcatleidabtyionc.ross-validationà Statistics and Econometrics (CAU Kiel) Summer 2020 26 / 43
42 5
Cross−Validation Error
25.0 25.2 25.4 25.6
Standardized Coefficients
−300 −100 0 100 300
à
Simulated
Shrinkage methods
ata example
Simulated data example
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2
k ö Lk/k ök
0.4 0.6 0.8 1.0
k ö Lk/k ök
Left: Ten-fold cross-validation MSE for the lasso, applied to the Left: Ten-fold cross-validation MSE for the lasso, applied to the sparse
sparse simulated data set from Slide 39à Right: The simulated data set.
corresponding lasso coe cient estimates are displayedà The
Right: The corresponding lasso coefficient estimates are displayed. The vertical dashed lines indicate the lasso Þt for which the
error is smallest.
vertical dashed lines indicate the lasso fit for which the cross-validation
cross-validation error is smallestà
43 5
Statistics and Econometrics (CAU Kiel) Summer 2020 27 / 43
d
Cross−Validation Error
0 200 600 1000 1400
Standardized Coefficients
−5 0 5 10 15
à
Dimension reduction methods
Outline
1 Recall model selection
2 Shrinkage methods
3 Dimension reduction methods
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 28 / 43
Dimension reduction methods
Summarize regressors?
The methods that we have discussed so far in this chapter have involved fitting linear regression models, via
least squares or
a shrinking approach,
using the original predictors, X1, X2, . . . , Xp
We now explore a class of approaches that transform the predictors
and then fit a least squares model using the transformed variables. This is a particular case of so-called “feature generation”
Statistics and Econometrics (CAU Kiel) Summer 2020 29 / 43
Dimension reduction methods
Dimension Reduction Methods: details
LetZ1,Z2,...,ZM representM