程序代写代做代考 C go graph data mining Shrinkage and dimensionality reduction

Shrinkage and dimensionality reduction
Data Mining
Prof. Dr. Matei Demetrescu
Statistics and Econometrics (CAU Kiel) Summer 2020 1 / 43

Things can go bad fast
Take linear regression with p = 1 predictor for n = 2 data points
0.0 0.2 0.4 0.6 0.8 1.0
x
Line fits all – no matter how data were generated!
Some datasets (e.g. in biostatistics) exhibit p = O(n), or even p ≫ n.
So what do we do?
0.0 0.2 0.4 0.6 0.8 1.0
x
Statistics and Econometrics (CAU Kiel) Summer 2020 2 / 43
0.0 0.4
0.8 1.2
0.0 0.4
0.8 1.2
y
y

Today’s outline
Shrinkage and dimensionality reduction
1 Recall model selection
2 Shrinkage methods
3 Dimension reduction methods
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020
3 / 43

Recall model selection
Outline
1 Recall model selection
2 Shrinkage methods
3 Dimension reduction methods
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 4 / 43

Recall model selection
More generally: model degrees of freedom I
If there are not enough data (relative to dimensionality and complexity of the model), overfitting looms.
In linear regression, p is the relevant quantity;
It covers both, dimensionality and complexity (polynomials etc.). We need a concept extending to nonlinear and nonparametric models.
In introductory statistics classes you learn about degrees of freedom.
This referred to the number of (linearly) independent residuals in computing a residual variance.
If we take d.o.f. to mean nr of independent variables in a system,
We could extend it its meaning to arbitrary models.
The “usual” degrees of freedom are then error degrees of freedom.
Statistics and Econometrics (CAU Kiel) Summer 2020
5 / 43

Recall model selection
More generally: model degrees of freedom II
Thus, model degrees of freedom capture the capability of a model to learn complex functional forms.
It makes sense to compare complexity only for given dimensionality.
You may equate degrees of freedom with the number of distinct data points that can be exactly interpolated by the model.
A rough approximation is given by the number of parameters, but
the more nonparametric and the more nonlinear the model, the less exact the approximation.
In nonparametric cases, one can devise equivalence formulas (see splines later on).
Either way,
More model degrees of freedom than observations is really bad!
Statistics and Econometrics (CAU Kiel) Summer 2020
6 / 43

Recall model selection
Too many predictors?
But what if we tried to estimate only the coefficients of predictors with strong signal?
After all…
there is not enough information to estimate all
if we have a lot of coefficients, some must be much less important than others
So set them to 0 (or close to)!
(Formally, we try to get good MSE by reducing variance yet increasing bias a bit.)
Statistics and Econometrics (CAU Kiel) Summer 2020
7 / 43

Recall model selection
Forward stepwise selection may help
Stepwise adding regressors
may be combined with different criteria can handle data with p > n
but
need (in principle) to account for multiple testing estimates may be seriously biased
… alternatives available.
Statistics and Econometrics (CAU Kiel) Summer 2020 8 / 43

Recall model selection
More approaches to play with
Shrinkage
We fit a model involving all p predictors, but the estimated coefficients are shrunk towards zero relative to OLS estimates.
Such regularization has the effect of reducing variance (at the cost of some bias) and can also perform variable selection.
Dimension Reduction
We project the p predictors into an M-dimensional subspace, where M < p. The M projections are used as predictors to fit a linear regression model by least squares. Model averaging Rather than selecting a single model, we “pool” many simpler models with suitably selected weights. We study this under ensemble methods. Statistics and Econometrics (CAU Kiel) Summer 2020 9 / 43 Shrinkage methods Outline 1 Recall model selection 2 Shrinkage methods 3 Dimension reduction methods 4 Up next Statistics and Econometrics (CAU Kiel) Summer 2020 10 / 43 Shrinkage methods Regularized (linear) regression Make sure that some of parameter estimates are small or zero Place a penalty on large values for slope coefficients, and add it to the sum of squared residuals (or likelihod, or...) then minimize/maximize to get βˆ, 􏰐n􏰑 βˆ=argmin 􏰉􏰎Yi −Xi′β􏰏2 +λf􏰎β ̄􏰏 . β Here, f measures the distance between 0 and slope coefficients β ̄ (the intercept does not need regularization). The how matters. The tuning parameter λ (shrinkage intensity) controls how close to 0 the estimates of nonimportant coefficients should be. i=1 Statistics and Econometrics (CAU Kiel) Summer 2020 11 / 43 Shrinkage methods Ridge regression Recall that the least squares fitting procedure estimates β0, β1, . . . , βp using the values that minimize np 2   RSS=􏰉 Yi− β0+􏰉βjXij i=1 j=1 In contrast, the ridge regression coefficient estimates βˆR are the values that minimize n  p 2 p p   􏰉 Yi− β0+􏰉βjXij +λ􏰉βj2=RSS+λ􏰉βj2, i=1 j=1 j=1 j=1 where λ ≥ 0 is a tuning parameter, to be determined separately. Statistics and Econometrics (CAU Kiel) Summer 2020 12 / 43 Shrinkage methods Ridge regression — summarized As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small. However, the second term, λ 􏰇 βj2 is small when β0, β1, . . . , βp are close to zero, and so it has the effect of shrinking the estimates of βj towards zero. The tuning parameter λ serves to control the relative impact of these two terms on the regression coefficient estimates. Selecting a good value for λ is critical; information criteria or (more often) cross-validation are used for this. Not to be underestimated: there’s a closed-form expression for the ridge regression estimator. Statistics and Econometrics (CAU Kiel) Summer 2020 13 / 43 Credit data example Shrinkage methods Credit data example Income Limit Rating Student 1e−02 1e+00 1e+02 1e+04 0.0 0.2 0.4 0.6 0.8 1.0 k ö R k 2 / k ö k 2 Ridge regression estimators plotted against amount of shrinkage (Left: against λ; Right: against length reduction compared to OLS). Statistics and Econometrics (CAU Kiel) Summer 2020 14 / 43 Standardized Coefficients −300 −100 0 100 200 300 400 Standardized Coefficients −300 −100 0 100 200 300 400 2 Shrinkage methods Ridge regression: scaling of predictors The standard least squares coefficient estimates are scale equivariant. In other words, regardless of how the jth predictor is scaled, Xjβˆj will remain the same. In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function. Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula ̃ Xij Xij=􏰜1􏰇n (Xji−Xj)2 n i=1 Statistics and Econometrics (CAU Kiel) Summer 2020 15 / 43 Why Does Ridge Regression Improve Over Least Shrinkage methods Squares Does Ridge Regression Improve Over Least Squares? The Bias-Variance tradeo↵ The Bias-Variance trade-off 1e−01 1e+01 1e+03 0.0 0.2 0.4 0.6 0.8 1.0 k ö R k 2 / k ö k 2 Simulated data with n = 50 observations, p = 45 predictors, all having Simulated data with n = 50 observations, p = 45 predictors, all having nonzero coecientsà Squared bias black , variance green , and test nonzero coefficients. Squared bias (black), variance (green), and test mean msqeuaanresdqeurarroerd(peurrpolre) pfourrtphle rifdogretrheegrreisdsigoenrpergerdeiscstionspornedaicstimiounlsatoedn a ˆR ˆ R data set, as a function of λ and ||β || /||β|| . Tˆhe horizˆontal dashed lines simulated data set, as a functionλof2 and2 k k2/kk2à The indicate the minimum possible MSE. The purple crosses indicate the horizontal dashed lines indicate the minimum possible MSEà The ridge regression models for which the MSE is smallest. purple crosses indicate the ridge regression models for which the MSE Statistics and Econometrics (CAU Kiel) Summer 2020 16 / 43 Mean Squared Error 0 10 20 30 40 50 60 Mean Squared Error 0 10 20 30 40 50 60 is smallestà 3 Shrinkage methods The Lasso Ridge regression does have one obvious disadvantage: unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model The Lasso is an alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, βˆλR minimize the quantity n  p 2 p p   􏰉 Yi− β0+􏰉βjXij +λ􏰉|βj|=RSS+λ􏰉|βj|. i=1 j=1 j=1 j=1 The lasso uses an l1 penalty instead of an l2 penalty. The l1 norm of a coefficient vector β is given by ||β||1 = 􏰇|βj|. Statistics and Econometrics (CAU Kiel) Summer 2020 17 / 43 Shrinkage methods The Lasso — continued As with ridge regression, the lasso shrinks the coefficient estimates towards zero. However, in the case of the lasso, the l1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. Hence, much like best subset selection, the lasso performs variable selection. We say that the lasso yields sparse models — that is, models that involve only a subset of the variables. As in ridge regression, selecting a good value of λ for the lasso is critical. Statistics and Econometrics (CAU Kiel) Summer 2020 18 / 43 C Example: Shrinkage methods redit dataset Example: Credit dataset Income Limit Rating Student 20 50 100 200 500 2000 5000 0.0 0.2 0.4 0.6 0.8 1.0 köLk1/kök1 Statistics and Econometrics (CAU Kiel) Summer 2020 19 / 43 Standardized Coefficients −200 0 100 200 300 400 Standardized Coefficients −300 −100 0 100 200 300 400 Shrinkage methods The variable selection property of the Lasso Why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly equal to zero? Lasso and ridge regressions penalized sums of squares are actually the Lagrangeans to and np2 p β  min􏰉 Yi− β0+􏰉βjXij subjectto􏰉|βj|≤s i=1 j=1 j=1 np2 p β  min􏰉 Yi− β0+􏰉βjXij subjectto􏰉βj2≤s. i=1 j=1 j=1 Statistics and Econometrics (CAU Kiel) Summer 2020 20 / 43 Shrinkage methods The Lasso Picture The Lasso vs. Ridge picture and animation Statistics and Econometrics (CAU Kiel) Summer 2020 21 / 43 Shrinkage methods Comparing the Lasso and Ridge Regression Comparing the Lasso and Ridge Regression 0.02 0.10 0.50 2.00 10.00 50.00 0.0 0.2 0.4 0.6 0.8 1.0 R2 on Training Data LeLfte:ft:PPllots offsqsuqauraedrebdiasbi(abslackb)l,avcakria,ncvea(rgiraenenc)e, angdretesnt M, aSnEd test (purple) for the lasso on simulated data set. Right: Comparison of squared MSE purple for the lasso on simulated data set of Slide 32à bias, variance and test MSE between lasso (solid) and ridge (dashed). Right: Comparison of squared bias, variance and test MSE Both are plotted against their R2 on the training data as common scale. etween lasso solid and ridge dashed à Both are plotted The crosses in both plots indicate the lasso model for which the MSE is gainst their R2 on the training data, as a common form of smallest. ndexingà The crosses in both plots indicate the lasso model for Statistics and Econometrics (CAU Kiel) Summer 2020 22 / 43 Mean Squared Error 0 10 20 30 40 50 60 Mean Squared Error 0 10 20 30 40 50 60 b a i Shrinkage methods aring the Lasso and Ridge Regression: co Comparing the Lasso and Ridge Regression — continued 0.02 0.10 0.50 2.00 10.00 50.00 0.4 0.5 0.6 0.7 0.8 0.9 1.0 R2 on Training Data Plots of squared bias black , variance green , and te The simulated data is similar to previous one, except that now only two predictors are related to the target. purple for the lassoà The simulated data is similar Left: Plots of squared bias (black), variance (green), and test MSE in(puSrpliled)efo3r t8h,e leaxsscoe. pRtightth: aCtomnpoarwisononoflsyqutawreod bpiarse, dvairciatnocresanadrtestre MSE between lasso (solid) and ridge (dashed). The crosses in both plots e responseà Right: Comparison of squared bias, varian indicate the lasso model for which the MSE is smallest. test MSE between lasso solid and ridge dashed à Bo Statistics and Econometrics (CAU Kiel) Summer 2020 23 / 43 pn Mean Squared Error 0 20 40 60 80 100 Mean Squared Error 0 20 40 60 80 100 : Et l hc t plotted against their R2 on the training data, as a comm Shrinkage methods Conclusions These two examples illustrate that neither ridge regression nor the lasso will universally dominate the other. An extension, the elastic net, considers both penalty terms for this reason. Still, the lasso and its descendants are more popular in practice. In general, one might expect the lasso to perform better when the response is a function of only a relatively small number of predictors. However, the number of predictors that is related to the response is never known a priori for real data sets. A technique such as cross-validation can be used in order to determine which approach is better on a particular data set. Statistics and Econometrics (CAU Kiel) Summer 2020 24 / 43 Shrinkage methods Tuning Parameters for Ridge Regression and the Lasso As for subset selection, for ridge regression and lasso we require a method to determine which of the models under consideration is best. That is, we require a method selecting a value for the tuning parameter λ or equivalently, the value of the constraint s. Cross-validation provides a simple way to tackle this problem. We choose a grid of λ values, and compute the cross-validation error rate for each value of λ. We then select the tuning parameter value for which the cross-validation error is smallest. Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter. Statistics and Econometrics (CAU Kiel) Summer 2020 25 / 43 Shrinkage methods Credit data example Credit data example 5e−03 5e−02 5e−01 5e+00 5e−03 5e−02 5e−01 5e+00 Left: Cross-validation errors that result from applying ridge Left: Cross-validation errors that result from applying ridge regression to regression to the Credit data set with various values of à the Credit data set with various values of λ. Right: The coecient estimates as a function of à The vertical Right: The coefficient estimates as a function of λ. The vertical dashed linesdainsdhiecdatleisnethseinvdailucaetoefs λthseelveactluede boyf crossesl-evcatleidabtyionc.ross-validationà Statistics and Econometrics (CAU Kiel) Summer 2020 26 / 43 42 5 Cross−Validation Error 25.0 25.2 25.4 25.6 Standardized Coefficients −300 −100 0 100 300 à Simulated Shrinkage methods ata example Simulated data example 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 köLk/kök 0.4 0.6 0.8 1.0 köLk/kök Left: Ten-fold cross-validation MSE for the lasso, applied to the Left: Ten-fold cross-validation MSE for the lasso, applied to the sparse sparse simulated data set from Slide 39à Right: The simulated data set. corresponding lasso coecient estimates are displayedà The Right: The corresponding lasso coefficient estimates are displayed. The vertical dashed lines indicate the lasso Þt for which the error is smallest. vertical dashed lines indicate the lasso fit for which the cross-validation cross-validation error is smallestà 43 5 Statistics and Econometrics (CAU Kiel) Summer 2020 27 / 43 d Cross−Validation Error 0 200 600 1000 1400 Standardized Coefficients −5 0 5 10 15 à Dimension reduction methods Outline 1 Recall model selection 2 Shrinkage methods 3 Dimension reduction methods 4 Up next Statistics and Econometrics (CAU Kiel) Summer 2020 28 / 43 Dimension reduction methods Summarize regressors? The methods that we have discussed so far in this chapter have involved fitting linear regression models, via least squares or a shrinking approach, using the original predictors, X1, X2, . . . , Xp We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables. This is a particular case of so-called “feature generation” Statistics and Econometrics (CAU Kiel) Summer 2020 29 / 43 Dimension reduction methods Dimension Reduction Methods: details LetZ1,Z2,...,ZM representM

Related Posts