程序代写代做 flex algorithm DNA Bayesian Linear Model Selection and Regularization

Linear Model Selection and Regularization
Chapter 6
March 18, 2020
1

1 Subset selection
2 Shrinkage methods
3 High dimension data analysis
2

• Linear model already addressed in detail in Chapter 3. Y =β0 +β1X1 +…+βpXp +ε
• Model assessment: cross-validation (prediction) error in Chapter 5.
• This chapter is about model selection for linear models.
• The model selection techniques can be extended beyond linear models.
About this chapter
3

Feature/variable selection
• Not all existing input variables are useful for predicting the output.
• Keeping redundant inputs in model can lead to poor prediction and poor interpretation.
• We consider two ways of variable/model selection:
1. Subset selection.
2. Shrinkage/regularization: constraining some regression parameters to 0.
4

Best subset selection
• Exhaust all possible combinations of inputs.
• With p variables, there are 2p many distinct combinations. • Identify the best model among these models.
5

The algorithm of best subset selection
• Step 1. Let M0 be the null model, Y = β0 + ε. The predictor is the sample mean of response.
• Step 2. For k = 1,2,…,p,
Fit all 􏰚p􏰛 = p!/(k!(n − k)!)] models that contain exactly k
k
predictors.
Pick the best model, that with largest R2, among them and call it Mk.
• Step 3. Select a single best model from M0, …, Mp by cross validation, AIC, or BIC.
6

• Step 2 reduces to p + 1 models, using training error.
• Step 3 identifies the best model using prediction error.
• Why using R2 for step 2: they are of same complexity; the RSS is easy to compute.
• Cannot use training error in Step 3.
Comments
7

• Residue
• Residual Sum of Squares
Recall: Definitions
p
j=1
nnp
􏰂2􏰂ˆ􏰂ˆ2
εˆ = y − βˆ − 􏰂 βˆ x ii0 jij
RSS = εˆ = (y − β − β x ) i i0 jij
• R-squared
where SSerror = RSS = εˆ and
SStotal = 􏰁ni=1(yi − y ̄)2.
i=1
i=1
j=1
= 1 − SSerror SStotal
R2 = SSreg SStotal
􏰁n 2
i=1 i
8

Residual Sum of Squares
2e+07 4e+07 6e+07 8e+07
0.0 0.2 0.4
0.6 0.8 1.0
R2
9
Example: Credit data
2 4 6 8 10 2 4 6 8 10
Number of Predictors Number of Predictors
Figure: 6.1. For each possible model containing a subset of the ten predictors in the Credit data set, the RSS and R2 are displayed. The red frontier tracks the best model for a given number of predictors, according to RSS and R2. Though the data set contains only ten predictors, the x-axis ranges from 1 to 11, since one of the variables is categorical and takes on three values, leading to the creation of two dummy variables.

The issues of R-squared
• The R-squared is the percentage of the total variation in response due to the inputs.
• The R-squared reflects the training error.
• However, a model with larger R-squared is not necessarily better than another model with smaller R-squared when we consider test error!
• If model A has all the inputs of model B, then model A’s R-squared will always be greater than or as large as that of model B.
• If model A’s additional inputs are entirely uncorrelated with the response, model A contain more noise than model B. As a result, the prediction based on model A would inevitably be poorer or no better.
10

Adjusted R-squared.
• The adjusted R-squared, taking into account of the degrees of freedom, is defined as
adjusted R2 =
1 − MSerror
MStotal
= 1− SSerror/(n−p−1)
SStotal/(n − 1) s2
= 1 − 􏰁ni=1(yi − y ̄)2/(n − 1)
With more inputs, the R2 always increase, but the adjusted R2 could decrease since more irrelevant inputs are penalized by the smaller degree of freedom of the residuals.
• The adjusted R-squared is preferred over the R-squared in evaluating models.
11

Commonly used criteria for model selection
• Akaike information criterion (AIC):
AIC = −2 log l(θˆ) + 2d
where θˆ is the maximum likelihood estimate, l(θˆ) is the likelihood function evaluated at θˆ, and d is number of estimated parameters in the model.
• Sometimes, −2 log l(θˆ) is called Deviance.
• Bayesian information criterion (BIC):
BIC = −2 log l(θˆ) + log(n)d
12

Criteria given in the reference book
• AIC given in the reference book
AIC = 1 (RSS + 2dσˆ2)
nσˆ2
= 1(−2logl(θˆ)+2d).
n
For linear regression, model parameters are θˆ = {βˆ, σˆ2},
where βˆ is the fitted regression coefficient.
• BIC in the reference book (7th-print)
BIC = 1 (RSS + log(n)dσˆ2) nσˆ2
= 1(−2logl(θˆ)+log(n)d) n
where σˆ2 was missed in first six printed versions.
• Mallow’s Cp is equivalent to AIC in the special case of
Gaussian linear regression.
13

Pros and Cons of best subset selection
• Seems straightforward to carry out.
• Conceptually clear.
• The search space too large (2p models), may lead to overfit. • Computationally infeasible: too many models to run.
• if p = 20, there are 220 > 1000, 000 models.
14

Forward stepwise selection
• Start with the null model.
• Find the best one-variable model.
• With the best one-varialbe model, add one more variable to get the best two-variable model.
• With the best two-varialbe model, add one more variable to get the best three-variable model.
• ….
• Find the best among all these best k-variable models.
15

The algorithm of forward stepwise selection
• Step1. LetM0 bethenullmodel,Y =β0+ε. The predictor is the sample mean of response.
• Step 2. For k = 0, 1, …, p − 1,
Consider all p − k models that augment the predictors in Mk with one additional predictor.
Choose the best among these p − k models, and call it Mk+1. Here best is defined as having smallest RSS or highest R2.
• step 3. Select a single best model from M0, …, Mp by cross validation or AIC or BIC or Cp or adjusted R2.
16

Pros and Cons of forward stepwise selection
• Less computation
• Less models (􏰁p−1(p − k) = 1 + p(p + 1)/2 models).
k=0
• (if p = 20, only 211 models, compared with more than 1
million models for best subset selection).
• No problem for first n-steps if p > n.
• Once an input is in, it does not get out.
17

Variables Best subset
one rating
two rating, income
three rating, income, student
four cards, income, student, limit
Forward stepwise
rating
rating, income
rating, income, student rating, income, student, limit
Example: credit dataset
TABLE 6.1. The first four selected models for best subset selection and forward stepwise selection on the Credit data set. The first three models are identical but the fourth models differ.
18

Backward stepwise selection
• Start with the largest model (all p inputs in).
• Find the best (p − 1)-variable model, by reducing one from
the largest model
• Find the best (p − 2)-variable model, by reducing one
variable from the best (p − 1)-variable model.
• Find the best (p − 3)-variable model, by reducing one
variable from the best (p − 2)-variable model.
• ….
• Find the best 1-varible model, by reducing one variable from the best 2-variable model.
• The null model.
19

The algorithm of backward stepwise selection
• Step 1. Let Mp be the full model.
• Step 2. For k = p, p − 1, …, 1,
Consider all k models that contain all but one of the predictors in Mk for a total of k − 1 predictors
Choose the best among these k models, and call it Mk−1. Here best is defined as having smallest RSS or highest R2.
• Step 3. Select a single best model from M0, …, Mp by cross validation or AIC or BIC or Cp or adjusted R2.
20

Pros and Cons of backward stepwise selection
• Less computation
• Less models (􏰁p−1(p − k) = 1 + p(p + 1)/2 models).
k=0
• (if p = 20, only 211 models, compared with more than 1
million models for best subset selection).
• Once an input is out, it does not get in.
• No applicable to the case with p > n.
21

Find the best model based on prediction error.
• Validation/cross-validation approach (addressed in Chapter 5).
• Use Adjusted R2, AIC, BIC or Cp.
22

Example
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
Number of Predictors Number of Predictors Number of Predictors
Figure: 6.2. Cp, BIC, and adjusted R2 are shown for the best models of each size for the Credit data set (the lower frontier in Figure 6.1). Cp and BIC are estimates of test MSE. In the middle plot we see that the BIC estimate of test error shows an increase after four variables are selected. The other two plots are rather flat after four variables are included.
23
10000 15000 20000 25000
30000
10000 15000 20000 25000
30000
Cp
BIC
Adjusted R
2
0.86 0.88 0.90 0.92 0.94 0.96

Example
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
Number of Predictors Number of Predictors Number of Predictors
Figure: 6.3. For the Credit data set, three quantities are displayed for the best model containing d predictors, for d ranging from 1 to 11. The overall best model, based on each of these quantities, is shown as a blue cross. Left: Square root of BIC. Center: Validation set errors (75% training data). Right: 10-fold Cross-validation errors.
24
100
120 140 160 180 200 220
100
120 140 160 180 200 220
100
120 140 160 180 200 220
Square Root of BIC
Validation Set Error
Cross−Validation Error

The one standard deviation rule
• In the above figure, model with 6 inputs do not seem to be much better than model with 4 or 3 inputs.
• Keep in mind the Occam’s razor: Choose the simplest model if they are similar by other criterion.
25

The one standard deviation rule
• Calculate the standard error of the estimated test MSE for each model size,
• Consider the models with estimated test MSE of one standard deviation within the smallest test MSE.
• Among them select the one with the smallest model size.
• (Apply this rule to the Example in Figure 6.3 gives the model with 3 variable.)
26

Ridge Regression
• The least squares estimator βˆ is minimizing
np RSS=􏰂(yi −β0 −􏰂βjxij)2 i=1 j=1
• The ridge regression βˆλR is minimizing
npp 􏰂(yi −β0 −􏰂βjxij)2 +λ􏰂βj2
i=1 j=1 j=1
where λ ≥ 0 is a tuning parameter.
• The first term measuares goodness of fit, the smaller the
better.
• The second term λ 􏰁pj=1 βj2 is called shrikage penalty,
which shrinks βj towards 0.
• The shrinkage reduces variance (at the cost increased bias)!
27

Tuning parameter λ.
• λ=0: nopenalty,βˆ0R =βˆ.
• λ = ∞: infinity penalty, βˆ0R = 0.
• Large λ: heavy penalty, more shrinkage of the estimator. • Note that β0 is not penalized.
28

Standardize the inputs.
• For j-th input Xj with observations: (x1j,…,xnj), standardize it as
xij − x ̄j
x ̃ij = 􏰌(1/n)􏰁ni=1(xij −x ̄j)2
to get rid of the scale of Xj.
• Suggest to apply standardization before trying ridge
regression.
• Least squares is unaffected by the scale of Xj. (i.e.,
Xjβˆj = (cXj)(βˆj/c))
• Ridge is affected by λ as well as the scale of the inputs.
29

Example
Income Limit Rating Student
1e−02 1e+00 1e+02 1e+04 0.0 0.2
0.4 0.6 0.8 1.0
∥ βˆ λR ∥ 2 / ∥ βˆ ∥ 2
Figure: 6.4. The standardized ridge regression coefficients are
displayed for the Credit data set, as a function of λ and ∥βˆλR∥2/∥βˆ∥2. Here ∥a∥2 = 􏰙􏰁pj=1 a2j .
λ
30
Standardized Coefficients
−300 −100 0 100 200 300 400
Standardized Coefficients
−300 −100 0 100 200 300 400

Bias-variance tradeoff (why ridge improves over LSE)
1e−01 1e+01 1e+03 0.0 0.2 0.4 0.6 0.8 1.0
λ ∥ βˆ λR ∥ 2 / ∥ βˆ ∥ 2
Figure: 6.5. Simulated data (p = 45, n = 50). Squared bias (black), variance (green), and test mean squared error (purple) for the ridge regression predictions on a simulated data set, as a function of λ and ∥βˆλR∥2/∥βˆ∥2. The horizontal dashed lines indicate the minimum possible MSE. The purple crosses indicate the ridge regression models for which the MSE is smallest.
31
Mean Squared Error
0 10 20 30 40 50 60
Mean Squared Error
0 10 20 30 40 50 60

• Suppose the response and the predictors is close to linear.
• the least squares estimates will have low bias.
• It may have high variance for relatively large p: small change in the training data can cause a large change in the least squares coefficient estimates.
• For large p, as in the example in Figure 6.5, the least squares estimates will be extremely variable.
• If p > n, then the least squares estimates do not even have a unique solution
Remark.
32

Remark.
• If p > n, ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance.
• Ridge regression works best in situations where the least squares estimates have high variance.
• Ridge regression also has substantial computational advantages

βˆλR =(XTX)+λI)−1XTy
where I is p + 1 by p + 1 diagnal with diagonal elements (0, 1, 1, …, 1).
33

The Lasso
• Lasso stands for Least Absolute Shrinkage and Selection Operator.
• The Lasso estimator βˆλL is the minimizer of
npp 􏰂(yi −β0 −􏰂βjxij)2 +λ􏰂|βj|
i=1 j=1 j=1
• We may use ∥β∥1 = 􏰁pj=1 |βj|, which is the l1 norm.
• LASSO often shrinks coefficients to be identically 0. (This is not the case for ridge)
• Hence it performs variable selection, and yields sparse models.
34

Example: Credit data.
Income Limit Rating Student
20 50 100 200 500 2000 5000 0.0 0.2
λ
0.4 0.6 0.8 1.0
∥βˆλL∥1/∥βˆ∥1
Figure: 6.6. The standardized lasso coefficients on the Credit data set are shown as a function of λ and ∥βˆλL∥1/∥βˆ∥1.
35
Standardized Coefficients
−200 0 100 200 300 400
Standardized Coefficients
−300 −100 0 100 200 300 400

• For Lasso: Minimize
Another formulation
npp 􏰂(yi−β0−􏰂βjxij)2 subjectto 􏰂|βj|≤s
i=1 j=1 j=1 • For Ridge: Minimize
npp 􏰂(yi−β0−􏰂βjxij)2 subjectto 􏰂βj2 ≤s
i=1 j=1 j=1 • For l0: Minimize
npp 􏰂(yi−β0−􏰂βjxij)2 subjectto 􏰂I(β̸=0)≤s
i=1 j=1 j=1
l0 method penalizes number of non-zero coefficients. A difficult (NP-hard) problem for optimization.
36

Variable selection property for Lasso
Figure: 6.7. Contours of the error and constraint functions for the lasso (left) and ridge regression (right). The solid blue areas are the constraint regions, |β1| + |β2| ≤ s and β12 + β2 ≤ s , while the red ellipses are the contours of the RSS.
37

Simulated data as in Figure 6.5 for comparing Lasso and Ridge
0.02 0.10 0.50 2.00 10.00 50.00 0.0
λ
0.2
0.4 0.6 0.8 1.0
R2 on Training Data
Figure: 6.8. Left: Plots of squared bias (black), variance (green), and test MSE (purple) for the lasso on a simulated data set. Right: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dotted). Both are plotted against their R2 on the training data, as a common form of indexing. The crosses in both plots indicate the lasso model for which the MSE is smallest.
38
Mean Squared Error
0 10 20 30 40 50 60
Mean Squared Error
0 10 20 30 40 50 60

The Lasso
• In the previous example, Ridge is slightly better than Lasso.
• The data in Figure 6.8 were generated in such a way that all 45 predictors were related to the response.
• None of the true coefficients β1, …, β45 equaled zero.
• The lasso implicitly assumes that a number of the
coefficients truly equal zero.
• Not surprising that ridge regression outperforms the lasso
in terms of prediction error in this setting.
• Yet lasso model is sparse, hence more interpretable than
ridge.
• In the next figure, only 2 out of 45 predictors are related with the response.
39

Comparing Lasso and ridge, different data
0.02 0.10 0.50 2.00 10.00 50.00
λ
0.4
0.5
0.6 0.7 0.8 0.9 1.0
R2 on Training Data
Figure: 6.9. Left: Plots of squared bias (black), variance (green), and test MSE (purple) for the lasso. The simulated data is similar to that in Figure 6.8, except that now only two predictors are related to the response. Right: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dotted). Both are plotted against their R2 on the training data. The crosses in both plots indicate the lasso model for which the MSE is smallest.
40
Mean Squared Error
0 20 40 60 80 100
Mean Squared Error
0 20 40 60 80 100

• Both ridge and Lasso can improve over the traditional least squares by trade off variance with bias.
• There are significant improvement when the variance of the least squares is large, mostly with small n and large p.
• Lasso has feature selection, while ridge does not.
• Use cross validation to determine which one has better prediction.
Summary remark
41

Simple cases
• Ridge has closed form solution. Lasso generally does not have closed form solution.
• Considerthesimplemodelyi =βi+εi,i=1,…,nand n = p. Then,
The least squares βˆj = yj ; the ridge βˆjR = yj /(1 + λ)
TheLassoβˆjL =sign(yj)(|yj|−λ/2)+.
• Slightly more generally, suppose input columns of the X are standardized to be mean 0 and variance 1 and are orthogonal.
βˆR = βˆLSE/(1 + λ) jj
βˆL = sign(βˆLSE)(|βˆLSE| − λ/2) jjj+
for j = 1,…,p.
42

Example
Ridge
Least Squares
Lasso
Least Squares
−1.5 −0.5 0.0 0.5 1.0 1.5 −1.5 −0.5 0.0 0.5 1.0 1.5
yj yj
Figure: 6.10. The ridge regression and lasso coefficient estimates for a simple setting with n = p and X a diagonal matrix with 1s on the diagonal. Left: The ridge regression coefficient estimates are shrunken proportionally towards zero, relative to the least squares estimates. Right: The lasso coefficient estimates are soft-thresholded towards zero.
43
Coefficient Estimate
−1.5 −0.5 0.5 1.5
Coefficient Estimate
−1.5 −0.5 0.5 1.5

Tuning parameter selection by cross-validation: Credit data
5e−03 5e−02 5e−01 5e+00 5e−03 5e−02 5e−01 5e+00
λλ
Figure: 6.12. Left: Cross-validation errors that result from applying ridge regression to the Credit data set with various value of λ. Right: The coefficient estimates as a function of λ. The vertical dashed lines indicate the value of λ selected by cross-validation.
44
Cross−Validation Error
25.0 25.2 25.4 25.6
Standardized Coefficients
−300 −100 0 100 300

Example
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2
∥βˆλL∥1/∥βˆ∥1
0.4 0.6 0.8 1.0
∥βˆλL∥1/∥βˆ∥1
Figure: 6.13. Left: Ten-fold cross-validation MSE for the lasso, applied to the sparse simulated data set from Figure 6.9. Right: The corresponding lasso coefficient estimates are displayed. The vertical dashed lines indicate the lasso fit for which the cross-validation error is smallest.
45
Cross−Validation Error
0 200 600 1000 1400
Standardized Coefficients
−5 0 5 10 15

• Digitization of the society brings big data.
• Many of the datasets contain large number of varaibles.
• It is common that p >> n.
• Example: predition of blood pressure. Response: blood pressure.
Inputs: SNPs; (Individual DNA mutations).
n may be of hundreds, but p can be of millions.
General remark
46

• Large p makes our linear regression model too flexible (or too large).
• It can easily lead to overfit.
• If p > n, the LSE is not even uniquely determined.
• A common phenomenon: small training error, but large test error.
The trouble
47

Example
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
XX
Figure: 6.22. Left: Least squares regression in the low-dimensional setting. Right: Least squares regression with n = 2 observations and two parameters to be estimated (an intercept and a coefficient).
48
−10 −5 0 5 10
−10 −5 0 5 10
Y
Y

Example
5 10 15 5 10 15 5 10 15
Number of Variables Number of Variables Number of Variables
Figure: 6.23. On a simulated example with n = 20 training observations, features that are completely unrelated to the outcome are added to the model. Left: The R2 increases to 1 as more features are included. Center: The training set MSE decreases to 0 as more features are included. Right: The test set MSE increases as more features are included.
49
0.0
0.2 0.4 0.6 0.8
0.2 0.4
0.6 0.8 1.0
R
2
Training MSE
Test MSE
1 5 50 500

Deal with high dimensional data
• Fit less flexible models to avoid overfit.
• Use forward stepwise selection, ridge regression, the lasso,
and principal components regression
• Regularization or shrinkage plays important role.
• Tuning parameter choice
50

Example for curse of dimensionality
p = 20
p = 50
p = 2000
1 16 21
Degrees of Freedom
1 28 51
Degrees of Freedom
1 70 111
Degrees of Freedom
Figure: 6.24. see next page
51
012345
012345
012345

Figure 6.24. The lasso was performed with n = 100 observations and three values of p, the number of features. Of the p features, 20 were associated with the response. The boxplots show the test MSEs that result using three different values of the tuning parameter λ in (6.7). For ease of interpretation, rather than reporting , the degrees of freedom are reported; for the lasso this turns out to be simply the number of estimated non-zero coefficients. When p = 20, the lowest test MSE was obtained with the smallest amount of regularization. When p = 50, the lowest test MSE was achieved when there is a substantial amount of regularization. When
p = 2, 000 the lasso performed poorly regardless of the amount of regularization, due to the fact that only 20 of the 2,000 features truly are associated with the outcome.
52

Caution when p > n.
• Extreme multicollinearity.
• Refrain from over-statement. (What we find may be one of
many possible models.)
• Avoid using sum of squares, p-values, R2, or other traditional measures of model on training as evidence of good fit.
• Place more emphasis on test error or cross validation error.
53