Linear Model Selection and Regularization
Dr. Lan Du
Faculty of Information Technology, Monash University, Australia
FIT5149 week 6
(Monash) FIT5149 1 / 38
Outline
1 Subset Selection Methods Best Subset Selection
Stepwise Selection
2 Shrinkage Methods Ridge regression
The Lasso Elastic net Group Lasso
3 Summary
(Monash)
FIT5149 2 / 38
Improve linear model fitting procedure
Recall the linear model
Y =β0+β1X1+···+βpXp
The linear model has distinct advantages in terms of its interpretability and often shows good predictive performance, while the assumptions are satisfied.
Improve the simple linear model:
replace ordinary least squares fitting with some alternative fitting procedures. Yield better prediction accuracy and model interpretability
(Monash) FIT5149 3 / 38
Why consider alternatives to least squares?
Prediction Accuracy: especially when p > n, to control the variance. If the true linear relationship between the Y and X ⇒ low bias
Ifn≫p⇒lowvariance
If n ≈ p ⇒ high variance & overfitting & poor prediction
if n < p ⇒ infinite variance & no unique OLS coefficient estimate
Model Interpretability:
When we have a large number of variables X, there will generally be some or
many that are not associated with the response Y .
Including irrelevant variables leads to unnecessary complexity in the model
Removing irrelevant variables by setting their coefficient to 0 increases the
interpretability of the resulting model.
Solution: feature (variable) selection.
(Monash) FIT5149 4 / 38
Three classes of selection methods
Subsect Selection: Identifying a subset of all p predictors X that we believe to be related to the response Y , and then fitting the model least squares on the reduced set of variables.
Best subset selection
Forward/Backward stepwise selection Hybrid selection
Shrinkage, also known as regularisation
The estimated coefficients are shrunken towards zero relative to the least
squares estimates.
The shrinkage has the effect of reducing variance.
The shrinkage can also perform variable selection.
− Ridge regression: L2 regularisation
− The Lasso: L1 regularisation
− Elastic Net: the mixture of L1 and L2
− Group Lasso
Dimension Reduction: Involves projecting all p predictors into an M-dimensional space where M < p, and then fitting regression model.
e.g., Principle Component Analysis
(Monash) FIT5149 5 / 38
Subset Selection Methods
Outline
1 Subset Selection Methods Best Subset Selection
Stepwise Selection
2 Shrinkage Methods
3 Summary
(Monash) FIT5149 6 / 38
models, with the goal of identifying the one that is bestà m
into two stages, as described in Algorithm 6àà
The problem of selecting the best
Subset Selection Methods
Best Subset Selection p odel from among the 2
possibilities
considered by best subset selection is not trivialà This is usually broken up
Best Subset Selection
Algorithm 6à Best subset selection
à Let Mà denote the null model, which contains no predictorsà This
model simply predicts the sample mean for each observationà
2à Fork=,2,àààp:
a Fit all kp models that contain exactly k predictorsà
b Pick the best among these kp models, and call it Mkà Here best
is deÞned as having the smallest RSS, or equivalently largest R2à
3à Select a single best model from among Mà, à à à , Mp using cross-
validated prediction error, Cp AIC , BIC, or adjusted R2à
In Algorithm 6à, Step 2 identiÞes the best model on the training data
for each subset size, in order to reduce the problem from one of 2p possible
models to one of p + possible modelsà In Figure 6à, these models form
(Monash) FIT5149 7 / 38 the lower frontier depicted in redà
Example: Credit data set
Subset Selection Methods
Best Subset Selection
Residual Sum of Squares
2e+07 4e+07 6e+07 8e+07
0.0 0.2 0.4
0.6 0.8 1.0
R2
2 4 6 8 10
Number of Predictors
2 4 6 8 10
Number of Predictors
For each possible model containing a subset of the ten predictors in the Credit data set, the RSS and R2 are displayed. The red frontier tracks the best model for a given number of predictors, according to RSS and R2. Though the data set contains only ten predictors, the x-axis ranges from 1 to 11, since one of the variables is categorical and takes on three values, leading to the creation of two
dummy variables
(Monash) FIT5149 8 / 38
Subset Selection Methods
Best Subset Selection
More on Best Subset Selection
It apply to other types of models, such as logistic regression.
The deviance — negative two times the maximized log-likelihood — plays the role of RSS for a broader class of models.
For computational reasons, best subset selection cannot be applied with very large p. Why not?
Overfitting: when p is large, larger the search space, the higher the chance of finding models with a low training error, which by no means guarantees a low test error.
Stepwise methods: which explore a far more restricted set of models, are attractive alternatives to best subset selection.
(Monash) FIT5149 9 / 38
to the model, one-at-a-time, until all of the predictors are in the modelà
Subset Selection Methods
Stepwise Selection
In particular, at each step the variable that gives the greatest additional FiomrwpraordveSmtenptwtisoetSheleÞcttiiosnadded to the modelà More formally, the forward
stepwise selection procedure is given in Algorithm 6à2à
Algorithm 6à2 Forward stepwise selection
à Let Mà denote the null model, which contains no predictorsà
2à Fork=à,ààà,p−:
a Consider all p − k models that augment the predictors in Mk
with one additional predictorà
b Choose the best among these p − k models, and call it Mk+à
Here best is deÞned as having smallest RSS or highest R2à
3à Select a single best model from among Mà, à à à , Mp using cross-
validated prediction error, Cp AIC , BIC, or adjusted R2à
(Monash) FIT5149 10 / 38
Subset Selection Methods
Stepwise Selection
More on Forward Stepwise Selection
Computational advantage over best subset selection is clear.
Best subset selection: 2p models
Forward stepwise selection: 1 + p−1(p − k) = 1 + p(p + 1)/2
k=0
It is not guaranteed to find the best possible model out of all 2p models
containing subsets of the p predictors. Why not? Give an example.
(Monash) FIT5149 11 / 38
Subset Selection Methods
Stepwise Selection
More on Forward Stepwise Selection
Computational advantage over best subset selection is clear.
Best subset selection: 2p models
Forward stepwise selection: 1 + p−1(p − k) = 1 + p(p + 1)/2
k=0
It is not guaranteed to find the best possible model out of all 2p models
containing subsets of the p predictors.
Why not? Give an example.
suppose that in a given data set with p = 3 predictors,
− the best possible one-variable model contains X1
− the best possible two-variable model instead contains X2 and X3
Then forward stepwise selection will fail to select the best possible two-variable model, because M1 will contain X1, so M2 must also contain X1 together with one additional variable.
(Monash) FIT5149 11 / 38
Subset Selection Methods
Stepwise Selection
More on Forward Stepwise Selection
Computational advantage over best subset selection is clear.
Best subset selection: 2p models
Forward stepwise selection: 1 + p−1(p − k) = 1 + p(p + 1)/2
k=0
It is not guaranteed to find the best possible model out of all 2p models
containing subsets of the p predictors.
Why not? Give an example.
suppose that in a given data set with p = 3 predictors,
− the best possible one-variable model contains X1
− the best possible two-variable model instead contains X2 and X3
Then forward stepwise selection will fail to select the best possible two-variable model, because M1 will contain X1, so M2 must also contain X1 together with one additional variable.
Forward stepwise selection can be applied even in the high-dimensional setting where n < p.
Just construct submodels M0, . . . , Mn−1 only. Why?
(Monash) FIT5149 11 / 38
Subset Selection Methods
Stepwise Selection
Compare best subset selection with forward selection
# Variables Best subset One rating
6à Subset Selection 2à9
Forward stepwise
rating
rating, income
rating, income, student rating, income, student, limit
Two Three Four
rating, income
rating, income, student cards, income,
student, limit
TABLE 6àà The Þrst four selected models for best subset selection and forward stepwise selection on the Credit data setà The Þrst three models are identical but the fourth models differà
stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-timeà Details are given in Algorithm 6à3à
Algorithm 6à3 Backward stepwise selection
(Monash) FIT5149 12 / 38
à Let
p
denote the full model, which contains all p predictorsà
M
stepwise selection, it begins with th
Subset Selection Methods
full least squares model containing
Stepwise Selection
all p predictors, and then iteratively removes the least useful predictor, Backward Stepwise Selection
one-at-a-timeà Details are given in Algorithm 6à3à
Algorithm 6à3 Backward stepwise selection
à Let Mp denote the full model, which contains all p predictorsà
2à Fork=p,p−,ààà,:
a Consider all k models that contain all but one of the predictors
in Mk, for a total of k − predictorsà
b Choose the best among these k models, and call it Mk−à Here
best is deÞned as having smallest RSS or highest R2à
3à Select a single best model from among Mà, à à à , Mp using cross-
validated prediction error, Cp AIC , BIC, or adjusted R2à
Like forward stepwise selection, the backward selection approach searches
through only +p p+ 2 models, and so can be applied in settings where
p is too large to apply best subset selectionà2 Also like forward stepwise (Monash) FIT5149 13 / 38
e
selection, backward stepwise selection is not guaranteed to yield the best
Subset Selection Methods
Stepwise Selection
More on Backward Stepwise Selection
Like forward stepwise selection, the backward selection approach searches through only 1 + p(p + 1)/2 models, and so can be applied in settings where p is too large to apply best subset selection.
Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the p predictors.
Backward selection requires that the number of samples n is larger than the number of variables p (so that the full model can be fit).
(Monash) FIT5149 14 / 38
Subset Selection Methods
Stepwise Selection
Choosing the Optimal Model
RSS and R2 are not suitable for selecting the best model among a collection of models with different numbers of predictors.
The RSS of these p + 1 models decreases monotonically, and the R2 increases monotonically, as the number of features included in the models increases.
These quantities are related to the training error. Recall that training error is usually a poor estimate of test error.
We wish to choose a model with low test error, not a model with low training error.
How to estimate test error?
(Monash) FIT5149 15 / 38
Subset Selection Methods
Stepwise Selection
Choosing the Optimal Model
RSS and R2 are not suitable for selecting the best model among a collection of models with different numbers of predictors.
The RSS of these p + 1 models decreases monotonically, and the R2 increases monotonically, as the number of features included in the models increases.
These quantities are related to the training error. Recall that training error is usually a poor estimate of test error.
We wish to choose a model with low test error, not a model with low training error.
How to estimate test error?
To estimate test error by making an adjustment to the training error to account for the bias due to overfitting.
To directly estimate the test error, using either a validation set approach or a cross-validation approach.
(Monash) FIT5149 15 / 38
Subset Selection Methods
Stepwise Selection
Measures for selection the best model
Other measures can be used to select among a set of models with different numbers of variables:
Mallow’sCp:Cp=1(RSS+2dσˆ2) n
AIC (Akaike information criterion): AIC = 1 (RSS + 2dσˆ2) n σˆ 2
BIC (Bayesian information criterion): BIC = 1 (RSS + log(n)dσˆ2) n
Adjusted R2: Adjusted_R2 = 1 − RSS/(n−d−1) TSS /(n−1)
These methods add penalty to RSS for the number of variables (i.e. complexity) in the model.
(Monash) FIT5149 16 / 38
Subset Selection Methods
Stepwise Selection
Example on the Credit data
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
Number of Predictors Number of Predictors Number of Predictors
A small value of Cp and BIC indicates a low error, and thus a better model. A large value for the Adjusted R2 indicates a better model.
(Monash) FIT5149 17 / 38
Cp
10000 15000 20000 25000 30000
BIC
10000 15000 20000 25000 30000
Adjusted R2
0.86 0.88 0.90 0.92 0.94 0.96
Subset Selection Methods
Stepwise Selection
Validation and Cross-Validation
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
Number of Predictors Number of Predictors Number of Predictors
We can compute the validation set error or the cross-validation error for each model under consideration, and then select the model for which the resulting estimated test error is smallest
Advantage: it provides a direct estimate of the test error, and makes fewer assumptions about the true underlying model.
(Monash) FIT5149 18 / 38
100
120 140 160 180 200 220
100
120 140 160 180 200 220
100
120 140 160 180 200 220
Square Root of BIC
Validation Set Error
Cross−Validation Error
Shrinkage Methods
Outline
1 Subset Selection Methods
2 Shrinkage Methods Ridge regression
The Lasso Elastic net Group Lasso
3 Summary
(Monash)
FIT5149 19 / 38
Shrinkage Methods
Shrinkage Methods
Ridge regression and Lasso
The subset selection methods use least squares to fit a linear model that
contains a subset of the predictors.
As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero.
It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance.
(Monash) FIT5149 20 / 38
Shrinkage Methods
Ridge regression
Ridge regression
The Ordinary Least Squares (OLS) fitting procedure estimates β0,β1,...,βp usingthevaluesthatminimize
n p 2 RSS= yi−β0−βjxj
i=1 j=1
The ridge regression coefficient estimates are the values that minimize
where
np2 p
yi−β0−βjxj +λ∥β∥2=RSS+λβj2
i=1 j=1
j=1
λ ≥ 0 is a regularisation parameter (or tuning parameter). λ ∥ β ∥2 is called a shrinkage penalty.
(Monash) FIT5149
21 / 38
Shrinkage Methods
Ridge regression
What does the shrinkage penalty do?
Income Limit Rating Student
1e−02 1e+00 1e+02 1e+04 0.0 0.2 0.4 0.6 0.8 1.0
λ ∥ βö λR ∥ 2 ∥ βö ∥ 2
Figure: The standardized ridge regression coefficients are displayed for the Credit data
set,asafunctionofλand∥βˆλR ∥2 /∥βˆ∥2
The shrinkage penalty has the effect of shrinking the estimates of βj towards zero.
λ = 0: ridge regression will produce the least squares estimates.
λ → ∞: the ridge regression coefficient estimates will approach zero.
(Monash) FIT5149 22 / 38
Standardized Coefficients
−300 −100 0 100 200 300 400
Standardized Coefficients
−300 −100 0 100 200 300 400
Shrinkage Methods
Ridge regression
What does the shrinkage penalty do?
Income Limit Rating Student
1e−02 1e+00 1e+02 1e+04 0.0 0.2 0.4 0.6 0.8 1.0
λ ∥ βö λR ∥ 2 ∥ βö ∥ 2
Figure: The standardized ridge regression coefficients are displayed for the Credit data
set,asafunctionofλand∥βˆλR ∥2 /∥βˆ∥2
The notation ∥ β ∥2 denotes the l2 norm, ∥ β ∥2= pj=1 βj2, which
measures the distance of β from zero.
(Monash) FIT5149 22 / 38
Standardized Coefficients
−300 −100 0 100 200 300 400
Standardized Coefficients
−300 −100 0 100 200 300 400
Shrinkage Methods
Ridge regression
Why does ridge regression improve over OLS?
Recall that MSE is a function of the variance plus the squared bias.
1e−01 1e+01 1e+03 0.0 0.2 0.4 0.6 0.8 1.0
λ ∥ βö λR ∥ 2 ∥ βö ∥ 2
Figure: The Bias-Variance tradeoff with a simulated data set containing p = 45
predictors and n = 50 samples. It shows that as λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.
Black: Squared bias
Green: Variance
Purple: the test mean squared error (MSE), a function of the variance plus the squared bias.
Horizontal dash line: the minimum possible MSE.
(Monash) FIT5149 23 / 38
Mean Squared Error
0 10 20 30 40 50 60
Mean Squared Error
0 10 20 30 40 50 60
Shrinkage Methods
Ridge regression
More on ridge regression
For p ≈ n and p > n,
OLS estimates are extremely variable
Ridge regression performs well by trading off a small increase in bias for a
large decrease in variance.
Computational advantages over best subset selection: for any fixed value of λ, ridge regression only fits a single model.
(Monash) FIT5149 24 / 38
Shrinkage Methods
The Lasso
The Lasso
One obvious disadvantage of Ridge regression:
The shrinkage penalty will never force any of the coefficient to be exactly zero.
The final model will include all variables, which makes it hard to interpret. The LASSO uses the L1 penalty to force some of the coefficient estimates
to be exactly equal to zero, when the tuning parameter λ is sufficiently large. np2 p
yi −β0−βjxj +λ∥β∥1=RSS+λ|βj| i=1 j=1 j=1
Ridge regression:
np2 p
yi−β0−βjxj +λ∥β∥2=RSS+λβj2
i=1 j=1
The LASSO performs variable selection.
j=1
(Monash) FIT5149
25 / 38
Shrinkage Methods
The Lasso
What does the L1 penalty do?
Income Limit Rating Student
20 50 100 200 500 2000 5000 0.0 0.2
0.4 0.6 0.8 1.0
∥βöλL∥ ∥βö∥
Figure: The standardized lasso coefficients on the Credit data set.
When λ becomes sufficiently large, the lasso gives the null model.
In the right-hand panel: rating ⇒ rating + student + limit ⇒ rating +
student + limit + income
λ
(Monash) FIT5149 26 / 38
Standardized Coefficients
−200 0 100 200 300 400
Standardized Coefficients
−300 −100 0 100 200 300 400
Shrinkage Methods
The Lasso
Comparing the Lasso and Ridge Regression
0.02 0.10 0.50
2.00 10.00 50.00 0.0 0.2 0.4 0.6 0.8 1.0
R2 on Training Data
λ
Figure: Plots of squared bias (black), variance (green), and test MSE (purple).
Left plot: The lasso leads to qualitatively similar behavior to ridge regression.
Right plot:
A plot against training R2 can be used to compare models with different types of regularisation.
The minimum MSE of ridge regression is slightly smaller than that of the lasso
− In the simulated dataset: all predicators were related to the response.
(Monash) FIT5149 27 / 38
Mean Squared Error
0 10 20 30 40 50 60
Mean Squared Error
0 10 20 30 40 50 60
Shrinkage Methods
The Lasso
Comparing the Lasso and Ridge Regression
0.02 0.10 0.50
2.00 10.00 50.00 0.4 0.5 0.6 0.7 0.8 0.9 1.0
R2 on Training Data
λ
Figure: Plots of squared bias (black), variance (green), and test MSE (purple). In the above plots, the simulated data is generated in such a way that only
2 out of 45 predicators were related to the response.
Conclusion: Neither ridge regression nor the lasso will universally dominate
the other.
(Monash) FIT5149 28 / 38
Mean Squared Error
0 20 40 60 80 100
Mean Squared Error
0 20 40 60 80 100
Shrinkage Methods
The Lasso
Conlcusions
We expect:
The lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients.
Ridge regression will perform better when the response is a function of many predictors, all with coefficients of roughly equal size
However, the number of predictors that is related to the response is never known a priori for real data sets.
Document classification: Given 10,000 words in a vocabulary, which words are related to document classes?
Solution: cross validation!
(Monash) FIT5149 29 / 38
Shrinkage Methods
The Lasso
Selecting the tuning parameter — 1
5e−03 5e−02 5e−01 5e+00 5e−03 5e−02 5e−01 5e+00
λλ
Figure: The choice of λ that results from performing leave- one-out cross-validation on the ridge regression fits from the Credit data set
Select a grid of potential values, use cross validation to estimate the error rate on test data for each value of λ, and select the value that gives the least error rate
Similar strategy applies to the Lasso.
(Monash) FIT5149 30 / 38
Cross−Validation Error
25.0 25.2 25.4 25.6
Standardized Coefficients
−300 −100 0 100 300
Shrinkage Methods
The Lasso
Selecting the tuning parameter — 2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2
∥βöλL∥ ∥βö∥
0.4 0.6 0.8 1.0
∥βöλL∥ ∥βö∥
Figure: Left: Ten-fold cross-validation MSE for the lasso, applied to the sparse simulated data set. Right: The corresponding lasso coefficient estimates are displayed. The vertical dashed lines indicate the lasso fit for which the cross-validation error is smallest.
(Monash) FIT5149 31 / 38
Cross−Validation Error
0 200 600 1000 1400
Standardized Coefficients
−5 0 5 10 15
Shrinkage Methods
Elastic net
Outline
1 Subset Selection Methods
2 Shrinkage Methods Ridge regression
The Lasso
Elastic net
Group Lasso
3 Summary
(Monash)
FIT5149 32 / 38
Shrinkage Methods
Elastic net
Elastic net
n p 2
yi−β0−βjxj +λ∥β∥1
i=1 j=1
Limitations of the lasso
if p > n, the lasso selects at most n variables
Grouped variables: the lasso fails to do grouped selection.
(Monash) FIT5149 33 / 38
Shrinkage Methods
Elastic net
Elastic net
n p 2 ∥β∥2
yi−β0−βjxj +λ (1−α) i=1 j=1
The L1 part of the penalty generates a sparse model. The quadratic part of the penalty
Removes the limitation on the number of selected variables; Encourages grouping effect;
Stabilizes the L1 regularization path.
Automatically include whole groups into the model if one variable amongst them is selected.
2
2+α∥β∥1
(Monash) FIT5149 34 / 38
Shrinkage Methods
Elastic net
Elastic net
ElasticNet Hui Zou, Stanford University
Geometry of the elastic net n p2∥β∥2
yi−β0−βjxj +λ (1−α) 2+α∥β∥1
i=1
j=1 2 The elastic n
2-dimensional illustration α = à à5
Jβ =α∥β∥
with α = λ2
min ∥y− Xβ∥ β
• Singulari vertexes sparsity
• Strict c
β2
(Monash)
The
s
t
r
FIT5149
34
/
38
Ridge Lasso Elastic Net
β1
λ
e
Shrinkage Methods
Elastic net
Elastic net
n p 2 ∥β∥2
yi−β0−βjxj +λ (1−α) i=1 j=1
The elastic net performs simultaneous regularization and variable selection. Ability to perform grouped selection
Appropriate for the p >> n problem
2
2+α∥β∥1
(Monash) FIT5149 34 / 38
Shrinkage Methods
Elastic net
ElasticNet Hui Zou, Stanford University
Elastic net example
A simple illustration: elastic net vsà lasso
• Two independent ÒhiddenÓ factors z and z2 z ∼U à,2à , z2 ∼U à,2à
• Generatetheresponsevectory=z+àà·z2+N à, • Suppose only observe predictors
x =z +ε, x2 =−z +ε2, x4 =z2 +ε4, x5 =−z2 +ε5,
x3 =z +ε3 x6 =z2 +ε6
•Fitthemodelon X,y
• An ÒoracleÓ would identify x , x2 , and x3 the z group as the
most important variablesà
Figure: Slide from “Regularization and Variable Selection via the Elastic Net” by Zou and Hastie
(Monash) FIT5149 35 / 38
Shrinkage Methods
Elastic net
ElasticNet Hui Zou, Stanford University 2
Elastic net example
Lasso
Elastic Net lambda = 0.5
3
6 5
1
4
2
3 1
4 6
5
2
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2
0.4 0.6 0.8 1.0
s = |beta| max|beta|
Figure: Slide from “Regularization and Variable Selection via the Elastic Net” by Zou
and Hastie
s = |beta| max|beta|
(Monash) FIT5149 35 / 38
Standardized Coefficients
−10 0 10 20 30 40
Standardized Coefficients
−20 −10 0 10 20
Shrinkage Methods
Group Lasso
Group Lasso
Some advantages of group lasso
The information contained in the grouping structure is informative in learning.
Selecting important groups of variables gives models that are more sensible
and interpretable
Group Lasso formulation
We denote X as being composed of J groups X1, X2,X3, …,XJ with pj denoting the size of group j; i.e., j pj = P
LL√ min ∥y−Xjβj ∥2 +λ pl ∥βl ∥2
β∈RP
Group lasso acts like the lasso at the group level.
Group lasso does not yield sparsity within a group.
jl
(Monash) FIT5149 36 / 38
Shrinkage Methods
Group Lasso
Sparse Group Lasso
Sparse Group Lasso formulation
We denote X as being composed of J groups X1, X2,X3, …,XJ with pj denoting the size of group j; i.e., j pj = P
LL√ min ∥y−Xβ∥2+λ p∥β∥+λ∥β∥
β∈RP jj2 1 l l2 2 1 jl
Sparse Group lasso yields sparsity at both the group and individual feature levels.
(Monash) FIT5149 37 / 38
Summary
Summary
Model selection methods are an essential tool for data analysis, especially
for big datasets involving many predictors.
Reading materials:
“Linear Model Selection and Regularization”, Chapter 6 of “Introduction to Statistical Learning”, 6th edition
− Section 6.1 “Subset Selection”
− Section 6.2 “Shrinkage Methods”
References:
Figures in this presentation were taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
Some of the slides are reproduced based on the slides from T. Hastie and R. Tibshirani
(Monash) FIT5149 38 / 38