程序代写代做代考 C Bayesian flex algorithm Linear Model Selection and Regularization

Linear Model Selection and Regularization
Dr. Lan Du
Faculty of Information Technology, Monash University, Australia
FIT5149 week 6
(Monash) FIT5149 1 / 38

Outline
1 Subset Selection Methods Best Subset Selection
Stepwise Selection
2 Shrinkage Methods Ridge regression
The Lasso Elastic net Group Lasso
3 Summary
(Monash)
FIT5149 2 / 38

Improve linear model fitting procedure
Recall the linear model
Y =β0+β1X1+···+βpXp
The linear model has distinct advantages in terms of its interpretability and often shows good predictive performance, while the assumptions are satisfied.
Improve the simple linear model:
􏰀 replace ordinary least squares fitting with some alternative fitting procedures. Yield better prediction accuracy and model interpretability
(Monash) FIT5149 3 / 38

Why consider alternatives to least squares?
Prediction Accuracy: especially when p > n, to control the variance. 􏰀 If the true linear relationship between the Y and X ⇒ low bias
􏰀 Ifn≫p⇒lowvariance
􏰀 If n ≈ p ⇒ high variance & overfitting & poor prediction
􏰀 if n < p ⇒ infinite variance & no unique OLS coefficient estimate Model Interpretability: 􏰀 When we have a large number of variables X, there will generally be some or many that are not associated with the response Y . 􏰀 Including irrelevant variables leads to unnecessary complexity in the model 􏰀 Removing irrelevant variables by setting their coefficient to 0 increases the interpretability of the resulting model. Solution: feature (variable) selection. (Monash) FIT5149 4 / 38 Three classes of selection methods Subsect Selection: Identifying a subset of all p predictors X that we believe to be related to the response Y , and then fitting the model least squares on the reduced set of variables. 􏰀 Best subset selection 􏰀 Forward/Backward stepwise selection 􏰀 Hybrid selection Shrinkage, also known as regularisation 􏰀 The estimated coefficients are shrunken towards zero relative to the least squares estimates. 􏰀 The shrinkage has the effect of reducing variance. 􏰀 The shrinkage can also perform variable selection. − Ridge regression: L2 regularisation − The Lasso: L1 regularisation − Elastic Net: the mixture of L1 and L2 − Group Lasso Dimension Reduction: Involves projecting all p predictors into an M-dimensional space where M < p, and then fitting regression model. 􏰀 e.g., Principle Component Analysis (Monash) FIT5149 5 / 38 Subset Selection Methods Outline 1 Subset Selection Methods Best Subset Selection Stepwise Selection 2 Shrinkage Methods 3 Summary (Monash) FIT5149 6 / 38 models, with the goal of identifying the one that is bestà m into two stages, as described in Algorithm 6àà The problem of selecting the best Subset Selection Methods Best Subset Selection p odel from among the 2 possibilities considered by best subset selection is not trivialà This is usually broken up Best Subset Selection Algorithm 6à Best subset selection à Let Mà denote the null model, which contains no predictorsà This model simply predicts the sample mean for each observationà 2à Fork=,2,àààp: a Fit all kp models that contain exactly k predictorsà b Pick the best among these kp models, and call it Mkà Here best is deÞned as having the smallest RSS, or equivalently largest R2à 3à Select a single best model from among Mà, à à à , Mp using cross- validated prediction error, Cp AIC , BIC, or adjusted R2à In Algorithm 6à, Step 2 identiÞes the best model on the training data for each subset size, in order to reduce the problem from one of 2p possible models to one of p +  possible modelsà In Figure 6à, these models form (Monash) FIT5149 7 / 38 the lower frontier depicted in redà Example: Credit data set Subset Selection Methods Best Subset Selection Residual Sum of Squares 2e+07 4e+07 6e+07 8e+07 0.0 0.2 0.4 0.6 0.8 1.0 R2 2 4 6 8 10 Number of Predictors 2 4 6 8 10 Number of Predictors For each possible model containing a subset of the ten predictors in the Credit data set, the RSS and R2 are displayed. The red frontier tracks the best model for a given number of predictors, according to RSS and R2. Though the data set contains only ten predictors, the x-axis ranges from 1 to 11, since one of the variables is categorical and takes on three values, leading to the creation of two dummy variables (Monash) FIT5149 8 / 38 Subset Selection Methods Best Subset Selection More on Best Subset Selection It apply to other types of models, such as logistic regression. 􏰀 The deviance — negative two times the maximized log-likelihood — plays the role of RSS for a broader class of models. For computational reasons, best subset selection cannot be applied with very large p. Why not? Overfitting: when p is large, larger the search space, the higher the chance of finding models with a low training error, which by no means guarantees a low test error. Stepwise methods: which explore a far more restricted set of models, are attractive alternatives to best subset selection. (Monash) FIT5149 9 / 38 to the model, one-at-a-time, until all of the predictors are in the modelà Subset Selection Methods Stepwise Selection In particular, at each step the variable that gives the greatest additional FiomrwpraordveSmtenptwtisoetSheleÞcttiiosnadded to the modelà More formally, the forward stepwise selection procedure is given in Algorithm 6à2à Algorithm 6à2 Forward stepwise selection à Let Mà denote the null model, which contains no predictorsà 2à Fork=à,ààà,p−: a Consider all p − k models that augment the predictors in Mk with one additional predictorà b Choose the best among these p − k models, and call it Mk+à Here best is deÞned as having smallest RSS or highest R2à 3à Select a single best model from among Mà, à à à , Mp using cross- validated prediction error, Cp AIC , BIC, or adjusted R2à (Monash) FIT5149 10 / 38 Subset Selection Methods Stepwise Selection More on Forward Stepwise Selection Computational advantage over best subset selection is clear. 􏰀 Best subset selection: 2p models 􏰀 Forward stepwise selection: 1 + 􏰃p−1(p − k) = 1 + p(p + 1)/2 k=0 It is not guaranteed to find the best possible model out of all 2p models containing subsets of the p predictors. 􏰀 Why not? Give an example. (Monash) FIT5149 11 / 38 Subset Selection Methods Stepwise Selection More on Forward Stepwise Selection Computational advantage over best subset selection is clear. 􏰀 Best subset selection: 2p models 􏰀 Forward stepwise selection: 1 + 􏰃p−1(p − k) = 1 + p(p + 1)/2 k=0 It is not guaranteed to find the best possible model out of all 2p models containing subsets of the p predictors. 􏰀 Why not? Give an example. 􏰀 suppose that in a given data set with p = 3 predictors, − the best possible one-variable model contains X1 − the best possible two-variable model instead contains X2 and X3 Then forward stepwise selection will fail to select the best possible two-variable model, because M1 will contain X1, so M2 must also contain X1 together with one additional variable. (Monash) FIT5149 11 / 38 Subset Selection Methods Stepwise Selection More on Forward Stepwise Selection Computational advantage over best subset selection is clear. 􏰀 Best subset selection: 2p models 􏰀 Forward stepwise selection: 1 + 􏰃p−1(p − k) = 1 + p(p + 1)/2 k=0 It is not guaranteed to find the best possible model out of all 2p models containing subsets of the p predictors. 􏰀 Why not? Give an example. 􏰀 suppose that in a given data set with p = 3 predictors, − the best possible one-variable model contains X1 − the best possible two-variable model instead contains X2 and X3 Then forward stepwise selection will fail to select the best possible two-variable model, because M1 will contain X1, so M2 must also contain X1 together with one additional variable. Forward stepwise selection can be applied even in the high-dimensional setting where n < p. 􏰀 Just construct submodels M0, . . . , Mn−1 only. Why? (Monash) FIT5149 11 / 38 Subset Selection Methods Stepwise Selection Compare best subset selection with forward selection # Variables Best subset One rating 6à Subset Selection 2à9 Forward stepwise rating rating, income rating, income, student rating, income, student, limit Two Three Four rating, income rating, income, student cards, income, student, limit TABLE 6àà The Þrst four selected models for best subset selection and forward stepwise selection on the Credit data setà The Þrst three models are identical but the fourth models differà stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-timeà Details are given in Algorithm 6à3à Algorithm 6à3 Backward stepwise selection (Monash) FIT5149 12 / 38 à Let p denote the full model, which contains all p predictorsà M stepwise selection, it begins with th Subset Selection Methods full least squares model containing Stepwise Selection all p predictors, and then iteratively removes the least useful predictor, Backward Stepwise Selection one-at-a-timeà Details are given in Algorithm 6à3à Algorithm 6à3 Backward stepwise selection à Let Mp denote the full model, which contains all p predictorsà 2à Fork=p,p−,ààà,: a Consider all k models that contain all but one of the predictors in Mk, for a total of k −  predictorsà b Choose the best among these k models, and call it Mk−à Here best is deÞned as having smallest RSS or highest R2à 3à Select a single best model from among Mà, à à à , Mp using cross- validated prediction error, Cp AIC , BIC, or adjusted R2à Like forward stepwise selection, the backward selection approach searches through only +p p+ 2 models, and so can be applied in settings where p is too large to apply best subset selectionà2 Also like forward stepwise (Monash) FIT5149 13 / 38 e selection, backward stepwise selection is not guaranteed to yield the best Subset Selection Methods Stepwise Selection More on Backward Stepwise Selection Like forward stepwise selection, the backward selection approach searches through only 1 + p(p + 1)/2 models, and so can be applied in settings where p is too large to apply best subset selection. Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the p predictors. Backward selection requires that the number of samples n is larger than the number of variables p (so that the full model can be fit). (Monash) FIT5149 14 / 38 Subset Selection Methods Stepwise Selection Choosing the Optimal Model RSS and R2 are not suitable for selecting the best model among a collection of models with different numbers of predictors. 􏰀 The RSS of these p + 1 models decreases monotonically, and the R2 increases monotonically, as the number of features included in the models increases. 􏰀 These quantities are related to the training error. Recall that training error is usually a poor estimate of test error. We wish to choose a model with low test error, not a model with low training error. How to estimate test error? (Monash) FIT5149 15 / 38 Subset Selection Methods Stepwise Selection Choosing the Optimal Model RSS and R2 are not suitable for selecting the best model among a collection of models with different numbers of predictors. 􏰀 The RSS of these p + 1 models decreases monotonically, and the R2 increases monotonically, as the number of features included in the models increases. 􏰀 These quantities are related to the training error. Recall that training error is usually a poor estimate of test error. We wish to choose a model with low test error, not a model with low training error. How to estimate test error? 􏰀 To estimate test error by making an adjustment to the training error to account for the bias due to overfitting. 􏰀 To directly estimate the test error, using either a validation set approach or a cross-validation approach. (Monash) FIT5149 15 / 38 Subset Selection Methods Stepwise Selection Measures for selection the best model Other measures can be used to select among a set of models with different numbers of variables: 􏰀 Mallow’sCp:Cp=1(RSS+2dσˆ2) n 􏰀 AIC (Akaike information criterion): AIC = 1 (RSS + 2dσˆ2) n σˆ 2 􏰀 BIC (Bayesian information criterion): BIC = 1 (RSS + log(n)dσˆ2) n 􏰀 Adjusted R2: Adjusted_R2 = 1 − RSS/(n−d−1) TSS /(n−1) These methods add penalty to RSS for the number of variables (i.e. complexity) in the model. (Monash) FIT5149 16 / 38 Subset Selection Methods Stepwise Selection Example on the Credit data 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Number of Predictors Number of Predictors Number of Predictors A small value of Cp and BIC indicates a low error, and thus a better model. A large value for the Adjusted R2 indicates a better model. (Monash) FIT5149 17 / 38 Cp 10000 15000 20000 25000 30000 BIC 10000 15000 20000 25000 30000 Adjusted R2 0.86 0.88 0.90 0.92 0.94 0.96 Subset Selection Methods Stepwise Selection Validation and Cross-Validation 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Number of Predictors Number of Predictors Number of Predictors We can compute the validation set error or the cross-validation error for each model under consideration, and then select the model for which the resulting estimated test error is smallest Advantage: it provides a direct estimate of the test error, and makes fewer assumptions about the true underlying model. (Monash) FIT5149 18 / 38 100 120 140 160 180 200 220 100 120 140 160 180 200 220 100 120 140 160 180 200 220 Square Root of BIC Validation Set Error Cross−Validation Error Shrinkage Methods Outline 1 Subset Selection Methods 2 Shrinkage Methods Ridge regression The Lasso Elastic net Group Lasso 3 Summary (Monash) FIT5149 19 / 38 Shrinkage Methods Shrinkage Methods Ridge regression and Lasso The subset selection methods use least squares to fit a linear model that contains a subset of the predictors. As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance. (Monash) FIT5149 20 / 38 Shrinkage Methods Ridge regression Ridge regression The Ordinary Least Squares (OLS) fitting procedure estimates β0,β1,...,βp usingthevaluesthatminimize n􏰍 p 􏰎2 RSS=􏰊 yi−β0−􏰊βjxj i=1 j=1 The ridge regression coefficient estimates are the values that minimize where n􏰍p􏰎2 p 􏰊 yi−β0−􏰊βjxj +λ∥β∥2=RSS+λ􏰊βj2 i=1 j=1 j=1 􏰀 λ ≥ 0 is a regularisation parameter (or tuning parameter). 􏰀 λ ∥ β ∥2 is called a shrinkage penalty. (Monash) FIT5149 21 / 38 Shrinkage Methods Ridge regression What does the shrinkage penalty do? Income Limit Rating Student 1e−02 1e+00 1e+02 1e+04 0.0 0.2 0.4 0.6 0.8 1.0 λ ∥ βö λR ∥ 2 ∥ βö ∥ 2 Figure: The standardized ridge regression coefficients are displayed for the Credit data set,asafunctionofλand∥βˆλR ∥2 /∥βˆ∥2 The shrinkage penalty has the effect of shrinking the estimates of βj towards zero. 􏰀 λ = 0: ridge regression will produce the least squares estimates. 􏰀 λ → ∞: the ridge regression coefficient estimates will approach zero. (Monash) FIT5149 22 / 38 Standardized Coefficients −300 −100 0 100 200 300 400 Standardized Coefficients −300 −100 0 100 200 300 400 Shrinkage Methods Ridge regression What does the shrinkage penalty do? Income Limit Rating Student 1e−02 1e+00 1e+02 1e+04 0.0 0.2 0.4 0.6 0.8 1.0 λ ∥ βö λR ∥ 2 ∥ βö ∥ 2 Figure: The standardized ridge regression coefficients are displayed for the Credit data set,asafunctionofλand∥βˆλR ∥2 /∥βˆ∥2 The notation ∥ β ∥2 denotes the l2 norm, ∥ β ∥2= 􏰏􏰃pj=1 βj2, which measures the distance of β from zero. (Monash) FIT5149 22 / 38 Standardized Coefficients −300 −100 0 100 200 300 400 Standardized Coefficients −300 −100 0 100 200 300 400 Shrinkage Methods Ridge regression Why does ridge regression improve over OLS? Recall that MSE is a function of the variance plus the squared bias. 1e−01 1e+01 1e+03 0.0 0.2 0.4 0.6 0.8 1.0 λ ∥ βö λR ∥ 2 ∥ βö ∥ 2 Figure: The Bias-Variance tradeoff with a simulated data set containing p = 45 predictors and n = 50 samples. It shows that as λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias. Black: Squared bias Green: Variance Purple: the test mean squared error (MSE), a function of the variance plus the squared bias. Horizontal dash line: the minimum possible MSE. (Monash) FIT5149 23 / 38 Mean Squared Error 0 10 20 30 40 50 60 Mean Squared Error 0 10 20 30 40 50 60 Shrinkage Methods Ridge regression More on ridge regression For p ≈ n and p > n,
􏰀 OLS estimates are extremely variable
􏰀 Ridge regression performs well by trading off a small increase in bias for a
large decrease in variance.
Computational advantages over best subset selection: for any fixed value of λ, ridge regression only fits a single model.
(Monash) FIT5149 24 / 38

Shrinkage Methods
The Lasso
The Lasso
One obvious disadvantage of Ridge regression:
􏰀 The shrinkage penalty will never force any of the coefficient to be exactly zero.
􏰀 The final model will include all variables, which makes it hard to interpret. The LASSO uses the L1 penalty to force some of the coefficient estimates
to be exactly equal to zero, when the tuning parameter λ is sufficiently large. n􏰍p􏰎2 p
􏰊 yi −β0−􏰊βjxj +λ∥β∥1=RSS+λ􏰊|βj| i=1 j=1 j=1
Ridge regression:
n􏰍p􏰎2 p
􏰊 yi−β0−􏰊βjxj +λ∥β∥2=RSS+λ􏰊βj2
i=1 j=1
The LASSO performs variable selection.
j=1
(Monash) FIT5149
25 / 38

Shrinkage Methods
The Lasso
What does the L1 penalty do?
Income Limit Rating Student
20 50 100 200 500 2000 5000 0.0 0.2
0.4 0.6 0.8 1.0
∥βöλL∥ ∥βö∥
Figure: The standardized lasso coefficients on the Credit data set.
When λ becomes sufficiently large, the lasso gives the null model.
In the right-hand panel: rating ⇒ rating + student + limit ⇒ rating +
student + limit + income
λ
(Monash) FIT5149 26 / 38
Standardized Coefficients
−200 0 100 200 300 400
Standardized Coefficients
−300 −100 0 100 200 300 400

Shrinkage Methods
The Lasso
Comparing the Lasso and Ridge Regression
0.02 0.10 0.50
2.00 10.00 50.00 0.0 0.2 0.4 0.6 0.8 1.0
R2 on Training Data
λ
Figure: Plots of squared bias (black), variance (green), and test MSE (purple).
Left plot: The lasso leads to qualitatively similar behavior to ridge regression.
Right plot:
􏰀 A plot against training R2 can be used to compare models with different types of regularisation.
􏰀 The minimum MSE of ridge regression is slightly smaller than that of the lasso
− In the simulated dataset: all predicators were related to the response.
(Monash) FIT5149 27 / 38
Mean Squared Error
0 10 20 30 40 50 60
Mean Squared Error
0 10 20 30 40 50 60

Shrinkage Methods
The Lasso
Comparing the Lasso and Ridge Regression
0.02 0.10 0.50
2.00 10.00 50.00 0.4 0.5 0.6 0.7 0.8 0.9 1.0
R2 on Training Data
λ
Figure: Plots of squared bias (black), variance (green), and test MSE (purple). In the above plots, the simulated data is generated in such a way that only
2 out of 45 predicators were related to the response.
Conclusion: Neither ridge regression nor the lasso will universally dominate
the other.
(Monash) FIT5149 28 / 38
Mean Squared Error
0 20 40 60 80 100
Mean Squared Error
0 20 40 60 80 100

Shrinkage Methods
The Lasso
Conlcusions
We expect:
􏰀 The lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients.
􏰀 Ridge regression will perform better when the response is a function of many predictors, all with coefficients of roughly equal size
However, the number of predictors that is related to the response is never known a priori for real data sets.
􏰀 Document classification: Given 10,000 words in a vocabulary, which words are related to document classes?
Solution: cross validation!
(Monash) FIT5149 29 / 38

Shrinkage Methods
The Lasso
Selecting the tuning parameter — 1
5e−03 5e−02 5e−01 5e+00 5e−03 5e−02 5e−01 5e+00
λλ
Figure: The choice of λ that results from performing leave- one-out cross-validation on the ridge regression fits from the Credit data set
Select a grid of potential values, use cross validation to estimate the error rate on test data for each value of λ, and select the value that gives the least error rate
Similar strategy applies to the Lasso.
(Monash) FIT5149 30 / 38
Cross−Validation Error
25.0 25.2 25.4 25.6
Standardized Coefficients
−300 −100 0 100 300

Shrinkage Methods
The Lasso
Selecting the tuning parameter — 2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2
∥βöλL∥ ∥βö∥
0.4 0.6 0.8 1.0
∥βöλL∥ ∥βö∥
Figure: Left: Ten-fold cross-validation MSE for the lasso, applied to the sparse simulated data set. Right: The corresponding lasso coefficient estimates are displayed. The vertical dashed lines indicate the lasso fit for which the cross-validation error is smallest.
(Monash) FIT5149 31 / 38
Cross−Validation Error
0 200 600 1000 1400
Standardized Coefficients
−5 0 5 10 15

Shrinkage Methods
Elastic net
Outline
1 Subset Selection Methods
2 Shrinkage Methods Ridge regression
The Lasso
Elastic net
Group Lasso
3 Summary
(Monash)
FIT5149 32 / 38

Shrinkage Methods
Elastic net
Elastic net
n􏰍 p 􏰎2
􏰊 yi−β0−􏰊βjxj +λ∥β∥1
i=1 j=1
Limitations of the lasso
􏰀 if p > n, the lasso selects at most n variables
􏰀 Grouped variables: the lasso fails to do grouped selection.
(Monash) FIT5149 33 / 38

Shrinkage Methods
Elastic net
Elastic net
n􏰍 p 􏰎2 􏰋 ∥β∥2 􏰌
􏰊 yi−β0−􏰊βjxj +λ (1−α) i=1 j=1
The L1 part of the penalty generates a sparse model. The quadratic part of the penalty
􏰀 Removes the limitation on the number of selected variables; 􏰀 Encourages grouping effect;
􏰀 Stabilizes the L1 regularization path.
Automatically include whole groups into the model if one variable amongst them is selected.
2
2+α∥β∥1
(Monash) FIT5149 34 / 38

Shrinkage Methods
Elastic net
Elastic net
ElasticNet Hui Zou, Stanford University
􏰍 􏰎 Geometry of the elastic net n p2􏰋∥β∥2 􏰌
􏰊 yi−β0−􏰊βjxj +λ (1−α) 2+α∥β∥1
i=1
j=1 2 The elastic n
2-dimensional illustration α = à à5
Jβ =α∥β∥
with α = λ2
min ∥y− Xβ∥ β
• Singulari vertexes sparsity
• Strict c
β2
(Monash)
The
s
t
r
FIT5149
34
/
38
Ridge Lasso Elastic Net
β1
λ
e

Shrinkage Methods
Elastic net
Elastic net
n􏰍 p 􏰎2 􏰋 ∥β∥2 􏰌
􏰊 yi−β0−􏰊βjxj +λ (1−α) i=1 j=1
The elastic net performs simultaneous regularization and variable selection. Ability to perform grouped selection
Appropriate for the p >> n problem
2
2+α∥β∥1
(Monash) FIT5149 34 / 38

Shrinkage Methods
Elastic net
ElasticNet Hui Zou, Stanford University 
Elastic net example
A simple illustration: elastic net vsà lasso
• Two independent ÒhiddenÓ factors z and z2 z ∼U à,2à , z2 ∼U à,2à
• Generatetheresponsevectory=z+àà·z2+N à, • Suppose only observe predictors
x =z +ε, x2 =−z +ε2, x4 =z2 +ε4, x5 =−z2 +ε5,
x3 =z +ε3 x6 =z2 +ε6
•Fitthemodelon X,y
• An ÒoracleÓ would identify x , x2 , and x3 the z group as the
most important variablesà
Figure: Slide from “Regularization and Variable Selection via the Elastic Net” by Zou and Hastie
(Monash) FIT5149 35 / 38

Shrinkage Methods
Elastic net
ElasticNet Hui Zou, Stanford University 2
Elastic net example
Lasso
Elastic Net lambda = 0.5
3
6 5
1
4
2
3 1
4 6
5
2
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2
0.4 0.6 0.8 1.0
s = |beta| max|beta|
Figure: Slide from “Regularization and Variable Selection via the Elastic Net” by Zou
and Hastie
s = |beta| max|beta|
(Monash) FIT5149 35 / 38
Standardized Coefficients
−10 0 10 20 30 40
Standardized Coefficients
−20 −10 0 10 20

Shrinkage Methods
Group Lasso
Group Lasso
Some advantages of group lasso
􏰀 The information contained in the grouping structure is informative in learning.
􏰀 Selecting important groups of variables gives models that are more sensible
and interpretable
Group Lasso formulation
􏰀 We denote X as being composed of J groups X1, X2,X3, …,XJ with pj denoting the size of group j; i.e., 􏰃j pj = P
􏰍LL√􏰎 min ∥y−􏰊Xjβj ∥2 +λ􏰊 pl ∥βl ∥2
β∈RP
􏰀 Group lasso acts like the lasso at the group level.
􏰀 Group lasso does not yield sparsity within a group.
jl
(Monash) FIT5149 36 / 38

Shrinkage Methods
Group Lasso
Sparse Group Lasso
Sparse Group Lasso formulation
􏰀 We denote X as being composed of J groups X1, X2,X3, …,XJ with pj denoting the size of group j; i.e., 􏰃j pj = P
􏰍LL√􏰎 min ∥y−􏰊Xβ∥2+λ􏰊 p∥β∥+λ∥β∥
β∈RP jj2 1 l l2 2 1 jl
􏰀 Sparse Group lasso yields sparsity at both the group and individual feature levels.
(Monash) FIT5149 37 / 38

Summary
Summary
Model selection methods are an essential tool for data analysis, especially
for big datasets involving many predictors.
Reading materials:
􏰀 “Linear Model Selection and Regularization”, Chapter 6 of “Introduction to Statistical Learning”, 6th edition
− Section 6.1 “Subset Selection”
− Section 6.2 “Shrinkage Methods”
References:
􏰀 Figures in this presentation were taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
􏰀 Some of the slides are reproduced based on the slides from T. Hastie and R. Tibshirani
(Monash) FIT5149 38 / 38