Sta$s$cal Inference STAT 431
Lecture 19: Mul$ple Regression (VI) Variable Selec$on
Which Predictors to Use?
• So far, we have assumed that the set of predictor variables to be included in the
model is given
• In prac$ce, we typically have a large set of candidate predictors.
The key ques$on is: Which predictors shall we use in the model?
• From a prac$cal point of view
– Too many predictorsàhard to interpret
– Too few predictorsàbad predic$on
• From a theore$cal point of few
– Too many predictorsàlow bias, high variance
– Too few predictorsàhigh bias, low variance
• This class: systema$c approaches for variable selec$on STAT 431
• •
Forward Selec$on
Idea: start with the most parsimonious model (Yi = β0 + i ); add predictors one at a $me un$l no further addi$on significantly improve the fit.
A forward selec$on step consists of two tasks:
1. Considerallmodelsobtainedbyaddingonemorepredictortothecurrent model. For each variable xj not already included, calculate the P-value obtained from the F-test comparing the current model with the new model where xj is added. Iden$fy the variable with the smallest P-value.
2. IfthesmallestP-valueislessthanapre-specifiedcutoffvalue(e.g.,0.05),add that predictor to form a new current model.
Forward selec$on starts with Yi = β0 + i as the ini$al current model, and repeats tasks 1 and 2 un$l no addi$onal explanatory variables can be added.
Note: dummy variables represen$ng a categorical predictor should be added as a single unit STAT 431
• •
• To perform a forward selec$on step in R, one could use the add1 command.
• Here is an illustra$on based on the house price of zip 30062 data set (see R code)
> house <- read.csv("HousePrices.csv", header = TRUE)
> fit0 <- lm(log(Price) ~ 1, data = house)
> add1(fit0, ~ Age + BLDSQFT + LOTSQFT + GARSQFT + BEDRMS + BATHS, test = “F”)
Single term additions
Model:
log(Price) ~ 1
Df Sum of Sq RSS AIC F value Pr(F)
50.271 -949.34
Age 1
BLDSQFT 1
LOTSQFT 1
GARSQFT 1
BEDRMS 1
BATHS 1
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1 …
STAT 431
21.448 28.824 -1191.53 325.1700 <2e-16 *** 33.996 16.275 -1442.45 912.8311 <2e-16 ***
0.035 50.236 -947.65 0.3037 0.5818
0.015 50.256 -947.47 0.1306 0.7180
19.920 30.351 -1168.86 286.8151 <2e-16 ***
17.659 32.612 -1137.32 236.6270 <2e-16 ***
The ini$al model
A formula for the largest possible model under considera$on
• •
Backward Elimina$on
Idea: start with the model including all possible predictors; eliminate predictors one at a $me un$l all the remaining predictors are significant.
A backward elimina$on step also consists of two tasks:
1. Foreachvariableinthecurrentmodel,computetheP-valueoftes$ngits significance. Iden$fy the variable with the largest P-value, i.e., the one that is least significant.
2. IfthelargestP-valueisgreaterthanapre-specifiedcutoffvalue(e.g.,0.05), then remove that predictor to obtain a new current model.
Backward elimina$on starts with the full model as the ini$al current model, and repeats tasks 1 and 2 un$l none of the remaining variables can be removed.
•
STAT 431
• To perform a backward elimina$on in R, one could use the drop1 command.
• Here is an illustra$on based on the house price of zip 30062 data set (see R code)
> fitb0 <- lm(log(Price) ~ Age + BLDSQFT + LOTSQFT + GARSQFT + BEDRMS + BATHS, data
= house)
> drop1(fitb0, test = “F”)
Single term deletions
Model:
log(Price) ~ Age + BLDSQFT + LOTSQFT + GARSQFT + BEDRMS + BATHS
Df Sum of Sq RSS AIC F value Pr(F)
13.327 -1520.2
Age 1
BLDSQFT 1
LOTSQFT 1 0.1242 13.451 -1518.1 4.0247 0.0454606 * GARSQFT 1 0.2567 13.583 -1513.8 8.3220 0.0041128 ** BEDRMS 1 0.3494 13.676 -1510.8 11.3270 0.0008322 *** BATHS 1 0.0216 13.348 -1521.5 0.7001 0.4032149
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
STAT 431
1.6396 14.966 -1471.2 53.1505 1.484e-12 ***
7.5340 20.861 -1325.5 244.2249 < 2.2e-16 ***
• •
Stepwise Regression
Idea: to mix addi$on and dele$on in the selec$on procedure
Stepwise regression
– Ini$alize with the model Yi = β0 + i ;
– Repeatthefollowingtwostepsun$lnopredictorscanbeaddedorremoved
1. Do one step of forward selec$on;
2. Do one step of backward elimina$on;
Note: it is important to specify the largest model to be considered!
•
STAT 431
Reflec$ons on Stepwise Procedures
• Forward selec$on, backward elimina$on, and stepwise regression can lead to
different final models
• “Adjacent” models are compared by hypothesis tes$ng
– Sotheyarenottreatedonanequalfoo$ng [Recall that H0 and H1 are not treated equally!]
– Thetestisconcernedwithlackoffit,butdoesnottakeintoaccountthe complexity of the model (i.e., the number of predictors included)
– Nolinearrankingforthreemodelsormore
• Possible improvements
– Ifthetotalnumberofcandidatepredictorsisnottoolarge,wemightwantto consider all models that use a subset of them (there are 2k models in total!)
– Whencomparingmodels,takebothlackoffitandcomplexityintoaccount
– Agacheachmodelwithanumbertolinearlyrankalistofcandidatemodels
STAT 431
Informa$on Criteria
• A model M is specified by a subset of all candidate predictors
• Lack of fit of the model could measured by its SSE: SSEM
• Complexity of the model could be measured by the number of predictor used in the model: pM
• Informa$on criteria try to combine both lack of fit and complexity when comparing different candidate models
Three most widely used informa$on criteria are:
1. AIC [Akaike Informa$on Criterion]: AIC(M) = n log(SSEM/n) + 2(pM + 1)
2. BIC[BayesianInforma$onCriterion]:BIC(M)=nlog(SSEM/n)+log(n)(pM+1)
3. Mallow’s Cp Criterion: Cp(M) = SSEM/s2 + 2(pM + 1) − n – s2 is the MSE from the full model which includes all the predictors
STAT 431
AIC(M) = n log(SSEM/n) + 2(pM + 1) BIC(M) = n log(SSEM/n) + log(n)(pM + 1)
Cp(M)=SSEM/s2 +2(pM +1)−n
• Comments on the criteria
– Ineachcase,thesmallerthevalueofthecriterion,thebegerthemodel
– Usingeachcriterion,wecanlinearlyrankalistofcandidatemodels
– Whenthenumberofpredictorisfixed,thethreecriterialeadtothesame “best” model – the one with the smallest SSE
– AICisthemostcommonlyusedcriterioninmul$pleregression
– Cp can be interpreted as an es$mator of the predic$on error [see textbook]
• We can use the informa$on criteria in stepwise procedures for variable selec$on STAT 431
• In R, if we decide to use AIC as the criteria in a stepwise procedure, then we could use the step command [The output can be overwhelming!]
> fit0 <- lm(log(Price) ~ 1, data = house)
> fitb0 <- lm(log(Price) ~ Age + BLDSQFT + LOTSQFT + GARSQFT + BEDRMS + BATHS,
+ data = house)
> house.fs <- step(fit0, scope = list(lower = ~ 1,
+ upper = ~ Age + BLDSQFT + LOTSQFT + GARSQFT + BEDRMS + BATHS),
+ direction = "forward", data = house)
> house.be <- step(fitb0, scope = list(lower = ~ 1,
+ upper = ~ Age + BLDSQFT + LOTSQFT + GARSQFT + BEDRMS + BATHS),
+ direction = "backward", data = house)
> house.st <- step(fit0, scope = list(lower = ~ 1,
+ upper = ~ Age + BLDSQFT + LOTSQFT + GARSQFT + BEDRMS + BATHS),
+ direction = "both", data = house)
STAT 431
All-Subsets Regression
• Suppose we have a set of k candidate predictors, then in total there are 2k
different subsets of it, leading to 2k different models [intercept is always included]
• Idea: select the “best” model among the 2k modelsàall-subsets regression
– Conceptually,thisisfeasible,becauseinforma$oncriteriaallowustolinearly rank all these models
– In prac$ce, only works when k is not too large
• In R, we can perform all-subsets regression using the regsubsets command in the
leaps package
– Thefunc$onreturnsonemodelwiththesmallestSSEforanygivennumberof
predictors
– Wecanthenchoosethefinalmodelamongthesemodelswiththesmallest value for a par$cular informa$on criterion
STAT 431
> house.subset <- regsubsets(log(Price) ~ Age + BLDSQFT + LOTSQFT + GARSQFT + BEDRMS + + BATHS, data = house)
> hsub <- summary(house.subset)
> hsub
…
Selection Algorithm: exhaustive
Age BLDSQFT LOTSQFT GARSQFT BEDRMS BATHS 1 (1)”””*” “” “” “” “”
2 (1)”*””*” “” “” “” “”
3 (1)”*””*” “” “” “*” “”
4 (1)”*””*” “” “*” “*” “”
5 (1)”*””*” “*” “*” “*” “”
6 (1)”*””*” “*” “*” “*” “*”
The best subset of predictors for any given size, indicated by *
> hsub$cp
[1] 92.573539 25.546276 13.123846 8.014674 5.700094 7.000000
Choose the final model by checking
the Cp values for each of the best models with given size [could also use BIC or AIC]
STAT 431
Variable Selec$on and Significance
• All variable selec$on methods can overstate significance
• An experiment: We generate a data set of 100 observa$ons in the file sim.csv
– In each observa$on, the response yi and 30 predictors xi1 , . . . , xi30 are i.i.d.
standard normal
– ThecorrectmodelisYi =0+i
• F-test for H0 : β1 = ··· = β30 = 0 has P-value 0.66. So, null hypothesis is not rejected.
• We apply all-subsets selec$on based on the Cp criterion, and the “best” model include three predictors: {x12, x20, x29}
• Fit the mul$ple regression with these three predictors. The F-test for tes$ng all slopes = 0 returns P-value 0.03! The null hypothesis (which is true) is rejected.
– Thevariableselec$onoverstatesthesignificanceofthethreeselected predictors!
• What impact does such an overstatement have? STAT 431
• Key points of this class
– Variableselec$on:stepwiseprocedures/all-subsets – Informa$oncriteria:AIC/BIC/Mallow’sCp
– Variableselec$onmethodsoverstatesignificance
• Read Sec$on 11.7 of the textbook
STAT 431