程序代写代做代考 Bayesian algorithm Sta$s$cal Inference STAT 431

Sta$s$cal Inference STAT 431
Lecture 19: Mul$ple Regression (VI) Variable Selec$on

Which Predictors to Use?
• So far, we have assumed that the set of predictor variables to be included in the
model is given
• In prac$ce, we typically have a large set of candidate predictors.
The key ques$on is: Which predictors shall we use in the model?
• From a prac$cal point of view
– Too many predictorsàhard to interpret
– Too few predictorsàbad predic$on
• From a theore$cal point of few
– Too many predictorsàlow bias, high variance
– Too few predictorsàhigh bias, low variance
• This class: systema$c approaches for variable selec$on STAT 431

• •
Forward Selec$on
Idea: start with the most parsimonious model (Yi = β0 + ￿i ); add predictors one at a $me un$l no further addi$on significantly improve the fit.
A forward selec$on step consists of two tasks:
1. Considerallmodelsobtainedbyaddingonemorepredictortothecurrent model. For each variable xj not already included, calculate the P-value obtained from the F-test comparing the current model with the new model where xj is added. Iden$fy the variable with the smallest P-value.
2. IfthesmallestP-valueislessthanapre-specifiedcutoffvalue(e.g.,0.05),add that predictor to form a new current model.
Forward selec$on starts with Yi = β0 + ￿i as the ini$al current model, and repeats tasks 1 and 2 un$l no addi$onal explanatory variables can be added.
Note: dummy variables represen$ng a categorical predictor should be added as a single unit STAT 431
• •

• To perform a forward selec$on step in R, one could use the add1 command.
• Here is an illustra$on based on the house price of zip 30062 data set (see R code)
> house <- read.csv("HousePrices.csv", header = TRUE) > fit0 <- lm(log(Price) ~ 1, data = house) > add1(fit0, ~ Age + BLDSQFT + LOTSQFT + GARSQFT + BEDRMS + BATHS, test = “F”)
Single term additions
Model:
log(Price) ~ 1
Df Sum of Sq RSS AIC F value Pr(F)
50.271 -949.34

Age 1
BLDSQFT 1
LOTSQFT 1
GARSQFT 1
BEDRMS 1
BATHS 1

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1 …
STAT 431
21.448 28.824 -1191.53 325.1700 <2e-16 *** 33.996 16.275 -1442.45 912.8311 <2e-16 *** 0.035 50.236 -947.65 0.3037 0.5818 0.015 50.256 -947.47 0.1306 0.7180 19.920 30.351 -1168.86 286.8151 <2e-16 *** 17.659 32.612 -1137.32 236.6270 <2e-16 *** The ini$al model A formula for the largest possible model under considera$on • • Backward Elimina$on Idea: start with the model including all possible predictors; eliminate predictors one at a $me un$l all the remaining predictors are significant. A backward elimina$on step also consists of two tasks: 1. Foreachvariableinthecurrentmodel,computetheP-valueoftes$ngits significance. Iden$fy the variable with the largest P-value, i.e., the one that is least significant. 2. IfthelargestP-valueisgreaterthanapre-specifiedcutoffvalue(e.g.,0.05), then remove that predictor to obtain a new current model. Backward elimina$on starts with the full model as the ini$al current model, and repeats tasks 1 and 2 un$l none of the remaining variables can be removed. • STAT 431 • To perform a backward elimina$on in R, one could use the drop1 command. • Here is an illustra$on based on the house price of zip 30062 data set (see R code) > fitb0 <- lm(log(Price) ~ Age + BLDSQFT + LOTSQFT + GARSQFT + BEDRMS + BATHS, data = house) > drop1(fitb0, test = “F”)
Single term deletions
Model:
log(Price) ~ Age + BLDSQFT + LOTSQFT + GARSQFT + BEDRMS + BATHS
Df Sum of Sq RSS AIC F value Pr(F)
13.327 -1520.2

Age 1
BLDSQFT 1
LOTSQFT 1 0.1242 13.451 -1518.1 4.0247 0.0454606 * GARSQFT 1 0.2567 13.583 -1513.8 8.3220 0.0041128 ** BEDRMS 1 0.3494 13.676 -1510.8 11.3270 0.0008322 *** BATHS 1 0.0216 13.348 -1521.5 0.7001 0.4032149

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
STAT 431
1.6396 14.966 -1471.2 53.1505 1.484e-12 ***
7.5340 20.861 -1325.5 244.2249 < 2.2e-16 *** • • Stepwise Regression Idea: to mix addi$on and dele$on in the selec$on procedure Stepwise regression – Ini$alize with the model Yi = β0 + ￿i ; – Repeatthefollowingtwostepsun$lnopredictorscanbeaddedorremoved 1. Do one step of forward selec$on; 2. Do one step of backward elimina$on; Note: it is important to specify the largest model to be considered! • STAT 431 Reflec$ons on Stepwise Procedures • Forward selec$on, backward elimina$on, and stepwise regression can lead to different final models • “Adjacent” models are compared by hypothesis tes$ng – Sotheyarenottreatedonanequalfoo$ng [Recall that H0 and H1 are not treated equally!] – Thetestisconcernedwithlackoffit,butdoesnottakeintoaccountthe complexity of the model (i.e., the number of predictors included) – Nolinearrankingforthreemodelsormore • Possible improvements – Ifthetotalnumberofcandidatepredictorsisnottoolarge,wemightwantto consider all models that use a subset of them (there are 2k models in total!) – Whencomparingmodels,takebothlackoffitandcomplexityintoaccount – Agacheachmodelwithanumbertolinearlyrankalistofcandidatemodels STAT 431 Informa$on Criteria • A model M is specified by a subset of all candidate predictors • Lack of fit of the model could measured by its SSE: SSEM • Complexity of the model could be measured by the number of predictor used in the model: pM • Informa$on criteria try to combine both lack of fit and complexity when comparing different candidate models Three most widely used informa$on criteria are: 1. AIC [Akaike Informa$on Criterion]: AIC(M) = n log(SSEM/n) + 2(pM + 1) 2. BIC[BayesianInforma$onCriterion]:BIC(M)=nlog(SSEM/n)+log(n)(pM+1) 3. Mallow’s Cp Criterion: Cp(M) = SSEM/s2 + 2(pM + 1) − n – s2 is the MSE from the full model which includes all the predictors STAT 431 AIC(M) = n log(SSEM/n) + 2(pM + 1) BIC(M) = n log(SSEM/n) + log(n)(pM + 1) Cp(M)=SSEM/s2 +2(pM +1)−n • Comments on the criteria – Ineachcase,thesmallerthevalueofthecriterion,thebegerthemodel – Usingeachcriterion,wecanlinearlyrankalistofcandidatemodels – Whenthenumberofpredictorisfixed,thethreecriterialeadtothesame “best” model – the one with the smallest SSE – AICisthemostcommonlyusedcriterioninmul$pleregression – Cp can be interpreted as an es$mator of the predic$on error [see textbook] • We can use the informa$on criteria in stepwise procedures for variable selec$on STAT 431 • In R, if we decide to use AIC as the criteria in a stepwise procedure, then we could use the step command [The output can be overwhelming!] > fit0 <- lm(log(Price) ~ 1, data = house) > fitb0 <- lm(log(Price) ~ Age + BLDSQFT + LOTSQFT + GARSQFT + BEDRMS + BATHS, + data = house) > house.fs <- step(fit0, scope = list(lower = ~ 1, + upper = ~ Age + BLDSQFT + LOTSQFT + GARSQFT + BEDRMS + BATHS), + direction = "forward", data = house) > house.be <- step(fitb0, scope = list(lower = ~ 1, + upper = ~ Age + BLDSQFT + LOTSQFT + GARSQFT + BEDRMS + BATHS), + direction = "backward", data = house) > house.st <- step(fit0, scope = list(lower = ~ 1, + upper = ~ Age + BLDSQFT + LOTSQFT + GARSQFT + BEDRMS + BATHS), + direction = "both", data = house) STAT 431 All-Subsets Regression • Suppose we have a set of k candidate predictors, then in total there are 2k different subsets of it, leading to 2k different models [intercept is always included] • Idea: select the “best” model among the 2k modelsàall-subsets regression – Conceptually,thisisfeasible,becauseinforma$oncriteriaallowustolinearly rank all these models – In prac$ce, only works when k is not too large • In R, we can perform all-subsets regression using the regsubsets command in the leaps package – Thefunc$onreturnsonemodelwiththesmallestSSEforanygivennumberof predictors – Wecanthenchoosethefinalmodelamongthesemodelswiththesmallest value for a par$cular informa$on criterion STAT 431 > house.subset <- regsubsets(log(Price) ~ Age + BLDSQFT + LOTSQFT + GARSQFT + BEDRMS + + BATHS, data = house) > hsub <- summary(house.subset) > hsub

Selection Algorithm: exhaustive
Age BLDSQFT LOTSQFT GARSQFT BEDRMS BATHS 1 (1)”””*” “” “” “” “”
2 (1)”*””*” “” “” “” “”
3 (1)”*””*” “” “” “*” “”
4 (1)”*””*” “” “*” “*” “”
5 (1)”*””*” “*” “*” “*” “”
6 (1)”*””*” “*” “*” “*” “*”
The best subset of predictors for any given size, indicated by *
> hsub$cp
[1] 92.573539 25.546276 13.123846 8.014674 5.700094 7.000000
Choose the final model by checking
the Cp values for each of the best models with given size [could also use BIC or AIC]
STAT 431

Variable Selec$on and Significance
• All variable selec$on methods can overstate significance
• An experiment: We generate a data set of 100 observa$ons in the file sim.csv
– In each observa$on, the response yi and 30 predictors xi1 , . . . , xi30 are i.i.d.
standard normal
– ThecorrectmodelisYi =0+￿i
• F-test for H0 : β1 = ··· = β30 = 0 has P-value 0.66. So, null hypothesis is not rejected.
• We apply all-subsets selec$on based on the Cp criterion, and the “best” model include three predictors: {x12, x20, x29}
• Fit the mul$ple regression with these three predictors. The F-test for tes$ng all slopes = 0 returns P-value 0.03! The null hypothesis (which is true) is rejected.
– Thevariableselec$onoverstatesthesignificanceofthethreeselected predictors!
• What impact does such an overstatement have? STAT 431

• Key points of this class
– Variableselec$on:stepwiseprocedures/all-subsets – Informa$oncriteria:AIC/BIC/Mallow’sCp
– Variableselec$onmethodsoverstatesignificance
• Read Sec$on 11.7 of the textbook
STAT 431