STAT 614 Week 13: Non-nested Comparisons
Stepwise Fits, step(), and Criteria
Richard Ressler 2019-11-20
1
Learning Outcomes
• Choose Between Non-nested Models • References: Sections 10.4.1 and 12.4
2
Case Study and EDA
Case Study: Sex Descrimination
• Same study as in Case Study 0102
• Looked at beginning salary at a bank with respect to sex. • Want to control for many different variables.
3
Case Study: Sex Descrimination
library(Sleuth3) data(case1202) head(case1202)
## Bsal Sal77 Sex Senior Age Educ Exper
## 1 5040 12420 Male
## 2 6300 12060 Male
## 3 6000 15120 Male
## 4 6000 16320 Male
## 5 6000 12300 Male
## 6 6840 10380 Male
96 329
82 357
67 315
97 354
66 351
92 374
15 14.0
15 72.0
15 35.5
12 24.0
12 56.0
15 41.5
4
EDA
library(GGally)
ggpairs(case1202, columns = c(1, 2, 4, 5, 6, 7),
6e−04 4e−04 2e−04 0e+00
15000
12500
10000
7500
90 80 70
700
600
500
400
300
16 14 12 10
400
300
200
100
aes(size = I(0.1)))
8
Bsal Sal77
Senior
Age Educ
Exper
0
4050607080705100102015050000708090 30405060700 8101214160102030400
BsalSal77SeniorAge EducExper
5
EDA – Log Transform Salary
case1202$logBsal <- log(case1202$Bsal) ggpairs(case1202, columns = c(2, 4, 5, 6, 7, 8),
aes(size = I(0.1)))
Sal77 Senior Age Educ Exper logBsal
0.00025 0.00020 0.00015 0.00010 0.00005 0.00000
90 80 70
700
600
500
400
300
16 14 12 10
400
300
200
100
9.0 8.8 8.6 8.4
7510102015050000708090 30405060700 81012141601020304008.84.86.98.0
8
0
Sal77SeniorAge EducExpelrogBsal
6
EDA Summary
• Loging Bsal seems to help a lot.
• Age and Experience might need a quadratic transformation.
7
Step-wise Procedures (Section 12.3)
Step-wise Regression
• Start with a complicated model.
• Look at p-values (when testing HO that a coefficient is 0)
• Drop the one with the largest p-value.
• Continue until all p-values are less than some threshold (usually 0.05).
• Note, this process means we cannot interpret p-values and CIs in the usual way. Why?
8
Step-wise Regression, the manual way
case1202$Age2 <- case1202$Age ^ 2 case1202$Exper2 <- case1202$Exper ^ 2
lm1 <- lm(logBsal ~ Senior + Age + Age2 +
Educ + Exper + Exper2, data = case1202)
9
Step-wise Regression, the manual way
coef(summary(lm1))
## Estimate Std. Error t value P
## (Intercept) 8.629509e+00 2.138580e-01 40.35158181 1.155
## Senior
## Age
## Age2
## Educ
## Exper
## Exper2
-3.242477e-03 1.149520e-03 -2.82072176 5.948
-3.093674e-04 9.166743e-04 -0.33748887 7.365
-2.788479e-08 8.828163e-07 -0.03158617 9.748
2.062741e-02 5.095469e-03 4.04818633 1.124
1.959782e-03 6.091284e-04 3.21735357 1.824
-4.098148e-06 1.657324e-06 -2.47274930 1.537
• Drop Age2 (p-value of 0.97)
10
Step-wise Regression, the manual way
## Estimate Std. Error t value Pr(
## (Intercept) 8.634793e+00 1.324347e-01 65.200375 1.20735
## Senior
## Age
## Educ
## Exper
## Exper2
-3.234426e-03 1.114447e-03 -2.902270 4.69264
-3.379533e-04 1.449062e-04 -2.332221 2.19983
2.065267e-02 5.003349e-03 4.127768 8.37213
1.970388e-03 5.052818e-04 3.899582 1.89185
-4.130283e-06 1.300768e-06 -3.175265 2.07105
11
Step-wise Regression
• Can also start at the simplest model,
• add the variable with the smallest p-value
• continue until no new variables have a p-value less than 0.05
• Can also both add and drop variables based on p-values at each step.
12
Step-wise Regression in R
• Use the step() function to do this automatically
• It actually uses AIC (not p-values) to choose between models,
but the idea is similar. See later for AIC.
• The model fitting must apply the models to the same dataset.
This may be a problem if there are missing values. Remove rows with missing values for any of the variables first.
lm1 <- lm(logBsal ~ Senior + Age + Age2 + Senior + Educ + Exper + Exper2,
data = case1202)
stepout <- step(object = lm1, trace = FALSE) stepout
##
## Call:
## lm(formula = logBsal ~ Senior + Age + Educ + Exper + Exper2,
## data = case1202)
##
## Coefficients:
13
Step-wise Regression in R
• The output of step() is also an lm object, so you can get coefficients, p-values, confidence intervals, fits, predictions, residuals, etc directly from it.
## # A tibble: 6 x 6
## term estimate statistic p.value conf.low
##
## 1 (Intercept) 8.63
## 2 Senior -0.00323
## 3 Age -0.000338
## 4 Educ 0.0207
## 5 Exper 0.00197
## 6 Exper2 -0.00000413
65.2 1.21e-75 8.37
-2.90 4.69e- 3 -0.00545
-2.33 2.20e- 2 -0.000626
4.13 8.37e- 5 0.0107
3.90 1.89e- 4 0.000966 -3.18 2.07e- 3 -0.0000016472
#confint(stepout)
tidy(stepout, conf.int = TRUE)[c(1,2,4,5,6,7)]
Comparing Non-nested Models (Section 12.4)
Motivation
• What if we want to decide between the following two models • μ(logBsal|…) = Senior + Educ + Exper + Exper2
• μ(logBsal|…) = Senior + Educ + Age + Age2
• These models are non-nested, so we cannot apply F-test techniques to them.
15
BIC and AIC
• BIC (Bayesian Information Criterion) and AIC (Akaike Information Criterion) return the log of the sum of square residuals plus a penalty due to the number of parameters (p) in the model.
• Best model has the smallest BIC or AIC.
• BIC: n log(SSR/n) + log(n)(p + 1)
• AIC: n log(SSR/n) + 2(p + 1)
• BIC penalizes more when the sample size is larger.
• BIC is better for model selection (get interpretable model), AIC is better for prediction (goal is prediction).
16
BIC and AIC in R
• Fit both models, then use the AIC() and BIC() functions.
lm_mod1 <- lm(logBsal ~ Senior + Educ + Exper + Exper2, data = case1202)
lm_mod2 <- lm(logBsal ~ Senior + Educ + Age + Age2, data = case1202)
cbind(BIC(lm_mod1), BIC(lm_mod2))
## [,1] [,2]
## [1,] -131.2381 -123.5927
cbind(AIC(lm_mod1), AIC(lm_mod2))
## [,1] [,2]
## [1,] -146.4337 -138.7883
17
Mallow’s Cp statistic
• Focuses on the bias due to exlcuding important variables and the extra variance due to incuding too many
• Bias(Yˆi ) = μ(Yˆi ) − μ(Yi ) - difference between mean of the estimate and the true mean
• MSE (Yˆi ) = Bias (Yˆi )2 + Var (Yˆi ) : The Mean Squared Error
• TMSE = ni=1 MSE(Yˆi) : The Total Mean Squared Error for
the sample
• We want to minimize the TMSE, but we don’t know it. Mallow’s Cp estimates it (assuming no bias).
18
Cp plot
• You obtain Mallow’s Cp for every possible model.
• Only feasible if you have less than p = 10 or so explanatory
variables (2p models are possible).
• Plot Cp on the y-axis and the number of parameters on the
x -axis.
• Models below the y = x line are candidate models
• Models without bias should have a Cp of about p
• So if Cp is below p, the model probably does not have any bias issues.
19
Cp in R using the leaps library leaps() function
• leaps() searches for the best subsets of the X variables for predicting ‘Y’ in linear regression, using an efficient branch-and-bound algorithm.
library(leaps)
lm1 <- lm(logBsal ~ Senior + Age + Age2 +
Educ + Exper + Exper2, data = case1202)
X <- model.matrix(lm1) leapsout <- leaps(x = X,
y = case1202$logBsal, int = FALSE)
20
Cp in R
qplot(leapsout$size, leapsout$Cp, xlab = "Number of Parameters", ylab = "Cp") +
geom_abline(slope = 1, intercept = 0) + scale_y_log1
0() #
21
1e+05 1e+04 1e+03 1e+02 1e+01
246
Number of Parameters
Cp
Cp in R
goodmodel <- leapsout$Cp < 1000 qplot(leapsout$size[goodmodel], leapsout$Cp[goodmodel]
xlab = "Number of Parameters",
ylab = "Cp") +
geom_abline(slope = 1, intercept = 0) + ylim(0, 60)
,
22
60
40
20
0
Cp
Back to Case Study
Back to Case Study
• We chose a model with
μ(logBsal|...) = Senior + Age + Educ + Exper + Exper2
• Now let’s answer the question if Sex is still associated with base salary after adjusting for these variables.
23
Results
## # A tibble: 7 x 5
## term estimate p.value conf.low
conf.high
8.79
1.20e-81 8.35
##
## 1 (Intercept) 8.57
## 2 SexMale 0.140
## 3 Senior -0.00326
## 4 Age -0.0000208 8.72e- 1 -0.000277
## 5 Educ 0.0137 1.79e- 3 0.00526
## 6 Exper 0.00155 4.12e- 4 0.000711
## 7 Exper2 -0.00000413 2.26e- 4 -0.00000626 -0.00000200
5.40e- 9 0.0974
6.28e- 4 -0.00509
0.184
-0.00143
0.000236
0.0222
0.00239
24
lmfinal <- lm(logBsal ~ Sex + Senior + Age + Educ + Exper + Exper2,
data = case1202) #coef(summary(lmfinal))
broom::tidy(lmfinal, conf.int = TRUE)[c(1,2,5,6,7)]
Results
cbind(coef(lmfinal), confint(lmfinal))
## 2.5 % 97.5 %
## (Intercept) 8.567188e+00 8.349190e+00 8.785187e+00
## SexMale
## Senior
## Age
## Educ
## Exper
## Exper2
1.404853e-01 9.741543e-02 1.835552e-01
-3.260895e-03 -5.087075e-03 -1.434716e-03
-2.079316e-05 -2.773778e-04 2.357914e-04
1.373116e-02 5.262387e-03 2.219993e-02
1.549318e-03 7.113471e-04 2.387289e-03
-4.128233e-06 -6.259705e-06 -1.996762e-06
25
Results
## SexMale
## 0.1508322
exp(confint(lmfinal)[2, ]) -1 ## 2.5 % 97.5 %
## 0.1023182 0.2014813
# from Female as base
1/exp(coef(lmfinal)[2])
## SexMale
## 0.8689364
1/exp(confint(lmfinal)[2, ])
## 2.5 % 97.5 %
## 0.9071791 0.8323059
26
exp(coef(lmfinal)[2])-1
Summary
Substantial evidence the bank’s median starting salary for males and females is different after accounting for other factors (p-value of 5.400800e-09). One would expect to see the bank pay new male employees 15% more than new female employees after accounting for seniority, age, education, and experience with a 95% CI of (10% to 20%)
Includes effect size, p-value, and confidence interval.
27