Modeling Issues in Linear Regression
Contents
1 Residuals and Influence Measures 1
1.1 ResidualPlots………………………………………. 2 1.2 IdentifyingandClassifyingUnusualObservations……………………. 9
2 Predictor Transformations 19
3 Omitted Variable Bias 24
4 Irrelevant Variables 25
5 Multicollinearity 27
5.1 DetectingMulticollinearity ……………………………….. 27
6 Model Misspecification: Ramsey RESET 30
7 Model Selection 31
7.1 AkaikeInformationCriterion(AIC) …………………………… 31 7.2 BayesianInformationCriterion(BIC)…………………………… 32 7.3 Mallows’CP……………………………………….. 32 7.4 StepwiseRegression …………………………………… 33
8 Heteroskedasticity 38
8.1 DetectingHeteroskedasticity ………………………………. 38 8.1.1 SpreadLevelPlots ………………………………… 38 8.1.2 ScoreTestsforNonconstantErrorVariance……………………. 39 8.1.3 Goldfeld-QuandtTest ………………………………. 41
8.2 MitigatingHeteroskedasticity………………………………. 42 8.2.1 Heteroskedasticity-ConsistentStandardErrors………………….. 42 8.2.2 Generalized(Weighted)LeastSquares………………………. 43
9 Missing Observations 45
Note: To access your textbook resources type the following on the console:
1 Residuals and Influence Measures
When diagnosing a regression fit, we have to carefully assess the impact that unusual observations may have the fit. Typically, we characterize Unusual observations in terms of outliers, leverage, and influence. Next we discuss each one in detail.
A. Outliers
They are easy to find visually because their y value is relatively large (conditioned on the predictor(s)), given the distribution of other y values in the vicinity.
#library(car)
#carWeb()
1
** ****** * * *
* Outliers *
*
*
***** ***** **
0 5 10 15 20 25 Speed
B. Leverage
In this case we are looking at observations that have an extreme x (i.e., predictor) value relative to the mean.
*
* *
*
* **** * ********
**** *
**
*
High Leverage
*
*
0 5 10 15 20 25 30 Speed
C. Influence
An influential observation can be considered as an outlier with high leverage. Therefore, they can alter the estimates of the regression coefficients. In this case, when we remove them, the regression fit may become more stable.
2
Distance Distance
0 20 40 60 80 0 50 100 150 200
** ****** * * *
** ** **** **
*
**
*
Influential
0 5 10 15 20 25 30 Speed
There are several metrics we can use to better classify unusual observations, however, since these tools depend on analyzing the residuals, we will first look at understanding residuals and then we can discuss the related metrics.
1.1 Residual Plots
The traditional residuals are given by ei = yi − yi, i = 1, . . . , N. They are a good diagnostic of how well the model fit the data if when plotted vs. e.g., a predictor, they are randomly distributed about y = 0 and have a constant variance. In more general terms, we can propose a functional form for the residuals such as
var(ei) = σ2(1 − hi)
where hi is called the leverage or hat-value is bounded between 0 and 1. Large values of hi (a.k.a high leverage) are associated with unusual values of xi. If hi is close to zero, we then obtain the classic linear regression SR1 condition of constant variance.
For the case of SLR, i.e., only one predictor, hi is given by:
(xi − x)2 1
From the formula we can interpret hi (leverage) as the proportion of the total sum of squares of the explanatory variable contributed by the iˆ{th} case.
## Registered S3 methods overwritten by ‘car’:
## method from
## influence.merMod lme4
## cooks.distance.influence.merMod lme4
## dfbeta.influence.merMod lme4
## dfbetas.influence.merMod lme4
hi=∑N (xi−x)2+N i=1
# Example 1: Food expenditure vs. income (non-constant variance) library(car) #for hccm() robust standard errors
3
Distance
0 50 100 150 200
##
## Attaching package: ‘car’
## The following object is masked from ‘package:purrr’:
##
## some
## The following object is masked from ‘package:dplyr’:
##
## recode
library(PoEdata) #for PoE4 datasets
data(“food”,package=”PoEdata”)
mreg.mod <- lm(food_exp~income, data=food)
plot(food$income,food$food_exp, pch=20,
xlab="Income", ylab="Food Expenditure")
abline(mreg.mod,lwd=2,col="blue")
5 10 15 20 25 30
Income
plot(food$income, mreg.mod$residuals, pch=20,ylab="Resdiduals", xlab="Income")
abline(h=0,col="red", lwd=2)
4
Food Expenditure
100 200 300 400 500 600
5 10 15 20 25 30
Income
# From the residuals plot we can see that the variance is not constant
# Example 2: US Investment Data
## Chapter 3 in Greene (2003)
## transform (and round) data to match Table 3.1 data("USInvest", package = "AER")
us <- as.data.frame(USInvest)
#us$invest <- round(0.1 * us$invest/us$price, digits = 3) #us$gnp <- round(0.1 * us$gnp/us$price, digits = 3) plot(us$gnp, us$invest , ylab="Investment", xlab="GNP",pch=20) mreg.mod=lm(us$invest ~ us$gnp, data = us) abline(mreg.mod,lwd=2,col="blue")
5
Resdiduals
−200 −100 0 100 200
1000 1500 2000 2500 3000
GNP
plot(us$gnp, mreg.mod$residuals, pch=20,ylab="Resdiduals", xlab="GNP",ylim=c(-45,45))
abline(h=0,col="red",lwd=2)
1000 1500 2000 2500 3000
GNP
# These residuals also show some hint of non-constant variance.
There are transformations that can be applied to the residuals to make the variance more constant. However, because they are applied to the residuals, the underlying regression model still needs to be reevaluated. For example, we can consider the Standardized Residuals as given by:
eSi= 2√ei , σ 1−hi
6
Resdiduals Investment
−40 −20 0 20 40 150 250 350 450
# Example 3: Standardized Residuals
library(car)
data("USInvest", package = "AER")
us <- as.data.frame(USInvest)
mreg.mod=lm(us$invest ~ us$gnp, data = us)
library(car)
residualPlots(mreg.mod,type = "rstandard")
1000 2000 3000
us$gnp
150
250 350 450
Fitted values
##
## us$gnp -1.8139 0.09477 .
## Tukey test -1.8139 0.06970 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
or the Studentized Residuals:
Test stat Pr(>|Test stat|)
ei
eTi= √ .
σ ( − i ) 1 − h i
# Example 4: Studentized Residuals
library(car)
data(“USInvest”, package = “AER”)
us <- as.data.frame(USInvest)
mreg.mod=lm(us$invest ~ us$gnp, data = us)
residualPlots(mreg.mod,type = "rstudent")
7
Rstandard residuals
−2 −1 0 1
Rstandard residuals
−2 −1 0 1
1000 2000 3000
us$gnp
150
250 350 450
Fitted values
##
## us$gnp -1.8139 0.09477 .
## Tukey test -1.8139 0.06970 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Another alternative is to compute the Pearson Residuals which are obtained from the Weighted Least Squares (WLS) Regression, where the weights are equal to wi and the residuals can be computed from:
ePi = √wiei
. As can be seen, when wi = 1, we recover the standard residuals.
Test stat Pr(>|Test stat|)
# Example 5: Pearson Residuals
library(car)
data(“USInvest”, package = “AER”)
us <- as.data.frame(USInvest)
mreg.mod=lm(us$invest ~ us$gnp, data = us)
library(car)
residualPlots(mreg.mod)
8
Rstudent residuals
−3 −2 −1 0 1 2
Rstudent residuals
−3 −2 −1 0 1 2
1000 2000 3000
us$gnp
150
250 350 450
Fitted values
##
## us$gnp -1.8139 0.09477 .
## Tukey test -1.8139 0.06970 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Test stat Pr(>|Test stat|)
# Example 6: All Residuals
#We can also graph all the relevant residual plots prestige.mod.2 <- lm(prestige ~ education + income + type,
data=Prestige)
residualPlots(prestige.mod.2)
9
Pearson residuals
−60 −40 −20 0 20 40
Pearson residuals
−60 −40 −20 0 20 40
6
8 10 12
education
14
16
5000 15000 25000
income
bc prof
type
wc
30 40 50 60 70 80 90
Fitted values
##
## education -0.6836
## income -2.8865
## type
## Tukey test -2.6104
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1.2 Identifying and Classifying Unusual Observations
Unusual observations can be somewhat difficult to quantify through visual inspection but fortunately there are several robust metrics that can be used in a more objective fashion.
A. Q-Q Plot
This plot shows the studentized residuals vs. the corresponding quantiles of a t-distribution with N − K − 2 degrees of freedom, and allows us to identify outliers (std. residuals 2 are considered large).
Test stat Pr(>|Test stat|)
0.495942
0.004854 **
0.009043 **
# Example 6: Q-Q Plot
library(car)
mreg.mod <- lm(prestige ~ income + education, data=Duncan)
qqPlot(mreg.mod, id=list(n=3))
10
Pearson residuals
Pearson residuals
−15 0 10
−15 0 10
Pearson residuals
Pearson residuals
−15 0 10
−15 0 10
minister
contractor
reporter
−2 −1 0 1 2
t Quantiles
## minister reporter contractor
## 6 9 17
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## minister 3.134519 0.0031772 0.14297
# The large p-value (Bonferroni) suggests that this value is not surpsing.
B. Cook’s Distance Plot
Cook’s Distance is defined as:
D i = e 2S i × h i k+1 1−hi
where large values of Di (larger than 1 and/or larger relative to the majority of the values) are typically associated with influential observations.
# It looks like minister might be a potential outlier. To be sure,
# we can perform an outlier test known as the Bonferroni-corrcted t-test: outlierTest(mreg.mod)
# Example 7: Cook's Distance Plot
library(car)
mreg.mod <- lm(prestige ~ income + education, data=Duncan)
influenceIndexPlot(mreg.mod, id=list(n=3),vars="Cook")
11
Studentized Residuals(mreg.mod)
−2 −1 0 1 2 3
minister
Diagnostic Plots
minister
conductor
reporter
0 10 20 30 40
Index
Diagnostic Plots
# It looks like minister, conductor and reporter are potential outliers
# Note: if you don't specify the flag "vars", it will plot all of them:
influenceIndexPlot(mreg.mod, id=list(n=3))
conducto
r
reporter
minister
contra
ctor
re
porter
accountant
reporter
minister
RR.engineer
conducto
0 10 20 30 40
Index
C. Hat-Values Plot
This plot is designed for identifying high leverage observations (some of which could also be influential). On average hi = p/N, therefore, hi 2p/N is considered large.
r
minister
# Example 8: Hat-Values Plot
library(car)
12
hat−valuesBonferroni p−Svatuludentized residuCaolsok's distance
Cook's distance
0.05 0.20 0.2 0.6 1.0 −2 0 2 0.0 0.3
0.0 0.2 0.4
mreg.mod <- lm(prestige ~ income + education, data=Duncan)
influenceIndexPlot(mreg.mod, id=list(n=3),vars="hat")
Diagnostic Plots
RR.engineer
minister
conductor
0 10 20 30 40
Index
## Calls:
## 1: lm(formula = prestige ~ income + education, data = Duncan)
## 2: lm(formula = prestige ~ income + education, data = Duncan, subset =
## rownames(Duncan) != "minister")
##
## Model 1 Model 2
# It looks like minister, conductor and RR.engineer are influential
# We can check if e.g., minister is influential by comparing the # coefficient estimates w/ and w/o 'minister'
mod.duncan <- lm(prestige ~ income + education, data=Duncan) mod.duncan.2 <- update(mod.duncan,
subset= rownames(Duncan) != "minister")
compareCoefs(mod.duncan, mod.duncan.2)
## (Intercept) -6.06
## SE 4.27
##
## income 0.599
## SE 0.120
##
-6.63 3.89
0.732 0.117
## education 0.5458 0.4330
## SE 0.0983 0.0963
##
# We can also depict the hat-values via:
influencePlot(mreg.mod, id=list(n=3))
13
hat−values
0.05 0.15 0.25
contractor
minister
RR.engineer
conductor
reporter
0.05
0.10 0.15 0.20 0.25
Hat−Values
StudRes
Hat
3.1345186
0.1730582
-2.3970224
0.0543936
-1.7040324
0.1945416
2.0438046
0.0432552
0.8089221
0.2690896
CookD minister 0.5663797 reporter 0.0989846 conductor 0.2236412 contractor 0.0585235 RR.engineer 0.0809681
D. Added Variable Plot (Partial Regression)
Depicts the relationship between y and one predictor x, adjusting for the effects of the other predictors. These can be used to identify influential observations and high leverage observations. For example, high leverage observations appear the in added variable plots as points horizontally distant from the rest of the data. They are also useful for identifying jointly influential points.
An alternative method is to look at the individual coefficient differences between estimates from the original data (bj) and those where observation i is removed (b(−i)j) .
dfβij=b(−i)j−bj forj=0,...,k
# Example 9: Added-Variable Plot
library(car)
mreg.mod <- lm(prestige ~ income + education, data=Duncan)
avPlots(mreg.mod, id=list(n=3, method="mahal"))
14
Studentized Residuals
−2 −1 0 1 2 3
Added−Variable Plots
RR.engineer
minister
conductor
minister
RR.engineer
conductor
−40 0 20 40 −60 −20 0 20 40
income | others education | others
# It looks like minister, conductor and RR.engineer are influential
# minister & conductor --> decrease the income slope
# minister & conductor –> increase the education slope
# RR.engineer seems fine for the most part
# We can check if e.g., minister & conductor are jointly influential
# by comparing the coefficient estimates w/ and w/o ‘minister & conductor’ mod.duncan <- lm(prestige ~ income + education, data=Duncan)
mod.duncan.3 <- update(mod.duncan,
subset = - whichNames(c("minister", "conductor"), Duncan))
compareCoefs(mod.duncan, mod.duncan.2, mod.duncan.3, se=FALSE)
## Calls:
## 1: lm(formula = prestige ~ income + education, data = Duncan)
## 2: lm(formula = prestige ~ income + education, data = Duncan, subset =
## rownames(Duncan) != "minister")
## 3: lm(formula = prestige ~ income + education, data = Duncan, subset =
## -whichNames(c("minister", "conductor"), Duncan))
##
## Model 1 Model 2 Model 3
## (Intercept) -6.06 -6.63
## income 0.599 0.732
## education 0.546 0.433
-6.41
0.867
0.332
# Based on the comparison is does appear that these points were
# jointly influential.
# We can also depict the hat-values via:
influencePlot(mreg.mod, id=list(n=3))
15
prestige | others
−30 −10 0 10 20 30 40
prestige | others
−40 −20 0 20 40 60
contractor
minister
RR.engineer
conductor
reporter
0.05
0.10 0.15 0.20 0.25
Hat−Values
CookD minister 0.5663797 reporter 0.0989846 conductor 0.2236412 contractor 0.0585235 RR.engineer 0.0809681
StudRes
Hat
3.1345186
0.1730582
-2.3970224
0.0543936
-1.7040324
2.0438046
0.1945416
0.0432552
0.8089221
0.2690896
# Lastly, we can compute the df_betas and cehck the individual differences
mreg.mod <- lm(prestige ~ income + education, data=Duncan)
dfbs <- dfbetas(mreg.mod)
# We can plot the influence on the income coefficient against the # influence on the education coefficient
plot(dfbs[ ,c("income", "education")]) # for b1 and b2 showLabels(dfbs[ , "income"], dfbs[ , "education"],
labels=rownames(Duncan), method=c(6, 16, 27))
16
Studentized Residuals
−2 −1 0 1 2 3
minister
conductor
RR.engineer
## minister conductor RR.engineer
## 6 16 27
−1.0 −0.5 0.0 0.5
income
# Notice that once again, minister and conductor stand out
# We can alos plot them all via:
avPlots(lm(prestige~income+education+type, data=Duncan))
17
education
0.0 0.5 1.0
Added−Variable Plots
minister
machinist
contract
or
store.manager
RR.engineer
n
−40 −20 0 20 40 60
income | others
−30 −20 −10 0 10 20
education | others
machi
ist
minister
minister
machinist
contractor
store.manager
machinist minister
store.clerk conductor
−0.4 −0.2 0.0 0.2 0.4 0.6
typeprof | others
## Warning in mmps(...): Interactions and/or factors skipped
−0.4 −0.2 0.0 0.2 0.4 0.6
typewc | others
# Example 10: Other Plots
library(car)
mreg.mod <- lm(prestige ~ income + education + type, data=Prestige)
# Marginal Plots
marginalModelPlots(mreg.mod, sd=TRUE)
18
prestige | others
prestige | others
−10 10 30
−20 0 20 40
prestige | others
prestige | others
−20 0 20
−20 0 20
Marginal Model Plots
Data
5000
Data
Model
15000
income
Model
25000
6
Data Model
8 10 12 14 16
education
30 40 50 60 70 80 90
Fitted values
Marginal plot of education
Added−Variable plot of education
# Marginal & Added-Variable Plots
mcPlots(mreg.mod, ~ education, overlaid=FALSE)
−4 −2 0 2 4
Centered education
−4 −2 0 2 4
education | others
19
Centered prestige
TRUE
TRUE
−30 −10 0 10 20 30 40
prestige | others
TRUE
−30 −10 0 10 20 30 40
20 40 60 80
20 40 60 80
20 40 60 80
mcPlots(mreg.mod, ~ income, overlaid=FALSE) Marginal plot of income
Added−Variable plot of income
−5000 5000 15000
Centered income
2 Predictor Transformations
−5000 5000 15000
income | others
Suppose we have a linear model y = β1 + β2x2 + β3x3 + e, but are interested in the relationship between y and x3 only. Therefore, we would like to plot y vs. x3 but with the effect of x2 removed. This is equivalent to plotting y − β1 + β2x2 vs. x3. We can accomplish this with a Component-Plus-Residual Plot. Mathematically, we are really doing an approximation by estimating a partial residual for x3. Algorithmically, here is how we can estimate it:
i. Obtain the coefficient estimates βk, for k = 1, 2, 3 from E[y|x] = β1 + β2x2 + β3x3 and respective residuals, e
ii. Estimate the partial residual (epartial) according to: epartial = y − β1 + β2x2 iii. The Component-Plus-Residual is given by: epartial = β2x2 + e
In general, we can express the partial residual as epartial,ij = ei + bjxij. This plot is often used to help detect the need for transformations of the predictor variables by looking for any departures from linearity in the plots of epartial,ij vs. xij. If we identify any nonlinear dynamics, we can then apply the techniques from Lecture 1 to the respective predictor variables.
# Example 11: Component-Plus-Residual Plot
library(car)
mreg.mod <- lm(prestige ~ income + education + women,
data=Prestige)
crPlots(mreg.mod, order=2)
20
Centered prestige
−30 −10 0 10 20 30 40
prestige | others
−30 −10 0 10 20 30 40
Component + Residual Plots
0 5000 15000 25000
income
6
8 10 12 14 16
education
0 20 40 60 80 100
women
# The dashed blue line is a partial fit, and the magenta curve a lowess smoother. Of the
# 3 predicators, income seems have the largest degree of curvature. Therefore, we might
# consider transforming it using, e.g., the bulging rule.
# We can try applying a log tranformation to income
mreg.mod2 <- update(mreg.mod,
. ~ . + log2(income) - income)
crPlots(mreg.mod2, order=2)
21
Component+Residual(prestige) Component+Residual(prestige)
−20 0 10 −20 0 20
Component+Residual(prestige)
−20 0 20
Component + Residual Plots
6 8 10 12 14 16
education
0 20 40 60 80 100
women
10 11 12 13 14
log2(income)
# It looks better, but maybe we should also apply a quadratic transformation to women
mreg.mod3 <- update(mreg.mod2,
. ~ . - women + poly(women, 2))
crPlots(mreg.mod3, order=1)
22
Component+Residual(prestige) Component+Residual(prestige)
−20 0 20 −20 0 20
Component+Residual(prestige)
−15 0 10
Component + Residual Plots
6 8 10 12 14 16 10 11 12 13 14
education log2(income)
−2 0 2 4 6
poly(women, 2)
## Warning in ceresPlots(lm(prestige ~ income + education + type, data =
## Prestige)): Factors skipped in drawing CERES plots.
# Note: When there are strong non-linear relations between predictors, the
# component-plus-residual plots may not work, in this case, we instead look
# at CERES Plots (Combining conditional Expectations and REsiduals)
ceresPlots(lm(prestige~income+education+type, data=Prestige))
23
Component+Residual(prestige) Component+Residual(prestige)
−15 0 10 −20 0 20
Component+Residual(prestige)
−20 0 20
CERES Plots
5000 15000 25000 6 8
income
10 12 14 16
education
A more general method for dealing with nonlinear predictors is to a apply the Box-Tidwell method for transforming predictors. Their proposed model assumes the form:
y = β0 +β1TBBC(x1,γ1)+···+βkTBC(xk,γk)+ε, where TBC is the Box-Cox power transformation introduced in Lecture 1.
# Example 12: Box-Tidwell Transformation
library(car)
# In R we can use the the functon boxTidwell to suggest the # appropriate transformation(s) via MLE, while also allowing # for user specifed transformations
# For example, let's apply a quadratic tranformation to women
# and see what it recommmends for income and education:
boxTidwell(prestige ~ income + education,
other.x = ~poly(women,2),data=Prestige)
##
## income
## education
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## iterations = 12
MLE of lambda Score Statistic (z) Pr(>|z|)
-0.037775 -5.3013 1.15e-07 ***
2.192827 2.4056 0.01615 *
# According to the output: lambda = -0.04 for income–> log
# According to the output: lambda = 2.19 for education–> quadratic
24
CERES Residual(prestige)
−10 0 10 20
CERES Residual(prestige)
−30 −20 −10 0 10 20
# To visualize these results, we can plot the “Constructed-Variable Plots”
# ‘construction’ in this context means transformation
3 Omitted Variable Bias
The Omitted Variable Bias quantifies the impact of omitting a relevant (e.g., statistically and/or economically significant) variable in a regression model. For example, suppose the model true underlying population model is y = β1 + β2×2 + β3×3 + e where both x2 and x3 should be included, but instead we only estimated the model y = β1 + β2×2 + e. How can we quantify the effect that omitting x3 would have on the model? We can show that mathematically it corresponds to a bias, namely:
The example below shows how omitting a relevant variable such as the wife’s years of education contribution to annual family income, leads to a ∼ $2000 bias.
## (Intercept) -5533.6
## he 3131.5
## we 4522.6
## —
11229.5 -0.493 0.622426
802.9 3.900 0.000112 ***
1066.3 4.241 2.73e-05 ***
bias(b∗) = E(b )∗ − β = β cov(x2, x3) 2 2 2 3var(x2)
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 40500 on 425 degrees of freedom
## Multiple R-squared: 0.1613, Adjusted R-squared: 0.1574
## F-statistic: 40.87 on 2 and 425 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = faminc ~ he, data = edu_inc)
# Example 12: Omitted Variable Bias
library(car)
library(AER)
library(broom)
data("edu_inc", package="PoEdata")
mreg.mod1 <- lm(faminc~he+we, data=edu_inc)
mreg.mod2 <- lm(faminc~he, data=edu_inc)
# Correct Model
summary(mreg.mod1)
##
## Call:
## lm(formula = faminc ~ he + we, data = edu_inc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88299 -23663 -6800 17244 232085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
#Incorrect Model
summary(mreg.mod2)
25
##
## Residuals:
## Min 1Q Median 3Q Max
## -78985 -26196 -6924 18372 250934
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26191.3 8541.1 3.066 0.0023 **
## he 5155.5 658.5 7.830 3.92e-14 ***
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 41300 on 426 degrees of freedom
## Multiple R-squared: 0.1258, Adjusted R-squared: 0.1237
## F-statistic: 61.3 on 1 and 426 DF, p-value: 3.921e-14
##
## Call:
## lm(formula = faminc ~ he + we + kl6, data = edu_inc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91760 -23745 -6642 17636 243255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
# We can see that the effect of omitting wedu translates to an
# ovesstatement of ~$2000 which would correspond to the bias in this case.
# Q: What if we include kl6 ( = number of chidren under the age of 6)?
mreg.mod3 <- lm(faminc~he+we+kl6, data=edu_inc)
summary(mreg.mod3)
## (Intercept) -7755.3
## he 3211.5
## we 4776.9
## kl6 -14310.9
## ---
11162.9 -0.695 0.48760
796.7 4.031 6.58e-05 ***
1061.2 4.502 8.73e-06 ***
5003.9 -2.860 0.00445 **
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40160 on 424 degrees of freedom
## Multiple R-squared: 0.1772, Adjusted R-squared: 0.1714
## F-statistic: 30.43 on 3 and 424 DF, p-value: < 2.2e-16
4 Irrelevant Variables
Typically, including irrelevant variables into the model has the effect of increasing the variance of the other variables, and therefore, potentially rendering the respective coefficient estimates statistically insignificant.
# A: In this case, the coefficients of hedu and wedu did not change by much,
# yet kl6 is both statistically and economically significant.
# This suggests that the estimates of wedu and hedu are robust.
26
# Example 13: Irrelevant Variables
library(car)
library(AER)
library(broom)
library(PoEdata)
data("edu_inc", package="PoEdata")
mreg.mod1 <- lm(faminc~he+we+kl6, data=edu_inc)
# The variables xtra_x5 and _x6 are contain random values that are not associated with $y$ # We can would expect that the variances of th relevant variable will increase as we intro # these irrlevant variables.
mreg.mod2 <- lm(faminc~he+we+kl6+xtra_x5, data=edu_inc)
mreg.mod3 <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc) #summary(mreg.mod1)
#summary(mreg.mod2)
#summary(mreg.mod3)
#stargazer(mreg.mod1, mreg.mod2, mreg.mod3)
Below we summarize the results from the 3 models: Table 1:
duce
he
we
kl6 xtra_x5 xtra_x6 Constant
Observations R2
Adjusted R2 Residual Std. Error F Statistic
Note:
(1)
3,211.526∗∗∗ (796.703)
4,776.907∗∗∗ (1,061.164)
−14,310.920∗∗∗ (5,003.928)
−7,755.330 (11,162.930)
428
0.177
0.171 40,160.080 (df = 424) 30.432∗∗∗ (df = 3; 424)
Dependent variable:
faminc (2)
3,377.219∗∗∗ (1,247.058)
4,783.930∗∗∗ (1,063.156)
−14,216.440∗∗∗ (5,039.395)
−180.162 (1,042.335)
−7,682.601 (11,183.650)
428
0.177
0.169 40,206.100 (df = 423) 22.779∗∗∗ (df = 4; 423)
(3)
3,339.792∗∗∗ (1,250.039)
5,868.677∗∗ (2,278.067)
−14,200.180∗∗∗ (5,043.720)
888.843 (2,242.491)
−1,067.186 (1,981.685)
−7,558.613 (11,195.410)
428
0.178
0.168 40,239.890 (df = 422) 18.251∗∗∗ (df = 5; 422)
27
∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
5 Multicollinearity
Multicollinearity occurs when there are strong linear relationships between the predictors. For example, consider the model y = β1 + β2x2 + β3x3 + ε, where x3 = α1 + α2x2. In this case, Ordinary Least Squares cannot generate estimates of the regression coefficients because the predictor x2 and x3 move together. In this case (which is considered Perfect Multicollinearity) the solution would be to drop of the two predictors from the model.
In general, Imperfect Multicollinearity is more common. This occurs when e.g., using the previous case, x3 = α1 + α2 x2 + u, where u represents some unknown random component.
The effects of Multicollinearity can be summarized as follows:
1. The variances and standard errors of the regression coefficient estimates will increase. This leads to
lower t−statistics.
2. The overall fit of the regression equation will be largely unaffected by multicollinearity. Therefore,
forecasting and prediction will be largely unaffected.
3. Regression coefficients can change substantially when variables are added or dropped.
5.1 Detecting Multicollinearity
There are several ways in which we can detect multicollinearity. Below are the 3 most common ones:
1. Large Correlation Coefficients
High pairwise correlations among independent variables is a strong sign of multicollinearity. Usually |rxi,xj | 0.8 is considered large.
2. Large R2 with Low t−Stats
It is possible for individual coefficients to be insignificant but for the overall fit of the equation to be high.
3. Variance Inflation Factor (VIF)
VIF is a measure of how much the variance of the estimated regression coefficient bk is inflated by the existence of correlations among the predictor variables in the model. Consider for example var(bj) as given by:
σ2 1
v a r ( b j ) = ( n − 1 ) s 2 × 1 − R 2 .
jj
The term 1/(1 − R2j ) is called the *VIF*. Values of VIF= 1 suggest no correlation whereas VIF 10 (and
in some cases 4) is considered large, and thus, we might consider removing those variables with large VIFs from the model. The R2j are obtained from the *Auxiliary Regressions*, i.e., xj regressed on the other predcitors.
An important limitation to of VIF is that it does not apply to nonlinear regressors or contrasts. For these cases we would instead use the Generalized VIF (GVIF) which is given by GVIF1/2p, where p represents the number of regressors in a term.
# Example 13: Multicollinearity (Fuel Efficiency)
library(car)
data("cars", package="PoEdata")
mreg.mod=lm(mpg~cyl+eng+wgt, data=cars)
summary(mreg.mod)
28
##
## Call:
## lm(formula = mpg ~ cyl + eng + wgt, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.5568 -2.8703 -0.3649 2.2708 16.4338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.3709616 1.4806851 29.967 < 2e-16 ***
## cyl
## eng
## wgt
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
-0.2677968 0.4130673 -0.648 0.517
-0.0126740 0.0082501 -1.536 0.125
-0.0057079 0.0007139 -7.995 1.5e-14 ***
## Residual standard error: 4.297 on 388 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.697
## F-statistic: 300.8 on 3 and 388 DF, p-value: < 2.2e-16
tidy(vif(mreg.mod))
## Warning: 'tidy.numeric' is deprecated.
## See help("Deprecated")
names cyl eng wgt
x 10.515508 15.786455 7.788716
# We can see that both cyl and eng should be removed since VIF>10
# Example 14: Multicollinearity (Median household value in Boston)
data(“Boston”, package = “MASS”)
mreg.mod <- lm(medv ~., data = Boston)
summary(mreg.mod)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim
## zn
## indus
## chas
## nox
## rm
## age
-1.080e-01 3.286e-02 -3.287 0.001087 **
4.642e-02 1.373e-02 3.382 0.000778 ***
2.056e-02 6.150e-02 0.334 0.738288
2.687e+00 8.616e-01 3.118 0.001925 **
-1.777e+01 3.820e+00 -4.651 4.25e-06 ***
3.810e+00 4.179e-01 9.116 < 2e-16 ***
6.922e-04 1.321e-02 0.052 0.958229
29
## dis
## rad
## tax
## ptratio
## black
## lstat
## ---
-1.476e+00 1.995e-01 -7.398 6.01e-13 ***
3.060e-01 6.635e-02 4.613 5.07e-06 ***
-1.233e-02 3.760e-03 -3.280 0.001112 **
-9.527e-01 1.308e-01 -7.283 1.31e-12 ***
9.312e-03 2.686e-03 3.467 0.000573 ***
-5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
tidy(vif(mreg.mod))
## Warning: 'tidy.numeric' is deprecated.
## See help("Deprecated")
names x crim 1.792191 zn 2.298758 indus 3.991596 chas 1.073995 nox 4.393720 rm 1.933744 age 3.100826 dis 3.955945 rad 7.484496 tax 9.008554 ptratio 1.799084 black 1.348521 lstat 2.941491
##
## Call:
## lm(formula = undercount ~ ., data = Ericksen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8356 -0.8033 -0.0553 0.7050 4.2467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.611411 1.720843 -0.355 0.723678
## minority 0.079834 0.022609 3.531 0.000827 ***
## crime 0.030117 0.012998 2.317 0.024115 *
## poverty -0.178369 0.084916 -2.101 0.040117 *
# We can see that nox, rad, and tax might be worth removing since VIF>4
# Also note that idus, and dis are very close to 4, so we might want to
# consider removing them as well.
# Example 15: Multicollinearity (US Census Data) -see Ericksen et. al., 1989
mreg.mod <- lm(undercount ~ ., data=Ericksen)
summary(mreg.mod)
30
## language 0.215125 0.092209 2.333 0.023200 *
## highschool 0.061290 0.044775 1.369 0.176415
## housing -0.034957 0.024630 -1.419 0.161259
## citystate -1.159982 0.770644 -1.505 0.137791
## conventional 0.036989 0.009253 3.997 0.000186 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.426 on 57 degrees of freedom
## Multiple R-squared: 0.7077, Adjusted R-squared: 0.6667
## F-statistic: 17.25 on 8 and 57 DF, p-value: 1.044e-12
tidy(vif(mreg.mod))
## Warning: 'tidy.numeric' is deprecated.
## See help("Deprecated")
names
minority 5.009065 crime 3.343586 poverty 4.625178 language 1.635568 highschool 4.619169 housing 1.871745 city 3.537750 conventional 1.691320
# We can see that minority, poverty, and highschool might be worth removing since VIF>4
6 Model Misspecification: Ramsey RESET
As we have seen previously, sometimes adding nonlinear expressions for the predictors can help improve our model in terms of economic and statistical significance. A popular test to evaluate is such nonlinear terms are needed, is known as the REgression Specification Error Test (RESET). For example, assume we have the model y = β1 + β2×2 + β3×3 + ε and we are interested in determining if we can improve the model by including higher order terms of the predictors. We can easily assess this by estimating the model
2
y=β1+β2×2+β3×3+γy +ε,andtestingthenullhypothesisH0 :γ=0againstH0 :γ̸=0.Ifwereject
x
H0, then we should consider including e.g., quadratic terms. We can then perform the same test but instead 3
including the term δy , and so on, until we fail to reject H0. In practice, however, we should not have to include high order terms beyond 2 or 3.
# Example 16: Functional Form (RESET)
library(car)
library(PoEdata)
library(lmtest)
# We will create to models: A quadratic one and a linear one, and test each for model miss
x <- c(1:30)
# Model 1: Quadratic
y1 <- 1 + x + x^2 + rnorm(30)
# Test if a quadratic model is appropriate
resettest(y1 ~ x , power=2, type="regressor")
31
pecification
##
## RESET test
##
## data: y1 ~ x
## RESET = 158944, df1 = 1, df2 = 27, p-value < 2.2e-16
##
## RESET test
##
## data: y2 ~ x
## RESET = 0.17222, df1 = 1, df2 = 27, p-value = 0.6814
##
## RESET test
##
## data: mreg.mod
## RESET = 2.6046, df1 = 3, df2 = 421, p-value = 0.05144
# According to the test, our model is fine
7 Model Selection
We now discuss how to select the best model among competing ones using robust statistical metrics.
7.1 Akaike Information Criterion (AIC)
AIC estimates the quality of the models being considered for the data relative to each other. The smaller the value of AIC, the better the model.
SSE 2K AIC=ln N +N
# According to the test, we can improve the model
# Test if a quadratic model is appropriate
y2 <- 1 + x + rnorm(30)
resettest(y2 ~ x , power=2, type="regressor")
# According to the test, our model is well specified, no need for adding a quadratic term
# Example 17: Family Income (RESET)
data("edu_inc", package="PoEdata")
mreg.mod <- lm(faminc~he+we+kl6, data=edu_inc)
resettest(mreg.mod , power=2, type="regressor")
# Example 18: AIC (Family Income)
library(car)
library(AER)
library(broom)
library(PoEdata)
data("edu_inc", package="PoEdata")
mreg.mod1 <- lm(faminc~he, data=edu_inc)
mreg.mod2 <- lm(faminc~he+we, data=edu_inc)
32
mreg.mod3 <- lm(faminc~he+we+kl6, data=edu_inc)
mreg.mod4 <- lm(faminc~he+we+kl6+xtra_x5, data=edu_inc)
mreg.mod5 <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc)
AIC(mreg.mod1, mreg.mod2, mreg.mod3, mreg.mod4, mreg.mod5)
AIC mreg.mod1 10316.65 mreg.mod2 10300.91 mreg.mod3 10294.73 mreg.mod4 10296.70 mreg.mod5 10298.41
7.2 Bayesian Information Criterion (BIC)
df
3
4
5
6
7
# As expected, Model 3 (faminc~he+we+kl6) is the preferred choice
Similarly to AIC, BIC does the same bu penalizes extra variables more heavily than AIC. SSE Kln(N)
As with AIC,
BIC=ln N + N the model with the smallest BIC is preferred.
# Example 18: BIC (Family Income)
library(car)
library(AER)
library(broom)
library(PoEdata)
data("edu_inc", package="PoEdata")
mreg.mod1 <- lm(faminc~he, data=edu_inc)
mreg.mod2 <- lm(faminc~he+we, data=edu_inc)
mreg.mod3 <- lm(faminc~he+we+kl6, data=edu_inc)
mreg.mod4 <- lm(faminc~he+we+kl6+xtra_x5, data=edu_inc)
mreg.mod5 <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc)
BIC(mreg.mod1, mreg.mod2, mreg.mod3, mreg.mod4, mreg.mod5)
df
mreg.mod1 mreg.mod2 mreg.mod3 mreg.mod4 mreg.mod5
BIC 10328.83 10317.15 10315.03 10321.06 10326.82
3
4
5
6
7
# As expected, Model 3 (faminc~he+we+kl6) is also the preferred choice
7.3 Mallows’ CP
Mallows’ Cp can be used to assess model fits using different number of parameters. The statistic is given by: Cp = SSE(p) − N + 2p,
σ2
where SSE is the sum squared error, N the number of observations, and p the number of parameters being used to estimate the model.
33
The smaller the value of the statistic Cp, the better. Therefore, if model p is the best choice, Cp will tend to be close to or smaller than p.
Abbreviation he h
)
elblue4",cex=1
we w kl6 k
xtra_x5 x_5
xtra_x6 x_6
Mallows CP
12345
Subset Size
7.4 Stepwise Regression
Since our goal is ultimately about making the best predictions possible with our model, Stepwise Regression consists of iteratively adding and removing predictors until the model with the lowest prediction error is obtained. There are 3 popular ways to perform this:
34
# Example 18: Mallows Cp (Family Income)
library(car)
library(AER)
library(broom)
library(PoEdata)
library(leaps)
data("edu_inc", package="PoEdata")
mreg.mod5 <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc)
ss=regsubsets(faminc~he+we+kl6+xtra_x5+xtra_x6,method=c("exhaustive"),nbest=3,data=edu_inc
subsets(ss,statistic="cp",legend=F,main="Mallows CP",col="steelblue4")
legend(3,25,bty="y",legend=c('h=hedu','w=wedu','k=kl6','x_5 =xtra5', 'x_6=extr6'),col="ste
x_6 h
w
h=hedu w=wedu k=kl6
x_5 =xtra5 x_6=extr6
h−k−x_5−x_6
h−w−k−x_5−x_6 h−w−k−x_56
w−x_5 w−k
h−w
h−w−x_6 w−k−x_5
h−w−k
Statistic: cp
5 10 15 20 25
# As expected, Model 3 (faminc~he+we+kl6) is also the preferred choice by Cp
1. Forward Selection
Start with no predictors in the model, iteratively add the most contributive predictors, and stop when the improvement is no longer statistically significant.
2. Backward Selection
Starts with all predictors in the model (full model), iteratively remove the least contributive predictors, and stop when you have a model where all predictors are statistically significant.
3. Stepwise Selection
Is a combination of forward and backward selections. You start with no predictors, then sequentially add the most contributive predictors (like forward selection). After adding each new variable, remove any variables that no longer provide an improvement in the model fit (like backward selection).
Below we show a brute force approach by manually comparing models based on their prediction accuracy.
# Example 19: Evaluating the Prediction Error
# See http://www.sthda.com/ for details of the exmaple below library(car)
library(AER)
library(broom)
library(PoEdata)
library(leaps)
library(tidyverse)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## The following object is masked from 'package:survival':
##
## cluster
library(MASS) suppressMessages(library("tidyverse"))
# Load the data
data("Boston", package = "MASS")
# Split the data into training and test set set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE) train.data <- Boston[training.samples, ] test.data <- Boston[-training.samples, ]
# Build the model
model1 <- lm(medv ~., data = train.data)
# Make predictions
predictions <- model1 %>% predict(test.data) # Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
35
RMSE R2 4.588948 0.761126
## crim zn indus chas nox rm age dis
## 1.844585 2.323334 3.956419 1.066866 4.447852 1.911905 3.213786 4.065420
## rad tax ptratio black lstat
## 8.152643 9.660128 1.851645 1.364539 3.131835
les?
# So far we have R^2 = 0.67, can we do better by removing e.g., any multicoliinear varaiab
vif(model1)
# Remove tax
model2 <- lm(medv ~. -tax, data = train.data) # Make predictions
predictions <- model2 %>% predict(test.data) # Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
RMSE R2 4.644396 0.7558647
In R we can use the function stepAIC (from the leaps library) with the flag direction set to forward, backward, or both (Stepwise) to automatically do the model selection process for us. The function step is often used as well.
# Looks like it didn’t help much
# Technically we would need to manually run through all the plausible models
# until we converge on the best one. Or we can simply use ‘step(model1)’ as
# shown in Example 20.
#step(model1)
# Example 20: Stepwise (Family Income)
library(car)
library(AER)
library(broom)
library(PoEdata)
library(leaps)
data(“edu_inc”, package=”PoEdata”)
mreg.mod <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc)
# Forward Selection
step.model <- stepAIC(mreg.mod, direction = "forward",
trace = FALSE)
summary(step.model)
##
## Call:
## lm(formula = faminc ~ he + we + kl6 + xtra_x5 + xtra_x6, data = edu_inc)
##
## Residuals:
## Min 1Q Median 3Q Max
36
## -90853 -24027 -6555 18014 243361
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7558.6
## he 3339.8
## we 5868.7
## kl6 -14200.2
## xtra_x5 888.8
## xtra_x6 -1067.2
## —
11195.4 -0.675 0.49995
1250.0 2.672 0.00784 **
2278.1 2.576 0.01033 *
5043.7 -2.815 0.00510 **
2242.5 0.396 0.69204
1981.7 -0.539 0.59050
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 40240 on 422 degrees of freedom
## Multiple R-squared: 0.1778, Adjusted R-squared: 0.1681
## F-statistic: 18.25 on 5 and 422 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = faminc ~ he + we + kl6, data = edu_inc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91760 -23745 -6642 17636 243255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
# Backward Selection
step.model <- stepAIC(mreg.mod, direction = "backward",
trace = FALSE)
summary(step.model)
## (Intercept) -7755.3
## he 3211.5
## we 4776.9
## kl6 -14310.9
## ---
11162.9 -0.695 0.48760
796.7 4.031 6.58e-05 ***
1061.2 4.502 8.73e-06 ***
5003.9 -2.860 0.00445 **
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40160 on 424 degrees of freedom
## Multiple R-squared: 0.1772, Adjusted R-squared: 0.1714
## F-statistic: 30.43 on 3 and 424 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = faminc ~ he + we + kl6, data = edu_inc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91760 -23745 -6642 17636 243255
# Stepwise Selection
step.model <- stepAIC(mreg.mod, direction = "both",
trace = FALSE)
summary(step.model)
37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7755.3
## he 3211.5
## we 4776.9
## kl6 -14310.9
## —
11162.9 -0.695 0.48760
796.7 4.031 6.58e-05 ***
1061.2 4.502 8.73e-06 ***
5003.9 -2.860 0.00445 **
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 40160 on 424 degrees of freedom
## Multiple R-squared: 0.1772, Adjusted R-squared: 0.1714
## F-statistic: 30.43 on 3 and 424 DF, p-value: < 2.2e-16
## Start: AIC=9081.8
## faminc ~ he + we + kl6 + xtra_x5 + xtra_x6 ##
## DfSumofSq RSS AIC ## - xtra_x5 1 2.5439e+08 6.8358e+11 9080.0 ## - xtra_x6 1 4.6960e+08 6.8379e+11 9080.1
##
## -we
## -he ## – kl6 ##
## Step: AIC=9079.95
## faminc ~ he + we + kl6 + xtra_x6
##
## DfSumofSq RSS AIC ## – xtra_x6 1 2.6350e+08 6.8384e+11 9078.1
# Autmated way to evaluate all the models:
step(mreg.mod)
##
## – kl6
## -he
## -we ##
6.8358e+11 9080.0
1 1.2698e+10 6.9628e+11 9085.8
## Step: AIC=9078.12
## faminc ~ he + we + kl6
##
##
##
## – kl6
## – he
## – we
1 1.5563e+10
1 2.0830e+10
6.9914e+11
7.0441e+11
9087.6
9090.8
RSS
6.8384e+11 9078.1
1 1.3192e+10 6.9703e+11 9084.3
1 2.6207e+10 7.1005e+11 9092.2
1 3.2683e+10 7.1652e+11 9096.1
DfSumofSq
AIC
6.8332e+11 9081.8
6.9407e+11 9086.5
6.9488e+11 9087.0
1 1.0746e+10
1 1.1559e+10
1 1.2835e+10 6.9616e+11 9087.8
##
## Call:
## lm(formula = faminc ~ he + we + kl6, data = edu_inc)
##
## Coefficients:
## (Intercept) he we
## -7755 3212 4777
38
kl6 -14311
# As expected, Model 3 (faminc~he+we+kl6) is the best one for all cases
# Note: The leaps package has other functions such as ‘leapSeq’ which give
# you more choices for performing the selection.
8 Heteroskedasticity
The classic signature of Heteroskedasticity is a growing spread in y with increasing values of x. In this case the Constant Variance assumption no longer holds, and we therefore need to apply other techniques to mitigate this problem.
ConsidertheMRmodel: yi = β1 +β2xi2 +···+βKxiK +ei. Ifthevariancevar(yi) = E(ei2)isnolonger constant, this suggests that there is a function h() that depends on predictors z2, . . . , zS that can be used to characterize the behavior of var(yi). For example, we can propose
var(yi) = E(ei2) = h(α1 + α2zi2 + · · · + αSziS).
This is more general result since the constant variance assumption holds provided α2 = α3 = · · · = αS = 0.
Next, we will discuss a couple of statistical methods to help us identify if heteroskedasticity is present by evaluating hypotheses such as H0 : α2 = α3 = · · · = αS = 0 and examining plausible function forms for h() such as linear, quadratic and exponential functions.
8.1 Detecting Heteroskedasticity
Next, we will discuss a couple of statistical methods to help us identify if heteroskedasticity is present by evaluating hypotheses such as H0 : α2 = α3 = · · · = αS = 0 and examining plausible function forms for h() such as linear, quadratic and exponential functions.
8.1.1 Spread Level Plots
These can be used as a diagnostic for nonconstant variance with the advantage that it also yields an estimate for the power law transformation to use for the response variable.
# Example 21: Spread Level Plot (Family Income)
library(car)
library(PoEdata)
# Traditional LS
foodeq <- lm(food_exp~income,data=food)
tidy(foodeq)
term (Intercept) income
p.value 0.0621824 0.0000195
estimate
std.error
statistic
83.41600
43.410163
1.921578
10.20964
2.093263
4.877381
spreadLevelPlot(foodeq)
39
Spread−Level Plot for foodeq
150 200 250 300 350 400
Fitted Values
##
## Suggested power transformation: -1.106871
# It suggests a lambda = -1.107 for the power low to use in transforming y.
8.1.2 Score Tests for Nonconstant Error Variance
There are several popular tests for Heteroskedasticity depending on various characteristics of the data. In this section, however, we will focus on:
1. Breusch-Pagan Test
2. White Test
3. Goldfeld-Quandt Test
8.1.2.1 Lagrange Multiplier Test (Breusch-Pagan Test)
AssumingweknowwhatthezS’sare,thistestisforthenullH0 :α2 =α3 =···=αS =0againstthe alternative H1 : not all the αS are zero. The respective test-statistic is χ2 ∼ N × R2 ∼ χ2S−1, where R2 is
2
obtainedfromtheregression:e =α +α z +···+α z +v.Thechallengewiththistestisthatwemay
i12i2 SiSi not not know (or have) other predictors zS for estimating the αS’s.
# Example 22: BP Test: Food Expenditure vs. Family Income
library(car)
library(PoEdata)
# The exmaple below shows how ot compute everything from scracth, however, we will shortly
# instead use the R function 'bptest'
40
Absolute Studentized Residuals
0.05 0.20 0.50 2.00
alpha <- 0.05
mod1 <- lm(food_exp~income, data=food)
ressq <- resid(mod1)^2
#The test equation:
modres <- lm(ressq~income, data=food)
N <- nobs(modres)
gmodres <- glance(modres)
S <- gmodres$df #Number of Betas in model
#Chi-square is always a right-tail test
chisqcr <- qchisq(1-alpha, S-1)
Rsqres <- gmodres$r.squared
chisq <- N*Rsqres
pval <- 1-pchisq(chisq,S-1)
# Since p-val = 0.0066 < alpha=0.05, we reject the null # and conclude tht heteroskedasticity is present.
# You can also use the function ncvTest from the car package # to test for Non Constant Variance (hence ncv) ncvTest(mod1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 7.343935, Df = 1, p = 0.0067289
# We can see that we reject the null H0: variance = constant
8.1.2.2 White Test
Given the problem we saw with the Breusch-Pagan (BP) Test, the White Test solves this by suggesting we start by using our existing predictors from the original model. For example, if our regression model is E(y)=β1+β2x2+β3x3,thenwesetz2 =x2,z3 =x3,z4 =x2,andz5 =x32.
# Example 22: White Test: Food Expenditure vs. Family Income
library(car)
library(PoEdata)
alpha <- 0.05
mod1 <- lm(food_exp~income, data=food)
ressq <- resid(mod1)^2
modres <- lm(ressq~income+I(income^2), data=food) gmodres <- glance(modres)
Rsq <- gmodres$r.squared
S <- gmodres$df #Number of Betas in model
chisq <- N*Rsq
pval <- 1-pchisq(chisq, S-1)
# Since p-val = 0.023 < alpha=0.05, we reject the null # and conclude tht heteroskedasticity is present.
# We can directly test the Robust version of the BP Test
mod1 <- lm(food_exp~income, data=food)
bptest(mod1)
##
## studentized Breusch-Pagan test
##
41
## data: mod1
## BP = 7.3844, df = 1, p-value = 0.006579
# Consissten with the previous results, we reject H0.
8.1.3 Goldfeld-Quandt Test
When our model includes a factor variable(s), and suffers from heteroskedasticty, the Goldfeld-Quandt Test can be useful for comparing groups based on the factor levels. For example, consider the case of a binary indicator variable (I) that takes on the values 0 or 1. The test will compares the variances of the two groups by testing the null hypothesis H0 : σ12 = σ02 against the alternative: H0 : σ12 ̸= σ02. The respective test statistic is F = σ12/σ02 which is then compared against Fc = FN1−K,N0−K. This test can also be applied to a model with only quantitative predictors by simply splitting the observations according to a desired rule. The two examples below show these cases.
# Example 23: Goldfeld-Quandt Test (w/ Inidicator variable) # wage = educ + exper + metro (=1 metropolitan and =0 urban) library(car)
library(PoEdata)
alpha <- 0.05 #two tail, will take alpha/2 data("cps2", package="PoEdata")
#Create the two groups, m (metro) and r (rural) m <- cps2[which(cps2$metro==1),]
r <- cps2[which(cps2$metro==0),]
wg1 <- lm(wage~educ+exper, data=m)
wg0 <- lm(wage~educ+exper, data=r)
df1 <- wg1$df.residual #Numerator degrees of freedom
df0 <- wg0$df.residual #Denominatot df
sig1squared <- glance(wg1)$sigma^2
sig0squared <- glance(wg0)$sigma^2
fstat <- sig1squared/sig0squared
Flc <- qf(alpha/2, df1, df0)#Left (lower) critical F
Fuc <- qf(1-alpha/2, df1, df0) #Right (upper) critical F
# Since F = 2.09 (in the rejection region), we reject H0 and conclude # that the variances are different, and therefore, heteroskedasticity # is an issue.
# Example 24: Goldfeld-Quandt Test (only quantitative predictors) # Food Expenditure vs. Family Income
# Divide the data into two groups: below median income and above library(car)
library(PoEdata)
alpha <- 0.05
data("food", package="PoEdata")
medianincome <- median(food$income)
li <- food[which(food$income<=medianincome),]
hi <- food[which(food$income>=medianincome),]
eqli <- lm(food_exp~income, data=li)
eqhi <- lm(food_exp~income, data=hi)
dfli <- eqli$df.residual
dfhi <- eqhi$df.residual
sigsqli <- glance(eqli)$sigma^2
42
sigsqhi <- glance(eqhi)$sigma^2
fstat <- sigsqhi/sigsqli #The larger var in numerator Fc <- qf(1-alpha, dfhi, dfli)
pval <- 1-pf(fstat, dfhi, dfli)
# Since p-val = 0.0046 < alpha=0.05, we reject the null
# and conclude tht heteroskedasticity is present. In this case,
# we can see that the test suggests that there is a hiher variance
# at higher incomes.
# An easier way to conduct the GQ test is using the function gqtest # from the lmtest package
library(lmtest)
foodeq <- lm(food_exp~income, data=food)
gqtest(foodeq, point=0.5, alternative="greater",
order.by=food$income)
##
## Goldfeld-Quandt test
##
## data: foodeq
## GQ = 3.6148, df1 = 18, df2 = 18, p-value = 0.004596
## alternative hypothesis: variance increases from segment 1 to 2
8.2 Mitigating Heteroskedasticity
In the previous sections we learned how to detect heteroskedasticity, now we will discuss what to with our model to mitigate it its impact.
8.2.1 Heteroskedasticity-Consistent Standard Errors
Recall that heteroskedasticity has the two effects that (1) the LS estimators are no longer the best, and (2) the LS standard errors are not correct since the variance is no longer constant. A more robust (in the sense that it applies to both heteroskedastic and homoskedastic errors) is the White Statndard Error as given by:
N 22 var(b2)= N ∑i=1 (xi−x)ei
N−2∑N [(xi−x)2]2 i=1
For example, we look at the results from the food expenditure case and compare the traditional vs. the White standard errors, as well as the respective confidence intervals for the model coefficients. Keep in mind that hypotheses such as e.g., λ = c1 β1 + c2 β2 = c0 would also require us to update the standard errors, and therefore, our conclusions may change.
43
# Example 25: White SE
# Food Expenditure vs. Family Income
# To compute the Robust White SE, the car package has a function
# hccm (heteroskedasticity-robust coefficient covariance matrix) that will
# compute them
library(car)
library(PoEdata)
# Traditional Standard Errors
foodeq <- lm(food_exp~income,data=food)
tidy(foodeq)
term (Intercept) income
term (Intercept) income
p.value 0.0621824 0.0000195
p.value 0.0042989 0.0000018
estimate
83.41600
std.error
43.410163
statistic
1.921578
10.20964
2.093263
4.877381
# Robust White Standard Errors
cov1 <- hccm(foodeq, type="hc1")
food.HC1 <- coeftest(foodeq, vcov.=cov1)
tidy(food.HC1)
estimate
83.41600
std.error
27.463748
statistic
3.037313
10.20964
1.809077
5.643566
# Hypotheiss Test for a linear combination of paramters: H0 = beta_2 + beta_3 = 0
data("andy", package="PoEdata")
andy.eq <- lm(sales~price+advert, data=andy) bp <- bptest(andy.eq) #Heteroskedsticity test b2 <- coef(andy.eq)[["price"]]
b3 <- coef(andy.eq)[["advert"]]
H0 <- "price+advert=0"
# Traditional Standard Errors
linearHypothesis(andy.eq, H0)
Res.Df Pr(>F)
73 72
Res.Df Pr(>F)
NA 1.06e-05
RSS
Df
Sum of Sq
F
2254.715
NA
NA
NA
1718.943
1
535.7719
22.44145
# Robust White Standard Errors
linearHypothesis(andy.eq, H0, vcov=hccm(andy.eq, type=”hc1″))
Df
73 72
NA 7.3e-06
NA
NA
1
23.38698
# In this case the p-value is larger using the traditional errors.
8.2.2 Generalized (Weighted) Least Squares
F
To properly deal with Heteroskedasticity, we would need to estimate a different variance of the error term for each observation. On a practical level this is not possible, and instead we model the dependence of the variance on one or more of the predictors in the model. We will exams two forms of dependence depending on whether we know or not the actual relationship between the variance and predictor(s).
44
8.2.2.1 Known Form or Variance
For simplicity, let’s assume we have a SLR model yi = β1 + β2xi + ei and that var(ei) = σ2 = σ2xi. If we √ ∗∗i∗∗∗
divide our original model by xi, then we show that in the transformed model: yi = β1xi1 + β2xi2 + ei , ei are now homoskedastic since var(ei∗) =varei/√xi =var(ei)/xi = σ2xi/xi = σ2.Also, the desired properties E(ei∗) = 0 and cov(ei∗, e∗j ) = 0, i ̸= j still hold.
The transformation we proposed is part of a more general method known as Weighted Least Squares where we weighted the errors by x−1/2.
i
For example, looking again at the food expenditure example, applying this method we obtain:
# Example 25: Weighted Least Squares -Known Variance
# Food Expenditure vs. Family Income
# To compute the Robust White SE, the car package has a function
# hccm (heteroskedasticity-robust coefficient covariance matrix) that will
# compute them
library(car)
library(PoEdata)
# Traditional LS
foodeq <- lm(food_exp~income,data=food)
tidy(foodeq)
term (Intercept) income
term (Intercept) income
8.2.2.2 Unknown Form of Variance
p.value 0.0621824 0.0000195
p.value 0.0042989 0.0000018
estimate
83.41600
std.error
43.410163
statistic
1.921578
10.20964
2.093263
4.877381
# Weighted LS
w <- 1/food$income # Weights : R takes the sqrt by default already, hence w=1/x food.wls <- lm(food_exp~income, weights=w, data=food)
vcvfoodeq <- coeftest(foodeq, vcov.=cov1)
tidy(vcvfoodeq)
estimate
std.error
27.463748
statistic
83.41600
3.037313
10.20964
1.809077
5.643566
# Traditional LS
If we do not know the form of the variance, we can propose a more general power law form such as var(ei) = σi2 = σ2xiγ, where γ is an unknown parameter that we can estimate. The trick to estimating γ is to take logs on both side of the variance equation, ln(σi2) = ln(σ2) + γ ln(xi), and take the exponential on both sides,
45
σi2 =exp[ln(σ2)+γln(xi)]=exp(α1+α2zi),
where α1 = ln(σ2), α2 = γ, and zi = ln(xi). For the MR case, we simply have more terms, ziS to include in the exponential function. We can use traditional LS to estimate γ from the log-linear model ln(σi2) = α1 + α2zi.
2
The estimated equation is ln(σi ) = 0.9378 + 2.39zi (recall that we used γ = 1 in our previous example). Now
we can use σ2 = exp(0.9378 + 2.39zi ) to transform our original model instead of the x−1/2 weights. The ii
estimated model is now:
# Example 25: Weighted Least Squares -Uknown Variance
# Food Expenditure vs. Family Income
# To compute the Robust White SE, the car package has a function
# hccm (heteroskedasticity-robust coefficient covariance matrix) that will
# compute them
library(car)
library(PoEdata)
foodeq <- lm(food_exp~income,data=food)
data("food", package="PoEdata")
food.ols <- lm(food_exp~income, data=food)
ehatsq <- resid(food.ols)^2
sighatsq.ols <- lm(log(ehatsq)~log(income), data=food)
vari <- exp(fitted(sighatsq.ols))
food.fgls <- lm(food_exp~income, weights=1/vari, data=food)
# This case is also known as Feasible Generalized Least Squares library(stargazer)
# stargazer(food.ols, food.HC1, food.wls, food.fgls)
Table 2: Comparing various ’food’ models Dependent variable: ’food expenditure’
Constant income
Observations
9 Missing Observations
OLS
83.416 (43.410)
10.210 (2.093)
40
HC1
83.416 (27.464)
10.210 (1.809)
WLS
78.684 (23.789)
10.451 (1.386)
40
FGLS
76.054 (9.713)
10.633 (0.972)
40
There are several ways to deal with missing data depending on their characteristics and volume. Below we briefly discuss for of them.
46
1. Remove Missing Observations
If the number of missing observations is small relative to the dataset, we can simply remove these observations. Also, we need to make sure not to introduce a bias due to their removal by e.g., failing to represent certain characteristics of the data.
2. Remove the Variable
This may be appropriate if most of the missing values occur mainly in a particular variable. We would need to weigh its removal against other considerations such as economic significance, etc.
3. Imputation via Mean/Median/Mode
Depending on the distribution of the observations, we can impute the missing data with estimates of the Mean, Median or Mode.
4. Prediction
We can predict the missing observations using kNN (k-Nearest Neighbor), regression, interpolation, etc. However, of all these methods, kNN seems to work well provided the variables are not factor variables. The basic idea behind the method is to replace a missing observation with a weighted average of the closest (using the Euclidean distance) k neighbors. In the case of factor variables, we can instead use *Recursive Partitioning* (We normally cover this when discussing *Regression Trees*).
# Example 25: Boston Housing Data
library(mlbench) #Machine Learning Benchmark
# For refence on this example, look at:
# http://r-statistics.co/Missing-Value-Treatment-With-R.html#4.%20Prediction
data ("BostonHousing", package="mlbench") # initialize the data # load the data original <- BostonHousing # backup original data
# Introduce missing values
set.seed(100)
BostonHousing[sample(1:nrow(BostonHousing), 40), "rad"] <- NA
BostonHousing[sample(1:nrow(BostonHousing), 40), "ptratio"] <- NA
head(BostonHousing)
crim medv 0.00632 24.0 0.02731 21.6 0.02729 34.7 0.03237 33.4 0.06905 36.2 0.02985 28.7
##
## Attaching package: 'mice'
## The following object is masked from 'package:tidyr':
##
## complete
## The following objects are masked from 'package:base':
##
## cbind, rbind
md.pattern(BostonHousing) # pattern or missing values in data.
zn
indus
chas
nox
rm
age
dis
rad
tax
ptratio
b
lstat
18
2.31
0
0.538
6.575
65.2
4.0900
1
296
15.3
396.90
4.98
0
7.07
0
0.469
6.421
78.9
4.9671
2
242
17.8
396.90
9.14
0
7.07
0
0.469
7.185
61.1
4.9671
2
242
17.8
392.83
4.03
0
2.18
0
0.458
6.998
45.8
6.0622
NA
222
18.7
394.63
2.94
0
2.18
0
0.458
7.147
54.2
6.0622
3
222
18.7
396.90
5.33
0
2.18
0
0.458
6.430
58.7
6.0622
3
222
18.7
394.12
5.21
# Before going through the methods, we should first look at # the pattern of missing observations
library(mice)
47
crim zn induschas nox
rm
age
dis
tax
b
lstat medv rad ptratio
430
36 1 36 1
42 0 0 0 0 0 0 0 0 0 0 0 0 40 40 80
0
430
36 1 36 1 42
0
80
crim
zn
indus
chas
nox
rm
age
dis
tax
b
lstat
medv
rad
ptratio
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
40
40
# Method 1: Remove Missing Observations
lm(medv ~ ptratio + rad, data=BostonHousing, na.action=na.omit) #
##
## Call:
## lm(formula = medv ~ ptratio + rad, data = BostonHousing, na.action = na.omit)
##
## Coefficients:
## (Intercept) ptratio rad
## 57.0465 -1.7691 -0.2032
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:xtable':
##
## label, label<-
## The following object is masked from 'package:quantmod':
##
## Lag
## The following objects are masked from 'package:tis':
##
## capitalize, Lag
## The following objects are masked from 'package:plyr':
##
## is.discrete, summarize
## The following objects are masked from 'package:dplyr':
# Method 2: Remove the variable
# Simply leave it out of the regression model -not much more to do 🙂
# Method 3: Imputation via Mean/Median/Mode
library(Hmisc)
48
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## Loading required package: grid
##
## Attaching package: 'DMwR'
## The following object is masked from 'package:broom':
##
## bootstrap
## The following object is masked from 'package:plyr':
##
## join
## mae mse rmse mape
## 1.7898624 4.6780049 2.1628696 0.1035111
## [1] FALSE
## mae mse rmse mape
## 0.043700256 0.084619114 0.290893647 0.002683069
= T)
#impute(BostonHousing$ptratio, mean) # replace with mean
#impute(BostonHousing$ptratio, median) # median
#impute(BostonHousing$ptratio, 20) # replace specific number
# or if you want to impute manually
#BostonHousing$ptratio[is.na(BostonHousing$ptratio)] <- mean(BostonHousing$ptratio, na.rm
# Compute the accuracy when NAs are imputed with the mean library(DMwR) #Data Mining w/ R
# This library offers many fancy ways of dealing with NAs
# As a side note, if you're interested in trading, try their trading.simulator actuals <- original$ptratio
predicted <- rep(mean(BostonHousing$ptratio, na.rm=T), length(actuals)) regr.eval(actuals, predicted)
# Looking at mape, we have about a 10% (mean absolute error) -not too bad
# Method 4: Prediction
# Perform knn imputation.
knnOutput <- knnImputation(BostonHousing[, !names(BostonHousing) %in% "medv"])
anyNA(knnOutput) # Check that we don't have any NAs left in the entire dataset
actuals <- original$ptratio
predicted <- knnOutput$ptratio
regr.eval(actuals, predicted)
# Based on mape, we are now down to 0.5% error!
# Prediction for Factor Variables
49
library(rpart)
class_mod <- rpart(rad ~ . - medv,
data=BostonHousing[!is.na(BostonHousing$rad), ],
method="class", na.action=na.omit) # since rad is a factor anova_mod <- rpart(ptratio ~ . - medv,
data=BostonHousing[!is.na(BostonHousing$ptratio), ],
method="anova", na.action=na.omit) # since ptratio is numeric. rad_pred <- predict(class_mod, BostonHousing[is.na(BostonHousing$rad), ])
ptratio_pred <- predict(anova_mod, BostonHousing[is.na(BostonHousing$ptratio), ])
# Accuracy for ptratio
actuals <- original$ptratio[is.na(BostonHousing$ptratio)]
predicteds <- ptratio_pred
regr.eval(actuals, predicteds)
## mae mse rmse mape
## 0.45583994 0.67213722 0.81983975 0.02592722
## [1] 0.225
##
## iter imp variable
## 1 1 rad ptratio
## 1 2 rad ptratio
## 1 3 rad ptratio
## 1 4 rad ptratio
## 1 5 rad ptratio
## 2 1 rad ptratio
## 2 2 rad ptratio
## 2 3 rad ptratio
## 2 4 rad ptratio
## 2 5 rad ptratio
## 3 1 rad ptratio
## 3 2 rad ptratio
## 3 3 rad ptratio
## 3 4 rad ptratio
## 3 5 rad ptratio
## 4 1 rad ptratio
## 4 2 rad ptratio
# A mape of about 4% not bad
# Acccuracy for rad (factor variable)
actuals <- original$rad[is.na(BostonHousing$rad)]
predicteds <- as.numeric(colnames(rad_pred)[apply(rad_pred, 1, which.max)]) mean(actuals != predicteds) # compute misclass error.
# About 25% mis-classification error -pretty good
# Method 5: Advanced Method (Multivariate Imputation by Chained Equations)
# Use the R package mice
library(mice)
# Perform mice imputation, based on random forests
miceMod <- mice(BostonHousing[, !names(BostonHousing) %in% "medv"], method="rf")
50
## 4 3 rad ptratio
## 4 4 rad ptratio
## 4 5 rad ptratio
## 5 1 rad ptratio
## 5 2 rad ptratio
## 5 3 rad ptratio
## 5 4 rad ptratio
## 5 5 rad ptratio
## [1] FALSE
## mae mse rmse mape
## 0.40750000 0.94025000 0.96966489 0.02407982
## [1] 0.2
miceOutput <- complete(miceMod) # generate the completed data. anyNA(miceOutput)
# Accuracy for ptratio
actuals <- original$ptratio[is.na(BostonHousing$ptratio)]
predicteds <- miceOutput[is.na(BostonHousing$ptratio), "ptratio"]
regr.eval(actuals, predicteds)
# Acccuracy for rad (factor variable)
actuals <- original$rad[is.na(BostonHousing$rad)] predicteds <- miceOutput[is.na(BostonHousing$rad), "rad"] mean(actuals != predicteds) # compute misclass error.
51