Modeling Issues in Linear Regression
Modeling Issues in Linear Regression
Dr. Randall R. Rojas
Note: To access your textbook resources type the following on the console:
#library(car)
#carWeb()
1 Residuals and Influence Measures
When diagnosing a regression fit, we have to carefully assess the impact that unusual observations may have
the fit. Typically, we characterize Unusual observations in terms of outliers, leverage, and influence. Next
we discuss each one in detail.
A. Outliers
They are easy to find visually because their y value is relatively large (conditioned on the predictor(s)),
given the distribution of other y values in the vicinity.
*
* *
* * *
*
*
*
*
*
**
** *
**
*
*
*
*
*
**
*
*
*
*
*
**
*
**
0 5 10 15 20 25
0
5
0
1
0
0
1
5
0
2
0
0
Speed
D
is
ta
n
c
e
Outliers
B. Leverage
In this case we are looking at observations that have an extreme x (i.e., predictor) value relative to the
mean.
1
*
*
*
*
*
*
*
*
*
*
*
*
*
*
* *
**
*
*
*
*
*
*
*
*
*
*
*
*
*
0 5 10 15 20 25 30
0
2
0
4
0
6
0
8
0
Speed
D
is
ta
n
c
e
High Leverage
C. Influence
An influential observation can be considered as an outlier with high leverage. Therefore, they can alter
the estimates of the regression coefficients. In this case, when we remove them, the regression fit may
become more stable.
*
* *
* * *
*
*
*
*
*
**
** *
**
*
*
*
*
*
**
*
*
*
*
*
*
0 5 10 15 20 25 30
0
5
0
1
0
0
1
5
0
2
0
0
Speed
D
is
ta
n
c
e
Influential
There are several metrics we can use to better classify unusual observations, however, since these tools
depend on analyzing the residuals, we will first look at understanding residuals and then we can
discuss the related metrics.
1.1 Residual Plots
The traditional residuals are given by ei = yi − ŷi, i = 1, . . . , N. They are a good diagnostic of how well the
model fit the data if when plotted vs. e.g., a predictor, they are randomly distributed about y = 0 and have a
constant variance. In more general terms, we can propose a functional form for the residuals such as
var(ei) = σ
2(1− hi)
2
where hi is called the leverage or hat-value is bounded between 0 and 1. Large values of hi (a.k.a high leverage)
are associated with unusual values of xi. If hi is close to zero, we then obtain the classic linear regression SR1
condition of constant variance.
For the case of SLR, i.e., only one predictor, hi is given by:
hi =
(xi − x)2
∑Ni=1(xi − x)2
+
1
N
From the formula we can interpret hi (leverage) as the proportion of the total sum of squares of the
explanatory variable contributed by the iˆ{th} case.
# Example 1: Food expenditure vs. income (non-constant variance)
library(car) #for hccm() robust standard errors
##
## Attaching package: ‘car’
## The following object is masked from ‘package:purrr’:
##
## some
## The following object is masked from ‘package:dplyr’:
##
## recode
library(PoEdata) #for PoE4 datasets
data(“food”,package=”PoEdata”)
mreg.mod <- lm(food_exp~income, data=food)
plot(food$income,food$food_exp, pch=20,
xlab="Income", ylab="Food Expenditure")
abline(mreg.mod,lwd=2,col="blue")
5 10 15 20 25 30
1
0
0
2
0
0
3
0
0
4
0
0
5
0
0
6
0
0
Income
F
o
o
d
E
xp
e
n
d
itu
re
3
plot(food$income, mreg.mod$residuals, pch=20,ylab="Resdiduals", xlab="Income")
abline(h=0,col="red", lwd=2)
5 10 15 20 25 30
−
2
0
0
−
1
0
0
0
1
0
0
2
0
0
Income
R
e
sd
id
u
a
ls
# From the residuals plot we can see that the variance is not constant
# Example 2: US Investment Data
## Chapter 3 in Greene (2003)
## transform (and round) data to match Table 3.1
data("USInvest", package = "AER")
us <- as.data.frame(USInvest)
#us$invest <- round(0.1 * us$invest/us$price, digits = 3)
#us$gnp <- round(0.1 * us$gnp/us$price, digits = 3)
plot(us$gnp, us$invest , ylab="Investment", xlab="GNP",pch=20)
mreg.mod=lm(us$invest ~ us$gnp, data = us)
abline(mreg.mod,lwd=2,col="blue")
4
1000 1500 2000 2500 3000
1
5
0
2
5
0
3
5
0
4
5
0
GNP
In
ve
st
m
e
n
t
plot(us$gnp, mreg.mod$residuals, pch=20,ylab="Resdiduals", xlab="GNP",ylim=c(-45,45))
abline(h=0,col="red",lwd=2)
1000 1500 2000 2500 3000
−
4
0
−
2
0
0
2
0
4
0
GNP
R
e
sd
id
u
a
ls
# These residuals also show some hint of non-constant variance.
There are transformations that can be applied to the residuals to make the variance more constant. However,
because they are applied to the residuals, the underlying regression model still needs to be reevaluated. For
example, we can consider the Standardized Residuals as given by:
eSi =
ei
σ̂2
√
1− hi
,
# Example 3: Standardized Residuals
5
library(car)
data("USInvest", package = "AER")
us <- as.data.frame(USInvest)
mreg.mod=lm(us$invest ~ us$gnp, data = us)
library(car)
residualPlots(mreg.mod,type = "rstandard")
1000 2000 3000
−
2
−
1
0
1
us$gnp
R
st
a
n
d
a
rd
r
e
si
d
u
a
ls
150 250 350 450
−
2
−
1
0
1
Fitted values
R
st
a
n
d
a
rd
r
e
si
d
u
a
ls
## Test stat Pr(>|Test stat|)
## us$gnp -1.8139 0.09477 .
## Tukey test -1.8139 0.06970 .
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
or the Studentized Residuals:
eTi =
ei
σ̂(−i)
√
1− hi
.
# Example 4: Studentized Residuals
library(car)
data(“USInvest”, package = “AER”)
us <- as.data.frame(USInvest)
mreg.mod=lm(us$invest ~ us$gnp, data = us)
residualPlots(mreg.mod,type = "rstudent")
6
1000 2000 3000
−
3
−
2
−
1
0
1
2
us$gnp
R
st
u
d
e
n
t
re
si
d
u
a
ls
150 250 350 450
−
3
−
2
−
1
0
1
2
Fitted values
R
st
u
d
e
n
t
re
si
d
u
a
ls
## Test stat Pr(>|Test stat|)
## us$gnp -1.8139 0.09477 .
## Tukey test -1.8139 0.06970 .
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Another alternative is to compute the Pearson Residuals which are obtained from the Weighted Least Squares
(WLS) Regression, where the weights are equal to wi and the residuals can be computed from:
ePi =
√
wiei
. As can be seen, when wi = 1, we recover the standard residuals.
# Example 5: Pearson Residuals
library(car)
data(“USInvest”, package = “AER”)
us <- as.data.frame(USInvest)
mreg.mod=lm(us$invest ~ us$gnp, data = us)
library(car)
residualPlots(mreg.mod)
7
1000 2000 3000
−
6
0
−
4
0
−
2
0
0
2
0
4
0
us$gnp
P
e
a
rs
o
n
r
e
si
d
u
a
ls
150 250 350 450
−
6
0
−
4
0
−
2
0
0
2
0
4
0
Fitted values
P
e
a
rs
o
n
r
e
si
d
u
a
ls
## Test stat Pr(>|Test stat|)
## us$gnp -1.8139 0.09477 .
## Tukey test -1.8139 0.06970 .
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
# Example 6: All Residuals
#We can also graph all the relevant residual plots
prestige.mod.2 <- lm(prestige ~ education + income + type,
data=Prestige)
residualPlots(prestige.mod.2)
8
6 8 10 12 14 16
−
1
5
0
1
0
education
P
e
a
rs
o
n
r
e
si
d
u
a
ls
5000 15000 25000
−
1
5
0
1
0
income
P
e
a
rs
o
n
r
e
si
d
u
a
ls
bc prof wc
−
1
5
0
1
0
type
P
e
a
rs
o
n
r
e
si
d
u
a
ls
30 40 50 60 70 80 90
−
1
5
0
1
0
Fitted values
P
e
a
rs
o
n
r
e
si
d
u
a
ls
## Test stat Pr(>|Test stat|)
## education -0.6836 0.495942
## income -2.8865 0.004854 **
## type
## Tukey test -2.6104 0.009043 **
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
1.2 Identifying and Classifying Unusual Observations
Unusual observations can be somewhat difficult to quantify through visual inspection but fortunately there
are several robust metrics that can be used in a more objective fashion.
A. Q-Q Plot
This plot shows the studentized residuals vs. the corresponding quantiles of a t-distribution with
N − K − 2 degrees of freedom, and allows us to identify outliers (std. residuals & 2 are considered
large).
# Example 6: Q-Q Plot
library(car)
mreg.mod <- lm(prestige ~ income + education, data=Duncan)
qqPlot(mreg.mod, id=list(n=3))
9
−2 −1 0 1 2
−
2
−
1
0
1
2
3
t Quantiles
S
tu
d
e
n
tiz
e
d
R
e
si
d
u
a
ls
(m
re
g
.m
o
d
) minister
reporter
contractor
## minister reporter contractor
## 6 9 17
# It looks like minister might be a potential outlier. To be sure,
# we can perform an outlier test known as the Bonferroni-corrcted t-test:
outlierTest(mreg.mod)
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## minister 3.134519 0.0031772 0.14297
# The large p-value (Bonferroni) suggests that this value is not surpsing.
B. Cook’s Distance Plot
Cook’s Distance is defined as:
Di =
e2Si
k + 1
×
hi
1− hi
where large values of Di (larger than 1 and/or larger relative to the majority of the values) are typically
associated with influential observations.
# Example 7: Cook's Distance Plot
library(car)
mreg.mod <- lm(prestige ~ income + education, data=Duncan)
influenceIndexPlot(mreg.mod, id=list(n=3),vars="Cook")
10
0
.0
0
.2
0
.4
C
o
o
k'
s
d
is
ta
n
ce
0 10 20 30 40
minister
conductor
reporter
Diagnostic Plots
Index
# It looks like minister, conductor and reporter are potential outliers
# Note: if you don't specify the flag "vars", it will plot all of them:
influenceIndexPlot(mreg.mod, id=list(n=3))
0
.0
0
.3
C
o
o
k'
s
d
is
ta
n
ce minister
conductor
reporter
−
2
0
2
S
tu
d
e
n
tiz
e
d
r
e
si
d
u
a
ls
minister
reporter
contractor
0
.2
0
.6
1
.0
B
o
n
fe
rr
o
n
i p
−
va
lu
e
minister
reporteraccountant
0
.0
5
0
.2
0
h
a
t−
va
lu
e
s
0 10 20 30 40
RR.engineer
conductorminister
Diagnostic Plots
Index
C. Hat-Values Plot
This plot is designed for identifying high leverage observations (some of which could also be influential).
On average hi = p/N, therefore, hi & 2p/N is considered large.
# Example 8: Hat-Values Plot
library(car)
11
mreg.mod <- lm(prestige ~ income + education, data=Duncan)
influenceIndexPlot(mreg.mod, id=list(n=3),vars="hat")
0
.0
5
0
.1
5
0
.2
5
h
a
t−
va
lu
e
s
0 10 20 30 40
RR.engineer
conductor
minister
Diagnostic Plots
Index
# It looks like minister, conductor and RR.engineer are influential
# We can check if e.g., minister is influential by comparing the
# coefficient estimates w/ and w/o 'minister'
mod.duncan <- lm(prestige ~ income + education, data=Duncan)
mod.duncan.2 <- update(mod.duncan,
subset= rownames(Duncan) != "minister")
compareCoefs(mod.duncan, mod.duncan.2)
## Calls:
## 1: lm(formula = prestige ~ income + education, data = Duncan)
## 2: lm(formula = prestige ~ income + education, data = Duncan, subset =
## rownames(Duncan) != "minister")
##
## Model 1 Model 2
## (Intercept) -6.06 -6.63
## SE 4.27 3.89
##
## income 0.599 0.732
## SE 0.120 0.117
##
## education 0.5458 0.4330
## SE 0.0983 0.0963
##
# We can also depict the hat-values via:
influencePlot(mreg.mod, id=list(n=3))
12
0.05 0.10 0.15 0.20 0.25
−
2
−
1
0
1
2
3
Hat−Values
S
tu
d
e
n
tiz
e
d
R
e
si
d
u
a
ls
minister
reporter
conductor
contractor
RR.engineer
StudRes Hat CookD
minister 3.1345186 0.1730582 0.5663797
reporter -2.3970224 0.0543936 0.0989846
conductor -1.7040324 0.1945416 0.2236412
contractor 2.0438046 0.0432552 0.0585235
RR.engineer 0.8089221 0.2690896 0.0809681
D. Added Variable Plot (Partial Regression)
Depicts the relationship between y and one predictor x, adjusting for the effects of the other predictors.
These can be used to identify influential observations and high leverage observations. For example,
high leverage observations appear the in added variable plots as points horizontally distant from the
rest of the data. They are also useful for identifying jointly influential points.
An alternative method is to look at the individual coefficient differences between estimates from the
original data (bj) and those where observation i is removed (b(−i)j) .
d f βij = b(−i)j − bj for j = 0, . . . , k
# Example 9: Added-Variable Plot
library(car)
mreg.mod <- lm(prestige ~ income + education, data=Duncan)
avPlots(mreg.mod, id=list(n=3, method="mahal"))
13
−40 0 20 40
−
3
0
−
1
0
0
1
0
2
0
3
0
4
0
income | others
p
re
st
ig
e
|
o
th
e
rs
minister
RR.engineer
conductor
−60 −20 0 20 40
−
4
0
−
2
0
0
2
0
4
0
6
0
education | others
p
re
st
ig
e
|
o
th
e
rs
minister
RR.engineer
conductor
Added−Variable Plots
# It looks like minister, conductor and RR.engineer are influential
# minister & conductor --> decrease the income slope
# minister & conductor –> increase the education slope
# RR.engineer seems fine for the most part
# We can check if e.g., minister & conductor are jointly influential
# by comparing the coefficient estimates w/ and w/o ‘minister & conductor’
mod.duncan <- lm(prestige ~ income + education, data=Duncan)
mod.duncan.3 <- update(mod.duncan,
subset = - whichNames(c("minister", "conductor"), Duncan))
compareCoefs(mod.duncan, mod.duncan.2, mod.duncan.3, se=FALSE)
## Calls:
## 1: lm(formula = prestige ~ income + education, data = Duncan)
## 2: lm(formula = prestige ~ income + education, data = Duncan, subset =
## rownames(Duncan) != "minister")
## 3: lm(formula = prestige ~ income + education, data = Duncan, subset =
## -whichNames(c("minister", "conductor"), Duncan))
##
## Model 1 Model 2 Model 3
## (Intercept) -6.06 -6.63 -6.41
## income 0.599 0.732 0.867
## education 0.546 0.433 0.332
# Based on the comparison is does appear that these points were
# jointly influential.
# We can also depict the hat-values via:
influencePlot(mreg.mod, id=list(n=3))
14
0.05 0.10 0.15 0.20 0.25
−
2
−
1
0
1
2
3
Hat−Values
S
tu
d
e
n
tiz
e
d
R
e
si
d
u
a
ls
minister
reporter
conductor
contractor
RR.engineer
StudRes Hat CookD
minister 3.1345186 0.1730582 0.5663797
reporter -2.3970224 0.0543936 0.0989846
conductor -1.7040324 0.1945416 0.2236412
contractor 2.0438046 0.0432552 0.0585235
RR.engineer 0.8089221 0.2690896 0.0809681
# Lastly, we can compute the df_betas and cehck the individual differences
mreg.mod <- lm(prestige ~ income + education, data=Duncan)
dfbs <- dfbetas(mreg.mod)
# We can plot the influence on the income coefficient against the
# influence on the education coefficient
plot(dfbs[ ,c("income", "education")]) # for b1 and b2
showLabels(dfbs[ , "income"], dfbs[ , "education"],
labels=rownames(Duncan), method=c(6, 16, 27))
15
−1.0 −0.5 0.0 0.5
0
.0
0
.5
1
.0
income
e
d
u
ca
tio
n
minister
conductor
RR.engineer
## minister conductor RR.engineer
## 6 16 27
# Notice that once again, minister and conductor stand out
# We can alos plot them all via:
avPlots(lm(prestige~income+education+type, data=Duncan))
−40 −20 0 20 40 60
−
2
0
0
2
0
4
0
income | others
p
re
st
ig
e
|
o
th
e
rs
minister
machinist
RR.engineer
minister
−30 −20 −10 0 10 20
−
2
0
0
2
0
education | others
p
re
st
ig
e
|
o
th
e
rs
minister
machinist
contractor
store.manager
−0.4 −0.2 0.0 0.2 0.4 0.6
−
1
0
1
0
3
0
typeprof | others
p
re
st
ig
e
|
o
th
e
rs
minister
machinist
store.manager
contractor
−0.4 −0.2 0.0 0.2 0.4 0.6
−
2
0
0
2
0
typewc | others
p
re
st
ig
e
|
o
th
e
rs minister
machinist
conductorstore.clerk
Added−Variable Plots
16
# Example 10: Other Plots
library(car)
mreg.mod <- lm(prestige ~ income + education + type, data=Prestige)
# Marginal Plots
marginalModelPlots(mreg.mod, sd=TRUE)
## Warning in mmps(...): Interactions and/or factors skipped
5000 15000 25000
2
0
4
0
6
0
8
0
income
p
re
st
ig
e
Data Model
6 8 10 12 14 16
2
0
4
0
6
0
8
0
education
p
re
st
ig
e
Data Model
30 40 50 60 70 80 90
2
0
4
0
6
0
8
0
Fitted values
p
re
st
ig
e
Data Model
Marginal Model Plots
# Marginal & Added-Variable Plots
mcPlots(mreg.mod, ~ education, overlaid=FALSE)
17
−4 −2 0 2 4
−
3
0
−
1
0
0
1
0
2
0
3
0
4
0
Marginal plot of education
Centered education
C
e
n
te
re
d
p
re
st
ig
e
−4 −2 0 2 4
−
3
0
−
1
0
0
1
0
2
0
3
0
4
0
Added−Variable plot of education
education | others
p
re
st
ig
e
|
o
th
e
rs
mcPlots(mreg.mod, ~ income, overlaid=FALSE)
−5000 5000 15000
−
3
0
−
1
0
0
1
0
2
0
3
0
4
0
Marginal plot of income
Centered income
C
e
n
te
re
d
p
re
st
ig
e
−5000 5000 15000
−
3
0
−
1
0
0
1
0
2
0
3
0
4
0
Added−Variable plot of income
income | others
p
re
st
ig
e
|
o
th
e
rs
18
2 Predictor Transformations
Suppose we have a linear model y = β1 + β2x2 + β3x3 + e, but are interested in the relationship between y
and x3 only. Therefore, we would like to plot y vs. x3 but with the effect of x2 removed. This is equivalent to
plotting y− β1 + β2x2 vs. x3. We can accomplish this with a Component-Plus-Residual Plot. Mathematically,
we are really doing an approximation by estimating a partial residual for x3. Algorithmically, here is how we
can estimate it:
i. Obtain the coefficient estimates βk, for k = 1, 2, 3 from E[y|x] = β1 + β2x2 + β3x3 and respective
residuals, e
ii. Estimate the partial residual (epartial) according to: epartial = y− β̂1 + β̂2x2
iii. The Component-Plus-Residual is given by: epartial = β̂2x2 + e
In general, we can express the partial residual as epartial,ij = ei + bjxij. This plot is often used to help detect
the need for transformations of the predictor variables by looking for any departures from linearity in the
plots of epartial,ij vs. xij. If we identify any nonlinear dynamics, we can then apply the techniques from
Lecture 1 to the respective predictor variables.
# Example 11: Component-Plus-Residual Plot
library(car)
mreg.mod <- lm(prestige ~ income + education + women,
data=Prestige)
crPlots(mreg.mod, order=2)
0 5000 15000 25000
−
2
0
0
2
0
income
C
o
m
p
o
n
e
n
t+
R
e
si
d
u
a
l(
p
re
st
ig
e
)
6 8 10 12 14 16
−
2
0
0
2
0
education
C
o
m
p
o
n
e
n
t+
R
e
si
d
u
a
l(
p
re
st
ig
e
)
0 20 40 60 80 100
−
2
0
0
1
0
women
C
o
m
p
o
n
e
n
t+
R
e
si
d
u
a
l(
p
re
st
ig
e
)
Component + Residual Plots
# The dashed blue line is a partial fit, and the magenta curve a lowess smoother. Of the
# 3 predicators, income seems have the largest degree of curvature. Therefore, we might
# consider transforming it using, e.g., the bulging rule.
19
# We can try applying a log tranformation to income
mreg.mod2 <- update(mreg.mod,
. ~ . + log2(income) - income)
crPlots(mreg.mod2, order=2)
6 8 10 12 14 16
−
2
0
0
2
0
education
C
o
m
p
o
n
e
n
t+
R
e
si
d
u
a
l(
p
re
st
ig
e
)
0 20 40 60 80 100
−
1
5
0
1
0
women
C
o
m
p
o
n
e
n
t+
R
e
si
d
u
a
l(
p
re
st
ig
e
)
10 11 12 13 14
−
2
0
0
2
0
log2(income)
C
o
m
p
o
n
e
n
t+
R
e
si
d
u
a
l(
p
re
st
ig
e
)
Component + Residual Plots
# It looks better, but maybe we should also apply a quadratic transformation to women
mreg.mod3 <- update(mreg.mod2,
. ~ . - women + poly(women, 2))
crPlots(mreg.mod3, order=1)
20
6 8 10 12 14 16
−
2
0
0
2
0
education
C
o
m
p
o
n
e
n
t+
R
e
si
d
u
a
l(
p
re
st
ig
e
)
10 11 12 13 14
−
2
0
0
2
0
log2(income)
C
o
m
p
o
n
e
n
t+
R
e
si
d
u
a
l(
p
re
st
ig
e
)
−2 0 2 4 6
−
1
5
0
1
0
poly(women, 2)
C
o
m
p
o
n
e
n
t+
R
e
si
d
u
a
l(
p
re
st
ig
e
)
Component + Residual Plots
# Note: When there are strong non-linear relations between predictors, the
# component-plus-residual plots may not work, in this case, we instead look
# at CERES Plots (Combining conditional Expectations and REsiduals)
ceresPlots(lm(prestige~income+education+type, data=Prestige))
## Warning in ceresPlots(lm(prestige ~ income + education + type, data =
## Prestige)): Factors skipped in drawing CERES plots.
21
5000 15000 25000
−
1
0
0
1
0
2
0
income
C
E
R
E
S
R
e
si
d
u
a
l(
p
re
st
ig
e
)
6 8 10 12 14 16
−
3
0
−
2
0
−
1
0
0
1
0
2
0
education
C
E
R
E
S
R
e
si
d
u
a
l(
p
re
st
ig
e
)
CERES Plots
A more general method for dealing with nonlinear predictors is to a apply the Box-Tidwell method for
transforming predictors. Their proposed model assumes the form:
y = β0 + β1TBBC(x1, γ1) + · · ·+ βkTBC(xk, γk) + ε,
where TBC is the Box-Cox power transformation introduced in Lecture 1.
# Example 12: Box-Tidwell Transformation
library(car)
# In R we can use the the functon boxTidwell to suggest the
# appropriate transformation(s) via MLE, while also allowing
# for user specifed transformations
# For example, let's apply a quadratic tranformation to women
# and see what it recommmends for income and education:
boxTidwell(prestige ~ income + education,
other.x = ~poly(women,2),data=Prestige)
## MLE of lambda Score Statistic (z) Pr(>|z|)
## income -0.037775 -5.3013 1.15e-07 ***
## education 2.192827 2.4056 0.01615 *
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## iterations = 12
# According to the output: lambda = -0.04 for income–> log
# According to the output: lambda = 2.19 for education–> quadratic
22
# To visualize these results, we can plot the “Constructed-Variable Plots”
# ‘construction’ in this context means transformation
3 Omitted Variable Bias
The Omitted Variable Bias quantifies the impact of omitting a relevant (e.g., statistically and/or economically
significant) variable in a regression model. For example, suppose the model true underlying population
model is y = β1 + β2×2 + β3×3 + e where both x2 and x3 should be included, but instead we only estimated
the model y = β1 + β2×2 + e. How can we quantify the effect that omitting x3 would have on the model?
We can show that mathematically it corresponds to a bias, namely:
bias(b∗2) = E(b2)
∗ − β2 = β3
̂cov(x2, x3)
var(x2)
The example below shows how omitting a relevant variable such as the wife’s years of education contribution
to annual family income, leads to a ∼ $2000 bias.
# Example 12: Omitted Variable Bias
library(car)
library(AER)
library(broom)
data(“edu_inc”, package=”PoEdata”)
mreg.mod1 <- lm(faminc~he+we, data=edu_inc)
mreg.mod2 <- lm(faminc~he, data=edu_inc)
# Correct Model
summary(mreg.mod1)
##
## Call:
## lm(formula = faminc ~ he + we, data = edu_inc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88299 -23663 -6800 17244 232085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5533.6 11229.5 -0.493 0.622426
## he 3131.5 802.9 3.900 0.000112 ***
## we 4522.6 1066.3 4.241 2.73e-05 ***
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 40500 on 425 degrees of freedom
## Multiple R-squared: 0.1613, Adjusted R-squared: 0.1574
## F-statistic: 40.87 on 2 and 425 DF, p-value: < 2.2e-16
#Incorrect Model
summary(mreg.mod2)
##
## Call:
## lm(formula = faminc ~ he, data = edu_inc)
##
23
## Residuals:
## Min 1Q Median 3Q Max
## -78985 -26196 -6924 18372 250934
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26191.3 8541.1 3.066 0.0023 **
## he 5155.5 658.5 7.830 3.92e-14 ***
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 41300 on 426 degrees of freedom
## Multiple R-squared: 0.1258, Adjusted R-squared: 0.1237
## F-statistic: 61.3 on 1 and 426 DF, p-value: 3.921e-14
# We can see that the effect of omitting wedu translates to an
# ovesstatement of ~$2000 which would correspond to the bias in this case.
# Q: What if we include kl6 ( = number of chidren under the age of 6)?
mreg.mod3 <- lm(faminc~he+we+kl6, data=edu_inc)
summary(mreg.mod3)
##
## Call:
## lm(formula = faminc ~ he + we + kl6, data = edu_inc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91760 -23745 -6642 17636 243255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7755.3 11162.9 -0.695 0.48760
## he 3211.5 796.7 4.031 6.58e-05 ***
## we 4776.9 1061.2 4.502 8.73e-06 ***
## kl6 -14310.9 5003.9 -2.860 0.00445 **
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 40160 on 424 degrees of freedom
## Multiple R-squared: 0.1772, Adjusted R-squared: 0.1714
## F-statistic: 30.43 on 3 and 424 DF, p-value: < 2.2e-16
# A: In this case, the coefficients of hedu and wedu did not change by much,
# yet kl6 is both statistically and economically significant.
# This suggests that the estimates of wedu and hedu are robust.
4 Irrelevant Variables
Typically, including irrelevant variables into the model has the effect of increasing the variance of the other
variables, and therefore, potentially rendering the respective coefficient estimates statistically insignificant.
# Example 13: Irrelevant Variables
library(car)
24
library(AER)
library(broom)
library(PoEdata)
data("edu_inc", package="PoEdata")
mreg.mod1 <- lm(faminc~he+we+kl6, data=edu_inc)
# The variables xtra_x5 and _x6 are contain random values that are not associated with $y$
# We can would expect that the variances of th relevant variable will increase as we introduce
# these irrlevant variables.
mreg.mod2 <- lm(faminc~he+we+kl6+xtra_x5, data=edu_inc)
mreg.mod3 <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc)
#summary(mreg.mod1)
#summary(mreg.mod2)
#summary(mreg.mod3)
#stargazer(mreg.mod1, mreg.mod2, mreg.mod3)
Below we summarize the results from the 3 models:
Table 1:
Dependent variable:
faminc
(1) (2) (3)
he 3,211.526∗∗∗ 3,377.219∗∗∗ 3,339.792∗∗∗
(796.703) (1,247.058) (1,250.039)
we 4,776.907∗∗∗ 4,783.930∗∗∗ 5,868.677∗∗
(1,061.164) (1,063.156) (2,278.067)
kl6 −14,310.920∗∗∗ −14,216.440∗∗∗ −14,200.180∗∗∗
(5,003.928) (5,039.395) (5,043.720)
xtra_x5 −180.162 888.843
(1,042.335) (2,242.491)
xtra_x6 −1,067.186
(1,981.685)
Constant −7,755.330 −7,682.601 −7,558.613
(11,162.930) (11,183.650) (11,195.410)
Observations 428 428 428
R2 0.177 0.177 0.178
Adjusted R2 0.171 0.169 0.168
Residual Std. Error 40,160.080 (df = 424) 40,206.100 (df = 423) 40,239.890 (df = 422)
F Statistic 30.432∗∗∗ (df = 3; 424) 22.779∗∗∗ (df = 4; 423) 18.251∗∗∗ (df = 5; 422)
Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
25
5 Multicollinearity
Multicollinearity occurs when there are strong linear relationships between the predictors. For example,
consider the model y = β1 + β2x2 + β3x3 + ε, where x3 = α1 + α2x2. In this case, Ordinary Least Squares
cannot generate estimates of the regression coefficients because the predictor x2 and x3 move together. In
this case (which is considered Perfect Multicollinearity) the solution would be to drop of the two predictors
from the model.
In general, Imperfect Multicollinearity is more common. This occurs when e.g., using the previous case,
x3 = α1 + α2x2 + u, where u represents some unknown random component.
The effects of Multicollinearity can be summarized as follows:
1. The variances and standard errors of the regression coefficient estimates will increase. This leads to
lower t−statistics.
2. The overall fit of the regression equation will be largely unaffected by multicollinearity. Therefore,
forecasting and prediction will be largely unaffected.
3. Regression coefficients can change substantially when variables are added or dropped.
5.1 Detecting Multicollinearity
There are several ways in which we can detect multicollinearity. Below are the 3 most common ones:
1. Large Correlation Coefficients
High pairwise correlations among independent variables is a strong sign of multicollinearity. Usually
|rxi ,xj | & 0.8 is considered large.
2. Large R2 with Low t−Stats
It is possible for individual coefficients to be insignificant but for the overall fit of the equation to be
high.
3. Variance Inflation Factor (VIF)
VIF is a measure of how much the variance of the estimated regression coefficient bk is inflated by the
existence of correlations among the predictor variables in the model. Consider for example v̂ar(bj) as
given by:
v̂ar(bj) =
σ̂2
(n− 1)s2j
×
1
1− R2j
.
The term 1/(1− R2j ) is called the *VIF*. Values of VIF= 1 suggest no correlation whereas VIF& 10 (and
in some cases & 4) is considered large, and thus, we might consider removing those variables with
large VIFs from the model. The R2j are obtained from the *Auxiliary Regressions*, i.e., xj regressed on
the other predcitors.
An important limitation to of VIF is that it does not apply to nonlinear regressors or contrasts. For these
cases we would instead use the Generalized VIF (GVIF) which is given by GVIF1/2p, where p represents the
number of regressors in a term.
# Example 13: Multicollinearity (Fuel Efficiency)
library(car)
data("cars", package="PoEdata")
mreg.mod=lm(mpg~cyl+eng+wgt, data=cars)
summary(mreg.mod)
26
##
## Call:
## lm(formula = mpg ~ cyl + eng + wgt, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.5568 -2.8703 -0.3649 2.2708 16.4338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.3709616 1.4806851 29.967 < 2e-16 ***
## cyl -0.2677968 0.4130673 -0.648 0.517
## eng -0.0126740 0.0082501 -1.536 0.125
## wgt -0.0057079 0.0007139 -7.995 1.5e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.297 on 388 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.697
## F-statistic: 300.8 on 3 and 388 DF, p-value: < 2.2e-16
tidy(vif(mreg.mod))
## Warning: 'tidy.numeric' is deprecated.
## See help("Deprecated")
names x
cyl 10.515508
eng 15.786455
wgt 7.788716
# We can see that both cyl and eng should be removed since VIF>10
# Example 14: Multicollinearity (Median household value in Boston)
data(“Boston”, package = “MASS”)
mreg.mod <- lm(medv ~., data = Boston)
summary(mreg.mod)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
27
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
tidy(vif(mreg.mod))
## Warning: 'tidy.numeric' is deprecated.
## See help("Deprecated")
names x
crim 1.792191
zn 2.298758
indus 3.991596
chas 1.073995
nox 4.393720
rm 1.933744
age 3.100826
dis 3.955945
rad 7.484496
tax 9.008554
ptratio 1.799084
black 1.348521
lstat 2.941491
# We can see that nox, rad, and tax might be worth removing since VIF>4
# Also note that idus, and dis are very close to 4, so we might want to
# consider removing them as well.
# Example 15: Multicollinearity (US Census Data) -see Ericksen et. al., 1989
mreg.mod <- lm(undercount ~ ., data=Ericksen)
summary(mreg.mod)
##
## Call:
## lm(formula = undercount ~ ., data = Ericksen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8356 -0.8033 -0.0553 0.7050 4.2467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.611411 1.720843 -0.355 0.723678
## minority 0.079834 0.022609 3.531 0.000827 ***
## crime 0.030117 0.012998 2.317 0.024115 *
## poverty -0.178369 0.084916 -2.101 0.040117 *
28
## language 0.215125 0.092209 2.333 0.023200 *
## highschool 0.061290 0.044775 1.369 0.176415
## housing -0.034957 0.024630 -1.419 0.161259
## citystate -1.159982 0.770644 -1.505 0.137791
## conventional 0.036989 0.009253 3.997 0.000186 ***
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 1.426 on 57 degrees of freedom
## Multiple R-squared: 0.7077, Adjusted R-squared: 0.6667
## F-statistic: 17.25 on 8 and 57 DF, p-value: 1.044e-12
tidy(vif(mreg.mod))
## Warning: ‘tidy.numeric’ is deprecated.
## See help(“Deprecated”)
names x
minority 5.009065
crime 3.343586
poverty 4.625178
language 1.635568
highschool 4.619169
housing 1.871745
city 3.537750
conventional 1.691320
# We can see that minority, poverty, and highschool might be worth removing since VIF>4
6 Model Misspecification: Ramsey RESET
As we have seen previously, sometimes adding nonlinear expressions for the predictors can help improve
our model in terms of economic and statistical significance. A popular test to evaluate is such nonlinear
terms are needed, is known as the REgression Specification Error Test (RESET). For example, assume we
have the model y = β1 + β2×2 + β3×3 + ε and we are interested in determining if we can improve the
model by including higher order terms of the predictors. We can easily assess this by estimating the model
y = β1 + β2×2 + β3×3 + γŷ2 + ε, and testing the null hypothesis H0 : γ = 0 against H0 : γ 6= 0. If we reject
H0, then we should consider including e.g., quadratic terms. We can then perform the same test but instead
including the term δŷ3, and so on, until we fail to reject H0. In practice, however, we should not have to
include high order terms beyond 2 or 3.
# Example 16: Functional Form (RESET)
library(car)
library(PoEdata)
library(lmtest)
# We will create to models: A quadratic one and a linear one, and test each for model misspecification
x <- c(1:30)
# Model 1: Quadratic
y1 <- 1 + x + x^2 + rnorm(30)
# Test if a quadratic model is appropriate
resettest(y1 ~ x , power=2, type="regressor")
29
##
## RESET test
##
## data: y1 ~ x
## RESET = 137780, df1 = 1, df2 = 27, p-value < 2.2e-16
# According to the test, we can improve the model
# Test if a quadratic model is appropriate
y2 <- 1 + x + rnorm(30)
resettest(y2 ~ x , power=2, type="regressor")
##
## RESET test
##
## data: y2 ~ x
## RESET = 0.36652, df1 = 1, df2 = 27, p-value = 0.55
# According to the test, our model is well specified, no need for adding a quadratic term
# Example 17: Family Income (RESET)
data("edu_inc", package="PoEdata")
mreg.mod <- lm(faminc~he+we+kl6, data=edu_inc)
resettest(mreg.mod , power=2, type="regressor")
##
## RESET test
##
## data: mreg.mod
## RESET = 2.6046, df1 = 3, df2 = 421, p-value = 0.05144
# According to the test, our model is fine
7 Model Selection
We now discuss how to select the best model among competing ones using robust statistical metrics.
7.1 Akaike Information Criterion (AIC)
AIC estimates the quality of the models being considered for the data relative to each other. The smaller the
value of AIC, the better the model.
AIC = ln
(
SSE
N
)
+
2K
N
# Example 18: AIC (Family Income)
library(car)
library(AER)
library(broom)
library(PoEdata)
data("edu_inc", package="PoEdata")
mreg.mod1 <- lm(faminc~he, data=edu_inc)
mreg.mod2 <- lm(faminc~he+we, data=edu_inc)
mreg.mod3 <- lm(faminc~he+we+kl6, data=edu_inc)
30
mreg.mod4 <- lm(faminc~he+we+kl6+xtra_x5, data=edu_inc)
mreg.mod5 <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc)
AIC(mreg.mod1, mreg.mod2, mreg.mod3, mreg.mod4, mreg.mod5)
df AIC
mreg.mod1 3 10316.65
mreg.mod2 4 10300.91
mreg.mod3 5 10294.73
mreg.mod4 6 10296.70
mreg.mod5 7 10298.41
# As expected, Model 3 (faminc~he+we+kl6) is the preferred choice
7.2 Bayesian Information Criterion (BIC)
Similarly to AIC, BIC does the same bu penalizes extra variables more heavily than AIC.
BIC = ln
(
SSE
N
)
+
K ln(N)
N
As with AIC, the model with the smallest BIC is preferred.
# Example 18: BIC (Family Income)
library(car)
library(AER)
library(broom)
library(PoEdata)
data("edu_inc", package="PoEdata")
mreg.mod1 <- lm(faminc~he, data=edu_inc)
mreg.mod2 <- lm(faminc~he+we, data=edu_inc)
mreg.mod3 <- lm(faminc~he+we+kl6, data=edu_inc)
mreg.mod4 <- lm(faminc~he+we+kl6+xtra_x5, data=edu_inc)
mreg.mod5 <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc)
BIC(mreg.mod1, mreg.mod2, mreg.mod3, mreg.mod4, mreg.mod5)
df BIC
mreg.mod1 3 10328.83
mreg.mod2 4 10317.15
mreg.mod3 5 10315.03
mreg.mod4 6 10321.06
mreg.mod5 7 10326.82
# As expected, Model 3 (faminc~he+we+kl6) is also the preferred choice
7.3 Mallows’ CP
Mallows’ Cp can be used to assess model fits using different number of parameters. The statistic is given by:
Cp =
SSE(p)
σ2
− N + 2p,
where SSE is the sum squared error, N the number of observations, and p the number of parameters being
used to estimate the model.
The smaller the value of the statistic Cp, the better. Therefore, if model p is the best choice, Cp will tend to be
close to or smaller than p.
31
# Example 18: Mallows Cp (Family Income)
library(car)
library(AER)
library(broom)
library(PoEdata)
library(leaps)
data("edu_inc", package="PoEdata")
mreg.mod5 <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc)
ss=regsubsets(faminc~he+we+kl6+xtra_x5+xtra_x6,method=c("exhaustive"),nbest=3,data=edu_inc)
subsets(ss,statistic="cp",legend=F,main="Mallows CP",col="steelblue4")
Abbreviation
he h
we w
kl6 k
xtra_x5 x_5
xtra_x6 x_6
legend(3,25,bty="y",legend=c('h=hedu','w=wedu','k=kl6','x_5 =xtra5', 'x_6=extr6'),col="steelblue4",cex=1.25)
1 2 3 4 5
5
1
0
1
5
2
0
2
5
Mallows CP
Subset Size
S
ta
tis
tic
:
cp
w
h
x_6
h−w
w−kw−x_5
h−w−k
w−k−x_5h−w−x_6
h−w−k−x_6h−w−k−x_5
h−k−x_5−x_6
h−w−k−x_5−x_6
h=hedu
w=wedu
k=kl6
x_5 =xtra5
x_6=extr6
# As expected, Model 3 (faminc~he+we+kl6) is also the preferred choice by Cp
7.4 Stepwise Regression
Since our goal is ultimately about making the best predictions possible with our model, Stepwise Regression
consists of iteratively adding and removing predictors until the model with the lowest prediction error is
obtained. There are 3 popular ways to perform this:
1. Forward Selection
Start with no predictors in the model, iteratively add the most contributive predictors, and stop when
32
the improvement is no longer statistically significant.
2. Backward Selection
Starts with all predictors in the model (full model), iteratively remove the least contributive predictors,
and stop when you have a model where all predictors are statistically significant.
3. Stepwise Selection
Is a combination of forward and backward selections. You start with no predictors, then sequentially
add the most contributive predictors (like forward selection). After adding each new variable, remove
any variables that no longer provide an improvement in the model fit (like backward selection).
Below we show a brute force approach by manually comparing models based on their prediction accuracy.
# Example 19: Evaluating the Prediction Error
# See http://www.sthda.com/ for details of the exmaple below
library(car)
library(AER)
library(broom)
library(PoEdata)
library(leaps)
library(tidyverse)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## The following object is masked from 'package:survival':
##
## cluster
library(MASS)
suppressMessages(library("tidyverse"))
# Load the data
data("Boston", package = "MASS")
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]
# Build the model
model1 <- lm(medv ~., data = train.data)
# Make predictions
predictions <- model1 %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
RMSE R2
4.992775 0.6704608
33
# So far we have R^2 = 0.67, can we do better by removing e.g., any multicoliinear varaiables?
vif(model1)
## crim zn indus chas nox rm age dis
## 1.865531 2.364859 3.901322 1.064429 4.471619 2.010665 3.018555 3.961686
## rad tax ptratio black lstat
## 7.799919 9.163102 1.907071 1.311933 2.967784
# Remove tax
model2 <- lm(medv ~. -tax, data = train.data)
# Make predictions
predictions <- model2 %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
RMSE R2
5.013736 0.6705781
# Looks like it didn’t help much
# Technically we would need to manually run through all the plausible models
# until we converge on the best one. Or we can simply use ‘step(model1)’ as
# shown in Example 20.
#step(model1)
In R we can use the function stepAIC (from the leaps library) with the flag direction set to forward,
backward, or both (Stepwise) to automatically do the model selection process for us. The function step is
often used as well.
# Example 20: Stepwise (Family Income)
library(car)
library(AER)
library(broom)
library(PoEdata)
library(leaps)
data(“edu_inc”, package=”PoEdata”)
mreg.mod <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc)
# Forward Selection
step.model <- stepAIC(mreg.mod, direction = "forward",
trace = FALSE)
summary(step.model)
##
## Call:
## lm(formula = faminc ~ he + we + kl6 + xtra_x5 + xtra_x6, data = edu_inc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90853 -24027 -6555 18014 243361
##
## Coefficients:
34
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7558.6 11195.4 -0.675 0.49995
## he 3339.8 1250.0 2.672 0.00784 **
## we 5868.7 2278.1 2.576 0.01033 *
## kl6 -14200.2 5043.7 -2.815 0.00510 **
## xtra_x5 888.8 2242.5 0.396 0.69204
## xtra_x6 -1067.2 1981.7 -0.539 0.59050
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 40240 on 422 degrees of freedom
## Multiple R-squared: 0.1778, Adjusted R-squared: 0.1681
## F-statistic: 18.25 on 5 and 422 DF, p-value: < 2.2e-16
# Backward Selection
step.model <- stepAIC(mreg.mod, direction = "backward",
trace = FALSE)
summary(step.model)
##
## Call:
## lm(formula = faminc ~ he + we + kl6, data = edu_inc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91760 -23745 -6642 17636 243255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7755.3 11162.9 -0.695 0.48760
## he 3211.5 796.7 4.031 6.58e-05 ***
## we 4776.9 1061.2 4.502 8.73e-06 ***
## kl6 -14310.9 5003.9 -2.860 0.00445 **
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 40160 on 424 degrees of freedom
## Multiple R-squared: 0.1772, Adjusted R-squared: 0.1714
## F-statistic: 30.43 on 3 and 424 DF, p-value: < 2.2e-16
# Stepwise Selection
step.model <- stepAIC(mreg.mod, direction = "both",
trace = FALSE)
summary(step.model)
##
## Call:
## lm(formula = faminc ~ he + we + kl6, data = edu_inc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91760 -23745 -6642 17636 243255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7755.3 11162.9 -0.695 0.48760
35
## he 3211.5 796.7 4.031 6.58e-05 ***
## we 4776.9 1061.2 4.502 8.73e-06 ***
## kl6 -14310.9 5003.9 -2.860 0.00445 **
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 40160 on 424 degrees of freedom
## Multiple R-squared: 0.1772, Adjusted R-squared: 0.1714
## F-statistic: 30.43 on 3 and 424 DF, p-value: < 2.2e-16
# Autmated way to evaluate all the models:
step(mreg.mod)
## Start: AIC=9081.8
## faminc ~ he + we + kl6 + xtra_x5 + xtra_x6
##
## Df Sum of Sq RSS AIC
## - xtra_x5 1 2.5439e+08 6.8358e+11 9080.0
## - xtra_x6 1 4.6960e+08 6.8379e+11 9080.1
##
## – we 1 1.0746e+10 6.9407e+11 9086.5
## – he 1 1.1559e+10 6.9488e+11 9087.0
## – kl6 1 1.2835e+10 6.9616e+11 9087.8
##
## Step: AIC=9079.95
## faminc ~ he + we + kl6 + xtra_x6
##
## Df Sum of Sq RSS AIC
## – xtra_x6 1 2.6350e+08 6.8384e+11 9078.1
##
## – kl6 1 1.2698e+10 6.9628e+11 9085.8
## – he 1 1.5563e+10 6.9914e+11 9087.6
## – we 1 2.0830e+10 7.0441e+11 9090.8
##
## Step: AIC=9078.12
## faminc ~ he + we + kl6
##
## Df Sum of Sq RSS AIC
##
## – kl6 1 1.3192e+10 6.9703e+11 9084.3
## – he 1 2.6207e+10 7.1005e+11 9092.2
## – we 1 3.2683e+10 7.1652e+11 9096.1
##
## Call:
## lm(formula = faminc ~ he + we + kl6, data = edu_inc)
##
## Coefficients:
## (Intercept) he we kl6
## -7755 3212 4777 -14311
# As expected, Model 3 (faminc~he+we+kl6) is the best one for all cases
# Note: The leaps package has other functions such as ‘leapSeq’ which give
# you more choices for performing the selection.
36
8 Heteroskedasticity
The classic signature of Heteroskedasticity is a growing spread in y with increasing values of x. In this
case the Constant Variance assumption no longer holds, and we therefore need to apply other techniques to
mitigate this problem.
Consider the MR model: yi = β1 + β2xi2 + · · ·+ βKxiK + ei. If the variance var(yi) = E(e2i ) is no longer
constant, this suggests that there is a function h() that depends on predictors z2, . . . , zS that can be used to
characterize the behavior of var(yi). For example, we can propose
var(yi) = E(e
2
i ) = h(α1 + α2zi2 + · · ·+ αSziS).
This is more general result since the constant variance assumption holds provided α2 = α3 = · · · = αS = 0.
Next, we will discuss a couple of statistical methods to help us identify if heteroskedasticity is present by
evaluating hypotheses such as H0 : α2 = α3 = · · · = αS = 0 and examining plausible function forms for h()
such as linear, quadratic and exponential functions.
8.1 Detecting Heteroskedasticity
Next, we will discuss a couple of statistical methods to help us identify if heteroskedasticity is present by
evaluating hypotheses such as H0 : α2 = α3 = · · · = αS = 0 and examining plausible function forms for h()
such as linear, quadratic and exponential functions.
8.1.1 Spread Level Plots
These can be used as a diagnostic for nonconstant variance with the advantage that it also yields an estimate
for the power law transformation to use for the response variable.
# Example 21: Spread Level Plot (Family Income)
library(car)
library(PoEdata)
# Traditional LS
foodeq <- lm(food_exp~income,data=food)
tidy(foodeq)
term estimate std.error statistic p.value
(Intercept) 83.41600 43.410163 1.921578 0.0621824
income 10.20964 2.093263 4.877381 0.0000195
spreadLevelPlot(foodeq)
37
150 200 250 300 350 400
0
.0
5
0
.2
0
0
.5
0
2
.0
0
Spread−Level Plot for
foodeq
Fitted Values
A
b
so
lu
te
S
tu
d
e
n
tiz
e
d
R
e
si
d
u
a
ls
##
## Suggested power transformation: -1.106871
# It suggests a lambda = -1.107 for the power low to use in transforming y.
8.1.2 Score Tests for Nonconstant Error Variance
There are several popular tests for Heteroskedasticity depending on various characteristics of the data. In
this section, however, we will focus on:
1. Breusch-Pagan Test
2. White Test
3. Goldfeld-Quandt Test
8.1.2.1 Lagrange Multiplier Test (Breusch-Pagan Test)
Assuming we know what the zS’s are, this test is for the null H0 : α2 = α3 = · · · = αS = 0 against the
alternative H1 : not all the αS are zero. The respective test-statistic is χ2 ∼ N × R2 ∼ χ2S−1, where R
2 is
obtained from the regression: êi
2 = α1 + α2zi2 + · · ·+ αSziS + vi. The challenge with this test is that we may
not not know (or have) other predictors zS for estimating the αS’s.
# Example 22: BP Test: Food Expenditure vs. Family Income
library(car)
library(PoEdata)
# The exmaple below shows how ot compute everything from scracth, however, we will shortly
# instead use the R function 'bptest'
38
alpha <- 0.05
mod1 <- lm(food_exp~income, data=food)
ressq <- resid(mod1)^2
#The test equation:
modres <- lm(ressq~income, data=food)
N <- nobs(modres)
gmodres <- glance(modres)
S <- gmodres$df #Number of Betas in model
#Chi-square is always a right-tail test
chisqcr <- qchisq(1-alpha, S-1)
Rsqres <- gmodres$r.squared
chisq <- N*Rsqres
pval <- 1-pchisq(chisq,S-1)
# Since p-val = 0.0066 < alpha=0.05, we reject the null
# and conclude tht heteroskedasticity is present.
# You can also use the function ncvTest from the car package
# to test for Non Constant Variance (hence ncv)
ncvTest(mod1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 7.343935, Df = 1, p = 0.0067289
# We can see that we reject the null H0: variance = constant
8.1.2.2 White Test
Given the problem we saw with the Breusch-Pagan (BP) Test, the White Test solves this by suggesting we
start by using our existing predictors from the original model. For example, if our regression model is
E(y) = β1 + β2x2 + β3x3, then we set z2 = x2, z3 = x3, z4 = x22, and z5 = x
2
3.
# Example 22: White Test: Food Expenditure vs. Family Income
library(car)
library(PoEdata)
alpha <- 0.05
mod1 <- lm(food_exp~income, data=food)
ressq <- resid(mod1)^2
modres <- lm(ressq~income+I(income^2), data=food)
gmodres <- glance(modres)
Rsq <- gmodres$r.squared
S <- gmodres$df #Number of Betas in model
chisq <- N*Rsq
pval <- 1-pchisq(chisq, S-1)
# Since p-val = 0.023 < alpha=0.05, we reject the null
# and conclude tht heteroskedasticity is present.
# We can directly test the Robust version of the BP Test
mod1 <- lm(food_exp~income, data=food)
bptest(mod1)
##
## studentized Breusch-Pagan test
##
## data: mod1
39
## BP = 7.3844, df = 1, p-value = 0.006579
# Consissten with the previous results, we reject H0.
8.1.3 Goldfeld-Quandt Test
When our model includes a factor variable(s), and suffers from heteroskedasticty, the Goldfeld-Quandt Test
can be useful for comparing groups based on the factor levels. For example, consider the case of a binary
indicator variable (I) that takes on the values 0 or 1. The test will compares the variances of the two groups
by testing the null hypothesis H0 : σ21 = σ
2
0 against the alternative: H0 : σ
2
1 6= σ
2
0 . The respective test statistic
is F = σ21 /σ
2
0 which is then compared against Fc = FN1−K,N0−K. This test can also be applied to a model
with only quantitative predictors by simply splitting the observations according to a desired rule. The two
examples below show these cases.
# Example 23: Goldfeld-Quandt Test (w/ Inidicator variable)
# wage = educ + exper + metro (=1 metropolitan and =0 urban)
library(car)
library(PoEdata)
alpha <- 0.05 #two tail, will take alpha/2
data("cps2", package="PoEdata")
#Create the two groups, m (metro) and r (rural)
m <- cps2[which(cps2$metro==1),]
r <- cps2[which(cps2$metro==0),]
wg1 <- lm(wage~educ+exper, data=m)
wg0 <- lm(wage~educ+exper, data=r)
df1 <- wg1$df.residual #Numerator degrees of freedom
df0 <- wg0$df.residual #Denominatot df
sig1squared <- glance(wg1)$sigma^2
sig0squared <- glance(wg0)$sigma^2
fstat <- sig1squared/sig0squared
Flc <- qf(alpha/2, df1, df0)#Left (lower) critical F
Fuc <- qf(1-alpha/2, df1, df0) #Right (upper) critical F
# Since F = 2.09 (in the rejection region), we reject H0 and conclude
# that the variances are different, and therefore, heteroskedasticity
# is an issue.
# Example 24: Goldfeld-Quandt Test (only quantitative predictors)
# Food Expenditure vs. Family Income
# Divide the data into two groups: below median income and above
library(car)
library(PoEdata)
alpha <- 0.05
data("food", package="PoEdata")
medianincome <- median(food$income)
li <- food[which(food$income<=medianincome),]
hi <- food[which(food$income>=medianincome),]
eqli <- lm(food_exp~income, data=li)
eqhi <- lm(food_exp~income, data=hi)
dfli <- eqli$df.residual
dfhi <- eqhi$df.residual
sigsqli <- glance(eqli)$sigma^2
sigsqhi <- glance(eqhi)$sigma^2
40
fstat <- sigsqhi/sigsqli #The larger var in numerator
Fc <- qf(1-alpha, dfhi, dfli)
pval <- 1-pf(fstat, dfhi, dfli)
# Since p-val = 0.0046 < alpha=0.05, we reject the null
# and conclude tht heteroskedasticity is present. In this case,
# we can see that the test suggests that there is a hiher variance
# at higher incomes.
# An easier way to conduct the GQ test is using the function gqtest
# from the lmtest package
library(lmtest)
foodeq <- lm(food_exp~income, data=food)
gqtest(foodeq, point=0.5, alternative="greater",
order.by=food$income)
##
## Goldfeld-Quandt test
##
## data: foodeq
## GQ = 3.6148, df1 = 18, df2 = 18, p-value = 0.004596
## alternative hypothesis: variance increases from segment 1 to 2
8.2 Mitigating Heteroskedasticity
In the previous sections we learned how to detect heteroskedasticity, now we will discuss what to with our
model to mitigate it its impact.
8.2.1 Heteroskedasticity-Consistent Standard Errors
Recall that heteroskedasticity has the two effects that (1) the LS estimators are no longer the best, and (2) the
LS standard errors are not correct since the variance is no longer constant. A more robust (in the sense that it
applies to both heteroskedastic and homoskedastic errors) is the White Statndard Error as given by:
var(b̂2) =
N
N − 2
∑Ni=1
[
(xi − x)2 ê2i
]
∑Ni=1 [(xi − x)2]
2
For example, we look at the results from the food expenditure case and compare the traditional vs. the White
standard errors, as well as the respective confidence intervals for the model coefficients. Keep in mind that
hypotheses such as e.g., λ = c1β1 + c2β2 = c0 would also require us to update the standard errors, and
therefore, our conclusions may change.
# Example 25: White SE
# Food Expenditure vs. Family Income
41
# To compute the Robust White SE, the car package has a function
# hccm (heteroskedasticity-robust coefficient covariance matrix) that will
# compute them
library(car)
library(PoEdata)
# Traditional Standard Errors
foodeq <- lm(food_exp~income,data=food)
tidy(foodeq)
term estimate std.error statistic p.value
(Intercept) 83.41600 43.410163 1.921578 0.0621824
income 10.20964 2.093263 4.877381 0.0000195
# Robust White Standard Errors
cov1 <- hccm(foodeq, type="hc1")
food.HC1 <- coeftest(foodeq, vcov.=cov1)
tidy(food.HC1)
term estimate std.error statistic p.value
(Intercept) 83.41600 27.463748 3.037313 0.0042989
income 10.20964 1.809077 5.643566 0.0000018
# Hypotheiss Test for a linear combination of paramters: H0 = beta_2 + beta_3 = 0
data("andy", package="PoEdata")
andy.eq <- lm(sales~price+advert, data=andy)
bp <- bptest(andy.eq) #Heteroskedsticity test
b2 <- coef(andy.eq)[["price"]]
b3 <- coef(andy.eq)[["advert"]]
H0 <- "price+advert=0"
# Traditional Standard Errors
linearHypothesis(andy.eq, H0)
Res.Df RSS Df Sum of Sq F Pr(>F)
73 2254.715 NA NA NA NA
72 1718.943 1 535.7719 22.44145 1.06e-05
# Robust White Standard Errors
linearHypothesis(andy.eq, H0, vcov=hccm(andy.eq, type=”hc1″))
Res.Df Df F Pr(>F)
73 NA NA NA
72 1 23.38698 7.3e-06
# In this case the p-value is larger using the traditional errors.
8.2.2 Generalized (Weighted) Least Squares
To properly deal with Heteroskedasticity, we would need to estimate a different variance of the error term
for each observation. On a practical level this is not possible, and instead we model the dependence of the
variance on one or more of the predictors in the model. We will exams two forms of dependence depending
on whether we know or not the actual relationship between the variance and predictor(s).
8.2.2.1 Known Form or Variance
42
For simplicity, let’s assume we have a SLR model yi = β1 + β2xi + ei and that var(ei) = σ2i = σ
2xi. If we
divide our original model by
√
xi, then we show that in the transformed model: y∗i = β1x
∗
i1 + β2x
∗
i2 + e
∗
i , e
∗
i
are now homoskedastic since var(e∗i ) =varei/
√
xi =var(ei)/xi = σ2xi/xi = σ2.Also, the desired properties
E(e∗i ) = 0 and cov(e
∗
i , e
∗
j ) = 0, i 6= j still hold.
The transformation we proposed is part of a more general method known as Weighted Least Squares where
we weighted the errors by x−1/2i .
For example, looking again at the food expenditure example, applying this method we obtain:
# Example 25: Weighted Least Squares -Known Variance
# Food Expenditure vs. Family Income
# To compute the Robust White SE, the car package has a function
# hccm (heteroskedasticity-robust coefficient covariance matrix) that will
# compute them
library(car)
library(PoEdata)
# Traditional LS
foodeq <- lm(food_exp~income,data=food)
tidy(foodeq)
term estimate std.error statistic p.value
(Intercept) 83.41600 43.410163 1.921578 0.0621824
income 10.20964 2.093263 4.877381 0.0000195
# Weighted LS
w <- 1/food$income # Weights : R takes the sqrt by default already, hence w=1/x
food.wls <- lm(food_exp~income, weights=w, data=food)
vcvfoodeq <- coeftest(foodeq, vcov.=cov1)
tidy(vcvfoodeq)
term estimate std.error statistic p.value
(Intercept) 83.41600 27.463748 3.037313 0.0042989
income 10.20964 1.809077 5.643566 0.0000018
# Traditional LS
8.2.2.2 Unknown Form of Variance
If we do not know the form of the variance, we can propose a more general power law form such as
var(ei) = σ2i = σ
2xγi , where γ is an unknown parameter that we can estimate. The trick to estimating γ is to
take logs on both side of the variance equation, ln(σ2i ) = ln(σ
2) + γ ln(xi), and take the exponential on both
sides,
σ2i = exp[ln(σ
2) + γ ln(xi)] = exp(α1 + α2zi),
43
where α1 = ln(σ2), α2 = γ, and zi = ln(xi). For the MR case, we simply have more terms, ziS to include in the
exponential function. We can use traditional LS to estimate γ from the log-linear model ln(σ2i ) = α1 + α2zi.
The estimated equation is l̂n(σ2i ) = 0.9378 + 2.39zi (recall that we used γ = 1 in our previous example). Now
we can use σ̂2i = exp(0.9378 + 2.39zi) to transform our original model instead of the x
−1/2
i weights. The
estimated model is now:
# Example 25: Weighted Least Squares -Uknown Variance
# Food Expenditure vs. Family Income
# To compute the Robust White SE, the car package has a function
# hccm (heteroskedasticity-robust coefficient covariance matrix) that will
# compute them
library(car)
library(PoEdata)
foodeq <- lm(food_exp~income,data=food)
data("food", package="PoEdata")
food.ols <- lm(food_exp~income, data=food)
ehatsq <- resid(food.ols)^2
sighatsq.ols <- lm(log(ehatsq)~log(income), data=food)
vari <- exp(fitted(sighatsq.ols))
food.fgls <- lm(food_exp~income, weights=1/vari, data=food)
# This case is also known as Feasible Generalized Least Squares
library(stargazer)
# stargazer(food.ols, food.HC1, food.wls, food.fgls)
Table 2: Comparing various ’food’ models
Dependent variable: ’food expenditure’
OLS HC1 WLS FGLS
Constant 83.416 83.416 78.684 76.054
(43.410) (27.464) (23.789) (9.713)
income 10.210 10.210 10.451 10.633
(2.093) (1.809) (1.386) (0.972)
Observations 40 40 40
9 Missing Observations
There are several ways to deal with missing data depending on their characteristics and volume. Below we
briefly discuss for of them.
1. Remove Missing Observations
If the number of missing observations is small relative to the dataset, we can simply remove these
44
observations. Also, we need to make sure not to introduce a bias due to their removal by e.g., failing to
represent certain characteristics of the data.
2. Remove the Variable
This may be appropriate if most of the missing values occur mainly in a particular variable. We would
need to weigh its removal against other considerations such as economic significance, etc.
3. Imputation via Mean/Median/Mode
Depending on the distribution of the observations, we can impute the missing data with estimates of
the Mean, Median or Mode.
4. Prediction
We can predict the missing observations using kNN (k-Nearest Neighbor), regression, interpolation,
etc. However, of all these methods, kNN seems to work well provided the variables are not factor
variables. The basic idea behind the method is to replace a missing observation with a weighted
average of the closest (using the Euclidean distance) k neighbors. In the case of factor variables, we can
instead use *Recursive Partitioning* (We normally cover this when discussing *Regression Trees*).
# Example 25: Boston Housing Data
library(mlbench) #Machine Learning Benchmark
# For refence on this example, look at:
# http://r-statistics.co/Missing-Value-Treatment-With-R.html#4.%20Prediction
data ("BostonHousing", package="mlbench") # initialize the data # load the data
original <- BostonHousing # backup original data
# Introduce missing values
set.seed(100)
BostonHousing[sample(1:nrow(BostonHousing), 40), "rad"] <- NA
BostonHousing[sample(1:nrow(BostonHousing), 40), "ptratio"] <- NA
head(BostonHousing)
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
# Before going through the methods, we should first look at
# the pattern of missing observations
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:tidyr':
##
## complete
## The following objects are masked from 'package:base':
##
## cbind, rbind
md.pattern(BostonHousing) # pattern or missing values in data.
45
crim
0
zn
0
indus
0
chas
0
nox
0
rm
0
age
0
dis
0
tax
0
b
0
lstat
0
medv
0
rad
40
ptratio
40
25
135
135
0431
80
crim zn indus chas nox rm age dis tax b lstat medv rad ptratio
431 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
35 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
35 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
5 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2
0 0 0 0 0 0 0 0 0 0 0 0 40 40 80
# Method 1: Remove Missing Observations
lm(medv ~ ptratio + rad, data=BostonHousing, na.action=na.omit) #
##
## Call:
## lm(formula = medv ~ ptratio + rad, data = BostonHousing, na.action = na.omit)
##
## Coefficients:
## (Intercept) ptratio rad
## 57.2724 -1.7836 -0.2035
# Method 2: Remove the variable
# Simply leave it out of the regression model -not much more to do 🙂
# Method 3: Imputation via Mean/Median/Mode
library(Hmisc)
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:xtable':
##
## label, label<-
## The following object is masked from 'package:quantmod':
##
## Lag
## The following objects are masked from 'package:tis':
##
## capitalize, Lag
## The following objects are masked from 'package:plyr':
##
## is.discrete, summarize
## The following objects are masked from 'package:dplyr':
##
46
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
#impute(BostonHousing$ptratio, mean) # replace with mean
#impute(BostonHousing$ptratio, median) # median
#impute(BostonHousing$ptratio, 20) # replace specific number
# or if you want to impute manually
#BostonHousing$ptratio[is.na(BostonHousing$ptratio)] <- mean(BostonHousing$ptratio, na.rm = T)
# Compute the accuracy when NAs are imputed with the mean
library(DMwR) #Data Mining w/ R
## Loading required package: grid
##
## Attaching package: 'DMwR'
## The following object is masked from 'package:broom':
##
## bootstrap
## The following object is masked from 'package:plyr':
##
## join
# This library offers many fancy ways of dealing with NAs
# As a side note, if you're interested in trading, try their trading.simulator
actuals <- original$ptratio
predicted <- rep(mean(BostonHousing$ptratio, na.rm=T), length(actuals))
regr.eval(actuals, predicted)
## mae mse rmse mape
## 1.7854298 4.6778718 2.1628388 0.1034322
# Looking at mape, we have about a 10% (mean absolute error) -not too bad
# Method 4: Prediction
# Perform knn imputation.
knnOutput <- knnImputation(BostonHousing[, !names(BostonHousing) %in% "medv"])
anyNA(knnOutput) # Check that we don't have any NAs left in the entire dataset
## [1] FALSE
actuals <- original$ptratio
predicted <- knnOutput$ptratio
regr.eval(actuals, predicted)
## mae mse rmse mape
## 0.079200565 0.156450738 0.395538542 0.004632037
# Based on mape, we are now down to 0.5% error!
# Prediction for Factor Variables
library(rpart)
47
##
## Attaching package: 'rpart'
## The following object is masked from 'package:survival':
##
## solder
class_mod <- rpart(rad ~ . - medv,
data=BostonHousing[!is.na(BostonHousing$rad), ],
method="class", na.action=na.omit) # since rad is a factor
anova_mod <- rpart(ptratio ~ . - medv,
data=BostonHousing[!is.na(BostonHousing$ptratio), ],
method="anova", na.action=na.omit) # since ptratio is numeric.
rad_pred <- predict(class_mod, BostonHousing[is.na(BostonHousing$rad), ])
ptratio_pred <- predict(anova_mod, BostonHousing[is.na(BostonHousing$ptratio), ])
# Accuracy for ptratio
actuals <- original$ptratio[is.na(BostonHousing$ptratio)]
predicteds <- ptratio_pred
regr.eval(actuals, predicteds)
## mae mse rmse mape
## 0.71061673 0.99693845 0.99846805 0.04099908
# A mape of about 4% not bad
# Acccuracy for rad (factor variable)
actuals <- original$rad[is.na(BostonHousing$rad)]
predicteds <- as.numeric(colnames(rad_pred)[apply(rad_pred, 1, which.max)])
mean(actuals != predicteds) # compute misclass error.
## [1] 0.25
# About 25% mis-classification error -pretty good
# Method 5: Advanced Method (Multivariate Imputation by Chained Equations)
# Use the R package mice
library(mice)
# Perform mice imputation, based on random forests
miceMod <- mice(BostonHousing[, !names(BostonHousing) %in% "medv"], method="rf")
##
## iter imp variable
## 1 1 rad ptratio
## 1 2 rad ptratio
## 1 3 rad ptratio
## 1 4 rad ptratio
## 1 5 rad ptratio
## 2 1 rad ptratio
## 2 2 rad ptratio
## 2 3 rad ptratio
## 2 4 rad ptratio
## 2 5 rad ptratio
## 3 1 rad ptratio
## 3 2 rad ptratio
## 3 3 rad ptratio
48
## 3 4 rad ptratio
## 3 5 rad ptratio
## 4 1 rad ptratio
## 4 2 rad ptratio
## 4 3 rad ptratio
## 4 4 rad ptratio
## 4 5 rad ptratio
## 5 1 rad ptratio
## 5 2 rad ptratio
## 5 3 rad ptratio
## 5 4 rad ptratio
## 5 5 rad ptratio
miceOutput <- complete(miceMod) # generate the completed data.
anyNA(miceOutput)
## [1] FALSE
# Accuracy for ptratio
actuals <- original$ptratio[is.na(BostonHousing$ptratio)]
predicteds <- miceOutput[is.na(BostonHousing$ptratio), "ptratio"]
regr.eval(actuals, predicteds)
## mae mse rmse mape
## 0.35000000 0.77700000 0.88147603 0.01965896
# Acccuracy for rad (factor variable)
actuals <- original$rad[is.na(BostonHousing$rad)]
predicteds <- miceOutput[is.na(BostonHousing$rad), "rad"]
mean(actuals != predicteds) # compute misclass error.
## [1] 0.225
49
Residuals and Influence Measures
Residual Plots
Identifying and Classifying Unusual Observations
Predictor Transformations
Omitted Variable Bias
Irrelevant Variables
Multicollinearity
Detecting Multicollinearity
Model Misspecification: Ramsey RESET
Model Selection
Akaike Information Criterion (AIC)
Bayesian Information Criterion (BIC)
Mallows' CP
Stepwise Regression
Heteroskedasticity
Detecting Heteroskedasticity
Spread Level Plots
Score Tests for Nonconstant Error Variance
Goldfeld-Quandt Test
Mitigating Heteroskedasticity
Heteroskedasticity-Consistent Standard Errors
Generalized (Weighted) Least Squares
Missing Observations