—
title: Modeling Issues in Linear Regression
author: “Dr. Randall R. Rojas”
fontfamily: mathpazo
output:
pdf_document:
number_sections: true
fig_caption: yes
highlight: haddock
header-includes: \usepackage{graphicx}
word_document: default
html_document:
toc: true
df_print: paged
fontsize: 10.5pt
editor_options:
chunk_output_type: console
—
“`{r, echo=FALSE, warning=FALSE, message= FALSE}
library(knitr)
library(png)
opts_chunk$set(tidy.opts=list(width.cutoff=60))
“`
“`{r libraries, echo=FALSE, warning=FALSE, message=FALSE}
rm(list=ls(all=TRUE))
library(tm)
library(SnowballC)
library(lda)
library(topicmodels)
library(LDAvis)
library(dplyr)
library(stringi)
library(plyr)
library(foreign)
library(xts)
library(tis)
library(jsonlite)
library(FNN)
library(hexbin)
library(RColorBrewer)
library(MASS)
library(ldatuning)
library(gofastr)
library(quantmod)
library(tseries)
library(foreign)
library(forecast)
library(MASS)
library(TTR)
library(vars)
library(readtext)
library(tidyr)
library(scales)
library(tinytex)
library(fitdistrplus)
library(rgl)
library(xtable)
library(printr)
library(effects)
library(broom)
library(bookdown)
library(knitr)
library(stats)
library(sandwich)
library(stargazer)
library(leaps)
# install.packages(“kableExtra”)
library(knitr)
library(kableExtra)
library(“tidyverse”)
suppressMessages(library(“tidyverse”))
library(lmtest)#for coeftest(), #for coeftest() and bptest().
# install.packages(“stargazer”)
# library(stargazer) #nice and informative tables
#Data files for: https://bookdown.org/ccolonescu/RPoE4/simplelm.html#the-general-model
#install.packages(“devtools”)
#library(devtools)
#install_git(“https://github.com/ccolonescu/PoEdata”)
#library(PoEdata)
“`
Note: To access your textbook resources type the following on the console:
“`{r }
#library(car)
#carWeb()
“`
\section{Residuals and Influence Measures}
When diagnosing a regression fit, we have to carefully assess the impact that unusual observations may have the fit. Typically, we characterize Unusual observations in terms of outliers, leverage, and influence. Next we discuss each one in detail.
\begin{enumerate}
\item[A.] {\bf Outliers}\\
They are easy to find visually because their $y$ value is relatively large (conditioned on the predictor(s)), given the distribution of other $y$ values in the vicinity.
\begin{center}
\includegraphics[width=3in]{outlier.eps}
\end{center}
\item[B.] {\bf Leverage}\\
In this case we are looking at observations that have an extreme $x$ (i.e., predictor) value relative to the mean.
\begin{center}
\includegraphics[width=3in]{leverage.eps}
\end{center}
\item[C.] {\bf Influence}\\
An influential observation can be considered as an outlier with high leverage. Therefore, they can alter the estimates of the regression coefficients. In this case, when we remove them, the regression fit may become more stable.
\begin{center}
\includegraphics[width=3in]{influential.eps}
\end{center}
There are several metrics we can use to better classify unusual observations, however, since these tools depend on analyzing the residuals, we will first look at understanding residuals and then we can discuss the related metrics.
\end{enumerate}
## Residual Plots
The traditional residuals are given by $e_i = y_i – \widehat{y}_i$, $i=1,\ldots,N$. They are a good diagnostic of how well the model fit the data if when plotted vs. e.g., a predictor, they are randomly distributed about $y=0$ and have a constant variance. In more general terms, we can propose a functional form for the residuals such as
$${\rm var}(e_i) = \sigma^2(1-h_i)$$
where $h_i$ is called the *leverage* or *hat-value* is bounded between 0 and 1. Large values of $h_i$ (a.k.a *high leverage*) are associated with unusual values of $x_i$. If $h_i$ is close to zero, we then obtain the classic linear regression SR1 condition of constant variance.
For the case of SLR, i.e., only one predictor, $h_i$ is given by:
$$h_i = \frac{(x_i-\overline{x})^2}{\sum_{i=1}^{N}(x_i-\overline{x})^2} + \frac{1}{N} $$
From the formula we can interpret $h_i$ (leverage) as the proportion of the total sum of squares of the explanatory variable contributed by the $i$^{th} case.
“`{r }
# Example 1: Food expenditure vs. income (non-constant variance)
library(car) #for hccm() robust standard errors
library(PoEdata) #for PoE4 datasets
data(“food”,package=”PoEdata”)
mreg.mod <- lm(food_exp~income, data=food)
plot(food$income,food$food_exp, pch=20,
xlab="Income", ylab="Food Expenditure")
abline(mreg.mod,lwd=2,col="blue")
plot(food$income, mreg.mod$residuals, pch=20,ylab="Resdiduals", xlab="Income")
abline(h=0,col="red", lwd=2)
# From the residuals plot we can see that the variance is not constant
# Example 2: US Investment Data
## Chapter 3 in Greene (2003)
## transform (and round) data to match Table 3.1
data("USInvest", package = "AER")
us <- as.data.frame(USInvest)
#us$invest <- round(0.1 * us$invest/us$price, digits = 3)
#us$gnp <- round(0.1 * us$gnp/us$price, digits = 3)
plot(us$gnp, us$invest , ylab="Investment", xlab="GNP",pch=20)
mreg.mod=lm(us$invest ~ us$gnp, data = us)
abline(mreg.mod,lwd=2,col="blue")
plot(us$gnp, mreg.mod$residuals, pch=20,ylab="Resdiduals", xlab="GNP",ylim=c(-45,45))
abline(h=0,col="red",lwd=2)
# These residuals also show some hint of non-constant variance.
```
There are transformations that can be applied to the residuals to make the variance more constant. However, because they are applied to the residuals, the underlying regression model still needs to be reevaluated. For example, we can consider the *Standardized Residuals* as given by:
$$e_{Si} = \frac{e_i}{\widehat{\sigma}^2\sqrt{1-h_i}}, $$
```{r }
# Example 3: Standardized Residuals
library(car)
data("USInvest", package = "AER")
us <- as.data.frame(USInvest)
mreg.mod=lm(us$invest ~ us$gnp, data = us)
library(car)
residualPlots(mreg.mod,type = "rstandard")
```
or the *Studentized Residuals*:
$$e_{Ti}= \frac{e_i}{\widehat{\sigma}_{(-i)}\sqrt{1-h_i}}.$$
```{r }
# Example 4: Studentized Residuals
library(car)
data("USInvest", package = "AER")
us <- as.data.frame(USInvest)
mreg.mod=lm(us$invest ~ us$gnp, data = us)
residualPlots(mreg.mod,type = "rstudent")
```
Another alternative is to compute the *Pearson Residuals* which are obtained from the *Weighted Least Squares (WLS) Regression*, where the weights are equal to $w_i$ and the residuals can be computed from:
$$e_{Pi} = \sqrt{w_i}e_i $$. As can be seen, when $w_i=1$, we recover the standard residuals.
```{r }
# Example 5: Pearson Residuals
library(car)
data("USInvest", package = "AER")
us <- as.data.frame(USInvest)
mreg.mod=lm(us$invest ~ us$gnp, data = us)
library(car)
residualPlots(mreg.mod)
```
```{r }
# Example 6: All Residuals
#We can also graph all the relevant residual plots
prestige.mod.2 <- lm(prestige ~ education + income + type,
data=Prestige)
residualPlots(prestige.mod.2)
```
## Identifying and Classifying Unusual Observations
Unusual observations can be somewhat difficult to quantify through visual inspection but fortunately there are several robust metrics that can be used in a more objective fashion.
\begin{enumerate}
\item[A.] {\bf Q-Q Plot}\\
This plot shows the studentized residuals vs. the corresponding quantiles of a t-distribution with $N-K-2$ degrees of freedom, and allows us to identify outliers (std. residuals $\gtrsim 2$ are considered large).
\end{enumerate}
```{r }
# Example 6: Q-Q Plot
library(car)
mreg.mod <- lm(prestige ~ income + education, data=Duncan)
qqPlot(mreg.mod, id=list(n=3))
# It looks like minister might be a potential outlier. To be sure,
# we can perform an outlier test known as the Bonferroni-corrcted t-test:
outlierTest(mreg.mod)
# The large p-value (Bonferroni) suggests that this value is not surpsing.
```
\begin{enumerate}
\item[B.] {\bf Cook's Distance Plot}\\
Cook's Distance is defined as:
$$D_i = \frac{e_{Si}^{2}}{k+1}\times\frac{h_i}{1-h_i} $$
where large values of $D_i$ (larger than 1 and/or larger relative to the majority of the values) are typically associated with influential observations.
\end{enumerate}
```{r }
# Example 7: Cook's Distance Plot
library(car)
mreg.mod <- lm(prestige ~ income + education, data=Duncan)
influenceIndexPlot(mreg.mod, id=list(n=3),vars="Cook")
# It looks like minister, conductor and reporter are potential outliers
# Note: if you don't specify the flag "vars", it will plot all of them:
influenceIndexPlot(mreg.mod, id=list(n=3))
```
\begin{enumerate}
\item[C.] {\bf Hat-Values Plot}\\
This plot is designed for identifying high leverage observations (some of which could also be influential). On average $h_i=p/N$, therefore, $h_i \gtrsim 2p/N$ is considered large.
\end{enumerate}
```{r }
# Example 8: Hat-Values Plot
library(car)
mreg.mod <- lm(prestige ~ income + education, data=Duncan)
influenceIndexPlot(mreg.mod, id=list(n=3),vars="hat")
# It looks like minister, conductor and RR.engineer are influential
# We can check if e.g., minister is influential by comparing the
# coefficient estimates w/ and w/o 'minister'
mod.duncan <- lm(prestige ~ income + education, data=Duncan)
mod.duncan.2 <- update(mod.duncan,
subset= rownames(Duncan) != "minister")
compareCoefs(mod.duncan, mod.duncan.2)
# We can also depict the hat-values via:
influencePlot(mreg.mod, id=list(n=3))
```
\begin{enumerate}
\item[D.] {\bf Added Variable Plot (Partial Regression)}\\
Depicts the relationship between $y$ and one predictor $x$, adjusting for the effects of the other predictors. These can be used to identify influential observations and high leverage observations. For example, high leverage observations appear the in added variable plots as points horizontally distant from the rest of the data. They are also useful for identifying jointly influential points.
An alternative method is to look at the individual coefficient differences between estimates from the original data ($b_j$) and those where observation $i$ is removed ($b_{(-i)j}$) .
$$ df\beta_{ij} = b_{(-i)j} - b_j\;\; {\rm for}\;\; j = 0, \ldots,k$$
\end{enumerate}
```{r }
# Example 9: Added-Variable Plot
library(car)
mreg.mod <- lm(prestige ~ income + education, data=Duncan)
avPlots(mreg.mod, id=list(n=3, method="mahal"))
# It looks like minister, conductor and RR.engineer are influential
# minister & conductor --> decrease the income slope
# minister & conductor –> increase the education slope
# RR.engineer seems fine for the most part
# We can check if e.g., minister & conductor are jointly influential
# by comparing the coefficient estimates w/ and w/o ‘minister & conductor’
mod.duncan <- lm(prestige ~ income + education, data=Duncan)
mod.duncan.3 <- update(mod.duncan,
subset = - whichNames(c("minister", "conductor"), Duncan))
compareCoefs(mod.duncan, mod.duncan.2, mod.duncan.3, se=FALSE)
# Based on the comparison is does appear that these points were
# jointly influential.
# We can also depict the hat-values via:
influencePlot(mreg.mod, id=list(n=3))
# Lastly, we can compute the df_betas and cehck the individual differences
mreg.mod <- lm(prestige ~ income + education, data=Duncan)
dfbs <- dfbetas(mreg.mod)
# We can plot the influence on the income coefficient against the
# influence on the education coefficient
plot(dfbs[ ,c("income", "education")]) # for b1 and b2
showLabels(dfbs[ , "income"], dfbs[ , "education"],
labels=rownames(Duncan), method=c(6, 16, 27))
# Notice that once again, minister and conductor stand out
# We can alos plot them all via:
avPlots(lm(prestige~income+education+type, data=Duncan))
```
```{r }
# Example 10: Other Plots
library(car)
mreg.mod <- lm(prestige ~ income + education + type, data=Prestige)
# Marginal Plots
marginalModelPlots(mreg.mod, sd=TRUE)
# Marginal & Added-Variable Plots
mcPlots(mreg.mod, ~ education, overlaid=FALSE)
mcPlots(mreg.mod, ~ income, overlaid=FALSE)
```
\section{Predictor Transformations}
Suppose we have a linear model $y = \beta_1 + \beta_2 x_2 + \beta_3 x_3+e$, but are interested in the relationship between $y$ and $x_3$ only. Therefore, we would like to plot $y$ vs. $x_3$ but with the effect of $x_2$ removed. This is equivalent to plotting $y - \beta_1 + \beta_2 x_2$ vs. $x_3$. We can accomplish this with a *Component-Plus-Residual* Plot. Mathematically, we are really doing an approximation by estimating a partial residual for $x_3$.
Algorithmically, here is how we can estimate it:
\begin{enumerate}
\item[i.] Obtain the coefficient estimates $\beta_k$, for $k=1, 2, 3$ from $E[y|x] = \beta_1 + \beta_2 x_2 + \beta_3 x_3$ and respective residuals, $e$
\item[ii.] Estimate the partial residual ($e_{partial} $) according to: $e_{partial} = y-\widehat{\beta}_{1} + \widehat{\beta}_2 x_2$
\item[iii.] The Component-Plus-Residual is given by: $e_{partial} = \widehat{\beta}_2 x_2 + e$
\end{enumerate}
In general, we can express the partial residual as $e_{partial,ij} = e_i + b_j x_{ij}$. This plot is often used to help detect the need for transformations of the predictor variables by looking for any departures from linearity in the plots of $e_{partial,ij}$ vs. $x_{ij}$. If we identify any nonlinear dynamics, we can then apply the techniques from Lecture 1 to the respective predictor variables.
```{r }
# Example 11: Component-Plus-Residual Plot
library(car)
mreg.mod <- lm(prestige ~ income + education + women,
data=Prestige)
crPlots(mreg.mod, order=2)
# The dashed blue line is a partial fit, and the magenta curve a lowess smoother. Of the
# 3 predicators, income seems have the largest degree of curvature. Therefore, we might
# consider transforming it using, e.g., the bulging rule.
# We can try applying a log tranformation to income
mreg.mod2 <- update(mreg.mod,
. ~ . + log2(income) - income)
crPlots(mreg.mod2, order=2)
# It looks better, but maybe we should also apply a quadratic transformation to women
mreg.mod3 <- update(mreg.mod2,
. ~ . - women + poly(women, 2))
crPlots(mreg.mod3, order=1)
# Note: When there are strong non-linear relations between predictors, the
# component-plus-residual plots may not work, in this case, we instead look
# at CERES Plots (Combining conditional Expectations and REsiduals)
ceresPlots(lm(prestige~income+education+type, data=Prestige))
```
A more general method for dealing with nonlinear predictors is to a apply the *Box-Tidwell* method for transforming predictors. Their proposed model assumes the form:
$$y =\beta_0 + \beta_1 TB_{BC}(x_1, \gamma_1)+ \cdots + \beta_k T_{BC}(x_k, \gamma_k) +\varepsilon, $$
where $T_{BC}$ is the Box-Cox power transformation introduced in Lecture 1.
```{r }
# Example 12: Box-Tidwell Transformation
library(car)
# In R we can use the the functon boxTidwell to suggest the
# appropriate transformation(s) via MLE, while also allowing
# for user specifed transformations
# For example, let's apply a quadratic tranformation to women
# and see what it recommmends for income and education:
boxTidwell(prestige ~ income + education,
other.x = ~poly(women,2),data=Prestige)
# According to the output: lambda = -0.04 for income--> log
# According to the output: lambda = 2.19 for education–> quadratic
# To visualize these results, we can plot the “Constructed-Variable Plots”
# ‘construction’ in this context means transformation
“`
\section{Omitted Variable Bias}
The *Omitted Variable Bias* quantifies the impact of omitting a relevant (e.g., statistically and/or economically significant) variable in a regression model. For example, suppose the model true underlying population model is $y = \beta_1 + \beta_2 x_2 + \beta_3 x_3 + e$ where both $x_2$ and $x_3$ should be included, but instead we only estimated the model $y = \beta_1 + \beta_2 x_2 +e$. How can we quantify the effect that omitting $x_3$ would have on the model? We can show that mathematically it corresponds to a bias, namely:
$$bias(b_{2}^{*}) = E(b_2)^{*} – \beta_2 = \beta_3 \frac{\widehat{{\rm cov}(x_2, x_3)}}{{\rm var}(x_2)}$$
The example below shows how omitting a relevant variable such as the wife’s years of education contribution to annual family income, leads to a $\sim \$2000$ bias.
“`{r }
# Example 12: Omitted Variable Bias
library(car)
library(AER)
library(broom)
data(“edu_inc”, package=”PoEdata”)
mreg.mod1 <- lm(faminc~he+we, data=edu_inc)
mreg.mod2 <- lm(faminc~he, data=edu_inc)
# Correct Model
summary(mreg.mod1)
#Incorrect Model
summary(mreg.mod2)
# We can see that the effect of omitting wedu translates to an
# ovesstatement of ~$2000 which would correspond to the bias in this case.
# Q: What if we include kl6 ( = number of chidren under the age of 6)?
mreg.mod3 <- lm(faminc~he+we+kl6, data=edu_inc)
summary(mreg.mod3)
# A: In this case, the coefficients of hedu and wedu did not change by much,
# yet kl6 is both statistically and economically significant.
# This suggests that the estimates of wedu and hedu are robust.
```
\section{Irrelevant Variables}
Typically, including irrelevant variables into the model has the effect of increasing the variance of the other variables, and therefore, potentially rendering the respective coefficient estimates statistically insignificant.
```{r }
# Example 13: Irrelevant Variables
library(car)
library(AER)
library(broom)
library(PoEdata)
data("edu_inc", package="PoEdata")
mreg.mod1 <- lm(faminc~he+we+kl6, data=edu_inc)
# The variables xtra_x5 and _x6 are contain random values that are not associated with $y$
# We can would expect that the variances of th relevant variable will increase as we introduce
# these irrlevant variables.
mreg.mod2 <- lm(faminc~he+we+kl6+xtra_x5, data=edu_inc)
mreg.mod3 <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc)
#summary(mreg.mod1)
#summary(mreg.mod2)
#summary(mreg.mod3)
#stargazer(mreg.mod1, mreg.mod2, mreg.mod3)
```
Below we summarize the results from the 3 models:
\begin{table}[!htbp] \centering
\caption{}
\label{}
\begin{tabular}{@{\extracolsep{5pt}}lccc}
\\[-1.8ex]\hline
\hline \\[-1.8ex]
& \multicolumn{3}{c}{\textit{Dependent variable:}} \\
\cline{2-4}
\\[-1.8ex] & \multicolumn{3}{c}{faminc} \\
\\[-1.8ex] & (1) & (2) & (3)\\
\hline \\[-1.8ex]
he & 3,211.526$^{***}$ & 3,377.219$^{***}$ & 3,339.792$^{***}$ \\
& (796.703) & (1,247.058) & (1,250.039) \\
& & & \\
we & 4,776.907$^{***}$ & 4,783.930$^{***}$ & 5,868.677$^{**}$ \\
& (1,061.164) & (1,063.156) & (2,278.067) \\
& & & \\
kl6 & $-$14,310.920$^{***}$ & $-$14,216.440$^{***}$ & $-$14,200.180$^{***}$ \\
& (5,003.928) & (5,039.395) & (5,043.720) \\
& & & \\
xtra\_x5 & & $-$180.162 & 888.843 \\
& & (1,042.335) & (2,242.491) \\
& & & \\
xtra\_x6 & & & $-$1,067.186 \\
& & & (1,981.685) \\
& & & \\
Constant & $-$7,755.330 & $-$7,682.601 & $-$7,558.613 \\
& (11,162.930) & (11,183.650) & (11,195.410) \\
& & & \\
\hline \\[-1.8ex]
Observations & 428 & 428 & 428 \\
R$^{2}$ & 0.177 & 0.177 & 0.178 \\
Adjusted R$^{2}$ & 0.171 & 0.169 & 0.168 \\
Residual Std. Error & 40,160.080 (df = 424) & 40,206.100 (df = 423) & 40,239.890 (df = 422) \\
F Statistic & 30.432$^{***}$ (df = 3; 424) & 22.779$^{***}$ (df = 4; 423) & 18.251$^{***}$ (df = 5; 422) \\
\hline
\hline \\[-1.8ex]
\textit{Note:} & \multicolumn{3}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\
\end{tabular}
\end{table}
\section{Multicollinearity}
Multicollinearity occurs when there are strong linear relationships between the predictors. For example, consider the model $y= \beta_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon$, where $x_3 = \alpha_1 + \alpha_2 x_2$. In this case, Ordinary Least Squares cannot generate estimates of the regression coefficients because the predictor $x_2$ and $x_3$ move together. In this case (which is considered *Perfect Multicollinearity*) the solution would be to drop of the two predictors from the model.
In general, *Imperfect Multicollinearity* is more common. This occurs when e.g., using the previous case, $x_3 = \alpha_1 + \alpha_2 x_2 + u$, where $u$ represents some unknown random component.
The effects of Multicollinearity can be summarized as follows:
\begin{enumerate}
\item[1.] The variances and standard errors of the regression coefficient estimates will increase. This leads to lower $t-$statistics.
\item[2.] The overall fit of the regression equation will be largely unaffected by multicollinearity. Therefore, forecasting and prediction will be largely unaffected.
\item[3.] Regression coefficients can change substantially when variables are added or dropped.
\end{enumerate}
## Detecting Multicollinearity
There are several ways in which we can detect multicollinearity. Below are the 3 most common ones:
\begin{enumerate}
\item[1.] {\bf Large Correlation Coefficients}\\
High pairwise correlations among independent variables is a strong sign of multicollinearity. Usually $|r_{x_i,x_j}|\gtrsim 0.8$ is considered large.
\item[2.] {\bf Large $R^2$ with Low $t-$Stats}\\
It is possible for individual coefficients to be insignificant but for the overall fit of the equation to be high.
\item[3.] {\bf Variance Inflation Factor (VIF)}\\
VIF is a measure of how much the variance of the estimated regression coefficient $b_k$ is inflated by the existence of correlations among the predictor variables in the model. Consider for example $\widehat{var}(b_j)$ as given by:
$$\widehat{var}(b_j) = \frac{\widehat{\sigma}^2}{(n-1)s^{2}_{j}}\times \frac{1}{1-R_{j}^{2}}.$$
The term $1/(1-R_{j}^{2})$ is called the *VIF*. Values of VIF$=1$ suggest no correlation whereas VIF$\gtrsim 10$ (and in some cases $\gtrsim 4$) is considered large, and thus, we might consider removing those variables with large VIFs from the model. The $R_{j}^{2}$ are obtained from the *Auxiliary Regressions*, i.e., $x_j$ regressed on the other predcitors.
\end{enumerate}
An important limitation to of VIF is that it does not apply to nonlinear regressors or contrasts. For these cases we would instead use the *Generalized VIF (GVIF)* which is given by $GVIF^{1/2p}$, where $p$ represents the number of regressors in a term.
```{r }
# Example 13: Multicollinearity (Fuel Efficiency)
library(car)
data("cars", package="PoEdata")
mreg.mod=lm(mpg~cyl+eng+wgt, data=cars)
summary(mreg.mod)
tidy(vif(mreg.mod))
# We can see that both cyl and eng should be removed since VIF>10
# Example 14: Multicollinearity (Median household value in Boston)
data(“Boston”, package = “MASS”)
mreg.mod <- lm(medv ~., data = Boston)
summary(mreg.mod)
tidy(vif(mreg.mod))
# We can see that nox, rad, and tax might be worth removing since VIF>4
# Also note that idus, and dis are very close to 4, so we might want to
# consider removing them as well.
# Example 15: Multicollinearity (US Census Data) -see Ericksen et. al., 1989
mreg.mod <- lm(undercount ~ ., data=Ericksen)
summary(mreg.mod)
tidy(vif(mreg.mod))
# We can see that minority, poverty, and highschool might be worth removing since VIF>4
“`
\section{Model Misspecification: Ramsey RESET}
As we have seen previously, sometimes adding nonlinear expressions for the predictors can help improve our model in terms of economic and statistical significance. A popular test to evaluate is such nonlinear terms are needed, is known as the *REgression Specification Error Test (RESET)*. For example, assume we have the model $y= \beta_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon$ and we are interested in determining if we can improve the model by including higher order terms of the predictors. We can easily assess this by estimating the model $y= \beta_1 + \beta_2 x_2 + \beta_3 x_3 + \gamma \widehat{y}^2 + \varepsilon$, and testing the null hypothesis $H_0:\gamma =0$ against $H_0:\gamma \neq 0$. If we reject $H_0$, then we should consider including e.g., quadratic terms. We can then perform the same test but instead including the term $\delta \widehat{y}^3$, and so on, until we fail to reject $H_0$. In practice, however, we should not have to include high order terms beyond 2 or 3.
“`{r }
# Example 16: Functional Form (RESET)
library(car)
library(PoEdata)
library(lmtest)
# We will create to models: A quadratic one and a linear one, and test each for model misspecification
x <- c(1:30)
# Model 1: Quadratic
y1 <- 1 + x + x^2 + rnorm(30)
# Test if a quadratic model is appropriate
resettest(y1 ~ x , power=2, type="regressor")
# According to the test, we can improve the model
# Test if a quadratic model is appropriate
y2 <- 1 + x + rnorm(30)
resettest(y2 ~ x , power=2, type="regressor")
# According to the test, our model is well specified, no need for adding a quadratic term
# Example 17: Family Income (RESET)
data("edu_inc", package="PoEdata")
mreg.mod <- lm(faminc~he+we+kl6, data=edu_inc)
resettest(mreg.mod , power=2, type="regressor")
# According to the test, our model is fine
```
\section{Model Selection}
We now discuss how to select the best model among competing ones using robust statistical metrics.
## Akaike Information Criterion (AIC)
AIC estimates the quality of the models being considered for the data relative to each other. The smaller the value of AIC, the better the model.
$$AIC = \ln \left( \frac{SSE}{N}\right) + \frac{2K}{N}$$
```{r }
# Example 18: AIC (Family Income)
library(car)
library(AER)
library(broom)
library(PoEdata)
data("edu_inc", package="PoEdata")
mreg.mod1 <- lm(faminc~he, data=edu_inc)
mreg.mod2 <- lm(faminc~he+we, data=edu_inc)
mreg.mod3 <- lm(faminc~he+we+kl6, data=edu_inc)
mreg.mod4 <- lm(faminc~he+we+kl6+xtra_x5, data=edu_inc)
mreg.mod5 <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc)
AIC(mreg.mod1, mreg.mod2, mreg.mod3, mreg.mod4, mreg.mod5)
# As expected, Model 3 (faminc~he+we+kl6) is the preferred choice
```
## Bayesian Information Criterion (BIC)
Similarly to AIC, BIC does the same bu penalizes extra variables more heavily than AIC.
$$BIC = \ln \left( \frac{SSE}{N}\right) + \frac{K \ln(N)}{N}$$
As with AIC, the model with the smallest BIC is preferred.
```{r }
# Example 18: BIC (Family Income)
library(car)
library(AER)
library(broom)
library(PoEdata)
data("edu_inc", package="PoEdata")
mreg.mod1 <- lm(faminc~he, data=edu_inc)
mreg.mod2 <- lm(faminc~he+we, data=edu_inc)
mreg.mod3 <- lm(faminc~he+we+kl6, data=edu_inc)
mreg.mod4 <- lm(faminc~he+we+kl6+xtra_x5, data=edu_inc)
mreg.mod5 <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc)
BIC(mreg.mod1, mreg.mod2, mreg.mod3, mreg.mod4, mreg.mod5)
# As expected, Model 3 (faminc~he+we+kl6) is also the preferred choice
```
## Mallows' CP
Mallows' $C_p$ can be used to assess model fits using different number of parameters. The statistic is given by:
$$C_p = \frac{SSE(p)}{\sigma^2} -N + 2p,$$
where $SSE$ is the sum squared error, $N$ the number of observations, and $p$ the number of parameters being used to estimate the model.
The smaller the value of the statistic $C_p$, the better. Therefore, if model $p$ is the best choice, $C_p$ will tend to be close to or smaller than $p$.
```{r }
# Example 18: Mallows Cp (Family Income)
library(car)
library(AER)
library(broom)
library(PoEdata)
library(leaps)
data("edu_inc", package="PoEdata")
mreg.mod5 <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc)
ss=regsubsets(faminc~he+we+kl6+xtra_x5+xtra_x6,method=c("exhaustive"),nbest=3,data=edu_inc)
subsets(ss,statistic="cp",legend=F,main="Mallows CP",col="steelblue4")
legend(3,25,bty="y",legend=c('h=hedu','w=wedu','k=kl6','x_5 =xtra5', 'x_6=extr6'),col="steelblue4",cex=1.25)
# As expected, Model 3 (faminc~he+we+kl6) is also the preferred choice by Cp
```
## `Stepwise` Regression
Since our goal is ultimately about making the best predictions possible with our model, *Stepwise Regression* consists of iteratively adding and removing predictors until the model with the lowest prediction error is obtained. There are 3 popular ways to perform this:
\begin{enumerate}
\item {\bf Forward Selection}\\
Start with no predictors in the model, iteratively add the most contributive predictors, and stop when the improvement is no longer statistically significant.
\item {\bf Backward Selection}\\
Starts with all predictors in the model (full model), iteratively remove the least contributive predictors, and stop when you have a model where all predictors are statistically significant.
\item {\bf Stepwise Selection}\\
Is a combination of forward and backward selections. You start with no predictors, then sequentially add the most contributive predictors (like forward selection). After adding each new variable, remove any variables that no longer provide an improvement in the model fit (like backward selection).
\end{enumerate}
Below we show a brute force approach by manually comparing models based on their prediction accuracy.
```{r }
# Example 19: Evaluating the Prediction Error
# See http://www.sthda.com/ for details of the exmaple below
library(car)
library(AER)
library(broom)
library(PoEdata)
library(leaps)
library(tidyverse)
library(caret)
library(MASS)
suppressMessages(library("tidyverse"))
# Load the data
data("Boston", package = "MASS")
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]
# Build the model
model1 <- lm(medv ~., data = train.data)
# Make predictions
predictions <- model1 %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
# So far we have R^2 = 0.67, can we do better by removing e.g., any multicoliinear varaiables?
vif(model1)
# Remove tax
model2 <- lm(medv ~. -tax, data = train.data)
# Make predictions
predictions <- model2 %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
# Looks like it didn’t help much
# Technically we would need to manually run through all the plausible models
# until we converge on the best one. Or we can simply use ‘step(model1)’ as
# shown in Example 20.
#step(model1)
“`
In `R` we can use the function `stepAIC` (from the `leaps` library) with the flag `direction` set to `forward`, `backward`, or `both` (Stepwise) to automatically do the model selection process for us. The function `step` is often used as well.
“`{r }
# Example 20: Stepwise (Family Income)
library(car)
library(AER)
library(broom)
library(PoEdata)
library(leaps)
data(“edu_inc”, package=”PoEdata”)
mreg.mod <- lm(faminc~he+we+kl6+xtra_x5+xtra_x6, data=edu_inc)
# Forward Selection
step.model <- stepAIC(mreg.mod, direction = "forward",
trace = FALSE)
summary(step.model)
# Backward Selection
step.model <- stepAIC(mreg.mod, direction = "backward",
trace = FALSE)
summary(step.model)
# Stepwise Selection
step.model <- stepAIC(mreg.mod, direction = "both",
trace = FALSE)
summary(step.model)
# Autmated way to evaluate all the models:
step(mreg.mod)
# As expected, Model 3 (faminc~he+we+kl6) is the best one for all cases
# Note: The leaps package has other functions such as 'leapSeq' which give
# you more choices for performing the selection.
```
\section{Heteroskedasticity}
The classic signature of Heteroskedasticity is a growing spread in $y$ with increasing values of $x$. In this case the *Constant Variance* assumption no longer holds, and we therefore need to apply other techniques to mitigate this problem.
Consider the MR model: $y_{i}=\beta_1 + \beta_2 x_{i2} + \cdots + \beta_K x_{iK} + e_i$. If the variance var$(y_i) = E(e_{i}^{2})$ is no longer constant, this suggests that there is a function $h()$ that depends on predictors $z_2, \ldots,z_S$ that can be used to characterize the behavior of var$(y_i)$. For example, we can propose
$${\rm var}(y_i) = E(e_{i}^{2}) = h(\alpha_1 + \alpha_2 z_{i2} + \cdots + \alpha_{S}z_{iS}).$$
This is more general result since the constant variance assumption holds provided $\alpha_2 = \alpha_3 = \cdots =\alpha_S=0$.
Next, we will discuss a couple of statistical methods to help us identify if heteroskedasticity is present by evaluating hypotheses such as $H_0: \alpha_2 = \alpha_3 = \cdots =\alpha_S=0$ and examining plausible function forms for $h()$ such as linear, quadratic and exponential functions.
## Detecting Heteroskedasticity
Next, we will discuss a couple of statistical methods to help us identify if heteroskedasticity is present by evaluating hypotheses such as $H_0: \alpha_2 = \alpha_3 = \cdots =\alpha_S=0$ and examining plausible function forms for $h()$ such as linear, quadratic and exponential functions.
### Spread Level Plots
These can be used as a diagnostic for nonconstant variance with the advantage that it also yields an estimate for the power law transformation to use for the response variable.
```{r }
# Example 21: Spread Level Plot (Family Income)
library(car)
library(PoEdata)
# Traditional LS
foodeq <- lm(food_exp~income,data=food)
tidy(foodeq)
spreadLevelPlot(foodeq)
# It suggests a lambda = -1.107 for the power low to use in transforming y.
```
### Score Tests for Nonconstant Error Variance
There are several popular tests for Heteroskedasticity depending on various characteristics of the data. In this section, however, we will focus on:
\begin{enumerate}
\item Breusch-Pagan Test
\item White Test
\item Goldfeld-Quandt Test
\end{enumerate}
#### Lagrange Multiplier Test (Breusch-Pagan Test)
Assuming we know what the $z_S$'s are, this test is for the null $H_0: \alpha_2 = \alpha_3 = \cdots =\alpha_S=0$ against the alternative $H_1:$ not all the $\alpha_S$ are zero. The respective test-statistic is $\chi^2 \sim N\times R^2 \sim \chi^{2}_{S-1}$, where $R^2$ is obtained from the regression: $\widehat{e_i}^{2} = \alpha_1 + \alpha_2 z_{i2} + \cdots + \alpha_S z_{iS}+v_i$. The challenge with this test is that we may not not know (or have) other predictors $z_S$ for estimating the $\alpha_{S}$'s.
```{r }
# Example 22: BP Test: Food Expenditure vs. Family Income
library(car)
library(PoEdata)
# The exmaple below shows how ot compute everything from scracth, however, we will shortly
# instead use the R function 'bptest'
alpha <- 0.05
mod1 <- lm(food_exp~income, data=food)
ressq <- resid(mod1)^2
#The test equation:
modres <- lm(ressq~income, data=food)
N <- nobs(modres)
gmodres <- glance(modres)
S <- gmodres$df #Number of Betas in model
#Chi-square is always a right-tail test
chisqcr <- qchisq(1-alpha, S-1)
Rsqres <- gmodres$r.squared
chisq <- N*Rsqres
pval <- 1-pchisq(chisq,S-1)
# Since p-val = 0.0066 < alpha=0.05, we reject the null
# and conclude tht heteroskedasticity is present.
# You can also use the function ncvTest from the car package
# to test for Non Constant Variance (hence ncv)
ncvTest(mod1)
# We can see that we reject the null H0: variance = constant
```
#### White Test
Given the problem we saw with the Breusch-Pagan (BP) Test, the *White Test* solves this by suggesting we start by using our existing predictors from the original model. For example, if our regression model is $E(y)=\beta_1 + \beta_2 x_{2} + \beta_3 x_{3}$, then we set $z_2 =x_2$, $z_3 =x_3$, $z_4 =x_{2}^{2}$, and $z_5 =x_{3}^{2}$.
```{r }
# Example 22: White Test: Food Expenditure vs. Family Income
library(car)
library(PoEdata)
alpha <- 0.05
mod1 <- lm(food_exp~income, data=food)
ressq <- resid(mod1)^2
modres <- lm(ressq~income+I(income^2), data=food)
gmodres <- glance(modres)
Rsq <- gmodres$r.squared
S <- gmodres$df #Number of Betas in model
chisq <- N*Rsq
pval <- 1-pchisq(chisq, S-1)
# Since p-val = 0.023 < alpha=0.05, we reject the null
# and conclude tht heteroskedasticity is present.
# We can directly test the Robust version of the BP Test
mod1 <- lm(food_exp~income, data=food)
bptest(mod1)
# Consissten with the previous results, we reject H0.
```
### Goldfeld-Quandt Test
When our model includes a factor variable(s), and suffers from heteroskedasticty, the Goldfeld-Quandt Test can be useful for comparing groups based on the factor levels. For example, consider the case of a binary indicator variable ($I$) that takes on the values 0 or 1. The test will compares the variances of the two groups by testing the null hypothesis $H_0:\sigma^{2}_{1} = \sigma^{2}_{0}$ against the alternative: $H_0:\sigma^{2}_{1} \neq \sigma^{2}_{0}$. The respective test statistic is $F=\sigma^{2}_{1}/\sigma^{2}_{0}$ which is then compared against $F_c = F_{N_1 - K, N_0 -K}$. This test can also be applied to a model with only quantitative predictors by simply splitting the observations according to a desired rule. The two examples below show these cases.
```{r }
# Example 23: Goldfeld-Quandt Test (w/ Inidicator variable)
# wage = educ + exper + metro (=1 metropolitan and =0 urban)
library(car)
library(PoEdata)
alpha <- 0.05 #two tail, will take alpha/2
data("cps2", package="PoEdata")
#Create the two groups, m (metro) and r (rural)
m <- cps2[which(cps2$metro==1),]
r <- cps2[which(cps2$metro==0),]
wg1 <- lm(wage~educ+exper, data=m)
wg0 <- lm(wage~educ+exper, data=r)
df1 <- wg1$df.residual #Numerator degrees of freedom
df0 <- wg0$df.residual #Denominatot df
sig1squared <- glance(wg1)$sigma^2
sig0squared <- glance(wg0)$sigma^2
fstat <- sig1squared/sig0squared
Flc <- qf(alpha/2, df1, df0)#Left (lower) critical F
Fuc <- qf(1-alpha/2, df1, df0) #Right (upper) critical F
# Since F = 2.09 (in the rejection region), we reject H0 and conclude
# that the variances are different, and therefore, heteroskedasticity
# is an issue.
```
```{r }
# Example 24: Goldfeld-Quandt Test (only quantitative predictors)
# Food Expenditure vs. Family Income
# Divide the data into two groups: below median income and above
library(car)
library(PoEdata)
alpha <- 0.05
data("food", package="PoEdata")
medianincome <- median(food$income)
li <- food[which(food$income<=medianincome),]
hi <- food[which(food$income>=medianincome),]
eqli <- lm(food_exp~income, data=li)
eqhi <- lm(food_exp~income, data=hi)
dfli <- eqli$df.residual
dfhi <- eqhi$df.residual
sigsqli <- glance(eqli)$sigma^2
sigsqhi <- glance(eqhi)$sigma^2
fstat <- sigsqhi/sigsqli #The larger var in numerator
Fc <- qf(1-alpha, dfhi, dfli)
pval <- 1-pf(fstat, dfhi, dfli)
# Since p-val = 0.0046 < alpha=0.05, we reject the null
# and conclude tht heteroskedasticity is present. In this case,
# we can see that the test suggests that there is a hiher variance
# at higher incomes.
# An easier way to conduct the GQ test is using the function gqtest
# from the lmtest package
library(lmtest)
foodeq <- lm(food_exp~income, data=food)
gqtest(foodeq, point=0.5, alternative="greater",
order.by=food$income)
```
## Mitigating Heteroskedasticity
In the previous sections we learned how to detect heteroskedasticity, now we will discuss what to with our model to mitigate it its impact.
### Heteroskedasticity-Consistent Standard Errors
Recall that heteroskedasticity has the two effects that (1) the LS estimators are no longer the best, and (2) the LS standard errors are not correct since the variance is no longer constant. A more robust (in the sense that it applies to both heteroskedastic and homoskedastic errors) is the *White Statndard Error* as given by:
$${\rm var} (\widehat{b}_2) = \frac{N}{N-2} \frac{\sum_{i={1}}^{N} \left[(x_i - \overline{x})^2 \widehat{e}_{i}^{2} \right]}{\sum_{i={1}}^{N} \left[ (x_i - \overline{x})^2\right]^2}$$
For example, we look at the results from the food expenditure case and compare the traditional vs. the White standard errors, as well as the respective confidence intervals for the model coefficients. Keep in mind that hypotheses such as e.g., $\lambda = c_1\beta_1 + c_2\beta_2 =c_0$ would also require us to update the standard errors, and therefore, our conclusions may change.
\begin{center}
\includegraphics[width=2in]{whitese.png}
\end{center}
\vspace{5pt}
\begin{center}
\includegraphics[width=5in]{whiteci.png}
\end{center}
```{r }
# Example 25: White SE
# Food Expenditure vs. Family Income
# To compute the Robust White SE, the car package has a function
# hccm (heteroskedasticity-robust coefficient covariance matrix) that will
# compute them
library(car)
library(PoEdata)
# Traditional Standard Errors
foodeq <- lm(food_exp~income,data=food)
tidy(foodeq)
# Robust White Standard Errors
cov1 <- hccm(foodeq, type="hc1")
food.HC1 <- coeftest(foodeq, vcov.=cov1)
tidy(food.HC1)
# Hypotheiss Test for a linear combination of paramters: H0 = beta_2 + beta_3 = 0
data("andy", package="PoEdata")
andy.eq <- lm(sales~price+advert, data=andy)
bp <- bptest(andy.eq) #Heteroskedsticity test
b2 <- coef(andy.eq)[["price"]]
b3 <- coef(andy.eq)[["advert"]]
H0 <- "price+advert=0"
# Traditional Standard Errors
linearHypothesis(andy.eq, H0)
# Robust White Standard Errors
linearHypothesis(andy.eq, H0, vcov=hccm(andy.eq, type="hc1"))
# In this case the p-value is larger using the traditional errors.
```
### Generalized (Weighted) Least Squares
To properly deal with Heteroskedasticity, we would need to estimate a different variance of the error term for each observation. On a practical level this is not possible, and instead we model the dependence of the variance on one or more of the predictors in the model. We will exams two forms of dependence depending on whether we know or not the actual relationship between the variance and predictor(s).
#### Known Form or Variance
For simplicity, let's assume we have a SLR model $y_i = \beta_1 + \beta_2 x_i + e_i$ and that var$(e_i) = \sigma_{i}^{2} = \sigma^2 x_i$. If we divide our original model by $\sqrt{x_i}$, then we show that in the transformed model: $y_{i}^{*} = \beta_1 x_{i1}^{*} + \beta_2 x_{i2}^{*} + e_{i}^{*}$, $e_{i}^{*}$ are now homoskedastic since var$(e_{i}^{*}) =$var$e_i/\sqrt{x_i} =$var$(e_i)/x_i = \sigma^2 x_i/x_i = \sigma^2$.Also, the desired properties $E(e_{i}^{*}) =0$ and cov$(e_{i}^{*}, e_{j}^{*})=0$, $i\neq j$ still hold.
The transformation we proposed is part of a more general method known as *Weighted Least Squares* where we weighted the errors by $x_{i}^{-1/2}$.
For example, looking again at the food expenditure example, applying this method we obtain:
\begin{center}
\includegraphics[width=2in]{weighted.png}
\end{center}
\begin{center}
\includegraphics[width=5in]{weightedci.png}
\end{center}
```{r }
# Example 25: Weighted Least Squares -Known Variance
# Food Expenditure vs. Family Income
# To compute the Robust White SE, the car package has a function
# hccm (heteroskedasticity-robust coefficient covariance matrix) that will
# compute them
library(car)
library(PoEdata)
# Traditional LS
foodeq <- lm(food_exp~income,data=food)
tidy(foodeq)
# Weighted LS
w <- 1/food$income # Weights : R takes the sqrt by default already, hence w=1/x
food.wls <- lm(food_exp~income, weights=w, data=food)
vcvfoodeq <- coeftest(foodeq, vcov.=cov1)
tidy(vcvfoodeq)
# Traditional LS
```
#### Unknown Form of Variance
If we do not know the form of the variance, we can propose a more general power law form such as var$(e_i) = \sigma_{i}^{2} = \sigma^2 x_{i}^{\gamma}$, where $\gamma$ is an unknown parameter that we can estimate. The trick to estimating $\gamma$ is to take logs on both side of the variance equation, $\ln(\sigma_{i}^{2}) = \ln(\sigma^2) + \gamma \ln(x_i)$, and take the exponential on both sides,
$$\sigma_{i}^{2} = \exp[\ln(\sigma^2) + \gamma \ln(x_i)] = \exp(\alpha_1 + \alpha_2 z_i),$$
where $\alpha_1 = \ln(\sigma^2)$, $\alpha_2 = \gamma$, and $z_i = \ln(x_i)$. For the MR case, we simply have more terms, $z_{iS}$ to include in the exponential function. We can use traditional LS to estimate $\gamma$ from the log-linear model $\ln(\sigma_{i}^{2}) = \alpha_1 + \alpha_2 z_i$. The estimated equation is $\widehat{\ln(\sigma_{i}^{2}}) = 0.9378 + 2.39z_i$ (recall that we used $\gamma = 1$ in our previous example). Now we can use $\widehat{\sigma}^{2}_{i} = \exp(0.9378 + 2.39z_i)$ to transform our original model instead of the $x_{i}^{-1/2}$ weights. The estimated model is now:
\begin{center}
\includegraphics[width=3in]{wls.png}
\end{center}
```{r }
# Example 25: Weighted Least Squares -Uknown Variance
# Food Expenditure vs. Family Income
# To compute the Robust White SE, the car package has a function
# hccm (heteroskedasticity-robust coefficient covariance matrix) that will
# compute them
library(car)
library(PoEdata)
foodeq <- lm(food_exp~income,data=food)
data("food", package="PoEdata")
food.ols <- lm(food_exp~income, data=food)
ehatsq <- resid(food.ols)^2
sighatsq.ols <- lm(log(ehatsq)~log(income), data=food)
vari <- exp(fitted(sighatsq.ols))
food.fgls <- lm(food_exp~income, weights=1/vari, data=food)
# This case is also known as Feasible Generalized Least Squares
library(stargazer)
# stargazer(food.ols, food.HC1, food.wls, food.fgls)
```
\begin{table}[!htbp] \centering
\caption{Comparing various 'food' models}
\label{}
\begin{tabular}{@{\extracolsep{5pt}}lcccc}
\\[-1.8ex]\hline
\hline \\[-1.8ex]
& \multicolumn{4}{c}{Dependent variable: 'food expenditure'} \\
\cline{2-5}
& OLS & HC1 & WLS & FGLS \\
\hline \\[-1.8ex]
Constant & 83.416 & 83.416 & 78.684 & 76.054 \\
& (43.410) & (27.464) & (23.789) & (9.713) \\
& & & & \\
income & 10.210 & 10.210 & 10.451 & 10.633 \\
& (2.093) & (1.809) & (1.386) & (0.972) \\
& & & & \\
\hline \\[-1.8ex]
Observations & 40 & & 40 & 40 \\
\hline
\hline \\[-1.8ex]
\end{tabular}
\end{table}
\section{Missing Observations}
There are several ways to deal with missing data depending on their characteristics and volume. Below we briefly discuss for of them.
\begin{enumerate}
\item[1.] {\bf Remove Missing Observations}\\
If the number of missing observations is small relative to the dataset, we can simply remove these observations. Also, we need to make sure not to introduce a bias due to their removal by e.g., failing to represent certain characteristics of the data.
\item[2.] {\bf Remove the Variable}\\
This may be appropriate if most of the missing values occur mainly in a particular variable. We would need to weigh its removal against other considerations such as economic significance, etc.
\item[3.] {\bf Imputation via Mean/Median/Mode}\\
Depending on the distribution of the observations, we can impute the missing data with estimates of the Mean, Median or Mode.
\item[4.] {\bf Prediction}\\
We can predict the missing observations using kNN (k-Nearest Neighbor), regression, interpolation, etc. However, of all these methods, kNN seems to work well provided the variables are not factor variables. The basic idea behind the method is to replace a missing observation with a weighted average of the closest (using the Euclidean distance) $k$ neighbors. In the case of factor variables, we can instead use *Recursive Partitioning* (We normally cover this when discussing *Regression Trees*).
\end{enumerate}
```{r }
# Example 25: Boston Housing Data
library(mlbench) #Machine Learning Benchmark
# For refence on this example, look at:
# http://r-statistics.co/Missing-Value-Treatment-With-R.html#4.%20Prediction
data ("BostonHousing", package="mlbench") # initialize the data # load the data
original <- BostonHousing # backup original data
# Introduce missing values
set.seed(100)
BostonHousing[sample(1:nrow(BostonHousing), 40), "rad"] <- NA
BostonHousing[sample(1:nrow(BostonHousing), 40), "ptratio"] <- NA
head(BostonHousing)
# Before going through the methods, we should first look at
# the pattern of missing observations
library(mice)
md.pattern(BostonHousing) # pattern or missing values in data.
# Method 1: Remove Missing Observations
lm(medv ~ ptratio + rad, data=BostonHousing, na.action=na.omit) #
# Method 2: Remove the variable
# Simply leave it out of the regression model -not much more to do 🙂
# Method 3: Imputation via Mean/Median/Mode
library(Hmisc)
#impute(BostonHousing$ptratio, mean) # replace with mean
#impute(BostonHousing$ptratio, median) # median
#impute(BostonHousing$ptratio, 20) # replace specific number
# or if you want to impute manually
#BostonHousing$ptratio[is.na(BostonHousing$ptratio)] <- mean(BostonHousing$ptratio, na.rm = T)
# Compute the accuracy when NAs are imputed with the mean
library(DMwR) #Data Mining w/ R
# This library offers many fancy ways of dealing with NAs
# As a side note, if you're interested in trading, try their trading.simulator
actuals <- original$ptratio
predicted <- rep(mean(BostonHousing$ptratio, na.rm=T), length(actuals))
regr.eval(actuals, predicted)
# Looking at mape, we have about a 10% (mean absolute error) -not too bad
# Method 4: Prediction
# Perform knn imputation.
knnOutput <- knnImputation(BostonHousing[, !names(BostonHousing) %in% "medv"])
anyNA(knnOutput) # Check that we don't have any NAs left in the entire dataset
actuals <- original$ptratio
predicted <- knnOutput$ptratio
regr.eval(actuals, predicted)
# Based on mape, we are now down to 0.5% error!
# Prediction for Factor Variables
library(rpart)
class_mod <- rpart(rad ~ . - medv,
data=BostonHousing[!is.na(BostonHousing$rad), ],
method="class", na.action=na.omit) # since rad is a factor
anova_mod <- rpart(ptratio ~ . - medv,
data=BostonHousing[!is.na(BostonHousing$ptratio), ],
method="anova", na.action=na.omit) # since ptratio is numeric.
rad_pred <- predict(class_mod, BostonHousing[is.na(BostonHousing$rad), ])
ptratio_pred <- predict(anova_mod, BostonHousing[is.na(BostonHousing$ptratio), ])
# Accuracy for ptratio
actuals <- original$ptratio[is.na(BostonHousing$ptratio)]
predicteds <- ptratio_pred
regr.eval(actuals, predicteds)
# A mape of about 4% not bad
# Acccuracy for rad (factor variable)
actuals <- original$rad[is.na(BostonHousing$rad)]
predicteds <- as.numeric(colnames(rad_pred)[apply(rad_pred, 1, which.max)])
mean(actuals != predicteds) # compute misclass error.
# About 25% mis-classification error -pretty good
# Method 5: Advanced Method (Multivariate Imputation by Chained Equations)
# Use the R package mice
library(mice)
# Perform mice imputation, based on random forests
miceMod <- mice(BostonHousing[, !names(BostonHousing) %in% "medv"], method="rf")
miceOutput <- complete(miceMod) # generate the completed data.
anyNA(miceOutput)
# Accuracy for ptratio
actuals <- original$ptratio[is.na(BostonHousing$ptratio)]
predicteds <- miceOutput[is.na(BostonHousing$ptratio), "ptratio"]
regr.eval(actuals, predicteds)
# Acccuracy for rad (factor variable)
actuals <- original$rad[is.na(BostonHousing$rad)]
predicteds <- miceOutput[is.na(BostonHousing$rad), "rad"]
mean(actuals != predicteds) # compute misclass error.
```
\vspace{10pt}