Sta$s$cal Inference STAT 431
Lecture 17: Mul$ple Regression (IV) Collinearity
Example: Stock Return
• The data is from February 1978 to December 1987
– Response:monthlyreturnforWalMartstock
– Predictors:monthlyreturnsfortheS&P500/value-weighted(VW)index
• Results of using S&P500 alone and VW alone
lm(formula = WALMART ~ SP500, data = stock)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.024476 0.006083 4.024 0.000102 ***
SP500 1.244457 0.123262 10.096 < 2e-16 ***
lm(formula = WALMART ~ VW, data = stock)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.019589 0.006075 3.225 0.00164 **
VW 1.239207 0.118087 10.494 < 2e-16 ***
STAT 431
Two fits are comparable:
• comparable βˆ
• comparable 1 ˆ SE(β1 )
• the predictor is highly significant in both cases
• What about the fit using both predictors?
lm(formula = WALMART ~ SP500 + VW, data = stock)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.015111 0.007106 2.126 0.0356 *
SP500 -1.257834 1.041077 -1.208 0.2294 VW 2.458291 1.015865 2.420 0.0171 *
• The fit seems problema$c:
– Counter-intui$vesignsoftheregressionscoefficients
– TheLSregressioncoefficientshavelargeSE
– Theindividualpredictorsaremuchlesssignificantthantheywereinseparate simple regressions; but jointly significant:
Analysis of Variance Table
Model 1: WALMART ~ 1
Model 2: WALMART ~ SP500 + VW
Res.Df RSS Df Sum of Sq F
1 118 0.92599
2 116 0.47108 2 0.4549 56.008 < 2.2e-16 ***
STAT 431
Pr(>F)
• What is wrong with the data?
• Checking the sca[er plot matrix of the data, we see that the two predictors are highly correlated!
• This is further confirmed by the correla$on between S&P500 and VW: r = 0.993!
• In mul$ple regression, we call this phenomenon collinearity
−0.20
−0.10
0.00
0.10
Collinearity
!
!
! !!! !!! !
! !
!
! !!! !!! ! !!
! ! !!!!! !!!! !!!
! !! !
!!! !
!
!
!! !!
!
! !!!! !!!!
!! ! !
! ! !!! ! !
!!!!!!!! !
! ! !!!!! !!!! !
!!
!
! !
!! ! !!
! !! !!!
!
! ! !!!! ! !
!! !!! ! ! !
! !! !!!! ! ! ! ! !! ! ! !
! !! ! !
!!! !! ! ! !!
! ! ! ! !!! !
!
! !!
!
!
!!! ! ! !! !
! !!! !! !! !!!! !
!!
! !
!
! !
! !
!
!
! !
SP500
! !!
! ! !!
! !
!! !!! !
!!!! ! !
!!
!! !
!
!
! !
! !
!
!
! !
! ! !
! ! !
!
!
!
!
!! !
!!! ! !!
!
!!! ! ! ! ! ! ! !
!
! !!!! !! !!
!! !!! !! ! !!!!!! !!
!! !!!!!! ! !!! !! ! !
! !!!!!! !!!! !!!
!
! !! !!
! !!
!
!
!!! !!
! !!!!!!!! !! !! !!! ! !
! ! !!! !! ! ! !!!!!! !
!
!! !! !! !
!
!
!! !!!
! ! ! !
! !!!! !!!!!
!!! !! ! ! !!
!! !! ! !
!
!
!
!!
!! ! ! !! !
WALMART
STAT 431
VW
−0.20 −0.10 0.00 0.10
−0.2 0.0
0.1 0.2 0.3
−0.2 0.0 0.1 0.2 0.3
−0.20 −0.10 0.00 0.10
−0.20 −0.10 0.00 0.10
• When there is strong collinearity among the predictors:
1. The es$mates βˆ are subject to numerical errors and are unreliable
• Small change in data might lead to drama$c changes in the coefficient es$mators
• Some$mes, even the signs of βˆ could be reversed
2. Mostofthecoefficientshaveverylargestandarderrorsandasaresultare sta$s$cally insignificant, even when the overall F-sta$s$c is significant
j
j
!
! !
!! !
!!
!!!!!! ! ! !!!!! !!!!
! !!!! !!
!!! !! ! !!
!
! ! !! !
!! !!! !!!!! !!
!!!! !!!! !
! !!
!
! !!!
! !
! !
!! !
! !!
! !
!
!
!!
! !! !! !!!!! !
!
! !!!! ! ! !
!
!! ! !! ! ! !! !! ! !!
!
!!! !!! !!! !!!
! !
! !!!!!
! ! !!! !
! !! !
! !!! !! !
33 22
11 00
−1
−1
−2
−2
−3
−3
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
xx
STAT 431
z
−6 −4 −2 0 2 4 6
y
z
−6 −4 −2 0 2 4 6
y
Measuring the Effect of Collinearity
• Variance infla$on factor (VIF)
– An individual VIF a[ached to each individual es$mated regression coefficient – Measurestheeffectofcollinearityoneachcoefficient
• Defini$onofVIF:consideraresponseY andpredictorsX1,…,Xk
– In the simple regression of Y on X1
SD(slope) = σ Sx1 x1
– Consider fibng the mul$ple regression model Y = β0 + β1X1 + · · · + βkXk + SD(βˆ)= σ = VIFσ
1 Sxx 1−R2 1 Sx1x1 1 1 1·2,…,k
– Here, R12·2,…,k is the coefficient of determina$on from the regression ofX1 on all the other predictors X2, . . . , Xk
– VIF1 = 1/(1 − R12·2,…,k); other VIF’s are analogously defined
– RootVIFmeasureshowmuchtheSDofthees$matorisinflateddueto collinearity
STAT 431
• Rule of thumb: in general VIF > 10 is regarded as unacceptable
• To compute the VIF’s in R, we could use the vif func$on from the R package car
> stock.fit <- lm(WALMART ~ SP500 + VW, data = stock)
> require(car)
> vif(stock.fit)
SP500 VW
74.29672 74.29672
• From the R output, the VIF’s of the two predictors in the stock data are very large (Why they are the same for the stock data?)
• A natural ques$on: How to deal with collinearity?
STAT 431
Dealing with Collinearity
• Solu$on # 1: Do nothing.
– OKiftheonlygoalispredic$onwithnoextrapola$on
– Collinearitydoesnotaffectpredic$ons,Rsquare,orassessmentoftheoverall strength of the fi[ed model (ANOVA)
– Becau$ous:extrapola$onisverydangerouswhenthereiscollinearity!
! !! !! !!!!! !
!
! !!!! ! ! !
!
!! ! !! ! ! !! !! ! !!
!
!!! !!! !!! !!!
! !
! !!!!!
! ! !!! !
! !! !
! !!! !! !
3 2
1 0
−1 −2
−3 −3 −2 −1 0 1 2 3
x
STAT 431
z
−6 −4 −2 0 2 4 6
y
• Solu$on # 2: Drop some of the predictors with large VIF’s
– Inthestockexample,wecandropeitherS&P500orVW
– Onceapredictorisdropped,makesuretofitthenewmodelandre-compute the VIF’s
• Solu$on # 3: Reconstruct predictors to avoid collinearity – WewillseeanexamplelaterinHW
STAT 431
Example: PDA Marke$ng
• A marke$ng firm studied the demand for a new type of PDA
• 75 consumers were randomly sampled, three variables recorded
– Response:Ra$ngs=thelikelihoodofpurchase;onascaleof1to10with1 implying li[le interest and 10 indica$ng almost certain purchase
– Predictors:Age(inyears)andincome(in$1000)
40 60 80 100 120
• From the sca[er plot matrix, the two predictors are correlated. The correla$on coefficient r = 0.827
Age
! !
! !
! !
!! ! !!
! !! !!
!!! ! !!
! !
!! !
! !!
! !!
!!!! !! ! !!
!
! !!! ! ! !!
!!
!! !!
! !!!! !!!! !!
! !
!
!! !
! !
! !!!!
! !!
!
!
! ! !! !! !
!!
Income
! !! !!
! ! !! !
! !!
!!! !!!
!!! !!! !!
!!
! !!!! !
! !
! !
!
!
! !!
!
! !
!
! !! !
!
!! ! ! ! !!!
! ! ! ! !
!
!
!
! !
! !
!!
! !
!! !
! !
!
!
! !
! ! !
! !
! ! ! ! !
! !
!
! !
!!
!
!
! !
! ! ! !! ! !!
! !!! !! !! !! !! ! !! !!!! !!!!! !
! !!! !!!!!!!!! ! !!!!!!! !! !!!
!!
!
! !
! ! !! ! !! ! !!!!!!! ! !
!! !! !!!!! ! !!!!!!!!
!!! ! ! !!! !
!!
Rating
STAT 431
20 30 40 50 60 70
2 4 6 8
! !!
! !
!!
!
!! ! !
!!! !!!
!! !!
!
! !
!!! !!!
! !
!
! ! ! ! !!!
!!
! ! ! ! !!
!
! !!
!
!
2 4 6 8
20 30 40 50 60 70
40 60 80 100 120
• Consider two simple regressions first:
lm(formula = Rating ~ Age, data = pda)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.52008 0.77612 0.670 0.505
Age 0.08882 0.01540 5.765 1.82e-07 *** lm(formula = Rating ~ Income, data = pda)
In the two simple regressions, both predictors have significant posi$ve slope coefficient
In the mul$ple regression, both predictors are s$ll significant, but the sign
in front of Age has been altered due to collinearity
Coefficients:
(Intercept) -0.667931 0.400427 -1.668 0.0996 .
Estimate Std. Error t value Pr(>|t|) Income 0.070519 0.004912 14.356 <2e-16 ***
• If we use both predictors, we get
lm(formula = Rating ~ Age + Income, data = pda)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.51616 0.40954 1.260 0.212
Age -0.07630 0.01447 -5.272 1.35e-06 *** Income 0.10315 0.00748 13.790 < 2e-16 ***
STAT 431
• However, instead of observing the sign change and blaming collinearity, we can truly say that collinearity has provided clarifica$on of the rela$onships among the variables:
– Asconsumersage,theytendtoacquiremoreincome
– Atanyfixedlevelofincome,thelikelihoodofpurchasedecreasesasage
increases
• The change of sign can be regarded as Simpson’s paradox for regression
• Also note that the SE of the Age coefficient in the mul$ple regression fit (0.0145) is smaller than that in the simple regression (0.0154)!
• So, the presence of mild collinearity is not necessarily bad
– Careful inves$ga$on is always needed
– Speciala[en$onshouldbepaidtotheSE’softheregressioncoefficients
STAT 431
• Key points of this class
– Collinearity:defini$on/consequences – VIF:defini$on/calcula$on
– Dealingwithcollinearity
• Read Sec$on 11.6 of the textbook
• Next class: Mul$ple Regression (V) – Mul$ple Regression Diagnos$cs
STAT 431