Sta$s$cal Inference STAT 431
Lecture 18: Mul$ple Regression (V) Regression Diagnos$cs
Review: Modeling Assump$ons for Mul$ple Regression
• Variables: 1 response variable Y , and k predictor variables x1 , . . . , xk
• Data: n vectors of observa$ons(xi1,…,xik,yi),i = 1,…,n
• Model: Yi =β0 +β1xi1 +···+βkxik +i, i=1,…,n
signal noise
• Assump$ons on the noises
1. i ’s are mutually independent random variables (independence)
2. i ’s have common mean 0, and common variance σ2 (homoscedas$city) 3. i ’s are normally distributed (normality)
• Equivalently, Yi ∼N(β0 +β1xi1 +···+βkxik,σ2) andaremutually independent of each other
STAT 431 2
Example: Price of Diamond Rings • The diamond data conforms to the modeling assump$ons.
!
!
! !
! ! !
!
! ! ! !
! !!
! ! !
! !
!
!
!
! !! !
! !
! !
!
• The scaPer plot shows a linear rela$onship between the response (Price) and the predictor (Weight)
• To check whether the errors have constant variance, we can plot residuals against fiPed values (or predictors) (the resul$ng plot is called the residual plot)
0.15
0.20 0.25 0.30 Weight (carats)
0.35
– Thever$caldevia$onsofthe residuals are well-behaved
! !
!!!
!!!
!
!
!!!! ! !! !
!
!
!! !
! ! ! ! ! !
!
!
! !!
!
!
!
!!
STAT 431
3
200 400
600 800 Fitted values (Singapore dollars)
1000
Residual −50 0 50
Price (Singapore dollars)
200 400 600 800 1000
• To check the normality assump$on on the errors, we could make a normal plot of the residuals. For the diamond data, there is no obvious viola$on of the normality assump$on.
Normal Q−Q Plot
!
! !
!
!! !!
! !
!!
!! !
!!!! !!!
!!!! !
! !!!
!! !
!! ! !!!!!!
! !!
!
!
−2 −1 0 1 2 Theoretical Quantiles
STAT 431 4
Sample Quantiles
−2 −1 0 1 2
Possible Anomalies in the Data
• Usually, the data set you get does not behave as well as the diamond data. There could be devia$ons from the assump$ons.
• Things to look for in the data (listed in order of inspec$on) 1. Regressionoutliers
• Gross devia$ons in the ver$cal direc$on from the general paPern 2. Heteroscedas$city
• Changes in the ver$cal spread of the data for varying predictor values 3. Non-linearity
• A curved overall paPern as a func$on of the predictors 4. Non-normalityoftheresiduals
5. Lackofindependence
6. Influen$al/Leveragepoints
• A_er fixing the above problems, we should also pay aPen$on to collinearity in predictors [covered in last lecture]
STAT 431 5
Example: House Prices in Zip 30062
• The data set contains the informa$on about the house prices in zip 30062 in the
year 2003. In total, there are n = 439 houses.
• We are interested in the rela$onship between Price (response, in $1000) and the
following four predictors
Age BLDSQFT BEDRMS BATHS
Age of the house (in years) Size of the building (in SQFT) # bedroom
# bathroom
STAT 431 6
Regression Outliers
• Regression outliers are points far from the overall paPern for the average of the response Y given the predictors, i.e. points with gross ver$cal devia$on from the general paPern
• What to do with outliers?
– Iftheyaremis-recorded,theyshould
be excluded from analysis
– Iftheydobelongtothedataset,
retain them
– But,oneshouldlatercheckthatany
0
20 40
2.0
3.0 4.0
5.0
Poten$al outliers
Price
●
● ●●● ● ● ●● ●
●● ●
●●● ●
●● ●
●●● ●●
●●●●●● ●●●
● ●●●●
●●● ●●●●●●●
●●● ●●●● ●
● ●● ●●●● ● ●
●
● ●●●●●
●●●● ●● ●●
●
● ● ●●● ●●● ● ●
●●
●
● ●●●●● ●● ●
●● ● ● ● ● ● ●● ● ● ● ● ● ●
●
●●●●●● ● ●●●
●● ● ● ●●●
●●● ● ● ● ●●
● ● ● ● ● ● ● ●
●
● ● ●
●
● ● ●
●
●
● ● ●
●
●
●
●
●
● ●
● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ●
● ●●
●
●
●●
●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ●
● ● ● ● ● ● ●
●● ●●●●● ●●
●●●● ●●● ● ● ● ●● ●●●●
●
●● ●
● ●● ●●●●●● ●●●● ●● ●●● ●●●
●●●●●●●●
● ●● ●
●●●● ● ● ● ● ●●● ●●● ● ●● ● ●
●●●●● ● ●● ● ●●●● ● ● ●●●●● ●●●
conclusions are not strongly dependent on these poten$al outliers
● ● ● ●●● ● ●●●
●●●● ● ●●
●● ● ●
● ●● ●●
●●● ● ● ●
●●● ● ●
● ●●●
●● ●●
●●● ● ●●● ●
●●●● ●●● ● ●● ● ● ●● ● ● ● ● ● ●●●●●● ●
●
●
●
● ● ●●
●●● ● ● ●● ●
●●●
● ●● ●
●● ● ● ● ● ●
●●● ●● ●
● ●● ● ● ●
Age
● ●● ●●●● ●●● ●●
●● ●●●● ●
●● ●●●● ●
● ●●● ●●●●●● ●
●● ●● ●●●●●●● ● ● ●●●●●●● ●
●●●●●●●●●●● ● ●● ●●●●●
● ● ●● ●●●● ● ● ●●● ● ● ●●● ●
●●●● ● ●●●● ●● ●●● ●
●● ● ●●
●● ●●●
●●●●● ● ● ●●●●● ● ●● ● ● ● ●
● ●●●●●●●●●● ●●● ●● ●●●●●
●●●●● ●●●● ● ● ● ● ●
● ● ●●●● ●
●
●
● ● ●
● ●
● ●
● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ●
●
●
● ●
● ● ● ●
● ● ●
●
● ● ● ●
●
● ●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
● ● ●
●
●
●
●
●
●
● ● ●●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
● ● ● ●
●
●
●●
● ● ●●
● ● ● ● ● ●
● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ●
● ●
●
● ●
●
● ● ●
● ●
●
●
●
●
●
BLDSQFT
●
● ● ● ● ● ● ● ● ● ●
●
● ● ●●●●●
● ●●●●● ●
●●●● ●● ●
●●
●●● ● ●
●●●●●●
●●●●●●
●● ●●● ● ●
● ● ●●●●● ●●
●●●● ●
●●●● ●
●●●● ●
BEDRMS
● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
●
● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●
●●●●● ● ● ● ●● ● ● ●
●●● ● ●
●●●
●
●● ●●● ●●● ●●
●●●●● ● ●● ●● ●● ●
●●●●●●● ● ● ●
●●●●●●
●
● ●●● ●●●●● ● ●●
●●●●●
● ● ● ● ● ●●●● ● ●●
● ●
● ●
● ●
● ●
● ●
● ●
● ● ●
●
●
● ● ● ●●
BATHS
STAT 431
7
100 300 500
1000
3000
1 2 3 4 5
1 2 3 4 5
1000 3000
100 300 500
2.0 3.0 4.0 5.0
0 20 40
• •
Heteroscedas$city We start with the most straighlorward fit
fit1 <- lm(Price ~ Age + BLDSQFT + BEDRMS + BATHS, data = house)
•
When both heteroscedas$city and non-linearity are present, we deal with the former first!
Look at the plot of residuals vs. fiPed values from the above fit – Thever$calspreadofresidualsbecomes
! ! ! !! ! !
! ! ! ! !!!!!! ! ! !!
!
!
!
!!!!!!! !!
! !!!! !!! !! !!!!!
! ! !!! ! !!! ! !
!!! !!!!!!!! !
! !!!!!! !! ! !
! !!!! !!!!
!!! ! !!! ! !
! !! !
! !!! ! ! ! !
! !!!!!!!!!!!! !
!
!!!!!!!!! !!!!
!! !! !!!!!!! !!!! !!!
! !
! !!!!! ! ! !!!! ! ! !!!! !! ! !! ! !! !
! !
!
!
!
!!
!! !!!
!!! ! !!! !!!
!! !!!!! !! !!!!!!!! ! ! !!
!
!!!! !!!! !!!! ! ! !!!!!!!!!! ! !
! !! ! ! !
! !!! !!! ! !
!!!!!!!! ! ! !!!! !! ! ! !
!! !! !!!!!! ! ! !!!! !!
! ! !!! ! !!!! !
!
!
!! !! ! !! !!!!!
!!!!! ! !! !!!
! !! ! !!!
!
!
larger as the fiPed values increase
(a clear sign of heteroscedas$city!) – Themeanvaluesoftheresiduals
seem to be curved rather than staying at 0 (a sign of non-linearity)
STAT 431
8
100
200 300 fitted(fit1)
400
residuals(fit1)
−100 0 100 200
Transforma$ons of Y
• Heteroscedas$city can o_en be addressed by transforming the response variable
• However, such a transforma$on can take a linear situa$on to a non-linear one!
• Typically, we take the following steps in order:
1. Chooseasuitabletransforma$ontofixheteroscedas$city
2. Lookatthetransformeddataanddeterminewhetheritsa$sfiesboth
linearity and homoscedas$city
3. Ifhomoscedas$cityissa$sfied,butnon-linearityispresent,afurther transforma$on of the predictors may help
• Commonly employed transforma$ons of Y √
– If ver$cal spread get larger as yˆ gets larger, try log(y),
– If ver$cal spread get smaller as yˆ gets larger, try y2 or ey
• For the current data set, we could try the log(y) transforma$on
fit2 <- lm(log(Price) ~ Age + BLDSQFT + BEDRMS + BATHS, data = house)
y or 1/y
– Is √y another op$on?
STAT 431 9
• We refit the model with the transformed response. Here are the plots of residuals vs. fiPed values. (Both transforma$ons lead to reasonable homoscedas$city)
y → log y
y → √y
! !
!! !
! !
!!!!!!!
!
!
!!!!!!!!!!!! ! !!!!!! !!
!! !
! !
!
!! !
!!
! !! !!! ! !! !
! ! ! !!!!! !!!!!!
!!!!!
!! !!!! ! ! ! !!!!!!!!!! !!
!!!!!! ! !! ! ! !! ! !
!!!!!!
! ! ! ! !! !!!!! !! ! !
! !!! !!!!!!!! !
! ! ! !!! ! !! ! ! ! !
! ! !!!!! !!! !! !! ! !
! ! !!! !! !!! ! !! ! ! !!!!!!!!!! !
!
!!! ! ! ! ! ! ! !! ! !
!
!!!!!! !!!! !! !! !! !
! !!!!!!! ! !!! !! ! !!!!!!!! !! ! !!
! !!!!!!! !!! !!!! ! ! !!!!!!!!! !
!!!! !! !! !!
!! !! !!!!! !!! ! !
!
! ! !
!!!!!!! ! !!!!!!
!!!!!! !!!!!! !
!!!! !!!!!!!
!
!
!
!
!
!
!! !
!!
!!! !!!!! !! !!!
!!!!!! !! !! !!! !!
!!!!!!! !!!!!!
! !!!!! !!!
!!! !!!! !!!!! ! ! ! ! !!!!
!
! ! !
!
! !
!
! !!!!!!!
!!! ! !!! !!!!
!
!
! ! ! !! !!!! !
!! !!!! !!!!!!! !!
! !!!!! !!
!!!!!!!!!!!! !!
!
!!! !! !!!!
! !!!!!!!!! !! !! !
! !!!!!!!!!!!!!! !
!! ! !
!
!
! !
!
!!! !!! ! !!!! !
!
! ! !!
!!! !!! !
!
! ! !!!!! !! !! !! !!!!!!! ! !!!
!
! !
! !!!!!! !!!
! !!!!!!!!!!!!
! ! !! !! ! ! ! !! !!!
! !
!
!!
! ! ! !!! !
!!!!! ! !!!!! ! !
!!! !!! !
!!!!!! !! !!!!!
!!!!
!
!! !! !
!!
!
! !!
!
! !
5.0 5.5 6.0 fitted(fit2)
10 12 14 16 18 20 22 fitted(fit4)
10
STAT 431
residuals(fit2)
−0.5 0.0 0.5
residuals(fit4)
−4 −2 0 2 4 6
Non-linearity
• In simple regression, we can simply look at the scaPer plot to check whether the overall paPern between the response and the single predictor is linear or curved.
• In mul$ple regression
– Usuallyimpossibletoobtaindirectvisualiza$onofthemul$-dimensionaldata – WecanlookatallpairwisescaPerplotsviathescaPerplotmatrix
• A more informa$ve way is to look at residual plots, including – Plots of residuals vs. individual predictors
– Plot of residuals vs. fiPed values
• Basic idea: If the rela$onship is linear, then all these residual plots should look like a plot of a random sample from a N(0,σ2) distribu$on
STAT 431 11
Residual Plots
• To obtain all the residual plots at the same $me, we could use the residualPlots command in the car package
• Here are the results with y → log y > residualPlots(fit2)
Test stat Pr(>|t|)
Age 0.436 0.663
BLDSQFT -0.371 0.711
! !!!!! !! !
!
!!
40
!!!!!!!!!!! !! !
1000 2000 3000
BLDSQFT
! !! ! !!!
! ! ! !
!
! !! !!! !! !!!!! ! !!
!
!
! !!! !
! !!!!!! ! ! !!
! !! !!! ! !!! ! ! ! ! ! ! ! ! ! !!
!!
! !
! !
!
!
! !
! !!! ! !
!
!!
!
!
!!!!
! !! !!!!
! !!! ! ! !!!!
! ! ! !!!
! ! !
! !
! ! ! !!
!! !
! !! !!
!! ! ! ! ! !!!! ! !!!! !! !!!!!!!! !!
! ! !!
!
!!!!!!!!!!!! ! !!!
! !
!!!!!!!! ! ! ! !
! !!!!!!!!! ! !!!!! !!!! !!
!!!
! ! ! !!! ! ! ! !
! !!!
!!! !! ! ! ! ! !
!
!
!
! !!
!!! ! ! ! ! !
! !!
!
!
!! !
!
!
! !!
!
!
!
! !! !
!
!
! ! ! !
! ! !
! !! !!! !
!
!! ! ! ! ! !
! !! ! !
! !
!
! ! !
! !
! !!
!
! !
!
0
10 20
30
Age
50
4000
!
!
! ! !
! ! ! !
! ! ! ! ! ! ! ! ! ! ! !
! !
!
! ! ! ! ! ! ! ! ! !
! !
!
! ! !
! !
!
! ! ! ! ! ! ! ! ! ! ! ! ! !
!
! ! ! ! ! ! ! ! ! ! ! ! ! !
! !
! !
!
!
! !
!
! ! !
!
! ! ! ! !
! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
!
! !
!
! ! !
! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
!
!
! ! ! ! ! ! !
!
!
!
!
BEDRMS
BATHS
Tukey test
2.872 0.004 -0.352 0.725 0.211 0.833
BEDRMS
2.0 2.5 3.0 3.5 4.0 4.5 5.0
1 2 3
BATHS
4 5
!
! !
!
!
! ! ! !!
!
!!!!!!!! !! !
! !!!!!!!!!!!! !!!
! !! !!! ! !! ! !
!
!!! ! ! ! ! ! !!!
!!! ! !
!! ! ! !
!! !
! !!
! !!!
!!! !
! ! !!!
! !
! !!!!! !!!! !! ! ! ! ! ! !!!!!!!!!!!! !!
! ! !!!
! !
!!! !
!
!!! ! !!!!!!! !!
!
! !
! !
!!! ! ! !
! !!!! !!! !
!!!!!!!!!!! !! ! !!!
!! !
! !! !
! !
! ! !!! !! !
!
!
! ! !
!!!
!
!
!
! !
! !
!
!
! !
!
!
! !
!
STAT 431
12
5.0 5.5
Fitted values
6.0
! !
! !!!!!!!!!! !! ! ! ! !! !!!!!!! !! !!! !! !!!! !!!!!!!! !
!
! ! !!!! !! !! !!! !!!!! ! ! ! !! ! !
!
! ! !! !
! ! ! ! !! !! !
! !
!
! ! ! ! !! !
! ! !! !!! ! !!! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !
! ! !!! !! !!!!! ! ! ! ! !!! !! !!!!! ! ! !!
! ! ! !!
! !! !! ! ! !
!
! ! ! ! ! ! !!
! ! ! ! ! ! ! !! ! ! ! ! !
! ! !! !!! ! !
!
Pearson residuals Pearson residuals
Pearson residuals
−0.5 0.0 0.5 −0.5 0.0 0.5
−0.5 0.0 0.5
Pearson residuals
Pearson residuals
−0.5 0.0 0.5
−0.5 0.0 0.5
• The residualPlots command simultaneously does the following things:
– Makeplotsofresidualsvs.predictors/fiPedvalues
– Fitameanfunc$oncurveineachoftheaboveresidualplots(theredcurve) • Ideally, the curve should be a horizontal line at 0
– Performsatestforcurvatureineachresidualplot
• In each residual plot, treat the residual as response, and what’s on the x-axis (i.e.,
a single predictor, the fiPed value, etc.) as the predictor
• Fitthemodely=β0 +β1x+β2×2 tothescaPerplot
• For all residual vs. predictor plots, the reported P-value comes from the t-test of
H0 :β2 =0.
• For the plot of residual vs. fiPed value, the P-value is obtained from the normal
distribu$on, and the test is called Tukey’s test for non-addi$vity.
• For the house price data, the residual plots look good except for the one with
predictor BEDRMS. The curvature test reports a P-value = 0.004.
STAT 431 13
• What about the transforma$on y → √y
> fit3 <- lm(sqrt(Price) ~ Age + BLDSQFT + BEDRMS + BATHS, data = house)
> residualPlots(fit3)
!
! !
!
! !
! !
!
! !
! ! !
! !
!
!! !!! ! !!!!
!!! !! !
!
!!
! ! !
!! !! !
!! !! ! !! !!! !!!
! ! !! !! ! ! ! ! !
! !
! !
! !
!
! ! ! !!
!! !
!
!
! ! ! !! ! !!! !
!!!! !!!! !
! !! !!! !! ! !! ! !!!
!!!!! !!!!!!!! ! ! !!!! !!!!!! ! ! ! ! ! !
!!!!!!!! !! ! !!
! ! ! !!! !!! !!!!!!!!!!
! ! ! !!!!!!!! ! ! !
! ! !!!!!!! !!!!!!!
! ! !!! !!!!! !!!! ! ! ! ! ! ! !! ! !!
!!
!! ! !! ! !
! !! ! !
!!! ! !!
!! !! !
!
!
! ! ! !!
!
! !
!
! !!! ! !
! !
! !
!
Age
BLDSQFT
BEDRMS
BATHS
Tukey test
Test stat Pr(>|t|) 2.069 0.039 2.456 0.014 3.883 0.000 1.643 0.101 3.197 0.001
0 10
20 30
Age
40
50
1000 2000 3000 4000
BLDSQFT
!
!
! !!!!!!!! ! !! !!!
!! !
!
!!!!!! !!!!!! ! ! !!!!!!!!!!! !!! !!! !!! !! !!!
! !
!
! ! ! ! ! ! !
! ! ! !!!! !! !
! !
! !!!!!!!! !!!
! !
! !!
!
!
!
!
! ! !
!
!
!
! ! ! !! !! !!! ! !
!
! !! ! !!!!!!!!!! ! ! !
! !!! !!!!! ! !!!!! ! !! !
! ! !!! !! !
!! !
! !! ! ! ! ! !
!
!
! !!! !
!
!
!
!!! ! !
!
!! !
!!! ! ! ! !!!
!
! !! !!!
!
!
! !
! !
! ! !! !
!
! !
!
! !
! !
!
!
!
!
!
!
!
!
! ! !
! !
!! !
! !
!! !
! ! ! ! ! ! ! !! !!! !!!
! !!!!!!!!! ! ! !!!!! !!!!!!! !! !! !!! !!! !! !
!
! ! ! ! !!!!!!!!!!!!! !
! !!!!!!!!!!! ! !!!!! !!!!!!!!! !! ! !! ! !!! !!! !
!
!
! ! !!
! ! !
! !! ! ! ! !! !
! !
!!! !! !! ! ! ! !! !! ! ! !
! !! ! ! ! !
! ! !
!
! !
! ! ! ! ! ! ! ! !!
! !! ! ! !
! ! ! !! !
!
! ! !
! ! ! !
! ! !
! !
!
!
!
!
!
!
! !
!
!
! !
! ! !
!
! !
!
! !
! ! !
! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! !
! !
!
! ! !
!
! ! ! ! ! ! ! !
! !
! ! ! ! ! ! ! ! !
! !
! ! ! !
!
! ! !
! !
!
!
! !
!
!
!
! ! ! !
!
!
! ! !
! !
! ! ! ! ! ! ! !
! !
!
! ! ! !
! !
!
! ! !
!
! ! !
! !
! ! ! ! !
!
! !
!
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! !
! ! !
! !
! ! ! ! ! ! ! ! ! !
! !
!
• Clearly, the residual plots
show that the square root transforma$on is less sa$sfactory than the log transforma$on.
2.0 2.5 3.0 3.5 4.0 4.5 5.0
BEDRMS
! ! ! !!!
10 12 14 16 18 20 22
1 2 3 4
BATHS
5
STAT 431
14
Fitted values
!
!
!
!
! ! ! !
Pearson residuals Pearson residuals
Pearson residuals
−4 −2 0 2 4 6 −4 −2 0 2 4 6
−4 −2 0 2 4 6
Pearson residuals
Pearson residuals
−4 −2 0 2 4 6
−4 −2 0 2 4 6
• So, we decide to proceed with the y → log y transform.
• How can we fix the mild non-linearity suggested in the residual plot?
Possible solu$on: Adding a BEDRMS^2 term.
> fit4 <- lm(log(Price) ~ Age + BLDSQFT + BEDRMS + I(BEDRMS^2) + BATHS, data = house) > residualPlots(fit4)
Test stat Pr(>|t|)
-0.231 0.818
-1.214 0.225
-0.022 0.982
Age
BLDSQFT
BEDRMS
I(BEDRMS^2) -1.066 0.287
BATHS -0.991 0.322
Tukey test -1.458 0.145
• Any other proposals?
STAT 431 15
● ● ●●●
●● ● ● ● ●●● ●●
●
● ● ●
●●●● ●
● ● ●● ●
● ●●
● ● ●
●●●●
●●●● ●● ●●●●●●●● ● ●●●●●●● ● ● ●
●●
● ●
● ●
● ●
●
●● ●● ●● ● ● ●●
●●● ●● ● ● ●●●●● ●●●●● ● ●●
●
●
●
●●
●
●
●●●●● ●●●●●
●
●
●●
●
● ● ● ●●●●●
● ●
● ● ● ●● ● ● ● ● ●● ●●●●●●●●●●● ● ●
●
● ● ●● ● ●
● ● ● ●
● ●
●●●● ● ● ●● ● ● ●
● ● ● ● ●
●
●● ● ●
● ●
●● ●●●●●●●●●●● ● ● ●● ● ●●● ● ● ● ● ●●●
●● ● ● ● ●
●● ●
● ●
●●
●
● ● ●
●
●
●
● ●●
●
● ●● ●
●
●●●●
● ● ●●●●
● ●●● ● ●●●●
●
●●●●●●● ●●●●● ●●●● ●● ●● ●
● ●
●
●
●
●
●
● ● ●
● ●
● ●●●● ●●●● ●● ●● ●
● ●
●
●●● ●● ● ● ●●●● ● ● ● ●●●●●●●●●●● ● ●
● ● ●
●●● ● ●
● ●
●
● ● ● ●● ●
● ● ● ● ● ●
● ● ●●●● ●● ● ●● ● ●●● ● ● ●
●
● ●● ●●● ● ● ● ● ● ●●●● ●● ● ●
●
●
●
●
●
● ● ●●●●●●●●● ●● ● ●● ● ●●●
● ●
●
● ●●● ●●●●●●●● ● ●●● ● ●
●
●
● ●●● ●● ●●●●●● ●●● ●●●●● ● ● ●●●● ●● ●● ●● ● ● ●
●
● ●●
● ●●●●● ● ●●● ● ● ● ● ●● ●●●● ●●● ● ●
●
●●● ●
● ●
●
● ● ● ● ● ● ●
●
●
●● ● ●●
● ● ●● ●●● ●●
●
0 10
●
20 30 40 50
Age
1000
2000 3000 4000
BLDSQFT
● ● ●
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ●
● ●
●
●
● ●
● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ●
● ●
●
●
● ● ●
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ●
● ●
●
●
● ●
● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ●
● ●
●
●
2.0 2.5 3.0 3.5 4.0 4.5 5.0
BEDRMS
●
1 2 3 4 5
BATHS
5
10 15 20
I(BEDRMS^2)
25
● ●
●
● ●
●
●
● ●
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ●
● ● ● ●
●
●
●
●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
● ● ●
● ●
●
● ●
●
●
●
● ●
●
●
●
●
●
● ●
●
● ● ●● ●● ● ● ●● ●
● ● ●
●
●● ●
●●● ●●
●
● ● ●
●●●●● ●● ●● ●
● ● ●
● ●
● ●
● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●
● ● ● ● ●
● ● ● ● ●●●●●●●●●● ●● ● ●
●●●
● ● ●
● ●● ● ●
●
●● ● ● ● ● ●●●●●●●●● ●
● ●
●
● ● ● ● ●●
● ●
● ●●● ● ●●● ● ●● ● ●
● ●
● ● ●●●●●●●● ● ●●● ●
● ●
●●● ● ●● ●
● ●●
●
● ● ● ●
● ●●● ●● ●●●● ●● ● ● ● ● ●●●●●●●●●●● ● ●●
●● ● ●
● ● ●
●●● ● ●●●
●●●● ●● ●
● ●
● ● ●●● ●
●
●●● ● ● ●
● ● ● ●
● ●
● ● ● ● ●
●
●● ●
● ●
● ●
●● ●● ● ●
●
● ● ●●
STAT 431
16
5.0 5.5 6.0
Fitted values
Pearson residuals
Pearson residuals
Pearson residuals
−0.5 0.0 0.5
−0.5 0.0 0.5
−0.5 0.0 0.5
Pearson residuals
Pearson residuals
Pearson residuals
−0.5 0.0 0.5
−0.5 0.0 0.5
−0.5 0.0 0.5
Non-normality of Residuals
• Once we have achieved linearity and homoscedas$city, the next thing on the list is to check for the normality of residuals – by a normal plot of the residuals
• What if normality fails?
– Skewnessandheavytailsare
problema$c – leads to inaccurate
tests / CI’s
– Lesscri$calwhenthesample
size is large
• On the right is the normal plot of the residuals for our data.
Normal Q−Q Plot
! !
!
!
! !!
! !!
!
! !
!
! !
!
! ! !
! !
! !
! !!!
STAT 431
17
−3 −2 −1 0 1 2 3 Theoretical Quantiles
Sample Quantiles
−4 −2 0 2 4
Lack of Independence
• Fairly rare except in $me series data, where the residuals could be autocorrelated
• Example: The cellular data [Cellular.csv] contains the number of subscribers to cell phone services in the U.S. every six months from the end of 1984 to the end of 1995.
• Thedataisa$meseriesy1,…,yn,whereytisthenumberofsubscribersat$me period t . In the simple regression seqng, t plays the role of the predictor.
!
! !
! !
! !
! !
! !
!
!!!!!!!!!!!
A scaPer plot of y vs. t shows non-linear growth trend in the number of subscribers
5 10 15 20 Period
STAT 431 18
Subscribers
0.0e+00 1.0e+07 2.0e+07 3.0e+07
• By trial and error, one discovers that the transforma$ony → y1/4 yields what appears to be an ideal linear rela$onship between the response and the predictor
! !
! !
! !
!
! !
! !
! !
! !
! !
! !
! !
! !
!
!
!
!
!
!
!!!
!
! !
!
!! !!!
!
!!
!
!
5 10 15 20 20 30 40 50 60 70 Period fitted(cellular.fit)
• However, the residual plot shows a meandering paPern of the residuals, which clearly violates the modeling assump$ons of simple regression
• Such residuals are called autocorrelated, for the adjacent residuals are correlated with each other
STAT 431 19
Subscribers^(1/4)
20 30 40 50 60 70
residuals(cellular.fit)
−1 0 1 2
Influen$al Observa$ons
• An observa$on is influen$al if its removal causes the model fit to change
no$ceably.
• Influen$al observa$ons and leverage
– Express fiPed values as i n ij j yˆ= j=1hy
– If hii is large, then we say the i-th observa$on has high leverage, and is a poten*al influen$al point
– Rule of thumb: in mul$ple regression, hii > 2(k + 1)/n is regarded as high leverage
– ComputeleveragevaluesinR:hatvalues
STAT 431 20
• Influen$al observa$ons and Cook’s distance
– Thebasicidea:Ifthei-thobserva$onisinfluen$al,thenthepredictedvalues based on the LS fits with or without the i-th observa$on should be no$ceably different
– Let yˆ ,j = 1,…,n be the fiPed values using the LS fit j
y = βˆ + βˆ x + · · · βˆ x 011kk
obtained from the en$re data set
– Now,leaveoutthei-thobserva$on,andweobtainanewLSfitwithn−1
data points: y = βˆ0(i) + βˆ1(i)x1 + · · · βˆk(i)xk. Using this new equa$on, we
get a new set of predicted values for each of the data point in the original data
set, denoted by yˆ ,j = 1,…,n j (i)
– Cook’s distance for the i-th observa$on: Di =
– Ruleofthumb:anyobserva$onwithCook’sdistance>1isinfluen$al
– Compute Cook’s distance in R: cooks.distance
STAT 431 21
n
j=1 j j(i)
(yˆ−yˆ )2 (k + 1)s2
! !
! !!
! !!!
! !!!!!!
!
!
!! !
!
!
!
! !!
!
!
!!! !!!!
! !! ! !
!!! !!! ! !!!!
!
! !!
!!!!! !!!! !!!!!!! !!!!!! ! !!! !!
! !!!! !!! ! ! !!!!!! !!! ! ! !
!!! ! !!!!!!!!! !!!! !! !!! !!!! ! !!! !!
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !! !!! ! ! ! ! ! ! ! ! ! !! !!!! ! !!! !!!! !!! ! ! !! ! ! ! !!! !! !!
! ! !! ! !!!! !!!! !!!! !!!!
! ! !! ! ! !!! ! ! ! ! ! ! !!!! !!!
!!!!!!!!!! !!!!!!!!!!!! ! !!!!
! ! !! !!!!! !!!!!! ! !! !!!!! ! !! ! ! !! !!! !! ! !! !!!! !!!!!! !!!!!! ! ! !
!
!
!
!
!
!
! !
! !
!!!!!
!
!!!!!!
!! ! ! !!!! !! ! !!! !! !!!!!!!!!!
!
!!! ! !!! !!! ! !!!
!! !!!! !! !! !!! !! !!! !!!!! ! !!!!! ! !!!!!! !!
!!!!!!! ! ! ! !! !!!!!!!!!!
!
0 100 200 300 400 Index
0 100 200 300 400 Index
22
STAT 431
hatvalues(fit4)
0.00 0.02 0.04 0.06 0.08 0.10 0.12
cooks.distance(fit4)
0.00 0.02 0.04 0.06 0.08 0.10
• Key points of this class
– Possibleanomaliesindata
– How to deal with heteroscedas$city • Transforma$on of Y
– Non-linearity
• Residual plots
– Lackofindependenceinresiduals – Influen$al observa$ons
• Leverage and Cook’s distance
• Read Sec$on 11.5 of the textbook
• Next class: Mul$ple Regression (VI) – Variableselec$on
STAT 431 23