Statistical Inference STAT 431
Lecture 15: Multiple Regression (II) ANOVA and Prediction
Review: The Multiple Regression Model
• Variables: 1 response variable Y , and k predictor variables x1 , . . . , xk
• Data: n vectors of observations(xi1,…,xik,yi),i = 1,…,n
• A statistical model:
Yi = 0 + 1xi1 +···+ kxik +⇥i, i=1,…,n
signal noise
• Assumptions on the noises
1. ✏i ’s are mutually independent random variables (independence)
2. ✏i’shavecommonmean0,andcommonvariance 2 (homoscedasticity)
3. ✏i ’s are normally distributed (normality)
STAT 431
• Interpretation of the regression coefficients 1 , . . . , k
– If xj does not depend on any of the other predictors, then j describes how much E(Y ) changes per unit change of xj , while all the other predictors are held fixed
– If xj depends on the other predictors, then the interpretation of j is case dependent
• We will look at two special cases in later lectures • LS estimators of the regression coefficients ˆ , . . . , ˆ
– Computed by the function lm in R i01i1 kik
0k
– Fittedvalues: Yˆ = ˆ + ˆ x +···+ ˆ x , i=1,…,n
– Residuals:
E =Y Yˆ, i=1,…,n iii
STAT 431
• In R, similar to the case of simple regression, the fitted values and the residuals can be obtained by the commands fitted and residuals, respectively.
> car.fit <- lm(MPG.City ~ Horsepower + Wt + Seating + Length, data = car)
> fitted(car.fit) 12345678
18.46756 18.86320 20.80561 22.95250 19.29076 16.30161 20.31837 18.53184 9 10 11 12 13 14 15 16 15.09061 16.65203 16.28459 20.38469 20.96575 20.68275 15.14919 18.27594
…
> residuals(car.fit) 123456
-0.467556318 1.136802714 2.194387019 2.047499652 -2.290763139 0.698387902 7 8 9 10 11 12 -0.318366743 -0.531836789 1.909389608 -0.652032373 1.715410815 -0.384693043
…
STAT 431
Goodness of Fit
• Similar to simple regression, we can ask the question:
How well does the least square hyperplane fit the data?
• To answer this question, we compare how much better the LS hyperplane fits the data than does the best “constant” hyperplane: y = C
– The best “constant” hyperplane is y = y ̄
– The sum of squared distance to the best “constant hyperplane” is called the
total sum of squares (SST):
Xn
i=1
– The sum of squared distance to the LS hyperplane is called the error sum of
squares (SSE):
Xn i=1
S S T =
( y i y ̄ ) 2
S S E =
( y yˆ ) 2 ii
STAT 431
Coefficient of Determination
• As in the case of simple regression, we define the coefficient of determination,R2 ,
as the proportion by which the LS hyperplane reduces the total sum of squares: R2 = 1 SSE
SST
• Define the regression sum of squares (SSR):
i=1
• The difference from the case of simple regression:
– In simple regression, R2 equals the squared sample correlation coefficient
– There is no such interpretation for multiple regression
S S R =
We have again the identity SST = SSR + SSE , so R2 = SSR/SST
Xn
i
STAT 431
( yˆ y ̄ ) 2
Analysis of Variance • Each of the sum of squares has its associated d.f.
Sum oPf squares
SST=P(yi y ̄)2
SSR = P(yˆ y ̄)2 i
SSE= (y yˆ)2 i i
• When 1 =···= k =0,wehave
Degrees of freedom
n 1 k
n (k + 1)
indep.of ,
so
• The F statistic can be used to test
F = SSR/k SSE/[n (k + 1)]
⇠ Fk,n (k+1)
H0 : 1 =···= k =0 vs. H1 :thereisatleastone j 6=0
– A level ↵ test rejects H0 if F > fk,n (k+1), STAT 431
• To perform ANOVA in R, we use the command anova
– However, the usage is slightly more complicated than the simple regression
case – we need to fit a model only with the intercept first! > car.fit0 <- lm(MPG.City ~ 1, data = car)
> car.fit <- lm(MPG.City ~ Horsepower + Wt + Seating + Length, data = car)
> anova(car.fit0, car.fit)
Analysis
of Variance Table
SST and its d.f.
Model 1:
Model 2:
Res.Df
1 221
2 217
MPG.City ~ 1
MPG.City ~ Horsepower + Wt + Seating + Length
RSS Df Sum of Sq F Pr(>F) 2881.12
589.87 4 2291.2 210.73 < 2.2e-16 ***
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
---
SSE and its d.f. SSR and its d.f.
STAT 431
Testing A Group of Coefficients
• Consider testing whether a group of predictor variables have (linear) effect on the response variable, after controlling for all other variables in the model
– E.g., in the gas mileage example, it might be of interest to test the null hypothesis that given Horsepower and Weight, neither Seating nor Length has effect on MPG-City
• Under the multiple regression model, the null hypothesis is equivalent to the statement that a subset of the regression coefficients are zeros
– For notational convenience, we shall label these coefficients k m+1, . . . , k
• In formal terms, the null and the alternative hypotheses are
H0 : k m+1 =···= k =0 vs.
H1 : at least one of k m+1,..., k is nonzero
• Be aware that the hypotheses are context dependent! STAT 431
• To test the hypothesis, we could fit two separate LS regressions:
– –
–
One using all the k predictors; call the resulting error sum of squares SSEk The other using only the k m predictors whose coefficients are not related
to H0 ; call the resulting error sum of squares SSEk m Note that SSEk m SSEk (Why?)
• Under the null hypothesis, we have
– SSEk ⇠ 2 2n (k+1)
– SSEk m ⇠ 2 2n (k m+1)
– In addition, the difference SSEk m SSEk ⇠ 2 2m is independent ofSSEk
– So, (SSEk m SSEk )/m
F = SSEk/[n (k + 1)] ⇠ Fm,n (k+1)
• A level ↵ test rejects the null hypothesis if F > fm,n (k+1),
• The F-test for all the k coefficients equal to zero is a special case of the above test
STAT 431
• In R, we could perform the test by applying the command anova to the two regression fits
> car.fit1 <- lm(MPG.City ~ Horsepower + Wt, data = car)
> car.fit <- lm(MPG.City ~ Horsepower + Wt + Seating + Length, data =
car)
> anova(car.fit1, car.fit)
Analysis
of Variance Table
MPG.City ~ Horsepower + Wt
MPG.City ~ Horsepower + Wt + Seating + Length
RSS Df Sum of Sq F Pr(>F) 626.20
Model 1:
Model 2:
Res.Df
1 219
2 217
589.87 2 36.33 6.6825 0.001527 **
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
—
STAT 431
Prediction
• Goal: to predict the value of the responseY when x = x⇤ = (x⇤1,…,x⇤k)
• Two different problems
– Predictthevalueofther.v.Y⇤ = 0 + 1x⇤1 +···+ kx⇤k +⇥ – PredictthemeanvalueofY ,i.e.,μ = + x +···+ x
• Both intervals are based on the tn (k+1) distribution
– The actual formulas involve linear algebra (not required, see textbook p.412)
– As in simple regression, the 100(1 ↵)% prediction interval for Y ⇤ is always wider than the 100(1 ↵)% confidence interval for the mean μ⇤
– Interpretations of the coverage probability are the same as in the simple regression case
• To compute these intervals in R, we again use the command predict
⇤⇤01⇤1 k⇤k
STAT 431
• Example 1: 95% prediction interval for the actual MPG-City of competing models with Horsepower = 225, Weight = 4000 lbs, Seating = 5, and Length = 180’’
> predict(car.fit, newdata = data.frame(Horsepower = 225, Wt = 4, Seating = 5, Length = 180), interval = “prediction”, level = 0.95)
fit lwr upr 1 17.79219 14.52555 21.05883
• Example 2: 95% prediction interval for the average MPG-City of competing models with Horsepower = 225, Weight = 4000 lbs, Seating = 5, and Length = 180’’
> predict(car.fit, newdata = data.frame(Horsepower = 225, Wt = 4, Seating = 5, Length = 180), interval = “confidence”, level = 0.95)
fit lwr upr 1 17.79219 17.45859 18.12579
STAT 431
Class Summary
• Key points of this class
– Coefficient of determination
– Analysis of variance for multiple regression
– Testing a group of coefficients – Prediction
• Read Section 11.4 of the textbook
• Next class: Multiple Regression (III) – Specially Constructed Predictor Variables
STAT 431