Statistical Inference STAT 431
Lecture 14: Multiple Regression (I) Statistical Model and Basic Inferences
Example: Gas Mileage
• A team designing a new car is concerned with the gas mileage that they should aim to achieve. In particular, they want to aim for the average value of MPG-city that their competitors would achieve for such a car.
– The new car they are designing will have the following characteristics: Horsepower = 225; Weight = 4000 lbs; Seating = 5 adults; Length = 180”
• To estimate what the average MPG-city their competitor would achieve for such a vehicle, they gathered data on all “ordinary” car models in 2004 [Car04ord.csv]
• The data set contains information for each model about MPG-city, and about the four variables: Horsepower, Weight, Seating, and Length
STAT 431 2
• Here are a few entries from the data set: Make/Model MPG-City HP
WT Seats
3.898 5
3.575 5
3.318 5
2.771 4
3.197 2
4.451 7
3.55 5
3.88 5
4.399 5
Length
196.6
189.3
183.3
172.2
174.2
188.7
179
192
204
Acura_RL
Acura_TL
Acura_TSX
Acura_RSX
Acura_NSX
Acura_MDX
Audi_A4_Quattro
Audi_A6_Quattro
Audi_A8
18 225
20 270
23 200
25 160
17 252
17 265
20 170
18 220
17 330
Example: Gas Mileage (Cont’d)
• In total, there are 222 Make/Model entries. STAT 431
3
Multiple Regression: Model & Assumptions
• Variables: 1 response variable Y , and k predictor variables x1 , . . . , xk
• Data: n vectors of observations(xi1,…,xik,yi),i = 1,…,n
– For the gas mileage example, k = 4, n = 222 • A statistical model:
Yi = 0 + 1xi1 +···+ kxik +⇥i,
i=1,…,n
• Assumptions on the noises
1. ✏i ’s are mutually independent random variables (independence)
2. ✏i’shavecommonmean0,andcommonvariance 2 (homoscedasticity) 3. ✏i ’s are normally distributed (normality)
linear signal
noise
STAT 431 4
Yi = 0 + 1xi1 +···+ kxik +⇥i, i=1,…,n signal noise
• The multiple regression model assumes a true linear relationship between the response and the predictors E(Yi) = 0 + 1xi1 + · · · + kxik
• 0 is the intercept, and 1 , . . . , k are the regression coefficients corresponding to the respective predictors
– Each j describes how much E(Y ) changes per unit change of the j-th predictor xj while all the other predictors are held fixed (think of partial derivatives)
• As in the simple regression case, we shall treat the (xi1 , . . . , xik ) ’s as fixed, and the randomness only comes from the ✏i ’s
• Therearek+2parametersinthemultipleregressionmodel: 0, 1,…, k,⇥2 STAT 431 5
LS Estimators of Regression Coefficients
• Similar to simple regression, we use the least square method to find estimators for
, ,…, : we find ˆ , ˆ ,…, ˆ which minimize the sum of squared 01k01k
distances between the observed yi and the fitted value
> car.fit
… Coefficients: (Intercept)
31.49706
Horsepower -0.01541
Wt -3.76863
Seating
0.33659
Length
0.01753
Xn
[yi ( 0 + 1xi1 +···+ kxik)]2 i=1
• The actual formulas for the LS estimators ˆ , ˆ , . . . , ˆ involve linear algebra 01k
(not required, read Sec.11.3 for the actual formulas if you are interested)
• To compute them for any particular data set in R, we can use the lm command > car.fit <- lm(MPG.City ~ Horsepower + Wt + Seating + Length, data = car)
STAT 431
6
SSE and Estimator of i iii • Terminologies (the same as in simple regression)
– Fittedvalues:Yˆ = ˆ + ˆx +···+ ˆx , i=1,...,n i01i1 kik
– Residuals:
– Error sum of squares (SSE):
Xn S S E =
i=1
• Mean square error (MSE) estimator for 2
E =Y Yˆ, i=1,...,n iii
Xn
E 2 = ( Y Yˆ ) 2
iii i=1
2
– –
S2 = SSE
n (k + 1)
There are n observations and we are estimating k + 1 regression coefficients. So SSE has degrees of freedom n (k + 1)
S is called residual standard error (or root mean square error, i.e., RMSE), and serves as an estimator for
STAT 431 7
Example: Gas Mileage (Cont’d)
• All the estimators can be obtained the R command summary(car.fit)
Coefficients: Estimate
(Intercept) 31.497062 Horsepower -0.015415 Wt -3.768635 Seating 0.336590 Length 0.017528 ---
LS estimators of coefficients
Std. Error
1.765402
0.002803
0.277950
0.136014
0.012613
t value
17.841
-5.500
-13.559
2.475
1.390
Pr(>|t|)
< 2e-16 ***
1.06e-07 *** < 2e-16 ***
0.0141 *
0.1660
Signif. codes: 0 '***'
Residual standard error: 1.649 on 217 degrees of freedom
0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
s
• Based on the LS fit of the multiple regression model, the predicted MPG-City is Yˆ = ˆ + ˆ ⇥225+ ˆ ⇥4+ ˆ ⇥5+ ˆ ⇥180=17.79
01234
STAT 431 8
Interpretation of the LS Coefficients
• Interpretation: Each ˆ is the estimated change in the response per unit change in j
the j-th predictor xj , while all the other predictors are held fixed
– It is important to keep in mind that the other predictors are controlled
– Otherwise, some results can seem quite strange
• Example: In the LS fit for the Gas Mileage data, the coefficient in front of Seating is ˆ =0.337
– This seems to say that “Cars with more seats get more gas mileage”, which is counter-intuitive. In comparison, a simple regression of MPG-City on Seating yields a negative slope -0.905.
• But if we take into account that other variables (Horsepower, Weight, and Length) are held fixed, here is one plausible explanation: for a fixed basic car design (esp. Weight and Horsepower), the number of seats depends on whether the model is intended for “families” or for “singles”
– Models for “families” tend to have more seats and better gas mileage; Models for “singles” tend to have less seats and may be designed to be sportier at the price of less gas mileage
3
STAT 431 9
Sampling Distributions • Sampling distributions of the ˆ ’s
ˆ⇠N( ,⇥2v), j=0,1,...,k j j jj
– The LS estimators for the ˆ’s are unbiased, and are normally distributed
– The vjj terms in the variance formulas are functions of the xij ’s
– Equivalently, we can write
j j
ˆ
j j ⇠N(0,1) withSD( ˆ)=⇥pv
S D ( ˆ ) j j
• Sampling distribution of S2
(n (k + 1))S2 2
2 ⇠ ⇥n (k+1)
– Important fact: 2 is independent of ( ˆ , ˆ ,..., ˆ )
S01k
STAT 431
10
Tests and CIs for the Regression Coefficients
• Pivotal RV’s: since S2 is independent of ( ˆ , ˆ ,..., ˆ ), for SE( ˆ ) = Spv ,
01 k j jj j j ⇠ tn (k+1), j = 0,1,...,k
ˆ S E ( ˆ )
j
• Tests and CIs for the ’s can be derived from the above pivotal RV’s j
– E.g.1. 100(1 ↵)% two-sided CI for j
ˆ ± t S E ( ˆ )
j n (k+1), /2 j
– E.g.2.TestforH : =0 vs.H : 6=0 usesteststatistict= ˆ /SE( ˆ ).
0j1jjj The null hypothesis is rejected at level ↵ if
|t| > tn (k+1), /2
• In R, we can obtain SE( ˆ ), the above test statistic t and the P-value for the above
test problem using the command summary
STAT 431 11
j
Example: Gas Mileage (Cont’d) • Back to the output of the R command summary(car.fit)
Coefficients: Estimate
(Intercept) 31.497062 Horsepower -0.015415 Wt -3.768635 Seating 0.336590 Length 0.017528 —
Signif. codes: 0 ‘***’
Residual standard error: 1.649 on 217 degrees of freedom
Std. Error
1.765402
0.002803
0.277950
0.136014
0.012613
t value
17.841
-5.500
-13.559
2.475
1.390
Pr(>|t|)
< 2e-16 ***
1.06e-07 *** < 2e-16 ***
0.0141 *
0.1660
0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
SE( ˆ ) Test statistic for j
H0 : j =0 vs.H1 : j 6=0
P-value of thetest
STAT 431
12
Class Summary
• Key points of this class
– Multiple regression model
– LS estimators of the regression coefficients, and MSE estimator of 2
– Tests and CIs for the regression coefficients
• Reading Sections 11.1—11.2 of the textbook
• Next class: Multiple Regression (II)
STAT 431 13