程序代写代做代考 Statistical Inference STAT 431

Statistical Inference STAT 431
Lecture 14: Multiple Regression (I) Statistical Model and Basic Inferences

Example: Gas Mileage
• A team designing a new car is concerned with the gas mileage that they should aim to achieve. In particular, they want to aim for the average value of MPG-city that their competitors would achieve for such a car.
– The new car they are designing will have the following characteristics: Horsepower = 225; Weight = 4000 lbs; Seating = 5 adults; Length = 180”
• To estimate what the average MPG-city their competitor would achieve for such a vehicle, they gathered data on all “ordinary” car models in 2004 [Car04ord.csv]
• The data set contains information for each model about MPG-city, and about the four variables: Horsepower, Weight, Seating, and Length
STAT 431 2

• Here are a few entries from the data set: Make/Model MPG-City HP
WT Seats
3.898 5
3.575 5
3.318 5
2.771 4
3.197 2
4.451 7
3.55 5
3.88 5
4.399 5
Length
196.6
189.3
183.3
172.2
174.2
188.7
179
192
204
Acura_RL
Acura_TL
Acura_TSX
Acura_RSX
Acura_NSX
Acura_MDX
Audi_A4_Quattro
Audi_A6_Quattro
Audi_A8
18 225
20 270
23 200
25 160
17 252
17 265
20 170
18 220
17 330
Example: Gas Mileage (Cont’d)
• In total, there are 222 Make/Model entries. STAT 431
3

Multiple Regression: Model & Assumptions
• Variables: 1 response variable Y , and k predictor variables x1 , . . . , xk
• Data: n vectors of observations(xi1,…,xik,yi),i = 1,…,n
– For the gas mileage example, k = 4, n = 222 • A statistical model:
Yi =0 +1xi1 +···+kxik +⇥i,
i=1,…,n
• Assumptions on the noises
1. ✏i ’s are mutually independent random variables (independence)
2. ✏i’shavecommonmean0,andcommonvariance2 (homoscedasticity) 3. ✏i ’s are normally distributed (normality)
linear signal
noise
STAT 431 4

Yi =0 +1xi1 +···+kxik +⇥i, i=1,…,n signal noise
• The multiple regression model assumes a true linear relationship between the response and the predictors E(Yi) = 0 + 1xi1 + · · · + kxik
• 0 is the intercept, and 1 , . . . , k are the regression coefficients corresponding to the respective predictors
– Each j describes how much E(Y ) changes per unit change of the j-th predictor xj while all the other predictors are held fixed (think of partial derivatives)
• As in the simple regression case, we shall treat the (xi1 , . . . , xik ) ’s as fixed, and the randomness only comes from the ✏i ’s
• Therearek+2parametersinthemultipleregressionmodel:0,1,…,k,⇥2 STAT 431 5

LS Estimators of Regression Coefficients
• Similar to simple regression, we use the least square method to find estimators for
, ,…, : we find ˆ ,ˆ ,…,ˆ which minimize the sum of squared 01k01k
distances between the observed yi and the fitted value
> car.fit
… Coefficients: (Intercept)
31.49706
Horsepower -0.01541
Wt -3.76863
Seating
0.33659
Length
0.01753
Xn
[yi (0 +1xi1 +···+kxik)]2 i=1
• The actual formulas for the LS estimators ˆ , ˆ , . . . , ˆ involve linear algebra 01k
(not required, read Sec.11.3 for the actual formulas if you are interested)
• To compute them for any particular data set in R, we can use the lm command > car.fit <- lm(MPG.City ~ Horsepower + Wt + Seating + Length, data = car) STAT 431 6 SSE and Estimator of iiii • Terminologies (the same as in simple regression) – Fittedvalues:Yˆ =ˆ +ˆx +···+ˆx , i=1,...,n i01i1 kik – Residuals: – Error sum of squares (SSE): Xn S S E = i=1 • Mean square error (MSE) estimator for 2 E =Y Yˆ, i=1,...,n iii Xn E 2 = ( Y Yˆ ) 2 iii i=1 2 – – S2 = SSE n (k + 1) There are n observations and we are estimating k + 1 regression coefficients. So SSE has degrees of freedom n (k + 1) S is called residual standard error (or root mean square error, i.e., RMSE), and serves as an estimator for STAT 431 7 Example: Gas Mileage (Cont’d) • All the estimators can be obtained the R command summary(car.fit) Coefficients: Estimate (Intercept) 31.497062 Horsepower -0.015415 Wt -3.768635 Seating 0.336590 Length 0.017528 --- LS estimators of coefficients Std. Error 1.765402 0.002803 0.277950 0.136014 0.012613 t value 17.841 -5.500 -13.559 2.475 1.390 Pr(>|t|)
< 2e-16 *** 1.06e-07 *** < 2e-16 *** 0.0141 * 0.1660 Signif. codes: 0 '***' Residual standard error: 1.649 on 217 degrees of freedom 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 s • Based on the LS fit of the multiple regression model, the predicted MPG-City is Yˆ =ˆ +ˆ ⇥225+ˆ ⇥4+ˆ ⇥5+ˆ ⇥180=17.79 01234 STAT 431 8 Interpretation of the LS Coefficients • Interpretation: Each ˆ is the estimated change in the response per unit change in j the j-th predictor xj , while all the other predictors are held fixed – It is important to keep in mind that the other predictors are controlled – Otherwise, some results can seem quite strange • Example: In the LS fit for the Gas Mileage data, the coefficient in front of Seating is ˆ =0.337 – This seems to say that “Cars with more seats get more gas mileage”, which is counter-intuitive. In comparison, a simple regression of MPG-City on Seating yields a negative slope -0.905. • But if we take into account that other variables (Horsepower, Weight, and Length) are held fixed, here is one plausible explanation: for a fixed basic car design (esp. Weight and Horsepower), the number of seats depends on whether the model is intended for “families” or for “singles” – Models for “families” tend to have more seats and better gas mileage; Models for “singles” tend to have less seats and may be designed to be sportier at the price of less gas mileage 3 STAT 431 9 Sampling Distributions • Sampling distributions of the ˆ ’s ˆ⇠N(,⇥2v), j=0,1,...,k j j jj – The LS estimators for the ˆ’s are unbiased, and are normally distributed – The vjj terms in the variance formulas are functions of the xij ’s – Equivalently, we can write j j ˆ j j ⇠N(0,1) withSD(ˆ)=⇥pv S D ( ˆ ) j j • Sampling distribution of S2 (n (k + 1))S2 2 2 ⇠ ⇥n(k+1) – Important fact: 2 is independent of (ˆ ,ˆ ,...,ˆ ) S01k STAT 431 10 Tests and CIs for the Regression Coefficients • Pivotal RV’s: since S2 is independent of (ˆ ,ˆ ,...,ˆ ), for SE(ˆ ) = Spv , 01 k j jj j j ⇠ tn(k+1), j = 0,1,...,k ˆ S E ( ˆ ) j • Tests and CIs for the ’s can be derived from the above pivotal RV’s j – E.g.1. 100(1 ↵)% two-sided CI forj ˆ ± t S E ( ˆ ) j n(k+1),/2 j – E.g.2.TestforH : =0 vs.H : 6=0 usesteststatistict=ˆ /SE(ˆ ). 0j1jjj The null hypothesis is rejected at level ↵ if |t| > tn(k+1),/2
• In R, we can obtain SE(ˆ ), the above test statistic t and the P-value for the above
test problem using the command summary
STAT 431 11
j

Example: Gas Mileage (Cont’d) • Back to the output of the R command summary(car.fit)
Coefficients: Estimate
(Intercept) 31.497062 Horsepower -0.015415 Wt -3.768635 Seating 0.336590 Length 0.017528 —
Signif. codes: 0 ‘***’
Residual standard error: 1.649 on 217 degrees of freedom
Std. Error
1.765402
0.002803
0.277950
0.136014
0.012613
t value
17.841
-5.500
-13.559
2.475
1.390
Pr(>|t|)
< 2e-16 *** 1.06e-07 *** < 2e-16 *** 0.0141 * 0.1660 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 SE(ˆ ) Test statistic for j H0 :j =0 vs.H1 :j 6=0 P-value of thetest STAT 431 12 Class Summary • Key points of this class – Multiple regression model – LS estimators of the regression coefficients, and MSE estimator of 2 – Tests and CIs for the regression coefficients • Reading Sections 11.1—11.2 of the textbook • Next class: Multiple Regression (II) STAT 431 13