Statistical Inference STAT 431
Lecture 13: Simple Regression (III) Analysis of Variance & Prediction
Review of Last Regression Lecture
• Statisticalmodelforsimpleregression: Yi = 0 + 1xi +⇥i, i=1,…,n
– Assumptions on ✏i’: (1) independence; (2) mean 0, variance 2; (3) normality
– Parameters in the model: 0, 1, ⇥2
• Parameter estimation:
– LS estimators of regression coefficients ( 0, 1)
– Estimator for 2 : S2 = SSE/(n 2)
• Sampling distributions of the estimators ˆ , ˆ , S2 01
• Pivotal random variables for inferences about 0 , 1
STAT 431 2
Basic Inferences for Simple Regression in R
• Recall the diamond data: prices (in Singapore dollars) and weights (in carats) of 48 diamond rings.
• We fit a simple regression to the data by the command lm > diamond.fit <- lm(Price ~ Weight, data = diamond)
• To obtain point estimates of the regression coefficients, we can use coef > coef(diamond.fit)
(Intercept) Weight
-259.6259 3721.0249
• To further obtain confidence intervals for the regression coefficients, we can use confint
> confint(diamond.fit, level = 0.95) 2.5 % 97.5 %
(Intercept) -294.4870 -224.7649 Weight 3556.3984 3885.6513
STAT 431 3
• To test H0 : 0 = 0 against the two-sided alternative, or to perform the same test for 1 , one can simply use the powerful command summary
> summary(diamond.fit)
Call:
lm(formula = Price ~ Weight, data = diamond)
…
Standard errors of the estimators
coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -259.63 17.32 -14.99 <2e-16 ***
Weight 3721.02 81.79 45.50 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
...
Observedteststatistics P-values
STAT 431 4
• The fitted response values and the residuals are typically useful for later diagnostic purposes. To extract these values, we can use the commands fitted and
residuals
> fitted(diamond.fit)
12345678 372.9483 335.7381 372.9483 410.1586 670.6303 335.7381 298.5278 447.3688 9 10 11 12 13 14 15 16 521.7893 298.5278 410.1586 782.2611 335.7381 484.5791 596.2098 819.4713
…
> residuals(diamond.fit) 123456
-17.9483176 -7.7380691 -22.9483176 -85.1585661 -28.6303057 6.2619309 7 8 9 10 11 12 23.4721795 37.6311854 -38.7893116 24.4721795 51.8414339 40.7389488
…
STAT 431 5
Goodness of Fit of the LS Line
• The basic question: How well does the least square line fit the data?
• To answer this question, we compare how much better the LS line fits the data than the best horizontal line does.
• The best horizontal line for predicting the response with data (x1 , y1 ), . . . (xn , yn ) y = y ̄
• The total squared distance from this line is called the total sum of squares (SST)
S S T =
S S E =
( y yˆ ) 2
STAT 431 6
Xn i=1
( y i y ̄ ) 2
• The total squared distance from the LS line is called the error sum of squares (SSE)
Xn
ii
i=1
Coefficient of Determination
• The coefficient of determination R2 measures the proportion by which the LS line reduces the total sum of squares. So, it is defined as
R2 = 1 SSE SST
• Curious but useful fact: R2 is the square of the correlation coefficient of the data, (x1,y1),…(xn,yn) i.e., R2 = r2
– For a proof, see p.355 of the textbook
• Define regression sum of squares (SSR):
We have an identity SST = SSR + SSE , so R2 = SSR/SST
S S R =
( yˆ y ̄ ) 2 i
Xn i=1
STAT 431 7
Analysis of Variance
• The comparison of SST, SSR and SSE is called the analysis of variance
• Each of these sums of squares has its associated degrees of freedom (d.f.)
Sum oPf squares Degrees of freedom
SST=P(yi y ̄)2 n 1
SSR = P(yˆ y ̄)2 i1
SSE= (y yˆ)2 iin 2
• When 1 = 0 , we have SSR ⇠ 21 independent of SSE ⇠ 2n 2, so
F > f1,n 2,
• Question: How does the above test relate to the t-test for the same problem?
F = SSR/1 SSE/(n 2)
⇠ F1,n 2
• Thus, to test H0 : 1 = 0 against two-sided alternative at level ↵ , we reject if
STAT 431 8
• In R, we can use the command anova > anova(diamond.fit)
Analysis of Variance
Response: Price
Df Sum Sq
Table
F statistic
P-value
Mean Sq F value
Weight 1 2098596 2098596 2070 < 2.2e-16 ***
Residuals 46 46636 1014
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mean square = sum of squares / d.f.
Pr(>F)
STAT 431 9
Prediction
• One goal: to predict the value of the response Y when the predictor x = x⇤
• Two different problems
– Predict the value of the random variable Y ⇤ = 0 + 1x⇤ + ⇥
– Predict the mean value of Y ⇤, i.e., μ⇤ = 0 + 1x⇤
⇤sˆˆ
• 100(1 ↵)% confidence interval forμ : let μˆ⇤ = 0 + 1x⇤
μˆ⇤ ± tn 2, /2S 1 + (x⇤ x ̄)2
n Sxx
– Centered at μˆ⇤, the point estimator provided by the LS regression line
– Width depends on
• Variance of noise / sample size / distance of x⇤ from x ̄ / constellation of the x ’s i
STAT 431 10
• 100(1 ↵)% prediction interval for s: let ˆ ˆ ˆ
Y ⇤ Y ⇤ = μˆ ⇤ = 0 + 1 x ⇤
Yˆ⇤±tn 2, /2S 1+1+(x⇤ x ̄)2 n Sxx
– Centered at Yˆ ⇤ = μˆ⇤, the point estimator provided by the LS regression line
– Width depends on
• Variance of noise / sample size / distance of x⇤ from x ̄ / constellation of the xi’s – Difference from the CI for μˆ⇤ : an extra term of 1
• Needed to account for the variability of Y ⇤ around its mean
– The 100(1 ↵)% prediction interval is always wider than the100(1 ↵)% confidence interval
STAT 431 11
Examples
●
● ● ●
●
● ● ● ●
● ● ●
●●●● ●
●
●
● ●
● ●
● ● ●● ●
●
●
● ●
●
●
●
● ● ● ● ●●●●●●●●●●●●●● ● ● ●●●●● ●●●●● ●
●
●●●● ●
●●● ●
● ● ●
●
●
●
● ● ●●●● ● ●● ●● ● ●●●●●●●●●● ● ● ●
●
● ●●
●
● ●●
●● ●●●●●●●●●●
●● ●●
● ● ●●●●●●●● ● ● ●●● ● ●●●●●●●●●●● ● ●● ●
●
● ●●●●● ● ● ●●●●●
●
●●● ●
●
● ●●● ●●
●●
●●●●●● ●●●●●
●
● ●●●●●●●● ● ●●●●● ●
●●●● ● ● ● ● ● ● ● ●● ● ●●●●●●● ● ● ● ●● ● ●●
● ● ●●●●●●●●●●●●●● ●● ● ●● ●●●●●● ●
● ● ●●●●●● ●●●●● ● ●●●●●●●● ●●
●● ● ● ● ● ●●●● ●●●●● ●
●● ●●●●●●●●●●● ● ●●●●● ●●●●●●● ● ● ●
●● ● ●● ●●● ●●● ●●●●●● ●●●●●●● ● ● ● ● ●● ● ●●● ●●● ●●● ● ●● ●
● ●●● ●●●●● ●●● ●● ● ● ● ●●●● ●● ● ● ●● ●● ●●●
● ● ●●● ● ● ●●●●●●●●●●●● ●● ●● ● ●● ● ● ●●●●●●●●●●● ●●●● ●
● ●
● ● ●●
●
●●●
●●● ● ●●●●●●●● ●●●●●
●●● ● ●●●●● ●●●●●●●●●
●●●●●●●●●● ●
●●● ●● ● ● ● ●● ●●● ●● ● ● ● ●●● ●●●●●●●
●
●● ●●●●●●●● ●●●●● ●●●●● ● ●●●●● ●●●●●●●●●●●●●
● ● ●●●●●●● ●●● ● ● ● ●●●●● ●● ●● ●●● ●
● ●●●●● ● ● ●
● ● ●●●●●● ●●●●●●●●● ●●● ● ●●●
● ● ● ●●●● ●● ●● ● ●●● ● ● ●●●● ●● ●●● ●● ●
● ●
●
●
●●●● ●●●●●●●● ●●●●●● ●
● ●●●●●●●●● ●●●●●●●
● ● ●●●●●● ●
● ●●
● ●●●●●●●●●
●● ●●● ● ● ●●●●●●●
● ● ●●●●● ●●●●● ●●●
● ●● ● ● ● ●● ● ● ●
● ●●●●●● ●● ●●
● ●●●●●●●●● ●
●●●●●● ● ●●● ● ● ●
●●●●●●●
● ●●●●
●●●●● ●●
●●●
●
0.10 0.15 0.20 0.25 0.30 Weight (carats)
0.35
0.40 60
65 70 75 Father’s height (inches)
Diamond data
LS regression line / 95% CI / 95% PI
Father-son height data
LS regression line / 95% CI / 95% PI
STAT 431
12
Price (Singapore dollars)
200 400 600 800 1000 1200
Son’s height (inches)
60 65 70 75
CI / PI Calculation in R
• Key R command: predict
• Suppose a simple regression has been fit in R for the diamond data
> diamond.fit <- lm(Price ~ Weight, data = diamond)
• Example 1: calculating 99% confidence interval of the average price when the weight of the diamond is 0.2 carats
> predict(diamond.fit, newdata = data.frame(Weight = 0.2), interval = “confidence”, level = 0.99)
fit lwr upr 1 484.5791 472.1962 496.9619
• Example 2: calculating 95% prediction interval for the actual price when the weight
of the diamond is 0.2 carats
> predict(diamond.fit, newdata = data.frame(Weight = 0.2), interval = “prediction”, level = 0.99)
fit lwr upr 1 484.5791 398.1317 571.0264
STAT 431 13
• Key points of this class
– Coefficient of determination & ANOVA • F-test for the goodness of fit
– Prediction of future observation
• CI and prediction interval (connection and difference)
• Reading Section 10.3 of the textbook
• Next class: Multiple Regression (I)
Class Summary
STAT 431 14