1. Under the multiple regression model
Yi =β0 +β1xi1 +···+βkxik +εi, i=1,…,n,
iid 2
where εi ∼ N(0,σ ). Decide for each of the following statements whether it is true or false.
Briefly state your reasoning.
(a) The LS estimators of the regression coefficients β0, . . . , βk are independent of SSE.
(b) The fitted values Yˆ , . . . , Yˆ are independent of S2. 1n
(c) The regression sum of squares SSR satisfies SSR/σ2 ∼ χ2k.
(d) If the εi’s are not normally distributed, then the LS estimators of the regression coeffi-
cientsβ0,…,βk arenolongerunbiased.
(e) For testing H0 : β1 = 0 vs. H1 : β1 ̸= 0, P -values from the appropriate t-test and F -test
are always identical.
(f) If the first predictor is not useful for predicting the response, then the LS estimator
βˆ1 = 0.
Solutions.
(a) True. The least square estimators (βˆ0, . . . , βˆk) are independent of MSE S2. On the other hand, SSE = (n − (k + 1))S2.
(b) True. Each fitted value Yˆi = βˆ0 + βˆ1xi1 + ··· + βˆkxik is a function of (βˆ0,…,βˆk) and hence is independent of S2.
(c) False. This is only true when β1 = ··· = βk = 0, but false in general.
(d) False. The LS estimators are unbiased as long as all εi’s are mean zero random variables.
(e) True. The F test statistic is always equal to square of the t test statistic, and F1,n−(k+1) is the distribution of T2 where T ∼ tn−(k+1).
(f) False. The true β1 = 0 while the LS estimator βˆ1 is a random variable that has mean zero but nonzero variance.
Speed limits and traffic fatalities. In November 1995, the National Highway System Designa- tion Act was signed to law in the United States. Among other things, the act abolished the federal mandate of 55 mile per hour maximum speed limits on roads in the United States. Of the 50 states (the District of Columbia is not considered), 32 increased their speed limits at the beginning of 1996 or sometime during 1996, and 18 did not.
The data available are the percentage changes in interstate highway traffic fatalities from 1995 to 1996. From now on, we label the 32 states that increased the speed limit the “increase” group, and the other 18 states the “non-increase” group. The side-by-side box plots of the two groups is shown below.
1
Increase Non−Increase
2. Consider the following statements about the data.
(i) The median percentage change of the “increase” group is higher than the median of the “non-increase” group.
(ii) There is no apparent outlier in the “non-increase” group.
(iii) More than half of the states which did not increase the speed limit had a negative percentage change in interstate highway traffic fatalities.
(iv) The distribution of the percentage change in the “increase” group has a long left tail. Which of the above statements are (is) correct?
Solution. (i) is correct since the middle line in each box represents the sample median. (ii) is correct as there is no observation outside the two whiskers. (iii) is correct since the median of the “non-increase” group is negative. (iv) is incorrect, the two tails are comparable according to the boxplot of the “increase” group.
3. Tom analyzed the data in R, where the vector increase was used to record all the percentage changes in the “increase” group, and the vector nonincrease recorded the data from the “non-increase” group. The output of his comparison of the mean percentage changes of the two groups is the following.
Welch Two Sample t-test
data: increase and nonincrease
t = 2.5167, df = 29.529, p-value = 0.018
alternative hypothesis: true difference in means is not equal to 0
2
Percentage change
−40 −20 0 20 40 60
95 percent confidence interval:
3.448554 33.246585
sample estimates:
mean of x mean of y
13.753125 -4.594444
Consider the following assumptions:
(i) The distributions within both groups are normal.
(ii) The population variances of the two distributions are equal.
(iii) The two samples are independent of each other.
(iv) The percentage change in fatalities of any one state is independent of all the other states. Which of the above assumptions are (is) necessary in order for Tom’s analysis to be valid?
Solutions. (i), (iii) and (iv). This is a two independent sample t-test without equal variance assumption. Therefore, one need i.i.d. normal observations within each group and the two groups of observations are also required to be mutually independent.
You may omit (iii) since it becomes redundant once (iv) is assumed, but further including it is also fine.
4. In addition to the analysis shown in the previous question, Tom also tried an alternative method to compare the means of the two groups. The output of the alternative method is the following.
Two Sample t-test
data: increase and nonincrease
t = 2.6747, df = 48, p-value = 0.01
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
4.555163 32.139976
sample estimates:
mean of x mean of y
13.753125 -4.594444
In addition, the sample SD of the two groups are 21.3 for the “increase” group and 26.5 for the “non-increase” group, and the normal plots for both groups are shown below.
3
Normal plot: Increase group
Normal plot: Non−increase group
●
● ●
●●●● ●
●● ●
● ●●●
● ●
●
●
●
●
●●● ●
●
● ●
●
●●
●
●
●
● ●
● ●
● ●
● ●
●●
●
●
●
●
●
●
−2 −1 0 1 2
Theoretical Quantiles
−2 −1 0 1 2
Theoretical Quantiles
Based on the analysis here and that in the previous question, at significance level 0.05, what is the correct conclusion on the comparison of the mean percentage changes of the two groups?
Solutions. The mean percentage changes of the two groups are significantly different; the P-value of the appropriate hypothesis test is 0.018. We make this conclusion since the equal variance assumption does not seem to be reasonable due to the big difference between the estimated degree of freedom 29.529 and degree of freedom n1 + n2 − 2 = 48 under equal variance assumption.
Wine Consumption and Heart Disease People are interested in whether the heart disease death rate is associated with average wine consumption. The data set contains measurements on two variables for 18 industrialized countries: (1) wine: the average wine consumption rate (in liters per person) and (2) mortality: number of ischemic heart disease deaths (per 1,000 men aged 55 to 64 years old).
First, consider the model
mortality = β0 + β1 log(wine) + ε.
Here, the logarithm to base e is used. A fraction of the R output from fitting the above model follows.
4
Sample Quantiles
−2 −1 0 1 2
Sample Quantiles
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
●
●
●
●
●
●
●
●● ●
●
●
●
●●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
● ●
● ●
●
1.0 1.5 2.0 2.5 3.0 3.5 4.0
log(wine)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
3 4
5 6 7 8
Fitted value
(Intercept) 10.2795 0.8316 12.360 1.34e-09 ***
log(wine) -1.7712 0.3468 -5.108 0.000105 ***
—
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 1.498
Multiple R-squared: 0.6199
5. Consider testing H0 : β1 = 0 vs. H1 : β1 ̸= 0 using an F statistic. For the current data, find the observed value of the F statistic and the degrees of freedom of its null distribution.
Solutions. By definition, the statistic
SSR/1 SSR R2 0.6199
F=SSE/(n−2)=(n−2)SSE=(n−2)1−R2 =16×1−0.6199=26.09. The degrees of freedom is (1, n − 2) = (1, 16).
6. According to the fitted model, for any industrialized country whose average wine consumption rate is 16, the 95% confidence interval for its expected mortality is [4.50, 6.24]. What is the 95% prediction interval for its actual mortality?
(a) [2.61,8.13]. (b) [2.37,8.37]. (c) [2.33,8.41]. (d) [2.08,8.66].
5
mortality
2 4 6 8 10
Residual
−2 −1 0 1 2
Solutions. First, the center of the two intervals are the same. So the center of the PI is 4.50 + 6.24 = 5.37.
Next, the half-width of CI is
2
6.24 − 4.50 = s × t16,0.025 × √⋄. 2
Here, s = 1.498 and t16,0.025 = 2.12 and so ⋄ = 0.075. Finally the half-width of PI is
√
So the answer is (d).
7. Consider the following transformations of the response variable:
(i) y → y1/3, (ii) y → √y, (iii) y → y2, (iv) y → y4.
Based on the residual plot, which of the above transformations could be considered for the current data set?
Solutions. By the residual plot, the vertical spread increases with fitted value. So we consider concave transforms and both (i) and (ii) are reasonable choices.
A second analysis of the same data set appears next. The response variable is changed to log(mortality), while the predictor remains the same. Here, the logarithm to base e is used.
Call:
lm(formula = log(mortality) ~ log(wine), data = winedata)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.55555 0.12690 20.139 8.60e-13 ***
log(wine) -0.35560 0.05291 -6.721 4.91e-06 ***
—
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
s×t16,0.025 ×
1+⋄=3.293.
Residual standard error: 0.2285
Multiple R-squared: 0.7384
6
●●
●
●
● ●●
●●
●
● ●
●●
●● ●
●
● ●●●
●
●● ●
●
●
●
●
●
●●
●
●
●
1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.2 1.4 1.6 1.8 2.0 2.2
log(wine) Fitted value
8. What is the interpretation of the slope coefficient in the second fitted model?
(a) If the average wine consumption rate increases by one percent, then the number of ischemic heart disease deaths is expected to decrease by about 0.36 deaths per 1,000 men aged 55 to 64.
(b) If the average wine consumption rate increases by ten percent, then the number of ischemic heart disease deaths is expected to decrease by about 0.36 deaths per 1,000 men aged 55 to 64.
(c) If the average wine consumption rate increases by ten percent, then the number of ischemic heart disease deaths is expected to decrease by 3.6 percent.
(d) If the average wine consumption rate increases by ten percent, then the number of ischemic heart disease deaths is expected to decrease by 0.36 percent.
Solutions. (c). Both response and predictor are log-transformed, and so both should be interpreted through percentage change.
9. Compute the value of the regression sum of squares in the second fitted model. Show the necessary steps.
Solutions.
Hence
To begin with, we express SSE as a function of S2 and R2. Note that SSR SSE (n − 2)S2
R2 =1−R2= 1−R2 . 2R2 2 0.7384
SSR=(n−2)S 1−R2 =16×0.2285 ×1−0.7384=2.358.
7
log(mortality)
1.0 1.5 2.0
Residual
−0.2 0.0 0.2 0.4
Oxygen Uptake Experiment An experiment was conducted to study O2UP (oxygen uptake in milligrams of oxygen per minute), given the following five other chemical measurements:
• BOD: biological oxygen demand, • TKN: Total Kjeldahl nitrogen,
• TS: Total solids,
• TVS: Total volatile solids,
• COD: Chemical oxygen demand.
The data set includes measurements on 20 samples of dairy wastes in a laboratory. We first fit the
following multiple regression model to the data:
log(O2UP) = β0 + β1BOD + β2TKN + β3TS + β4TVS + β5COD + ε.
As before, the logarithm to base e is used. The R output of the fit follows. Call:
lm(formula = log(O2UP) ~ BOD + TKN + TS + TVS + COD, data = dwaste)
Coefficients:
(Intercept) -4.968e+00 2.110e+00 -2.355 0.0336 *
Estimate Std. Error t value Pr(>|t|)
-4.298e-05 1.197e-03 -0.036 0.9719
BOD
TKN
TS
TVS
COD
—
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 0.6046 on 14 degrees of freedom
Multiple R-squared: 0.8094,Adjusted R-squared: 0.7413
F-statistic: 11.89 on 5 and 14 DF, p-value: 0.0001238
10. Which of the following is a correct statement about the above fit?
(a) None of the five predictors is significant at level 0.05, and we should simply use the sample mean for predicting the value of log(O2UP).
(b) Approximately 81% of the log(O2UP) values are correctly determined by the model.
(c) The SSE of the above fit is 8.4644.
(d) None of the individual predictor is significant at level 0.05 given the other four predictors. However, the null hypothesis that all the slope coefficients are zero is rejected at level 0.05. This indicates that there is collinearity among the predictors.
8
3.016e-03
2.959e-04
1.825e-02
3.274e-04 1.703e-04 1.922 0.0751 .
2.918e-03
1.776e-04
3.233e-02
1.034 0.3188
1.666 0.1179
0.564 0.5814
Solutions. (d) is correct.
(a) is incorrect since none of the five predictors is significant at level 0.05 given the remaining
four, and so we cannot remove all of them at once.
(b) is incorrect. Approximately 81% of the variation in the response variable log(O2UP) is explained by the model.
(c) is incorrect. SSE = 0.60462 × 14 = 5.118.
11. Consider 90% two-sided confidence intervals for the slope coefficients β1, . . . , β5. Among these
intervals, which cover(s) the point 0?
Solutions. By the duality between testing and confidence intervals, any variable with t test P-value larger than 0.1 has it 90% CI covering zero. Hence, the CI’s for β1, β2, β3 and β4 all cover zero.
12. Starting with the current model with all the five predictors, consider a backward elimination procedure for variable selection. Suppose we base our decision on the P-value of testing the significance of each predictor given the others, and the cutoff P-value is set at 0.05. Which variable will be removed from the model in the first backward elimination step?
(a) BOD. (b) COD.
(c) TS.
(d) None of the above is correct.
Solutions. (a). We remove the variable with the largest P-value if it is above 0.05.
In order to investigate the influences of individual observations on the above fit, we obtain the following plots.
●
●
●
●
● ●●●
● ●
●
●
● ●●
● ●●
●
●
●
●●● ●●●
●●
● ●●●●●●●●
●
●
5 10 15 20
Observation indx
5 10 15 20
Observation index
9
0.0
0.5 1.0 1.5
Leverage
0.2 0.4 0.6 0.8
Cook’s distance
13. Based on the above plots, which of the following statement is correct?
(a) There is one observation which is both influential and of high leverage. (b) There is one influential observation, but it is not of high leverage.
(c) There is one observation of high leverage, but it is not influential.
(d) All observations are non-influential and of low leverage.
Solutions. The threshold for high leverage is 2(k+1) = 12 = 0.6, and the threshold for n 20
Cook’s distance is 1. So the correct answer is (a).
The R output of an all-subsets selection performed on the current data set follows.
> dwaste.sub <- regsubsets(log(O2UP) ~ BOD + TKN + TS + TVS + COD, data = dwaste)
> dwaste.subsum <- summary(dwaste.sub)
> dwaste.subsum
Subset selection object
Call: regsubsets.formula(log(O2UP) ~ BOD + TKN + TS + TVS + COD, data = dwaste)
5 Variables (and intercept)
Forced in Forced out
BOD FALSE
TKN FALSE
TS FALSE
TVS FALSE
COD FALSE
1 subsets of each size up to 5
Selection Algorithm: exhaustive
BOD TKN TS TVS COD 1 (1)”””””*””””” 2 (1)”””””*””””*” 3 (1)”””*””*””””*” 4 (1)”””*””*””*””*” 5 (1)”*””*””*””*””*”
> dwaste.subsum$cp
[1] 6.295335 1.739348 2.318918 4.001289 6.000000
14. Under the Cp criterion, which model is the best for the current data set?
Solutions. The model with the smallest Cp value, namely the best two predictor model which uses TS and COD.
FALSE
FALSE
FALSE
FALSE
FALSE
10