1. Under the multiple regression model
Yi =β0 +β1xi1 +···+βkxik +εi, i=1,…,n,
iid 2
where εi ∼ N(0,σ ). Decide for each of the following statements whether it is true or false.
Briefly state your reasoning.
(a) The LS estimators of the regression coefficients β0, . . . , βk are independent of SSE.
(b) The fitted values Yˆ , . . . , Yˆ are independent of S2. 1n
(c) The regression sum of squares SSR satisfies SSR/σ2 ∼ χ2k.
(d) If the εi’s are not normally distributed, then the LS estimators of the regression coeffi-
cientsβ0,…,βk arenolongerunbiased.
(e) For testing H0 : β1 = 0 vs. H1 : β1 ̸= 0, P -values from the appropriate t-test and F -test
are always identical.
(f) If the first predictor is not useful for predicting the response, then the LS estimator
βˆ1 = 0.
Speed limits and traffic fatalities. In November 1995, the National Highway System Designa- tion Act was signed to law in the United States. Among other things, the act abolished the federal mandate of 55 mile per hour maximum speed limits on roads in the United States. Of the 50 states (the District of Columbia is not considered), 32 increased their speed limits at the beginning of 1996 or sometime during 1996, and 18 did not.
The data available are the percentage changes in interstate highway traffic fatalities from 1995 to 1996. From now on, we label the 32 states that increased the speed limit the “increase” group, and the other 18 states the “non-increase” group. The side-by-side box plots of the two groups is shown below.
Increase
Non−Increase
1
Percentage change
−40 −20 0 20 40 60
2. Consider the following statements about the data.
(i) The median percentage change of the “increase” group is higher than the median of the “non-increase” group.
(ii) There is no apparent outlier in the “non-increase” group.
(iii) More than half of the states which did not increase the speed limit had a negative percentage change in interstate highway traffic fatalities.
(iv) The distribution of the percentage change in the “increase” group has a long left tail. Which of the above statements are (is) correct?
3. Tom analyzed the data in R, where the vector increase was used to record all the percentage changes in the “increase” group, and the vector nonincrease recorded the data from the “non-increase” group. The output of his comparison of the mean percentage changes of the two groups is the following.
Welch Two Sample t-test
data: increase and nonincrease
t = 2.5167, df = 29.529, p-value = 0.018
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
3.448554 33.246585
sample estimates:
mean of x mean of y
13.753125 -4.594444
Consider the following assumptions:
(i) The distributions within both groups are normal.
(ii) The population variances of the two distributions are equal.
(iii) The two samples are independent of each other.
(iv) The percentage change in fatalities of any one state is independent of all the other states. Which of the above assumptions are (is) necessary in order for Tom’s analysis to be valid?
4. In addition to the analysis shown in the previous question, Tom also tried an alternative method to compare the means of the two groups. The output of the alternative method is the following.
Two Sample t-test
data: increase and nonincrease
t = 2.6747, df = 48, p-value = 0.01
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
4.555163 32.139976
2
sample estimates:
mean of x mean of y
13.753125 -4.594444
In addition, the sample SD of the two groups are 21.3 for the “increase” group and 26.5 for the “non-increase” group, and the normal plots for both groups are shown below.
Normal plot: Increase group
Normal plot: Non−increase group
●
● ●
●●●● ●
●● ●
● ●●●
● ●
●
●
●
●
●●● ●
●
● ●
●
●●
●
●
●
● ●
● ●
● ●
● ●
●●
●
●
●
●
●
●
−2 −1 0 1 2
Theoretical Quantiles
−2 −1 0 1 2
Theoretical Quantiles
Based on the analysis here and that in the previous question, at significance level 0.05, what is the correct conclusion on the comparison of the mean percentage changes of the two groups?
Wine Consumption and Heart Disease People are interested in whether the heart disease death rate is associated with average wine consumption. The data set contains measurements on two variables for 18 industrialized countries: (1) wine: the average wine consumption rate (in liters per person) and (2) mortality: number of ischemic heart disease deaths (per 1,000 men aged 55 to 64 years old).
First, consider the model
mortality = β0 + β1 log(wine) + ε.
Here, the logarithm to base e is used. A fraction of the R output from fitting the above model follows.
3
Sample Quantiles
−2 −1 0 1 2
Sample Quantiles
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
●
●
●
●
●
●
●
●● ●
●
●
●
●●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
● ●
● ●
●
1.0 1.5 2.0 2.5 3.0 3.5 4.0
log(wine)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
3 4
5 6 7 8
Fitted value
(Intercept) 10.2795 0.8316 12.360 1.34e-09 ***
log(wine) -1.7712 0.3468 -5.108 0.000105 ***
—
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 1.498
Multiple R-squared: 0.6199
5. Consider testing H0 : β1 = 0 vs. H1 : β1 ̸= 0 using an F statistic. For the current data, find the observed value of the F statistic and the degrees of freedom of its null distribution.
6. According to the fitted model, for any industrialized country whose average wine consumption rate is 16, the 95% confidence interval for its expected mortality is [4.50, 6.24]. What is the 95% prediction interval for its actual mortality?
(a) [2.61,8.13]. (b) [2.37,8.37]. (c) [2.33,8.41]. (d) [2.08,8.66].
7. Consider the following transformations of the response variable:
(i) y → y1/3, (ii) y → √y, (iii) y → y2, (iv) y → y4.
Based on the residual plot, which of the above transformations could be considered for the current data set?
4
mortality
2 4 6 8 10
Residual
−2 −1 0 1 2
A second analysis of the same data set appears next. The response variable is changed to log(mortality), while the predictor remains the same. Here, the logarithm to base e is used.
Call:
lm(formula = log(mortality) ~ log(wine), data = winedata)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.55555 0.12690 20.139 8.60e-13 ***
log(wine) -0.35560 0.05291 -6.721 4.91e-06 ***
—
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 0.2285
Multiple R-squared: 0.7384
●●
●
●
● ●●
●●
●
● ●
●●
●● ●
●
●
● ●●●
●
●● ●
●
●
●
●
●●
●
●
●
1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.2 1.4 1.6 1.8 2.0 2.2
log(wine) Fitted value
8. What is the interpretation of the slope coefficient in the second fitted model?
(a) If the average wine consumption rate increases by one percent, then the number of ischemic heart disease deaths is expected to decrease by about 0.36 deaths per 1,000 men aged 55 to 64.
(b) If the average wine consumption rate increases by ten percent, then the number of ischemic heart disease deaths is expected to decrease by about 0.36 deaths per 1,000 men aged 55 to 64.
(c) If the average wine consumption rate increases by ten percent, then the number of ischemic heart disease deaths is expected to decrease by 3.6 percent.
(d) If the average wine consumption rate increases by ten percent, then the number of ischemic heart disease deaths is expected to decrease by 0.36 percent.
9. Compute the value of the regression sum of squares in the second fitted model. Show the necessary steps.
5
log(mortality)
1.0 1.5 2.0
Residual
−0.2 0.0 0.2 0.4
Oxygen Uptake Experiment An experiment was conducted to study O2UP (oxygen uptake in milligrams of oxygen per minute), given the following five other chemical measurements:
• BOD: biological oxygen demand, • TKN: Total Kjeldahl nitrogen,
• TS: Total solids,
• TVS: Total volatile solids,
• COD: Chemical oxygen demand.
The data set includes measurements on 20 samples of dairy wastes in a laboratory. We first fit the
following multiple regression model to the data:
log(O2UP) = β0 + β1BOD + β2TKN + β3TS + β4TVS + β5COD + ε.
As before, the logarithm to base e is used. The R output of the fit follows. Call:
lm(formula = log(O2UP) ~ BOD + TKN + TS + TVS + COD, data = dwaste)
Coefficients:
(Intercept) -4.968e+00 2.110e+00 -2.355 0.0336 *
Estimate Std. Error t value Pr(>|t|)
-4.298e-05 1.197e-03 -0.036 0.9719
BOD
TKN
TS
TVS
COD
—
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 0.6046 on 14 degrees of freedom
Multiple R-squared: 0.8094,Adjusted R-squared: 0.7413
F-statistic: 11.89 on 5 and 14 DF, p-value: 0.0001238
10. Which of the following is a correct statement about the above fit?
(a) None of the five predictors is significant at level 0.05, and we should simply use the sample mean for predicting the value of log(O2UP).
(b) Approximately 81% of the log(O2UP) values are correctly determined by the model.
(c) The SSE of the above fit is 8.4644.
(d) None of the individual predictor is significant at level 0.05 given the other four predictors. However, the null hypothesis that all the slope coefficients are zero is rejected at level 0.05. This indicates that there is collinearity among the predictors.
11. Consider 90% two-sided confidence intervals for the slope coefficients β1, . . . , β5. Among these intervals, which cover(s) the point 0?
6
3.016e-03
2.959e-04
1.825e-02
3.274e-04 1.703e-04 1.922 0.0751 .
2.918e-03
1.776e-04
3.233e-02
1.034 0.3188
1.666 0.1179
0.564 0.5814
12. Starting with the current model with all the five predictors, consider a backward elimination procedure for variable selection. Suppose we base our decision on the P-value of testing the significance of each predictor given the others, and the cutoff P-value is set at 0.05. Which variable will be removed from the model in the first backward elimination step?
(a) BOD. (b) COD.
(c) TS.
(d) None of the above is correct.
In order to investigate the influences of individual observations on the above fit, we obtain the following plots.
●
●
●
●
● ●●●
● ●
●
●
● ●●
● ●●
●
●
●
●●● ●●●
●●
● ●●●●●●●●
●
●
5 10 15 20 5 10 15 20
Observation indx Observation index
13. Based on the above plots, which of the following statement is correct?
(a) There is one observation which is both influential and of high leverage. (b) There is one influential observation, but it is not of high leverage.
(c) There is one observation of high leverage, but it is not influential.
(d) All observations are non-influential and of low leverage.
The R output of an all-subsets selection performed on the current data set follows.
> dwaste.sub <- regsubsets(log(O2UP) ~ BOD + TKN + TS + TVS + COD, data = dwaste)
> dwaste.subsum <- summary(dwaste.sub)
> dwaste.subsum
Subset selection object
Call: regsubsets.formula(log(O2UP) ~ BOD + TKN + TS + TVS + COD, data = dwaste)
5 Variables (and intercept)
Forced in Forced out
BOD FALSE FALSE
7
0.0
0.5 1.0 1.5
Leverage
0.2 0.4 0.6 0.8
Cook’s distance
TKN FALSE
TS FALSE
TVS FALSE
COD FALSE
1 subsets of each size up to 5
Selection Algorithm: exhaustive
BOD TKN TS TVS COD 1 (1)”””””*””””” 2 (1)”””””*””””*” 3 (1)”””*””*””””*” 4 (1)”””*””*””*””*” 5 (1)”*””*””*””*””*”
> dwaste.subsum$cp
[1] 6.295335 1.739348 2.318918 4.001289 6.000000
14. Under the Cp criterion, which model is the best for the current data set?
FALSE
FALSE
FALSE
FALSE
8