STAT 151A Final Exam, version B
Spring 2020
Due: 2:30 PM PDT, May 11, 2020
Instructions: answer each of the questions below. You may put your answers directly onto this document (or onto the LaTeX) template or you may write out your answers on separate pages. To receive credit you must upload pictures or scans of your answers to Gradescope by the deadline above.
You are allowed to refer to course notes, textbooks, and online resources, and you will be allowed to do calculations in R. However, you may not consult any other person. The only exception is that logistical questions about the exam may be sent to the course staff (please copy Sam, Amanda, and Jake) by private bCourses message.
Please be sure to give clear reasons for your answers when asked and to show work to receive partial credit. You do not need to reduce all numerical calculations to a single number.
Question Total points Q1 15 Q2 10 Q3 8 Q4 13 Q5 12 Q6 12
70
1
1. For this question we consider the seatpos dataset from R. Here is a description from the R help file:
Car drivers like to adjust the seat position for their own
comfort. Car designers would find it helpful to know where different
drivers will position the seat depending on their size and
age. Researchers at the HuMoSim laboratory at the University of
Michigan collected data on 38 drivers.
We focus on a random subset of 33 drivers. The dataset contains the following variables
Age (in years)
Weight (in lbs)
HtShoes (height in shoes in cm)
Ht (Height bare foot in cm)
Seated (Seated height in cm)
Arm (lower arm length in cm)
Thigh (Thigh length in cm)
Leg (Lower leg length in cm)
hipcenter (horizontal distance of the midpoint of the hips from a
fixed location in the car in mm)
Using the variables given in the dataset, you decide to create four new variables v1, v2, v3 and v4 via
> v1 = seatpos$HtShoes – 172.0
> v2 = seatpos$Arm – 0.2185*seatpos$HtShoes + 5.338
> v3 = seatpos$Thigh – 0.1706*seatpos$HtShoes – 0.3845*seatpos$Arm + 3.013
> v4 = seatpos$Leg – 0.2515*seatpos$HtShoes – 0.1784*seatpos$Arm +
0.0395*seatpos$Thigh + 11.09
You then fit the model
hipcenter=β0 +β1v1+β2v2+β3v3+β4v4+e to the data using R which gives the following output:
> model = lm(seatpos$hipcenter ~ v1 + v2 + v3 + v4)
> summary(model)
Call:
lm(formula = seatpos$hipcenter ~ v1 + v2 + v3 + v4)
Residuals:
Min 1Q Median 3Q Max
-94.367 -23.213 -0.284 30.222 54.459
Coefficients:
Estimate Std. Error
(Intercept) -165.748
v1 -4.272
v2 1.345
v3 -1.508
v4 -6.613
—
6.708
0.597
2.977
2.591
XXXX
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
2
Residual standard error: 38.53 on 28 degrees of freedom
Multiple R-squared: 0.6574,Adjusted R-squared: XXX
F-statistic: XXX on 4 and 28 DF, p-value: 3.131e-06
The XT X matrix for this linear model is given by (approximately) X = model.matrix(model)
t(X) %*% X
(Intercept) v1 v2 v3 v4
(Intercept) 33 0.00 0.00 0.00 0.00
v1 0 4169.25 0.00 0.00 0.00
v2 0 0.00 167.54 0.00 0.00
v3 0 0.00 0.00 221.26 0.00
v40 0.00 0.00 0.00 66.03
(a) Fill in the three missing values in the R output above, explaining your answers. (4 points).
(b) You decide to drop the variables v3 and v4 from the above model which results in the following
fit:
Call:
lm(formula = seatpos$hipcenter ~ v1 + v2)
Residuals:
Min 1Q Median 3Q Max
-103.943 -23.701 0.244 23.740 65.656
Coefficients:
(Intercept) XXXXXX
v1 XXXXXX
v2 XXXXXX
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 38.72 on 30 degrees of freedom
Multiple R-squared: 0.6295,Adjusted R-squared: 0.6048
F-statistic: 25.48 on 2 and 30 DF, p-value: 3.407e-07
Fill the six missing values above giving proper reasons (6 points).
(c) A different data scientist takes the same original data and fits the model hipcenter ∼ HtShoes + Arm. Which of the three models (yours in (a) and (b) and that of the other scientist) would you prefer to use in practice for predicting how seats should be set up for new customers, and why? What quantities might you compute next to help you decide? (5 points)
XXXXXX
XXXXXX
XXXXXX
3
2. Consider the following two observed vectors:
The fitted model lm(y ∼ x) returns
Coefficients:
(Intercept) x
248.0 -0.7
10
70 y = 120 170
240
340
260 x = 180 100
20
(a) Suppose we create a new sample by bootstrapping cases, and let zi∗ = (yi∗, x∗i ) be the ordered pair of values for the ith observation in this bootstrap sample. Write down the exact distribution of z1∗. (4 points)
(b) Suppose now that we generate the bootstrap sample by bootstrapping residuals instead of cases. Is the distribution of z1∗ the same as in part (a)? If so, explain why. If not, provide the new distribution. (6 points)
4
3. For a subset of the bodyfat dataset used in lab, I ran a regression and obtained the following output:
Call:
lm(formula = bodyfat ~ Age + Weight + Height + Knee + Biceps +
Wrist, data = body_subset)
Residuals:
Min 1Q Median 3Q Max
-18.7488 -3.7902 0.0035 3.8497 15.0668
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.41126
Age 0.22626
Weight 0.22758
Height -0.47390
Knee 0.05696
Biceps 0.22976
Wrist -3.10337
—
13.72912 3.526 0.000527 ***
0.03491 6.482 7.37e-10 ***
0.03352 6.790 1.35e-10 ***
0.11390 -4.161 4.77e-05 ***
0.32633 0.175 0.861613
0.22809 1.007 0.315043
0.72066 -4.306 2.64e-05 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.69 on 193 degrees of freedom
Multiple R-squared: 0.5447,Adjusted R-squared: 0.5305
F-statistic: 38.48 on 6 and 193 DF, p-value: < 2.2e-16
(a) Look at the diagnostic plot in Figure 3a. Based on this plot, is observation 42 an outlier? influential point? Explain. (2 points)
(b) I decided to drop observation 42 and perform the regression again. Here is what I got:
Call:
lm(formula = bodyfat ~ Age + Weight + Height + Knee + Biceps +
Wrist, data = body_subset[-which(rownames(body_subset) == "42"), ])
Residuals:
Min 1Q Median 3Q Max
-22.5625 -3.7245 0.4098 3.4343 11.8238
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 84.63658
Age 0.18399
Weight 0.24192
Height -1.17937
Knee 0.42454
Biceps 0.06257
Wrist -2.88053
—
15.17390 5.578 8.19e-08 ***
0.03437 5.352 2.46e-07 ***
0.03199 7.562 1.60e-12 ***
0.18577 -6.349 1.53e-09 ***
0.31985 1.327 0.186
0.21963 0.285 0.776
0.68632 -4.197 4.13e-05 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.406 on 192 degrees of freedom
Multiple R-squared: 0.5854,Adjusted R-squared: 0.5725
F-statistic: 45.19 on 6 and 192 DF, p-value: < 2.2e-16
Based on the above two regression outputs, describe a test for assessing whether the observation 42 is an outlier (in the first regression) and calculate its p-value (you are allowed to write the answer
5
Residuals vs Leverage
216
Cook's distance
39
42
1 0.5
0.5 1
0.0 0.1 0.2
0.3 0.4
0.5 0.6 0.7
Leverage
lm(bodyfat ~ Age + Weight + Height + Knee + Biceps + Wrist)
for the p-value in terms of the appropriate probability or quantile for the appropriate distribution). If you find it helpful, you may use the fact that the leverage of observation 42 in the original model is approximately 0.68. (6 points)
6
Standardized residuals
-4 -2 0 2
4. The Online News Popularity dataset contains information on news articles from the website Mash- able.com, including the number of times each article was shared. Using a subset of 30,000 news arti- cles, we model the number of shares as a function of the 58 other explanatory variables using shrinkage methods.
The following two plots show cross-validation error (top) and coefficient values (bottom) against log- lambda values for ridge regression (at left) and lasso regression (at right).
58 58 58 58 58 58 58 58 58
57 57 54 53 48 35 20 6 2
6 8 10 12 14
Log(λ)
58 58
-2 0 2 4 6
Log(λ)
-15000
-5000 0 5000
1.0e+08 1.5e+08 2.0e+08 2.5e+08
-15000
-5000 0 5000
1.5e+08 2.0e+08 2.5e+08 3.0e+08 3.5e+08 4.0e+08
Coefficients
Mean-Squared Error
Coefficients
Mean-Squared Error
4 6 8 10 12 14
Log Lambda
-2 0 2 4 6 8
Log Lambda
7
55
16
(a) What is the meaning of the sequences of numbers written across the top of the two cross-validation plots? Interpret and explain the differences between the two sequences. (3 points).
(b) A different statistician fits an OLS model to this same data. Based on the plots above, do you think the R2 for this model will be closer to one or closer to zero? Explain your answer. (2 points).
(c) Two of the 58 explanatory variables can be written as linear combinations of the other 56 variables. Comment on the respective implications for OLS and ridge regression. (3 points).
(d) We select the lasso model and use the value of lambda achieving minimal cross-validation error. We compare the predictions of this model with the predictions of the OLS model for three articles in the dataset:
Lasso OLS
[1,] 2004.769 1919.122
[2,] 7566.888 9114.411
[3,] 1274.315 1308.842
Estimate the bias of each lasso prediction. (3 points).
(e) Suppose a new article is observed. Do you expect the lasso or the OLS model to have lower MSE
in predicting how many times it will be shared? Why? (2 points).
8
5. We study a subset of the chdage dataset in R (from package aplore3) that lists age in years (AGE) and presence or absence of evidence of significant coronary heart disease (CHD) for n = 80 subjects. The response variable is CHD which is coded with a value of zero to indicate CHD is absent, or 1 to indicate that it is present in the individual. The goal is to explore the relationship between age and the presence or absence of CHD in this study population. I fitted a logistic regression model for CHD using AGE as the explanatory variable and this gave me the following output.
> mod1 = glm(CHD ~ AGE, family = binomial, data = chdage)
> summary(mod1)
Call:
glm(formula = chd ~ age, family = binomial, data = chdage)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0683 -0.8230 -0.4234 0.7627 2.2178
Coefficients:
Estimate Std. Error z value
(Intercept) -5.18 1.22 -4.24
age 0.11 0.03 4.26
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 110.45 on XXX degrees of freedom
Residual deviance: XXX on 78 degrees of freedom
Number of Fisher Scoring iterations: 4
Also consider the following results in R:
> sum(chdage$CHD * log(mod1$fitted.values)) + sum((1-chdage$CHD)*log(1 –
mod1$fitted.values))
[1] -42.51
> mean(chdage$CHD)
[1] 0.4625
(a) Fill in the missing values in the above output, explaining your answers (3 points).
(b) Fill in the three missing values with proper reasoning in the following statement: the results suggest that the change in the log-odds of average CHD per one year increase in AGE is XXXXXX and the change could be as little as XXXXX and as much as XXXXX with 95 percent confidence. (3 points).
Suppose now that for better interpretation, I decide to dichotomize the AGE variable to create a new variable AGED which takes the value 1 if the age of the subject is greater than or equal to 55 and 0 otherwise. Because the two variables CHD and AGED are both categorical, we can nicely summarize the data for these two variables in the form of the following table:
> table(chdage$chd, chdage$aged)
01 No 39 4 Yes 18 19
9
This means that there are 39 subjects for which the CHD and AGED both equal 0; there are 4 subjects for which CHD equals 0 and AGED equals 1 and so on.
The output of fitting a logistic regression model for CHD based on the explanatory variable AGED is as follows.
> mod2 = glm(chd ~ aged, family = binomial, data = chdage)
> summary(mod2)
Call:
glm(formula = chd ~ aged, family = binomial, data = chdage)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8704 -0.8712 -0.8712 0.6181 1.5183
Coefficients:
Estimate Std. Error
(Intercept) XXXXX 0.28
aged XXXXX 0.62
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 110.45 on 79 degrees of freedom
Residual deviance: 92.35 on 78 degrees of freedom
AIC: 96.35
Number of Fisher Scoring iterations: 4
(c) Fill in the two missing values above with proper reasoning (3 points each = 6 points).
10
6. Answer TRUE or FALSE to the following statements and justify your answer.
(a) If an observation with high leverage is dropped from the model fitting, then the estimates for the
coefficients become more precise. (2 points).
(b) Mallows’s Cp for the full model always equals 1+p where p is the number of explanatory variables
in the model. (2 points).
(c) If you suspect that the linear relationship between a continuous y and a continuous x has different slope values across levels of a categorical variable z, a good solution is to fit a generalized additive model in x and z. (2 points).
(d) When comparing two models, the one with lower mean-squared cross-validation error is always better if the goal of the analysis is causal inference. (2 points).
(e) A classification tree is always better than logistic regression when the response is binary. (2 points).
(f) The log-likelihood for a saturated logistic regression model is 0. (2 points).
11