STAT:3200 – Applied Linear Regression, Homework 9
Assigned by Dr. Erning Li on 11/16/2018.
Due at the beginning of class on Friday, November 30, 2018.
Coverage: Multicollinearity, Model selection.
1. A simulation study of the impacts of multicollinearity:
In statistics, Monte Carlo simulation can be used to assess the performance of a method. With simulations, we know and control the truth. In this simulation, you repeatedly draw independent samples from an artificial (Y, x1, x2, x3) population with known parameters, conduct estimation and inference on model parameters and empirically check the impacts of multicollinearity.
The response Y is generated from the following model (true model): Y=β0+β1×1+β2×2+ε, whereβ0=150,β1=−10,β2=2,ε∼N(0,15).
Hence in the population, x1 and x2 are useful for predicting Y , while x3 isn’t (but a surrogate of x2). The variables x2 and x3 have high correlation coefficient 0.95, but x1 is independent from them. Each simulated sample is fit to two regression models:
• ModelI: lm(Y ∼ x1 + x2) •ModelII: lm(Y∼x1+x2+x3)
Read the R script file at http://www.stat.uiowa.edu/~ernli/ALRdata/HW9_simulation.R to see the data generating scheme and multiple linear regressions. Open R and copy/paste this R script into your R Editor.
Step a):
Step b):
Run the entire script to obtain one simulated random sample and model fits. Copy/paste
the estimate, s.e., t stat, p-value for β1 and β2 from Model I fit, and those for β1, β2 and
β3 from Model II fit into your answer sheet.
Repeatedly simulate samples from the same population by repeating the last step (Step a))
for a total of 5 times.
Use your simulation results to answer:
(1) For the fits of Model I,
i) Are β1 and β2 stable and close to their corresponding true parameter values?
ii) Do β1 and β2 have significant p-values? (2) For the fits of Model II,
i) Are β1 stable and close to its true parameter value?
ii) Do β1 have significant p-values?
iii) Are β2 stable and close to its true parameter value?
iv) Do β2 (always) have significant p-values?
v) How are the standard errors of β2 compared to those from Model I (larger or smaller)?
vi) Are β3 stable and close to its true parameter value? 1
iid 2
vii) Do β3 (always) have significant p-values?
(3) Are the two regression models (I and II) reliable for analyzing the data? Briefly explain.
2. Consider the “Bfox” data set in car library, which contains 30 time series data on Canadian women’s labor-force participation in the first three decades of the postwar period.
The researchers were expecting: women’s earnings (womwage), consumer debt (debt), and the availability of part-time work (parttime) to affect women’s labor-force participation (partic) positively; while fertility (tfr) and men’s earnings (menwage) to have negative effects.
Because all of the series, including that for the dependent variable, manifest strong linear trends over the 20-year period of the study, we want to account for a linear time effect. Thus, we create our own time variable for inclusion into the model using the row names of the data which are the years:
> attach(Bfox) > time = as.numeric(rownames(Bfox)) > Bfox.new = cbind(Bfox, time) > names(Bfox.new) [1] "partic" "tfr" "menwage" "womwage" "debt" "parttime" "time"
- (a) Fit the multiple regression model for predicting women’s labor participation (partic) from all the other variables (predictors).
Based on the signs of the regression coefficient estimates, the effect of which predictor is not consistent with what the researchers were expecting?Are there any partial regression coefficients that are insignificant? List them.
## when regressing one response variable on all the other variables as main ## effects (i.e. no interactions), you can use the following short-cut code > lm.out = lm(partic ~ ., data = Bfox.new)
- (b) Plot the scatterplot matrix and check pairwise correlations. Are some of the correlations very high?
> plot(Bfox.new, pch=16) > round(cor(Bfox.new), 4) ## round to 4 decimals
(c) 1)
Compute the Variance Inflation Factors for the predictors in this model. Do you have a concern? Briefly explain.
> vif(lm.out)
2) Manually calculate the VIF for menwage using the R2 from regressing menwage on all the other predictors.
> summary( lm(menwage ~ tfr + womwage + debt + parttime + time,
data=Bfox.new) )$r.squared
(d) For the above multiple regression model (lm.out), use AIC in a backward stepwise method
to select an “optimal” model. (No need to include the AIC iterations.) 2
Provide a list of the predictors that were removed in the order from first to last. Provide a list of predictors that remain in the final chosen model.
> model.selection = step(lm.out)
- (e) Use BIC in the model selection procedure to choose an “optimal” model. (No need to include the BIC iterations.)
Provide a list of the predictors that were removed in the order from first to last.
Is the final model preferred by BIC the same as the one chosen by AIC?> model.selection.BIC = step(lm.out, k=log(nrow(Bfox.new)))
- (f) For the final model chosen by AIC/BIC, do the following:
- 1) Fit the final model. Provide summary output. Are the partial regression coefficients significant?
- 2) Compare AIC values between the original model (lm.out) and the final model. Which model is better according to AIC?
## check the AIC value of the original model
> AIC(lm.out) - 3) Compare BIC values between the original model (lm.out) and the final model. Which model is better according to BIC?
## check the BIC of the original model
> BIC(lm.out)## alternatively
> AIC(lm.out, k=log(nrow(Bfox.new)))
- 4) Compare Adjusted R2 between the original model (lm.out) and the final model.
Which model is better according to Adjusted R2?
- 5) Conduct a partial F test to see if the final model is sufficient in comparison to the original
model (lm.out)? Hint: anova().
- 6) For this final model, is there still a concern of multicollinearity? Compute VIF’s and
explain.
- 7) For this final chosen model, plot Cook’s D against observation indices.
Plot the “bubble” plot of studentized residuals against hat values with bubble size increasing with Cook’s D. (Note: Use correct sample size and number of predictors when determining the appropriate thresholds in your R codes.)
Which observation is unusually most influential — Is it most influential due to high leverage and/or large residual?
- (g) Compare the following two models:
> model.1 = lm(partic ~ tfr + debt + parttime) > model.2 = lm(partic ~ menwage + debt + time)
Which model is preferred by AIC? Which model is preferred by BIC?
(h) 1) Regress partic on debt, parttime, and their interaction debt:parttime. Compute the VIF’s and explain if there is a concern of multicollinearity.
3
> fit1 = lm(partic ~ debt + parttime + debt:parttime)
2) Regressparticoncentereddebt,centeredparttime,andtheinteractionofthetwocentered predictors. Compute the VIF’s and explain if there is a concern of multicollinearity.
## center debt
> c.debt = debt – mean(debt)
3. In the last assignment, we considered the data “Chirot” in the car library and fit the model > myfit = lm(intensity ~ commerce + tradition + commerce:tradition
+ midpeasant + inequality, data=Chirot)
- (a) Compute the VIF’s for the predictors in this model. Do you have a concern? Briefly explain.
- (b) Use AIC to select an “optimal” model. (No need to include the AIC iterations.)
Provide a list of predictors that remain in the AIC-chosen model.
Note: When there are higher-order terms in a model, in the iterations, AIC or BIC check/report the highest-order term of each predictor first. For example, midpeasant, inequality, commerce:tradition (but not commerce or tradition). The reason is if the inter- action ‘commerce:tradition’ needs to be kept in the model, then the corresponding main effects ‘commerce’ and ‘tradition’ should automatically remain in the model to ensure model hierarchy. - (c) Use BIC to select an “optimal” model. (No need to include the BIC iterations.) Provide a list of predictors that remain in the BIC-chosen model.
Which model is smaller, the one chosen by AIC or the one chosen by BIC? - (d) Provide the Adjusted R2 of the original model,of the AIC-chosen model, and of the BIC- chosen model, respectively.
Which model is better according to Adjusted R2? - (e) For the AIC-chosen model, do the following:
- 1) Check the VIF’s.
Which predictors have high VIF’s?
Do you still have a concern? Briefly explain. - 2) Fit the model when both commerce and tradition are centered.
Compute the VIF’s and explain if there is still a concern of multicollinearity.
- 1) Check the VIF’s.
Note: Provide necessary R code(s), output(s) and/or plot(s) to support your answers to each problem.
Please write up your answers clearly with all the key and necessary steps shown; you may lose points for skipping/missing key codes, output, or steps. The homework you turn in should be neat and stapled with your name on the top of each page. You can discuss and work on homework together, but must do your own programming and write up your own answers. NO blind copying allowed. The grader may deduct point if you didn’t STAPLE your homework.
4