R语言 机器学习代写 Multiple Linear Regression

Question 1 (Multiple Linear Regression, 3.0 points)

Consider the multiple linear regression model to regress the logarithm of “Course_eval” on all the other variables in the dataset (please do not consider the interaction terms for now).

Please answer the following questions in the answer sheet.

a) (1 point) Please use R to fit the model. What is the least squares estimate for the coefficient of “age” (rounded to four decimal places)? Please also interpret this estimated coefficient.

b) (1 point) Based on the “summary” function output of this fitted model, what are the null hypothesis and the alternative hypothesis for the “F-statistic” in the

“summary” function output? What conclusion can you obtain for this F -test?

c) (0.5 points) Based on the “summary” function output of this fitted model, if we control the other variables, is the mean of log(Course_eval) in the category of

“male”, is significantly different from that in the category of “female”?

d) (0.5 points) Please paste the R codes for all the above analyses of Question 1 in the answer sheet.

3

Question 2 (Model Diagnostics, 6.0 points)
Consider the multiple linear regression model in Quesiton 1 a). Please answer the

following questions in the answer sheet.

  1. a)  (0.5 points) Based on the “summary” function output of the fitted model in Quesiton 1 a), please interpret the R-squared.
  2. b)  (1 point) Please paste the residuals versus fitted values plot of the fitted model in Quesiton 1 a) in the answer sheet. Are the assumptions in the multiple linear regression model violated based on this plot?
  3. c)  (1 point) Please paste the Q-Q plot of the residuals based on the fitted model in Quesiton 1 a) in the answer sheet. What conclusions can you obtain via the Q-Q plot?
  4. d)  (1 point) Please paste the Cook’s distance plot of the fitted model in Quesiton 1 a) in the answer sheet. Based on the criterion introduced in lectures, are there any influential observations? Why or why not?
  5. e)  (1 point) Please find the observation with the largest Cook’s distance. (Hint: use “which” function in R.) Based on the “rule of thumb” cut-offs for the studentized residual, is this observation an outlier? How to deal with this suspected influential

    observation?

  6. f)  (1 point) We have found the observation with the largest Cook’s distance in e). Based on the “rule of thumb” cut-off for the leverage, does this observation have distant explanatory variable values? Why or why not?
  7. g)  (0.5 points) Please paste the R codes for all the above analyses of Question 2 in the answer sheet.

    4

Question 3 (Multiple Linear Regression for Continuous and Categorical Explanatory Variables, 3.0 points)

Consider the multiple linear regression model in Quesiton 1 a), but we would like to add more explanatory variables. Please answer the following questions in the answer sheet.

a) (0.5 points) Consider the model and the variables in Question 1 a). But now we add all the interaction terms between any of two explanatory variables from “Beauty”, “Female”, “Minority”, “NNenglish”, “intro”, “onecredit”, and “age”,

and obtain a new model. Compute and show the sum of squared errors (SSE) for the fitted model in this question and Question 1 a), respectively. Which one is smaller?

  1. b)  (0.5 points) Clearly “Minority” is an indicator variable of two categories “non- White” and “White”. Which category is the baseline level for the model with interactions constructed in the previous question?
  2. c)  (0.5 points) Consider the regression model with the interaction terms suggested in Question 3 a). If now we are interested in testing whether or not the regression model for the response log(Course_eval) in the category of “a native English speaker”, is significantly different from that in the category of “not a native English speaker”, when other variables are held constant, please use R to obtain an appropriate test statistic and the corresponding p-value. What conclusion can you obtain based on the result?
  3. d)  (0.5 points) Consider the model with interactions in Question 3 a). What are the explanations of the estimated coefficient of the interaction term between “Female” and “intro”? Is the interaction between “Female” and “intro” significant? Why or why not?
  4. e)  (0.5 points) What is the 90% confidence interval for the coefficient of the inter- action term between “Female” and “intro”? Please round your answer to four decimal places. Please also interpret the meaning of this confidence interval.
  5. f)  (0.5 points) Please paste the R codes for all the above analyses of Question 3 in the answer sheet.

    5

Question 4 (Simulation for Multiple Linear Regression, 3.0 points)

Consider the multiple linear regression model μ{Y |X1, X2} = β0 + β1X1 + β2X2 for the observations {(Yi, X1,i, X2,i) : i = 1, · · · , (n + 1)}, and the least squares estimates βˆ0, βˆ1 and βˆ2 based on the data {(Yi, X1,i, X2,i) : i = 1, · · · , n} for the coefficients β0, β1 and β2 can be obtained.

Lily wants to use R to generate random samples based on the multiple linear regression model assumptions. She follows the steps below.

Step 1: Specify β0 = 2, β1 = 1 and β2 = −1,

Step 2: Suppose the observations X1,1,··· ,X1,n+1 are 1,2,··· ,101, so n = 100.

Step 3: Generate X2,1, · · · , X2,n+1 from the t3 distribution. (Hint: use the R function “rt”.)

Step 4: Generate E1, · · · , En+1 from the normal distribution with mean 0 and variance 2 [N(0,2)].

Step 5: Generate Yi = μ{Yi|X1,i, X2,i} + Ei, i = 1, · · · , (n + 1).
Step 6: Repeat Step 4 – Step 5 1,000 times and obtain 1,000 different datasets of

{(Yi, X1,i, X2,i) : i = 1, · · · , (n + 1)}.

Part 1. (1.5 points) Lei Li is a friend of Lily. Lily hands over the above 1,000 datasets of {(Yi, X1,i, X2,i) : i = 1, · · · , n} to him but she keeps the observation (Yn+1,X1,n+1,X2,n+1) for each dataset only for herself. She also does not tell him the true values of β0, β1 and β2. Based on each dataset of {(Yi,X1,i,X2,i) : i = 1,··· ,n}, Lei Li computes the least squares estimates βˆ0, βˆ1 and βˆ2 as well as the 95% confidence interval for the mean of response given X1 = 2.5 and X2 = 0. Ultimately, he obtains

1,000 different confidence intervals.

Then Lily computes the mean of response μ{Y |X1 = 2.5, X2 = 0} and tells Lei Li this information. Lei Li counts the number of the above 1,000 confidence intervals that cover μ{Y |X1 = 2.5, X2 = 0}.

Please answer the following questions in the answer sheet.

a) (0.5 points) Suppose you play both roles of Lily and Lei Li and realise the above steps in R. Please paste the complete R codes for all the above procedures in the answer sheet.

6

  1. b)  (0.5 points) What is the number of the confidence intervals that cover μ{Y |X1 = 2.5, X2 = 0} based on the output after running your R codes? Please answer this question in the answer sheet.
  2. c)  (0.5 points) Based on the result of b), interpret the 95% confidence interval for the mean of response. Please answer this question in the answer sheet.

Part 2. (1.5 points) James is another friend of Lily. Lily hands over the above 1,000 datasets of {(Yi, X1,i, X2,i) : i = 1, · · · , n} and (X1,n+1, X2,n+1) to him but she keeps the observation of response Yn+1 for each dataset only for herself. She also does not tell him the true values of β0, β1 and β2. Based on each dataset of {(Yi, X1,i, X2,i) : i = 1, · · · , n}, James computes the least squares estimates βˆ0, βˆ1 and βˆ2. Using those estimates and (X1,n+1,X2,n+1), he also calculates the 95% prediction interval of the response Yn+1. Ultimately, he obtains one prediction interval of the response Yn+1 for each dataset, and 1,000 different prediction intervals in total.

Then Lily tells James the values of Yn+1 for 1,000 datasets. For each dataset, James counts “1” if the prediction interval covers the corresponding Yn+1; “0”, otherwise. Since there are 1,000 datasets, James can count the total number of “1”s in the above procedure.

Please answer the following questions in the answer sheet.

  1. a)  (0.5 points) Suppose you play both roles of Lily and James and realise the above steps in R. Please paste the complete R codes for all the above procedures in the answer sheet.
  2. b)  (0.5 points) What is the total number of “1”s based on the output after running your R codes? Please answer this question in the answer sheet.
  3. c)  (0.5 points) Based on the result of b), interpret the 95% prediction interval for Yn+1. Please answer this question in the answer sheet.

    7