BU510.650 – Data Analytics, Fall 2020 Assignment # 4
Please submit two documents: Your answers to each part of every question in .pdf or .doc format, and your R script, in .R format. In your document with answers, please do *not* respond with R output only. While it is okay to include R output in that document, please make sure you spell out the response to the question asked. Please submit your assignment through Blackboard and name your files using the convention LastName_FirstName_AssignmentNumber. For example, Yazdi_Mohammad_4.pdf and Yazdi_Mohammad_4.R.
Question 1)
In this question, you will perform model selection using AutoLoss data set. Our goal is to predict the loss payment for a vehicle (the payments made by an insurance company to cover claims) as a function of vehicle characteristics.
To start your work on this question, read the data in AutoLoss.csv to a data frame called AutoLoss. This data set has missing values, marked as “?” in the data file, so we will read the data to make sure that we identify the rows with ?s, so that we can remove them. To do so, first, replace ?s with NA while reading the data from the .csv file, and then remove all the observations with any NA.
a) Using the best subset selection method and allowing up to 15 predictors, use regsubsets() to determine the best model with k predictors for k = 1, 2, …, 15. Use the output to answer the following question: Which predictors are included in the best model with 11 predictors?
Note: I will add extra information to help you understand one subtlety about the output. Please pay attention to the names of predictors displayed in the output – for example, the output will not show BodyStyle, which is one of the columns in your data, as a predictor. Instead, you will notice predictor names BodyStylehardtop, BodyStylehatchback, BodyStylesedan, BodyStylewagon. This is because BodyStyle is a qualitative variable, which can take the values “hardtop,” “hatchback,” “sedan,” “wagon,” and “convertible.” R is replacing the qualitative variable “BodyStyle” with four columns (BodyStylehardtop, BodyStylehatchback, BodyStylesedan, BodyStylewagon), which are 0-1 columns. For example, BodyStylehardtop would be 1 if the car’s BodyStyle is hardtop and 0 otherwise. BodyStylehatchback would be 1 if the car’s BodysStyle is hatchback, and 0 otherwise. This is similar to what we did in Toyota Used Car example in class (replacing the Fuel Type column with CNGFuel and DieselFuel columns, which were 0-1 columns), but R is automating it for you here when you run regsubsets().
b) Focusing on the best model with 8 predictors, answer the following True / False questions:
i Whether or not a car has four doors is a predictor in this model.
ii Whether or not a car’s body style is hard top is a predictor in this model.
iii Whether or not a car’s drive wheels is forward is a predictor in this model. c) Repeat part (a), but this time using the forward stepwise selection method
d) Using the forward stepwise selection method and allowing up to 15 predictors, what is the best model according to Cp criterion? State the predictors in the best model and their coefficients. Comment on predictors: What types of cars tend to have higher losses? What types of cars tend to have lower losses?
Hint: In part (c), you used regsubsets() with forward selection to determine the best model with k predictors for k = 1, 2, …, 15. Now, you need to compare the Cp of the best model with 1 predictor versus the Cp of the best model with 2 predictors versus …. the Cp of the best model with 15 predictors, to determine the number of predictors that minimizes Cp. For guidance, check how we accomplished the same goal in Task 6 of Hitters – Subset Selection example.
Question 2)
Suppose we have a data set, which has p predictors (input variables), and we perform model selection using (i) best subset selection (BSS), (ii) forward stepwise selection (FwSS), and (iii) backward stepwise selection (BwSS). Specifically, using each of these three approaches, we determine the best model with k predictors for all possible values of k, that is, k = 1, 2, …, p.
Answer the following True or False questions:
I. The predictors in the k-predictor model identified by FwSS are a subset of predictors in the (k+1)-predictor model identified by FwSS.
II. The predictors in the k-predictor model identified by BwSS are a subset of predictors in the (k+1)-predictor model identified by BwSS.
III. The predictors in the k-predictor model identified by BSS are a subset of predictors in the (k+1)-predictor model identified by BSS.
IV. The predictors in the k-predictor model identified by BwSS are a subset of predictors in the (k+1)-predictor model identified by FwSS.
V. The predictors in the k-predictor model identified by FwSS are a subset of predictors in the (k+1)-predictor model identified by BwSS.