BU510.650 – Data Analytics, Spring 2020 Homework # 3
1. In this question, you will work with the Bikeshare dataset adapted from a data set of bike rentals from DC’s Capital Bikeshare system – see the following url for details: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset). An explanation of the variables in the data set Bikeshare.csv is included in the Appendix to this assignment.
To start your work on this question, read the data in Bikeshare.csv to a data frame called Bikeshare. Then, create a new column in your data frame called “Weekend,” which shows 1 if the day is a Saturday or Sunday, and 0 otherwise. (R Hint: In R, the “or” operator is the symbol |. For example,(x == 5) | (x == 6)will return TRUE if x is 5 or 6.)
(a) Runamultiplelinearregressionwith“Rentals”astheoutputvariableand“Temperature,” “Humidity,” “Windspeed,” and “Weekend” as input variables. Comment on the output: Which input variables have a statistically significant effect on the number of rentals?
(b) Runamultiplelinearregressionwith“Registered”astheoutputvariableand“Temperature,” “Humidity,” “Windspeed,” and “Weekend” as input variables. Comment on the output: Which input variables have a statistically significant effect on the number of rentals by registered users?
(c) Run a multiple linear regression with “Casual” as the output variable and “Temperature,” “Humidity,” “Windspeed,” and “Weekend” as input variables. Comment on the output: Which input variables have a statistically significant effect on the number of rentals by casual users?
(d) Compareandcontrastyourresultsfromthepreviousthreepartstoanswerthefollowingquestion: How does the weekend affect rental patterns?
2. Suppose we have a data set, which has p predictors (input variables), and we perform model selection using (i) best subset selection (BSS), (ii) forward stepwise selection (FwSS), and (iii) backward stepwise selection (BwSS). Specifically, using each of these three approaches, we determine the best model with k predictors for all possible values of k, that is, k = 1, 2, …, p.
(a) Foragivenk,supposewearecomparingthemodelsobtainedbythesethreemethods,thatis,the model with k predictors obtained by BSS, FwSS, BwSS. Which of the three models will have the smallest RSS on the data we used to perform model selection? Explain your answer.
(b) AnswerthefollowingTrueorFalsequestions:
(i) The predictors in the k-predictor model identified by FwSS are a subset of predictors in the
(k+1)-predictor model identified by FwSS.
(ii) The predictors in the k-predictor model identified by BwSS are a subset of predictors in the (k+1)-predictor model identified by BwSS.
(iii) The predictors in the k-predictor model identified by BSS are a subset of predictors in the (k+1)-predictor model identified by BSS.
(iv) The predictors in the k-predictor model identified by BwSS are a subset of predictors in the (k+1)-predictor model identified by FwSS.
(v) The predictors in the k-predictor model identified by FwSS are a subset of predictors in the (k+1)-predictor model identified by BwSS.
3. In this question, you will perform model selection using AutoLoss data set (adapted from a data set of loss payments made by insurance companies – see the following url for details: https://archive.ics.uci.edu/ml/datasets/Automobile). An explanation of the variables in this data set is included in the Appendix to this assignment. Our goal is to predict the loss payment for a vehicle (the payments made by an insurance company to cover claims) as a function of vehicle characteristics.
To start your work on this question, read the data in AutoLoss.csv to a data frame called AutoLoss. This data set has missing values, marked as “?” in the data file, so we will read the data to make sure that we identify the rows with ?s, so that we can remove them. To do so, run the following two lines of code: The first one replaces ?s with NA while reading the data from the .csv file, and the second one removes all the observations with any NA.
AutoLoss <- read.csv("AutoLoss.csv", na.strings = "?")
AutoLoss <- na.omit(AutoLoss)
(a) Usingthebestsubsetselectionmethodandallowingupto15predictors,useregsubsets()to determine the best model with k predictors for k = 1, 2, ..., 15. Use the output to answer the following question: Which predictors are included in the best model with 10 predictors?
Note: I will add extra information to help you understand one subtlety about the output. Please pay attention to the names of predictors displayed in the output – for example, the output will not show BodyStyle, which is one of the columns in your data, as a predictor. Instead, you will notice predictor names BodyStylehardtop, BodyStylehatchback, BodyStylesedan, BodyStylewagon. This is because BodyStyle is a qualitative variable, which can take the values “hardtop,” “hatchback,” “sedan,” “wagon,” and “convertible.” R is replacing the qualitative variable “BodyStyle” with four columns (BodyStylehardtop, BodyStylehatchback, BodyStylesedan, BodyStylewagon), which are 0-1 columns. For example, BodyStylehardtop would be 1 if the car’s BodyStyle is hardtop and 0 otherwise. BodyStylehatchback would be 1 if the car’s BodysStyle is hatchback, and 0 otherwise.
(b) Allowing up to 15 predictors, what is the best model according to Cp criterion? State the predictors in the best model and their coefficients. Comment on predictors: What types of cars tend to have higher losses? What types of cars tend to have lower losses?
Hint: In part (a), you used regsubsets() to determine the best model with k predictors for k = 1, 2, ..., 15. Then, compare the Cp of the best model with 1 predictor versus the Cp of the best model with 2 predictors versus .... the Cp of the best model with 15 predictors. (For guidance, check how we made similar comparisons for Hitters data set in class.) The model with the lowest Cp is the best model.
(c) Using the forward stepwise selection method, what would be the best model with 5 predictors? (State the predictors included in the model and their coefficients.)
(d) Usingthebackwardstepwiseselectionmethod,whatwouldbethebestmodelwith5 predictors? (State the predictors included in the model and their coefficients.)
(e) Howdothemodelsin(c)and(d)comparetothemodelobtainedbythebestsubsetselection method?
APPENDIX: Description of the Bikeshare data set
• Temperature – normalized temperature in Celsius, derived according to: (temperature on that day - t_min)/(t_max -t_min), where t_min=-8, t_max=+39 (minimum and maximum temperatures encountered during the time period the data was collected).
• Humidity – normalized humidity, derived according to: Humidity (measured on a scale of 0 to 100) on that day / 100.
• Windspeed – normalized windspeed in km/h, derived according to: Windspeed on that day / wind_max, where wind_max = 67, the fastest wind encountered during the time period the data was collected.
• Rentals – number of bikes rented on that day.
• Weekday – goes from 0 to 6, with 0 indicating that the day was Sunday, 1 indicating that the day
was Monday, etc.
• Registered – number of bikes rented by registered users on that day.
• Casual – number of bikes rented by casual users on that day.
APPENDIX: Description of the AutoLoss data set
Each row represents one particular type of vehicle. The columns in the data set are as follows:
• Losses: The losses covered by the insurance company for this vehicle
• Fuel type: whether the vehicle has gas or diesel engine
• Aspiration: shows if the vehicle is standard or turbo
• NumDoors: whether the vehicle has two or four doors
• BodyStyle: whether the vehicle is convertible, hardtop, hatchback, sedan, or station
• DriveWheels: whether the vehicle is front-wheel drive, rear-wheel drive, or four-wheel drive
• Length: length of the vehicle
• Width: width of the vehicle
• Height: height of the vehicle
• Weight: weight of the vehicle
• EngineSize: Engine size of the vehicle
• Horsepower: Horsepower of the vehicle
• PeakRPM: The peak RPM the vehicle can reach
• Citympg: The mpg of the vehicle in city driving
• Price: The price of the vehicle