JTERM 2018 FINAL EXAM COMPUTER PORTION Solution
This question uses the automobile data set that you used for your project!
- There is a new column called Accident Rating which is a dichotomous variable that describes whether the car is “Safe” or “Dangerous”. We would like to investigate if there is a relationship between the body.style of a car and it’s accident rating. We would like to do this after accounting for the horsepower of the car. Therefore, fit a logistic regression model with accident rating as the response and horsepower and body style as explanatory variables. Specifically, we would like to investigate if there is a significant difference in the odds of being a dangerous car between a sedan and any other body style. In your answer provide your R code and parameter estimate table. Be sure and interpret the appropriate parameter estimates and provide confidence intervals.
Holding horsepower fixed, the odds of being a Dangerous car for Hatchbacks is estimated to be 3.12 times that Sedans. A 95% confidence interval for this estimate is (1.28,7.99).
- Given the model you fit above, interpret the horsepower parameter estimate. Be sure and provide confidence intervals with your answer.
A odds of being a dangerous car for a car with a fixed horsepower is estimated be 1.05 times that of a car with one less horsepower. A 95% confidence interval for this estimate is (1.03, 1.07).
- We would like to get an idea of how well your model is performing. Perform the necessary steps to construct a confusion table for the model fit above. Provide all relevant RCode as well as the table. Also include an estimate of the correct classification rate and the mis-classification rate.
- Notice that there are a couple of cars with missing accident ratings. Use your model to estimate the probability that each car is a Dangerous Car. Report the probability for each car and use it to impute the accident rating of each vehicle. (Update the data set with these values.) Use a .5 threshold for determining between “Safe” and “Dangerous”.
The estimated probability of the first car with missing a Accident Rating of being rated as dangerous (index 160) is 100%. The estimated probability of the first car with missing a Accident Rating of being rated as dangerous (index 166) is 43.4%.
- Use your model to predict the safety rating of Lebron James’ car. It is a convertible with a horse 450 horsepower. Is this extrapolation? Show all Rcode and make sure and report the estimated probability of his car being dangerous as well as the actual predicted “Safe” or “Dangerous” Prediction. Use a .5 threshold for determining between “Safe” and “Dangerous”.
The estimated probability of Lebron’s car being rated as “Dangerous” is almost 100%.
- For this question, we would like to quantify our uncertainty about our prediction in the last question. Find a prediction interval for probability of Lebron James’ car being Dangerous. Can we be 95% confident that his car is “Safe” or “Dangerous”? Why?
- Now we would like to use our full dataset, with the imputed values for accident rating, to estimate the price category of our car. Another new variable has been created for the price. This variable is called PriceCat (and PriceI) and represents the price category a car is in (Very Inexpensive, Inexpensive, Moderate, Expensive, Very Expensive, and Extremely Expensive.) Fit a model that predicts the price category of the car (not the actual numerical price) using a cumulative logit / ordinal logistic regression model. Your model should use horsepower and body.style as predictors. Display the RCode and parameter estimate table for this model and use this model to predict the probability of Labron James’ car to fall in the Very Expensive category.
P(Very Expensive) = .00000000934
The estimated probability is very high that Lebron James’ car is in the Extremely Expensive category.
- Use your model above to interpret the horsepower parameter estimate. Be sure and include a confidence interval.
The odds of being less than any given Price Level for a car with fixed horsepower is estimated to be .93 times the odds of a car with 1 less horsepower. A 95% confidence interval for this estimate is (.917, .943).
BONUS:
- Perform a cross validation for the ordinal regression model. Make your training set 2/3 of the original data set and your test set 1/3. Find the correct classification rate (CCR). Show all R code and confusion matrix.
- Fit another model of your choice that will beat the above model in terms of cross validated correct classification rate (based on the same training/test split of the data from the first bonus question.) Identify your model, show all Rcode and provide the CCR and the confusion matrix.
THAT’s IT! We have reached the end of Jan Term 2018 STAT 3300!
I can’t tell you how much fun I have had and how much respect I have for the amount of effort you have put in and frankly how much you have learned. I feel very confident (as I hope you do as well) that each of you have a solid understanding of the fundamentals of regression. You now have the tools to make decisions based on data and that is a very marketable tool. From here, there are so many more methods and models that can be studied; I can vouch that YOU have the talent, work ethic and now the background to take this as far as you wish.
Thank you for a great “semester”! 🙂
Bivin