R语言统计代写: STAT2450 – Final Exam

Reminder:

STAT2450 – Final Exam

Name: *** , Student ID: B00*** , Email: ***

  • Submit your exam via Brightspace before the deadline 3:00 pm on Sunday, April 8th.
  • ABSOLUTELY NO COLLABORATION FOR THE EXAM! YOU MUST DO EXAM COMPLETELY

    ON YOUR OWN!

  • Show the outputs and results clearly.
  • Provide the executable R code.
  • Explain your results in detail using your own words.
  • Comment your code for major steps.

    Part 1 – Regression

    Use the Carseats data set in package ISLR (column Sales is the response; the other columns are predictors) to do the following questions.

    [1 point] (a) Describe what is this data set about briefly. (hint: ?Carseats)

    Before your analysis, set a random seed using the last 3 digits of your student ID.
    [1 point] (b) Split the data set into a training data set (80%) and a test data set (20%).

    [5 points] (c) Fit a multiple linear regression model on the training data using all predictors. Use the summary() function to print the results and calculate the test error. Which predictors appear to have a statistically significant relationship to the response? What does the coefficient for the Advertising suggest?

    [2 points] (d) Use the boostrapping to estimate the interquartile range (IQR = Q3 – Q1) of the Advertising column and provide the 95% CI (use 1000 bootstrap replicates). (hint: quantile() function. Q1: 25th percentile; Q3: 75th percentile)

    [5 points] (e) Choose one of the following methods (Best subset, Forward/Backward stepwise subset, the Lasso) to do the model selection.

    Perform the method you chose on the training data, use 10-fold cross validation to find the optimal parameter. Provide the outputs and plots to show which predictors are selected in your best model? What are their coefficients? Use your best model to make predictions on the test data and calculate the test error. (hint: The parameter represents the number of predictors selected in the model for subset approaches and the tuning parameter (λ) for the Lasso.)

1

[5 points] (f) Fit a decision tree to the training data. Use 10-fold cross validation to find the best tree size. Create the plot with tree sizes on x-axis and deviance on y-axis. Use the decision tree in the best size to predict the test data and calculate test error. Plot this tree with labels. Are the predictors chosen in (e) also used as the splits in the tree?

[6 points] (g) Compare the methods you used for this regression problem. Based on all the results you have above, write a short conclusion about your data.

Part 2 – Classification

Use the frogs data set in DAAG package (column pres.abs is the response; the other columns are predictors) to do following questions. If you don’t have the DAAG package in R, you need to install it first.

[1 point] (a) Describe the data set briefly. (hint: ?frogs)

[1 point] (b) Set the random seed first. Split the data set into a training data set (80%) and a test data set (20%).

[4 points] (c) Fit a logistic regression model on the training data using all predictors. Use the summary() function to print the results. Make predictions on the test data and calculate the test error. What does the coefficient for the distance suggest?

[4 points] (d) Choose one of the Bayes’ classifers (Naive Bayesian, LDA, QDA) to redo this question. Fit the model on the training data , make the predictions on the test data and calculate the test error.

[3 points] (e) Describe the difference in the precedures of Bagging, Random Forest and Boosting tree briefly (only need to introduce the general idea of these methods). And what are the parameters in these three

methods respectively?

[4 points] (f) Choose Random Forests OR Boosting Tree to do this question. Perform the method you chose on the training data. Make the predictions on the test data and calculate the test error. (hint: Use 1000 as the number of trees that will be built for Random Forest OR Boosting. For other parameters, just use the default settings. You need to change response to factor first (as.factor()), otherwise the regression tree will be built.)

2

[2 points] (g) Create the plot of importance (or influence for boosting) measures. Which predictors do you think are more important?

[6 points] (h) Compare these three different methods you used for this classification problem. Based on all the results you have above, write a short conclusion about your data.

3