sas代写

 Assignment 3

Your name (netid)

 

In this assignment, you are asked to predict the sales of a store based on the following information. Note that the store is closed on a few days. The sales on closed days are imputed by average sales.

Variables Description
sales Daily sales on the log scale (dependent variable)
closed Whether the store is closed
oil_price Oil price ((the country heavily relies on oil export)
promotions The number of items under promotion in the store
comp_prom The average number of items under promotion in competing stores
is_national Is it a national holiday
is_local Is it a local holiday

 

This dataset has 1,172 observations, in which the sales for the last 200 days are missing. The goal of this assignment is to predict the sales on these 200 days. For Q1-Q4, please holdout sales on the last 180 days with available sales for validation. The evaluation metric for this task is RMSE. However, to avoid overfitting, you may also want to take a look at SBC.

Please provide relevant screenshots while answering the questions below. Missing important screenshots can have a negative impact on your grade.

 

  1. Data Exploration (1.5 Point)

Import (SAS Menu FileàImport Data) train.csv and save it as a SAS dataset.

SAS allows you to explore the series in multiple ways. Sometimes, they may give you different suggestions on whether the series has trend or seasonality. Now examine the following three plots. (1.5 point)

  1. a) Does the Series plot exhibit seasonality? You may want to zoom in to see better.
  2. b) Do the Autocorrelation plots reveal potential seasonality? You may want to zoom out to see more lags.
  3. c) Does the Seasonal Root Test provide statistical evidence for seasonality?

Explain your answer to each question.

 

 

 

 

  1. Linear Regression Models (5 points)

2.1 Fit a model in which all explanatory variables are used as ordinary regressors. Examine the Parameter Estimates, which variables have significant effects on sales at 5% significance level? (1 point)

 

 

2.2 Apply any type of transformation (e.g., log) that makes sense to you on at least two explanatory variables. You may want to check the distribution of each variable first. (2 point)

  1. a) Explain why you want to make certain transformations on certain regressors. You are encouraged to Google on this. If so, please provide links you find helpful.
  2. b) Does the transformation improve the RMSE of the model on the hold out sample?
  3. c) Does the transformation make any insignificant variable become significant?

Hint: to transform a variable, you need to add it as a dynamic regressor, even though you may not specify any transfer function for it.

 

 

2.3 Based on your best model above, try applying some transfer functions on two regressors. (2 points)

  1. a) Explain in a few sentences why you choose certain types of transfer function for certain regressors.
  2. b) Do the transfer functions improve RMSE as planned?

It is fine that the transfer function does not improve model performance. Please try at least four different models in this step.

 

 

 

2.4 What is your best model in terms of RMSE so far? Duplicate the model and change its name to “BEST2”. (0 point, just for your own record)

 

  1. ARIMA Model + Explanatory Variables (5 points)

3.1 Duplicate your best model in Q2.4 and do the following, one at a time. (2.5 points)

  1. i) Add Seasonal Dummies into the model
  2. ii) Add both Linear Trend and Seasonal Dummies into your model

iii) Apply Seasonal Difference on sales

  1. iv) Apply First and Seasonal Differences on sales

Name the models properly, so that they can be easily understood. For example

Please answer the following two questions.

  1. a) Provide a screenshot of the “Statistics of Fit” for each model. Which model produces the smallest RMSE on the holdout sample?
  2. b) Summarize at least three findings from the comparison of the four models. Explain these findings or discuss what you learned from them. (grading will be based on how insightful your findings and discussions are)

 

3.2 Examine the diagnostic plots on the residuals of your best model in Q3.1. Now fit a Seasonal ARIMA model ARIMA(p,d,q)(P,1,Q)s based on your understanding of the data. Please add the regressors that deem to be helpful as well. Only need to consider p<=2, d<=1, q<=2, P+Q<=1. Different people may arrive at different models.

This is an iterative process: fit models à examine problems with residuals à fit new models. Explain your thought process and provide necessary screenshots (grading will be partially based on the clarity of your explanation). Try at least 3~4 models (2.5 points)

 

3.3 What is your best model in terms of RMSE so far? Duplicate the model and change its name to “BEST3”. (0 point, just for your own record)

 

  1. Modeling Christmas and Events (4.5 points)

The sales increase substantially around Christmas. However, modeling the effect of Christmas on sales can be more complicated than it appears.

4.1 There are two potential ways to deal with it: a) treating Christmas as an event; b) modeling Christmas with a regressor (i.e., a dummy variable for Christmas). In principle, which way is more helpful for prediction? Explain. No need to try any model for this question.  (1 points)

 

4.2 Regardless of you answer in Q4.1, now model the effect of Christmas using a dummy variable. The dummy variables can be defined based on different rationales. For example, you may code the dummy variable to be one on Dec 25 and zero otherwise, because Dec 25 is the official date for Christmas. Alternatively, you may code the dummy to be one on Dec 23 as the sales peak on that day. You may also code the dummy to be one on some other day. (2 points)

  1. a) In the csv file, generate three dummy variables for Christmas based on three different days: Dec 23, Dec 25, and one day of your own choice (explain why you choose that day). Name them as Christmas25, Christmas23, and so forth. Import this new CSV file as a SAS dataset.
  2. b) Add each of the three dummies as an ordinary regressor separately into the best model in Q3.3 and see which dummy leads to the smallest RMSE. Examine the effects of the three dummies and explain why a particular one performs the best.

 

 

4.3 Examine the “Prediction Errors for Sales” plot (second button on the right) of your best model so far, identify events on at least two dates. Try to mode at least two events with appropriate transfer functions. Explain your rationale behind these transfer functions. Does accounting for events improve your model performance? (1.5 points)

 

 

 

 

  1. Further tune your models for performance points (4 points)

You may fit any models using any tools for this question. You are recommended to save models in this question as a separate project in the same catalog. You may submit predictions from five models. Please save the predictions as SAS datasets and name them as pred1, pred2, and so forth.

  1. a) If you use tools other than SAS to generate your final predictions, please simply replace the missing values in the train.csv with your predicted values and then submit the csv file.
  2. b) You may want to re-train your best model on the whole sample
  3. c) To make sure your best model is robust, you may vary the holdout sample size, refit the model, and then see whether it remains to be the best model.
  4. d) It’s NOT recommended to select five models which are highly similar to each other, as they may suffer from the same issue.
  5. e) TSFS allows you to combine multiple models for prediction.
  6. f) You may impute sales on closed days differently, such as average sales in past week.
  7. g) Make sure your predictions have 1172 observations.

Your performance points are proportional to the ranking of your best RMSE on the test set, namely the last 200 observations (not the holdout sample).

 

Please submit the following for your assignment

1) The saved catalog file.

2) Predictions from 5 different models of your own choice.

3) This document.