CS计算机代考程序代写 Week-9 Regression Models

Week-9 Regression Models

Some of the slides are adapted from the lecture notes provided by Prof. Antoine Saure and Prof. Rob Hyndman

Business Forecasting Analytics
ADM 4307 – Fall 2021

Regression Models

Ahmet Kandakoglu, PhD

08 November, 2021

Outline

• The Linear Model with Time Series

• Simple Linear Regression

• Multiple Linear Regression

• Least Squares Estimation

• Evaluating the Regression Model

• Some Useful Predictors

ADM 4307 Business Forecasting Analytics 2Fall 2021

Simple Linear Regression

• The basic concept is that we forecast variable 𝑦 assuming it has a linear relationship with 𝑥
variable

𝑦𝑡 = 𝛽0 + 𝛽1𝑥𝑡 + 𝜀𝑡

• The model is called simple regression as we only allow one predictor variable 𝑥. The forecast
variable 𝑦 is sometimes also called the dependent or explained variable. The predictor
variable 𝑥 is sometimes also called the independent or explanatory variable.

• The parameters 𝛽0 and 𝛽1 determine the intercept and the slope of the line respectively. The
intercept 𝛽0 represents the predicted value of 𝑦 when 𝑥 = 0. The slope 𝛽1 represents the
average predicted change in 𝑦 resulting from a one unit increase in 𝑥.

ADM 4307 Business Forecasting Analytics 3Fall 2021

Simple Linear Regression

Notice that the observations do not lie on the straight line but are scattered around it. We can

think of each observation 𝑦𝑡 consisting of the systematic or explained part of the model, 𝛽0 +
𝛽1𝑥𝑡and the random error, 𝜀𝑡.

ADM 4307 Business Forecasting Analytics 4Fall 2021

Example: US Consumption Expenditure

us_change %>% pivot_longer(c(Consumption, Income), names_to=”Series”) %>%

autoplot(value) + labs(y = “% change”)

ADM 4307 Business Forecasting Analytics 5Fall 2021

Example: US Consumption Expenditure

us_change %>% ggplot(aes(x = Income, y = Consumption)) +

labs(y = “Consumption (quarterly % change)”, x = “Income (quarterly % change)”) +

geom_point() + geom_smooth(method = “lm”, se = FALSE)

ADM 4307 Business Forecasting Analytics 6Fall 2021

Example: US Consumption Expenditure

The equation is estimated in R using the TSLM() function:

fit <- us_change %>% model(lm = TSLM(Consumption ~ Income))

report(fit)

ADM 4307 Business Forecasting Analytics 7

𝐶𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛 = 0.54 + 0.27 𝐼𝑛𝑐𝑜𝑚𝑒

Series: Consumption

Model: TSLM

Residuals:

Min 1Q Median 3Q Max

-2.58236 -0.27777 0.01862 0.32330 1.42229

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.54454 0.05403 10.079 < 2e-16 *** Income 0.27183 0.04673 5.817 2.4e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5905 on 196 degrees of freedom Multiple R-squared: 0.1472, Adjusted R-squared: 0.1429 F-statistic: 33.84 on 1 and 196 DF, p-value: 2.4022e-08 Fall 2021 Multiple Linear Regression • In multiple regression there is one variable to be forecast and several predictor variables. • The credit score is used to determine if a customer will be given a loan or not. The credit score could be predicted from other variables. This is an example of cross- sectional data where we want to predict the value of the credit score variable using the values of other variables. • We want to forecast the value of future beer production, but there are no other variables available for predictors. Instead, with time series data, we could use the number of quarters since the start of the series, or the quarter of the year corresponding to each observation as a predictor variable. ADM 4307 Business Forecasting Analytics 8Fall 2021 Multiple Linear Regression • The general form of a multiple regression is 𝑦𝑡 = 𝛽0 + 𝛽1𝑥1,𝑡 + 𝛽2𝑥2,𝑡 +⋯+ 𝛽𝑘 𝑥𝑘,𝑡 + 𝜀𝑡 where 𝑦𝑡 is the variable to be forecast and 𝑥1,𝑡, … , 𝑥𝑘,𝑡 are the predictor variables. • The coefficients 𝛽1, … , 𝛽𝑘 measure the effect of each predictor after taking account of the effect of all other predictors in the model. • Thus, the coefficients measure the marginal effects of the predictor variables. ADM 4307 Business Forecasting Analytics 9Fall 2021 Multiple Linear Regression • For forecasting, we make the following assumptions about the errors (𝜀1, … , 𝜀𝑇): • they have mean zero; otherwise the forecasts will be systematically biased. • they are not autocorrelated; otherwise the forecasts will be inefficient, as there is more information in the data that can be exploited. • they are unrelated to the predictor variables; otherwise there would be more information that should be included in the systematic part of the model. • It is also useful to have the errors normally distributed with constant variance in order to produce prediction intervals, but this is not necessary for forecasting. ADM 4307 Business Forecasting Analytics 10Fall 2021 Example: US Consumption Expenditure • Additional predictors that may be useful for forecasting US consumption expenditure • Building a multiple linear regression model can potentially generate more accurate forecasts ADM 4307 Business Forecasting Analytics 11Fall 2021 Example: US Consumption Expenditure • Scatterplot matrix of five variables us_change %>% ggpairs(columns = 2:6)

• Note: Install and load the GGally package

in R before you run the code

ADM 4307 Business Forecasting Analytics 12Fall 2021

Example: US Consumption Expenditure

fit_MR <- us_change %>% model(lm = TSLM(Consumption ~ Income + Production + Unemployment + Savings))

report(fit_MR)

ADM 4307 Business Forecasting Analytics 13

Series: Consumption

Model: TSLM

Residuals:

Min 1Q Median 3Q Max

-0.90555 -0.15821 -0.03608 0.13618 1.15471

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.253105 0.034470 7.343 5.71e-12 ***

Income 0.740583 0.040115 18.461 < 2e-16 *** Production 0.047173 0.023142 2.038 0.0429 * Unemployment -0.174685 0.095511 -1.829 0.0689 . Savings -0.052890 0.002924 -18.088 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3102 on 193 degrees of freedom Multiple R-squared: 0.7683, Adjusted R-squared: 0.7635 F-statistic: 160 on 4 and 193 DF, p-value: < 2.22e-16 Fall 2021 Outline • The Linear Model with Time Series • Simple Linear Regression • Multiple Linear Regression • Least Squares Estimation • Evaluating the Regression Model • Some Useful Predictors ADM 4307 Business Forecasting Analytics 14Fall 2021 Least Squares Estimation • The values of the coefficients 𝛽0, 𝛽1, … , 𝛽𝑘 are obtained by finding the minimum sum of squares of the errors. That is, we find the values of 𝛽0, 𝛽1, … , 𝛽𝑘 which minimize ෍ 𝑡=1 𝑇 𝜀𝑡 2 = ෍ 𝑡=1 𝑇 𝑦𝑡 − 𝛽0 − 𝛽1𝑥1,𝑡 − 𝛽2𝑥2,𝑡 −⋯− 𝛽𝑘 𝑥𝑘,𝑡 2 • This is called least squares estimation because it gives the least value of the sum of squared errors. Fall 2021 ADM 4307 Business Forecasting Analytics 15 Least Squares Estimation • Finding the best estimates of the coefficients is often called “fitting” the model to the data, or sometimes “learning” or “training” the model. • We refer to the estimated coefficients using the notation መ𝛽0, መ𝛽1, … , መ𝛽𝑘. • The TSLM() function fits a linear regression model to time series data. • It is similar to the lm() function which is widely used for linear models, but TSLM() provides additional facilities for handling time series. ADM 4307 Business Forecasting Analytics 16Fall 2021 Fitted Values • Predictions of 𝑦 can be calculated by ignoring the error in the regression equation: ො𝑦 = መ𝛽0 + መ𝛽1𝑥1 + መ𝛽2𝑥2 +⋯+ መ𝛽𝑘 𝑥𝑘 • Plugging in the values of 𝑥0, 𝑥1, … , 𝑥𝑘 gives a prediction of 𝑦. • ො𝑦 is referred to as fitted values. • These are "predictions" of the data used in estimating the model. They are not genuine forecasts as the actual value of 𝑦. ADM 4307 Business Forecasting Analytics 17Fall 2021 Example: US Consumption Expenditure augment(fit_MR) %>% ggplot(aes(x = Quarter)) +

geom_line(aes(y = Consumption, colour = “Data”)) +

geom_line(aes(y = .fitted, colour = “Fitted”)) +

labs(y = NULL, title = “Percent change in US consumption expenditure”) +

scale_colour_manual(values=c(Data=”black”,Fitted=”#D55E00″)) +

guides(colour = guide_legend(title = NULL))

ADM 4307 Business Forecasting Analytics 18Fall 2021

Example: US Consumption Expenditure

augment(fit_MR) %>% ggplot(aes(x = Consumption, y = .fitted)) +

geom_point() +

labs(y = “Fitted (predicted values)”, x = “Data (actual values)”,

title = “Percent change in US consumption expenditure”

) + geom_abline(intercept = 0, slope = 1)

ADM 4307 Business Forecasting Analytics 19Fall 2021

Goodness-of-Fit

• A common way to summarize how well a linear regression model fits the data is via

the coefficient of determination, or 𝑅2.

𝑅2 =
σ ො𝑦𝑖 − ത𝑦

2

σ 𝑦𝑖 − ത𝑦
2

where the summations are over all observations.

• Thus, it reflects the proportion of variation in the forecast variable that is accounted

for (or explained) by the regression model

ADM 4307 Business Forecasting Analytics 20Fall 2021

Goodness-of-Fit

• In all cases, 0 ≤ 𝑅2 ≤ 1.

• In simple linear regression, the value of 𝑅2 is also equal to the square of the
correlation between 𝑦 and 𝑥.

• If the predictions are close to the actual values, we would expect 𝑅2 to be close to 1.
On the other hand, if the predictions are unrelated to the actual values, then 𝑅2 = 0.

• The 𝑅2 value is used frequently, though often incorrectly, in forecasting.

• There are no set rules for what is a good 𝑅2 value, and typical values of 𝑅2

depend on the type of data used.

• The value of 𝑅2 will never decrease when adding an extra predictor to the model
and this can lead to over-fitting.

• Validating a model’s forecasting performance on the test data is much better than

measuring the 𝑅2 value on the training data.

ADM 4307 Business Forecasting Analytics 21Fall 2021

Example: US Consumption Expenditure

• For multiple regression, 𝑅2 = 0.768. That means, the model explains 76.8% of the variation in
the consumption data.

• For simple regression, 𝑅2 = 0.147

• Adding the three extra predictors has allowed a lot more of the variation in the consumption

data to be explained.

ADM 4307 Business Forecasting Analytics 22

Series: Consumption

Model: TSLM

Residuals:

Min 1Q Median 3Q Max

-2.58236 -0.27777 0.01862 0.32330 1.42229

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.54454 0.05403 10.079 < 2e-16 *** Income 0.27183 0.04673 5.817 2.4e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5905 on 196 degrees of freedom Multiple R-squared: 0.1472, Adjusted R-squared: 0.1429 F-statistic: 33.84 on 1 and 196 DF, p-value: 2.4022e-08 Fall 2021 Series: Consumption Model: TSLM Residuals: Min 1Q Median 3Q Max -0.90555 -0.15821 -0.03608 0.13618 1.15471 Coefficients: Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.253105 0.034470 7.343 5.71e-12 ***

Income 0.740583 0.040115 18.461 < 2e-16 *** Production 0.047173 0.023142 2.038 0.0429 * Unemployment -0.174685 0.095511 -1.829 0.0689 . Savings -0.052890 0.002924 -18.088 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3102 on 193 degrees of freedom Multiple R-squared: 0.7683, Adjusted R-squared: 0.7635 F-statistic: 160 on 4 and 193 DF, p-value: < 2.22e-16 Standard Error of the Regression • Another measure of how well the model has fitted the data is the standard deviation of the residuals, which is often known as the standard error of the regression: ො𝜎𝑒 = 1 𝑇 − 𝑘 − 1 ෍ 𝑡=1 𝑇 𝑒𝑡 2 where 𝑘 is the number of predictors in the model. • Notice that we divide by 𝑇 − 𝑘 − 1 because we have estimated 𝑘 + 1 parameters (the intercept and a coefficient for each predictor variable) in computing the residuals. • The standard error is related to the size of the average error that the model produces. ADM 4307 Business Forecasting Analytics 23Fall 2021 Outline • The Linear Model with Time Series • Simple Linear Regression • Multiple Linear Regression • Least Squares Estimation • Evaluating the Regression Model • Some Useful Predictors ADM 4307 Business Forecasting Analytics 24Fall 2021 Evaluating the Regression Model • The differences between the observed 𝑦 values and the corresponding fitted ො𝑦 values are the training-set errors or “residuals” defined as: 𝑒𝑡 = 𝑦𝑡 − ො𝑦𝑡 for 𝑡 = 1, 2, . . . , 𝑇 • It is clear that the average of the residuals is zero, and that the correlation between the residuals and the observations for the predictor variable is also zero. ADM 4307 Business Forecasting Analytics 25Fall 2021 Evaluating the Regression Model • After selecting the regression variables and fitting a regression model, it is necessary to plot the residuals to check that the assumptions of the model have been satisfied. • There are a series of plots that should be produced in order to check different aspects of the fitted model and the underlying assumptions (whether the linear model was appropriate). • ACF plot of residuals • Histogram of residuals • Residual plots against predictors • Residual plots against fitted values ADM 4307 Business Forecasting Analytics 26Fall 2021 ACF Plot of Residuals • It is common to find autocorrelation in the residuals. • This case violates the assumption of no autocorrelation in the errors, and our forecasts may be inefficient — there is some information left over which should be accounted for in the model in order to obtain better forecasts. • Therefore we should always look at an ACF plot of the residuals. ADM 4307 Business Forecasting Analytics 27Fall 2021 • Another useful test of autocorrelation in the residuals designed to take account for the regression model is the Ljung-Box test • A small p-value indicates there is significant autocorrelation remaining in the residuals Histogram of Residuals • It is always a good idea to check whether the residuals are normally distributed. • This is not essential for forecasting, but it does make the calculation of prediction intervals much easier. • The histogram shows that the residuals seem to be slightly skewed, which may also affect the coverage probability of the prediction intervals. ADM 4307 Business Forecasting Analytics 28Fall 2021 Example: US Consumption Expenditure fit_MR %>% gg_tsresiduals()

augment(fit_MR) %>% features(.innov,

ljung_box, lag = 10, dof = 5)

ADM 4307 Business Forecasting Analytics 29

# A tibble: 1 x 3

.model lb_stat lb_pvalue

1 lm 18.9 0.00204

• The p value is less than 0.05. So, the residuals

are correlated

• There is a significant spike at lag 7 in the ACF,

and the model fails the Ljung-Box test

• The model can still be used for forecasting

Fall 2021

Residual Plots Against Predictors

• The residuals should be to be randomly scattered without showing any systematic

patterns.

• If scatterplots of the residuals against each of the predictor variables show a pattern,

then the relationship may be nonlinear and the model will need to be modified

accordingly

ADM 4307 Business Forecasting Analytics 30Fall 2021

Residual Plots Against Fitted Values

• A plot of the residuals against the fitted values should also show no pattern.

• If a pattern is observed, the variance of the residuals may not be constant.

• If this problem occurs, a transformation of the forecast variable such as a logarithm or

square root may be required.

ADM 4307 Business Forecasting Analytics 31Fall 2021

Outliers and Influential Observations

• Observations that take extreme values compared to the majority of the data are called

outliers.

• Observations that have a large influence on the estimated coefficients of a regression

model are called influential observations.

• Usually, influential observations are also outliers that are extreme in the 𝑥 direction.

ADM 4307 Business Forecasting Analytics 32Fall 2021

Spurious Regression

• Time series data are often “non-stationary” (the values of the time series do not

fluctuate around a constant mean).

• For example, consider the two following variables

• These appear to be related simply because they both trend upwards in the same

manner. However, air passenger traffic in Australia has nothing to do with rice

production in Guinea.

ADM 4307 Business Forecasting Analytics 33Fall 2021

Spurious Regression

• Regressing non-stationary time series can lead to spurious regressions.

• High 𝑅2 and high residual autocorrelation can be signs of spurious regression.

• Cases of spurious regression might appear to give reasonable short-term forecasts,

but they will generally not continue to work into the future

ADM 4307 Business Forecasting Analytics 34Fall 2021

Outline

• The Linear Model with Time Series

• Simple Linear Regression

• Multiple Linear Regression

• Least Squares Estimation

• Evaluating the Regression Model

• Some Useful Predictors

ADM 4307 Business Forecasting Analytics 35Fall 2021

Some Useful Predictors

• There are several useful predictors that occur frequently when using regression for

time series data:

• Trend

• Dummy variables

• Seasonal dummy variables

• Intervention variables

• Trading days

• Distributed lags

• Holidays

• Fourier series

ADM 4307 Business Forecasting Analytics 36Fall 2021

Linear Trend

• It is common for time series data to be trending.

• A linear trend can be modelled by simply using 𝑥𝑡 = 𝑡 as a predictor:

𝑦𝑡 = 𝛽0 + 𝛽1𝑡 + 𝜀𝑡

where 𝑡 = 1, 2, . . . , 𝑇

• A trend variable can be specified in the TSLM() function using the trend() special.

ADM 4307 Business Forecasting Analytics 37Fall 2021

Dummy Variables

• A predictor could be a categorical variable taking only two values (e.g., “yes” and

“no”)

• This situation can still be handled within the framework of multiple regression models

by creating a “dummy variable” taking value

• 1 corresponding to “yes” and

• 0 corresponding to “no”.

• A dummy variable is also known as an “indicator variable”

• If there are more than two categories, then the variable can be coded using several

dummy variables (one fewer than the total number of categories)

ADM 4307 Business Forecasting Analytics 38Fall 2021

Dummy Variables

• For example, when forecasting daily sales and you want to take account of whether

the day is a public holiday or not. So the predictor takes value

• “yes” on a public holiday, and

• “no” otherwise.

• A dummy variable can also be used to account for an outlier in the data. Rather than

omit the outlier, a dummy variable removes its effect. In this case, the dummy

variable takes value

• 1 for that observation and

• 0 everywhere else.

ADM 4307 Business Forecasting Analytics 39Fall 2021

Seasonal Dummy Variables

• Suppose that we are forecasting daily data and we want to account for the day of the

week as a predictor. Then the following dummy variables can be created.

ADM 4307 Business Forecasting Analytics 40

D1 D2 D3 D4 D5 D6

Monday 1 0 0 0 0 0

Tuesday 0 1 0 0 0 0

Wednesday 0 0 1 0 0 0

Thursday 0 0 0 1 0 0

Friday 0 0 0 0 1 0

Saturday 0 0 0 0 0 1

Sunday 0 0 0 0 0 0

Monday 1 0 0 0 0 0

⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

Fall 2021

Dummy Variable Trap

• Notice that only six dummy variables are needed to code seven categories. That is

because the seventh category (in this case Sunday) is captured by the intercept, and

is specified when the dummy variables are all set to zero.

• Many beginners will try to add a seventh dummy variable for the seventh category.

This is known as the “dummy variable trap”, because it will cause the regression to

fail.

• The general rule is to use one fewer dummy variables than categories.

• The interpretation of each of the coefficients associated with the dummy variables is

that it is a measure of the effect of that category relative to the omitted category.

ADM 4307 Business Forecasting Analytics 41Fall 2021

Uses of Dummy Variables

• Seasonal dummies

• For quarterly data: use 3 dummies

• For monthly data: use 11 dummies

• For daily data: use 6 dummies

• Outliers

• If there is an outlier, you can use a dummy variable (taking value 1 for that observation

and 0 elsewhere) to remove its effect.

• Public holidays

• For daily data: if it is a public holiday, dummy=1, otherwise dummy=0.

• The TSLM() function will automatically handle this situation if you specify the predictor

season().

ADM 4307 Business Forecasting Analytics 42Fall 2021

Example: Australian Beer Production

recent_production <- aus_production %>% filter(year(Quarter) >= 1992)

recent_production %>% autoplot(Beer) +

labs(y = “Megalitres”, title = “Australian quarterly beer production”)

ADM 4307 Business Forecasting Analytics 43Fall 2021

Example: Australian Beer Production

• We want to forecast the value of future beer production.

• We can model this data using a regression model with a linear trend and quarterly

dummy variables

𝑦𝑡 = 𝛽0 + 𝛽1𝑡 + 𝛽2𝑑2,𝑡 + 𝛽3𝑑3,𝑡 + 𝛽4𝑑4,𝑡 + 𝜀𝑡

where 𝑑𝑖,𝑡 = 1 is if 𝑡 is in quarter 𝑖 and 0 otherwise.

• The first quarter variable has been omitted, so the coefficients associated with the

other quarters are measures of the difference between those quarters and the first

quarter.

ADM 4307 Business Forecasting Analytics 44Fall 2021

Example: Australian Beer Production

fit_beer <- recent_production %>% model(TSLM(Beer ~ trend() + season()))

report(fit_beer)

ADM 4307 Business Forecasting Analytics 45

Note that trend() and

season() are not

standard functions or

objects; they are

“special” functions that

work within the TSLM()

model

Fall 2021

Series: Beer

Model: TSLM

Residuals:

Min 1Q Median 3Q Max

-42.9029 -7.5995 -0.4594 7.9908 21.7895

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 441.80044 3.73353 118.333 < 2e-16 *** trend() -0.34027 0.06657 -5.111 2.73e-06 *** season()year2 -34.65973 3.96832 -8.734 9.10e-13 *** season()year3 -17.82164 4.02249 -4.430 3.45e-05 *** season()year4 72.79641 4.02305 18.095 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 12.23 on 69 degrees of freedom Multiple R-squared: 0.9243, Adjusted R-squared: 0.9199 F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16 Example: Australian Beer Production • There is an average downward trend of -0.34 megalitres per quarter. • On average, • the second quarter has production of 34.7 megalitres lower than the first quarter, • the third quarter has production of 17.8 megalitres lower than the first quarter, and • the fourth quarter has production of 72.8 megalitres higher than the first quarter. Fall 2021 ADM 4307 Business Forecasting Analytics 46 Series: Beer Model: TSLM Residuals: Min 1Q Median 3Q Max -42.9029 -7.5995 -0.4594 7.9908 21.7895 Coefficients: Estimate Std. Error t value Pr(>|t|)

(Intercept) 441.80044 3.73353 118.333 < 2e-16 *** trend() -0.34027 0.06657 -5.111 2.73e-06 *** season()year2 -34.65973 3.96832 -8.734 9.10e-13 *** season()year3 -17.82164 4.02249 -4.430 3.45e-05 *** season()year4 72.79641 4.02305 18.095 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 12.23 on 69 degrees of freedom Multiple R-squared: 0.9243, Adjusted R-squared: 0.9199 F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16 Example: Australian Beer Production augment(fit_beer) %>% ggplot(aes(x = Quarter)) +

geom_line(aes(y = Beer, colour = “Data”)) + geom_line(aes(y = .fitted, colour = “Fitted”)) +

labs(y = “Megalitres”, title = “Australian quarterly beer production”) +

scale_colour_manual(values=c(Data=”black”,Fitted=”#D55E00″)) +

guides(colour = guide_legend(title = NULL))

ADM 4307 Business Forecasting Analytics 47Fall 2021

Example: Australian Beer Production

augment(fit_beer) %>% ggplot(aes(x = Beer, y = .fitted, colour = factor(quarter(Quarter)))) +

geom_point() + labs(y = “Fitted”, x = “Actual values”, title = “Australian quarterly beer production”) +

geom_abline(intercept = 0, slope = 1) + guides(colour = guide_legend(title = “Quarter”))

ADM 4307 Business Forecasting Analytics 48Fall 2021

Example: Australian Beer Production

fit_beer %>% gg_tsresiduals()

ADM 4307 Business Forecasting Analytics 49Fall 2021

Example: Australian Beer Production

fit_beer %>% forecast() %>% autoplot(recent_production)

ADM 4307 Business Forecasting Analytics 50Fall 2021

Intervention Variables

• It is often necessary to model interventions that may have affected the variable to be

forecast.

• For example, competitor activity, advertising expenditure, industrial action, and so on,

can all have an effect.

• Three situations:

• When the effect lasts only for one period, we use a spike (dummy) variable. It takes value

one in the period of the intervention and zero elsewhere.

• Other interventions have an immediate and permanent effect. Then we use a step

variable. It takes value zero before the intervention and one from the time of intervention

onward.

• Another form of permanent effect is a change of slope. Here the intervention is handled

using a piecewise linear trend

ADM 4307 Business Forecasting Analytics 51Fall 2021

Trading days

• The number of trading days in a month can vary considerably and can have a

substantial effect on sales data. To allow for this, the number of trading days in each

month can be included as a predictor.

• An alternative that allows for the effects of different days of the week has the following

predictors:

x1 = # Mondays in month;

x2 = # Tuesdays in month;

x3 = # Sundays in month

ADM 4307 Business Forecasting Analytics 52Fall 2021

Distributed Lags

• It is often useful to include advertising expenditure as a predictor. However, since the

effect of advertising can last beyond the actual campaign, we need to include lagged

values of advertising expenditure.

• Thus, the following predictors may be used.

x1 = advertising for previous month;

x2 = advertising for two months previously;

xm = advertising for m months previously

• It is common to require the coefficients to decrease as the lag increases

ADM 4307 Business Forecasting Analytics 53Fall 2021

Holidays

• For monthly data:

• Christmas: always in December so part of monthly seasonal effect

• Easter: use a dummy variable 𝑣𝑡 = 1 if any part of Easter is in that month, 𝑣𝑡 = 0
otherwise.

• Ramadan and Chinese new year similar.

ADM 4307 Business Forecasting Analytics 54Fall 2021

Fourier Series

• An alternative to using seasonal dummy variables, especially for long seasonal

periods, Fourier terms can be applied.

• Every periodic function can be approximated by sums of sin and cos terms for

large enough K.

• The K specifies how many pairs of sin and cos terms to include

• The maximum allowed is K=m/2 where m is the seasonal period

• Choose K by minimizing AICc

• Called “harmonic regression”

• These Fourier terms are produced using the fourier() function

• Example:

TSLM(y ~ trend() + fourier(K))

ADM 4307 Business Forecasting Analytics 55Fall 2021

Example: Australian Beer Production

fourier_beer <- recent_production %>% model(TSLM(Beer ~ trend() + fourier(K = 2)))

report(fourier_beer)

ADM 4307 Business Forecasting Analytics 56Fall 2021

Series: Beer

Model: TSLM

Residuals:

Min 1Q Median 3Q Max

-42.9029 -7.5995 -0.4594 7.9908 21.7895

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 446.87920 2.87321 155.533 < 2e-16 *** trend() -0.34027 0.06657 -5.111 2.73e-06 *** fourier(K = 2)C1_4 8.91082 2.01125 4.430 3.45e-05 *** fourier(K = 2)S1_4 -53.72807 2.01125 -26.714 < 2e-16 *** fourier(K = 2)C2_4 -13.98958 1.42256 -9.834 9.26e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 12.23 on 69 degrees of freedom Multiple R-squared: 0.9243, Adjusted R-squared: 0.9199 F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16 Example: Australian Beer Production recent_production %>%

model(

f1 = TSLM(Beer ~ trend() + fourier(K=1)),

f2 = TSLM(Beer ~ trend() + fourier(K=2)),

season = TSLM(Beer ~ trend() + season())

) %>% glance()

ADM 4307 Business Forecasting Analytics 57Fall 2021

# A tibble: 3 x 15

.model r_squared adj_r_squared sigma2 statistic p_value df log_lik AIC AICc BIC

1 f1 0.818 0.810 354. 105. 7.41e-26 4 -320. 440. 441. 452.

2 f2 0.924 0.920 150. 211. 6.97e-38 5 -288. 377. 379. 391.

3 season 0.924 0.920 150. 211. 6.97e-38 5 -288. 377. 379. 391.

# … with 4 more variables: CV , deviance , df.residual , rank

Example: Eating-out Expenditure

aus_cafe <- aus_retail %>% filter(

Industry == “Cafes, restaurants and takeaway food services”,

year(Month) %in% 2004:2018

) %>% summarise(Turnover = sum(Turnover))

aus_cafe %>% autoplot(Turnover)

ADM 4307 Business Forecasting Analytics 58Fall 2021

Example: Eating-out Expenditure

fit <- aus_cafe %>%

model(K1 = TSLM(log(Turnover) ~ trend() + fourier(K = 1)),

K2 = TSLM(log(Turnover) ~ trend() + fourier(K = 2)),

K3 = TSLM(log(Turnover) ~ trend() + fourier(K = 3)),

K4 = TSLM(log(Turnover) ~ trend() + fourier(K = 4)),

K5 = TSLM(log(Turnover) ~ trend() + fourier(K = 5)),

K6 = TSLM(log(Turnover) ~ trend() + fourier(K = 6)))

glance(fit) %>% select(.model, r_squared, adj_r_squared, AICc)

ADM 4307 Business Forecasting Analytics 59Fall 2021

# A tibble: 6 x 4

.model r_squared adj_r_squared AICc

1 K1 0.962 0.962 -1085.

2 K2 0.966 0.965 -1099.

3 K3 0.976 0.975 -1160.

4 K4 0.980 0.979 -1183.

5 K5 0.985 0.984 -1234.

6 K6 0.985 0.984 -1232.

K5 is a better model with

the lowest AICc and

highest R square

Example: Eating-out Expenditure

ADM 4307 Business Forecasting Analytics 60Fall 2021

Business Forecasting Analytics
ADM 4307 – Fall 2021

Regression Models

ADM 4307 Business Forecasting Analytics 61Fall 2021