Week-9 Regression Models
Some of the slides are adapted from the lecture notes provided by Prof. Antoine Saure and Prof. Rob Hyndman
Business Forecasting Analytics
ADM 4307 – Fall 2021
Regression Models
Ahmet Kandakoglu, PhD
08 November, 2021
Outline
• The Linear Model with Time Series
• Simple Linear Regression
• Multiple Linear Regression
• Least Squares Estimation
• Evaluating the Regression Model
• Some Useful Predictors
ADM 4307 Business Forecasting Analytics 2Fall 2021
Simple Linear Regression
• The basic concept is that we forecast variable 𝑦 assuming it has a linear relationship with 𝑥
variable
𝑦𝑡 = 𝛽0 + 𝛽1𝑥𝑡 + 𝜀𝑡
• The model is called simple regression as we only allow one predictor variable 𝑥. The forecast
variable 𝑦 is sometimes also called the dependent or explained variable. The predictor
variable 𝑥 is sometimes also called the independent or explanatory variable.
• The parameters 𝛽0 and 𝛽1 determine the intercept and the slope of the line respectively. The
intercept 𝛽0 represents the predicted value of 𝑦 when 𝑥 = 0. The slope 𝛽1 represents the
average predicted change in 𝑦 resulting from a one unit increase in 𝑥.
ADM 4307 Business Forecasting Analytics 3Fall 2021
Simple Linear Regression
Notice that the observations do not lie on the straight line but are scattered around it. We can
think of each observation 𝑦𝑡 consisting of the systematic or explained part of the model, 𝛽0 +
𝛽1𝑥𝑡and the random error, 𝜀𝑡.
ADM 4307 Business Forecasting Analytics 4Fall 2021
Example: US Consumption Expenditure
us_change %>% pivot_longer(c(Consumption, Income), names_to=”Series”) %>%
autoplot(value) + labs(y = “% change”)
ADM 4307 Business Forecasting Analytics 5Fall 2021
Example: US Consumption Expenditure
us_change %>% ggplot(aes(x = Income, y = Consumption)) +
labs(y = “Consumption (quarterly % change)”, x = “Income (quarterly % change)”) +
geom_point() + geom_smooth(method = “lm”, se = FALSE)
ADM 4307 Business Forecasting Analytics 6Fall 2021
Example: US Consumption Expenditure
The equation is estimated in R using the TSLM() function:
fit <- us_change %>% model(lm = TSLM(Consumption ~ Income))
report(fit)
ADM 4307 Business Forecasting Analytics 7
𝐶𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛 = 0.54 + 0.27 𝐼𝑛𝑐𝑜𝑚𝑒
Series: Consumption
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-2.58236 -0.27777 0.01862 0.32330 1.42229
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.54454 0.05403 10.079 < 2e-16 *** Income 0.27183 0.04673 5.817 2.4e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5905 on 196 degrees of freedom Multiple R-squared: 0.1472, Adjusted R-squared: 0.1429 F-statistic: 33.84 on 1 and 196 DF, p-value: 2.4022e-08 Fall 2021 Multiple Linear Regression • In multiple regression there is one variable to be forecast and several predictor variables. • The credit score is used to determine if a customer will be given a loan or not. The credit score could be predicted from other variables. This is an example of cross- sectional data where we want to predict the value of the credit score variable using the values of other variables. • We want to forecast the value of future beer production, but there are no other variables available for predictors. Instead, with time series data, we could use the number of quarters since the start of the series, or the quarter of the year corresponding to each observation as a predictor variable. ADM 4307 Business Forecasting Analytics 8Fall 2021 Multiple Linear Regression • The general form of a multiple regression is 𝑦𝑡 = 𝛽0 + 𝛽1𝑥1,𝑡 + 𝛽2𝑥2,𝑡 +⋯+ 𝛽𝑘 𝑥𝑘,𝑡 + 𝜀𝑡 where 𝑦𝑡 is the variable to be forecast and 𝑥1,𝑡, … , 𝑥𝑘,𝑡 are the predictor variables. • The coefficients 𝛽1, … , 𝛽𝑘 measure the effect of each predictor after taking account of the effect of all other predictors in the model. • Thus, the coefficients measure the marginal effects of the predictor variables. ADM 4307 Business Forecasting Analytics 9Fall 2021 Multiple Linear Regression • For forecasting, we make the following assumptions about the errors (𝜀1, … , 𝜀𝑇): • they have mean zero; otherwise the forecasts will be systematically biased. • they are not autocorrelated; otherwise the forecasts will be inefficient, as there is more information in the data that can be exploited. • they are unrelated to the predictor variables; otherwise there would be more information that should be included in the systematic part of the model. • It is also useful to have the errors normally distributed with constant variance in order to produce prediction intervals, but this is not necessary for forecasting. ADM 4307 Business Forecasting Analytics 10Fall 2021 Example: US Consumption Expenditure • Additional predictors that may be useful for forecasting US consumption expenditure • Building a multiple linear regression model can potentially generate more accurate forecasts ADM 4307 Business Forecasting Analytics 11Fall 2021 Example: US Consumption Expenditure • Scatterplot matrix of five variables us_change %>% ggpairs(columns = 2:6)
• Note: Install and load the GGally package
in R before you run the code
ADM 4307 Business Forecasting Analytics 12Fall 2021
Example: US Consumption Expenditure
fit_MR <- us_change %>% model(lm = TSLM(Consumption ~ Income + Production + Unemployment + Savings))
report(fit_MR)
ADM 4307 Business Forecasting Analytics 13
Series: Consumption
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-0.90555 -0.15821 -0.03608 0.13618 1.15471
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.253105 0.034470 7.343 5.71e-12 ***
Income 0.740583 0.040115 18.461 < 2e-16 *** Production 0.047173 0.023142 2.038 0.0429 * Unemployment -0.174685 0.095511 -1.829 0.0689 . Savings -0.052890 0.002924 -18.088 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3102 on 193 degrees of freedom Multiple R-squared: 0.7683, Adjusted R-squared: 0.7635 F-statistic: 160 on 4 and 193 DF, p-value: < 2.22e-16 Fall 2021 Outline • The Linear Model with Time Series • Simple Linear Regression • Multiple Linear Regression • Least Squares Estimation • Evaluating the Regression Model • Some Useful Predictors ADM 4307 Business Forecasting Analytics 14Fall 2021 Least Squares Estimation • The values of the coefficients 𝛽0, 𝛽1, … , 𝛽𝑘 are obtained by finding the minimum sum of squares of the errors. That is, we find the values of 𝛽0, 𝛽1, … , 𝛽𝑘 which minimize 𝑡=1 𝑇 𝜀𝑡 2 = 𝑡=1 𝑇 𝑦𝑡 − 𝛽0 − 𝛽1𝑥1,𝑡 − 𝛽2𝑥2,𝑡 −⋯− 𝛽𝑘 𝑥𝑘,𝑡 2 • This is called least squares estimation because it gives the least value of the sum of squared errors. Fall 2021 ADM 4307 Business Forecasting Analytics 15 Least Squares Estimation • Finding the best estimates of the coefficients is often called “fitting” the model to the data, or sometimes “learning” or “training” the model. • We refer to the estimated coefficients using the notation መ𝛽0, መ𝛽1, … , መ𝛽𝑘. • The TSLM() function fits a linear regression model to time series data. • It is similar to the lm() function which is widely used for linear models, but TSLM() provides additional facilities for handling time series. ADM 4307 Business Forecasting Analytics 16Fall 2021 Fitted Values • Predictions of 𝑦 can be calculated by ignoring the error in the regression equation: ො𝑦 = መ𝛽0 + መ𝛽1𝑥1 + መ𝛽2𝑥2 +⋯+ መ𝛽𝑘 𝑥𝑘 • Plugging in the values of 𝑥0, 𝑥1, … , 𝑥𝑘 gives a prediction of 𝑦. • ො𝑦 is referred to as fitted values. • These are "predictions" of the data used in estimating the model. They are not genuine forecasts as the actual value of 𝑦. ADM 4307 Business Forecasting Analytics 17Fall 2021 Example: US Consumption Expenditure augment(fit_MR) %>% ggplot(aes(x = Quarter)) +
geom_line(aes(y = Consumption, colour = “Data”)) +
geom_line(aes(y = .fitted, colour = “Fitted”)) +
labs(y = NULL, title = “Percent change in US consumption expenditure”) +
scale_colour_manual(values=c(Data=”black”,Fitted=”#D55E00″)) +
guides(colour = guide_legend(title = NULL))
ADM 4307 Business Forecasting Analytics 18Fall 2021
Example: US Consumption Expenditure
augment(fit_MR) %>% ggplot(aes(x = Consumption, y = .fitted)) +
geom_point() +
labs(y = “Fitted (predicted values)”, x = “Data (actual values)”,
title = “Percent change in US consumption expenditure”
) + geom_abline(intercept = 0, slope = 1)
ADM 4307 Business Forecasting Analytics 19Fall 2021
Goodness-of-Fit
• A common way to summarize how well a linear regression model fits the data is via
the coefficient of determination, or 𝑅2.
𝑅2 =
σ ො𝑦𝑖 − ത𝑦
2
σ 𝑦𝑖 − ത𝑦
2
where the summations are over all observations.
• Thus, it reflects the proportion of variation in the forecast variable that is accounted
for (or explained) by the regression model
ADM 4307 Business Forecasting Analytics 20Fall 2021
Goodness-of-Fit
• In all cases, 0 ≤ 𝑅2 ≤ 1.
• In simple linear regression, the value of 𝑅2 is also equal to the square of the
correlation between 𝑦 and 𝑥.
• If the predictions are close to the actual values, we would expect 𝑅2 to be close to 1.
On the other hand, if the predictions are unrelated to the actual values, then 𝑅2 = 0.
• The 𝑅2 value is used frequently, though often incorrectly, in forecasting.
• There are no set rules for what is a good 𝑅2 value, and typical values of 𝑅2
depend on the type of data used.
• The value of 𝑅2 will never decrease when adding an extra predictor to the model
and this can lead to over-fitting.
• Validating a model’s forecasting performance on the test data is much better than
measuring the 𝑅2 value on the training data.
ADM 4307 Business Forecasting Analytics 21Fall 2021
Example: US Consumption Expenditure
• For multiple regression, 𝑅2 = 0.768. That means, the model explains 76.8% of the variation in
the consumption data.
• For simple regression, 𝑅2 = 0.147
• Adding the three extra predictors has allowed a lot more of the variation in the consumption
data to be explained.
ADM 4307 Business Forecasting Analytics 22
Series: Consumption
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-2.58236 -0.27777 0.01862 0.32330 1.42229
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.54454 0.05403 10.079 < 2e-16 *** Income 0.27183 0.04673 5.817 2.4e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5905 on 196 degrees of freedom Multiple R-squared: 0.1472, Adjusted R-squared: 0.1429 F-statistic: 33.84 on 1 and 196 DF, p-value: 2.4022e-08 Fall 2021 Series: Consumption Model: TSLM Residuals: Min 1Q Median 3Q Max -0.90555 -0.15821 -0.03608 0.13618 1.15471 Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.253105 0.034470 7.343 5.71e-12 ***
Income 0.740583 0.040115 18.461 < 2e-16 *** Production 0.047173 0.023142 2.038 0.0429 * Unemployment -0.174685 0.095511 -1.829 0.0689 . Savings -0.052890 0.002924 -18.088 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3102 on 193 degrees of freedom Multiple R-squared: 0.7683, Adjusted R-squared: 0.7635 F-statistic: 160 on 4 and 193 DF, p-value: < 2.22e-16 Standard Error of the Regression • Another measure of how well the model has fitted the data is the standard deviation of the residuals, which is often known as the standard error of the regression: ො𝜎𝑒 = 1 𝑇 − 𝑘 − 1 𝑡=1 𝑇 𝑒𝑡 2 where 𝑘 is the number of predictors in the model. • Notice that we divide by 𝑇 − 𝑘 − 1 because we have estimated 𝑘 + 1 parameters (the intercept and a coefficient for each predictor variable) in computing the residuals. • The standard error is related to the size of the average error that the model produces. ADM 4307 Business Forecasting Analytics 23Fall 2021 Outline • The Linear Model with Time Series • Simple Linear Regression • Multiple Linear Regression • Least Squares Estimation • Evaluating the Regression Model • Some Useful Predictors ADM 4307 Business Forecasting Analytics 24Fall 2021 Evaluating the Regression Model • The differences between the observed 𝑦 values and the corresponding fitted ො𝑦 values are the training-set errors or “residuals” defined as: 𝑒𝑡 = 𝑦𝑡 − ො𝑦𝑡 for 𝑡 = 1, 2, . . . , 𝑇 • It is clear that the average of the residuals is zero, and that the correlation between the residuals and the observations for the predictor variable is also zero. ADM 4307 Business Forecasting Analytics 25Fall 2021 Evaluating the Regression Model • After selecting the regression variables and fitting a regression model, it is necessary to plot the residuals to check that the assumptions of the model have been satisfied. • There are a series of plots that should be produced in order to check different aspects of the fitted model and the underlying assumptions (whether the linear model was appropriate). • ACF plot of residuals • Histogram of residuals • Residual plots against predictors • Residual plots against fitted values ADM 4307 Business Forecasting Analytics 26Fall 2021 ACF Plot of Residuals • It is common to find autocorrelation in the residuals. • This case violates the assumption of no autocorrelation in the errors, and our forecasts may be inefficient — there is some information left over which should be accounted for in the model in order to obtain better forecasts. • Therefore we should always look at an ACF plot of the residuals. ADM 4307 Business Forecasting Analytics 27Fall 2021 • Another useful test of autocorrelation in the residuals designed to take account for the regression model is the Ljung-Box test • A small p-value indicates there is significant autocorrelation remaining in the residuals Histogram of Residuals • It is always a good idea to check whether the residuals are normally distributed. • This is not essential for forecasting, but it does make the calculation of prediction intervals much easier. • The histogram shows that the residuals seem to be slightly skewed, which may also affect the coverage probability of the prediction intervals. ADM 4307 Business Forecasting Analytics 28Fall 2021 Example: US Consumption Expenditure fit_MR %>% gg_tsresiduals()
augment(fit_MR) %>% features(.innov,
ljung_box, lag = 10, dof = 5)
ADM 4307 Business Forecasting Analytics 29
# A tibble: 1 x 3
.model lb_stat lb_pvalue
1 lm 18.9 0.00204
• The p value is less than 0.05. So, the residuals
are correlated
• There is a significant spike at lag 7 in the ACF,
and the model fails the Ljung-Box test
• The model can still be used for forecasting
Fall 2021
Residual Plots Against Predictors
• The residuals should be to be randomly scattered without showing any systematic
patterns.
• If scatterplots of the residuals against each of the predictor variables show a pattern,
then the relationship may be nonlinear and the model will need to be modified
accordingly
ADM 4307 Business Forecasting Analytics 30Fall 2021
Residual Plots Against Fitted Values
• A plot of the residuals against the fitted values should also show no pattern.
• If a pattern is observed, the variance of the residuals may not be constant.
• If this problem occurs, a transformation of the forecast variable such as a logarithm or
square root may be required.
ADM 4307 Business Forecasting Analytics 31Fall 2021
Outliers and Influential Observations
• Observations that take extreme values compared to the majority of the data are called
outliers.
• Observations that have a large influence on the estimated coefficients of a regression
model are called influential observations.
• Usually, influential observations are also outliers that are extreme in the 𝑥 direction.
ADM 4307 Business Forecasting Analytics 32Fall 2021
Spurious Regression
• Time series data are often “non-stationary” (the values of the time series do not
fluctuate around a constant mean).
• For example, consider the two following variables
• These appear to be related simply because they both trend upwards in the same
manner. However, air passenger traffic in Australia has nothing to do with rice
production in Guinea.
ADM 4307 Business Forecasting Analytics 33Fall 2021
Spurious Regression
• Regressing non-stationary time series can lead to spurious regressions.
• High 𝑅2 and high residual autocorrelation can be signs of spurious regression.
• Cases of spurious regression might appear to give reasonable short-term forecasts,
but they will generally not continue to work into the future
ADM 4307 Business Forecasting Analytics 34Fall 2021
Outline
• The Linear Model with Time Series
• Simple Linear Regression
• Multiple Linear Regression
• Least Squares Estimation
• Evaluating the Regression Model
• Some Useful Predictors
ADM 4307 Business Forecasting Analytics 35Fall 2021
Some Useful Predictors
• There are several useful predictors that occur frequently when using regression for
time series data:
• Trend
• Dummy variables
• Seasonal dummy variables
• Intervention variables
• Trading days
• Distributed lags
• Holidays
• Fourier series
ADM 4307 Business Forecasting Analytics 36Fall 2021
Linear Trend
• It is common for time series data to be trending.
• A linear trend can be modelled by simply using 𝑥𝑡 = 𝑡 as a predictor:
𝑦𝑡 = 𝛽0 + 𝛽1𝑡 + 𝜀𝑡
where 𝑡 = 1, 2, . . . , 𝑇
• A trend variable can be specified in the TSLM() function using the trend() special.
ADM 4307 Business Forecasting Analytics 37Fall 2021
Dummy Variables
• A predictor could be a categorical variable taking only two values (e.g., “yes” and
“no”)
• This situation can still be handled within the framework of multiple regression models
by creating a “dummy variable” taking value
• 1 corresponding to “yes” and
• 0 corresponding to “no”.
• A dummy variable is also known as an “indicator variable”
• If there are more than two categories, then the variable can be coded using several
dummy variables (one fewer than the total number of categories)
ADM 4307 Business Forecasting Analytics 38Fall 2021
Dummy Variables
• For example, when forecasting daily sales and you want to take account of whether
the day is a public holiday or not. So the predictor takes value
• “yes” on a public holiday, and
• “no” otherwise.
• A dummy variable can also be used to account for an outlier in the data. Rather than
omit the outlier, a dummy variable removes its effect. In this case, the dummy
variable takes value
• 1 for that observation and
• 0 everywhere else.
ADM 4307 Business Forecasting Analytics 39Fall 2021
Seasonal Dummy Variables
• Suppose that we are forecasting daily data and we want to account for the day of the
week as a predictor. Then the following dummy variables can be created.
ADM 4307 Business Forecasting Analytics 40
D1 D2 D3 D4 D5 D6
Monday 1 0 0 0 0 0
Tuesday 0 1 0 0 0 0
Wednesday 0 0 1 0 0 0
Thursday 0 0 0 1 0 0
Friday 0 0 0 0 1 0
Saturday 0 0 0 0 0 1
Sunday 0 0 0 0 0 0
Monday 1 0 0 0 0 0
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
Fall 2021
Dummy Variable Trap
• Notice that only six dummy variables are needed to code seven categories. That is
because the seventh category (in this case Sunday) is captured by the intercept, and
is specified when the dummy variables are all set to zero.
• Many beginners will try to add a seventh dummy variable for the seventh category.
This is known as the “dummy variable trap”, because it will cause the regression to
fail.
• The general rule is to use one fewer dummy variables than categories.
• The interpretation of each of the coefficients associated with the dummy variables is
that it is a measure of the effect of that category relative to the omitted category.
ADM 4307 Business Forecasting Analytics 41Fall 2021
Uses of Dummy Variables
• Seasonal dummies
• For quarterly data: use 3 dummies
• For monthly data: use 11 dummies
• For daily data: use 6 dummies
• Outliers
• If there is an outlier, you can use a dummy variable (taking value 1 for that observation
and 0 elsewhere) to remove its effect.
• Public holidays
• For daily data: if it is a public holiday, dummy=1, otherwise dummy=0.
• The TSLM() function will automatically handle this situation if you specify the predictor
season().
ADM 4307 Business Forecasting Analytics 42Fall 2021
Example: Australian Beer Production
recent_production <- aus_production %>% filter(year(Quarter) >= 1992)
recent_production %>% autoplot(Beer) +
labs(y = “Megalitres”, title = “Australian quarterly beer production”)
ADM 4307 Business Forecasting Analytics 43Fall 2021
Example: Australian Beer Production
• We want to forecast the value of future beer production.
• We can model this data using a regression model with a linear trend and quarterly
dummy variables
𝑦𝑡 = 𝛽0 + 𝛽1𝑡 + 𝛽2𝑑2,𝑡 + 𝛽3𝑑3,𝑡 + 𝛽4𝑑4,𝑡 + 𝜀𝑡
where 𝑑𝑖,𝑡 = 1 is if 𝑡 is in quarter 𝑖 and 0 otherwise.
• The first quarter variable has been omitted, so the coefficients associated with the
other quarters are measures of the difference between those quarters and the first
quarter.
ADM 4307 Business Forecasting Analytics 44Fall 2021
Example: Australian Beer Production
fit_beer <- recent_production %>% model(TSLM(Beer ~ trend() + season()))
report(fit_beer)
ADM 4307 Business Forecasting Analytics 45
Note that trend() and
season() are not
standard functions or
objects; they are
“special” functions that
work within the TSLM()
model
Fall 2021
Series: Beer
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-42.9029 -7.5995 -0.4594 7.9908 21.7895
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 441.80044 3.73353 118.333 < 2e-16 *** trend() -0.34027 0.06657 -5.111 2.73e-06 *** season()year2 -34.65973 3.96832 -8.734 9.10e-13 *** season()year3 -17.82164 4.02249 -4.430 3.45e-05 *** season()year4 72.79641 4.02305 18.095 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 12.23 on 69 degrees of freedom Multiple R-squared: 0.9243, Adjusted R-squared: 0.9199 F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16 Example: Australian Beer Production • There is an average downward trend of -0.34 megalitres per quarter. • On average, • the second quarter has production of 34.7 megalitres lower than the first quarter, • the third quarter has production of 17.8 megalitres lower than the first quarter, and • the fourth quarter has production of 72.8 megalitres higher than the first quarter. Fall 2021 ADM 4307 Business Forecasting Analytics 46 Series: Beer Model: TSLM Residuals: Min 1Q Median 3Q Max -42.9029 -7.5995 -0.4594 7.9908 21.7895 Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 441.80044 3.73353 118.333 < 2e-16 *** trend() -0.34027 0.06657 -5.111 2.73e-06 *** season()year2 -34.65973 3.96832 -8.734 9.10e-13 *** season()year3 -17.82164 4.02249 -4.430 3.45e-05 *** season()year4 72.79641 4.02305 18.095 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 12.23 on 69 degrees of freedom Multiple R-squared: 0.9243, Adjusted R-squared: 0.9199 F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16 Example: Australian Beer Production augment(fit_beer) %>% ggplot(aes(x = Quarter)) +
geom_line(aes(y = Beer, colour = “Data”)) + geom_line(aes(y = .fitted, colour = “Fitted”)) +
labs(y = “Megalitres”, title = “Australian quarterly beer production”) +
scale_colour_manual(values=c(Data=”black”,Fitted=”#D55E00″)) +
guides(colour = guide_legend(title = NULL))
ADM 4307 Business Forecasting Analytics 47Fall 2021
Example: Australian Beer Production
augment(fit_beer) %>% ggplot(aes(x = Beer, y = .fitted, colour = factor(quarter(Quarter)))) +
geom_point() + labs(y = “Fitted”, x = “Actual values”, title = “Australian quarterly beer production”) +
geom_abline(intercept = 0, slope = 1) + guides(colour = guide_legend(title = “Quarter”))
ADM 4307 Business Forecasting Analytics 48Fall 2021
Example: Australian Beer Production
fit_beer %>% gg_tsresiduals()
ADM 4307 Business Forecasting Analytics 49Fall 2021
Example: Australian Beer Production
fit_beer %>% forecast() %>% autoplot(recent_production)
ADM 4307 Business Forecasting Analytics 50Fall 2021
Intervention Variables
• It is often necessary to model interventions that may have affected the variable to be
forecast.
• For example, competitor activity, advertising expenditure, industrial action, and so on,
can all have an effect.
• Three situations:
• When the effect lasts only for one period, we use a spike (dummy) variable. It takes value
one in the period of the intervention and zero elsewhere.
• Other interventions have an immediate and permanent effect. Then we use a step
variable. It takes value zero before the intervention and one from the time of intervention
onward.
• Another form of permanent effect is a change of slope. Here the intervention is handled
using a piecewise linear trend
ADM 4307 Business Forecasting Analytics 51Fall 2021
Trading days
• The number of trading days in a month can vary considerably and can have a
substantial effect on sales data. To allow for this, the number of trading days in each
month can be included as a predictor.
• An alternative that allows for the effects of different days of the week has the following
predictors:
x1 = # Mondays in month;
x2 = # Tuesdays in month;
⋮
x3 = # Sundays in month
ADM 4307 Business Forecasting Analytics 52Fall 2021
Distributed Lags
• It is often useful to include advertising expenditure as a predictor. However, since the
effect of advertising can last beyond the actual campaign, we need to include lagged
values of advertising expenditure.
• Thus, the following predictors may be used.
x1 = advertising for previous month;
x2 = advertising for two months previously;
⋮
xm = advertising for m months previously
• It is common to require the coefficients to decrease as the lag increases
ADM 4307 Business Forecasting Analytics 53Fall 2021
Holidays
• For monthly data:
• Christmas: always in December so part of monthly seasonal effect
• Easter: use a dummy variable 𝑣𝑡 = 1 if any part of Easter is in that month, 𝑣𝑡 = 0
otherwise.
• Ramadan and Chinese new year similar.
ADM 4307 Business Forecasting Analytics 54Fall 2021
Fourier Series
• An alternative to using seasonal dummy variables, especially for long seasonal
periods, Fourier terms can be applied.
• Every periodic function can be approximated by sums of sin and cos terms for
large enough K.
• The K specifies how many pairs of sin and cos terms to include
• The maximum allowed is K=m/2 where m is the seasonal period
• Choose K by minimizing AICc
• Called “harmonic regression”
• These Fourier terms are produced using the fourier() function
• Example:
TSLM(y ~ trend() + fourier(K))
ADM 4307 Business Forecasting Analytics 55Fall 2021
Example: Australian Beer Production
fourier_beer <- recent_production %>% model(TSLM(Beer ~ trend() + fourier(K = 2)))
report(fourier_beer)
ADM 4307 Business Forecasting Analytics 56Fall 2021
Series: Beer
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-42.9029 -7.5995 -0.4594 7.9908 21.7895
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 446.87920 2.87321 155.533 < 2e-16 *** trend() -0.34027 0.06657 -5.111 2.73e-06 *** fourier(K = 2)C1_4 8.91082 2.01125 4.430 3.45e-05 *** fourier(K = 2)S1_4 -53.72807 2.01125 -26.714 < 2e-16 *** fourier(K = 2)C2_4 -13.98958 1.42256 -9.834 9.26e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 12.23 on 69 degrees of freedom Multiple R-squared: 0.9243, Adjusted R-squared: 0.9199 F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16 Example: Australian Beer Production recent_production %>%
model(
f1 = TSLM(Beer ~ trend() + fourier(K=1)),
f2 = TSLM(Beer ~ trend() + fourier(K=2)),
season = TSLM(Beer ~ trend() + season())
) %>% glance()
ADM 4307 Business Forecasting Analytics 57Fall 2021
# A tibble: 3 x 15
.model r_squared adj_r_squared sigma2 statistic p_value df log_lik AIC AICc BIC
1 f1 0.818 0.810 354. 105. 7.41e-26 4 -320. 440. 441. 452.
2 f2 0.924 0.920 150. 211. 6.97e-38 5 -288. 377. 379. 391.
3 season 0.924 0.920 150. 211. 6.97e-38 5 -288. 377. 379. 391.
# … with 4 more variables: CV
Example: Eating-out Expenditure
aus_cafe <- aus_retail %>% filter(
Industry == “Cafes, restaurants and takeaway food services”,
year(Month) %in% 2004:2018
) %>% summarise(Turnover = sum(Turnover))
aus_cafe %>% autoplot(Turnover)
ADM 4307 Business Forecasting Analytics 58Fall 2021
Example: Eating-out Expenditure
fit <- aus_cafe %>%
model(K1 = TSLM(log(Turnover) ~ trend() + fourier(K = 1)),
K2 = TSLM(log(Turnover) ~ trend() + fourier(K = 2)),
K3 = TSLM(log(Turnover) ~ trend() + fourier(K = 3)),
K4 = TSLM(log(Turnover) ~ trend() + fourier(K = 4)),
K5 = TSLM(log(Turnover) ~ trend() + fourier(K = 5)),
K6 = TSLM(log(Turnover) ~ trend() + fourier(K = 6)))
glance(fit) %>% select(.model, r_squared, adj_r_squared, AICc)
ADM 4307 Business Forecasting Analytics 59Fall 2021
# A tibble: 6 x 4
.model r_squared adj_r_squared AICc
1 K1 0.962 0.962 -1085.
2 K2 0.966 0.965 -1099.
3 K3 0.976 0.975 -1160.
4 K4 0.980 0.979 -1183.
5 K5 0.985 0.984 -1234.
6 K6 0.985 0.984 -1232.
K5 is a better model with
the lowest AICc and
highest R square
Example: Eating-out Expenditure
ADM 4307 Business Forecasting Analytics 60Fall 2021
Business Forecasting Analytics
ADM 4307 – Fall 2021
Regression Models
ADM 4307 Business Forecasting Analytics 61Fall 2021