CS计算机代考程序代写 Excel Bayesian ETC3231/5231 Business forecasting

ETC3231/5231 Business forecasting
Ch7. Regression models OTexts.org/fpp3/

Outline
1 The linear model with time series
2 Residual diagnostics
3 Some useful predictors for linear models
4 Selecting predictors and forecast evaluation
5 Forecasting with regression
6 Matrix formulation
7 Correlation, causation and forecasting
2

Outline
1 The linear model with time series
2 Residual diagnostics
3 Some useful predictors for linear models
4 Selecting predictors and forecast evaluation
5 Forecasting with regression
6 Matrix formulation
7 Correlation, causation and forecasting
3

Multiple regression and forecasting
yt =β0 +β1×1,t +β2×2,t +···+βkxk,t +εt.
yt is the variable we want to predict: the “response’ ’ variable
Each xj,t is numerical and is called a “predictor’ ’. They are usually assumed to be known for all past and future times.
The coefficients β1, . . . , βk measure the effect of each predictor after taking account of the effect of all other predictors in the model.
That is, the coefficients measure the marginal effects. εt is a white noise error term
4

Example: US consumption expenditure
us_change %>%
pivot_longer(c(Consumption, Income), names_to=”Series”) %>%
autoplot(value) +
labs(y=”% change”)
2.5
0.0
−2.5
1980 Q1 2000 Q1 2020 Q1
Quarter [1Q]
Series
Consumption Income
5
% change

Example: US consumption expenditure
us_change %>%
ggplot(aes(x=Income, y=Consumption)) +
labs(y = “Consumption (quarterly % change)”,
x = “Income (quarterly % change)”) +
geom_point() + geom_smooth(method=”lm”, se=FALSE)
2
1
0
−1
−2
−2.5
0.0 2.5
Income (quarterly % change)
6
Consumption (quarterly % change)

Example: US consumption expenditure
fit_cons <- us_change %>%
model(lm = TSLM(Consumption ~ Income))
report(fit_cons)
## Series: Consumption
## Model: TSLM ##
## Residuals:
## Min 1Q Median
## -2.582 -0.278 0.019 ##
## Coefficients:
## Estimate
## (Intercept) 0.5445
## Income 0.2718
## —
3Q Max 0.323 1.422
Std. Error t value Pr(>|t|) 0.0540 10.08 < 2e-16 *** 0.0467 5.82 2.4e-08 *** ## Signif. codes: ## 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## Residual standard error: 0.591 on 196 degrees of ## Multiple R-squared: 0.147, Adjusted R-squared: ## F-statistic: 33.8 on 1 and 196 DF, p-value: 2e-08 freedom 0.143 7 Example: US consumption expenditure 2 1 0 −1 −2 2.5 0.0 −2.5 2.5 0.0 −2.5 −5.0 40 20 0 −20 −40 −60 1.5 1.0 0.5 0.0 −0.5 −1.0 1980 Q1 2000 Q1 2020 Q1 Quarter 8 Consumption Income Production Savings Unemployment Example: US consumption expenditure 0.6 0.4 0.2 0.0 2.5 0.0 −2.5 2.5 0.0 −2.5 −5.0 40 20 0 −20 −40 −60 1.5 1.0 0.5 0.0 −0.5 −1.0 Consumption Income Corr: 0.384*** Production Corr: 0.529*** Corr: 0.269*** Savings Corr: −0.257*** Corr: 0.720*** Corr: −0.059 Unemployment Corr: −0.527*** Corr: −0.224** Corr: −0.768*** Corr: 0.106 Consumption Income Production Savings Unemployment −2 −1 0 1 2 −2.5 0.0 2.5 −5.0−2.50.0 2.5 −60−40−20 0 20 40−1.0−0.50.0 0.5 1.0 1.5 9 Example: US consumption expenditure fit_consMR <- us_change %>%
model(lm = TSLM(Consumption ~ Income + Production + Unemployment + Savings))
report(fit_consMR)
## Series: Consumption
## Model: TSLM ##
## Residuals:
## Min 1Q Median 3Q Max
## -0.906 -0.158 -0.036 0.136 1.155 ##
## Coefficients:
## Estimate
## (Intercept) 0.25311
## Income 0.74058
## Production 0.04717
## Unemployment -0.17469
## Savings -0.05289
## —
Std. Error 0.03447 0.04012 0.02314 0.09551 0.00292
t value 7.34 18.46 2.04 -1.83 -18.09
Pr(>|t|) 5.7e-12 *** < 2e-16 *** 0.043 * 0.069 . < 2e-16 *** ## Signif. codes: ## 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ## ## Residual standard error: 0.31 on ## Multiple R-squared: 0.768, Adjusted R-squared: 0.763 ## F-statistic: 160 on 4 and 193 DF, p-value: <2e-16 ’.’ 0.1 193 degrees of freedom ’ ’ 1 10 Example: US consumption expenditure Percent change in US consumption expenditure 2 1 0 −1 −2 1980 Q1 2000 Q1 2020 Q1 Quarter Data Fitted 11 Example: US consumption expenditure Percentage change in US consumption expenditure 2 1 0 −1 −2 −1 0 1 2 Data (actual values) 12 Fitted (predicted values) Example: US consumption expenditure augment(fit_consMR) %>%
gg_tsdisplay(.resid, plot_type=”hist”)
1.0 0.5 0.0
−0.5 −1.0
1980 Q1
2000 Q1
2020 Q1
Quarter
0.1
0.0 −0.1 −0.2
2 4 6 8 10121416182022
lag [1Q]
40 30 20 10
0
−1.0 −0.5
0.0 0.5 1.0
.resid
13
acf .resid
count

Outline
1 The linear model with time series
2 Residual diagnostics
3 Some useful predictors for linear models
4 Selecting predictors and forecast evaluation
5 Forecasting with regression
6 Matrix formulation
7 Correlation, causation and forecasting
14

Multiple regression and forecasting
For forecasting purposes, we require the following assumptions:
εt are uncorrelated and zero mean εt are uncorrelated with each xj,t.
15

Multiple regression and forecasting
For forecasting purposes, we require the following assumptions:
εt are uncorrelated and zero mean εt are uncorrelated with each xj,t.
It is useful to also have εt ∼ N(0, σ2) when producing prediction intervals or doing statistical tests.
15

Residual plots
Useful for spotting outliers and whether the linear model was appropriate.
Scatterplot of residuals et against each predictor xj,t.
Scatterplot residuals against the fitted values yˆt
Expect to see scatterplots resembling a horizontal band with no values too far from the band and no patterns such as curvature or increasing spread.
16

Residual patterns
If a plot of the residuals vs any predictor in the model shows a pattern, then the relationship is nonlinear.
If a plot of the residuals vs any predictor not in the model shows a pattern, then the predictor should be added to the model.
If a plot of the residuals vs fitted values shows a pattern, then there is heteroscedasticity in the errors. (Could try a transformation.)
17

Outline
1 The linear model with time series
2 Residual diagnostics
3 Some useful predictors for linear models
4 Selecting predictors and forecast evaluation
5 Forecasting with regression
6 Matrix formulation
7 Correlation, causation and forecasting
18

Trend
Linear trend
xt = t
Strong assumption that trend will continue.
t = 1,2,…,T
19

Dummy variables
If a categorical variable takes only two values (e.g., ‘Yes’ or
‘No’), then an equivalent numerical variable can be constructed taking value 1 if yes and 0 if no. This is called a dummy variable.
20

Dummy variables
If there are more than two categories, then the variable can be coded using several dummy variables (one fewer than the total number of categories).
21

Beware of the dummy variable trap!
Using one dummy for each category gives too many dummy variables!
The regression will then be singular and inestimable.
Either omit the constant, or omit the dummy for one category.
The coefficients of the dummies are relative to the omitted category.
22

Uses of dummy variables
Seasonal dummies
For quarterly data: use 3 dummies For monthly data: use 11 dummies For daily data: use 6 dummies What to do with weekly data?
Outliers
If there is an outlier, you can use a dummy variable to remove its effect.
Public holidays
For daily data: if it is a public holiday, dummy=1, otherwise dummy=0. 23

Beer production revisited
500
450
400
Australian quarterly beer production
1995 Q1 2000 Q1 2005 Q1 2010 Q1
Quarter [1Q]
24
Megalitres

Beer production revisited
Regression model
yt =β0 +β1t+β2d2,t +β3d3,t +β4d4,t +εt di,t = 1 if t is quarter i and 0 otherwise.
25

Beer production revisited
fit_beer <- recent_production %>% model(TSLM(Beer ~ trend() + season()))
report(fit_beer)
## Series: Beer
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.9 -7.6 -0.5 8.0 21.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 441.8004
## trend() -0.3403
## season()year2 -34.6597
## season()year3 -17.8216
## season()year4 72.7964
## —
3.7335 118.33 < 2e-16 *** 0.0666 -5.11 2.7e-06 *** 3.9683 -8.73 9.1e-13 *** 4.0225 -4.43 3.4e-05 *** 4.0230 18.09 < 2e-16 *** ## Signif. codes: ## 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## Residual standard error: 12.2 on 69 degrees of freedom ## Multiple R-squared: 0.924, Adjusted R-squared: 0.92 26 ## F-statistic: 211 on 4 and 69 DF, p-value: <2e-16 Beer production revisited augment(fit_beer) %>%
ggplot(aes(x = Quarter)) +
geom_line(aes(y = Beer, colour = “Data”)) +
geom_line(aes(y = .fitted, colour = “Fitted”)) +
labs(y=”Megalitres”,title =”Australian quarterly beer production”) +
scale_colour_manual(values = c(Data = “black”, Fitted = “#D55E00″))
27
Australian quarterly beer production
500
450
400
1995 Q1 2000 Q1 2005 Q1 2010 Q1
Quarter
colour
Data Fitted
Megalitres

Beer production revisited
Quarterly beer production
480
440
400
Quarter 1
2 3 4
400 450 500
Actual values
28
Fitted

Beer production revisited
augment(fit_beer) %>% gg_tsdisplay(.resid, plot_type=”hist”)
20 0 −20 −40
1995 Q1
2000 Q1
Quarter
2005 Q1
2010 Q1
0.2 0.1 0.0
−0.1 −0.2
20
10
0
2 4
6 8
10 12 14 16 18
lag [1Q]
−25 0 25
.resid
29
acf .resid
count

Beer production revisited
fit_beer %>% forecast %>% autoplot(recent_production)
500
450
400
350
l
1995 Q1
2000 Q1
2005 Q1
2010 Q1
Quarter
evel
80 95
30
Beer

Fourier series
Periodic seasonality can be handled using pairs of
Fourier terms:
sk(t) = sin m
􏰂2πkt􏰃 ck(t) = cos m
􏰂2πkt􏰃
K
yt =a+bt+􏰆[αksk(t)+βkck(t)]+εt
k=1
Every periodic function can be approximated by
sums of sin and cos terms for large enough K. Choose K by minimizing AICc.
Called “harmonic regression”
TSLM(y ~ trend() + fourier(K)) 31

Harmonic regression: beer production
fourier_beer <- recent_production %>% model(TSLM(Beer ~ trend() + fourier(K=2))) report(fourier_beer)
## Series: Beer
## Model: TSLM ##
## Residuals:
## Min 1Q Median
3Q 8.0
Max 21.8
## -42.9 -7.6 ##
## Coefficients: ##
-0.5
Estimate 446.8792 -0.3403 8.9108 -53.7281 -13.9896
Std. Error 2.8732 0.0666 2.0112 2.0112 1.4226
t value 155.53 -5.11 4.43 -26.71 -9.83
Pr(>|t|)
< 2e-16 *** 2.7e-06 *** 3.4e-05 *** < 2e-16 *** 9.3e-15 *** ## (Intercept) ## trend() ## fourier(K = 2)C1_4 ## fourier(K = 2)S1_4 ## fourier(K = 2)C2_4 ## --- ## Signif. codes: ## 0 ’***’ 0.001 ’**’ ## ## Residual standard error: 12.2 on 69 degrees of ## Multiple R-squared: 0.924, Adjusted R-squared: 0.92 ## F-statistic: 211 on 4 and 69 DF, p-value: <2e-16 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 freedom 32 Intervention variables Spikes Equivalent to a dummy variable for handling an outlier. Steps Variable takes value 0 before the intervention and 1 afterwards. Change of slope Variables take values 0 before the intervention and values {1, 2, 3, . . . } afterwards. 33 Holidays For monthly data Christmas: always in December so part of monthly seasonal effect Easter: use a dummy variable vt = 1 if any part of Easter is in that month, vt = 0 otherwise. Ramadan and Chinese new year similar. 34 Trading days With monthly data, if the observations vary depending on how many different types of days in the month, then trading day predictors can be useful. z1 = # Mondays in month; z2 = # Tuesdays in month; . z7 = # Sundays in month. 35 Distributed lags Lagged values of a predictor. Example: x is advertising which has a delayed effect x1 = advertising for previous month; x2 = advertising for two months previously; . xm = advertising for m months previously. 36 Nonlinear trend Piecewise linear trend with bend “knot” at τ x1,t = t 0 t<τ (t−τ) t≥τ β1 trend slope before time τ β1 + β2 trend slope after time τ More knots can be added forming more (t − τ )+ x2,t =(t−τ)+ =  37 Nonlinear trend Piecewise linear trend with bend “knot” at τ x1,t = t 0 t<τ (t−τ) t≥τ β1 trend slope before time τ β1 + β2 trend slope after time τ More knots can be added forming more (t − τ )+ Quadratic or higher order trend x1,t = t, x2,t = t2, ... NOT RECOMMENDED! x2,t =(t−τ)+ =  37 Piece wise linear trend x1,t =t x2,t =(t−τ)+ = 0 t<τ (t−τ) t≥τ  38 Example: Boston marathon winning times marathon <- boston_marathon %>%
filter(Event == “Men’s open division”) %>%
select(-Event) %>%
mutate(Minutes = as.numeric(Time)/60)
marathon %>% autoplot(Minutes) +
labs(y=”Winning times in minutes”)
170
160
150
140
130
1900 1925
1950 1975
2000 202
Year [1Y]
5
39
Winning times in minutes

Example: Boston marathon winning times
fit_trends <- marathon %>%
model(
# Linear trend
linear = TSLM(Minutes ~ trend()),
# Exponential trend
exponential = TSLM(log(Minutes) ~ trend()), # Piecewise linear trend
piecewise = TSLM(Minutes ~ trend(knots = c(1940, 1980)))
)
fit_trends
## # A mable: 1 x 3
## linear exponential piecewise
##
## 1
40

Example: Boston marathon winning times
fit_trends %>% forecast(h=10) %>% autoplot(marathon)
Boston marathon winning times
160
140
120
.
1920
1960 2000
Year [1Y]
model exponential
linear piecewise
level 95
41
Minutes

Example: Boston marathon winning times
fit_trends %>% select(piecewise) %>%
augment() %>% gg_tsdisplay(.resid, plot_type = “histogram”)
20 10 0 −10
1900 1925 1950
1975 2000 202
Year
5
0.3 0.2 0.1 0.0
−0.1 −0.2
20
10
0
5 10 15 20
lag [1Y]
−10 0 10 20
.resid
42
acf .resid
count

Outline
1 The linear model with time series
2 Residual diagnostics
3 Some useful predictors for linear models
4 Selecting predictors and forecast evaluation
5 Forecasting with regression
6 Matrix formulation
7 Correlation, causation and forecasting
43

Selecting predictors
When there are many predictors, how should we choose which ones to use?
We need a way of comparing competing models.
44

Selecting predictors
When there are many predictors, how should we choose which ones to use?
We need a way of comparing competing models.
What not to do!
Plot y against a particular predictor (xj) and if it shows no noticeable relationship, drop it.
Do a multiple linear regression on all the predictors and disregard all variables whose p values are greater than 0.05.
Maximize R2 or minimize MSE
44

Comparing regression models
Computer output for regression will always give the R2 value. This is a useful summary of the model.
It is equal to the square of the correlation between y and yˆ.
It is often called the “coefficient of determination’ ’.
It can also be calculated as follows: 2 􏰈 ( yˆ t − y ̄ ) 2
R = 􏰈 ( y t − y ̄ ) 2
It is the proportion of variance accounted for
(explained) by the predictors.
45

Comparing regression models
However . . .
R2 does not allow for “degrees of freedom’ ’.
Adding any variable tends to increase the value of R2, even if that variable is irrelevant.
46

Comparing regression models
However . . .
R2 does not allow for “degrees of freedom’ ’.
Adding any variable tends to increase the value of R2, even if that variable is irrelevant.
To overcome this problem, we can use adjusted R2: R ̄2=1−(1−R2) T−1
T−k−1
where k = no. predictors and T = no. observations.
46

Comparing regression models
However . . .
R2 does not allow for “degrees of freedom’ ’.
Adding any variable tends to increase the value of R2, even if that variable is irrelevant.
To overcome this problem, we can use adjusted R2: R ̄2=1−(1−R2) T−1
T−k−1
where k = no. predictors and T = no. observations.
Maximizing R ̄2 is equivalent to minimizing σˆ2. 1T
σˆ 2 = 􏰆 ε 2t T − k − 1 t=1
46

Akaike’s Information Criterion
AIC = −2 log(L) + 2(k + 2)
where L is the likelihood and k is the number of predictors in the model.
47

Akaike’s Information Criterion
AIC = −2 log(L) + 2(k + 2)
where L is the likelihood and k is the number of
predictors in the model.
AIC penalizes terms more heavily than R ̄2. Minimizing the AIC is asymptotically equivalent to minimizing MSE via leave-one-out cross-validation (for any linear regression).
47

Corrected AIC
For small values of T, the AIC tends to select too many predictors, and so a bias-corrected version of the AIC has been developed.
AICC =AIC+2(k+2)(k+3) T−k−3
As with the AIC, the AICC should be minimized.
48

Bayesian Information Criterion
BIC = −2 log(L) + (k + 2) log(T)
where L is the likelihood and k is the number of predictors in the model.
49

Bayesian Information Criterion
BIC = −2 log(L) + (k + 2) log(T) where L is the likelihood and k is the number of
predictors in the model.
BIC penalizes terms more heavily than AIC
Also called SBIC and SC.
Minimizing BIC is asymptotically equivalent to leave-v-out cross-validation when
v = T[1 − 1/(log(T) − 1)].
49

Cross-validation
Leave-one-out cross-validation for regression can be carried out using the following steps.
Remove observation t from the data set, and fit the model using the remaining data. Then compute the error
(e∗t =yt−yˆt)fortheomittedobservation.
Repeat step 1 for t = 1, . . . , T.
Compute the MSE from {e∗1 , . . . , e∗T }. We shall call this the CV.
The best model is the one with minimum CV.
However is this time consuming? More to follow. . . .
50

Comparing regression models
glance(fit_trends) %>%
select(.model, r_squared, adj_r_squared, AICc, CV)
## # A tibble: 3 x 5
## .model r_squared adj_r_squared AICc CV
##
## 1 linear 0.728
## 2 exponential 0.744
## 3 piecewise 0.767

0.726 452. 39.1
0.742 -779. 0.00176
0.761 438. 34.8
Be careful making comparisons when transformations are used.
51

Choosing regression variables
Best subsets regression
Fit all possible regression models using one or more of the predictors.
Choose the best model based on one of the measures of predictive ability (CV, AIC, AICc).
52

Choosing regression variables
Best subsets regression
Fit all possible regression models using one or more of the predictors.
Choose the best model based on one of the measures of predictive ability (CV, AIC, AICc).
Warning!
If there are a large number of predictors, this is not possible.
For example, 44 predictors leads to 18 trillion possible models!
52

Choosing regression variables
Backwards stepwise regression
Start with a model containing all variables.
Try subtracting one variable at a time. Keep the model if it has lower CV or AICc.
Iterate until no further improvement.
Forwards stepwise regression
Start with a model containing only a constant.
Add one variable at a time. Keep the model if it has lower CV or AICc.
Iterate until no further improvement.
Hybrid backwards and forwards also possible.
Stepwise regression is not guaranteed to lead to the best possible model.
53

What should you use?
Notes
Stepwise regression is not guaranteed to lead to the best possible model.
Inference on coefficients of final model will be wrong.
54

What should you use?
Notes
Stepwise regression is not guaranteed to lead to the best possible model.
Inference on coefficients of final model will be wrong.
Choice: CV, AIC, AICc, BIC, R ̄2
BIC tends to choose models too small for prediction (however can be useful for large k).
R ̄2 tends to select models too large.
AIC also slightly biased towards larger models (especially when T is small).
Empirical studies in forecasting show AIC is better than BIC for forecast accuracy.
Choice between AICc and CV (double check AIC and BIC where possible). 54

Outline
1 The linear model with time series
2 Residual diagnostics
3 Some useful predictors for linear models
4 Selecting predictors and forecast evaluation
5 Forecasting with regression
6 Matrix formulation
7 Correlation, causation and forecasting
55

Ex-ante versus ex-post forecasts
Ex ante forecasts are made using only information available in advance.
􏰌 require forecasts of predictors
Ex post forecasts are made using later information on the predictors.
􏰌 useful for studying behaviour of forecasting models. trend, seasonal and calendar variables are all known in advance, so these don’t need to be forecast.
56

Scenario based forecasting
Assumes possible scenarios for the predictor variables
Prediction intervals for scenario based forecasts do not include the uncertainty associated with the future values of the predictor variables.
57

Building a predictive regression model
If getting forecasts of predictors is difficult, you can use lagged predictors instead.
yt =β0 +β1×1,t−h +···+βkxk,t−h +εt
A different model for each forecast horizon h.
58

Beer production
recent_production <- aus_production %>% filter(year(Quarter) >= 1992)
fit_beer <- recent_production %>% model(TSLM(Beer ~ trend() + season()))
fc_beer <- forecast(fit_beer) fc_beer %>% autoplot(recent_production) +
ggtitle(“Forecasts of beer production using regression”) + ylab(“megalitres”)
Forecasts of beer production using regression
500
450
400
350
l
1995 Q1
2000 Q1
2005 Q1
2010 Q1
Quarter
evel
80 95
59
megalitres

US Consumption
fit_consBest <- us_change %>%
model(
TSLM(Consumption ~ Income + Savings + Unemployment)
)
future_scenarios <- scenarios( Increase = new_data(us_change, 4) %>%
mutate(Income=1, Savings=0.5, Unemployment=0),
Decrease = new_data(us_change, 4) %>%
mutate(Income=-1, Savings=-0.5, Unemployment=0),
names_to = “Scenario”)
fc <- forecast(fit_consBest, new_data = future_scenarios) 60 US Consumption us_change %>% autoplot(Consumption) +
labs(y=”% change in US consumption”) +
autolayer(fc) +
labs(title = “US consumption”, y = “% change”)
US consumption
2
1
0
−1
−2
Scenario
Decrease Increase
level
80 95
1980 Q1
2000 Q1
2020 Q1
Quarter [1Q]
61
% change

Outline
1 The linear model with time series
2 Residual diagnostics
3 Some useful predictors for linear models
4 Selecting predictors and forecast evaluation
5 Forecasting with regression
6 Matrix formulation
7 Correlation, causation and forecasting
62

Matrix formulation
yt =β0 +β1×1,t +β2×2,t +···+βkxk,t +εt.
63

Matrix formulation
yt =β0 +β1×1,t +β2×2,t +···+βkxk,t +εt.
Let y = (y1, . . . , yT)′, ε = (ε1, . . . , εT)′, β = (β0, β1, . . . , βk)′ and
1 x1,1 x2,1 1 x1,2 x2,2
. . .
. . .
xk,1 xk,2
X=. . .
1 x1,T x2,T … xk,T
.. … . 
63

Matrix formulation
yt =β0 +β1×1,t +β2×2,t +···+βkxk,t +εt.
Let y = (y1, . . . , yT)′, ε = (ε1, . . . , εT)′, β = (β0, β1, . . . , βk)′
and
Then
1 x1,1 x2,1 1 x1,2 x2,2
. . .
. . .
xk,1 xk,2
X=. . .
1 x1,T x2,T … xk,T
y = Xβ + ε.
.. … . 
63

Matrix formulation
Least squares estimation
Minimize: (y − Xβ)′(y − Xβ)
64

Matrix formulation
Least squares estimation
Minimize: (y − Xβ)′(y − Xβ)
Differentiate wrt β gives
βˆ = (X′X)−1X′y
64

Matrix formulation
Least squares estimation
Minimize: (y − Xβ)′(y − Xβ) Differentiate wrt β gives
βˆ = (X′X)−1X′y (The “normal equation’ ’.)
64

Matrix formulation
Least squares estimation
Minimize: (y − Xβ)′(y − Xβ) Differentiate wrt β gives
βˆ = (X′X)−1X′y (The “normal equation’ ’.)
σˆ2= 1 (y−Xβˆ)′(y−Xβˆ) T−k−1
Note: If you fall for the dummy variable trap, (X′X) is a singular matrix.
64

Likelihood
If the errors are iid and normally distributed, then y ∼ N(Xβ, σ2I).
So the likelihood is 1􏰂1′􏰃
L=σT(2π)T/2exp −2σ2(y−Xβ)(y−Xβ)
which is maximized when (y − Xβ)′(y − Xβ) is minimized.
65

Likelihood
If the errors are iid and normally distributed, then y ∼ N(Xβ, σ2I).
So the likelihood is 1􏰂1′􏰃
L=σT(2π)T/2exp −2σ2(y−Xβ)(y−Xβ)
which is maximized when (y − Xβ)′(y − Xβ) is minimized.
So MLE = OLS.
65

Multiple regression forecasts
Optimal forecasts
yˆ∗ = E(y∗|y, X, x∗) = x∗βˆ = x∗(X′X)−1X′y
where x∗ is a row vector containing the values of the predictors for the forecasts (in the same format as X). Forecast variance
Var(y∗|X, x∗) = σ2 􏰉1 + x∗(X′X)−1(x∗)′􏰊
This ignores any errors in x∗.
95% prediction intervals assuming normal errors:
yˆ∗ ± 1.96􏰋Var(y∗|X, x∗).
66

Multiple regression forecasts
Fitted values
yˆ = Xβˆ = X(X′X)−1X′y = Hy
where H = X(X′X)−1X′ is the “hat matrix’ ’. Leave-one-out residuals
Leth1,…,hT bethediagonalvaluesofH,thenthe cross-validation statistic is
1􏰆T 2 CV=T [et/(1−ht)],
t=1
where et is the residual obtained from fitting the model to all T observations.
67

Outline
1 The linear model with time series
2 Residual diagnostics
3 Some useful predictors for linear models
4 Selecting predictors and forecast evaluation
5 Forecasting with regression
6 Matrix formulation
7 Correlation, causation and forecasting
68

Correlation is not causation
When x is useful for predicting y, it is not necessarily causing y.
e.g., predict number of drownings y using number of ice-creams sold x.
Correlations are useful for forecasting, even when there is no causality.
Better models usually involve causal relationships (e.g., temperature x and people z to predict drownings y).
69

Multicollinearity
In regression analysis, multicollinearity occurs when:
Two predictors are highly correlated (i.e., the correlation between them is close to ±1).
A linear combination of some of the predictors is highly correlated with another predictor.
A linear combination of one subset of predictors is highly correlated with a linear combination of another subset of predictors.
70

Multicollinearity
If multicollinearity exists. . .
the numerical estimates of coefficients may be wrong (worse in Excel than in a statistics package). don’t rely on the p-values to determine significance.
there is no problem with model predictions provided the predictors used for forecasting are within the range used for fitting.
omitting variables can help.
combining variables can help.
71

Outliers and influential observations
Things to watch for
Outliers: observations that produce large residuals.
Influential observations: removing them would markedly change the coefficients. (Often outliers in the x variable).
Lurking variable: a predictor not included in the regression but which has an important effect on the response.
Points should not normally be removed without a good explanation of why they are different.
72