程序代写代做代考 AI Bayesian scheme chain matlab data mining database GMM algorithm finance ER Lecture 1: Introduction to Forecasting

Lecture 1: Introduction to Forecasting
UCSD, January 9 2017

Allan Timmermann1

1UC San Diego

Timmermann (UCSD) Forecasting Winter, 2017 1 / 64

1 Course objectives

2 Challenges facing forecasters

3 Forecast Objectives: the Loss Function

4 Common Assumptions on Loss

5 Specific Types of Loss Functions

6 Multivariate loss

7 Does the loss function matter?

8 Informal Evaluation Methods

9 Out-of-Sample Forecast Evaluation

10 Some easy and hard to predict variables

11 Weak predictability but large economic gains
Timmermann (UCSD) Forecasting Winter, 2017 2 / 64

Course objectives: Develop

Skills in analyzing, modeling and working with time series data from
finance and economics

Ability to construct forecasting models and generate forecasts

formulating a class of models – using information intelligently
model selection
estimation – making best use of historical data

Develop creativity in posing forecasting questions, collecting and
using often incomplete data

which data help me build a better forecasting model?

Ability to critically evaluate and compare forecasts

reasonable (simple) benchmarks
skill or luck? Overfitting (data mining)
Compete or combine?

Timmermann (UCSD) Forecasting Winter, 2017 2 / 64

Ranking forecasters: Mexican inflation

Timmermann (UCSD) Forecasting Winter, 2017 3 / 64

Forecast situations

Forecasts are used to guide current decisions that affect the future
welfare of a decision maker (forecast user)

Predicting my grade – updating information on the likely grade as the
course progresses
Choosing between a fixed-rate mortgage (interest rate fixed for 20
years) versus a floating-rate (variable) mortgage

Depends on interest rate and inflation forecast

Political or sports outcomes – prediction markets
Investing in the stock market. How volatile will the stock market be?
Predicting Chinese property prices. Supply and demand considerations,
economic growth

Structural versus reduced-form approaches
Depends on the forecast horizon: 1 month vs 10 years

Timmermann (UCSD) Forecasting Winter, 2017 4 / 64

Forecasting and decisions

Credit card company deciding which transactions are potentially
fraudulent and should be denied (in real time)

requires fitting a model to past credit card transactions
binary data (zero-one)

Central Bank predicting the state of the economy – timing issues

Predicting which fund manager (if any) or asset class will outperform

Forecasting the outcome of the world cup:
http://www.goldmansachs.com/our-thinking/outlook/world-cup-
sections/world-cup-book-2014-statistical-model.html

Timmermann (UCSD) Forecasting Winter, 2017 5 / 64

Forecasting the outcome of the world cup

Timmermann (UCSD) Forecasting Winter, 2017 6 / 64

Key issues

Decision maker’s actions depend on predicted future outcomes

Trade off relative costs of over- or underpredicting outcomes
Actions and forecasts are inextricably linked

good forecasts are expected to lead to good decisions
bad forecasts are expected to lead to poor decisions

Forecast is an intermediate input in a decision process, rather than an
end product of separate interest

Loss function weighs the cost of possible forecast errors – like a utility
function uses preferences to weigh different outcomes

Timmermann (UCSD) Forecasting Winter, 2017 7 / 64

Loss functions

Forecasts play an important role in almost all decision problems where
a decision maker’s utility or wealth is affected by his current and
future actions and depend on unknown future events

Central Banks

Forecast inflation, unemployment, GDP growth
Action: interest rate; monetary policy
Trade off cost of over- vs. under-predictions

Firms

Forecast sales
Action: production level, new product launch
Trade off inventory vs. stock-out/goodwill costs

Money managers

Forecast returns (mean, variance, density)
Action: portfolio weights/trading strategy
Trade off Risk vs. return

Timmermann (UCSD) Forecasting Winter, 2017 8 / 64

Ways to generate forecasts

Rule of thumb. Simple decision rule that is not optimal, but may be
robust

Judgmental/subjective forecast, e.g., expert opinion

Combine with other information/forecasts

Quantitative models

“… an estimated forecasting model provides a characterization of what
we expect in the present, conditional upon the past, from which we
infer what to expect in the future, conditional upon the present and the
past. Quite simply, we use the estimated forecasting model to
extrapolate the observed historical data.” (Frank Diebold, Elements of
Forecasting).

Combine different types of forecasts

Timmermann (UCSD) Forecasting Winter, 2017 9 / 64

Forecasts: key considerations

Forecasting models are simplified approximations to a complex reality

How do we make the right shortcuts?
Which methods seem to work in general or in specific situations?

Economic theory may suggest relevant predictor variables, but is silent
about functional form, dynamics of forecasting model

combine art (judgment) and science
how much can we learn from the past?

Timmermann (UCSD) Forecasting Winter, 2017 10 / 64

Forecast object – what are we trying to forecast?

Event outcome: predict if a certain event will happen

Will a bank or hedge fund close?
Will oil prices fall below $40/barrel in 2017?
Will Europe experience deflation in 2017?

Event timing: it is known that an event will happen, but unknown
when it will occur

When will US stocks enter a “bear” market (Dow drops by 10%)?

Time-series: forecasting future values of a continuous variable by
means of current and past data

Predicting the level of the Dow Jones Index on March 15, 2017

Timmermann (UCSD) Forecasting Winter, 2017 11 / 64

Forecast statement

Point forecast

Single number summarizing “best guess”. No information on how
certain or precise the point forecast is. Random shocks affect all
time-series so a non-zero forecast error is to be expected even from a
very good forecast
Ex: US GDP growth for 2017 is expected to be 2.5%

Interval forecast

Lower and upper bound on outcome. Gives a range of values inside
which we expect the outcome will fall with some probability (e.g., 50%
or 95%). Confidence interval for the predicted variable. Length of
interval conveys information about forecast uncertainty.
Ex: 90% chance US GDP growth will fall between 1% and 4%

Density or probability forecast

Entire probability distribution of the future outcome
Ex: US GDP growth for 2017 is Normally distributed N(2.5,1)

Timmermann (UCSD) Forecasting Winter, 2017 12 / 64

Forecast horizon

The best forecasting model is likely to depend on whether we are
forecasting 1 minute, 1 day, 1 month or 1 year ahead

We refer to an h−step-ahead forecast, where h (short for “horizon”)
is the number of time periods ahead that we predict

Often you hear the argument that “fundamentals matter in the long
run, psychological factors are more important in the short run”

Timmermann (UCSD) Forecasting Winter, 2017 13 / 64

Information set

Do we simply use past values of a series itself or do we include a
larger information set?
Suppose we wish to forecast some outcome y for period T + 1 and
have historical data on this variable from t = 1, ..,T . The univariate
information set consists of the series itself up to time T :

IunivariateT = {y1, …, yT }

If data on other series, zt (typically an N × 1 vector), are available,
we have a multivariate information set

ImultivariateT = {y1, …, yT , z1, …, zT }

It is often important to establish whether a forecast can benefit from
using such additional information

Timmermann (UCSD) Forecasting Winter, 2017 14 / 64

Loss function: notations

Outcome: Y

Forecast: f

Forecast error: e = Y − f
Observed data: Z

Loss function: L(f ,Y )→ R
maps inputs f ,Y to the real number line R
yields a complete ordering of forecasts
describes in relative terms how costly it is to make forecast errors

Timmermann (UCSD) Forecasting Winter, 2017 15 / 64

Loss Function Considerations

Choice of loss function that appropriately measures trade-offs is
important for every facet of the forecasting exercise and affects

which forecasting models are preferred
how parameters are estimated
how forecasts are evaluated and compared

Loss function reflects the economics of the decision problem

Financial analysts’forecasts; Hong and Kubik (2003), Lim (2001)

Analysts tend to bias their earnings forecasts (walk-down effect)

Sometimes a forecast is best viewed as a signal in a strategic game
that explicitly accounts for the forecast provider’s incentives

Timmermann (UCSD) Forecasting Winter, 2017 16 / 64

Constructing a loss function

For profit maximizing investors the natural choice of loss is the
function relating payoffs (through trading rule) to the forecast and
realized returns

Link between loss and utility functions: both are used to minimize risk
arising from economic decisions

Loss is sometimes viewed as the negative of utility

U(f ,Y ) ≈ −L(Y , f )

Majority of forecasting papers use simple ‘off the shelf’statistical loss
functions such as Mean Squared Error (MSE)

Timmermann (UCSD) Forecasting Winter, 2017 17 / 64

Common Assumptions on Loss

Granger (1999) proposes three ‘required’properties for error loss
functions, L(f , y) = L(y − f ) = L(e):

A1. L(0) = 0 (minimal loss of zero for perfect forecast);
A2. L(e) ≥ 0 for all e;
A3. L(e) is monotonically non-decreasing in |e| :

L(e1) ≥ L(e2) if e1 > e2 > 0
L(e1) ≥ L(e2) if e1 < e2 < 0 A1: normalization A2: imperfect forecasts are more costly than perfect ones A3: regularity condition - bigger forecast mistakes are (weakly) costlier than smaller mistakes (of same sign) Timmermann (UCSD) Forecasting Winter, 2017 18 / 64 Additional Assumptions on Loss Symmetry: L(y − f , y) = L(y + f , y) Granger and Newbold (1986, p. 125): “.. an assumption of symmetry about the conditional mean ... is likely to be an easy one to accept ... an assumption of symmetry for the cost function is much less acceptable.” Homogeneity: for some positive function h(a) : L(ae) = h(a)L(e) scaling doesn’t matter Differentiability of loss with respect to the forecast (regularity condition) Timmermann (UCSD) Forecasting Winter, 2017 19 / 64 Squared Error (MSE) Loss L(e) = ae2, a > 0

Satisfies the three Granger properties
Homogenous, symmetric, differentiable everywhere
Convex: penalizes large forecast errors at an increasing rate
Optimal forecast:

f ∗ = arg
f
min

∫
(y − f )2pY dy

First order condition

f ∗ =
∫
ypY dy = E (y)

The optimal forecast under MSE loss is the conditional mean

Timmermann (UCSD) Forecasting Winter, 2017 20 / 64

Piece-wise Linear (lin-lin) Loss

L(e) = (1− α)e1e>0 − αe1e≤0, 0 < α < 1 1e>0 = 1 if e > 0, otherwise 1e>0 = 0. Indicator variable

Weight on positive forecast errors: (1− α)
Weight on negative forecast errors: α

Lin-lin loss satisfies the three Granger properties and is homogenous
and differentiable everywhere with regard to f , except at zero

Lin-lin loss does not penalize large errors as much as MSE loss

Mean absolute error (MAE) loss arises if α = 1/2:

L(e) = |e|

Timmermann (UCSD) Forecasting Winter, 2017 21 / 64

MSE vs. piece-wise Linear (lin-lin) Loss

-3 -2 -1 0 1 2 3
0

10
L(

α = 0.25

-3 -2 -1 0 1 2 3
0

L(
e)

α = 0.5, MAE loss

-3 -2 -1 0 1 2 3
0

L(
e)

α = 0.75

MSE

linlin

MSE

linlin

MSE

linlin

Timmermann (UCSD) Forecasting Winter, 2017 22 / 64

Optimal forecast under lin-lin Loss

Expected loss under lin-lin loss:

EY [L(Y − f )] = (1− α)E [Y |Y > f ]− αE [Y |Y ≤ f ]

First order condition:
f ∗ = P−1Y (1− α)

PY : CDF of Y
The optimal forecast is the (1− α) quantile of Y
α = 1/2 : optimal forecast is the median of Y
As α increases towards one, the optimal forecast moves further to the
left of the tail of the predicted outcome distribution

Timmermann (UCSD) Forecasting Winter, 2017 23 / 64

Optimal forecast of N(0,1) variable under lin-lin loss

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
-2.5

-2

-1.5

-1

-0.5

0.5

1.5

2.5

Timmermann (UCSD) Forecasting Winter, 2017 24 / 64

Linex Loss

L(e) = exp(a2e)− a2e − 1, a2 6= 0

Differentiable everywhere

Asymmetric: a2 controls both the degree and direction of asymmetry

a2 > 0 : loss is approximately linear for e < 0 and approximately exponential for e > 0

Large underpredictions are very costly (f < y , so e = y − f > 0)

Converse is true when a2 < 0 Timmermann (UCSD) Forecasting Winter, 2017 25 / 64 MSE versus Linex Loss -3 -2 -1 0 1 2 3 0 5 10 15 20 L( e) e right-skewed linex loss with a 2 =1 -3 -2 -1 0 1 2 3 0 5 10 15 20 L( e) e left-skewed linex loss with a 2 =-1 MSE Linex MSE Linex Timmermann (UCSD) Forecasting Winter, 2017 26 / 64 Linex Loss Suppose Y ∼ N(µY , σ 2 Y ). Then E [L(e)] = exp(a2(µY − f ) + a22 2 σ2Y )− a2(µY − f ) Optimal forecast: f ∗ = µY + a2 2 σ2Y Under linex loss, the optimal forecast depends on both the mean and variance of Y (µY and σ 2 Y ) as well as on the curvature parameter of the loss function, a2 Timmermann (UCSD) Forecasting Winter, 2017 27 / 64 Optimal bias under Linex Loss for N(0,1) variable -3 -2 -1 0 1 2 3 0 0.2 0.4 e MSE loss -3 -2 -1 0 1 2 3 0 0.2 0.4 e linex loss with a 2 =1 -3 -2 -1 0 1 2 3 0 0.2 0.4 e linex loss with a 2 =-1 Timmermann (UCSD) Forecasting Winter, 2017 28 / 64 Multivariate Loss Functions Multivariate MSE loss with n errors e = (e1, ..., en)′ : MSE (A) = e ′Ae A is a nonnegative and positive definite n× n matrix This satisfies the basic assumptions for a loss function When A = In, covariances can be ignored and the loss function simplifies to MSE (In) = E [e ′e] = ∑ n i=1 e 2 i , i.e., the sum of the individual mean squared errors Timmermann (UCSD) Forecasting Winter, 2017 29 / 64 Does the loss function matter? Cenesizoglu and Timmermann (2012) compare statistical and economic measures of forecasting performance across a large set of stock return prediction models with time-varying mean and volatility Economic performance is measured through the certainty equivalent return (CER), i.e., the risk-adjusted return Statistical performance is measured through mean squared error (MSE) Performance is measured relative to that of a constant expected return (prevailing mean) benchmark Common for forecast models to produce worse mean squared error (MSE) but better return performance than the benchmark Relation between statistical and economic measures of forecasting performance can be weak Timmermann (UCSD) Forecasting Winter, 2017 30 / 64 Does loss function matter? Cenesizoglu and Timmermann Timmermann (UCSD) Forecasting Winter, 2017 31 / 64 Percentage of models with worse statistical but better economic performance than prevailing mean (CT, 2012) CER is certainty equivalent return Sharpe is the Sharpe ratio RAR is risk-adjusted return RMSE is root mean squared (forecast) error Timmermann (UCSD) Forecasting Winter, 2017 32 / 64 Example: Directional Trading system Consider the decisions of a risk-neutral ‘market timer’whose utility is linear in the return on the market portfolio (y) U(δ(f ), y) = δy Investor’s decision rule, δ(f ) : go ‘long’one unit in the risky asset if a positive return is predicted (f > 0), otherwise go short one unit:

δ(f ) =
{

1 if f ≥ 0
−1 if f < 0 Let sign(y) = 1, if y > 0, otherwise sign(y) = 0. Payoff:

U(y , δ(f )) = (2sign(f )− 1)y

Sign and magnitude of y and sign of f matter to trader’s utility

Timmermann (UCSD) Forecasting Winter, 2017 33 / 64

Example: Directional Trading system (cont.)

Which forecast approach is best under the directional trading rule?

Since the trader ignores information about the magnitude of the
forecast, an approach that focuses on predicting only the sign of the
excess return could make sense

Leitch and Tanner (1991) studied forecasts of T-bill futures:

Professional forecasters reported predictions with higher mean squared
error (MSE) than those from simple time-series models

Puzzling since the time-series models incorporate far less information
than the professional forecasts

When measured by their ability to generate profits or correctly forecast
the direction of future interest rate movements the professional
forecasters did better than the time-series models
Professional forecasters’objectives are poorly approximated by MSE
loss – closer to directional or ‘sign’loss

Timmermann (UCSD) Forecasting Winter, 2017 34 / 64

Common estimates of forecasting performance

Define the forecast error et+h|t = yt+h − ft+h|t . Then

MSE = T−1
T

∑
t=1
e2t+h|t

RMSE =

√√√√T−1 T∑
t=1
e2
t+h|t

MAE = T−1
T

∑
t=1
|et+h|t |

Directional accuracy (DA): let Ixt+1>0 = 1 if xt+1 > 0, otherwise
Ixt+1>0 = 0. Then an estimate of DA is

DA = T−1
T

∑
t=1
Iyt+h×ft+h|t>0

Timmermann (UCSD) Forecasting Winter, 2017 35 / 64

Forecast evaluation

ft+h|t : forecast of yt+h given information available at time t
Given a sequence of forecasts, ft+h|t , and outcomes, yt+h,
t = 1, …,T , it is natural to ask if the forecast was “optimal”or
obviously deficient

Questions posed by forecast evaluation are related to the
measurement of predictive accuracy

Absolute performance measures the accuracy of an individual
forecast relative to the outcome, using either an economic
(loss-based) or a statistical metric

Relative performance compares the performance of one or several
forecasts against some benchmark

Timmermann (UCSD) Forecasting Winter, 2017 36 / 64

Forecast evaluation (cont.)

Forecast evaluation amounts to understanding if the loss from a given
forecast is “small enough”

Informal methods – graphical plots, decompositions
Formal methods – distribution of test statistic for sample averages of
loss estimates can depend on how the forecasts were constructed, e.g.
which estimation method was used

The method (not only the model) used to construct the forecast
matters – expanding vs. rolling estimation window

Formal evaluation of an individual forecast requires testing whether
the forecast is optimal with respect to some loss function and a
specific information set

Rejection of forecast optimality suggests that the forecast can be
improved

Timmermann (UCSD) Forecasting Winter, 2017 37 / 64

Effi cient Forecast: Definition

A forecast is effi cient (optimal) if no other forecast using the available
data, xt ∈ It , can be used to generate a smaller expected loss
Under MSE loss:

f̂ ∗t+h|t = arg
f̂ (xt )

minE
[
(yt+h − f̂ (xt ))2

]

If we can use information in It to produce a more accurate forecast,
then the original forecast would be suboptimal

Effi ciency is conditional on the information set

weak form forecast effi ciency tests include only past forecasts and
past outcomes It = {yt , yt−1, …, f̂t |t−1, et |t−1, …}
strong form effi ciency tests extend this to include all other variables
xt ∈ It

Timmermann (UCSD) Forecasting Winter, 2017 38 / 64

Optimality under MSE loss

First order condition for an optimal forecast under MSE loss:

E [
∂(yt+h − ft+h|t )2

∂ft+h|t
] = −2E

[
yt+h − ft+h|t

]
= −2E

[
et+h|t

]
= 0

Similarly, conditional on information at time t, It :
E [et+h|t |It ] = 0

The expected value of the forecast error must equal zero given
current information, It
Test E [et+h|txt ] = 0 for all variables xt ∈ It known at time t
If the forecast is optimal, no variable known at time t can predict its
future forecast error et+h|t . Otherwise the forecast wouldn’t be
optimal
If I can predict that my forecast will be too low, I should increase my
forecast
Timmermann (UCSD) Forecasting Winter, 2017 39 / 64

Optimality properties under Squared Error Loss

1 Optimal forecasts are unbiased: the forecast error et+h|t has zero
mean, both conditionally and unconditionally:

E [et+h|t ] = E [et+h|t |It ] = 0

2 h-period forecast errors (et+h|t) are uncorrelated with information
available at the time the forecast was computed (It). In particular,
single-period forecast errors, et+1|t , are serially uncorrelated:

E [et+1|tet |t−1] = 0

3 The variance of the forecast error (et+h|t) increases (weakly) in the
forecast horizon, h :

Var(et+h+1|t ) ≥ Var(et+h|t ) for all h ≥ 1

Timmermann (UCSD) Forecasting Winter, 2017 40 / 64

Optimality properties under Squared Error Loss (cont.)

Forecasts should be unbiased. Why? If they were biased, we could
improve the forecast simply by correcting for the bias

Suppose ft+1|t is biased:

yt+1 = 1+ ft+1|t + εt+1, εt+1 ∼ WN(0, σ
2)

The bias-corrected forecast:

f ∗t+1|t = 1+ ft+1|t

is more accurate than ft+1|t
Forecast errors should be unpredictable:

Suppose yt+1 − ft+1|t = et+1 = 0.5et + εt+1 so the one-step forecast
error is serially correlated
Adding back 0.5et to the original forecast yields a more accurate
forecast: f ∗t+1|t = ft+1|t + 0.5et is better than f

∗
t+1|t

Variance of forecast error increases in the forecast horizon
We learn more information as we get closer to the forecast “target”

Timmermann (UCSD) Forecasting Winter, 2017 41 / 64

Informal evaluation methods (Greenbook forecasts)

Time-series graph of forecasts and outcomes {ft+h|t , yt+h}Tt=1

1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
-10

-5

10
GDP growth

time

an
nu

al
iz

ed
c

ha
ng

Actual
Forecast

1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
0

14
inflation rate

time, t

an
nu

al
iz

ed
c

ha
ng

Actual
Forecast

Timmermann (UCSD) Forecasting Winter, 2017 42 / 64

Informal evaluation methods (Greenbook forecasts)

Scatterplots of {ft+h|t , yt+h}Tt=1

-10 -8 -6 -4 -2 0 2 4 6 8 10
-10

-5

10
GDP growth

forecast

ac
tu

0 5 10 15
0

15
inflation rate

forecast

ac
tu

Timmermann (UCSD) Forecasting Winter, 2017 43 / 64

Informal evaluation methods (Greenbook Forecasts)

Plots of ft+h|t − yt against yt+h − yt : directional accuracy

-15 -10 -5 0 5 10 15
-10

-5

forecast

ac
tu

al
GDP growth

-10

-5

-15 -10 -5 0 5 10 15

-4 -3 -2 -1 0 1 2 3 4
-6

-4

-2

forecast

ac
tu

inflation rate

-6

-4

-2

-4 -3 -2 -1 0 1 2 3 4

Timmermann (UCSD) Forecasting Winter, 2017 44 / 64

Informal evaluation methods (Greenbook forecasts)

Plot of forecast errors et+h = yt+h − ft+h|t

1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
-5

10
fo

re
ca

st
e

rr
or

GDP growth

1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
-4

-2

fo
re

ca
st

e
rr

time, t

Inflation rate

Timmermann (UCSD) Forecasting Winter, 2017 45 / 64

Informal evaluation methods

Theil (1961) suggested the following decomposition:

E [y − f ]2 = E [(y − Ey)− (f − Ef ) + (Ey − Ef )]2

= (Ey − Ef )2 + (σy − σf )2 + 2σyσf (1− ρ)

MSE depends on

squared bias (Ey − Ef )2
squared differences in standard deviations (σy − σf )2
correlation between the forecast and outcome ρ

Timmermann (UCSD) Forecasting Winter, 2017 46 / 64

Pseudo out-of-sample Forecasts

Simulated (“pseudo”) out-of-sample (OoS) forecasts seek to mimic
the “real time”updating underlying most forecasts

What would a forecaster have done (historically) at a given point in
time?

Method splits data into an initial estimation sample (in-sample
period) and a subsequent evaluation sample (OoS period)

Forecasts are based on parameter estimates that use data only up to
the date when the forecast is computed

As the sample expands, the model parameters get updated, resulting
in a sequence of forecasts

Why do out-of-sample forecasting?

control for data mining – harder to “game”
feasible in real time (less “look-ahead” bias)

Timmermann (UCSD) Forecasting Winter, 2017 47 / 64

Pseudo out-of-sample forecasts (cont.)

Out-of-sample (OoS) forecasts impose the constraint that the
parameter estimates of the forecasting model only use information
available at the time the forecast was computed

Only information known at time t can be used to estimate and select
the forecasting model and generate forecasts ft+h|t
Many variants of OoS forecast estimation methods exist. These can
be illustrated for the linear regression model

yt+1 = β
′xt + εt+1

f̂t+1|t = β̂
′
txt

β̂t =

(
t

∑
s=1

ω(s, t)xs−1x
′
s−1

)−1 (
t

∑
s=1

ω(s, t)xs−1y
′
s

)

Different methods use different weighting functions ω(s, t)

Timmermann (UCSD) Forecasting Winter, 2017 48 / 64

Expanding window

Expanding or recursive estimation windows put equal weight on all
observations s = 1, …, t to estimate the parameters of the model:

ω(s, t) =
{
1 1 ≤ s ≤ t
0 otherwise

As time progresses, the estimation sample grows larger, It ⊆ It+1
If the parameters of the model do not change (“stationarity”), the
expanding window approach makes effi cient use of the data and leads
to consistent parameter estimates

If model parameters are subject to change, the approach leads to
biased forecasts

The approach works well empirically due to its use of all available
data which reduces the effect of estimation error on the forecasts

Timmermann (UCSD) Forecasting Winter, 2017 49 / 64

Expanding window

1 t t+1 t+2 T-1
time

Timmermann (UCSD) Forecasting Winter, 2017 50 / 64

Rolling window

Rolling window uses an equal-weighted kernel of the most recent ω̄
observations to estimate the parameters of the forecasting model

ω(s, t) =
{
1 t − ω̄+ 1 ≤ s ≤ t
0 otherwise

Only one ‘design’parameter: ω̄ (length of window)

Practical way to account for slowly-moving changes to the data
generating process

Does this address “breaks”?

window too long immediately after breaks
window too short further away

Timmermann (UCSD) Forecasting Winter, 2017 51 / 64

Rolling window

t-w+1 t-w+2 t t+1 t+2 T-1
time

Timmermann (UCSD) Forecasting Winter, 2017 52 / 64

Fixed window

Fixed window uses only the first ω̄0 observations to once and for all
estimate the parameters of the forecasting model

ω(s, t) =
{
1 1 ≤ s ≤ ω̄0
0 otherwise

This method is typically employed when the costs of estimation are
very high, so re-estimating the model with new data is prohibitively
expensive or impractical in real time

The method also makes analytical results easier

Timmermann (UCSD) Forecasting Winter, 2017 53 / 64

Fixed window

1 w t t+1 t+2 T-1
time

Timmermann (UCSD) Forecasting Winter, 2017 54 / 64

Exponentially declining weights

In the presence of model instability, it is common to discount past
observations using weights that get smaller, the older the data

Exponentially declining weights take the following form:

ω(s, t) =
{

λt−s 1 ≤ s ≤ t
0 otherwise

0 < λ < 1. This method is sometimes called discounted least squares as the discount factor, λ, puts less weight on past observations Timmermann (UCSD) Forecasting Winter, 2017 55 / 64 Comparisons Expanding estimation window: number of observations available for estimating model parameters increases with the sample size Effect of estimation error gets reduced Fixed/rolling/discounted window: parameter estimation error continues to affect the forecasts even as the sample grows large model parameters are inconsistent Forecasts vary more under the short (fixed and rolling) estimation windows than under the expanding window Timmermann (UCSD) Forecasting Winter, 2017 56 / 64 US stock index Timmermann (UCSD) Forecasting Winter, 2017 57 / 64 Monthly US stock returns Timmermann (UCSD) Forecasting Winter, 2017 58 / 64 Monthly inflation Timmermann (UCSD) Forecasting Winter, 2017 59 / 64 US T-bill rate Timmermann (UCSD) Forecasting Winter, 2017 60 / 64 US Stock market volatility Timmermann (UCSD) Forecasting Winter, 2017 61 / 64 Example: Portfolio Choice under Mean-Variance Utility T-bills with known payoff rf vs stocks with uncertain return r s t+1 and excess return rt+1 = r st+1 − rf Wt = $1 : Initial wealth ωt : portion of portfolio held in stocks at time t (1−ωt ) : portion of portfolio held in Tbills Wt+1 : future wealth Wt+1 = (1−ωt )rf +ωt (rt+1 + rf ) = rf +ωt rt+1 Investor chooses ωt to maximize mean-variance utility: Et [U(Wt+1)] = Et [Wt+1]− A 2 Vart (Wt+1) Et [Wt+1] and Vart (Wt+1) : conditional mean and variance of Wt+1 Timmermann (UCSD) Forecasting Winter, 2017 62 / 64 Portfolio Choice under Mean-Variance Utility (cont.) Suppose stock returns follow the process rt+1 = µ+ xt + εt+1 xt ∼ (0, σ2x ), εt+1 ∼ (0, σ2ε ), cov(xt , εt+1) = 0 xt : predictable component given information at t εt+1 : unpredictable innovation (shock) Uninformed investor’s (no information on xt) stock holding: ω∗t = arg ωt max { ωtµ+ rf − A 2 ω2t (σ 2 x + σ 2 ε ) } = µ A(σ2x + σ 2 ε ) E [U(Wt+1(ω ∗ t ))] = rf + µ2 2A(σ2x + σ 2 ε ) = rf + S2 2A S = µ/ √ σ2x + σ 2 ε : unconditional Sharpe ratio Timmermann (UCSD) Forecasting Winter, 2017 63 / 64 Portfolio Choice under Mean-Variance Utility (cont.) Informed investor knows xt . His stock holdings are ω∗t = µ+ xt Aσ2ε Et [U(Wt+1(ω ∗ t ))] = rf + (µ+ xt )2 2Aσ2ε Average (unconditional expectation) value of this is E [Et [U(Wt+1(ω ∗ t ))]] = rf + µ2 + σ2x 2Aσ2ε Increase in expected utility due to knowing the predictor variable: E [U inf ]− E [Uun inf ] = σ2x 2Aσ2ε = R2 2A(1− R2) Plausible empirical numbers, i.e., R2 = 0.005, and A = 3, give an annualized certainty equivalent return of about 1% Timmermann (UCSD) Forecasting Winter, 2017 64 / 64 Lecture 2: Univariate Forecasting Models UCSD, January 18 2017 Allan Timmermann1 1UC San Diego Timmermann (UCSD) ARMA Winter, 2017 1 / 59 1 Introduction to ARMA models 2 Covariance Stationarity and Wold Representation Theorem 3 Forecasting with ARMA models 4 Estimation and Lag Selection for ARMA Models Choice of Lag Order 5 Random walk model 6 Trend and Seasonal Components Seasonal components Trended Variables Timmermann (UCSD) ARMA Winter, 2017 2 / 59 Introduction: ARMA models When building a forecasting model for an economic or financial variable, the variable’s own past time series is often the first thing that comes to mind Many time series are persistent Effect of past and current shocks takes time to evolve Auto Regressive Moving Average (ARMA) models Work horse of forecast profession since Box and Jenkins (1970) Remain the centerpiece of many applied forecasting courses Used extensively commercially Timmermann (UCSD) ARMA Winter, 2017 2 / 59 Why are ARMA models so popular? 1 Minimalist demand on forecaster’s information set: Need only past history of the variable IT = {y1, y2, ..., yT−1, yT } "Reduced form": No need to derive fully specified model for y By excluding other variables, ARMA forecasts show how useful the past of a time series is for predicting its future 2 Empirical success: ARMA forecasts often provide a good ‘benchmark’and have proven surprisingly diffi cult to beat in empirical work 3 ARMA models underpinned by theoretical arguments Wold Representation Theorem: Covariance stationary processes can be represented as a (possibly infinite order) moving average process ARMA models have certain optimality properties among linear projections of a variable on its own past and past shocks to the series ARMA models are not optimal in a global sense - it may be optimal to use nonlinear transformations of past values of the series or to condition on a wider information set ("other variables") Timmermann (UCSD) ARMA Winter, 2017 3 / 59 Covariance Stationarity: Definition A time series, or stochastic process, {yt}∞t=−∞, is covariance stationary if The mean of yt , µt = E [yt ], is the same for all values of t: µt = µ without loss of generality we set µt = 0 for all t [de-meaning] The autocovariance exists and does not depend on t, but only on the "distance", j , i.e., E [ytyt−j ] ≡ γ(j , t) = γ(j) for all t Autocovariance measures how strong the covariation is between current and past values of a time series If yt is independently distributed over time, then E [ytyt−j ] = 0 for all j 6= 0 Timmermann (UCSD) ARMA Winter, 2017 4 / 59 Covariance Stationarity: Interpretation History repeats: if the series changed fundamentally over time, the past would not be useful for predicting the future of the series. To rule out this situation, we have to assume a certain degree of stability of the series. This is known as covariance stationarity Covariance stationarity rules out shifting patterns such as trends in the mean of a series breaks in the mean, variance, or autocovariance of a series Covariance stationarity allows us to use historical information to construct a forecasting model and predict the future Under covariance stationarity Cov(y2016, y2015) = Cov(y2017, y2016). This allows us to predict y2017 from y2016 Timmermann (UCSD) ARMA Winter, 2017 5 / 59 White noise Covariance stationary processes can be built from white noise: Definition A stochastic process, εt , is called white noise if it has zero mean, constant variance, and is serially uncorrelated: E [εt ] = 0 Var(εt ) = σ 2 E [εt εs ] = 0, for all t 6= s Timmermann (UCSD) ARMA Winter, 2017 6 / 59 Wold Representation Theorem Any covariance stationary process can be written as an infinite order MA model, MA(∞), with coeffi cients θi that are independent of t : Theorem Wold’s Representation Theorem: Any covariance stationary stochastic process {yt} can be represented as a linear combination of serially uncorrelated lagged white noise terms εt and a linearly deterministic component, µt : yt = ∞ ∑ j=0 θj εt−j + µt where {θi} are independent of time and ∑∞j=0 θ 2 j < ∞. Timmermann (UCSD) ARMA Winter, 2017 7 / 59 Wold Representation Theorem: Discussion Since E [εt ] = 0, E [ε2t ] = σ 2 ≥ 0, E [εt εs ] = 0, for all t 6= s, εt is not predictable using linear models of past data Practical concern: MA order is potentially infinite Since ∑∞j=0 θ 2 j < ∞, the parameters are likely to die off over time - a finite approximation to the infinite MA process could be appropriate In practice we need to construct εt from data (filtering) MA representation holds apart from a possible deterministic term, µt , which is perfectly predictable infinitely far into the future e.g., constant, linear time trend, seasonal pattern, or sinusoid with known periodicity Timmermann (UCSD) ARMA Winter, 2017 8 / 59 Estimation of Autocovariances Autocovariances and autocorrelations can be estimated from sample data (sample t = 1, ....,T ): Ĉov(Yt ,Yt−j ) = 1 T − j − 1 T ∑ t=j+1 (yt − ȳ)(yt−j − ȳ) ρ̂j = ĉov(yt , yt−j ) v̂ar(yt ) where ȳ = (1/T )∑Tt=1 yt is the sample mean of Y Testing for autocorrelation: Q−stat can be used to test for serial correlation of order 1, ...,m : Q = T m ∑ j=1 ρ̂2j ∼ χ 2 m Small p-values (below 0.05) suggest significant serial correlation Timmermann (UCSD) ARMA Winter, 2017 9 / 59 Autocovariances in matlab autocorr : computes sample autocorrelation parcorr : computes sample partial autocorrelation lbqtest: computes Ljung-Box Q-test for residual autocorrelation Timmermann (UCSD) ARMA Winter, 2017 10 / 59 Sample autocorrelation for US T-bill rate Timmermann (UCSD) ARMA Winter, 2017 11 / 59 Sample autocorrelation for US stock returns Timmermann (UCSD) ARMA Winter, 2017 12 / 59 Autocorrelations and predictability The more strongly autocorrelated a variable is, the easier it is to predict its mean strong serial correlation means the series is slowly mean reverting and so the past is useful for predicting the future strongly serially correlated variables include interest rates (in levels) level of inflation rate (year on year) weakly serially correlated or uncorrelated variables include stock returns changes in inflation growth rate in corporate dividends Timmermann (UCSD) ARMA Winter, 2017 13 / 59 Lag Operator and Lag Polynomials The lag operator, L, when applied to any variable simply lags the variable by one period: Lyt = yt−1 Lpyt = yt−p Lag polynomials such as φ(L) take the form φ(L) = p ∑ i=0 φiL i For example, if p = 2 and φ(L) = 1− φ1L− φ2L 2, then φ(L)yt = 1× yt − φ1Lyt − φ2L 2yt = yt − φ1yt−1 − φ2yt−2 Timmermann (UCSD) ARMA Winter, 2017 14 / 59 ARMA Models Autoregressive models specify y as a function of its own lags Moving average models specify y as a weighted average of past shocks (innovations) to the series ARMA(p, q) specification for a stationary variable yt : yt = φ1yt−1 + ...+ φpyt−p + εt + θ1εt−1 + ...+ θqεt−q In lag polynomial notation φ(L)yt = θ(L)εt φ(L) = 1− p ∑ j=0 φiL i θ(L) = q ∑ i=0 θiL i = 1+ θ1L+ ...+ θqL q Timmermann (UCSD) ARMA Winter, 2017 15 / 59 AR(1) Model ARMA(1, 0) or AR(1) model takes the form: yt = φ1yt−1 + εt (1− φ1L)yt = εt , θ(L) = 1 By recursive backward substitution, yt = φ1(φ1yt−2 + εt−1)︸︷︷︸ yt−1 + εt = φ 2 1yt−2 + εt + φ1εt−1 Iterating further backwards, we have, for h ≥ 1, yt = φ h 1yt−h + h−1 ∑ s=0 φs1εt−s = φh1yt−h + θ(L)εt , where θ(L) : θi = φ i 1 (for i = 1, .., h− 1) Timmermann (UCSD) ARMA Winter, 2017 16 / 59 AR(1) Model AR(1) model is equivalent to an MA(∞) model as long as φh1yt−h becomes “small” in a mean square sense: E [ yt − h−1 ∑ s=0 φs1εt−s ]2 = E [ φh1yt−h ]2 ≤ φ2h1 γy (0)→ 0 as h→ ∞, provided that φ2h1 → 0, i.e., |φ1| < 1 Stationary AR(1) process has an equivalent MA(∞) representation The root of the polynomial φ(z) = 1− φ1L = 0 is L ∗ = 1/φ1, so |φ1| < 1 means that the root exceeds one. This is a necessary and suffi cient condition for stationarity of an AR(1) process Stationarity of an AR(p) model requires that all roots of the equation φ(z) = 0 exceed one (fall outside the unit circle) Timmermann (UCSD) ARMA Winter, 2017 17 / 59 MA(1) Model ARMA(0, 1) or MA(1) model: yt = εt + θ1εt−1, i.e., φ(L) = 1, θ(L) = 1+ θ1L Backwards substitution yields εt = yt 1+ θ1L = h ∑ s=0 (−θ1)syt−s + (−θ1)hεt−h εt is equivalent to an AR(h) process with coeffi cients φs = (−θ1) s provided that E [(−θ1)hεt−h ] gets small as h increases, i.e., |θ1| < 1 MA(q) is invertible if the roots of θ(z) exceed one Invertible MA process can be written as an infinite order AR process A stationary and invertible ARMA(p, q) process can be written as either an AR model or as an MA model, typically of infinite order yt = φ(L) −1θ(L)εt or θ(L) −1φ(L)yt = εt Timmermann (UCSD) ARMA Winter, 2017 18 / 59 ARIMA representation for nonstationary processes Suppose that d of the roots of φ(L) equal unity (one), while the remaining roots of φ̃(L) fall outside the unit circle. Factorization: φ(L) = φ̃(L)(1− L)d Applying (1− L) to a series is called differencing Let ỹt = (1− L)dyt be the d th difference of yt . Then φ̃(L)ỹt = θ(L)εt By assumption, the roots of φ̃(L) lie outside the unit circle so the differenced process, ỹt , is stationary and can be studied instead of yt Processes with d 6= 0 need to be differenced to achieve stationarity and are called ARIMA(p, d , q) Timmermann (UCSD) ARMA Winter, 2017 19 / 59 US stock index Timmermann (UCSD) ARMA Winter, 2017 20 / 59 Monthly US stock returns (first-differenced prices) Timmermann (UCSD) ARMA Winter, 2017 21 / 59 Forecasting with AR models Prediction is straightforward for AR(p) models yT+1 = φ1yT + ...+ φpyT−p+1 + εT+1, εT+1 ∼ WN(0, σ 2) Treat parameters as known and ignore estimation error Using that E [εT+1|IT ] = 0 and {yT−p+1, ..., yT } ∈ IT , the forecast of yT+1 given IT becomes fT+1|T = φ1yT + ...+ φpyT−p+1 fT+1|T means the forecast of yT+1 given information at time T x ∈ IT means "x is known at time T , i.e., belongs to the information set at time T" Timmermann (UCSD) ARMA Winter, 2017 22 / 59 Forecasting with AR models: The Chain Rule When generating forecasts multiple steps ahead, unknown values of yT+h (h ≥ 1) can be replaced with their forecasts, fT+h|T , setting up a recursive system of forecasts: fT+2|T = φ1fT+1|T + φ2yT + ...+ φpyT−p+2 fT+3|T = φ1fT+2|T + φ2fT+1|T + φ3yT + ...+ φpyT−p+3 ... fT+p+1|T = φ1fT+p|T + φ2fT+p−1|T + φ3fT+p−2|T + ...+ φp fT+1|T ‘Chain rule’is equivalent to recursively expressing unknown future values yT+i as a function of yT and its past Known values of y affect the forecasts of an AR (p) model up to horizon T + p, while forecasts further ahead only depend on past forecasts themselves Timmermann (UCSD) ARMA Winter, 2017 23 / 59 Forecasting with MA models Consider the MA(q) model yT+1 = εT+1 + θ1εT + ...+ θqεT−q+1 One-step-ahead forecast: fT+1|T = θ1εT + ...+ θqεT−q+1 Sequence of shocks {εt} are not directly observable but can be computed recursively (estimated) given a set of assumptions on the initial values for εt , t = 0, ..., q − 1 For the MA(1) model, we can set ε0 = 0 and use the recursion ε1 = y1 ε2 = y2 − θ1ε1 = y2 − θ1y1 ε3 = y3 − θ1ε2 = y3 − θ1(y2 − θy1) Unobserved shocks can be written as a function of the parameter value θ1 and current and past values of y Timmermann (UCSD) ARMA Winter, 2017 24 / 59 Forecasting with MA models (cont.) Simple recursions using past forecasts can also be employed to update the forecasts. For the MA(1) model we have ft+1|t = θ1εt = θ1(yt − ft |t−1) MA processes of infinite order: yT+h for h ≥ 1 is yT+h = θ(L)εT+h = (εT+h + θ1εT+h−1 + ...+ θh−1εT+1︸︷︷︸ unpredictable + θhεT + θh+1εT−1 + ...︸︷︷︸ predictable . Hence, if εT were observed, the forecast would be fT+h|T = θhεT + θh+1εT−1 + ... = ∞ ∑ j=h θj εT+h−j MA(q) model has limited memory: values of an MA(q) process more than q periods into the future are not predictable Timmermann (UCSD) ARMA Winter, 2017 25 / 59 Forecasting with mixed ARMA models Consider a mixed ARMA(p, q) model yT+1 = φ1yT + φ2yT−1 + ...+ φpyT−p+1 + εT+1 + θ1εT + ...+ θq εT−q+1 Separate AR and MA prediction steps can be combined by recursively replacing future values of yT+i with their predicted values and setting E [εT+j |IT ] = 0 for j ≥ 1 : fT+1|T = φ1yT + φ2yT−1 + ...+ φpyT−p+1 + θ1εT + ...+ θq εT−q+1 fT+2|T = φ1fT+1|T + φ2yT + ...+ φpyT−p+2 + θ2εT + ...+ θq εT−q+2 ... fT+h|T = φ1fT+h−1|T + φ2fT+h−2|T + ...+ φp fT−p+h|T + θhεT + ...+ θq εT−q+h Note: fT−j+h|T = yT−j+h if j ≥ h, and we assumed q ≥ h Timmermann (UCSD) ARMA Winter, 2017 26 / 59 Mean Square Forecast Errors By the Wold Representation Theorem, all stationary ARMA processes can be written as an MA process with associated forecast error yT+h − fT+h|T = εT+h + θ1εT+h−1 + ...+ θh−1εT+1 Mean square forecast error: E [ (yT+h − fT+h|T )2 ] = E [(εT+h + θ1εT+h−1 + ...+ θh−1εT+1) 2 ] = σ2(1+ θ21 + ...+ θ 2 h−1) For the AR(1) model, θi = φ i 1 and so the MSE becomes E [(yT+h − fT+h|T )2] = σ2(1+ φ21 + ...+ φ 2(h−1) 1 ) = σ2(1− φ2h1 ) 1− φ21 Timmermann (UCSD) ARMA Winter, 2017 27 / 59 Direct vs. Iterated multi-period forecasts Two ways to generate multi-period forecasts (h > 1):

Iterated approach: forecasting model is estimated at the highest
frequency and iterated upon to obtain forecasts at longer horizons
Direct approach: forecasting model is matched with the desired
forecast horizon: One model for each horizon, h. The dependent
variable is yt+h while all predictor variables are dated period t

Example: AR(1) model yt = φ1yt−1 + εt
Iterated approach: use the estimated value, φ̂1, to obtain a forecast

fT+h|T = φ̂
h
1yT

Direct approach: Estimate h−period lag relationship:

yt+h = φ
h
1︸︷︷︸

φ̃1h

yt +
h−1
∑
s=0

φs1εt−s︸︷︷︸
ε̃t+h

Timmermann (UCSD) ARMA Winter, 2017 28 / 59

Direct vs. Iterated multi-period forecasts: Trade-offs

When the autoregressive model is correctly specified, the iterated
approach makes more effi cient use of the data and so tends to
produce better forecasts

Conversely, by virtue of being a linear projection, the direct approach
tends to be more robust towards misspecification

When the model is grossly misspecified, iteration on the misspecified
model can exacerbate biases and may result in a larger MSE

Which approach performs best depends on the true DGP, the degree
of model misspecification (both unknown), and the sample size

Empirical evidence in Marcellino et al. (2006) suggests that the
iterated approach works best on average for macro variables

Timmermann (UCSD) ARMA Winter, 2017 29 / 59

Estimation of ARIMA models

ARIMA models can be estimated by maximum likelihood methods

ARIMA models are based on linear projections (regressions) which
provide reasonable forecasts of linear processes under MSE loss

There may be nonlinear models of past data that provide better
predictors:

Under MSE loss the best predictor is the conditional mean, which need
not be a linear function of the past

Timmermann (UCSD) ARMA Winter, 2017 30 / 59

Estimation (continued)

AR(p) models with known p > 0 can be estimated by ordinary least
squares by regressing yT on yT−1, yT , .., .yT−p
Assuming the data are covariance stationary, OLS estimates of the
coeffi cients φ1, .., φp are consistent and asymptotically normal

If the AR model is correctly specified, such estimates are also
asymptotically effi cient

Least squares estimates are not optimal in finite samples and will be
biased
For the AR(1) model, φ̂1 has a downward bias of (1+ 3φ1)/T
For higher order models, the biases are complicated and can go in
either direction

Timmermann (UCSD) ARMA Winter, 2017 31 / 59

Estimation and forecasting with ARMA models in matlab

regARIMA: creates regression model with ARIMA time series errors

estimate: estimates parameters of regression models with ARIMA
errors

Pure AR models: can be estimated by OLS

forecast: forecast ARIMA models

Timmermann (UCSD) ARMA Winter, 2017 32 / 59

Lag length selection

In most situations, forecasters do not know the true or optimal lag
orders, p and q

Judgmental approaches based on examining the autocorrelations and
partial autocorrelations of the data
Model selection criteria: Different choices of (p, q) result in a set of
models {Mk}Kk=1, where Mk represents model k and the search is
conducted over K different combinations of p and q
Information criteria trade off fit versus parsimony

Timmermann (UCSD) ARMA Winter, 2017 33 / 59

Information criteria

Information criteria (IC) for linear ARMA specifications:

ICk = ln σ̂
2
k + nkg(T )

IC s trade off fit (gets better with more parameters) against parsimony
(fewer parameters is better). Choose k to minimize IC

σ̂2k : sum of squared residuals of model k. Lower σ̂
2
k ⇔ better fit

nk = pk + qk + 1 : number of estimated parameters for model k
g(T ) : penalty term that depends on the sample size, T :

Criterion g(T )
AIC (Akaike (1974)) 2T−1

BIC (Schwartz (1978)) ln(T )/T

In matlab: aicbic
Timmermann (UCSD) ARMA Winter, 2017 34 / 59

Marcellino, Stock and Watson (2006)

Timmermann (UCSD) ARMA Winter, 2017 35 / 59

Random walk model

The random walk model is an AR(1) with φ1 = 1 :

yt = yt−1 + εt , εt ∼ WN(0, σ2)

This model implies that the change in yt is unpredictable:

∆yt = yt − yt−1 = εt

For example, the level of stock prices is easy to predict, but not its
change (rate of return if using logarithm of stock index)

Shocks to the random walk have permanent effects: A one unit shock
moves the series by one unit forever. This is in sharp contrast to a
mean-reverting process

Timmermann (UCSD) ARMA Winter, 2017 36 / 59

Random walk model (cont)

The variance of a random walk increases over time so the distribution
of yt changes over time. Suppose that yt started at zero, y0 = 0 :

y1 = y0 + ε1 = ε1
y2 = y1 + ε2 = ε1 + ε2

…

yt = ε1 + ε2 + …+ εt−1 + εt

From this we have

E [yt ] = 0

var(yt ) = tσ
2, lim

t→∞
var(yt ) = ∞

The variance of y grows proportionally with time
A random walk does not revert to the mean but wanders up and
down at random
Timmermann (UCSD) ARMA Winter, 2017 37 / 59

Forecasts from random walk model

Recall that forecasts from the AR(1) process yt = φ1yt−1 + εt ,
εt ∼ WN(0, σ2) are simply

ft+h|t = φ
h
1yt

For the random walk model φ1 = 1, so for all forecast horizons, h, the
forecast is simply the current value:

ft+h|t = yt

The basic random walk model says that the value of the series next
period (given the history of the series) equals its current value plus an
unpredictable change:

Forecast of tomorrow = today’s value

Random steps, εt , makes yt a “random walk”
Timmermann (UCSD) ARMA Winter, 2017 38 / 59

Random walk with a drift

Introduce a non-zero drift term, δ :

yt = δ+ yt−1 + εt , εt ∼ WN(0, σ2)

This is a popular model for the logarithm of stock prices
The drift term, δ, plays the same role as a time trend. Assuming
again that the series started at y0, we have

yt = δt + y0 + ε1 + ε2 + …+ εt−1 + εt

Similarly,

E [yt ] = y0 + δt

var(yt ) = tσ
2

lim
t→∞

var(yt ) = ∞

Timmermann (UCSD) ARMA Winter, 2017 39 / 59

Summary of properties of random walk

Changes in random walk are unpredictable

Shocks have permanent effects

Variance grows in proportion with the forecast horizon

These points are important for forecasting:

point forecasts never revert to a mean
since the variance goes to infinity, the width of interval forecasts
increases without bound as the forecast horizon grows
Uncertainty grows without bounds

Timmermann (UCSD) ARMA Winter, 2017 40 / 59

Logs, levels and growth rates

Certain transformations of economic variables such as their logarithm
are often easier to forecast than the “raw” data

If the standard deviation of a time series is approximately proportional
to its level, then the standard deviation of the change in the logarithm
of the series is approximately constant:

Yt+1 = Yt exp(εt+1), εt+1 ∼ (0, σ2)⇔
ln(Yt+1)− ln(Yt ) = εt+1

Example: US GDP follows an upward trend. Instead of studying the
level of US GDP, we can study its growth rate which is not trending

Timmermann (UCSD) ARMA Winter, 2017 41 / 59

Unit root processes

Random walk is a special case of a unit root process which has a unit
root in the AR polynomial, i.e.,

(1− L)φ̃(L)yt = θ(L)εt

where the roots of φ̃(L) lie outside the unit circle

We can test for a unit root using an Augmented Dickey Fuller (ADF)
test:

∆yt = α+ βyt−1 +
p

∑
i=1

∆yt−i + εt

In matlab: adftest

Under the null of a unit root, β = 0. Under the alternative of
stationarity, β < 0 Test is based on the t-stat of β. Test statistic follows a non-standard distribution Timmermann (UCSD) ARMA Winter, 2017 42 / 59 Critical values for Dickey-Fuller test Timmermann (UCSD) ARMA Winter, 2017 43 / 59 Classical decomposition of time series into three components Cycles (stochastic) - captured using ARMA models Trend trend captures the slow, long-run evolution in the outcome for many series in levels, this is the most important component for long-run predictions Seasonals regular (deterministic) patterns related to time of the year (day), public holidays, etc. Timmermann (UCSD) ARMA Winter, 2017 44 / 59 Example: CO2 concentration (ppm) - Dave Keeling, Scripps, 1957-2005 Timmermann (UCSD) ARMA Winter, 2017 45 / 59 Seasonality Sources of seasonality: technology, preferences and institutions are linked to the calendar weather (agriculture, construction) holidays, religious events Many economic time series display seasonal variations: home sales unemployment figures stock prices (?) commodity prices? Timmermann (UCSD) ARMA Winter, 2017 46 / 59 Handling seasonalities One strategy is to remove the seasonal component and work with seasonally adjusted series Problem: We might be interested in forecasting the actual (non-adjusted) series, not just the seasonally adjusted part Timmermann (UCSD) ARMA Winter, 2017 47 / 59 Seasonal components Seasonal patterns can be deterministic or stochastic Stochastic modeling approach uses differencing to incorporate seasonal components - e.g., year-on-year changes Box and Jenkins (1970) considered seasonal ARIMA, or SARIMA, models of the form φ(L)(1− LS )yt = θ(L)εt . (1− LS )yt = yt − yT−S : computes year-on-year changes Timmermann (UCSD) ARMA Winter, 2017 48 / 59 Modeling seasonality Seasonality can be modeled through seasonal dummies. Let S be the number of seasons per year. S = 4 (quarterly data) S = 12 (monthly data) S = 52 (weekly data) For example, the following set of dummies would be used to model quarterly variation in the mean: D1t = ( 1 0 0 0 1 0 0 0 1 0 0 0 ) D2t = ( 0 1 0 0 0 1 0 0 0 1 0 0 ) D3t = ( 0 0 1 0 0 0 1 0 0 0 1 0 ) D4t = ( 0 0 0 1 0 0 0 1 0 0 0 1 ) D1 picks up mean effects in the first quarter. D2 picks up mean effects in the second quarter, etc. At any point in time only one of the quarterly dummies is activated Timmermann (UCSD) ARMA Winter, 2017 49 / 59 Pure seasonal dummy model The pure seasonal dummy model is yt = S ∑ s=1 δsDst + εt We only regress yt on intercept terms (seasonal dummies) that vary across seasons. δs summarizes the seasonal pattern over the year Alternatively, we can include an intercept and S − 1 seasonal dummies. Now the intercept captures the mean of the omitted season and the remaining seasonal dummies give the seasonal increase/decrease relative to the omitted season Never include both a full set of S seasonal dummies and an intercept term - perfect collinearity Timmermann (UCSD) ARMA Winter, 2017 50 / 59 General seasonal effects Holiday variation (HDV ) variables capture dates of holidays which may change over time (Easter, Thanksgiving) - v1 of these: yt = S ∑ s=1 δsDst + v1 ∑ i=1 δHDVi HDVit + εt Timmermann (UCSD) ARMA Winter, 2017 51 / 59 Seasonals ARMA model with seasonal dummies takes the form φ(L)yt = S ∑ s=1 δsDst + θ(L)εt Application of seasonal dummies can sometimes yield large improvements in predictive accuracy Example: day of the week, seasonal, and holiday dummies: µt = 7 ∑ day=1 βdayDday ,t + H ∑ holiday=1 βholidayDholiday ,t + 12 ∑ month=1 βmonthDmonth,t Adding deterministic seasonal terms to the ARMA component, the value of y at time T + h can be predicted as follows: yT+h = 7 ∑ day=1 βdayDday ,T+h + H ∑ holiday=1 βholidayDholiday ,T+h + 12 ∑ month=1 βmonthDmonth,T+h + ỹT+h , φ(L)ỹT+h = θ(L)εT+h Timmermann (UCSD) ARMA Winter, 2017 52 / 59 Deterministic trends Let Timet be a deterministic time trend so that Timet = t, t = 1, ....,T This time trend is perfectly predictable (deterministic) Linear trend model: Trendt = β0 + β1Timet β0 is the intercept (value at time zero) β1 is the slope which is positive if the trend is increasing or negative if the trend is decreasing Timmermann (UCSD) ARMA Winter, 2017 53 / 59 Examples of trended variables US stock price index Number of residents in Beijing, China US labor participation rate for women (up) or men (down) Exchange rates over long periods (?) Interest rates (?) Global mean temperature (?) Timmermann (UCSD) ARMA Winter, 2017 54 / 59 Quadratic trend Sometimes the trend is nonlinear (curved) as when the variable increases at an increasing or decreasing rate For such cases we can use a quadratic trend: Trendt = β0 + β1Timet + β2Time 2 t Caution: quadratic trends are mostly considered adequate local approximations and can give rise to a variety of unrealistic shapes for the trend if the forecast horizon is long Timmermann (UCSD) ARMA Winter, 2017 55 / 59 Log-linear trend log-linear trends are used to describe time series that grow at a constant exponential rate: Trendt = β0 exp(β1Timet ) Although the trend is non-linear in levels, it is linear in logs: ln(Trendt ) = ln(β0) + β1Timet Timmermann (UCSD) ARMA Winter, 2017 56 / 59 Deterministic Time Trends: summary Three common time trend specifications: Linear : µt = µ0 + β0t Quadratic : µt = µ0 + β0t + β1t 2 Exponential : µt = exp(µ0 + β0t) These global trends are unlikely to provide accurate descriptions of the future value of most time series at long forecast horizons Timmermann (UCSD) ARMA Winter, 2017 57 / 59 Estimating trend models Assuming MSE loss, we can estimate the trend parameters by solving θ̂ = arg θ min { T ∑ t=1 (yt − Trendt (θ))2 } Example: with a linear trend model we have Trendt (θ) = β0 + β1Timet θ = {β0, β1} and we can estimate β0, β1 by OLS (β̂0, β̂1) = arg β0,β1 min { T ∑ t=1 (yt − β0 − β1Timet ) 2 } Timmermann (UCSD) ARMA Winter, 2017 58 / 59 Forecasting Trend Suppose a time series is generated by the linear trend model yt = β0 + β1Timet + εt , εt ∼ WN(0, σ 2) Future values of εt are unpredicable given current information, It : E [εt+h |It ] = 0 Suppose we want to predict the series at time T + h given information IT : yT+h = β0 + β1TimeT+h + εT+h Since TimeT+h = T + h is perfectly predictable while εT+h is unpredictable, our best forecast (under MSE loss) becomes fT+h|T = β̂0 + β̂1TimeT+h Timmermann (UCSD) ARMA Winter, 2017 59 / 59 Lecture 3: Model Selection UCSD, January 23 2017 Allan Timmermann1 1UC San Diego Timmermann (UCSD) Model Selection Winter, 2017 1 / 35 1 Estimation Methods 2 Introduction to Model Selection 3 In-Sample Selection Methods 4 Sequential Selection 5 Information Criteria 6 Cross-Validation 7 Lasso Timmermann (UCSD) Model Selection Winter, 2017 2 / 35 Least Squares Estimation I The multivariate linear regression model with k regressors yt = k ∑ j=1 βjxjt−1 + ut , t = 1, ...,T can be written more compactly as yt = β ′xt−1 + ut , β = (β1, ..., βk ) ′, xt−1 = (x1t−1, ..., xkt−1) ′ or, in matrix form, y = X β+ u, y T×1 =   y1 y2 ... yT   , XT×k =   x10 x20 · · · xk0 x11 x21 · · · xk1 ... ... ... x1,T−1 x2,T−1 · · · xk ,T−1   Timmermann (UCSD) Model Selection Winter, 2017 2 / 35 Least Squares Estimation II Ordinary Least Squares (OLS) minimizes the sum of squared (forecast) errors: β̂ = Argmin β T ∑ t=1 (yt − β′xt−1)2 This is the same as minimizing (y − X β)′(y − X β) and yields the solution β̂ = (X ′X )−1X ′y (assuming that X ′X is of full rank). Four assumptions on the disturbance terms, u, ensure that the OLS estimator has the smallest variance among all linear unibased estimators of β, i.e., it is the Best Linear Unbiased Estimator (BLUE) 1 Zero mean: E (ut ) = 0 for all t 2 Homoskedastic: Var(ut |X ) = σ2 for all t 3 No serial correlation: Cov(ut , us |X ) = 0 for all s, t 4 Orthogonal: E [ut |X ] = 0 for all t Timmermann (UCSD) Model Selection Winter, 2017 3 / 35 Least Squares Estimation III We can also write the OLS estimator as follows: β̂ = β+ (X ′X )−1X ′u Provided that u is normally distributed, u ∼ N(0, σ2IT ), we have β̂ ∼ N(β, σ2(X ′X )−1) A similar result can be established asymptotically Thus we can use standard t−tests or F -tests to test if β is statistically significant Timmermann (UCSD) Model Selection Winter, 2017 4 / 35 Maximum Likelihood Estimation (MLE) Suppose the residuals u are independently, identically and normally distributed ut ∼ N(0, σ2). Then the likelihood of u1, ..., uT as a function of the parameters θ = (β′, σ2), becomes L(θ) = ( 2πσ2 )−T /2 exp ( −1 2σ2 T ∑ t=1 u2t ) = ( 2πσ2 )−T /2 exp ( −1 2σ2 (y − X β)′(y − X β) ) Taking logs, we get the log-likelihood LL(θ) = log(L(θ)): LL(θ) = −T 2 ln(2πσ2)− 1 2σ2 (y − X β)′(y − X β). The following parameter estimates maximize the LL: β̂MLE = (X ′X )−1X ′y σ̂2MLE = u′u T Timmermann (UCSD) Model Selection Winter, 2017 5 / 35 Generalized Method of Moments (GMM) I Suppose we have data (y1, x0), ..., (yT , xT−1) drawn from a probability distribution p((y1, x0), ..., (yT , xT−1)|θ0) with true parameters θ0. The parameters can be identified from a set of population moment conditions E [m((yt , xt−1), θ0)] = 0 for all t Parameter estimates can be based on sample moments, m((yt , xt−1), θ) : 1 T T ∑ t=1 m((yt , xt−1), θ̂T ) = 0 If we have the same number of (non-redundant) moment conditions as we have parameters, the parameters θ̂T are exactly identified by the moment conditions. For the linear regression model, the moment conditions are that the regression residuals (yt − x ′t−1β) are uncorrelated with the predictors, xt−1: E [xt−1ut ] = E [xt−1(yt − x ′t−1β)] = 0, t = 0, ...,T ⇒ β̂MoM = (X ′X )−1X ′y Timmermann (UCSD) Model Selection Winter, 2017 6 / 35 Generalized Method of Moments (GMM) II Under broad conditions the GMM estimator is consistent and asymptotically normally distributed GMM estimator allows for heteroskedastic (time-varying covariance) and autocorrelated (persistent) errors GMM estimator has certain robustness properties GMM is widely used throughout finance Lars Peter Hansen (Chicago) shared the Nobel prize in 2013 for his work on GMM estimation (and other topics) Timmermann (UCSD) Model Selection Winter, 2017 7 / 35 Shrinkage and Ridge estimators Estimation errors often lead to bad forecasts A simple "trick" is to penalize for large parameter estimates Shrinkage estimators do this by solving the problem β = Argmin β T−1 T ∑ t=1 (yt − β′xt−1)2 + λ nk ∑ i=1 β2i︸︷︷︸ penalty λ > 0 : penalizes for large parameters
With a single regressor, the solution to this problem is simple:

β̃shrink =
1

1+ λ
β̂OLS

In the multivariate case we get the ridge estimator

β̃Ridge = (X
′X + λI )−1X ′y

Even though β̃shrink is now biased, the variance of the forecast is reduced
Timmermann (UCSD) Model Selection Winter, 2017 8 / 35

Model selection

Typically a set of models, rather than just a single model is considered when
constructing a forecast of a particular outcome

Models could differ by

dynamic specification (ARMA lags)
predictor variables (covariates)
functional form (nonlinearities)
estimation method (OLS, GMM, MLE)

Can a single ‘best’model be identified?

Model selection methods attempt to choose such a ‘best’model

might be hard if the space of models is very large
what if many models have similar performance?

Different from forecast combination which combines forecasts from several
models

Timmermann (UCSD) Model Selection Winter, 2017 9 / 35

Model selection: setup

MK = {M1, …,MK } : Finite set of K forecasting models
Mk : individual model used to generate a forecast, fk (xt , βk ), k = 1, …,K
xt : data (conditioning information or predictors) at time t
βk : parameters for model k
Model selection involves searching overMK to find the best forecasting
model

Data sample: {yt+1, xt}, t = 0, …,T − 1
One model nests another model if the second model is a special case
(smaller version) of the first one. Example:

M1 : yt+1 = β1x1t + ε1t+1 (small model)
M2 : yt+1 = β21x1t + β22x2t + ε2t+1 (big model)

Timmermann (UCSD) Model Selection Winter, 2017 10 / 35

In-sample comparison of models

Two models: M1 = f1(x1, β1) and M2 = f2(x2, β2)

Squared error loss: e1 = y − f1, e2 = y − f2
The second (“large”) model nests the first (“small”)

Coeffi cient estimates for both models are selected such that

β̂i = argmin
βi
T−1

∑
t=1
(yt − fit |t−1(βi ))

Because f2t |t−1(β2) nests f1t |t−1(β1), it follows that, in a given sample,

T−1
T

∑
t=1
(yt − f2t |t−1(β̂2))

2 ≤ T−1
T

∑
t=1
(yt − f1t |t−1(β̂1))

The larger model (M2) always provides at least as good a fit as the smaller
model (M1) and in most cases will provide a strictly better in-sample fit

Timmermann (UCSD) Model Selection Winter, 2017 11 / 35

In-sample comparison of models

The smaller model’s in-sample fit is always worse even if the true expected
loss under the (first) small model is less than or equal to the expected loss
under the second (large) model, i.e., even if the following holds in population:

E
[
(yt+1 − f1t+1|t (β

∗
1))

2
]
≤ E

[
(yt+1 − f2t+1|t (β

∗
2))

2
]

β∗1, β
∗
2 : true parameters. These are unknown

Superior in-sample fit does not by itself suggest that a particular forecast
model is necessarily better out-of-sample
Large (complex) models often perform well in comparisons of in-sample fit
even when they perform poorly compared with smaller models when
evaluated on new (out-of-sample) data

Take-away: Overfitting matters. Be careful with large models

Timmermann (UCSD) Model Selection Winter, 2017 12 / 35

Model selection methods

Popular in-sample model selection methods

Information criteria (IC)
Sequential hypothesis testing
Cross validation
LASSO (large dimensional models)

Advantages of each approach depends on whether there are few or very many
potential predictor variables

Also depends on the true, but unknown, model
Are many or few of the predictor variables truly significant?

sparseness

Timmermann (UCSD) Model Selection Winter, 2017 13 / 35

Sequential Hypothesis Testing

Sequential hypothesis tests choose the ‘best’submodel from a larger set of
models through a sequence of specification tests that identify the relevant
parts of a model and exclude the remainder

Approach reflects how applied researchers construct their models in practice

Remove terms found not to be useful when tested against a smaller model
that omits such variables

Use t−tests, F−tests, or p−values
Diffi culties may arise if models are nonnested or include nonlinearities

In matlab: stepwisefit

Timmermann (UCSD) Model Selection Winter, 2017 14 / 35

Sequential Hypothesis Testing

Different orders of the sequence in which variables are tested

forward stepwise
backward stepwise

General-to-specific – start big: include all potential variables in the initial
model. Then remove variables thought not to be useful through a sequence
of tests. The final model typically depends on the sequence of tests

Specific-to-general – start small: begin with a small baseline model with
the ‘main’variables or simply a constant, then add further variables if they
improve the prediction model

Forward and backward methods can be mixed

Timmermann (UCSD) Model Selection Winter, 2017 15 / 35

Sequential Testing with Linear Model (backwise stepwise)

Kitchen sink with K potential predictors {x1, …, xK }:

yt+1 = β0 + β1x1t + β2x2t + …+ βK−1xK−1t + βK xKt + εt+1

Suppose the smallest absolute value of the t−statistic of any variable falls
below some threshold, t, such as t = 2:

tmin = min
k=1,…,K

|t
β̂k
| < t = 2 Eliminate the variable with the smallest t−statistic or the lowest p−value smaller than 0.05 Timmermann (UCSD) Model Selection Winter, 2017 16 / 35 Sequential Testing with Linear Model (cont.) Suppose xK is dropped. The trimmed model with the remaining K − 1 variables is next re-estimated: yt+1 = β0 + β1x1t + β2x2t + ...+ βK xK−1t + εt+1 Recalculate the t-statistics for all regressors. Check if min k=1,...,K−1 {|t β̂k |}| < t and drop the variable with the smallest t−statistic if this condition holds Repeat procedure until no further variable is dropped Timmermann (UCSD) Model Selection Winter, 2017 17 / 35 Forecasts from sequential tests Backward stepwise forecast: ft+1|t = β̂0k + K ∑ k=1 β̂k Ik xkt Ik = 1 if the kth variable is included in the final model. Otherwise Ik = 0 Ik depends on the entire sequence of t−statistics not only for the kth variable itself but also for all other variables. Why? In matlab: stepwisefit(X(1:t-1,:),y(2:t),’display’,’off’,’inmodel’,ones(1,size(X,2))); % backward stepwise model selection Timmermann (UCSD) Model Selection Winter, 2017 18 / 35 Sequential Hypothesis Testing: Specific to general Begin from a simple model that only includes an intercept yt+1 = β0 + εt+1 Next consider all K univariate models (forward stepwise approach): yt+1 = β0k + βk xkt + εkt+1 k = 1, ...,K Add to the model the variable with the highest t−statistic subject to the condition that this exceeds some threshold value t̄, e.g., t̄ = 2 tmax = max k=1,...,K |t β̂k | > t̄

Add regressors from the remaining pool to this univariate model one by one.
New regressor is included if its t−statistic exceeds t̄
Repeat until no further variable is included
matlab: stepwisefit(X(1:t-1,:),y(2:t),’display’,’off’); % forward stepwise model
selection
Timmermann (UCSD) Model Selection Winter, 2017 19 / 35

Sequential approach

Forecasts from the backward and forward stepwise approaches take the form

ft+1|t = β̂0k +
K

∑
k=1

β̂k Ik xkt

Ik = 1 if the kth variable is included in the final model. Otherwise Ik = 0

Advantages:

intuitive appeal
simplicity
computationally easy

Disadvantages:

no comprehensive search across all possible models
outcome can be path dependent – no guarantee that it finds the globally
optimal model
pre-test biases: hard to control the size

Timmermann (UCSD) Model Selection Winter, 2017 20 / 35

Information Criteria (IC)

ICs trade off model fit against a penalty for model complexity measured by
the number of freely estimated parameters

ICs were developed under different assumptions regarding the ‘correct’
underlying model and hence have different properties

ICs ‘correct’a minimization criterion for the effect of parameter estimation
which tends to make large models appear better in-sample than they really are

Popular information criteria:

Bayes information criterion (BIC or SIC)
Akaike information criterion (AIC)

In matlab: aicbic . Choose model with smallest

aic = −2 ∗ logL+ 2 ∗ numParam
bic = −2 ∗ logL+ numParam ∗ log (numObs)

Timmermann (UCSD) Model Selection Winter, 2017 21 / 35

Information Criteria

How much additional parameters improve the in-sample fit depends on the
true model, so differences across information criteria hinge on how to
practically get around our ignorance about the true model

Information criteria are used to rank a set of parametric models,Mk
Each model Mk ∈ MK requires estimating nk parameters, βk
To conduct a comprehensive search over all possible models with K potential
predictor variables, {x1, …., xK } means considering 2K possible model
specifications

Example: K = 2 : Two possible predictors {x1, x2} yield four models {0, 0},
{1, 0}, {0, 1}, and {1, 1}

with K = 11, 211 = 2, 048 models
with K = 20, 2K = 220 > 1, 000, 000 models

Timmermann (UCSD) Model Selection Winter, 2017 22 / 35

Bayesian/Schwarz Information Criterion

BIC = −2LogLk + nk ln(T )

nk : number of parameters of model k
T : sample size – penalty depends on the sample size: bigger T , bigger
penalty

For linear regressions, the BIC takes the form

BIC = ln σ̂2k + nk ln(T )/T

σ̂2k = e
′
k ek/T : sample estimate of the residual variance

Select the model with the smallest BIC
BIC is a consistent model selection criterion: It selects the true model in a
very large sample (big T ) if this is included inMK

Timmermann (UCSD) Model Selection Winter, 2017 23 / 35

Akaike Information Criterion

AIC = −2LogLk + 2nk

AIC minimizes the distance between the true model and the fitted model

For linear regression models

AIC (k) = ln σ̂2k +
2nk
T

AIC penalizes inclusion of extra parameters less than the SIC

AIC is not a consistent model selection criterion – it tends to select models
with too many parameters

AIC selects the best “approximate” model – asymptotic effi ciency

Timmermann (UCSD) Model Selection Winter, 2017 24 / 35

Cross-validation

Cross validation (CV) avoids overfitting by removing the correlation that
causes the estimated in-sample loss to be “small” due to the use of the same
observations for both parameter estimation and model evaluation

CV makes use of the full dataset for both estimation and evaluation

CV averages over all possible combinations of estimation and evaluation
samples obtainable from a given data set

‘Leave one out’CV estimator holds out one observation for model evaluation

Remaining observations are used for estimation of the parameters
The loss is calculated solely from the evaluation sample – this breaks the
connection leading to overfitting
Repeat calculation for all possible ways to leave out one data point for model
evaluation
CV can be computationally slow if T is large

Timmermann (UCSD) Model Selection Winter, 2017 25 / 35

Illustration: Estimation of sample mean under MSE

Estimation of sample mean ȳT = T
−1 ∑Tt=1 yt for i.i.d. time series, yt

Mean Squared Error (MSE):

T−1
T

∑
t=1
(yt − ȳT )2 = T−1

∑
t=1

ε2t − (ȳT − µ)2 ⇒

[
T−1

∑
t=1
(yt − ȳT )2

]
= σ2 − E [(ȳT − µ)2 ] ≤ σ2

E [(ȳT − µ)2 ], gets subtracted from the MSE!
The in-sample MSE based on the fitted mean will on average be smaller than
the true MSE computed under known µ

Cross validation breaks the correlation between the forecast error and the
estimation error

Separate observations used to estimate the parameters of the prediction model
(the sample mean) from observations used to compute the MSE

Timmermann (UCSD) Model Selection Winter, 2017 26 / 35

How does the classical ‘leave one out’CV work?

At each point in time, t, use the sample mean
ȳ{−t} = (T − 1)−1 ∑Ts=1,s 6=t ys that leaves out observation yt
We can show that

(
T −1

∑
t=1
(yt − ȳ{−t})2

)
= E

(
T −1

∑
t=1

ε2t

)
+ (T − 1)−1E

[
T −1

∑
t=1

ε2t

]
= σ2(1+ (T − 1)−1) > σ2

The expected squared error of the leave-one-out estimator ȳ{−t} can be
shown to be smaller than that of the usual estimate, ȳt , that does not
leave-one-out

CV estimator tells us (correctly) that the MSE exceeds σ2

Timmermann (UCSD) Model Selection Winter, 2017 27 / 35

How many predictor variables do we have?

Low-dimensional set of variables

Large-dimensional: hundreds or thousands

Federal Reserve Bank of St Louis Federal, FRED, has 429,000 time series

Timmermann (UCSD) Model Selection Winter, 2017 28 / 35

Lasso Model Selection

Lasso (Least Absolute Shrinkage and Selection Operator) is a type of
shrinkage estimator for least squares regression

Shrinkage estimators (“pull towards zero”) reduce the effect of sampling
errors

Lasso estimates linear regression coeffi cients by minimizing the sum of least
squares residuals subject to a penalty function

T−1
T

∑
t=1
(yt − β′xt−1)2 + λ

∑
i=1
|βi |︸︷︷︸

penalty

λ : scalar tuning parameter determining the size of the penalty
Shrinks the parameter estimates towards zero
λ = 0 gives OLS estimates. Big values of λ pull β̂ towards zero

No closed form solution for minimizing the expression – computational
methods are required

Timmermann (UCSD) Model Selection Winter, 2017 29 / 35

Lasso Model Selection

Common to re-estimate parameters of selected variables by OLS

In matlab: lasso

“lasso performs lasso or elastic net regularization for linear regression.
[B,STATS] = lasso(X,Y,…) Performs L1-constrained linear least squares fits
(lasso) or L1- and L2-constrained fits (elastic net) relating the predictors in X
to the responses in Y. The default is a lasso fit, or constraint on the L1-norm
of the coeffi cients B.”

matlab uses cross-validation to choose the weight on the penalty term, λ

Lasso tends to set many coeffi cients to zero and can thus be used for model
selection

Timmermann (UCSD) Model Selection Winter, 2017 30 / 35

Variable selection and Lasso (Patrick Breheny slides)

Timmermann (UCSD) Model Selection Winter, 2017 31 / 35

Empirical example

Forecasts of quarterly (excess) stock returns

Twelve predictor variables:

dividend-price ratio,
dividend-earnings (payout) ratio,
stock volatility,
book-to-market ratio,
net equity issues,
T-bill rate,
long term return,
term spread,
default yield,
default return,
inflation
investment-capital ratio

Timmermann (UCSD) Model Selection Winter, 2017 32 / 35

Time-series forecasts of quarterly stock returns

1970Q1 1980Q1 1990Q3 2000Q4 2010Q4
-0.2

-0.1

0.1

0.2

0.3

0.4

0.5

Aic Bic CrossVal

1970Q1 1980Q1 1990Q3 2000Q4 2010Q4
-0.2

-0.1

0.1

0.2

0.3

0.4

0.5

Forward Backward

1970Q1 1980Q1 1990Q3 2000Q4 2010Q4
-0.01

0.01

0.02

0.03

0.04

0.05

0.06

Bagg α=2% Bagg α=3%

1970Q1 1980Q1 1990Q3 2000Q4 2010Q4
-0.1

-0.05

0.05

0.1

0.15

Lasso λ=8 Lasso λ=3

Timmermann (UCSD) Model Selection Winter, 2017 33 / 35

Inclusion of individual variables

1970Q1 1990Q3 2010Q4

AIC

BIC

Forw

Back

Lasso

1970Q1 1990Q3 2010Q4

AIC

BIC

Forw

Back

Lasso

tbl

1970Q1 1990Q3 2010Q4

AIC

BIC

Forw

Back

Lasso

tms

1970Q1 1990Q3 2010Q4

AIC

BIC

Forw

Back

Lasso

dfy

Timmermann (UCSD) Model Selection Winter, 2017 34 / 35

Conclusion

Model selection increases the “space”over which the search for the
forecasting model is conducted

Model uncertainty matters and can be as important as parameter
estimation error

When one model is clearly superior to others it will nearly always be selected

No free lunch – when a single model is not obviously superior to all other
models, different models are selected by different criteria in different samples

statistical techniques for model selection are used precisely because models are
hard to tell apart, and not because one model is obviously much better than
the others

Timmermann (UCSD) Model Selection Winter, 2017 35 / 35

Lecture 4: Updating and Forecasting with New
Information

UCSD, January 30, 2017

Allan Timmermann1

1UC San Diego

Timmermann (UCSD) Filtering Winter, 2017 1 / 53

1 Bayes rule and updating beliefs

2 The Kalman Filter
Application to inflation forecasting

3 Nowcasting

4 Jagged Edge Data

5 Daily Business Cycle Indicator for the U.S.

6 Markov Switching models
Empirical examples

Timmermann (UCSD) Filtering Winter, 2017 2 / 53

Updating and forecasting

A good forecasting model closely tracks how data evolve over time

Important to update beliefs about the predicted variable or the
forecasting model as new information arrives

Filtering

Suppose the current “state” is unobserved. For example we may not
know in real time if the economy is in a recession

Nowcasting: obtaining the best estimate of the current state

How do we accurately and effi ciently update our forecasts as new
information arrives?

Kalman filter (continuous state)
Regime switching model (small number of states)

Timmermann (UCSD) Filtering Winter, 2017 2 / 53

Bayes’rule

Bayes rule for two random variables A and B:

P(B |A) =
P(A|B)P(B)

P(A)

P(A) : probability of event A
P(B) : probability of event B
P(B |A) : probability of event B given that event A occurred

Timmermann (UCSD) Filtering Winter, 2017 3 / 53

Bayes’rule (cont.)

Let θ be some unknown model parameters while y are observed data.
By Bayes’rule (setting θ = B, y = A)

P(θ|y) =
P(y |θ)P(θ)
P(y)

Given some data, y , what do we know about model parameters θ?
If we are only interested in θ, we can ignore p(y) :

P(θ|y)︸︷︷︸
posterior

∝ P(y |θ)︸︷︷︸
likelihood

P(θ)︸︷︷︸
prior

We start with prior beliefs before seeing the data. Updating these
beliefs with the observed data, we get posterior beliefs

Parameters, θ, that do not fit the data become less likely
For example, if θH is ’high mean returns’and we observe a data sample
with low mean returns, then we put less weight on θH

Timmermann (UCSD) Filtering Winter, 2017 4 / 53

Examples of Bayes’rule

B : European recession. A : European growth rate of -1%

P(recession|g = −1%) =
P(g = −1%|recession)P(recession)

P(g = −1%)

B : bear market. A : negative returns of -5%

P(bear |r = −5%) =
P(r = −5%|bear)P(bear)

P(r = −5%)

Here P (recession) and P(bear) are the initial probabilities of being in
a recession/bear market (before observing the data)

Timmermann (UCSD) Filtering Winter, 2017 5 / 53

Understanding the updating process

Suppose two random variables Y and X are normally distributed(
Y
X

)
= N

((
µy
µx

)
,

(
σ2y σxy
σxy σ

2
x

))
µy and µx are the initial (unconditional) expected values of y and x

σ2y , σ
2
x , σxy are variances and covariance

Conditional mean and variance of Y given an observation X = x :

E [Y |X = x ] = µy +
σxy
σ2x
(x − µx )

Var(Y |X = x) = σ2y −
σ2xy
σ2x

If Y and X are positively correlated (σxy > 0) and we observe a high
value of X (x > µx ), then we increase our expectation of Y
Just like a linear regression! σxy/σ2x is the beta coeffi cient
Timmermann (UCSD) Filtering Winter, 2017 6 / 53

Kalman Filter: Background

The Kalman filter is an algorithm for linear updating and prediction

Introduced by Kalman in 1960 for engineering applications

Method has found great use in many disciplines, including economics
and finance

Kalman Filter gives an updating rule that can be used to revise our
beliefs as we see more and more data

For models with normally distributed variables, the filter can be used
to write down the likelihood function

Timmermann (UCSD) Filtering Winter, 2017 7 / 53

Kalman Filter (Wikipedia) I

“Kalman filtering, also known as linear quadratic estimation (LQE), is
an algorithm that uses a series of measurements observed over time,
containing noise (random variations) and other inaccuracies, and
produces estimates of unknown variables that tend to be more precise
than those based on a single measurement alone. More formally, the
Kalman filter operates recursively on streams of noisy input data to
produce a statistically optimal estimate of the underlying system
state. The filter is named for Rudolf (Rudy) E. Kálmán, one of the
primary developers of its theory.

The Kalman filter has numerous applications in technology. A
common application is for guidance, navigation and control of
vehicles, particularly aircraft and spacecraft. Furthermore, the
Kalman filter is a widely applied concept in time series analysis used
in fields such as signal processing and econometrics.

Timmermann (UCSD) Filtering Winter, 2017 8 / 53

Kalman Filter (Wikipedia) II

The algorithm works in a two-step process. In the prediction step,
the Kalman filter produces estimates of the current state variables,
along with their uncertainties. Once the outcome of the next
measurement (necessarily corrupted with some amount of error,
including random noise) is observed, these estimates are updated
using a weighted average, with more weight being given to estimates
with higher certainty. Because of the algorithm’s recursive nature, it
can run in real time using only the present input measurements and
the previously calculated state and its uncertainty matrix; no
additional past information is required.”

Timmermann (UCSD) Filtering Winter, 2017 9 / 53

Kalman Filter: Models in state space form

Let St be an unobserved (state) variable while yt is an observed
variable. A model that shows how yt is related to St and how St
evolves is called a state space model. This has two equations:

State equation (unobserved/latent):

St = φ× St−1 + εst , εst ∼ (0, σ2s ) (1)

Measurement equation (observed)

yt = B × St + εyt , εyt ∼ (0, σ2y ) (2)

Innovations are uncorrelated with each other:

Cov(εst , εyt ) = 0

Timmermann (UCSD) Filtering Winter, 2017 10 / 53

Example 1: AR(1) model in state space form

AR(1) model
yt = φyt−1 + εt

This can be written in state space form as

St = φSt−1 + εt state eq.
yt = St measurement eq.

with B = 1, σ2s = σ
2
ε , and σ

2
y = 0

very simple: no error in the measurement equation: yt is observed
without error

Timmermann (UCSD) Filtering Winter, 2017 11 / 53

Example 2: MA(1) model in state space form

MA(1) model with unobserved shocks εt :

yt = εt + θεt−1

This can be written in state space form(
S1t
S2t

)
=

(
0 0
1 0

)
︸︷︷︸

(
S1t−1
S2t−1

)
+

(
1
0

)
εt

yt =
(
1 θ

)︸︷︷︸
B

(
S1t
S2t

)
= εt + θεt−1

Note that S1t = εt ,S2t = S1t−1 = εt−1

Timmermann (UCSD) Filtering Winter, 2017 12 / 53

Example 3: Unobserved components model

The unobserved components model consists of two equations

yt = St + εyt (B = 1)

St = St−1 + εst (φ = 1)

yt is observed with noise

St is the underlying “mean” of yt . This is smoother than yt
This model can be written as an ARIMA(0,1,1):

yt − yt−1 = St − St−1 + εyt − εyt−1
= εst + εyt − εyt−1 : MA(1)

Timmermann (UCSD) Filtering Winter, 2017 13 / 53

Kalman Filter: Advantages

The state equation in (1) is in AR(1) form and so is easy to iterate
forward. The h−step-ahead forecast of the state given its current
value, St , is given by

Et [St+h |St ] = φhSt

In practice we don’t observe St and so need an estimate of this given
current information, St |t , or past information, St |t−1
Updating the Kalman filter through newly arrived information is easy

Timmermann (UCSD) Filtering Winter, 2017 14 / 53

Kalman Filter Updating Equations

It = {yt , yt−1, yt−2, …}. Current information
It−1 = {yt−1, yt−2, yt−3, …}. Lagged information
yt : random variable we want to predict
yt |t−1 : best prediction of yt given information at t − 1, It−1
St |t−1 : best prediction of St given information at t − 1, It−1
St |t : best “prediction” (or nowcast) of St given information It
Define mean squared error (MSE)-values associated with the forecasts
of St and yt

MSESt |t−1 = E [(St − St |t−1)
2]

MSESt |t = E [(St − St |t )
2]

MSE y
t |t−1 = E [(yt − yt |t−1)

Timmermann (UCSD) Filtering Winter, 2017 15 / 53

Prediction and Updating Equations

Using the state, measurement, and MSE equations, the Kalman filter
gives a set of prediction equations:

St |t−1 = φSt−1|t−1

MSESt |t−1 = φ
2MSESt−1|t−1 + σ

2
s

yt |t−1 = B × St |t−1
MSE y

t |t−1 = B
2 ×MSESt |t−1 + σ

2
y

Similarly, we have a pair of updating equations for S :

St |t = St |t−1 + B
(
MSESt |t−1/MSE

y
t |t−1

)
(yt − yt |t−1)

MSESt |t = MSE
S
t |t−1

[
1− B2

(
MSESt |t−1/MSE

y
t |t−1

)]

Timmermann (UCSD) Filtering Winter, 2017 16 / 53

Prediction and Updating Equations

Intuition for updating equations (B = 1)

St |t = St |t−1 +
MSESt |t−1
MSE y

t |t−1
(yt − yt |t−1)

St |t : estimate of current (t) state given current information It
St |t−1 : old (t − 1) estimate of state St given It−1
MSESt |t−1/MSE

y
t |t−1 : amount by which we update our estimate of

the current state after we observe yt . This is small if MSE
y
t |t−1 is big

(noisy data) relative to MSESt |t−1, i.e., σ
2
y >> σ

2
s

(yt − yt |t−1) : surprise (news) about yt
If yt is higher than we expected, (yt − yt |t−1) > 0 and we increase
our expectations about the state: St |t > St |t−1. The updating
equation tells us by how much

Timmermann (UCSD) Filtering Winter, 2017 17 / 53

Starting the Algorithm

At t = 0, we have not observed any data, so we must make our best
guesses of S1|0 and MSE

S
1|0 without data by picking a pair of initial

conditions. This gives y1|0 and MSE
y
1|0 from the prediction equations

At t = 1 we observe y1. The updating equations generate S1|1 and
MSE s1|1. The prediction equations then generate forecasts for the
second period

At t = 2 we observe y2, and the cycle continues to give sequences of
predictions of the states, {St |t} and {St |t−1}
Keep on iterating to get a sequence of estimates

Timmermann (UCSD) Filtering Winter, 2017 18 / 53

Filtered versus smoothed states

St |t : filtered states: estimate of the state at time t given information
up to time t

Uses only historical information
What is my best guess of St given my current information?
“Filters” past historical information for noise

St |T : smoothed states: estimate of the state at time t given
information up to time T

Uses the full sample up to time T ≥ t
Less sensitive to noise and thus tends to be smoother than the filtered
states
Information on yt−1, yt , yt+1 help us more precisely estimate the state
at time t, St

Timmermann (UCSD) Filtering Winter, 2017 19 / 53

Practical applications of the Kalman filter

Common to use Kalman filter to estimate adaptive forecasting models
with time-varying relations:

yt+1 = βtxt + εt+1

Two alternative specifications for βt :

βt − β̄ = φ(βt−1 − β̄) + ut : mean-reverts to β̄
βt = βt−1 + ut : random walk

yt , xt : observed variables
βt : time-varying coeffi cient (unobserved state variable)

Timmermann (UCSD) Filtering Winter, 2017 20 / 53

Kalman filter in matlab

Matlab has a good Kalman filter called ssm (state space model). The
setup is

xt = Atxt−1 + Btut
yt = Ctxt +Dtet

xt : unobserved state (our St)
yt : observed variable
ut , et : uncorrelated noise processes with variance of one
model = ssm(A,B,C,D,’StateType’,stateType); % state space model

modelEstimate = estimate(model,variable,params0,’lb’,[0; 0])

filtered = filter(modelEstimate,variable)

smoothed = smooth(modelEstimate,variable)

Timmermann (UCSD) Filtering Winter, 2017 21 / 53

Kalman filter example: monthly inflation

Unobserved components model for inflation

xt = xt−1 + σuut
yt = xt + σeet

A = 1; % state-transition matrix (A = φ in our notation)

B = NaN; % state-disturbance-loading matrix (B = σS )

C = 1; % measurement-sensitivity matrix (C = B in our notation)

D = NaN; % observation-innovation matrix (D = σy )

stateType = 2; % sets state equation to be a random walk

Timmermann (UCSD) Filtering Winter, 2017 22 / 53

Application of Kalman filter to monthly US inflation

1930 1940 1950 1960 1970 1980 1990 2000 2010
-0.02

-0.01

0.01

0.02

0.03

0.04

0.05

Time

In
fla

tio
n

Timmermann (UCSD) Filtering Winter, 2017 23 / 53

Kalman filter estimates of inflation (last 100 obs.)

2005 2006 2007 2008 2009 2010 2011 2012

-1.5

-1

-0.5

0.5

Time

P
er

ce
nt

ag
e

po
in

ts
Inflation
Filtered inflation
Smoothed inflation

Timmermann (UCSD) Filtering Winter, 2017 24 / 53

Kalman filter take-aways

Kalman filter is a very popular approach for dynamically updating
linear forecasts

Used to estimate ARMA models

Used throughout engineering and the social sciences

Fast, easy algorithm

Optimal updating equations for normally distributed data

Timmermann (UCSD) Filtering Winter, 2017 25 / 53

Nowcasting

Nowcasting refers to “estimating the present”

Nowcasting extracts information about the present state of some
variable or system of variables

distinct from traditional forecasting

Nowcasting only makes sense if the present state is
unknown−otherwise nowcasting would just amount to checking the
current value

Example: Use a single unobserved state variable to summarize the
state of the economy, e.g., the daily point in the business cycle

Variables such as GDP are actually observed with large measurement
errors (revisions)

Timmermann (UCSD) Filtering Winter, 2017 26 / 53

Jagged edge data

Macroeconomic data such as GDP, monetary aggregates,
consumption, unemployment figures or housing starts as well as
financial data extracted from balance sheets and income statements
are published infrequently and sometimes at irregular intervals

Delays in the publication of macro variables differ across variables

Irregular data releases (release date changes from month to month)
generate what is often called “jagged edge”data

A forecaster can only use the data that is available on any given date
and needs to pay careful attention to which variables are in the
information set

Timmermann (UCSD) Filtering Winter, 2017 27 / 53

Aruoba, Diebold and Scotti daily business cycle indicator

ADS model the daily business cycle, St , as an unobserved variable
that follows a (zero-mean) AR(1) process:

St = φSt−1 + et

Although St is unobserved, we can extract information about it from
its relation with a set of observed economic variables y1t , y2t , …

At the daily horizon these variables follow processes:

yit = ki + βiSt + γiyi ,t−Di + uit , i = 1, .., n

Di equals seven days if the variable is observed weekly, etc.

Timmermann (UCSD) Filtering Winter, 2017 28 / 53

Aruoba, Diebold and Scotti index from Philly Fed

Timmermann (UCSD) Filtering Winter, 2017 29 / 53

ADS five variable model

The ADS model can be written in state-space form

For example, a model could use the following observables:

interest rates (daily, y1t)
initial jobless claims (weekly, y2t)
personal income (monthly, y3t)
industrial production (monthly, y4t)
GDP (quarterly, y5t)

Kalman filter can be used to extract and update estimates of the
unobserved common variable that tracks the state of the economy

Kalman filter is well suited for handling missing data

If all elements of yt are missing on a given day, we skip the updating
step

Timmermann (UCSD) Filtering Winter, 2017 30 / 53

Markov Chains: Basics

Updating equations simplify a great deal if we only have two states,
states 1 and 2, and want to know which state we are currently in

recession/expansion
inflation/deflation
bull/bear market
high volatility/low volatility

Timmermann (UCSD) Filtering Winter, 2017 31 / 53

Markov Chains: Basics

A first order (constant) Markov chain, St , is a random process that
takes integer values {1, 2, ….,K} with state transitions that depend
only on the most recent state, St−1
Probability of moving from state i at time t − 1 to state j at time t is
pij :

P(St = j |St−1 = i) = pij
0 ≤ pij ≤ 1

∑
j=1
pij = 1

Timmermann (UCSD) Filtering Winter, 2017 32 / 53

Fitted values, 3-state model for monthly T-bill rates

1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 2020
0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Time

F
itt

ed
v

al
ue

s,
T

-b
ill

r
at

Timmermann (UCSD) Filtering Winter, 2017 33 / 53

Two-state Markov Chain

With K = 2 states, the transition probabilities can be collected in a
2× 2 matrix

P = P(St+1 = j |St = i)

[
p11 p21
p12 p22

]
=

[
p11 1− p22
1− p11 p22

]

Law of total probability: pi1 + pi2 = 1: we either stay in state i or we
leave to state j

pii : “stayer”probability – measure of state i’s persistence

Timmermann (UCSD) Filtering Winter, 2017 34 / 53

Basic regime switching model

Simple two-state regime-switching model

yt+1 = µst+1 + σst+1 εt+1, εt+1 ∼ N(0, 1)
P(st+1 = j |st = i) = pij

µst+1 : mean in state st+1
σst+1 : volatility in state st+1
st+1 matters for both the mean and volatility of yt+1

if st+1 = 1 : yt+1 = µ1 + σ1εt+1
if st+1 = 2 : yt+1 = µ2 + σ2εt+1

Timmermann (UCSD) Filtering Winter, 2017 35 / 53

Updating state probabilities

To predict yt+1, we need to predict st+1. This depends on the current
state, st
Let p1t |t = prob(st = 1|It ) be the current probability of being in
state 1 given all information up to time t, It

If p1t |t = 1, we know for sure that we are in state 1 at time t
Typically p1t |t < 1 and there is uncertainty about the present state Let p1t+1|t = prob(st+1 = 1|It ) be the predicted probability of being in state 1 next period (t + 1), given It Timmermann (UCSD) Filtering Winter, 2017 36 / 53 Updating state probabilities To be in state 1 at time t + 1, we must have come from either state 1 or from state 2: p1t+1|t = p11 × p1t |t + (1− p22)× p2t |t p2t+1|t = (1− p11)× p1t |t + p22 × p2t |t If p1t |t = 1, we know for sure that we are in state 1 at time t. Then the equations simplify to p1t+1|t = p11 × 1+ (1− p22)× 0 = p11 p2t+1|t = (1− p11)× 1+ p22 × 0 = 1− p11 Timmermann (UCSD) Filtering Winter, 2017 37 / 53 Updating with two states Let P(st = 1|yt−1) and P(st = 1|yt−1) be our initial estimates of being in states 1 and 2 given information at time t − 1 In period t we observe a new data point: yt If we are in state 1 the likelihood of observing yt is P(yt |st = 1) If we are in state 2 the likelihood of yt is P(yt |st = 2) If these are normally distributed, we have P(yt |st = 1) = 1√ 2πσ21 exp ( −(yt − µ1) 2 2σ21 ) P(yt |st = 2) = 1√ 2πσ22 exp ( −(yt − µ2) 2 2σ22 ) (3) Timmermann (UCSD) Filtering Winter, 2017 38 / 53 Bayesian updating with two states: examples I Use Bayes’rule to compute the updated state probabilities: P(st = 1|yt ) = P(y t |st= 1)P(st= 1) P(y t ) , where P(yt ) = P(y t |st= 1)P(st= 1) + P(y t |st= 2)P(st= 2) Similarly P(st = 2|yt ) = P(y t |st= 2)P(st= 2) P(y t ) Suppose that µ1 < 0, σ 2 1 is "large" so state 1 is a high volatility state with negative mean, while µ2 > 0 with small σ
2
2 so state 2 is a

“normal” state

Timmermann (UCSD) Filtering Winter, 2017 39 / 53

Bayesian updating with two states: examples II

If we see a large negative yt , this is most likely drawn from state 1
and so P(yt |st = 1) > P(yt |st = 2). Then we revise upward the
probability that we are currently (at time t) in state 1

Example:

µ1 = −3, σ1 = 5, µ2 = 1, σ2 = 2
P(st = 1|yt−1) = 0.70, P(st = 2|yt−1) = 0.30 : initial estimates
p11 = 0.8, p22 = 0.9

Suppose we observe yt = −4. Then from (3)

p(yt |st = 1) = Normpdf (−1/5) = 0.0782
p(yt |st = 2) = Normpdf (−5/2) = 0.0088

P(st = 1|yt ) =
0.0782× 0.70

0.0782× 0.70+ 0.0088× 0.30
= 0.954

P(st = 2|yt ) =
0.0088× 0.30

0.0782× 0.70+ 0.0088× 0.30
= 0.046

Timmermann (UCSD) Filtering Winter, 2017 40 / 53

Bayesian updating with two states: examples III

Because the observed value (-4%) is far more likely to have been
drawn from state 1 than from state 2, we revise upwards our beliefs
that we are currently in the first state from 70% to 95.4%
Using p11 and p22, our forecast of being in state 1 next period (at
time t + 1) is

P(st+1 = 1|yt ) = 0.954× 0.8+ 0.046× (1− 0.9) = 0.768

Our forecast of being in state 2 next period is

P(st+1 = 2|yt ) = 0.954× (1− 0.8) + 0.046× 0.9 = 0.232

Timmermann (UCSD) Filtering Winter, 2017 41 / 53

Bayesian updating with two states: examples IV

Similarly, the mean and variance forecasts in this case are given by

E [yt+1|yt ] = µ1P(st+1 = 1|yt ) + µ2P(st+1 = 2|yt )
= −3× 0.768+ 1× 0.232 = −2.07

= 52 × 0.768+ 22 × 0.232+ 0.768× 0.232× (1+ 3)2

= 22.98

Timmermann (UCSD) Filtering Winter, 2017 42 / 53

Bayesian updating with two states: example (cont.)

Suppose instead we observe a value yt = +1. Then

p(yt |st = 1) = Normpdf (4/5) = 0.0579
p(yt |st = 2) = Normpdf (0) = 0.1995

P(st = 1|yt ) =
0.0579× 0.70

0.0579× 0.70+ 0.1995× 0.30
= 0.4038

P(st = 2|yt ) =
0.1995× 0.30

0.0579× 0.70+ 0.1995× 0.30
= 0.5962

Now, we reduce the probability of being in state 1 from 70% to 40%,
while we increase the chance of being in state 2 from 30% to 60%
Our forecasts of being in states 1 and 2 next period are

P(st+1 = 1|yt ) = 0.4038× 0.8+ 0.5962× (1− 0.9) = 0.3827
P(st+1 = 1|yt ) = 0.4038× (1− 0.8) + 0.5962× 0.9 = 0.6173

Timmermann (UCSD) Filtering Winter, 2017 43 / 53

Estimation of Markov switching models

The MS model is neither Gaussian, nor linear: the state st might lead
to changes in regression coeffi cients and the covariance matrix

Two common estimation methods:

Maximum likelihood estimation (MLE)
Bayesian estimation using Gibbs sampler

Filtered states: P(st = i |It ) : probability of being in state i at time t
given information at time t, It
Smoothed states: P(st = i |IT ) : probability of being in state i at
time t given information at the end of the sample, IT
Choice of number of states can be tricky. We can use AIC or BIC

Timmermann (UCSD) Filtering Winter, 2017 44 / 53

Filtered states (Ang-Timmermann, 2012)

Timmermann (UCSD) Filtering Winter, 2017 45 / 53

Smoothed state probabilities (Ang-Timmermann, 2012)

Timmermann (UCSD) Filtering Winter, 2017 46 / 53

Smoothed state probabilities (Ang-Timmermann)

Timmermann (UCSD) Filtering Winter, 2017 47 / 53

Parameter estimates (Ang-Timmermann, 2012)

yt = µst + φst yt−1 + σst εt , εt ∼ iiN(0, 1)

Timmermann (UCSD) Filtering Winter, 2017 48 / 53

Take-away for MS models

Markov switching models are popular in finance and economics

MS models are easy to interpret economically

Empirically often one state is highly persistent (“normal” state) with
parameters not too far from the average of the series
The other state is often more transitory and captures spells of high
volatility (asset returns) or negative outliers (GDP growth)

Forecasts are easy to compute with MS models

One state often has high volatility – regime switching can be
important for risk management

Try to experiment with the Markov switching and Kalman filter codes
on Ted

Timmermann (UCSD) Filtering Winter, 2017 49 / 53

Estimates, 3-state model for monthly stock returns

P ′ =


 0.9881 0.0119 0.0000.000 0.9197 0.0803
0.8437 0.000 0.1563




µ =
(
0.0651 -0.1321 0.3756

)
σ =

(
0.0571 0.1697 0.0154

)

P : state transition probabilities, µ : means, σ : volatilities
State 1: highly persistent, medium mean, medium volatility

State 2: negative mean, high volatility, medium persistence

State 3: transitory bounce-back state with high mean

Timmermann (UCSD) Filtering Winter, 2017 50 / 53

Smoothed state probabilities, monthly stock returns

1930 1940 1950 1960 1970 1980 1990 2000 2010

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Time

M
on

th
ly

s
to

ck
r

et
ur

Timmermann (UCSD) Filtering Winter, 2017 51 / 53

Fitted versus actual stock returns (3 state model)

1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 2020
-0.2

-0.15

-0.1

-0.05

0.05

0.1

0.15

0.2

Time

F
itt

ed
v

al
ue

s,
s

to
ck

r
et

ur
ns

Timmermann (UCSD) Filtering Winter, 2017 52 / 53

Volatility of monthly stock returns

1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 2020
0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

Time

F
itt

ed
v

ol
at

ili
ty

,
st

oc
k

re
tu

rn
s

Timmermann (UCSD) Filtering Winter, 2017 53 / 53

Lecture 5: Random walk and spurious correlation
UCSD, Winter 2017

Allan Timmermann1

1UC San Diego

Timmermann (UCSD) Random walk Winter, 2017 1 / 21

1 Random walk model

2 Logs, levels and growth rates

3 Spurious correlation
Timmermann (UCSD) Random walk Winter, 2017 2 / 21

Random walk model

The random walk model is an AR(1) yt = φ1yt−1 + εt with φ1 = 1 :

yt = yt−1 + εt , εt ∼ WN(0, σ2).

This model implies that the change in yt is unpredictable:

∆yt = yt − yt−1 = εt

For example, the level of (log-) stock prices is easy to predict, but not
its change (rate of return for log-prices)

Shocks to the random walk have permanent effects: A one unit shock
moves the series by one unit forever. This is in sharp contrast to a
mean-reverting process such as yt = 0.8yt−1 + εt

Timmermann (UCSD) Random walk Winter, 2017 2 / 21

Random walk model (cont)

The variance of a random walk increases over time so the distribution
of yt changes over time. Suppose that yt started at zero, y0 = 0 :

y1 = y0 + ε1 = ε1
y2 = y1 + ε2 = ε1 + ε2

…

yt = ε1 + ε2 + …+ εt−1 + εt , so

E [yt ] = 0

var(yt ) = var(ε1 + ε2 + …+ εt ) = tσ
2 ⇒

lim
t→∞

var(yt ) = ∞

The variance of y grows proportionally with time
A random walk does not revert back to the mean but wanders up and
down at random
Timmermann (UCSD) Random walk Winter, 2017 3 / 21

Forecasts from random walk model

Recall that forecasts from the AR(1) process yt = φ1yt−1 + εt ,
εt ∼ WN(0, σ2) are simply

ft+h|t = φ
h
1yt

For the random walk model φ1 = 1, so for all forecast horizons, h, the
forecast is simply the current value:

ft+h|t = yt

Forecast of tomorrow = today’s value

The basic random walk model says that the value of the series next
period (given the history of the series) equals its current value plus an
unpredictable change. Random steps, εt , make yt a “random walk”

Timmermann (UCSD) Random walk Winter, 2017 4 / 21

Random walk with a drift

Introduce a non-zero drift term, δ :

yt = δ+ yt−1 + εt , εt ∼ WN(0, σ2).

This is a popular model for the logarithm of stock prices
The drift term, δ, plays the same role as a time trend. Assuming
again that the series started at y0, we have

yt = 2δ+ yt−2 + εt + εt−1
= δt + y0 + ε1 + ε2 + …+ εt−1 + εt , so

E [yt ] = y0 + δt

var(yt ) = tσ
2

lim
t→∞

var(yt ) = ∞

Timmermann (UCSD) Random walk Winter, 2017 5 / 21

Summary of properties of random walk

Changes in a random walk are unpredictable

Shocks have permanent effects

Variance grows in proportion with the forecast horizon

These points are important for forecasting:

point forecasts never revert to a mean or a trend
since the variance goes to infinity, the width of interval forecasts
increases without bound as the forecast horizon grows. Uncertainty
grows without bounds.

Timmermann (UCSD) Random walk Winter, 2017 6 / 21

Logs, levels and growth rates

Certain transformations of economic variables such as their logarithm
are often easier to model than the “raw” data

If the standard deviation of a time series is proportional to its level,
then the standard deviation of the logarithm of the series is
approximately constant:

Yt = Yt−1 exp(εt ), εt ∼ (0, σ2)⇔
ln(Yt ) = ln(Yt−1) + εt

The first difference of the log of Yt is ∆ ln(Yt ) = ln(Yt )− ln(Yt−1)
The percentage change in Yt between t − 1 and t is approximately
100∆ ln(Yt ). This can be interpreted as a growth rate
Example: US GDP follows an upward trend. Instead of studying the
level of US GDP, we can study its growth rate which is not trending

Timmermann (UCSD) Random walk Winter, 2017 7 / 21

Unit root processes

Random walk is a special case of a unit root process which has a unit
root in the AR polynomial, i.e.,

(1− L)yt = θ(L)εt ,

We can test for a unit root using an Augmented Dickey Fuller (ADF)
test:

∆yt = α+ βyt−1 +
p

∑
i=1

λi∆yt−i + εt .

Under the null of a unit root, H0 : β = 0. Under the alternative of
stationarity, H1 : β < 0 Timmermann (UCSD) Random walk Winter, 2017 8 / 21 Unit root processes (cont.) Example: suppose p = 0 (no autoregressive terms for ∆yt) and β = −0.2. Then ∆yt = yt − yt−1 = α− 0.2yt−1 + εt ⇔ yt = 0.8yt−1 + εt (which is stationary) If instead β = 0.2, we have yt − yt−1 = α+ 0.2yt−1 + εt ⇔ yt = 1.2yt−1 + εt (which is explosive) Test is based on the t-stat of β. Test statistic follows a non-standard distribution with wider tails than the normal distribution Timmermann (UCSD) Random walk Winter, 2017 9 / 21 Unit root test in matlab In matlab: adftest [h,pValue,stat,cValue,reg] = adftest(y) [h,pvalue,stat,cvalue] = adftest(logprice,’lags’,1,’model’,’AR’); Timmermann (UCSD) Random walk Winter, 2017 10 / 21 Critical values for Dickey-Fuller test Timmermann (UCSD) Random walk Winter, 2017 11 / 21 Shanghai SE stock price (monthly, 1991-2014) t-statistic: 1, 0362. p-value:0.92. Fail to reject null of a unit root. 1990 1995 2000 2005 2010 2015 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 Timmermann (UCSD) Random walk Winter, 2017 12 / 21 Changes in Shanghai SE stock price t-statistic: -11.15. p−value: 0.001. Reject null of a unit root. 1990 1995 2000 2005 2010 2015 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 Timmermann (UCSD) Random walk Winter, 2017 13 / 21 Spurious correlation Time series that are trending systematically up or down may appear to be significantly correlated even though they are completely independent Correlation between a city’s ice cream sales and the number of drownings in city swimming pools: Both peak at the same time even though there is no causal relationship between the two. In fact, a heat wave may drive both variables Dutch statistics reveal a positive correlation between the number of storks nesting in the spring and the number of human babies born at that time. Any causal relation? Cumulative rainfall in Brazil and US stock prices Timmermann (UCSD) Random walk Winter, 2017 14 / 21 Spurious correlation Two series with a random walk (unit root) component may appear to be related even when they are not. Consider an example: y1t = y1t−1 + ε1t y2t = y2t−1 + ε2t , cov(ε1t , ε2t ) = 0 Regressing one variable on the other y2t = α+ βy1t + ut often leads to apparently high values of R2 and of the associated t−statistic for β. Both are unreliable! Solutions: instead of regressing y1t on y2t in levels, regress ∆y1t on ∆y2t use cointegration analysis Timmermann (UCSD) Random walk Winter, 2017 15 / 21 Simulations of stationary processes 1,000 simulations (T = 500) of uncorrelated stationary AR(1) processes: y1t = 0.5y1t−1 + ε1t y2t = 0.5y2t−1 + ε2t cov(ε1t , ε2t ) = 0 The two time series y1t and y2t are independent by construction Next, estimate a regression y1t = β0 + β1y2t + ut What do you expect to find? Timmermann (UCSD) Random walk Winter, 2017 16 / 21 Simulation from stationary AR(1) process Average t−stat: 1.02. Rejection rate: 5.7%. Average R2 : 0.003 -4 -3 -2 -1 0 1 2 3 4 0 100 200 300 distribution of t-stats: stationary AR(1) 0 0.005 0.01 0.015 0.02 0.025 0.03 0 200 400 600 800 distribution of R-squared: stationary AR(1) Timmermann (UCSD) Random walk Winter, 2017 17 / 21 Spurious correlation: simulations 1,000 simulations of uncorrelated random walk processes: y1t = y1t−1 + ε1t y2t = y2t−1 + ε2t cov(ε1t , ε2t ) = 0 Then estimate regression y1t = β0 + β1y2t + ut What do we find now? Timmermann (UCSD) Random walk Winter, 2017 18 / 21 Spurious correlation: simulation from random walk Average t−stat: 13.4. Rejection rate: 44%. Average R2 : 0.25 -80 -60 -40 -20 0 20 40 60 80 100 0 100 200 300 400 distribution of t-stats: random walk 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 100 200 300 400 distribution of R-squared: random walk Timmermann (UCSD) Random walk Winter, 2017 19 / 21 Spurious correlation: dealing with the problem 1,000 simulations of uncorrelated random walk processes: y1t = y1t−1 + ε1t y2t = y2t−1 + ε2t cov(ε1t , ε2t ) = 0 Next, estimate regression on first-differenced series: ∆y1t = β0 + β∆y2t + ut Timmermann (UCSD) Random walk Winter, 2017 20 / 21 Spurious correlation: simulation from random walk Average t−stat: 0.78. Rejection rate: 1.5%. Average R2 : 0.002 -4 -3 -2 -1 0 1 2 3 4 0 100 200 300 distribution of t-stats: random walk, first-differences 0 0.005 0.01 0.015 0.02 0.025 0 200 400 600 800 distribution of R-squared: random walk, first-differences Timmermann (UCSD) Random walk Winter, 2017 21 / 21 Lecture 5: Vector Autoregressions and Factor Models UCSD, Winter 2017 Allan Timmermann1 1UC San Diego Timmermann (UCSD) VARs and Factors Winter, 2017 1 / 41 1 Vector Autoregressions 2 Forecasting with VARs Present value example Impulse response analysis 3 Cointegration 4 Forecasting with Factor Models 5 Forecasting with Panel Data Timmermann (UCSD) VARs and Factors Winter, 2017 2 / 41 From univariate to multivariate models Often information other than a variable’s own past values are relevant for forecasting Think of forecasting Hong Kong house prices exchange rate, GDP growth, population growth, interest rates might be relevant past house prices in Hong Kong also matter (AR model) In general we can get better models by using richer information sets How do we incorporate additional information sources? Vector Auto Regressions (VARs) (small set of predictors) Factor models (many possible predictors) Timmermann (UCSD) VARs and Factors Winter, 2017 2 / 41 Vector Auto Regressions (VARs) Vector autoregressions generalize univariate autoregressions to the multivariate case by letting yt be an n× 1 vector and so extend the information set to It = {yit , yit−1, ..., yi1} for i = 1, ..., n Many of the properties of VARs are simple multivariate generalizations of the univariate AR model The Wold representation theorem also extends to the multivariate case and hence VARs and VARMA models can be used to approximate covariance stationary multivariate (vector) processes VARMA: Vector AutoRegressive Moving Average Timmermann (UCSD) VARs and Factors Winter, 2017 3 / 41 VARs: definition A pth order VAR for an n× 1 vector yt takes the form: yt = c + A1yt−1 + A2yt−2 + ...+ Apyt−p + εt , εt ∼ WN(0,Σ) Ai : n× n matrix of autoregressive coeffi cients for i = 1, ..., p : Ai =   Ai11 Ai12 · · · Ai1n Ai21 Ai22 · · · Ai2n ... Ain1 Ainn   εt : n× 1 vector of innovations. These can be correlated across variables VARs have the same regressors appearing in each equation Number of parameters: n2 × p︸︷︷︸ A1,...,Ap + n︸︷︷︸ c + n(n+ 1)/2︸︷︷︸ Σ Timmermann (UCSD) VARs and Factors Winter, 2017 4 / 41 Why do we need VARs for forecasting? Consider a VAR with two variables: yt = (y1t , y2t ) ′ IT = {y1T , y1T−1, ....y11, y2T , y2T−1, ..., y21} Suppose y1 depends on past values of y2. Forecasting y1 one step ahead (y1T+1) given IT is possible if we know today’s values, y1T , y2T Suppose we want to predict y1 two steps ahead (y1T+2) Since y1T+2 depends on y2T+1, we need a forecast of y2T+1, given IT We need a joint model for predicting y1 and y2 given their past values This is provided by the VAR Timmermann (UCSD) VARs and Factors Winter, 2017 5 / 41 Example: Bivariate VAR(1) Joint model for the dynamics in y1t and y2t : y1t = φ11y1t−1 + φ12y2t−1 + ε1t , ε1t ∼ WN(0, σ 2 1) y2t = φ21y1t−1 + φ22y2t−1 + ε2t , ε2t ∼ WN(0, σ 2 2) Each variable depends on one lag of the other variable and one lag of itself φ12 measures the impact of the past value of y2, y2t−1, on current y1t . When φ12 6= 0, y2t−1 affects y1t φ21 measures the impact of the past value of y1, y1t−1, on current y2t . When φ21 6= 0, y1t−1 affects y2t The two variables can also be contemporaneously correlated if the innovations ε1t and ε2t are correlated and are influenced by common shocks: Cov(ε1t , ε2t ) = σ12 If σ12 6= 0, shocks to y1t and y2t are contemporaneously correlated Timmermann (UCSD) VARs and Factors Winter, 2017 6 / 41 Forecasting with Bivariate VAR I One-step-ahead forecast given IT = {y1T , y2T , ..., y11, y21} : f1T+1|T = φ11y1T + φ12y2T f2T+1|T = φ21y1T + φ22y2T To compute two-step-ahead forecasts, use the chain rule: f1T+2|T = φ11f1T+1|T + φ12f2T+1|T f2T+2|T = φ21f1T+1|T + φ22f2T+1|T Using the expresssions for f1T+1|T and f2T+1|T , we have f1T+2|T = φ11(φ11y1T + φ12y2T ) + φ12(φ21y1T + φ22y2T ) f2T+2|T = φ21(φ11y1T + φ12y2T ) + φ22(φ21y1T + φ22y2T ) Timmermann (UCSD) VARs and Factors Winter, 2017 7 / 41 Forecasting with Bivariate VAR II Collecting terms, we have f1T+2|T = (φ 2 11 + φ12φ21)y1T + φ12(φ11 + φ22)y2T f2T+2|T = φ21(φ11 + φ22)y1T + (φ12φ21 + φ 2 22)y2T To forecast y1 two steps ahead we need to forecast both y1 and y2 one step ahead. This can only be done if we have forecasting models for both y1 and y2 Therefore, we need to use a VAR for multi-step forecasting of time series that depend on other variables Timmermann (UCSD) VARs and Factors Winter, 2017 8 / 41 Predictive (Granger) causality Clive Granger (1969) used a variable’s predictive content to develop a definition of causality that depends on the conditional distribution of the predicted variable, given other information Statistical concept of causality closely related to forecasting Basic principles: cause should precede (come before) effect a causal series should contain information useful for forecasting that is not available from the other series (including their past) Granger causality in the bivariate VAR: If φ12 = 0, then y2 does not Granger cause y1 : past values of y2 do not improve our predictions of future values of y1 If φ21 = 0, then y1 does not Granger cause y2 : past values of y1 do not improve our predictions of future values of y2 For all other values of φ12 and φ21, y1 will Granger cause y2 and/or y2 will Granger cause y1 Include more lags? Timmermann (UCSD) VARs and Factors Winter, 2017 9 / 41 Granger causality tests Each variable predicts every other variable in the general VAR In VARs with many variables, it is quite likely that some variables are not useful for forecasting all the other variables Granger causality findings might be overturned by adding more variables to the model – y2t may simply predict y1t+1 because other information (y3t which causes both y1t+1 and y2t+1) has been left out (omitted variable) Timmermann (UCSD) VARs and Factors Winter, 2017 10 / 41 Estimation of VARs In suffi ciently large samples and under conventional assumptions, the least squares estimates of (A1, ...,Ap) will be normally distributed around the true parameter value Standard errors for each regression are computed using the OLS estimates OLS estimation is asymptotically effi cient OLS estimates are generally biased in small samples Timmermann (UCSD) VARs and Factors Winter, 2017 11 / 41 VARs in matlab 5-variable sample code on Triton Ed: varExample.m model = vgxset(’n’,5,’nAR’,nlags,’Constant’,true); % set up the VAR model [modelEstimate,modelStdEr,LL] = vgxvarx(model,Y(1:estimationEnd,:)); %estimate the VAR numParams = vgxcount(model); %number of parameters [aicForecast,aicForecastCov] = vgxpred(modelEstimates{aicLags,1},forecastHorizon,[],); %forecast with VAR Timmermann (UCSD) VARs and Factors Winter, 2017 12 / 41 Diffi culties with VARs VARs initially became a popular forecasting tool because of their relative simplicity in terms of which choices need to be made by the forecaster When estimating a VAR by classical methods, only two choices need to be made to construct forecasts which variables to include (choice of y1, ..., yn) how many lags of the variables to include (choice of p) Risk of overparameterization of VARs is high The general VAR has n(np + 1) mean parameters plus another n(n+ 1)/2 covariance parameters For n = 5, p = 4 this is 105 mean parameters and 15 covariance parameters Bayesian procedures reduce parameter estimation error by shrinking the parameter estimates towards some target value Timmermann (UCSD) VARs and Factors Winter, 2017 13 / 41 Choice of lag length Typically we search over VARs with different numbers of lags, p With a vector of constants, n variables, p lags, and T observations, the BIC and AIC information criteria take the forms BIC (p) = ln |Σ̂p |+ n(np + 1) ln(T ) T AIC (p) = ln |Σ̂p |+ n(np + 1) 2 T Σ̂p = T−1 ∑ T t=t ε̂t ε̂ ′ t is the estimate of the residual covariance matrix The objective is to identify the model (indexed by p) that minimizes the information criterion The sample code varExample.m chooses the VAR, using up to 12 lags (maxLags) Timmermann (UCSD) VARs and Factors Winter, 2017 14 / 41 Multi-period forecasts VARs are ideally designed for generating multi-period forecasts. For the VAR(1) specification yt+1 = Ayt + εt+1, εt+1 ∼ WN(0,Σ) the h−step-ahead value can be written yt+h = A hyt + h ∑ i=1 Ah−i εt+i The forecast under MSE loss is then ft+h|t = A hyt Just like in the case with an AR(1) model! Timmermann (UCSD) VARs and Factors Winter, 2017 15 / 41 Multi-period forecasts: 4-variable example Forecasts using 4-variable VAR with quarterly inflation rate, unemployment rate, GDP growth and 10-year yield vgxplot(modelEstimates,Y,aicForecast,aicForecastCov); % plot forecast 0 50 100 150 200 250 -0.05 0 0.05 Inflation Process Lower 1-σ Upper 1-σ 0 50 100 150 200 250 -0.05 0 0.05 GDP growth 0 50 100 150 200 250 0 10 20 Unemployment 0 50 100 150 200 250 0 10 20 10 year Treasury bond rate Timmermann (UCSD) VARs and Factors Winter, 2017 16 / 41 Multi-period forecasts of 10-year yield (cont.) Reserve last 5-years of data for forecast evaluation 2009.5 2010 2010.5 2011 2011.5 2012 2012.5 2013 2013.5 2014 2 2.5 3 3.5 4 4.5 5 Time P er ce nt ag e po in ts Actual AIC BIC AR(4) Timmermann (UCSD) VARs and Factors Winter, 2017 17 / 41 Example: Campbell-Shiller present value model I Campbell and Shiller (1988) express the continuously compounded stock return in period t + 1, rt+1, as an approximate linear function of the logarithms of current and future stock prices, pt , pt+1 and dividends, dt+1: rt+1 = k + ρpt+1 + (1− ρ)dt+1 − pt ρ is a scalar close to (but below) one, and k is a constant Rearranging, we get a recursive equation for log-prices: pt = k + ρpt+1 + (1− ρ)dt+1 − rt+1 Iterating forward and taking expectations conditional on current information, we have pt = k 1− ρ + (1− ρ)Et [ ∞ ∑ j=0 ρjdt+1+j ] − Et [ ∞ ∑ j=0 ρj rt+1+j ] Timmermann (UCSD) VARs and Factors Winter, 2017 18 / 41 Example: Campbell-Shiller present value model II Stock prices depend on an infinite sum of expected future dividends and expected returns Key to the present value model is therefore how such expectations are formed VARs can address this question since they can be used to generate multi-period forecasts To illustrate this point, let zt be a vector of state variables with z1t = pt , z2t = dt , z3t = xt ; xt are predictor variables Define selection vectors e1 = (1 0 0) ′, e2 = (0 1 0) ′, e3 = (0 0 1) ′ so pt = e ′1zt , dt = e ′ 2zt , xt = e ′ 3xt Suppose that zt follows a VAR(1): zt+1 = Azt + εt+1 ⇒ Et [zt+j ] = A j zt Timmermann (UCSD) VARs and Factors Winter, 2017 19 / 41 Example: Campbell-Shiller present value model III If expected returns Et [rt+1+j ] are constant and stock prices only move due to variation in dividends, we have (ignoring the constant and assuming that we can invert (I − ρA)) pt = (1− ρ)Et [ ∞ ∑ j=0 ρjdt+1+j ] = (1− ρ)e ′2 ∞ ∑ j=0 ρjAj+1zt = (1− ρ)e ′2A(I − ρA) −1zt Nice and simple expression for the present value stock price! The VAR gives us a simple way to compute expected future dividends Et [dt+1+j ] for all future points in time given the current information in zt Can you suggest other ways of doing this? Timmermann (UCSD) VARs and Factors Winter, 2017 20 / 41 Impulse response analysis Stationary vector autoregressions (VARs) can equivalently be expressed as vector moving average (VMA) processes: yt = εt + θ1εt−1 + θ2εt−2 + ... Impulse response analysis shows how variable i in a VAR is affected by a shock to variable j at different horizons: ∂yit+1 ∂εjt 1-period impulse ∂yit+2 ∂εjt 2-period impulse ∂yit+h ∂εjt h-period impulse Suppose we find out that variable j is higher than we expected (by one unit). Impulse responses show how much we revise our forecasts of future values of yit+h due to this information How does an interest rate shock affect future unemployment and inflation? Timmermann (UCSD) VARs and Factors Winter, 2017 21 / 41 Impulse response analysis in matlab Four-variable model: inflation, GDP growth, unemployment, 10-year Treasury bond rate impulseHorizon = 24; % 24 months out W0 = zeros(impulseHorizon,4); %baseline scenario of zero shock W1 = W0; W1(1,4) = sqrt(modelEstimates{aicLags,1}.Q(4,4)); % one standard deviation shock to variable number four (interest rate) Yimpulse = vgxproc(modelEstimates{aicLags,1},W1,[],Y(1:estimationEnd,:)); %impulse response Ynoimpulse = vgxproc(modelEstimates{aicLags,1},W0,[],Y(1:estimationEnd,:)); Timmermann (UCSD) VARs and Factors Winter, 2017 22 / 41 Impulse response analysis: shock to 10-year yield 5 10 15 20 0 0.05 0.1 0.15 0.2 Horizon % C ha ng e Inflation 5 10 15 20 -0.02 -0.015 -0.01 -0.005 0 Horizon % C ha ng e GDP growth 5 10 15 20 0 0.005 0.01 Horizon % C ha ng e Unemployment 5 10 15 20 0.05 0.1 0.15 Horizon % C ha ng e 10 year Treasury bond rate Timmermann (UCSD) VARs and Factors Winter, 2017 23 / 41 Nobel Prize Award, 2003 press release “Most macroeconomic time series follow a stochastic trend, so that a temporary disturbance in, say, GDP has a long-lasting effect. These time series are called nonstationary; they differ from stationary series which do not grow over time, but fluctuate around a given value. Clive Granger demonstrated that the statistical methods used for stationary time series could yield wholly misleading results when applied to the analysis of nonstationary data. His significant discovery was that specific combinations of nonstationary time series may exhibit stationarity, thereby allowing for correct statistical inference. Granger called this phenomenon cointegration. He developed methods that have become invaluable in systems where short-run dynamics are affected by large random disturbances and long-run dynamics are restricted by economic equilibrium relationships. Examples include the relations between wealth and consumption, exchange rates and price levels, and short and long-term interest rates.” This work was done at UCSD Timmermann (UCSD) VARs and Factors Winter, 2017 24 / 41 Cointegration Consider the variables xt = xt−1 + εt x follows a random walk (nonstationary) y1t = xt + u1t y1 is a random walk plus noise y2t = xt + u2t y2 is a random walk plus noise εt , u1t , u2t are all white noise (or at least stationary) xt is a unit root process: (1− L)xt = εt , so L = 1 is a "root" y1 and y2 behave like random walks. However, their difference y1t − y2t = u1t − u2t is stationary (mean-reverting) Over the long run, y1 − y2 will revert to its equilibrium value of zero Timmermann (UCSD) VARs and Factors Winter, 2017 25 / 41 Cointegration (cont.) Future levels of random walk variables are diffi cult to predict It is much easier to predict differences between two sets of cointegrated variables Example: Forecasting the level of Brent or WTI (West Texas Intermediate) crude oil prices five years from now is diffi cult Forecasting the difference between these prices (or the logs of their prices) is likely to be easier In practice we often study the logarithm of prices (instead of their level), so percentage differences cannot become too large Timmermann (UCSD) VARs and Factors Winter, 2017 26 / 41 Cointegration (cont.) If two variables are cointegrated, they must both individually have a stochastic trend (follow a unit root process) and their individual paths can wander arbitrarily far away from their current values There exists a linear combination that ties the two variables closely together Future values cannot deviate too far away from this equilibrium relation Granger representation theorem: Equilibrium errors (deviations from the cointegrating relationship) can be used to predict future changes Examples of possible cointegrated variables: Oil prices in Shanghai and Hong Kong– if they differ by too much, there is an arbitrage opportunity Long and short interest rates Baidu and Alibaba stock prices (pairs trading) House prices in two neighboring cities Chinese A and H share prices for same company. Arbitrage opportunities? Timmermann (UCSD) VARs and Factors Winter, 2017 27 / 41 Vector Error Correction Models (VECM) Vector error correction models (VECMs) can be used to analyze VARs with nonstationary variables that are cointegrated Cointegration relation restricts the long-run behavior of the variables so they converge to their cointegrating relationship (long-run equilibrium) Cointegration term is called the error-correction term This measures the deviation from the equilibrium and allows for short-run predictability In the long-run equilibrium, the error correction term equals zero Timmermann (UCSD) VARs and Factors Winter, 2017 28 / 41 Vector Error Correction Models (cont.) VECM for changes in two variables, y1t , y2t with cointegrating equation y2t = βy1t and lagged error correction term (y2t−1 − βy1t−1) : ∆y1t = α1 (y2t−1 − βy1t−1)︸︷︷︸ lagged error correction term + λ1∆y1t−1 + ε1t ∆y2t = α2(y2t−1 − βy1t−1) + λ2∆y2t−1 + ε2t In the short run y1 and y2 can deviate from the equilibrium y2t = βy1t Lagged error correction term (y2t−1 − βy1t−1) pulls the variables back towards their equilibrium α1 and α2 measure the speed of adjustment of y1 and y2 towards equilibrium Larger values of α1 and α2 mean faster adjustment Timmermann (UCSD) VARs and Factors Winter, 2017 29 / 41 Vector Error Correction Models (cont.) In many applications (particularly with variables in logs), β = 1. Then a forecasting model for the changes ∆y1t and ∆y2t could take the form ∆y1t = c1 + p ∑ i=1 λ1i∆y1t−i︸︷︷︸ p AR lags + α1 (y2t−1 − y1t−1)︸︷︷︸ error correction term + ε1t ∆y2t = c2 + p ∑ i=1 λ2i∆y2t−i + α2(y2t−1 − y1t−1) + ε2t This can be estimated by OLS since you know the cointegrating coeffi cient, β = 1 Include more lags of the error correction term (y2t−1 − y1t−1)? Adjustments may be slow Timmermann (UCSD) VARs and Factors Winter, 2017 30 / 41 House prices in San Diego and San Francisco 1990 1995 2000 2005 2010 50 100 150 200 250 Time H om e P ric e In de x SD SF Timmermann (UCSD) VARs and Factors Winter, 2017 31 / 41 Simple test for cointegration Regress San Diego house prices on San Francisco house prices and test if the residuals are non-stationary use logarithm of prices (?) Null hypothesis is that there is no cointegration (so there is no linear combination of the two prices that is stationary) If you reject the null hypothesis (get a low p-value), this means that the series are cointegrated If you don’t reject the null hypothesis (high p-value), the series are not cointegrated Often test has low power (fails to reject even when the series are cointegrated) Timmermann (UCSD) VARs and Factors Winter, 2017 32 / 41 Test for cointegration in matlab See VecmExample.m file on Triton Ed In matlab: egcitest : Engle-Granger cointegration test [h, pValue, stat, cValue] = egcitest(Y ) "Engle-Granger tests assess the null hypothesis of no cointegration among the time series in Y. The test regresses Y(:,1) on Y(:,2:end), then tests the residuals for a unit root. Values of h equal to 1 (true) indicate rejection of the null in favor of the alternative of cointegration. Values of h equal to 0 (false) indicate a failure to reject the null." p−value of test for SD and SF house prices: 0.9351. We fail to reject the null that the house prices are not cointegrated. Why? Timmermann (UCSD) VARs and Factors Winter, 2017 33 / 41 House prices in San Diego and San Francisco 1990 1995 2000 2005 2010 -30 -20 -10 0 10 20 30 Time Cointegrating Relation Timmermann (UCSD) VARs and Factors Winter, 2017 34 / 41 Forecasting with Factor models I Suppose we have a very large set of predictor variables, xit , i = 1, ...n, where n could be in the hundreds or thousands The simplest forecasting approach would be to consider a linear model with all predictors included: yt+1 = α+ n ∑ i=1 βi xit + φ1yt + εyt+1 This model can be estimated by OLS, assuming that the total number of parameters, n+ 2, is small relative to the length of the time series, T Often n > T and so linear regression methods are not feasible

Instead it is commonly assumed that the x−variables only affect y through a
small set of r common factors, Ft = (F1t , …,Frt )′, where r is much smaller
than N (typically less than ten)

Timmermann (UCSD) VARs and Factors Winter, 2017 35 / 41

Forecasting with Factor models II

This suggests using a common factor forecasting model of the form

yt+1 = α+
r

∑
i=1

βiF Fit + φ1yt + εyt+1

Suppose that n = 200 and r = 3 common factors

The general forecasting model requires fitting 202 mean parameters:
α, β1, …, β200, φ1
The simple factor model only requires estimating 5 mean parameters:
α, β1F , β2F , β3F , φ1

Timmermann (UCSD) VARs and Factors Winter, 2017 36 / 41

Forecasting with Factor models

The identity of the common factors is usually unknown and so must be
extracted from the data

Forecasting with common factor models can therefore be thought of as a
two-step process

1 Extract estimates of the common factors from the data
2 Use the factors, along with past values of the predicted variable, to select and
estimate a forecasting model

Suppose a set of factor estimates, F̂it , has been extracted. These are then
used along with past values of y to estimate a model and generate forecasts
of the form:

ŷt+1|t = α̂+
r

∑
i=1

β̂iF F̂it + φ̂1yt

Common factors can be extracted using the principal components method

Timmermann (UCSD) VARs and Factors Winter, 2017 37 / 41

Principal components

Wikipedia: “Principal component analysis (PCA) is a statistical procedure
that uses an orthogonal transformation to convert a set of observations of
possibly correlated variables into a set of values of linearly uncorrelated
variables called principal components. The number of principal components is
less than or equal to the number of original variables. This transformation is
defined in such a way that the first principal component has the largest
possible variance (that is, accounts for as much of the variability in the data
as possible), and each succeeding component in turn has the highest variance
possible under the constraint that it is orthogonal to (i.e., uncorrelated with)
the preceding components. The principal components are orthogonal because
they are the eigenvectors of the covariance matrix, which is symmetric. PCA
is sensitive to the relative scaling of the original variables.”

Timmermann (UCSD) VARs and Factors Winter, 2017 38 / 41

Empirical example

Data set with n = 132 predictor variables

Available in macro_raw_data.xlsx on Triton Ed. Uses data from Sydney
Ludvigson’s NYU website

Data series have to be transformed (e.g., from levels to growth rates) before
they are used to form principal components

Extract r = 8 common factors using principal components methods

Timmermann (UCSD) VARs and Factors Winter, 2017 39 / 41

Empirical example (cont.): 8 principal components

200 400 600
-5

5
PC: 1

200 400 600
-4

-2

4
PC: 2

200 400 600
-10

-5

10
PC: 3

200 400 600
-10

-5

10
PC: 4

200 400 600
-5

5
PC: 5

200 400 600
-4

-2

4
PC: 6

200 400 600
-10

-5

5
PC: 7

200 400 600
-10

-5

5
PC: 8

Timmermann (UCSD) VARs and Factors Winter, 2017 40 / 41

Forecasting with panel data I

Forecasting methods can also be applied to cross-sections or panel data
Key requirement is that the predictors are predetermined in time. For
example, we could build a forecasting model for a large cross-section of
credit-card holders using data on household characteristics, past payment
records etc.

The implicit time dimension is that we know whether a payment in the data
turned out fraudulent

Panel regressions take the form

yit = αi + λt + X
′
itβ+ uit , i = 1, …, n, t = 1, …,T

αi : fixed effect (e.g., firm, stock, or country level)
λt : time fixed effect

How do we predict λt+1?

Panel models can be estimated using regression methods
Do slope coeffi cients β vary across units (βi )?

bias-variance trade-off
Timmermann (UCSD) VARs and Factors Winter, 2017 41 / 41

Lecture 7: Event, Density and Volatility Forecasting
UCSD, Winter 2017

Allan Timmermann1

1UC San Diego

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 1 / 57

1 Event forecasting

2 Point, interval and density forecasts
Location-Scale Models of Density Forecasts
GARCH Models
Realized Volatility Measures

3 Interval and Density Forecasts
Mean reverting processes
Random walk model
Alternative Distribution Models
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 2 / 57

Event forecasting

Discrete events are important in economics and finance

Mergers & Acquisitions – do they happen (yes = 1) or not (no = 0)?
Will a credit card transaction be fraudulent (yes = 1, no = 0)?
Will Europe enter into a recession in 2017 (yes = 1, no = 0)?
What will my course grade be? A, B , C
Change in Fed funds rate is usually in increments of 25 bps or zero. Create
bins of 0 = no change, 1 = 25 bp change, 2 = 50 bp change, etc.

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 2 / 57

University of Iowa Electronic markets: value of contract on
Republican presidential nominee

Contracts trading for a total exposure of $500 with a $1 payoff on each
contract

y = 1 : you get paid one dollar if Trump wins the Republican nomination
y = 0 : you get nothing if Trump fails to win the Republican nomination

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 3 / 57

University of Iowa Electronic markets: Democrat vs
republican win

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 4 / 57

Limited dependent variables

Limited dependent variables have a restricted range of values (bins)

Example: A binary variable takes only two values: y = 1 or y = 0

In a binary response model, interest lies in the response probability given
some predictor variables x1t , …, xkt :

P(yt+1 = 1|x1t , x2t , .., xkt )

Example: what is the probability that the Fed will raise interest rates by more
than 75 bps in 2017 given the current level of inflation, changes in oil prices,
bank lending, unemployment rate and past interest rate decisions?

Suppose y is a binary variable taking values of zero or one

E [yt+1 |xt ] = P(yt+1 = 1|xt )× 1+ P(yt+1 = 0|xt )× 0 = P(yt+1 = 1|xt )

E [.] : Expectation. P(.) : Probability
The probability of “success” (yt+1 = 1) equals the expected value of yt+1
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 5 / 57

Linear probability model

Linear probability model:

P(yt+1 = 1|x1t , .., xkt ) = β0 + β1x1t + …+ βk xkt

x1t , …, xkt : predictor variables
In the linear probability model, βj measures the change in the probability of
success when xjt changes, holding other variables constant:

∆P(yt+1 = 1|x1t , ..,∆xjt , …, xkt ) = βj∆xjt

Problems with linear probability model:

Probabilities can be bigger than one or less than zero
Is the effect of x linear? What if you are close to a probability of zero or one?
Often the linear model gives a good first idea of the slope coeffi cient βj

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 6 / 57

Linear probability model: forecasts outside [0,1]

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 7 / 57

Binary response models

To address the limitations of the linear probability model, consider a class of
binary response models of the form

P(yt+1 = 1|x1t , x2t , …, xkt ) = G (β0 + β1x1t + ..+ βk xkt )

for functions G (.) satisfying

0 ≤ G (.) ≤ 1

Probabilities are now guaranteed to fall between zero and one

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 8 / 57

Logit and Probit models

Two popular choices for G (.)

Logit model:

G (x) =
exp(x)

1+ exp(x)

Probit model:

G (x) = Φ(x) =
∫ x
−∞

φ(z)dz , φ(z) = (2π)−1/2 exp(−z2/2)

Φ(x) is the standard normal cumulative distribution function (CDF)
Logit and probit functions are increasing and steepest at x = 0

G (x)→ 0 as x → −∞, and G (x)→ 1 as x → ∞

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 9 / 57

Logit and Probit models

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 10 / 57

Logit, probit and LPM in matlab

binaryResponseExample.m on Triton Ed
lpmBeta = [ones(T,1) X]\y; % estimates for linear probability model
probitBeta = glmfit(X,y,’binomial’,’link’,’probit’); % Probit model

logitBeta = glmfit(X,y,’binomial’,’link’,’logit’); % Logit model

lpmFit = [ones(T,1) X]*lpmBeta; % Calculate fitted values

probitFit = glmval(probitBeta,X,’probit’); % fitted values, probit

logitFit = glmval(logitBeta,X,’logit’); % fitted values, logit

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 11 / 57

Application: Directional investment strategy

yt+1 = r
s
t+1 − Tbillt+1 : excess return on stocks (r

s
t+1) over Tbills (Tbillt+1)

I syt+1>0 =
{
1 if yt+1 > 0
0 0therwise

Investment strategy: buy stocks if we predict yt+1 > 0, otherwise hold T-bills

forecast stocks T-bills
ft+1|t > 0, +1 0
ft+1|t ≤ 0, 0 +1

Logit/Probit model estimates the probability of a positive excess return,
prob(I syt+1>0 = 1|It )
I syt+1 = yt+1 >= 0; % create indicator for dependent variable

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 12 / 57

Fitted probabilities of a positive excess return

Use logit, probit or linear model to forecast the probability of a positive
(monthly) excess return using the lagged T-bill rate, dividend yield and
default spread as predictors

1930 1940 1950 1960 1970 1980 1990 2000 2010

0.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

0.6

Time

F
itt

ed
p

ro
ba

bi
lit

LPM
Probit
Logit

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 13 / 57

Switching strategy (cont.)

Decision rule: define the stock “weight” ωst+1|t

ωst+1|t =

{
1 if prob(I syt+1>0 = 1|It ) > 0.5
0 0therwise

Payoff on stock-bond switching (market timing) portfolio:

rt+1 = ω
s
t+1|t r

s
t+1 + (1−ω

s
t+1|t )Tbillt+1

Payoff depends on both the sign and magnitude of the predicted excess
return, even though the forecast ignores information about magnitudes

Cumulated wealth: Starting from initial wealth W0 we get

WT = W0
T

∏
τ=1
(1+ rτ)

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 14 / 57

Cumulated wealth from T-bills, stocks and switching rule

1930 1940 1950 1960 1970 1980 1990 2000 2010

Time

F
itt

ed
p

ro
ba

bi
lit

y
switching
stocks
Tbills

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 15 / 57

Point forecasts

Point forecasts provide a summary statistic for the predictive density of the
predicted variable (Y ) given the data

This is all we need under MSE loss (suffi cient statistic)

Limitations to point forecasts:

Different loss functions L give different point forecasts (Lecture 1)

Point forecasts convey no sense of the precision of the forecast —how
aggressively should an investor act on a predicted stock return of +1%?

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 16 / 57

Interval forecasts

It is always useful to report a measure of forecast uncertainty

Addresses “how certain am I of my forecast?”

many forecasts are surrounded by considerable uncertainty

Alternatively, use scenario analysis

specify outcomes in possible future scenarios along with the probabilities of the
scenarios

Under the assumption that the forecast errors are normally distributed, we
can easily construct an interval forecast, i.e., an interval that contains the
future value of Y with a probability such as 50%, 90% or 95%

forecast the variance as well

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 17 / 57

Distribution (density) forecasts

Distribution forecasts provide a complete characterization of the forecast
uncertainty

Calculation of expected utility for many risk-averse investors requires a
forecast of the full probability distribution of returns —not just its mean

Parametric approaches assume a known distribution such as the normal
(Gaussian)

Non-parametric methods treat the distribution as unknown

bootstrap draws from the empirical distribution of residuals

Hybrid approaches that mix different distributions can also be used

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 18 / 57

Density forecasts (cont.)

To construct density forecasts, typically three estimates are used:

Estimate of the conditional mean given the data, µt+1|t
Estimate of the conditional volatility given the data, σt+1|t
Estimate of the distribution function of the innovations/shocks, Pt+1|t

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 19 / 57

Conditional location-scale processes with normal errors

yt+1 = µt+1|t + σt+1|tut+1, ut+1 ∼ N(0, 1)

µt+1|t : conditional mean of yt+1, given current information, It
σt+1|t : conditional standard deviation or volatility, given It

P(yt+1 ≤ y |It ) = P
(
yt+1 − µt+1|t

σt+1|t
≤
y − µt+1|t

σt+1|t

)

= P

(
ut+1 ≤

y − µt+1|t
σt+1|t

)
≡ N

(
y − µt+1|t

σt+1|t

)

P : probability
N : cumulative distribution function of Normal(0, 1)

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 20 / 57

Nobel prize committee press release (2003)

“On financial markets, random fluctuations over time —volatility —are
particularly significant because the value of shares, options and other
financial instruments depends on their risk. Fluctuations can vary
considerably over time; turbulent periods with large fluctuations are followed
by calmer periods with small fluctuations. Despite such time-varying
volatility, in want of a better alternative, researchers used to work with
statistical methods that presuppose constant volatility. Robert Engle’s
discovery was therefore a major breakthrough. He found that the concept of
autoregressive conditional heteroskedasticity (ARCH) accurately
captures the properties of many time series and developed methods for
statistical modeling of time-varying volatility. His ARCH models have become
indispensable tools not only for researchers, but also for analysts on financial
markets, who use them in asset pricing and in evaluating portfolio risk.”

Robert Engle did the work on ARCH models at UCSD

This work is critical for modeling σt+1|t

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 21 / 57

ARCH models

Returns, rt+1, at short horizons (daily, 5-minute, weekly, even monthly) are
hard to predict – they are not strongly serially correlated

However, squared returns, r2t+1, are serially correlated and easier to predict

Volatility clustering: periods of high market volatility alternates with periods
of low volatility

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 22 / 57

Daily stock returns

10-years of daily US stock returns

2006 2008 2010 2012 2014

-8

-6

-4

-2

Time

P
er

ce
nt

ag
e

po
in

ts
S&P 500 returns

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 23 / 57

Daily stock return levels: AR(4) model estimates

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 24 / 57

Squared daily stock returns

2006 2008 2010 2012 2014
0

0.2

0.4

0.6

0.8

1.2

Time

P
er

ce
nt

ag
e

po
in

S&P 500 returns

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 25 / 57

Squared daily stock returns: AR(4) estimates

Much stronger evidence of serial persistence (autocorrelation) in squared returns

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 26 / 57

GARCH models

Generalized AutoRegressive Conditional Heteroskedasticity

GARCH(p, q) model for the conditional variance:

εt+1 = σt+1|tut+1, ut+1 ∼ N(0, 1)

σ2t+1|t = ω+
p

∑
i=1

βiσ
2
t+1−i |t−i +

∑
i=1

αi ε
2
t+1−i

GARCH(1, 1) is the empirically most popular specification:

σ2t+1|t = ω+ β1σ
2
t |t−1 + α1ε

2
t

= ω+ (α1 + β1)σ
2
t |t−1 + α1σ

2
t |t−1(u

2
t − 1)︸︷︷︸

zero mean

Diffi cult to beat GARCH(1,1) in many volatility forecasting contests

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 27 / 57

GARCH(1,1) model: h-step forecasts I

α1 + β1 : measures persistence of GARCH(1,1) model

As long as α1 + β1 < 1, the volatility process will converge Long run−or unconditional−variance is E [σ2t+1|t ] ≡ σ 2 = ω 1− α1 − β1 GARCH(1,1) is similar to an ARMA(1,1) model in squares: σ2t+1|t = σ 2 + (α1 + β1)(σ 2 t |t−1 − σ 2) + α1σ 2 t |t−1(u 2 t − 1) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 28 / 57 Volatility modeling GARCH(1,1) model can generate fat tails The standard GARCH(1,1) model does not generate a skewed distribution — that’s because the shocks are normally distributed (symmetric) Conditional volatility estimate: estimate of the current volatility level given all current information. This varies over time Mean reversion: If the current conditional variance forecast σ2t+1|t > σ
2, the

multi-period variance forecast will exceed the average forecast by an amount
that declines in the forecast horizon

Unconditional volatility estimate: long-run (“average”) estimate of volatility.
This is constant over time

GARCH models can be estimated by maximum likelihood methods

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 29 / 57

Asymmetric GARCH models I

GJR model of Glosten, Jagannathan, and Runkle (1993):

σ2t+1|t = ω+ α1ε
2
t + λε

2
t I (εt < 0) + β1σ 2 t |t−1 I (εt < 0) = { 1 if εt < 0 0 otherwise Positive and negative shocks affect volatility differently if λ 6= 0 If λ > 0, negative shocks will affect future conditional variance more strongly
than positive shocks

The bigger effect of negative shocks is sometimes attributed to leverage (for
stock returns)

λ measures the magnitude of the leverage

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 30 / 57

Asymmetric GARCH models II

EGARCH (exponential GARCH) model of Nelson (1991):

log(σ2t+1|t ) = ω+ α1(|εt | − E [|εt |]) + γεt + β1 log(σ
2
t |t−1)

EGARCH volatility estimates are always positive in levels (the exponential of
a negative number is positive)

If γ < 0, negative shocks (εt < 0) will have a bigger effect on future conditional volatility than positive shocks γ measures the magnitude of the leverage (sign different from GJR model) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 31 / 57 GARCH estimation in matlab garchExample.m: code on Triton Ed [h, pValue, stat, cValue] = archtest(res,′ lags ′, 10); % test for ARCH model = garch(P,Q); % creates a conditional variance GARCH model with GARCH degree P and ARCH degree Q model = egarch(P,Q); % creates an EGARCH model with P lags of log(σ2t |t−1) and Q lags of ε 2 t model = gjr(P,Q); % creates a GJR model modelEstimate = estimate(model ,spReturns); % estimate GARCH model modelVariances = infer(modelEstimate,spReturns); % generate conditional variance estimate varianceForecasts = forecast(modelEstimate,10,′V 0′,modelVariances); % generate variance forecast Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 32 / 57 GARCH(1,1) and EGARCH estimates Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 33 / 57 GJR estimates Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 34 / 57 Comparison of fitted volatility estimates 500 1000 1500 2000 2500 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 Forecast horizon P er ce nt ag e po in ts Fitted Volatility estimates GARCH(1,1) EGARCH(1,1) GJR(1,1) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 35 / 57 Comparison of 3 models: out-of-sample forecasts 2 4 6 8 10 12 14 16 18 20 1.15 1.2 1.25 1.3 1.35 1.4 Forecast horizon P er ce nt ag e po in ts Volatility forecasts GARCH(1,1) EGARCH(1,1) GJR(1,1) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 36 / 57 Realized variance True variance is unobserved. How do we estimate it? Intuition: the higher the variance of y is in a given period, the more y fluctuates in small intervals during that period Idea: sum the squared changes in y during small intervals between time markers τ0, τ1, τ2, ..., τN within a given period Realized variance: RVt = N ∑ j=1 (yτj − yτj−1 ) 2 t − 1 = τ0 < τ1 < ... < τN = t Example: use 5-minute sampled data over 8 hours to estimate the daily stock market volatility: N = 8 ×12 = 96 observations τ0 = 8am, τ1 = 8 : 05am, τ2 = 8 : 10am, ...τN = 4pm Example: use squared daily returns to estimate volatility within a month: N = 22 daily observations (trading days) τ0 = Jan31, τ1 = Feb01, τ2 = Feb02, ..., τN = Feb28 Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 37 / 57 Realized variance Treating the realized variance as a noisy estimate of the true (unobserved) variance, we can use simple ARMA models to predict future volatility AR(1) model for the realized variance: RVt+1 = β0 + β1RVt + εt+1 The realized volatility is the square root of the realized variance Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 38 / 57 Monthly realized volatility 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 4 6 8 10 12 14 16 18 20 Time P er ce nt ag e po in ts Monthly realized volatility Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 39 / 57 Example: Linear regression model Consider the linear regression model yt+1 = β1yt + εt+1, εt+1 ∼ N(0, σ 2) The point forecast computed at time T using an estimated model is f̂T+1|T = β̂1yT The forecast error is the difference between actual value and forecast: yT+1 − f̂T+1|T = εT+1 + (β1 − β̂1)yT The MSE is MSE = E [(yT+1 − f̂T+1|T ) 2 ] = σ2ε + Var((β1 − β̂1))× y 2 T This depends on σ2ε and also on the estimation error (β1 − β̂1) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 40 / 57 Example (cont.): interval forecasts Interval forecasts are similar to confidence intervals: A 95% interval forecast is an interval that contains the future value of the outcome 95% of the time If the variable is normally distributed, we can construct this as f̂T+1|T ± 1.96× SE (YT+1 − f̂T+1|T ) SE (YT+1 − f̂T+1|T ) is the standard error of the forecast error eT+1 = YT+1 − f̂T+1|T Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 41 / 57 Interval forecasts Consider the simple model yt+1 = µ+ σεt+1, εt+1 ∼ N(0, 1) A typical interval forecast is that the outcome yt+1 falls in the interval [f l , f u ] with some given probability, e.g., 95% If εt+1 is normally distributed this simplifies to f l = µ− 1.96σ f u = µ+ 1.96σ More generally, with time-varying mean (µt+1|t ) and volatility (σt+1|t ): f lt+1|t = µt+1|t − 1.96σt+1|t f ut+1|t = µt+1|t + 1.96σt+1|t What happens to forecasts for longer horizons? Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 42 / 57 Interval forecasts Mean reverting AR(1) process starting at the mean (yT = 1,E [y ] = 1, φ = 0.9, σ = 0.5) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 43 / 57 Interval forecasts Mean reverting AR(1) process starting above the mean (yT = 2, E [y ] = 1, σ = 0.5) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 44 / 57 Uncertainty and forecast horizon I Consider the AR(1) process yt+1 = φyt + εt+1, εt+1 ∼ N(0, σ2) yt+h = φyt+h−1 + εt+h = φ(φyt+h−2 + εt+h−1) + εt+h = φ2yt+h−2 + φεt+h−1 + εt+h ... yt+h = φ hyt + φ h−1εt+1 + φ h−2εt+2 + ...+ φεt+h−1 + εt+h︸︷︷︸ unpredictable future shocks Using this expression, if |φ| < 1 (mean reversion) we have ft+h|t = φ hyt Var(yt+h |It ) = σ2(1− φ2h) 1− φ2 → σ2 1− φ2 (for large h) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 45 / 57 Uncertainty and forecast horizon II The 95% interval forecast and probabilty (density) forecast for the mean reverting AR(1) process (|φ| < 1) with Gaussian shocks are 95% interval forec. φhyt ± 1.96σ √ 1−φ2h 1−φ2 density forecast N ( φhyt , σ2(1−φ2h) 1−φ2 ) This ignores parameter estimation error (φ is taken as known) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 46 / 57 Interval and density forecasts for random walk model For the random walk model yt+1 = yt + εt+1, εt+1 ∼ N(0, σ2), so yt+h = yt + εt+1 + εt+2 + ...+ εt+h−1 + εt+h Using this expression, we get ft+h|t = yt Var(yt+h |It ) = hσ2 The 95% interval and probability forecasts are 95% interval forec. yt ± 1.96σ √ h density forecast N(yt , hσ2) Width of confidence interval continues to expand as h→ ∞ Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 47 / 57 Interval forecasts: random walk model Interval forecasts for random walk model (yT = 2, σ = 1) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 48 / 57 Alternative distributions: Two-piece normal distribution Two-piece normal distribution: dist(yt+1) =   exp(−(yt+1−µt+1|t ) 2/2σ21)√ 2π(σ1+σ2)/2 for yt+1 ≤ µt+1|t exp(−(yt+1−µt+1|t ) 2/2σ22)√ 2π(σ1+σ2)/2 for yt+1 > µt+1|t

The mean of this distribution is

Et [yt+1 ] = µt+1|t +

√
2
π
(σ2 − σ1)

If σ2 > σ1, the distribution is positively skewed

The distribution has fat tails provided that σ1 6= σ2
This distribution is used by Bank of England to compute “fan charts”

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 49 / 57

Bank of England fan charts: Inflation report 02/2017

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 50 / 57

Bank of England fan charts: Inflation report 02/2016

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 51 / 57

IMF World Economic Outlook, October 2016

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 52 / 57

Alternative distributions: Mixtures of normals

Suppose

y1t+1 ∼ N(µ1, σ
2
1)

y2t+1 ∼ N(µ2, σ
2
2)

cov(y1t+1, y2t+1) = σ12

Sums of normal distributions are normally distributed:

y1t+1 + y2t+1 ∼ N(µ1 + µ2, σ
2
1 + σ

2
2 + 2σ12)

Mixtures of normal distributions are not normally distributed: Let
st+1 = {0, 1} be a random indicator variable. Then

st+1 × y1t+1 + (1− st+1)× y2t+1 6= N(., .)

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 53 / 57

Moments of Gaussian mixture models

Let p1 = probability of state 1, p2 = probability of state 2
mean and variance of y :

E [y ] = p1µ1 + p2µ2
Var(y) = p2σ

2
2 + p1σ

2
1 + p1p2(µ2 − µ1)

skewness:

skew(y) = p1p2(µ1 − µ2)
{
3(σ21 − σ

2
2) + (1− 2p1)(µ2 − µ1)

2
}

kurtosis:

kurt(y) = p1p2(µ1 − µ2)
2
[(
p32 + p

3
1

)
(µ1 − µ2)

2
]

+6p1p2(µ1 − µ2)
2
[
p1σ

2
2 + p2σ

2
1

]
+3p1σ

4
1 + 3p2σ

4
2

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 54 / 57

Mixtures of normals: Ang and Timmermann (2012)

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 55 / 57

Mixtures of normals: Marron and Wand (1992)

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 56 / 57

Mixtures of normals: Marron and Wand (1992)

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 57 / 57

Lecture 8: Forecast Combination
UCSD, February 27, 2017

Allan Timmermann1

1UC San Diego

Timmermann (UCSD) Combination Winter, 2017 1 / 49

1 Introduction: When, why and what to combine?

2 Survey of Professional Forecasters

3 Optimal Forecast Combinations: Theory
Optimal Combinations under MSE loss

4 Estimating Forecast Combination Weights
Weighting schemes under MSE loss
Forecast Combination Puzzle
Rapach, Strauss and Zhou, 2010
Elliott, Gargano, and Timmermann, 2013
Time-varying combination weights

5 Model Combination
Optimal Pool

6 Bayesian Model Averaging

7 Conclusion
Timmermann (UCSD) Combination Winter, 2017 2 / 49

Key issues in forecast combination

Why combine?
Many models or forecasts with ‘similar’predictive accuracy

Diffi cult to identify a single best forecast
State-dependent performance

Diversification gains

When to combine?
Individual forecasts are misspecified (“all models are wrong but some are
useful.”)
Unstable forecasting environment (past track record is unreliable)
Short track record; use “one-over-N” weights? (N forecasts)

What to combine?
Forecasts using different information sets
Forecasts based on different modeling approaches
Surveys, econometric model forecasts: surveys are truly forward-looking.
Econometric models are better calibrated to data

Timmermann (UCSD) Combination Winter, 2017 2 / 49

Essentials of forecast combination

Dimensionality reduction: Combination reduces the information in a large
set of forecasts to a single summary forecast using a set of combination
weights

Optimal combination chooses “optimal” weights to produce the most
accurate combined forecast

More accurate forecasts get larger weights
Combination weights also reflect correlations across forecasts
Estimation error is important for combination weights

Irrelevance Proposition: In a world with no model misspecification, infinite
data samples (no estimation error) and complete access to the information
sets underlying the individual forecasts, there is no need for forecast
combination

just use the single best model

Timmermann (UCSD) Combination Winter, 2017 3 / 49

When to combine?

Notations:

y : outcome
f̂1, f̂2 : individual forecasts
ω : combination weight
Simple combined forecast: f̂ com = ωf̂1 + (1−ω)f̂2
The combined forecast f̂ com dominates the individual forecasts f̂1 and f̂2
under MSE loss if

E
[(
y − f̂1

)2]
> E

[(
y − f̂ com

)2]
and

E
[(
y − f̂2

)2]
> E

[(
y − f̂ com

)2]

Both conditions need to hold

Timmermann (UCSD) Combination Winter, 2017 4 / 49

Applications of forecast combinations

Forecast combinations have been successfully applied in several areas of
forecasting:

Gross National Product
currency market volatility and exchange rates
inflation, interest rates, money supply
stock returns
meteorological data
city populations
outcomes of football games
wilderness area use
check volume
political risks

Estimation of GDP based on income and production measures

Averaging across values of unknown parameters

Timmermann (UCSD) Combination Winter, 2017 5 / 49

Two types of forecast combinations

1 Data used to construct the invididual forecasts are not observed:

Treat individual forecasts like any other information (data) and estimate the
best possible mapping from the forecasts to the outcome
Examples: survey forecasts, analysts’earnings forecasts

2 Data underlying the model forecasts are observed: ‘model combination’

First generate forecasts from individual models. Then combine these forecasts

Why not simply construct a single “super” model?

Timmermann (UCSD) Combination Winter, 2017 6 / 49

Survey of Economic Forecasters: Participation

1995 2000 2005 2010 2015
400

420

440

460

480

500

520

540

560

580

Timmermann (UCSD) Combination Winter, 2017 7 / 49

SPF: median, interquartile range, min, max real GDP
forecasts

1995 2000 2005 2010 2015

Timmermann (UCSD) Combination Winter, 2017 8 / 49

SPF: identity of best forecaster (unemployment rate),
ranked by MSE

1975 1980 1985 1990 1995 2000 2005 2010

100

200

300

400

500

5 Years
ID

1980 1985 1990 1995 2000 2005 2010

100

200

300

400

500

10 Years

Timmermann (UCSD) Combination Winter, 2017 9 / 49

Forecast combinations: simple example

Two forecasting models using x1 and x2 as predictors:

yt+1 = β1x1t + ε1t+1 ⇒ f̂1t+1|t = β̂1tx1t
yt+1 = β2x2t + ε2t+1 ⇒ f̂2t+1|t = β̂2tx2t

Combined forecast:

f̂ comt+1|t = ωf̂1t+1|t + (1−ω)f̂2t+1|t

Could the combined forecast be better than the forecast based on the
“super” model?

yt+1 = β1x1t + β2x2t + εt+1 ⇒ f̂
Super
t+1|t = β̂1tx1t + β̂2tx2t

Timmermann (UCSD) Combination Winter, 2017 10 / 49

Combinations of forecasts: theory

Suppose the information set consists of m individual forecasts:
I = {f̂1, …., f̂m}
Find an optimal combination of the individual forecasts:

f̂ com = ω0 +ω1 f̂1 +ω2 f̂2 + …+ωm f̂m

ω0,ω1,ω2, …,ωm : unknown combination weights
The combined forecast uses the individual forecasts {f̂1, f̂2, …, f̂m} rather than
the underlying information sets used to construct the forecasts (f̂i = β̂

′
i xi )

Timmermann (UCSD) Combination Winter, 2017 11 / 49

Combinations of forecasts: theory

Because the underlying ‘data’are forecasts, they can be expected to obtain
non-negative weights that sum to unity,

0 ≤ ωi ≤ 1, i = 1, …,m
m

∑
i=1

ωi = 1

Such constraints on the weights can be used to reduce the effect of
estimation error

Should we allow ωi < 0 and go "short" in a forecast? Negative ωi doesn’t mean that the ith forecast was bad. It just means that forecast i can be used to offset the errors of other forecasts Timmermann (UCSD) Combination Winter, 2017 12 / 49 Combinations of two forecasts Two individual forecasts f1, f2 with forecast errors e1 = y − f1, e2 = y − f2 Both forecasts are assumed to be unbiased: E [e1 ] = E [e2 ] = 0 Variances of forecast errors: σ2i , i = 1, 2. Covariance is σ12 The combined forecast will also be unbiased if the weights add up to one: f = ωf1 + (1−ω)f2 ⇒ e = y − f = y −ωf1 − (1−ω)f2 = ωe1 + (1−ω)e2 Forecast error from the combination is a weighted average of the individual forecast errors E [e] = 0 Var(e) = ω2σ21 + (1−ω) 2σ22 + 2ω(1−ω)σ12 Like a portfolio of two correlated assets Timmermann (UCSD) Combination Winter, 2017 13 / 49 Combination of two unbiased forecasts: optimal weights Solve for the optimal combination weight, ω∗ : ω∗ = σ22 − σ12 σ21 + σ 2 2 − 2σ12 1−ω∗ = σ21 − σ12 σ21 + σ 2 2 − 2σ12 Combination weight can be negative if σ12 > σ
2
1 or σ12 > σ

2
2

If σ12 = 0: weights are the relative variance σ
2
2/σ

2
1 of the forecasts:

ω∗ =
σ22

σ21 + σ
2
2
=

σ−21
σ−21 + σ

−2
2

Greater weight is assigned to more precise models (small σ2i )

Timmermann (UCSD) Combination Winter, 2017 14 / 49

Combinations of multiple unbiased forecasts

f : m× 1 vector of forecasts
e = ιmy − f : vector of m forecast errors
ιm = (1, 1, …, 1)′ : m× 1 vector of ones
Assume that the individual forecast errors are unbiased:

E [e] = 0, Σe = Covar(e)

Choosing ω to minimize the MSE subject to the weights summing to one, we
get the optimal combination weights ω∗

ω∗ = argmin
ω

ω′Σeω

= (ι′mΣ
−1
e ιm)

−1Σ−1e ιm

Special case: if Σe is diagonal, ω∗i = σ
−2
i / ∑

m
j=1 σ

−2
j : inverse MSE weights

Timmermann (UCSD) Combination Winter, 2017 15 / 49

Optimality of equal weights

Equal weights (EW) play a special role in forecast combination

EW are optimal when the individual forecast errors have identical variance,
σ2, and identical pair-wise correlations ρ

nothing to distinguish between the forecasts

This situation holds to a close approximation when all models are based on
similar data and produce more or less equally accurate forecasts

Similarity to portfolio analysis: An equal-weighted portfolio is optimal if all
assets have the same mean and variance and pairwise identical covariances

Timmermann (UCSD) Combination Winter, 2017 16 / 49

Estimating combination weights

In practice, combination weights need to be estimated using past data

Once we use estimated combination weights it is diffi cult to show that any
particular weighting scheme will dominate other weighting methods
We prefer one method for some data and different methods for other data
Equal-weights avoid estimation error entirely

Why not always use equal weights then?

Timmermann (UCSD) Combination Winter, 2017 17 / 49

Estimating combination weights

If we try to estimate the optimal combination weights, estimation error
creeps in

In the case of forecast combination, the “data” (individual forecasts) is not a
random draw but (possibly unbiased, if not precise) forecasts of the outcome

This suggests imposing special restrictions on the combination weights

We might impose that the weights sum to one and are non-negative:

∑
i

ωi = 1, ωi ∈ [0, 1]

Simple combination schemes such as EW satisfy these constraints and do not
require estimation of any parameters

EW can be viewed as a reasonable prior when no data has been observed

Timmermann (UCSD) Combination Winter, 2017 18 / 49

Estimating combination weights

Simple estimation methods are diffi cult to beat in practice

Common baseline is to use a simple EW average of forecasts:

f ew =
1
m

∑
i=1

No estimation error since the combination weights are imposed rather than
estimated (data independent)

Also works if the number of forecasts (m) changes over time or some
forecasts have short track records

Timmermann (UCSD) Combination Winter, 2017 19 / 49

Simple combination methods

Equal-weighted forecast

f ew =
1
m

∑
i=1

Median forecast (robust to outliers)

f median = median{fi}mi=1

Trimmed mean. Order the forecasts

{f1 ≤ f2 ≤ … ≤ fm−1 ≤ fm}

Then trim the top/bottom α% before taking an average

f trim =
1

m(1− 2α)

b(1−α)mc
∑

i=bαm+1c
fi

Timmermann (UCSD) Combination Winter, 2017 20 / 49

Weights inversely proportional to MSE or rankings

Ignore correlations across forecast errors and set weights proportional to the
inverse of the models’MSE (mean squared error) values:

ωi =
MSE−1i

∑mi=1MSE
−1
i

Robust weighting scheme that weights forecast models inversely to their rank,
Ranki

ω̂i =
Rank−1i

∑mi=1 Rank
−1
i

Best model gets a rank of 1, second best model a rank of 2, etc. Weights
proportional to 1/1, 1/2, 1/3, etc.

Timmermann (UCSD) Combination Winter, 2017 21 / 49

Bates-Granger restricted least squares

Bates and Granger (1969): use plug-in weights in the optimal solution based
on the estimated variance-covariance matrix

Numerically identical to restricted least squares estimator of the weights from
regressing the outcome on the vector of forecasts ft+h|t and no intercept
subject to the restriction that the coeffi cients sum to one:

f BGt+h|t = ω̂
′
OLS ft+h|t = (ι

′Σ̂−1e ι)
−1 ι′Σ̂−1e ft+h|t

Σ̂ε = (T − h)−1 ∑T−ht=1 et+h|te
′
t+h|t : sample estimator of error covariance

matrix

Timmermann (UCSD) Combination Winter, 2017 22 / 49

Forecast combination puzzle

Empirical studies often find that simple equal-weighted forecast combinations
perform very well compared with more sophisticated combination schemes
that rely on estimated combination weights

Smith and Wallis (2009): “Why is it that, in comparisons of combinations of
point forecasts based on mean-squared forecast errors …, a simple average
with equal weights, often outperforms more complicated weighting schemes.”

Errors introduced by estimation of the optimal combination weights could
overwhelm any gains relative to using 1/N weights

Timmermann (UCSD) Combination Winter, 2017 23 / 49

Combination forecasts using Goyal-Welch Data

RMSE: Prevmean: 1.9640, Ksink: 1.9924, EW = 1.9592

1975 1980 1985 1990 1995 2000 2005 2010

-0.01

-0.005

0.005

0.01

0.015

Time

re
tu

rn
fo

re
ca

st
combination forecasts

PrevMean
Ksink
EW

Timmermann (UCSD) Combination Winter, 2017 24 / 49

Combination forecasts using Goyal-Welch Data

RMSE: EW = 1.9592, rolling = 1.9875, PrevBest = 2.0072

1975 1980 1985 1990 1995 2000 2005 2010

-0.015

-0.01

-0.005

0.005

0.01

0.015

Time

re
tu

rn
fo

re
ca

st
combination forecasts

EW
rolling
PrevBest

Timmermann (UCSD) Combination Winter, 2017 25 / 49

Rapach-Strauss-Zhou (2010)

Quarterly stock returns data, 1947-2005, 15 predictor variables
Individual univariate prediction models (i = 1, ..,N = 15):

rt+1 = αi + βi xit + εit+1 (estimated model)

r̂t+1|i = α̂i + β̂i xit (generated forecast)

Combination forecast of returns, r̂ ct+1|t :

r̂ ct+1|t =
N

∑
i=1

ωi r̂t+1|i with weights

ωi = 1/N or ωi =
DMSPE−1i

∑Nj=1 DMSPE
−1
j

DMSPEi =
t

∑
s=T0

θτ−1−s (rs+1 − r̂s+1|i )
2, θ ≤ 1

DMSPE : discounted mean squared prediction error
Timmermann (UCSD) Combination Winter, 2017 26 / 49

Rapach-Strauss-Zhou (2010): results

Timmermann (UCSD) Combination Winter, 2017 27 / 49

Rapach-Strauss-Zhou (2010): results

Timmermann (UCSD) Combination Winter, 2017 28 / 49

Empirical Results (Rapach, Strauss and Zhou, 2010)

Timmermann (UCSD) Combination Winter, 2017 29 / 49

Rapach, Strauss, Zhou: main results

Forecast combinations dominate individual prediction models for stock
returns out-of-sample

Forecast combination reduces the variance of the return forecast

Return forecasts are most accurate during economic recessions

“Our evidence suggests that the usefulness of forecast combining methods
ultimately stems from the highly uncertain, complex, and constantly evolving
data-generating process underlying expected equity returns, which are related
to a similar process in the real economy.”

Timmermann (UCSD) Combination Winter, 2017 30 / 49

Elliott, Gargano, and Timmermann (JoE, 2013):

K possible predictor variables

Generalizes equal-weighted combination of K univariate models
r̂t+1 = α̂i + β̂i xit to consider EW combination of all possible 2-variate,
3-variate, etc. models:

r̂t+1|i ,j = α̂i + β̂i xit + β̂jxjt (2 predictors)

r̂t+1|i ,j ,k = α̂i + β̂i xit + β̂jxjt + β̂k xkt (3 predictors)

For K = 12, there are 12 univariate models (k = 1), 66 bivariate models
(k = 2), 220 trivariate models (k = 3) to combine, etc.

Take equal-weighted averages over the forecasts from these models –
complete subset regressions

Timmermann (UCSD) Combination Winter, 2017 31 / 49

Elliott, Gargano, and Timmermann (JoE, 2013)

Timmermann (UCSD) Combination Winter, 2017 32 / 49

Elliott, Gargano, and Timmermann (JoE, 2013)

Timmermann (UCSD) Combination Winter, 2017 33 / 49

Adaptive combination weights

Bates and Granger (1969) propose several adaptive estimation schemes

Rolling window of the forecast models’relative performance over the most
recent win observations:

ω̂i ,t |t−h =

(
∑tτ=t−win+1 e

2
i ,τ|τ−h

)−1
∑mj=1

(
∑tτ=t−win+1 e

2
j ,τ|τ−h

)−1
Adaptive updating scheme discounts older performance, λ ∈ (0; 1) :

ω̂i ,t |t−h = λω̂i ,t−1|t−h−1 + (1− λ)

(
∑tτ=t−win+1 e

2
i ,τ|τ−h

)−1
∑mj=1

(
∑tτ=t−win+1 e

2
j ,τ|τ−h

)−1
The closer to unity is λ, the smoother the combination weights

Timmermann (UCSD) Combination Winter, 2017 34 / 49

Time-varying combination weights

Time-varying parameter (Kalman filter):

yt+1 = ω
′
t f̂t+1|t + εt+1

ωt = ωt−1 + ut , cov(ut , εt+1) = 0

Discrete (observed) state switching (Deutsch et al., 1994) conditional on
observed event happening (et ∈ At ):

yt+1 = Iet∈A(ω01 +ω
′
1 f̂t+1|t ) + (1− Iet∈A)(ω02 +ω

′
2 f̂t+1|t ) + εt+1

Regime switching weights (Elliott and Timmermann, 2005):

yt+1 = ω0st+1 +ω
′
st+1 f̂t+1|t + εt+1

pr(St+1 = st+1 |St = st ) = pst+1st

Timmermann (UCSD) Combination Winter, 2017 35 / 49

Combinations as a hedge against instability

Forecast combinations can work well empirically because they provide
insurance against model instability

The performance of combined forecasts tends to be more stable than that of
individual forecasts used in the empirical combination study of Stock and
Watson (2004)
Combination methods that attempt to explicitly model time-variations in the
combination weights often fail to perform well, suggesting that regime
switching or model ‘breakdown’can be diffi cult to predict or even to track
through time
Use simple, robust methods (rolling window)?

Timmermann (UCSD) Combination Winter, 2017 36 / 49

Combinations as a hedge against instability (cont.)

Suppose a particular forecast is correlated with the outcome only during
times when other forecasts break down. This creates a role for the forecast as
a hedge against model breakdown

Consider two forecasts and two regimes

first forecast works well only in the first state (normal state) but not in the
second state (rare state)
second forecast works well in the second state but not in the first state
second model serves as “insurance” against the breakdown of the first model
like a portfolio asset

Timmermann (UCSD) Combination Winter, 2017 37 / 49

Classical approach to density combination

Problem: we do not directly observe the outcome density−we only observe a
draw from this−and so cannot directly choose the weights to minimize the
loss between this object and the combined density

Kullback Leibler (KL) loss for a linear combination of densities ∑mi=1 ωipit (y)
relative to some unknown true density p(y) is given by

KL =
∫
p(y) ln

(
p(y)

∑mi=1 ωipi (y)

)
dy

=
∫
p(y) ln (p(y)) dy −

∫
p(y) ln

(
m

∑
i=1

ωipi (y)

)
dy

= C − E ln
(
m

∑
i=1

ωipi (y)

)

C is constant for all choices of the weights ωi
Minimizing the KL distance is the same as maximizing the log score in
expectation

Timmermann (UCSD) Combination Winter, 2017 38 / 49

Classical approach to density combination

Use of the log score to evaluate the density combination is popular in the
literature

Geweke and Amisano (2011) use this approach to combine GARCH and
stochastic volatility models for predicting the density of daily stock returns

Under the log score criterion, estimation of the combination weights becomes
equivalent to maximizing the log likelihood. Given a sequence of observed
outcomes {yt}Tt=1, the sample analog is to maximize

ω̂ = argmax
ω
T−1

∑
t=1

(
m

∑
i=1

ωipit (yt )

)

s.t. ωi ≥ 0,
m

∑
i=1

ωi = 1 for all i

Timmermann (UCSD) Combination Winter, 2017 39 / 49

Prediction pools with two models (Geweke-Amisano, 2011)

With two models, M1,M2, we have a predictive density

p(yt |Yt−1,M) = ωp(yt |Yt−1,M1) + (1−ω)p(yt |Yt−1,M2)

and a predictive log score

∑
t=1

log [wp(yt |Yt−1,M1) + (1− w)p(yt |Yt−1,M2)] , ω ∈ [0, 1]

Empirical example: Combine GARCH and stochastic volatility models for
predicting the density of daily stock returns

Timmermann (UCSD) Combination Winter, 2017 40 / 49

Log predictive score as a function of model weight,
S&P500, 1976-2005 (Geweke-Amisano, 2011)

Timmermann (UCSD) Combination Winter, 2017 41 / 49

Weights in pools of multiple models, S&P500, 1976-2005
(Geweke-Amisano, 2011)

Timmermann (UCSD) Combination Winter, 2017 42 / 49

Optimal prediction pool – time-varying combination weights

Petttenuzzo-Timmermann (2014)

Timmermann (UCSD) Combination Winter, 2017 43 / 49

Model combination – Bayesian Model Averaging

When constructing the individual forecasts ourselves, we can base the
combined forecast on information on the individual models’fit

Methods such as BMA (Bayesian Model Averaging) can be used
BMA weights predictive densities by the posterior probabilities (fit) of the
models, Mi
Models that fit the data better get higher weights in the combination

Timmermann (UCSD) Combination Winter, 2017 44 / 49

Bayesian Model Averaging (BMA)

pc (y) =
m

∑
i=1

ωip(y |Mi )

m models: M1, ….,Mm
BMA weights predictive densities by the posteriors of the models, Mi
BMA is a model averaging procedure rather than a predictive density
combination procedure per se

BMA assumes the availability of both the data underlying each of the
densities, pi (y) = p(y |Mi ), and knowledge of how that data is employed to
obtain a predictive density

Timmermann (UCSD) Combination Winter, 2017 45 / 49

Bayesian Model Averaging (BMA)

The combined model average, given data (Z ) is

pc (y |Z ) =
m

∑
i=1

p(y |Mi ,Z )p(Mi |Z )

p(Mi |Z ) : Posterior probability for model i , given the data, Z

p(Mi |Z ) =
p(Z |Mi )p(Mi )

∑mj=1 p(Z |Mj )p(Mj )

Marginal likelihood of model i is

P(Z |Mi ) =
∫
P(Z |θi ,Mi )P(θi |Mi )dθi

p(θi |Mi ) : prior density of model i’s parameters
p(Z |θi ,Mi ) : likelihood of the data given the parameters and the model
Timmermann (UCSD) Combination Winter, 2017 46 / 49

Constructing BMA estimates

Requirements:

List of models M1, …,Mm
Prior model probabilities p(M1), …., p(Mm)

Priors for the model parameters P(θ1 |M1), …,P(θm |Mm)
Computation of p(Mi |Z ) requires computation of the marginal likelihood
p(Z |Mi ) which can be time consuming

Timmermann (UCSD) Combination Winter, 2017 47 / 49

Alternative BMA schemes

Raftery, Madigan and Hoeting (1997) MC3

If the models’marginal likelihoods are diffi cult to compute, one can use a
simple approximation based on BIC:

ωi = P(Mi |Z ) ≈
exp(−0.5BICi )

∑mi=1 exp(−0.5BICi )

Remove models that appear not to be very good

Madigan and Raftery (1994) suggest removing models for which p(Mi |Z ) is
much smaller than the posterior probability of the best model

Timmermann (UCSD) Combination Winter, 2017 48 / 49

Conclusion

Combination of forecasts is motivated by

misspecified forecasting models due to parameter instability, omitted variables
etc.
diversification across forecasts
private information used to compute individual forecasts (surveys)

Simple, robust estimation schemes tend to work well

optimal combination weights are hard to estimate in small samples

Even if they do not always deliver the most precise forecasts, forecast
combinations generally do not deliver poor performance and so represent a
relatively safe choice

Empirically, equal-weighted survey forecasts work well for many
macroeconomic variables, but they tend to be biased and not very precise for
stock returns

Timmermann (UCSD) Combination Winter, 2017 49 / 49

Lecture 9: Forecast Evaluation
UCSD, Winter 2017

Allan Timmermann1

1UC San Diego

Timmermann (UCSD) Forecast Evaluation Winter, 2017 1 / 50

1 Forecast Evaluation: Absolute vs. relative performance

2 Properties of Optimal Forecasts – Theoretical concepts

3 Evaluation of Sign (Directional) Forecasts

4 Evaluating Interval Forecasts

5 Evaluating Density Forecasts

6 Comparing forecasts I: Tests of equal predictive accuracy

7 Comparing Forecasts II: Tests of Forecast Encompassing
Timmermann (UCSD) Forecast Evaluation Winter, 2017 2 / 50

Forecast evaluation

Given an observed series of forecasts, ft+h|t , and outcomes, yt+h ,
t = 1, …,T , we want to know if the forecasts were “optimal”or poor

Forecast evaluation is closely related to how we measure forecast accuracy

Absolute performance measures the accuracy of an individual forecast
relative to the outcome, using either economic (loss-based) or statistical
measures of performance

Forecast optimality, effi ciency
A forecast that isn’t obviously deficient could still be poor

Relative performance compares the performance of one or several forecasts
against some benchmark—horse race between competing forecast models

Forecast comparisons: test of equal predictive accuracy
Two forecasts could be poor, but one is less bad than the other

Forecast encompassing tests (tests for dominance)

Timmermann (UCSD) Forecast Evaluation Winter, 2017 2 / 50

Forecast evaluation (cont.)

Forecast evaluation amounts to understanding if a model’s predictive
accuracy is “good enough”

How accurate should we be able to forecast Chinese GDP growth? What’s a
reasonable R2 or RMSE?
How about forecasting stock returns? Expect low R2

How much does the forecast horizon matter to the degree of predictability?

Some variables are easier to predict than others. Why?

Unconditional forecast or random walk forecast are natural benchmarks

ARMA models are sometimes used

Timmermann (UCSD) Forecast Evaluation Winter, 2017 3 / 50

Forecast evaluation (cont.)

Informal methods – graphical plots, decompositions
Formal methods – deal with how to formally test if a forecasting model
satisfies certain “optimality criteria”

Evaluation of a forecasting model requires an estimate of its expected loss

Good forecasting models produce ‘small’average losses, while bad models
produce ‘large’average losses

Good performance in a given sample could be due to luck or could reflect
the performance of a genuinely good model

Power of statistical tests varies. Can we detect the difference between an
R2 = 1% versus R2 = 1.5%? Depends on the sample size

Test results depend on the loss function and the information set

Rejection of forecast optimality suggests that we can improve the forecast (at
least in theory)

Timmermann (UCSD) Forecast Evaluation Winter, 2017 4 / 50

Optimality Tests

Establish a benchmark for what constitutes an optimal or a “good” forecast
Effi cient forecasts: Constructed given knowledge of the true data generating
process (DGP) using all currently available information

sets the bar very high: in practice we don’t know the true DGP

Forecasts are effi cient (rational) if they fully utilize all available information
and this information cannot be used to construct a better forecast

weak versus strong rationality (just like tests of market effi ciency)
unbiasedness (forecast error should have zero mean)
orthogonality tests (forecast error should be unpredictable)

Timmermann (UCSD) Forecast Evaluation Winter, 2017 5 / 50

Effi cient Forecast: Definition

A forecast is effi cient (optimal) if no other forecast using the available data,
xt ∈ It , can be used to generate a smaller expected loss
Under MSE loss:

f̂ ∗t+h|t = arg
f̂ (xt )

minE
[
(yt+h − f̂ (xt ))2

]

If we can use information in It to produce a more accurate forecast, then the
original forecast is suboptimal

Effi ciency is conditional on the information set

weak form forecast effi ciency tests include only past forecasts and past
outcomes It = {yt , yt−1, …, f̂t |t−1, et |t−1, …}
strong form effi ciency tests extend this to include all other variables xt ∈ It

Timmermann (UCSD) Forecast Evaluation Winter, 2017 6 / 50

Optimality under MSE loss

First order condition for an optimal forecast under MSE loss:

E [
∂(yt+h − ft+h|t )2

∂ft+h|t
] = −2E

[
yt+h − ft+h|t

]
= −2E

[
et+h|t

]
= 0

Similarly, conditional on information at time t, It :

E [et+h|t |It ] = 0

Expected value of the forecast error must equal zero given current
information, It
Test E [et+h|txt ] = 0 for all variables xt ∈ It known at time t
If the forecast is optimal, no variable known at time t can predict its future
forecast error et+h|t . Otherwise the forecast wouldn’t be optimal

If I can predict my forecast will be too low, I should increase my forecast

Timmermann (UCSD) Forecast Evaluation Winter, 2017 7 / 50

Optimality properties under Squared Error Loss

1 Forecasts are unbiased: the forecast error et+h|t has zero mean, both
conditionally and unconditionally:

E [et+h|t ] = E [et+h|t |It ] = 0

2 h-period forecast errors (et+h|t ) are uncorrelated with information available
at the time the forecast was computed (It ). In particular, single-period
forecast errors, et+1|t , are serially uncorrelated:

E [et+1|tet |t−1 ] = 0

3 The variance of the forecast error (et+h|t ) increases (weakly) in the forecast
horizon, h :

Var(et+h+1|t ) ≥ Var(et+h|t ), for all h ≥ 1

On average it’s harder to predict distant outcomes than outcomes in the near
future
Timmermann (UCSD) Forecast Evaluation Winter, 2017 8 / 50

Optimality properties under Squared Error Loss (cont.)

Optimal forecasts are unbiased. Why? If they were biased, we could improve
the forecast simply by correcting for the bias

Suppose ft+1|t is biased:

yt+1 = 1+ ft+1|t + εt+1, εt+1 ∼ WN(0, σ
2),

Bias-corrected forecast:
f ∗t+1|t = 1+ ft+1|t

is more accurate than ft+1|t

Forecast errors from optimal model should be unpredictable:

Suppose et+1 = 0.5et so the one-step forecast error is serially correlated
Adding back 0.5et to the original forecast yields a more accurate forecast:
f ∗
t+1|t = ft+1|t + 0.5et is better than f

∗
t+1|t

Variance of (optimal) forecast error increases in the forecast horizon

We learn more information as we get closer to the forecast “target” and
increase our information set

Timmermann (UCSD) Forecast Evaluation Winter, 2017 9 / 50

Illustration for MA(2) process

Yt = εt + θ1εt−1 + θ2εt−2, εt ∼ WN(0, σ2)

t yt ft+h|t et+h|t
T + 1 εT+1 + θ1εT + θ2εT−1 θ1εT + θ2εT−1 εT+1
T + 2 εT+2 + θ1εT+1 + θ2εT θ2εT εT+2 + θ1εT+1
T + 3 εT+3 + θ1εT+2 + θ2εT+1 0 εT+3 + θ1εT+2 + θ2εT+1

From these results we see that

E [et+h|t ] = 0 for t = T , h = 1, 2, 3, …

Var(eT+3|T ) ≥ Var(eT+2|T ) ≥ Var(eT+1|T )

Timmermann (UCSD) Forecast Evaluation Winter, 2017 10 / 50

Regression tests of optimality under MSE loss

Effi ciency regressions test if any variable xt known at time t can predict the
future forecast error et+1|t :

et+1|t = β
′xt + εt+1, εt+1 ∼ WN(0, σ2)

H0 : β = 0 vs H1 : β 6= 0

Unbiasedness tests set xt = 1:

et+1|t = β0 + εt+1

Mincer-Zarnowitz regression uses yt+1 on LHS and sets xt = (1, f̂t+1|t ):

yt+1 = β0 + β1 f̂t+1|t + εt+1
H0 : β0 = 0, β1 = 1

Zero intercept, unit slope—use F test

Timmermann (UCSD) Forecast Evaluation Winter, 2017 11 / 50

Regression tests of optimality: Example

Suppose that f̂t+1|t is biased:

yt+1 = 0.2+ 0.9f̂t+1|t + εt+1, εt+1 ∼ WN(0, σ
2).

Q: How can we easily produce a better forecast?

Answer:
f̂ ∗t+1|t = 0.2+ 0.9f̂t+1|t

will be an unbiased forecast

What if

yt+1 = 0.3+ f̂t+1|t + εt+1,

εt+1 = ut+1 + θ1ut , ut ∼ WN(0, σ2)

Can we improve on this forecast?

Timmermann (UCSD) Forecast Evaluation Winter, 2017 12 / 50

A question of power

In small samples with little predictability, forecast optimality tests may not
have much power (ability to detect deviations from forecast optimality)

Rare to find individual forecasters with a long track record
Predictive ability changes over time

Need a long out-of-sample data set (evaluation sample) to be able to tell
with statistical confidence if a forecast is suboptimal

Timmermann (UCSD) Forecast Evaluation Winter, 2017 13 / 50

Testing non-decreasing variance of forecast errors

Suppose we have forecasts recorded for three different horizons, h = S ,M, L,
with S < M < L (short, medium, long) µe = [ E [e2t+S |t ],E [e 2 t+M |t ],E [e 2 t+L|t ] ]′ : MSE values MSE differentials (Long-Short, Medium-Short): ∆eL−M ≡ E [ e2t+L|t ] − E [ e2t+M |t ] ∆eM−S ≡ E [ e2t+M |t ] − E [ e2t+S |t ] We can test if the expected value of the squared forecast errors is weakly increasing in the forecast horizon: ∆eL−M ≥ 0 ∆eM−S ≥ 0 Distant future is harder to predict than the near future Timmermann (UCSD) Forecast Evaluation Winter, 2017 14 / 50 Evaluating the rationality of the “Greenbook” forecasts Patton and Timmermann (2012) study the Fed’s “Greenbook” forecasts of GDP growth, GDP deflator and CPI inflation Data are quarterly, over the period 1982Q1 to 2000Q4, approx. 80 observations Greenbook forecasts and actuals constructed from real-time Federal Reserve publications. These are aligned in “event time” 6 forecast horizons: h = 0, 1, 2, 3, 4, 5 Increasing MSE and decreasing MSF Greenbook forecasts of GDP growth, 1982Q1-2000Q4 -5 -4 -3 -2 -1 0 0 1 2 3 4 5 6 7 8 Forecasts and forecast errors, GDP growth Forecast horizon V ar ia nc e MSE V[forecast] V[actual] Bias vs. forecast horizon (in months) Analysts’EPS forecasts at different horizons: Biases 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 horizon P er ce nt ag e po in ts Bias in forecast errors (AAPL) RMSE vs. forecast horizon (in months) Analysts’EPS forecasts at different horizons: RMSE 2 3 4 5 6 7 8 9 10 11 12 0.5 1 1.5 2 2.5 horizon R M S E RMSE (AAPL) Optimality tests that do not rely on the outcome Many macroeconomic data series are revised: preliminary, first release, latest data vintage could be used to measure the outcome volatility is also unobserved Under MSE loss, forecast revisions should be unpredictable If I could predict today that my future forecast of the same event will be different in a particular direction (higher or lower), then I should incorporate this information into my current forecast Let ∆ft+h = ft+h|t+1 − ft+h|t be the forecast revision. Then E [∆ft+h |It ] = 0 Forecast revisions are a martingale difference process (zero mean) This can be tested through a simple regression that doesn’t use the outcome: ∆ft+h = α+ δxt + εt+h Timmermann (UCSD) Forecast Evaluation Winter, 2017 19 / 50 Forecast evaluation for directional forecasting Suppose we are interested in evaluating the forecast (f ) of the sign of a variable, y . There are four possible outcomes: forecast/outcome sign(y) > 0 sign(y) ≤ 0
sign(f ) > 0 true positive false positive
sign(f ) ≤ 0 false negative true negative

If stock returns (y) are positive 60% of the time and we always predict a
positive return, we have a “hit rate” of 60%. Is this good?

We need a test statistic that doesn’t reward “broken clock” forecasts (always
predict the same sign) with no informational content or value

Timmermann (UCSD) Forecast Evaluation Winter, 2017 20 / 50

Who is the better forecaster?

Timmermann (UCSD) Forecast Evaluation Winter, 2017 21 / 50

Information in forecasts

Both forecasters have a ‘hit rate’of 80%, with 8 out of 10 correct predictions
(sum elements on the diagonal)

There is no information in the first forecast (the forecaster always says
“increase”)
There is some information in the second forecast: both increases and
decreases are successfully predicted
Not enough to only look at the overall hit rate

Timmermann (UCSD) Forecast Evaluation Winter, 2017 22 / 50

Forecast evaluation for sign forecasting

Suppose we are interested in predicting the sign of yt+h using the sign
(direction) of a forecast ft+h|t
P : probability of a correctly predicted sign (positive or negative)
Py : probability of a positive sign of y
Pf : probability of a positive sign of the forecast, f
Define the sign indicator

I (zt ) =
{
1 if zt ≥ 0
0 if zt < 0 Sample estimates of sign probabilities with T observations: P̂ = 1 T T ∑ t=1 I (yt+hft+h|t ) P̂y = 1 T T ∑ t=1 I (yt+h) P̂f = 1 T T ∑ t=1 I (ft+h|t ) > 0

Timmermann (UCSD) Forecast Evaluation Winter, 2017 23 / 50

Sign test

In large samples we can test for sign predictability using the
Pesaran-Timmermann sign statistic

ST =
P̂ − P̂∗√

v̂ar(P̂)− v̂ar(P̂∗)
∼ N(0, 1), where

P̂∗ = P̂y P̂f + (1− P̂y )(1− P̂f )
v̂ar(P̂) = T−1P̂∗(1− P̂∗)
v̂ar(P̂∗) = T

−1(2P̂y − 1)2P̂f (1− P̂f ) + T−1(2P̂f − 1)2P̂y (1− P̂y )
+4T−2P̂y P̂f (1− P̂y )(1− P̂f )

This test is very simple to compute and has been used in studies of market
timing (financial returns) and studies of business cycle forecasting

Timmermann (UCSD) Forecast Evaluation Winter, 2017 24 / 50

Forecast evaluation for event forecasting

Leitch and Tanner (American Economic Review, 1990)

Timmermann (UCSD) Forecast Evaluation Winter, 2017 25 / 50

Forecasts of binary variables: Liu and Moench (2014)

Liu and Moench (2014): What predicts U.S. Recessions? Federal Reserve
Bank of New York working paper

St ∈ {0, 1} : true state of the economy (recession indicator)
St = 1 in recession
St = 0 in expansion

Forecast the probability of a recession using a probit model

Pr(St+1 = 1|Xt ) = Φ(β0 + β1Xt ) ≡ Pt+1|t

The log-likelihood function for β = (β0 β1)
′ is

ln l(β) =
T−1
∑
t=0

[St+1 ln (Φ(β0 + β1Xt )) + (1− St+1) ln (1−Φ(β0 + β1Xt ))]

Timmermann (UCSD) Forecast Evaluation Winter, 2017 26 / 50

Evaluating recession forecasts: Liu-Moench

Pt+1|t ∈ [0, 1] : prediction of St+1 given information known at time t, Xt
Blue: Xt = {term spread}
Green: Xt = {term spread, lagged term spread}
Red: Xt = {term spread, lagged term spread, additional predictor}

Timmermann (UCSD) Forecast Evaluation Winter, 2017 27 / 50

Evaluating binary recession forecasts

Split [0, 1] using 100 evenly spaced thresholds
ci = [0, 0.01, 0.02, …., 0.98, 0.99, 1]

For each threshold, ci , compute the prediction model’s classification:

Ŝt+1|t (ci ) =
{
1 if Pt+1|t ≥ ci
0 if Pt+1|t < ci True positive (TP) and false positive (FP) indicators: I tpt+1(ci ) = { 1 if St+1 = 1 and Ŝt+1|t (ci ) = 1 0 otherwise I fpt+1(ci ) = { 1 if St+1 = 0 and Ŝt+1|t (ci ) = 1 0 otherwise Timmermann (UCSD) Forecast Evaluation Winter, 2017 28 / 50 Estimating the true positive and false positive rates Using the true St+1 and the classifications, Ŝt+1|t , calculate the percentage of true positives, PTP, and the percentage of false positives, PFP PTP(ci ) = 1 n1 T ∑ t=1 I tpt PFP(ci ) = 1 n0 T ∑ t=1 I fpt n1 : number of times St = 1 (recessions) n0 : number of times St = 0 (expansions) n0 + n1 = n : sample size Timmermann (UCSD) Forecast Evaluation Winter, 2017 29 / 50 Creating the ROC curve Each ci produces a set of values (PFPi ,PTPi ) Plot (PFPi ,PTPi ) across all thresholds ci with PFP on the x-axis and PTP on the y axis Connecting these points gives the Receiver Order Characteristics curve ROC curve plots all possible combinations of PTP(ci ) and PFP(ci ) for ci ∈ [0, 1] is an increasing function in [0, 1] as c → 0, TP(c) = FP(c) = 1 as c → 1, TP(c) = FP(c) = 0 Area Under the ROC (AUROC) curve measures accuracy of the classification Perfect forecast: ROC curve lies in the top left corner Random guess: ROC curve follows the 45 degree diagonal line Timmermann (UCSD) Forecast Evaluation Winter, 2017 30 / 50 Evaluating recession forecasts: Liu-Moench Timmermann (UCSD) Forecast Evaluation Winter, 2017 31 / 50 Estimation and inference on AUROC Y Rt : observations of Yt classified as recessions (St = 1) Y Et : observations of Yt classified as expansions (St = 0) Nonparametric estimate of AUROC: ÂUROC = 1 n1n0 nE ∑ i=1 nR ∑ j=1 [ I (Y Ri > Y

E
j ) +

1
2
I (Y Ri = Y

E
j )

]

Asymptotic standard error of AUROC:

σ2 =
1
n1n0

[
AUROC (1− AUROC ) + (n1 − 1)(Q1 − AUROC 2)

+(n0 − 1)(Q2 − AUROC 2)
]

Q1 =
AUROC

2− AUROC
; Q2 =

2AUROC 2

1+ AUROC

Timmermann (UCSD) Forecast Evaluation Winter, 2017 32 / 50

Comparing AUROC for two forecasts

Suppose we have two sets of AUROC estimates ̂AUROC1, ̂AUROC2, σ̂21, σ̂
2
2

We also need an estimate, r , of the correlation between AUROC1,AUROC2
We can test if the AUROC are the same using a t-statistic:

t =
AUROC1 − AUROC2√

σ21 + σ
2
2 − 2rσ1σ2

Timmermann (UCSD) Forecast Evaluation Winter, 2017 33 / 50

Evaluating recession forecasts: Liu-Moench (table 3)

Timmermann (UCSD) Forecast Evaluation Winter, 2017 34 / 50

Evaluating Interval forecasts

Interval forecasts predict that the future outcome yt+1 should lie in some
interval

[pLt+1|t (α); p
U
t+1|t (α)]

α ∈ (0, 1) : probability that outcome falls inside the interval forecast
(coverage)

pLt+1|t (α) : lower bound of interval forecast

pUt+1|t (α) : upper bound of interval forecast

Timmermann (UCSD) Forecast Evaluation Winter, 2017 35 / 50

Unconditional test of correctly specified interval forecast

Define the indicator variable

1yt+1 = 1{yt+1 ∈ [p
L
t+1|t (α); p

U
t+1|t (α)]}

{
1 if outcome falls inside interval
0 if outcome falls outside interval

Test for correct unconditional (“average”) coverage:

E [1yt+1 ] = α

Use this to evaluate fan charts which show interval forecasts for different
values of α

Test correct coverage for α = 25%, 50%, 75%, 90%, 95%, etc.

Timmermann (UCSD) Forecast Evaluation Winter, 2017 36 / 50

Test of correct conditional coverage

Test for correct conditional coverage: For all Xt

E [1yt+1 |Xt ] = α

Test this implication by estimating, say, a probit model

Pr(1yt+1 = 1) = Φ(β0 + β1xt )

Under the null of a correct interval forecast model, H0 : β1 = 0

Timmermann (UCSD) Forecast Evaluation Winter, 2017 37 / 50

Evaluating Density forecasts: Probability integral transform

The probability integral transform (PIT), Ut+1, of a continuous cumulative
density function PY (yt+1 |xt ) is defined as the outcome evaluated at the
model-specified conditional CDF:

Ut+1 = PY (yt+1 |x1, x2, …, xt )

=
∫ yt+1
−∞

py (y |x1, x2, …, xt )dy

yt+1 : realized value of the outcome (observed value of y)
x1, x2, …, xt : predictors (data)
pY (y |x1, x2, …, xt ) : conditional density of y
PIT value: how likely is it to observe a value equal to or smaller than the
actual outcome (yt+1), given the density forecast py ?

we don’t want this to be very small or very large most of the time

Timmermann (UCSD) Forecast Evaluation Winter, 2017 38 / 50

Probability integral transform: Example

Suppose that our prediction model is a GARCH(1,1)

yt+1 = β0 + β1xt + εt+1, εt+1 ∼ N(0, σ
2
t+1|t )

σ2t+1|t = α0 + α1ε
2
t + β1σ

2
t |t−1

Then we have

pY (y |x1, x2, …, xt ) = N(β0 + β1xt , σ
2
t+1|t )

and so

ut+1 =
∫ yt+1
−∞

pY (y |x1, x2, …, xt )dy

Φ(
yt+1 − (β0 + β1xt )

σt+1|t
)

This is the standard cumulative normal function, Φ, evaluated at
zt+1 = [yt+1 − (β0 + β1xt )]/σt+1|t
Timmermann (UCSD) Forecast Evaluation Winter, 2017 39 / 50

Understanding the PIT

By construction, PIT values lie between zero and one

If the density forecasting model pY (y |x1, x2, …, xt ) is correctly specified, U
will be uniformly distributed on [0, 1]

The sequence of PIT values û1, û2, …, ûT should be independently and
identically distributed Uniform(0, 1)—they should not be serially correlated

If we apply the inverse Gaussian CDF, Φ−1 , to the ût values to get
ẑt = Φ−1(ût ), we get a sequence of i.i.d. N(0, 1) variables
We can therefore test that the density forecasting model is correctly specified
through simple regression tests:

ẑt = µ+ εt H0 : µ = 0
ẑt = µ+ ρẑt−1 + εt H0 : µ = ρ = 0
ẑ2t = µ+ ρẑ

2
t−1 + εt H0 : µ = 1, ρ = 0

Timmermann (UCSD) Forecast Evaluation Winter, 2017 40 / 50

Working in the PIT: graphical test

Timmermann (UCSD) Forecast Evaluation Winter, 2017 41 / 50

Tests of Equal Predictive Accuracy: Diebold-Mariano Test

Test if two forecasts generate the same average (MSE) loss

E [e21t+1 ] = E [e
2
2t+1 ]

Diebold and Mariano (1995) propose a simple and elegant method that
accounts for sampling uncertainty in average losses
Setup: two forecasts with associated loss MSE1, MSE2

MSE1 ∼ N(µ1,Ω11),MSE2 ∼ N(µ2,Ω22),
Cov(MSE1,MSE2) = Ω12

Loss differential in period t + h (dt+h) is

dt+h = e
2
1t+h|t − e

2
2t+h|t

Timmermann (UCSD) Forecast Evaluation Winter, 2017 42 / 50

Tests of Equal Loss – Diebold Mariano Test

Suppose we observe samples of forecasts, forecast errors and forecast
differentials dt+h
These form the basis of a test of the null of equal predictive accuracy:

H0 : E [dt+h ] = 0

To test H0, regress the time series dt+h on a constant and conduct a t−test
on the constant, µ:

dt+h = µ+ εt+h

µ > 0 suggests that MSE1 > MSE2 and so forecast 2 produces the smallest
squared forecast errors and is best

µ < 0 suggests that MSE1 < MSE2 and so forecast 1 produces the smallest squared forecast errors and is best Timmermann (UCSD) Forecast Evaluation Winter, 2017 43 / 50 Comparing forecast methods - Giacomini and White (2006) Consider two set of forecasts, f̂1t+h|t (ωin1) and f̂2t+h|t (ωin2), each computed using a rolling estimation window of length ωi Each forecast is a function of the data and parameter estimates Different choices for the window length, ωin, used in the rolling regressions alter what is being tested If we change the window length, ωi , we also change the forecasts being compared {f̂1t+h|t , f̂2t+h|t} Big models are more affected by estimation error The idea is to compare forecasting methods—not just forecasting models Timmermann (UCSD) Forecast Evaluation Winter, 2017 44 / 50 Conditional Tests - Giacomini and White (2006) We can also test if one method performs better than another one in different environments - regress forecast differential on current information xt ∈ It : dt+h = µ+ βxt + εt+h Analysts may be better at forecasting stock returns than econometric models in recessions or, say, in environments with low interest rates Switch between forecasts? Choose the forecast with the smallest conditional expected squared forecast error Timmermann (UCSD) Forecast Evaluation Winter, 2017 45 / 50 Tests of Forecast Encompassing Encompassing: one model contains all the information (relevant for forecasting) of another model plus some additional information f1 encompasses f2 provided that for all values of ω MSE (f1) ≤ MSE (ωf1 + (1−ω)f2) Equality only holds for ω = 1 One forecast (f1) encompasses another (f2) when the information in the second forecast does not help improve on the forecasting performance of the first forecast Timmermann (UCSD) Forecast Evaluation Winter, 2017 46 / 50 Encompassing Tests Under MSE loss Use OLS to regress the outcome on two forecasts to test for forecast encompassing: yt+1 = β1 f̂1t+1|t + β2 f̂2t+1|t + εt+1 Forecast 1 (f̂1t+1|t ) encompasses (dominates) forecast 2 (f̂2t+1|t ) if β1 = 1 and β2 = 0. If this holds, only use forecast 1 Equivalently, if β2 = 0 in the following regression, forecast 1 encompasses forecast 2: ê1t+1 = β2 f̂2t+1|t + ε1t+1 Forecast 2 doesn’t explain model 1’s forecast error If β1 = 0 in the following regression, forecast 2 encompasses forecast 1: ê2t+1 = β1 f̂1t+1|t + ε2t+1 Timmermann (UCSD) Forecast Evaluation Winter, 2017 47 / 50 Forecast encompassing vs tests of equal predictive accuracy Suppose we cannot reject a test of equal predictive accuracy for two forecasts: E [e21t+h|t ] = E [e 2 2t+h,t ] Then it is optimal to use equal weights in a combined forecast, rather than use only one or the other forecast Forecast encompassing tests if it is optimal to assign a weight of unity on one forecast and zero on the other completely ignore one forecast Tests of equal predictive accuracy and tests of forecast encompassing examine very different hypotheses about how useful the forecasts are Timmermann (UCSD) Forecast Evaluation Winter, 2017 48 / 50 Comparing Two Forecast Methods Three possible outcomes of model comparison: One forecast method completely dominates another method Encompassing; choose the dominant forecasting method One forecast is best, but does not contain all useful information from the second model Combine forecasts using non-equal weights The forecasts have the same expected loss (MSE) Combine forecasts using equal weights Timmermann (UCSD) Forecast Evaluation Winter, 2017 49 / 50 Forecast evaluation: Conclusions Forecast evaluation is very important - health check of forecast models Variety of diagnostic tests available for Optimality tests: point, interval and density forecasts Sign/direction forecasts Forecast comparisons Timmermann (UCSD) Forecast Evaluation Winter, 2017 50 / 50 Lecture 10: Model Instability UCSD, Winter 2017 Allan Timmermann1 1UC San Diego Timmermann (UCSD) Breaks Winter, 2017 1 / 42 1 Forecasting under Model Instability: General Issues 2 How costly is it to Ignore Model Instability? 3 Limitations of Tests for Model Instability Frequent, small versus rare and large changes Using pre-break data: Trade-offs 4 Ad-hoc Methods for Dealing with Breaks Intelligent ways of using a rolling window 5 Modeling the Instability Process Time-varying parameters Regime switching Change point models 6 Real-time monitoring of forecasting performance 7 Conclusions and Practical Lessons Timmermann (UCSD) Breaks Winter, 2017 2 / 42 Model instability is everywhere... Model instability affects a majority of macroeconomic and financial variables (Stock and Watson, 1996) Great Moderation: Sharp drop in the volatility of macroeconomic variables around 1984 Zero lower bound for US interest rates Stock and Watson (2003) conclude “... forecasts based on individual indicators are unstable. Finding an indicator that predicts well in one period is no guarantee that it will predict well in later periods. It appears that instability of predictive relations based on asset prices (like many other candidate leading indicators) is the norm.” Strong evidence of instability for prediction models fitted to stock market returns: Ang and Bekaert (2007), Paye and Timmermann (2006) and Rapach and Strauss (2006) Timmermann (UCSD) Breaks Winter, 2017 2 / 42 Maclean and Pontiff (2012) "We investigate the out-of-sample and post-publication return predictability of 82 characteristics that have been shown to predict cross-sectional returns by academic publications in peer-reviewed journals... We estimate post-publication decay to be about 35%, and we can reject the hypothesis that there is no decay, and we can reject the hypothesis that the cross-sectional predictive ability disappears entirely. This finding is better explained by a discrete change in predictive ability, rather than a declining time-trend in predictive ability." (p. 24) Timmermann (UCSD) Breaks Winter, 2017 3 / 42 Sources of model instability Model parameters may change over time due to shifting market conditions (QE) changing regulations and government policies (Dodd-Frank) new technologies (fracking; iphone) mergers and acquisitions; spinoffs shift in behavior (market saturation; self-destruction of predictable patterns under market effi ciency) Timmermann (UCSD) Breaks Winter, 2017 4 / 42 Strategies for dealing with model instability Ignore it altogether Test for large, discrete breaks and use only data after the most recent break to estimate the forecasting model Ad-hoc approaches that discount past data rolling window estimation exponential discounting of past observations (risk-metrics) adaptive approaches Model the break process itself If multiple breaks occurred in the past, we may want to model the possibility of future breaks, particularly for long forecast horizons Forecast combination Timmermann (UCSD) Breaks Winter, 2017 5 / 42 Ignoring model instability Timmermann (UCSD) Breaks Winter, 2017 6 / 42 How costly is it to ignore breaks? "All models are wrong but some are useful" (George Box) Full-sample estimated parameters of forecasting models are an average of time-varying coeffi cients these may or may not be useful for forecasting fail to detect valuable predictors wrongly include irrelevant predictors Model instability can show up as a disparity between a forecasting model’s in-sample and out-of-sample performance or in differences in the model’s forecasting performance across different historical subsamples Timmermann (UCSD) Breaks Winter, 2017 7 / 42 How do breaks affect forecasting performance? Forecasting model yt+1 = β ′ txt + εt+1, t = 1, ...,T yt+1 : outcome we are interested in predicting βt : (time-varying) parameters of the data generating process β̂t : parameter estimates xt : predictors known at time t T : present time (time where we generate our forecast of yT+1) ŷT+1 = β̂ ′ T xT : forecast Timmermann (UCSD) Breaks Winter, 2017 8 / 42 Failing to find genuine predictability time, t Tbreak ßt Tβ̂ breakpre−β breakpost−β 11 ++ += tttt xy εβ 0 T Timmermann (UCSD) Breaks Winter, 2017 9 / 42 Wrongly identifying predictability time, t Tbreak ßt Tβ̂ breakpre−β breakpost−β 11 ++ += tttt xy εβ T Timmermann (UCSD) Breaks Winter, 2017 10 / 42 Breaks to forecast models What happens when the parameters of the forecasting model change over time so the full-sample estimates of the forecasting model provide poor guidance for the forecasting model at the end of the sample (T ) where the forecast gets computed? βt : regression parameters of the forecasting model ‘t’subscript indicates that the parameters vary over time Assuming that parameter changes are not random, we would prefer to construct forecasts using the parameters, βT , at the point of the forecast (T ) Instead, the estimator based on full-sample information, β̂T , will typically converge not to βT but to the average value for βt computed over the sample t = 1, ...,T Timmermann (UCSD) Breaks Winter, 2017 11 / 42 Take-aways: Careful with full-sample tests A forecasting model that uses a good predictor might generate poor out-of-sample forecasts because the parameter estimates are an average of time-varying coeffi cients Valuable predictor variables may appear to be uninformative because the full-sample parameter estimate β̂T is close to zero Conversely, full-sample tests might indicate that certain predictors are useful even though, at the time of the forecast, βT is close to zero Under model instability, all bets could be off! Timmermann (UCSD) Breaks Winter, 2017 12 / 42 Testing for parameter instability There are many different ways to model and test for parameter instability Tests for a single discrete break Tests for multiple discrete breaks Tests for random walk breaks (break every period) Timmermann (UCSD) Breaks Winter, 2017 13 / 42 Testing for breaks: Known break date If the date of the possible break in the coeffi cients (t1) is known, the null of no break can be tested using a dummy variable interaction regression Let Dt (t0) be a binary variable that equals zero before the break date, tB , and is one afterwards: Dt (tB ) = { 0 if t < tB 1 if t ≥ tB We can test for a break in the intercept: yt+1 = β0 + β1Dt (tB ) + β2xt + εt+1 H0 : β1 = 0 (no break) Or we can test for a break in the slope: yt+1 = β0 + β1xt + β2Dt (tB )xt + εt+1 H0 : β2 = 0 (no break) The t−test for β1 or β2 is called a Chow test Timmermann (UCSD) Breaks Winter, 2017 14 / 42 Models with a single break I What happens if the time and size of the break are unknown? We can try to estimate the date and magnitude of the break For each date in the sample we can compute the sum of squared residuals (SSR) associated with that choice of break date. Then choose the value t̂B that minimizes the SSR SSR(t1) = T−1 ∑ t=1 (yt+1 − βxt − dxt1(t ≤ tB ))2 We “trim” the data by searching only for breaks between lower and upper limits that exclude 10-15% of the data at both ends to have some minimal amount of data for parameter estimation and evaluation In a sample with T = 100 observations, test for break at t = 16, ...., 85 Timmermann (UCSD) Breaks Winter, 2017 15 / 42 Models with a single break II Potentially we can compute a better forecast based on the post-break parameter values. For the linear regression yt+1 = { (β+ d)xt + εt+1 t ≤ t̂B βxt + εt+1 t > t̂B

we could use the forecast fT+1 = β̂xT where β̂ is estimated from a regression
that replaces the unknown break date with the estimated break date t̂B
In practice, estimates of the break date are often inaccurate

Should we always exclude pre-break data points?

Not if the break is small or the post-break data sample is short
Bias-variance trade-off

Timmermann (UCSD) Breaks Winter, 2017 16 / 42

Many small breaks versus few large ones

time, t

Tbreak

ßt

Tβ̂

breakpre−β

breakpost−β

11 ++ += tttt xy εβ

Timmermann (UCSD) Breaks Winter, 2017 17 / 42

0 5 10 15
0

0.2

0.4

0.6

0.8

1
Single Break at 50%

P
ow

δ
0 5 10 15

0.2

0.4

0.6

0.8

1
Two Breaks, 40% and 60%

P
ow

0 5 10 15
0

0.2

0.4

0.6

0.8

1
Random Breaks with Probability 10%

P
ow

δ
0 5 10 15

0.2

0.4

0.6

0.8

1
Random Breaks with Probability 1

P
ow

qLL
Nyblom
SupF
AP

Timmermann (UCSD) Breaks Winter, 2017 18 / 42

Practical lessons from break point testing

Tests designed for one type of breaks (frequent, small ones) typically also
have the ability to detect other types of breaks (rare, large ones)

Rejections of a test for a particular break process do not imply that the break
process tested for is “correct”

Rather, it could be one of many processes

Imagine a medical test that can tell if the patient is sick or not but cannot
tell if the patient suffers from diabetes or coronary disease

Timmermann (UCSD) Breaks Winter, 2017 19 / 42

Estimating the break date

We can attempt to estimate the date and magnitude of any breaks

In practice, estimates of the break date are often inaccurate

Costs of mistakes in estimation of the break date:

Too late: If the estimated break date falls after the true date, the resulting
parameter estimates are ineffi cient
Too early: If the estimated break date occurs prior to the true date, the
estimates will use pre-break data and hence be biased

If you know the time of a possible break, use this information

Introduction of Euro
Change in legislation

Timmermann (UCSD) Breaks Winter, 2017 20 / 42

Should we only use post-break data?

The expected performance of the forecasting model can sometimes be
improved by including pre-break observations to estimate the parameters of
the forecasting model

Adding pre-break observations introduces a bias in the forecast but can also
reduce the variance of the estimator
If the size of the break is small (small bias) and the break occurs late in the
sample so the additional observations reduce the variance of the estimates by
a large amount, the performance of the forecasting model can be improved by
incorporating pre-break data points

Timmermann (UCSD) Breaks Winter, 2017 21 / 42

Consequences of a small break

time, t

Tbreak

ßt

Tβ̂
breakpre−β

breakpost−β

11 ++ += tttt xy εβ

Timmermann (UCSD) Breaks Winter, 2017 22 / 42

Including pre-break data (Pesaran and Timmermann, 2007)

The optimal pre-break window that minimizes the mean squared (forecast)
error (MSE) is longer, the

smaller the R2 of the prediction model (noisy returns data)
smaller the size of the break (small bias)
shorter the post-break window (post-break-only estimates are very noisy)

Timmermann (UCSD) Breaks Winter, 2017 23 / 42

Stopping rule for determining estimation window

First, estimate the time at which the most recent break occurred, T̂b
If no break is detected, use all the data
If a break is detected, estimate the MSE using only data after the break date,
t = T̂b + 1, …,T

Next, compute the MSE by including an additional observation, t = T̂b , …,T

If the new estimate reduces the MSE, continue by adding an additional data
point (T̂b − 1) to the sample and again compute the MSE
Repeat until the data suggest that including additional pre-break data no
longer reduces the MSE

Timmermann (UCSD) Breaks Winter, 2017 24 / 42

Downweighting past observations

In situations where the form of the break process is unknown, we might use
adaptive methods that do not depend directly on the nature of the breaks

Simple weighted least squares scheme puts greater weight on recent
observations than on past observations by choosing parameters

β̂T =

[
T−1
∑
t=0

ωtxtx
′
t

]−1 [T−1
∑
t=0

ωtxtyt

]

The forecast is β̂
′
T xT

Expanding window regressions set ωt = 1 for all t

Rolling regressions set ωt = I{T − ω̄ ≤ t ≤ T − 1} : use last ω̄ obs.
Discounted least squares sets ωt = λ

T−t for λ ∈ (0, 1)

Timmermann (UCSD) Breaks Winter, 2017 25 / 42

Careful with rolling regressions!

Rolling regressions employ an intuitive trade-off

Short estimation windows reduce the bias in the estimates due to the use of
stale data that come from a different “regime”
This bias reduction is achieved at the cost of a decreased precision in the
parameter estimates as less data get used

We hope that the bias reduction more than makes up for the increased
parameter estimation error

However, there does not exist a data generating process for which a rolling
window is optimal, so how do we choose the length of the estimation window?

Timmermann (UCSD) Breaks Winter, 2017 26 / 42

Cross-validation and choice of estimation window

Treat rolling window as a choice variable

If the last P observations are used for cross-validation, we can choose the
length of the rolling estimation window, ω, to minimize the out-of-sample
MSE criterion

MSE (ω) = P−1
T

∑
t=T−P+1

(yt − x ′t−1 β̂t−ω+1:t )
2

β̂t−ω+1:t : OLS estimate of β that uses observations [t −ω+ 1 : t]
This method requires a suffi ciently long evaluation window, P, to yield
precise MSE estimates

not a good idea if the candidate break date, Tb , is close to T

Timmermann (UCSD) Breaks Winter, 2017 27 / 42

Using model averaging to deal with breaks

A more robust approach uses model averaging to deal with the underlying
uncertainty surrounding the selection of the estimation window

Combine forecasts associated with estimation windows ω ∈ [ω0,ω1 ]:

ŷT+1|T =
∑ω1ω=ω0 (x

′
T β̂T−ω+1:T )MSE (ω)

−1

∑ω1ω=ω0 MSE (ω)
−1

Example: daily data with ω0 = 200 days, ω1 = 500 days

If the break is very large, models that start the estimation sample after the
break will receive greater weight (they have small MSE) than models that
include pre-break data and thus get affected by a large (squared) bias term

Timmermann (UCSD) Breaks Winter, 2017 28 / 42

Modeling the instability process

Several methods exist to capture the process giving rise to time variation in
the model parameters

Time-varying parameter model (random walk or mean-reverting)
Markov switching
Change point process

These are parametric approaches that assume a particular break process

Many small changes versus few large “breaks”

Timmermann (UCSD) Breaks Winter, 2017 29 / 42

Time-varying parameter models

Small “breaks” every period

Time-varying parameter (TVP) model:

yt+1 = x
′
tβt+1 + εyt+1

βt+1 − β̄ = κ(βt − β̄) + εβt+1

This model is in state space form with yt being the observable process and
βt being the latent “state”

Use Kalman filter or MCMC methods

Volatility may also be changing over time—stochastic volatility

Allowing too much time variation in parameters (large σεβ) leads to poor
forecasts (signal-to-noise issue)

Timmermann (UCSD) Breaks Winter, 2017 30 / 42

Regime switching models: History repeats

Markov switching processes take the form

yt+1 = µst+1 + εt+1, εt+1 ∼ N(0,Σst+1 )
Pr(st+1 = j |st = i) = pij , i , j ∈ {1, …,K}

st+1 : underlying state, st+1 ∈ {1, 2, …,K} for K ≥ 2
Key assumption: the same K states repeat

Is this realistic?

Use forward-looking information in state transitions:

pij (zt ) = Φ(αij + βij zt )

e.g., zt = ∆Leading Indicatort

Timmermann (UCSD) Breaks Winter, 2017 31 / 42

Markov Switching models: been there, seen that

time, t

ßt

1β

2β

11 1 ++
+=

+ ttst
xy

t
εβ

1β

2β

Timmermann (UCSD) Breaks Winter, 2017 32 / 42

Change point models: History doesn’t repeat

Change point models allow the number of states to increase over time and do
not impose that the states are drawn repeatedly from the same set of values
Example: Assuming K breaks up to time T , for i = 0, …,K

yt+1 = µi + Σi εt+1, τi ≤ t ≤ τi+1

Assuming that the probability of remaining within a particular state is
constant, but state-specific, the transitions for this class of models are

P =



p11 p12 0 · · · 0
0 p22 p23 · · · 0
…

…
…

0 · · · 0 pKK pK ,K+1
0 0 · · · 0 1




pi ,i+1 = 1− pii

The process either remains in the current state or moves to a new state

Timmermann (UCSD) Breaks Winter, 2017 33 / 42

Change point models: a new era arises

time, t

ßt

1β

2β

11 1 ++
+=

+ ttst
xy

t
εβ

3β

4β

Timmermann (UCSD) Breaks Winter, 2017 34 / 42

Monitoring stability of forecasting models

Timmermann (UCSD) Breaks Winter, 2017 35 / 42

Monitoring the predictive accuracy

To study the real-time evolution in the accuracy of a model’s forecasts, plot
the Cumulative Sum of Squared prediction Error Difference (CSSED) for
some benchmark against the competitor model up to time t :

CSSEDm,t =
t

∑
τ=1

(
e2Benmk ,τ − e

2
m,τ

)

eBenmk ,τ = yτ − ŷτ,Benmk : forecast error (Benmk)
em,τ = yτ − ŷτ,m : forecast error (model m)
Positive and rising values of CSSED indicate that the point forecasts
generated by model m are more accurate than those produced by the
benchmark

Timmermann (UCSD) Breaks Winter, 2017 36 / 42

Comparison of break models

Pettenuzzo and Timmermann (2015)

Benchmark: Constant parameter linear model (LIN)

Competitors:

Time-varying parameter, stochastic volatility (TVP-SV)
Markov switching with K regimes (MSK )
Change point model with K regimes (CPK )

Timmermann (UCSD) Breaks Winter, 2017 37 / 42

Data and models (US inflation)

Pt : quarterly price index for the GDP deflator
πt = 400× ln (Pt/Pt−1) : annualized quarterly inflation rate
Prediction model: backward-looking Phillips curve

∆πt+1 = µ+ β(L)ut + λ(L)∆πt + εt+1, εt+1 ∼ N
(
0, σ2ε

)

∆πt+1 = πt+1 − πt : quarter-on-quarter change in the annualized inflation
rate

ut : quarterly unemployment rate

Timmermann (UCSD) Breaks Winter, 2017 38 / 42

Cumulative sum of squared forecast error differentials
(quarterly inflation, 1970-2012)

Timmermann (UCSD) Breaks Winter, 2017 39 / 42

Challenges to forecasting with breaks

1 Tests for model instability can detect many different types of instability and
tend to be uninformative about the nature of the instability

2 The forecaster therefore often does not have a good idea of which specific
way to capture model instability

3 Future parameter values might change again over the forecast horizon if this
is long. Forecasting procedures require modeling both the probability and
magnitude of future breaks

1 Models with rare changes to the parameters have little or nothing to say about
the chance of a future break in the parameters

2 Think about forecasting the growth in the Chinese economy over the next 25
years. Many things could change—future “breaks” could occur

Timmermann (UCSD) Breaks Winter, 2017 40 / 42

Conclusions: Practical Lessons

Model instability poses fundamental challenges to forecasting: All bets could
be off if ignored

Important to monitor model stability

Use model forecasts with more caution if models appear to be breaking down

Forecast combinations offer a promising tool to handle instability

Combine forecasts from models using short, medium and long estimation
windows
Combine different types of models, allowing combination weights to evolve
over time as new and better models replace old (stale) ones

Timmermann (UCSD) Breaks Winter, 2017 41 / 42

Conclusion: Practical lessons (cont.)

Models that allow parameters to change are adaptive – they catch up with a
shift in the data generating process

Adaptive approaches can work well but have obvious limitations

Models that attempt to predict instability or breaks require forward-looking
information

use information in option prices (VIX) or in financial fragility indicators

Diffi cult to predict exact timing and magnitude of breaks, but the risk of a
break may be time-varying

Model instability is not just a nuisance but also poses an opportunity for
improved forecasting performance

Timmermann (UCSD) Breaks Winter, 2017 42 / 42

Lecture 10: Data mining – Pitfalls in Forecasting
UCSD, Winter 2017

Allan Timmermann1

1UC San Diego

Timmermann (UCSD) Data mining Winter, 2017 1 / 23

1 Data mining
Opportunities and Challenges
Skill or Luck
Bonferroni Bound

2 Comparing Many Forecasts: Reality Check

3 Hal White’s Reality Check
Data snooping and technical trading rules
Timmermann (UCSD) Data mining Winter, 2017 2 / 23

Data mining (Wikipedia)

“Data mining is the process of sorting through large amounts of data and
picking out relevant information. It is usually used by business intelligence
organizations, and financial analysts, but is increasingly being used in the sciences
to extract information from the enormous data sets generated by modern
experimental and observational methods. It has been described as “the nontrivial
extraction of implicit, previously unknown, and potentially useful information from
data”and “the science of extracting useful information from large data sets or
databases.
The term data mining is often used to apply to the two separate processes of
knowledge discovery and prediction. Knowledge discovery provides explicit
information that has a readable form and can be understood by a user (e.g.,
association rule mining). Forecasting, or predictive modeling provides
predictions of future events and may be transparent and readable in some
approaches (e.g., rule-based systems) and opaque in others such as neural
networks. Moreover, some data-mining systems such as neural networks
are inherently geared towards prediction and pattern recognition, rather
than knowledge discovery.”

Timmermann (UCSD) Data mining Winter, 2017 2 / 23

Model selection and data mining

In the context of economic/financial forecasting, data mining is often used in
a negative sense as the practice of using a data set more than once for
purposes of selecting, estimating and testing a model

If you get to evaluate a model on the same data used to develop/estimate the
model, chances are you are overfitting the data

This practice is necessitated because we are limited to short (time-series)
samples which cannot be easily replicated/generated

We only have one history of quarterly US GDP, fund-manager performance etc.
We cannot use experiments to generate new data on such a large scale

If we have panel data with large cross-sections and no fixed effects, then we
can keep a large separate evaluation sample for model validation

Timmermann (UCSD) Data mining Winter, 2017 3 / 23

Data mining as a source of new information

Statistical analysis guided strictly by theory may not discover unknown
relationships that have not yet been stipulated by theory

Before the emergence of germ theory, medical doctors didn’t understand why
some patients got infected, while others didn’t. It was the use of patient data
and a search for correlations that helped doctors find out that those who
washed their hands between patient visits infected fewer of them

Big data present a similar opportunity for finding new empirical relationships
which can then be interpreted by theorists

If we have a really big data set that is deep (multiple entries for each
variable) and wide (many variables), we can use machine learning to detect
novel and interesting patterns on one subset of data, then test it on another
(independent) or several other cross-validation samples

Man versus machine: testing if a theoretical model performs better than a
data mined model, using independent (novel) data

Can we automate the development of theories (hypotheses)?

Timmermann (UCSD) Data mining Winter, 2017 4 / 23

Data mining as an overfitting problem

Data mining can cause problems for inference with small samples

“[David Lenweber, managing director of First Quadrant Corporation in
Pasadena, California] sifted through a United Nations CD-Rom and
discovered that historically, the single best prediction for the Standard &
Poor’s 500 stock index was butter production in Bangladesh.” (Coy,
1997, Business Week)

“Is it reasonable to use the standard t-statistic as a valid meaure of
significance when the test is conducted on the same data used by many
earlier studies whose results influenced the choice of theory to be tested?”
(Merton, 1987)

“. . . given enough computer time, we are sure that we can find a mechanical
trading rule which “works”on a table of random numbers —provided that we
are allowed to test the rule on the same table of numbers which we used to
discover the rule”. Jensen and Bennington (1970)

Timmermann (UCSD) Data mining Winter, 2017 5 / 23

Skill or luck?

A student comes to the professor’s offi ce and reports a stock market return
prediction model with a t−statistic of 3. Should the professor be impressed?

If the student only fitted a single model: Yes
What if the student experimented with hundreds of models?

A similar issue arises more broadly in the assessment of performance

Forecasting models in financial markets
Star mutual funds

How many mutual fund “stars” should we expect to find by random chance?

What if the answer is four and we only see one star in the actual data?

Lucky penny

Newsletter scam

Timmermann (UCSD) Data mining Winter, 2017 6 / 23

Timmermann (UCSD) Data mining Winter, 2017 7 / 23

Dealing with over-fitting

Report the full set (and number) of forecasting models that were considered

Good practice in all circumstances
The harder you had to work to find a ‘good’model, the more skeptical you
should be that the model will produce good future forecasts

Problem: Diffi cult to keep track of all models

What about other forecasters whose work influenced your study? Do you know
how many models they considered? Collective data mining
Even if you keep track of all the models you studied, how do you account for
the correlation between the forecasts that they generate?

Timmermann (UCSD) Data mining Winter, 2017 8 / 23

Dealing with over-fitting (cont.)

Use data from alternative sources

Seeking independent evidence
This is a way to ‘tie your hands’by not initially looking at all possible data

This strategy works if you have access to similar and genuinely independent
data

Often such data are diffi cult to obtain
Example: Use European or Asian data to corroborate results found in the US
Problem: What if the data are correlated? US, European and Asian stock
market returns are highly correlated and so are not independent data sources

Timmermann (UCSD) Data mining Winter, 2017 9 / 23

Dealing with over-fitting (cont.)

Reserve the last portion of your data for out-of-sample forecast evaluation

Problem: what if the world has changed?
Maybe the forecasting model truly performed well in a particular historical
sample (the “in-sample” period), but broke down in the subsequent sample

Example: Performance of small stocks

Small cap stocks have not systematically outperformed large cap stocks in the
35 years since the size effect was publicized in the early 1980s

Timmermann (UCSD) Data mining Winter, 2017 10 / 23

Bonferroni bound

Suppose we are interested in testing if the best model among k = 1, ….,m
models produces better forecasts than some benchmark forecasting model

Let pk be the p−value associated with the null hypothesis that model k does
not produce more accurate forecasts than the benchmark

This could be based on the t−statistic from a Diebold-Mariano test
The Bonferroni Bound says that the p-value for the null that none of the m
models is superior to the benchmark satisfies an upper bound

p ≤ Min(m×min(p1, …, pm), 1)

The smallest of the p-values (which produces the strongest evidence against
the null that no model beats the benchmark) gets multiplied by the number of
tested models, m
mindless datamining weakens the evidence!
Bonferroni bound holds for all possible correlations between test statistics

Timmermann (UCSD) Data mining Winter, 2017 11 / 23

Bonferroni bound: example

Bonferroni bound: Probability of observing a p-value as small as pmin among
m forecasting models is less than or equal to m× pmin
Example: m = 10, pmin = 0.02

Bonferroni bound = Min(10× 0.02, 1) = 0.20

In a sample with 10 p-values, there is at most a 20% chance that the
smallest p-value is less than 0.02

Test is conservative (doesn’t reject as often as it should): Suppose you have
20 models whose tests have a correlation of 1. Effectively you only have one
(independent) forecast. However, even if p1 = p2 = … = p20 = 0.01, the
Bonferroni bound gives a value of

p ≤ 20× 0.01 = 0.20

Timmermann (UCSD) Data mining Winter, 2017 12 / 23

Reverse engineering the Bonferroni bound

Suppose a student reports a p-value of 0.001 (one in a thousand)

For this to correspond to a Bonferroni p-value of 0.05, at least 50 models
must have been considered since 50× 0.001 = 0.05
Is this likely?

What is a low p-value in a world with data snooping?

The conventional criterion that a variable is significant if its p-value falls
below 0.05 isn’t true anymore

Back to the example with a t−statistic of 3 reported by a student. How
many models would the student have had to have looked at?

prob(t ≥ 3) = 0.0013
0.05/0.0013 = 37

Answer: 37 models

Timmermann (UCSD) Data mining Winter, 2017 13 / 23

Testing for Superior Predictive Ability

How confident can we be that the best forecast is genuinely better than some
benchmark, given that the best forecast is selected from a potentially large
set of forecasts?

Skill or luck? In a particular sample, a forecast model may produce a small
average loss even though in expectation (i.e., across all samples we could
have seen) the model would not have been so good

A search across multiple forecast models may result in the discovery of a
genuinely good model (skill), but it may also uncover a bad model that just
happens to perform well in a given sample (luck)

Tests used in model comparisons typically ignore any search that preceded
the selection of the prediction models

Timmermann (UCSD) Data mining Winter, 2017 14 / 23

White (2000) Reality Check

Forecasts generated recursively using an expanding estimation window

m models used to compute m out-of-sample (average) losses

f0t+1|t : forecast from benchmark (model 0)
fkt+1|t : forecast from alternative model k, k = 1, …,m

dkt+1 = (yt+1 − f0t+1)2 − (yt+1 − fkt+1)2 : MSE difference for the
benchmark (model 0) relative to model k

d̄k : sample average of dkt+1
d̄k > 0 suggests model k outperformed benchmark (model 0)

d̄ = (d̄1, …, d̄m)
′ : m× 1 vector of sample averages of MSE differences

measured relative to the benchmark

d̄∗ = (d̄1(β
∗
1), …, d̄m(β

∗
m))

′ : m× 1 vector of sample averages of MSE
differences measured relative to the benchmark

β∗i : p limt→∞{β̂it} : probability limit of β̂it

Timmermann (UCSD) Data mining Winter, 2017 15 / 23

White (2000) Reality Check (cont.)

White’s reality check tests the null hypothesis that the benchmark model is
not inferior to any of the m alternatives:

H0 : max
k=1,…,m

E [d∗kt+1 ] ≤ 0

If all models perform as well as the benchmark, d̄∗ = (d̄∗1 , …, d̄
∗
m) has mean

zero and so its maximum also has mean zero

Examining the maximum of d̄ is the same as searching for the best model

Alternative hypothesis: the best model outperforms the benchmark, i.e.,
there exists a superior model k such that E [d∗kt+1 ] > 0

Timmermann (UCSD) Data mining Winter, 2017 16 / 23

White (2000) Reality Check (cont.)

White shows conditions under which (⇒ means convergence in distribution)

max
k=1,…,m

T 1/2(d̄k − E [d̄∗k ])⇒ max
k=1,…,m

{Zk}

Zk , k = 1, …,m are distributed as N(0,Ω)
Problem: We cannot easily determine the distribution of the maximum of Z
because the m×m covariance matrix Ω is unknown
Timmermann (UCSD) Data mining Winter, 2017 17 / 23

White (2000) Reality Check (cont.)

Hal White (2000) developed a bootstrap for drawing the maximum from
N(0,Ω)

1 Draw values of dkt+1 with replacement to generate bootstrap samples of d̄k .
These draws use the same ‘t’across all models to preserve the correlation
structure across the test statistics d̄

2 Compute the maximum value of the test statistic across these samples
3 Compare these to the actual values of the maximum test statistic, i.e.,
compare the best performance in the actual data to the quantiles from the
bootstrap to obtain White’s bootstrapped Reality Check p-value for the null
hypothesis that no model beats the benchmark

Important to impose the null hypothesis: recentering the bootstrap
distribution around d̄k = 0

Timmermann (UCSD) Data mining Winter, 2017 18 / 23

White (2000) Reality Check: stationary bootstrap

Let τl be a random time index between R and T

R: beginning of evaluation sample

T : end of evaluation sample
For each bootstrap, b, generate a sample estimate
d̄bk = ∑

T
t=R (yτl − fkτl−1)

2 as follows:
1 Set t = R + 1. Draw τl = τR at random, independently, and uniformly from
{R + 1, …,T }

2 Increase t by 1. If t > T , stop. Otherwise, draw a standard uniform random
variable, U , independently of all other random variables

1 If U < q, draw τl at random, independently, and uniformly, from {R + 1, ...,T } 2 If U ≥ q, expand the block by setting τl = τl−1 + 1; if τl > T , reset τl = R + 1

3 Repeat step 2

Timmermann (UCSD) Data mining Winter, 2017 19 / 23

Sullivan, Timmermann, and White, JF 1999

Investigates the performance of 7,846 technical trading rules applied to daily
data

filter rules, moving averages, support and resistance, channel breakouts,
on-balance volume averages

The “best” technical trading rule looks very good on its own

However, when accounting for the possibility that the best technical trading
rule was selected from a large set of possible rules during the period with
liquid trading it is no longer possible to rule out that even the best rule is not
significantly better than the benchmark buy-and-hold strategy

Timmermann (UCSD) Data mining Winter, 2017 20 / 23

Sullivan, Timmermann, and White, JF 1999: Results

Timmermann (UCSD) Data mining Winter, 2017 21 / 23

Sullivan, Timmermann, and White, JF 1999: Results

Timmermann (UCSD) Data mining Winter, 2017 22 / 23

Conclusions

Data mining poses both an opportunity and a challenge to constructing
forecasting models in economics and finance

Less of a concern if we can generate new data or use cross-sectional holdout
data

More of a concern if we only have one (short) historical time-series

Methods exist for quantifying how data mining affects statistical tests

Timmermann (UCSD) Data mining Winter, 2017 23 / 23

Related Posts