Lecture 1: Introduction to Forecasting
UCSD, January 9 2017
Allan Timmermann1
1UC San Diego
Timmermann (UCSD) Forecasting Winter, 2017 1 / 64
1 Course objectives
2 Challenges facing forecasters
3 Forecast Objectives: the Loss Function
4 Common Assumptions on Loss
5 Specific Types of Loss Functions
6 Multivariate loss
7 Does the loss function matter?
8 Informal Evaluation Methods
9 Out-of-Sample Forecast Evaluation
10 Some easy and hard to predict variables
11 Weak predictability but large economic gains
Timmermann (UCSD) Forecasting Winter, 2017 2 / 64
Course objectives: Develop
Skills in analyzing, modeling and working with time series data from
finance and economics
Ability to construct forecasting models and generate forecasts
formulating a class of models – using information intelligently
model selection
estimation – making best use of historical data
Develop creativity in posing forecasting questions, collecting and
using often incomplete data
which data help me build a better forecasting model?
Ability to critically evaluate and compare forecasts
reasonable (simple) benchmarks
skill or luck? Overfitting (data mining)
Compete or combine?
Timmermann (UCSD) Forecasting Winter, 2017 2 / 64
Ranking forecasters: Mexican inflation
Timmermann (UCSD) Forecasting Winter, 2017 3 / 64
Forecast situations
Forecasts are used to guide current decisions that affect the future
welfare of a decision maker (forecast user)
Predicting my grade – updating information on the likely grade as the
course progresses
Choosing between a fixed-rate mortgage (interest rate fixed for 20
years) versus a floating-rate (variable) mortgage
Depends on interest rate and inflation forecast
Political or sports outcomes – prediction markets
Investing in the stock market. How volatile will the stock market be?
Predicting Chinese property prices. Supply and demand considerations,
economic growth
Structural versus reduced-form approaches
Depends on the forecast horizon: 1 month vs 10 years
Timmermann (UCSD) Forecasting Winter, 2017 4 / 64
Forecasting and decisions
Credit card company deciding which transactions are potentially
fraudulent and should be denied (in real time)
requires fitting a model to past credit card transactions
binary data (zero-one)
Central Bank predicting the state of the economy – timing issues
Predicting which fund manager (if any) or asset class will outperform
Forecasting the outcome of the world cup:
http://www.goldmansachs.com/our-thinking/outlook/world-cup-
sections/world-cup-book-2014-statistical-model.html
Timmermann (UCSD) Forecasting Winter, 2017 5 / 64
Forecasting the outcome of the world cup
Timmermann (UCSD) Forecasting Winter, 2017 6 / 64
Key issues
Decision maker’s actions depend on predicted future outcomes
Trade off relative costs of over- or underpredicting outcomes
Actions and forecasts are inextricably linked
good forecasts are expected to lead to good decisions
bad forecasts are expected to lead to poor decisions
Forecast is an intermediate input in a decision process, rather than an
end product of separate interest
Loss function weighs the cost of possible forecast errors – like a utility
function uses preferences to weigh different outcomes
Timmermann (UCSD) Forecasting Winter, 2017 7 / 64
Loss functions
Forecasts play an important role in almost all decision problems where
a decision maker’s utility or wealth is affected by his current and
future actions and depend on unknown future events
Central Banks
Forecast inflation, unemployment, GDP growth
Action: interest rate; monetary policy
Trade off cost of over- vs. under-predictions
Firms
Forecast sales
Action: production level, new product launch
Trade off inventory vs. stock-out/goodwill costs
Money managers
Forecast returns (mean, variance, density)
Action: portfolio weights/trading strategy
Trade off Risk vs. return
Timmermann (UCSD) Forecasting Winter, 2017 8 / 64
Ways to generate forecasts
Rule of thumb. Simple decision rule that is not optimal, but may be
robust
Judgmental/subjective forecast, e.g., expert opinion
Combine with other information/forecasts
Quantitative models
“… an estimated forecasting model provides a characterization of what
we expect in the present, conditional upon the past, from which we
infer what to expect in the future, conditional upon the present and the
past. Quite simply, we use the estimated forecasting model to
extrapolate the observed historical data.” (Frank Diebold, Elements of
Forecasting).
Combine different types of forecasts
Timmermann (UCSD) Forecasting Winter, 2017 9 / 64
Forecasts: key considerations
Forecasting models are simplified approximations to a complex reality
How do we make the right shortcuts?
Which methods seem to work in general or in specific situations?
Economic theory may suggest relevant predictor variables, but is silent
about functional form, dynamics of forecasting model
combine art (judgment) and science
how much can we learn from the past?
Timmermann (UCSD) Forecasting Winter, 2017 10 / 64
Forecast object – what are we trying to forecast?
Event outcome: predict if a certain event will happen
Will a bank or hedge fund close?
Will oil prices fall below $40/barrel in 2017?
Will Europe experience deflation in 2017?
Event timing: it is known that an event will happen, but unknown
when it will occur
When will US stocks enter a “bear” market (Dow drops by 10%)?
Time-series: forecasting future values of a continuous variable by
means of current and past data
Predicting the level of the Dow Jones Index on March 15, 2017
Timmermann (UCSD) Forecasting Winter, 2017 11 / 64
Forecast statement
Point forecast
Single number summarizing “best guess”. No information on how
certain or precise the point forecast is. Random shocks affect all
time-series so a non-zero forecast error is to be expected even from a
very good forecast
Ex: US GDP growth for 2017 is expected to be 2.5%
Interval forecast
Lower and upper bound on outcome. Gives a range of values inside
which we expect the outcome will fall with some probability (e.g., 50%
or 95%). Confidence interval for the predicted variable. Length of
interval conveys information about forecast uncertainty.
Ex: 90% chance US GDP growth will fall between 1% and 4%
Density or probability forecast
Entire probability distribution of the future outcome
Ex: US GDP growth for 2017 is Normally distributed N(2.5,1)
Timmermann (UCSD) Forecasting Winter, 2017 12 / 64
Forecast horizon
The best forecasting model is likely to depend on whether we are
forecasting 1 minute, 1 day, 1 month or 1 year ahead
We refer to an h−step-ahead forecast, where h (short for “horizon”)
is the number of time periods ahead that we predict
Often you hear the argument that “fundamentals matter in the long
run, psychological factors are more important in the short run”
Timmermann (UCSD) Forecasting Winter, 2017 13 / 64
Information set
Do we simply use past values of a series itself or do we include a
larger information set?
Suppose we wish to forecast some outcome y for period T + 1 and
have historical data on this variable from t = 1, ..,T . The univariate
information set consists of the series itself up to time T :
IunivariateT = {y1, …, yT }
If data on other series, zt (typically an N × 1 vector), are available,
we have a multivariate information set
ImultivariateT = {y1, …, yT , z1, …, zT }
It is often important to establish whether a forecast can benefit from
using such additional information
Timmermann (UCSD) Forecasting Winter, 2017 14 / 64
Loss function: notations
Outcome: Y
Forecast: f
Forecast error: e = Y − f
Observed data: Z
Loss function: L(f ,Y )→ R
maps inputs f ,Y to the real number line R
yields a complete ordering of forecasts
describes in relative terms how costly it is to make forecast errors
Timmermann (UCSD) Forecasting Winter, 2017 15 / 64
Loss Function Considerations
Choice of loss function that appropriately measures trade-offs is
important for every facet of the forecasting exercise and affects
which forecasting models are preferred
how parameters are estimated
how forecasts are evaluated and compared
Loss function reflects the economics of the decision problem
Financial analysts’forecasts; Hong and Kubik (2003), Lim (2001)
Analysts tend to bias their earnings forecasts (walk-down effect)
Sometimes a forecast is best viewed as a signal in a strategic game
that explicitly accounts for the forecast provider’s incentives
Timmermann (UCSD) Forecasting Winter, 2017 16 / 64
Constructing a loss function
For profit maximizing investors the natural choice of loss is the
function relating payoffs (through trading rule) to the forecast and
realized returns
Link between loss and utility functions: both are used to minimize risk
arising from economic decisions
Loss is sometimes viewed as the negative of utility
U(f ,Y ) ≈ −L(Y , f )
Majority of forecasting papers use simple ‘off the shelf’statistical loss
functions such as Mean Squared Error (MSE)
Timmermann (UCSD) Forecasting Winter, 2017 17 / 64
Common Assumptions on Loss
Granger (1999) proposes three ‘required’properties for error loss
functions, L(f , y) = L(y − f ) = L(e):
A1. L(0) = 0 (minimal loss of zero for perfect forecast);
A2. L(e) ≥ 0 for all e;
A3. L(e) is monotonically non-decreasing in |e| :
L(e1) ≥ L(e2) if e1 > e2 > 0
L(e1) ≥ L(e2) if e1 < e2 < 0
A1: normalization
A2: imperfect forecasts are more costly than perfect ones
A3: regularity condition - bigger forecast mistakes are (weakly)
costlier than smaller mistakes (of same sign)
Timmermann (UCSD) Forecasting Winter, 2017 18 / 64
Additional Assumptions on Loss
Symmetry:
L(y − f , y) = L(y + f , y)
Granger and Newbold (1986, p. 125): “.. an assumption of symmetry
about the conditional mean ... is likely to be an easy one to accept ...
an assumption of symmetry for the cost function is much less
acceptable.”
Homogeneity: for some positive function h(a) :
L(ae) = h(a)L(e)
scaling doesn’t matter
Differentiability of loss with respect to the forecast (regularity
condition)
Timmermann (UCSD) Forecasting Winter, 2017 19 / 64
Squared Error (MSE) Loss
L(e) = ae2, a > 0
Satisfies the three Granger properties
Homogenous, symmetric, differentiable everywhere
Convex: penalizes large forecast errors at an increasing rate
Optimal forecast:
f ∗ = arg
f
min
∫
(y − f )2pY dy
First order condition
f ∗ =
∫
ypY dy = E (y)
The optimal forecast under MSE loss is the conditional mean
Timmermann (UCSD) Forecasting Winter, 2017 20 / 64
Piece-wise Linear (lin-lin) Loss
L(e) = (1− α)e1e>0 − αe1e≤0, 0 < α < 1 1e>0 = 1 if e > 0, otherwise 1e>0 = 0. Indicator variable
Weight on positive forecast errors: (1− α)
Weight on negative forecast errors: α
Lin-lin loss satisfies the three Granger properties and is homogenous
and differentiable everywhere with regard to f , except at zero
Lin-lin loss does not penalize large errors as much as MSE loss
Mean absolute error (MAE) loss arises if α = 1/2:
L(e) = |e|
Timmermann (UCSD) Forecasting Winter, 2017 21 / 64
MSE vs. piece-wise Linear (lin-lin) Loss
-3 -2 -1 0 1 2 3
0
5
10
L(
e)
e
α = 0.25
-3 -2 -1 0 1 2 3
0
5
10
L(
e)
e
α = 0.5, MAE loss
-3 -2 -1 0 1 2 3
0
5
10
L(
e)
e
α = 0.75
MSE
linlin
MSE
linlin
MSE
linlin
Timmermann (UCSD) Forecasting Winter, 2017 22 / 64
Optimal forecast under lin-lin Loss
Expected loss under lin-lin loss:
EY [L(Y − f )] = (1− α)E [Y |Y > f ]− αE [Y |Y ≤ f ]
First order condition:
f ∗ = P−1Y (1− α)
PY : CDF of Y
The optimal forecast is the (1− α) quantile of Y
α = 1/2 : optimal forecast is the median of Y
As α increases towards one, the optimal forecast moves further to the
left of the tail of the predicted outcome distribution
Timmermann (UCSD) Forecasting Winter, 2017 23 / 64
Optimal forecast of N(0,1) variable under lin-lin loss
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
α
f*
Timmermann (UCSD) Forecasting Winter, 2017 24 / 64
Linex Loss
L(e) = exp(a2e)− a2e − 1, a2 6= 0
Differentiable everywhere
Asymmetric: a2 controls both the degree and direction of asymmetry
a2 > 0 : loss is approximately linear for e < 0 and approximately exponential for e > 0
Large underpredictions are very costly (f < y , so e = y − f > 0)
Converse is true when a2 < 0 Timmermann (UCSD) Forecasting Winter, 2017 25 / 64 MSE versus Linex Loss -3 -2 -1 0 1 2 3 0 5 10 15 20 L( e) e right-skewed linex loss with a 2 =1 -3 -2 -1 0 1 2 3 0 5 10 15 20 L( e) e left-skewed linex loss with a 2 =-1 MSE Linex MSE Linex Timmermann (UCSD) Forecasting Winter, 2017 26 / 64 Linex Loss Suppose Y ∼ N(µY , σ 2 Y ). Then E [L(e)] = exp(a2(µY − f ) + a22 2 σ2Y )− a2(µY − f ) Optimal forecast: f ∗ = µY + a2 2 σ2Y Under linex loss, the optimal forecast depends on both the mean and variance of Y (µY and σ 2 Y ) as well as on the curvature parameter of the loss function, a2 Timmermann (UCSD) Forecasting Winter, 2017 27 / 64 Optimal bias under Linex Loss for N(0,1) variable -3 -2 -1 0 1 2 3 0 0.2 0.4 e MSE loss -3 -2 -1 0 1 2 3 0 0.2 0.4 e linex loss with a 2 =1 -3 -2 -1 0 1 2 3 0 0.2 0.4 e linex loss with a 2 =-1 Timmermann (UCSD) Forecasting Winter, 2017 28 / 64 Multivariate Loss Functions Multivariate MSE loss with n errors e = (e1, ..., en)′ : MSE (A) = e ′Ae A is a nonnegative and positive definite n× n matrix This satisfies the basic assumptions for a loss function When A = In, covariances can be ignored and the loss function simplifies to MSE (In) = E [e ′e] = ∑ n i=1 e 2 i , i.e., the sum of the individual mean squared errors Timmermann (UCSD) Forecasting Winter, 2017 29 / 64 Does the loss function matter? Cenesizoglu and Timmermann (2012) compare statistical and economic measures of forecasting performance across a large set of stock return prediction models with time-varying mean and volatility Economic performance is measured through the certainty equivalent return (CER), i.e., the risk-adjusted return Statistical performance is measured through mean squared error (MSE) Performance is measured relative to that of a constant expected return (prevailing mean) benchmark Common for forecast models to produce worse mean squared error (MSE) but better return performance than the benchmark Relation between statistical and economic measures of forecasting performance can be weak Timmermann (UCSD) Forecasting Winter, 2017 30 / 64 Does loss function matter? Cenesizoglu and Timmermann Timmermann (UCSD) Forecasting Winter, 2017 31 / 64 Percentage of models with worse statistical but better economic performance than prevailing mean (CT, 2012) CER is certainty equivalent return Sharpe is the Sharpe ratio RAR is risk-adjusted return RMSE is root mean squared (forecast) error Timmermann (UCSD) Forecasting Winter, 2017 32 / 64 Example: Directional Trading system Consider the decisions of a risk-neutral ‘market timer’whose utility is linear in the return on the market portfolio (y) U(δ(f ), y) = δy Investor’s decision rule, δ(f ) : go ‘long’one unit in the risky asset if a positive return is predicted (f > 0), otherwise go short one unit:
δ(f ) =
{
1 if f ≥ 0
−1 if f < 0
Let sign(y) = 1, if y > 0, otherwise sign(y) = 0. Payoff:
U(y , δ(f )) = (2sign(f )− 1)y
Sign and magnitude of y and sign of f matter to trader’s utility
Timmermann (UCSD) Forecasting Winter, 2017 33 / 64
Example: Directional Trading system (cont.)
Which forecast approach is best under the directional trading rule?
Since the trader ignores information about the magnitude of the
forecast, an approach that focuses on predicting only the sign of the
excess return could make sense
Leitch and Tanner (1991) studied forecasts of T-bill futures:
Professional forecasters reported predictions with higher mean squared
error (MSE) than those from simple time-series models
Puzzling since the time-series models incorporate far less information
than the professional forecasts
When measured by their ability to generate profits or correctly forecast
the direction of future interest rate movements the professional
forecasters did better than the time-series models
Professional forecasters’objectives are poorly approximated by MSE
loss – closer to directional or ‘sign’loss
Timmermann (UCSD) Forecasting Winter, 2017 34 / 64
Common estimates of forecasting performance
Define the forecast error et+h|t = yt+h − ft+h|t . Then
MSE = T−1
T
∑
t=1
e2t+h|t
RMSE =
√√√√T−1 T∑
t=1
e2
t+h|t
MAE = T−1
T
∑
t=1
|et+h|t |
Directional accuracy (DA): let Ixt+1>0 = 1 if xt+1 > 0, otherwise
Ixt+1>0 = 0. Then an estimate of DA is
DA = T−1
T
∑
t=1
Iyt+h×ft+h|t>0
Timmermann (UCSD) Forecasting Winter, 2017 35 / 64
Forecast evaluation
ft+h|t : forecast of yt+h given information available at time t
Given a sequence of forecasts, ft+h|t , and outcomes, yt+h,
t = 1, …,T , it is natural to ask if the forecast was “optimal”or
obviously deficient
Questions posed by forecast evaluation are related to the
measurement of predictive accuracy
Absolute performance measures the accuracy of an individual
forecast relative to the outcome, using either an economic
(loss-based) or a statistical metric
Relative performance compares the performance of one or several
forecasts against some benchmark
Timmermann (UCSD) Forecasting Winter, 2017 36 / 64
Forecast evaluation (cont.)
Forecast evaluation amounts to understanding if the loss from a given
forecast is “small enough”
Informal methods – graphical plots, decompositions
Formal methods – distribution of test statistic for sample averages of
loss estimates can depend on how the forecasts were constructed, e.g.
which estimation method was used
The method (not only the model) used to construct the forecast
matters – expanding vs. rolling estimation window
Formal evaluation of an individual forecast requires testing whether
the forecast is optimal with respect to some loss function and a
specific information set
Rejection of forecast optimality suggests that the forecast can be
improved
Timmermann (UCSD) Forecasting Winter, 2017 37 / 64
Effi cient Forecast: Definition
A forecast is effi cient (optimal) if no other forecast using the available
data, xt ∈ It , can be used to generate a smaller expected loss
Under MSE loss:
f̂ ∗t+h|t = arg
f̂ (xt )
minE
[
(yt+h − f̂ (xt ))2
]
If we can use information in It to produce a more accurate forecast,
then the original forecast would be suboptimal
Effi ciency is conditional on the information set
weak form forecast effi ciency tests include only past forecasts and
past outcomes It = {yt , yt−1, …, f̂t |t−1, et |t−1, …}
strong form effi ciency tests extend this to include all other variables
xt ∈ It
Timmermann (UCSD) Forecasting Winter, 2017 38 / 64
Optimality under MSE loss
First order condition for an optimal forecast under MSE loss:
E [
∂(yt+h − ft+h|t )2
∂ft+h|t
] = −2E
[
yt+h − ft+h|t
]
= −2E
[
et+h|t
]
= 0
Similarly, conditional on information at time t, It :
E [et+h|t |It ] = 0
The expected value of the forecast error must equal zero given
current information, It
Test E [et+h|txt ] = 0 for all variables xt ∈ It known at time t
If the forecast is optimal, no variable known at time t can predict its
future forecast error et+h|t . Otherwise the forecast wouldn’t be
optimal
If I can predict that my forecast will be too low, I should increase my
forecast
Timmermann (UCSD) Forecasting Winter, 2017 39 / 64
Optimality properties under Squared Error Loss
1 Optimal forecasts are unbiased: the forecast error et+h|t has zero
mean, both conditionally and unconditionally:
E [et+h|t ] = E [et+h|t |It ] = 0
2 h-period forecast errors (et+h|t) are uncorrelated with information
available at the time the forecast was computed (It). In particular,
single-period forecast errors, et+1|t , are serially uncorrelated:
E [et+1|tet |t−1] = 0
3 The variance of the forecast error (et+h|t) increases (weakly) in the
forecast horizon, h :
Var(et+h+1|t ) ≥ Var(et+h|t ) for all h ≥ 1
Timmermann (UCSD) Forecasting Winter, 2017 40 / 64
Optimality properties under Squared Error Loss (cont.)
Forecasts should be unbiased. Why? If they were biased, we could
improve the forecast simply by correcting for the bias
Suppose ft+1|t is biased:
yt+1 = 1+ ft+1|t + εt+1, εt+1 ∼ WN(0, σ
2)
The bias-corrected forecast:
f ∗t+1|t = 1+ ft+1|t
is more accurate than ft+1|t
Forecast errors should be unpredictable:
Suppose yt+1 − ft+1|t = et+1 = 0.5et + εt+1 so the one-step forecast
error is serially correlated
Adding back 0.5et to the original forecast yields a more accurate
forecast: f ∗t+1|t = ft+1|t + 0.5et is better than f
∗
t+1|t
Variance of forecast error increases in the forecast horizon
We learn more information as we get closer to the forecast “target”
Timmermann (UCSD) Forecasting Winter, 2017 41 / 64
Informal evaluation methods (Greenbook forecasts)
Time-series graph of forecasts and outcomes {ft+h|t , yt+h}Tt=1
1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
-10
-5
0
5
10
GDP growth
time
an
nu
al
iz
ed
c
ha
ng
e
Actual
Forecast
1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
0
2
4
6
8
10
12
14
inflation rate
time, t
an
nu
al
iz
ed
c
ha
ng
e
Actual
Forecast
Timmermann (UCSD) Forecasting Winter, 2017 42 / 64
Informal evaluation methods (Greenbook forecasts)
Scatterplots of {ft+h|t , yt+h}Tt=1
-10 -8 -6 -4 -2 0 2 4 6 8 10
-10
-5
0
5
10
GDP growth
forecast
ac
tu
al
0 5 10 15
0
5
10
15
inflation rate
forecast
ac
tu
al
Timmermann (UCSD) Forecasting Winter, 2017 43 / 64
Informal evaluation methods (Greenbook Forecasts)
Plots of ft+h|t − yt against yt+h − yt : directional accuracy
-15 -10 -5 0 5 10 15
-10
-5
0
5
10
forecast
ac
tu
al
GDP growth
-10
-5
0
5
10
-15 -10 -5 0 5 10 15
-4 -3 -2 -1 0 1 2 3 4
-6
-4
-2
0
2
4
6
forecast
ac
tu
al
inflation rate
-6
-4
-2
0
2
4
6
-4 -3 -2 -1 0 1 2 3 4
Timmermann (UCSD) Forecasting Winter, 2017 44 / 64
Informal evaluation methods (Greenbook forecasts)
Plot of forecast errors et+h = yt+h − ft+h|t
1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
-5
0
5
10
fo
re
ca
st
e
rr
or
GDP growth
1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
-4
-2
0
2
4
6
fo
re
ca
st
e
rr
or
time, t
Inflation rate
Timmermann (UCSD) Forecasting Winter, 2017 45 / 64
Informal evaluation methods
Theil (1961) suggested the following decomposition:
E [y − f ]2 = E [(y − Ey)− (f − Ef ) + (Ey − Ef )]2
= (Ey − Ef )2 + (σy − σf )2 + 2σyσf (1− ρ)
MSE depends on
squared bias (Ey − Ef )2
squared differences in standard deviations (σy − σf )2
correlation between the forecast and outcome ρ
Timmermann (UCSD) Forecasting Winter, 2017 46 / 64
Pseudo out-of-sample Forecasts
Simulated (“pseudo”) out-of-sample (OoS) forecasts seek to mimic
the “real time”updating underlying most forecasts
What would a forecaster have done (historically) at a given point in
time?
Method splits data into an initial estimation sample (in-sample
period) and a subsequent evaluation sample (OoS period)
Forecasts are based on parameter estimates that use data only up to
the date when the forecast is computed
As the sample expands, the model parameters get updated, resulting
in a sequence of forecasts
Why do out-of-sample forecasting?
control for data mining – harder to “game”
feasible in real time (less “look-ahead” bias)
Timmermann (UCSD) Forecasting Winter, 2017 47 / 64
Pseudo out-of-sample forecasts (cont.)
Out-of-sample (OoS) forecasts impose the constraint that the
parameter estimates of the forecasting model only use information
available at the time the forecast was computed
Only information known at time t can be used to estimate and select
the forecasting model and generate forecasts ft+h|t
Many variants of OoS forecast estimation methods exist. These can
be illustrated for the linear regression model
yt+1 = β
′xt + εt+1
f̂t+1|t = β̂
′
txt
β̂t =
(
t
∑
s=1
ω(s, t)xs−1x
′
s−1
)−1 (
t
∑
s=1
ω(s, t)xs−1y
′
s
)
Different methods use different weighting functions ω(s, t)
Timmermann (UCSD) Forecasting Winter, 2017 48 / 64
Expanding window
Expanding or recursive estimation windows put equal weight on all
observations s = 1, …, t to estimate the parameters of the model:
ω(s, t) =
{
1 1 ≤ s ≤ t
0 otherwise
As time progresses, the estimation sample grows larger, It ⊆ It+1
If the parameters of the model do not change (“stationarity”), the
expanding window approach makes effi cient use of the data and leads
to consistent parameter estimates
If model parameters are subject to change, the approach leads to
biased forecasts
The approach works well empirically due to its use of all available
data which reduces the effect of estimation error on the forecasts
Timmermann (UCSD) Forecasting Winter, 2017 49 / 64
Expanding window
1 t t+1 t+2 T-1
time
Timmermann (UCSD) Forecasting Winter, 2017 50 / 64
Rolling window
Rolling window uses an equal-weighted kernel of the most recent ω̄
observations to estimate the parameters of the forecasting model
ω(s, t) =
{
1 t − ω̄+ 1 ≤ s ≤ t
0 otherwise
Only one ‘design’parameter: ω̄ (length of window)
Practical way to account for slowly-moving changes to the data
generating process
Does this address “breaks”?
window too long immediately after breaks
window too short further away
Timmermann (UCSD) Forecasting Winter, 2017 51 / 64
Rolling window
t-w+1 t-w+2 t t+1 t+2 T-1
time
Timmermann (UCSD) Forecasting Winter, 2017 52 / 64
Fixed window
Fixed window uses only the first ω̄0 observations to once and for all
estimate the parameters of the forecasting model
ω(s, t) =
{
1 1 ≤ s ≤ ω̄0
0 otherwise
This method is typically employed when the costs of estimation are
very high, so re-estimating the model with new data is prohibitively
expensive or impractical in real time
The method also makes analytical results easier
Timmermann (UCSD) Forecasting Winter, 2017 53 / 64
Fixed window
1 w t t+1 t+2 T-1
time
Timmermann (UCSD) Forecasting Winter, 2017 54 / 64
Exponentially declining weights
In the presence of model instability, it is common to discount past
observations using weights that get smaller, the older the data
Exponentially declining weights take the following form:
ω(s, t) =
{
λt−s 1 ≤ s ≤ t
0 otherwise
0 < λ < 1. This method is sometimes called discounted least squares as the discount factor, λ, puts less weight on past observations Timmermann (UCSD) Forecasting Winter, 2017 55 / 64 Comparisons Expanding estimation window: number of observations available for estimating model parameters increases with the sample size Effect of estimation error gets reduced Fixed/rolling/discounted window: parameter estimation error continues to affect the forecasts even as the sample grows large model parameters are inconsistent Forecasts vary more under the short (fixed and rolling) estimation windows than under the expanding window Timmermann (UCSD) Forecasting Winter, 2017 56 / 64 US stock index Timmermann (UCSD) Forecasting Winter, 2017 57 / 64 Monthly US stock returns Timmermann (UCSD) Forecasting Winter, 2017 58 / 64 Monthly inflation Timmermann (UCSD) Forecasting Winter, 2017 59 / 64 US T-bill rate Timmermann (UCSD) Forecasting Winter, 2017 60 / 64 US Stock market volatility Timmermann (UCSD) Forecasting Winter, 2017 61 / 64 Example: Portfolio Choice under Mean-Variance Utility T-bills with known payoff rf vs stocks with uncertain return r s t+1 and excess return rt+1 = r st+1 − rf Wt = $1 : Initial wealth ωt : portion of portfolio held in stocks at time t (1−ωt ) : portion of portfolio held in Tbills Wt+1 : future wealth Wt+1 = (1−ωt )rf +ωt (rt+1 + rf ) = rf +ωt rt+1 Investor chooses ωt to maximize mean-variance utility: Et [U(Wt+1)] = Et [Wt+1]− A 2 Vart (Wt+1) Et [Wt+1] and Vart (Wt+1) : conditional mean and variance of Wt+1 Timmermann (UCSD) Forecasting Winter, 2017 62 / 64 Portfolio Choice under Mean-Variance Utility (cont.) Suppose stock returns follow the process rt+1 = µ+ xt + εt+1 xt ∼ (0, σ2x ), εt+1 ∼ (0, σ2ε ), cov(xt , εt+1) = 0 xt : predictable component given information at t εt+1 : unpredictable innovation (shock) Uninformed investor’s (no information on xt) stock holding: ω∗t = arg ωt max { ωtµ+ rf − A 2 ω2t (σ 2 x + σ 2 ε ) } = µ A(σ2x + σ 2 ε ) E [U(Wt+1(ω ∗ t ))] = rf + µ2 2A(σ2x + σ 2 ε ) = rf + S2 2A S = µ/ √ σ2x + σ 2 ε : unconditional Sharpe ratio Timmermann (UCSD) Forecasting Winter, 2017 63 / 64 Portfolio Choice under Mean-Variance Utility (cont.) Informed investor knows xt . His stock holdings are ω∗t = µ+ xt Aσ2ε Et [U(Wt+1(ω ∗ t ))] = rf + (µ+ xt )2 2Aσ2ε Average (unconditional expectation) value of this is E [Et [U(Wt+1(ω ∗ t ))]] = rf + µ2 + σ2x 2Aσ2ε Increase in expected utility due to knowing the predictor variable: E [U inf ]− E [Uun inf ] = σ2x 2Aσ2ε = R2 2A(1− R2) Plausible empirical numbers, i.e., R2 = 0.005, and A = 3, give an annualized certainty equivalent return of about 1% Timmermann (UCSD) Forecasting Winter, 2017 64 / 64 Lecture 2: Univariate Forecasting Models UCSD, January 18 2017 Allan Timmermann1 1UC San Diego Timmermann (UCSD) ARMA Winter, 2017 1 / 59 1 Introduction to ARMA models 2 Covariance Stationarity and Wold Representation Theorem 3 Forecasting with ARMA models 4 Estimation and Lag Selection for ARMA Models Choice of Lag Order 5 Random walk model 6 Trend and Seasonal Components Seasonal components Trended Variables Timmermann (UCSD) ARMA Winter, 2017 2 / 59 Introduction: ARMA models When building a forecasting model for an economic or financial variable, the variable’s own past time series is often the first thing that comes to mind Many time series are persistent Effect of past and current shocks takes time to evolve Auto Regressive Moving Average (ARMA) models Work horse of forecast profession since Box and Jenkins (1970) Remain the centerpiece of many applied forecasting courses Used extensively commercially Timmermann (UCSD) ARMA Winter, 2017 2 / 59 Why are ARMA models so popular? 1 Minimalist demand on forecaster’s information set: Need only past history of the variable IT = {y1, y2, ..., yT−1, yT } "Reduced form": No need to derive fully specified model for y By excluding other variables, ARMA forecasts show how useful the past of a time series is for predicting its future 2 Empirical success: ARMA forecasts often provide a good ‘benchmark’and have proven surprisingly diffi cult to beat in empirical work 3 ARMA models underpinned by theoretical arguments Wold Representation Theorem: Covariance stationary processes can be represented as a (possibly infinite order) moving average process ARMA models have certain optimality properties among linear projections of a variable on its own past and past shocks to the series ARMA models are not optimal in a global sense - it may be optimal to use nonlinear transformations of past values of the series or to condition on a wider information set ("other variables") Timmermann (UCSD) ARMA Winter, 2017 3 / 59 Covariance Stationarity: Definition A time series, or stochastic process, {yt}∞t=−∞, is covariance stationary if The mean of yt , µt = E [yt ], is the same for all values of t: µt = µ without loss of generality we set µt = 0 for all t [de-meaning] The autocovariance exists and does not depend on t, but only on the "distance", j , i.e., E [ytyt−j ] ≡ γ(j , t) = γ(j) for all t Autocovariance measures how strong the covariation is between current and past values of a time series If yt is independently distributed over time, then E [ytyt−j ] = 0 for all j 6= 0 Timmermann (UCSD) ARMA Winter, 2017 4 / 59 Covariance Stationarity: Interpretation History repeats: if the series changed fundamentally over time, the past would not be useful for predicting the future of the series. To rule out this situation, we have to assume a certain degree of stability of the series. This is known as covariance stationarity Covariance stationarity rules out shifting patterns such as trends in the mean of a series breaks in the mean, variance, or autocovariance of a series Covariance stationarity allows us to use historical information to construct a forecasting model and predict the future Under covariance stationarity Cov(y2016, y2015) = Cov(y2017, y2016). This allows us to predict y2017 from y2016 Timmermann (UCSD) ARMA Winter, 2017 5 / 59 White noise Covariance stationary processes can be built from white noise: Definition A stochastic process, εt , is called white noise if it has zero mean, constant variance, and is serially uncorrelated: E [εt ] = 0 Var(εt ) = σ 2 E [εt εs ] = 0, for all t 6= s Timmermann (UCSD) ARMA Winter, 2017 6 / 59 Wold Representation Theorem Any covariance stationary process can be written as an infinite order MA model, MA(∞), with coeffi cients θi that are independent of t : Theorem Wold’s Representation Theorem: Any covariance stationary stochastic process {yt} can be represented as a linear combination of serially uncorrelated lagged white noise terms εt and a linearly deterministic component, µt : yt = ∞ ∑ j=0 θj εt−j + µt where {θi} are independent of time and ∑∞j=0 θ 2 j < ∞. Timmermann (UCSD) ARMA Winter, 2017 7 / 59 Wold Representation Theorem: Discussion Since E [εt ] = 0, E [ε2t ] = σ 2 ≥ 0, E [εt εs ] = 0, for all t 6= s, εt is not predictable using linear models of past data Practical concern: MA order is potentially infinite Since ∑∞j=0 θ 2 j < ∞, the parameters are likely to die off over time - a finite approximation to the infinite MA process could be appropriate In practice we need to construct εt from data (filtering) MA representation holds apart from a possible deterministic term, µt , which is perfectly predictable infinitely far into the future e.g., constant, linear time trend, seasonal pattern, or sinusoid with known periodicity Timmermann (UCSD) ARMA Winter, 2017 8 / 59 Estimation of Autocovariances Autocovariances and autocorrelations can be estimated from sample data (sample t = 1, ....,T ): Ĉov(Yt ,Yt−j ) = 1 T − j − 1 T ∑ t=j+1 (yt − ȳ)(yt−j − ȳ) ρ̂j = ĉov(yt , yt−j ) v̂ar(yt ) where ȳ = (1/T )∑Tt=1 yt is the sample mean of Y Testing for autocorrelation: Q−stat can be used to test for serial correlation of order 1, ...,m : Q = T m ∑ j=1 ρ̂2j ∼ χ 2 m Small p-values (below 0.05) suggest significant serial correlation Timmermann (UCSD) ARMA Winter, 2017 9 / 59 Autocovariances in matlab autocorr : computes sample autocorrelation parcorr : computes sample partial autocorrelation lbqtest: computes Ljung-Box Q-test for residual autocorrelation Timmermann (UCSD) ARMA Winter, 2017 10 / 59 Sample autocorrelation for US T-bill rate Timmermann (UCSD) ARMA Winter, 2017 11 / 59 Sample autocorrelation for US stock returns Timmermann (UCSD) ARMA Winter, 2017 12 / 59 Autocorrelations and predictability The more strongly autocorrelated a variable is, the easier it is to predict its mean strong serial correlation means the series is slowly mean reverting and so the past is useful for predicting the future strongly serially correlated variables include interest rates (in levels) level of inflation rate (year on year) weakly serially correlated or uncorrelated variables include stock returns changes in inflation growth rate in corporate dividends Timmermann (UCSD) ARMA Winter, 2017 13 / 59 Lag Operator and Lag Polynomials The lag operator, L, when applied to any variable simply lags the variable by one period: Lyt = yt−1 Lpyt = yt−p Lag polynomials such as φ(L) take the form φ(L) = p ∑ i=0 φiL i For example, if p = 2 and φ(L) = 1− φ1L− φ2L 2, then φ(L)yt = 1× yt − φ1Lyt − φ2L 2yt = yt − φ1yt−1 − φ2yt−2 Timmermann (UCSD) ARMA Winter, 2017 14 / 59 ARMA Models Autoregressive models specify y as a function of its own lags Moving average models specify y as a weighted average of past shocks (innovations) to the series ARMA(p, q) specification for a stationary variable yt : yt = φ1yt−1 + ...+ φpyt−p + εt + θ1εt−1 + ...+ θqεt−q In lag polynomial notation φ(L)yt = θ(L)εt φ(L) = 1− p ∑ j=0 φiL i θ(L) = q ∑ i=0 θiL i = 1+ θ1L+ ...+ θqL q Timmermann (UCSD) ARMA Winter, 2017 15 / 59 AR(1) Model ARMA(1, 0) or AR(1) model takes the form: yt = φ1yt−1 + εt (1− φ1L)yt = εt , θ(L) = 1 By recursive backward substitution, yt = φ1(φ1yt−2 + εt−1)︸ ︷︷ ︸ yt−1 + εt = φ 2 1yt−2 + εt + φ1εt−1 Iterating further backwards, we have, for h ≥ 1, yt = φ h 1yt−h + h−1 ∑ s=0 φs1εt−s = φh1yt−h + θ(L)εt , where θ(L) : θi = φ i 1 (for i = 1, .., h− 1) Timmermann (UCSD) ARMA Winter, 2017 16 / 59 AR(1) Model AR(1) model is equivalent to an MA(∞) model as long as φh1yt−h becomes “small” in a mean square sense: E [ yt − h−1 ∑ s=0 φs1εt−s ]2 = E [ φh1yt−h ]2 ≤ φ2h1 γy (0)→ 0 as h→ ∞, provided that φ2h1 → 0, i.e., |φ1| < 1 Stationary AR(1) process has an equivalent MA(∞) representation The root of the polynomial φ(z) = 1− φ1L = 0 is L ∗ = 1/φ1, so |φ1| < 1 means that the root exceeds one. This is a necessary and suffi cient condition for stationarity of an AR(1) process Stationarity of an AR(p) model requires that all roots of the equation φ(z) = 0 exceed one (fall outside the unit circle) Timmermann (UCSD) ARMA Winter, 2017 17 / 59 MA(1) Model ARMA(0, 1) or MA(1) model: yt = εt + θ1εt−1, i.e., φ(L) = 1, θ(L) = 1+ θ1L Backwards substitution yields εt = yt 1+ θ1L = h ∑ s=0 (−θ1)syt−s + (−θ1)hεt−h εt is equivalent to an AR(h) process with coeffi cients φs = (−θ1) s provided that E [(−θ1)hεt−h ] gets small as h increases, i.e., |θ1| < 1 MA(q) is invertible if the roots of θ(z) exceed one Invertible MA process can be written as an infinite order AR process A stationary and invertible ARMA(p, q) process can be written as either an AR model or as an MA model, typically of infinite order yt = φ(L) −1θ(L)εt or θ(L) −1φ(L)yt = εt Timmermann (UCSD) ARMA Winter, 2017 18 / 59 ARIMA representation for nonstationary processes Suppose that d of the roots of φ(L) equal unity (one), while the remaining roots of φ̃(L) fall outside the unit circle. Factorization: φ(L) = φ̃(L)(1− L)d Applying (1− L) to a series is called differencing Let ỹt = (1− L)dyt be the d th difference of yt . Then φ̃(L)ỹt = θ(L)εt By assumption, the roots of φ̃(L) lie outside the unit circle so the differenced process, ỹt , is stationary and can be studied instead of yt Processes with d 6= 0 need to be differenced to achieve stationarity and are called ARIMA(p, d , q) Timmermann (UCSD) ARMA Winter, 2017 19 / 59 US stock index Timmermann (UCSD) ARMA Winter, 2017 20 / 59 Monthly US stock returns (first-differenced prices) Timmermann (UCSD) ARMA Winter, 2017 21 / 59 Forecasting with AR models Prediction is straightforward for AR(p) models yT+1 = φ1yT + ...+ φpyT−p+1 + εT+1, εT+1 ∼ WN(0, σ 2) Treat parameters as known and ignore estimation error Using that E [εT+1|IT ] = 0 and {yT−p+1, ..., yT } ∈ IT , the forecast of yT+1 given IT becomes fT+1|T = φ1yT + ...+ φpyT−p+1 fT+1|T means the forecast of yT+1 given information at time T x ∈ IT means "x is known at time T , i.e., belongs to the information set at time T" Timmermann (UCSD) ARMA Winter, 2017 22 / 59 Forecasting with AR models: The Chain Rule When generating forecasts multiple steps ahead, unknown values of yT+h (h ≥ 1) can be replaced with their forecasts, fT+h|T , setting up a recursive system of forecasts: fT+2|T = φ1fT+1|T + φ2yT + ...+ φpyT−p+2 fT+3|T = φ1fT+2|T + φ2fT+1|T + φ3yT + ...+ φpyT−p+3 ... fT+p+1|T = φ1fT+p|T + φ2fT+p−1|T + φ3fT+p−2|T + ...+ φp fT+1|T ‘Chain rule’is equivalent to recursively expressing unknown future values yT+i as a function of yT and its past Known values of y affect the forecasts of an AR (p) model up to horizon T + p, while forecasts further ahead only depend on past forecasts themselves Timmermann (UCSD) ARMA Winter, 2017 23 / 59 Forecasting with MA models Consider the MA(q) model yT+1 = εT+1 + θ1εT + ...+ θqεT−q+1 One-step-ahead forecast: fT+1|T = θ1εT + ...+ θqεT−q+1 Sequence of shocks {εt} are not directly observable but can be computed recursively (estimated) given a set of assumptions on the initial values for εt , t = 0, ..., q − 1 For the MA(1) model, we can set ε0 = 0 and use the recursion ε1 = y1 ε2 = y2 − θ1ε1 = y2 − θ1y1 ε3 = y3 − θ1ε2 = y3 − θ1(y2 − θy1) Unobserved shocks can be written as a function of the parameter value θ1 and current and past values of y Timmermann (UCSD) ARMA Winter, 2017 24 / 59 Forecasting with MA models (cont.) Simple recursions using past forecasts can also be employed to update the forecasts. For the MA(1) model we have ft+1|t = θ1εt = θ1(yt − ft |t−1) MA processes of infinite order: yT+h for h ≥ 1 is yT+h = θ(L)εT+h = (εT+h + θ1εT+h−1 + ...+ θh−1εT+1︸ ︷︷ ︸ unpredictable + θhεT + θh+1εT−1 + ...︸ ︷︷ ︸ predictable . Hence, if εT were observed, the forecast would be fT+h|T = θhεT + θh+1εT−1 + ... = ∞ ∑ j=h θj εT+h−j MA(q) model has limited memory: values of an MA(q) process more than q periods into the future are not predictable Timmermann (UCSD) ARMA Winter, 2017 25 / 59 Forecasting with mixed ARMA models Consider a mixed ARMA(p, q) model yT+1 = φ1yT + φ2yT−1 + ...+ φpyT−p+1 + εT+1 + θ1εT + ...+ θq εT−q+1 Separate AR and MA prediction steps can be combined by recursively replacing future values of yT+i with their predicted values and setting E [εT+j |IT ] = 0 for j ≥ 1 : fT+1|T = φ1yT + φ2yT−1 + ...+ φpyT−p+1 + θ1εT + ...+ θq εT−q+1 fT+2|T = φ1fT+1|T + φ2yT + ...+ φpyT−p+2 + θ2εT + ...+ θq εT−q+2 ... fT+h|T = φ1fT+h−1|T + φ2fT+h−2|T + ...+ φp fT−p+h|T + θhεT + ...+ θq εT−q+h Note: fT−j+h|T = yT−j+h if j ≥ h, and we assumed q ≥ h Timmermann (UCSD) ARMA Winter, 2017 26 / 59 Mean Square Forecast Errors By the Wold Representation Theorem, all stationary ARMA processes can be written as an MA process with associated forecast error yT+h − fT+h|T = εT+h + θ1εT+h−1 + ...+ θh−1εT+1 Mean square forecast error: E [ (yT+h − fT+h|T )2 ] = E [(εT+h + θ1εT+h−1 + ...+ θh−1εT+1) 2 ] = σ2(1+ θ21 + ...+ θ 2 h−1) For the AR(1) model, θi = φ i 1 and so the MSE becomes E [(yT+h − fT+h|T )2] = σ2(1+ φ21 + ...+ φ 2(h−1) 1 ) = σ2(1− φ2h1 ) 1− φ21 Timmermann (UCSD) ARMA Winter, 2017 27 / 59 Direct vs. Iterated multi-period forecasts Two ways to generate multi-period forecasts (h > 1):
Iterated approach: forecasting model is estimated at the highest
frequency and iterated upon to obtain forecasts at longer horizons
Direct approach: forecasting model is matched with the desired
forecast horizon: One model for each horizon, h. The dependent
variable is yt+h while all predictor variables are dated period t
Example: AR(1) model yt = φ1yt−1 + εt
Iterated approach: use the estimated value, φ̂1, to obtain a forecast
fT+h|T = φ̂
h
1yT
Direct approach: Estimate h−period lag relationship:
yt+h = φ
h
1︸︷︷︸
φ̃1h
yt +
h−1
∑
s=0
φs1εt−s︸ ︷︷ ︸
ε̃t+h
Timmermann (UCSD) ARMA Winter, 2017 28 / 59
Direct vs. Iterated multi-period forecasts: Trade-offs
When the autoregressive model is correctly specified, the iterated
approach makes more effi cient use of the data and so tends to
produce better forecasts
Conversely, by virtue of being a linear projection, the direct approach
tends to be more robust towards misspecification
When the model is grossly misspecified, iteration on the misspecified
model can exacerbate biases and may result in a larger MSE
Which approach performs best depends on the true DGP, the degree
of model misspecification (both unknown), and the sample size
Empirical evidence in Marcellino et al. (2006) suggests that the
iterated approach works best on average for macro variables
Timmermann (UCSD) ARMA Winter, 2017 29 / 59
Estimation of ARIMA models
ARIMA models can be estimated by maximum likelihood methods
ARIMA models are based on linear projections (regressions) which
provide reasonable forecasts of linear processes under MSE loss
There may be nonlinear models of past data that provide better
predictors:
Under MSE loss the best predictor is the conditional mean, which need
not be a linear function of the past
Timmermann (UCSD) ARMA Winter, 2017 30 / 59
Estimation (continued)
AR(p) models with known p > 0 can be estimated by ordinary least
squares by regressing yT on yT−1, yT , .., .yT−p
Assuming the data are covariance stationary, OLS estimates of the
coeffi cients φ1, .., φp are consistent and asymptotically normal
If the AR model is correctly specified, such estimates are also
asymptotically effi cient
Least squares estimates are not optimal in finite samples and will be
biased
For the AR(1) model, φ̂1 has a downward bias of (1+ 3φ1)/T
For higher order models, the biases are complicated and can go in
either direction
Timmermann (UCSD) ARMA Winter, 2017 31 / 59
Estimation and forecasting with ARMA models in matlab
regARIMA: creates regression model with ARIMA time series errors
estimate: estimates parameters of regression models with ARIMA
errors
Pure AR models: can be estimated by OLS
forecast: forecast ARIMA models
Timmermann (UCSD) ARMA Winter, 2017 32 / 59
Lag length selection
In most situations, forecasters do not know the true or optimal lag
orders, p and q
Judgmental approaches based on examining the autocorrelations and
partial autocorrelations of the data
Model selection criteria: Different choices of (p, q) result in a set of
models {Mk}Kk=1, where Mk represents model k and the search is
conducted over K different combinations of p and q
Information criteria trade off fit versus parsimony
Timmermann (UCSD) ARMA Winter, 2017 33 / 59
Information criteria
Information criteria (IC) for linear ARMA specifications:
ICk = ln σ̂
2
k + nkg(T )
IC s trade off fit (gets better with more parameters) against parsimony
(fewer parameters is better). Choose k to minimize IC
σ̂2k : sum of squared residuals of model k. Lower σ̂
2
k ⇔ better fit
nk = pk + qk + 1 : number of estimated parameters for model k
g(T ) : penalty term that depends on the sample size, T :
Criterion g(T )
AIC (Akaike (1974)) 2T−1
BIC (Schwartz (1978)) ln(T )/T
In matlab: aicbic
Timmermann (UCSD) ARMA Winter, 2017 34 / 59
Marcellino, Stock and Watson (2006)
Timmermann (UCSD) ARMA Winter, 2017 35 / 59
Random walk model
The random walk model is an AR(1) with φ1 = 1 :
yt = yt−1 + εt , εt ∼ WN(0, σ2)
This model implies that the change in yt is unpredictable:
∆yt = yt − yt−1 = εt
For example, the level of stock prices is easy to predict, but not its
change (rate of return if using logarithm of stock index)
Shocks to the random walk have permanent effects: A one unit shock
moves the series by one unit forever. This is in sharp contrast to a
mean-reverting process
Timmermann (UCSD) ARMA Winter, 2017 36 / 59
Random walk model (cont)
The variance of a random walk increases over time so the distribution
of yt changes over time. Suppose that yt started at zero, y0 = 0 :
y1 = y0 + ε1 = ε1
y2 = y1 + ε2 = ε1 + ε2
…
yt = ε1 + ε2 + …+ εt−1 + εt
From this we have
E [yt ] = 0
var(yt ) = tσ
2, lim
t→∞
var(yt ) = ∞
The variance of y grows proportionally with time
A random walk does not revert to the mean but wanders up and
down at random
Timmermann (UCSD) ARMA Winter, 2017 37 / 59
Forecasts from random walk model
Recall that forecasts from the AR(1) process yt = φ1yt−1 + εt ,
εt ∼ WN(0, σ2) are simply
ft+h|t = φ
h
1yt
For the random walk model φ1 = 1, so for all forecast horizons, h, the
forecast is simply the current value:
ft+h|t = yt
The basic random walk model says that the value of the series next
period (given the history of the series) equals its current value plus an
unpredictable change:
Forecast of tomorrow = today’s value
Random steps, εt , makes yt a “random walk”
Timmermann (UCSD) ARMA Winter, 2017 38 / 59
Random walk with a drift
Introduce a non-zero drift term, δ :
yt = δ+ yt−1 + εt , εt ∼ WN(0, σ2)
This is a popular model for the logarithm of stock prices
The drift term, δ, plays the same role as a time trend. Assuming
again that the series started at y0, we have
yt = δt + y0 + ε1 + ε2 + …+ εt−1 + εt
Similarly,
E [yt ] = y0 + δt
var(yt ) = tσ
2
lim
t→∞
var(yt ) = ∞
Timmermann (UCSD) ARMA Winter, 2017 39 / 59
Summary of properties of random walk
Changes in random walk are unpredictable
Shocks have permanent effects
Variance grows in proportion with the forecast horizon
These points are important for forecasting:
point forecasts never revert to a mean
since the variance goes to infinity, the width of interval forecasts
increases without bound as the forecast horizon grows
Uncertainty grows without bounds
Timmermann (UCSD) ARMA Winter, 2017 40 / 59
Logs, levels and growth rates
Certain transformations of economic variables such as their logarithm
are often easier to forecast than the “raw” data
If the standard deviation of a time series is approximately proportional
to its level, then the standard deviation of the change in the logarithm
of the series is approximately constant:
Yt+1 = Yt exp(εt+1), εt+1 ∼ (0, σ2)⇔
ln(Yt+1)− ln(Yt ) = εt+1
Example: US GDP follows an upward trend. Instead of studying the
level of US GDP, we can study its growth rate which is not trending
The first difference of the log of Yt is ∆ ln(Yt ) = ln(Yt )− ln(Yt−1)
The percentage change in Yt between t − 1 and t is approximately
100∆ ln(Yt ). This can be interpreted as a growth rate
Timmermann (UCSD) ARMA Winter, 2017 41 / 59
Unit root processes
Random walk is a special case of a unit root process which has a unit
root in the AR polynomial, i.e.,
(1− L)φ̃(L)yt = θ(L)εt
where the roots of φ̃(L) lie outside the unit circle
We can test for a unit root using an Augmented Dickey Fuller (ADF)
test:
∆yt = α+ βyt−1 +
p
∑
i=1
∆yt−i + εt
In matlab: adftest
Under the null of a unit root, β = 0. Under the alternative of
stationarity, β < 0
Test is based on the t-stat of β. Test statistic follows a non-standard
distribution
Timmermann (UCSD) ARMA Winter, 2017 42 / 59
Critical values for Dickey-Fuller test
Timmermann (UCSD) ARMA Winter, 2017 43 / 59
Classical decomposition of time series into three
components
Cycles (stochastic) - captured using ARMA models
Trend
trend captures the slow, long-run evolution in the outcome
for many series in levels, this is the most important component for
long-run predictions
Seasonals
regular (deterministic) patterns related to time of the year (day), public
holidays, etc.
Timmermann (UCSD) ARMA Winter, 2017 44 / 59
Example: CO2 concentration (ppm) - Dave Keeling,
Scripps, 1957-2005
Timmermann (UCSD) ARMA Winter, 2017 45 / 59
Seasonality
Sources of seasonality: technology, preferences and institutions are
linked to the calendar
weather (agriculture, construction)
holidays, religious events
Many economic time series display seasonal variations:
home sales
unemployment figures
stock prices (?)
commodity prices?
Timmermann (UCSD) ARMA Winter, 2017 46 / 59
Handling seasonalities
One strategy is to remove the seasonal component and work with
seasonally adjusted series
Problem: We might be interested in forecasting the actual
(non-adjusted) series, not just the seasonally adjusted part
Timmermann (UCSD) ARMA Winter, 2017 47 / 59
Seasonal components
Seasonal patterns can be deterministic or stochastic
Stochastic modeling approach uses differencing to incorporate
seasonal components - e.g., year-on-year changes
Box and Jenkins (1970) considered seasonal ARIMA, or SARIMA,
models of the form
φ(L)(1− LS )yt = θ(L)εt .
(1− LS )yt = yt − yT−S : computes year-on-year changes
Timmermann (UCSD) ARMA Winter, 2017 48 / 59
Modeling seasonality
Seasonality can be modeled through seasonal dummies. Let S be the
number of seasons per year.
S = 4 (quarterly data)
S = 12 (monthly data)
S = 52 (weekly data)
For example, the following set of dummies would be used to model
quarterly variation in the mean:
D1t =
(
1 0 0 0 1 0 0 0 1 0 0 0
)
D2t =
(
0 1 0 0 0 1 0 0 0 1 0 0
)
D3t =
(
0 0 1 0 0 0 1 0 0 0 1 0
)
D4t =
(
0 0 0 1 0 0 0 1 0 0 0 1
)
D1 picks up mean effects in the first quarter. D2 picks up mean
effects in the second quarter, etc. At any point in time only one of
the quarterly dummies is activated
Timmermann (UCSD) ARMA Winter, 2017 49 / 59
Pure seasonal dummy model
The pure seasonal dummy model is
yt =
S
∑
s=1
δsDst + εt
We only regress yt on intercept terms (seasonal dummies) that vary
across seasons. δs summarizes the seasonal pattern over the year
Alternatively, we can include an intercept and S − 1 seasonal
dummies.
Now the intercept captures the mean of the omitted season and the
remaining seasonal dummies give the seasonal increase/decrease
relative to the omitted season
Never include both a full set of S seasonal dummies and an intercept
term - perfect collinearity
Timmermann (UCSD) ARMA Winter, 2017 50 / 59
General seasonal effects
Holiday variation (HDV ) variables capture dates of holidays which
may change over time (Easter, Thanksgiving) - v1 of these:
yt =
S
∑
s=1
δsDst +
v1
∑
i=1
δHDVi HDVit + εt
Timmermann (UCSD) ARMA Winter, 2017 51 / 59
Seasonals
ARMA model with seasonal dummies takes the form
φ(L)yt =
S
∑
s=1
δsDst + θ(L)εt
Application of seasonal dummies can sometimes yield large
improvements in predictive accuracy
Example: day of the week, seasonal, and holiday dummies:
µt =
7
∑
day=1
βdayDday ,t +
H
∑
holiday=1
βholidayDholiday ,t +
12
∑
month=1
βmonthDmonth,t
Adding deterministic seasonal terms to the ARMA component, the
value of y at time T + h can be predicted as follows:
yT+h =
7
∑
day=1
βdayDday ,T+h +
H
∑
holiday=1
βholidayDholiday ,T+h +
12
∑
month=1
βmonthDmonth,T+h + ỹT+h ,
φ(L)ỹT+h = θ(L)εT+h
Timmermann (UCSD) ARMA Winter, 2017 52 / 59
Deterministic trends
Let Timet be a deterministic time trend so that
Timet = t, t = 1, ....,T
This time trend is perfectly predictable (deterministic)
Linear trend model:
Trendt = β0 + β1Timet
β0 is the intercept (value at time zero)
β1 is the slope which is positive if the trend is increasing or negative if
the trend is decreasing
Timmermann (UCSD) ARMA Winter, 2017 53 / 59
Examples of trended variables
US stock price index
Number of residents in Beijing, China
US labor participation rate for women (up) or men (down)
Exchange rates over long periods (?)
Interest rates (?)
Global mean temperature (?)
Timmermann (UCSD) ARMA Winter, 2017 54 / 59
Quadratic trend
Sometimes the trend is nonlinear (curved) as when the variable
increases at an increasing or decreasing rate
For such cases we can use a quadratic trend:
Trendt = β0 + β1Timet + β2Time
2
t
Caution: quadratic trends are mostly considered adequate local
approximations and can give rise to a variety of unrealistic shapes for
the trend if the forecast horizon is long
Timmermann (UCSD) ARMA Winter, 2017 55 / 59
Log-linear trend
log-linear trends are used to describe time series that grow at a
constant exponential rate:
Trendt = β0 exp(β1Timet )
Although the trend is non-linear in levels, it is linear in logs:
ln(Trendt ) = ln(β0) + β1Timet
Timmermann (UCSD) ARMA Winter, 2017 56 / 59
Deterministic Time Trends: summary
Three common time trend specifications:
Linear : µt = µ0 + β0t
Quadratic : µt = µ0 + β0t + β1t
2
Exponential : µt = exp(µ0 + β0t)
These global trends are unlikely to provide accurate descriptions of
the future value of most time series at long forecast horizons
Timmermann (UCSD) ARMA Winter, 2017 57 / 59
Estimating trend models
Assuming MSE loss, we can estimate the trend parameters by solving
θ̂ = arg
θ
min
{
T
∑
t=1
(yt − Trendt (θ))2
}
Example: with a linear trend model we have
Trendt (θ) = β0 + β1Timet
θ = {β0, β1}
and we can estimate β0, β1 by OLS
(β̂0, β̂1) = arg
β0,β1
min
{
T
∑
t=1
(yt − β0 − β1Timet )
2
}
Timmermann (UCSD) ARMA Winter, 2017 58 / 59
Forecasting Trend
Suppose a time series is generated by the linear trend model
yt = β0 + β1Timet + εt , εt ∼ WN(0, σ
2)
Future values of εt are unpredicable given current information, It :
E [εt+h |It ] = 0
Suppose we want to predict the series at time T + h given
information IT :
yT+h = β0 + β1TimeT+h + εT+h
Since TimeT+h = T + h is perfectly predictable while εT+h is
unpredictable, our best forecast (under MSE loss) becomes
fT+h|T = β̂0 + β̂1TimeT+h
Timmermann (UCSD) ARMA Winter, 2017 59 / 59
Lecture 3: Model Selection
UCSD, January 23 2017
Allan Timmermann1
1UC San Diego
Timmermann (UCSD) Model Selection Winter, 2017 1 / 35
1 Estimation Methods
2 Introduction to Model Selection
3 In-Sample Selection Methods
4 Sequential Selection
5 Information Criteria
6 Cross-Validation
7 Lasso
Timmermann (UCSD) Model Selection Winter, 2017 2 / 35
Least Squares Estimation I
The multivariate linear regression model with k regressors
yt =
k
∑
j=1
βjxjt−1 + ut , t = 1, ...,T
can be written more compactly as
yt = β
′xt−1 + ut ,
β = (β1, ..., βk )
′, xt−1 = (x1t−1, ..., xkt−1)
′
or, in matrix form,
y = X β+ u,
y
T×1
=
y1
y2
...
yT
, XT×k =
x10 x20 · · · xk0
x11 x21 · · · xk1
...
...
...
x1,T−1 x2,T−1 · · · xk ,T−1
Timmermann (UCSD) Model Selection Winter, 2017 2 / 35
Least Squares Estimation II
Ordinary Least Squares (OLS) minimizes the sum of squared (forecast) errors:
β̂ = Argmin
β
T
∑
t=1
(yt − β′xt−1)2
This is the same as minimizing (y − X β)′(y − X β) and yields the solution
β̂ = (X ′X )−1X ′y
(assuming that X ′X is of full rank). Four assumptions on the disturbance terms,
u, ensure that the OLS estimator has the smallest variance among all linear
unibased estimators of β, i.e., it is the Best Linear Unbiased Estimator (BLUE)
1 Zero mean: E (ut ) = 0 for all t
2 Homoskedastic: Var(ut |X ) = σ2 for all t
3 No serial correlation: Cov(ut , us |X ) = 0 for all s, t
4 Orthogonal: E [ut |X ] = 0 for all t
Timmermann (UCSD) Model Selection Winter, 2017 3 / 35
Least Squares Estimation III
We can also write the OLS estimator as follows:
β̂ = β+ (X ′X )−1X ′u
Provided that u is normally distributed, u ∼ N(0, σ2IT ), we have
β̂ ∼ N(β, σ2(X ′X )−1)
A similar result can be established asymptotically
Thus we can use standard t−tests or F -tests to test if β is statistically
significant
Timmermann (UCSD) Model Selection Winter, 2017 4 / 35
Maximum Likelihood Estimation (MLE)
Suppose the residuals u are independently, identically and normally distributed
ut ∼ N(0, σ2). Then the likelihood of u1, ..., uT as a function of the parameters
θ = (β′, σ2), becomes
L(θ) =
(
2πσ2
)−T /2
exp
(
−1
2σ2
T
∑
t=1
u2t
)
=
(
2πσ2
)−T /2
exp
(
−1
2σ2
(y − X β)′(y − X β)
)
Taking logs, we get the log-likelihood LL(θ) = log(L(θ)):
LL(θ) =
−T
2
ln(2πσ2)−
1
2σ2
(y − X β)′(y − X β).
The following parameter estimates maximize the LL:
β̂MLE = (X
′X )−1X ′y
σ̂2MLE =
u′u
T
Timmermann (UCSD) Model Selection Winter, 2017 5 / 35
Generalized Method of Moments (GMM) I
Suppose we have data (y1, x0), ..., (yT , xT−1) drawn from a probability
distribution p((y1, x0), ..., (yT , xT−1)|θ0) with true parameters θ0. The
parameters can be identified from a set of population moment conditions
E [m((yt , xt−1), θ0)] = 0 for all t
Parameter estimates can be based on sample moments, m((yt , xt−1), θ) :
1
T
T
∑
t=1
m((yt , xt−1), θ̂T ) = 0
If we have the same number of (non-redundant) moment conditions as we have
parameters, the parameters θ̂T are exactly identified by the moment conditions.
For the linear regression model, the moment conditions are that the regression
residuals (yt − x ′t−1β) are uncorrelated with the predictors, xt−1:
E [xt−1ut ] = E [xt−1(yt − x ′t−1β)] = 0, t = 0, ...,T ⇒
β̂MoM = (X
′X )−1X ′y
Timmermann (UCSD) Model Selection Winter, 2017 6 / 35
Generalized Method of Moments (GMM) II
Under broad conditions the GMM estimator is consistent and asymptotically
normally distributed
GMM estimator allows for heteroskedastic (time-varying covariance) and
autocorrelated (persistent) errors
GMM estimator has certain robustness properties
GMM is widely used throughout finance
Lars Peter Hansen (Chicago) shared the Nobel prize in 2013 for his work on
GMM estimation (and other topics)
Timmermann (UCSD) Model Selection Winter, 2017 7 / 35
Shrinkage and Ridge estimators
Estimation errors often lead to bad forecasts
A simple "trick" is to penalize for large parameter estimates
Shrinkage estimators do this by solving the problem
β = Argmin
β
T−1
T
∑
t=1
(yt − β′xt−1)2 + λ
nk
∑
i=1
β2i︸ ︷︷ ︸
penalty
λ > 0 : penalizes for large parameters
With a single regressor, the solution to this problem is simple:
β̃shrink =
1
1+ λ
β̂OLS
In the multivariate case we get the ridge estimator
β̃Ridge = (X
′X + λI )−1X ′y
Even though β̃shrink is now biased, the variance of the forecast is reduced
Timmermann (UCSD) Model Selection Winter, 2017 8 / 35
Model selection
Typically a set of models, rather than just a single model is considered when
constructing a forecast of a particular outcome
Models could differ by
dynamic specification (ARMA lags)
predictor variables (covariates)
functional form (nonlinearities)
estimation method (OLS, GMM, MLE)
Can a single ‘best’model be identified?
Model selection methods attempt to choose such a ‘best’model
might be hard if the space of models is very large
what if many models have similar performance?
Different from forecast combination which combines forecasts from several
models
Timmermann (UCSD) Model Selection Winter, 2017 9 / 35
Model selection: setup
MK = {M1, …,MK } : Finite set of K forecasting models
Mk : individual model used to generate a forecast, fk (xt , βk ), k = 1, …,K
xt : data (conditioning information or predictors) at time t
βk : parameters for model k
Model selection involves searching overMK to find the best forecasting
model
Data sample: {yt+1, xt}, t = 0, …,T − 1
One model nests another model if the second model is a special case
(smaller version) of the first one. Example:
M1 : yt+1 = β1x1t + ε1t+1 (small model)
M2 : yt+1 = β21x1t + β22x2t + ε2t+1 (big model)
Timmermann (UCSD) Model Selection Winter, 2017 10 / 35
In-sample comparison of models
Two models: M1 = f1(x1, β1) and M2 = f2(x2, β2)
Squared error loss: e1 = y − f1, e2 = y − f2
The second (“large”) model nests the first (“small”)
Coeffi cient estimates for both models are selected such that
β̂i = argmin
βi
T−1
T
∑
t=1
(yt − fit |t−1(βi ))
2
Because f2t |t−1(β2) nests f1t |t−1(β1), it follows that, in a given sample,
T−1
T
∑
t=1
(yt − f2t |t−1(β̂2))
2 ≤ T−1
T
∑
t=1
(yt − f1t |t−1(β̂1))
2
The larger model (M2) always provides at least as good a fit as the smaller
model (M1) and in most cases will provide a strictly better in-sample fit
Timmermann (UCSD) Model Selection Winter, 2017 11 / 35
In-sample comparison of models
The smaller model’s in-sample fit is always worse even if the true expected
loss under the (first) small model is less than or equal to the expected loss
under the second (large) model, i.e., even if the following holds in population:
E
[
(yt+1 − f1t+1|t (β
∗
1))
2
]
≤ E
[
(yt+1 − f2t+1|t (β
∗
2))
2
]
β∗1, β
∗
2 : true parameters. These are unknown
Superior in-sample fit does not by itself suggest that a particular forecast
model is necessarily better out-of-sample
Large (complex) models often perform well in comparisons of in-sample fit
even when they perform poorly compared with smaller models when
evaluated on new (out-of-sample) data
Take-away: Overfitting matters. Be careful with large models
Timmermann (UCSD) Model Selection Winter, 2017 12 / 35
Model selection methods
Popular in-sample model selection methods
Information criteria (IC)
Sequential hypothesis testing
Cross validation
LASSO (large dimensional models)
Advantages of each approach depends on whether there are few or very many
potential predictor variables
Also depends on the true, but unknown, model
Are many or few of the predictor variables truly significant?
sparseness
Timmermann (UCSD) Model Selection Winter, 2017 13 / 35
Sequential Hypothesis Testing
Sequential hypothesis tests choose the ‘best’submodel from a larger set of
models through a sequence of specification tests that identify the relevant
parts of a model and exclude the remainder
Approach reflects how applied researchers construct their models in practice
Remove terms found not to be useful when tested against a smaller model
that omits such variables
Use t−tests, F−tests, or p−values
Diffi culties may arise if models are nonnested or include nonlinearities
In matlab: stepwisefit
Timmermann (UCSD) Model Selection Winter, 2017 14 / 35
Sequential Hypothesis Testing
Different orders of the sequence in which variables are tested
forward stepwise
backward stepwise
General-to-specific – start big: include all potential variables in the initial
model. Then remove variables thought not to be useful through a sequence
of tests. The final model typically depends on the sequence of tests
Specific-to-general – start small: begin with a small baseline model with
the ‘main’variables or simply a constant, then add further variables if they
improve the prediction model
Forward and backward methods can be mixed
Timmermann (UCSD) Model Selection Winter, 2017 15 / 35
Sequential Testing with Linear Model (backwise stepwise)
Kitchen sink with K potential predictors {x1, …, xK }:
yt+1 = β0 + β1x1t + β2x2t + …+ βK−1xK−1t + βK xKt + εt+1
Suppose the smallest absolute value of the t−statistic of any variable falls
below some threshold, t, such as t = 2:
tmin = min
k=1,…,K
|t
β̂k
| < t = 2
Eliminate the variable with the smallest t−statistic or the lowest p−value
smaller than 0.05
Timmermann (UCSD) Model Selection Winter, 2017 16 / 35
Sequential Testing with Linear Model (cont.)
Suppose xK is dropped. The trimmed model with the remaining K − 1
variables is next re-estimated:
yt+1 = β0 + β1x1t + β2x2t + ...+ βK xK−1t + εt+1
Recalculate the t-statistics for all regressors. Check if
min
k=1,...,K−1
{|t
β̂k
|}| < t
and drop the variable with the smallest t−statistic if this condition holds
Repeat procedure until no further variable is dropped
Timmermann (UCSD) Model Selection Winter, 2017 17 / 35
Forecasts from sequential tests
Backward stepwise forecast:
ft+1|t = β̂0k +
K
∑
k=1
β̂k Ik xkt
Ik = 1 if the kth variable is included in the final model. Otherwise Ik = 0
Ik depends on the entire sequence of t−statistics not only for the kth
variable itself but also for all other variables. Why?
In matlab:
stepwisefit(X(1:t-1,:),y(2:t),’display’,’off’,’inmodel’,ones(1,size(X,2))); %
backward stepwise model selection
Timmermann (UCSD) Model Selection Winter, 2017 18 / 35
Sequential Hypothesis Testing: Specific to general
Begin from a simple model that only includes an intercept
yt+1 = β0 + εt+1
Next consider all K univariate models (forward stepwise approach):
yt+1 = β0k + βk xkt + εkt+1 k = 1, ...,K
Add to the model the variable with the highest t−statistic subject to the
condition that this exceeds some threshold value t̄, e.g., t̄ = 2
tmax = max
k=1,...,K
|t
β̂k
| > t̄
Add regressors from the remaining pool to this univariate model one by one.
New regressor is included if its t−statistic exceeds t̄
Repeat until no further variable is included
matlab: stepwisefit(X(1:t-1,:),y(2:t),’display’,’off’); % forward stepwise model
selection
Timmermann (UCSD) Model Selection Winter, 2017 19 / 35
Sequential approach
Forecasts from the backward and forward stepwise approaches take the form
ft+1|t = β̂0k +
K
∑
k=1
β̂k Ik xkt
Ik = 1 if the kth variable is included in the final model. Otherwise Ik = 0
Advantages:
intuitive appeal
simplicity
computationally easy
Disadvantages:
no comprehensive search across all possible models
outcome can be path dependent – no guarantee that it finds the globally
optimal model
pre-test biases: hard to control the size
Timmermann (UCSD) Model Selection Winter, 2017 20 / 35
Information Criteria (IC)
ICs trade off model fit against a penalty for model complexity measured by
the number of freely estimated parameters
ICs were developed under different assumptions regarding the ‘correct’
underlying model and hence have different properties
ICs ‘correct’a minimization criterion for the effect of parameter estimation
which tends to make large models appear better in-sample than they really are
Popular information criteria:
Bayes information criterion (BIC or SIC)
Akaike information criterion (AIC)
In matlab: aicbic . Choose model with smallest
aic = −2 ∗ logL+ 2 ∗ numParam
bic = −2 ∗ logL+ numParam ∗ log (numObs)
Timmermann (UCSD) Model Selection Winter, 2017 21 / 35
Information Criteria
How much additional parameters improve the in-sample fit depends on the
true model, so differences across information criteria hinge on how to
practically get around our ignorance about the true model
Information criteria are used to rank a set of parametric models,Mk
Each model Mk ∈ MK requires estimating nk parameters, βk
To conduct a comprehensive search over all possible models with K potential
predictor variables, {x1, …., xK } means considering 2K possible model
specifications
Example: K = 2 : Two possible predictors {x1, x2} yield four models {0, 0},
{1, 0}, {0, 1}, and {1, 1}
with K = 11, 211 = 2, 048 models
with K = 20, 2K = 220 > 1, 000, 000 models
Timmermann (UCSD) Model Selection Winter, 2017 22 / 35
Bayesian/Schwarz Information Criterion
BIC = −2LogLk + nk ln(T )
nk : number of parameters of model k
T : sample size – penalty depends on the sample size: bigger T , bigger
penalty
For linear regressions, the BIC takes the form
BIC = ln σ̂2k + nk ln(T )/T
σ̂2k = e
′
k ek/T : sample estimate of the residual variance
Select the model with the smallest BIC
BIC is a consistent model selection criterion: It selects the true model in a
very large sample (big T ) if this is included inMK
Timmermann (UCSD) Model Selection Winter, 2017 23 / 35
Akaike Information Criterion
AIC = −2LogLk + 2nk
AIC minimizes the distance between the true model and the fitted model
For linear regression models
AIC (k) = ln σ̂2k +
2nk
T
AIC penalizes inclusion of extra parameters less than the SIC
AIC is not a consistent model selection criterion – it tends to select models
with too many parameters
AIC selects the best “approximate” model – asymptotic effi ciency
Timmermann (UCSD) Model Selection Winter, 2017 24 / 35
Cross-validation
Cross validation (CV) avoids overfitting by removing the correlation that
causes the estimated in-sample loss to be “small” due to the use of the same
observations for both parameter estimation and model evaluation
CV makes use of the full dataset for both estimation and evaluation
CV averages over all possible combinations of estimation and evaluation
samples obtainable from a given data set
‘Leave one out’CV estimator holds out one observation for model evaluation
Remaining observations are used for estimation of the parameters
The loss is calculated solely from the evaluation sample – this breaks the
connection leading to overfitting
Repeat calculation for all possible ways to leave out one data point for model
evaluation
CV can be computationally slow if T is large
Timmermann (UCSD) Model Selection Winter, 2017 25 / 35
Illustration: Estimation of sample mean under MSE
Estimation of sample mean ȳT = T
−1 ∑Tt=1 yt for i.i.d. time series, yt
Mean Squared Error (MSE):
T−1
T
∑
t=1
(yt − ȳT )2 = T−1
T
∑
t=1
ε2t − (ȳT − µ)2 ⇒
E
[
T−1
T
∑
t=1
(yt − ȳT )2
]
= σ2 − E [(ȳT − µ)2 ] ≤ σ2
E [(ȳT − µ)2 ], gets subtracted from the MSE!
The in-sample MSE based on the fitted mean will on average be smaller than
the true MSE computed under known µ
Cross validation breaks the correlation between the forecast error and the
estimation error
Separate observations used to estimate the parameters of the prediction model
(the sample mean) from observations used to compute the MSE
Timmermann (UCSD) Model Selection Winter, 2017 26 / 35
How does the classical ‘leave one out’CV work?
At each point in time, t, use the sample mean
ȳ{−t} = (T − 1)−1 ∑Ts=1,s 6=t ys that leaves out observation yt
We can show that
E
(
T −1
T
∑
t=1
(yt − ȳ{−t})2
)
= E
(
T −1
T
∑
t=1
ε2t
)
+ (T − 1)−1E
[
T −1
T
∑
t=1
ε2t
]
= σ2(1+ (T − 1)−1) > σ2
The expected squared error of the leave-one-out estimator ȳ{−t} can be
shown to be smaller than that of the usual estimate, ȳt , that does not
leave-one-out
CV estimator tells us (correctly) that the MSE exceeds σ2
Timmermann (UCSD) Model Selection Winter, 2017 27 / 35
How many predictor variables do we have?
Low-dimensional set of variables
Large-dimensional: hundreds or thousands
Federal Reserve Bank of St Louis Federal, FRED, has 429,000 time series
Timmermann (UCSD) Model Selection Winter, 2017 28 / 35
Lasso Model Selection
Lasso (Least Absolute Shrinkage and Selection Operator) is a type of
shrinkage estimator for least squares regression
Shrinkage estimators (“pull towards zero”) reduce the effect of sampling
errors
Lasso estimates linear regression coeffi cients by minimizing the sum of least
squares residuals subject to a penalty function
T−1
T
∑
t=1
(yt − β′xt−1)2 + λ
nk
∑
i=1
|βi |︸ ︷︷ ︸
penalty
λ : scalar tuning parameter determining the size of the penalty
Shrinks the parameter estimates towards zero
λ = 0 gives OLS estimates. Big values of λ pull β̂ towards zero
No closed form solution for minimizing the expression – computational
methods are required
Timmermann (UCSD) Model Selection Winter, 2017 29 / 35
Lasso Model Selection
Common to re-estimate parameters of selected variables by OLS
In matlab: lasso
“lasso performs lasso or elastic net regularization for linear regression.
[B,STATS] = lasso(X,Y,…) Performs L1-constrained linear least squares fits
(lasso) or L1- and L2-constrained fits (elastic net) relating the predictors in X
to the responses in Y. The default is a lasso fit, or constraint on the L1-norm
of the coeffi cients B.”
matlab uses cross-validation to choose the weight on the penalty term, λ
Lasso tends to set many coeffi cients to zero and can thus be used for model
selection
Timmermann (UCSD) Model Selection Winter, 2017 30 / 35
Variable selection and Lasso (Patrick Breheny slides)
Timmermann (UCSD) Model Selection Winter, 2017 31 / 35
Empirical example
Forecasts of quarterly (excess) stock returns
Twelve predictor variables:
dividend-price ratio,
dividend-earnings (payout) ratio,
stock volatility,
book-to-market ratio,
net equity issues,
T-bill rate,
long term return,
term spread,
default yield,
default return,
inflation
investment-capital ratio
Timmermann (UCSD) Model Selection Winter, 2017 32 / 35
Time-series forecasts of quarterly stock returns
1970Q1 1980Q1 1990Q3 2000Q4 2010Q4
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Aic Bic CrossVal
1970Q1 1980Q1 1990Q3 2000Q4 2010Q4
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Forward Backward
1970Q1 1980Q1 1990Q3 2000Q4 2010Q4
-0.01
0
0.01
0.02
0.03
0.04
0.05
0.06
Bagg α=2% Bagg α=3%
1970Q1 1980Q1 1990Q3 2000Q4 2010Q4
-0.1
-0.05
0
0.05
0.1
0.15
Lasso λ=8 Lasso λ=3
Timmermann (UCSD) Model Selection Winter, 2017 33 / 35
Inclusion of individual variables
1970Q1 1990Q3 2010Q4
AIC
BIC
CV
Forw
Back
Lasso
dp
1970Q1 1990Q3 2010Q4
AIC
BIC
CV
Forw
Back
Lasso
tbl
1970Q1 1990Q3 2010Q4
AIC
BIC
CV
Forw
Back
Lasso
tms
1970Q1 1990Q3 2010Q4
AIC
BIC
CV
Forw
Back
Lasso
dfy
Timmermann (UCSD) Model Selection Winter, 2017 34 / 35
Conclusion
Model selection increases the “space”over which the search for the
forecasting model is conducted
Model uncertainty matters and can be as important as parameter
estimation error
When one model is clearly superior to others it will nearly always be selected
No free lunch – when a single model is not obviously superior to all other
models, different models are selected by different criteria in different samples
statistical techniques for model selection are used precisely because models are
hard to tell apart, and not because one model is obviously much better than
the others
Timmermann (UCSD) Model Selection Winter, 2017 35 / 35
Lecture 4: Updating and Forecasting with New
Information
UCSD, January 30, 2017
Allan Timmermann1
1UC San Diego
Timmermann (UCSD) Filtering Winter, 2017 1 / 53
1 Bayes rule and updating beliefs
2 The Kalman Filter
Application to inflation forecasting
3 Nowcasting
4 Jagged Edge Data
5 Daily Business Cycle Indicator for the U.S.
6 Markov Switching models
Empirical examples
Timmermann (UCSD) Filtering Winter, 2017 2 / 53
Updating and forecasting
A good forecasting model closely tracks how data evolve over time
Important to update beliefs about the predicted variable or the
forecasting model as new information arrives
Filtering
Suppose the current “state” is unobserved. For example we may not
know in real time if the economy is in a recession
Nowcasting: obtaining the best estimate of the current state
How do we accurately and effi ciently update our forecasts as new
information arrives?
Kalman filter (continuous state)
Regime switching model (small number of states)
Timmermann (UCSD) Filtering Winter, 2017 2 / 53
Bayes’rule
Bayes rule for two random variables A and B:
P(B |A) =
P(A|B)P(B)
P(A)
P(A) : probability of event A
P(B) : probability of event B
P(B |A) : probability of event B given that event A occurred
Timmermann (UCSD) Filtering Winter, 2017 3 / 53
Bayes’rule (cont.)
Let θ be some unknown model parameters while y are observed data.
By Bayes’rule (setting θ = B, y = A)
P(θ|y) =
P(y |θ)P(θ)
P(y)
Given some data, y , what do we know about model parameters θ?
If we are only interested in θ, we can ignore p(y) :
P(θ|y)︸ ︷︷ ︸
posterior
∝ P(y |θ)︸ ︷︷ ︸
likelihood
P(θ)︸︷︷︸
prior
We start with prior beliefs before seeing the data. Updating these
beliefs with the observed data, we get posterior beliefs
Parameters, θ, that do not fit the data become less likely
For example, if θH is ’high mean returns’and we observe a data sample
with low mean returns, then we put less weight on θH
Timmermann (UCSD) Filtering Winter, 2017 4 / 53
Examples of Bayes’rule
B : European recession. A : European growth rate of -1%
P(recession|g = −1%) =
P(g = −1%|recession)P(recession)
P(g = −1%)
B : bear market. A : negative returns of -5%
P(bear |r = −5%) =
P(r = −5%|bear)P(bear)
P(r = −5%)
Here P (recession) and P(bear) are the initial probabilities of being in
a recession/bear market (before observing the data)
Timmermann (UCSD) Filtering Winter, 2017 5 / 53
Understanding the updating process
Suppose two random variables Y and X are normally distributed(
Y
X
)
= N
((
µy
µx
)
,
(
σ2y σxy
σxy σ
2
x
))
µy and µx are the initial (unconditional) expected values of y and x
σ2y , σ
2
x , σxy are variances and covariance
Conditional mean and variance of Y given an observation X = x :
E [Y |X = x ] = µy +
σxy
σ2x
(x − µx )
Var(Y |X = x) = σ2y −
σ2xy
σ2x
If Y and X are positively correlated (σxy > 0) and we observe a high
value of X (x > µx ), then we increase our expectation of Y
Just like a linear regression! σxy/σ2x is the beta coeffi cient
Timmermann (UCSD) Filtering Winter, 2017 6 / 53
Kalman Filter: Background
The Kalman filter is an algorithm for linear updating and prediction
Introduced by Kalman in 1960 for engineering applications
Method has found great use in many disciplines, including economics
and finance
Kalman Filter gives an updating rule that can be used to revise our
beliefs as we see more and more data
For models with normally distributed variables, the filter can be used
to write down the likelihood function
Timmermann (UCSD) Filtering Winter, 2017 7 / 53
Kalman Filter (Wikipedia) I
“Kalman filtering, also known as linear quadratic estimation (LQE), is
an algorithm that uses a series of measurements observed over time,
containing noise (random variations) and other inaccuracies, and
produces estimates of unknown variables that tend to be more precise
than those based on a single measurement alone. More formally, the
Kalman filter operates recursively on streams of noisy input data to
produce a statistically optimal estimate of the underlying system
state. The filter is named for Rudolf (Rudy) E. Kálmán, one of the
primary developers of its theory.
The Kalman filter has numerous applications in technology. A
common application is for guidance, navigation and control of
vehicles, particularly aircraft and spacecraft. Furthermore, the
Kalman filter is a widely applied concept in time series analysis used
in fields such as signal processing and econometrics.
Timmermann (UCSD) Filtering Winter, 2017 8 / 53
Kalman Filter (Wikipedia) II
The algorithm works in a two-step process. In the prediction step,
the Kalman filter produces estimates of the current state variables,
along with their uncertainties. Once the outcome of the next
measurement (necessarily corrupted with some amount of error,
including random noise) is observed, these estimates are updated
using a weighted average, with more weight being given to estimates
with higher certainty. Because of the algorithm’s recursive nature, it
can run in real time using only the present input measurements and
the previously calculated state and its uncertainty matrix; no
additional past information is required.”
Timmermann (UCSD) Filtering Winter, 2017 9 / 53
Kalman Filter: Models in state space form
Let St be an unobserved (state) variable while yt is an observed
variable. A model that shows how yt is related to St and how St
evolves is called a state space model. This has two equations:
State equation (unobserved/latent):
St = φ× St−1 + εst , εst ∼ (0, σ2s ) (1)
Measurement equation (observed)
yt = B × St + εyt , εyt ∼ (0, σ2y ) (2)
Innovations are uncorrelated with each other:
Cov(εst , εyt ) = 0
Timmermann (UCSD) Filtering Winter, 2017 10 / 53
Example 1: AR(1) model in state space form
AR(1) model
yt = φyt−1 + εt
This can be written in state space form as
St = φSt−1 + εt state eq.
yt = St measurement eq.
with B = 1, σ2s = σ
2
ε , and σ
2
y = 0
very simple: no error in the measurement equation: yt is observed
without error
Timmermann (UCSD) Filtering Winter, 2017 11 / 53
Example 2: MA(1) model in state space form
MA(1) model with unobserved shocks εt :
yt = εt + θεt−1
This can be written in state space form(
S1t
S2t
)
=
(
0 0
1 0
)
︸ ︷︷ ︸
φ
(
S1t−1
S2t−1
)
+
(
1
0
)
εt
yt =
(
1 θ
)︸ ︷︷ ︸
B
(
S1t
S2t
)
= εt + θεt−1
Note that S1t = εt ,S2t = S1t−1 = εt−1
Timmermann (UCSD) Filtering Winter, 2017 12 / 53
Example 3: Unobserved components model
The unobserved components model consists of two equations
yt = St + εyt (B = 1)
St = St−1 + εst (φ = 1)
yt is observed with noise
St is the underlying “mean” of yt . This is smoother than yt
This model can be written as an ARIMA(0,1,1):
yt − yt−1 = St − St−1 + εyt − εyt−1
= εst + εyt − εyt−1 : MA(1)
Timmermann (UCSD) Filtering Winter, 2017 13 / 53
Kalman Filter: Advantages
The state equation in (1) is in AR(1) form and so is easy to iterate
forward. The h−step-ahead forecast of the state given its current
value, St , is given by
Et [St+h |St ] = φhSt
In practice we don’t observe St and so need an estimate of this given
current information, St |t , or past information, St |t−1
Updating the Kalman filter through newly arrived information is easy
Timmermann (UCSD) Filtering Winter, 2017 14 / 53
Kalman Filter Updating Equations
It = {yt , yt−1, yt−2, …}. Current information
It−1 = {yt−1, yt−2, yt−3, …}. Lagged information
yt : random variable we want to predict
yt |t−1 : best prediction of yt given information at t − 1, It−1
St |t−1 : best prediction of St given information at t − 1, It−1
St |t : best “prediction” (or nowcast) of St given information It
Define mean squared error (MSE)-values associated with the forecasts
of St and yt
MSESt |t−1 = E [(St − St |t−1)
2]
MSESt |t = E [(St − St |t )
2]
MSE y
t |t−1 = E [(yt − yt |t−1)
2]
Timmermann (UCSD) Filtering Winter, 2017 15 / 53
Prediction and Updating Equations
Using the state, measurement, and MSE equations, the Kalman filter
gives a set of prediction equations:
St |t−1 = φSt−1|t−1
MSESt |t−1 = φ
2MSESt−1|t−1 + σ
2
s
yt |t−1 = B × St |t−1
MSE y
t |t−1 = B
2 ×MSESt |t−1 + σ
2
y
Similarly, we have a pair of updating equations for S :
St |t = St |t−1 + B
(
MSESt |t−1/MSE
y
t |t−1
)
(yt − yt |t−1)
MSESt |t = MSE
S
t |t−1
[
1− B2
(
MSESt |t−1/MSE
y
t |t−1
)]
Timmermann (UCSD) Filtering Winter, 2017 16 / 53
Prediction and Updating Equations
Intuition for updating equations (B = 1)
St |t = St |t−1 +
MSESt |t−1
MSE y
t |t−1
(yt − yt |t−1)
St |t : estimate of current (t) state given current information It
St |t−1 : old (t − 1) estimate of state St given It−1
MSESt |t−1/MSE
y
t |t−1 : amount by which we update our estimate of
the current state after we observe yt . This is small if MSE
y
t |t−1 is big
(noisy data) relative to MSESt |t−1, i.e., σ
2
y >> σ
2
s
(yt − yt |t−1) : surprise (news) about yt
If yt is higher than we expected, (yt − yt |t−1) > 0 and we increase
our expectations about the state: St |t > St |t−1. The updating
equation tells us by how much
Timmermann (UCSD) Filtering Winter, 2017 17 / 53
Starting the Algorithm
At t = 0, we have not observed any data, so we must make our best
guesses of S1|0 and MSE
S
1|0 without data by picking a pair of initial
conditions. This gives y1|0 and MSE
y
1|0 from the prediction equations
At t = 1 we observe y1. The updating equations generate S1|1 and
MSE s1|1. The prediction equations then generate forecasts for the
second period
At t = 2 we observe y2, and the cycle continues to give sequences of
predictions of the states, {St |t} and {St |t−1}
Keep on iterating to get a sequence of estimates
Timmermann (UCSD) Filtering Winter, 2017 18 / 53
Filtered versus smoothed states
St |t : filtered states: estimate of the state at time t given information
up to time t
Uses only historical information
What is my best guess of St given my current information?
“Filters” past historical information for noise
St |T : smoothed states: estimate of the state at time t given
information up to time T
Uses the full sample up to time T ≥ t
Less sensitive to noise and thus tends to be smoother than the filtered
states
Information on yt−1, yt , yt+1 help us more precisely estimate the state
at time t, St
Timmermann (UCSD) Filtering Winter, 2017 19 / 53
Practical applications of the Kalman filter
Common to use Kalman filter to estimate adaptive forecasting models
with time-varying relations:
yt+1 = βtxt + εt+1
Two alternative specifications for βt :
βt − β̄ = φ(βt−1 − β̄) + ut : mean-reverts to β̄
βt = βt−1 + ut : random walk
yt , xt : observed variables
βt : time-varying coeffi cient (unobserved state variable)
Timmermann (UCSD) Filtering Winter, 2017 20 / 53
Kalman filter in matlab
Matlab has a good Kalman filter called ssm (state space model). The
setup is
xt = Atxt−1 + Btut
yt = Ctxt +Dtet
xt : unobserved state (our St)
yt : observed variable
ut , et : uncorrelated noise processes with variance of one
model = ssm(A,B,C,D,’StateType’,stateType); % state space model
modelEstimate = estimate(model,variable,params0,’lb’,[0; 0])
filtered = filter(modelEstimate,variable)
smoothed = smooth(modelEstimate,variable)
Timmermann (UCSD) Filtering Winter, 2017 21 / 53
Kalman filter example: monthly inflation
Unobserved components model for inflation
xt = xt−1 + σuut
yt = xt + σeet
A = 1; % state-transition matrix (A = φ in our notation)
B = NaN; % state-disturbance-loading matrix (B = σS )
C = 1; % measurement-sensitivity matrix (C = B in our notation)
D = NaN; % observation-innovation matrix (D = σy )
stateType = 2; % sets state equation to be a random walk
Timmermann (UCSD) Filtering Winter, 2017 22 / 53
Application of Kalman filter to monthly US inflation
1930 1940 1950 1960 1970 1980 1990 2000 2010
-0.02
-0.01
0
0.01
0.02
0.03
0.04
0.05
Time
In
fla
tio
n
Timmermann (UCSD) Filtering Winter, 2017 23 / 53
Kalman filter estimates of inflation (last 100 obs.)
2005 2006 2007 2008 2009 2010 2011 2012
-1.5
-1
-0.5
0
0.5
1
Time
P
er
ce
nt
ag
e
po
in
ts
Inflation
Filtered inflation
Smoothed inflation
Timmermann (UCSD) Filtering Winter, 2017 24 / 53
Kalman filter take-aways
Kalman filter is a very popular approach for dynamically updating
linear forecasts
Used to estimate ARMA models
Used throughout engineering and the social sciences
Fast, easy algorithm
Optimal updating equations for normally distributed data
Timmermann (UCSD) Filtering Winter, 2017 25 / 53
Nowcasting
Nowcasting refers to “estimating the present”
Nowcasting extracts information about the present state of some
variable or system of variables
distinct from traditional forecasting
Nowcasting only makes sense if the present state is
unknown−otherwise nowcasting would just amount to checking the
current value
Example: Use a single unobserved state variable to summarize the
state of the economy, e.g., the daily point in the business cycle
Variables such as GDP are actually observed with large measurement
errors (revisions)
Timmermann (UCSD) Filtering Winter, 2017 26 / 53
Jagged edge data
Macroeconomic data such as GDP, monetary aggregates,
consumption, unemployment figures or housing starts as well as
financial data extracted from balance sheets and income statements
are published infrequently and sometimes at irregular intervals
Delays in the publication of macro variables differ across variables
Irregular data releases (release date changes from month to month)
generate what is often called “jagged edge”data
A forecaster can only use the data that is available on any given date
and needs to pay careful attention to which variables are in the
information set
Timmermann (UCSD) Filtering Winter, 2017 27 / 53
Aruoba, Diebold and Scotti daily business cycle indicator
ADS model the daily business cycle, St , as an unobserved variable
that follows a (zero-mean) AR(1) process:
St = φSt−1 + et
Although St is unobserved, we can extract information about it from
its relation with a set of observed economic variables y1t , y2t , …
At the daily horizon these variables follow processes:
yit = ki + βiSt + γiyi ,t−Di + uit , i = 1, .., n
Di equals seven days if the variable is observed weekly, etc.
Timmermann (UCSD) Filtering Winter, 2017 28 / 53
Aruoba, Diebold and Scotti index from Philly Fed
Timmermann (UCSD) Filtering Winter, 2017 29 / 53
ADS five variable model
The ADS model can be written in state-space form
For example, a model could use the following observables:
interest rates (daily, y1t)
initial jobless claims (weekly, y2t)
personal income (monthly, y3t)
industrial production (monthly, y4t)
GDP (quarterly, y5t)
Kalman filter can be used to extract and update estimates of the
unobserved common variable that tracks the state of the economy
Kalman filter is well suited for handling missing data
If all elements of yt are missing on a given day, we skip the updating
step
Timmermann (UCSD) Filtering Winter, 2017 30 / 53
Markov Chains: Basics
Updating equations simplify a great deal if we only have two states,
states 1 and 2, and want to know which state we are currently in
recession/expansion
inflation/deflation
bull/bear market
high volatility/low volatility
Timmermann (UCSD) Filtering Winter, 2017 31 / 53
Markov Chains: Basics
A first order (constant) Markov chain, St , is a random process that
takes integer values {1, 2, ….,K} with state transitions that depend
only on the most recent state, St−1
Probability of moving from state i at time t − 1 to state j at time t is
pij :
P(St = j |St−1 = i) = pij
0 ≤ pij ≤ 1
K
∑
j=1
pij = 1
Timmermann (UCSD) Filtering Winter, 2017 32 / 53
Fitted values, 3-state model for monthly T-bill rates
1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 2020
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Time
F
itt
ed
v
al
ue
s,
T
-b
ill
r
at
e
Timmermann (UCSD) Filtering Winter, 2017 33 / 53
Two-state Markov Chain
With K = 2 states, the transition probabilities can be collected in a
2× 2 matrix
P = P(St+1 = j |St = i)
=
[
p11 p21
p12 p22
]
=
[
p11 1− p22
1− p11 p22
]
Law of total probability: pi1 + pi2 = 1: we either stay in state i or we
leave to state j
pii : “stayer”probability – measure of state i’s persistence
Timmermann (UCSD) Filtering Winter, 2017 34 / 53
Basic regime switching model
Simple two-state regime-switching model
yt+1 = µst+1 + σst+1 εt+1, εt+1 ∼ N(0, 1)
P(st+1 = j |st = i) = pij
µst+1 : mean in state st+1
σst+1 : volatility in state st+1
st+1 matters for both the mean and volatility of yt+1
if st+1 = 1 : yt+1 = µ1 + σ1εt+1
if st+1 = 2 : yt+1 = µ2 + σ2εt+1
Timmermann (UCSD) Filtering Winter, 2017 35 / 53
Updating state probabilities
To predict yt+1, we need to predict st+1. This depends on the current
state, st
Let p1t |t = prob(st = 1|It ) be the current probability of being in
state 1 given all information up to time t, It
If p1t |t = 1, we know for sure that we are in state 1 at time t
Typically p1t |t < 1 and there is uncertainty about the present state
Let p1t+1|t = prob(st+1 = 1|It ) be the predicted probability of being
in state 1 next period (t + 1), given It
Timmermann (UCSD) Filtering Winter, 2017 36 / 53
Updating state probabilities
To be in state 1 at time t + 1, we must have come from either state 1
or from state 2:
p1t+1|t = p11 × p1t |t + (1− p22)× p2t |t
p2t+1|t = (1− p11)× p1t |t + p22 × p2t |t
If p1t |t = 1, we know for sure that we are in state 1 at time t. Then
the equations simplify to
p1t+1|t = p11 × 1+ (1− p22)× 0 = p11
p2t+1|t = (1− p11)× 1+ p22 × 0 = 1− p11
Timmermann (UCSD) Filtering Winter, 2017 37 / 53
Updating with two states
Let P(st = 1|yt−1) and P(st = 1|yt−1) be our initial estimates of
being in states 1 and 2 given information at time t − 1
In period t we observe a new data point: yt
If we are in state 1 the likelihood of observing yt is P(yt |st = 1)
If we are in state 2 the likelihood of yt is P(yt |st = 2)
If these are normally distributed, we have
P(yt |st = 1) =
1√
2πσ21
exp
(
−(yt − µ1)
2
2σ21
)
P(yt |st = 2) =
1√
2πσ22
exp
(
−(yt − µ2)
2
2σ22
)
(3)
Timmermann (UCSD) Filtering Winter, 2017 38 / 53
Bayesian updating with two states: examples I
Use Bayes’rule to compute the updated state probabilities:
P(st = 1|yt ) =
P(y t |st= 1)P(st= 1)
P(y t )
, where
P(yt ) = P(y t |st= 1)P(st= 1) + P(y t |st= 2)P(st= 2)
Similarly
P(st = 2|yt ) =
P(y t |st= 2)P(st= 2)
P(y t )
Suppose that µ1 < 0, σ
2
1 is "large" so state 1 is a high volatility state
with negative mean, while µ2 > 0 with small σ
2
2 so state 2 is a
“normal” state
Timmermann (UCSD) Filtering Winter, 2017 39 / 53
Bayesian updating with two states: examples II
If we see a large negative yt , this is most likely drawn from state 1
and so P(yt |st = 1) > P(yt |st = 2). Then we revise upward the
probability that we are currently (at time t) in state 1
Example:
µ1 = −3, σ1 = 5, µ2 = 1, σ2 = 2
P(st = 1|yt−1) = 0.70, P(st = 2|yt−1) = 0.30 : initial estimates
p11 = 0.8, p22 = 0.9
Suppose we observe yt = −4. Then from (3)
p(yt |st = 1) = Normpdf (−1/5) = 0.0782
p(yt |st = 2) = Normpdf (−5/2) = 0.0088
P(st = 1|yt ) =
0.0782× 0.70
0.0782× 0.70+ 0.0088× 0.30
= 0.954
P(st = 2|yt ) =
0.0088× 0.30
0.0782× 0.70+ 0.0088× 0.30
= 0.046
Timmermann (UCSD) Filtering Winter, 2017 40 / 53
Bayesian updating with two states: examples III
Because the observed value (-4%) is far more likely to have been
drawn from state 1 than from state 2, we revise upwards our beliefs
that we are currently in the first state from 70% to 95.4%
Using p11 and p22, our forecast of being in state 1 next period (at
time t + 1) is
P(st+1 = 1|yt ) = 0.954× 0.8+ 0.046× (1− 0.9) = 0.768
Our forecast of being in state 2 next period is
P(st+1 = 2|yt ) = 0.954× (1− 0.8) + 0.046× 0.9 = 0.232
Timmermann (UCSD) Filtering Winter, 2017 41 / 53
Bayesian updating with two states: examples IV
Similarly, the mean and variance forecasts in this case are given by
E [yt+1|yt ] = µ1P(st+1 = 1|yt ) + µ2P(st+1 = 2|yt )
= −3× 0.768+ 1× 0.232 = −2.07
Var(yt+1|yt ) = σ21P(st+1 = 1|yt ) + σ22P(st+1 = 2|yt )
+P(st+1 = 1|yt )× P(st+1 = 2|yt )(µ2 − µ1)
2
= 52 × 0.768+ 22 × 0.232+ 0.768× 0.232× (1+ 3)2
= 22.98
Timmermann (UCSD) Filtering Winter, 2017 42 / 53
Bayesian updating with two states: example (cont.)
Suppose instead we observe a value yt = +1. Then
p(yt |st = 1) = Normpdf (4/5) = 0.0579
p(yt |st = 2) = Normpdf (0) = 0.1995
P(st = 1|yt ) =
0.0579× 0.70
0.0579× 0.70+ 0.1995× 0.30
= 0.4038
P(st = 2|yt ) =
0.1995× 0.30
0.0579× 0.70+ 0.1995× 0.30
= 0.5962
Now, we reduce the probability of being in state 1 from 70% to 40%,
while we increase the chance of being in state 2 from 30% to 60%
Our forecasts of being in states 1 and 2 next period are
P(st+1 = 1|yt ) = 0.4038× 0.8+ 0.5962× (1− 0.9) = 0.3827
P(st+1 = 1|yt ) = 0.4038× (1− 0.8) + 0.5962× 0.9 = 0.6173
Timmermann (UCSD) Filtering Winter, 2017 43 / 53
Estimation of Markov switching models
The MS model is neither Gaussian, nor linear: the state st might lead
to changes in regression coeffi cients and the covariance matrix
Two common estimation methods:
Maximum likelihood estimation (MLE)
Bayesian estimation using Gibbs sampler
Filtered states: P(st = i |It ) : probability of being in state i at time t
given information at time t, It
Smoothed states: P(st = i |IT ) : probability of being in state i at
time t given information at the end of the sample, IT
Choice of number of states can be tricky. We can use AIC or BIC
Timmermann (UCSD) Filtering Winter, 2017 44 / 53
Filtered states (Ang-Timmermann, 2012)
Timmermann (UCSD) Filtering Winter, 2017 45 / 53
Smoothed state probabilities (Ang-Timmermann, 2012)
Timmermann (UCSD) Filtering Winter, 2017 46 / 53
Smoothed state probabilities (Ang-Timmermann)
Timmermann (UCSD) Filtering Winter, 2017 47 / 53
Parameter estimates (Ang-Timmermann, 2012)
yt = µst + φst yt−1 + σst εt , εt ∼ iiN(0, 1)
Timmermann (UCSD) Filtering Winter, 2017 48 / 53
Take-away for MS models
Markov switching models are popular in finance and economics
MS models are easy to interpret economically
Empirically often one state is highly persistent (“normal” state) with
parameters not too far from the average of the series
The other state is often more transitory and captures spells of high
volatility (asset returns) or negative outliers (GDP growth)
Forecasts are easy to compute with MS models
One state often has high volatility – regime switching can be
important for risk management
Try to experiment with the Markov switching and Kalman filter codes
on Ted
Timmermann (UCSD) Filtering Winter, 2017 49 / 53
Estimates, 3-state model for monthly stock returns
P ′ =
0.9881 0.0119 0.0000.000 0.9197 0.0803
0.8437 0.000 0.1563
µ =
(
0.0651 -0.1321 0.3756
)
σ =
(
0.0571 0.1697 0.0154
)
P : state transition probabilities, µ : means, σ : volatilities
State 1: highly persistent, medium mean, medium volatility
State 2: negative mean, high volatility, medium persistence
State 3: transitory bounce-back state with high mean
Timmermann (UCSD) Filtering Winter, 2017 50 / 53
Smoothed state probabilities, monthly stock returns
1930 1940 1950 1960 1970 1980 1990 2000 2010
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Time
M
on
th
ly
s
to
ck
r
et
ur
ns
Timmermann (UCSD) Filtering Winter, 2017 51 / 53
Fitted versus actual stock returns (3 state model)
1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 2020
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Time
F
itt
ed
v
al
ue
s,
s
to
ck
r
et
ur
ns
Timmermann (UCSD) Filtering Winter, 2017 52 / 53
Volatility of monthly stock returns
1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 2020
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
Time
F
itt
ed
v
ol
at
ili
ty
,
st
oc
k
re
tu
rn
s
Timmermann (UCSD) Filtering Winter, 2017 53 / 53
Lecture 5: Random walk and spurious correlation
UCSD, Winter 2017
Allan Timmermann1
1UC San Diego
Timmermann (UCSD) Random walk Winter, 2017 1 / 21
1 Random walk model
2 Logs, levels and growth rates
3 Spurious correlation
Timmermann (UCSD) Random walk Winter, 2017 2 / 21
Random walk model
The random walk model is an AR(1) yt = φ1yt−1 + εt with φ1 = 1 :
yt = yt−1 + εt , εt ∼ WN(0, σ2).
This model implies that the change in yt is unpredictable:
∆yt = yt − yt−1 = εt
For example, the level of (log-) stock prices is easy to predict, but not
its change (rate of return for log-prices)
Shocks to the random walk have permanent effects: A one unit shock
moves the series by one unit forever. This is in sharp contrast to a
mean-reverting process such as yt = 0.8yt−1 + εt
Timmermann (UCSD) Random walk Winter, 2017 2 / 21
Random walk model (cont)
The variance of a random walk increases over time so the distribution
of yt changes over time. Suppose that yt started at zero, y0 = 0 :
y1 = y0 + ε1 = ε1
y2 = y1 + ε2 = ε1 + ε2
…
yt = ε1 + ε2 + …+ εt−1 + εt , so
E [yt ] = 0
var(yt ) = var(ε1 + ε2 + …+ εt ) = tσ
2 ⇒
lim
t→∞
var(yt ) = ∞
The variance of y grows proportionally with time
A random walk does not revert back to the mean but wanders up and
down at random
Timmermann (UCSD) Random walk Winter, 2017 3 / 21
Forecasts from random walk model
Recall that forecasts from the AR(1) process yt = φ1yt−1 + εt ,
εt ∼ WN(0, σ2) are simply
ft+h|t = φ
h
1yt
For the random walk model φ1 = 1, so for all forecast horizons, h, the
forecast is simply the current value:
ft+h|t = yt
Forecast of tomorrow = today’s value
The basic random walk model says that the value of the series next
period (given the history of the series) equals its current value plus an
unpredictable change. Random steps, εt , make yt a “random walk”
Timmermann (UCSD) Random walk Winter, 2017 4 / 21
Random walk with a drift
Introduce a non-zero drift term, δ :
yt = δ+ yt−1 + εt , εt ∼ WN(0, σ2).
This is a popular model for the logarithm of stock prices
The drift term, δ, plays the same role as a time trend. Assuming
again that the series started at y0, we have
yt = 2δ+ yt−2 + εt + εt−1
= δt + y0 + ε1 + ε2 + …+ εt−1 + εt , so
E [yt ] = y0 + δt
var(yt ) = tσ
2
lim
t→∞
var(yt ) = ∞
Timmermann (UCSD) Random walk Winter, 2017 5 / 21
Summary of properties of random walk
Changes in a random walk are unpredictable
Shocks have permanent effects
Variance grows in proportion with the forecast horizon
These points are important for forecasting:
point forecasts never revert to a mean or a trend
since the variance goes to infinity, the width of interval forecasts
increases without bound as the forecast horizon grows. Uncertainty
grows without bounds.
Timmermann (UCSD) Random walk Winter, 2017 6 / 21
Logs, levels and growth rates
Certain transformations of economic variables such as their logarithm
are often easier to model than the “raw” data
If the standard deviation of a time series is proportional to its level,
then the standard deviation of the logarithm of the series is
approximately constant:
Yt = Yt−1 exp(εt ), εt ∼ (0, σ2)⇔
ln(Yt ) = ln(Yt−1) + εt
The first difference of the log of Yt is ∆ ln(Yt ) = ln(Yt )− ln(Yt−1)
The percentage change in Yt between t − 1 and t is approximately
100∆ ln(Yt ). This can be interpreted as a growth rate
Example: US GDP follows an upward trend. Instead of studying the
level of US GDP, we can study its growth rate which is not trending
Timmermann (UCSD) Random walk Winter, 2017 7 / 21
Unit root processes
Random walk is a special case of a unit root process which has a unit
root in the AR polynomial, i.e.,
(1− L)yt = θ(L)εt ,
We can test for a unit root using an Augmented Dickey Fuller (ADF)
test:
∆yt = α+ βyt−1 +
p
∑
i=1
λi∆yt−i + εt .
Under the null of a unit root, H0 : β = 0. Under the alternative of
stationarity, H1 : β < 0
Timmermann (UCSD) Random walk Winter, 2017 8 / 21
Unit root processes (cont.)
Example: suppose p = 0 (no autoregressive terms for ∆yt) and
β = −0.2. Then
∆yt = yt − yt−1 = α− 0.2yt−1 + εt ⇔
yt = 0.8yt−1 + εt (which is stationary)
If instead β = 0.2, we have
yt − yt−1 = α+ 0.2yt−1 + εt ⇔
yt = 1.2yt−1 + εt (which is explosive)
Test is based on the t-stat of β. Test statistic follows a non-standard
distribution with wider tails than the normal distribution
Timmermann (UCSD) Random walk Winter, 2017 9 / 21
Unit root test in matlab
In matlab: adftest
[h,pValue,stat,cValue,reg] = adftest(y)
[h,pvalue,stat,cvalue] = adftest(logprice,’lags’,1,’model’,’AR’);
Timmermann (UCSD) Random walk Winter, 2017 10 / 21
Critical values for Dickey-Fuller test
Timmermann (UCSD) Random walk Winter, 2017 11 / 21
Shanghai SE stock price (monthly, 1991-2014)
t-statistic: 1, 0362. p-value:0.92. Fail to reject null of a unit root.
1990 1995 2000 2005 2010 2015
4.5
5
5.5
6
6.5
7
7.5
8
8.5
9
Timmermann (UCSD) Random walk Winter, 2017 12 / 21
Changes in Shanghai SE stock price
t-statistic: -11.15. p−value: 0.001. Reject null of a unit root.
1990 1995 2000 2005 2010 2015
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Timmermann (UCSD) Random walk Winter, 2017 13 / 21
Spurious correlation
Time series that are trending systematically up or down may appear
to be significantly correlated even though they are completely
independent
Correlation between a city’s ice cream sales and the number of
drownings in city swimming pools: Both peak at the same time even
though there is no causal relationship between the two. In fact, a
heat wave may drive both variables
Dutch statistics reveal a positive correlation between the number of
storks nesting in the spring and the number of human babies born at
that time. Any causal relation?
Cumulative rainfall in Brazil and US stock prices
Timmermann (UCSD) Random walk Winter, 2017 14 / 21
Spurious correlation
Two series with a random walk (unit root) component may appear to
be related even when they are not. Consider an example:
y1t = y1t−1 + ε1t
y2t = y2t−1 + ε2t ,
cov(ε1t , ε2t ) = 0
Regressing one variable on the other y2t = α+ βy1t + ut often leads
to apparently high values of R2 and of the associated t−statistic for
β. Both are unreliable! Solutions:
instead of regressing y1t on y2t in levels, regress ∆y1t on ∆y2t
use cointegration analysis
Timmermann (UCSD) Random walk Winter, 2017 15 / 21
Simulations of stationary processes
1,000 simulations (T = 500) of uncorrelated stationary AR(1)
processes:
y1t = 0.5y1t−1 + ε1t
y2t = 0.5y2t−1 + ε2t
cov(ε1t , ε2t ) = 0
The two time series y1t and y2t are independent by construction
Next, estimate a regression
y1t = β0 + β1y2t + ut
What do you expect to find?
Timmermann (UCSD) Random walk Winter, 2017 16 / 21
Simulation from stationary AR(1) process
Average t−stat: 1.02. Rejection rate: 5.7%. Average R2 : 0.003
-4 -3 -2 -1 0 1 2 3 4
0
100
200
300
distribution of t-stats: stationary AR(1)
0 0.005 0.01 0.015 0.02 0.025 0.03
0
200
400
600
800
distribution of R-squared: stationary AR(1)
Timmermann (UCSD) Random walk Winter, 2017 17 / 21
Spurious correlation: simulations
1,000 simulations of uncorrelated random walk processes:
y1t = y1t−1 + ε1t
y2t = y2t−1 + ε2t
cov(ε1t , ε2t ) = 0
Then estimate regression
y1t = β0 + β1y2t + ut
What do we find now?
Timmermann (UCSD) Random walk Winter, 2017 18 / 21
Spurious correlation: simulation from random walk
Average t−stat: 13.4. Rejection rate: 44%. Average R2 : 0.25
-80 -60 -40 -20 0 20 40 60 80 100
0
100
200
300
400
distribution of t-stats: random walk
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
100
200
300
400
distribution of R-squared: random walk
Timmermann (UCSD) Random walk Winter, 2017 19 / 21
Spurious correlation: dealing with the problem
1,000 simulations of uncorrelated random walk processes:
y1t = y1t−1 + ε1t
y2t = y2t−1 + ε2t
cov(ε1t , ε2t ) = 0
Next, estimate regression on first-differenced series:
∆y1t = β0 + β∆y2t + ut
Timmermann (UCSD) Random walk Winter, 2017 20 / 21
Spurious correlation: simulation from random walk
Average t−stat: 0.78. Rejection rate: 1.5%. Average R2 : 0.002
-4 -3 -2 -1 0 1 2 3 4
0
100
200
300
distribution of t-stats: random walk, first-differences
0 0.005 0.01 0.015 0.02 0.025
0
200
400
600
800
distribution of R-squared: random walk, first-differences
Timmermann (UCSD) Random walk Winter, 2017 21 / 21
Lecture 5: Vector Autoregressions and Factor Models
UCSD, Winter 2017
Allan Timmermann1
1UC San Diego
Timmermann (UCSD) VARs and Factors Winter, 2017 1 / 41
1 Vector Autoregressions
2 Forecasting with VARs
Present value example
Impulse response analysis
3 Cointegration
4 Forecasting with Factor Models
5 Forecasting with Panel Data
Timmermann (UCSD) VARs and Factors Winter, 2017 2 / 41
From univariate to multivariate models
Often information other than a variable’s own past values are relevant for
forecasting
Think of forecasting Hong Kong house prices
exchange rate, GDP growth, population growth, interest rates might be
relevant
past house prices in Hong Kong also matter (AR model)
In general we can get better models by using richer information sets
How do we incorporate additional information sources?
Vector Auto Regressions (VARs) (small set of predictors)
Factor models (many possible predictors)
Timmermann (UCSD) VARs and Factors Winter, 2017 2 / 41
Vector Auto Regressions (VARs)
Vector autoregressions generalize univariate autoregressions to the
multivariate case by letting yt be an n× 1 vector and so extend the
information set to It = {yit , yit−1, ..., yi1} for i = 1, ..., n
Many of the properties of VARs are simple multivariate generalizations of the
univariate AR model
The Wold representation theorem also extends to the multivariate case
and hence VARs and VARMA models can be used to approximate covariance
stationary multivariate (vector) processes
VARMA: Vector AutoRegressive Moving Average
Timmermann (UCSD) VARs and Factors Winter, 2017 3 / 41
VARs: definition
A pth order VAR for an n× 1 vector yt takes the form:
yt = c + A1yt−1 + A2yt−2 + ...+ Apyt−p + εt , εt ∼ WN(0,Σ)
Ai : n× n matrix of autoregressive coeffi cients for i = 1, ..., p :
Ai =
Ai11 Ai12 · · · Ai1n
Ai21 Ai22 · · · Ai2n
...
Ain1 Ainn
εt : n× 1 vector of innovations. These can be correlated across variables
VARs have the same regressors appearing in each equation
Number of parameters: n2 × p︸ ︷︷ ︸
A1,...,Ap
+ n︸︷︷︸
c
+ n(n+ 1)/2︸ ︷︷ ︸
Σ
Timmermann (UCSD) VARs and Factors Winter, 2017 4 / 41
Why do we need VARs for forecasting?
Consider a VAR with two variables: yt = (y1t , y2t )
′
IT = {y1T , y1T−1, ....y11, y2T , y2T−1, ..., y21}
Suppose y1 depends on past values of y2. Forecasting y1 one step ahead
(y1T+1) given IT is possible if we know today’s values, y1T , y2T
Suppose we want to predict y1 two steps ahead (y1T+2)
Since y1T+2 depends on y2T+1, we need a forecast of y2T+1, given IT
We need a joint model for predicting y1 and y2 given their past values
This is provided by the VAR
Timmermann (UCSD) VARs and Factors Winter, 2017 5 / 41
Example: Bivariate VAR(1)
Joint model for the dynamics in y1t and y2t :
y1t = φ11y1t−1 + φ12y2t−1 + ε1t , ε1t ∼ WN(0, σ
2
1)
y2t = φ21y1t−1 + φ22y2t−1 + ε2t , ε2t ∼ WN(0, σ
2
2)
Each variable depends on one lag of the other variable and one lag of itself
φ12 measures the impact of the past value of y2, y2t−1, on current y1t .
When φ12 6= 0, y2t−1 affects y1t
φ21 measures the impact of the past value of y1, y1t−1, on current y2t .
When φ21 6= 0, y1t−1 affects y2t
The two variables can also be contemporaneously correlated if the
innovations ε1t and ε2t are correlated and are influenced by common shocks:
Cov(ε1t , ε2t ) = σ12
If σ12 6= 0, shocks to y1t and y2t are contemporaneously correlated
Timmermann (UCSD) VARs and Factors Winter, 2017 6 / 41
Forecasting with Bivariate VAR I
One-step-ahead forecast given IT = {y1T , y2T , ..., y11, y21} :
f1T+1|T = φ11y1T + φ12y2T
f2T+1|T = φ21y1T + φ22y2T
To compute two-step-ahead forecasts, use the chain rule:
f1T+2|T = φ11f1T+1|T + φ12f2T+1|T
f2T+2|T = φ21f1T+1|T + φ22f2T+1|T
Using the expresssions for f1T+1|T and f2T+1|T , we have
f1T+2|T = φ11(φ11y1T + φ12y2T ) + φ12(φ21y1T + φ22y2T )
f2T+2|T = φ21(φ11y1T + φ12y2T ) + φ22(φ21y1T + φ22y2T )
Timmermann (UCSD) VARs and Factors Winter, 2017 7 / 41
Forecasting with Bivariate VAR II
Collecting terms, we have
f1T+2|T = (φ
2
11 + φ12φ21)y1T + φ12(φ11 + φ22)y2T
f2T+2|T = φ21(φ11 + φ22)y1T + (φ12φ21 + φ
2
22)y2T
To forecast y1 two steps ahead we need to forecast both y1 and y2 one step
ahead.
This can only be done if we have forecasting models for both y1 and y2
Therefore, we need to use a VAR for multi-step forecasting of time series that
depend on other variables
Timmermann (UCSD) VARs and Factors Winter, 2017 8 / 41
Predictive (Granger) causality
Clive Granger (1969) used a variable’s predictive content to develop a
definition of causality that depends on the conditional distribution of the
predicted variable, given other information
Statistical concept of causality closely related to forecasting
Basic principles:
cause should precede (come before) effect
a causal series should contain information useful for forecasting that is not
available from the other series (including their past)
Granger causality in the bivariate VAR:
If φ12 = 0, then y2 does not Granger cause y1 : past values of y2 do not
improve our predictions of future values of y1
If φ21 = 0, then y1 does not Granger cause y2 : past values of y1 do not
improve our predictions of future values of y2
For all other values of φ12 and φ21, y1 will Granger cause y2 and/or y2 will
Granger cause y1
Include more lags?
Timmermann (UCSD) VARs and Factors Winter, 2017 9 / 41
Granger causality tests
Each variable predicts every other variable in the general VAR
In VARs with many variables, it is quite likely that some variables are not
useful for forecasting all the other variables
Granger causality findings might be overturned by adding more variables to
the model – y2t may simply predict y1t+1 because other information (y3t
which causes both y1t+1 and y2t+1) has been left out (omitted variable)
Timmermann (UCSD) VARs and Factors Winter, 2017 10 / 41
Estimation of VARs
In suffi ciently large samples and under conventional assumptions, the least
squares estimates of (A1, ...,Ap) will be normally distributed around the true
parameter value
Standard errors for each regression are computed using the OLS estimates
OLS estimation is asymptotically effi cient
OLS estimates are generally biased in small samples
Timmermann (UCSD) VARs and Factors Winter, 2017 11 / 41
VARs in matlab
5-variable sample code on Triton Ed: varExample.m
model = vgxset(’n’,5,’nAR’,nlags,’Constant’,true); % set up the VAR model
[modelEstimate,modelStdEr,LL] = vgxvarx(model,Y(1:estimationEnd,:));
%estimate the VAR
numParams = vgxcount(model); %number of parameters
[aicForecast,aicForecastCov] =
vgxpred(modelEstimates{aicLags,1},forecastHorizon,[],); %forecast with VAR
Timmermann (UCSD) VARs and Factors Winter, 2017 12 / 41
Diffi culties with VARs
VARs initially became a popular forecasting tool because of their relative
simplicity in terms of which choices need to be made by the forecaster
When estimating a VAR by classical methods, only two choices need to be
made to construct forecasts
which variables to include (choice of y1, ..., yn)
how many lags of the variables to include (choice of p)
Risk of overparameterization of VARs is high
The general VAR has n(np + 1) mean parameters plus another n(n+ 1)/2
covariance parameters
For n = 5, p = 4 this is 105 mean parameters and 15 covariance parameters
Bayesian procedures reduce parameter estimation error by shrinking the
parameter estimates towards some target value
Timmermann (UCSD) VARs and Factors Winter, 2017 13 / 41
Choice of lag length
Typically we search over VARs with different numbers of lags, p
With a vector of constants, n variables, p lags, and T observations, the BIC
and AIC information criteria take the forms
BIC (p) = ln |Σ̂p |+ n(np + 1)
ln(T )
T
AIC (p) = ln |Σ̂p |+ n(np + 1)
2
T
Σ̂p = T−1 ∑
T
t=t ε̂t ε̂
′
t is the estimate of the residual covariance matrix
The objective is to identify the model (indexed by p) that minimizes the
information criterion
The sample code varExample.m chooses the VAR, using up to 12 lags
(maxLags)
Timmermann (UCSD) VARs and Factors Winter, 2017 14 / 41
Multi-period forecasts
VARs are ideally designed for generating multi-period forecasts. For the
VAR(1) specification
yt+1 = Ayt + εt+1, εt+1 ∼ WN(0,Σ)
the h−step-ahead value can be written
yt+h = A
hyt +
h
∑
i=1
Ah−i εt+i
The forecast under MSE loss is then
ft+h|t = A
hyt
Just like in the case with an AR(1) model!
Timmermann (UCSD) VARs and Factors Winter, 2017 15 / 41
Multi-period forecasts: 4-variable example
Forecasts using 4-variable VAR with quarterly inflation rate, unemployment
rate, GDP growth and 10-year yield
vgxplot(modelEstimates,Y,aicForecast,aicForecastCov); % plot forecast
0 50 100 150 200 250
-0.05
0
0.05
Inflation
Process Lower 1-σ Upper 1-σ
0 50 100 150 200 250
-0.05
0
0.05
GDP growth
0 50 100 150 200 250
0
10
20
Unemployment
0 50 100 150 200 250
0
10
20
10 year Treasury bond rate
Timmermann (UCSD) VARs and Factors Winter, 2017 16 / 41
Multi-period forecasts of 10-year yield (cont.)
Reserve last 5-years of data for forecast evaluation
2009.5 2010 2010.5 2011 2011.5 2012 2012.5 2013 2013.5 2014
2
2.5
3
3.5
4
4.5
5
Time
P
er
ce
nt
ag
e
po
in
ts
Actual
AIC
BIC
AR(4)
Timmermann (UCSD) VARs and Factors Winter, 2017 17 / 41
Example: Campbell-Shiller present value model I
Campbell and Shiller (1988) express the continuously compounded stock
return in period t + 1, rt+1, as an approximate linear function of the
logarithms of current and future stock prices, pt , pt+1 and dividends, dt+1:
rt+1 = k + ρpt+1 + (1− ρ)dt+1 − pt
ρ is a scalar close to (but below) one, and k is a constant
Rearranging, we get a recursive equation for log-prices:
pt = k + ρpt+1 + (1− ρ)dt+1 − rt+1
Iterating forward and taking expectations conditional on current information,
we have
pt =
k
1− ρ
+ (1− ρ)Et
[
∞
∑
j=0
ρjdt+1+j
]
− Et
[
∞
∑
j=0
ρj rt+1+j
]
Timmermann (UCSD) VARs and Factors Winter, 2017 18 / 41
Example: Campbell-Shiller present value model II
Stock prices depend on an infinite sum of expected future dividends and
expected returns
Key to the present value model is therefore how such expectations are formed
VARs can address this question since they can be used to generate
multi-period forecasts
To illustrate this point, let zt be a vector of state variables with z1t = pt ,
z2t = dt , z3t = xt ; xt are predictor variables
Define selection vectors e1 = (1 0 0)
′, e2 = (0 1 0)
′, e3 = (0 0 1)
′ so
pt = e ′1zt , dt = e
′
2zt , xt = e
′
3xt
Suppose that zt follows a VAR(1):
zt+1 = Azt + εt+1 ⇒
Et [zt+j ] = A
j zt
Timmermann (UCSD) VARs and Factors Winter, 2017 19 / 41
Example: Campbell-Shiller present value model III
If expected returns Et [rt+1+j ] are constant and stock prices only move due
to variation in dividends, we have (ignoring the constant and assuming that
we can invert (I − ρA))
pt = (1− ρ)Et
[
∞
∑
j=0
ρjdt+1+j
]
= (1− ρ)e ′2
∞
∑
j=0
ρjAj+1zt = (1− ρ)e ′2A(I − ρA)
−1zt
Nice and simple expression for the present value stock price!
The VAR gives us a simple way to compute expected future dividends
Et [dt+1+j ] for all future points in time given the current information in zt
Can you suggest other ways of doing this?
Timmermann (UCSD) VARs and Factors Winter, 2017 20 / 41
Impulse response analysis
Stationary vector autoregressions (VARs) can equivalently be expressed as
vector moving average (VMA) processes:
yt = εt + θ1εt−1 + θ2εt−2 + ...
Impulse response analysis shows how variable i in a VAR is affected by a
shock to variable j at different horizons:
∂yit+1
∂εjt
1-period impulse
∂yit+2
∂εjt
2-period impulse
∂yit+h
∂εjt
h-period impulse
Suppose we find out that variable j is higher than we expected (by one unit).
Impulse responses show how much we revise our forecasts of future values of
yit+h due to this information
How does an interest rate shock affect future unemployment and inflation?
Timmermann (UCSD) VARs and Factors Winter, 2017 21 / 41
Impulse response analysis in matlab
Four-variable model: inflation, GDP growth, unemployment, 10-year Treasury
bond rate
impulseHorizon = 24; % 24 months out
W0 = zeros(impulseHorizon,4); %baseline scenario of zero shock
W1 = W0;
W1(1,4) = sqrt(modelEstimates{aicLags,1}.Q(4,4)); % one standard
deviation shock to variable number four (interest rate)
Yimpulse =
vgxproc(modelEstimates{aicLags,1},W1,[],Y(1:estimationEnd,:)); %impulse
response
Ynoimpulse =
vgxproc(modelEstimates{aicLags,1},W0,[],Y(1:estimationEnd,:));
Timmermann (UCSD) VARs and Factors Winter, 2017 22 / 41
Impulse response analysis: shock to 10-year yield
5 10 15 20
0
0.05
0.1
0.15
0.2
Horizon
%
C
ha
ng
e
Inflation
5 10 15 20
-0.02
-0.015
-0.01
-0.005
0
Horizon
%
C
ha
ng
e
GDP growth
5 10 15 20
0
0.005
0.01
Horizon
%
C
ha
ng
e
Unemployment
5 10 15 20
0.05
0.1
0.15
Horizon
%
C
ha
ng
e
10 year Treasury bond rate
Timmermann (UCSD) VARs and Factors Winter, 2017 23 / 41
Nobel Prize Award, 2003 press release
“Most macroeconomic time series follow a stochastic trend, so that a temporary
disturbance in, say, GDP has a long-lasting effect. These time series are called
nonstationary; they differ from stationary series which do not grow over time, but
fluctuate around a given value. Clive Granger demonstrated that the statistical
methods used for stationary time series could yield wholly misleading results when
applied to the analysis of nonstationary data. His significant discovery was that
specific combinations of nonstationary time series may exhibit stationarity, thereby
allowing for correct statistical inference. Granger called this phenomenon
cointegration. He developed methods that have become invaluable in systems
where short-run dynamics are affected by large random disturbances and long-run
dynamics are restricted by economic equilibrium relationships. Examples include
the relations between wealth and consumption, exchange rates and price levels,
and short and long-term interest rates.”
This work was done at UCSD
Timmermann (UCSD) VARs and Factors Winter, 2017 24 / 41
Cointegration
Consider the variables
xt = xt−1 + εt x follows a random walk (nonstationary)
y1t = xt + u1t y1 is a random walk plus noise
y2t = xt + u2t y2 is a random walk plus noise
εt , u1t , u2t are all white noise (or at least stationary)
xt is a unit root process: (1− L)xt = εt , so L = 1 is a "root"
y1 and y2 behave like random walks. However, their difference
y1t − y2t = u1t − u2t
is stationary (mean-reverting)
Over the long run, y1 − y2 will revert to its equilibrium value of zero
Timmermann (UCSD) VARs and Factors Winter, 2017 25 / 41
Cointegration (cont.)
Future levels of random walk variables are diffi cult to predict
It is much easier to predict differences between two sets of cointegrated
variables
Example: Forecasting the level of Brent or WTI (West Texas Intermediate)
crude oil prices five years from now is diffi cult
Forecasting the difference between these prices (or the logs of their prices) is
likely to be easier
In practice we often study the logarithm of prices (instead of their level), so
percentage differences cannot become too large
Timmermann (UCSD) VARs and Factors Winter, 2017 26 / 41
Cointegration (cont.)
If two variables are cointegrated, they must both individually have a
stochastic trend (follow a unit root process) and their individual paths can
wander arbitrarily far away from their current values
There exists a linear combination that ties the two variables closely together
Future values cannot deviate too far away from this equilibrium relation
Granger representation theorem: Equilibrium errors (deviations from the
cointegrating relationship) can be used to predict future changes
Examples of possible cointegrated variables:
Oil prices in Shanghai and Hong Kong– if they differ by too much, there is an
arbitrage opportunity
Long and short interest rates
Baidu and Alibaba stock prices (pairs trading)
House prices in two neighboring cities
Chinese A and H share prices for same company. Arbitrage opportunities?
Timmermann (UCSD) VARs and Factors Winter, 2017 27 / 41
Vector Error Correction Models (VECM)
Vector error correction models (VECMs) can be used to analyze VARs with
nonstationary variables that are cointegrated
Cointegration relation restricts the long-run behavior of the variables so they
converge to their cointegrating relationship (long-run equilibrium)
Cointegration term is called the error-correction term
This measures the deviation from the equilibrium and allows for short-run
predictability
In the long-run equilibrium, the error correction term equals zero
Timmermann (UCSD) VARs and Factors Winter, 2017 28 / 41
Vector Error Correction Models (cont.)
VECM for changes in two variables, y1t , y2t with cointegrating equation
y2t = βy1t and lagged error correction term (y2t−1 − βy1t−1) :
∆y1t = α1 (y2t−1 − βy1t−1)︸ ︷︷ ︸
lagged error correction term
+ λ1∆y1t−1 + ε1t
∆y2t = α2(y2t−1 − βy1t−1) + λ2∆y2t−1 + ε2t
In the short run y1 and y2 can deviate from the equilibrium y2t = βy1t
Lagged error correction term (y2t−1 − βy1t−1) pulls the variables back towards
their equilibrium
α1 and α2 measure the speed of adjustment of y1 and y2 towards equilibrium
Larger values of α1 and α2 mean faster adjustment
Timmermann (UCSD) VARs and Factors Winter, 2017 29 / 41
Vector Error Correction Models (cont.)
In many applications (particularly with variables in logs), β = 1. Then a
forecasting model for the changes ∆y1t and ∆y2t could take the form
∆y1t = c1 +
p
∑
i=1
λ1i∆y1t−i︸ ︷︷ ︸
p AR lags
+ α1 (y2t−1 − y1t−1)︸ ︷︷ ︸
error correction term
+ ε1t
∆y2t = c2 +
p
∑
i=1
λ2i∆y2t−i + α2(y2t−1 − y1t−1) + ε2t
This can be estimated by OLS since you know the cointegrating coeffi cient,
β = 1
Include more lags of the error correction term (y2t−1 − y1t−1)? Adjustments
may be slow
Timmermann (UCSD) VARs and Factors Winter, 2017 30 / 41
House prices in San Diego and San Francisco
1990 1995 2000 2005 2010
50
100
150
200
250
Time
H
om
e
P
ric
e
In
de
x
SD
SF
Timmermann (UCSD) VARs and Factors Winter, 2017 31 / 41
Simple test for cointegration
Regress San Diego house prices on San Francisco house prices and test if the
residuals are non-stationary
use logarithm of prices (?)
Null hypothesis is that there is no cointegration (so there is no linear
combination of the two prices that is stationary)
If you reject the null hypothesis (get a low p-value), this means that the
series are cointegrated
If you don’t reject the null hypothesis (high p-value), the series are not
cointegrated
Often test has low power (fails to reject even when the series are
cointegrated)
Timmermann (UCSD) VARs and Factors Winter, 2017 32 / 41
Test for cointegration in matlab
See VecmExample.m file on Triton Ed
In matlab: egcitest : Engle-Granger cointegration test
[h, pValue, stat, cValue] = egcitest(Y )
"Engle-Granger tests assess the null hypothesis of no cointegration among
the time series in Y. The test regresses Y(:,1) on Y(:,2:end), then tests the
residuals for a unit root.
Values of h equal to 1 (true) indicate rejection of the null in favor of the
alternative of cointegration. Values of h equal to 0 (false) indicate a failure
to reject the null."
p−value of test for SD and SF house prices: 0.9351. We fail to reject the
null that the house prices are not cointegrated. Why?
Timmermann (UCSD) VARs and Factors Winter, 2017 33 / 41
House prices in San Diego and San Francisco
1990 1995 2000 2005 2010
-30
-20
-10
0
10
20
30
Time
Cointegrating Relation
Timmermann (UCSD) VARs and Factors Winter, 2017 34 / 41
Forecasting with Factor models I
Suppose we have a very large set of predictor variables, xit , i = 1, ...n, where
n could be in the hundreds or thousands
The simplest forecasting approach would be to consider a linear model with
all predictors included:
yt+1 = α+
n
∑
i=1
βi xit + φ1yt + εyt+1
This model can be estimated by OLS, assuming that the total number of
parameters, n+ 2, is small relative to the length of the time series, T
Often n > T and so linear regression methods are not feasible
Instead it is commonly assumed that the x−variables only affect y through a
small set of r common factors, Ft = (F1t , …,Frt )′, where r is much smaller
than N (typically less than ten)
Timmermann (UCSD) VARs and Factors Winter, 2017 35 / 41
Forecasting with Factor models II
This suggests using a common factor forecasting model of the form
yt+1 = α+
r
∑
i=1
βiF Fit + φ1yt + εyt+1
Suppose that n = 200 and r = 3 common factors
The general forecasting model requires fitting 202 mean parameters:
α, β1, …, β200, φ1
The simple factor model only requires estimating 5 mean parameters:
α, β1F , β2F , β3F , φ1
Timmermann (UCSD) VARs and Factors Winter, 2017 36 / 41
Forecasting with Factor models
The identity of the common factors is usually unknown and so must be
extracted from the data
Forecasting with common factor models can therefore be thought of as a
two-step process
1 Extract estimates of the common factors from the data
2 Use the factors, along with past values of the predicted variable, to select and
estimate a forecasting model
Suppose a set of factor estimates, F̂it , has been extracted. These are then
used along with past values of y to estimate a model and generate forecasts
of the form:
ŷt+1|t = α̂+
r
∑
i=1
β̂iF F̂it + φ̂1yt
Common factors can be extracted using the principal components method
Timmermann (UCSD) VARs and Factors Winter, 2017 37 / 41
Principal components
Wikipedia: “Principal component analysis (PCA) is a statistical procedure
that uses an orthogonal transformation to convert a set of observations of
possibly correlated variables into a set of values of linearly uncorrelated
variables called principal components. The number of principal components is
less than or equal to the number of original variables. This transformation is
defined in such a way that the first principal component has the largest
possible variance (that is, accounts for as much of the variability in the data
as possible), and each succeeding component in turn has the highest variance
possible under the constraint that it is orthogonal to (i.e., uncorrelated with)
the preceding components. The principal components are orthogonal because
they are the eigenvectors of the covariance matrix, which is symmetric. PCA
is sensitive to the relative scaling of the original variables.”
Timmermann (UCSD) VARs and Factors Winter, 2017 38 / 41
Empirical example
Data set with n = 132 predictor variables
Available in macro_raw_data.xlsx on Triton Ed. Uses data from Sydney
Ludvigson’s NYU website
Data series have to be transformed (e.g., from levels to growth rates) before
they are used to form principal components
Extract r = 8 common factors using principal components methods
Timmermann (UCSD) VARs and Factors Winter, 2017 39 / 41
Empirical example (cont.): 8 principal components
200 400 600
-5
0
5
PC: 1
200 400 600
-4
-2
0
2
4
PC: 2
200 400 600
-10
-5
0
5
10
PC: 3
200 400 600
-10
-5
0
5
10
PC: 4
200 400 600
-5
0
5
PC: 5
200 400 600
-4
-2
0
2
4
PC: 6
200 400 600
-10
-5
0
5
PC: 7
200 400 600
-10
-5
0
5
PC: 8
Timmermann (UCSD) VARs and Factors Winter, 2017 40 / 41
Forecasting with panel data I
Forecasting methods can also be applied to cross-sections or panel data
Key requirement is that the predictors are predetermined in time. For
example, we could build a forecasting model for a large cross-section of
credit-card holders using data on household characteristics, past payment
records etc.
The implicit time dimension is that we know whether a payment in the data
turned out fraudulent
Panel regressions take the form
yit = αi + λt + X
′
itβ+ uit , i = 1, …, n, t = 1, …,T
αi : fixed effect (e.g., firm, stock, or country level)
λt : time fixed effect
How do we predict λt+1?
Panel models can be estimated using regression methods
Do slope coeffi cients β vary across units (βi )?
bias-variance trade-off
Timmermann (UCSD) VARs and Factors Winter, 2017 41 / 41
Lecture 7: Event, Density and Volatility Forecasting
UCSD, Winter 2017
Allan Timmermann1
1UC San Diego
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 1 / 57
1 Event forecasting
2 Point, interval and density forecasts
Location-Scale Models of Density Forecasts
GARCH Models
Realized Volatility Measures
3 Interval and Density Forecasts
Mean reverting processes
Random walk model
Alternative Distribution Models
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 2 / 57
Event forecasting
Discrete events are important in economics and finance
Mergers & Acquisitions – do they happen (yes = 1) or not (no = 0)?
Will a credit card transaction be fraudulent (yes = 1, no = 0)?
Will Europe enter into a recession in 2017 (yes = 1, no = 0)?
What will my course grade be? A, B , C
Change in Fed funds rate is usually in increments of 25 bps or zero. Create
bins of 0 = no change, 1 = 25 bp change, 2 = 50 bp change, etc.
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 2 / 57
University of Iowa Electronic markets: value of contract on
Republican presidential nominee
Contracts trading for a total exposure of $500 with a $1 payoff on each
contract
y = 1 : you get paid one dollar if Trump wins the Republican nomination
y = 0 : you get nothing if Trump fails to win the Republican nomination
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 3 / 57
University of Iowa Electronic markets: Democrat vs
republican win
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 4 / 57
Limited dependent variables
Limited dependent variables have a restricted range of values (bins)
Example: A binary variable takes only two values: y = 1 or y = 0
In a binary response model, interest lies in the response probability given
some predictor variables x1t , …, xkt :
P(yt+1 = 1|x1t , x2t , .., xkt )
Example: what is the probability that the Fed will raise interest rates by more
than 75 bps in 2017 given the current level of inflation, changes in oil prices,
bank lending, unemployment rate and past interest rate decisions?
Suppose y is a binary variable taking values of zero or one
E [yt+1 |xt ] = P(yt+1 = 1|xt )× 1+ P(yt+1 = 0|xt )× 0 = P(yt+1 = 1|xt )
E [.] : Expectation. P(.) : Probability
The probability of “success” (yt+1 = 1) equals the expected value of yt+1
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 5 / 57
Linear probability model
Linear probability model:
P(yt+1 = 1|x1t , .., xkt ) = β0 + β1x1t + …+ βk xkt
x1t , …, xkt : predictor variables
In the linear probability model, βj measures the change in the probability of
success when xjt changes, holding other variables constant:
∆P(yt+1 = 1|x1t , ..,∆xjt , …, xkt ) = βj∆xjt
Problems with linear probability model:
Probabilities can be bigger than one or less than zero
Is the effect of x linear? What if you are close to a probability of zero or one?
Often the linear model gives a good first idea of the slope coeffi cient βj
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 6 / 57
Linear probability model: forecasts outside [0,1]
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 7 / 57
Binary response models
To address the limitations of the linear probability model, consider a class of
binary response models of the form
P(yt+1 = 1|x1t , x2t , …, xkt ) = G (β0 + β1x1t + ..+ βk xkt )
for functions G (.) satisfying
0 ≤ G (.) ≤ 1
Probabilities are now guaranteed to fall between zero and one
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 8 / 57
Logit and Probit models
Two popular choices for G (.)
Logit model:
G (x) =
exp(x)
1+ exp(x)
Probit model:
G (x) = Φ(x) =
∫ x
−∞
φ(z)dz , φ(z) = (2π)−1/2 exp(−z2/2)
Φ(x) is the standard normal cumulative distribution function (CDF)
Logit and probit functions are increasing and steepest at x = 0
G (x)→ 0 as x → −∞, and G (x)→ 1 as x → ∞
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 9 / 57
Logit and Probit models
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 10 / 57
Logit, probit and LPM in matlab
binaryResponseExample.m on Triton Ed
lpmBeta = [ones(T,1) X]\y; % estimates for linear probability model
probitBeta = glmfit(X,y,’binomial’,’link’,’probit’); % Probit model
logitBeta = glmfit(X,y,’binomial’,’link’,’logit’); % Logit model
lpmFit = [ones(T,1) X]*lpmBeta; % Calculate fitted values
probitFit = glmval(probitBeta,X,’probit’); % fitted values, probit
logitFit = glmval(logitBeta,X,’logit’); % fitted values, logit
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 11 / 57
Application: Directional investment strategy
yt+1 = r
s
t+1 − Tbillt+1 : excess return on stocks (r
s
t+1) over Tbills (Tbillt+1)
I syt+1>0 =
{
1 if yt+1 > 0
0 0therwise
Investment strategy: buy stocks if we predict yt+1 > 0, otherwise hold T-bills
forecast stocks T-bills
ft+1|t > 0, +1 0
ft+1|t ≤ 0, 0 +1
Logit/Probit model estimates the probability of a positive excess return,
prob(I syt+1>0 = 1|It )
I syt+1 = yt+1 >= 0; % create indicator for dependent variable
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 12 / 57
Fitted probabilities of a positive excess return
Use logit, probit or linear model to forecast the probability of a positive
(monthly) excess return using the lagged T-bill rate, dividend yield and
default spread as predictors
1930 1940 1950 1960 1970 1980 1990 2000 2010
0.42
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58
0.6
Time
F
itt
ed
p
ro
ba
bi
lit
y
LPM
Probit
Logit
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 13 / 57
Switching strategy (cont.)
Decision rule: define the stock “weight” ωst+1|t
ωst+1|t =
{
1 if prob(I syt+1>0 = 1|It ) > 0.5
0 0therwise
Payoff on stock-bond switching (market timing) portfolio:
rt+1 = ω
s
t+1|t r
s
t+1 + (1−ω
s
t+1|t )Tbillt+1
Payoff depends on both the sign and magnitude of the predicted excess
return, even though the forecast ignores information about magnitudes
Cumulated wealth: Starting from initial wealth W0 we get
WT = W0
T
∏
τ=1
(1+ rτ)
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 14 / 57
Cumulated wealth from T-bills, stocks and switching rule
1930 1940 1950 1960 1970 1980 1990 2000 2010
5
10
15
20
25
30
35
Time
F
itt
ed
p
ro
ba
bi
lit
y
switching
stocks
Tbills
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 15 / 57
Point forecasts
Point forecasts provide a summary statistic for the predictive density of the
predicted variable (Y ) given the data
This is all we need under MSE loss (suffi cient statistic)
Limitations to point forecasts:
Different loss functions L give different point forecasts (Lecture 1)
Point forecasts convey no sense of the precision of the forecast —how
aggressively should an investor act on a predicted stock return of +1%?
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 16 / 57
Interval forecasts
It is always useful to report a measure of forecast uncertainty
Addresses “how certain am I of my forecast?”
many forecasts are surrounded by considerable uncertainty
Alternatively, use scenario analysis
specify outcomes in possible future scenarios along with the probabilities of the
scenarios
Under the assumption that the forecast errors are normally distributed, we
can easily construct an interval forecast, i.e., an interval that contains the
future value of Y with a probability such as 50%, 90% or 95%
forecast the variance as well
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 17 / 57
Distribution (density) forecasts
Distribution forecasts provide a complete characterization of the forecast
uncertainty
Calculation of expected utility for many risk-averse investors requires a
forecast of the full probability distribution of returns —not just its mean
Parametric approaches assume a known distribution such as the normal
(Gaussian)
Non-parametric methods treat the distribution as unknown
bootstrap draws from the empirical distribution of residuals
Hybrid approaches that mix different distributions can also be used
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 18 / 57
Density forecasts (cont.)
To construct density forecasts, typically three estimates are used:
Estimate of the conditional mean given the data, µt+1|t
Estimate of the conditional volatility given the data, σt+1|t
Estimate of the distribution function of the innovations/shocks, Pt+1|t
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 19 / 57
Conditional location-scale processes with normal errors
yt+1 = µt+1|t + σt+1|tut+1, ut+1 ∼ N(0, 1)
µt+1|t : conditional mean of yt+1, given current information, It
σt+1|t : conditional standard deviation or volatility, given It
P(yt+1 ≤ y |It ) = P
(
yt+1 − µt+1|t
σt+1|t
≤
y − µt+1|t
σt+1|t
)
= P
(
ut+1 ≤
y − µt+1|t
σt+1|t
)
≡ N
(
y − µt+1|t
σt+1|t
)
P : probability
N : cumulative distribution function of Normal(0, 1)
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 20 / 57
Nobel prize committee press release (2003)
“On financial markets, random fluctuations over time —volatility —are
particularly significant because the value of shares, options and other
financial instruments depends on their risk. Fluctuations can vary
considerably over time; turbulent periods with large fluctuations are followed
by calmer periods with small fluctuations. Despite such time-varying
volatility, in want of a better alternative, researchers used to work with
statistical methods that presuppose constant volatility. Robert Engle’s
discovery was therefore a major breakthrough. He found that the concept of
autoregressive conditional heteroskedasticity (ARCH) accurately
captures the properties of many time series and developed methods for
statistical modeling of time-varying volatility. His ARCH models have become
indispensable tools not only for researchers, but also for analysts on financial
markets, who use them in asset pricing and in evaluating portfolio risk.”
Robert Engle did the work on ARCH models at UCSD
This work is critical for modeling σt+1|t
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 21 / 57
ARCH models
Returns, rt+1, at short horizons (daily, 5-minute, weekly, even monthly) are
hard to predict – they are not strongly serially correlated
However, squared returns, r2t+1, are serially correlated and easier to predict
Volatility clustering: periods of high market volatility alternates with periods
of low volatility
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 22 / 57
Daily stock returns
10-years of daily US stock returns
2006 2008 2010 2012 2014
-8
-6
-4
-2
0
2
4
6
8
10
Time
P
er
ce
nt
ag
e
po
in
ts
S&P 500 returns
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 23 / 57
Daily stock return levels: AR(4) model estimates
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 24 / 57
Squared daily stock returns
2006 2008 2010 2012 2014
0
0.2
0.4
0.6
0.8
1
1.2
Time
P
er
ce
nt
ag
e
po
in
ts
S&P 500 returns
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 25 / 57
Squared daily stock returns: AR(4) estimates
Much stronger evidence of serial persistence (autocorrelation) in squared returns
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 26 / 57
GARCH models
Generalized AutoRegressive Conditional Heteroskedasticity
GARCH(p, q) model for the conditional variance:
εt+1 = σt+1|tut+1, ut+1 ∼ N(0, 1)
σ2t+1|t = ω+
p
∑
i=1
βiσ
2
t+1−i |t−i +
q
∑
i=1
αi ε
2
t+1−i
GARCH(1, 1) is the empirically most popular specification:
σ2t+1|t = ω+ β1σ
2
t |t−1 + α1ε
2
t
= ω+ (α1 + β1)σ
2
t |t−1 + α1σ
2
t |t−1(u
2
t − 1)︸ ︷︷ ︸
zero mean
Diffi cult to beat GARCH(1,1) in many volatility forecasting contests
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 27 / 57
GARCH(1,1) model: h-step forecasts I
α1 + β1 : measures persistence of GARCH(1,1) model
As long as α1 + β1 < 1, the volatility process will converge
Long run−or unconditional−variance is
E [σ2t+1|t ] ≡ σ
2 =
ω
1− α1 − β1
GARCH(1,1) is similar to an ARMA(1,1) model in squares:
σ2t+1|t = σ
2 + (α1 + β1)(σ
2
t |t−1 − σ
2) + α1σ
2
t |t−1(u
2
t − 1)
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 28 / 57
Volatility modeling
GARCH(1,1) model can generate fat tails
The standard GARCH(1,1) model does not generate a skewed distribution —
that’s because the shocks are normally distributed (symmetric)
Conditional volatility estimate: estimate of the current volatility level given
all current information. This varies over time
Mean reversion: If the current conditional variance forecast σ2t+1|t > σ
2, the
multi-period variance forecast will exceed the average forecast by an amount
that declines in the forecast horizon
Unconditional volatility estimate: long-run (“average”) estimate of volatility.
This is constant over time
GARCH models can be estimated by maximum likelihood methods
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 29 / 57
Asymmetric GARCH models I
GJR model of Glosten, Jagannathan, and Runkle (1993):
σ2t+1|t = ω+ α1ε
2
t + λε
2
t I (εt < 0) + β1σ
2
t |t−1
I (εt < 0) =
{
1 if εt < 0
0 otherwise
Positive and negative shocks affect volatility differently if λ 6= 0
If λ > 0, negative shocks will affect future conditional variance more strongly
than positive shocks
The bigger effect of negative shocks is sometimes attributed to leverage (for
stock returns)
λ measures the magnitude of the leverage
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 30 / 57
Asymmetric GARCH models II
EGARCH (exponential GARCH) model of Nelson (1991):
log(σ2t+1|t ) = ω+ α1(|εt | − E [|εt |]) + γεt + β1 log(σ
2
t |t−1)
EGARCH volatility estimates are always positive in levels (the exponential of
a negative number is positive)
If γ < 0, negative shocks (εt < 0) will have a bigger effect on future conditional volatility than positive shocks γ measures the magnitude of the leverage (sign different from GJR model) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 31 / 57 GARCH estimation in matlab garchExample.m: code on Triton Ed [h, pValue, stat, cValue] = archtest(res,′ lags ′, 10); % test for ARCH model = garch(P,Q); % creates a conditional variance GARCH model with GARCH degree P and ARCH degree Q model = egarch(P,Q); % creates an EGARCH model with P lags of log(σ2t |t−1) and Q lags of ε 2 t model = gjr(P,Q); % creates a GJR model modelEstimate = estimate(model ,spReturns); % estimate GARCH model modelVariances = infer(modelEstimate,spReturns); % generate conditional variance estimate varianceForecasts = forecast(modelEstimate,10,′V 0′,modelVariances); % generate variance forecast Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 32 / 57 GARCH(1,1) and EGARCH estimates Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 33 / 57 GJR estimates Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 34 / 57 Comparison of fitted volatility estimates 500 1000 1500 2000 2500 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 Forecast horizon P er ce nt ag e po in ts Fitted Volatility estimates GARCH(1,1) EGARCH(1,1) GJR(1,1) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 35 / 57 Comparison of 3 models: out-of-sample forecasts 2 4 6 8 10 12 14 16 18 20 1.15 1.2 1.25 1.3 1.35 1.4 Forecast horizon P er ce nt ag e po in ts Volatility forecasts GARCH(1,1) EGARCH(1,1) GJR(1,1) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 36 / 57 Realized variance True variance is unobserved. How do we estimate it? Intuition: the higher the variance of y is in a given period, the more y fluctuates in small intervals during that period Idea: sum the squared changes in y during small intervals between time markers τ0, τ1, τ2, ..., τN within a given period Realized variance: RVt = N ∑ j=1 (yτj − yτj−1 ) 2 t − 1 = τ0 < τ1 < ... < τN = t Example: use 5-minute sampled data over 8 hours to estimate the daily stock market volatility: N = 8 ×12 = 96 observations τ0 = 8am, τ1 = 8 : 05am, τ2 = 8 : 10am, ...τN = 4pm Example: use squared daily returns to estimate volatility within a month: N = 22 daily observations (trading days) τ0 = Jan31, τ1 = Feb01, τ2 = Feb02, ..., τN = Feb28 Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 37 / 57 Realized variance Treating the realized variance as a noisy estimate of the true (unobserved) variance, we can use simple ARMA models to predict future volatility AR(1) model for the realized variance: RVt+1 = β0 + β1RVt + εt+1 The realized volatility is the square root of the realized variance Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 38 / 57 Monthly realized volatility 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 4 6 8 10 12 14 16 18 20 Time P er ce nt ag e po in ts Monthly realized volatility Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 39 / 57 Example: Linear regression model Consider the linear regression model yt+1 = β1yt + εt+1, εt+1 ∼ N(0, σ 2) The point forecast computed at time T using an estimated model is f̂T+1|T = β̂1yT The forecast error is the difference between actual value and forecast: yT+1 − f̂T+1|T = εT+1 + (β1 − β̂1)yT The MSE is MSE = E [(yT+1 − f̂T+1|T ) 2 ] = σ2ε + Var((β1 − β̂1))× y 2 T This depends on σ2ε and also on the estimation error (β1 − β̂1) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 40 / 57 Example (cont.): interval forecasts Interval forecasts are similar to confidence intervals: A 95% interval forecast is an interval that contains the future value of the outcome 95% of the time If the variable is normally distributed, we can construct this as f̂T+1|T ± 1.96× SE (YT+1 − f̂T+1|T ) SE (YT+1 − f̂T+1|T ) is the standard error of the forecast error eT+1 = YT+1 − f̂T+1|T Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 41 / 57 Interval forecasts Consider the simple model yt+1 = µ+ σεt+1, εt+1 ∼ N(0, 1) A typical interval forecast is that the outcome yt+1 falls in the interval [f l , f u ] with some given probability, e.g., 95% If εt+1 is normally distributed this simplifies to f l = µ− 1.96σ f u = µ+ 1.96σ More generally, with time-varying mean (µt+1|t ) and volatility (σt+1|t ): f lt+1|t = µt+1|t − 1.96σt+1|t f ut+1|t = µt+1|t + 1.96σt+1|t What happens to forecasts for longer horizons? Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 42 / 57 Interval forecasts Mean reverting AR(1) process starting at the mean (yT = 1,E [y ] = 1, φ = 0.9, σ = 0.5) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 43 / 57 Interval forecasts Mean reverting AR(1) process starting above the mean (yT = 2, E [y ] = 1, σ = 0.5) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 44 / 57 Uncertainty and forecast horizon I Consider the AR(1) process yt+1 = φyt + εt+1, εt+1 ∼ N(0, σ2) yt+h = φyt+h−1 + εt+h = φ(φyt+h−2 + εt+h−1) + εt+h = φ2yt+h−2 + φεt+h−1 + εt+h ... yt+h = φ hyt + φ h−1εt+1 + φ h−2εt+2 + ...+ φεt+h−1 + εt+h︸ ︷︷ ︸ unpredictable future shocks Using this expression, if |φ| < 1 (mean reversion) we have ft+h|t = φ hyt Var(yt+h |It ) = σ2(1− φ2h) 1− φ2 → σ2 1− φ2 (for large h) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 45 / 57 Uncertainty and forecast horizon II The 95% interval forecast and probabilty (density) forecast for the mean reverting AR(1) process (|φ| < 1) with Gaussian shocks are 95% interval forec. φhyt ± 1.96σ √ 1−φ2h 1−φ2 density forecast N ( φhyt , σ2(1−φ2h) 1−φ2 ) This ignores parameter estimation error (φ is taken as known) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 46 / 57 Interval and density forecasts for random walk model For the random walk model yt+1 = yt + εt+1, εt+1 ∼ N(0, σ2), so yt+h = yt + εt+1 + εt+2 + ...+ εt+h−1 + εt+h Using this expression, we get ft+h|t = yt Var(yt+h |It ) = hσ2 The 95% interval and probability forecasts are 95% interval forec. yt ± 1.96σ √ h density forecast N(yt , hσ2) Width of confidence interval continues to expand as h→ ∞ Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 47 / 57 Interval forecasts: random walk model Interval forecasts for random walk model (yT = 2, σ = 1) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 48 / 57 Alternative distributions: Two-piece normal distribution Two-piece normal distribution: dist(yt+1) = exp(−(yt+1−µt+1|t ) 2/2σ21)√ 2π(σ1+σ2)/2 for yt+1 ≤ µt+1|t exp(−(yt+1−µt+1|t ) 2/2σ22)√ 2π(σ1+σ2)/2 for yt+1 > µt+1|t
The mean of this distribution is
Et [yt+1 ] = µt+1|t +
√
2
π
(σ2 − σ1)
If σ2 > σ1, the distribution is positively skewed
The distribution has fat tails provided that σ1 6= σ2
This distribution is used by Bank of England to compute “fan charts”
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 49 / 57
Bank of England fan charts: Inflation report 02/2017
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 50 / 57
Bank of England fan charts: Inflation report 02/2016
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 51 / 57
IMF World Economic Outlook, October 2016
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 52 / 57
Alternative distributions: Mixtures of normals
Suppose
y1t+1 ∼ N(µ1, σ
2
1)
y2t+1 ∼ N(µ2, σ
2
2)
cov(y1t+1, y2t+1) = σ12
Sums of normal distributions are normally distributed:
y1t+1 + y2t+1 ∼ N(µ1 + µ2, σ
2
1 + σ
2
2 + 2σ12)
Mixtures of normal distributions are not normally distributed: Let
st+1 = {0, 1} be a random indicator variable. Then
st+1 × y1t+1 + (1− st+1)× y2t+1 6= N(., .)
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 53 / 57
Moments of Gaussian mixture models
Let p1 = probability of state 1, p2 = probability of state 2
mean and variance of y :
E [y ] = p1µ1 + p2µ2
Var(y) = p2σ
2
2 + p1σ
2
1 + p1p2(µ2 − µ1)
2
skewness:
skew(y) = p1p2(µ1 − µ2)
{
3(σ21 − σ
2
2) + (1− 2p1)(µ2 − µ1)
2
}
kurtosis:
kurt(y) = p1p2(µ1 − µ2)
2
[(
p32 + p
3
1
)
(µ1 − µ2)
2
]
+6p1p2(µ1 − µ2)
2
[
p1σ
2
2 + p2σ
2
1
]
+3p1σ
4
1 + 3p2σ
4
2
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 54 / 57
Mixtures of normals: Ang and Timmermann (2012)
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 55 / 57
Mixtures of normals: Marron and Wand (1992)
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 56 / 57
Mixtures of normals: Marron and Wand (1992)
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 57 / 57
Lecture 8: Forecast Combination
UCSD, February 27, 2017
Allan Timmermann1
1UC San Diego
Timmermann (UCSD) Combination Winter, 2017 1 / 49
1 Introduction: When, why and what to combine?
2 Survey of Professional Forecasters
3 Optimal Forecast Combinations: Theory
Optimal Combinations under MSE loss
4 Estimating Forecast Combination Weights
Weighting schemes under MSE loss
Forecast Combination Puzzle
Rapach, Strauss and Zhou, 2010
Elliott, Gargano, and Timmermann, 2013
Time-varying combination weights
5 Model Combination
Optimal Pool
6 Bayesian Model Averaging
7 Conclusion
Timmermann (UCSD) Combination Winter, 2017 2 / 49
Key issues in forecast combination
Why combine?
Many models or forecasts with ‘similar’predictive accuracy
Diffi cult to identify a single best forecast
State-dependent performance
Diversification gains
When to combine?
Individual forecasts are misspecified (“all models are wrong but some are
useful.”)
Unstable forecasting environment (past track record is unreliable)
Short track record; use “one-over-N” weights? (N forecasts)
What to combine?
Forecasts using different information sets
Forecasts based on different modeling approaches
Surveys, econometric model forecasts: surveys are truly forward-looking.
Econometric models are better calibrated to data
Timmermann (UCSD) Combination Winter, 2017 2 / 49
Essentials of forecast combination
Dimensionality reduction: Combination reduces the information in a large
set of forecasts to a single summary forecast using a set of combination
weights
Optimal combination chooses “optimal” weights to produce the most
accurate combined forecast
More accurate forecasts get larger weights
Combination weights also reflect correlations across forecasts
Estimation error is important for combination weights
Irrelevance Proposition: In a world with no model misspecification, infinite
data samples (no estimation error) and complete access to the information
sets underlying the individual forecasts, there is no need for forecast
combination
just use the single best model
Timmermann (UCSD) Combination Winter, 2017 3 / 49
When to combine?
Notations:
y : outcome
f̂1, f̂2 : individual forecasts
ω : combination weight
Simple combined forecast: f̂ com = ωf̂1 + (1−ω)f̂2
The combined forecast f̂ com dominates the individual forecasts f̂1 and f̂2
under MSE loss if
E
[(
y − f̂1
)2]
> E
[(
y − f̂ com
)2]
and
E
[(
y − f̂2
)2]
> E
[(
y − f̂ com
)2]
Both conditions need to hold
Timmermann (UCSD) Combination Winter, 2017 4 / 49
Applications of forecast combinations
Forecast combinations have been successfully applied in several areas of
forecasting:
Gross National Product
currency market volatility and exchange rates
inflation, interest rates, money supply
stock returns
meteorological data
city populations
outcomes of football games
wilderness area use
check volume
political risks
Estimation of GDP based on income and production measures
Averaging across values of unknown parameters
Timmermann (UCSD) Combination Winter, 2017 5 / 49
Two types of forecast combinations
1 Data used to construct the invididual forecasts are not observed:
Treat individual forecasts like any other information (data) and estimate the
best possible mapping from the forecasts to the outcome
Examples: survey forecasts, analysts’earnings forecasts
2 Data underlying the model forecasts are observed: ‘model combination’
First generate forecasts from individual models. Then combine these forecasts
Why not simply construct a single “super” model?
Timmermann (UCSD) Combination Winter, 2017 6 / 49
Survey of Economic Forecasters: Participation
1995 2000 2005 2010 2015
400
420
440
460
480
500
520
540
560
580
Timmermann (UCSD) Combination Winter, 2017 7 / 49
SPF: median, interquartile range, min, max real GDP
forecasts
1995 2000 2005 2010 2015
0
5
10
15
Timmermann (UCSD) Combination Winter, 2017 8 / 49
SPF: identity of best forecaster (unemployment rate),
ranked by MSE
1975 1980 1985 1990 1995 2000 2005 2010
100
200
300
400
500
5 Years
ID
1980 1985 1990 1995 2000 2005 2010
100
200
300
400
500
10 Years
ID
Timmermann (UCSD) Combination Winter, 2017 9 / 49
Forecast combinations: simple example
Two forecasting models using x1 and x2 as predictors:
yt+1 = β1x1t + ε1t+1 ⇒ f̂1t+1|t = β̂1tx1t
yt+1 = β2x2t + ε2t+1 ⇒ f̂2t+1|t = β̂2tx2t
Combined forecast:
f̂ comt+1|t = ωf̂1t+1|t + (1−ω)f̂2t+1|t
Could the combined forecast be better than the forecast based on the
“super” model?
yt+1 = β1x1t + β2x2t + εt+1 ⇒ f̂
Super
t+1|t = β̂1tx1t + β̂2tx2t
Timmermann (UCSD) Combination Winter, 2017 10 / 49
Combinations of forecasts: theory
Suppose the information set consists of m individual forecasts:
I = {f̂1, …., f̂m}
Find an optimal combination of the individual forecasts:
f̂ com = ω0 +ω1 f̂1 +ω2 f̂2 + …+ωm f̂m
ω0,ω1,ω2, …,ωm : unknown combination weights
The combined forecast uses the individual forecasts {f̂1, f̂2, …, f̂m} rather than
the underlying information sets used to construct the forecasts (f̂i = β̂
′
i xi )
Timmermann (UCSD) Combination Winter, 2017 11 / 49
Combinations of forecasts: theory
Because the underlying ‘data’are forecasts, they can be expected to obtain
non-negative weights that sum to unity,
0 ≤ ωi ≤ 1, i = 1, …,m
m
∑
i=1
ωi = 1
Such constraints on the weights can be used to reduce the effect of
estimation error
Should we allow ωi < 0 and go "short" in a forecast?
Negative ωi doesn’t mean that the ith forecast was bad. It just means that
forecast i can be used to offset the errors of other forecasts
Timmermann (UCSD) Combination Winter, 2017 12 / 49
Combinations of two forecasts
Two individual forecasts f1, f2 with forecast errors e1 = y − f1, e2 = y − f2
Both forecasts are assumed to be unbiased: E [e1 ] = E [e2 ] = 0
Variances of forecast errors: σ2i , i = 1, 2. Covariance is σ12
The combined forecast will also be unbiased if the weights add up to one:
f = ωf1 + (1−ω)f2 ⇒
e = y − f = y −ωf1 − (1−ω)f2 = ωe1 + (1−ω)e2
Forecast error from the combination is a weighted average of the individual
forecast errors
E [e] = 0
Var(e) = ω2σ21 + (1−ω)
2σ22 + 2ω(1−ω)σ12
Like a portfolio of two correlated assets
Timmermann (UCSD) Combination Winter, 2017 13 / 49
Combination of two unbiased forecasts: optimal weights
Solve for the optimal combination weight, ω∗ :
ω∗ =
σ22 − σ12
σ21 + σ
2
2 − 2σ12
1−ω∗ =
σ21 − σ12
σ21 + σ
2
2 − 2σ12
Combination weight can be negative if σ12 > σ
2
1 or σ12 > σ
2
2
If σ12 = 0: weights are the relative variance σ
2
2/σ
2
1 of the forecasts:
ω∗ =
σ22
σ21 + σ
2
2
=
σ−21
σ−21 + σ
−2
2
Greater weight is assigned to more precise models (small σ2i )
Timmermann (UCSD) Combination Winter, 2017 14 / 49
Combinations of multiple unbiased forecasts
f : m× 1 vector of forecasts
e = ιmy − f : vector of m forecast errors
ιm = (1, 1, …, 1)′ : m× 1 vector of ones
Assume that the individual forecast errors are unbiased:
E [e] = 0, Σe = Covar(e)
Choosing ω to minimize the MSE subject to the weights summing to one, we
get the optimal combination weights ω∗
ω∗ = argmin
ω
ω′Σeω
= (ι′mΣ
−1
e ιm)
−1Σ−1e ιm
Special case: if Σe is diagonal, ω∗i = σ
−2
i / ∑
m
j=1 σ
−2
j : inverse MSE weights
Timmermann (UCSD) Combination Winter, 2017 15 / 49
Optimality of equal weights
Equal weights (EW) play a special role in forecast combination
EW are optimal when the individual forecast errors have identical variance,
σ2, and identical pair-wise correlations ρ
nothing to distinguish between the forecasts
This situation holds to a close approximation when all models are based on
similar data and produce more or less equally accurate forecasts
Similarity to portfolio analysis: An equal-weighted portfolio is optimal if all
assets have the same mean and variance and pairwise identical covariances
Timmermann (UCSD) Combination Winter, 2017 16 / 49
Estimating combination weights
In practice, combination weights need to be estimated using past data
Once we use estimated combination weights it is diffi cult to show that any
particular weighting scheme will dominate other weighting methods
We prefer one method for some data and different methods for other data
Equal-weights avoid estimation error entirely
Why not always use equal weights then?
Timmermann (UCSD) Combination Winter, 2017 17 / 49
Estimating combination weights
If we try to estimate the optimal combination weights, estimation error
creeps in
In the case of forecast combination, the “data” (individual forecasts) is not a
random draw but (possibly unbiased, if not precise) forecasts of the outcome
This suggests imposing special restrictions on the combination weights
We might impose that the weights sum to one and are non-negative:
∑
i
ωi = 1, ωi ∈ [0, 1]
Simple combination schemes such as EW satisfy these constraints and do not
require estimation of any parameters
EW can be viewed as a reasonable prior when no data has been observed
Timmermann (UCSD) Combination Winter, 2017 18 / 49
Estimating combination weights
Simple estimation methods are diffi cult to beat in practice
Common baseline is to use a simple EW average of forecasts:
f ew =
1
m
m
∑
i=1
fi
No estimation error since the combination weights are imposed rather than
estimated (data independent)
Also works if the number of forecasts (m) changes over time or some
forecasts have short track records
Timmermann (UCSD) Combination Winter, 2017 19 / 49
Simple combination methods
Equal-weighted forecast
f ew =
1
m
m
∑
i=1
fi
Median forecast (robust to outliers)
f median = median{fi}mi=1
Trimmed mean. Order the forecasts
{f1 ≤ f2 ≤ … ≤ fm−1 ≤ fm}
Then trim the top/bottom α% before taking an average
f trim =
1
m(1− 2α)
b(1−α)mc
∑
i=bαm+1c
fi
Timmermann (UCSD) Combination Winter, 2017 20 / 49
Weights inversely proportional to MSE or rankings
Ignore correlations across forecast errors and set weights proportional to the
inverse of the models’MSE (mean squared error) values:
ωi =
MSE−1i
∑mi=1MSE
−1
i
Robust weighting scheme that weights forecast models inversely to their rank,
Ranki
ω̂i =
Rank−1i
∑mi=1 Rank
−1
i
Best model gets a rank of 1, second best model a rank of 2, etc. Weights
proportional to 1/1, 1/2, 1/3, etc.
Timmermann (UCSD) Combination Winter, 2017 21 / 49
Bates-Granger restricted least squares
Bates and Granger (1969): use plug-in weights in the optimal solution based
on the estimated variance-covariance matrix
Numerically identical to restricted least squares estimator of the weights from
regressing the outcome on the vector of forecasts ft+h|t and no intercept
subject to the restriction that the coeffi cients sum to one:
f BGt+h|t = ω̂
′
OLS ft+h|t = (ι
′Σ̂−1e ι)
−1 ι′Σ̂−1e ft+h|t
Σ̂ε = (T − h)−1 ∑T−ht=1 et+h|te
′
t+h|t : sample estimator of error covariance
matrix
Timmermann (UCSD) Combination Winter, 2017 22 / 49
Forecast combination puzzle
Empirical studies often find that simple equal-weighted forecast combinations
perform very well compared with more sophisticated combination schemes
that rely on estimated combination weights
Smith and Wallis (2009): “Why is it that, in comparisons of combinations of
point forecasts based on mean-squared forecast errors …, a simple average
with equal weights, often outperforms more complicated weighting schemes.”
Errors introduced by estimation of the optimal combination weights could
overwhelm any gains relative to using 1/N weights
Timmermann (UCSD) Combination Winter, 2017 23 / 49
Combination forecasts using Goyal-Welch Data
RMSE: Prevmean: 1.9640, Ksink: 1.9924, EW = 1.9592
1975 1980 1985 1990 1995 2000 2005 2010
-0.01
-0.005
0
0.005
0.01
0.015
Time
re
tu
rn
fo
re
ca
st
combination forecasts
PrevMean
Ksink
EW
Timmermann (UCSD) Combination Winter, 2017 24 / 49
Combination forecasts using Goyal-Welch Data
RMSE: EW = 1.9592, rolling = 1.9875, PrevBest = 2.0072
1975 1980 1985 1990 1995 2000 2005 2010
-0.015
-0.01
-0.005
0
0.005
0.01
0.015
Time
re
tu
rn
fo
re
ca
st
combination forecasts
EW
rolling
PrevBest
Timmermann (UCSD) Combination Winter, 2017 25 / 49
Rapach-Strauss-Zhou (2010)
Quarterly stock returns data, 1947-2005, 15 predictor variables
Individual univariate prediction models (i = 1, ..,N = 15):
rt+1 = αi + βi xit + εit+1 (estimated model)
r̂t+1|i = α̂i + β̂i xit (generated forecast)
Combination forecast of returns, r̂ ct+1|t :
r̂ ct+1|t =
N
∑
i=1
ωi r̂t+1|i with weights
ωi = 1/N or ωi =
DMSPE−1i
∑Nj=1 DMSPE
−1
j
DMSPEi =
t
∑
s=T0
θτ−1−s (rs+1 − r̂s+1|i )
2, θ ≤ 1
DMSPE : discounted mean squared prediction error
Timmermann (UCSD) Combination Winter, 2017 26 / 49
Rapach-Strauss-Zhou (2010): results
Timmermann (UCSD) Combination Winter, 2017 27 / 49
Rapach-Strauss-Zhou (2010): results
Timmermann (UCSD) Combination Winter, 2017 28 / 49
Empirical Results (Rapach, Strauss and Zhou, 2010)
Timmermann (UCSD) Combination Winter, 2017 29 / 49
Rapach, Strauss, Zhou: main results
Forecast combinations dominate individual prediction models for stock
returns out-of-sample
Forecast combination reduces the variance of the return forecast
Return forecasts are most accurate during economic recessions
“Our evidence suggests that the usefulness of forecast combining methods
ultimately stems from the highly uncertain, complex, and constantly evolving
data-generating process underlying expected equity returns, which are related
to a similar process in the real economy.”
Timmermann (UCSD) Combination Winter, 2017 30 / 49
Elliott, Gargano, and Timmermann (JoE, 2013):
K possible predictor variables
Generalizes equal-weighted combination of K univariate models
r̂t+1 = α̂i + β̂i xit to consider EW combination of all possible 2-variate,
3-variate, etc. models:
r̂t+1|i ,j = α̂i + β̂i xit + β̂jxjt (2 predictors)
r̂t+1|i ,j ,k = α̂i + β̂i xit + β̂jxjt + β̂k xkt (3 predictors)
For K = 12, there are 12 univariate models (k = 1), 66 bivariate models
(k = 2), 220 trivariate models (k = 3) to combine, etc.
Take equal-weighted averages over the forecasts from these models –
complete subset regressions
Timmermann (UCSD) Combination Winter, 2017 31 / 49
Elliott, Gargano, and Timmermann (JoE, 2013)
Timmermann (UCSD) Combination Winter, 2017 32 / 49
Elliott, Gargano, and Timmermann (JoE, 2013)
Timmermann (UCSD) Combination Winter, 2017 33 / 49
Adaptive combination weights
Bates and Granger (1969) propose several adaptive estimation schemes
Rolling window of the forecast models’relative performance over the most
recent win observations:
ω̂i ,t |t−h =
(
∑tτ=t−win+1 e
2
i ,τ|τ−h
)−1
∑mj=1
(
∑tτ=t−win+1 e
2
j ,τ|τ−h
)−1
Adaptive updating scheme discounts older performance, λ ∈ (0; 1) :
ω̂i ,t |t−h = λω̂i ,t−1|t−h−1 + (1− λ)
(
∑tτ=t−win+1 e
2
i ,τ|τ−h
)−1
∑mj=1
(
∑tτ=t−win+1 e
2
j ,τ|τ−h
)−1
The closer to unity is λ, the smoother the combination weights
Timmermann (UCSD) Combination Winter, 2017 34 / 49
Time-varying combination weights
Time-varying parameter (Kalman filter):
yt+1 = ω
′
t f̂t+1|t + εt+1
ωt = ωt−1 + ut , cov(ut , εt+1) = 0
Discrete (observed) state switching (Deutsch et al., 1994) conditional on
observed event happening (et ∈ At ):
yt+1 = Iet∈A(ω01 +ω
′
1 f̂t+1|t ) + (1− Iet∈A)(ω02 +ω
′
2 f̂t+1|t ) + εt+1
Regime switching weights (Elliott and Timmermann, 2005):
yt+1 = ω0st+1 +ω
′
st+1 f̂t+1|t + εt+1
pr(St+1 = st+1 |St = st ) = pst+1st
Timmermann (UCSD) Combination Winter, 2017 35 / 49
Combinations as a hedge against instability
Forecast combinations can work well empirically because they provide
insurance against model instability
The performance of combined forecasts tends to be more stable than that of
individual forecasts used in the empirical combination study of Stock and
Watson (2004)
Combination methods that attempt to explicitly model time-variations in the
combination weights often fail to perform well, suggesting that regime
switching or model ‘breakdown’can be diffi cult to predict or even to track
through time
Use simple, robust methods (rolling window)?
Timmermann (UCSD) Combination Winter, 2017 36 / 49
Combinations as a hedge against instability (cont.)
Suppose a particular forecast is correlated with the outcome only during
times when other forecasts break down. This creates a role for the forecast as
a hedge against model breakdown
Consider two forecasts and two regimes
first forecast works well only in the first state (normal state) but not in the
second state (rare state)
second forecast works well in the second state but not in the first state
second model serves as “insurance” against the breakdown of the first model
like a portfolio asset
Timmermann (UCSD) Combination Winter, 2017 37 / 49
Classical approach to density combination
Problem: we do not directly observe the outcome density−we only observe a
draw from this−and so cannot directly choose the weights to minimize the
loss between this object and the combined density
Kullback Leibler (KL) loss for a linear combination of densities ∑mi=1 ωipit (y)
relative to some unknown true density p(y) is given by
KL =
∫
p(y) ln
(
p(y)
∑mi=1 ωipi (y)
)
dy
=
∫
p(y) ln (p(y)) dy −
∫
p(y) ln
(
m
∑
i=1
ωipi (y)
)
dy
= C − E ln
(
m
∑
i=1
ωipi (y)
)
C is constant for all choices of the weights ωi
Minimizing the KL distance is the same as maximizing the log score in
expectation
Timmermann (UCSD) Combination Winter, 2017 38 / 49
Classical approach to density combination
Use of the log score to evaluate the density combination is popular in the
literature
Geweke and Amisano (2011) use this approach to combine GARCH and
stochastic volatility models for predicting the density of daily stock returns
Under the log score criterion, estimation of the combination weights becomes
equivalent to maximizing the log likelihood. Given a sequence of observed
outcomes {yt}Tt=1, the sample analog is to maximize
ω̂ = argmax
ω
T−1
T
∑
t=1
ln
(
m
∑
i=1
ωipit (yt )
)
s.t. ωi ≥ 0,
m
∑
i=1
ωi = 1 for all i
Timmermann (UCSD) Combination Winter, 2017 39 / 49
Prediction pools with two models (Geweke-Amisano, 2011)
With two models, M1,M2, we have a predictive density
p(yt |Yt−1,M) = ωp(yt |Yt−1,M1) + (1−ω)p(yt |Yt−1,M2)
and a predictive log score
T
∑
t=1
log [wp(yt |Yt−1,M1) + (1− w)p(yt |Yt−1,M2)] , ω ∈ [0, 1]
Empirical example: Combine GARCH and stochastic volatility models for
predicting the density of daily stock returns
Timmermann (UCSD) Combination Winter, 2017 40 / 49
Log predictive score as a function of model weight,
S&P500, 1976-2005 (Geweke-Amisano, 2011)
Timmermann (UCSD) Combination Winter, 2017 41 / 49
Weights in pools of multiple models, S&P500, 1976-2005
(Geweke-Amisano, 2011)
Timmermann (UCSD) Combination Winter, 2017 42 / 49
Optimal prediction pool – time-varying combination weights
Petttenuzzo-Timmermann (2014)
Timmermann (UCSD) Combination Winter, 2017 43 / 49
Model combination – Bayesian Model Averaging
When constructing the individual forecasts ourselves, we can base the
combined forecast on information on the individual models’fit
Methods such as BMA (Bayesian Model Averaging) can be used
BMA weights predictive densities by the posterior probabilities (fit) of the
models, Mi
Models that fit the data better get higher weights in the combination
Timmermann (UCSD) Combination Winter, 2017 44 / 49
Bayesian Model Averaging (BMA)
pc (y) =
m
∑
i=1
ωip(y |Mi )
m models: M1, ….,Mm
BMA weights predictive densities by the posteriors of the models, Mi
BMA is a model averaging procedure rather than a predictive density
combination procedure per se
BMA assumes the availability of both the data underlying each of the
densities, pi (y) = p(y |Mi ), and knowledge of how that data is employed to
obtain a predictive density
Timmermann (UCSD) Combination Winter, 2017 45 / 49
Bayesian Model Averaging (BMA)
The combined model average, given data (Z ) is
pc (y |Z ) =
m
∑
i=1
p(y |Mi ,Z )p(Mi |Z )
p(Mi |Z ) : Posterior probability for model i , given the data, Z
p(Mi |Z ) =
p(Z |Mi )p(Mi )
∑mj=1 p(Z |Mj )p(Mj )
Marginal likelihood of model i is
P(Z |Mi ) =
∫
P(Z |θi ,Mi )P(θi |Mi )dθi
p(θi |Mi ) : prior density of model i’s parameters
p(Z |θi ,Mi ) : likelihood of the data given the parameters and the model
Timmermann (UCSD) Combination Winter, 2017 46 / 49
Constructing BMA estimates
Requirements:
List of models M1, …,Mm
Prior model probabilities p(M1), …., p(Mm)
Priors for the model parameters P(θ1 |M1), …,P(θm |Mm)
Computation of p(Mi |Z ) requires computation of the marginal likelihood
p(Z |Mi ) which can be time consuming
Timmermann (UCSD) Combination Winter, 2017 47 / 49
Alternative BMA schemes
Raftery, Madigan and Hoeting (1997) MC3
If the models’marginal likelihoods are diffi cult to compute, one can use a
simple approximation based on BIC:
ωi = P(Mi |Z ) ≈
exp(−0.5BICi )
∑mi=1 exp(−0.5BICi )
Remove models that appear not to be very good
Madigan and Raftery (1994) suggest removing models for which p(Mi |Z ) is
much smaller than the posterior probability of the best model
Timmermann (UCSD) Combination Winter, 2017 48 / 49
Conclusion
Combination of forecasts is motivated by
misspecified forecasting models due to parameter instability, omitted variables
etc.
diversification across forecasts
private information used to compute individual forecasts (surveys)
Simple, robust estimation schemes tend to work well
optimal combination weights are hard to estimate in small samples
Even if they do not always deliver the most precise forecasts, forecast
combinations generally do not deliver poor performance and so represent a
relatively safe choice
Empirically, equal-weighted survey forecasts work well for many
macroeconomic variables, but they tend to be biased and not very precise for
stock returns
Timmermann (UCSD) Combination Winter, 2017 49 / 49
Lecture 9: Forecast Evaluation
UCSD, Winter 2017
Allan Timmermann1
1UC San Diego
Timmermann (UCSD) Forecast Evaluation Winter, 2017 1 / 50
1 Forecast Evaluation: Absolute vs. relative performance
2 Properties of Optimal Forecasts – Theoretical concepts
3 Evaluation of Sign (Directional) Forecasts
4 Evaluating Interval Forecasts
5 Evaluating Density Forecasts
6 Comparing forecasts I: Tests of equal predictive accuracy
7 Comparing Forecasts II: Tests of Forecast Encompassing
Timmermann (UCSD) Forecast Evaluation Winter, 2017 2 / 50
Forecast evaluation
Given an observed series of forecasts, ft+h|t , and outcomes, yt+h ,
t = 1, …,T , we want to know if the forecasts were “optimal”or poor
Forecast evaluation is closely related to how we measure forecast accuracy
Absolute performance measures the accuracy of an individual forecast
relative to the outcome, using either economic (loss-based) or statistical
measures of performance
Forecast optimality, effi ciency
A forecast that isn’t obviously deficient could still be poor
Relative performance compares the performance of one or several forecasts
against some benchmark—horse race between competing forecast models
Forecast comparisons: test of equal predictive accuracy
Two forecasts could be poor, but one is less bad than the other
Forecast encompassing tests (tests for dominance)
Timmermann (UCSD) Forecast Evaluation Winter, 2017 2 / 50
Forecast evaluation (cont.)
Forecast evaluation amounts to understanding if a model’s predictive
accuracy is “good enough”
How accurate should we be able to forecast Chinese GDP growth? What’s a
reasonable R2 or RMSE?
How about forecasting stock returns? Expect low R2
How much does the forecast horizon matter to the degree of predictability?
Some variables are easier to predict than others. Why?
Unconditional forecast or random walk forecast are natural benchmarks
ARMA models are sometimes used
Timmermann (UCSD) Forecast Evaluation Winter, 2017 3 / 50
Forecast evaluation (cont.)
Informal methods – graphical plots, decompositions
Formal methods – deal with how to formally test if a forecasting model
satisfies certain “optimality criteria”
Evaluation of a forecasting model requires an estimate of its expected loss
Good forecasting models produce ‘small’average losses, while bad models
produce ‘large’average losses
Good performance in a given sample could be due to luck or could reflect
the performance of a genuinely good model
Power of statistical tests varies. Can we detect the difference between an
R2 = 1% versus R2 = 1.5%? Depends on the sample size
Test results depend on the loss function and the information set
Rejection of forecast optimality suggests that we can improve the forecast (at
least in theory)
Timmermann (UCSD) Forecast Evaluation Winter, 2017 4 / 50
Optimality Tests
Establish a benchmark for what constitutes an optimal or a “good” forecast
Effi cient forecasts: Constructed given knowledge of the true data generating
process (DGP) using all currently available information
sets the bar very high: in practice we don’t know the true DGP
Forecasts are effi cient (rational) if they fully utilize all available information
and this information cannot be used to construct a better forecast
weak versus strong rationality (just like tests of market effi ciency)
unbiasedness (forecast error should have zero mean)
orthogonality tests (forecast error should be unpredictable)
Timmermann (UCSD) Forecast Evaluation Winter, 2017 5 / 50
Effi cient Forecast: Definition
A forecast is effi cient (optimal) if no other forecast using the available data,
xt ∈ It , can be used to generate a smaller expected loss
Under MSE loss:
f̂ ∗t+h|t = arg
f̂ (xt )
minE
[
(yt+h − f̂ (xt ))2
]
If we can use information in It to produce a more accurate forecast, then the
original forecast is suboptimal
Effi ciency is conditional on the information set
weak form forecast effi ciency tests include only past forecasts and past
outcomes It = {yt , yt−1, …, f̂t |t−1, et |t−1, …}
strong form effi ciency tests extend this to include all other variables xt ∈ It
Timmermann (UCSD) Forecast Evaluation Winter, 2017 6 / 50
Optimality under MSE loss
First order condition for an optimal forecast under MSE loss:
E [
∂(yt+h − ft+h|t )2
∂ft+h|t
] = −2E
[
yt+h − ft+h|t
]
= −2E
[
et+h|t
]
= 0
Similarly, conditional on information at time t, It :
E [et+h|t |It ] = 0
Expected value of the forecast error must equal zero given current
information, It
Test E [et+h|txt ] = 0 for all variables xt ∈ It known at time t
If the forecast is optimal, no variable known at time t can predict its future
forecast error et+h|t . Otherwise the forecast wouldn’t be optimal
If I can predict my forecast will be too low, I should increase my forecast
Timmermann (UCSD) Forecast Evaluation Winter, 2017 7 / 50
Optimality properties under Squared Error Loss
1 Forecasts are unbiased: the forecast error et+h|t has zero mean, both
conditionally and unconditionally:
E [et+h|t ] = E [et+h|t |It ] = 0
2 h-period forecast errors (et+h|t ) are uncorrelated with information available
at the time the forecast was computed (It ). In particular, single-period
forecast errors, et+1|t , are serially uncorrelated:
E [et+1|tet |t−1 ] = 0
3 The variance of the forecast error (et+h|t ) increases (weakly) in the forecast
horizon, h :
Var(et+h+1|t ) ≥ Var(et+h|t ), for all h ≥ 1
On average it’s harder to predict distant outcomes than outcomes in the near
future
Timmermann (UCSD) Forecast Evaluation Winter, 2017 8 / 50
Optimality properties under Squared Error Loss (cont.)
Optimal forecasts are unbiased. Why? If they were biased, we could improve
the forecast simply by correcting for the bias
Suppose ft+1|t is biased:
yt+1 = 1+ ft+1|t + εt+1, εt+1 ∼ WN(0, σ
2),
Bias-corrected forecast:
f ∗t+1|t = 1+ ft+1|t
is more accurate than ft+1|t
Forecast errors from optimal model should be unpredictable:
Suppose et+1 = 0.5et so the one-step forecast error is serially correlated
Adding back 0.5et to the original forecast yields a more accurate forecast:
f ∗
t+1|t = ft+1|t + 0.5et is better than f
∗
t+1|t
Variance of (optimal) forecast error increases in the forecast horizon
We learn more information as we get closer to the forecast “target” and
increase our information set
Timmermann (UCSD) Forecast Evaluation Winter, 2017 9 / 50
Illustration for MA(2) process
Yt = εt + θ1εt−1 + θ2εt−2, εt ∼ WN(0, σ2)
t yt ft+h|t et+h|t
T + 1 εT+1 + θ1εT + θ2εT−1 θ1εT + θ2εT−1 εT+1
T + 2 εT+2 + θ1εT+1 + θ2εT θ2εT εT+2 + θ1εT+1
T + 3 εT+3 + θ1εT+2 + θ2εT+1 0 εT+3 + θ1εT+2 + θ2εT+1
From these results we see that
E [et+h|t ] = 0 for t = T , h = 1, 2, 3, …
Var(eT+3|T ) ≥ Var(eT+2|T ) ≥ Var(eT+1|T )
Timmermann (UCSD) Forecast Evaluation Winter, 2017 10 / 50
Regression tests of optimality under MSE loss
Effi ciency regressions test if any variable xt known at time t can predict the
future forecast error et+1|t :
et+1|t = β
′xt + εt+1, εt+1 ∼ WN(0, σ2)
H0 : β = 0 vs H1 : β 6= 0
Unbiasedness tests set xt = 1:
et+1|t = β0 + εt+1
Mincer-Zarnowitz regression uses yt+1 on LHS and sets xt = (1, f̂t+1|t ):
yt+1 = β0 + β1 f̂t+1|t + εt+1
H0 : β0 = 0, β1 = 1
Zero intercept, unit slope—use F test
Timmermann (UCSD) Forecast Evaluation Winter, 2017 11 / 50
Regression tests of optimality: Example
Suppose that f̂t+1|t is biased:
yt+1 = 0.2+ 0.9f̂t+1|t + εt+1, εt+1 ∼ WN(0, σ
2).
Q: How can we easily produce a better forecast?
Answer:
f̂ ∗t+1|t = 0.2+ 0.9f̂t+1|t
will be an unbiased forecast
What if
yt+1 = 0.3+ f̂t+1|t + εt+1,
εt+1 = ut+1 + θ1ut , ut ∼ WN(0, σ2)
Can we improve on this forecast?
Timmermann (UCSD) Forecast Evaluation Winter, 2017 12 / 50
A question of power
In small samples with little predictability, forecast optimality tests may not
have much power (ability to detect deviations from forecast optimality)
Rare to find individual forecasters with a long track record
Predictive ability changes over time
Need a long out-of-sample data set (evaluation sample) to be able to tell
with statistical confidence if a forecast is suboptimal
Timmermann (UCSD) Forecast Evaluation Winter, 2017 13 / 50
Testing non-decreasing variance of forecast errors
Suppose we have forecasts recorded for three different horizons, h = S ,M, L,
with S < M < L (short, medium, long)
µe =
[
E [e2t+S |t ],E [e
2
t+M |t ],E [e
2
t+L|t ]
]′
: MSE values
MSE differentials (Long-Short, Medium-Short):
∆eL−M ≡ E
[
e2t+L|t
]
− E
[
e2t+M |t
]
∆eM−S ≡ E
[
e2t+M |t
]
− E
[
e2t+S |t
]
We can test if the expected value of the squared forecast errors is weakly
increasing in the forecast horizon:
∆eL−M ≥ 0
∆eM−S ≥ 0
Distant future is harder to predict than the near future
Timmermann (UCSD) Forecast Evaluation Winter, 2017 14 / 50
Evaluating the rationality of the “Greenbook” forecasts
Patton and Timmermann (2012) study the Fed’s “Greenbook” forecasts of
GDP growth, GDP deflator and CPI inflation
Data are quarterly, over the period 1982Q1 to 2000Q4, approx. 80
observations
Greenbook forecasts and actuals constructed from real-time Federal Reserve
publications. These are aligned in “event time”
6 forecast horizons: h = 0, 1, 2, 3, 4, 5
Increasing MSE and decreasing MSF
Greenbook forecasts of GDP growth, 1982Q1-2000Q4
-5 -4 -3 -2 -1 0
0
1
2
3
4
5
6
7
8
Forecasts and forecast errors, GDP growth
Forecast horizon
V
ar
ia
nc
e
MSE
V[forecast]
V[actual]
Bias vs. forecast horizon (in months)
Analysts’EPS forecasts at different horizons: Biases
2 3 4 5 6 7 8 9 10 11 12
1
2
3
4
5
6
7
horizon
P
er
ce
nt
ag
e
po
in
ts
Bias in forecast errors (AAPL)
RMSE vs. forecast horizon (in months)
Analysts’EPS forecasts at different horizons: RMSE
2 3 4 5 6 7 8 9 10 11 12
0.5
1
1.5
2
2.5
horizon
R
M
S
E
RMSE (AAPL)
Optimality tests that do not rely on the outcome
Many macroeconomic data series are revised: preliminary, first release, latest
data vintage could be used to measure the outcome
volatility is also unobserved
Under MSE loss, forecast revisions should be unpredictable
If I could predict today that my future forecast of the same event will be
different in a particular direction (higher or lower), then I should incorporate
this information into my current forecast
Let ∆ft+h = ft+h|t+1 − ft+h|t be the forecast revision. Then
E [∆ft+h |It ] = 0
Forecast revisions are a martingale difference process (zero mean)
This can be tested through a simple regression that doesn’t use the outcome:
∆ft+h = α+ δxt + εt+h
Timmermann (UCSD) Forecast Evaluation Winter, 2017 19 / 50
Forecast evaluation for directional forecasting
Suppose we are interested in evaluating the forecast (f ) of the sign of a
variable, y . There are four possible outcomes:
forecast/outcome sign(y) > 0 sign(y) ≤ 0
sign(f ) > 0 true positive false positive
sign(f ) ≤ 0 false negative true negative
If stock returns (y) are positive 60% of the time and we always predict a
positive return, we have a “hit rate” of 60%. Is this good?
We need a test statistic that doesn’t reward “broken clock” forecasts (always
predict the same sign) with no informational content or value
Timmermann (UCSD) Forecast Evaluation Winter, 2017 20 / 50
Who is the better forecaster?
Timmermann (UCSD) Forecast Evaluation Winter, 2017 21 / 50
Information in forecasts
Both forecasters have a ‘hit rate’of 80%, with 8 out of 10 correct predictions
(sum elements on the diagonal)
There is no information in the first forecast (the forecaster always says
“increase”)
There is some information in the second forecast: both increases and
decreases are successfully predicted
Not enough to only look at the overall hit rate
Timmermann (UCSD) Forecast Evaluation Winter, 2017 22 / 50
Forecast evaluation for sign forecasting
Suppose we are interested in predicting the sign of yt+h using the sign
(direction) of a forecast ft+h|t
P : probability of a correctly predicted sign (positive or negative)
Py : probability of a positive sign of y
Pf : probability of a positive sign of the forecast, f
Define the sign indicator
I (zt ) =
{
1 if zt ≥ 0
0 if zt < 0
Sample estimates of sign probabilities with T observations:
P̂ =
1
T
T
∑
t=1
I (yt+hft+h|t )
P̂y =
1
T
T
∑
t=1
I (yt+h)
P̂f =
1
T
T
∑
t=1
I (ft+h|t ) > 0
Timmermann (UCSD) Forecast Evaluation Winter, 2017 23 / 50
Sign test
In large samples we can test for sign predictability using the
Pesaran-Timmermann sign statistic
ST =
P̂ − P̂∗√
v̂ar(P̂)− v̂ar(P̂∗)
∼ N(0, 1), where
P̂∗ = P̂y P̂f + (1− P̂y )(1− P̂f )
v̂ar(P̂) = T−1P̂∗(1− P̂∗)
v̂ar(P̂∗) = T
−1(2P̂y − 1)2P̂f (1− P̂f ) + T−1(2P̂f − 1)2P̂y (1− P̂y )
+4T−2P̂y P̂f (1− P̂y )(1− P̂f )
This test is very simple to compute and has been used in studies of market
timing (financial returns) and studies of business cycle forecasting
Timmermann (UCSD) Forecast Evaluation Winter, 2017 24 / 50
Forecast evaluation for event forecasting
Leitch and Tanner (American Economic Review, 1990)
Timmermann (UCSD) Forecast Evaluation Winter, 2017 25 / 50
Forecasts of binary variables: Liu and Moench (2014)
Liu and Moench (2014): What predicts U.S. Recessions? Federal Reserve
Bank of New York working paper
St ∈ {0, 1} : true state of the economy (recession indicator)
St = 1 in recession
St = 0 in expansion
Forecast the probability of a recession using a probit model
Pr(St+1 = 1|Xt ) = Φ(β0 + β1Xt ) ≡ Pt+1|t
The log-likelihood function for β = (β0 β1)
′ is
ln l(β) =
T−1
∑
t=0
[St+1 ln (Φ(β0 + β1Xt )) + (1− St+1) ln (1−Φ(β0 + β1Xt ))]
Timmermann (UCSD) Forecast Evaluation Winter, 2017 26 / 50
Evaluating recession forecasts: Liu-Moench
Pt+1|t ∈ [0, 1] : prediction of St+1 given information known at time t, Xt
Blue: Xt = {term spread}
Green: Xt = {term spread, lagged term spread}
Red: Xt = {term spread, lagged term spread, additional predictor}
Timmermann (UCSD) Forecast Evaluation Winter, 2017 27 / 50
Evaluating binary recession forecasts
Split [0, 1] using 100 evenly spaced thresholds
ci = [0, 0.01, 0.02, …., 0.98, 0.99, 1]
For each threshold, ci , compute the prediction model’s classification:
Ŝt+1|t (ci ) =
{
1 if Pt+1|t ≥ ci
0 if Pt+1|t < ci
True positive (TP) and false positive (FP) indicators:
I tpt+1(ci ) =
{
1 if St+1 = 1 and Ŝt+1|t (ci ) = 1
0 otherwise
I fpt+1(ci ) =
{
1 if St+1 = 0 and Ŝt+1|t (ci ) = 1
0 otherwise
Timmermann (UCSD) Forecast Evaluation Winter, 2017 28 / 50
Estimating the true positive and false positive rates
Using the true St+1 and the classifications, Ŝt+1|t , calculate the percentage
of true positives, PTP, and the percentage of false positives, PFP
PTP(ci ) =
1
n1
T
∑
t=1
I tpt
PFP(ci ) =
1
n0
T
∑
t=1
I fpt
n1 : number of times St = 1 (recessions)
n0 : number of times St = 0 (expansions)
n0 + n1 = n : sample size
Timmermann (UCSD) Forecast Evaluation Winter, 2017 29 / 50
Creating the ROC curve
Each ci produces a set of values (PFPi ,PTPi )
Plot (PFPi ,PTPi ) across all thresholds ci with PFP on the x-axis and PTP
on the y axis
Connecting these points gives the Receiver Order Characteristics curve
ROC curve
plots all possible combinations of PTP(ci ) and PFP(ci ) for ci ∈ [0, 1]
is an increasing function in [0, 1]
as c → 0, TP(c) = FP(c) = 1
as c → 1, TP(c) = FP(c) = 0
Area Under the ROC (AUROC) curve measures accuracy of the classification
Perfect forecast: ROC curve lies in the top left corner
Random guess: ROC curve follows the 45 degree diagonal line
Timmermann (UCSD) Forecast Evaluation Winter, 2017 30 / 50
Evaluating recession forecasts: Liu-Moench
Timmermann (UCSD) Forecast Evaluation Winter, 2017 31 / 50
Estimation and inference on AUROC
Y Rt : observations of Yt classified as recessions (St = 1)
Y Et : observations of Yt classified as expansions (St = 0)
Nonparametric estimate of AUROC:
ÂUROC =
1
n1n0
nE
∑
i=1
nR
∑
j=1
[
I (Y Ri > Y
E
j ) +
1
2
I (Y Ri = Y
E
j )
]
Asymptotic standard error of AUROC:
σ2 =
1
n1n0
[
AUROC (1− AUROC ) + (n1 − 1)(Q1 − AUROC 2)
+(n0 − 1)(Q2 − AUROC 2)
]
Q1 =
AUROC
2− AUROC
; Q2 =
2AUROC 2
1+ AUROC
Timmermann (UCSD) Forecast Evaluation Winter, 2017 32 / 50
Comparing AUROC for two forecasts
Suppose we have two sets of AUROC estimates ̂AUROC1, ̂AUROC2, σ̂21, σ̂
2
2
We also need an estimate, r , of the correlation between AUROC1,AUROC2
We can test if the AUROC are the same using a t-statistic:
t =
AUROC1 − AUROC2√
σ21 + σ
2
2 − 2rσ1σ2
Timmermann (UCSD) Forecast Evaluation Winter, 2017 33 / 50
Evaluating recession forecasts: Liu-Moench (table 3)
Timmermann (UCSD) Forecast Evaluation Winter, 2017 34 / 50
Evaluating Interval forecasts
Interval forecasts predict that the future outcome yt+1 should lie in some
interval
[pLt+1|t (α); p
U
t+1|t (α)]
α ∈ (0, 1) : probability that outcome falls inside the interval forecast
(coverage)
pLt+1|t (α) : lower bound of interval forecast
pUt+1|t (α) : upper bound of interval forecast
Timmermann (UCSD) Forecast Evaluation Winter, 2017 35 / 50
Unconditional test of correctly specified interval forecast
Define the indicator variable
1yt+1 = 1{yt+1 ∈ [p
L
t+1|t (α); p
U
t+1|t (α)]}
=
{
1 if outcome falls inside interval
0 if outcome falls outside interval
Test for correct unconditional (“average”) coverage:
E [1yt+1 ] = α
Use this to evaluate fan charts which show interval forecasts for different
values of α
Test correct coverage for α = 25%, 50%, 75%, 90%, 95%, etc.
Timmermann (UCSD) Forecast Evaluation Winter, 2017 36 / 50
Test of correct conditional coverage
Test for correct conditional coverage: For all Xt
E [1yt+1 |Xt ] = α
Test this implication by estimating, say, a probit model
Pr(1yt+1 = 1) = Φ(β0 + β1xt )
Under the null of a correct interval forecast model, H0 : β1 = 0
Timmermann (UCSD) Forecast Evaluation Winter, 2017 37 / 50
Evaluating Density forecasts: Probability integral transform
The probability integral transform (PIT), Ut+1, of a continuous cumulative
density function PY (yt+1 |xt ) is defined as the outcome evaluated at the
model-specified conditional CDF:
Ut+1 = PY (yt+1 |x1, x2, …, xt )
=
∫ yt+1
−∞
py (y |x1, x2, …, xt )dy
yt+1 : realized value of the outcome (observed value of y)
x1, x2, …, xt : predictors (data)
pY (y |x1, x2, …, xt ) : conditional density of y
PIT value: how likely is it to observe a value equal to or smaller than the
actual outcome (yt+1), given the density forecast py ?
we don’t want this to be very small or very large most of the time
Timmermann (UCSD) Forecast Evaluation Winter, 2017 38 / 50
Probability integral transform: Example
Suppose that our prediction model is a GARCH(1,1)
yt+1 = β0 + β1xt + εt+1, εt+1 ∼ N(0, σ
2
t+1|t )
σ2t+1|t = α0 + α1ε
2
t + β1σ
2
t |t−1
Then we have
pY (y |x1, x2, …, xt ) = N(β0 + β1xt , σ
2
t+1|t )
and so
ut+1 =
∫ yt+1
−∞
pY (y |x1, x2, …, xt )dy
Φ(
yt+1 − (β0 + β1xt )
σt+1|t
)
This is the standard cumulative normal function, Φ, evaluated at
zt+1 = [yt+1 − (β0 + β1xt )]/σt+1|t
Timmermann (UCSD) Forecast Evaluation Winter, 2017 39 / 50
Understanding the PIT
By construction, PIT values lie between zero and one
If the density forecasting model pY (y |x1, x2, …, xt ) is correctly specified, U
will be uniformly distributed on [0, 1]
The sequence of PIT values û1, û2, …, ûT should be independently and
identically distributed Uniform(0, 1)—they should not be serially correlated
If we apply the inverse Gaussian CDF, Φ−1 , to the ût values to get
ẑt = Φ−1(ût ), we get a sequence of i.i.d. N(0, 1) variables
We can therefore test that the density forecasting model is correctly specified
through simple regression tests:
ẑt = µ+ εt H0 : µ = 0
ẑt = µ+ ρẑt−1 + εt H0 : µ = ρ = 0
ẑ2t = µ+ ρẑ
2
t−1 + εt H0 : µ = 1, ρ = 0
Timmermann (UCSD) Forecast Evaluation Winter, 2017 40 / 50
Working in the PIT: graphical test
Timmermann (UCSD) Forecast Evaluation Winter, 2017 41 / 50
Tests of Equal Predictive Accuracy: Diebold-Mariano Test
Test if two forecasts generate the same average (MSE) loss
E [e21t+1 ] = E [e
2
2t+1 ]
Diebold and Mariano (1995) propose a simple and elegant method that
accounts for sampling uncertainty in average losses
Setup: two forecasts with associated loss MSE1, MSE2
MSE1 ∼ N(µ1,Ω11),MSE2 ∼ N(µ2,Ω22),
Cov(MSE1,MSE2) = Ω12
Loss differential in period t + h (dt+h) is
dt+h = e
2
1t+h|t − e
2
2t+h|t
Timmermann (UCSD) Forecast Evaluation Winter, 2017 42 / 50
Tests of Equal Loss – Diebold Mariano Test
Suppose we observe samples of forecasts, forecast errors and forecast
differentials dt+h
These form the basis of a test of the null of equal predictive accuracy:
H0 : E [dt+h ] = 0
To test H0, regress the time series dt+h on a constant and conduct a t−test
on the constant, µ:
dt+h = µ+ εt+h
µ > 0 suggests that MSE1 > MSE2 and so forecast 2 produces the smallest
squared forecast errors and is best
µ < 0 suggests that MSE1 < MSE2 and so forecast 1 produces the smallest squared forecast errors and is best Timmermann (UCSD) Forecast Evaluation Winter, 2017 43 / 50 Comparing forecast methods - Giacomini and White (2006) Consider two set of forecasts, f̂1t+h|t (ωin1) and f̂2t+h|t (ωin2), each computed using a rolling estimation window of length ωi Each forecast is a function of the data and parameter estimates Different choices for the window length, ωin, used in the rolling regressions alter what is being tested If we change the window length, ωi , we also change the forecasts being compared {f̂1t+h|t , f̂2t+h|t} Big models are more affected by estimation error The idea is to compare forecasting methods—not just forecasting models Timmermann (UCSD) Forecast Evaluation Winter, 2017 44 / 50 Conditional Tests - Giacomini and White (2006) We can also test if one method performs better than another one in different environments - regress forecast differential on current information xt ∈ It : dt+h = µ+ βxt + εt+h Analysts may be better at forecasting stock returns than econometric models in recessions or, say, in environments with low interest rates Switch between forecasts? Choose the forecast with the smallest conditional expected squared forecast error Timmermann (UCSD) Forecast Evaluation Winter, 2017 45 / 50 Tests of Forecast Encompassing Encompassing: one model contains all the information (relevant for forecasting) of another model plus some additional information f1 encompasses f2 provided that for all values of ω MSE (f1) ≤ MSE (ωf1 + (1−ω)f2) Equality only holds for ω = 1 One forecast (f1) encompasses another (f2) when the information in the second forecast does not help improve on the forecasting performance of the first forecast Timmermann (UCSD) Forecast Evaluation Winter, 2017 46 / 50 Encompassing Tests Under MSE loss Use OLS to regress the outcome on two forecasts to test for forecast encompassing: yt+1 = β1 f̂1t+1|t + β2 f̂2t+1|t + εt+1 Forecast 1 (f̂1t+1|t ) encompasses (dominates) forecast 2 (f̂2t+1|t ) if β1 = 1 and β2 = 0. If this holds, only use forecast 1 Equivalently, if β2 = 0 in the following regression, forecast 1 encompasses forecast 2: ê1t+1 = β2 f̂2t+1|t + ε1t+1 Forecast 2 doesn’t explain model 1’s forecast error If β1 = 0 in the following regression, forecast 2 encompasses forecast 1: ê2t+1 = β1 f̂1t+1|t + ε2t+1 Timmermann (UCSD) Forecast Evaluation Winter, 2017 47 / 50 Forecast encompassing vs tests of equal predictive accuracy Suppose we cannot reject a test of equal predictive accuracy for two forecasts: E [e21t+h|t ] = E [e 2 2t+h,t ] Then it is optimal to use equal weights in a combined forecast, rather than use only one or the other forecast Forecast encompassing tests if it is optimal to assign a weight of unity on one forecast and zero on the other completely ignore one forecast Tests of equal predictive accuracy and tests of forecast encompassing examine very different hypotheses about how useful the forecasts are Timmermann (UCSD) Forecast Evaluation Winter, 2017 48 / 50 Comparing Two Forecast Methods Three possible outcomes of model comparison: One forecast method completely dominates another method Encompassing; choose the dominant forecasting method One forecast is best, but does not contain all useful information from the second model Combine forecasts using non-equal weights The forecasts have the same expected loss (MSE) Combine forecasts using equal weights Timmermann (UCSD) Forecast Evaluation Winter, 2017 49 / 50 Forecast evaluation: Conclusions Forecast evaluation is very important - health check of forecast models Variety of diagnostic tests available for Optimality tests: point, interval and density forecasts Sign/direction forecasts Forecast comparisons Timmermann (UCSD) Forecast Evaluation Winter, 2017 50 / 50 Lecture 10: Model Instability UCSD, Winter 2017 Allan Timmermann1 1UC San Diego Timmermann (UCSD) Breaks Winter, 2017 1 / 42 1 Forecasting under Model Instability: General Issues 2 How costly is it to Ignore Model Instability? 3 Limitations of Tests for Model Instability Frequent, small versus rare and large changes Using pre-break data: Trade-offs 4 Ad-hoc Methods for Dealing with Breaks Intelligent ways of using a rolling window 5 Modeling the Instability Process Time-varying parameters Regime switching Change point models 6 Real-time monitoring of forecasting performance 7 Conclusions and Practical Lessons Timmermann (UCSD) Breaks Winter, 2017 2 / 42 Model instability is everywhere... Model instability affects a majority of macroeconomic and financial variables (Stock and Watson, 1996) Great Moderation: Sharp drop in the volatility of macroeconomic variables around 1984 Zero lower bound for US interest rates Stock and Watson (2003) conclude “... forecasts based on individual indicators are unstable. Finding an indicator that predicts well in one period is no guarantee that it will predict well in later periods. It appears that instability of predictive relations based on asset prices (like many other candidate leading indicators) is the norm.” Strong evidence of instability for prediction models fitted to stock market returns: Ang and Bekaert (2007), Paye and Timmermann (2006) and Rapach and Strauss (2006) Timmermann (UCSD) Breaks Winter, 2017 2 / 42 Maclean and Pontiff (2012) "We investigate the out-of-sample and post-publication return predictability of 82 characteristics that have been shown to predict cross-sectional returns by academic publications in peer-reviewed journals... We estimate post-publication decay to be about 35%, and we can reject the hypothesis that there is no decay, and we can reject the hypothesis that the cross-sectional predictive ability disappears entirely. This finding is better explained by a discrete change in predictive ability, rather than a declining time-trend in predictive ability." (p. 24) Timmermann (UCSD) Breaks Winter, 2017 3 / 42 Sources of model instability Model parameters may change over time due to shifting market conditions (QE) changing regulations and government policies (Dodd-Frank) new technologies (fracking; iphone) mergers and acquisitions; spinoffs shift in behavior (market saturation; self-destruction of predictable patterns under market effi ciency) Timmermann (UCSD) Breaks Winter, 2017 4 / 42 Strategies for dealing with model instability Ignore it altogether Test for large, discrete breaks and use only data after the most recent break to estimate the forecasting model Ad-hoc approaches that discount past data rolling window estimation exponential discounting of past observations (risk-metrics) adaptive approaches Model the break process itself If multiple breaks occurred in the past, we may want to model the possibility of future breaks, particularly for long forecast horizons Forecast combination Timmermann (UCSD) Breaks Winter, 2017 5 / 42 Ignoring model instability Timmermann (UCSD) Breaks Winter, 2017 6 / 42 How costly is it to ignore breaks? "All models are wrong but some are useful" (George Box) Full-sample estimated parameters of forecasting models are an average of time-varying coeffi cients these may or may not be useful for forecasting fail to detect valuable predictors wrongly include irrelevant predictors Model instability can show up as a disparity between a forecasting model’s in-sample and out-of-sample performance or in differences in the model’s forecasting performance across different historical subsamples Timmermann (UCSD) Breaks Winter, 2017 7 / 42 How do breaks affect forecasting performance? Forecasting model yt+1 = β ′ txt + εt+1, t = 1, ...,T yt+1 : outcome we are interested in predicting βt : (time-varying) parameters of the data generating process β̂t : parameter estimates xt : predictors known at time t T : present time (time where we generate our forecast of yT+1) ŷT+1 = β̂ ′ T xT : forecast Timmermann (UCSD) Breaks Winter, 2017 8 / 42 Failing to find genuine predictability time, t Tbreak ßt Tβ̂ breakpre−β breakpost−β 11 ++ += tttt xy εβ 0 T Timmermann (UCSD) Breaks Winter, 2017 9 / 42 Wrongly identifying predictability time, t Tbreak ßt Tβ̂ breakpre−β breakpost−β 11 ++ += tttt xy εβ T Timmermann (UCSD) Breaks Winter, 2017 10 / 42 Breaks to forecast models What happens when the parameters of the forecasting model change over time so the full-sample estimates of the forecasting model provide poor guidance for the forecasting model at the end of the sample (T ) where the forecast gets computed? βt : regression parameters of the forecasting model ‘t’subscript indicates that the parameters vary over time Assuming that parameter changes are not random, we would prefer to construct forecasts using the parameters, βT , at the point of the forecast (T ) Instead, the estimator based on full-sample information, β̂T , will typically converge not to βT but to the average value for βt computed over the sample t = 1, ...,T Timmermann (UCSD) Breaks Winter, 2017 11 / 42 Take-aways: Careful with full-sample tests A forecasting model that uses a good predictor might generate poor out-of-sample forecasts because the parameter estimates are an average of time-varying coeffi cients Valuable predictor variables may appear to be uninformative because the full-sample parameter estimate β̂T is close to zero Conversely, full-sample tests might indicate that certain predictors are useful even though, at the time of the forecast, βT is close to zero Under model instability, all bets could be off! Timmermann (UCSD) Breaks Winter, 2017 12 / 42 Testing for parameter instability There are many different ways to model and test for parameter instability Tests for a single discrete break Tests for multiple discrete breaks Tests for random walk breaks (break every period) Timmermann (UCSD) Breaks Winter, 2017 13 / 42 Testing for breaks: Known break date If the date of the possible break in the coeffi cients (t1) is known, the null of no break can be tested using a dummy variable interaction regression Let Dt (t0) be a binary variable that equals zero before the break date, tB , and is one afterwards: Dt (tB ) = { 0 if t < tB 1 if t ≥ tB We can test for a break in the intercept: yt+1 = β0 + β1Dt (tB ) + β2xt + εt+1 H0 : β1 = 0 (no break) Or we can test for a break in the slope: yt+1 = β0 + β1xt + β2Dt (tB )xt + εt+1 H0 : β2 = 0 (no break) The t−test for β1 or β2 is called a Chow test Timmermann (UCSD) Breaks Winter, 2017 14 / 42 Models with a single break I What happens if the time and size of the break are unknown? We can try to estimate the date and magnitude of the break For each date in the sample we can compute the sum of squared residuals (SSR) associated with that choice of break date. Then choose the value t̂B that minimizes the SSR SSR(t1) = T−1 ∑ t=1 (yt+1 − βxt − dxt1(t ≤ tB ))2 We “trim” the data by searching only for breaks between lower and upper limits that exclude 10-15% of the data at both ends to have some minimal amount of data for parameter estimation and evaluation In a sample with T = 100 observations, test for break at t = 16, ...., 85 Timmermann (UCSD) Breaks Winter, 2017 15 / 42 Models with a single break II Potentially we can compute a better forecast based on the post-break parameter values. For the linear regression yt+1 = { (β+ d)xt + εt+1 t ≤ t̂B βxt + εt+1 t > t̂B
we could use the forecast fT+1 = β̂xT where β̂ is estimated from a regression
that replaces the unknown break date with the estimated break date t̂B
In practice, estimates of the break date are often inaccurate
Should we always exclude pre-break data points?
Not if the break is small or the post-break data sample is short
Bias-variance trade-off
Timmermann (UCSD) Breaks Winter, 2017 16 / 42
Many small breaks versus few large ones
time, t
Tbreak
ßt
Tβ̂
breakpre−β
breakpost−β
11 ++ += tttt xy εβ
T
Timmermann (UCSD) Breaks Winter, 2017 17 / 42
0 5 10 15
0
0.2
0.4
0.6
0.8
1
Single Break at 50%
P
ow
er
δ
0 5 10 15
0
0.2
0.4
0.6
0.8
1
Two Breaks, 40% and 60%
P
ow
er
δ
0 5 10 15
0
0.2
0.4
0.6
0.8
1
Random Breaks with Probability 10%
P
ow
er
δ
0 5 10 15
0
0.2
0.4
0.6
0.8
1
Random Breaks with Probability 1
P
ow
er
δ
qLL
Nyblom
SupF
AP
Timmermann (UCSD) Breaks Winter, 2017 18 / 42
Practical lessons from break point testing
Tests designed for one type of breaks (frequent, small ones) typically also
have the ability to detect other types of breaks (rare, large ones)
Rejections of a test for a particular break process do not imply that the break
process tested for is “correct”
Rather, it could be one of many processes
Imagine a medical test that can tell if the patient is sick or not but cannot
tell if the patient suffers from diabetes or coronary disease
Timmermann (UCSD) Breaks Winter, 2017 19 / 42
Estimating the break date
We can attempt to estimate the date and magnitude of any breaks
In practice, estimates of the break date are often inaccurate
Costs of mistakes in estimation of the break date:
Too late: If the estimated break date falls after the true date, the resulting
parameter estimates are ineffi cient
Too early: If the estimated break date occurs prior to the true date, the
estimates will use pre-break data and hence be biased
If you know the time of a possible break, use this information
Introduction of Euro
Change in legislation
Timmermann (UCSD) Breaks Winter, 2017 20 / 42
Should we only use post-break data?
The expected performance of the forecasting model can sometimes be
improved by including pre-break observations to estimate the parameters of
the forecasting model
Adding pre-break observations introduces a bias in the forecast but can also
reduce the variance of the estimator
If the size of the break is small (small bias) and the break occurs late in the
sample so the additional observations reduce the variance of the estimates by
a large amount, the performance of the forecasting model can be improved by
incorporating pre-break data points
Timmermann (UCSD) Breaks Winter, 2017 21 / 42
Consequences of a small break
time, t
Tbreak
ßt
Tβ̂
breakpre−β
breakpost−β
11 ++ += tttt xy εβ
T
Timmermann (UCSD) Breaks Winter, 2017 22 / 42
Including pre-break data (Pesaran and Timmermann, 2007)
The optimal pre-break window that minimizes the mean squared (forecast)
error (MSE) is longer, the
smaller the R2 of the prediction model (noisy returns data)
smaller the size of the break (small bias)
shorter the post-break window (post-break-only estimates are very noisy)
Timmermann (UCSD) Breaks Winter, 2017 23 / 42
Stopping rule for determining estimation window
First, estimate the time at which the most recent break occurred, T̂b
If no break is detected, use all the data
If a break is detected, estimate the MSE using only data after the break date,
t = T̂b + 1, …,T
Next, compute the MSE by including an additional observation, t = T̂b , …,T
If the new estimate reduces the MSE, continue by adding an additional data
point (T̂b − 1) to the sample and again compute the MSE
Repeat until the data suggest that including additional pre-break data no
longer reduces the MSE
Timmermann (UCSD) Breaks Winter, 2017 24 / 42
Downweighting past observations
In situations where the form of the break process is unknown, we might use
adaptive methods that do not depend directly on the nature of the breaks
Simple weighted least squares scheme puts greater weight on recent
observations than on past observations by choosing parameters
β̂T =
[
T−1
∑
t=0
ωtxtx
′
t
]−1 [T−1
∑
t=0
ωtxtyt
]
The forecast is β̂
′
T xT
Expanding window regressions set ωt = 1 for all t
Rolling regressions set ωt = I{T − ω̄ ≤ t ≤ T − 1} : use last ω̄ obs.
Discounted least squares sets ωt = λ
T−t for λ ∈ (0, 1)
Timmermann (UCSD) Breaks Winter, 2017 25 / 42
Careful with rolling regressions!
Rolling regressions employ an intuitive trade-off
Short estimation windows reduce the bias in the estimates due to the use of
stale data that come from a different “regime”
This bias reduction is achieved at the cost of a decreased precision in the
parameter estimates as less data get used
We hope that the bias reduction more than makes up for the increased
parameter estimation error
However, there does not exist a data generating process for which a rolling
window is optimal, so how do we choose the length of the estimation window?
Timmermann (UCSD) Breaks Winter, 2017 26 / 42
Cross-validation and choice of estimation window
Treat rolling window as a choice variable
If the last P observations are used for cross-validation, we can choose the
length of the rolling estimation window, ω, to minimize the out-of-sample
MSE criterion
MSE (ω) = P−1
T
∑
t=T−P+1
(yt − x ′t−1 β̂t−ω+1:t )
2
β̂t−ω+1:t : OLS estimate of β that uses observations [t −ω+ 1 : t]
This method requires a suffi ciently long evaluation window, P, to yield
precise MSE estimates
not a good idea if the candidate break date, Tb , is close to T
Timmermann (UCSD) Breaks Winter, 2017 27 / 42
Using model averaging to deal with breaks
A more robust approach uses model averaging to deal with the underlying
uncertainty surrounding the selection of the estimation window
Combine forecasts associated with estimation windows ω ∈ [ω0,ω1 ]:
ŷT+1|T =
∑ω1ω=ω0 (x
′
T β̂T−ω+1:T )MSE (ω)
−1
∑ω1ω=ω0 MSE (ω)
−1
Example: daily data with ω0 = 200 days, ω1 = 500 days
If the break is very large, models that start the estimation sample after the
break will receive greater weight (they have small MSE) than models that
include pre-break data and thus get affected by a large (squared) bias term
Timmermann (UCSD) Breaks Winter, 2017 28 / 42
Modeling the instability process
Several methods exist to capture the process giving rise to time variation in
the model parameters
Time-varying parameter model (random walk or mean-reverting)
Markov switching
Change point process
These are parametric approaches that assume a particular break process
Many small changes versus few large “breaks”
Timmermann (UCSD) Breaks Winter, 2017 29 / 42
Time-varying parameter models
Small “breaks” every period
Time-varying parameter (TVP) model:
yt+1 = x
′
tβt+1 + εyt+1
βt+1 − β̄ = κ(βt − β̄) + εβt+1
This model is in state space form with yt being the observable process and
βt being the latent “state”
Use Kalman filter or MCMC methods
Volatility may also be changing over time—stochastic volatility
Allowing too much time variation in parameters (large σεβ) leads to poor
forecasts (signal-to-noise issue)
Timmermann (UCSD) Breaks Winter, 2017 30 / 42
Regime switching models: History repeats
Markov switching processes take the form
yt+1 = µst+1 + εt+1, εt+1 ∼ N(0,Σst+1 )
Pr(st+1 = j |st = i) = pij , i , j ∈ {1, …,K}
st+1 : underlying state, st+1 ∈ {1, 2, …,K} for K ≥ 2
Key assumption: the same K states repeat
Is this realistic?
Use forward-looking information in state transitions:
pij (zt ) = Φ(αij + βij zt )
e.g., zt = ∆Leading Indicatort
Timmermann (UCSD) Breaks Winter, 2017 31 / 42
Markov Switching models: been there, seen that
time, t
ßt
1β
2β
11 1 ++
+=
+ ttst
xy
t
εβ
T
1β
2β
Timmermann (UCSD) Breaks Winter, 2017 32 / 42
Change point models: History doesn’t repeat
Change point models allow the number of states to increase over time and do
not impose that the states are drawn repeatedly from the same set of values
Example: Assuming K breaks up to time T , for i = 0, …,K
yt+1 = µi + Σi εt+1, τi ≤ t ≤ τi+1
Assuming that the probability of remaining within a particular state is
constant, but state-specific, the transitions for this class of models are
P =
p11 p12 0 · · · 0
0 p22 p23 · · · 0
…
…
…
…
…
0 · · · 0 pKK pK ,K+1
0 0 · · · 0 1
pi ,i+1 = 1− pii
The process either remains in the current state or moves to a new state
Timmermann (UCSD) Breaks Winter, 2017 33 / 42
Change point models: a new era arises
time, t
ßt
1β
2β
11 1 ++
+=
+ ttst
xy
t
εβ
T
3β
4β
Timmermann (UCSD) Breaks Winter, 2017 34 / 42
Monitoring stability of forecasting models
Timmermann (UCSD) Breaks Winter, 2017 35 / 42
Monitoring the predictive accuracy
To study the real-time evolution in the accuracy of a model’s forecasts, plot
the Cumulative Sum of Squared prediction Error Difference (CSSED) for
some benchmark against the competitor model up to time t :
CSSEDm,t =
t
∑
τ=1
(
e2Benmk ,τ − e
2
m,τ
)
eBenmk ,τ = yτ − ŷτ,Benmk : forecast error (Benmk)
em,τ = yτ − ŷτ,m : forecast error (model m)
Positive and rising values of CSSED indicate that the point forecasts
generated by model m are more accurate than those produced by the
benchmark
Timmermann (UCSD) Breaks Winter, 2017 36 / 42
Comparison of break models
Pettenuzzo and Timmermann (2015)
Benchmark: Constant parameter linear model (LIN)
Competitors:
Time-varying parameter, stochastic volatility (TVP-SV)
Markov switching with K regimes (MSK )
Change point model with K regimes (CPK )
Timmermann (UCSD) Breaks Winter, 2017 37 / 42
Data and models (US inflation)
Pt : quarterly price index for the GDP deflator
πt = 400× ln (Pt/Pt−1) : annualized quarterly inflation rate
Prediction model: backward-looking Phillips curve
∆πt+1 = µ+ β(L)ut + λ(L)∆πt + εt+1, εt+1 ∼ N
(
0, σ2ε
)
∆πt+1 = πt+1 − πt : quarter-on-quarter change in the annualized inflation
rate
ut : quarterly unemployment rate
Timmermann (UCSD) Breaks Winter, 2017 38 / 42
Cumulative sum of squared forecast error differentials
(quarterly inflation, 1970-2012)
Timmermann (UCSD) Breaks Winter, 2017 39 / 42
Challenges to forecasting with breaks
1 Tests for model instability can detect many different types of instability and
tend to be uninformative about the nature of the instability
2 The forecaster therefore often does not have a good idea of which specific
way to capture model instability
3 Future parameter values might change again over the forecast horizon if this
is long. Forecasting procedures require modeling both the probability and
magnitude of future breaks
1 Models with rare changes to the parameters have little or nothing to say about
the chance of a future break in the parameters
2 Think about forecasting the growth in the Chinese economy over the next 25
years. Many things could change—future “breaks” could occur
Timmermann (UCSD) Breaks Winter, 2017 40 / 42
Conclusions: Practical Lessons
Model instability poses fundamental challenges to forecasting: All bets could
be off if ignored
Important to monitor model stability
Use model forecasts with more caution if models appear to be breaking down
Forecast combinations offer a promising tool to handle instability
Combine forecasts from models using short, medium and long estimation
windows
Combine different types of models, allowing combination weights to evolve
over time as new and better models replace old (stale) ones
Timmermann (UCSD) Breaks Winter, 2017 41 / 42
Conclusion: Practical lessons (cont.)
Models that allow parameters to change are adaptive – they catch up with a
shift in the data generating process
Adaptive approaches can work well but have obvious limitations
Models that attempt to predict instability or breaks require forward-looking
information
use information in option prices (VIX) or in financial fragility indicators
Diffi cult to predict exact timing and magnitude of breaks, but the risk of a
break may be time-varying
Model instability is not just a nuisance but also poses an opportunity for
improved forecasting performance
Timmermann (UCSD) Breaks Winter, 2017 42 / 42
Lecture 10: Data mining – Pitfalls in Forecasting
UCSD, Winter 2017
Allan Timmermann1
1UC San Diego
Timmermann (UCSD) Data mining Winter, 2017 1 / 23
1 Data mining
Opportunities and Challenges
Skill or Luck
Bonferroni Bound
2 Comparing Many Forecasts: Reality Check
3 Hal White’s Reality Check
Data snooping and technical trading rules
Timmermann (UCSD) Data mining Winter, 2017 2 / 23
Data mining (Wikipedia)
“Data mining is the process of sorting through large amounts of data and
picking out relevant information. It is usually used by business intelligence
organizations, and financial analysts, but is increasingly being used in the sciences
to extract information from the enormous data sets generated by modern
experimental and observational methods. It has been described as “the nontrivial
extraction of implicit, previously unknown, and potentially useful information from
data”and “the science of extracting useful information from large data sets or
databases.
The term data mining is often used to apply to the two separate processes of
knowledge discovery and prediction. Knowledge discovery provides explicit
information that has a readable form and can be understood by a user (e.g.,
association rule mining). Forecasting, or predictive modeling provides
predictions of future events and may be transparent and readable in some
approaches (e.g., rule-based systems) and opaque in others such as neural
networks. Moreover, some data-mining systems such as neural networks
are inherently geared towards prediction and pattern recognition, rather
than knowledge discovery.”
Timmermann (UCSD) Data mining Winter, 2017 2 / 23
Model selection and data mining
In the context of economic/financial forecasting, data mining is often used in
a negative sense as the practice of using a data set more than once for
purposes of selecting, estimating and testing a model
If you get to evaluate a model on the same data used to develop/estimate the
model, chances are you are overfitting the data
This practice is necessitated because we are limited to short (time-series)
samples which cannot be easily replicated/generated
We only have one history of quarterly US GDP, fund-manager performance etc.
We cannot use experiments to generate new data on such a large scale
If we have panel data with large cross-sections and no fixed effects, then we
can keep a large separate evaluation sample for model validation
Timmermann (UCSD) Data mining Winter, 2017 3 / 23
Data mining as a source of new information
Statistical analysis guided strictly by theory may not discover unknown
relationships that have not yet been stipulated by theory
Before the emergence of germ theory, medical doctors didn’t understand why
some patients got infected, while others didn’t. It was the use of patient data
and a search for correlations that helped doctors find out that those who
washed their hands between patient visits infected fewer of them
Big data present a similar opportunity for finding new empirical relationships
which can then be interpreted by theorists
If we have a really big data set that is deep (multiple entries for each
variable) and wide (many variables), we can use machine learning to detect
novel and interesting patterns on one subset of data, then test it on another
(independent) or several other cross-validation samples
Man versus machine: testing if a theoretical model performs better than a
data mined model, using independent (novel) data
Can we automate the development of theories (hypotheses)?
Timmermann (UCSD) Data mining Winter, 2017 4 / 23
Data mining as an overfitting problem
Data mining can cause problems for inference with small samples
“[David Lenweber, managing director of First Quadrant Corporation in
Pasadena, California] sifted through a United Nations CD-Rom and
discovered that historically, the single best prediction for the Standard &
Poor’s 500 stock index was butter production in Bangladesh.” (Coy,
1997, Business Week)
“Is it reasonable to use the standard t-statistic as a valid meaure of
significance when the test is conducted on the same data used by many
earlier studies whose results influenced the choice of theory to be tested?”
(Merton, 1987)
“. . . given enough computer time, we are sure that we can find a mechanical
trading rule which “works”on a table of random numbers —provided that we
are allowed to test the rule on the same table of numbers which we used to
discover the rule”. Jensen and Bennington (1970)
Timmermann (UCSD) Data mining Winter, 2017 5 / 23
Skill or luck?
A student comes to the professor’s offi ce and reports a stock market return
prediction model with a t−statistic of 3. Should the professor be impressed?
If the student only fitted a single model: Yes
What if the student experimented with hundreds of models?
A similar issue arises more broadly in the assessment of performance
Forecasting models in financial markets
Star mutual funds
How many mutual fund “stars” should we expect to find by random chance?
What if the answer is four and we only see one star in the actual data?
Lucky penny
Newsletter scam
Timmermann (UCSD) Data mining Winter, 2017 6 / 23
Timmermann (UCSD) Data mining Winter, 2017 7 / 23
Dealing with over-fitting
Report the full set (and number) of forecasting models that were considered
Good practice in all circumstances
The harder you had to work to find a ‘good’model, the more skeptical you
should be that the model will produce good future forecasts
Problem: Diffi cult to keep track of all models
What about other forecasters whose work influenced your study? Do you know
how many models they considered? Collective data mining
Even if you keep track of all the models you studied, how do you account for
the correlation between the forecasts that they generate?
Timmermann (UCSD) Data mining Winter, 2017 8 / 23
Dealing with over-fitting (cont.)
Use data from alternative sources
Seeking independent evidence
This is a way to ‘tie your hands’by not initially looking at all possible data
This strategy works if you have access to similar and genuinely independent
data
Often such data are diffi cult to obtain
Example: Use European or Asian data to corroborate results found in the US
Problem: What if the data are correlated? US, European and Asian stock
market returns are highly correlated and so are not independent data sources
Timmermann (UCSD) Data mining Winter, 2017 9 / 23
Dealing with over-fitting (cont.)
Reserve the last portion of your data for out-of-sample forecast evaluation
Problem: what if the world has changed?
Maybe the forecasting model truly performed well in a particular historical
sample (the “in-sample” period), but broke down in the subsequent sample
Example: Performance of small stocks
Small cap stocks have not systematically outperformed large cap stocks in the
35 years since the size effect was publicized in the early 1980s
Timmermann (UCSD) Data mining Winter, 2017 10 / 23
Bonferroni bound
Suppose we are interested in testing if the best model among k = 1, ….,m
models produces better forecasts than some benchmark forecasting model
Let pk be the p−value associated with the null hypothesis that model k does
not produce more accurate forecasts than the benchmark
This could be based on the t−statistic from a Diebold-Mariano test
The Bonferroni Bound says that the p-value for the null that none of the m
models is superior to the benchmark satisfies an upper bound
p ≤ Min(m×min(p1, …, pm), 1)
The smallest of the p-values (which produces the strongest evidence against
the null that no model beats the benchmark) gets multiplied by the number of
tested models, m
mindless datamining weakens the evidence!
Bonferroni bound holds for all possible correlations between test statistics
Timmermann (UCSD) Data mining Winter, 2017 11 / 23
Bonferroni bound: example
Bonferroni bound: Probability of observing a p-value as small as pmin among
m forecasting models is less than or equal to m× pmin
Example: m = 10, pmin = 0.02
Bonferroni bound = Min(10× 0.02, 1) = 0.20
In a sample with 10 p-values, there is at most a 20% chance that the
smallest p-value is less than 0.02
Test is conservative (doesn’t reject as often as it should): Suppose you have
20 models whose tests have a correlation of 1. Effectively you only have one
(independent) forecast. However, even if p1 = p2 = … = p20 = 0.01, the
Bonferroni bound gives a value of
p ≤ 20× 0.01 = 0.20
Timmermann (UCSD) Data mining Winter, 2017 12 / 23
Reverse engineering the Bonferroni bound
Suppose a student reports a p-value of 0.001 (one in a thousand)
For this to correspond to a Bonferroni p-value of 0.05, at least 50 models
must have been considered since 50× 0.001 = 0.05
Is this likely?
What is a low p-value in a world with data snooping?
The conventional criterion that a variable is significant if its p-value falls
below 0.05 isn’t true anymore
Back to the example with a t−statistic of 3 reported by a student. How
many models would the student have had to have looked at?
prob(t ≥ 3) = 0.0013
0.05/0.0013 = 37
Answer: 37 models
Timmermann (UCSD) Data mining Winter, 2017 13 / 23
Testing for Superior Predictive Ability
How confident can we be that the best forecast is genuinely better than some
benchmark, given that the best forecast is selected from a potentially large
set of forecasts?
Skill or luck? In a particular sample, a forecast model may produce a small
average loss even though in expectation (i.e., across all samples we could
have seen) the model would not have been so good
A search across multiple forecast models may result in the discovery of a
genuinely good model (skill), but it may also uncover a bad model that just
happens to perform well in a given sample (luck)
Tests used in model comparisons typically ignore any search that preceded
the selection of the prediction models
Timmermann (UCSD) Data mining Winter, 2017 14 / 23
White (2000) Reality Check
Forecasts generated recursively using an expanding estimation window
m models used to compute m out-of-sample (average) losses
f0t+1|t : forecast from benchmark (model 0)
fkt+1|t : forecast from alternative model k, k = 1, …,m
dkt+1 = (yt+1 − f0t+1)2 − (yt+1 − fkt+1)2 : MSE difference for the
benchmark (model 0) relative to model k
d̄k : sample average of dkt+1
d̄k > 0 suggests model k outperformed benchmark (model 0)
d̄ = (d̄1, …, d̄m)
′ : m× 1 vector of sample averages of MSE differences
measured relative to the benchmark
d̄∗ = (d̄1(β
∗
1), …, d̄m(β
∗
m))
′ : m× 1 vector of sample averages of MSE
differences measured relative to the benchmark
β∗i : p limt→∞{β̂it} : probability limit of β̂it
Timmermann (UCSD) Data mining Winter, 2017 15 / 23
White (2000) Reality Check (cont.)
White’s reality check tests the null hypothesis that the benchmark model is
not inferior to any of the m alternatives:
H0 : max
k=1,…,m
E [d∗kt+1 ] ≤ 0
If all models perform as well as the benchmark, d̄∗ = (d̄∗1 , …, d̄
∗
m) has mean
zero and so its maximum also has mean zero
Examining the maximum of d̄ is the same as searching for the best model
Alternative hypothesis: the best model outperforms the benchmark, i.e.,
there exists a superior model k such that E [d∗kt+1 ] > 0
Timmermann (UCSD) Data mining Winter, 2017 16 / 23
White (2000) Reality Check (cont.)
White shows conditions under which (⇒ means convergence in distribution)
max
k=1,…,m
T 1/2(d̄k − E [d̄∗k ])⇒ max
k=1,…,m
{Zk}
Zk , k = 1, …,m are distributed as N(0,Ω)
Problem: We cannot easily determine the distribution of the maximum of Z
because the m×m covariance matrix Ω is unknown
Timmermann (UCSD) Data mining Winter, 2017 17 / 23
White (2000) Reality Check (cont.)
Hal White (2000) developed a bootstrap for drawing the maximum from
N(0,Ω)
1 Draw values of dkt+1 with replacement to generate bootstrap samples of d̄k .
These draws use the same ‘t’across all models to preserve the correlation
structure across the test statistics d̄
2 Compute the maximum value of the test statistic across these samples
3 Compare these to the actual values of the maximum test statistic, i.e.,
compare the best performance in the actual data to the quantiles from the
bootstrap to obtain White’s bootstrapped Reality Check p-value for the null
hypothesis that no model beats the benchmark
Important to impose the null hypothesis: recentering the bootstrap
distribution around d̄k = 0
Timmermann (UCSD) Data mining Winter, 2017 18 / 23
White (2000) Reality Check: stationary bootstrap
Let τl be a random time index between R and T
R: beginning of evaluation sample
T : end of evaluation sample
For each bootstrap, b, generate a sample estimate
d̄bk = ∑
T
t=R (yτl − fkτl−1)
2 as follows:
1 Set t = R + 1. Draw τl = τR at random, independently, and uniformly from
{R + 1, …,T }
2 Increase t by 1. If t > T , stop. Otherwise, draw a standard uniform random
variable, U , independently of all other random variables
1 If U < q, draw τl at random, independently, and uniformly, from {R + 1, ...,T } 2 If U ≥ q, expand the block by setting τl = τl−1 + 1; if τl > T , reset τl = R + 1
3 Repeat step 2
Timmermann (UCSD) Data mining Winter, 2017 19 / 23
Sullivan, Timmermann, and White, JF 1999
Investigates the performance of 7,846 technical trading rules applied to daily
data
filter rules, moving averages, support and resistance, channel breakouts,
on-balance volume averages
The “best” technical trading rule looks very good on its own
However, when accounting for the possibility that the best technical trading
rule was selected from a large set of possible rules during the period with
liquid trading it is no longer possible to rule out that even the best rule is not
significantly better than the benchmark buy-and-hold strategy
Timmermann (UCSD) Data mining Winter, 2017 20 / 23
Sullivan, Timmermann, and White, JF 1999: Results
Timmermann (UCSD) Data mining Winter, 2017 21 / 23
Sullivan, Timmermann, and White, JF 1999: Results
Timmermann (UCSD) Data mining Winter, 2017 22 / 23
Conclusions
Data mining poses both an opportunity and a challenge to constructing
forecasting models in economics and finance
Less of a concern if we can generate new data or use cross-sectional holdout
data
More of a concern if we only have one (short) historical time-series
Methods exist for quantifying how data mining affects statistical tests
Timmermann (UCSD) Data mining Winter, 2017 23 / 23