Loss Functions
Australian National University
(James Taylor) 1 / 12
2 I
Loss Functions
Stop for a moment and ask: Why do we want to forecast?
Usually to guide decision making under uncertainty
Make financial investments
Manage inventories
Make investment in productive capacities
Prepare our economy for future changes
(James Taylor) 2 / 12
Loss Functions
There are consequences for making errors in the forecast
Need to consider the associated loss if the forecast is wrong
The cost associated with a forecast error is summarized by the loss
function
(James Taylor) 3 / 12
Example: Weather prediction
Predict if it will be freezing on the way home:
State Space = {Freezing, Not Freezing}
If freezing, bring a thick jacket
If not freezing, don’t bring a jacket
(James Taylor) 4 / 12
optimal
Example: Weather prediction
If our prediction is correct then we’re happy (loss is zero)
If we’re incorrect:
if it’s freezing and no jacket we get cold and are very unhappy (cost
= 10)
if it’s not freezing and we have to carry around our big jacket we are
slightly unhappy (cost = 2)
(James Taylor) 5 / 12
Weather prediction – The Loss Function
Table: Loss function table
Freezing Fine
Bring a Jacket 0 2
No Jacket 10 0
(James Taylor) 6 / 12
Loss action
L ly g j
outcome
Expected Loss Minimisation
Our action depends not only on the weather forecast, but also on the loss
function
Suppose we believe there is a 20% chance of it being freezing this
afternoon:
if we bring a jacket, our expected cost is 0.16,
that’s (0⇥ 0.2+ 2⇥ 0.8)
if we don’t bring a jacket, our expected cost is 0.2,
that’s (10⇥ 0.2+ 0⇥ 0.8)
even though we think it is unlikely to be freezing, we should
nonetheless bring a jacket.
(James Taylor) 7 / 12
0.16 0.2
Symmetric and Asymmetric Loss Functions
The loss function in the weather example is asymmetric – the losses are
di↵erent for di↵erent kinds of deviations from optimal.
Symmetric loss functions are much more common – where errors above or
below the actual value cause the same loss.
Two popular symmetric loss functions:
squared loss: L(ŷ , y) = (ŷ � y)2
absolute loss: L(ŷ , y) = |ŷ � y |
(James Taylor) 8 / 12
y’eprediction
y actual
Symmetric Loss Functions
(a) Squared Loss (b) Absolute Loss
Figure: Symmetric Loss Functions
(James Taylor) 9 / 12
not e I
equal 1 Huat I
1 I i l l
I 1 I ii u il l l l l l l
l il l l t l ll l s l t l
l l l I i My
same
mostused
Optimal Point Forecast
We have been using E(yT+h | IT , q) as our h-step ahead point forecast
Is this really a good choice? What does it mean for a forecast to be
“good”?
Since we want to produce forecasts to aid decision making…
Define a point forecast as optimal if minimizes the given loss function
(James Taylor) 10 / 12
Teh
Optimal Point Forecast
Squared Loss
Theorem: Given a density forecast f (yT+h | IT , q) and the squared loss
function L(ŷ , y) = (ŷ � y)2, the point forecast ŷT+h which minimises
expected loss is
ŷT+h = E(yT+h | IT , q)
Proof: In Supplementary Lecture Module
(James Taylor) 11 / 12
mm
hean expectedvalue
Optimal Point Forecast
Absolute Loss
Theorem: Given a density forecast f (yT+h | IT , q) and the absolute loss
function L(ŷ , y) = |ŷ � y |, the point forecast ŷT+h which minimises
expected loss is the median mT+h.
Proof: In assignment
(James Taylor) 12 / 12
mm
Minimising Squared Loss
Optimal Point Forecasts
Australian National University
(James Taylor) 1 / 3
2 19
Optimal Point Forecast
Squared Loss
Theorem: Given a density forecast f (yT+h | IT , q) and the squared loss
function L(ŷ , y) = (ŷ � y)2, the point forecast ŷT+h which minimises
expected loss is
ŷT+h = E(yT+h | IT , q)
Proof: Upcoming
(James Taylor) 2 / 3
1. The hard way
2. The easy way
3. Why I’ve talked about both
(James Taylor) 3 / 3
3Why
Emphasisthebenefitof general theorems
Similarthey 1hAssignment heyif I y y
needstheHARDWAY
1 HARDWAY t.ly yI Cy yl2 Cy yi CHOOSE
expected
IE L ij yl L EI y I Ttoss 25
f ly Iif aydy flanIIiatestatistimI
Iffy’tYidy Iodyyffeydy gifydy Ely
ytoddy
EIgcyst IfguyHpdis is
If yfupdy tyffytcysdyiij.ttupdy
II y fupdy Lij FLys t y I
O 2Fly 12g O
y Eup
2 EASY WAY Lly CyFT CHOOSE
2 2
age
E Lily IH 2g Flyoff
I
2g
E y Lyftyn
a
2g
LIFE’y’T IEEgg EEgg
a
2gLIEEy’T 2yIEEy ya
0 HELy try TEC o
y Ely
Model Selection 1 – In-Sample Fitting
Australian National University
(James Taylor) 1 / 15
2.2
Evaluating Forecast Models
Typically there are several reasonable models for yt ,
With no good theoretical reason to prefer one to another,
But they may produce di↵erent forecasts
(James Taylor) 2 / 15
Example: GDP
Forecast GDP over the medium term (1 year or 5 years)
It should be growing (hopefully), so need a trend
Use linear trend? Quadratic trend? Cubic Trend?
Something else?
(James Taylor) 3 / 15
Approaches to Model Selection
In-Sample Fitting
Out-of-Sample Forecasting
(James Taylor) 4 / 15
In-sample Fitting
Use the full sample to estimate the model parameters, and then compute
some measures to assess model-fit
Two popular measures are
the mean absolute error or MAE
MAE =
1
T
T
Â
t=1
|yt � ŷt |
the mean squared error or MSE
MSE =
1
T
T
Â
t=1
(yt � ŷt)2
(James Taylor) 5 / 15
I
T
R M
MSF 2
RMS’t
In-sample Fitting with MAE/MSE
Choose the model with the lowest MAE or MSE
If our model fits the historical data well, then hopefully it will produce
good forecasts in the future
But, in-sample fitting measures are prone to the over-fitting problem
the selected model is excessively complex relative to the amount of
data available
(James Taylor) 6 / 15
Big
Over-fitting Problem
MAE and MSE both depend exclusively on how well the model fits
data and ignore model complexity
Both criteria necessarily favour more complex models
That is, if we have two nested specification, the MAE/MSE for the
more complex model is always no larger than that of the simpler
model
(James Taylor) 7 / 15
Over-fitting Problem
That the more complex model is always favoured is disconcerting
With a finite amount of data, as the model becomes more complex it
will begin to become more imprecise
Forecasts based on inaccurately estimated parameters tend to be poor
(James Taylor) 8 / 15
Exako’m
alike
Over-fitting Problem – Example
Consider the two trend specifications:
ŷ1t = a0 + a1t
ŷ2t = b0 + b1t + b2t
2
The linear model is nested within the quadratic model (for b2 = 0)
Estimate parameters by OLS
Claim: The MSE under quadratic model is always no larger then the
MSE under the linear model
(James Taylor) 9 / 15
bhear
quadratic
Over-fitting Problem – Example
Define f : R3 ! R by
f (c0, c1, c2) =
1
T
T
Â
t=1
(yt � c0 � c1t � c2t2)2
Generate (â0, â1) and (b̂0, b̂1, b̂2) by OLS. Then
(â0, â1) = argmin
a0,a1
T
Â
t=1
(yt � ŷ1t )
2 = argmin
a0,a1
f (a0, a1, 0)
(b̂0, b̂1, b̂2) = argmin
b0,b1,b2
T
Â
t=1
(yt � ŷ2t )
2 = argmin
b0,b1,b2
f (b0, b1, b2)
(James Taylor) 10 / 15
HSE yt
Fat
Over-fitting Problem – Example
Observation 1:
MSE1 = f (â0, â1, 0), MSE2 = f (b̂0, b̂1, b̂2)
Observation 2: (b̂0, b̂1, b̂2) is the global minimiser of f . Therefore,
f (b̂0, b̂1, b̂2) f (â0, â1, 0)
So MSE2 MSE1
(James Taylor) 11 / 15
f lloGC4 It Iyt co Gt catY
f tao di o t it E lye at att
E f cXy t I
Penalize Model Complexity
So it seems reasonable to penalize models with more parameters
If two models both fit the data similarly well, we would prefer to pick
the simpler model (usually)
Reduces the impact of the over-fitting problem
(James Taylor) 12 / 15
AIC and BIC
Popular selection criteria that explicitly include model complexity are the
Akaike information criterion and the Bayesian information criterion.
AIC = MSE⇥ T + 2k
BIC = MSE⇥ T + k logT
where T is the sample size and k is the number of parameters in the
model
pick the model with the smallest AIC or BIC
they reward goodness-of-fit
but also penalize model complexity to reduce over-fitting
(James Taylor) 13 / 15
HSEIv KT
nm
AIC and BIC
AIC = MSE⇥ T + 2k
BIC = MSE⇥ T + k logT
The penalty term in BIC is larger than in AIC (for T � 8)
So BIC favours more parsimonious models
AIC and BIC will often agree on the best model in any case
(James Taylor) 14 / 15
Kika’m42m
Is There a Better Way?
We are not really interested in fitting the historical data
We are interested in choosing the model which produces the best
forecasts
But, we don’t know which model does best in forecasting ex ante
(James Taylor) 15 / 15
Model Selection 2 – Out-of-sample Forecasting
Australian National University
(James Taylor) 1 / 6
2.3
Is There a Better Way?
We are not really interested in fitting the historical data
We are interested in choosing the model which produces the best
forecasts
But, we don’t know which model does best in forecasting ex ante
(James Taylor) 2 / 6
Psuedo out-of-sample Forecasting
We will simulate the experience of a real-time forecaster
use a portion of the dataset, say from t = 1 to t = T0 to estimate
the model parameters
make the forecast ŷT0+h|T0
compare this forecast with the observed value yT0+h
(James Taylor) 3 / 6
in
fake
(James Taylor) 4 / 6
A 2
Average Yuk Yu us yid nu coolly Ycoof
Freest 52131
T
parameter Ycoli
p
damath
ITYKE inti do 9131
30 too lol Butdon’tknow
I
dataat 1030 Jian Yeo4
parameter makethis assmalleraswecan
t
ForcastTypo
MAFE and MSFE
The Mean Absolute Forecast Error, or MAFE, is
MAFE =
1
T � h� T0 + 1
T�h
Â
t=T0
|yt+h � ŷt+h|t |
The Mean Squared Forecast Error, or MSFE, is
MSFE =
1
T � h� T0 + 1
T�h
Â
t=T0
(yt+h � ŷt+h|t)2
where yt+h is the realized value of y at time t + h, and ŷt+h|t is the h-step
ahead point forecast given the information at time t.
(James Taylor) 5 / 6
To_So h I 1
60
Oxso
While MAE and MSE are in-sample goodness-of-fit measures, both
MAFE and MSFE are out-of-sample performance measures.
This means even if we choose the model with the smallest MAFE or
MSFE we do not have an over-fitting problem
(James Taylor) 6 / 6
c’heneverusedthedatainrealtime
usedataattineGotoforestthe40
Model Selection Example
USA Gross Domestic Product
Australian National University
(James Taylor) 1 / 8
d4
Example: U.S. GDP Forecast
Figure: US Seasonally adjusted real GDP from 1947Q1 to 2009Q1, from FRED
(James Taylor) 2 / 8
AFC
I
Model Specifications
There’s clearly a trend in the U.S. GDP data, and it seems to be
faster than linear
Linear, quadratic and cubic trends:
ŷ1t = a0 + a1t
ŷ2t = b0 + b1t + b2t
2
ŷ3t = c0 + c1t + c2t
2 + c3t
3
Present results, we will do estimation details later
(James Taylor) 3 / 8
Fitted Values of U.S GDP
(a) Linear Model (b) Quadratic Model (c) Cubic Model
Figure: Fitted value of U.S. GDP under various models
(James Taylor) 4 / 8
MSE
Table: MSE under various trend models
Linear Trend Quadratic Trend Cubic Trend
MSE 0.543 0.045 0.039
MSE is decreasing in model complexity, as expected
The linear model does not fit the data very well
The addition of the quadratic term decreases the MSE by over 90%
The addition of the cubic term only reduces the MSE by 13%
(James Taylor) 5 / 8
Aefwired
AIC and BIC
Table: AIC and BIC of the various models
Linear Trend Quadratic Trend Cubic Trend
# of parameters 2 3 4
AIC 140.88 17.39 17.95
BIC 147.94 27.98 32.07
Both AIC and BIC suggest the best model is the quadratic trend
We make a trade-o↵ between goodness-of-fit and model complexity
(James Taylor) 6 / 8
terriblegoodnessoffit model uplexieycQuadrathbetter
good lowmodelcomplexity similargoodnessffit Larbicbetter
t r
better
x
x
ALLI132C
Pseudo Out-of-sample Forecasting
We will compute the MSFE for each of the specifications with h = 4
and h = 20
The recursive forecasting exercises start from 1957Q1 (i.e. T0 = 40)
(James Taylor) 7 / 8
Ef 57dL tf
OL
570 tf Q lyear
I
gyear
MSFE
Table: MSFE under various models
Linear Trend Quadratic Trend Cubic Trend
h = 4 0.746 0.073 0.079
h = 20 1.335 0.188 0.394
Linear Specification is terrible
Quadratic model forecasts the best
1-year forecasts are better than 5-year forecasts (as usual)
(James Taylor) 8 / 8
best
x
x 0 0
clearlymuchbetter
nightbeenegm
trofast choir
Linear Algebra Review
Australian National University
(James Taylor) 1 / 5
2.5
Linear Algebra Review
Matrix multiplication
Transposition
Determinants and Invertability
(James Taylor) 2 / 5
Matrix Multiplication
(James Taylor) 3 / 5
Transposition
(James Taylor) 4 / 5
Determinants and Invertability
(James Taylor) 5 / 5