CS计算机代考程序代写 Bayesian Loss Functions

Loss Functions

Australian National University

(James Taylor) 1 / 12

2 I

Loss Functions

Stop for a moment and ask: Why do we want to forecast?

Usually to guide decision making under uncertainty

Make financial investments

Manage inventories

Make investment in productive capacities

Prepare our economy for future changes

(James Taylor) 2 / 12

Loss Functions

There are consequences for making errors in the forecast

Need to consider the associated loss if the forecast is wrong

The cost associated with a forecast error is summarized by the loss

function

(James Taylor) 3 / 12

Example: Weather prediction

Predict if it will be freezing on the way home:

State Space = {Freezing, Not Freezing}

If freezing, bring a thick jacket

If not freezing, don’t bring a jacket

(James Taylor) 4 / 12

optimal

Example: Weather prediction

If our prediction is correct then we’re happy (loss is zero)

If we’re incorrect:

if it’s freezing and no jacket we get cold and are very unhappy (cost

= 10)

if it’s not freezing and we have to carry around our big jacket we are

slightly unhappy (cost = 2)

(James Taylor) 5 / 12

Weather prediction – The Loss Function

Table: Loss function table

Freezing Fine

Bring a Jacket 0 2

No Jacket 10 0

(James Taylor) 6 / 12

Loss action

L ly g j
outcome

Expected Loss Minimisation

Our action depends not only on the weather forecast, but also on the loss

function

Suppose we believe there is a 20% chance of it being freezing this

afternoon:

if we bring a jacket, our expected cost is 0.16,

that’s (0⇥ 0.2+ 2⇥ 0.8)

if we don’t bring a jacket, our expected cost is 0.2,

that’s (10⇥ 0.2+ 0⇥ 0.8)

even though we think it is unlikely to be freezing, we should

nonetheless bring a jacket.

(James Taylor) 7 / 12

0.16 0.2

Symmetric and Asymmetric Loss Functions

The loss function in the weather example is asymmetric – the losses are

di↵erent for di↵erent kinds of deviations from optimal.

Symmetric loss functions are much more common – where errors above or

below the actual value cause the same loss.

Two popular symmetric loss functions:

squared loss: L(ŷ , y) = (ŷ � y)2

absolute loss: L(ŷ , y) = |ŷ � y |

(James Taylor) 8 / 12

y’eprediction

y actual

Symmetric Loss Functions

(a) Squared Loss (b) Absolute Loss

Figure: Symmetric Loss Functions

(James Taylor) 9 / 12

not e I
equal 1 Huat I

1 I i l l
I 1 I ii u il l l l l l l

l il l l t l ll l s l t l
l l l I i My

same

mostused

Optimal Point Forecast

We have been using E(yT+h | IT , q) as our h-step ahead point forecast
Is this really a good choice? What does it mean for a forecast to be

“good”?

Since we want to produce forecasts to aid decision making…

Define a point forecast as optimal if minimizes the given loss function

(James Taylor) 10 / 12

Teh

Optimal Point Forecast

Squared Loss

Theorem: Given a density forecast f (yT+h | IT , q) and the squared loss
function L(ŷ , y) = (ŷ � y)2, the point forecast ŷT+h which minimises
expected loss is

ŷT+h = E(yT+h | IT , q)

Proof: In Supplementary Lecture Module

(James Taylor) 11 / 12

mm
hean expectedvalue

Optimal Point Forecast

Absolute Loss

Theorem: Given a density forecast f (yT+h | IT , q) and the absolute loss
function L(ŷ , y) = |ŷ � y |, the point forecast ŷT+h which minimises
expected loss is the median mT+h.

Proof: In assignment

(James Taylor) 12 / 12

mm

Minimising Squared Loss

Optimal Point Forecasts

Australian National University

(James Taylor) 1 / 3

2 19

Optimal Point Forecast

Squared Loss

Theorem: Given a density forecast f (yT+h | IT , q) and the squared loss
function L(ŷ , y) = (ŷ � y)2, the point forecast ŷT+h which minimises
expected loss is

ŷT+h = E(yT+h | IT , q)

Proof: Upcoming

(James Taylor) 2 / 3

1. The hard way

2. The easy way

3. Why I’ve talked about both

(James Taylor) 3 / 3

3Why

Emphasisthebenefitof general theorems
Similarthey 1hAssignment heyif I y y

needstheHARDWAY

1 HARDWAY t.ly yI Cy yl2 Cy yi CHOOSE
expected

IE L ij yl L EI y I Ttoss 25

f ly Iif aydy flanIIiatestatistimI

Iffy’tYidy Iodyyffeydy gifydy Ely
ytoddy

EIgcyst IfguyHpdis is
If yfupdy tyffytcysdyiij.ttupdy

II y fupdy Lij FLys t y I
O 2Fly 12g O

y Eup

2 EASY WAY Lly CyFT CHOOSE
2 2

age
E Lily IH 2g Flyoff

I

2g
E y Lyftyn

a

2g
LIFE’y’T IEEgg EEgg
a

2gLIEEy’T 2yIEEy ya
0 HELy try TEC o
y Ely

Model Selection 1 – In-Sample Fitting

Australian National University

(James Taylor) 1 / 15

2.2

Evaluating Forecast Models

Typically there are several reasonable models for yt ,

With no good theoretical reason to prefer one to another,

But they may produce di↵erent forecasts

(James Taylor) 2 / 15

Example: GDP

Forecast GDP over the medium term (1 year or 5 years)

It should be growing (hopefully), so need a trend

Use linear trend? Quadratic trend? Cubic Trend?

Something else?

(James Taylor) 3 / 15

Approaches to Model Selection

In-Sample Fitting

Out-of-Sample Forecasting

(James Taylor) 4 / 15

In-sample Fitting

Use the full sample to estimate the model parameters, and then compute

some measures to assess model-fit

Two popular measures are

the mean absolute error or MAE

MAE =
1

T

T

Â
t=1

|yt � ŷt |

the mean squared error or MSE

MSE =
1

T

T

Â
t=1

(yt � ŷt)2

(James Taylor) 5 / 15

I

T

R M
MSF 2

RMS’t

In-sample Fitting with MAE/MSE

Choose the model with the lowest MAE or MSE

If our model fits the historical data well, then hopefully it will produce

good forecasts in the future

But, in-sample fitting measures are prone to the over-fitting problem

the selected model is excessively complex relative to the amount of

data available

(James Taylor) 6 / 15

Big

Over-fitting Problem

MAE and MSE both depend exclusively on how well the model fits

data and ignore model complexity

Both criteria necessarily favour more complex models

That is, if we have two nested specification, the MAE/MSE for the

more complex model is always no larger than that of the simpler

model

(James Taylor) 7 / 15

Over-fitting Problem

That the more complex model is always favoured is disconcerting

With a finite amount of data, as the model becomes more complex it

will begin to become more imprecise

Forecasts based on inaccurately estimated parameters tend to be poor

(James Taylor) 8 / 15

Exako’m

alike

Over-fitting Problem – Example

Consider the two trend specifications:

ŷ1t = a0 + a1t

ŷ2t = b0 + b1t + b2t
2

The linear model is nested within the quadratic model (for b2 = 0)

Estimate parameters by OLS

Claim: The MSE under quadratic model is always no larger then the

MSE under the linear model

(James Taylor) 9 / 15

bhear

quadratic

Over-fitting Problem – Example

Define f : R3 ! R by

f (c0, c1, c2) =
1

T

T

Â
t=1

(yt � c0 � c1t � c2t2)2

Generate (â0, â1) and (b̂0, b̂1, b̂2) by OLS. Then

(â0, â1) = argmin
a0,a1

T

Â
t=1

(yt � ŷ1t )
2 = argmin

a0,a1
f (a0, a1, 0)

(b̂0, b̂1, b̂2) = argmin
b0,b1,b2

T

Â
t=1

(yt � ŷ2t )
2 = argmin

b0,b1,b2
f (b0, b1, b2)

(James Taylor) 10 / 15

HSE yt

Fat

Over-fitting Problem – Example

Observation 1:

MSE1 = f (â0, â1, 0), MSE2 = f (b̂0, b̂1, b̂2)

Observation 2: (b̂0, b̂1, b̂2) is the global minimiser of f . Therefore,

f (b̂0, b̂1, b̂2)  f (â0, â1, 0)

So MSE2  MSE1

(James Taylor) 11 / 15

f lloGC4 It Iyt co Gt catY
f tao di o t it E lye at att

E f cXy t I

Penalize Model Complexity

So it seems reasonable to penalize models with more parameters

If two models both fit the data similarly well, we would prefer to pick

the simpler model (usually)

Reduces the impact of the over-fitting problem

(James Taylor) 12 / 15

AIC and BIC

Popular selection criteria that explicitly include model complexity are the

Akaike information criterion and the Bayesian information criterion.

AIC = MSE⇥ T + 2k

BIC = MSE⇥ T + k logT

where T is the sample size and k is the number of parameters in the

model

pick the model with the smallest AIC or BIC

they reward goodness-of-fit

but also penalize model complexity to reduce over-fitting

(James Taylor) 13 / 15

HSEIv KT

nm

AIC and BIC

AIC = MSE⇥ T + 2k

BIC = MSE⇥ T + k logT

The penalty term in BIC is larger than in AIC (for T � 8)

So BIC favours more parsimonious models

AIC and BIC will often agree on the best model in any case

(James Taylor) 14 / 15

Kika’m42m

Is There a Better Way?

We are not really interested in fitting the historical data

We are interested in choosing the model which produces the best

forecasts

But, we don’t know which model does best in forecasting ex ante

(James Taylor) 15 / 15

Model Selection 2 – Out-of-sample Forecasting

Australian National University

(James Taylor) 1 / 6

2.3

Is There a Better Way?

We are not really interested in fitting the historical data

We are interested in choosing the model which produces the best

forecasts

But, we don’t know which model does best in forecasting ex ante

(James Taylor) 2 / 6

Psuedo out-of-sample Forecasting

We will simulate the experience of a real-time forecaster

use a portion of the dataset, say from t = 1 to t = T0 to estimate

the model parameters

make the forecast ŷT0+h|T0

compare this forecast with the observed value yT0+h

(James Taylor) 3 / 6

in

fake

(James Taylor) 4 / 6

A 2
Average Yuk Yu us yid nu coolly Ycoof

Freest 52131
T

parameter Ycoli
p

damath
ITYKE inti do 9131

30 too lol Butdon’tknow
I

dataat 1030 Jian Yeo4
parameter makethis assmalleraswecan
t

ForcastTypo

MAFE and MSFE

The Mean Absolute Forecast Error, or MAFE, is

MAFE =
1

T � h� T0 + 1
T�h
Â

t=T0

|yt+h � ŷt+h|t |

The Mean Squared Forecast Error, or MSFE, is

MSFE =
1

T � h� T0 + 1
T�h
Â

t=T0

(yt+h � ŷt+h|t)2

where yt+h is the realized value of y at time t + h, and ŷt+h|t is the h-step

ahead point forecast given the information at time t.

(James Taylor) 5 / 6

To_So h I 1

60

Oxso

While MAE and MSE are in-sample goodness-of-fit measures, both

MAFE and MSFE are out-of-sample performance measures.

This means even if we choose the model with the smallest MAFE or

MSFE we do not have an over-fitting problem

(James Taylor) 6 / 6

c’heneverusedthedatainrealtime

usedataattineGotoforestthe40

Model Selection Example

USA Gross Domestic Product

Australian National University

(James Taylor) 1 / 8

d4

Example: U.S. GDP Forecast

Figure: US Seasonally adjusted real GDP from 1947Q1 to 2009Q1, from FRED

(James Taylor) 2 / 8

AFC

I

Model Specifications

There’s clearly a trend in the U.S. GDP data, and it seems to be

faster than linear

Linear, quadratic and cubic trends:

ŷ1t = a0 + a1t

ŷ2t = b0 + b1t + b2t
2

ŷ3t = c0 + c1t + c2t
2 + c3t

3

Present results, we will do estimation details later

(James Taylor) 3 / 8

Fitted Values of U.S GDP

(a) Linear Model (b) Quadratic Model (c) Cubic Model

Figure: Fitted value of U.S. GDP under various models

(James Taylor) 4 / 8

MSE

Table: MSE under various trend models

Linear Trend Quadratic Trend Cubic Trend

MSE 0.543 0.045 0.039

MSE is decreasing in model complexity, as expected

The linear model does not fit the data very well

The addition of the quadratic term decreases the MSE by over 90%

The addition of the cubic term only reduces the MSE by 13%

(James Taylor) 5 / 8

Aefwired

AIC and BIC

Table: AIC and BIC of the various models

Linear Trend Quadratic Trend Cubic Trend

# of parameters 2 3 4

AIC 140.88 17.39 17.95

BIC 147.94 27.98 32.07

Both AIC and BIC suggest the best model is the quadratic trend

We make a trade-o↵ between goodness-of-fit and model complexity

(James Taylor) 6 / 8

terriblegoodnessoffit model uplexieycQuadrathbetter
good lowmodelcomplexity similargoodnessffit Larbicbetter

t r
better

x
x

ALLI132C

Pseudo Out-of-sample Forecasting

We will compute the MSFE for each of the specifications with h = 4

and h = 20

The recursive forecasting exercises start from 1957Q1 (i.e. T0 = 40)

(James Taylor) 7 / 8

Ef 57dL tf
OL

570 tf Q lyear
I

gyear

MSFE

Table: MSFE under various models

Linear Trend Quadratic Trend Cubic Trend

h = 4 0.746 0.073 0.079

h = 20 1.335 0.188 0.394

Linear Specification is terrible

Quadratic model forecasts the best

1-year forecasts are better than 5-year forecasts (as usual)

(James Taylor) 8 / 8

best

x
x 0 0

clearlymuchbetter
nightbeenegm
trofast choir

Linear Algebra Review

Australian National University

(James Taylor) 1 / 5

2.5

Linear Algebra Review

Matrix multiplication

Transposition

Determinants and Invertability

(James Taylor) 2 / 5

Matrix Multiplication

(James Taylor) 3 / 5

Transposition

(James Taylor) 4 / 5

Determinants and Invertability

(James Taylor) 5 / 5