STAT 443: Forecasting
Paul Marriott
7 September 2020
ii
Contents
1 Introduction to forecasting, control and time series 3
1.1 Introduction………………………. 3
1.2 Examples ……………………….. 3 1.2.1 Observeddataexamples…………….. 3 1.2.2 SimpleMathematicalmodels ………….. 9 1.2.3 AdvancedMathematicalmodels. . . . . . . . . . . . . 10
1.3 Forecasting, prediction and control problems . . . . . . . . . 13
1.4 Simplestatisticaltools…………………. 17
2 Regression methods and model building principles 23
2.1 Introduction………………………. 23
2.2 Statisticaldecisiontheory ……………….. 23
2.3 Linearregressionmodels………………… 26
2.3.1 ComputinginR ………………… 31
2.4 Modelcomplexity……………………. 31 2.4.1 Bias-variancedecomposition…………… 32 2.4.2 Subsetselection…………………. 33 2.4.3 Forwardandbackwardstepselection. . . . . . . . . . 34 2.4.4 ComputinginR ………………… 35 2.4.5 Ridgeregression ………………… 35 2.4.6 TheLasso……………………. 37 2.4.7 ComputinginR ………………… 38
2.5 Modelassessmentandselection…………….. 38
2.5.1 2.5.2
Appendices
Bias-variancedecomposition…………… 40 Crossvalidation ………………… 43
45
2.A Appendix: Multivariate normal distribution . . . . . . . . . . 45 2.B Matrixdifferentiationreview ……………… 46 2.C Appendix: ReviewofRegressionTheory . . . . . . . . . . . . 47
iii
iv
3
CONTENTS
Forecasting Stationary processes 51
3.1 Introduction………………………. 51
3.2 Stationaryprocesses ………………….. 52
3.3 Estimablebutflexiblemodels……………… 54
3.4 Bestlinearpredictor ………………….. 55
3.5 Estimating the mean and auto-covariance functions . . . . . . 58
3.6 Computing the auto-covariance function . . . . . . . . . . . . 61
3.7 MA(q)processes……………………. 65
3.8 TheAR(p)process…………………… 66
3.9 TheARMA(p,q)process………………… 69
3.10EstimationofARMAmodels ……………… 69
3.10.1 Likelihoodestimation ……………… 69
3.10.2 IdentificationofARMAmodels . . . . . . . . . . . . . 71 3.11UsingRforARMAmodelling……………… 73 3.12 MA(∞)processes …………………… 78
Appendices 81
3.A PartialCorrelation …………………… 81
3.B Otherstationaryprocesses……………….. 81 3.B.1 Longmemoryprocesses …………….. 82 3.B.2 Gaussianprocesses……………….. 82
3.C PowerSeriesExpansions………………… 83
4 The Box-Jenkins approach to forecasting and control 85
4.1 Introduction………………………. 85
4.2 Non-stationarytimeseries……………….. 86 4.2.1 ARIMAmodelling ……………….. 87
4.3 ForecastingARIMAmodels ………………. 87
4.4 UsingRforARIMAmodelling …………….. 89
4.5 SARIMAmodelling ………………….. 93
4.6 UsingRforSARIMAmodelling…………….. 94
4.7 TheBoxJenkinsApproach ………………. 97 4.7.1 Summary ……………………. 97 4.7.2 Testsforstationarity………………. 97 4.7.3 Criticismsofapproach……………… 98
5 Forecast Evaluation 101
5.1 Introduction……………………….101
5.2 EvaluationofForecastingMethods ……………102
5.3 Simulationexperiments …………………103
5.4 Using simulation studies to evaluate statistical methods . . . 105
5.5 Evaluationbycompetition………………..106
5.5.1 MakridakisCompetitions …………….106 5.5.2 UsingRbasedMLpackages……………107
CONTENTS v
5.5.3 Tuningparameters………………..108
5.5.4 Economic models versus time series approaches . . . . 108 5.6 Illusionsofprogress……………………109
6 Bayesian and state space methods 113
6.1 Introduction……………………….113
6.2 Bayesianmethods ……………………113
6.2.1 Bayesianmethodsinregression . . . . . . . . . . . . . 117 6.2.2 Linktopenaltymethods……………..120
6.3 Statespacemodel ……………………121
6.4 CaseStudy ……………………….122
6.4.1 Introduction …………………..122
6.4.2 Data……………………….123
6.4.3 Themodellingapproach……………..124
6.4.4 Results ……………………..126
6.4.5 Conclusionsandfurtherwork…………..131
Appendices 133
6.A JAGS,BUGSandR …………………..133
7 The Kalman Filter 135
7.1 Introduction……………………….135 7.1.1 Statespacemodel ………………..135
7.2 State space representations of ARIMA models . . . . . . . . . 136
7.3 TheKalmanfilter ……………………137 7.3.1 Historialnote…………………..137
7.4 TheKalmanfilter ……………………137
7.5 Algorithm………………………..141
7.6 TheKalmanFilterandBayesianUpdating . . . . . . . . . . 143
8 ARCH and GARCH modelling 147
8.0.1 WorkingwithGARCHmodelsinR. . . . . . . . . . . 150 8.1 AlternativeStateSpaceModels……………..151 8.1.1 FittingtheHiddenMarkovModel . . . . . . . . . . . 153
9 Frequency domain methods 155
9.0.1 Whythecomplexplane?……………..155
9.0.2 Spectral density and discrete Fourier transforms . . . 156
9.0.3 ThePeriodogram…………………160
9.0.4 Smoothingtheperiodogram……………164
9.0.5 Filteringinthespectraldomain. . . . . . . . . . . . . 166
vi CONTENTS
CONTENTS 1 Preface
2 CONTENTS
1
Introduction to forecasting, control and time series
“There are two kinds of forecasters: those who don’t know, and those who don’t know they don’t know.” J. K. Galbraith
1.1 Introduction
This set of notes draws from a number of resources and details can be found in the bibliography. The material in Chapter 2 – on regression and model building for forecasting – is strongly influenced by the book Hastie et al. (2009, Ch. 1, 3 and 7). Another good general reference to regression modelling is Abraham and Ledolter (2006). The stationary process material in Chapter 3 and its application to Box-Jenkins is treated in Box and Jenkins (1976), Brockwell and Davis (2002) and Brockwell and Davis (2009). The second of these is perhaps most suitable for the novice. Another very good time series book is Shumway and Stoffer (2010).
1.2 Examples
We start with a large set of example. First real data examples and then example of models.
1.2.1 Observed data examples
We start with examples of time series a form of of data where forecasting is often important. Issues illustrated include the types of structure that we commonly come across with this kind of data. In particular we note the following key points: (i) unlike examples from earlier statistics courses at Waterloo, the data does not come as independent realisations of a random variable, (ii) the data is not identically distributed where both the mean
3
41. INTRODUCTIONTOFORECASTING,CONTROLANDTIMESERIES and, potentially, the variance can vary in time, (iii) there can be abrupt
changes in the data generation process.
Example 1.2.1. (Actuarial Science) Insurance companies offering accident insurance need to have forecasts of the number of payouts they are likely to have in a given period in order to set premiums. Of course they also need to have forecasts on severity of the accident but for simplicity lets focus on just the number of cases.
Illustrated in Fig. 1.1 are the monthly number of deaths and serious injuries in UK road accidents and Ontario respectively, (R Core Team, 2013). The time ranges are from January 1969 to December 1984 and January 1960 to December 1974. For the UK data, the description of the dataset mentions that a seatbelt law was introduced in February 1983.
UK Deaths and Injuries
1970 1975 1980
Time
1985
1960
Ontario road deaths
1965 1970 1975
Time
Figure 1.1: Number of deaths and serious injuries in road accidents per month
Fig. 1.1 are examples of time series plots.
Definition 1.2.2. (Time series plot) A time series plot is a scatterplot of the data with time on the x-axis in, typically, equally spaced intervals, and the observations on the y-axis. For visual clarity adjacent observations are connected by lines. The equally spaced intervals can be in units of years, months, days, hours etc and these units are called the period of the time series.
In Example 1.2.1 within each year there is a repeated pattern across the months of the year.
Definition 1.2.3. (Level, trend and seasonal component) Terms like level, trend and seasonality can only be defined precisely in terms of a modelling exercise. Here we give informal definition for use in describing observed patterns in time series plots. The level is the local mean of the observations,
Number
1000 1500 2000 2500
Number
50 100 150 200 250
1.2. EXAMPLES 5
around which we see random noise. When the level varies with time we say there is a trend. A seasonal effect is a systematic and calendar related effect which repeats with a given period.
In the Ontario accident data we notice that, on the yearly time scale after smoothing the seasonal aspects, that there is a steady increase in the average number of accidence across the time period, the trend. A final feature of this data is the possibility of a change point, perhaps due to the change in law in the UK data
Definition 1.2.4. (Change point) A change point is a time at which at least of the of the following changes: the data generation process, the way that the data is measured, or the way that the observation is defined.
Definition 1.2.5. (Non-stationarity) Seasonality, trends, non-constant vari- ance and change points are all examples of non-stationarity. Stationarity, defined formally in Chapter 3, informally means that the underlying random process does not change in time.
If we have a stationary process we would be able to assume that, at least statistically, the future is similar to the past – this means that we could perhaps use information from the past to make forecasts about the future. Exact stationarity is rare and we need to be to model the forms of non-stationarity seen in the data.
Example 1.2.6. (Ontario Gas prices) Figure 1.2 shows the monthly demand for gasoline in Ontario 1960 to 1975 in dollars. We see strong seasonality in the data as well as a non-linear change in the level. Underlying forecasting
Ontario Gas Demand
1960 1965
1970 1975
Figure 1.2: Monthly demand for gasoline in Ontario from 1960 to 1975
Time
Gallons
100000 150000 200000 250000
61. INTRODUCTIONTOFORECASTING,CONTROLANDTIMESERIES problems here might have been, for an analyst in 1975, to forecast gas usage
in the upcoming year.
Example 1.2.7. (Airline passengers) In Fig. 1.3, which shows the monthly totals of international airline passengers from 1949 to 1960, (Box and Jenk- ins, 1976), we see a strong trend and a seasonal pattern. This pattern is a little different from that of Example 1.2.6 since the size of the periodic oscillations increases with time. We might model this as having the periodic seasonal effect being multiplicative with the time series, i.e. a proportional change.
1950 1952 1954 1956 1958 1960
Time
Figure 1.3: Monthly totals of international airline passengers, 1949 to 1960.
AirPassengers
100 200 300 400 500 600
1.2. EXAMPLES 7
Example 1.2.8. (Internet network traffic) Figure 1.4 are plots of time series of the network traffic in the UK academic network backbone1. Data was collected between 19 November 2004 and 27 January 2005, at the frequency daily and hourly time scales. In each time period the number of bits has been recorded. We see different aspects of the data on these different time scales.
Internet Traffic
Dec 01 Dec 15 Jan 01
Days
Jan 15
Internet Traffic
Jan
Figure 1.4: Network traffic in the UK academic network daily and hourly intervals.
Example 1.2.9. (Stock price) Figure 1.5 shows an example of daily stock prices, from the FTSE index in the period 1991-1996, (R Core Team, 2013). The right hand plot shows the first difference of the stock price, i.e. log(Xt/Xt−1) if Xt is the time series. This is a measure of the relative daily change in the price. We see that the variability of this seems to change with time.
FTSE index 1991 − 1998
1992 1994 1996
Time
1998
1992
First Difference log price
1994 1996 1998
Time
Days
2000 4000
6000 8000
500000
1000000 1500000
2000000
−0.08
−0.04 0.00 0.04
20000
60000 100000
FTSE
Bits
Difference
Bits
Figure 1.5: Daily FTSE stock price.
1 https://datamarket.com/
81. INTRODUCTIONTOFORECASTING,CONTROLANDTIMESERIES
Example 1.2.10. (US unemployment) An example from economics is shown in Fig. 1.6, the monthly numbers of registered unemployed in the US, (R Core Team, 2013). We see periods where the level is low and periods where it is high. This is representative of the so-called business cycle, (Sherman and Kolk, 1996). This a cycle of expansions and contractions in the econ- omy that occur at about the same time in many economic activities. They are recurrent but not strictly periodic with durations varying from one to twelve years.
US unemployment
1970 1980 1990 2000
Time
Figure 1.6: US unemployment
Example 1.2.11. (Denmark birth data) The data illustrated in Fig. 1.17 shows the number of births in Denmark for each month in the period 1900- 1992. The data shows strong seasonality, non-constant level and at least one change point at 1919.
Monthly Births in Denmark
1900 1920 1940 1960 1980
Time
Figure 1.7: Monthly birth data in Denmark 1900-1992
birth unemployed (thousands)
4000 5000 6000 7000 8000 9000 4000 6000 8000 10000 12000
1.2. EXAMPLES 9 1.2.2 Simple Mathematical models
Definition 1.2.12. (Markov chain) An example from probability theory of a discrete state time series is a Markov chain, a form of stochastic iteration. It is a sequence Xi of dependent random variables where the distribution of Xn+1 depends on Xn and only on Xn. That means
P(Xn+1|X0 = x0,··· ,Xn = xn) = P(Xn+1|Xn = xn), (1.1)
the Markov property. As with any iterative scheme you need to define the initial conditions. The initial value X0 can be chosen deterministically or it can be random with some chosen distribution.
If Xn takes values in the finite set S = {1,…,K} it is a discrete space Markov chain and
P{Xn = j|Xn−1 = i} := pij,
defines the K × K transition matrix. The n-step probabilities pn(i, j) are
defined as follows:
pn(i,j)=P{Xn =j|X0 =i}=P{Xn+k =j|Xk =i},
and it can easily be shown that n-step transition probability is the i,jth entry of the matrix Pn. For large n the n-step transition probability often converge to equilibrium distribution which is independent of the past.
Example 1.2.13. (Simple Markov chain) (a) Realisation of Markov chain
0 200 400 600 800
Time
(b) Equilibrium distribution
1.0 2.0 3.0 4.0
Sate
●
●
●
●
Figure 1.8: Realisation of Markov chain
Figure 1.8 (a) shows a realisation of a 4-state Markov chain which has the transition matrix
0.9 0.1 0.0 0.0
0.0 0.8 0.1 0.0 P = 0 . 0 0 . 0 0 . 9 0 . 1
0.1 0.0 0.0 0.9
1.0 2.0
3.0 4.0
State
Probaility
0.0 0.2 0.4 0.6 0.8 1.0
101. INTRODUCTIONTOFORECASTING,CONTROLANDTIMESERIES and (b) shows its equilibrium distribution.
We might think about forecasting in Example 1.2.13 based on data. To make a short term forecast you need the current state, Xn, and an estimate of the transition matrix which might be based on all available data. Of course we are here making a modelling assumption that what we have seen is a realisation of a stationary Markov chain. A second, conceptual point, is to note that Fig. 1.8 (a) only shows a single realisation of the Markov chain although we see 1000 observations. We have to be careful with the concept of sample size in any situation where we have dependent data.
Example 1.2.14. (Random walk) Consider a continuous state space Markov chain. Let Zt, t ∈ Z be an i.i.d. sequence of random variables. The series defined by
t
Xt :=Zi,
i=1 for t = 1,2,…, is called a random walk.
A related Markov process has the following definition.
Example 1.2.15. (Autoregressive model) Let Zt, t ∈ Z be an i.i.d. sequence
of N(0,σ2) random variables. The series defined by Xt = φXt−1 + Zt
for all t ∈ Z, is called a first order autoregressive process (AR(1)) process.
Not all time series have to be on discrete time points. If the time period of observation is arbitrary then we talk about continuous time stochastic processes
Example 1.2.16. (Non-constant variance process) If we want models where the underlying variance of the process changes with time then one pos- sibility is to used a AutoRegressive Conditional Heteroscedasticity (ARCH) model. We define the model by a hierarchically: first define Xt = σtZt where Zt ∼ N (0, 1) i.i.d., but treat σ as being random such that
σ2=α +αX2 +···+αX2 t01t−1 qt−q
So the variance is time dependent – a large value of Xt will result in period of high volatility. We illustrate an example of a realisation in Fig. 1.9
1.2.3 Advanced Mathematical models
Definition 1.2.17. (Stochastic process) A stochastic process is a family of random variables {Xt,t ∈ T} defined on a probability space (Ω,F,P) i.e a sample space, a set of events and a probability function. The index set T
1.2. EXAMPLES 11 ARCH model
0 200 400 600 800
Time
Figure 1.9: Realisation of ARCH process
could be a function of the reals or something more general. The functions {X(·)(ω), ω ∈ Ω} on T are known as the realisations or sample paths of the process {Xt,t ∈ T}
When the index set T is not finite we can’t just write down a joint distribution to define the behaviour of the random variables. The following definition shows what we can do when we look at finite subsets of T.
Theorem 1.2.18. (Kolmogorov’s Existence Theorem) The probability dis- tribution functions Ft(·) are the distributions functions of some stochastic process if and only if for any n ∈ {1,2,…} and t = (t1,…,tn) ∈ T and 1 ≤ i ≤ n,
lim Ft(x) = Ft(i)(x(i)) xi →∞
where t(i) and x(i) are the (n − 1)-component vectors obtained by deleting the ith components of t and x respectively.
Example 1.2.19. (Gaussian process) Let T = R, then {Xt} is a discrete Gaussian process if any finite subset {Xt1,…,Xtn} has an n-dimensional multivariate normal distribution. This model is completely determined when the mean and variance-covariance structures are known.
Figure 1.10 shows three realisations from three different one dimensional Gaussian processes. They all share the fact that the realisation gives a con- tinuous graph but they differ in the ‘smoothness’ of the realisation. This smoothness is controlled by a single parameter ν and the amount of ‘smooth- ness’ increases as we go from left to right.
Example 1.2.20. (Brownian motion) A Brownian motion or Wiener pro- cess is a continuous-time stochastic process which is the unique process Wt which satisfies the following:
1. W0 =0,
x
−3 −2 −1 0 1 2 3
121. INTRODUCTIONTOFORECASTING,CONTROLANDTIMESERIES nu=0.5 nu=0.75 nu=1
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
xxx
Figure 1.10: Three realisation of one dimension Gaussian processes
2. The function t → Wt is almost surely everywhere continuous,
3. The increments Wt1 − Ws1 Wt2 − Ws2 are independent when 0 ≤ s1 <
t1 ≤s2
Definition 1.4.4. (Simple moving average) For Model (1.4.3) we can esti- mate with mt with a moving average filter.
If the period is odd d = 2q + 1, then the smoother is defined on the realisation xt by
mk = xt−q +xt−q+1 +···+xt+q−1 +xt+q. d
1.4. SIMPLE STATISTICAL TOOLS 19 Decomposition of additive time series
random seasonal trend observed random seasonal trend observed −500 0 500 −400 0 400 5000 7000 4000 7000 −20000 0 −20000 0 20001020000 200001000000 200000
1960 1965
1970 1975
Figure 1.14: Ontario gas demand (gallons)
Decomposition of additive time series
Time
1900 1920 1940 1960 1980 Time
Figure 1.15: Monthly birth data in Denmark 1900-1992
Lets think about applying this in the Ontario gas example shown in Fig. 1.14.
We start, because it is monthly data with d = 12 and the trend in the
201. INTRODUCTIONTOFORECASTING,CONTROLANDTIMESERIES figure is the result of applying the moving average to the observed data.
Note that since we have that
12
st = 0. t=1
the seasonal component will not be part of this estimate. The smoother has completely removed it. Hence if we subtract the estimated trend from the observed data we will be left with the seasonal effect plus the noise.
Averaging will reduce noise so can can then estimate the seasonal com- ponents by computing 12 averages wˆMonth by averaging just the January dates then averaging the February dates etc. Its then standard to normalise to get
wˆMonth sˆMonth = wˆMonth − 12 .
Having estimate the trend and the seasonal effects we subtract them to leave the noise.
We can see this used in our examples:
Example (1.2.11 revisited). If we look at the Danish birth data we can estimate the level, mt, using a 12 point moving average filter. Figure 1.16 shows both the raw data (black) and the estimated level m t which smooths out the seasonal component (red).
Monthly Births in Denmark
birth
4000 5000 6000 7000 8000 9000
1900 1920 1940
1960 1980
Figure 1.16: Danish birth data: black raw data and red estimated level
We note that the change point around 1919 has not be well estimate by the filter. Rather than having a discontinuity we have a estimated a linear slope in the year containing the discontinuity. The key point here is that smoothing methods can over-smooth real features in the data. This can also be seen in the random component where there is an increase in the variance around the discontinuity. We also see other regions where the variance is inflated.
Time
1.4. SIMPLE STATISTICAL TOOLS 21
The moving average filter of Definition 1.4.4 is not the only way of fil- tering the data. One disadvantage of it symmetric form is that it can only estimate the mt in the middle of the data – as can be seen from Fig. 1.16. This means it can’t be used directly for forecasting. There are other filtering methods which over come this. We look at a useful methods and related methods, exponential smoothing and, the more general, Holt-Winters filter- ing.
Definition 1.4.5. (Exponential smoothing) Exponential smoothing can be used for a time series to estimates the level of the process mt which is assumed slowing varying. It should not be used when there is a trend or seasonality. Assume that we can observe x1, . . . , xn. We select an initial value m0, typically x1 then we update the estimate recursively via the update equation
mt+1 =αxt+1 +(1−α)mt =mt +α(xt+1 −mt)
The second from of the update is called the error correcting version and
et+1 := xt+1 − mt
is the one-step ahead forecast error. The tuning parameter α needs to be selected and this can be done by finding the α which minimises
n
(xt+1 − mt(α))2. t=1
Definition 1.4.6. (Holt-Winters filtering) The Holt-Winters method gen- eralises exponential smoothing to the case where there is a trend and sea- sonality. We have three terms which depend on time t. The first is a level at, the second is the trend bt and the third is the seasonal component st. The terms at, bt are considered slowly varying and the mean of the time series at time t+h is given by
mt+h = at + bth + st+h
The Holt-Winters prediction function for h time periods ahead of current time t is xt+h = at + bth + st+h. The one step error term for this prediction is then defined as
et := xt −(at−1 +bt−1 +st−p)
then starting with initial values for a, b and s and then update the parameters
according to the size of the error over the range of t by using
at = at−1 + bt−1 + αet, bt = bt−1+αβet,
st = st−p+γet.
The method uses three tuning parameters α, β and γ and these are selected by minimising the sum of squared errors.
221. INTRODUCTIONTOFORECASTING,CONTROLANDTIMESERIES
Example (1.2.11 revisited). In R we can apply the Holt-Winters method on the Denmark data and predicting using the result with the functions
> birth.hw <- HoltWinters(birth)
> birth.hw.predict <- predict(birth.hw, n.ahead=12*8)
and we can see the fit (red) and the forecast (blue) for the next 8 years in Fig. 1.17. The prediction models the seasonality and the prediction is based on a local linear approach. This is a strength for short term forecasting but, as the figure shows, may be unrealistic for longer term forecasts.
4000 6000 8000
birth
1900 1920 1940
1960 1980 2000
Figure 1.17: Monthly birth data in Denmark 1900-1992: fitted red and forecast blue
Time
2
Regression methods and model building principles
2.1 Introduction
Much of this chapter follows the approach of Hastie et al. (2009, Ch. 1, 3 and 7) and there is a great deal of very interesting further reading in that book. We focus here mostly on regression models but the ideas of this chapter have a much wider applicability in statistical classification and machine learning. The course STAT 441 Statistical Learning - Classification develops many of these themes.
2.2 Statistical decision theory
Let X ∈ Rp and Y ∈ R be vector valued random variables where we are trying to predict Y given the information available in X. We are therefore looking for a function f : Rp → R, which has ‘nice’ properties.
Definition 2.2.1. (Loss function) A loss function L(Y,f(X)) is a real val- ued function which measures the error in estimating Y with the estimate f(X). Typically we assume L(y,f(x)) ≥ 0 with equality if and only if y = f(x). Since L(Y,f(X)) is a random variable, we may also consider its expected value, EX,Y [L(Y,f(X))], often called the risk. Here the expecta- tion is over the joint distribution of X, Y .
Example 2.2.2. (Quadratic loss) A commonly used example is the squared error loss, or quadratic, loss
LSE(Y,f(X))=(Y −f(X))2.
The expected value is called the mean squared error (MSE)
MSE(f):=E (L X,Y
SE
(Y,f(X)))=E (Y −f(X))2. (2.1) X,Y
23
242. REGRESSIONMETHODSANDMODELBUILDINGPRINCIPLES
Theorem 2.2.3. (Conditional expectation) If X and Y are two random variables with E(Y ) = μ and V ar(Y ) < ∞, then the function, f, which minimises MSE(f), for X,Y, is given by the conditional expectation.
f(X) = EY |X(Y |X). Proof. We can break the proof down into steps:
1. Show that the constant c that minimises EY [(Y − c)2] is c = μ. This is easy just by differentiating
d EY[(Y −c)2]=EY[2(Y −c)] dc
Setting this to zero gives the best choice is c = E ( Y ) = μ
the unconditional mean.
2. We now look at minimizing EY [(Y − c(x))2|X = x] for any value of x.
Here, due to the conditioning, c(x) can be treated as a constant, so the same argument on the last slide then gives
c ( x ) = E ( Y | X = x ) the conditional mean. Hence the best forecast is
f(x) := c(x) = E(Y |X = x).
3. Now we need to work at the level of random variables.
Recall that an inequality between two random functions Z := z(X), and W := w(X) is defined to be Z ≥ W if and only ifx w(x) ≥ z(x) for all realisations of x
Then the argument in Point 2 above says, for all realisation of x, and any forecast g(·),
EY((Y −g(X))2|X=x)≥EY((Y −E(Y|X=x))2|X=x) So we have a random function inequality
EY ((Y − g(X))2|X) ≥ EY ((Y − f(x))2|X)
4. Finally we can use standard properties of expectation to get
E E (Y − g(X))2|X) ≥ E E (Y − f(x))2|X) XY XY
2.2. STATISTICAL DECISION THEORY 25 Then using the law of iterated expectations gives
E (Y − g(X))2 ≥ E (Y − f(X))2 Y,X Y,X
Or for any predictor g(·) we have that MSE(g) ≥ MSE(f)
when f(X) is E(Y |X).
Theorem 2.2.4. (Absolute loss) We can also define a loss function in terms of absolute values, i.e.
Labs(Y,f(X)):=|Y −f(X)| the function which minimises E(Labs(Y, f(X))) is
f(x) = Median(Y |X = x).
Proof. (Outline) Similar to Theorem 2.2.3. If fY (y) is the density of Y , we
write
E[Labs(Y, f(c))] = |y−c|fY (y)dy = (c−y)fY (y)dy+ (y−c)fY (y)dy
∞c∞ −∞ −∞ c
Differentiating with respect to c and setting to zero gives turing point c satisfies F (c) = 0.5, i.e. the median is the turning point.
Definition 2.2.5. (Zero-one loss) If Y is a discrete, or categorical random variable, with sample space S, then a commonly used loss function is the indicator, or the zero-one, function
0 ifY=f(X), L01(Y,f(X))= 1 ifY̸=f(X).
and then its expected value is
E(L01(Y,f(X))) = P(Y = f(X)).
Theorem 2.2.6. (Bayes classifier) The function which minimises E(L01(Y, f(X))) for X, Y where Y has sample space S is given by
f(x) = maxP(y|X = x), y∈S
the so-called Bayes classifier.
262. REGRESSIONMETHODSANDMODELBUILDINGPRINCIPLES 2.3 Linear regression models
Let us start by considering an example where linear regression is often used for prediction and the quadratic loss function is a natural one.
Example 2.3.1. (House price example) The market value of a house is of interest to a both buyers and sellers. It should be a function of a number of features of the house and a multiple linear regression model can be used to estimate this function. In order to calibrate the model a data set has been constructed using the following variables.
X1 X2 X3 X4 X5 X6 X7 X8 X9
Current taxes
Number of Bathrooms Lot size
Living space
Number of parking spaces Number of rooms Number of bedrooms
Age of house
Number of fireplaces
Y
Actual sale price
From Theorem 2.2.3 we see that the best forecast of Y given the in- formation in X is given by E(Y |X) in terms of MSE. Computing a condi- tional expectation requires some knowledge of the joint distribution. The linear regression model makes some simplifying assumptions, firstly if X = (X1,...,Xp) then we assume
p
f(X;β)=E(Y|X)=β0 +βiXi (2.2)
i=1
In detail we have by defining the design matrix X and response and error
vectors y & ε, respectively
xT 1 x x ... x Y ε 111121p11
x 2T 1 X = . = .
x 2 p Y 2 ε 2
. , y = . and ε = . . . . .
Definition 2.3.2. (Residual sum of squares) To estimate β given a train- ing set {(yi,xi)|i = 1,··· ,N}, where xi = (xi1,...,xip) it is common to
x 2 1 .
x 2 2 .
. . .
. .
.
xnT 1xn1 xn1 ...xnp Yn εn
we write y = X β + ε.
.
2.3. LINEAR REGRESSION MODELS 27 minimise the residual sum of squares
N p 2
RSS(β):= yi−β0−βjxij i=1 j=1
The ordinary least squares estimate β is defined as arg minβ RSS(β).
For notational convenience define X to be the N × (p + 1) matrix where each row corresponds to one of the elements in the training set, and we define xi0 = 1 for all i, and y = (y1,...,yN)T. With this notation we have
RSS(β)=(y−Xβ)T (y−Xβ). (2.3) Theorem 2.3.3. (OLS estimation) The ordinary least squares estimate is
given by
when (XT X)−1 exists.
βOLS =(XTX)−1XTy
Proof. Using matrix differentiation rules, see Appendix 2.B, we get
∂ RSS(β) = −2XT (y − Xβ) ∂β
∂2 T ∂β∂βTRSS(β)=−2X X
(2.4)
and
Setting Equation 2.4 to zero and solving for β gives the result. Definition 2.3.4. (Hat matrix) We define the, so-called, hat matrix as
H := X(XT X)−1XT , which is an orthogonal projection matrix since
H2 = H,
The fitted values are the projection of the observed values in the training set
y projected onto the linear subspace spanned by the columns of X, i.e. y := Hy.
The estimated residuals are the corresponding orthogonal complement y := y − Hy = (I − H) y.
We shall now assume that we treat the response variables X as fixed – or alternatively think of conditioning on the observed values of the explana- tory variates. Further, assume that V ar(Yi) = σ2 and that Yi and Yj are uncorrelated.
282. REGRESSIONMETHODSANDMODELBUILDINGPRINCIPLES Theorem 2.3.5. (Variance of OLS) The variance-covariance of the sam-
pling distribution of βOLS is given by
V ar β = (XT X)−1σ2
OLS
Proof. From Theorem 2.3.3 we have that βOLS is a linear function of y
which has a variance-covariance matrix
V ar(y) = σ2IN×N ,
hence we have the result from standard properties of covariances.
For more details of the tools available with regression see Appendix 2.C.
Example (2.3.1 revisited). In R we can fit the full model using OLS and get the following output.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.31044
X1 1.95413
X2 6.84552
X3 0.13761
X4 2.78143
X5 2.05076
X6 -0.55590
X7 -1.24516
X8 -0.03800
X9 1.70446
—
5.96093 2.568
1.03833 1.882
4.33529 1.579
0.49436 0.278
4.39482 0.633
1.38457 1.481
2.39791 -0.232
3.42293 -0.364
0.06726 -0.565
1.95317 0.873
0.0223 *
0.0808 .
0.1367
0.7848
0.5370
0.1607
0.8200
0.7215
0.5810
0.3976
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 2.973 on 14 degrees of freedom
Multiple R-squared: 0.8512,Adjusted R-squared: 0.7555
F-statistic: 8.898 on 9 and 14 DF, p-value: 0.0002015
We see that most of the terms individually are ‘not significant’ but the whole model has an R2 = 0.8512, and an overall p-value of 0.0002015 so appears to explain a good deal of the variation of Y and hence might be a useful predictive model.
Individually the estimated βi values are a bit puzzling, for example the coefficient of X6, the number of rooms, is negative: intuitively we might feel that the more rooms a house has the higher the price. Of course the estimate is not ‘statistically significantly negative’ and maybe it could be set to zero. We return to this in the following section.
In Theorems 2.3.3 and 2.C.4 to both define the OLS estimates and un- derstand their sampling properties we need the condition that n (XT X)−1
2.3. LINEAR REGRESSION MODELS 29 exists, i.e., (XT X) is non-singular. One or the most hardest parts of using
regression models is the case where (XT X) is close to being singular. Definition 2.3.6. (Multicollinearity) If two, or more, of the columns of X
are highly correlated we say that we have multicollinearity.
In in terms of prediction we have that the correlated explanatory vari- ables are essentially are sharing the same information about Y . Having highly correlated explanatory variables does not add much to predictive power but is statistically expensive since there are more parameters to be estimated. We see in later sections that the bias is not reduced by much but the variance is increased, see §2.4.1. In particular the term νj in Theorem 2.C.4 can very large meaning we can not make very precise statements about the corresponding parameter.
Example (2.3.1 revisited). Going back to the house price example we have of 9 explanatory variates and the intercept term. Figure 2.6 shows a pairwise scatterplot of the 9 explanatory variables and we see that there are correlations. For example X3, lot size and X4 living space are correlated – as might be expected – and so share some of the same predictive power.
1.0
● ●
1.3
1.0 1.6 5.0 7.0 10 40
● ●●● ●● ●● ●● ●●●●●● ● ● ●●●● ●●●● ● ● ●●● ● ●●● ● ● ● ●●●●●●●● ●●●●●●●●
X1
●● ●●●●
● ●●● ● ●
●● ● ● ● ● ●● ● ● ● ● ● ●
●●● ●●●
●●●●● ● ● ● ● ● ●
● ●● ● ●●● ●● ● ● ● ● ● ●●● ●● ●● ● ● ●●● ●● ●● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●●●●●●● ● ●●● ●●●●●●●●
● ●
●
● ● ●
●
● ●
● ●
● ●
X2
●
● ●
● ● ●
●●● ●●
●● ●●●●
●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ●●
● ●●●●● ● ●
●● ●●
● ● ● ●●
●●
●
● ● ● ● ● ● ● ●
●
● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ●
●
● ●●●
●●●●
● ●
●
● ●
●
● ●
●
● ●● ●●●
● ● ●
● ●● ● ● ●
● ● ●●● ●● ● ● ●●●●
●
● X3 ●●●●●●
●●
● ● ●●●●●●●●
●●● ●● ●●●●●
●
●
●● ● ● ●● ●
●
●● ● ● ●●●●●●
● ●●●
●● ● ● ● ●
● ● ●
● ●●● ● ●●●
● ● ●
● ● ● ●●●●● ●●● ● ●
● ●●●
● ●●
● ●●●●●
● ● X4●
● ●
●
● ●●●
●●●● ●
● ●
● ●
● ● ●
● ● ●
●
●
● ● ● ● ●
●
● ● ● ● ● ● ●
● ● ●●
●●● ● ●●● ●●
● ● ●● ● ● ● ● ● ● ●
●●
●●● ●●●●●
●
●
●
● ●●●●● ● ●●● ● ●●
X5
● ●
●
●
● ● ● ● ● ●
●● ●●
●
●
● ● ●
●
●
●
● ●●●●●●●●
●●●●●● ●● ●●●
● ● ● ●
●
●● ● ●● ● ●
●●● ●
●●●●
●●●● ●● ● ●●● ● ● ●
●●●●● ● ● ●●● ● ●●● ● ●
●
● ● ●
● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●●● ●
●●●●● ●● ● ●
●●●
X6 ● ● ●
●
●
●
●●●● ● ● ●●● ● ●●●
● ● ● ●●●●● ●●●●●●●●●●
●●● ● ●●●●● ● ●● ● ● ● ● ● ● ●●●● ● ●● ● ● ●
●●●●●●
● ●
X7
● ●●● ●● ● ●
●●●●
●●●●●●●●
●● ● ● ●● ● ●● ● ● ●●● ● ●
● ●●
● ●● ●● ● ●●
● ● ●●●● ● ●●●● ●
● ●●●●●●●●●●●● ● ● ●● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ●
● ● ● ● ● ● ● ● ● ●X8 ●
● ●
●
●
●● ●●●●●●●●
● ● ●●●●●●●●
●●● ●
● ●
● ●
●●● ●● ●
●●●● ●● ● ●
4 6 8
● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●●
● ●●● ●●●●
● ● ●●●●● ● ● ●●● ● ●●● ● ● ● ● ● ● ● ● ●
2 6 10 0.0 1.5
● ● ●
2.0 3.5
X9
0.0 0.6
Figure 2.1: Pairwise scatterplot for covariate in house example
In the case of multiple linear regression suppose we want to predict the value of Y0 for a new explanatory vector X0 when we have a training set T available from which we can estimate β = β(T ).
Theorem 2.3.7. (Prediction interval in regression) If we assume that (Y0,X0) is independent of the training set T = {(Y1,X1),…,(YN,XN)} and assume we have a correctly specified model.
0.0 0.6
2.0 3.5
0.0 1.5
2 6 10
4 6 8
10 40
5.0 7.0
1.0 1.6
1.0 1.3
302. REGRESSIONMETHODSANDMODELBUILDINGPRINCIPLES Define the point forecast as
μ = x T0 β
i.e. the estimated value of the conditional expectation. Then a α%-prediction interval is T T −1
μ±cασ 1+x0(X X) x0 α2
where cα is the 1− 2 -quantile from the tN−p−1-distribution and σ is defined by Definition 2.C.3.
Proof. (Sketch) By independence of the prediction and training set, the distribution of Y0 and the distribution of β we can show that
Y −xTβ∼N0,σ21+xT(XTX)−1x . 0000
Since σ2 has to be estimated we define
Y 0 − x T0 β
T:= T T −1 σ 1 + x 0 ( X X ) x 0
which has a tN−p−1 distribution, when σ is the sample estimate of σ. This means, for all parameter values we have
P(−cα ≤T ≤cα)=1−α. 2
or by rearranging have a 1 − α probability that Y0 lies in the interval 2
TT T−1
x0β±cασ 1+x0(X X) x0
Example 2.3.8. (Prediction in regression) The following simple simulated example shows both how forecasting in prediction works and illustrates a major weakness of this, and other forecasting methods. Consider Fig. 2.2, panel (a) shows the data (black dots) a fitted linear model (solid red line) and prediction intervals (solid blue) for ‘new’ values of the covariate of 6, 12 and 18.
Panels (b) and (c) show information about the residuals and it looks like the linear model fits very well. In fact, the true model here was non-linear (dashed red line) and the prediction interval for x = 12 and 16 would be very poor.
The lesson to learn from Example 2.3.8 is that making a prediction where the new covariate values are far from the values used to fit the data is a very risky thing to do. The goodness-of-fit statistics only tells us that the model fits the training data and very little about how the model behaves when far from the training data.
2.4. MODEL COMPLEXITY
(a) Simple prediction
0 5 10 15 20
31
●● ●
● ●●●
●● ● ●●
●● ●● ●● ● ● ●●
●● ● ●
● ●● ●
●
●
● ●●
●●● ●●
●●
●● ●
●●● ●●
●
●●● ●
●●
●
●●
●
●●
● ●●
●●● ●●
●●
●● ●
●●●●● ●
● ●●●● ●
●
●
2.3.1 Computing in R
x
(b) Fitted and residuals
10 20 30 40
Fitted
(c) Distribution of residuals
−2 −1 0 1 2
Theoretical Quantiles
Figure 2.2: Prediction with regression
In R the key function for regression is the lm( ). For example in Example 2.3.1 we fit the model using
> house.fit <- lm(Y ~X1+X2+X3+X4+X5+X6+X7+X8+X9, data=house.price) the ls-object then contains all the output needed for analysis. Functions
for which it is an argument includes the following:
> summary(house.fit)
> plot(house.fit)
> house.fit$fitted
> house.fit$residuals
The last two of these are very helpful for looking at good-of-fit residual analysis plots. When we want to use the lm( ) function for prediction we
can use the predict( ) or predict.lm( ) function. For example, suppose
we look a the simple model which regresses house price Y against X1 and
X2. We fit the model using
house.fit1 <- lm(Y ~ X1+X2, data=house.price)
> new <- data.frame(X1=c(6.00), X2= c(1.5) )
> predict.lm(house.fit1, new, interval=”prediction”)
fit lwr upr
1 35.56195 29.05033 42.07357
Here fit is our point estimate of the price and lwr and upr define a 95%- prediction interval thought the lower and upper values respectively.
2.4 Model complexity
The theory of linear models in the previous section, for example Theorems 2.3.3 and 2.C.4, implicitly assumes that we know the form of the regres- sion model. That is Equation (2.2) holds for a known set of p explanatory variables. We can use methods based on residuals for model checking – for
−6 −4 −2 0 2 4 6
y
0 20 40 60 80 100
Residuals
Sample Quantiles
−6 −4 −2 0 2 4 6
322. REGRESSIONMETHODSANDMODELBUILDINGPRINCIPLES
example Fig. 2.2(b, c) – but there can be many models which have ‘good enough’ goodness-of-fit to be considered plausible. Which one to use? We shall see that we do not use the model which has the best fit.
2.4.1 Bias-variance decomposition
Let us consider the multiple regression model with a large number of co- variates and how the fitted model might be used for prediction. Naively, one might think that the more covariates there are the more information is available, so the better the forecast. Multicollinearity, Def. 2.3.6, is one rea- son for that this is not true. Adding covariates which are strongly correlated with ones already included typically reduces the model’s prediction power.
There are, in fact, even stronger reasons why over-complex models should be avoided. This is called over-fitting, and this is where the fitted model is a very good description of the training data – the sample – but not such a good description of the underlying population. Over-fitted models have poor power to generalised away from the training data.
Definition 2.4.1. (Bias-variance decomposition) Let T be the training set used to estimate β = β(T ), we now want to estimate the forecasting error for a new set of covariates, x0. Let the forecast be
y0 := xT0 β
If y0 is the actual value we are trying to forecast, and using quadratic
loss (Example 2.2.2) we evaluate the forecast using MSE(x) := E {y −y}2
0 T00
= E { y − E ( y ) + E ( y ) − y } 2
T0T0T00
= (y −E (y))2+E y −E (y))2
0T0T0T0 = Bias2(y0) + V arT (y0).
So when we are evaluating the MSE of a prediction we typically are trying to balance two different objectives: the bias of the forecast and the variance. In previous statistics courses we have often chosen estimates with zero bias, for example Def. 2.C.3 defines a unbiased estimate of variance and Theorem 2.C.4 shows that the OLS estimate of β is unbiased for a correctly specified model. However Hastie et al. (2009, §3.3) criticises the OLS method for models with many covariance for prediction since although the predictions have small bias they can have large variance. Prediction accuracy can be improved by shrinking or setting some of the coefficients of β to zero. This increases bias – since we are changing an unbiased estimate – but reduces variability.
2.4. MODEL COMPLEXITY 33
Figure 2.3 is an attempt to give a geometric view of the trade-off between bias and variance. Suppose we chose a parameterised model space – say lin- ear regression with independent normal errors and a fixed set of explanatory variates – the ‘true’ model probabily does not lie in the space but hopefully is close to it. We can think of the error caused by the true model being outside the space as model error or bias. Since the model spaces involves a set of unknown parameters their estimation involves uncertainty, which is variability. We can reduce the bias by making the model space more com- plex – say by adding new covariates or allowing dependence in the model – this will mean there are more parameters to estimate, so more variability.
f(x)
2
inf( (f(x) − f(x)) ) =bias = approximation error
f in model f*(x)
Estimation error
Figure 2.3: Bias Variance trade off
2.4.2 Subset selection
Model space
One approach to improving the predictive performance is to use only a subset of the explanatory variables. This may reduce the multi-collinearity and reduce the variance of the estimator, at the cost of some bias. The question then is how to select the subset?
Definition 2.4.2. (Best subset regression) Best subset regression finds for each k ∈ {0, 1, · · · , p}, the subset which has the smallest residual sum of squares, see Def. 2.3.2.
Example (2.3.1 revisited). Going back to the house price example we have a maximum of 9 explanatory variates and the intercept term. Figure 2.4 shows the residual sum of squares for all possible subsets of X1, . . . , X9 – the intercept was always included – plotted against k, the number of terms in the subset. The red curved is the best subset curve and defines the models which are eligible for selection in the best subset method. Since this curve must be a decreasing one – adding any variable always gives a ‘better’ fit, hence reduces RSS – it can not be used for finding the ‘optimal’ value for k. How to select k involves the tradeoff between bias and variance, as described in §2.4.1. Typically the smallest model is selected which minimizes an estimate of expected prediction error, see §2.5.
342. REGRESSIONMETHODSANDMODELBUILDINGPRINCIPLES All subset models
● ●
●
●
●
●
●
●
●
●
●
● ● ●
● ●
● ●
●
● ● ● ● ●
● ● ● ●
●
● ● ● ●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ●
● ●
● ● ● ● ●
● ●
● ●
● ●
●
2468
k
Figure 2.4: RSS for all size k subsets. Red solid lines is best subset curve, blue dashed line is greedy forward selection
Looking at all possible subsets and fitting a model to each choice can be computationally very expensive. In Example 2.3.1 there are 29 = 512 possi- ble models so it is just about possible to do in a reasonable time. In general we use non-exhaustive search methods which are computational much faster but may not find the ‘best’ model however commonly find ones close to the best.
Most method we look at will, like best subset regression produce a set of possible models indexed by a single parameter – in this case k. The job is then the find the ‘right’ value of this parameter.
2.4.3 Forward and backward step selection
Definition 2.4.3. (Forward step selection) This is a greedy algorithm which starts with the intercepts then adds the covariate which most decreases the RSS at each stage. Rather than explore the whole of the space of possible models it looks for a ‘good’ yet easy to compute path through them.
Example (2.3.1 revisited). In Fig. 2.4 the blue dashed line shows the path, in terms of RSS and k, of the forward selection algorithm. Here its is close to the best subset curve for most values of k, but there is no guarantee that that is always going to happen.
Definition 2.4.4. (Backward step selection) Starts with the full model and deletes the covariate with the smallest z-statistic
200 400
600 800
RSS
2.4. MODEL COMPLEXITY 35
Definition 2.4.5. (Akaike information criterion) The Akaike information criterion (AIC) tries to balance goodness of fit of a model and the models complexity. It measures the complexity of the model by counting the number of parameters used. It is defined as
−2l(θ) + 2Np
where l is log-likelihood, θ the MLE and Np the number of parameters in model.
In general a smaller value of AIC is preferred. As you add parameters –
make the model more complex – −2l(θ) gets smaller but the penalty term, 2Np, counts against adding parameters which don’t improve the fit enough
Definition 2.4.6. (The step( ) function) In R we have the function step( ) whose default is to use both a forward and backward one step search at each stage – i.e. add any one variable not currently included and delete any one variable that has been included. The decision to add or drop is not done by just looking at the RSS but by using the AIC criterion which takes into account the complexity of the model.
2.4.4 Computing in R
We can do all the previous calculations in R very easily, but sometimes we need to add packages to the basic R framework. For example consider
> library(MASS)
> stepAIC(lm(Y ~X1+X2+X2+X3+X4+X5+X6+X7+X8+X9, data=house.price) ) This opens the library MASS and uses the function stepAIC( ) from it. This function allows forward, backward (and both) stepwise searches through model space. If you get the response
> library(MASS)
Error in library(MASS) : there is no package called ’MASS’ when you try this you would need to use the Package Installer to download the library from CRAN before you start.
If p is not too large you can try the exhaustive search method. This can be done very efficiently with the code
> library(leaps)
> least.subs <- leaps(X,y, nbest=1)
2.4.5 Ridge regression
Subset selection methods take a very binary approach, the explanatory vari- able is either included or excluded. We can look at more continuous ways of adapting the model – always using a tuning parameter akin to k – of changing the explanatory information.
362. REGRESSIONMETHODSANDMODELBUILDINGPRINCIPLES Definition 2.4.7. (Ridge regression) The ridge regression estimate of a
linear model is defined as
βridge := arg min RSS(λ),
β
where N
RSS(λ):= (yi −β0 −xijβj)2 +λβj2 (2.5)
p p i=1 j=1 j=1
where λ > 0 is a tuning parameter which determines the bias-variance trade- off.
This problem can be written in an equivalent form as
Np
βridge := arg min (yi − β0 − xij βj )2 (2.6)
β i=1 j=1 p
s.t. βj2≤t j=1
where there is a one-to-one correspondence between λ and t.
Without loss for the predictive power of the model we usual centre the explanatory variables such that ni=1 xij = 0. When the centring has been done we estimate β0 with y ̄. Let X be the N × p matrix whose columns are the centred covariates and we don’t include the constant term. With this
definition we have
RSS(λ)=(y−Xβ)T (y−Xβ)+λβTβ
and here β = (β1,…,βp). We can show that
βridge = (XT X + λI)−1XT y.
From the above formula we see that if XT X is close to being singular –i.e. collinearity is a problem – the λI adjustment moves the matrix away from a singularity towards the identity matrix.
Example (2.3.1 revisited). We can see the effect of choosing different values of λ on the parameter values in the house example in Fig. 2.6. As λ gets bigger all parameters shrink towards zero, but at different rates. We see for example the coefficient β6 of the ‘Number of rooms’ goes quickly to zero and this is the estimate that seemed to have the ‘wrong’ sign in the standard OLS estimates.
Definition 2.4.8. (Effective degrees of freedom) The effective degrees of freedom of a ridge estimate of the centred explanatory variables X is defined by
df(λ) := tr X(XT X + λI)−1XT .
2.4. MODEL COMPLEXITY
37
Shrinkage of parameters
beta1 beta2 beta3 beta4 beta5 beta6 beta7 beta8 beta9
0123456
lambda
Figure 2.5: Shrinkage of parameters in house example
When λ = 0 we get the normal degrees of freedom in the regression model i.e. p. The effective degrees of freedom measures how much shrinkage has taken place
2.4.6 The Lasso
Definition 2.4.9. (Lasso) The lasso regression estimate is defined by
Np
βlasso := arg min (yi − β0 − xij βj )2 (2.7)
β i=1 j=1 p
s.t. |βj|≤t j=1
which can be compared to Equation (2.6) where we see the constraint is on the absolute value of the parameter values not the squares. This can then also be expressed, using Lagrange multipliers as,
N p p βlasso :=argmin (yi −β0 +xijβj)2 +λ|βj| ,
β i=1 j=1 j=1
The lasso not only shrinks small estimate to zero but also can make them
exactly zero. Thus it is able to perform subset selection as well as shrinkage.
Example (2.3.1 revisited). We can see the strinkage effects of the lasso in Fig. 2.6. The left hand plot shows the lasso as λ moves from 0 (on the
beta
−1 0 1 2 3
382. REGRESSIONMETHODSANDMODELBUILDINGPRINCIPLES
right of the plot) to ∞ (on the left). We see the values of the (standard- ised) parameter estimates as they shrink and then go to zero. For example following the red dashed line, which corresponds to β2 is the second to last to be set to zero (i.e. when k = 1). The path for β6 (solid magenta) is the first to be shrunk completely to zero.
To compare this we have in the righthand panel the forward stepwise se- lection, where we can see the covariates being added in the order 1, 2, 9, 8, 5, 7, 4, 3, 6
LASSO Forward Stepwise
01256890 12345689
*
*
*
* *
* *
* * *
*
*
*
*
* *
**
* * *
*
*
* *
*
* *
**
* *
* *
**
*
*
*
**
*
* *
*
*
*
*
*
** *
*
*
*
*
* *
**
*
*
* * *
*
* * * *
**
* **
*
* *
*
* *
* *
*
* * *
* *
*
*
*
Standardized Coefficients
−5 0 5 10 15
7839521
Standardized Coefficients
−5 0 5 10 15 20 25
7639521
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2
|beta|/max|beta|
0.4 0.6 0.8 1.0
|beta|/max|beta|
Figure 2.6: Shrinkage of parameters in house example
2.4.7 Computing in R
The following code shows how to implement ridge regression and the LASSO in R.
###########################
#Ridge regression
> library(MASS)
> fit.ridge <- lm.ridge(Y ~X1+X2+X2+X3+X4+X5+X6+X7+X8+X9, lambda=1, data=house.
###################
#Lasso
> library(lars)
> house.lasso <- lars(X1, y, type="lasso") where X1 is the matrix of centred covariates.
2.5 Model assessment and selection
The previous section describes how we can look at the model complexity of regression models in terms of one dimensional tuning parameters: k the number of covariates in subset selection, λ a positive real number in ridge
2.5. MODEL ASSESSMENT AND SELECTION 39
regression or the lasso. We have not yet discussed how to select the ‘best’ value of such a tuning parameter for the prediction problem. We emphasis that finding the model which best fits the training data is not going to be the best model for forecasting because of the bias-variance trade-off. Very complex models can fit training data very well but have associated very high variance. Another way of viewing this issue is to recall that for prediction we want a model that fits the underlying population, but by fitting very complex models to a sample just getting a very good description of the sample. Often a complex model will not be able to generalise away from the observed sample – we say it has over fitted, describing detailed aspects of the sample that are there just by ‘chance’.
Let use define how we can evaluate a models predictive performance in a way that has taking into account all aspects of the model fitting process.
Definition 2.5.1. (Generalization error) Let T be the training data, i.e. for
regression the observed set {(y1, x1), . . . , (yN , xN )} where N is the number
of cases, yi the response and xi the covariates for the ith-case.
Let f(X0) be the prediction model fitted to the training data and eval- uated at value of the covariate selected from the population, X0. We define the generalization error to be
ErrT :=EL(Y,f(X0))|T.
where L() is any loss function, but in this section will typically be squared 2
loss i.e. the generalization error is E Y − f(X0) |T . In this we think
of Y and X0 as both being drawn from their joint distribution. WecanalsodefinetheexpectedgeneralizationerrortobeErrT averaged
over training sets
When we are evaluating how well a model predicts we want to compute
Err:=ET [ErrT].
– or at least estimate – either ErrT or, in practice, Err.
Definition 2.5.2. (Training error) The training error in general is defined
as
which is the residual sum of squares (RSS) typically in this section.
In general training error is not a good estimate of prediction error and typically underestimates it.
Definition 2.5.3. (Ideal procedure) Hastie et al. (2009, p. 222) state that if we had lots of data, and a set of possible models, what we would ideally do is split the data into three different parts: training, validation and test.
1 N
err := L(yi,f(xi)),
N i=1
402. REGRESSIONMETHODSANDMODELBUILDINGPRINCIPLES
1. Use the training to fit each of our set of models i.e. estimate their
parameters.
2. Use the validation data to estimate the prediction error for each of our candidate fitted models in order to pick the best. Note here we are not reusing data to both fit and evaluate.
3. Finally, use the test data to estimate the prediction error for our se- lected single best model.
They propose a rough rule that 50% of data be training, 25% validation and 25% being testing.
This is however often not possible since data is scarce, so what should we do?
2.5.1 Bias-variance decomposition
Definition 2.5.4. (Expected prediction error regression) We have a regres- sion model of the form Y = f((X)) + ε, where V ar(ε) := σε2 and the fitted
value is f(x0) for some input X0 = x0. Then the expected prediction error is
2
Err(x0) = E Y −f(x0) |X0 =x0
2 2 2
= σε + E f(x0) −f(x0) +E f(x0)−E[f(x0)] = σε2 + Bias2 + V ariance
The first term is called the irreducible error associated with the ‘true’ model, f, and will always be there, the second and third terms depend on the model class we use.
Example 2.5.5. (Simulation Example) In order to understand the gener- alisation error for models selected by best subset regression a simulation experiment was done. The true model is given by
Y = 1.0X1 + 0.775X2 + 0.550X3 + 0.325X4 + 0.100X5 + ε (2.8)
where ε ∼ N(0,0.32). The set of possible covariates Xi is p = 20 dimensional and we generate each independently from U nif (0, 1) distribution. A training sample, T with N = 80, was generated and the best subset regression was fitted for each subset size k = 1,...,20.
From the training set we have 20 possible models to choose from. We see howwelleachdoesintermsofitserrT predictionerrorbyrandomlyselecting sets of explanatory variables (each independent Unif(0,1)) generating the
2.5. MODEL ASSESSMENT AND SELECTION 41 Generalisation error
5 10 15 20
Complexity
Figure 2.7: Generalisation error
observed value of Y from model (2.8) and then seeing how well each of the 20 candidate fitted models did at predicting Y in terms of squared error.
With fixed T this was repeated many times and the average error (i.e., errT ) is plotted in Fig. 2.7 We see that when the model is too simple – has less than the 5 covariances in the true model – its does very badly. It is strongly biased. But having the model too complex is not a good thing since the prediction variance starts to increase.
The simulation based approach of Example 2.5.5 does clarify what hap- pens in model building and the trade off between bias and variance, but it can’t be used in practice with a real data set since it required knowing the true model. If we try to use the training error (Definition 2.5.2) we find that, in the words of Hastie et al. (2009, p. 229) it is is too optimistic, it underestimates the true error.
Definition 2.5.6. (In-sample error) The in-sample error is defined by 1N0
Errin := EY0 L(Y ,f(xi)|T . N
i=1
That is we fixed the values of the explanatory variables at their training values and then average over the random responses Y 0.
We define the optimism as being
op := Errin − err,
and average optimism as being
ω := EY (op).
The Akaike information criterion, Definition 2.4.5 can thought of as an estimate of Errin and we can use it for model selection we simply choose the model giving smallest AIC over the set of models considered.
●
●
● ●●● ●
●
●●●●●●●●●●●●
Error
0.00 0.05 0.10 0.15 0.20
422. REGRESSIONMETHODSANDMODELBUILDINGPRINCIPLES
Example (2.3.1 revisited). We can use the stepAIC function in the MASS package in R to do a forward and backwards search which stops when the smallest AIC value is found.
> library(MASS)
> stepAIC(lm(Y ~X1+X2+X3+X4+X5+X6+X7+X8+X9, data=house.price) )
[ … ]
20.836
21.669
47.409
156.606
Df Sum of Sq
RSS AIC
139.60 52.257
160.43 53.596
161.27 53.720
187.01 57.274
296.20 68.312
Call:
lm(formula = Y ~ X1 + X2 + X5 + X7, data = house.price)
Coefficients:
(Intercept) X1 X2 X5 X7
13.621 2.412 8.459 2.060 -2.215
> summary(lm(formula = Y ~ X1 + X2 + X5 + X7, data = house.price))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.6212
X1 2.4123
X2 8.4589
X5 2.0604
X7 -2.2154
3.6725 3.709 0.001489 **
0.5225 4.617 0.000188 ***
3.3300 2.540 0.019970 *
1.2235 1.684 0.108541
1.2901 -1.717 0.102176
Example (2.5.5 revisited). If we look again at the simulated exercise we can see how well the AIC based search works.
> stepAIC(lm(y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10))
[ … ]
Call:
lm(formula = y ~ x1 + x2 + x3 + x4)
Coefficients:
(Intercept) x1 x2 x3 x4
0.1332 0.9902 0.8383 0.4564 0.2637
It indeed picks out the p = 4 model that Fig. 2.7 shows has the best pre-
2.5. MODEL ASSESSMENT AND SELECTION 43 diction error, but of course unlike that figure it has not used any knowledge
of the true model to do it.
Definition 2.5.7. (AIC(α)) Given a set of candidate models which have be
indexed by a tuning parameter α we can define d(α) 2
AIC(α) = err(α) + 2 N σε
where err(α) is the training error and d(α) the number of parameters (or the effective degrees of freedom) for the model indexed by α. This func- tion provides an estimate of the test error curve, and we find the tuning parameter α that minimizes it.
2.5.2 Cross validation
For most examples the ‘ideal procedure’ of Definition 2.5.3 requires too much data but it can be thought of as a way of motivating the most popular method for estimating prediction error, this is called cross validation. It works in a very wide range of prediction problems, but here we will look at it in the regression context.
Definition 2.5.8. (K-fold cross-validation) The idea is to split the N sub- jects in the training set into K (roughly) equal-sized parts, say {Tk}Kk=1. For k = 1,…,K take out the kth-part of this partition, Tk and fit the model to the other K − 1 parts of the data. Calculate the prediction error of the fitted model for Tk. The prediction error is then estimate by averaging over the K times this can be done.
Example (2.5.5 revisited). Suppose we want to evaluate the 20 models which were proposed by the best subset regression the simulation exercise Example 2.5.5. Best subset regression gives a set of models, indexed by k, but does not tell you which k to use. The cross validation method can be used here to estimate k. In Fig. 2.8 one case was left out at a time – i.e. K = N cross-validation – the 20 possible models were fitted to the remaining data and the error each makes in predicting the remaining data was recorded. We average this over all N cases.
We see very similar results to Fig. 2.7 but the cross validation only uses the data which is available to the analyst. It does not use any knowledge of the ‘true model’.
One difference, though, between Figs. 2.7 and 2.8 is the type of gen- eralisation error being estimated. In general cross-validation gives us an estimate of the expected generalization error err and not the generalization error, errT , itself. Since, in a given prediction problem we have one specific training set, it would be more appropriate to have a good idea about errT , but in general this is not an easy thing to estimate.
442. REGRESSIONMETHODSANDMODELBUILDINGPRINCIPLES
Cross−validation
5 10 15 20
k
Figure 2.8: Cross validation estimate of generalisation error.
●
●
● ●
●
●●●
● ●
● ●
● ●
● ●
● ●
● ●
Estimated Generalization error
0.10 0.11 0.12 0.13 0.14 0.15 0.16
Appendix
2.A Appendix: Multivariate normal distribution
First recall that we denote linear transformation using y = Ax
where x, y are n × 1 (column) vectors and A is n × n matrix. Further, we say that A is orthogonal if
AAT =In
where AT is the transpose of A and In the identity matrix
Definition 2.A.1. A projection of a vector x ∈ Rn onto a subspace S ⊂ Rn is the linear transformation
y=Px where P is idempotent i.e. P.P = P
In this, and later sections we will make heavy use of the properties of the multivariate normal distribution.
Definition 2.A.2. Let X be a p-dimensional multivariate normal random variable. Then the density of X is given by
−p/2 −1/2 1 T−1 (2π) |detΣ| exp −2(x−μ) Σ (x−μ) ,
where μ ∈ Rp is the mean and Σ is the p × p positive-definite variance- covariance matrix of X . The usual notation is to write X ∼ N (μ, Σ). Note: xT means the transpose of x.
The properties of the multivariate normal are as follows:
Theorem 2.A.3. Assume assume that Z ∼ Nn(μ, Σ). If μ2 is an n dimen-
sional vector and S a n × n matrix then
SZ+μ2 ∼Nn(Sμ+μ2,SΣST)
Note that the dimension does not have to stay the same. For example, Aisap×n(wherep≤n)matrixandY =AZ then
Y ∼Np(Aμ,AΣAT) 45
462. REGRESSIONMETHODSANDMODELBUILDINGPRINCIPLES Theorem 2.A.4. Marginal distributions of multivariate normal are again
normal. IfY1 isap×1subvectorofan×1vectorY ∼Nn(μ,Σ) Proof. Check formally by defining
Y=I,0 Y 1 p×p p×(n−p)
where I is identity matrix, and 0 is matrix of zeros Then used the linear properties of Normal model to show
Y1 ∼ Np(μ1, Σ11)
Assume, without loss of generality, that Y1 is first p components then can write mean and variance covariance as
and covariance
μ1.2 = μ1 + Σ12(Σ22)−1(y2 − μ2), Σ1.2 = Σ11 − Σ12(Σ22)−1ΣT12.
μΣΣ μ= 1 ,Σ= 11 12
μ2 ΣT12 Σ22
We can also look at the conditional properties of Multivariate normal.
Theorem 2.A.5. The conditional distribution of Y1|Y2 = y2 is Normal. It has mean
2.B Matrix differentiation review
• Let x = (x1, …, xp)T be a p-dimensional vector and y = g(x) = (g1(x),…,gq(x))T aq-dimensionalvector.
• Define the derivative of x as the following q × p matrix ∂y ∂g(x) ∂gi(x)
∂x=∂x=∂x . j i,j
• If y = Ax, for a constant matrix A, then ∂y =A.
• If y = xT A, then
• If y = xT A x, then
∂x
∂y = AT . ∂x
∂y =xT (A+AT). ∂x
2.C. APPENDIX: REVIEW OF REGRESSION THEORY 47 2.C Appendix: Review of Regression Theory
Theorem 2.C.1. (Properties of hat matrix)
(a) ThehatmatrixHisn×nsymmetric,i.e. H=HT,
(b) The hat matrix H is idempotent, i.e. H2 = H,
(c) y = Hy = X (XT X)−1XT y is the projection of the vector y onto the column space of the design matrix X.
n
(d) trace(H) = hi i = p + 1.
i=1
(e) For a given point x, we can write h(x)T = xT (XT X)−1x and therefore
n fn(x)=f(x;β)=h(x)TYi =hi(x)Yi.
i=1
Proof. Sketch of (d): Recall the trace of a square matrix being the sum of
its diagonal elements, then note that
trace(H) = trace(X (XT X)−1XT ) = trace((XT X)−1XT X )
= trace(Ip+1 ) = p + 1 .
It can be helpful to think of the variability in the data in terms of different types of ‘sums-of-squares’.
Definition 2.C.2. (Sum of squares)
(a) Total corrected sum of squares
n
TSS = (yi − y)2 .
i=1
(b) Regression sum of squares (i.e., sum of squares explained by the regres-
sion model),
(c) Residual sum of squares
n
SSReg = (yi − y)2 . i=1
n
RSS = RSS(β) = (yi − yi)2 .
i=1
482. REGRESSIONMETHODSANDMODELBUILDINGPRINCIPLES It is easy to show that TSS = SSReg + RSS, i.e.
total variability = variability explained + unexplained variability by the model (i.e. error)
Definition 2.C.3. (Unbiased estimate of variance) The unbiased estimate
of σ2 is defined as
1 N 22
σ :=N−p−1
(yi−yi).
Theorem 2.C.4. (Sampling distributions) As well as the assumptions of Theorem 2.3.5, further assume that we can write y in terms of its mean and an error term
Y = E(Y|X)+ε p
= β0 +βiXi +ε. i=1
where ε ∼ N 0,σ2IN×N.
(i) The sampling distribution of βOLS is given by
βOLS ∼Nβ,σ2(XTX)−1.
2
(ii) The distribution of σ
(N−p−1)σ ∼χ2N−p−1.
is determined by
(iii) To test the hypothesis βj = 0 we use the test statistic βj
z= j√
2 σ2
i=1
σ νj
where νj is the jth diagonal element of (XTX)−1, and zj has a tN−p−1 distribution under the null.
(iv) To test a smaller model M0, with p0 covariates, inside a larger model M1, with p1 covariates, we use the test statistic
F = (RSS0 − RSS1)/(p1 − p0) RSS1/(N − p1 − 1)
where RSSi are the residual sum of squares for each estimated model. F has an Fp1−p0,N−p1−1 distribution under the null.
AssumingthemodelYi =β0+β1xi1+···+βpxip+εi. Wemaywish to test
H0 : β1 =β2 =···=βp =0
H1 : atleastoneβi ̸=0, i=1,…,p.
2.C. APPENDIX: REVIEW OF REGRESSION THEORY 49 Under the assumption of normality, the test statistic
F = SSReg/p RSS/(n−p−1)
has an F distribution with p and n − p − 1 degrees of freedom, i.e. F (p, n − p−1). Reject H0 if F > Fα;p,n−p−1, where Fα;p,n−p−1 is upper α quantile of F (p, n − p − 1) distribution.
Source of variation
Regression Residual
Degrees of freedom (df)
p n−p−1
Sum of squares (SS)
SSReg RSS
Mean squares (MS)
SSReg/p
σ2 = RSS/(n−p−1)
F statistic SSReg/p
Total n−1 TSS
Analysis of variance table
F = RSS/(n−p−1)
The F-test is used to test that there exists a linear relationship between Y and p covariates in x vector. If it iss significant then a natural question is for which of the p covariate in x there is an evidence of a linear relationship with Y . To answer this question, one can perform p separate t-tests using the test statistics
βi − βi
Ti = i = 1, . . . , p ,
se(βi )
whereunderH0 :βi=0,Ti∼tn−p−1,tdistributionwithn−p−1degrees
of freedom. The quantity se(βi) is the estimated standard deviation of βi.
Note that, for i = 0, 1, …, p, se(βi) is the square root of the (i, i) element 2 T −1
of the matrix Var[β] = σ (X X) . A (1 − α)100% confidence interval for βi is
βi − t(α/2, n−p−1)se(βi) ≤ βi ≤ βi + t(α/2, n−p−1)se(βi) .
How to test presence of a specified subset of the predictors in the model?
This means, for 1 ≤ s ≤ p, testing the null hypothesis
H0 : β1 =β2 =···=βs =0, versus H1 :H0 nottrue.
That is, testing the reduced model Y = β0 + βs+1xs+1 · · · + βpxp + ε, versus thefullmodelY =β0+β1×1+···+βsxs+βs+1xs+1···+βpxp+ε. The partial F-test statistic is used
F = (RSS( reduced model ) − RSS( full model ))/(dfreduced − dffull) RSS( full model )/dffull
=
(RSS(β0, βs+1, …, βp) − RSS(β0, β1, …, βs ,βs+1, …, βp))/s
of the regression fit
RSS(β0, β1, …, βs ,βs+1, …, βp)/(n − p − 1)
Reject H0 if F > Fα; s, n−p−1
To compare different fits, one may use the coefficient of determination
R2 = SSReg = 1 − RSS . TSS TSS
502. REGRESSIONMETHODSANDMODELBUILDINGPRINCIPLES A major deficiency of R2 is that adding a predictor to the model (even
irrelevant) always increases R2.
RSS( fit of small model ) ≥ RSS( fit of larger model ) .
To compensate for this an adjusted coefficient of determination is proposed. 2 RSS/(n−p+1) n−1 2
Radj =1− TSS/(n−1) =1− n−p−1 (1−R ).
When comparing models with different numbers of predictors one should use R2adj and not R2.
3
Forecasting Stationary processes
This chapter looks at stationary time series and how to forecast them. It looks in particular at how, by estimating moment structures – such as mean vectors and variance-covariance matrices – optimal linear predictors can be computed. It shows how certain models have moment structures which are estimable with the data available to us, and discusses how to select, fit and check these models in practice.
The stationary time series studied will form the basis of a much richer class of models which will have application to ‘real-world’ forecasting. How this is done is shown in Chapter 4.
3.1 Introduction
Definition 3.1.1. (Time series and finite dimensional distributions) A discrete time series is a set of random variables, {Xt}, indexed by t ∈ T ⊆ Z. For any finite subset of {t1, . . . , tn} ⊂ T the finite dimensional distributions are the joint distributions
F(xt1,··· ,xtn) := P(Xt1 ≤ xt1,··· ,Xtn ≤ xtn).
Example 3.1.2. (An i.i.d. sequence) A sequence of independent identically distributed (i.i.d.) random variables {Xt}, such that Xt ∼ FX for all t, form a discrete time series, with
n F(xt1,··· ,xtn) = FX(xti).
i=1
Example 3.1.3. (White noise) A sequence of uncorrelated random vari- ables, {Xt}, each with E(Xt) = 0 and V ar(Xt) = σ2, for all t ∈ T , is called a white noise process and is denoted by WN(0,σ2). For this time series
51
52 3. FORECASTING STATIONARY PROCESSES the finite dimensional distributions are not defined explicitly but the second
order moment structure (i.e. all means, variances and covariances) is.
Example 3.1.4. (Gaussian process) Let T = Z, then {Xt} is a discrete Gaussian process if any finite subset {Xt1,…,Xtn} has an n-dimensional multivariate normal distribution.
This model is completely determined when the mean and variance-covariance structures are known.
Example 3.1.5. (Random walk) Let Zt, t ∈ Z be an i.i.d. sequence of random variables. The series defined by
t
Xt :=Zi,
i=1 for t = 1,2,…, is called a random walk.
Example 3.1.6. (MA(1) process) Let Zt, t ∈ Z be an i.i.d. sequence of N(0, σ2), or more for more generality Zt ∼ WN(0, σ2), random variables. The series defined by
Xt := Zt + θZt−1
for all t, is called a first order moving average process, and denoted by
MA(1).
Example 3.1.7. (AR(1) process) Let Zt, t ∈ Z be an i.i.d. sequence of N(0, σ2), or more for more generality Zt ∼ WN(0, σ2), random variables. The series defined by
Xt = φXt−1 + Zt
for all t ∈ Z, is called a first order autoregressive process (AR(1)) process.
3.2 Stationary processes
Definition 3.2.1. (The auto-covariance function) If {Xt} is a time series with Var(Xt) < ∞ for all t ∈ T, then the auto-covariance function (acvf) is defined by
γ(r, s) = Cov(Xr, Xs),
for r, s ∈ T .
Definition 3.2.2. (Stationarity) The time series {Xt}t∈T is said to be sta- tionary if (i) E(|Xt|2) < ∞ for all t ∈ T, (ii) E(Xt) = μ, for all t ∈ T, and (iii) the auto-covariance function satisfies
γ(r, s) = γ(r + t, s + t) for all r, s, r + t, s + t ∈ T .
3.2. STATIONARY PROCESSES 53
Definition 3.2.3. (Strict Stationarity) The time series {Xt}t∈T is said to be strictly stationary if the finite dimensional vectors (Xt1 , · · · , Xtn )
and (Xt1+h, · · · , Xtn+h) have the same joint distributions for all finite sub- sets of T and all h where the translation is defined.
Example 3.2.4. (White noise process) Example 3.1.3 defines the white noise process WN(0, σ2). This is stationary since (i) E(|Xt|2) = σ2 +02 < ∞ for all t. (ii)E(Xt) = 0 for all t and (iii)
σ2 ifr=s γ(r,s) = 0 ifr̸=s
= γ(r+t,s+t)
for all r, s, t.
Since the finite dimensional distributions are not defined it is not strictly
stationary.
Example 3.2.5. (Cauchy example) A Cauchy distributed random variable
X has density function
f(x) = 1 , π(1 + x2)
which does not have a mean or variance. So an i.i.d. sequence {Xn} is strictly stationary but not stationary.
Example 3.2.6. (Random walk example) The random walk, Xt := ti=1 Zi,, for T = {1, 2, . . . } was defined in Example 3.1.5. It is clear that the auto- covariance function satisfies
γ(t, t) = tσ2, (3.1) so does not satisfy Condition (iii) in Definition 3.2.2, so {Xt} is not station-
ary.
Example 3.2.7. (Predictable process) Let Z1,Z2 be two independent
N(0,σ2) random variables. We can define the discrete time series Xt = Z1 cos(2πt/100) + Z2 sin(2πt/100).
From the definition of the auto-covariance function we have γ(r, s) = σ2 cos2(2πr/100) + sin2(2πs/100)
= σ2 cos(2π(r − s)/100), (3.2)
which satisfies (iii) of Definition 3.2.2. From this we can easily check that {Xt} is stationary.
54 3. FORECASTING STATIONARY PROCESSES
Theorem 3.2.8. (Properties of auto-covariance function) Assume γ(r, s) is the auto-covariance function of a stationary process {Xt}, then the following statements hold.
(i) The auto-covariance function can be written as γ(h) := γ(r + h, r) for all r,
(ii) γ(0) ≥ 0,
(iii) |γ(h)| ≤ γ(0),
(iv) The auto-covariance function is an even function, i.e. γ(h) = γ(−h).
Proof. (i) Since Xt is stationary γ(r, s) = γ(r + h, s + h) for all h, so substi- tuting h = −s we have
γ(r, s) = γ(r − s, 0)
which is just a function of one variable h := r − s. So, without loss we can
define γ(h) via
γ(h) := γ(r + h, r) = γ(h, 0) (ii) By definition γ(0) = V ar(Xt) ≥ 0.
(3.3)
for all h ∈ Z
(iii) This follows from the Cauchy-Schwartz inequality,
|Cov(Xr, Xs)| ≤ (iv) By definition
V ar(Xr)V ar(Xs) = γ(0).
γ(−h) = Cov(X−h, X0) = Cov(X0, X0+h) = Cov(Xh, X0) = γ(h).
Definition 3.2.9. (Auto-correlation function) For a stationary process {Xt} the auto-correlation function is defined by
ρ(h) = γ(h). γ (0)
3.3 Estimable but flexible models
One class of processes where there is not a difference between strict station- arity and stationarity is the Gaussian processes first considered in Example 3.1.4. Consider the time series X := (X1, . . . , Xn) which a Gaussian pro- cess as a model. We have that the n-vector has a n-dimensional multivariate normal as its distribution.
Definition 3.3.1. (Multivariate normal) Let X be a n-dimensional mul- tivariate normal random variable. Then the density of X is given by
−n/2 −1/2 1 T−1 (2π) |detΣ| exp −2(x−μ) Σ (x−μ) ,
where μ ∈ Rn is the mean and Σ is the n × n positive-definite variance- covariance matrix of X.
3.4. BEST LINEAR PREDICTOR 55
Now if we only assume that X has a multivariate normal model in a time
series context and we want to estimate its parameters from the observed
data xt1 , . . . , xtn we have a problem. We have one observed vector with
n-components and we want to estimate: n-parameters in μ and 1 n(n + 2
1) parameters in Σ. So there are far too many parameters compared to observations. We therefore are forced to restrict the model class to one which is estimable in practice.
First note that assuming stationarity helps since we have the mean vector has only one parameter,
μ = (μ,μ,...,μ), and Σ has to be constant on off-diagonals
σ2 a b c ... a σ2 a b ...
2
b a σ a ...
c b a σ2 ... ....
.......
so can be specified by n parameters, and a check on positive definiteness. This gives n + 1-parameters and n observations, still too many parameters. We therefore are forced to make the model specification even stronger to look for parsimonious models with few parameters, but will then need to
check the quality of the fit Examples include:
1. Example 3.1.2 is i.i.d. Normal which has 2 parameters μ,σ2. This in general will not fit commonly observed real data since the indepen- dence assumption is too strong.
2. Example 3.1.6 is the MA(1) process which has, in general, three pa- rameters μ, θ, σ2. Its covariance structure is defined in Theorem 3.6.1.
3. Example 3.1.7 is the (AR(1) process which has, in general, three pa- rameters μ,φ,σ2, where here we need to check that the process is not a random walk, since this is not stationary. Its covariance structure is defined in Theorem 3.6.3.
3.4 Best linear predictor
We have seen from Chapter 2, §2.2 that the best predictor, in terms of mean square error, of a random variable Y given the information in a random variable X is E (Y |X ). To compute this requires knowledge of the joint distribution and is computationally hard for many examples. An alternative to is to look at a simpler class of predictors.
56 3. FORECASTING STATIONARY PROCESSES
Example 3.4.1. (Best linear predictor) Consider two random variables, X and Y, with E(X) := μX,E(Y) := μY and all second moments are finite. We want to compute the best linear predictor of Y given X, i.e. find Y (X) := a + bX which minimises
2 E Y−Y(X) .
Minimising L(a, b) := E {Y − (a + bX)}2 over a, b gives the solution. We have
∂L(a,b)=2EY −(a+bX)=2E(Y)−a−bE(X)=0. ∂a
So, a = E(Y)−bE(X), and substituting into L(aˆ,ˆb) gives ˆ 2
E (Y −μY)−b(X−μX)
and this is minimised by b = Cov(X,Y ) . Thus the best linear predictor is
Var(X)
Y =E(Y)+Cov(X,Y)(X−μX). Var(X)
and the MSE of the predictor is Var(Y)(1−Corr(X,Y)).
Note that if X and Y are uncorrelated the best linear predictor of Y is just E(Y) and its MSE is Var(Y).
Example 3.4.2. (Non-linear prediction) Suppose X ∼ N(0,1) and Y = X2 − 1, then we have E(X) = E(Y ) = 0 and the best linear predictor is Y = 0, since Cov(X, Y ) = 0. Of course, the best predictor of Y given X is X2 − 1, which has a MSE of 0. So the best linear predictor might not be very good.
Example 3.4.3. (Best linear predictor) Suppose now we wish to predict Y given X1, X2. We apply the same method used in Example 3.4.1, first write the best linear predictor as
Y = a 0 − a 1 X 2 − a 2 X 1 .
Differentiating with respect to a0 and substituting gives that we want to minimise
E {(Y − μY ) − a1(X2 − μX2 ) − a2(X1 − μX1 )}2
3.4. BEST LINEAR PREDICTOR 57 which is minimised by (a1,a2) being the solution to
V ar(X2) Cov(X1, X2) a1 = Cov(Y, X2)
Cov(X1, X2) V ar(X1) a2 Cov(Y, X1) i.e., when covariance is non-singular the best predictor is
Var(X2) Cov(X1,X2) −1 Cov(Y,X2) E(Y )+ X2 − μX2 X1 − μX1 Cov(X1, X2) V ar(X1) Cov(Y, X1) .
The methods and results of Examples 3.4.1 and 3.4.3 can be generalised to the following theorem.
Theorem 3.4.4. (Prediction operator) If {Xt} is a stationary time series, with mean μ, and auto-covariance function γ(h). The best linear predictor of Xn+h given the set Xn,...X1 is
(3.4)
(3.5)
Xn − μ Pred(X |X ,...,X ):=μ+(a ,...,a )T X2 −μ
n+hn1 1n. .
X1 − μ
where a := (a1, . . . , an) satisfies the equation
Γa=γ(n,h) :=(γ(h),...,γ(h+n−1))T
where Γ is the n × n matrix with ijth-element, Γij = γ(|i − j|). The MSE is given by
γ(0) − aT γ(n,h).
Theorem 3.4.4 shows that, at least as far as linear prediction for a sta- tionary process {Xt} is concerned, all that is required is knowledge of the mean of the process μ and the auto-correlation function γ(h) for a suitable range of h. Often the computation issues are concerned with inverting the n × n matrix Γ.
Example 3.4.5. Figure 3.1 shows an example of a optimal linear predictor based on an observed sample with n = 200. In this plot the auto-covariance function was estimated (see §3.5) and the best linear predictor for h = 1,...,20 was computed and plotted in blue. The error associate with the forecast is shown by the 95%-prediction interval, shown in red.
It is important to have a qualitative understanding of this example. For values of h where the auto-correlation is near zero the best linear predictor is simply the (sample) mean which is shown with the horizontal dashed line. We see that, in this example, the blue line converges to this for h ≈ 20. The error in the forecast also converges to a limit for large h. This is just determined by the variance of the sample.
58 3. FORECASTING STATIONARY PROCESSES Forecasting
0 50 100 150 200
Time
Figure 3.1: Simple forecast example: blue line forecast, red lines ±1.96× standard error. The dashed line is mean of process.
The time between observation and prediction before the best estimate is just the sample mean ±1.96×sample standard deviation will depend on the shape of the auto-covariance function. If it rapidly decreases to zero then the convergence is fast. Figure 3.2 shows an example where there is a slower decay to zero. In this case we see that the convergence to the mean has not happened by h = 20.
3.5 Estimating the mean and auto-covariance func- tions
Since Theorem 3.4.4 shows that for optimal linear h-step forecasting we only need the mean and auto-covariance functions it is natural to ask if these functions can be directly estimated from observed data.
Definition 3.5.1. (Sample moments) Let x1, . . . , xn be observed values of a stationary time series. The sample mean is defined by x ̄ := 1 n x , the
sample auto-covariance function is defined by
n t=1 t
1 n−|h|
γ(h) := xt+|h| − x ̄ (xt − x ̄) ,
n t=1
and the sample auto-correlation function is ρ(h) = γ(h).
γ (0)
Since the data is not i.i.d. in general, the usual properties of the sample means do not apply, but they are still well behaved statistically.
Data
0 5 10 15 20
3.5. ESTIMATINGTHEMEANANDAUTO-COVARIANCEFUNCTIONS59 Forecasting
0 50 100 150 200
Time
Figure 3.2: Simple forecast example: blue line forecast, red lines ±1.96× standard error. The dashed line is mean of process.
Theorem 3.5.2. (Properties of sample mean) For a stationary process {Xt} the estimator X ̄n := 1 nt=1 Xt is an unbiased estimate of E(Xt) := μ.
n
Further, its mean squared error is
̄ 2 1 n | h |
EXn−μ=n 1−nγ(h). h=−n
If γ(h) → 0 as h → ∞, then X ̄n convergences, in mean square, to μ and
nE if ∞h=−∞ |γ(h)| < ∞.
Proof. We have that
→
γ(h),
E
̄ 2 Xn − μ
=
̄
V ar(Xn)
̄ 2 Xn − μ
|h|<∞
1nn
= Cov(Xi,Xj)
n2
i=1 j=1
1nn
= (n−|i−j|)γ(i−j).
which gives the result.
n2
i=1 j=1
−10 0 10 20
Data
60 3. FORECASTING STATIONARY PROCESSES
The sample auto-covariance function, γ(h), will give a biased estimate of γ(h) – since we divide by ‘n’ and not ‘n−p’ – but the bias is small for n much larger than h and can be disregarded. The reason for using the 1/n term is the following result.
Theorem 3.5.3. (Properties of sample auto-correlation function) For each k ≥ 1 the k-dimensional sample covariance matrix
Γk := is non-negative definite.
Proof. See (Brockwell and Davis, 2002, Page 60).
This result is important since we often want to ‘plug-in’ estimates to replace the unknown true covariance. We therefore want that they have the same mathematical properties such as being invertible.
Clearly if h and n are the same order of magnitude the estimate of γ(h) will be very poor. A guideline, proposed by Box and Jenkins (1976), says that we should use n ≥ 50 and h ≤ n/4 if possible. When R computes the sample auto-correlation function, see Example 3.5.4 and §3.11, is has a default value for h given n which is h = 10log10(n). So for n = 100 this gives h = 20, and if n = 500 then h = 27.
In using the sample auto-correlation function we need guidelines as to when an estimated correlation is ‘small enough’ that we could safely set it to zero.
Example 3.5.4. (Interpreting acf() plots) Suppose we look at i.i.d. N (0, 1) noise, then we know that the exact auto-correlation function is zero for all lags apart from h = 0. If we generate 100 observations from this model and compute the auto-correlation function we see that the estimated values are not exactly zero, for example see Fig. 3.1.
This is due to the sampling error of course. To deal with it we need to know the sampling distribution of ρ(h) for each value of h.
It can be shown, (Brockwell and Davis, 2009, P. 222), that if the noise
is i.i.d. normal then
1 ρ(h)∼N 0,n .
γ(0)
γ(1)
γ(1) γ(0) .
··· ···
γ(k−1) γ(k−2)
. γ(k − 1) γ(k − 2) · · · γ(0)
.
. . ··· .
In the sample acf plot in Fig. 3.1 the blue dashed lines are set at ±1.96 1 , n
giving a 95%-confidence intervals for each value of h. Now, care has to be taken with interpretation when an estimate correlation lies above, or below, these lines. Even when the model is exactly i.i.d. N(0,1) we would expect that 5% of the estimates would, just by chance, lie outside. An example of this is the estimate at h = 12.
√
3.6. COMPUTING THE AUTO-COVARIANCE FUNCTION 61
Gaussian Noise
Series x
0 20 40 60 80 100 0 5 10 15 20
Time Lag
Figure 3.1: Gaussian noise and the corresponding sample auto-correlation function
3.6 Computing the auto-covariance function
Rather than estimating the mean and variance structure purely from the data we can assume a model and compute the auto-covariance function exactly. This gains efficiency when the model is correct but, of course, could be a problem when the model is incorrectly specified, see §3.10.
Theorem 3.6.1. Consider a MA(1)-process (see Definition 3.1.6) of the i.i.d. 2 2
form Xt := Zt +θZt−1, for Zt ∼ N(0,σ ), (or {Zt} ∼ WN(0,σ )), random variables, then we have the following results:
(i) The auto-covariance function is defined by
γ(r, s) =
σ2(1+θ2) σ2θ
forr−s=0, for |r − s| = 1, for |r − s| > 1.
0
(ii) Xt is a stationary process for all values of θ and σ > 0.
Proof. For the first case r = s we have
γ(r, s) =
= Cov(Zr + θZr−1, Zr + θZr−1)
= Cov(Zr, Zr) + 2θCov(Zr, Zr−1) + θ2Cov(Zr−1, Zr−1) = σ2+0+σ2θ2
The other cases follow the same type of argument.
(ii) Follows by directly checking the conditions of Definition 3.2.2.
Cov(Xr, Xr)
x
−3 −2 −1 0 1 2 3
ACF
−0.2 0.0 0.2 0.4 0.6 0.8 1.0
62 3. FORECASTING STATIONARY PROCESSES
Example 3.6.2. (MA(1) example) Figure 3.1 shows an example of the auto- correlation function for two MA(1) processes. In the left hand panel θ = 0.6, while in the right panel θ = −0.6. We see that after h = 1 the values of the auto-correlation function are exactly zero.
acf of MA(1) proces
0 2 4 6 8 10
Lag
acf of MA(1) proces
0 2 4 6 8 10
Lag
●
●
●
●●●●●●●●●
●
●●●●●●●●●
Figure 3.1: The auto-correlation function of two MA(1) processes: left panel θ = 0.6, right panel θ = −0.6
Theorem 3.6.3. (AR(1) example) Let {Xt} be an AR(1) process defined by
Xt = φXt−1 + Zt, i.i.d. 2 2
for Zt ∼ N(0,σ ), or {Zt} ∼ WN(0,σ ), random variables.
Assuming that the process is stationary, |φ| < 1 and Xt is uncorrelated
with Zt+h for h > 0, then the auto-covariance function is given by σ2φ|h|
γ(h) = 1 − φ2 . Proof. By definition, and for h > 0,
γ(h) =
= Cov(φXt−1 + Zt, Xt−h)
= φCov(Xt−1, Xt−h) + Cov(Zt, Xt−h) = φγ(h−1)+0
Iterating this argument gives γ(h) = φhγ(0). Further,
γ(0) = Cov(Xt, Xt) = Cov(φXt−1 + Zt, φXt−1 + Zt) = φ2γ(0) + σ2
Hence, we have
Cov(Xt, Xt−h)
σ2 γ(0) = 1 − φ2 .
0.0 0.2 0.4
0.6 0.8 1.0
0.0
0.5
1.0
acf
acf
3.6. COMPUTING THE AUTO-COVARIANCE FUNCTION 63 Combining these results gives, using the fact that γ() is an even function,
gives
σ2φ|h| γ(h) = 1 − φ2 .
Example 3.6.4. (AR(1) example) Figure 3.1 shows plots of the auto- correlation function for two AR(1) models. The left hand panel has φ = 0.6 and the right hand panel has φ = −0.6. For Theorem 3.6.3 to hold there are regularity results that need to be checked. In fact these conditions hold in these examples as shown later in Theorem 3.8.4.
In both plots we see exponential decay, which is characteristic of the
auto-correlation of AR-process. In the right hand plot the values of the
correlation alternate between being positive and negative due to the sign of φ|h| .
acf of AR(1) proces
0 2 4 6 8 10
Lag
acf of AR(1) proces
0 2 4 6 8 10
Lag
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
● ●
●● ●●
Figure 3.2: The auto-correlation function of two AR(1) process: left panel φ = 0.6, right panel φ = −0.6.
Example 3.6.5. (AR(1) forecast) Consider an AR(1) process with param- eters μ = 2,φ = 0.5,σ2 = 4 and suppose we have observed n = 3 obser- vations: x1 = 1.48, x2 = 3.95, x3 = 0.38, and we want to forecast the next observation using the results of Theorem 3.4.4.
The auto-covariance is given by
σ2φ|h| 16 1
γ(h) = 1 − φ2 = 3 2|h| .
So that in this case
γ(3,1) =(γ(1),γ(2),γ(3)) = 3 2, 3 22, 3 23
T 161 161 161T
0.0 0.2 0.4
0.6
0.8 1.0
−0.5 0.0
0.5
1.0
acf
acf
64 3. FORECASTING STATIONARY PROCESSES and
γ(0) γ(1) γ(2) 16 8 4
1 −1−1
4 −2 0 −2 5 −2 0 −2 4
Γ = γ(1) γ(0) γ(1) = 3 8 16 8 , Γ
= 16
γ(2)γ(1)γ(0) and we get
4 8 16
Γ−1γ(3,1) = (0.5, 0, 0)T . and the forecast for X4 would be
γ(h) = toforecastX11 usen=10,h=1
and
(γ(h), . . . , γ(h + n − 1)) = (γ(1), . . . , γ(10)) = (4, 0, . . . , 0) , 10 4 0 0 ··· 0
4 10 4 0 ··· 0
giving a forecast
0.38−μ μ+(0.5,0,0) 3.95−μ =1.19.
1.48−μ
Thus in this example the forecast of X4 only depends on the value of x3 and
not on x1 and x2.
Example 3.6.6. (MA(1) forecast) Consider an MA(1) process with param-
eters μ = 2, θ = 2, σ2 = 2 and suppose we have observed 10 observations: x1,…,x10.
We want to forecast the values of X11. Using the methods of Theorem 3.4.4 we need the auto-covariance function which is
σ2(1+θ2)=10 σ2θ = 4
forh=0 for |h| = 1 for |h| > 1
0
Γ= 0 4 10 4 ··· 0 . . . . . .
…… 0 ··· 0 0 4 10
x10 − μ μ+(γ(1),…,γ(10))Γ−1 x9−μ .
In this example we see that one of the problems with forecasting is that we have to invert, what might be, very large matrices. The matrix here has the structure of a, so-called, symmetric Toeplitz matrix and there are fast algorithms which can be used to compute their inverse.
. .
x1 − μ
3.7. MA(Q) PROCESSES 65 3.7 MA(q) processes
Definition 3.7.1. (Backward shift operator) The backward shift operator B acts on a time series {Xt} and is defined as
BXt = Xt−1.
The difference operator ∇ is also define on {Xt} via
∇Xt =Xt −Xt−1 =(1−B)Xt, where 1 here represents the identity operator.
We can combine operators as ‘polynomials’, thus
B0Xt = Xt,BXt = Xt−1,B2Xt = Xt−2,B3Xt = Xt−3,…
and define Thus
θ(B) := 1 + θ1B + · · · + θqBq. θ(B)Xt =Xt +θ1Xt−1 +···+θqXt−q.
Definition 3.7.2. (MA(q) process) Let {Zt}, t ∈ Z be an i.i.d. sequence of N(0,σ2), or Zt ∼ WN(0,σ2), random variables. The series defined by
Xt :=θ(B)Zt =Zt +θ1Zt−1 +···+θqZt−q,
for all t, is called a qth-order moving average, and denoted by the notation
MA(q).
Theorem 3.7.3. (Moments of MA(q) process) (i) The mean of an MA(q)
process is
E(Xt) = E (θ(B)Zt) = 0
(ii) The auto-covariance of an MA(q) process is given by
if |r − s| ≤ q, Otherwise.
for all t.
σ2 q−|r−s| θjθj+|r−s|
γ(r,s) = j=0
0
(3.6)
(iii) All MA(q) processes are stationary.
Example 3.7.4. (Auto-correlation of MA(2) process) The plots in Fig. 3.1 show the auto-correlation function ρ(h) for two MA(2) processes. The left hand panel shows the case θ1 = 0.6, θ2 = 0.4 and the right hand panel shows the case θ1 = −0.6, θ2 = 0.4. Both have the property that all correlations, and hence covariances, are zero after lag 2.
66 3. FORECASTING STATIONARY PROCESSES
acf of MA(2) proces
0 2 4 6 8 10
Lag
acf of MA(2) proces
0 2 4 6 8 10
Lag
●
●
●
●●●●●●●●
●
●
●
●●●●●●●●
Figure 3.1: The auto-correlation function of two MA(2) processes: left panel θ1 = 0.6, θ2 = 0.4, right panel θ1 = −0.6, θ2 = 0.4.
3.8 The AR(p) process
Just as polynomials can be extended to infinite series, as long as we are careful about convergence, moving average processes can have infinite order. This is described in the Appendix 3.12. As an example of such an infinite moving average process consider the following argument about an AR(1) process.
Example 3.8.1. (Solving the AR(1) equation) The AR(1) process defined in Definition 3.1.7 was, in Theorem 3.6.3, assumed to be stationary when |φ| < 1.
The process is defined implicitly as the solution of Xt = φXt−1 + Zt,
for {Zt} ∼ WN(0, σ2). We can, at least informally, write it as
Xt = φXt−1 +Zt =φ(φXt−2 +Zt−1)+Zt
= φ2Xt−2 + (Zt + φZt−1) = φ2 (φXt−3 + Zt−2) + (Zt + φZt−1) =φ3X +Z+φZ +φ2Z
.
= Zt +φZt−1 +φ2Zt−2 +φ3Zt−3 +··· (3.7)
For this to be an MA(∞) process (Definition 3.12.1) we need the coefficients to form an absolutely continuous sum. i.e. ∞j=0 |φj| < ∞, but we have that, when |φ| < 1 the standard result that
∞
|φ|j = (1−|φ|)−1 < ∞. j=0
t−3 t
t−1 t−2
0.0 0.2 0.4
0.6 0.8 1.0
−0.5 0.0
0.5
1.0
acf
acf
3.8. THE AR(P) PROCESS 67
We also see that in Equation (3.7) that Xt is only a function of Zs random variables where s ≤ t. Hence, from the properties of WN(0,σ2) we have that Xt is uncorrelated with Zt+h for h > 0, the last regularity condition of Theorem 3.6.3.
Definition 3.8.2. (The AR(p) process) Define the polynomial operator φ(B):=1−φ1B−φ2B2 −···−φpBp,
where B is the backward shift operator.
The AR(p) process is the process which is the stationary solution to the
difference equations
for {Zt} ∼ WN(0, σ2), when such a solution exists.
Definition 3.8.3. (Causal process) A causal process {Xt} generated by {Zt} ∼ WN(0,σ2) is one where each Xt is only a function of those Zs where s ≤ t.
Theorem 3.8.4. (Existence of stationary solution) There exists a station- ary solution to Equation (3.8) when, for z ∈ C, the polynomial
φ(z):=1−φ1z−φ2z2 −···−φpzp
has no roots which lie on the unit circle {z ∈ C||z| = 1}.
If all roots lie strictly outside the unit circle we say there is a causal
solution which can be written as
∞
Xt = ψjZt−j, (3.9)
j=0
i.e. Xt is a function of ‘previous’ Zt values.
Proof. See Brockwell and Davis (2002) or Brockwell and Davis (2009). Example 3.8.5. (Examples of AR(p) processes) Assume that {Zt} ∼ WN(0, σ2).
(i) Let p = 2, and consider
Xt − 1Xt−1 + 1Xt−2 = Zt.
φ(B)Xt = Zt (3.8)
24
Look at the roots of the equation 1− 1z+ 1z2 = 0, since the roots, 1±√3i
24
lie outside the units circle there is a stationary solution and it has the causal form (3.9).
(ii) Let p = 2, and consider
Xt − Xt−1 + 1Xt−2 = Zt. 4
68 3. FORECASTING STATIONARY PROCESSES The corresponding polynomial, 1 − z + 1 z2, has roots, 2, 2 which lie inside
4
the unit circle, so there is a stationary causal solution. (iii) Let p = 3, and consider
Xt −5Xt−1 +7Xt−2 −3Xt−3 =Zt
The corresponding polynomial has roots 1, 1, 1 , so there is not a stationary
3
solution to these equations
Theorem 3.8.6. (Auto-covariance function for AR(p) process) If Xt is a
causal, stationary AR(p) process it can be written as
∞
Xt =ψjZt−j,
j=0
and hence its auto-covariance function can be written as
∞
γ(h) = σ2 ψjψj+|h|.
j=0
Example 3.8.7. (Example of AR(p) process) Figure 3.1 shows the acf for the two stationary processes. The left hand panel is an AR(2) process defined by
and
Xt = 0.5Xt−1 + 0.25Xt−2 + Zt,
Xt = 0.5Xt−1 + 0.25Xt−2 − 0.1Xt−2 + Zt,
where Zt ∼ WN(0, σ2). Numerical checks of the polynomials φ(z) shows all roots lie outside the unit circle, so there exists stationary solutions. We see both panels show the characteristic exponential decay.
acf of AR(2) proces
0 5 10 15 20
Lag
acf of AR(3) proces
0 5 10 15 20
Lag
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●●
●●●
●
●
●
●
●
●
●
●
●●
●●●●●●●●●●●
Figure 3.1: The auto-correlation function of stationary AR(p) processes: left
panel φ1 = 1,φ2 = 1, right panel φ1 = 1,φ2 = 1,φ3 = −0.1. 24 24
acf
0.0 0.2 0.4 0.6 0.8 1.0
acf
0.0 0.2 0.4 0.6 0.8 1.0
3.9. THE ARMA(P, Q) PROCESS 69 3.9 The ARMA(p,q) process
Definition3.9.1. (TheARMAprocess)LetZt ∼WN(0,σ2),ori.i.d. N(0,σ2) random variables, and define the polynomial operators
and
φ(B):=1−φ1B−φ2B2 −···−φpBp, θ(B) := 1 + θ1B + · · · + θqBq,
where B is the backward shift operator.
The ARMA(p,q) process is the process which is the stationary solution
to the difference equations
φ(B)Xt = θ(B)Zt (3.10) when such a solution exists, and we assume that the polynomials φ(z) and
θ(z) share no common factors.
Theorem 3.9.2. (Existence of ARMA process) The ARMA(p,q) process defined by Equation (3.10) has a stationary solution if none of the roots of the polynomial φ(z) lie on the unit circle |z| = 1. Further, if the roots lie outside the unit circle (i.e.|z| > 1) then there is a causal stationary solution.
3.10 Estimation of ARMA models
In this section we shall look at the problem of the estimation of an ARMA(p, q)
model when p and q are considered known, and we are estimating the follow-
ing parameters for, an assumed stationary, time series. The second problem,
identifying p, q is discussed in §3.10.2. The parameters to be estimated are
the mean μ, the p-parameters in φ(B), the q-parameters in θ(B) and the
variance of the innovation process Zt ∼ WN(0,σ2), or if stronger paramet- i.i.d. 2
ric conditions are assumed, Zt ∼ N(0,σ ). There are a number of ways of estimating the parameters: method of moments (Yule-Walker), least squares and, if Normal errors are assumed, likelihood methods. We concentrate on the last of these in this course.
Throughout, we will assume that we have a zero mean stationary process. This is really without loss since if E(Xt) = μ, we estimate μ (Theorem 3.5.2) with the sample mean and work with the process Xt − μ.
3.10.1 Likelihood estimation
In order to use likelihood methods for inference we need to make assumptions about the distributions involved in the ARMA(p, q) models. For simplicity
i.i.d. 2
we assume that the innovations Zt ∼ N (0, σ ). Let us further assume that
70 3. FORECASTING STATIONARY PROCESSES we have observed X1 = x1, . . . , Xn = xn. For any ARMA(p, q) process the
observed (X1, . . . , Xn) are a linear function of the unobserved (Z1, . . . , Zn) ∼ N(0n, σ2In×n),
thus the distribution of (X1, . . . , Xn) is also normal, and the computational issue is to compute the mean and the variance-covariance function.
Definition 3.10.1. (The likelihood function) The likelihood function for a mean zero, stationary ARMA(p, q) process is
2 1 1T−1 Lik(φ,θ,σ ) = (2π)n/2 det(Γn)1/2 exp −2X Γn X
where Γn is variance covariance matrix for X of the form γ(0) γ(1) ··· γ(n−1)
γ(1) γ(0) ··· γ(n−2) . . . . .
. . . . γ(n−1) γ(n−2) ··· γ(0)
The optimisation of the log-likelihood can not be done exactly, rather numerical methods such as the Newton-Raphson algorithm have to be used. In particular there are two non-linear functions which are involved. The first is the function which takes the parameters to the auto-covariance function,
φ, θ, σ2 → γ(h) → Γn. (3.11) The second is the function which inverts the (potentially large) matrix
Γ → Γ−1. (3.12) n
Finally we estimate σ2 with the residual sum of squares, as in regression. Theorem 3.10.2. (Properties of MLE) Let the parameters of an ARMA(p, q)
model be given by
and let β be the maximum likelihood estimates, then for large n we have
that the sampling distribution of β is N(β,n−1V(β))
where V (β) is a (p + q) × (p + q) positive definite matrix.
Rather than give the general form of V (β) let us look at some special cases.
β = (φ1,··· ,φp,θ1,··· ,θq),
3.10. ESTIMATION OF ARMA MODELS 71 Theorem 3.10.3. (MLE for AR(p)) For an AR(p) model we have V (φ) =
σ2Γ−1, so that p
1. Whenp=1
2. Whenp=2
V (φ) = 1 − φ2 . 1
V ( θ ) = 1 − θ 12 .
1−θ2 −θ1(1+θ2)
special cases are 1. Whenq=1
2. Whenq=2
1−φ2 −φ1(1+φ2) −φ (1 + φ ) 1 − φ2 .
V (φ) =
Theorem 3.10.4. (MLE for MA(q)) For an MA(q) model two important
122
V (θ) =
Theorem 3.10.5. (MLE for ARMA(p, q)) For a causal and invertible ARMA(p, q)
−θ1(1 + θ2) 1 − θ2 .
model we have
1+φθ (1−φ2)(1+φθ) −1(1−θ2)(1−φ2) V(φ,θ)= (φ+θ)2 −1(1−θ2)(1−φ2) (1−φ2)(1+φθ) .
Of course to use the likelihood approach we have to assume Normally distributed errors, if weaker assumptions are made inference can still be done: a method of moments method is called the Yule-Walker algorithm, least squares methods also exists. In general all algorithms require numer- ically inverting large matrices. When computing power was less than it is today algorithms such the innovations and the Durbin-Levinson algorithm have being developed, for more details see Brockwell and Davis (2002) or Brockwell and Davis (2009).
3.10.2 Identification of ARMA models
The previous section worked with the assumption that p and q in the ARMA model were known. In practice we need to go through a model selection and checking procedure just as in regression modelling. As we have seen in Chapter 2 one major problem which can seriously effect the ability of a model to forecast correctly is over-fitting. Just as in Chapter 2 penalised likelihood methods such a AIC, AICC can be very helpful in finding the
72 3. FORECASTING STATIONARY PROCESSES
model structure. However, in general all such ‘black-box’ model section procedures must be used with care and the analyst should alway check that the fitted model makes sense in the context of the given problem.
Definition 3.10.6. (AIC) The Akaike information criterion (AIC) is de- fined as
AIC := −2l(βˆ) + 2k
where l(βˆ) is the maximum value of the log-likelihood for a given model, and k is the number of parameters in the model.
The AICc corrects for small samples and is given by AICc := AIC + 2k(k + 1).
n−k−1
For both criteria in a given set of models the preferred model is the one with the minimum criterion value. The idea is to penalise more complicated model (i.e. ones with more parameters) and reward simple models, as long as they fit of course.
We can also use graphical methods to get some idea of the structure of certain types of ARMA(p, q) models.
Example 3.10.7. (Using the ACF) From Theorem 3.7.3 and example in Fig. 3.1 we see that if the process is MA(q) the its auto-covariance function will be zero after lag q. Hence inspection of the sample auto-covariance (or auto-correlation) plot gives information about q.
This method will not work for an AR(p) model, for example Fig. 3.1 shows two models with different p-values having similar looking auto-correlation functions. There is a plot that will help to find p and this is called the par- tial auto-correlation plot, see Appendix 3.A for a formal definition. Here we only seek to interpret the plot.
Example 3.10.8. (Using the PACF) Figure 3.1 shows the pacf function for the same models as in Example 3.8.7. The plots of the acf above both had the characteristic exponential decay of AR(p) processes. In contrast the pacfs, shown in Fig. 3.1 look different. The one for the AR(2)-process has non-zero only values for h = 1,2 while the one for the AR(1) process has non-zero values for h = 1, 2, 3. This shows how the pacf can be used to give information about p in a AR(p) process.
Theorem 3.10.9. (Partial auto-correlation function) If α(h) is the partial auto-correlation function for a stationary AR(p) process then α(k) = 0 for |k| > p.
Outside the cases of pure MA(q) and AR(p) these graphical methods can not give definite answers for model selection, indeed even in the cases that they are designed for it takes a good deal of experience (and luck) to be able to confidently find the ‘correct’ model.
3.11. USING R FOR ARMA MODELLING pacf of AR(2) proces
2 4 6 8 10
Lag
73
pacf of AR(3) proces
2 4 6 8 10
Lag
●
●
●●●●●●●●
●
Figure 3.1: The partial auto-correlation function of stationary AR(p) pro- cesses: left panel φ1 = 1,φ2 = 1, right panel φ1 = 1,φ2 = 1,φ3 = −0.1.
3.11 Using R for ARMA modelling
In this section we look at some basic R, (R Core Team, 2013), commands for working with ARMA-processes. We start by simulation and plotting of realisations of ARMA models.
Example 3.11.1. (Using R) The R command:
sim.ar2 <- arima.sim(n = 100, model=list(ar = c(0.5, 0.25)), sd = sqrt(2)) simulates a realisation of an AR(2)-process of the form
●
●
●●●●●●●
that is
24 24
φ(B)Xt = Zt
φ(B) := 1 − 0.5B − 0.25B2,
Xt = φ1Xt−1 + φ2Xt−2 + Zt = 0.5Xt−1 + 0.25Xt−2 + Zt
where Zt ∼ N(0,2) and
The object sim.ar2 is a ts object in R and has some built in attributes, for example
> print.ts(sim.ar2)
Time Series:
Start = 1
End = 100
Frequency = 1
1.48885878 4.21586053 4.06171606 3.63421248
[ … ]
[97] 1.05016201 -1.54411515 -0.21416876 -0.79918265
pacf
−1.0 −0.5 0.0 0.5 1.0
pacf
−1.0 −0.5 0.0 0.5 1.0
74 3. FORECASTING STATIONARY PROCESSES We can plot the time series, its auto-correlation function and its partial
auto-correlation function using
plot.ts(sim.ar2)
acf(sim.ar2)
pacf(sim.ar2)
0 20 40 60 80 100
Time
0
Series sim.ar2
5 10 15
Lag
20
5
Series sim.ar2
10 15 20
Lag
Figure 3.1: Simulated AR(2) model: the time series, its acf and pacf plots
If you try and simulate from parameter values which do not have a stationary solution you get an error message and no output.
> arima.sim(n = 100, model=list(ar = c(0.5, 0.5)), sd = sqrt(2))
Error in arima.sim(n = 100, model = list(ar = c(0.5, 0.5)), sd = sqrt(2)) :
’ar’ part of model is not stationary
Example 3.11.2. Generating a MA(q) time series is similar, for example we can generate a sample of size 500 from an MA(2) model such that
Xt = Zt + 2Zt−1 + 5Zt−2
where Zt ∼ N (0, 10) via:
sim.ma2 <- arima.sim(n = 500, model=list(ma = c(2, 5)), sd = sqrt(10))
Example 3.11.3. For a ARMA(1, 2) example we can generate a sample of size 500 from a model such that
Xt + 0.6Xt−1 = Zt + 0.6Zt−1 − 0.3Zt−2
we would use
sim.arma12 <- arima.sim(n = 500, list(ar=c(-0.6), ma = c(0.6, -0.3)),
sd = sqrt(1))
sim.ar2
−6 −4 −2 0 2 4 6
ACF
−0.2 0.0 0.2 0.4 0.6 0.8 1.0
Partial ACF
−0.2 0.0 0.2 0.4 0.6 0.8
3.11. USING R FOR ARMA MODELLING
Series sim.ma2
0 100 200 300 400 500 0 5 10 15 20 25
Time Lag
75
Series sim.ma2
0 5 10 15 20 25
Lag
Figure 3.2: Simulated MA(2) model: the time series, its acf and pacf plots
Series sim.arma12
0 100 200 300 400 500 0 5 10 15 20 25
Time Lag
Series sim.arma12
0 5 10 15 20 25
Lag
Figure 3.3: Simulated ARMA(1, 2) model: the time series, its acf and pacf plots
−0.2 0.0 0.2 0.4 0.6
0.8
1.0
0.0 0.2
0.4 0.6 0.8 1.0
−0.20 −0.10
0.00 0.05 0.10
−0.1 0.0
0.1 0.2 0.3
sim.arma12
sim.ma2
−2 −1 0 1 2 3
−60 −40 −20 0 20 40
ACF
ACF
Partial ACF
Partial ACF
76 3. FORECASTING STATIONARY PROCESSES Fitting models
If we assume, for the moment, that we know the values of p and q in the ARMA(p,q) structure then we can estimate the values of the parameters, using the arima( ) function.
Example 3.11.4. Lets go back the the ARMA(1, 2) in Example 3.11.3. We use the function
> arima(sim.arma12, order=c(1, 0,2), include.mean=F)
Coefficients:
ar1 ma1 ma2
-0.5891 0.5152 -0.3640
s.e. 0.0653 0.0701 0.0512
sigma^2 estimated as 1.015: log likelihood = -713.57, aic = 1435.14 >
From this output we see the following. First we can compare the true pa- rameters values, which we know since this is simulated data to the estimated
values and standard errors – and so 95%-confidence intervals. We see that the true values lie inside the confidence intervals and the width of the con- fidence intervals are quite small since this example uses a lot of data.
True φ1 = −0.6 θ1 = 0.6 θ2 = −0.3
Estimated S.E. -0.5891 0.0653 0.5152 0.0701 -0.3640 0.0512
Confidence Interval (-0.72, -0.46) (0.38, 0.65)
( -0.46, -0.26)
In this fit we used the option include.mean=F since I was assuming I knew that it was a mean zero time series. If mean also had to be estimated then we can drop this option from the function call.
We can now look at the quality of the fit. Just as in regression analysis we use residuals, here estimates of the innovation process Zt which should be white noise, or if we assume a Gaussian process, i.i.d. Normal data. We do this by the command
arima(sim.arma12, order=c(1, 0,2))$residual
which output the 500 residuals values. Again, as in regression, we can use plots to evaluate the model fit. For example in Fig. 3.4 we plot the estimated residuals, the acf plot of the residuals – which looks like white noise in this case, and a QQplot which shows the residuals do indeed look like they come from a normal distribution.
Estimating the structure
The previous section assumed that p and q are known. Of course in practice these are not known and a model selection procedure has to be undertaken.
3.11. USING R FOR ARMA MODELLING
77
residuals
0 100 200 300 400 500
Time
residuals
0 5 10 15 20 25
Lag
residuals
●●
●●● ●
●
●
● ●
●●● ●
Figure 3.4: Residual plots: the time series, its acf and QQ norm
This is typically done using a mixture of graphical and numerical tools. The following rules of thumb are helpful, but should not be over interpreted. We first recall the definition of the
1 ±1.96 n
dashed lines which we get in R’s acf( ) and pacf( ) output. These where discussed in Example 3.5.4 and can be thought of as pointwise 95%-confidence intervals for the hypothesis that the correlation at h is zero.
Example 3.11.5. For an MA(q) we would expect that the auto-correlation function is zero for all lags greater than q, and it can be shown that the partial auto-correlation function has exponential decay.
For example consider Fig. 3.2. The model which generated the data was MA(2) and we see that in the sample acf plots we have estimated values outside the dashed lines for h = 0, 1, 2. The partial auto-correlation function is trivial for all h > 1 apart from h = 22 and we can plausibly argue that is is just because we would expect 5% of the estimated values to lie outside the lies even if the true values were actually zero.
Example 3.11.6. For an AR(p) model we expect that the auto-correlation function shows exponential decay, while the partial auto-correlation function is zero for all h > p. We can see this in Fig. 3.1, where we have in the pacf non-zero values for h = 0, 1 and exponential decay in the acf plot.
For general ARMA(p,q) models unfortunately there are no simple pat- terns that always appear, and even the patterns described in Examples 3.11.5 and 3.11.6 tend to require quite large sample sizes to be used reliably.
The second way of doing model selection is to use the AIC or AICc values from the fit of the models. Since these use the number of the parameters the will give the same penalty to, for example an AR(2), ARMA(1, 1) and
−3 −2 −1 0 1 2 3
Theoretical Quantiles
residuals
−3 −2 −1 0 1 2 3
ACF
0.0 0.2 0.4 0.6 0.8 1.0
Sample Quantiles
−3 −2 −1 0 1 2 3
78 3. FORECASTING STATIONARY PROCESSES
MA(2) model. In general there will rarely be the case where the model can be uniquely identified and it is advisable to compare forecasts from different plausible models to see if there are any major differences.
Forecasting
OnceamodelhasbeenfittedusingtheARIMA( )functionthenthepredict( , n.ahead= ) can be used to make forecasts for h steps a head and compute the standard error of the forecast. This has already been shown in Fig. 3.1 and 3.2. These were generated with the following code: The data was a ts- object called sim.forecast, and I assumed this this came from an AR(2) model, this was then fitted and its (estimated) mean and auto-covariance function was used, as in §3.4, to make a point forecast and estimate the variance around this. Forecasts for h = 1, 2, 3 are given by the code
> predict(arima(sim.ma2.forecast, order=c(2, 0,0)), n.ahead=3)
$pred
Time Series:
Start = 201
End = 203
Frequency = 1
[1] 13.30448 12.90422 12.75071
$se
Time Series:
Start = 201
End = 203
Frequency = 1
[1] 3.017765 3.780846 4.418923
3.12 MA(∞) processes
Definition 3.12.1. Let Zt, t ∈ Z be an i.i.d. sequence of N(0,σ2), (or Zt ∼ WN(0,σ2)), random variables. Let {ψj}, j = 0,1,··· be a sequence which is absolutely convergent, i.e.
then the process defined by
∞
|ψj | < ∞, j=0
∞
Xt := ψjZt−j,
j=0
for all t, is called a infinite-order moving average, and denoted by (MA(∞)) process.
3.12. MA(∞) PROCESSES 79 Theorem 3.12.2. The MA(∞) process of Definition 3.12.1 is stationary
with zero mean and auto-covariance function
∞
γ(h) = σ2 ψjψj+|h|.
j =−∞
Theorem 3.12.3. Wold decomposition theorem Any stationary pro- cess can be written as the sum of an MA(∞) process and an independent deterministic process, where a deterministic process is any process whose complete realisation is a deterministic function of a finite number of its values.
Example 3.12.4. The process
Xt = Z1 cos(2πt/100) + Z2 sin(2πt/100),
defined in Example 3.2.7 is a deterministic process, where Z1,Z2 are two independent N(0, σ2) random variables.
80 3. FORECASTING STATIONARY PROCESSES
Appendix
3.A Partial Correlation
Definition 3.A.1. If X, Y and Z are random variables the (sample) partial correlation of X and Y given Z, denoted by ρxy.z is the (sample) correlation between rX and rY , where these are the residuals from the linear regressions of X on Z and Y on Z, respectively.
Example 3.A.2. (Adapted from Mardia et al. (1979)[page 170]) Suppose we had 20 observations on verbal skills (x), weight (y) and age (z) for a group of children. If we plot the verbal skills score against weight (see Fig. 3.A.1 (a)) we see a high positive correlation (0.89).
This we think is spurious and just due to the effect of age. We can control for the effect of age; regress x and y on age z. If we look at the plot of the corresponding residuals (see Fig. 3.A.1 (b)) we see almost no correlation left (-0.21). Panels (c) and (d) show the very strong common relationship between the two variables and age, which is giving the confounding.
Definition 3.A.3. The partial auto-correlation function (pacf) for a sta- tionary process, {Xt}, is defined by
Cor(X1, X0) for |h| = 1 ρXhX0·{Xh−1...X1} for |h| > 1
α(h) =
where ρXhX0·{Xh−1…X1} is the partial correlation of Xh and X0 given the set
{Xh−1 …X1}.
3.B Other stationary processes
The forecasting techniques in this chapter are designed for stationary pro- cesses of the form ARMA(p, q) where, in practice to ensure that the number of parameters is not too large
p + q << n.
These give rise to linear predictions which are basically short term – the long term all forecasts are just the mean. The length of time which counts as short-term is determined by the structure of the auto-covariance function.
81
82
3. FORECASTING STATIONARY PROCESSES
(a) X−Y (b) X residuals −Y residuals
−6 −4 −2 0 2 4 −1.5 −1.0 −0.5 0.0 0.5 1.0
x rX
(c) Z−Y (c) Z−X
−4 −2 0 2 4 −4 −2 0 2 4
zz
●● ●
●●●
●
●●
●
●
●
● ●●
● ●●●
●
● ●●
● ●
●
●●
● ●●
●
●
●
●●
●
●●●
●
● ●
●● ●
● ●●
● ●●●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●● ●
Figure 3.A.1: Partial Correlation: (a) scatterplot X, Y , (b) residuals, (c) Y,Z (d) X,Z
We finish this chapter with a couple of examples to show that there are time series outside this parsimonious ARMA(p,q) structure which can behave quantitatively quite differently.
3.B.1 Long memory processes
A long memory process is one where the auto-covariance function decays more slowly that the exponential decay that we see in stationary ARMA- process. A process like a random walk has an auto-covariance function which does not decay at all – it is a constant. These long term memory processes therefore lie somewhere between a stationary ARMA(p,q)-process and a non-stationary random walk. They are stationary but the auto-covariance decays like a polynomial in h – that is much slower than exponentially.
Figure 3.B.1 shows an example of such a process. In the left hand panel a realisation of n = 1000 points is shown and the corresponding (sample) auto-correlation function is plotted in the right hand panel. We see the very slow decay of the function clearly.
3.B.2 Gaussian processes
Another class of processes which can be used in modelling are the Gaussian processes defined in Example 3.1.4. These models make explicit assumptions about the joint distribution of the observed variables, i.e. {Xt1,...,Xtn}.
yy
−4 −2 0 2 4 6 −4 −2 0 2 4 6
x rY
−6 −4 −2 0 2 4 −1 0 1 2
3.C. POWER SERIES EXPANSIONS
83
0 200 400 600 800 1000
Time
Series sim
0 20 40 60 80 100
Lag
Figure 3.B.1: A timer series with long term dependency
Since the model is assumed Gaussian we need to defined an n-dimensional mean and an n × n-dimensional variance-covariance function, and then add conditions which ensure stationarity.
One class of these is the Matern class, which has a signal parameter ν. In Fig. 3.B.2 four realisations of Matern class processes are shown, with ν = 0.5, 0.75, 1 and 2. One thing is immediately clear: the larger the value of ν the ‘smoother’ the realisation.
3.C Power Series Expansions
We use, at least formally, the expansion of a rational functions of z ∈ C in terms of a power series. Here are some examples
(1−z)−1 =1+z+z2 +z3 +Oz4
(1+z)−1 =1−z+z2 −z3 +Oz4 (1−2z)−1 =1+2z+4z2 +8z3 +Oz4
Within the radius of convergence the power series has absolute conver- gence properties so behave like polynomials. For example you can multiple series term by term:
(1−z+z2 −z3 +Oz4)×(1+2z+4z2 +8z3 +Oz4 can be written as
1+(2−1)z+(1+4−2)z2 +(8−1+2−4)z3 +Oz4
−4 −2 0 2
sim
ACF
0.0 0.2 0.4 0.6 0.8 1.0
84
3. FORECASTING STATIONARY PROCESSES
nu=0.5 nu=0.75
0 2 4 6 8 10 0 2 4 6 8 10
xx
nu=1 nu=2
0 2 4 6 8 10 0 2 4 6 8 10
xx
process
process
−2.0 −1.0 0.0 1.0
−2 −1 0 1 2
process
process
−2 −1 0 1 2
−1 0 1 2
Figure 3.B.2: Simulations of four Matern class Gaussian process: ν = 0.5, 0.75, 1 and 2. These are progressively smoother
4
The Box-Jenkins approach to forecasting and control
4.1 Introduction
The key result of Chapter 3 is Theorem 3.4.4. This states that for a sta- tionary time series {Xt}, with mean μ, and auto-covariance function γ(h), the best linear predictor, in terms of mean square error, of Xn+h based on X1,...,Xn is given by
(4.1)
Xn − μ Pred(X |X ,...,X ):=μ+(a ,...,a )T X2 −μ ,
n+hn1 1n. .
X1 − μ Γa=γ(n,h) :=(γ(h),...,γ(h+n−1))T ,
where a := (a1, . . . , an) satisfies the equation
and Γ is the n × n matrix with ijth-element, Γij = γ(|i − j|).
While this is important as a theoretical result it does not have many direct applications because most observed time series are not stationary. Two key observations, though, do mean we can extend the theorem to cases
of practical importances.
1. Fortunately, a large number of observed time series do fall into the class of being linear transformations of stationary processes. These include examples with non-constant means, strong seasonal effects, and non-constant variances, which are not themselves stationary. An example is 4.2.1.
2. Theorem 3.4.4 only uses linear functions, means and covariances and we know how all these behave under linear transformations. This allows a powerful generalisation, Theorem 4.1.1.
85
864. THEBOX-JENKINSAPPROACHTOFORECASTINGANDCONTROL
These generalisations do not come completely for ‘free’, and some extra assumptions will be needed.
The following theorem is a generalisation of Theorem 3.4.4 but note that it does not use the stationarity condition.
Theorem 4.1.1. Let U be a random variable, such that E(U2) < ∞, and W an n-dimensional random vector with a finite variance-covariance matrix,
Γ := Cov(W,W).
In terms of mean square error the best linear predictor, Pred(U|W), of
U based on W = (W1,...,Wn) satisfies:
(i) Pred(U|W)=E(U)+aT(W−E(W))whereΓa=Cov(U,W).
(ii) E ([U − P red(U |W )]) = 0 i.e. the estimator is unbiased.
(iii) E([U−Pred(U|W)]W)=0,thezerovector.
(iv) Pred(α0+α1U1+α2U2|W)=α0+α1Pred(U1|W)+α2Pred(U2|W), i.e. the predictor is linear.
(v) Pred(Wi|W)=Wi.
Proof. (i) Follows by the same arguments as Theorem 3.4.4 – just the nota-
tion changes. While the rest of these results follow directly from (i).
4.2 Non-stationary time series
The following set of examples illustrates the range of useful linear transfor- mations of stationary processes which can be used in modelling real data.
Example 4.2.1. Consider the random walk with drift, defined by initial condition X0 and for t ≥ 1
t
Xt := X0 + Zi,
i=1
i.i.d. 2
whereZt ∼ N(μ,σ ),thenE(Xt)=μtandVar(Xt)=tσ . Ifμ̸=0the
mean is dependent on t – we say it is not mean stationary. Further, the variance depends on t so its not variance stationary.
We first note that {Xt} is a linear function of the stationary process {Zt} and X0, and that for t ≥ 1
Zt =∇Xt =Xt −Xt−1 =(1−B)Xt.
So there is a linear transformation, differencing, which was defined in Defi-
nition 3.7.1, which recovers the stationary process. Here we have,
∇Xt ∼μ+ARMA(0,0), (4.2)
a stationary process with a mean that needs to be estimated.
2
4.3. FORECASTING ARIMA MODELS 87
Example 4.2.2. Let Zt ∼ WN(0,σ2) and define Xt = α0 +α1t+Zt, using a linear transformation. We have that E(Xt) = α0 + α1t, so {Xt} is not mean stationary. Applying the differencing operator gives
∇Xt = α1 + (Zt − Zt−1).
Now (Zt −Zt−1) is an MA(1) process with θ1 = −1, and is stationary, hence
we have
where a mean parameter needs to be estimated.
∇Xt ∼ α1 + ARM A(0, 1), (4.3)
4.2.1 ARIMA modelling
Examples 4.2.1 and 4.2.2 show that non-stationary processes can be closely linked to stationary ones, with differencing being the key tool. We can define a broad class of such examples.
Definition 4.2.3. (dth-order differencing) The operator ∇d is called the dth-order differencing operator and is defined by
∇dXt = (1 − B)dXt.
∇2Xt =(1−B)(Xt −Xt−1)=Xt −2Xt−1 +Xt−2.
Definition 4.2.4. (ARIMA model) If d is a non-negative integer, then {Xt} is an ARIMA(p, d, q) process if
Yt := ∇dXt is a causal ARMA(p, q) process.
4.3 Forecasting ARIMA models
If a non-stationary time series is close to being stationary – in the sense that there exists simple linear transformations which reduce the process to a stationary one and with only a small number of parameters – then we might hope that optimal h-step ahead would be possible.
Example (4.2.1 revisited). Suppose we wish to compute the best linear prediction of Xn+1 based on X0, X1, . . . , Xn i.e., P red(Xn+1|X0, X1, . . . , Xn). Now, by Theorem 4.1.1 and since
n+1 Xn+1 := X0 + Zi,
i=1
For example,
884. THEBOX-JENKINSAPPROACHTOFORECASTINGANDCONTROL we have
i=1
Pred(Xn+1|X0,X1,...,Xn) =
= Pred(Xn +Zn+1|X0,X1,...,Xn)
n+1
Pred(X0 + Zi|X0,X1,...,Xn)
= Xn +Pred(Zn+1|X0,X1,...,Xn) Since X0,X1,...,Xn are an invertible linear function of X0,Z1,...,Zn they
will have the same best linear predictors, so we have Pred(Xn+1|X0,X1,...,Xn)=Xn +Pred(Zn+1|X0,Z1,...,Zn) (4.4)
To compute this with the same tools we have seen in Chapter 3 we have to make an additional assumption: Yt is uncorrelated with X0. This means we have
Pred(Xn+1|X0,X1,...,Xn)=Xn +Pred(Zn+1|Z1,...,Zn), (4.5) where the second term is the best linear predictor for a ARMA(0, 0) process
with an unknown mean.
Example 4.3.1. (ARIMA(1, 2, 1) forecast) Suppose that we have the model
Xt ∼ ARIMA(1, 2, 1),
and we want to forecast Xn+2 based on X−1, X0, X1, . . . , Xn. By definition we have that Zt := ∇2Xt = Xt − 2Xt−1 + Xt−2 is a causal ARMA(1, 1) process. So
Xn+2 = Zn+2 + 2Xn+1 − Xn, P(Xt+2|X−1,X0,X1,...,Xn) = P(Zn+2|X−1,X0,X1,...,Xn)
+2P(Xn+1|X−1,X0,X1,...,Xn)−Xn We treat each of these terms in turn.
1. First we have, since Z1,...,Zn is a linear function of X1,...,Xn, that P(Zn+2|X−1,X0,X1,...,Xn) = P(Zn+2|X−1,X0,Z1,...,Zn),
and if we make the new assumption that the Z variables are uncorre- lated with X−1, X0 then we have
P(Zn+2|X−1,X0,Z1,...,Zn) = P(Zn+2|Z1,...,Zn)
which is the h = 2 forecast of a causal ARMA(1, 1) process which was studied in §3.9 via Theorem 3.4.4.
and
4.4. USING R FOR ARIMA MODELLING 89 2. The term P (Xn+1|X−1, X0, X1, . . . , Xn), using the same argument as
above, can be written as
P(Zn+1|Z1,...,Zn)+2Xn −Xn−1,
and hence, again is computable using the theory of stationary processes in Chapter 3
Thus by recursively estimating previous terms we can build up an estimate of Xn+2 as long as we include extra conditions on the lack of correlation between initial conditions and the driving stationary process.
The ideas behind Example 4.3.1 directly extends to produce h-step ahead forecasts for any ARIMA(p, d, q) models, where we work recursively building the h-step forecast from (h − 1)-step forecast, etc. Each time we do need to add assumptions on the lack of correlation between initial conditions and the driving causal ARMA(p,q)-process. For full details see (Brockwell and Davis, 2002, Page 198).
4.4 Using R for ARIMA modelling
We have already seen, in §3.11, how to use R functions to simulate, fit and predict from ARMA(p,q) processes and almost the exact same methods work for a general ARIMA(p, d, q) process. There are the same two types of problem: (i) how to work with the models for which p, d, q are known, and (ii) how to find appropriate values for p, d, q.
Fitting ARIMA models
Example 4.4.1. (ARIMA model in R) Suppose we have a time series in the ts-object sim.arima.forecast and that this has been correctly identified as being an ARIMA(0, 1, 1) model – that is of the form Example 4.2.2.
We plot the data in Fig. 4.1. The top left panel shows the time series and the right panel shows the plot of the first difference, by using the diff( ) function.
We see that the original data does not ‘look’ stationary – its not centred around a time independent mean with a fixed variance. If we use the acf( ) function – which really we shouldn’t as its only defined for stationary series – we get a plot where the correlations only very slowly decay to zero. On the other hand the auto-correlation of the first difference function looks like that of a stationary series.
If we know the structure of the model we can fit it using the ARIMA( ) function:
904. THEBOX-JENKINSAPPROACHTOFORECASTINGANDCONTROL
0 100 200 300 400 500
Time
Series sim.arima.forecast
0 5 10 15 20 25
Lag
0 100 200 300 400 500
Time
Series diff(sim.arima.forecast)
0 5 10 15 20 25
Lag
Figure 4.1: Simulation of ARIMA(0,1,1) model: top left panel simulated data, top right first difference process, bottom left panel: auto-covariance function of series, bottom right panel auto-covariance of first difference
> arima(sim.arima.forecast, order=c(0,1,1))
Coefficients:
ma1
0.5407
s.e. 0.0414
sigma^2 estimated as 4.095: log likelihood = -1062.07, aic = 2128.14
getting the parameter estimates, standard errors, log-likelihood and AIC values with an assumption of Gaussian errors. The output from this function can go into the predict( ) function in the following way.
> predict(arima(sim.arima.forecast, order=c(0,1,1)), n.ahead=2) $pred
Time Series:
Start = 502
End = 503
Frequency = 1
[1] 44.97003 44.97003
$se
Time Series:
Start = 502
End = 503
ACF
sim.arima.forecast
0.0 0.2 0.4 0.6 0.8 1.0
0 10 20 30 40 50 60 70
ACF
diff(sim.arima.forecast)
0.0 0.2 0.4 0.6 0.8 1.0
−6 −4 −2 0 2 4 6
4.4. USING R FOR ARIMA MODELLING 91
Frequency = 1
[1] 2.023546 3.71677
Figure 4.2 shows the prediction of h = 20 steps ahead for this model. We see, unlike the stationary time series example, that the uncertainty in the forecast grows very large for large h. This is a result of the fact that we are trying to forecast a non-stationary process.
Forecasting
0 100 200
300 400 500
Figure 4.2: Forecasting using the R predict( ) function. Blue is point forecast and red is ±1.96 standard errors
Estimating the ARIMA structure
All of the above analysis required us to know the model structure. In general this it not easy to identify model structure exactly but often the search can be reduced to a small number of plausible models. One key trick with a non-stationary time series is to difference it a small number of times and look to see if you can find an acceptable stationary ARMA(p,q) model for the differenced process.
Figure 4.3 shows a concrete example. The top row shows the auto- correlations for an undifference, differenced once and differenced twice sim- ulated series, with the bottom row showing the partial acfs. We see the very slow decay of the auto-correlation which is typical of some non-stationary series. After two differences we see acf and pacf plots which are consistent with stationarity. We could interpret the acf-plot as exponential decay and the pacf-plot is zero after lag 3, so p = 3, q = 0 is possible. However so is p=1,q=1andp=0,q=2. Inthiscasethedatawasgeneratedbya (1, 2, 1) model.
As an alternative to inspecting plots we could use the AIC value for model selection as in Chapter 2.
Time
Data
0 10203040506070
924. THEBOX-JENKINSAPPROACHTOFORECASTINGANDCONTROL
Figure 4.3: No differencing (a, d), once (b,e) and twice (c,f)
(a) acf
(b) acf
(c) acf
0 5 10 15 20 25
0 5 10 15 20 25
0 5 10 15 20 25
Lag
Lag
Lag
(d) pacf
(e) pacf
(f) pacf
0 5 10 15 20 25
0 5 10 15 20 25
0 5 10 15 20 25
Lag
Lag
Lag
0.0
0.4 0.8
0.0 0.4 0.8
0.0
0.4 0.8
0.0 0.4 0.8
Partial ACF ACF
Partial ACF ACF
Partial ACF ACF
−0.2 0.0 0.2 0.4
0.0 0.4 0.8
4.5. SARIMA MODELLING 93 4.5 SARIMA modelling
There are other forms of non-stationarity that the Box-Jenkins approach can tackle. The most important is seasonality.
Definition 4.5.1. (Seasonal components) Let Zt ∼ WN(0,σ2) and, for ease
of interpretation, let us assume t is recorded on a yearly time scale, but we
have one observation per month. Following R we would record January 2015
as 2015.00, February 2015 as 2015.083 = 2015 1 etc. Let αi i = 1,2,…,12 12
be constants associate with the month (i − 1)/12 such that 12 α = 0. i=1 i
and define the function mon(t) as the function which returns the month associated with time t.
The process Xt := αmon(t) +Zt where {Zt} ∼ ARIMA(p,d,q) is said to have a seasonal component. It will not be stationary if any of the αi values are nonzero.
Example 4.5.2. (US accident data) Accidental deaths
1973 1974 1975 1976 1977 1978 1979
Time
0.0
Series USAccDeaths
0.5 1.0 1.5
Lag
Figure 4.1: US accidental deaths data 1973- 1978. Data and auto-correlation plot.
Definition 4.5.3. (Lag s difference) The lag s difference operator Bs is defined by
BsXt = Xt−s.
The corresponding difference operator ∆s is then defined as
∆s := (I − Bs).
Differencing at appropriate lags can remove seasonality, hence are an- other example where a linear function of a non-stationary time series can be stationary. This gives rise, in a very similar way to the definition of the ARIMA model, to a wider class of models which can model seasonal time series.
USAccDeaths
7000 8000 9000 10000 11000
ACF
−0.4 0.0 0.2 0.4 0.6 0.8 1.0
944. THEBOX-JENKINSAPPROACHTOFORECASTINGANDCONTROL
Definition 4.5.4. (Seasonal ARIMA (SARIMA) model) If d and D are nonnegative integers, then {Xt} is a seasonal ARIMA(p, d, q)(P, D, Q) pro- cess with period s if the differenced series
Yt := (1 − B)d(1 − Bs)DXt, is a causal ARMA process defined by
φ(B)Φ(Bs)Yt = θ(B)Θ(Bs)Zt, where Zt ∼ WN(0,σ2) and
φ(z) := 1−φ1z−···−φpzp Φ(z) := 1−Φ1z−···−ΦPzP
θ(z) := 1+θ1z+···+θqzq Θ(z) := 1+Θ1z+···+ΘQzQ
The corresponding process is causal if and only if all the roots of φ(z) and Φ(z) lie outside the unit circle.
4.6 Using R for SARIMA modelling
Just as in §4.4 there are the same two types of problem: (i) how to work with the models for which p,d,q,P,D,Q are known, and (ii) how to find appropriate values for p, d, qP, D, Q.
Fitting SARIMA models
Example 4.6.1. (SARIMA model in R) Let us look at a particular example for concreteness. The R data object USAccDeaths contains a monthly time series of accidental deaths in the USA in the period 1973 -1978. Figure 4.1 (a) shows the data and we see a strong seasonal effect. Differencing at lag 12 is shown in Panel (b) and there seems to be a trend so another difference is done to give Panel (c) which ‘looks stationary’. The R code which did this is as follows.
> data(USAccDeaths)
> plot(USAccDeaths, main=”(a) Accidental deaths”, ylab=”Counts”)
> plot(diff(USAccDeaths, lag=12) )
> plot(diff(diff(USAccDeaths, lag=12)))
So it looks like a SARIMA model might be appropriate. Let us suppose that we know that (p,d,q) = (1,1,1) and (P,D,Q) = (0,1,1) we then fit the appropriate SARIMA model using:
> mod2 <- arima(USAccDeaths, order = c(1, 1,1),
seasonal = list(order = c(0, 1, 1), period = 12)
4.6. USING R FOR SARIMA MODELLING
95
(a) Accidental deaths
1973 1974 1975 1976 1977 1978 1979
Time
(b) Monthly differenced
1974 1975 1976 1977 1978 1979
Time
Figure 4.1: US accident data
(c) Differenced Monthly differenced
1974 1975 1976 1977 1978 1979
Time
where mod2 is an arima object containing the estimated parameters and other details of the fit i.e.
> mod2
Call:
arima(x = USAccDeaths, order = c(1, 1, 1),
seasonal = list(order = c(0, 1, 1), period = 12))
Coefficients:
ar1 ma1 sma1
0.0979 -0.5109 -0.5437
s.e. 0.3111 0.2736 0.1784
sigma^2 estimated as 99453: log likelihood = -425.39, aic = 858.78
We need to see if the fit is ‘good’ so we look the residuals using
> acf(mod2$residuals, main=”ACF residuals”)
> acf(mod2$residuals, type=”partial”, main=”PACF residuals”)
ACF residuals
PACF residuals
0.0 0.5
1.0
1.5
0.2 0.4 0.6 0.8 1.0 1.2 1.4
Lag
Lag
Figure 4.2: US accident data
−0.2
0.0 0.2 0.4 0.6
0.8 1.0
−0.2
−0.1 0.0 0.1
0.2
ACF
7000
8000
9000
10000
11000
Partial ACF
−1000
−500
0 500
−500
0
500 1000
Counts
Counts
Counts
964. THEBOX-JENKINSAPPROACHTOFORECASTINGANDCONTROL
We can then make predictions for two years ahead using
> predict(mod2, n.ahead=24)
and the results can be found in Fig. 4.3 with the red line the point forecast and the blue the prediction intervals.
Accidental deaths
1974 1976
Figure 4.3: US accident data
Estimating the SARIMA structure
As in the case of ARIMA modelling we can inspect acf and pacf plot to look for possible p, d, q and P, D, Q values. This usually requires a good deal of experience to do reliably and a good deal of data.
Example (4.6.1 revisited). We have already seen by looking at differenc-
ing plots that d = 1, D = 1 are plausible – although not the only possible
values. We will stick to those and look a acf and pacf plots of these dif-
ferences in Fig. 4.4 To try and guess P,Q we look for the same types of
patterns discussed above –i.e. exponential decay and finding a bound after
which all estimates are zero – but only for integer values. That is we look at
the yearly lags at 1,2,…. Here we see P = 0,Q = 1 is possible, although
not the only possible choice. To think about the p,q values we look for
patternsatthemonthly 1,2,… lags. Hereweseethatp=1,q=1is 12 12
possible, as are other choices.
An alternative method to interpreting this plots is look look at all models with small p, q, P, Q values and look at the AIC value given by the arima( ) function output.
Time
1978 1980
Counts
6000 8000 10000 12000 14000
4.7.
THE BOX JENKINS APPROACH 97 ACF PACF
01234 01234
Lag Lag
Figure 4.4: US accident data
4.7
4.7.1 Summary
The Box Jenkins Approach
In this chapter we have looked at the Box-Jenkins methodology for fore- casting. It is based on the idea that many stationary processes can be parsimoniously represented by ARMA(p,q) models – that is only a rela- tively small number of parameters are need. As we have seen in Chapter 2 if we can fit observed data with only a small number of parameters then often we can get reasonable forecasting models.
However, most applications involve some form of non-stationarity and the methodology deals with this by appropriate differencing combined with associated independence assumptions. We can summarise the steps as fol- lows
(a) Model Identification. This uses the R functions plot, diff, acf etc.
(b) Estimation. Once the structure of the model is identified we used the
arima function to estimate the parameters
(c) Diagnostic checking. Using the residuals from the model fit we can check
to see if the model is ‘satisfactory’.
(d) Consider alternative models if necessary.
4.7.2 Tests for stationarity
One of the important stages of the approach is to decide if a suitably dif- ference time series is stationary. This can be rather subjective if purely graphical methods are used. We can use more formal tests which focus on different kinds ways that stationarity can fail.
−0.4 0.0 0.4 0.8
ACF
Partial ACF
−0.3 −0.1 0.1
984. THEBOX-JENKINSAPPROACHTOFORECASTINGANDCONTROL
Definition 4.7.1. (Phillips-Perron test) The Phillips-Perron test is a test that the time series has a ‘unit root’ a stationary alternative. Recall that we check for stationarity in an ARMA process by looking to see where the roots of the polynomial φ(z) lie. If they lie on the unit circle then we say the series has a unit root and is non-stationary. The random walk model is the typically example of this.
In R we use the function PP.test. For the data in Example 4.6.1 select- ing d = D = 1 reduced the model to stationarity. If we try and test this formally we get
> PP.test(diff(diff(USAccDeaths, lag=12)))
Phillips-Perron Unit Root Test
data: diff(diff(USAccDeaths, lag = 12))
Dickey-Fuller = -11.8917, Truncation lag parameter = 3, p-value = 0.01
Definition 4.7.2. (Runs test) If there was a trend in the sequence the series would have a pattern in if it was above or below its mean. The runs test looks at the the frequency of runs of a binary data set. The null hypothesis is that the series is random.
> library(tseries)
> x <- diff(diff(USAccDeaths, lag=12))
> y <- factor(sign(x - mean(x)))
> runs.test(y)
Runs Test
data: y
Standard Normal = 0.2439, p-value = 0.8073
alternative hypothesis: two.sided
Here we convert the continuous time series to a binary one and then use the runs.test( ) from the tseries library.
4.7.3 Criticisms of approach
The paper Chatfield and Prothero (1973) has some criticisms of the Box- Jenkins approach. In particular we note the following.
(a) Differencingtoattainstationaritymayintroducespuriousauto-correlations. For example with monthly data it can occur at lag 11.
(b) Itcanbedifficult,justfromdata,todistinguishbetweendifferentSARIMA models and these may give different forecasts
4.7.
(c)
(d)
THE BOX JENKINS APPROACH 99
A reasonably large number of data points is needed to implement the approach with Box and Jenkins (1976) stating that 50−100 observations are needed.
ARIMA models may be harder for practitioners to understand than linear trends plus seasonal effect models.
1004. THEBOX-JENKINSAPPROACHTOFORECASTINGANDCONTROL
5
Forecast Evaluation
5.1 Introduction
We have seen, so far, the mathematical structure of stationary time series and its application in linear methods such as Exponential Smoothing, the Holt-Winters method, ARMA, ARIMA and SARIMA models in the Box- Jenkins framework. When there is non-stationarity we have tried linear transformations, such as differencing, to try to identify and estimate the non-stationary components of the time series. Overall we are modelling the systematic non-stationarity with simple, low-dimensional, linear models and modelling the random component with a stationary process which has been selected through a model building exercise.
In order to justify these models we have used some optimality criteria such as: the ‘optimality’ of the conditional expectation in Theorem 2.2.3, the ‘optimality’ of the conditional median in Theorem 2.2.4, the ‘optimality’ of the Bayes classifier in Theorem 2.2.6, the, so-called, ‘ideal procedure’ defined in Hastie et al. (2009, p. 222), and the ‘optimal’ linear predictor in Theorem 3.4.4.
It may seem odd that we have many different optimality conditions but the definition of optimality depends on choices that the analyst makes: the choice of loss function, computational traceability, the amount of data, the choice of long or short term forecasts, the bias-variance trade-off, the quality and availability of predictors etc.
However, not all methods are linear and the last decade, or so, has seen an explosion of Machine Learning (ML) forecasting and classification meth- ods which the analyst can use. These can be motivated by ‘big data’ consid- erations, such as the size and quality of the available data and considerations of what is computable in a suitable time scale.
In later chapters we will look, in detail, at some of the more statistical non-linear models, such as ARCH, GARCH, and state space models, and also at models for which the speed of computation is the key consideration,
101
102 5. FORECAST EVALUATION
such as the Kalman filter. In this chapter we will look at the question of how to to chose between methods in practice. Often, new ML methods are available as a ‘black-box’ where the analyst just understands the method as an input-output function and does not, and may not want to, understand exactly what is going inside the code. Typically the black-box methods have a small number of tuning parameters which are under the control of the analyst. Examples of this, that we have already seen, is the λ parameter in Ridge Regression (Definition 2.4.7) and the LASSO (Definition 2.7). One way of thinking about these tuning parameters is that they define a relatively simple path through a very rich model space to find a balance between bias and variance. Often the value of the tuning parameters are decided using cross-validation, §2.5.2 and the may not have a direct interpretation that more traditional statistical parameters have.
Since the analyst does not explore the inside of the black box they can not evaluate the method through mathematically based optimality conditions. Instead they calibrate and compare models on well-known example sets and through simulations. We will explore some of these issues with a guided reading of some important papers in these areas and through some simple R experiments.
5.2 Evaluation of Forecasting Methods
As discussed above, determining which statistical forecast ML algorithm is ‘best’ is a very hard problem. Indeed we should never expect uniform dominance of one method over all others. Rankings can very much depended on the domain of the problem. A reference which covers the basics is Wang and Chaovalitwongse (2010).
There are cultural difference between the traditional statistics and ML community in the way that methods are evaluated. For example very com- monly used statistical models such as ARMA or ARIMA are motivated by the fact that the best linear predictor for a squared error loss is well de- fined – basically its a standard regression problem – and prediction errors can computed if we make distributional assumptions. Then the main opera- tion for a given forecasting problem turns into is a model selection problem. Common approaches use a penalised likelihood methodology or cross vali- dation to do the model selection. Evaluation of these models often use the mathematics of asymptotic sample size analysis. This might be showing a method is consistent or efficient. This is an analysis exploring what hap- pens as the amount of data tends to infinity. Other mathematical results might be based on an oracle property. This compares the behaviour of the procedure to one for which the exact model, and its parameter values, are known. Of course the oracle will always win but if the difference is ‘small’ we are reassured the methodology will give reasonably good results.
5.3. SIMULATION EXPERIMENTS 103
While asymptotic results might reassure us that, if we have a great deal of data, our model based prediction will be close to the correct answer, it does not really tell us what will happen when we have a real data of fixed sample size.. We also recall that in a time series say, (x0, x1, · · · , xN ), even if N is large we still only have one realisation of the underlying stochastic process. In general, without knowing the underlying dependence structure, it is not clear what we should mean by the term ‘sample size’. A common why to explore these issues, which we will explore in §5.3, is through carefully designed simulation experiments.
5.3 Simulation experiments
In this section we will start by evaluating methods through well designed simulation exercises, which allow us to evaluate a method in a, hopefully realistic, finite sample context. Below is an example of what a simulation based exercise could look like. We will use R to evaluate how well the prediction function, predict, actually works for AR(2) processes.
Example 5.3.1. We explore, in two cases, how well the built-in R function work for prediction in a time series context. In particular we explore the 1-step ahead prediction behaviour of the AR(2) model
Xt = 0.5Xt−1 + 0.3Xt−2 + Zt, where Zt ∼ N (0, 4). We will investigate two cases.
1. Case 1: We have fitted the model to 200 observations, x1,…,x200, and we are trying to predict X201.
2. Case 2: We now only have 20 observed values and we are trying to estimate X21.
When designing a simulation experiment we need to have decided what question are we trying to answer. Different questions will, in general, lead to different designs.
Here we will focus on the frequentist behaviour of a nominally 95%- prediction interval. Recall that a 95%-prediction interval will contain the true, but unknown, future value of the time series 95% of the time the procedure is used.
Example (5.3.1 Again). In our example the problem is to evaluate if the R function predict( ), should have the property generates a prediction interval which contains the actual realisation of Xn+h|{x1, . . . , xn} 95% of the times it is used.
i.i.d.
104 5. FORECAST EVALUATION
We use a Monte Carlo experiment to estimate the actual coverage per- centage over a large number of independent trials. It is perhaps easiest to consider the experiment as a game played by two people. Person A is re- sponsible for selected the data generation process – the true model. For each running of the game A generates the data and passes it to person B. Person B does not know the true model, only the data. From the data B uses the statistical method being evaluated (in this chapter a forecasting method) to generate an answer (here a prediction). B then passes back the answer to A who evaluates the answer compared to the truth. Recall that A knows everything, but B only can see the data. The game is repeated many times and so A can finally report what proportion of these times B was correct.
Of course in practice you, the analyst, play the role of both person A and person B and R does all the hard work. You must be careful, when writing your R code that you don’t cheat by letting B know things that only A should know.
In each of these simulation exercises we have the following general struc- ture:
1. Select the model which will generate the data, its parameter values, the size of the training data and the number of time steps you wish to forecast. [This is person A’s job]
2. Generate from the model both the training data and the data to be forecast. [This is person A’s job]
3. Person A then passes the training data – and nothing else – to B
4. Fit a model to the data go through a model selection and fitting ex- ercise. [This is person B’s job]
5. From the fitted model construct the 95%-prediction interval [This is person B’s job]
6. The answer, as computed by B, is passed by to A for evaluation, and the evaluation is stored
7. Repeat independently Steps 2-6 a large number of times and, from the stored evaluation compute the proportion of times B was ‘correct’.
Example (5.3.1 Again). In this example the data generation model is the AR(2) model
Xt = 0.5Xt−1 + 0.3Xt−2 + Zt,
i.i.d.
where Zt ∼ N (0, 4) where A knows the model class and the corresponding
parameter values. Person A using R simulation code to generate data. Person B only sees the data and has to use it to do a number of things only using the data: (i) select the ARMA model class, (ii) estimate the
5.4. USINGSIMULATIONSTUDIESTOEVALUATESTATISTICALMETHODS105
parameter values, (iii) use the estimated model to make a prediction and quantify the uncertainty. Person B then passes back the prediction interval to A who then checks if the interval contains the ‘true value’.
This game is repeated many times and final A reports if the prediction interval method captures the truth in 95% of the games that have been played.
In Example 5.3.1 we assumed that the model class, that both A and B used, was an ARMA model. Under this assumption we know many math- ematical properties of the model class. One of the powerful features of simulation methods is that Person A can, if fact, generate from a different class of models, such as the non-linear ARCH model considered in §8. Thus we can explore the methodology is cases where the assumptions of the model do not hold. This is called misspecification.
5.4 Using simulation studies to evaluate statistical methods
This section is based on the paper Morris et al. (2019) which is a recom- mended read. In the paper’s abstract they state:
‘A key strength of simulation studies is the ability to understand the behavior of statistical methods because some “truth” (usu- ally some parameter/s of interest) is known from the process of generating the data. This allows us to consider properties of methods, such as bias.’
In this course we would be considering properties connected to forecasting: coverages, MSE, out of sample behaviour etc. They go onto say: ‘While widely used, simulation studies are often poorly designed, analyzed, and reported’.
When designing a study (Morris et al., 2019, Table 1) gives very useful guidance. Here we give a simplified version showing the things you should consider
1. Aims: Identify specific aims of simulation study.
2. Data-generating mechanism: Think hard about what mechanisms you
want to use in your context and with your aims in mind
3. Estimand/target of analysis: Decide using your aims and, data-generating mechanism, precisely what target(s)) you are going to investigate. In our work these are typically going to be some form or forecast
4. Methods: For method comparison studies, make sure that you are including relevant method and include them in a fair way.
106 5. FORECAST EVALUATION
5.
6.
7. 8.
9.
5.5
Performance measures: In our work these are typically going ways of evaluating forecast methods. There are going to be a number of these typically, not just one.
Select a suitable number of simulations to achieve an acceptable Monte Carlo SE for key performance measures. Note that acceptable depends on your aims.
Coding and execution: plenty of checks often using careful choices of random number seed. Start small and build up code
Analysis: Conduct exploratory analysis of results, particularly graph- ical exploration.This is part of the iterative code checking and making sure that your results do not contain any surprises. Always compute Monte Carlo standard errors for all estimates
Reporting: describe simulation study using sufficient rationale for choices. Include Monte Carlo SE simulationuncertainty. Publish code to execute the simulation study including user-written routines. This ensures that others can reproduce and check what you claim to be true.
Evaluation by competition
We can certainly evaluate the properties of an ML methods using Monte Carlo simulations. However, on top of this methods are often evaluated using competitions, where large sets of real data are made available and a developer of a new algorithm runs it on the publicly available test problems with the winner minimising the given objective function.
5.5.1 Makridakis Competitions
An important example of such events are the Makridakis Competitions (also known as the M Competitions), for example see https://mofc.unic.ac. cy/m5-competition/ The site tell us ‘The M4 extended and replicated the results of the previous three competitions, using an extended and diverse set of time series to identify the most accurate forecasting method(s) for different types of prediction’. The emphasis on diversity here is interesting since, as mentioned above, we do not expect uniform dominance of any one method over different domains.
The paper Makridakis et al. (2018) is a good place to see the results. From the abstract we see [my emphasis in bold]
‘Machine Learning (ML) methods have been proposed in the academic literature as alternatives to statistical ones for time se- ries forecasting. Yet, scant evidence is available about their
5.5. EVALUATION BY COMPETITION 107
relative performance in terms of accuracy and computational requirements. The purpose of this paper is to evaluate such per- formance across multiple forecasting horizons using a large subset of 1045 monthly time series used in the M3 Competi- tion. After comparing the post-sample accuracy of popular ML methods with that of eight traditional statistical ones, we found that the former are dominated across both accuracy mea- sures used and for all forecasting horizons examined. Moreover, we observed that their computational requirements are consid- erably greater than those of statistical methods. The paper dis- cusses the results, explains why the accuracy of ML models is below that of statistical ones and proposes some possible ways forward. The empirical results found in our research stress the need for objective and unbiased ways to test the performance of forecasting methods that can be achieved through sizable and open competitions allowing meaningful comparisons and definite conclusions.’
Again this is consistent with some of the results §5.6 with simple mod- els tending either win or only losing by a small amount. The paper does recognise that ML methods, in some kinds of problems can be exceptionally good for example: ‘Autonomous Vehicles (AV), intelligent robots, image and speech recognition, automatic translations, medical and law usage as well as beating champions in games like chess, Jeopardy, GO and poker’. How- ever these forms of problems do not guarantee good performance in other domains.
The methods used are described in Table 1 of the paper. The overall conclusion where that simple models like the Theta method, Assimakopoulos and Nikolopoulos (2000), and forms of exponential smoothing generally the best, at least with the selected loss function. They did find that some neural networks (NN) did quite well and had improved since the last competition. Details of the methods and places where code (often R) can be found in (Makridakis et al., 2018, §3) and the winning Theta method can be run in R, https://www.rdocumentation.org/packages/forecTheta/versions/ 2.2/topics/Theta%20Models
5.5.2 Using R based ML packages
If we want to run Monte Carlo, or to compete in a contest, in order to evaluate forecasting methods it is very very helpful to have well-designed code. From what we have seen in this cause we are familiar with the standard statistically based linear forecasting methods. Table 5.1 gives a subset of ML methods discussed in Makridakis et al. (2018) with their corresponding R packages.
108 5. FORECAST EVALUATION
ML method
R package
Multi-Layer Perceptron (MLP)
The mlp function of the RSNNS R statistical package
Bayesian Neural Network (BNN)
The brnnl function of the brnn R package
Radial Basis Functions (RBF)
The rbf function of the RSNNS R package
Generalized Regression Neural Networks (GRNN)
The guess, learn and
smooth functions of the grnnR package
CART regression trees (CART)
The rpart R package
Table 5.1: R code packages
5.5.3 Tuning parameters
As explained above finding the ‘best’ forecasting method for a given problem is a very difficult problem. One issue that makes the simple comparison based on simulation studies complex is the question of tuning parameters. We have seen these are a natural part of many statistical and ML methods, such a LASSO or ridge regression. When we are comparing a number of methods we need to be completely clear about what procedures we have used in the selection of tuning parameters. For example in Table 5.1 we see a number of ML methods which have been coded in R. If a method has a default value of a tuning parameter a practitioner may not realise that the value of a tuning parameter matters. Any paper comparing different methods may have spend a lot of effort in optimising thee choice of tuning parameter value for the new method, while using default values for the comparison method. This often mean results are hard to replicate.
5.5.4 Economic models versus time series approaches
The competition between ML and (linear) statistical methods recall a simi- lar debate between using economic models and simple statistical models for forecasting. Again we should not expect that we are going to find a universal method. We find in Granger et al. (1986, §3. Which Methods Work Best?) the statement: ‘It is perhaps overly ambitious to hope for a simple prescrip- tion that provides a good forecasting method for all economic time-series.’ This discussion, from 1986, shows that competition between intensive com- puter and big data methods are far from new and of the importance of not relying on one single methodology. Granger et al. (1986) closes with the statement: ‘the future of economic forecasting promises the development of new, computer-intensive methods and the real-time availability of very large data sets [· · · we] underscore the importance of having a simple and reliable benchmark forecast against which alternative forecasts can be assessed’.
An example of this – now historical but still relevent – debate, in a
5.6. ILLUSIONS OF PROGRESS 109
very practical case study, can be found in Cranage and Andrew (1992), which looks at forecasting in the the hospitality industry. The results of the study show that for the actual restaurant sales in this sample, ‘time series models (specifically Box-Jenkins and exponential smoothing models) performed as well or better in forecasting sales than an econometric model.’ The paper compares judgement methods, econometric models – these are essentially regression models which, not only, try to forecast but also explain the possibly casual relationships between economic variables – and pure time series models. One issue, which should not be forgotten, is that all methods have to be used by people who have no experience with any kind of modelling.
5.6 Illusions of progress
As discussed in Hand (2006), using the Monte Carlo approach of §5.3 or the real-world competition approach of §5.5 can result in a large number of forecasting procedures having a similar performances and the differences can be dwarfed by issues like distributional drift, change points in time series data, problem uncertainty, missing data and using an inappropriate loss function.
For us, in this course, we look at the important paper Hand (2006). This is mostly focused on classification but the message also applies to forecast- ing. It looks at methods of classification including classical statistical tools like linear discriminant analysis (LDA) or logistic regression and more Ma- chine Learning based methods such as neural networks and support vector machines. It looks at the ways that we might judge that one method is su- perior to another, referencing many such studies. These comparisons often based on simulation methods of §5.3 since these have a ground truth which allows us to judge the performance. From the abstract Hand states
‘This paper argues that these comparisons often fail to take into account important as aspects of real problems, so that the appar- ent superiority of more sophisticated methods may be something of an illusion. In particular, simple methods typically yield per- formance almost as good as more sophisticated methods, to the extent that the difference in performance may be swamped by other sources of uncertainty that generally are not considered in the classical supervised classification paradigm’
In particular in (Hand, 2006, p. 3) its argued than in many studies improve- ments of the ML algorithms over more traditional methods can be small and ‘aspects of real practical problems often render such small differences irrelevant, or even unreal’. He identifies four issues: that there are only small marginal gains in a model selection process with most of performance coming
110 5. FORECAST EVALUATION
at the start using simple models; secondly he argues that non-stationarity in the data is very common, but not included in the evaluation processes; thirdly he notes that there is arbitrariness in the definition of the classes that we are tying to classify the data into, which is ignored in evaluation studies. Finally he looks at, what he calls problem uncertainty. An example of this is making incorrect assumptions on the costs of different types of classification error.
In (Hand, 2006, § 3.1) he states: ‘A fundamental assumption of the classical paradigm is that the various distributions involved do not change over time. In fact, in many applications this is unrealistic and the population distributions are nonstationary.’ As a statistic principle we should have a clear understanding of how the data that is going to be used for fitting a forecasting model ended up in our computers. This is sometimes called data provenance, Simmhan et al. (2005). The need to understand this is looked at in more detail in (Hand, 2006, § 3.2) Sample Selectivity Bias. An example of this could be our forecast/classification methods was trained, or fitted, to a set of volunteers from Waterloo undergraduates, but is then applied to a group with quite different demographic, social, economic and other factors. In §4 of Hand (2006) he looks at problem uncertainty issues such as errors in class labeling, or measurement error, arbitrariness in the class definition and concept drift. Finally §4 finishes by discussing the choices involved in fitting and using classification/forecasting methods. Typically we are optimising over some performance criteria – such as mean square error – but that criteria is itself fairly arbitrary or in appropriate for a given problem.
It should be noted that Hand is not arguing that traditional, classical statistics methods are always better, but rather the task of evaluating fore- cast or classification methods is more complex than many comparison stud- ies make out. Typically the evaluation studies ignore the issues identified above whose effects can dominate the relatively small perceived advantages of some new method. Since simple methods can do almost as well, in real world problems, as more complex one, and may have other benefits in terms of interpretation and transparency.
It should be noted that this general story is complex. For example Jie et al. (2019) state in the abstract ‘The objective of this study was to compare performance of logistic regression (LR) with machine learning (ML) for clin- ical prediction modelling in the literature.’ they do this via a meta-analysis of Medline and conclude ‘We found no evidence of superior performance of ML over LR. Improvements in methodology and reporting are needed for studies that compare modelling algorithms.’ However in mortgage default prediction Fitzpatrick and Mues (2016) state that in a study of mortgage default prediction: ‘The main findings are that the selected approaches have varying degrees of predictive power and that boosted regression trees signif- icantly outperform logistic regression.’
5.6. ILLUSIONS OF PROGRESS 111
In general it is clear that no one method is uniformly stronger than all the others and differences can be due to the domain specific properties of any given problem.
112 5. FORECAST EVALUATION
6
Bayesian and state space methods
6.1 Introduction
Most of the standard results in this section can be found in a number of places including Berger (2013) or Jackman (2009).
6.2 Bayesian methods
Finally we briefly review the ideas of Bayesian inference. We shall assume that we have a model for the data f(x|θ) and that we want to learn about the (vector) of parameters θ. In Bayesian statistics everything is considered a random variable and all statements about uncertainty are made using the language and calculus of probability theory.
Suppose, that we consider that the parameter vector θ itself is random, and that it has a distribution, Pr(θ) which we shall call the prior distribu- tion. This represents what we know about θ before we see the data. In the Bayesian viewpoint all uncertainty – and in particular what we know about the value of θ – can be represented by probability statements.
To see what we know about θ after we see the data we apply Bayes theorem to give
Pr(θ| data ) = Pr( data |θ)Pr(θ). (6.1) P r(data)
The term P r(θ| data ) is called the posterior and is interpreted as what we know about θ after we have seen the data. We can the write 6.1 as
Posterior(θ) ∝ Lik(θ) × prior(θ) (6.2)
That is, up to a constant, the likelihood function is the way we convert the information we have about θ before we have seen the data into the information we have about it after seeing the data.
113
114 6. BAYESIAN AND STATE SPACE METHODS Example 6.2.1. (Normal Bayesian inference 1) Consider a simple normal
i.i.d.
example, where we have a model Xi ∼ N(θ,1) with a sample size of n, so
that the likelihood function based on the observed data x1, . . . , xn is n 1 (xi−θ)2
Lik(θ) = P r(data|θ) = √2π exp − 2 i=1
1 n n (xi−θ)2 = √ exp−
2π i=1 2 nn
∝ exp θ(xi)−2θ2 i=1
This is then the key tool that we need to merge with the prior in order to get the posterior.
In the Bayesian viewpoint the ‘answer’ to an inference problem is cap- tured in the posterior distribution. There are then a number of computa- tional issues which we need to study. First we note that the very simple formula Equation (6.2) only defines the posterior up to a constant. Since the posterior is a probability mass function or a density function we might want to be able to deal with the fact that finding
Posterior(θ) dθ, θ
might be a difficult computational problem. This can be solved either by clever mathematics or by numerical integration methods. Secondly, we want to know how can we convert mathematical knowledge about the posterior into something interpretable? Again this looks like this will involve (poti- entially) difficult integration problems.
Example (6.2.1 revisited). Suppose that the prior distribution for θ was in the Normal family i.e.
(θ − μprior)2 θ∼N(θ;μ ,σ2 )∝exp −
prior prior 2σp2rior
where μprior and σp2rior are considered known constants. Then the posterior
6.2. BAYESIAN METHODS 115
can be written as Pr(θ|x1,…,xn) ∝ ∝
∝
=
Lik(θ) × prior(θ)
n n (θ−μprior)2
exp θ(xi)− 2θ2 ×exp − 2σp2rior i=1
1 + nσ2 θ2 − 2 μ + σ2 (n
exp −1 prior prior prior i=1 i
x ) θ 2 σ2
prior
(μprior+σ2 (n θ2 − 2 prior i=1
1 (2) 1+nσprior
prior prior
, 1 + nσ2 prior
xi))
θ
exp−2 σ2 /1+nσ2
So that we have
P r(θ|x1, . . . , xn) ∼ N 1 + nσ2
μprior + σp2rior(ni=1 xi) prior
σ2 prior
(6.3)
So, when we see data, the prior distribution N(θ;μprior,σp2rior) is updated to another normal but with different mean and variance. As n gets larger the mean of the posterior satisfies
μprior + σp2rior(ni=1 xi)
1 + nσ2 ≈ x ̄ → μ,
prior
the mean of the model. The variance will satisfy σp2rior 1
1 + nσ2 ≈ n → 0. prior
We interpret this as saying, when we have a lot of data from the model the posterior converges to a spike distribution around the true value.
In general with Bayesian methods, the nice properties of Example 6.2.1 hold in general. As we get more data we update the posterior and it will shrink around the ‘true’ values of the model.
One property that is special is the that the prior distribution (a normal here) was updated to another normal. In general the posterior will not be a simple well-known distribution and the normalising factor has to be computed numerically.
Having seen how to compute the posterior distribution we now see how the Bayesian method deals with forecasting. The same principles apply: only use the rules of probability.
116 6. BAYESIAN AND STATE SPACE METHODS
Definition 6.2.2. (Predictive distribution) The solution to a forecast- ing problem to forecast Xnew based on an observed sample X1, . . . , Xn which comes from a model f(X|θ) should be the conditional distribution P(Xnew|X1,…,Xn). This is called the predictive distribution and is calcu- lated from the posterior by
P(Xnew|X1,…,Xn) =
In words we average over the model according to the posterior belief in the
value of θ given the observed data.
Example (6.2.1 revisited). For our example we have the predictive dis-
tribution 2 φ(x; θ, 1)φ θ; μPost, σPost
shown to be
φ(x;μ ,1+σ2 ) (6.4) Post Post
Definition 6.2.3. (Inverse gamma distribution) A positive random variable X ∼ InvGamma has an inverse gamma distribution if X−1 has a gamma distribution. Its density is
scale.
Example 6.2.4. (Normal Bayesian inference 2) Suppose now we have the
i.i.d. 2 2 modelXi ∼ N(θ,σ )withasamplesizeofnwherebothθandσ areto
be estimated. We can take the priors
σ2 θ|σ2 ∼ N μ0,n0
σ2 ∼ InvGamma(ν0/2, ν0σ02/2)
In this prior specification n0 is a known constant called the ‘prior sample size’ – a nice way of measuring the relative weight of the prior and the data. The terms ν0 and σ02 are considered known.
Θ
P(Xnew|θ)P(θ|X1,…,Xn)dθ.
dθ.
where φ() is the normal density function. After some algebra this can be
βα −α−1 fX(x|α,β) = Γ(α)x
exp{−β/x}
for x > 0, α > 0 and β > 0. Here α is called the shape parameter and β the
For this set up, after some algebra, we get the posterior for μ, σ2 as σ2
θ|σ2 ∼ N μ1,n1
σ2 ∼ InvGamma(ν1/2, ν1σ12/2)
6.2. BAYESIAN METHODS 117
where
and
μ1 := n0μprior +nx ̄,n1 =n0 +n,ν1 =ν0 +n n0 + n
n ν1σ12 =
i=1
n 0 n
(xi −x ̄)2 +ν0σ02 + n (μ0 −x ̄)2
1
In Example 6.2.4 the priors have been selected in a clever way so that the function form does not change before and after see the data. These are called conjugate priors. They are very convenient, keep the mathematics very clean, but not strictly necessary.
6.2.1 Bayesian methods in regression
We can generalise the results of Example 6.2.4 to the standard regression framework we studied in Chapter 2. As Jackman (2009) shows we can use the same conjugate prior ‘trick’ and keep an analytic solution possible.
Definition 6.2.5. (Conjugate prior) A conjugate prior is a prior which, for a given model, has the property that the posterior lies in the same functional family as the prior. The updating using the data only changes the parameter values in the conjugate family.
Computationally we don’t have to do that and we can, using R compute much more general cases using Markov chain Monte Carlo (MCMC) analysis.
Definition 6.2.6. (MCMC) If we have a complex distribution P(θ) that is known up to a constant then rather than working directly with the distribu- tion we can compute most statistical functions – such as means, variances, marginal distributions – using a large sample θ1, θ2, . . . which is a Markov chain whose stationary (equilibrium) distribution is proportional to P(θ).
We will not, in this course, look at the details of how the sample θ1, θ2, . . . is constructed but rather use built in R functions.
Example 6.2.7. (Bayesian regression) Suppose we have a model yi|xi ∼N(xiβ,σ2)
with semi-conjugate independent priors
β∼MN b,B−1 p+1 0 0
σ2 ∼ InvGamma(c0/2, ν0, d02/2)
then the posterior is could be calculated but it is easier to use MCMC.
Example (6.2.4 revisited). We can think of this example as a regression example with no covariates. In R we could use
118 6. BAYESIAN AND STATE SPACE METHODS
> library(MCMCpack) #load library
> library(coda)
> y <- rnorm(30, mean=3, sd=0.2) #generate data
> out <- MCMCregress(y~ 1) #run mcmc
> out1 <- as.matrix(out) #store output as matrix
> dim(out1)
[1] 10000 2
The output is a set of two long series, one for θ and one from σ2. These large samples represent the (joint) posterior of both parameters and are plotted as density estimates in Fig. 6.1 (b) and (c). Panel (a) is just a plot of the data used.
(a) Data
●●● ●●
(b) Posterior theta
2.90 2.95 3.00 3.05 3.10 3.15
theta
(c) Posterior sigma2
0.01 0.02 0.03 0.04
sigma2
2.96 3.00 3.04
3.08
2.8 2.9
3.0
3.1
3.2
theta
y
sigma2
Density
0.010 0.015 0.020 0.025 0.030 0.035
0 5
10
15
sigma2
Density
0.01 0.02 0.03 0.04
0 20
40 60
80
●● ●●
●
●●●
●
●
●
●
●●● ●●
●●● ●●
●
●
0 5 10 15 20 25 30
Index
●
Figure 6.1: MCMC output for Example 6.2.4
We can look at more detail at this output. In Fig. 6.2, Panel (a) shows the first 500 points of the θ-series, noting that there are in fact 10, 000 points in the series. Panel (b) shows the same for the σ2 series and (c) shows a joint plot. This last shows out the MCMC outputs from the joint distribution of the posterior.
(a) theta sample
0 100 200 300 400 500
Index
(b) sigma2 sample
0 100 200 300 400 500
Index
(c) Joint distribution
2.95 3.00 3.05 3.10 3.15
theta
Figure 6.2: MCMC output for Example 6.2.4 (a) θ sample, (b) σ2 sample, (c) joint distribution
If we want to compute posterior means, standard deviations etc we sim- ply use the sample i.e.
> mean(out1[,1])
6.2. BAYESIAN METHODS 119
[1] 3.031442
> sd(out1[,1])
[1] 0.02407227
which give point estimates of the parameters and measures of (posterior) uncertainty.
Panel (c) shows a scatterplot of the output. This is a representation of the joint posterior distribution of the parameters. The fact that a Bayesian analysis gives joint inference is very different from the standard methods which give, say, confidence intervals for each parameter one at a time. In this simple random normal sample example since the sampling distribution of the sample mean and sample variance are independent this is rarely an issue. For example panel (c) looks independent so two marginal inferences are sufficient. In general though the dependence can really be important.
Example (2.3.1 revisited). We can return to the house price example of Chapter 2 and fit the model using MCMC.
> out <- MCMCregress(Y ~X1+X2+X3+X4+X5+X6+X7+X8+X9, data=house.price) > summary(out)
Mean SD Naive SE Time-series SE
(Intercept) 15.22298 6.45155 0.0645155
X1 1.93971 1.12235 0.0112235
X2 6.87051 4.67551 0.0467551
X3 0.13595 0.53313 0.0053313
X4 2.80602 4.75816 0.0475816
X5 2.04593 1.49712 0.0149712
X6 -0.50892 2.60459 0.0260459
X7 -1.29140 3.70783 0.0370783
X8 -0.03864 0.07188 0.0007188
X9 1.69108 2.09140 0.0209140
sigma2 10.25968 4.46910 0.0446910
We get very similar estimates to the OLS solution. But now, since we are getting a joint posterior for all the parameters, we can see something interesting about the multicollinearity issue seen before. Recall that we found it odd that the estimate for β6 was negative and this was explained due to dependence in the explanatory variables. In the Bayesian analysis this dependence flows very naturally into the posterior as shown in Fig. 6.3. We see our knowledge about the value of β6 is strongly tied to our knowledge about other parameters, say β2 or β6. This gives a much more complete picture of what we really know about the parameters even when there is multicollinearity.
How do we use the MCMC output to make predictions. We want to use the predictive distribution of Definition 6.2.2. This is an integral and with
0.0645155
0.0112235
0.0480019
0.0053274
0.0475816
0.0149712
0.0260459
0.0370783
0.0007188
0.0209140
0.0725849
120 6.
(Intercept)
BAYESIAN AND STATE SPACE METHODS
−20 0 20 40
−4 −2 0 2 4 6
X1
−20 −10 0 10
−20 −10 0 10 −4 −2 0 2 4 6
−10 0 5 10 15
−20 0 20 40
Figure 6.3: MCMC output for Example 2.3.1
MCMC output all integrals – which are hard mathematically – are replaced with sums over the MCMC sample i.e.
NMCMC f(θ)P(θ|X1,…,Xn)dθ ≈ f(θi) (6.5)
Θ
Definition 6.2.8. (Building a predictive sample) If we want to do prediction using a posterior for θ we use a similar idea. We can take a sample from the predictive distribution in two steps:
(i) Draw a sample θ(i) from the posterior P(θ|data)
(ii) Given θ(i) draw a prediction from the model f(y|θ(i))
We repeat this, independently for many different values of i.
Example (2.3.1 revisited). In the regression example, given β(i) we take the simple forecast y(i) = xT0 β(i) + ε(i), where ε(i) ∼ N(0,σ2(i)). We could look at the distribution of this over values of β from the posterior. In Figure 6.4 we select xo to be the average house and generate a sample from the predictive distribution using the two steps of Definition 6.2.8. The blue vertical line is the sample mean of the house prices which, because of the choice of x0, should be the mean here.
6.2.2 Link to penalty methods
From Hastie et al. (2009, p. 64) we get the following link between the ridge regression method of Definition 2.4.7 and Bayesian methods. They show
i=1
for any function f where θi for i = 1,…,NMCMC is the MCMC sample.
X6
−10 −5 0 5 10 15
X7
6.3. STATE SPACE MODEL
121
Predictive distribution
20 25 30 35 40 45 50
price
Figure 6.4: Predictive distribution for Example 2.3.1
that ridge regression estimates are modes of posterior distributions with suitably chosen priors.
Theorem 6.2.9. (Bayes and ridge regression) Suppose we have the model for Y conditionally on X as
Y i | X , β ∼ N ( β 0 + x Ti β , σ 2 )
with independent priors βj ∼ N(0,τ2). Then, with τ,σ assumed known the
log-posterior for β is given
Npp (yi −β0 +xijβj)2 +λβj2, i=1 j=1 j=1
which agrees with RSS(λ) in Equation 2.5 when λ = σ2/τ2. The LASSO can also be interpreted in Bayesian terms.
Theorem 6.2.10. (Bayes and ridge regression) The Lasso estimate for lin- ear regression parameters can be interpreted as a Bayesian posterior mode estimate when the priors on the regression parameters are independent dou- ble – exponential (Laplace) distributions.
Proof. See Park and Casella (2008). 6.3 State space model
Definition 6.3.1. (State space model) If {Yt} is an (observed) time series with (unobserved) state process {θt}. In a state space model the dependence between these is defined by the graphical model shown in Fig. 8.1 The re- lationship defines the conditional independence relations: (i) θt is a Markov
Density
0.00 0.02 0.04 0.06 0.08 0.10 0.12
122 6. BAYESIAN AND STATE SPACE METHODS θ0 θ1 θ2 … θt−1 θt
Y1 Y2 Yt−1 Y t
Figure 6.1: State space structure
chain (ii) conditionally on θt for t = 0, 1, . . . , t then Yt are independent so
that the joint distribution of (Y1,…,Yt)|(θ1,…,θt) is
t
f(yt|θt). i=1
State space models are also called hidden Markov models. 6.4 Case Study
6.4.1 Introduction
The TSUNAMI group of insurance and reinsurance companies have made available various datasets which cover issues important for the insurance in- dustry. This report concentrates on a set of a seven year series of commercial claims due to weather. The TSUNAMI group in particular have an interest in the following questions.
1. Temporal effects. Is there any significant variation across the years in the observed data, apart from that caused by inflation? Any such variation should be identified and described.
2. Aggregation. There is interest in the aggregation in the data, usually over three day periods. Suitable models for such summary statistics need to be found.
3. Extremes. The data set contains two ‘extreme’ events. These were the storms in January and February 1990. The modelling should be able to predict return periods for extreme events such as these.
The report is structured in the following way. Section 6.4.2 contains details of both the currently available dataset and important data which is potentially available. Section 6.4.3 examines the modelling approach used throughout, pointing out the differences with traditional time series models which we regard as inappropriate for this kind of data. Section 6.4.4 looks at the results of some fitted models, while the final section describes future work.
6.4. CASE STUDY 123 6.4.2 Data
The data covers a period from 1st January 1990 until 31st December 1996. It is a record of all claims for UK commercial policies which are coded as being caused by ‘weather’. For each claim the following information is available: the date of the reported claim incidence, the settlement date, and the claim amount. It is known that the data exhibits structure which is induced by the systematic way incidence dates are allocated when the exact date is unknown, forgotten or censored. For example, there are relatively more claims on the first day of each month, and there are relatively fewer claims at the weekend. It is important that the modelling takes into account this structure.
Totalling all the claims occurring on a particular day creates a time series of total daily claims. This series for both the raw data and the log of the data are shown in Figure 6.1. Alternatively the data can be represented in its original format, i.e. individual claim sizes and incidence dates, and an additional time series created which contains the number of claims made on each day. These are shown in Figures 6.2 and 6.3 respectively.
.
.
. .
.
.
.
.. . .. …
. . .. . . .. . … .. . …
.. .
.. . .. …
.
. . .
. …….
. . .
.. …..
. …. . .. …….. ……. … . .. . . .. .
………. … .. …….. . … . …………… .. .. . . … . .. . .. .. . .. .. .. .. ….. . ….. .. . ……. …. ………. ……… … . . . …. . . … . . …. . .. …. . . . .. . … …. …. . . . .. . … . ….. . ….. ….. . .. . .. .. . ….. …. ….. .. … . …………. ….. ……….. . ….. .. …. ….. …. ….. … . ………. ……. ……….. …………. ………… … …. … .. .. . .. … . …. ….. . . .. .. . .. … .. …. ….. …….. . …………… …… .. ……. ………………………….. …… ……………. …. …………………………… ………………….. ……….. ……. …….. ….. .. …….. ……. …………………… . …………………… …………………………. …………………………. … …….. ….. ………… . . ……. .. ……… … …. . . .. . .. … .. .. . ……… . …… …. …………………………….. …………… . .. .. … … . . . . …… . .. .. . ……. …. .. ….. . . ……… . …………. … ……… ….. … ……. . .. ..
… . ….. .. . … . .. .. . …… … .. .. . …. …. .. . … . .. …. . .. . … . … . .. … .. …… ……. … . ..
…. …. …. .. ….. . . … . … …. … … … … … ..
.. …. . .. .. …. .
. …. … .. .. . . .. …..
….. …..
…. . .
.
.
. .
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 8 10 12 14 16 18 20 22 days from 1/1/1990 days from 1/1/1990 log(daily claim size)
Figure 6.1: The total daily claim sizes from 1/1/1990 until 31/12/1996.
It is also important to consider information which, while unavailable at this stage of the modelling process, is in principle available to individual insurance companies and which will be clearly important in meeting the objectives in Section 6.4.1. This missing information will be important in the following issues.
1. Temporal effects. Currently not recorded are the total number of poli- cies providing cover on each day. This series is important in under- standing the change in behaviour over time of the data with which it is confounded.
daily claim size
0 10^9 2*10^9 3*10^9 4*10^9 5*10^9
0
frequency
100 200 300
log(daily claim size)
10 12 14 16 18 20 22
124 6. BAYESIAN AND STATE SPACE METHODS
.
.
.
.
. .
.
. . .
.
. …
. . .
.. . ..
.. .. . . . .
.. . . .. . .. .. .
… .
… …… . . ………
… .. .. … …. …. . …. . .. …………….. ……. … . … .. … .. … . .. . ……… .. …. ….. ….. . .. … …. … .. ….. … …. ………. . ………… …. … .. .. …… ……………… ……………………. ….. .. ………………………………………………. ………………………… … …………. ….. ………….. …. …………. …….. …. …………………………. . ……………………….. ……………. …. …… ……….. .. .. .. …. … .. .. …. ……………….. ………………… ………………………… …. ……… …………………….. ….. …….. . … …………….. …… ……… .. .. . …. …………….. … … …… .. . . . . .. …………………………. …… .. … . … … …. . . .. . . . .. . .. . .. .. . .. ..
.. …… . . .. . . . …. . . . . . .. . … .
.. .. …. ..
. ..
…….. .
.. ..
.
..
.. .. . . . … ….
. . … . ..
.
.. .. ….. .. .
.
. .
. . … . .. … . .
.. . .. . .. .
.. .. . .
………. .. …. .. …..
…
.
.. .. .
. … … … . . . . . .
.. .. . .. . . .. . .. . .. . . .. . .. … . . … . . . … . . … . . .
.
.
. . . ..
.
. . …..
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 5 10 15 20 days from 1/1/1990 days from 1/1/1990 log(claim size)
Figure 6.2: The individual claim sizes on each day from 1/1/1990 until 31/12/1996.
2. Aggregation. Understanding the dependence structure across the dif- ferent claims is an important part of the modelling. Information which would be useful here includes the geographical location of claims, the type of damage and the type of property concerned. This will be use- ful since when looking at aggregation it would be informative to know when different claims may be due to the same metrological event. For example flood and wind damgage may well have quite different aggre- gation properties.
3. Extremes. The geographical and claims information mentioned above will also be important in the modelling of extreme events. For example the relationship between the intensity of the metrological event and its location will contribute to both the number and size of observed claims.
The models developed below attempt to take account of this missing information by using random effects. Further they are designed to extend naturally when and if the above data becomes available.
6.4.3 The modelling approach
The underlying structure of the data, with random numbers of claims on each day, is unsuitable for traditional time series methods. Therefore in this paper we follow an events based approach. The physical cause of the damage claims will always be some metrological event, for example a storm or flood. Although these events are unobserved the structure of the model should take them into account. In particular a high proportion of the variation in the
claim size
0 2*10^8 4*10^8 6*10^8
log(claim size)
0 5 10 15 20
0 2000 4000
6000 8000
frequency
6.4. CASE STUDY
125
.
.
. . .. . . .
.. .. . .. . . .. . .. . . .
.
.
.
. .. .. .. .
… ..
………
. ………. … ……………… …. .. … .. … . . . .
..
. .. . . . . .
. . . .
. . .
. .
.. … .
. . … .. … .
……….. ……
. . . .. . .
. …. … .. … .. . . . . …….. …. ………. …. … . …… . .. … . . ….. . .. . … .. .. . … … … . .. . . . .. ……. ….. .. .. … . . …….. ….. .. .. ….. … …… .. .. .. .. … . ………. .. …. . … .. .. . .. . … . .. …. .. .. …… . ….. . ….. .. .. …….. …. . …… … . ….. …………………………………….. . …… …….. … ……….. ……… ……………… … …. ……… …..
. ….. ………… … …….. .. .. ….. . …….
. …. ……….. . ……. ……. …… ……. ..
…… .. .. . . . … . .. .. ….. . …..
. . ………. .. … …… . …. ……. . … . .. … … ……… .. …. ……. ………. … .. . . … . … .. . …… …. .. … . … …. . …. …… . … . …… ……. … …. …………………. …… ………. ………….. .. … …………………………………………………. ……… . .. ………… …….. ………. … ….. . …. . …….. ……………. ……. …… … ………. . ……………………. ………. ……. …….. …….. . . …. .. …… ……… ……….. ……. …….. …… .
……. .
.. . .. . .
0 500 1000 1500 2000 2500 days from 1/1/1990
0 500 1000 1500 2000 2500 days from 1/1/1990
0 2
4 6 8 log(no. of claims)
Figure 6.3: The number of claims on each day from 1/1/1990 until 31/12/1996.
size of claims depends only on the weather on each day. We therefore model these metrological events by using random effects modelling.
In general we are interested in identifying the joint distribution of the claim sizes and the number of claims on any particular day. This allows us to address all the issues of interest raised in Section 6.4.1. We denote the claims sizes by X, the number of claims by N, the day of interest as t, and the joint distribution on this day as ft(X,N). We then modelled this joint distribution as a product of the conditional distribution of X given N and the marginal distribution of N, i.e.
ft(X, N) = ft(X|N)ft(N).
Note that we will apply the events based approach by concentrating on the distribution of N . Informal data analysis shows that after conditioning on N the distribution of X remains approximately constant across time. This can be interpreted as meaning that the information in the metrological events is filtered primarily through the number of claims.
Models for ft(N)
Let Nt denote the number of claims on day t. We shall consider the distribu- tion of Nt conditionally on both the weather for that day and the number of policies at risk. Standard modelling assumptions, Feller (1966), which rely on the independence of claims given the weather conditions dictate that Nt will follow a Poisson distribution if we further assume that the actual pro- portion of policies which give rise to a claim on this day is small. Thus we apply the approximation of a Poisson distribution with a rate parameter λt.
no. of claims
0 2000 4000 6000 8000
10000
log(no. of claims) 02468
frequency
0 100 200 300
400
126 6. BAYESIAN AND STATE SPACE METHODS So we have
Nt ∼ Poisson(λt).
Note that the information about the number of policies acts as a multiplier for the rate parameter λt.
Since the actual weather and number of policies are unobserved we will treat these factors as random effects for each day. A large proportion of the information which determines the parameter λt represents the effects of the weather on day t. Clearly this will have a component of seasonal variation. It also will be correlated in time since a weather system typically takes about three days to pass across the country. Further the systematic structure of reporting claims mentioned in Section 6.4.2 will affect the value of λt, for example it should be higher on the first day of each month. Putting these factors together means we assume a model
log λt = Ht + εt,
where Ht is a deterministic function of time which represents the fixed tem- poral effects, while εt represents the random weather effects. To model the correlation across time we shall assume a time series structure for εt. As an example a simple possibility is
εt = ρεt−1 + νt (6.6)
where νt ∼ N(0,σt2).
All models in this paper will be based on this structure. We therefore
examine in Section 6.4.4 the appropriate form of the deterministic effect, Ht, and the time series structure of the random effects εt.
Models for ft(X|N)
Data analysis indicates that after conditioning on the number of claims there is very little seasonal variation in the distribution of X. It appears that a relatively simple model, such as a fixed t-distribution is appropriate here.
6.4.4 Results
Throughout we use a Bayesian approach implemented using Markov chain Monte Carlo (MCMC) methodology, see Gilks, Richardson, and Spiegel- halter (Gilks et al.) and Gelman et al. (2014). As far as possible non- informative priors were used, and in general the likelihood completely dom- inated the posterior distribution which made the choice of prior relatively unimportant. There was one exception to this mentioned in Section 6.4.4. In general a sequential Metropolis-Hastings algorithm was used, although for some of the models we used block-updating to increase efficiency, see Gilks, Richardson, and Spiegelhalter (Gilks et al.). The fact that there was
6.4. CASE STUDY 127
a large amount of data and that the models were clearly identified meant that the convergence of the MCMC algorithms gave few problems after pa- rameterisations were chosen to aid efficiency, Brooks and Roberts (1998) and Cowles and Carlin (1996).
During the model selection process we attempted to keep the dimension of the model as small as possible, using the posterior distribution of both the parameter values and the likelihood as a guide.
The deterministic model
The deterministic part of the model Ht needs to include long term changes of the level, seasonal effects and the systematic features of the reporting mentioned in Section 6.4.2. This is currently modelled as
7 12
Ht = α0 + α1t + α2t2 + α3fm(t) + (βidi(t)) + (γimi(t)), (6.7)
i=1 i=1
where fw is an indicator function which is 1 if t is the first day of the month, and 0 otherwise. While mi is the indicator function for the ith month of the year and di for the ith day of the week.
We first discuss the long term trends. The output from the model is shown in Figure 6.4. The left hand panels show the output of the MCMC algorithm for each of the three parameters, α0,α1 and α2. They show that the algorithm is mixing well and the output will give the posterior distribu- tion of the parameters. The convergence seen here is typical of that for the other parameters in the model. The right hand panels show the marginal posterior distribution of the parameters. They show that there is reasonable evidence for both α1 and α2 being non zero. However it is important to note that the unobserved number of policies would enter as an additive part of equation (6.7), hence will be confounded with these trend terms.
The seasonal effects also were found to be strongly significant in all mod- els. Various versions of a season component were tried including sinusoidal components. The twelve level monthly model used in equation (6.7) was found to be a simple and effective description of this variation. Figure 6.5 shows the posterior output for the twelve levels. The seasonal effect, with winter being worse than summer, is very obvious.
The systematic reporting effects are picked up by the day of the week effect, shown in Figure 6.6, and the first day of the week effect, shown in Figure 6.7. For the weekly effect we see that there is an excess of claims on Monday and a corresponding decrease at the weekend. The first day of the week effect is shown to be significantly positive.
Overall the model has done very well in exploring the deterministic vari- ation across days. It can detect long term variation and has quantified the systematic reporting effects very clearly.
128
6. BAYESIAN AND STATE SPACE METHODS
0 200
0 200
0 200
400 600 800 2.8 Index
400 600 800 Index
400 600 800 Index
2.9
-1.5
3.0 constant
-0.2
3.1
Figure 6.4:
algorithm, right hand panel showing marginal distribution.
-0.4
-2.0
0.0
0.2
The trend effect: left hand panel shows output from MCMC
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
-0.5
July effect
0.0 0.5 1.0
January effect
-0.5 0.0 0.5 1.0
quadratic
-2.0 -1.5 -1.0 -0.5 0.0 0.5
linear
-0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1
constant 2.8 2.9 3.0
3.1
-0.5
August effect
0.0 0.5 1.0
February effect
-0.5 0.0 0.5 1.0
-0.5
September effect
0.0 0.5 1.0
March effect
-0.5 0.0 0.5 1.0
-0.5
October effect
0.0 0.5 1.0
April effect
-0.5 0.0 0.5 1.0
probability
0.0 0.2 0.4 0.6 0.8 1.0
0
probability 1 2
3
probability 0 2 4
6
-0.5
November effect
0.0 0.5 1.0
May effect
-0.5 0.0 0.5 1.0
-0.5
December effect
0.0 0.5 1.0
June effect
-0.5 0.0 0.5 1.0
Figure 6.5:
The seasonal monthly effect
linear
-1.0 quadratic
-0.5
0.0 0.5
6.4. CASE STUDY
129
0 200400600800
Index
0 200400600800
Index
0 200400600800
Index
0 200400600800
Index
0 200400600800
Index
0 200400600800
Index
0 200400600800
Index
Figure 6.6:
The weekly effect
probability
0 2 4 6 8 10 12 14
Sunday effect
-0.2 -0.1
0.0 0.1
0.2 0.3
-0.2 -0.1
0.0 0.1
0.2 0.3
-0.2 -0.1
0.0 0.1
0.2 0.3
-0.2 -0.1
0.0 0.1
0.2 0.3
-0.2 -0.1
0.0 0.1
0.2 0.3
-0.2 -0.1
0.0 0.1
0.2 0.3
-0.2 -0.1
0.0 0.1
0.2 0.3
Monday effect
Tuesday effect
Wedday effect
Thursday effect
Friday effect
Saturday effect
0.60
0.65
0.70 0.75 First day of month effect
Figure 6.7: The first day of the month effect
130 6. BAYESIAN AND STATE SPACE METHODS The random effects
The modelling for the random effects needs to take into account both the marginal and the dependence structure. A simple example of a model is given by Equation (6.6) which simply models the correlation across time using a simple autoregressive structure. The innovation process ν is given by a Normal model. The variance of this term is modelled as having seasonal variation as
12
σ t2 = σ 2 γ i m i ( t ) i=1
where mi is the monthly indicator function defined in Equation (6.7). The seasonal variation in the variance appears to be strongly significant.
In Figure 6.8 we plot the marginal distribution of the innovation process, based on this model. The left hand panel shows the marginal structure. The histogram shows a skew symmetric distribution. This is also shown in the plot of the innovations against the day in the right hand panel. The very large values in this plot correspond to very large jumps in one day in the number of claims.
The model above is capable of modelling the bulk of this distribution well, but there is a question of how to model both of the tails. Other innovation processes which have been used include a t-distribution, and mixtures of Normals and t-distributions. For further details of this issue see Section 6.4.5.
-4 -2 0 2 4 6 0 500 1000 1500 2000 2500 innovations days from 1/1/1990
Figure 6.8: The innovation process; the left hand panel shows the marginal distribution, the right hand panel the time series structure
•
•
•
••••
• •••••• •••••• ••••
•••
• • • •• ••• • • •
•• •••••• •••• ••• ••••••••••••• • • • • ••• • •
• ••• ••••• • • • • • • •• •••• •• • • ••• •••••••••• •• •••• •••••••• ••
•••••• •••••••••• •••• • •• • •• •• •••••••••••••••• •••• ••• •• •• • •••••• ••• •• •••••• ••••• • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •••• • • ••• • • • • • • • • • • • • • • • • •• • • • •• • • • • • • • • •• • • • • • • •• • • • •• • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • •• • • • • •• • • • • • •• • • •• ••• • • • • • • • • • • • • • ••••••••••••••••• •• •••• •••••••••••••••• •••••••••••••• ••••••••••• •••••••••••••• ••• •••••••• •••••••••••••••••••••••• • ••••••••••••••• ••••••••••• • • •••••••••••• ••
• •••••• ••• ••••• •• ••••••••
• ••• • • •••• •• •••••• ••• ••• • ••• •• •
•••••••• ••••• •• •• ••• •••••• •• •
•••• •••••••• ••••
••• ••••• •• • • • • ••••• • ••••••
••• • •• • ••• •••• •••• •• ••••• ••
•••
•
0.0 0.2 0.4 0.6 0.8
innovations
-2 0 2 4 6
6.4. CASE STUDY 131 Model checking
One of the advantages of the MCMC methodology is that simple data an- alytic techniques can be used to give informative model diagnostics. One useful technique is to use the predictive distribution of the model and com- pare the output with the empirical distribution based on the data. As an example of this technique see Figure 6.9. This a comparison of the pre- dictive distribution to the observed data. In this example we look at the proportion of days with less than 10 claims. The histogram is the predictive distribution generated from the model. The vertical line is the observed number of claims. We see that this is completely consistent with the models predictions.
6.4.5 Conclusions and further work
The modelling methodology here is designed to reflect the physical process which is generating the data. Traditional time series approaches are forced to model some aggregation of the data, by modelling directly the total cost of claims on a day, for example. In contrast we model the full joint distribution of all the data. We feel that a model which is closer to the actual physical process offers great advantages in interpretation and predictive power.
There are a number of outstanding issues which need to be explored. We look at the three objectives raised in Section 6.4.1.
1. Temporal effects. The model is able to detect yearly variation, both using trends in the deterministic function, Ht, and potentially in the structure of the innovation process. The model recognises that changes in the numbers of claims is confounded with changes in the number of policies at risk. However there is scope to incorporate this statistic in order to give a satisfactory model for this first question.
2. Aggregation. By modelling the full joint distribution we have implicitly modelled any aggregation process. This can be done either analytically or via simulation. Clearly there is further work to be done here.
3. Extremes. The most delicate part of the modelling concerns the incor- poration of the extreme metrological events in the dataset. Currently different models of the innovation process can treat this observations very differently. Further work is clearly needed in order to clarify this issue. A more detailed dataset, as mentioned in Section 6.4.2 could help. Since we have modelled the data via the physical process driving it, there is also the possibility of using metrological models directly in the modelling. This possibility opens up a whole new set of infor- mation which can greatly improve prediction of return levels of these extreme events.
132 6. BAYESIAN AND STATE SPACE METHODS
0.3 0.4 0.5 0.6 0.7 probability of less than 10 claims
Figure 6.9: An example of comparing the predictive distribution to the observed data. In this example we look at the proportion of days with less than 10 claims. The vertical line is the observed number of claims
probability
0 5 10 15 20
Appendix
6.A JAGS, BUGS and R
As we have seen above we can use MCMC to compute any posterior quan- tity in a Bayesian setting. One major issue with MCMC is computational speed and a number of packages have been written to help users efficiently implement MCMC in complex hierarchical models. We will focus on JAGS, which can be run directly from R.
BUGS and WinBugs1 were highly successful packages that was made available to users which had a major impact on using MCMC methods in many areas. A very good and up to date introduction can be found in Lunn et al. (2012). BUGS stands for ‘Bayesian inference Using Gibbs Sampling’ and the WinBUGS package was developed to run on Windows machines.
In this course we will concentrate on the package JAGS which is a clone of BUGS (Bayesian analysis Using Gibbs Sampling) which runs on Windows, Mac and UNIX machines and through the R package rjags can be easily run directly from R. Note that the rjags package does not include a copy of the JAGS library which must be installed separately2. JAGS (Just Another Gibbs Sampler) is licensed under the GNU General Public License and is thus freely available.
Let us look at a very simple example of running a Gibbs sampler in run from R using JAGS.
Example 6.A.1. (Independent normal example) Consider the model
X|θ,τ ∼ N(θ,τ−1) θ|τ ∼ N (0, 100/τ )
1/τ ∼ Gamma(1/1000, 1/1000)
First, in a text file – which I have called jagsmodel2.R, but could have any name – we have the JAGS (and in fact BUGS) code for this model. This is given by
1 http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml
2 For instructions on downloading JAGS, see the home page at http://mcmc- jags.sourceforge.net.
133
134 6. BAYESIAN AND STATE SPACE METHODS
model {
for (i in 1:N) {
X[i] ~ dnorm(theta, tau)
}
theta ~ dnorm(1, 0.01*tau)
tau <- 1.0/sqrt(tau1)
tau1 ~ dgamma(1.0E-3, 1.0E-3)
}
library(rjags)
X <- rnorm(50, mean=0, sd=100)
N <- 50
model2 <- jags.model(file="jagsmodel2.R", n.adapt=0) update(model2, n.iter=5000)
output <- coda.samples(model2, c("theta","tau"), n.iter=10000) plot(output)
which would gives the plot shown in Fig. 6.A.1
Trace of tau
6000 8000 12000
Iterations
Trace of theta
6000 8000 12000
Iterations
Density of tau
0.0019 0.0021 0.0023
N = 10000 Bandwidth = 9.451e−06
Density of theta
0 5 10 15 20 25
N = 10000 Bandwidth = 0.513
0 5 10 15 20 0.0019 0.0021 0.0023
0.00 0.04 0.08 0.12 0 2000 4000 6000
Figure 6.A.1: Output from Example 6.A.1
A few points to note from Example 6.A.1. Firstly you need to be careful to note that the way R and JAGS codes is subtly different in the choice of parameters to use. Compare above dnorm(theta, tau) in JAGS, which uses the mean and precision (inverse of variance), while R uses rnorm(50, mean=0, sd=100) which uses mean and standard deviation. Further the variable which the model uses (in the case of the example above X,N, need to lie in the R session.
7
The Kalman Filter
7.1 Introduction
In this chapter we take a brief look at a number of other topics that are useful in time series modelling. We will not be looking at these in great depth, rather give pointers to where you can find ideas if you are interested.
One unifying theme is the idea of a state space, which explores the idea that the observed data does not, by itself, define the underlying time series structure. However, perhaps because of some theoretical reasons, there is an underlying and hidden structure.
7.1.1 State space model
Definition 7.1.1. (State space model) If {Yt} is an (observed) time series with (unobserved) state process {θt}. In a state space model the dependence between these is defined by the graphical model shown in Fig. 8.1 The re-
θ0 θ1 θ2 ... θt−1 θt
Y1 Y2 Yt−1 Y t
Figure 7.1: State space structure
lationship defines the conditional independence relations: (i) θt is a Markov chain (ii) conditionally on θt for t = 0, 1, . . . , t then Yt are independent so that the joint distribution of (Y1,...,Yt)|(θ1,...,θt) is
t
f(yt|θt). i=1
State space models are also called hidden Markov models. 135
136 7. THE KALMAN FILTER
When assuming a state space structure, were we have an (unobserved) state variable at time s and a set of observations y1, . . . , yt. We can subdivide problems by:
1. filtering is the case where (s=t),
2. state prediction is the case when s > t and 3. smoothing is the case s < t.
7.2 State space representations of ARIMA models
The ARIMA models we studied in Chapter 4 can be written in a state space form and then the efficient prediction methods of the Kalman filter can be used directly. In fact the predict.Arima( ), predict( ) and arima( ) functions all use this method.
Example 7.2.1. (State space representation of AR(1) process) We can write a AR(1) model
Xt = φXt−1 + Zt
where Zt ∼ N(0,σ2) as a state space model trivially as
Yt = Xt,
Xt = φXt−1+Zt.
Example 7.2.2. (State space representation of AR(2) process) We can write a AR(2) model
Xt = φ1Xt−1 + φ2Xt−2 + Zt
where Zt ∼ N(0,σ2) as a state space model by having a two dimensional
state variable Xt = (Xt, Xt−1)T and
Yt = 1 0Xt,
X=φφX+Z t 12t−1t
Definition 7.2.3. (makeARIMA( ) function) In R if you want the state space representation of an ARIMA(p, d, q) model we can use makeARIMA( ) func- tion. This will output a state space model with the notation
y = Za+η,η∼N(0,h)
a = Ta+Re,e∼N(0,Q)
it can then be used, for example in the KalmanLike( ) function to compute a likelihood.
7.3. THE KALMAN FILTER 137 7.3 The Kalman filter
7.3.1 Historial note
We now look at the Kalman filter and its computational properties. The filter, taking its name from the paper Kalman (1960), had it origins in engi- neering control problems and found early applications in controlling space- craft in the Apollo program. The relationship between forecasting and con- trol is a strong one – indeed the point of many forecasting exercises is to ask the ‘what if?’ question to see if a system needs to be changed in some way. These types of questions about engineering control also motivated many of the ideas of Box and Jenkins and their book, Box and Jenkins (1976), is titled Time series analysis: forecasting and control, showing the importance of the link in their opinion.
In many of these applications the speed of computation was one of the most important features. With real time control problems you typically need to be able to compute predictions faster than you are gathering data. Think about controlling a moon-landing for example. This concern with speed is reflected in the use of many recursive computation methods. Although com- putational power has increased exponentially since the development of these ideas there are still many current applications where speed of computation is critical. The so-called high-frequency trading in finance is one such case.
We note that it is also an efficient, recursive, algorithm for computing the likelihood function for arima functions, recalling that the key computational issues involves calculating the determinant and inverse – or eigen values – of a large matrix.
7.4 The Kalman filter
Let us consider the case where we are interested in estimating the hidden state and compute it, and the associated uncertainty, in a time which is faster than the rate that new data arrives. With real time control problems you typically need to be able to compute predictions faster than you are gathering data.
One problem with the literature about the Kalman filter is that there are many different forms of the notation that are used for setting up the problem. Here I will use that found in Time Series Analysis and its Applications by Shumway and Stoffer1
Definition 7.4.1. (State space model) Consider a p-dimensional state vari-
1This version of the book can be downloaded from Stoffer’s site https://www.stat. pitt.edu/stoffer/tsa4/ which you can download when on campus using Waterloo’s SpringerLink agreement. The site also has lots of good R code and libraries
138 7. THE KALMAN FILTER able xt which satisfies the state equation
xt = Φxt−1 + Υut + wt
where ut are a set of control or input variables and wt ∼ MVNp(0,Q)
(7.1) The observed part of the system is defined via the observation equation
yt = Atxt−1 + Γut + vt (7.2) i.i.d
where y is q-dimensional and vt ∼ MVNq(0,R) and the two noise terms, w, v are independent of one another
We treat the simplest case where the matrices Φ, Υ, At, Γ and variances Q,R in (7.3) and (7.4) are considered known, while the state variable xt is the object of interest and is unobserved. We think of the observation variable, yt, as a transformed noisy version of the state variable which we do observe. We are trying to smooth, filter or forecast the state variable based on the observation variable and do this in real time. We also may want to use ut to control the hidden state variable, again in real time
Since Equation (7.3) is recursive we require an initial condition x0 which is MVNp(0,Σ0). This is assumed to be independent of the noise terms and again, in the simplest example, we will consider Σ0 known
The Kalman filter was originally introduced as a method primarily for use in aerospace-related research. These include landing a space craft on the moon and radar tracking problem. The essential feature of these problem is that we need real time analysis There are now many other applications for example in economics, engineering and medicine
Observed Position
0.0 0.2 0.4 0.6
Time
Observed Position
0.8
1.0
0.0
0.2
0.4
0.6 0.8 1.0
Figure 7.1:
makes a one step ahead predictions, as the next new noisy information (red dots) comes in the KF almost immediately updates this estimate of the (now) current value (blue).
Example 7.4.2. Ball tracking example It is common to see, when watch- ing sports like baseball, cricket and tennis ball-tracking algorithms. These
: The ball tracking example and the Kalman-Filter: The KF
Time
i.i.d
Observed height
0.0 0.5 1.0 1.5
Observed height
0.0 0.5 1.0 1.5
7.4. THE KALMAN FILTER 139
can be used to make decisions: in tennis if a ball was on the line or out; in cricket decide on a LBW decision. Also they are used to report statistics such as the distance a home run was hit in baseball. The tracking comes from camera information, and the underlying Physics, and have to be done in real time, or very close to it.
We will explore a simplified version of one of these algorithms were we ignore everything but gravity and initial conditions
Suppose we have a body falling and its height, at time t – which to start with is continuous – is z(t). We have, from physics theory, that
d2z (t) = −g. dt2
and we can solve this differential equation. Integrate once gives dz(τ) = −gτ + c1,
dt
using initial conditions gives c1 = dz (t) + gt. So can write as
dτ
dz (τ ) = −g(τ − t) + dz (t) dt dτ
Then integrating again gives
z(τ ) = −g 1 τ 2 + gtτ + dz (t)τ + c2
2 dτ So, using initial conditions at t gives:
z(t)=−g1t2 +gt2 + dz(t)t+c2 =g1t2 + dz(t)t+c2, 2 dτ 2dτ
giving
That is have exactly
z(τ ) = z(t) + dz (t)(τ − t) − g (τ − t)2
dτ 2
We work in discrete time for t = 0, 1, · · · with each time increment being ε seconds. and denote
zt := z(εt), z ̇t := dz (εt) dτ
−g1τ2 +gtτ + dz(t)τ +c2, 2 dτ
z(τ) =
= −1g(τ −t)2 + dz(τ −t)+x(t).
2 dτ
140 7. THE KALMAN FILTER
Thus we have
giving the discrete time evolution of zt in terms of z ̇t. Furthermore we have
from the original equation d2z (τ) = −g By the finite difference method, dτ2
d2z dz(τ +ε)− dz(τ) −g= 2(τ)≈ dτ dτ
or in terms of discrete time
z ̇t+1 = z ̇t − εg + o(ε)
giving the evaluation of the velocity term. Thus the complete evolution
equations are a pair
zt+1 = zt+εz ̇t−ε2g, 2
z ̇t+1 = z ̇t − εg + o(ε)
Let us therefore define the state vector xt := (zt, z ̇t)T and so, the physics
zt+1 = zt + εz ̇t − ε2 g , 2
dτ ε
gives us the state equation
xt+1=01xt− ε g
1 ε 0.5ε2
The discretisation error in the above calculation can be thought of as an
error term giving the state equation
1 ε 0.5ε2
xt+1=01xt− ε g+wt
Then determines the theoretical behaviour of the body, but might not be what is observed. In our example have a camera which records a (noisy) measurement of xt.
yt = zt+vt=1 0xt+vt. denoted by
= Atxt + vt.
The terms Φ, At come from the Physics and understanding the measure-
ment system
The distributions of wt and vt are found by calibration experiments with, in the example shown in Fig. 7.1.
i.i.d. 2×10−5 0 wt ∼ N (0,0)T, 0 10−2
i.i.d. −2 vT ∼ N(0,10 )
and ε = 2 × 10−3.
7.5. ALGORITHM 141
In the ball tracking example the laws of physics and calibration exper- iments give us information which is clearly highly informative and can be used for forecasting, filtering and control. In this case the model is com- ing from known Physics and not an empirical model selection process like Box-Jenkins.
Further, as we shall see, the Kalman-Filter: behaves like exponential smoothing or Holt-Winters since it uses an error correcting approach. How- ever it also updates its estimate of uncertainty in real time and it also estimates the full state variable even though it is of larger dimension than the observation, see Fig. 7.2
Height Velocity
0 200 400 0 200 400
time time
Figure 7.2: The ball tracking example and the Kalman-Filter: reconstruct- ing the two dimensional state space from one dimensional observations
7.5 Algorithm
We start by reviewing notation.
Definition 7.5.1. (State Structure for KF) Consider a p-dimensional state
variable xt which satisfies the state equation
xt = Φxt−1 + Υut + wt
(7.3) where ut are a set of control or input variables and wt ∼ MVNp(0,Q)
The observed part of the system is defined via the observation equation yt =Atxt +Γut +vt (7.4)
i.i.d
where y is q-dimensional and vt ∼ MVNq(0,R) and the two noise terms,
w, v are independent of one another
In the standards cases the matrices Φ, Υ, At, Γ are known from theory and variances Q, R in (7.3) and (7.4) are considered known from calibration.
i.i.d
position
0.0 0.2 0.4 0.6 0.8 1.0 1.2
position
−4 −2 0 2 4
142 7. THE KALMAN FILTER
The state variable xt is the object of interest and is unobserved. We think of the observation variable, yt, as a transformed noisy version of the state variable which we do observe. We also may want to use ut to represent controls or outside influences.
Since Equation (7.3) is recursive we require an initial condition x0 which is MVNp(0,Σ0). This is assumed to be independent of the noise terms and, again in the simplest examples, we will consider Σ0 known from calibrations
The algorithm at each time period calculates a prediction which is then corrected as soon as new information become available and the design is such that each step is a low dimensional calculation: i.e. there are no large matrix inversions, rather lots of small ones.
The algorithm keep track, and updates, the forecast uncertainty. Let us define
xts :=E(xt|y1,...ys)
to be the best mean square estimate of xt based on Ys := (y1, . . . , ys) and
Ps :=E(x −xs )(x −xs )T t1,t2 t1 t1 t2 t2
with this being Pts when t1 = t2 = t
Definition 7.5.2. (Kalman filter) For the state space model defined by
(7.5)
(7.6)
(7.7)
(7.8)
(7.9)
(7.3) and (7.4) we have from the conditional mean xt−1 = Φxt−1 + Υu
t t−1 t
This is the prediction step. It has a mse which satisfies:
Pt−1 =ΦPt−1ΦT +Q t t−1
We can then also predict the next observed value by yt−1 = A xt−1 + Γu
The correction step or update is defined by
xt = xt−1 + K y − A xt−1 − Γu
tttt
ttttttt
It has a MSE/Posterior variance which satisfies: P t = [I − K A ] P t−1
t ttt
where the term Kt is called the Kalman gain and is defined by
K :=Pt−1AT APt−1AT +R−1 ttttt
7.6. THE KALMAN FILTER AND BAYESIAN UPDATING 143
For a compete proof of the structure of the Kalman filter see Chapter 6 of Shumway and Stoffer. Here we will not go into all the details but will go through the key steps in the formulation.
The first is that that the state equation
xt =Φxt−1 +Υut +wt
basically has an AR(1) structure so is Markovian. The error term here is assumed Gaussian – that is used at a couple of stages in the proof. If we don’t assume Gaussian errors the method still can be made to work looking for best linear one-step ahead predictors
Since we are using a Gaussian assumption, and the assumed indepen- dence structure, we can using the recursive nature of the Kalman filter to compute the likelihood function. If we tried to do this directly we would need to invert a large (n × n) variance-covariants matrix – this can be hard. The Kalman filter replaces this calculation with a sequence of low-dimension matrix inversions. In fact R uses a version of the Kalman filter in the arima function to efficiently calculate the likelihood
In R we can implement the algorithm using the library astsa and the Kfilter1 function.
7.6 The Kalman Filter and Bayesian Updating
One of the elegant aspects of Bayesian inference is the way that Bayes theo- rem tells us exactly how to up date our beliefs as we get new data. We note that the value of the Kalman Filter is the clever way it up-dates using the prediction/correction method. In this section we explain that, in fact, these are the same thing.
Lets start with is a simplified Bayesian example for the case y ∼ N (θ, 1) If the prior distribution for θ was in the Normal family i.e.
θ∼N(θ;μ ,σ2 )∝exp − prior prior
then the posterior, after seeing y, is
P r(θ|y) ∝ Lik(θ) × prior(θ)
(θ − μprior)2 2σp2rior
,
μ +yσ2 2
= N 1 + σ2 prior
prior prior
σprior
, 1 + σ2
prior
This has defined the iterative update rules for the posterior mean
μprior →μposterior := 1+σ2 prior
→ ··· → E(Y)
μ +yσ2
prior prior more data more data
144 7. THE KALMAN FILTER Recall we have the Kalman Filter state space structure
xt = Φxt−1 + Υut + wt yt = Atxt+Γut+vt
(7.10) (7.11)
i.i.d
where ut are a set of control or input variables and wt ∼ MVNp(0, Q), and
i.i.d
y is q-dimensional and vt ∼ MVNq (0, R) and the two noise terms, w, v are
independent of one another. We want to get insight of how the structure of Equations (7.10, 7.11) lead to the prediction-correction iterative cycle of the Kalman filter and it’s error correction structure
We will highly simplify notation to get to the essentials. First we drop the control terms Υut , Γut , and work in one-dimension case. This is not needed but it makes the story simpler.
Our estimates will be conditional expectations – i.e posterior means and we work with normality assumptions. Lets define some notation: Define
Hs = {ys,ys−1,···}.
Our target is to compute E(Xt|Ht) though an update rule:
1. Predict using E(Xt|Ht−1)
2. Correct when yt becomes available.
Now suppose that conditionally on Ht−1 we have ‘priors and likelihood’
Xt|Ht−1 ∼ N(Φxt−1, Pt−1) denote by
= N (μt−1 , Pt−1 ) (7.12) Yt|xt ∼ N(Axt,R) (7.13)
Lets look at the Bayesian interpretation of Equations (7.12-7.13). Equa- tion (7.12) describes how much we know about the state at time t given the observations up to t − 1. It a full distribution having conditional expecta- tion as point estimation and conditional variance to measure uncertainty: its our prior at time t − 1. Equation (7.13) describes the uncertainty in the observation given we know the state at time t. We are going to combine them to get the posterior which tells us what we know at time t.
One key assumption is that of Normality here. We will see that if Equa- tion (7.12) is Normal at stage t − 1 the corresponding term will also be normal at stage t. This assumption can be analysed further but if t is large it is usually reasonable to assume normality
We can then think of Equation (7.12) as being a prior for Xt, and Equa- tion (7.13) as the likelihood. We are updating prior mean and prior variance
7.6. THE KALMAN FILTER AND BAYESIAN UPDATING 145 to posterior mean and posterior variance. In terms of the Bayesian setting
and notation we have
Lik(yt) = P (yt|xt) = √2πR exp − 2R
1 (xt−μt−1)2 Prior(xt) = √2πP exp − 2P
t−1 t−1
1 (yt−Axt)2
So get posterior
P(xt|yt) ∝ exp − 2R exp − 2Pt−1
1 A2 1 μt−1 ytA ∝exp−2R+P x2t−2P+Rxt
t−1 t−1
:= N(μt,Pt)
So have updated our knowledge
N(μt−1, Pt−1) →y N(μt, Pt)
Direct calculation gives the posterior variance and mean as
by defining
(yt −Axt)2 (xt −μt−1)2
A2 1 −1 RPt−1 Pt=R+P =A2P+R
t−1 t−1 μt−1 +ytA
Pt−1 R
μt = A2+ 1 =μt−1+Kt(y−Aμt−1)
R Pt−1
Kt := APt−1 A2Pt−1 + R
i.e. the gain.
So have have the update rule which takes the form
Pt−1 →
This form is the one dimensional version of the general case. It shows directly the error correcting form. It defines K as the sample dependent - and therefore non-constant – version of the training parameter in exponential smoothing. The generalisation to the general matrix form is the same but and is simply (long) algebra
μt−1 → μt :=
= μt−1 + Kt × (error in prediction of y)
μt−1 + Kt(y − Aμt−1) Pt := (I − KtA)Pt−1
146 7. THE KALMAN FILTER
8
ARCH and GARCH modelling
Example 8.0.1. (The Bollerslev-Ghysel benchmark dataset)
The data in Fig. 8.1 from Bollerslev and Ghysels (1996) is the daily percentage nominal returns for the Deutschemark-Pound exchange rate. We see the non-constant volatility and volatility clustering that is common in such financial data. We also can see from the marginal plots that, where the dynamics is excluded, we get non-normal behaviour. The data is ‘heavier-
tail’ than a constant variance Gaussian process.
Bollerslev−Ghysel dataset
500 1000 1500 2000
Index
Bollerslev−Ghysel dataset
−2 −1 0 1 2 3
dmbp[, 1]
Bollerslev−Ghysel dataset
●
●
● ●
● ●●
−3 −2 −1 0 1 2 3
Theoretical Quantiles
●
●
●● ●
dmbp[, 1]
−2 −1 0 1 2 3
Frequency
0 200 400 600 800
Sample Quantiles
−2 −1 0 1 2 3
0
Figure 8.1: Bollerslev-Ghysel benchmark dataset
We have already seen, in Chapter 1, a way of modelling data which has non-constant volatility. We recall the definition here.
Definition 8.0.2. (ARCH model) An AutoRegressive Conditional Het- eroscedasticity (ARCH(p)) model is defined hierarchically: first define Xt = σtZt where Zt ∼ N(0,1) i.i.d., but treat σ as being random such that
σ2=α +αX2 +···+αX2 t01t−1 pt−p
147
148 8. ARCH AND GARCH MODELLING So the variance is time dependent – a large value of Xt will result in period
of high volatility.
We can write Xt as a non-linear function of the previous innovations in the following way. Note that this is different from the ARMA story where, if we have a causal stationary process we write it as a linear function of the previous innovations. Let just look at the ARCH(1) model. We have, following (Brockwell and Davis, 2002, Page 350),
X=σZ,σ2=α+αX2 . t ttt 0 1t−1
So we have
X2 = σ2Z2=αZ2+αZ2X2
t tt0t1tt−1
= αZ2+αZ2αZ2 +αZ2 X2
0 t 1 t 0 t−1 1 t−1 t−2 = αZ2+ααZ2Z2 +α2Z2Z2 X2
0 t 0 1 t t−1 1 t t−1 t−2
= α Z2 +α α Z2Z2 +α α2Z2Z2 Z2 +···+αn+1Z2Z2 ···Z2 X2
0 t 0 1 t t−1 0 1 t t−1 t−2 1 t t−1 t−n t−n−1 Now if we assume α1| < 1 this last term will converge to zero as n → ∞
and we get the expression
X2=α Z2+αZ2Z2 +α2Z2Z2 Z2 +....
t 0 t 1 t t−1 1 t t−1 t−2
This is the non-linear causal representation of Xt in terms of previous Zs
innovation. Taking expectation gives EX2=α1+α+α2+...= α0 .
t 0 1 1 1−α1
As with ARMA models we need to check if there are stationary solutions for any given sets of parameters. Using these results gives the following conclusion.
Theorem 8.0.3. (Existence of stationary solution) For an ARCH(1) model we have, when |α1| < 1 there is a unique causal stationary solution of the equations of (8.0.2). It has the moment structure:
E(Xt) = E (E(Xt|Zs, s < t)) = 0, Var(Xt) = α0 ,
1−α1
Cov(Xt+h, Xt) = E (E(Xt+hXt|Zs, s < t + h)) = 0,
for h > 0.
Proof. Sketch of Proof. For an ARCH(1) model we have
X =σZ,σ2=α +αX2
t ttt 0 1t−1
where Zt ∼ WN(0,1)
X2=σ2Z2 = α+αX2 Z2
Hence have
ttt 01t−1t
= α+αα+αX2 Z2 Z2
0 1 0 1 t−2 t−1 t
= α+αα+αα+αX2 Z2 Z2 Z2
0 1 0 1 0 1 t−3 t−2 t−1 t = α0Zt2 + α0α1Zt2Zt2−1 + α0α12Zt2Zt2−1Zt2−2 + · · ·
···+αn+1X2 Z2Z2 Z2 ···Z2
1 t−n−1 t t−1 t−2 t−n
If|α1|<1thelasttermtendstozeroasn→∞. Sohave ∞
X2=ααjZ2Z2 Z2 ···Z2
j=0
∞α E ( X t2 ) = α 0 α 1j = 0
t 0
1t t−1t−2 t−j
149
j=0 1−α1
Also have that we have that Xt is purely a function of {Zs|s ≤ t} Further,
σt2 is purely a function of {Zs|s < t}. This means that E(Xt) = EZs|s
> Model2 <- ugarchspec(variance.model = list(model = "sGARCH",
garchOrder = c(1, 0) ),mean.model = list(armaOrder=c(0,0,0),
include.mean = FALSE) )
> fit <- ugarchfit(data = sim, spec = Model2)
> fit
———————————*
* GARCH Model Fit *
*———————————*
Conditional Variance Dynamics
———————————–
GARCH Model : sGARCH(1,0)
Mean Model : ARFIMA(0,0,0)
Distribution : norm
Optimal Parameters
————————————
Estimate Std. Error t value Pr(>|t|)
omega 10.88071 0.596597 18.238 0
alpha1 0.83808 0.053021 15.806 0
As can be seen from the specification of Model2 the ugarchfit library can fit much more complex GARCH models than we have defined. For example the mean model does not have to be an i.i.d. sequence, but can have its own time series structure which is defined in by the mean.model argument. Above it has been set to be the simplest possible. We such possible complexity in practice the model selection part of using GARCH models can be rather complex.
8.1 Alternative State Space Models
In this section we show that the structure that we saw in the data that motivated using ARCH or GARCH can have other modelling approaches. These as also state space models.
If {Yt} is an (observed) time series with (unobserved) state process {θt}. In a state space model the dependence between these is defined by the graphical model shown in Fig. 8.1 Suppose that θi is a two state Markov chain and that Yi is normal with a mean and variance which only depend on the state θi. The states can represent the state of the market
152 8. ARCH AND GARCH MODELLING θ0 θ1 θ2 … θt−1 θt
Y1 Y2 Yt−1 Y t
Figure 8.1: State space structure
1. State 1: the market is growing so positive mean and small volatility
2. State 2: the market is in recession so negative mean and high volatility
3. We transition between states with a transition matrix of the form
1−pp q 1−q
with p, q quite small so that market stays in the same state for a long time.
Consider the following simple simulation: There are two unobserved states, denoted by red and green. When in the red state the daily change in price is modelled as N(0.1,0.52) and when in the green state the daily change in price is modelled as N(−0.2,22). The probability of staying in the red state is 0.99 while the probability of staying in the green state is 0.98. We plot both the changes in price and the price itself in Fig. 8.2.
Observed Change
−4 −2 0 2 4
Market Value
0 102030405060
●
0 200 400 600 800 0 200 400 600 800
Time Time
Figure 8.2: State space experiment
We can also look at the marginal distribution of the changes in price in Fig. 8.3. We see the ‘heavy tail’ issue as before. This is because we are modelling the changes in price with a mixture of normals
8.1. ALTERNATIVE STATE SPACE MODELS 153
Daily changes
−8 −4 0 2 4 6
obs
−3
Daily changes
−1 0 1 2 3
Theoretical Quantiles
● ●●
●
●
● ●
●
●
Figure 8.3: State space experiment
8.1.1 Fitting the Hidden Markov Model
How do we fit a HMM? In fact we have seen in case study in Chapter 6 that Bayesian methods can be used. In R the library depmixS4 allows us to do this. In Fig. 8.4 we see the results of this approach.
In the top panel we see, in black, the observed data. We do not see the state, of course, but since this was a simulation case we know its value as is plotted with colour in the same plot. The inference problem is to infer the state. The result of applying the depmix function is shown in the lower panel. At each point we have a posterior probability of being in the green state. In this example this is either close to one or close to zero. Hence the method has done a good job of sorting the states.
Frequency
0 100 200 300 400
Sample Quantiles
−8 −4 0246
154
8. ARCH AND GARCH MODELLING
probability Observed Change
0.0 0.2 0.4 0.6 0.8 1.0 −4 −2 0 2 4
0
200 400
Posterior probability
0
200 400
600 800 1000
Figure 8.4: Fitting a HMM: library(depmixS4)
Time
Index
600 800 1000
9
Frequency domain methods
9.0.1 Why the complex plane?
In this section we are going to be looking at the periodic part of time series analysis. We, of course, have looked at seasonality in previous sections but here we are going to be looking at the case where the important periods, or equivalently the frequency, of the time series are not known a priori. Because periodic functions, of different frequencies, are going to be key it is natural that we are going to be making use of trigonometric functions such as sines and cosines. What is perhaps less obvious is that we are also going to be using complex numbers. The following example motivates why it is very convenient to work in the complex plane. Recall that we link the trigonometric functions and the complex plane through the identity
cos(θ) + i sin(θ) = eiθ. (9.1) To show how convenient it is to work in the complex plane consider the
following proof of this important trigonometrical identity. Theorem 9.0.1. We have the identity
cos(u + v) = cos(u) cos(v) − sin(u) sin(v). Proof. We write the left hand side, using Equation (9.1) as
cos(u+v) = Rei(u+v)=Reiueiv
= R ((cos(u) + i sin(u))(cos(v) + i sin(v)))
= cos(u) cos(v) − sin(u) sin(v)
(9.2)
This is representative of the ease of handling trigonometrical functions after switching to the complex domain. The identity (9.2) is also important since it shows how to change the phase of a trigonometrical function and one way of explaining why we use the complex domain is that it allows us to separate the frequency from the phase.
155
156 9. FREQUENCY DOMAIN METHODS 9.0.2 Spectral density and discrete Fourier transforms
We have seen, in Example 3.2.7 an example of a ‘random’ sine wave as a ‘predictable process’ i.e. where Z1, Z2 are two independent N(0, σ2) random variables and we define the discrete time series
Xt = Z1 cos(2πt/100) + Z2 sin(2πt/100),
which can be though of as a sine wave with random amplitude and phase
by using the trigonometric identities such as
sin(u + v) = sin(u) cos(v) − cos(u) sin(v), sin(u) =
By allowing the frequency to vary we can have a class of time series
Example 9.0.2. (Linear combinations of sinusoids) Consider the process defined by
Z1
Z 12 + Z 2 2
.
k
Xt := (Ai cos(2πωit) + Bi sin(2πωit))
i=1
where Ai,Bi are independent WN(0,σi2) processes.
Theorem 9.0.3. The auto-covariance function of (9.3) is
k
γ(h) = σj2 cos(2πωih)
i=1
So the series is stationary. Proof. We use the identity
cos(u − v) = cos(u) cos(v) + sin(u) sin(v). which can be proved easy in the complex plane.
(9.3)
In order to understand the effect of the frequency ωi on a discretely observed sinusoidal function consider Fig. 9.1. In these plots we have look at 50 discrete time points one unit apart. In the first plot the frequency is ω = 1/50, so that across the observed points we see exactly one complete period of the sinusoidal function. This is called a low frequency element. Indeed for any lower frequency, and these time points, we could not observe a complete cycle. In the middle two panels we increase the frequency to 7/10 and 18/50 where we see that we now observe more cycles, but have less observations per cycle. The most extreme version of this is the bottom panel where the frequency is now ‘high’ at 1/2. At this level we only have two observations per cycle – the maximum and minimum.
We can therefore think of time series of the form (9.3) in terms of their high, medium and low frequency components.
Frequency =1/50
157
●●● ●●●● ●● ●●
●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●●
●● ●●
●●●●●●●
0 10 20 30 40 50
Time
Frequency =7/50
●● ●● ●● ●●
●●
●●
●● ●● ●● ●● ●● ●●
●●
●●
●●● ●●● ●●●●●
0 10 20 30 40 50
Time
Frequency =18/50
●●●●●● ●●●●
●● ●● ●●●● ●●●● ●● ●● ●●●● ●● ●● ●●●● ●● ●● ●●●●●●●●
0 10 20 30 40 50
Time
Frequency =1/2
●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
Time
●●●●●●● ●●●●
Value Value Value Value
−1.0 0.0 1.0 −1.0 0.0 1.0 −1.0 0.0 1.0 −1.0 0.0 1.0
Figure 9.1: Discrete time series of simple periodic functions and the effect of changing the frequency
158 9. FREQUENCY DOMAIN METHODS
Example 9.0.4. (Linear combinations of sinusoids) Figure 9.2 shows two example of processes of the form (9.3) both with k = 50 and having 500 observed time points. They both have σi2 = 1 for all i. They differ in the way the 50 values of ωi where selected. Following the discussion above we have the constraints that 1/500 ≤ ωi < 0.5. In the top row the middle histogram summarises the values of ωi. They are all ‘low frequency’ components and we can see that the time series has is varying slowly over the time steps. In contrast the lower row has selected the values of ωi at the top of the range. These are high frequency and we see the time series has much more rapid oscillations over much shorter time scales.
0 100 200 300 400 500
t.list
0 100 200 300 400 500
t.list
Histogram of nu
0.0 0.1 0.2 0.3 0.4 0.5
nu
Histogram of nu
0.0 0.1 0.2 0.3 0.4 0.5
nu
Series: x Raw Periodogram
0.0 0.1 0.2 0.3 0.4 0.5
frequency bandwidth = 0.000577
Series: x Raw Periodogram
0.0 0.1 0.2 0.3 0.4 0.5
frequency bandwidth = 0.000577
Figure 9.2: Simple periodic functions: top low frequency, bottom high fre- quency
The right most plots show the data based estimates of the distribution of the frequencies – the so-called spectral densities.
Definition 9.0.5. (Spectral density) If a mean zero, stationary time series {Xt} has an auto-covariance function γ(h) such that ∞h=−∞ |γ(h)| < ∞ then the spectral density is defined as
∞
f(ν) = γ(h)e−2πiνh.
h=−∞
for −∞ < ν < ∞.
Theorem 9.0.6. (Properties of spectral density) If f(ν) is the spectral
density of a stationary time series it has the following properties (i) For all ν
|f(ν)| < ∞.
−20 −10 0
10
20
−20 −10 0 10 20
out
out
Frequency
Frequency
0 5 10 15 20 25 30 35
0 10 20 30 40 50 60
spectrum
spectrum
0 500 1000 2000 3000
0 1000 2000 3000 4000 5000
159
(ii) f(ν) is periodic with period 1, so we can define its domain as −0.5 ≤
ν ≤ 0.5.
(iii) f (ν ) = f (−ν ) is an even function. So we can effectively define the
domain as 0 ≤ ν ≤ 0.5.
(iv) f(ν) > 0
(v) Given the spectral density f(ν) we can reconstruct the auto-covariance function by the formula
γ(h) =
0.5 −0.5
e
2πiνh
f(ν)dν (9.4)
Proof. (Sketch) (i) The first property comes from the fact that |eiθ| = 1 and by using the absolute convergence of γ from Def. 9.0.5.
(ii) This follows since
e−2πνh = e−2πν(h+1). (iii) Follow from elementary properties.
(iv) First define
F N ( λ ) = N E
r=1
= N E
r=1
1
N
2
≥ 0
1N 1N N
X r e
−2πirλ
This can be expanded as
2
N E
r=1
−2πirλ X r e
−2πirλ 2πitλ X r e X t e
t=1
= 1 (N − |h|) e−2πihλγ(h) N |h|
> plot(kernel(“daniell”, c(3,3)))
> plot(kernel(“daniell”, c(4,4, 4, 4)))
The weights which are used in the moving average are shown in Fig. 9.5. We can see they range from a simple equally weighted moving average (left panel) through a triangular distribution (center) to a kernel that looks like a Gaussian one (right).
Daniell(1) Daniell(3,3) Daniell(4,4,4,4)
−1.0 −0.5 0.0 0.5 1.0 −6 −4 −2 0 2 4 6 −15 −10 −5 0 5 10 15
kkk
Figure 9.5: Three Daniell kernels
Example (9.0.4 revisited). If we go back to one of the examples in Ex- ample 9.0.4 we can see, in Fig. 9.6, the raw (unsmoothed) periodogram and two versions which have been smoothed with a Daniell kernel. The R code is given by
> spectrum(out, log=”no”)
> spectrum(out, log=”no”, spans= c(3,3))
> spectrum(out, log=”no”, spans= c(4,4,4,4))
Example (1.3.6 revisited). In Chapter 1 we looked at the sunspot data set. The sunspot numbers from 1749 to 1983 were collected at Swiss Federal Observatory, Zurich until 1960, then Tokyo Astronomical Observatory and can be found in R.
R code for this example is given by the following.
0.20 0.25 0.30
0.35 0.40 0.45
0.02 0.04
0.06 0.08 0.10 0.12 0.14
0.00
0.02 0.04 0.06
W[k]
W[k]
W[k]
166
9. FREQUENCY DOMAIN METHODS
Series: x Raw Periodogram
0.0 0.1 0.2 0.3 0.4 0.5
frequency bandwidth = 0.000577
Series: x Smoothed Periodogram
0.0 0.1 0.2 0.3 0.4 0.5
frequency bandwidth = 0.00208
Series: x Smoothed Periodogram
0.0 0.1 0.2 0.3 0.4 0.5
frequency bandwidth = 0.00493
Figure 9.6: The raw periodogram and two smoothed versions
> data(sunspots)
> k1 = kernel(“daniell”, 4)
> plot(sunspots)
> spec.pgram(sunspots, log = “no”, xlim=c(0,1))
> spec.pgram(sunspots, k1, log = “no”, xlim=c(0,1))
Series: sunspots Raw Periodogram
1750 1800 1850 1900 1950 0.0 0.2 0.4 0.6 0.8 1.0
Time frequency bandwidth = 0.0012
Series: sunspots Smoothed Periodogram
0.0 0.2 0.4 0.6 0.8 1.0
frequency bandwidth = 0.0108
0 50 100 150
200
250
0 1000
2000 3000 4000
0 5000 10000 20000
30000
0 500
1000 1500 2000
0 2000 4000 6000 8000
12000
0 500
1000
1500
sunspots
spectrum
spectrum
spectrum
spectrum
spectrum
Figure 9.7: The sunspot data, its raw periodogram and a smoothed version In Fig. 9.7 we see that there is a peak in the spectrum at ν = .09 which
corresponds to a 1/0.09 = 11.1 year cycle. 9.0.5 Filtering in the spectral domain
In this section we see how working in the spectral domain, i.e. working with frequencies, allows us to build tools which filter out ‘noise’, hopefully leaving ‘signal’. The most obvious application of this idea is in signal processing.
Definition 9.0.18. (Signal processing) From Priemer (1991) we have that, for an engineer, a signal is a function that conveys information about the behaviour of a system. These can be acoustic waves such as speech or music, electronic waves such as radio signals, or output from a medical device such as EEG. Signal processing is then an activity that transforms an input signal to an output signal, typically to amplify the power of the signal.
Example 9.0.19. (Audio signal) For an audio signal – say recorded and transmitted music – noise from the recording and transmission systems –
167
which sounds like a ‘hiss’ – often has a different frequency spectrum than the (music) signal. We can therefore separate the ‘signal’ from the ‘noise’ by a transformation which has the effect of decreasing power in the high frequency part of the spectrum and increasing power in the parts of the spectrum which corresponds to the music. A filter which does this is called a low pass filter since it only allows the lower parts of the spectrum through.
For example Fig. 9.8 shows a simulate example. The left panel shows, in black the raw signal which is a simple sine wave plus i.i.d. noise. Its spectrum, on a log-scale, is shown in the middle panel. We see that the high frequency parts of this spectrum are purely due to the noise. After applying the low pass filter the spectrum changes to the plot in the right panel. The filter has removed some of the high frequency parts. The correpesonding, transformed, signal is shown in red in the left panel. The code which does
Series: x Raw Periodogram
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5
Series: x Raw Periodogram
0.0 0.1 0.2 0.3 0.4 0.5
frequency bandwidth = 0.000289
−2 −1 0 1 2
1e−03 1e−01 1e+01
x
spectrum
spectrum
1e−14 1e−10 1e−06 1e−02 1e+02
t
frequency bandwidth = 0.000289
Figure 9.8: Low pass filter example
this in R is given by
> library(signal)
> bf <- butter(3, 0.1) # 10 Hz low-pass filter
> t <- seq(0, 1, len = 1000) # 2 second sample
> x <- sin(2*pi*t*2.3) + 0.5*rnorm(length(t)) # 2.3 Hz sinusoid+noise > z <- filter(bf, x) # apply filter
168 9. FREQUENCY DOMAIN METHODS
Bibliography
Abraham, B. and J. Ledolter (2006). Introduction to regression modeling. Thomson Brooks/Cole.
Assimakopoulos, V. and K. Nikolopoulos (2000). The theta model: a de- composition approach to forecasting. International journal of forecast- ing 16(4), 521–530.
Berger, J. O. (2013). Statistical decision theory and Bayesian analysis. Springer Science & Business Media.
Bollerslev, T. (2008). Glossary to arch (garch). CREATES Research Pa- per 49.
Bollerslev, T. and E. Ghysels (1996). Periodic autoregressive conditional heteroscedasticity. Journal of Business & Economic Statistics 14 (2), 139– 151.
Box, G. E. and G. M. Jenkins (1976). Time series analysis: forecasting and control, revised ed. Holden-Day.
Brockwell, P. J. and R. A. Davis (2002). Introduction to time series and forecasting. Springer.
Brockwell, P. J. and R. A. Davis (2009). Time series: theory and methods. Springer Science & Business Media.
Brooks, S. P. and G. O. Roberts (1998). Assessing convergence of markov chain monte carlo algorithms. Statistics and Computing 8(4), 319–335.
Chatfield, C. and D. Prothero (1973). Box-jenkins seasonal forecasting: problems in a case-study. Journal of the Royal Statistical Society. Series A (General), 295–336.
Cowles, M. K. and B. P. Carlin (1996). Markov chain monte carlo con- vergence diagnostics: a comparative review. Journal of the American Statistical Association 91(434), 883–904.
169
170 BIBLIOGRAPHY
Cranage, D. A. and W. P. Andrew (1992). A comparison of time series and econometric models for forecasting restaurant sales. International Journal of Hospitality Management 11(2), 129–142.
Fitzpatrick, T. and C. Mues (2016). An empirical comparison of clas- sification algorithms for mortgage default prediction: evidence from a distressed mortgage market. European Journal of Operational Re- search 249(2), 427–439.
Francq, C. and J.-M. Zakoian (2011). GARCH models: structure, statistical inference and financial applications. John Wiley & Sons.
Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (2014). Bayesian data analysis, Volume 2. Taylor & Francis.
Gilks, W. R., S. Richardson, and D. J. Spiegelhalter. (1996). markov chain monte carlo in practice.
Gonzalez-Rivera, G. (2013). Forecasting for Economics and Business. The Pearson Series in Economics.
Granger, C. W., N. P. F. E. Time, K. A. Kurven, H. Raum, and M. Watson (1986). Time series: Economic forecasting.
Hand, D. J. (2006). Classifier technology and the illusion of progress. Sta- tistical science, 1–14.
Hastie, T., R. Tibshirani, and J. Friedman (2009). The elements of statistical learning, Volume 2. Springer.
Jackman, S. (2009). Bayesian analysis for the social sciences, Volume 846. John Wiley & Sons.
Jie, M., G. S. Collins, E. W. Steyerberg, J. Y. Verbakel, B. van Calster, et al. (2019). A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of clinical epidemiology.
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Journal of Fluids Engineering 82(1), 35–45.
Linstone, H. A., M. Turoff, et al. (1975). The Delphi method: Techniques and applications, Volume 29. Addison-Wesley Reading, MA.
Lunn, D., C. Jackson, N. Best, A. Thomas, and D. Spiegelhalter (2012). The BUGS Book - A Practical Introduction to Bayesian Analysis. CRC Press / Chapman and Hall.
BIBLIOGRAPHY 171
Makridakis, S., E. Spiliotis, and V. Assimakopoulos (2018). Statistical and machine learning forecasting methods: Concerns and ways forward. PloS one 13(3).
Mardia, K. V., J. T. Kent, and J. M. Bibby (1979). Multivariate analysis. Academic press.
Morris, T. P., I. R. White, and M. J. Crowther (2019). Using simulation studies to evaluate statistical methods. Statistics in medicine 38(11), 2074–2102.
Park, T. and G. Casella (2008). The bayesian lasso. Journal of the American Statistical Association 103(482), 681–686.
Priemer, R. (1991). Introductory signal processing, Volume 6. World Scien- tific.
R Core Team (2013). R: A Language and Environment for Statistical Com- puting. Vienna, Austria: R Foundation for Statistical Computing.
Sherman, H. J. and D. Kolk (1996). Business cycles and forecasting. Harper- Collins College Publishers.
Shumway, R. H. and D. S. Stoffer (2010). Time series analysis and its applications: with R examples. Springer Science & Business Media.
Simmhan, Y. L., B. Plale, and D. Gannon (2005). A survey of data prove- nance in e-science. ACM Sigmod Record 34(3), 31–36.
Stamey, T. A., J. N. Kabalin, J. E. McNeal, I. M. Johnstone, F. Freiha, E. A. Redwine, and N. Yang (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. ii. radical prostatectomy treated patients. The Journal of urology 141(5), 1076–1083.
Van der Heijden, K. (2011). Scenarios: the art of strategic conversation. John Wiley & Sons.
Vasicek, O. (1977). An equilibrium characterization of the term structure. Journal of financial economics 5(2), 177–188.
Wang, S. and W. A. Chaovalitwongse (2010). Evaluating and comparing forecasting models. Wiley Encyclopedia of Operations Research and Man- agement Science.
Index
SARIMA model, 94 ARMA(p, q) process, 69 AR(p) process, 66 makeARIMA( ) function., 136
Absolute loss, 25
AIC, 35
AICc, 72
Akaike information criterion, AR(p) process, 67
AR(1) process, 10, 52
ARCH model, 10, 147 ARIMA model, 87
ARIMA models, 136 Asymptotic consistency, 102 Audio signal, 166 Auto-correlation function, 54 Auto-covariance function, 52
Properties, 54
Backward selection, 34 Backward shift operator, 65 Bayes classifier, 25 Bayesian inference, 111, 113 Bayesian regression, 117 Best linear predictor, 55
Properties, 86
Best subset regression, 33 Bias-variance decomposition,
Big Data, 101 Bollerslev-Ghyselbenchmarkdataset,
147 Brownian motion, 11
BUGS, 133 Business cycle, 8
Calibration, 140
35, 72
Case study, 122
Cauchy distribution, 53
Causal solution, 67
Change point, 5
Competitions, 106
Complex inner product space, 162 Conjugate prior, 117
Conjugate priors, 117
Cross validation, 43, 102
Daniell kernal, 165
Delphi method, 17
Denmark birth data, 8 Difference operator, 65, 86 Discrete Fourier transform, 163
Efficient market hypothesis, 14 Estimate of variance, 48
Estimate residuals, 27
Evaluation of Forecasting Methods,
102
Exponential smoothing, 21
Extrapolation, 30
Finite dimensional distributions, 51 Finite sample behaviour, 103 Fitted values, 27
Forecast Evaluation, 101
Forward selection, 34
Fourier frequencies, 162 Frequencydomainmethods,155
GARCH model, 147, 149 Gaussian process, 11, 52, 54, 82 Gaussian processes, 82 Generalization error, 39 generalization error, 39
172
32, 40
INDEX
Hat matrix, 27
Hidden Markov model, 121, 135 Holt-Winters filter, 21
House price example, 26, 28, 29, 33,
34, 36, 37, 42, 119, 120
i.i.d. sequence, 51
In-sample error, 41
Internet network traffic, 7 Interpreting acf plots, 76 Interpreting sample auto-correlation
function plots, 60 Inverse gamma distribution, 116
JAGS, 133
K-fold cross-validation, 43
Kalman filter, 137
Kolmogorov’s existence theorem., 11
Lag s difference, 93 Lasso, 37
Leading indictors, 14 Level, 4
Likelihood estimation, 69
Likelihood for ARMA process, 70 Linear combinations of sinusoids, 156 Linear decomposition model, 18 Long memory processess, 82
Loss function, 23
MA(∞) process, 78
MA(q) process, 65
MA(1) process, 52
Machine Learning, 101
Markov chain Monte Carlo, 117 Markov property, 9
Mean squared error, 23
Mean stationary, 86 Misspecification, 105
Model identification, 71
Model with seasonal component, 18 Model with trend, 17
Monte Carlo experiment, 104 multicollinearity, 29
173
Multivariate Normal, 45 Multivariate normal distribution, 45,
54
Non-stationarity, 5
Non-statistical forecasting methods,
17
OLS estimation, 27
One step ahead prediction, 103 Optimality, 101
Optimism of training error, 41 Oracle property, 102 Ornstein-Uhlenbeck process, 12 Over smoothing, 20 Over-fitting, 32
Partial correlation, 81
Partial covariance, 81
Partial auto-correlation function, 81 Period of time series, 4 Periodogram, 160, 163 Phillips-Perron test, 98
Posterior distribution, 113 Predictable process, 53
Prediction interval, 29, 103 Prediction operator, 57
Predictive distribution, 116
Prior distribution, 113
Projection operator, 45
Prostate cancer example, 15
Quadratic loss, 23
Random walk, 10, 52, 53
Random walk with drift, 86 Regularisation and penalty methods,
120
Residual sum of squares, 26
residuals, 27
Ridge regression, 36 Risk, 23
Runs test, 98
Sample auto-correlation function, 58
174 INDEX
Sample auto-covariance function, 58 Sample mean, 58
San Diego house price example, 13 Scenario analysis, 17
Seasonal ARIMA model, 94 Seasonal components, 93 Seasonality, 4
Signal processing, 166
Simple moving average filter, 18 Simulation experiments, 103 Spectral density, 158
Squared error, 23
Squared error loss, 23
State space methods, 15 State space model, 121, 135 Stationarity, 14, 52 Stochastic process, 10 Strict stationarity, 53 Subset selection, 33
Sum of squares, 47
Tests for stationarity, 97 Time series, 3, 51
Time series plot, 4 Training set, 26
Trend, 4
Tuning parameters, 102, 108
US accident data, 93
Variance stationary, 86 Volatility clustering, 149
White noise, 160
White noise process, 51, 53 Wiener process, 11
WinBugs, 133
Wold decomposition theorem, 79
Zero-one loss, 25