6
Bayesian and state space methods
6.1 Introduction
This is a shorten version of the material in Chapter 6 of the notes. First we briefly review the ideas of Bayesian inference.
6.2 Bayesian methods
Example 6.2.1. Consider a simple normal example, where we have a model i.i.d.
Xi ∼ N(θ,1) with a sample size of n, so that the likelihood function based on the observed data x1, . . . , xn. With frequentist inference, if we wanted to learn about θ we might use the MLE θ and a confidence interval
(θ − 1.96 1 , θ + 1.96 1 ) based on the sampling distribution of the MLE. nn
A Bayesian approach is different. It describes what we have learned about θ by treating it as a random variable and computing its distribution. We shall assume that we have a model for the data f(x|θ) and that we want to learn about the (vector) of parameters θ. In Bayesian statis- tics everything is considered a random variable and all statements about uncertainty are made using the language and calculus of probability theory.
1. Suppose, that we consider that the parameter vector θ itself is random, and that it has a distribution, Pr(θ) which we shall call the prior distribution.
2. This represents what we know about θ before we see the data. In the Bayesian viewpoint all uncertainty – and in particular what we know about the value of θ – can be represented by probability statements.
1
√√
2 6. BAYESIAN AND STATE SPACE METHODS
3.
4. 5.
6.
To see what we know about θ after we see the data we apply Bayes theorem to give
Pr(θ| data ) = Pr( data |θ)Pr(θ). (6.1) P r(data)
The term P r(θ| data ) is called the posterior and is interpreted as what we know about θ after we have seen the data.
We can the write 6.1 as
Posterior(θ) ∝ Lik(θ) × prior(θ) (6.2)
That is, up to a constant, the likelihood function is the way we convert the information we have about θ before we have seen the data into the information we have about it after seeing the data.
Example 6.2.2. (Normal Bayesian inference 1) Consider a simple normal i.i.d.
example, where we have a model Xi ∼ N(θ,1) with a sample size of n, so that the likelihood function based on the observed data x1, . . . , xn is
Lik(θ) = P r(data|θ) =
This is then the key tool that we need to merge with the prior in order to
get the posterior.
In the Bayesian viewpoint the ‘answer’ to an inference problem is cap- tured in the posterior distribution. There are then a number of computa- tional issues which we need to study. First we note that the very simple formula Equation (6.2) only defines the posterior up to a constant. Since the posterior is a probability mass function or a density function we might want to be able to deal with the fact that finding
Posterior(θ) dθ, θ
might be a difficult computational problem. This can be solved either by clever mathematics or by numerical integration methods. Secondly, we want to know how can we convert mathematical knowledge about the posterior into something interpretable? Again this looks like this will involve (poti- entially) difficult integration problems.
Example 6.2.3 (6.2.2 revisited). Suppose that the prior distribution for θ was in the Normal family i.e.
θ ∼ N(θ;μ ,σ2 ) prior prior
n 1 (xi−θ)2
√2π exp − 2
i=1
6.3. BAYESIAN METHODS IN REGRESSION 3 where μprior and σp2rior are considered known constants. Then the posterior
can be written as
We can show
where
P r(θ|x1, . . . , xn) ∝ Lik(θ) × prior(θ)
μposterior =
2
, σposterior = 1 + nσ2
prior
Pr(θ|x ,…,x ) = N(μ
1 n posterior
,σ2 posterior
)
(6.3)
μprior + σp2rior(ni=1 xi) 1 + nσ2
σ2 prior
prior
Having seen how to compute the posterior distribution we now see how the Bayesian method deals with forecasting. The same principles apply: only use the rules of probability.
Definition 6.2.4. (Predictive distribution) The solution to a forecast- ing problem to forecast Xnew based on an observed sample X1, . . . , Xn which comes from a model f(X|θ) should be the conditional distribution P(Xnew|X1,…,Xn). This is called the predictive distribution and is calcu- lated from the posterior by
P(Xnew|θ)P(θ|X1,…,Xn)dθ.
In words we average over the model according to the posterior belief in the
P(Xnew|X1,…,Xn) = value of θ given the observed data.
Θ
6.3 Bayesian methods in regression
One of the nice things about modern Bayesian analysis is that we don’t have to compute integrals analytically, rather we use numerical method called Markov chain Monte Carlo (MCMC) analysis.
Definition 6.3.1. (MCMC) If we have a distribution P(θ|x1,…,xn) ∝ Lik(θ)P(θ)
that is known up to a constant then rather than working directly with the distribution we can compute most statistical functions – such as means, variances, marginal distributions – using a large sample θ1, θ2, . . . where
θi ∼ P(θ|x1,…,xn)
4 6. BAYESIAN AND STATE SPACE METHODS We will not, in this course, look at the details of how the sample θ1, θ2, . . .
is constructed but rather use built in R functions.
i.i.d. 2 Example 6.3.2. Suppose now we have the model Xi ∼ N(θ,σ ) with a
sample size of n where both θ and σ2 are to be estimated.
We can think of this example as a regression example with no covariates.
In R we could use
> library(MCMCpack) #load library
> library(coda)
> y <- rnorm(30, mean=3, sd=0.2) #generate data
> out <- MCMCregress(y~ 1) #run mcmc
> out1 <- as.matrix(out) #store output as matrix
> dim(out1)
[1] 10000 2
The output is a set of two long series, one for θ and one from σ2. These large samples represent the (joint) posterior of both parameters and are plotted as density estimates in Fig. 6.1 (b) and (c). Panel (a) is just a plot of the data used.
(a) Data
0 5 10 15 20 25 30
Index
(b) Posterior theta
2.90 2.95 3.00 3.05 3.10 3.15
theta
Figure 6.1: MCMC output
(c) Posterior sigma2
0.01 0.02 0.03 0.04
sigma2
●●● ●●
●● ●●
●
●●●
●
●
●●● ●●
●●● ●●
●
●
●
●
●
We can look at more detail at this output. In Fig. 6.2, Panel (a) shows the first 500 points of the θ-series, noting that there are in fact 10, 000 points in the series. Panel (b) shows the same for the σ2 series and (c) shows a joint plot. This last shows out the MCMC outputs from the joint distribution of the posterior.
If we want to compute posterior means, standard deviations etc we sim- ply use the sample i.e.
> mean(out1[,1])
[1] 3.031442
> sd(out1[,1])
[1] 0.02407227
which give point estimates of the parameters and measures of (posterior) uncertainty.
0 5 10 15
y
2.8 2.9 3.0 3.1 3.2
Density
Density
0 20 40 60 80
6.3. BAYESIAN METHODS IN REGRESSION
5
(a) theta sample
0 100 200 300 400 500
Index
(b) sigma2 sample
0 100 200 300 400 500
Index
(c) Joint distribution
2.95 3.00 3.05 3.10 3.15
theta
2.96 3.00 3.04 3.08
theta
sigma2
0.010 0.015 0.020 0.025 0.030 0.035
sigma2
0.01 0.02 0.03 0.04
Figure 6.2: MCMC output (a) θ sample, (b) σ2 sample, (c) joint distribution
Panel (c) shows a scatterplot of the output. This is a representation of the joint posterior distribution of the parameters. The fact that a Bayesian analysis gives joint inference is very different from the standard methods which give, say, confidence intervals for each parameter one at a time. In this simple random normal sample example since the sampling distribution of the sample mean and sample variance are independent this is rarely an issue. For example panel (c) looks independent so two marginal inferences are sufficient. In general though the dependence can really be important.
Example 6.3.3. We can return to the house price example of Chapter 2
and fit the model using MCMC.
> out <- MCMCregress(Y ~X1+X2+X3+X4+X5+X6+X7+X8+X9, data=house.price) > summary(out)
Mean SD Naive SE Time-series SE
(Intercept) 15.22298 6.45155 0.0645155
X1 1.93971 1.12235 0.0112235
X2 6.87051 4.67551 0.0467551
X3 0.13595 0.53313 0.0053313
X4 2.80602 4.75816 0.0475816
X5 2.04593 1.49712 0.0149712
X6 -0.50892 2.60459 0.0260459
X7 -1.29140 3.70783 0.0370783
X8 -0.03864 0.07188 0.0007188
X9 1.69108 2.09140 0.0209140
sigma2 10.25968 4.46910 0.0446910
We get very similar estimates to the OLS solution. But now, since we are getting a joint posterior for all the parameters, we can see something interesting about the multicollinearity issue seen before. Recall that we found it odd that the estimate for β6 was negative and this was explained due to dependence in the explanatory variables. In the Bayesian analysis this dependence flows very naturally into the posterior as shown in Fig. 6.3. We
0.0645155
0.0112235
0.0480019
0.0053274
0.0475816
0.0149712
0.0260459
0.0370783
0.0007188
0.0209140
0.0725849
6 6. BAYESIAN AND STATE SPACE METHODS
see our knowledge about the value of β6 is strongly tied to our knowledge about other parameters, say β2 or β6. This gives a much more complete picture of what we really know about the parameters even when there is multicollinearity.
(Intercept)
−20 0 20 40
−4 −2 0 2 4 6
X1
−20 −10 0 10
−20 −10 0 10 −4 −2 0 2 4 6
−10 0 5 10 15
−20 0 20 40
Figure 6.3: MCMC output for Example ??
How do we use the MCMC output to make predictions. We want to use the predictive distribution of Definition 6.2.4. This is an integral and with MCMC output all integrals – which are hard mathematically – are replaced with sums over the MCMC sample i.e.
NMCMC f(θ)P(θ|X1,…,Xn)dθ ≈ f(θi) (6.4)
Θ
Definition 6.3.4. (Building a predictive sample) If we want to do prediction using a posterior for θ we use a similar idea. We can take a sample from the predictive distribution in two steps:
(i) Draw a sample θ(i) from the posterior P(θ|data)
(ii) Given θ(i) draw a prediction from the model f(y|θ(i))
We repeat this, independently for many different values of i.
Example 6.3.5. In the regression example, given β(i) we take the simple forecast y(i) = xT0 β(i) + ε(i), where ε(i) ∼ N(0, σ2 (i)). We could look at the distribution of this over values of β from the posterior. In Figure 6.4 we select xo to be the average house and generate a sample from the predictive
i=1
for any function f where θi for i = 1,…,NMCMC is the MCMC sample.
X6
−10 −5 0 5 10 15
X7
6.3. BAYESIAN METHODS IN REGRESSION 7
distribution using the two steps of Definition 6.3.4. The blue vertical line is the sample mean of the house prices which, because of the choice of x0, should be the mean here.
Predictive distribution
20 25 30 35 40 45 50
price
Figure 6.4: Predictive distribution for Example ??
6.3.1 Link to penalty methods
We get the following link between the ridge regression method of Chapter 2 and Bayesian methods. They show that ridge regression estimates are modes of posterior distributions with suitably chosen priors.
Theorem 6.3.6. (Bayes and ridge regression) Suppose we have the model for Y conditionally on X as
Y i | X , β ∼ N ( β 0 + x Ti β , σ 2 )
with independent priors βj ∼ N(0,τ2). Then, with τ,σ assumed known the
log-posterior for β is given
Npp (yi −β0 +xijβj)2 +λβj2, i=1 j=1 j=1
which agrees with RSS(λ) in the definition of ridge regression when λ = σ2/τ2.
The LASSO can also be interpreted in Bayesian terms.
Theorem 6.3.7. (Bayes and ridge regression) The Lasso estimate for linear regression parameters can be interpreted as a Bayesian posterior mode esti- mate when the priors on the regression parameters are independent double – exponential (Laplace) distributions.
Density
0.00 0.02 0.04 0.06 0.08 0.10 0.12
8 6. BAYESIAN AND STATE SPACE METHODS 6.4 State space model
Definition 6.4.1. (State space model) If {Yt} is an (observed) time series with (unobserved) state process {θt}. In a state space model the dependence between these is defined by the graphical model shown in Fig. 6.5 The re-
θ0 θ1 θ2 … θt−1 θt
Y1 Y2 Yt−1 Y t
Figure 6.5: State space structure
lationship defines the conditional independence relations: (i) θt is a Markov chain (ii) conditionally on θt for t = 0, 1, . . . , t then Yt are independent so that the joint distribution of (Y1,…,Yt)|(θ1,…,θt) is
t
f(yt|θt). i=1
State space models are also called hidden Markov models. 6.5 Case Study
6.5.1 Introduction
The TSUNAMI group of insurance and reinsurance companies have made available various datasets which cover issues important for the insurance in- dustry. This report concentrates on a set of a seven year series of commercial claims due to weather. The TSUNAMI group in particular have an interest in the following questions.
1. Temporal effects. Is there any significant variation across the years in the observed data, apart from that caused by inflation? Any such variation should be identified and described.
2. Aggregation. There is interest in the aggregation in the data, usually over three day periods. Suitable models for such summary statistics need to be found.
3. Extremes. The data set contains two ‘extreme’ events. These were the storms in January and February 1990. The modelling should be able to predict return periods for extreme events such as these.
The report is structured in the following way. Section 6.5.2 contains details of both the currently available dataset and important data which is potentially available. Section 6.5.3 examines the modelling approach used
6.5. CASE STUDY 9
throughout, pointing out the differences with traditional time series models which we regard as inappropriate for this kind of data. Section 6.5.4 looks at the results of some fitted models, while the final section describes future work.
6.5.2 Data
The data covers a period from 1st January 1990 until 31st December 1996. It is a record of all claims for UK commercial policies which are coded as being caused by ‘weather’. For each claim the following information is available: the date of the reported claim incidence, the settlement date, and the claim amount. It is known that the data exhibits structure which is induced by the systematic way incidence dates are allocated when the exact date is unknown, forgotten or censored. For example, there are relatively more claims on the first day of each month, and there are relatively fewer claims at the weekend. It is important that the modelling takes into account this structure.
Totalling all the claims occurring on a particular day creates a time series of total daily claims. This series for both the raw data and the log of the data are shown in Figure 6.6. Alternatively the data can be represented in its original format, i.e. individual claim sizes and incidence dates, and an additional time series created which contains the number of claims made on each day. These are shown in Figures 6.7 and 6.8 respectively.
.
.
. .
.
.
.
.. . .. …
. . .. . . .. . … .. . …
.. .
.. . .. …
.
. . .
. …….
. . .
.. …..
. …. . .. …….. ……. … . .. . . .. .
………. … .. …….. . … . …………… .. .. . . … . .. . .. .. . .. .. .. .. ….. . ….. .. . ……. …. ………. ……… … . . . …. . . … . . …. . .. …. . . . .. . … …. …. . . . .. . … . ….. . ….. ….. . .. . .. .. . ….. …. ….. .. … . …………. ….. ……….. . ….. .. …. ….. …. ….. … . ………. ……. ……….. …………. ………… … …. … .. .. . .. … . …. ….. . . .. .. . .. … .. …. ….. …….. . …………… …… .. ……. ………………………….. …… ……………. …. …………………………… ………………….. ……….. ……. …….. ….. .. …….. ……. …………………… . …………………… …………………………. …………………………. … …….. ….. ………… . . ……. .. ……… … …. . . .. . .. … .. .. . ……… . …… …. …………………………….. …………… . .. .. … … . . . . …… . .. .. . ……. …. .. ….. . . ……… . …………. … ……… ….. … ……. . .. ..
… . ….. .. . … . .. .. . …… … .. .. . …. …. .. . … . .. …. . .. . … . … . .. … .. …… ……. … . ..
…. …. …. .. ….. . . … . … …. … … … … … ..
.. …. . .. .. …. .
. …. … .. .. . . .. …..
….. …..
…. . .
.
.
. .
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 8 10 12 14 16 18 20 22 days from 1/1/1990 days from 1/1/1990 log(daily claim size)
Figure 6.6: The total daily claim sizes from 1/1/1990 until 31/12/1996.
It is also important to consider information which, while unavailable at this stage of the modelling process, is in principle available to individual insurance companies and which will be clearly important in meeting the objectives in Section 6.5.1. This missing information will be important in the following issues.
daily claim size
0 10^9 2*10^9 3*10^9 4*10^9 5*10^9
0
frequency
100 200 300
log(daily claim size)
10 12 14 16 18 20 22
10 6. BAYESIAN AND STATE SPACE METHODS
.
.
.
.
. .
.
. . .
.
. …
. . .
.. . ..
.. .. . . . .
.. . . .. . .. .. .
… .
… …… . . ………
… .. .. … …. …. . …. . .. …………….. ……. … . … .. … .. … . .. . ……… .. …. ….. ….. . .. … …. … .. ….. … …. ………. . ………… …. … .. .. …… ……………… ……………………. ….. .. ………………………………………………. ………………………… … …………. ….. ………….. …. …………. …….. …. …………………………. . ……………………….. ……………. …. …… ……….. .. .. .. …. … .. .. …. ……………….. ………………… ………………………… …. ……… …………………….. ….. …….. . … …………….. …… ……… .. .. . …. …………….. … … …… .. . . . . .. …………………………. …… .. … . … … …. . . .. . . . .. . .. . .. .. . .. ..
.. …… . . .. . . . …. . . . . . .. . … .
.. .. …. ..
. ..
…….. .
.. ..
.
..
.. .. . . . … ….
. . … . ..
.
.. .. ….. .. .
.
. .
. . … . .. … . .
.. . .. . .. .
.. .. . .
………. .. …. .. …..
…
.
.. .. .
. … … … . . . . . .
.. .. . .. . . .. . .. . .. . . .. . .. … . . … . . . … . . … . . .
.
.
. . . ..
.
. . …..
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 5 10 15 20 days from 1/1/1990 days from 1/1/1990 log(claim size)
Figure 6.7: The individual claim sizes on each day from 1/1/1990 until 31/12/1996.
1. Temporal effects. Currently not recorded are the total number of poli- cies providing cover on each day. This series is important in under- standing the change in behaviour over time of the data with which it is confounded.
2. Aggregation. Understanding the dependence structure across the dif- ferent claims is an important part of the modelling. Information which would be useful here includes the geographical location of claims, the type of damage and the type of property concerned. This will be use- ful since when looking at aggregation it would be informative to know when different claims may be due to the same metrological event. For example flood and wind damgage may well have quite different aggre- gation properties.
3. Extremes. The geographical and claims information mentioned above will also be important in the modelling of extreme events. For example the relationship between the intensity of the metrological event and its location will contribute to both the number and size of observed claims.
The models developed below attempt to take account of this missing information by using random effects. Further they are designed to extend naturally when and if the above data becomes available.
6.5.3 The modelling approach
The underlying structure of the data, with random numbers of claims on each day, is unsuitable for traditional time series methods. Therefore in this
claim size
0 2*10^8 4*10^8 6*10^8
log(claim size)
0 5 10 15 20
0 2000 4000
6000 8000
frequency
6.5. CASE STUDY
11
.
.
. . .. . . .
.. .. . .. . . .. . .. . . .
.
.
.
. .. .. .. .
… ..
………
. ………. … ……………… …. .. … .. … . . . .
..
. .. . . . . .
. . . .
. . .
. .
.. … .
. . … .. … .
……….. ……
. . . .. . .
. …. … .. … .. . . . . …….. …. ………. …. … . …… . .. … . . ….. . .. . … .. .. . … … … . .. . . . .. ……. ….. .. .. … . . …….. ….. .. .. ….. … …… .. .. .. .. … . ………. .. …. . … .. .. . .. . … . .. …. .. .. …… . ….. . ….. .. .. …….. …. . …… … . ….. …………………………………….. . …… …….. … ……….. ……… ……………… … …. ……… …..
. ….. ………… … …….. .. .. ….. . …….
. …. ……….. . ……. ……. …… ……. ..
…… .. .. . . . … . .. .. ….. . …..
. . ………. .. … …… . …. ……. . … . .. … … ……… .. …. ……. ………. … .. . . … . … .. . …… …. .. … . … …. . …. …… . … . …… ……. … …. …………………. …… ………. ………….. .. … …………………………………………………. ……… . .. ………… …….. ………. … ….. . …. . …….. ……………. ……. …… … ………. . ……………………. ………. ……. …….. …….. . . …. .. …… ……… ……….. ……. …….. …… .
……. .
.. . .. . .
0 500 1000 1500 2000 2500 days from 1/1/1990
0 500 1000 1500 2000 2500 days from 1/1/1990
0 2
4 6 8 log(no. of claims)
Figure 6.8: The number of claims on each day from 1/1/1990 until 31/12/1996.
paper we follow an events based approach. The physical cause of the damage claims will always be some metrological event, for example a storm or flood. Although these events are unobserved the structure of the model should take them into account. In particular a high proportion of the variation in the size of claims depends only on the weather on each day. We therefore model these metrological events by using random effects modelling.
In general we are interested in identifying the joint distribution of the claim sizes and the number of claims on any particular day. This allows us to address all the issues of interest raised in Section 6.5.1. We denote the claims sizes by X, the number of claims by N, the day of interest as t, and the joint distribution on this day as ft(X,N). We then modelled this joint distribution as a product of the conditional distribution of X given N and the marginal distribution of N, i.e.
ft(X, N) = ft(X|N)ft(N).
Note that we will apply the events based approach by concentrating on the distribution of N . Informal data analysis shows that after conditioning on N the distribution of X remains approximately constant across time. This can be interpreted as meaning that the information in the metrological events is filtered primarily through the number of claims.
Models for ft(N)
Let Nt denote the number of claims on day t. We shall consider the distribu- tion of Nt conditionally on both the weather for that day and the number of policies at risk. Standard modelling assumptions, Feller (1966), which rely on the independence of claims given the weather conditions dictate that Nt
no. of claims
0 2000 4000 6000 8000
10000
log(no. of claims) 02468
frequency
0 100 200 300
400
12 6. BAYESIAN AND STATE SPACE METHODS
will follow a Poisson distribution if we further assume that the actual pro- portion of policies which give rise to a claim on this day is small. Thus we apply the approximation of a Poisson distribution with a rate parameter λt. So we have
Nt ∼ Poisson(λt).
Note that the information about the number of policies acts as a multiplier for the rate parameter λt.
Since the actual weather and number of policies are unobserved we will treat these factors as random effects for each day. A large proportion of the information which determines the parameter λt represents the effects of the weather on day t. Clearly this will have a component of seasonal variation. It also will be correlated in time since a weather system typically takes about three days to pass across the country. Further the systematic structure of reporting claims mentioned in Section 6.5.2 will affect the value of λt, for example it should be higher on the first day of each month. Putting these factors together means we assume a model
log λt = Ht + εt,
where Ht is a deterministic function of time which represents the fixed tem- poral effects, while εt represents the random weather effects. To model the correlation across time we shall assume a time series structure for εt. As an example a simple possibility is
εt = ρεt−1 + νt (6.5)
where νt ∼ N(0,σt2).
All models in this paper will be based on this structure. We therefore
examine in Section 6.5.4 the appropriate form of the deterministic effect, Ht, and the time series structure of the random effects εt.
Models for ft(X|N)
Data analysis indicates that after conditioning on the number of claims there is very little seasonal variation in the distribution of X. It appears that a relatively simple model, such as a fixed t-distribution is appropriate here.
6.5.4 Results
Throughout we use a Bayesian approach implemented using Markov chain Monte Carlo (MCMC) methodology, see ? and ?. As far as possible non- informative priors were used, and in general the likelihood completely dom- inated the posterior distribution which made the choice of prior relatively unimportant. There was one exception to this mentioned in Section 6.5.4. In general a sequential Metropolis-Hastings algorithm was used, although
6.5. CASE STUDY 13
for some of the models we used block-updating to increase efficiency, see ?. The fact that there was a large amount of data and that the models were clearly identified meant that the convergence of the MCMC algorithms gave few problems after parameterisations were chosen to aid efficiency, ? and ?.
During the model selection process we attempted to keep the dimension of the model as small as possible, using the posterior distribution of both the parameter values and the likelihood as a guide.
The deterministic model
The deterministic part of the model Ht needs to include long term changes of the level, seasonal effects and the systematic features of the reporting mentioned in Section 6.5.2. This is currently modelled as
7 12
Ht = α0 + α1t + α2t2 + α3fm(t) + (βidi(t)) + (γimi(t)), (6.6)
i=1 i=1
where fw is an indicator function which is 1 if t is the first day of the month, and 0 otherwise. While mi is the indicator function for the ith month of the year and di for the ith day of the week.
We first discuss the long term trends. The output from the model is shown in Figure 6.9. The left hand panels show the output of the MCMC algorithm for each of the three parameters, α0,α1 and α2. They show that the algorithm is mixing well and the output will give the posterior distribu- tion of the parameters. The convergence seen here is typical of that for the other parameters in the model. The right hand panels show the marginal posterior distribution of the parameters. They show that there is reasonable evidence for both α1 and α2 being non zero. However it is important to note that the unobserved number of policies would enter as an additive part of equation (6.6), hence will be confounded with these trend terms.
The seasonal effects also were found to be strongly significant in all mod- els. Various versions of a season component were tried including sinusoidal components. The twelve level monthly model used in equation (6.6) was found to be a simple and effective description of this variation. Figure 6.10 shows the posterior output for the twelve levels. The seasonal effect, with winter being worse than summer, is very obvious.
The systematic reporting effects are picked up by the day of the week effect, shown in Figure 6.11, and the first day of the week effect, shown in Figure 6.12. For the weekly effect we see that there is an excess of claims on Monday and a corresponding decrease at the weekend. The first day of the week effect is shown to be significantly positive.
Overall the model has done very well in exploring the deterministic vari- ation across days. It can detect long term variation and has quantified the systematic reporting effects very clearly.
14
6. BAYESIAN AND STATE SPACE METHODS
0 200
0 200
0 200
400 600 800 2.8 Index
400 600 800 Index
400 600 800 Index
2.9
-1.5
3.0 constant
-0.2
3.1
Figure 6.9:
algorithm, right hand panel showing marginal distribution.
-0.4
-2.0
0.0
0.2
The trend effect: left hand panel shows output from MCMC
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
0 200400600800 Index
-0.5
July effect
0.0 0.5 1.0
January effect
-0.5 0.0 0.5 1.0
quadratic
-2.0 -1.5 -1.0 -0.5 0.0 0.5
linear
-0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1
constant 2.8 2.9 3.0
3.1
-0.5
August effect
0.0 0.5 1.0
February effect
-0.5 0.0 0.5 1.0
-0.5
September effect
0.0 0.5 1.0
March effect
-0.5 0.0 0.5 1.0
-0.5
October effect
0.0 0.5 1.0
April effect
-0.5 0.0 0.5 1.0
probability
0.0 0.2 0.4 0.6 0.8 1.0
0
probability 1 2
3
probability 0 2 4
6
-0.5
November effect
0.0 0.5 1.0
May effect
-0.5 0.0 0.5 1.0
-0.5
December effect
0.0 0.5 1.0
June effect
-0.5 0.0 0.5 1.0
Figure 6.10:
The seasonal monthly effect
linear
-1.0 quadratic
-0.5
0.0 0.5
6.5. CASE STUDY
15
0 200400600800
Index
0 200400600800
Index
0 200400600800
Index
0 200400600800
Index
0 200400600800
Index
0 200400600800
Index
0 200400600800
Index
Figure 6.11:
The weekly effect
probability
0 2 4 6 8 10 12 14
Sunday effect
-0.2 -0.1
0.0 0.1
0.2 0.3
-0.2 -0.1
0.0 0.1
0.2 0.3
-0.2 -0.1
0.0 0.1
0.2 0.3
-0.2 -0.1
0.0 0.1
0.2 0.3
-0.2 -0.1
0.0 0.1
0.2 0.3
-0.2 -0.1
0.0 0.1
0.2 0.3
-0.2 -0.1
0.0 0.1
0.2 0.3
Monday effect
Tuesday effect
Wedday effect
Thursday effect
Friday effect
Saturday effect
0.60
0.65
0.70 0.75 First day of month effect
Figure 6.12: The first day of the month effect
16 6. BAYESIAN AND STATE SPACE METHODS The random effects
The modelling for the random effects needs to take into account both the marginal and the dependence structure. A simple example of a model is given by Equation (6.5) which simply models the correlation across time using a simple autoregressive structure. The innovation process ν is given by a Normal model. The variance of this term is modelled as having seasonal variation as
12
σ t2 = σ 2 γ i m i ( t ) i=1
where mi is the monthly indicator function defined in Equation (6.6). The seasonal variation in the variance appears to be strongly significant.
In Figure 6.13 we plot the marginal distribution of the innovation pro- cess, based on this model. The left hand panel shows the marginal structure. The histogram shows a skew symmetric distribution. This is also shown in the plot of the innovations against the day in the right hand panel. The very large values in this plot correspond to very large jumps in one day in the number of claims.
The model above is capable of modelling the bulk of this distribution well, but there is a question of how to model both of the tails. Other innovation processes which have been used include a t-distribution, and mixtures of Normals and t-distributions. For further details of this issue see Section 6.5.5.
-4 -2 0 2 4 6 0 500 1000 1500 2000 2500 innovations days from 1/1/1990
Figure 6.13: The innovation process; the left hand panel shows the marginal distribution, the right hand panel the time series structure
•
•
•
••••
• •••••• •••••• ••••
•••
• • • •• ••• • • •
•• •••••• •••• ••• ••••••••••••• • • • • ••• • •
• ••• ••••• • • • • • • •• •••• •• • • ••• •••••••••• •• •••• •••••••• ••
•••••• •••••••••• •••• • •• • •• •• •••••••••••••••• •••• ••• •• •• • •••••• ••• •• •••••• ••••• • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •••• • • ••• • • • • • • • • • • • • • • • • •• • • • •• • • • • • • • • •• • • • • • • •• • • • •• • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • •• • • • • •• • • • • • •• • • •• ••• • • • • • • • • • • • • • ••••••••••••••••• •• •••• •••••••••••••••• •••••••••••••• ••••••••••• •••••••••••••• ••• •••••••• •••••••••••••••••••••••• • ••••••••••••••• ••••••••••• • • •••••••••••• ••
• •••••• ••• ••••• •• ••••••••
• ••• • • •••• •• •••••• ••• ••• • ••• •• •
•••••••• ••••• •• •• ••• •••••• •• •
•••• •••••••• ••••
••• ••••• •• • • • • ••••• • ••••••
••• • •• • ••• •••• •••• •• ••••• ••
•••
•
0.0 0.2 0.4 0.6 0.8
innovations
-2 0 2 4 6
6.5. CASE STUDY 17 Model checking
One of the advantages of the MCMC methodology is that simple data an- alytic techniques can be used to give informative model diagnostics. One useful technique is to use the predictive distribution of the model and com- pare the output with the empirical distribution based on the data. As an example of this technique see Figure 6.14. This a comparison of the pre- dictive distribution to the observed data. In this example we look at the proportion of days with less than 10 claims. The histogram is the predictive distribution generated from the model. The vertical line is the observed number of claims. We see that this is completely consistent with the models predictions.
6.5.5 Conclusions and further work
The modelling methodology here is designed to reflect the physical process which is generating the data. Traditional time series approaches are forced to model some aggregation of the data, by modelling directly the total cost of claims on a day, for example. In contrast we model the full joint distribution of all the data. We feel that a model which is closer to the actual physical process offers great advantages in interpretation and predictive power.
There are a number of outstanding issues which need to be explored. We look at the three objectives raised in Section 6.5.1.
1. Temporal effects. The model is able to detect yearly variation, both using trends in the deterministic function, Ht, and potentially in the structure of the innovation process. The model recognises that changes in the numbers of claims is confounded with changes in the number of policies at risk. However there is scope to incorporate this statistic in order to give a satisfactory model for this first question.
2. Aggregation. By modelling the full joint distribution we have implicitly modelled any aggregation process. This can be done either analytically or via simulation. Clearly there is further work to be done here.
3. Extremes. The most delicate part of the modelling concerns the incor- poration of the extreme metrological events in the dataset. Currently different models of the innovation process can treat this observations very differently. Further work is clearly needed in order to clarify this issue. A more detailed dataset, as mentioned in Section 6.5.2 could help. Since we have modelled the data via the physical process driving it, there is also the possibility of using metrological models directly in the modelling. This possibility opens up a whole new set of infor- mation which can greatly improve prediction of return levels of these extreme events.
18 6. BAYESIAN AND STATE SPACE METHODS
0.3 0.4 0.5 0.6 0.7 probability of less than 10 claims
Figure 6.14: An example of comparing the predictive distribution to the observed data. In this example we look at the proportion of days with less than 10 claims. The vertical line is the observed number of claims
probability
0 5 10 15 20
Appendix
6.A JAGS, BUGS and R
As we have seen above we can use MCMC to compute any posterior quan- tity in a Bayesian setting. One major issue with MCMC is computational speed and a number of packages have been written to help users efficiently implement MCMC in complex hierarchical models. We will focus on JAGS, which can be run directly from R.
BUGS and WinBugs1 were highly successful packages that was made available to users which had a major impact on using MCMC methods in many areas. A very good and up to date introduction can be found in ?. BUGS stands for ‘Bayesian inference Using Gibbs Sampling’ and the WinBUGS package was developed to run on Windows machines.
In this course we will concentrate on the package JAGS which is a clone of BUGS (Bayesian analysis Using Gibbs Sampling) which runs on Windows, Mac and UNIX machines and through the R package rjags can be easily run directly from R. Note that the rjags package does not include a copy of the JAGS library which must be installed separately2. JAGS (Just Another Gibbs Sampler) is licensed under the GNU General Public License and is thus freely available.
Let us look at a very simple example of running a Gibbs sampler in run from R using JAGS.
Example 6.A.1. (Independent normal example) Consider the model
X|θ,τ ∼ N(θ,τ−1) θ|τ ∼ N (0, 100/τ )
1/τ ∼ Gamma(1/1000, 1/1000)
First, in a text file – which I have called jagsmodel2.R, but could have any name – we have the JAGS (and in fact BUGS) code for this model. This is given by
1 http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml
2 For instructions on downloading JAGS, see the home page at http://mcmc- jags.sourceforge.net.
19
20 6. BAYESIAN AND STATE SPACE METHODS
model {
for (i in 1:N) {
X[i] ~ dnorm(theta, tau)
}
theta ~ dnorm(1, 0.01*tau)
tau <- 1.0/sqrt(tau1)
tau1 ~ dgamma(1.0E-3, 1.0E-3)
}
library(rjags)
X <- rnorm(50, mean=0, sd=100)
N <- 50
model2 <- jags.model(file="jagsmodel2.R", n.adapt=0) update(model2, n.iter=5000)
output <- coda.samples(model2, c("theta","tau"), n.iter=10000) plot(output)
which would gives the plot shown in Fig. 6.A.1
Trace of tau
6000 8000 12000
Iterations
Trace of theta
6000 8000 12000
Iterations
Density of tau
0.0019 0.0021 0.0023
N = 10000 Bandwidth = 9.451e−06
Density of theta
0 5 10 15 20 25
N = 10000 Bandwidth = 0.513
0 5 10 15 20 0.0019 0.0021 0.0023
0.00 0.04 0.08 0.12 0 2000 4000 6000
Figure 6.A.1: Output from Example 6.A.1
A few points to note from Example 6.A.1. Firstly you need to be careful to note that the way R and JAGS codes is subtly different in the choice of parameters to use. Compare above dnorm(theta, tau) in JAGS, which uses the mean and precision (inverse of variance), while R uses rnorm(50, mean=0, sd=100) which uses mean and standard deviation. Further the variable which the model uses (in the case of the example above X,N, need to lie in the R session.