STAT 443: Forecasting Fall 2020 Assignment 4
When you upload to Crowdmark please have a separate page for each of the 12 questions. You should submit your work before 5:00pm (Waterloo time) on the 20th November.
It is allowed to work with others on this assignment but if you do you must submit a joint group submission to Crowdmark. You can create your own groups, and any member of the group can submit on behalf of the group and a single group assessment will be graded. All students in the group will receive a copy of the graded feedback in their Crowdmark portfolio.
1
Question Set 1: This set of questions looks at how to implement Bayesian ideas using code from R. You will need to use the library MCMCpack. From Learn download the file Assignment4Question1.csv and read into
an R session using:
y <- read.csv(file="Assignment4Question1.csv", header=TRUE)
y <- as.vector(y)
i.i.d.
1. (5 marks) Consider random variables Yi ∼ N(μ,1), for i = 1...N,
where we are interested in inference on μ after we have seen y1, . . . , yN . (a) Show that the likelihood function for μ, when we have observed
y1,...,yN, can be written as
N N
exp μ(yi)−2μ2 i
(b) Suppose the prior for μ is N(0,1). Show that the posterior satis- fies
1 Ny2 P(μ|y1,...,yN)∝exp − (N+1) μ− i=1 i
2
(c) Hence show the posterior for μ is N( (N+1) ,(N +1)
i.i.d.
2. (5 marks) Consider a set of random variables {Yi} ∼ N(μ,σ ) for
i = 1 . . . N . After downloading the data we can use the function MCMCregress to do Bayesian inference on μ, σ2 by running
output <- MCMCregress(y ~ 1, b0=0, B0= 1 )
summary(output)
where b0 is the mean and B0 is the precision (inverse of variance) of the prior.
[Note the function works for general regression models so if there are no explanatory variables the intercept is the mean]
(a) Compare Bayesian inference on μ (point estimate and standard error), based on MCMCregress, with frequentist inference, based on the output of the t.test() function.
2
Ni=1 yi
−1
)
(N+1)
2
(b) What happens if we set B0 = 100? Can you explain what has changed in terms of the strength of the prior?
3. (5 marks) The output from the MCMCregress function is a large sample of realisations from the posterior, stored in a matrix. The first column corresponds to the mean, and the second the variance.
(a) Use plots to visualise the posterior for μ, σ2 and the joint poste- rior of (μ, σ2).
(b) You can also plot the first column as if it were a time series. What does the time series plot, the acf and the pacf tell us?
4. (5 marks) Forecasting in Bayesian inference is done with the predictive distribution i.e.
22 2 φ(y0|μ,σ )P(μ,σ |y1,...,yn)dμdσ
where φ is the normal density.
We can use the output of MCMCregress to produce a sample from the predictive distribution. Rather than compute this integral we take the sample {(μi, σi2)} for i = 1, . . . 10 000 from MCMCregress. We ap- proximate the predictive distribution with a sample of size 10 000 by generating a sequence of single observations:
y i ∼ N ( μ i , σ i2 )
Showing your code, what do you estimate the mean and the standard
for i = 1,...10000.
deviation of the predictive sample to be?
3
Question Set 2: This set of questions looks at the relationship between the Kalman Filter, Bayesian inference and updating rules.
We use the simple state space model
xt = Φxt−1+wt
yt = Axt+vt
(1) (2)
i.i.d
where wt ∼ N(0, Q), yt and xt are 1-dimensional, vt ∼ N(0, R) and the
two noise terms, w, v are independent of one another.
The {xt} process is the unobserved state variable and {yt} is the observed
process. Define
Hs = {ys,ys−1,···}.
Our target is to compute E(Xt|Ht) though an update rule: (i) Predict using E(Xt|Ht−1), (ii) Correct when yt becomes available. Here, conditionally on Ht−1, we have prior and likelihood:
Xt|Ht−1 ∼ N(μt−1, Pt−1) (3) Yt|xt ∼ N(Axt,R) (4)
5. (5 marks) First let us consider Bayesian learning as a sequence of updates.
When Y ∼ N(θ,1) and the prior for θ is N(μ0,σ02) we can show that the posterior mean and variance after observing y are given by
μ 0 + y σ 02 2 2 σ 02
μ0 →μ1 := 1+σ02 ,σ0 →σ1 := 1+σ02
Suppose we had sequence {yi} of independent realisations distribution, each time we update the new posterior according to the updating rules above. We denote by μn,σn2 the mean and variance of the posterior, P(θ|y1,...,yn).
i.i.d
It will be convenient to work with the precision ρn which is defined as 1 , the inverse of the variance.
σn2
(a) Show that the posterior precision, after n updates, is n + ρ0.
(b) If we think about the size of the variance of the posterior as the uncertainty we have about θ after seeing the data, what happens as n → ∞?
4
6. Now consider updating the mean. (a) Show μ1 = ρ0 μ0 + 1 y1.
ρ1 ρ1
(b) Show μ2 = ρ0 μ0 + 1 y1 + 1 y2
ρ2 ρ2 ρ2
(c) Deduce that for large n have μn ≈ ni=1 yi .
n
7. (5 marks) The Kalman Filter for the model defined by (1) and (2) is based on normality assumptions and finding numerically efficient ways of updating using a Bayesian approach.
[Here we are working in the one dimensional case i.e. both the observed process and the state variable are one dimensional to avoid matrix inverses, but the arguments are, in fact, quite general]
In Section 7.6 we saw posterior update rule, for the state space model, in terms of mean and variance are
μt−1 +ytA Pt−1 R
μt = A2 + 1 R Pt−1
A2 1 −1 RPt−1 Pt=R+P =A2P+R
t−1 t−1
(a) Again define the precision as ρ = P−1, so show the update rules
can be written as
ρt = RP = R +ρt−1
tt
A2Pt−1 + R A2 t−1
μt = ρt−1μt−1+A1yt ρt Rρt
(b) Write the update rule for μt in terms of the forecast error (yt − Aμt−1)
and ρ and show the Kalman gain is A . t−1 (A2 +Rρt−1 )
8. (5 marks) If we did not work sequentially using the Kalman Filter we could estimate the whole state process after seeing all the observed data.
(a) Explain why the distribution of (Y1, . . . , Yn) given (x1, . . . , xn) is multivariate normal.
(b) What could be a computational problem with finding the maxi- mum likelihood estimates of (x1, . . . , xn) directly?
5
Question Set 3: This set of questions explores the ideas behind state
space models and the information available in knowing the state for fore-
casting. Consider the state space model defined in Definition 7.1.1. Sup-
pose that the state variable can take two values, i.e. θt ∈ {1, 2}, and that
Yt|θ ∼ N(μθt,σ2 ). Further, assume that the state variable evolves as a θt
Markov Chain such that
Pr(θt+1 =1|θt =1)=p,Pr(θt+1 =2|θt =2)=q
so there is a transition matrix P=1−qq
and we assume that {θt} is in equilibrium. Figure 1 shows a realisation of Yt in which the behaviour in different states is very different.
(a) Simulated HMM
0 500 1000 1500 2000
Time
(b) QQ Plot
−3 −2 −1 0 1 2 3
Theoretical Quantiles
(c) Histogram
−6 −4 −2 0 2 4 6
Data
p1−p
Time series
−6 −4 −2 0 2 4
Sample Quantiles
−6 −4 −2 0 2 4
Frequency
0 200 400 600 800
Figure 1: (a) Realisation of HMM, (b) QQ plot compared to normal
Fitting a HMM is more complex than fitting an ARIMA model as the likelihood is complicated. You can use Bayesian based MCMC methods or the EM algorithm. The R library depmixS4 uses the second approach and we got the following estimates from the data: The transition matrix is estimated as
0.996 0.004 P= 0.014 0.986
and its associated equilibrium distribution π1 = 0.7777778, π2 = 0.2222222. The estimated means and standard deviations are μ1 = 0.981, σ1 = 0.505, μ2 = −0.444, σ2 = 2.030.
We also get posterior estimates of the unobserved state variable, in par- ticular we have that posterior probability that the most recent state, s2000, is 1 is 0.998
6
We are interested if {Yt} is stationary and how to forecast from it.
9. (5 marks) We start with looking at the mean of the process. We assume that the state process {θt} is in equilibrium so that we have the unconditional probabilities P (θt = 1) = π1, P (θt = 2) = π2 .
(a) Use the iterated expectation formula, E(S) = ET(E(S|T)), to show
E(Yt) = π1μ1 + π2μ2
(b) Explain if the process {Yt} is mean stationary or not.
(c) Using the numerical estimates above what is the estimate of E(Yt)?
10. (5 marks) Now lets think about the variance
(a) Just from Figure 1 do you think the process is variance station- ary?
(b) Use the iterated variance formula
V ar(S) = ET (V ar(S|T )) + V arT (E(S|T )),
and the equilibrium assumption Pr(θt = 1) = π1, to show that V ar(Yt) = (μ1 − μ2)2π1π2 + σ12π1 + σ2π2
11. (5 marks) Since we have good estimate for μ1, μ2, σ12, σ2 and P and we can compute good estimates for the equilibrium probabilities π1,π2, treat these as fixed and known. A sensible estimate for Y2001 based on the estimates would be
Y2001 = E(Y2001) = π1μ1 + π2μ2.
(a) Suppose that we were told that s2000 = 1 show, using this extra
information, that the bias of using Y2001 compared to the best (in terms of MSE) estimate, E(Y2001|s2000 = 1), is
μ1P11 + μ2P12 − (π1μ1 + π2μ2)
(b) Using the estimates above what is the numerical value of the bias?
7
12. (5 marks)
(a) In the previous question what happens to the bias for the estimate
of Y2000+h as h → ∞?
(b) What happens to the bias when μ1 = μ2?
(c) From Figure 1 what do you conclude about the marginal distri- bution of Yt?
8