STAT 443: Forecasting Fall 2020 Project
Introduction
The project for STAT 443 for Fall 2020 will be based on the theme of eval- uating and comparing forecasting methods in a specific context. It is due on 5:00pm December 7th via Crowdmark. Unlike the assignments this is an individual piece of work.
The project will be released in two parts to match our progress through the course. This document is Part I which is worth 40% of the final project mark. Part I focuses on Chapters 1-5 of the course and I would recommend working on this during the period when we are studying Chapter 5. It is designed to explore the material on designing simulation studies from Chapter 5 illustrated with examples from Chapters 1-4.
Part II will involve you designing and running a more complex simulation study involving models that we will be studying in the second half of the course. It will be released as we start to study Chapter 6. We do not have quizzes or assignments scheduled for the last two weeks of term to give you time to work on Part II. Both parts of the project are due at the same time.
Note that some of the work involves writing and running R code. If you are not experienced with R it often takes longer that you might at first expect so budget your time accordingly.
Part I: Question Set 1
i.i.d 2 In the notes we called the time series, X := (Xt) defined by Xt ∼ N(μ,σ )
with i = 1,…,n, a trivial case because it has the the simplest depen- dence structure, i.e, all observations are independent. In earlier STAT courses you will have seen that a 95%-prediction interval is a random interval (L(X),U(X)) which has the property that
Pr (L(X) ≤ Xn+h ≤ U (X)) = 0.95% (1) 1
The following R code tries to check, through a simulation exercise, that a particular choice of random functions L(X) , U(X) do satisfy Equation (1):
check.coverage <- function(N.sim,n, mu.true, sigma.true, h)
{
check <- rep(0, length=N.sim)
set.seed(1000)
for(i in 1:N.sim)
{
x <- rnorm(n, mean=mu.true, sd= sigma.true)
mu.hat <- mean(x)
sigma.hat <- sd(x)
c <- qt(0.975, df=n-1)
L <- mu.hat - c*sigma.hat*sqrt(1 + 1/n)
U <- mu.hat + c*sigma.hat*sqrt(1 + 1/n)
xh <- rnorm(1,mean=mu.true, sd= sigma.true)
check[i] <- (xh >= L)&(xh <= U)
}
mean(check)
}
1. (5 marks) Explain in simple English what the term mean(check) rep- resents and what its relationship is to Equation (1)
2. (5 marks) Equation (1) says that the prediction coverage should be 95%. If n, the length of the observed series, is 100, and the time
i.i.d 2
series is Xt ∼ N(2,3 ), what does the algorithm estimate the actual
prediction coverage is for h = 1 and h = 100? Comment on your interpretation of the results. [Hint: take N.sim=100,000].
3. (5 marks) Annotate the code in term of the Person A and Person B game discussed in Section 5.3 of the notes, to help explain the struc- ture.
4. (5 marks) The code tries to estimate estimating a probability, π, of correctly capturing the Xn+h value in the prediction interval (L,U).
Let us define
1 Xn+h∈[L,U] Z= 0 Xn+h∈/[L,U]
2
Part of the code effectively generates N.sim i.i.d. random versions of this random variable, which we denote by Zi.
(a) What distribution does Z and N.sim Z have in terms of π? i=1 i
N.sim Zi
(b) Using your answer, what is the mean and variance of i=1 ?
N.sim
(c) Use your result to give a reason for selecting N.sim = 100, 000 in Question 1 in terms of the accuracy of the output of the code. [This calculation is often used to compute the, so-called Monte Carlo errror]
5. (5 marks) In Section 5.5 we have a list from Table 1 of Morris et al. (2019) on important things to consider in the design of a simulation study. In the context of the example of this question set identify the following:
i the specific aim of this simulation study.
ii the data-generating mechanism
iii the performance measure used
iv the number of simulations to achieve an acceptable Monte Carlo SE
v how you might use a graphical method in an exploratory analysis of your results
Question Set 2
As pointed out in Chapter 5, Morris et al (2019) states that ‘A key strength of simulation studies is the ability to understand the behavior of statistical methods because some “truth” (usually some parameter/s of interest) is known from the process of generating the data.’ As an example of this we can investigate how well a method works when the assumptions behind the model do not hold.
6. (5 marks) Suppose that that a random variable X has a exponential distribution with mean 1. Show how to write an R function which returns n independent realisations of aX +b which have mean mu.true and standard deviation sigma.true.
Submit annotated code.
3
7. (5 marks) Describe how you checked your code using both graphical methods and summary statistics.
8. (5 marks) Show how you can adapt the code in check.coverage so that the data generation process is now a scaled and translated expo- nential and repeat the experiment in Question 2.
9. (5 marks) As part of the Analysis step comment on the results of your simulation.
Use graphical methods to explore anything in the results that you found surprising.
10. (5 marks) By comparing, in the exponential case, the results based on the method in check.coverage with a quantile based analysis, explain that coverage may not be the only important performance measure in the experiment.
Simple experimental design
11. (15 marks) We have looked in the notes at the AirPassengers data from R. This was an example used by Box and Jenkins. We want to compare two forecasting methods for data of this kind. Maybe the data is from different years, different geographical regions, different airlines etc, but you can assume it has the same basic structure.
You have been asked to compare the Box-Jenkins methodology and Holt-Winters in terms of their performance with one-step ahead pre- diction.
For this question you need to prepare an initial design of a simulation study that you feel will evaluate the two methods. You should explain your design in terms of Points 1-6 and 9 of the list in Section 5.4. For point 9 you need to explain how you plan to present results when you have them.
[Note: You are only planning the study here, you should not be writing code or running the experiment]
4