Introduction to Transfer function noise model and assignment
February 9, 2020
1 Transfer Function Noise (TFN) Model
1.1 Model specification
Consider the time series regression as follows. \[y_t = \alpha +\sum^\infty_{i=0} v_i x_{t-i}+n_t,~~n_t \sim N(0,\sigma^2_n),~(1)\] where \(v_0,v_1,\dots\) are referred to as the impulse response function,1 and \[\phi(B)x_t =\theta(B)a_t,~a_t\sim N(0,\sigma^2_a).\] This model is also known as the distributed lag model in statistics or econometrics.2
1. For a dynamic system as specified by Equation (1) to be meaningful, the impulse responses must be absolutely summable, \(\sum^\infty_{i=0} |v_i|<\infty\)
2. The value \(g=\sum^\infty_{i=0} v_i\) is called the steady-state gain.
1.2 How to estimate infinite number of parameters
1. Unstructured approximation
2. Structured estimation/approximation, such as Koyck distributed lag model.
◦ The Koyck distributed lag model may be given by \[y_t = a + \sum^\infty_{i=0} \beta \lambda^ix_{t-i}+\sum^\infty_{i=0}\lambda^i \xi_{t-i},~~|\lambda|<1\] The above model is a special case of Jorgenson’s rational distributed lag model \[y_t = \frac{\delta(B)}{\vartheta(B)}x_t+n_t,\] where \[v(B) = \frac{\delta_0 +\delta_1B+\dots+\delta_rB^r}{1-\vartheta_1B-\dots-\vartheta_sB^s} = \frac{\delta(B)}{\vartheta(B)}\]
3. Others
1.3 Model building process (single input TFN model)
1. Preliminary identification of the impulse response coefficients \(\{v_i\}_{i=0}^\infty\);
2. Specification of the noise term \(n_t\);
3. Specification of the transfer function using rational polynomial in \(B\) if necessary;
4. Estimation of the TFN model specified in Step 2 and 3;
5. Model diagnostic checks.
1.4 Model identification using prewhitening
To discuss in class
2 Example
• Source: The 2016 SSC Case Study
• https://static.googleusercontent.com/media/research.google.com/en//archive/papers/detecting-influenza-epidemics.pdf
• Download the Google Flu Trends (GFT)3 and FluWatch data from https://ssc.ca/sites/default/files/case-studies/case_study_1_fluwatch.xlsx
• In this exercise, we try to use TFN models to build the relationship between GFT and flu tests and postive flu tests.
1. Are the seasonal trends in the GFT data more strongly associated with the number of influenza infections (i.e., number of positive tests) or the total number of tests? Does GFT lead or lag the number of positive influenza tests?
2. Are the seasonal trends in the GFT data associated with the number of tests and positive test results for other respiratory viruses?
3. How many weeks ahead can you predict the peak in the number of influenza positive tests? Does the data from GFT help to predict the peak?
2.1 Google Flu Trend data
• Import and analyze GFT data
################################################################
########### SSC Google Ful Trends ##############################
################################################################
GFluTrends<-read_excel("C:\\Users\\jenwe\\Downloads\\case_study_1_fluwatch.xlsx",sheet="Google Flu Trends", skip = 1)
head(names(GFluTrends))
## [1] "Date" "Canada" "Alberta"
## [4] "British Columbia" "Manitoba" "New Brunswick"
gft.Ontario<-as.matrix(GFluTrends[,"Ontario"]) # retrieve Ontario GFT
first(GFluTrends[,"Date"]) # 2003-09-28 00:00:00
Date
2003-09-28
last(GFluTrends[,"Date"]) # 2015-08-09 00:00:00
Date
2015-08-09
# Use timeSeries object for weekly data
tim<-timeSequence(from = "2003-09-28", to = "2015-08-09", by = "week")
gft.On<- timeSeries(GFluTrends[,"Ontario"], charvec = tim)
# visualization
par(mfrow = c(1,1), cex = 0.8)
plot(gft.On,type = "b", pch = 19, main = "Google Flu Trends data")
grid()

• Model GFT for prewhitening
a. Model selection using forecast package
mod<-auto.arima(gft.On, max.p = 60, max.q = 0, stationary = TRUE)
mod
## Series: gft.On
## ARIMA(3,0,0) with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 mean
## 1.5615 -0.7870 0.1516 1869.9983
## s.e. 0.0397 0.0671 0.0396 203.4424
##
## sigma^2 estimated as 143751: log likelihood=-4560.77
## AIC=9131.55 AICc=9131.64 BIC=9153.69
# par(mfrow = c(1,2))
# acf(gft.On)
# acf(mod$resid)
b) Model adequacy test
p1<-autoplot(acf(gft.On, plot = FALSE), conf.int.fill = '#0000FF', conf.int.value = 0.8, conf.int.type = 'ma')
p2<-autoplot(acf(mod$resid, plot = FALSE), conf.int.fill = '#0000FF', conf.int.value = 0.8, conf.int.type = 'ma')
grid.arrange(p1, p2, ncol=2)

c) Create `filter`
# filter for gft ontario
f1<-c(1,-mod$coef[1:3])
2.2 FluWatch data
• Import and analyze FluWatch data
###################################################
fluWatch.Ontario<-read_excel("C:\\Users\\jenwe\\Downloads\\case_study_1_fluwatch.xlsx", sheet="FluWatch-Ontario", skip = 2)
head(names(fluWatch.Ontario))
## [1] "Date" "FluApos" "FluBpos" "FluPos" "FluTest" "RSVtest"
#first(fluWatch.Ontario[,"Date"]) # 2003-09-07 00:00:00
#last(fluWatch.Ontario[,"Date"]) # 2015-08-23 00:00:00
tim1<-timeSequence(from = "2003-09-07", to = "2015-08-23", by = "week")
fluPos.On<- timeSeries(fluWatch.Ontario[,"FluPos"], charvec = tim1)
fluTest.On<- timeSeries(fluWatch.Ontario[,"FluTest"], charvec = tim1)
• Time series plot of numbers of flu tests and positive flu tests on Ontario
par(mfrow = c(1,1), cex =0.75)
plot(fluTest.On, type = "b", pch = 19, ylab = "#")
lines(fluPos.On, col =2, type = "b", pch = 20)
legend("topleft",c("flu test", "flu postive"), col = c(1,2), pch = c(19,20))
grid()

2.3 Prewhitening and model identification
• Prewhitening using CCF
fFluTest.On<-filter(window(fluTest.On, start = "2003-09-28", end = "2015-08-09"),f1,method=c("convo"),sides=1)
fFluPos.On<-filter(window(fluPos.On, start = "2003-09-28", end = "2015-08-09"),f1,method=c("convo"),sides=1)
par(mfrow = c(2,1))
yf<-fFluTest.On[-(1:3)] # loss degrees of freedom
xf<-mod$res[-(1:3)]
par(cex=0.75,bg="gray95")
ccf(xf, yf, lwd=1, ylab="Cross-correlation functions",
main="CCF of prewhitened GFT and flu test")
abline(v=0, col="gold", lwd=2, lty="dashed")
text(-1, 0.2, "-1", col=2)
text(-2, 0.2, "-2", col=2)
yf1<-fFluPos.On[-(1:3)] # loss degrees of freedom
par(cex=0.75,bg="gray95")
ccf(xf, yf1, lwd=1, ylab="Cross-correlation functions",
main="CCF of prewhitened GFT and flu postive")
abline(v=0, col="gold", lwd=2, lty="dashed")
text(-1, 0.2, "-1", col=2)

• Preliminary model identification based on prewhitening
According to the above prewhitening result, we have \[{T}_t = \alpha+v_0 {GFT}_t+v_1{GFT}_{t-1}+v_2{GFT}_{t-2}+n_{1,t},\] \[{P}_t = a+u_0 {GFT}_t+u_1{GFT}_{t-1}+n_{2,t},\] where \(T_t\) and \(P_t\) stands for the numbers of flu tests and postive flu tests at time \(t\), respectively, and \(G_t\) stands for GFT at time \(t\), \(n_{1,t}\) and \(n_{2,t}\) are the noise process for \(T_t\) and \(P_t\), respectively.
2.4 TFN Estimation
library(dynlm)
y<-fluTest.On
y1<-fluPos.On
x<-gft.On
dat<- cbind(y,x, lag(x), lag(x,2))[-(1:2),]
dat1<- cbind(y1, x,lag(x))[-1,]
colnames(dat)<-c("fluTest", "gtf", "gtfL1", "gftL2")
colnames(dat1)<-c("fluPos", "gtf", "gtfL1")
• TFN model for flu tests
In this section, we use two different methods to estimate the identified flu test model.
1) __Do not consider serially correlated error terms:__
mod.test<-dynlm(dat[,1]~dat[,-1])
summary(mod.test)
##
## Time series regression with "zoo" data:
## Start = 2003-10-12, End = 2015-08-09
##
## Call:
## dynlm(formula = dat[, 1] ~ dat[, -1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2410.1 -264.4 -35.3 200.0 4202.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.08936 35.16598 0.145 0.884976
## dat[, -1]1 0.19514 0.05275 3.699 0.000236 ***
## dat[, -1]2 0.08081 0.08931 0.905 0.365924
## dat[, -1]3 0.21689 0.05278 4.109 4.51e-05 ***
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 502.2 on 614 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.6414, Adjusted R-squared: 0.6396
## F-statistic: 366 on 3 and 614 DF, p-value: < 2.2e-16
autoplot(acf(mod.test$resid, plot = FALSE), conf.int.fill = '#0000FF', conf.int.value = 0.8, conf.int.type = 'ma')

2) __Consider serially correlated error terms:__
mod.flutest<-auto.arima(dat[,1],xreg=dat[,-1], stationary = TRUE)
mod.flutest
## Series: dat[, 1]
## Regression with ARIMA(4,0,0) errors
##
## Coefficients:
## ar1 ar2 ar3 ar4 gtf gtfL1 gftL2
## 0.6931 0.1331 0.2578 -0.1884 0.2599 0.1036 0.1070
## s.e. 0.0396 0.0475 0.0472 0.0397 0.0239 0.0310 0.0238
##
## sigma^2 estimated as 55513: log likelihood=-4252.39
## AIC=8520.79 AICc=8521.02 BIC=8556.2
autoplot(acf(tseries::na.remove(mod.flutest$resid), plot = FALSE), conf.int.fill = '#0000FF', conf.int.value = 0.8, conf.int.type = 'ma')

Exercise: Read R output and write down the model in mathematics.
Exercise: Carry out the model diagnostic tests (Readmodel diagnostic lecture note for more details).
• TFN model for positive flu tests
mod.pos<-dynlm(dat1[,1]~dat1[,-1])
summary(mod.pos)
##
## Time series regression with "zoo" data:
## Start = 2003-10-05, End = 2015-08-09
##
## Call:
## dynlm(formula = dat1[, 1] ~ dat1[, -1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -560.94 -90.43 10.93 56.41 1787.39
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -146.95652 10.92748 -13.448 < 2e-16 ***
## dat1[, -1]1 0.04308 0.01398 3.082 0.00215 **
## dat1[, -1]2 0.09855 0.01399 7.047 4.92e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 161.2 on 616 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.6013, Adjusted R-squared: 0.6
## F-statistic: 464.5 on 2 and 616 DF, p-value: < 2.2e-16
autoplot(acf(tseries::na.remove(mod.pos$resid), plot = FALSE), conf.int.fill = '#0000FF', conf.int.value = 0.8, conf.int.type = 'ma')

mod.flupos<-auto.arima(dat1[,1],xreg=dat1[,-1], stationary = TRUE)
mod.flupos
## Series: dat1[, 1]
## Regression with ARIMA(1,0,0) errors
##
## Coefficients:
## ar1 intercept gtf gtfL1
## 0.7963 -197.0446 0.1017 0.0666
## s.e. 0.0255 26.1465 0.0107 0.0105
##
## sigma^2 estimated as 10067: log likelihood=-3732
## AIC=7474.01 AICc=7474.11 BIC=7496.15
autoplot(acf(tseries::na.remove(mod.flupos$resid), plot = FALSE), conf.int.fill = '#0000FF', conf.int.value = 0.8, conf.int.type = 'ma')

mod.flupos1<-arima(dat1[,1],c(1,0,4), xreg=dat1[,-1])
mod.flupos1
##
## Call:
## arima(x = dat1[, 1], order = c(1, 0, 4), xreg = dat1[, -1])
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 intercept gtf gtfL1
## 0.8030 -0.0462 0.0120 0.1827 -0.1260 -204.9571 0.1057 0.0669
## s.e. 0.0654 0.0810 0.0629 0.0550 0.0597 26.7236 0.0101 0.0100
##
## sigma^2 estimated as 9623: log likelihood = -3717.66, aic = 7453.31
autoplot(acf(tseries::na.remove(mod.flupos1$resid), plot = FALSE), conf.int.fill = '#0000FF', conf.int.value = 0.8, conf.int.type = 'ma')

Exercise: Read R output and write down the model in mathematics
2.5 Prediction
par(mfrow=c(2,1), cex = 0.75)
plot(fluTest.On, type = "b", pch = 21); grid()
plot(gft.On, type = "b", pch = 15); grid()

par(mfrow = c(1,1))
plot(fluTest.On, type = "b", pch = 21)
lines(timeSeries(c(rep(NA,7),predict(mod.test)),charvec = timeSequence(from = "2003-09-07", to = "2015-08-23", by = "week")),col = 2, pch = 15, type = "b", main = "# of flu tests and Google flu trends"); grid()

Exercise: Calculate MSE and MAE of the above prediction.
3 Assignment
1. Analyze the relationship between GTF and FluTest/FluPos as we did in Section 2 using Canada data
2. Also answer the exercise in Section 2
Copyright (c) 2020 JEN-WEN LIN
1. We can express Equation (1) using the compact notation \[y_t = \alpha+v(B)x_t+n_t,\] where \(v(B)=v_1B+v_2B^2+\dots = \sum^\infty_{i=0}v_iB^i\).↩
2. Distributed lag models require impulse response functions to be of the same sign.↩
3. https://en.wikipedia.org/wiki/Google_Flu_Trends↩