Applied Time Series Analysis Section 2: Exploratory Data Analysis
Linear Regression – Setup in Time Series Context
Let {Xt} be our target time series and {Zt,1},…,{Zt,q} be our covariates. We regress Xt onto Zt,1, . . . , Zt,q, that is we have the regression equation
Copyright By PowCoder代写 加微信 powcoder
Xt =βjZt,q +εt,
where {εt} is some (hopefully well-behaved) error processes.
Examples for covariates:
• Indicators for distinct time periods, e.g. for daily data starting on Monday we can
use Zt,1 = 1{(t + 4)/7 ∈ Z} to regress on Wednesdays
• A time sequence, e.g., Zt,2 = t or Zt,3 = t2 (often with intercept Zt,4 = 1);
• Nonlinear function of t, e.g, Zt,5 = sin(2πt + φ);
• Lagged values of Xt, e.g. Zt,6 = Xt−1;
• Other time series and/or lagged values of other time series.
Linear Regression – Setup in Time Series Context
Let us say we have n observations and let Zt = (Zt,1,…,Zt,q)⊤, Z = (Z1,…,Zn)⊤, X = (X1, . . . , Xn)⊤, and β = (β1, . . . , βq)⊤. Note that Z is an n × q matrix.
We can estimate β by using Ordinary Least Squares (OLS). That is, we minimize the residual sum of squares (RSS)
RSS(β)=(Xt −βjZt,j)2 =(Xt −β⊤Zt)2 =∥X −Zβ∥2.
t=1 j=1 t=1
where the last equality is under the condition that Z⊤Z is of full rank q.
βˆ = argminβ RSS(β) = (Z⊤Z)−1Z⊤X,
Linear Regression – Setup in Time Series Context
What does the condition ’Z⊤Z is of full rank q’ and constructing (Z⊤Z)−1 imply?
• q < n; we cannot include too many regressors. If the sample size is not really large, including some polynomial trend and also some seasonal indicators could be too much. We need to be selective there.
• Inverting the matrix Z⊤Z is easiest and numerically most stable if matrix is close to being diagonal. Keep that in mind when constructing regressors.
Example - Regressing onto Quadratic Time
Suppose we want to regress on quadratic time, for t = 1, . . . , n. That means we can do
This leads to
Zt,1 = 1,Zt,2 = t,Zt,3 = t2,
1.00 50.50 3383.50 Z⊤Z/100 = 50.50 3383.50 255025.00
We can also do
3383.50 255025.00 20503333.30 n
This leads to
1.00 Z⊤Z/100 = 0.00 0.00
0.00 555277.80
=(t−t ̄)2 −1/n(t−t ̄)2, n n
0.00 833.25 0.00
Linear Regression - Properties
If we assume Z⊤Z is of full rank q and
• E(εt) ≡ 0 and cov(εt,Zt) = 0, then βˆ is unbiased, i.e., E(βˆ) = β.
• εt iid N (0, σε2 ) and Z deterministic, then
• βˆ is also the maximum likelihood estimator of β;
βˆ ∼ N(β,σε2(Z⊤Z)−1);
εˆ := X − Z βˆ ∼ N (0, σε2 (I n − Z (Z ⊤ Z )−1 Z ⊤ );
• βˆ and εˆ are independent.
We will discuss the case with autocorrelated errors later. In most cases and given E(εt ) ≡ 0 and cov(εt , Zt ) = 0, the OLS is still consistent but not optimal any more. Also the asymptotic variance is affected by the autocorrelated errors.
Example - Global Temperature
We model the global temperature as
Yt =β0+β1t+Xt,
where {Xt } is a stationary time series and test β1 > 0 in this model. We can do this by regressing Yt onto Zt,1 = 1 and Zt,2 = t.
The variance for βˆ on the previous slide was under the assumption of independent and normally distributed errors.
The independence assumption might be too strong here. We may rather assume {Xt} is a stationary time series. {Xt} might be normally distributed but not independent, we have ρX (h) ̸= 0 for |h| > 0.
One way of taking the dependence in the variance estimation of βˆ into account is using heteroscedasticity and autocorrelation consistent (HAC) covariance matrix estimators, also known as Newey–West estimator. We discuss this later in more detail.
Example – Global Temperature
Lower Confidence Intervals (CI) for the Linear Trend
estimated trend
5% lower CI (assuming iid data)
5% lower CI (assuming stationary time series)
1880 1900 1920 1940 1960 1980 2000 2020 Time (year)
Global Temperature
-0.5 0.0 0.5 1.0
Example – Global Temperature
ACF of detrended (linear) Global Temperature
0 5 10 15 20 Lag
-0.2 0.0 0.2 0.4 0.6 0.8 1.0
Example – Global Temperature
Lower Confidence Intervals (CI) for the Quadratic Trend
estimated trend
5% lower CI (assuming iid data)
5% lower CI (assuming stationary time series)
1880 1900 1920 1940 1960 1980 2000 2020 Time (year)
Global Temperature
-0.5 0.0 0.5 1.0
Example – Global Temperature
ACF of detrended (quadratic) Global Temperature
0 5 10 15 20 Lag
-0.2 0.0 0.2 0.4 0.6 0.8 1.0
Linear Regression – Model Selection
We often face the question what time series we include as covariates; e.g., how many lags of a time series do we need to include, do we regress on each weekday, etc.
A solution to this is to fit many models and compare them; which gives the best fit, i.e., the lowest RSS?
Problem: Including more covariates always decrease RSS. We need to take the number of covariates into account.
Linear Regression – Model Selection
We focus here on information criteria(IC):1
Let l(βˆML) be the log-likelihood for β and let p denote the number of parameters. In general, an information criteria has the form
IC = −l(βˆML) + fn((p)),
where fn is some function which determines the impact of the number of parameters.
The idea is to find a function fn which balances the error of the fit against the number of parameter in the model such it leads to consistent model choices.
The best model is defined by the lowest IC. That also means the value itself is not important. In implementations, constants and scaling can differ.
1For nested models, Analysis of Variance (ANOVA) with F-tests is another approach.
Definition – Akaike’s Information Criteria (AIC)
Let l(βˆML) be the log-likelihood for β and let p denote the number of parameters. Definition 2.1
The Akaike’s Information Criteria (AIC) is defined as AIC = −l(βˆML) + 2p.
In the linear regression context with normal errors, this simplifies to
AIC = log(RSS /n) + n + 2p . n
Definition – Bayesian Information Criteria (BIC)
Let l(βˆML) be the log-likelihood for β and let p denote the number of parameters. Definition 2.2
The Bayesian Information Criteria (BIC) is defined as BIC = −l(βˆML) + log(n)p.
In the linear regression context with normal errors, this simplifies to
BIC = log(RSS/n) + p log(n). n
The BIC is also known as Schwarz’s Information Criteria (SIC).
For larger n, the BIC penalizes the number of parameters more than the AIC. This fact leads in general to smaller models.
Example – Pollution, Temperature and Mortality
weekly data of the Los Angeles County
Cardiovascular Mortality Temperature Particulates
1970 1972 1974
1976 1978 1980
20 40 60 80 100 120
50 60 70 80 90 100
Temperature
Particulates
70 80 90 100 110 120 130
20 40 60 80 100
20 40 60 80 100
70 90 110 130
50 60 70 80 90100
Example – Pollution, Temperature and Mortality
We are interested in the effect of pollution(Particulates Pt) on cardiovascular mortality(Mortality Mt).
We see an overall downward trend.
The scatterplots show some correlation between Particulates and Mortality.
We also see a nonlinear connection between Temperature(Tt) and Mortality. A parabolic relationship centered around the mean temperature(T ̄) can be a good approximation.
This leads to the following model:
M t = β 0 + β 1 t + β 2 ( T t − T ̄ ) + β 3 ( T t − T ̄ ) 2 + β 4 P t + ε t .
Wecomparethemodelwithβ3 =0,β4 =0;β4 =0;β3 =0.
Example – Pollution, Temperature and Mortality
model AIC BIC β3 =0,β4 =0 5.14 5.17 β4=0 5.03 5.07 β3=0 4.84 4.88 full model 4.72 4.77
Full model is the best model according to AIC and BIC.
β4 is highly significant (p − value < 0.001; under normality assumption).
Definition - Moving Average Smoother
Let {Xt} be a times series.
Definition 2.3
We define a linear filter by aj ≥ 0 and kj=−k aj = 1. We call the series {Mt} obtained by
Mt = a−jXt−j,
a moving average smoother.
Usually, the coefficients are symmetric, i.e., aj = a−j .
Example: Suppose we have daily data. We might be interested in a weekly (7days), monthly(31 days), annually(365 days) average.
Forthis,letkbeanevennumberandwesetaj = 1 ,j=−k,...,k.Then,k=3 2k +1
gives us a weekly, k = 15 monthly, k = 182 annually.
Example - NASDAQ 100 - 200-day Moving Average (MA)
0.000 0.006
Nasdaq 100 Returns
2007-01-03 / 2023-01-11
Jan 03 2007 Jan 04 2010 Jan 02 2013 Jan 04 2016 Jan 02 2019 Jan 03 2022
Definition - Kernel Smoothing
A special case of the moving average smoother is the kernel smoothing. In general, a kernel K : R → R is a function which is normalized to 1, i.e.
∞ K (x )dx = 1. Often, it is also required that the kernel is non-negative, symmetric −∞
and has a bounded support on [−1, 1] (i.e., the kernel is zero on (−∞, −1) and (1, ∞)).
Definition 2.4
A kernel smoother {Kt } to bandwidth b ≥ 0 is defined by
Kt = wi(t,b)Xi,
K(t−i ) where w (t,b) = b .
i n K(t−j ) j=1 b
Remark - Kernel Smoothing
The amount of smoothing is control by the bandwidth. The larger b, the smoother the results. Usually, the choice of the bandwidth has more impact than the choice of the kernel.
To see that it is a special case of the moving average smoother, note that we can write it also as
Examples for kernels:
n K(t−j)Xt−i. i j=1 b
Example - Trend or not?
-8 -6 -4 -2 0 2 4 6
300 400 500
Example - Trend or not?
Xt , t = 1, . . . , 500, mean=0.49 (sd=0.61)
0 5 10 15 20 Lag
Xt , t = 1, . . . , 500, mean=0.99 (sd=0.72)
0 5 10 15 20 Lag
Xt , t = 1, . . . , 500, mean=-0.01 (sd=0.84)
0 5 10 15 20 Lag
ACF (cov) ACF (cov) ACF (cov) 1234 01234 1234
Example - Trend or not? Yt = Xt − Kt , Kt kernel smoothing
Yt , t = 1, . . . , 500, mean=-0.00 (sd=0.19)
0 5 10 15 20 Lag
Yt , t = 1, . . . , 500, mean=-0.00 (sd=0.19)
0 5 10 15 20 Lag
Yt , t = 1, . . . , 500, mean=-0.00 (sd=0.19)
0 5 10 15 20 Lag
ACF (cov) ACF (cov) ACF (cov)
-0.2 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6
Remark - Kernel Regression
The kernel smoothing can be also written as
K(t)=Kt = n j=1
K(t−j) . b
K(z−Zi )Xi b .
K(z−Zj ) b
This is also known as Nadaraya–Watson kernel regression.
n K(z) = i=1
K(t−i )Xi b
We can generalize the kernel smoothing to kernel regression. In the kernel smoothing, we regress on time, 1,...,n. For point t, i.e. K(t) = Kt, we average all Xt for which the corresponding time points are close (measured by the kernel) to t.
In general, we can regress onto another time series, say Zt. Then for point z, i.e., K(z), we average all Xt, for which the corresponding Zt are close (according to the kernel) to z:
Example - Temperature and Mortality
weekly data of the Los Angeles County
Cardiovascular Mortality Temperature
1970 1972 1974
1976 1978 1980
60 80 100 120
Example - Temperature and Mortality - Kernel(exponential/Gaussian) Regression
50 60 70 80 90 100
Temperature (Fahrenheit)
10 16 21 27 32 38
(Celsius) 29 / 51
Cardiovascular Mortality
70 80 90 100 110 120 130
Example - Temperature and Mortality - Kernel(uniform) Regression
50 60 70 80 90 100
Temperature (Fahrenheit)
10 16 21 27 32 38
Cardiovascular Mortality
70 80 90 100 110 120 130
Example - Temperature and Particulates - Kernel Regression
weekly data of the Los Angeles County
Particulates Temperature
1970 1972 1974
1976 1978 1980
20 40 60 80 100
Example - Temperature and Particulates - Kernel Regression
50 60 70 80 90 100 10 16 21 Temperature (Fahrenheit)27 32 38
Particulates
20 40 60 80 100
Example - Temperature and Particulates - Kernel Regression Lags
50 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100 tempr(t) tempr(t-1) tempr(t-2)
50 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100 tempr(t-3) tempr(t-4) tempr(t-5)
50 60 70 80 90 100 50 60 70 80 90 100 50 60 70 80 90 100 tempr(t-6) tempr(t-7) tempr(t-8)
20 40 60 80 100
20 40 60 80 100
20 40 60 80 100
20 40 60 80 100
20 40 60 80 100
20 40 60 80 100
20 40 60 80 100
20 40 60 80 100
20 40 60 80 100
Removing a Deterministic Trend by Regression
Suppose our time series is of the form
Xt =μt +Yt,
where we observe Xt, Yt is a zero mean stationary time series, and μt some deterministic trend.
If μt = δ1 + tδ2 is a linear trend, we can remove μt by regressing Xt onto t, see previous Section.
More general, if μt = qj=0 δjtj is a polynomial trend, we can estimate δj,j = 0,...,q by regressing onto tj,j = 0,...,q.
WeobtainYˆ =X −q δˆtj andrequireestimationofδ,j=0,...,q. t t j=0j j
Removing a Deterministic Trend by Taking Differences
Suppose our time series is of the form
Xt =μt +Yt,
where we observe Xt, Yt is a zero mean stationary time series, and μt = δ0 + tδ1 is a linear trend.
Note that Hence,
μt − μt−1 = δ0 + tδ1 − (δ0 + (t − 1)δ1 = δ1. Xt − Xt−1 = δ1 + Yt − Yt−1.
Let Zt = Yt − Yt−1. {Zt} is obtained by filtering a stationary time series, hence it is stationary.
=⇒ Xt − Xt−1 removes the linear trend and we obtain a stationary time series.
Definition - Backshift Operator
Let {Xt} be a times series. Definition 2.5
We define the backshift operator B by
BXt = Xt−1.
It extend to powers, e.g., B2Xt = B(BXt) = BXt−1 = Xt−2. Thus BkXt = Xt−k. Additionally, we set
which leads to B−kXt = Xt+k.
B−1Xt = Xt+1
Definition - Difference Operator
Let {Xt} be a times series. Definition 2.6
We define the difference operator ∇ by
∇Xt =(1−B)Xt =Xt −Xt−1.
We denote ∇Xt as the first difference. The difference of order d is given by ∇dXt =(1−B)dXt.
Removing a Deterministic Trend by Taking Differences
Suppose our time series is of the form
Xt =μt +Yt,
where we observe Xt, Yt is a zero mean stationary time series, and μt = qj=0 δjtj is a polynomial trend.
We can remove the polynomial trend and obtain a stationary time series by taking q differences
Zt =:∇qXt =δq +∇qYt.
Detrending - Taking Differences vs. Regression
Xt =μt +Yt,
Differences Parameters no parameters
Output ∇q ∇qYt isonlyavailablefort=q+1,...,n,i.e.,weloseqobservations.
If we are interested in the trend parameters itself and underlying time series is stationary, for larger sample sizes, doing a regression is the better approach.
However, for nonstationary time series like a random walk, the regression approach might not work at all.
Recall, if Yt is a random walk ∇Yt becomes a stationary series.
⇒ Taking differences is the more robust approach. If in doubt about random walk, better taking differences for detrending.
Regression
up to q Estimate of Yt
Example - Linear Trend vs. Stochastic Trend
0 20 40 60 80 100 Time
0 20 40 60 80 100 Time
-20 -15 -10 -5 0 5 -20 -10 -5 0 5
Example - Linear Trend vs. Stochastic Trend
0 20 40 60 80 100
ACF of diff Y_t
−0.2 0.2 0.6 1.0 −2 −1 0 1 2
Example - Linear Trend vs. Stochastic Trend
0 20 40 60 80 100
ACF of diff Z_t
−0.4 0.0 0.4 0.8 −4 −2 012
Example - Linear Trend vs. Stochastic Trend
detrending Y_t
0 20 40 60 80 100
ACF of trending Y_t
−0.2 0.2 0.6 1.0 −6 −2 0 2 4
Example - Linear Trend vs. Stochastic Trend
detrending Z_t
0 20 40 60 80 100
ACF of trending Z_t
0.6 1.0 −2 −1 0 1 2 3
Example - Global Temperature - Taking Differences vs. Regression
1880 1900 1920 1940
1960 1980 2000 2020
Detrending by difference
Time (year)
ACF of differences time series
0 5 10 15 20 Lag
ACF Global Temperature
0.6 1.0 -0.2 0.0 0.1 0.2 0.3
Example - Global Temperature - Taking Differences vs. Regression
1880 1900 1920 1940
1960 1980 2000 2020
Detrending
Time (year)
ACF of detrended time series
0 5 10 15 20 Lag
ACF Global Temperature
0.6 1.0 -0.2 0.0 0.2 0.4
Example - Weekly Oil Price
Weekly Cushing, OK WTI Spot Price FOB (Dollars per Barrel)
1986-01-03 / 2023-01-06
Jan 03 1986
Jan 05 1990
Jan 07 1994
Jan 02 1998
Jan 04 2002
Jan 06 2006
Jan 01 2010
Jan 03 2014
Jan 05 2018
Jan 07 2022
Returns and Log Returns
Let {Pt} be some price time series. The return series {rt} is defined as
rt= Pt −1=Pt−Pt−1. Pt −1 Pt −1
The log return series {lrt} is defined as
lrt = log( Pt ) = log(Pt) − log(Pt−1).
Symmetry Possible values Aggregation
Pt −1 log-returns
symmetric around zero (−∞, ∞)
returns non-symmetric [−1, ∞)
Tt=1 lrt = log(PT /P0) Tt=1(rt + 1) = PT /P0 =⇒ Linear statistical methods are more reasonable for log-returns.
Returns and Log Returns
-0.2 -0.1 0.0 0.1 0.2 x
log(x + 1)
-0.2 -0.1 0.0 0.1
Example - Weekly Oil Price
Weekly Cushing, OK WTI Spot Price FOB (returns) 1986-01-03 / 2023-01-06 33
Jan 03 1986 Jan 05 1990 Jan 07 1994 Jan 02 1998 Jan 04 2002 Jan 06 2006
Jan 01 2010
Jan 03 2014 Jan 05 2018 Jan 07 2022 1986-01-03 / 2023-01-06
1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5
Jan 03 1986
1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5
Weekly Cushing, OK WTI Spot Price FOB (log returns)
Jan 05 1990
Jan 07 1994
Jan 02 1998
Jan 04 2002
Jan 06 2006
Jan 01 2010
Jan 03 2014
Jan 05 2018
Jan 07 2022
Definition - Power Transformation
Let {Xt} be a times series. Definition 2.7
We call the nonlinear function g : R → R
X tλ − 1
, λ ̸= 0 log(Xt) ,λ=0
power transformation and the series {Yt} given by Yt = g(Xt) power transformation
of the series {Xt}.
Such nonlinear transformations can help to improve the approximation to normality or to equalize variability over the length of a single series.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com