Financial Data And Statistics
Tomaso Aste
t.aste@ucl.ac.uk http://www.cs.ucl.ac.uk/staff/tomaso_aste/
2019
Part III Dependency & Causality
COMPG001 – Dependency
T. Aste UCL 2019 1
/2
straightforward
Total revenue generated by arcades
correlates with
http://www.tylervigen.com/spurious-correlations
Because quantifying dependency / independency and causality is no
Computer science doctorates awarded in the US
COMPG001 – Dependency T. Aste UCL 2019
Correlation: 98.51% (r=0.985065)
Why the age of Miss America is related to freaky deaths?
16 Spurious Correlations
1999 25 yrs
23.75 yrs
22.5 yrs
21.25 yrs
20 yrs
18.75 yrs
2000 2001
Correlation: 87.01% (r=0.870127)
2002 2003 2004 2005 2006 2007 2008 2009
Age of Miss America
correlates with
Murders by steam, hot vapours and hot objects
1999
2000 2001
2002 2003 2004 2005 2006 2007 2008 2009
8 murders
6 murders
4 murders
2 murders
tylervigen
Murders by steam
Age of Miss America
Data sources: Wikipedia and Centers for Disease Control & Prevention
.co
t
60
m
2
Murders by steam
Age of Miss America
Multivariate probabilities
Before addressing the problem of quantification of dependency we must first introduce the basic concept of probability associated with more than one variable.
Definition: The probability to observe both X=x and Y=y in conjunction is the joint probability f(X=x,Y=y).
Definition: The probability to observe X=x independently from the value of the variable ‘Y’ is the marginal probability fX(X=x) and similarly the marginal probability to observe Y=y independently from the value of the variable ‘X’ is fY(Y=y).
The marginal probabilities are related to the joint probabilities by:
∞
fX(X=x)=∫ f(X=x,Y=y)dy
∞
fY(Y=y)=∫ f(X=x,Y=y)dx
COMPG001 – Dependency
T. Aste UCL 2019
3
−∞
−∞
Conditional probabilities
Definition: The probability to observe X=x given Y=y is the conditional probability f(X=x | Y=y) and similarly the conditional probability to observe Y=y given X=x is f(Y=y | X=x).
Theorem: The conditional probability, the joint and the marginal probability are related by the Bayes theorem:
f(X=x|Y=y)= f(X=x,Y=y) fY (Y = y)
and
f(Y=y|X=x)= f(X=x,Y=y) fX (X = x)
COMPG001 – Dependency T. Aste UCL 2019 4
Dependency & Independency
COMPG001 – Dependency T. Aste UCL 2019 5
Independent variables
The mutual dependence between two stochastic variables X and Y should be measurable from the ‘difference’ between the probability to observe them simultaneously and the probability to observe them separately.
We can then introduce the most important result concerning dependency between variables:
Two variables X and Y are independent if and only if
p(X = x,Y = y) = pX (X = x)pY (Y = y)
We shall see that all other results and approaches are somehow related to the above theorem
COMPG001 – Dependency T. Aste UCL 2019 6
Note that, from the Bayes theorem, we have that for independent variables the conditional probability is
f(X=x|Y=y)= f(X=x,Y=y)=fX(X=x) fY (Y = y)
which expresses the intuitive fact that two events are independent if and only if the occurrence of one event makes it neither more nor less probable that the other occurs.
COMPG001 – Dependency T. Aste UCL 2019 7
If we look at the cumulative distribution we have an analogous result stating that two random variables X, Y are independent if and only if:
P(X ≤ x,Y ≤ y) = PX (X ≤ x)PY (Y ≤ y)
We have two cases:
1) Positive quadrant dependency “when X is large also Y is likely
to be large”
2) Negative quadrant dependency “when X is large Y is
likely to be small”
Lehmann, E.L., 1966. Some concepts of dependence. Ann. Math. Statist. 37, 1137–1153.
P ( X ≤ x , Y ≤ y ) > P X ( X ≤ x ) PY ( Y ≤ y ) P ( X ≤ x , Y ≤ y ) < P X ( X ≤ x ) PY ( Y ≤ y )
COMPG001 - Dependency T. Aste UCL 2019 8
Covariance, Correlation and
Linear Dependency
COMPG001 - Dependency T. Aste UCL 2019 9
Covariance
It is quite clear that to establish the existence of some dependence between two variables it is necessary to measure difference between the joint probability and the product of the marginal probabilities. Let us, for instance, measure the integral of this difference over the entire support:
Cov(X,Y)= ∫∫ P(X ≤ x,Y ≤ y)−P (X ≤ x)P (Y ≤ y) dxdy (XY)
This quantity is the covariance (Hoeffding identity) which is
Cov(X,Y)=E"X−EX Y−EY$ #[][]%
( )( )
From this relation, it is clear that Cov(X,Y) = 0 when the two variables are independent but not necessary vice-versa.
We also see that the covariance is an average measure of quadrant
dependency and therefore, in some nonlinear cases, it may fail to
detect dependency.
COMPG001 - Dependency T. Aste UCL 2019 10
Correlations
Definition:
The Pearson product-moment correlation coefficient ρi,j between two random variables Xi and Xj with expected values E[Xi] and E[Xj] and standard deviations σ(Xi) and σ(Xj) is defined as
ρi,j =Corr(Xi,Xj)= Cov(Xi,Xj) σ ( X i )σ ( X j )
with
σ2(X)=E"X−E[X]2$ σ2(X)=E"(X −E[X])2%
the variances
The correlation coefficient is the covariance renormalized by the standard deviations.
( ) j $# j j '& i#ii%
COMPG001 - Dependency T. Aste UCL 2019 11
We can see from the definition that the correlation coefficient is
symmetric
ρi,j =ρj,i
and from the Cauchy-Schwartz inequality (E[XY]2 ≤ E[X2] E[Y2])
we have
It is also straightforward to prove that, for b, b’>0 and any a,a’ ∈R
−1≤ρj,i ≤1 €
Corr(a+b Xi,Xj)=Corr(Xi,Xj)=Corr(Xi,a’+b’Xj) €
COMPG001 – Dependency
T. Aste UCL 2019 12
♯
Correlation and Linear Dependence
When two variables are linearly related, which means
Xj =a+bXi
There is a strict linear dependence between the two variables
and the correlation coefficient is equal to 1 or -1, indeed
Corr(Xi,a+b Xi)=
♯
E⎡X−E[X] bX−bE[X]⎤ ( )( )
⎣iiii⎦
E⎡ X −E[X ] 2⎤E⎡ b X −bE[X ] 2⎤
⎢⎣(i b⎧
i)⎥⎦⎢⎣( i +1 if b>0
i)⎥⎦
13
= =⎪⎨ b ⎪⎩
−1 if b<0
T. Aste UCL 2019
COMPG001 - Dependency
If the linear dependence between the two variables is only
approximate:
Xj =a+b Xi +εi,j
With εi,j a noise term (also called residual) with Cov(Xi,εi,j)=0
Then
Corr(Xi,Xj =a+b Xi +εi,j)=
b
b2 +Var(εi,j)
Var(Xi)
♯
which is a number between -1 and 1 with values closer to zero
when the variance of the residuals is large with respect to the
variance of the variable.
COMPG001 - Dependency T. Aste UCL 2019 14
The correlation coefficient is
ρi,j = 0 ρi,j = 1
if the two events i and j are independent and therefore uncorrelated
if the two events i and j are perfectly linearly dependent correlated (e.g. Xj = b Xi +a, with b>0)
♯
ρi,j =- 1
and anti-correlated (e.g. Xj = b Xi +a, with b<0)
if the two events i and j are perfectly linearly dependent
Note that, in general, two stochastic variables with zero correlation coefficient can still be (non-linearly) dependent.
COMPG001 - Dependency T. Aste UCL 2019 15
Linear regression
Linear correlations are equivalent to investigate how one data can be described in terms of the other by means of a linear relation:
Referred to as “dependent variable”
Xj ≈ a + b Xi
Referred to as “independent variable” or explanatory variable
To measure how good the linear ‘regression’ of variable Xj with respect to variable Xi is, we can look at the difference
εi,j = Xj – (a + b Xi)
That is called residual.
Small Var(εi,j ) with respect to Var(Xj) signify that the model is a good fit of the data and, vice-versa, large Var(εi,j ) mean that the model is a bad fit of the data.
COMPG001 - Dependency T. Aste UCL 2019 16
Example of linearly correlated datasets
♯
Xj
Xj = a + b Xi
Xi
Xi = a’ + b’ Xi
ρi,j = 0
ρi,j ≅ -1
ρi,j ≅ -1
ρi,j ≅ 1
ρi,j ≅ 1 ρi,jj = 0
Blue: positively correlated with correlation coefficient 0.99
Green: negatively correlated with correlation coefficient -0.99
Red: uncorrelated with correlation coefficient 0.00
Note that the classification of the variables as “dependent” and “independent” is somehow arbitrary because the regression of Xi vs. Xj or Xj vs. Xi are strictly
Xi
Xj
COMPG001 - Dependency T. Aste UCL 2019 related 17
Linear regression: best fitting coefficients
Therefore we must search for the coefficient b that minimizes the expectation value of square of the residual
argmin[E[(εi,j )2]] b
we have: (see exercises for derivation)
b = Cov(Xi,Xj)/Var( Xi) and a = E[Xj] - b E[Xj]
Substituting E[(εi,j )2] = Var(Xj ) - Cov(XiXj)2/Var( Xi)
Gives:
♯
E[(εi,j )2] /𝑉𝑎𝑟( X𝑗 ) = 1 - 𝜌𝑖,𝑗2
𝜌𝑖,𝑗2 is called coefficient of determination.
It is a measure of the goodness of the linear fit.
COMPG001 - Dependency T. Aste UCL 2019 18
Coefficient of determination
♯
The correlation coefficient can be seen as a measures of the ‘quality’ of modeling the original data Xi and Xj by means of the linear equation
Xj = a + b Xi + ε or Xi = a’ + b’ Xj + ε’ .
Indeed we have b = ρi,j σj /σi and and therefore
b’ = ρi,j σi /σj
(ρi,j )2= bb’
The square of the correlation coefficient is equal to the product of
the two linear coefficients
COMPG001 - Dependency T. Aste UCL 2019 19
For a given coefficient of determination ρij2 we can say that
“the linear equation a + b Xi predicts (ρij2100)% of the variance
in the variable Xj ”
(example: if ρij=0.4, then the linear equation a + b Xi predicts 16% of the variance of variable Xj)
Note that the residuals are not symmetric:
εi,j = Xj – (a + b Xi) is different form εj,i = Xi – (a’ + b’ Xj) But the coefficient of determination is the same for both.
1−Var(εi,j)=1−Var(εj,i)=ρ2 =ρ2 Var(X j ) Var(Xi ) i, j j,i
♯
COMPG001 - Dependency
T. Aste UCL 2019
20
COMPG001 - Dependency
T. Aste UCL 2019
21
Linear fit ONEOK INC vs. NICOR INC and vice-versa
0.25
0.2
0.15
0.1
0.05
0
−0.05
coefficient of determination = 0.2910
The linear model 0.41 Y + 0.0 explains 0.29% of the variance of X
The linear model 0.71 X + 0.0003 explains 0.29% of the variance of Y
−0.1
−0.1 −0.05 0 0.05 X= NICOR INC log-returns
0.1
♯
Y= ONEOK INC log-returns
Linear correlations between N variables
We often have much more that 2 variables and the dependency problem is an high-dimensional challenge.
However, in most of the practical cases only dependencies between couples of variables, Xi , Xj , are considered and therefore we will apply the definitions and methods to measure dependency between each couple of variables obtaining -in general- N(N-1) dependency relations.
For instance all the correlation coefficients between N variables Xi with i =1...N form a NxN matrix with entries:
E"X−E[X] X−E[X]$ ( )( )
#iijj% E"(X −E[X ])2$E"(X −E[X ])2$
ρi,j =
&# i i '% &# j i '%
Which is symmetric ρi,j = ρj,i and it has all the elements on the diagonal equal to 1, reducing to a set of N(N-1)/2 coefficients.
COMPG001 - Dependency T. Aste UCL 2019 22
Pearson’s estimation of correlation coefficient
♯
Given q measurements of the pair {xi(t),xj(t)}, t = 1...q, the Pearson’s estimator of the correlation coefficient is
q
∑ (xi(t)−mˆi)(xj(t)−mˆj) 1 q
ρˆi,j = t=1 with mˆi = ∑xi(t) q 2q 2 qt=1
∑ (xi −mˆi ) ∑ (xj −mˆ j ) t =1 t =1
Example: Pearson’s cross-correlation coefficient
ρi , j = 0.65 (for log- returns)
the sample’s means.
COMPG001 - Dependency T. Aste UCL 2019
23
50
100
150
200
250
300
350
400
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
N(N −1) = 80200 2
coefficients
500 450 400 350 300 250 200 150 100
50
Distribution of the coefficients
Example ♯ Pearson's cross-correlation coefficients for the log-returns of
401 firms on the US equity market 1996-2009
50 100
150 200 250 300 350 400
0
−0.2 0 0.2 0.4 0.6 0.8 1
COMPG001 - Dependency
T. Aste UCL 2019
24
Linear correlations and common factors
The simplest interpretation of cross-correlations is to assume that the N stochastic variables are all (linearly) dependent from a given set of common “factors”.
Consider, for instance, the two stochastic variables:
X1 = ε1+ b1,1 F1
X2 = ε2 + b2,1F1
That are a combination of two independent random variables ε1 and
ε2 (with Cov[ε1, ε2]=0) and share a common ‘factor’: the random variable F1.
Let assume E[ε1]=0, E[ε2]=0, E[F1]=0, Var[ε1]=σ12, Var[ε2]=σ22 and Var[F1]=σF12. Then the Variances are
Var[X1]=σ12+b1,12 σF12 Var[X2]=σ22+b2,12 σF12
♯
COMPG001 - Dependency T. Aste UCL 2019 25
The covariance is
Cov(X1,X2) =E[X1X2] =E[(ε1+b1,1F1)(ε2+b2,1F1)]=b1,1 b1,2 E[F12]=
=b1,1 b2,1Var(F1) The correlation coefficient is
ρ1,2= b1,1 b1,2 σF12/[(σ12+b1,12σF12) (σ22+b2,12σF12)]1/2
We can see that when b1,1 and b2,1 are large (and therefore the common factor is weighted more than the noise terms) the correlation coefficient becomes large and eventually:
ρ1,2 ≈ ±1 when |b1,1| , |b2,1| → ∞. Conversely when b1,1 and b2,1 are small the two signals become
independent and eventually
ρ1,2 ≈ 0 when |b1,1| , |b2,1| = 0.
T. Aste UCL 2019 26
.
COMPG001 - Dependency
♯
System of p linearly dependent variables
In general, it is possible to generate a set of N variables with any given cross-correlation matrix by using p common random variables F1, F2,...,Fp (called ‘components’). Specifically, if
p Xi=εi+∑bi,k Fk
k =1
with E[εi, εj]=0 for i≠j, E[Fi,Fj]=0 for i≠j and E[εi,Fk]=0 for all i,k,
then the covariance is (i ≠ j):
Cov(Xi,Xj) = ∑bi,kbk,jVar[Fk]
and variance (i=j)
and the correlation coefficients are:
p k =1
Corr(X ,X )= ij
COMPG001 - Dependency
∑
⎛p ⎞⎛p ⎞
p
Var(X )= Var[ε ]+∑ b2 Var[F ]
i i i,k k k =1
p
⎜Var[ε ]+∑ b2 Var[F ]⎟⎜Var[ε ]+∑ b2 Var[F ]⎟ ⎜i i,k k⎟⎜j j,k k⎟
⎝ k=1 ⎠⎝ k=1 ⎠ T. Aste UCL 2019
bi,kbk, jVar[Fk ] k =1
27
Bivariate Normal distribution
If the random variables ε1, ε2 and F1 are distributed accordingly to a Normal distribution, then, it follows that X1 and X2 must also be Normally distributed (they are sum of normal distributions).
The joint distribution between the two variables is:
f(X1=x,X2 =y)=∫∫∫N(ε1)N(ε2)N(F1)δ(ε1+b1,1F1−x)δ(ε2+b2,1F1−y)dε1dε2dF1 giving
1
2π σ1σ2−ρ1,2
"22 % 1 "(x−μ ) (y−μ ) ρ1,2 %
f(X1 =x,X2 =y)=
which is called bi-variate Normal distribution
$12 '
exp − 2
2 $2(1−ρ)σ σ σσ '
−2 (x−μ1)(y−μ2)' #1,2#1212 &&
$ 2
+ 2
COMPG001 - Dependency T. Aste UCL 2019 28
Multivariate Normal distribution
When all variables εi and Fk follow Normal distributions, the joint distribution associated to this system of linearly dependent variablesis: 1 ⎛1 −1 T⎞
f(X1 =x1,X2 =x2,...XN =xN)= N exp⎜−2(x−μ)Σ (x−μ) ⎟
(2π) det(Σ) ⎝
with x =(x1,x2...,xN ) a vector of values of the variables X1...XN
and
a vector of their expectation values. The moments of this distribution are:
⎠
μ=(E[X1],E[X2]...,E[XN])
E[Xi]=μi
E[X−E(X) X−E(X)]=Cov(X,X)=Σ
(i i)(j j) i j i,j and all higher central moments are zero.
This is the multivariate normal distribution
COMPG001 - Dependency T. Aste UCL 2019
29
Regression Coefficients
♯
Given the factors Fk and the variables Xi the problem of finding the coefficients bi,k that best approximate the Xi in terms of a linear combination of Fk
p Xi=εi+∑bi,k Fk
k =1 is called linear regression.
Mathematically the problem is find the coefficients bi,k that are minimizing the mean square error
COMPG001 - Dependency
T. Aste UCL 2019 30
)2, #∑p &
2+. E[εi]=E%Xi− bi,k Fk(
+.
$k=1 ' *-
pp
E !" X i 2 #$ + ∑ b i , k b i , j E !" F k F j #$ − 2 ∑ b i , k E !" X i F k #$
k,j=1 k=1
We search for the maximum by taking the first derivative with
respect to bi,k and equaling it to zero: p
2 ∑ b i , j E "# F k F j $% − 2 E "# X i F k $% = 0 j=1 T
in matrix notation it is B Σ(F,F) = Σ(F,X)
With Σ(F,F) the covariance matrix of the factors (p x p matrix with elements Σ(F, F)k,k’ =Cov(Fk,Fk’)) and Σ(F,X) the cross- covariance matrix between factors and X (p x N matrix with elements Σ(F, X)k,i =Cov(Fk,Xi)) and Bk,i=bk,i the (p x N) matrix of coefficients.
Solution:
♯
B = Σ-1(F,F)Σ(F,X)
COMPG001 - Dependency T. Aste UCL 2019 31
Principal components analysis
For risk purposes it is useful to understand if two assets are exposed to a common factor. This requires to discover the factors Fk form the analytics of the variables Xi.
The principal components analysis can be used to extract such factors.
In the Principal Components Analysis language we apply a unitary vector w1 with components (w1)i with i=1...p called loadings that projects the set of original, centered variables E [Xi ] = 0 (i = 1, .., p), onto a new variable F1 with components called scores.
p
F1 = ∑ (w1)i Xi
i=1
We search for the vector w1 that maximize the variance Var(F1) it
turns out that it is the eigenvector e1 associated with the largest
eigenvalue λ1 of the covariance matrix Σ. The explained variance
associated to the factor F1 is (λ1)2.
COMPG001 - Dependency T. Aste UCL 2019 32
♯
1000 800 600 400 200
341)stocks)US)market,)daily)data,)2/01/1997)–)31/12/2012))))
−1 5000
4000 3000 2000 1000
−0.5
0 0.5 1 0 20 40 60 80 100 120 De7trended)by)the)first)factor)(market)mode))
150
Correla'on)
coefficient) 100 matrix)
Correla'on) coefficient) distribu'on)
150
100
50
Correla'on) coefficient) matrix) eigenvalues) distribu'on)
00
Correla'on) coefficient) distribu'on)
00
−1 −0.5 COMPG001 - Dependency
0
0.5 1 0 5 10 15 20 25 30 T. Aste UCL 2019 33
Aste
Covariances & Dependencies Madrid, 13/01/2
50
eigenvalues) distribu'on)
♯
Factor models
For risk purposes it is useful to understand if two assets are exposed to a common factor. This requires to discover the factors Fk form the analytics of the variables Xi.
The principal components analysis can be used to extract such factors but sometime one wants to measure the dependence on some given a-priori factor. We can write our set of N variables X1, ... XN as linearly dependent on a set of a-priori given factors F1,..,Fp
♯
p Xi=εi+∑bi,k Fk
k =1
Note that despite the equation looks identical to the previous, however here we assumed that factors can be correlated. Our purpose is to find the coefficient bi,k that provide the best representation of Xi in terms of the factors Fk.
COMPG001 - Dependency T. Aste UCL 2019 34
APT
Market risk management, reporting and portfolio construction
FIVE FACTS ABOUT USING FACTOR MODELS FOR MARKET RISK MANAGEMENT
Factor Models are now the widely accepted standard method for measuring market risk, as they show how the return on any portfolio of assets is influenced by different economic factors such as changes in interest rates,
FX rates, inflation, and oil prices. There are several types of factor models, but all are constructed using factor analysis techniques and can be divided into three basic categories: statistical, macroeconomic, and fundamental.
Different factor models include different factors – but which ones are the right ones to include?
The debate around fundamental factor models, which pre-specify factors, and statistical factor models, which use statistical processes to extract the factors from market prices, often draws on complicated mathematical arguments. Here we provide an overview of the two different approaches and summarise the debate in five points.
Overview: Factor models for market risk management
Five factor model facts
1. Why use factor models for market risk management?
› Nearly all investors believe that there are systematic drivers of risk and return, so being able to create an attribution of the total or headline risk measure to the various risk factors, as well as to the positions in any portfolio, is very valuable for investment managers.
› Risk attribution can be easily performed when using a factor risk model such as the SunGard APT model. This helps answer the question “Which factor bets am I taking, and how well are they hedged or diversified?”
› A factor model will be able to reveal style exposures (to value, growth and momentum) for an equities portfolio, and how much risk can be attributed to interest rate movements (shift, twist and butterfly yield curve factors) and how much to credit spread effects.
› The APT factor model provides insight into the risk factors that that make up the systematic risk in the portfolio - it is vital to know the breakdown of total risk down to its systematic and specific parts.
› Using a factor model such as SunGard APT, it is also possible to calculate the portfolio beta to any market, industry or regional index.
2. The risks of factor model
› If we attempt to pre-specify (and likely mis-specify) factors when estimating a model from historical data, we cannot capture their effects as completely.
› There is a much higher likelihood of building a genuinely robust risk model when using a statistical methodology such as principal components modelling.
3. Using factor models for multi-asset class risk management
› The estimation of cross-asset-class correlations is quite natural within statistical models, whereas it is an ad-hoc process based on a separate methodology for pre-specified models. Thus the statistical factor modelling process generates a more coherent risk model than other approaches, and is less likely to create unreliable
risk measures for portfolios containing assets across different classes.
› The problems of pre-specified factor models are multiplied when trying to create multi-asset-class (MAC) models. The lack of robustness associated with the judgmental approach to factor selection in each asset class becomes compounded when attempting to model the cross-asset-class correlations. A number of MAC risk modellers use only index-level correlation estimates between asset classes since their methods do not allow security-level estimation of these correlations. The estimation of factor correlations across asset classes is a judgmental exercise since there is no objective measure of how many factors are really required to explain cross-asset-class behaviour.
› The statistical factor model methodology provides a coherent approach to MAC modelling, in which macro- economic factors (whose influence extends across all asset classes) may be properly included within the estimation core. By including these macro factors, and carefully selecting the assets within the estimation core, we create a set of principal component factors which not only capture the systematic risks associated with equity/credit, rates, FX and commodity markets, but do so simultaneously all the cross-asset-class effects observed in the historical data.
› The most complete MAC model from APT contains 96 factors – 30 associated with rates, 20 associated with equity & credit, 26 associated with FX and 20 associated with commodities (towhichallassetsmaybeexposed). Thisisaneconomically sensible number of factors to represent the truly independent risk drivers of the global marketplace.
Fundamental factor models
› Start out by identifying micro-economic traits of assets, such as industry membership, financial ratios and exposure to technical indicators.
› Then the impact certain events may have on individual stocks is determined based on publicly available information.
› Next, a set of factors is pre-specified based on what the risk model provider deems logical, e.g. value, growth, sectors, interest rates, etc. Finally, statistical regression is applied to map historical stock prices onto factor values to infer the exposure of each stock to each factor.
› Only after all these different assumptions are made can the fundamental factor model be used for risk analysis.
› As every pre-specified factor model uses a different set of factors, it is recognized that this approach can introduce factor risk and lead to model mis-specification: Depending on the suppositions made, key factors may be missing and irrelevant factors may be included.
Statistical factor models
› Are economically motivated and consistent with asset pricing theory and the observed effects of arbitrage across markets.
› The SunGard APT approach is to build a dataset of asset returns and a rich set of explanatory factors, including market and sector equity indices, FX rates, interest rates, credit spreads, commodity indices, inflation etc. This dataset allows cross-asset-class effects to be captured within the risk models.
› The model does not pre-specify which factors are to be regarded as the sources of systematic risk within markets, and it does not assume that all the systematic risk can be captured with a named set of correlated factors.
› Instead, statistical factor models rely on fewer assumptions, and use robust statistical processes to extract the factors from the whole universe of market prices and macro explanatory factors.
› SunGard’s APT uses a proprietary process similar to Principal Components Analysis (PCA), which is why the model is called a ‘statistical model’.
Conclusion: A robust model of market risk must take account of the co-movement of asset returns both in normal times and when markets are stressed. Because statistical factor models make fewer assumptions about the systematic risk factors which drive markets, there is a better chance of capturing the effects of these factors.
COMPG001 - Dependency
T. Aste UCL 2019
35
2 Five facts about using factor models for market risk management
4. Using factor models for equities market risk management
› The first factor risk models were built for equities, and included industry factors only. It quickly became apparent that for these simple multi-variant regression models, adding style factors would improve the explanatory and forecast power of these models – but how many truly independent styles are there? When we build global equity models, is it better to include all country factors (say 40 factors or more) or to rely on regional factors? How many independent currency risk factors do we need? How much would the addition of commodity or macro-economic factors (such as oil, inflation, interest rates, credit spreads) improve a global equities model?
› These are difficult questions for risk modellers to answer, because all the obvious explanatory factors are correlated with one another, and there may be other important risk factors which are not obvious. In addition, there are transitory factors which can affect equities markets strongly during some periods but are much less influential during other times.
› Building an equity risk model with pre-specified factors is always based on judgment rather than objective methods, and the results are often far from robust. It is this difficulty which has led some pre-specified risk modellers to include principal components factors alongside fundamental factors in their models, to capture the structure that pre-specified factors cannot. Whether the regression model for estimating betas is implemented on time-series or cross-sectionally, ultimately this approach provides little confidence that the systematic part of the risk is correctly estimated. Many ‘surprises’ in realised risk compared to forecasts from pre-specified models have been observed because of market risk factors which were only “identified” with the benefit of hindsight.
5. Using factor models for bonds market risk management
› Similar problems occur with other asset classes such as bonds. The risk factors associated with yield curves (e.g. shift, twist, butterfly factors) are very highly correlated across currencies (for example Euro and Swiss Franc) and the total number of factors to include in a global bond model is far from obvious.
Conclusion:
Mis-specification of the risk factors can be avoided by starting with a methodology that is theory-based and objectively defined. This approach will also provide a robust measure of the systematic part of the risk on any portfolio.
The method of principal components, applied to the historical correlation matrix of a carefully selected “estimation core” of the assets within a single class such as equities, can provide a stable and robust set of systematic factors without the problems of co-linearity and arbitrary selection. By re-estimating the model factors every month, the problems associated with transitory factors are much reduced. Users can then overlay the explanatory factors of their choice within the portfolio analysis to provide intuitive attribution and scenario analysis.
Creating long-term and short-term risk models can be easily achieved by applying an appropriate influence function to the historical data. The statistical factor methodology does not necessitate arbitrary judgements about which factors may or may not be most strongly affected when changing the effective time horizon of the risk model.
About APT
SunGard’s APT provides investment technology for a broad range of asset classes, countries and regions including data and software for understanding market risk, credit risk, liquidity risk and for portfolio construction and performance analysis. APT provides investors with statistical market risk models, performance and risk analytics and portfolio optimization and construction tools. APT’s customers include institutional and retail asset managers, pension funds, private wealth managers, hedge funds, broker/dealers, prime brokers and proprietary traders.
About SunGard
SunGard is one of the world’s leading software and technology services companies. SunGard has more than 17,000 employees and serves approximately 25,000 customers in more than 70 countries. SunGard provides software and processing solutions for financial services, education and the public sector. SunGard also provides disaster recovery services, managed IT services, information availability consulting services and business continuity management software. With annual revenue of about $4.5 billion, SunGard is the largest privately held software and services company and was ranked 434 on the Fortune 500 in 2011. Look for us wherever the mission is critical.
For more information, please visit:
www.sungard.com/apt/learnmore
Contact us
am.solutions@sungard.com
Telephone
+44 20 8081 3300
NEW YORK
Tel: +1 646 445 1018
LONDON
Tel: +44 (0) 208 081 2720
PARIS
Tel:+33153059595
JOHANNESBURG
Tel: +27 11 430 7600
©2012 SunGard.
SINGAPORE
Tel: +65 6308 8000
HONG KONG
Tel: +852 3 719 0800
MIDDLE EAST
Tel: +971 4 3911180
SYDNEY
Tel: +61 2 8224 0055
COMPG001 - Dependency
Trademark Information: SunGard, the SunGard logo and APT are trademarks or registered trademarks of
T. Aste UCL 2019
36
SunGard or its subsidiaries in the U.S. and other countries. All other trade names are trademarks or registered
www.sungard.com/apt/learnmore 3 trademarks of their respective holders.
Non-linear dependency
COMPG001 - Dependency T. Aste UCL 2019 37
The use of the correlation coefficient to measure dependence between variables is very common and widespread. However, this measure can be very problematic and it might sometime lead to serious faults. Indeed, in non-linear cases, strictly dependent variables can have zero correlation coefficient.
Other problems might arise with non-normally distributed variables. Indeed, we already noticed that the standard deviation is not defined for random variables with fat-tailed power law distributions with tail exponent smaller or equal than 2. This implies that for these variables the correlation coefficient is not defined as well. Moreover, when the tail index α belongs to the interval (2,4], the correlation coefficient exists but its Pearson’s estimator is highly unreliable because its distribution has undefined second moments and therefore it can have large variations.
COMPG001 - Dependency T. Aste UCL 2019 38
Example: non linearly dependent variables
Consider the stochastic variable X1 which is Normal distributed with E[X1] = 0 and E[X12] = 1
♯
X=1 X2−1 22(1)
The variable
has zero correlation coefficient with variable X1
ρ1,2 = E[X1X2]−E[X1]E[X2]= 1 E[X1(X12 −1)]−E[X1]E[(X12 −1)] σ1σ2 2 σ1σ2
= 1 E[X13]−E[X1]−E[X1]E[X12]+E[X1]=0 because E[X ]=E[X3]=0
11
completely determined by the value of X1
Note that, if instead of the correlation coefficient between the two variables we compute the correlation coefficient
between the function g(X1) = (Xi2-1)/√2 and X2 we would get ρ’1,2 = E[g(X1)X2]=1. Showing therefore 100% correlation. COMPG001 - Dependency T. Aste UCL 2019 39
2
σ1σ2
but they are not independent variables because the value of X2 is
Measure of Dependency
Le us first ask the question about the properties that are desirable for a measure of dependency - in general.
A measure of dependency C(X,Y) between two random variables X and Y must:
1) C(X,Y) = C(Y,X) 2) 0 ≤ C(X,Y) ≤ 1 3) C(X,Y) = 1
be symmetric
have values within [0,1]
be equal to 1 if there is a strict dependence between X and Y (i.e. if X= g(Y) or
Y =f(X), with f(.), g(.) Borel-measurable
functions)
4) C(f(X),g(Y)) = C(X,Y) with f(.), g(.) Borel-measurable function
5) C(X,Y) = |Corr(X,Y)| if X, Y are linearly correlated (joint normal)
Rényi, Alfréd. "On measures of dependence." Acta Mathematica Hungarica 10.3 (1959): 441-451. COMPG001 - Dependency T. Aste UCL 2019 40
Let us note that if X and Y are independent variables, then E[f(X)g(Y)] = E[f(X)]E[g(Y)] and consequently Cov[f(X)g(Y)]=0
We have that X and Y are independent if and only if for all Borel-
measurable functions
The quantity
f(X) and g(Y) Cov[f(X)g(Y)]=0
sup Corr( f ( X ), g(Y )) f ,g
is a dependency measure that can account for any kind of non-linear dependency. However, the space of measureable function is ‘very large’ containing all sorts of weird functions and the functions f(X) and g(Y) that maximize correlation are hard to find and might not even exist.
Rényi, Alfréd. "On measures of dependence." Acta Mathematica Hungarica 10.3 (1959): 441-451.
COMPG001 - Dependency T. Aste UCL 2019 41
Spearman’s Rho correlation
As a particular case of non linear correlation extracted form a function of the variables we have the Spearman’s Rho correlation.
This is a non-parametric measure of correlation which assesses how well an arbitrary monotonic function (and not only a linear function as for the Pearson’s correlation coefficient) could describe the relationship between two variables.
It means that it quantifies the model:
Xj = f(Xi)
with f(.) any monotonic function
COMPG001 - Dependency T. Aste UCL 2019 42
Given T observations of two variables
{x(1), x(2), ...,x(q)} and {y(1), y(2), ...,y(q)}
The Spearman’s Rho can be computed as follows:
1) substitute each value with its rank: {Rank(x(1)),Rank(x(2)), ..., Rank(x(q))}
and
{Rank(y(1)),Rank(y(2)), ..., Rank(y(q))}
(where the Rank(x(t)) is the number of variables in {x(1), x(2), ...,x(q)}
which are smaller or equal than x(t) );
COMPG001 - Dependency T. Aste UCL 2019 43
2) Calculate the Pearson’s correlation coefficient from the two lists of ranks instead of the original variables (when rank-ties are present, fractional ranks, rank equal to the average rank, should be used).
Note that the Pearson’s correlation between ranks is the Pearson’s correlation between the cumulative probability distributions:
ρs = Corr(F(X),F(Y))
For a large class of dependencies (elliptic copulas) the Spearman’s
Rho is related to the Pearson correlation coefficient by ρ S = π6 a r c s i n ( ρ2 )
COMPG001 - Dependency T. Aste UCL 2019 44
Kendall’s tau
Another non-parametric measure of correlation which measures the probability that two random variables X, Y are concordant or discordant is the Kendall’s tau.
In particular it takes all pairs of observations (x(t),x(t’)) and (y(t),y(t’)) and look at the sign of (x(t)-x(t’)) and compare it with the sign of (y(t)-y(t’)) counting the relative number of times that they are the same or different:
q−1 q τ=∑∑sign((x(t)−x(s))(y(t)− y(s)))
t=1 s=t+1
q(q−1)/2
This returns a measure between -1 (maximum anti-dependent) and 1 (maximum dependence) with 0 associated with the independent case.
COMPG001 - Dependency T. Aste UCL 2019 45
For a large class of dependencies (elliptic copulas) the Kendall tau is related to the Pearson correlation coefficient by:
τ = π2 a r c s i n ( ρ )
Let us note that although Spearman and Kendall correlations can capture non-linear dependencies, they would fail to correctly see dependency between X and Y=(X2-1)/√2 in the previous example. Indeed, these measures capture monotonic dependency only.
COMPG001 - Dependency T. Aste UCL 2019 46
Rank correlations
The Kendall Tau and Spearman’s Rho belong to a more general class
of correlations called rank correlation. Which are the Pearson
correlations of ‘ranks’ of the variables.
q ∑at,sbt,s
Γ= t,s=1 =Corr(a,b) qq
∑a2 ∑b2 t,s t,s
t,s=1 t,s=1
For Kendal Tau the rank are: at,s = sign(Rank(x(t))-Rank(x(s))) bt,s = sign(Rank(y(t))-Rank(y(s)))
For Spearman Rho the rank are: at,s = Rank(x(t))-Rank(x(s)) bt,s = Rank(y(t))-Rank(y(s))
Infinitely many other ranks can be elaborated.
COMPG001 - Dependency T. Aste UCL 2019 47
Correlation ratio
We have discussed for the linear case that the correlation coefficient is related to the goodness of the fit between a variable, Y, and a linear function of the other: f(X) = a + b X.
In general to measure dependency between two random variables X and Y we may look for a function f(X) that best fits Y and then measure the goodness of the fit (goodness of the regression). Clearly, if the fit is good then we must have a dependence between the two variables.
In order to measure the quality of the fit we might look at the relative ‘distance’ between the data and the fit.
A measure can be
"$ Var Y − f (X)
φX,Y =
#%
#%
Var"Y$
small values of the this ‘cost’ function imply good fits and large
values correspond instead to poor fits.
COMPG001 - Dependency T. Aste UCL 2019 48
If, for instance, we assume a linear dependence f(X) = a + b X ,
then
"$ Var Y −(a+bX)
φ
2
= # %=1+b2 σX −2bCov(X,Y)=1−ρ2
X,Y Var"Y$ σ2 σ2 X,Y #%YY
(Note, we used b = ρX,Y σY/σX , which is the best linear fit)
we see that φX,Y is indeed a ‘cost’ with larger values associated to poor correlations. In this case, φX,Y correctly goes to zero when ρX,Y =±1. We have indeed that for the linear case 1-φX,Y is the coefficient of determination defined previously for linear regressions.
However, in general, we do not know the function f(X) therefore the best we can do is to substitute f(X) with the expectation value of Y for a given X:
f(x) = E[Y=y |X= x]=E[Y | X] COMPG001 - Dependency T. Aste UCL 2019
This is the general definition of regression
49
"$ "$
Var Y − E Y | X
# # %%
φX,Y =
#%
♯
Var"Y$
=
Var Y − Var E Y | X #% ## %%
Var"Y$ #%
"$ "$"$
"$ "$
!# =1- EY|X=x
Var E Y | X ## %%
x
"$
Var"Y$ #%
=1−η2 X,Y
As proposed by Karl Pearson, in analogy with the previous result for the linear dependence, we can use as measure of dependence 1- φ defining the correlation ratio as
η2 = X,Y
"$
!# !#
Var E Y | X "$
Var!Y# "$
COMPG001 - Dependency T. Aste UCL 2019 50
Information theoretic measures of dependency
COMPG001 - Dependency T. Aste UCL 2019 51
Mutual Information
From a general perspective, to quantify dependency we must measure the distance between the joint probability distribution function f(X,Y) and the product of the two marginal distributions fX(X)fY (Y). The two variables are independent if and only if such a distance is zero.
A general measure of distance between two probability distributions functions f1 and f2 is the Kullback–Leibler divergence DKL(f1||f2) (defined in part 1), which -for continuous variables- is:
DKL ( f1 || f2 )= ∫ f1(x)log f1(x) dx f2(x)
Lehmann, E.L., 1966. Some concepts of dependence. Ann. Math. Statist. 37, 1137–1153.
COMPG001 - Dependency T. Aste UCL 2019 52
By re-writing the definition as
DKL(f1 || f2)=∫ f1(x)(log f1(x)−log f2(x))dx
this measure is an average of a difference between the logarithms of the two distributions and DKL(f1||f2)=0 when f2 = f1 .
We can therefore use this distance to introduce a new dependency measure:
I(X;Y) = DKL( f(X,Y) || fX(X)fY (Y) )
i.e. the Kullback–Leibler divergence between f(X,Y) and
fX(X) fY (Y).
This is called mutual information. COMPG001 - Dependency T. Aste UCL 2019
53
Mutual information
I(X;Y)= ∫∫ f(x,y)log f(x,y) dxdy fX (x) fY (y)
The mutual information measures how much the knowledge about one of these variables, say X, will reduces our uncertainty about the other, say Y.
It is a symmetric measure I(X;Y) = I(Y;X)
We will see that mutual information is an information measure related to the entropies of the variables.
Indeed, dependency between two variables ultimately means that one variables carries some information about the other.
COMPG001 - Dependency T. Aste UCL 2019 54
Entropy, Information and Dependency
Quantification of Information
Let consider a time series of T observations of a variable X that can have a number, A, of discrete values. Or -analogously- consider a “message” of length T written with an “alphabet” containing a number A of characters. There are T positions where the first letter can be placed, T-1 positions where the second letter can be placed etc. until the last T-th letter have only one position left, this counts to
T(T-1)(T-2)...1 = T! combinations
However, in a ‘message’ some ‘characters’ are repeated several times and the exchange of two identical characters in different places does not change the sentence, therefore for each character “i” repeated ni times we must consider that ni! combinations are identical and do not carry any extra information. Therefore, when we have repetitions,
we must divide the total number of combinations, T!, by ni! .
COMPG001 - Dependency T. Aste UCL 2019 55
It results that the total number of different possible combinations in a signal of length T containing a number n1 of character a1, a number n2 of character a2,... and a number nA of character aA is:
Ω= T! n1!n2!...nA!
this number is associated to the maximum amount of information that we can pack in a signal of length T where the value x1 is repeated n1 times, the value x2 is repeated n2 times... .
From a different perspective, given the probability distribution of a variable X we can compute the maximum amount of information it can carry. We can, for instance, see that if we use only 1 character ni=T, then this number is Ω=1 (no information).
COMPG001 - Dependency T. Aste UCL 2019 56
If we measure the information in terms of bits, G, we have Ω = 2G
which means
G=log Ω=log " T! % 2 2 $# n 1 ! n 2 ! . . . n A ! '&
We can now use the Stirling’s approximation
log2 n!≈nlog2 n−n
(which is a very accurate approximation of the factorial for large
enough n). Obtaining:
G=−∑nilog2 T
A ni i=1
COMPG001 - Dependency T. Aste UCL 2019 57
We might recognize that ni/T is the relative frequency of occurrence of the character “i” in the signal, and it tends to the probability p(xi) in the limit of an infinitely long signal (law of large numbers).
Therefore, for large T:
G = −T∑p(xi )log2 p(xi ) = T H(X)
the quantity
A i=1
A
H(X) = −∑p(xi )log2 p(xi )
i=1
Is called Shannon/Gibbs entropy associated to a random variable X with probability distribution p(X).
COMPG001 - Dependency T. Aste UCL 2019 58
Entropy is a measure of information:
It counts the minimum number of bits that may be necessary
to encode the information contained in the variable. or, equivalently
the maximum number of bits per unit length that a signal with distribution p(X) can transmit.
Entropy is a measure of uncertainty:
Larger the entropy smaller is the possibility to predict the
next output in a data series.
Entropy is maximal when the distribution is constant p(X)=p0 , this correspond to maximum uncertainty: when all ‘characters’ are equally likely to appear prediction of next character is the hardest.
For a normal distribution H(X)= log(σ) + const
COMPG001 - Dependency T. Aste UCL 2019 59
In analogy with the definition for one variable, we can define the joint-entropy of two variables as
H(X,Y)=-∫∫ f(x,y)log f(x,y)dxdy And, similarly, the conditional entropy:
H(X|Y)=-∫∫ f(x,y)logf(x|y)dxdy
This quantity measures the amount of information in X given the
knowledge of Y.
For instance, if Y carries no information about X then H(X|Y) =
H(X).
In general, we must have H(X) ≥ H(X|Y)
COMPG001 - Dependency T. Aste UCL 2019
60
The difference between H(X) and H(X|Y) is therefore a measure of dependency.
Indeed,
I(X;Y)= H(X) - H(X|Y) is the mutual information introduced previously.
COMPG001 - Dependency T. Aste UCL 2019 61
Indeed from the previous definition of the mutual information
I(X;Y)=∫∫f(x,y)log f(x,y) dxdy fX (x)fY (y)
we can see directly (by using the Bayes theorem) that I(X;Y)= H(X)−H(X |Y)
Note that
I(X;Y)=H(Y)−H(Y |X)
I ( X ; Y ) = H ( X ) + H (Y ) − H ( X , Y )
H(X |Y)=∫ fY (y)H(X |Y = y)dy=
-∫dy fY (y)∫dx f(x|y)log f(x|y)=-∫∫ f(x,y)log f(x|y)dxdy
COMPG001 - Dependency
T. Aste UCL 2019 62
H(X)=-∫ f (x)log f (x)dx X!X
Figure!2.3:!A!Venn!diagram!demonstrating!the!relationship!between!the!entropies!of!two!variables,!their!joint!entropy,!mutual! information!and!conditional!entropy.!!!
H(Y)=-∫ fY (y)log fY (y)dy H(Y)"
H(Y|X)"
H(X)"
" "
"
"
H(X|Y)"
" I(X;Y)"
The mutual information quantifies the reduction in uncertainty "
"
"
"
"
H(X|Y)=-∫∫" f(x,y)logf(x|y)dxdy H(Y|X)=-∫∫ f(x,y)logf(y|x)dxdy
about X produced by the knowledge of Y (and vice-versa). 24"
Or, analogously, it is the amount of uncertainty about X left by the "
full knowledge of Y. It is a measure of dependency with larger values
corresponding to larger dependency.
COMPG001 - Dependency T. Aste UCL 2019 63
For linearly depended variables mutual information and correlation are simply related by:
♯
I(X;Y)=−1log 1−ρ2
2
()
And conversely
ρ = 1−exp −2I(X;Y)
X ,Y
()
but this is not true in general.
COMPG001 - Dependency T. Aste UCL 2019 64
Non-simultaneous dependency
We can now ask the intriguing question whether events in the past may be related to events in the future.
One simple test is looking if there is a dependency relation between events in the past and in the future. This can be simply achieved by taking two time series and shift one with respect the other by a certain number of time-steps– a certain lag.
If there is a significant dependence between a signal x(t) at time t and another signal y(t-λ) at a previous time t-λ, then we can
use the information about y to predict what might happen to x at a later time. In this case se say that Y has a leading lag- dependency on X at lag λ.
COMPG001 - Dependency T. Aste UCL 2019 65
Causality
COMPG001 - Dependency T. Aste UCL 2019 66
Causality
Is the housing bubbles that has triggered the crisis in the banking sector or it is the banking sector crisis that has dragged down the house market?
Dependency measures do not give any information about causal relations.
To establish which is the cause and which the effect of a given output is often a very complicated issue.
One approach is to analyze the temporal sequence of events that lead to a given output and then establish which are the events in the past that contribute most to the present output.
COMPG001 - Dependency T. Aste UCL 2019 67
Consider, two variables X and Y.
We aim to establish if the observation x(t+h) is more the
consequence of what occurred previously to X or to Y. Specifically, lets measure the additional information on future
value of the variable X: x(t+h)
provided by the previous m values of the variable Y: Y(t,m)=(y(t),...,y(t-m+1))
and knowing the previous k values of the variable X: X(t,k)=(x(t),...,x(t-k+1))
COMPG001 - Dependency T. Aste UCL 2019 68
Causality: Transfer Entropy
We can quantify this information from the amount of uncertainty reduction in future values of X by knowing the past values of Y.
This is H(x(t+h) | X(t,m)) – H(x(t+h) |X(t,k),Y(t,m)) By using Shannon entropy it is written as:
f ( x ( t + h ) | X ( t , k ) , Y ( t , m) ) k m T(Y→X)=∫...∫f(x(t+h),X(t,k),Y(t,m))log f(x(t+h)|X(t,k)) dx(t+h)d x(t,k)d y(t,m)
We can see that if x(t+h) is random or only dependent on the past values of X but not on Y then p(x|X,Y)=p(x|X) and consequently T(Y⟶X) = 0 .
COMPG001 - Dependency T. Aste UCL 2019
69
Form the expression
T(Y →X)=H(x|X)−H(x|X,Y)
We can see that T is a positive quantity measuring a difference in
relative entropy.
In other terms we can write
T(Y →X)=I(X;Y|X)
where we defined the conditional mutual information as
I(XA;XB |XC)=DKL(p(XA,XB,XC)||p(XA |XC)p(XB |XC)p(XC)) The transfer entropy is a conditional mutual information.
It is -indeed- the information about A provided by B given the full
knowledge of C.
COMPG001 - Dependency T. Aste UCL 2019 70
Granger causality
The Granger causality test is a regression procedure to determine whether one time series is useful in forecasting another.
Consider a time series X and let try to predict the present value of the series x(t) from the knowledge of k previous values X(t-1,k)=(x(t-1),x(t-2),...,x(t-k)).
A autoregressive predictor uses a liner combination of the previous value to predict the current one:
x(t) = a0+a1 x(t-1)+ a2 x(t-2)+...+ak x(t-k)+εx(t)
where εx(t) is the prediction error (residual). The autoregression is calibrated by finding the coefficients A=(a0,a1,a2,...ak) that minimize the sum of the squares of εx(t), which corresponds to minimizing Var[εx].
COMPG001 - Dependency T. Aste UCL 2019 71
Let now consider two time series X, and Y and do the same kind of regression but also using the information from the other series:
x(t) =b0+ b1 x(t-1)+ b2 x(t-2)+...+bk x(t-k-1)+ +c1 y(t-1)+ c2 y(t-2)+...+cm y(t-m-1)+εx|y(t)
the coefficients B=(b0,b1,b2,...bk) and C=(c1,c2,...cm) are the ones that minimize Var[εx|y]
By looking at the values of the residuals we define causality as follows:
if Var[εx|y] significantly smaller than Var[εx]
then Y is said to have a causal influence on X: Y→X
Where significantly means with respect to the null hypothesis
COMPG001 - Dependency T. Aste UCL 2019 72
In matricial form these regressions are written as: x(t) = A X(t-1,k)T +εx(t)
x(t) = B X(t-1,k)T + C Y(t-1,h)T + εx|y(t)
with optimal coefficients matrices A, B,C obtained by minimizing
regression residual.
COMPG001 - Dependency T. Aste UCL 2019 73
Granger causality is strictly related to transfer entropy. It turn out that they are actually the same when the variables are linearly dependent.
Indeed, for linearly dependent variables one has the identity:
T (Y → X ) = 1 log Indeed
L Barnett Phys rev Lett 103 (2009) 238701
Var(εX ) 2 Var(εXIY )
-1⁄2 log Var[εx|y] = H(x(t)|X(t-1,m),Y(t-1,k)) and
-1⁄2 log Var[εx] = H(x(t)|X(t-1,m))
COMPG001 - Dependency T. Aste UCL 2019 74
Estimation of dependency and causality
COMPG001 - Dependency T. Aste UCL 2019 75
Multivariate sample moments
In general, we can estimate the expectation values of any function of the variables by using sample means
⎡⎤1q
E f X ,X ,...X → f x (t),x (t),...x (t) ()∑()
⎣12N⎦q12N t=1
For instance the covariance becomes
1q⎛ 1q ⎞⎛ 1q ⎞ Cov[X ,X ]= ⎜x (t)− x (t)⎟⎜x (t)− x (t)⎟
i j q∑⎜i q∑i ⎟⎜j q∑j ⎟ t=1 ⎝ t=1 ⎠⎝ t=1 ⎠
The law of large numbers assures convergence towards the expectation value for large q
COMPG001 - Dependency T. Aste UCL 2019 76
Estimate multivariate probabilities
As for the one dimensional univariate case we have two main approaches: parametric and non-parametric.
The parametric case require the estimation of coefficients associated with the multivariate probability distribution function.
For multivariate normal distributions and several other common multivariate distributions (location scale) these parameters are the univariate means of all variables, their variances and the covariances.
For N variables we typically have to N means, N variances and N(N-1)/2 covariances.
♯
COMPG001 - Dependency T. Aste UCL 2019 77
The non-parametric approach require the estimation of the joint distribution form observations. For this we can use the same methodologies we have seen for the unvariate cases, namely the binning into histograms and the kernel methods, extended to more than one variable.
Two dimensional histogram
For instance if we have two variables Xi and Xj we can divide the observation domain into a grid and compute the frequencies of co-occurrence of each couple of observation within the two values obtaining a two dimensional histogram.
40
30
20
10
0 4
2
1
0
probabilities for a large number of observations. If we consider only
couples of variables we have N(N-1) relative frequencies to estimate. COMPG001 - Dependency T. Aste UCL 2019 78
The relative frequencies eventually will converge towards the
24 3
0
-2
-1
-2
-3
♯
Note that to evaluate the mutual information between two variables
I(Xi;Xj)= ∫∫ f(xi,xj)log f(xi,xj) dxidxj fi (xi ) f j (xj )
the estimation of the bivariate probability distribution function is sufficient.
However if we want to estimate the transfer entropy we already need to estimate joint probabilities of at least three variables
f (xi (t) | xi (t −λ), xj (t −λ))
T(Xj →Xi)=∫...∫f(xi(t),xi(t−λ),xj(t−λ))log f(xi(t)|xi(t−λ)) dxi(t)dxi(t−λ)dxj(t−λ)
Where in this example we estimate the information on variable Xi(t) provided by variable Xj(t-λ) in excess of the information from Xi(t-λ)
♯
COMPG001 - Dependency T. Aste UCL 2019 79
Curse of dimensionality
When we have a number N of variables the estimation of the joint probability distribution function requires the estimation of at least N(N-1)/2 quantities (i.e. the covariances or the joint frequencies between couples of variables).
If we have q observation of the system, the information we acquire about the system grows at most as qN whereas the number of quantities to estimate grows at least as N2.
It is intuitive to understand that, unless the number of observations q is equal or larger than N, we will not be able to estimate joint distribution functions with sufficient precision.
COMPG001 - Dependency T. Aste UCL 2019 80
Inverse Covariance matrix
♯
Formally, the problem with dimensionality N with respect to the size of the observation set becomes evident for the parametric estimation of the multivariate normal
f(X =x,X =x,...X =x )= 1 exp⎛−1(x−μ)Σ−1(x−μ)T⎞ 1 1 2 2 N N (2π)Ndet(Σ) ⎜⎝2 ⎟⎠
where we must estimate the means μ and inverse of the covariance matrix, Σ (which is a matrix with elements Σi,j =Cov(Xi,Xj)).
If we estimate Cov(Xi,Xj) from sample means, from linear algebra we know that the inverse is defined only if q > N, meaning that the number of observations should be larger than the number of variables in order to have a parametric estimation of the multivariate normal.
COMPG001 – Dependency T. Aste UCL 2019 81
Shrinkage and regularizations
There are ways to improve the estimation of the inverse covariance that can be very effective when the number of observations is of the same order of the number of variables q ~ N.
One of the most used is based on the idea of inverting
−1
With IN the identity NxN
matrix
This is called shrinkage. The additional term penalizes the sum of square values of the coefficients of the inverse.
By adding a constant γ2 to the diagonal the inverse is always
defined (even for q
pi,j = (number of times |ai,j| > |ci,j| ) / (total number of permutations)
Clearly pi,j must be small (typically smaller than 0.01 or 0.05) to reject the hypothesis that the observed dependency can be just the outcome of chance.
Also in this case Bonferroni correction must be applied.
COMPG001 – Dependency T. Aste UCL 2019 89
♯
Example: 401 firms on the US equity market 1996-2009
♯
3000
2500
2000
1500 10
1000 500 0
100
90
80
70
60
50
40
30
20
0.1%
0
-0.1 0 0.1
0.2 0.3
5%
0 0.2 0.4 0.6 0.8 1
COMPG001 – Dependency T. Aste UCL 2019 90
1/2
f Miss America
100
tylervigen.com
Corr
= 0.8701
des
in the US
2007 2008 2009
2007 2008 2009
e
2000 degrees
1500 degrees
1000 degrees
500 degrees
p-value = 0.0005
2500
2000
1500
1000
500
0 -1
Com0p.1u%ter s-c0ie.n8c4e7do2cto0ra.t8es7a0w1arded Correlation: 98.51% (r=0.985065)
Bootstrap
Confidence intervals
Total revenue generated by arca
1% -0.7209 0.6933 10-1 correlates with
$2 billion $1.75
2000 2001 20
$1.5 $1.25 $1
billion billion billion billion
2000
2001 2002
testing?
02
Null hypothesis not 10-2 2003 2004 2005 2006
rejected?
Are the two series dependent?
10 -3
What about multiple10-4
2006
Arcade revenu
-0.8 -0.6
-0.4 -0.2
0 0.2
0.4 0.6
0.8 1
2003 2004 2005
Computer science doctorates
10 -5
10 10 10
COMPG001 – Dependency
-2 tylervigen.com -1 T. Aste UCL 2019
0
Data sources: U.S. Census Bureau and National Science Foundation
91
Example
6/2016 Spurious Correlations
Murders, Age 7 24
7 24
7 24
3 21
4 22
3 21
8 24
4 22
2 20
3 19
2 22
♯
1999 25 yrs
23.75 yrs
22.5 yrs
21.25 yrs
20 yrs
18.75 yrs
2000 2001
Correlation: 87.01% (r=0.870127)
2002 2003 2004 2005 2006
2007 2008 2009
Age of Miss America
correlates with
Murders by steam, hot vapours and hot objects
8 murders
6 murders
4 murders
2 murders
1999
2000 2001
2002 2003 2004 2005 2006
2007 2008 2009
Murders by steam Age o Data sources: Wikipedia and Centers for Disease Control & Prevention
Arcade revenue Age of Miss America
Murders by steam Computer science doctorates
Additional material
COMPG001 – Dependency T. Aste UCL 2019 92
Copulas
Let me here briefly mention a very important mathematical instrument in the study of dependent variables: the copulas.
A N-dimensional copula is a joint cumulative distribution function of N variables with marginal uniform distribution functions
C(u1,u2,…,uN )=P[U1 ≤u1,U2 ≤u2,…,UN ≤uN ] where Uk are random uniformly distributed variables.
Given N random variables X1,…XN with joint distribution F(X1=x1,…, XN=xN) and marginal distributions FXk(Xk=xk) (k=1… N), the copula provides a link between the joint distribution and the marginal distribution (Sklar theorem):
F(X1 =x1,…,XN =xn)=C(FX1(X1 =x1),…,FXN (XN =xN)) COMPG001 – Dependency T. Aste UCL 2019 93
and also
C(u,…,u )=F(F−1(U =u),…,F−1 (U =u )) 1 N X1 1 1 XN N N
Copulas contain the full information about the structure of dependency between the variables. With respect to the joint distribution function the copula has the advantage that one can consider only the marginal distributions FXk(Xk) and their copula C, without handling directly the multivariate joint distribution F(X1,…,XN).
The Sklar theorem ensures that given F(X1,…,XN) and FXk(Xk) the copula function C is unique (if the marginals are continuous functions).
COMPG001 – Dependency T. Aste UCL 2019 94
We will not enter here in any further technical detail on how to best measure dependency and causality.
Let us here assume that we have such a measure and let us call it generically “dependency” without necessarily referring to any specific form of linear or non-liner correlation.
In general such a “dependency” is a scalar ci,j , which measures the relative dependence between the variables Xi and Xj.
Non-linearity may imply that different amplitudes of the signals are coupled with different strengths.
COMPG001 – Dependency T. Aste UCL 2019 95