程序代写代做代考 arm flex algorithm V Lecture

V Lecture
Correlation, Partial Correlations, Multiple Correlations. Copulae-

Estimation and Testing

5.0 First of all, we would like to make some general comments on similarities and differ-
ences between correlations and dependencies.

Very often we are interested in correlations (dependencies) between a number of ran-
dom variables and are trying to describe the “strength” of the (mutual) dependencies. For
example, we would like to know if there is a correlation (mutual non-directed dependence)
between the length of the arm and of the leg. But if we would like to get an information
about (or to predict) the length of the arm by measuring the length of the leg, we are
dealing with dependence of the arm’s length on the leg’s length. Both problems described
in this example, make sense.

On the other hand, there are other examples/situations in which only one of the
problems is interesting or makes sense. If we study the dependence between rain and
crops, this makes a perfect sense but there is no sense at all to study the (directed)
influence of crops on rain.

In a nutshell, we can say that when studying the mutual (linear) dependence, we are
dealing with correlation theory whereas when studying directed influence of one (input)
variable on another (output) variable, we are dealing with regression theory. It should be
clearly pointed out though that correlation alone, no matter how strong, can not help
us identify the direction of influence and can not help us in regression modelling. Our
reasoning about direction of influence should come outside of Statistical theory, from
another theory.

Another important point to always bear in mind is that, as already discussed in
lecture 2, uncorrelated does not necessarily mean independent if the multivariate data
happens to fail the multivariate normality test. Nonetheless, for multivariate normal
data, the notions of “uncorrelated” and “independent” coincide.

5.1. Definition and estimation of partial correlation coefficients

In general, there are 3 types of correlation coefficients:

• The usual correlation coefficient between 2 variables

• Partial correlation coefficient between 2 variables after adjusting for the effect (re-
gression, association ) of set of other variables.

• Multiple correlation between a single random variable and a set of p other variables

For X ∼ Np(µ,Σ) we defined the correlation coefficient ρij =
σij√

σii

σjj
, i, j = 1, 2, . . . , p

and discussed the MLE’s ρ̂ij in (3.6). It turned out that they coincide with the sample
correlations rij we introduced in the first lecture (formula (1.3)). To define partial corre-
lation coefficients, recall the Property 4 of the multivariate normal distribution from II
Lecture:

If vector X ∈ Rp is divided into X =
(
X(1)
X(2)

)
, X(1) ∈ Rr, r < p,X(2) ∈ Rp−r and 1 according to this subdivision the vector means are µ = ( µ(1) µ(2) ) and the covariance matrix Σ has been subdivided into Σ = ( Σ11 Σ12 Σ21 Σ22 ) and the rank of Σ22 is full then the conditional density of X(1) given that X(2) = x(2) is Nr(µ(1) + Σ12Σ −1 22 (x(2) − µ(2)),Σ11 −Σ12Σ −1 22 Σ21) We define the partial correlations of X(1) given X(2) = x(2) as the usual correlation coefficients calculated from the elements σij.(r+1),(r+2)...,p of the matrix Σ1|2 = Σ11 −Σ12Σ−122 Σ21, i.e. ρij.(r+1),(r+2),...,p = σij.(r+1),(r+2),...,p √ σii.(r+1),(r+2),...,p √ σjj.(r+1),(r+2),...,p (5.1) To find ML estimates for these, we use the translation invariance property of the MLE to claim that if Σ̂ = ( Σ̂11 Σ̂12 Σ̂21 Σ̂22 ) is the usual MLE of the covariance matrix then Σ̂1|2 = Σ̂11 − Σ̂12Σ̂−122 Σ̂21 with elements σ̂ij.(r+1),(r+2),...,p, i, j = 1, 2, . . . , r is the MLE estimate of Σ1|2 and correspondingly, ρ̂ij.(r+1),(r+2),...,p = σ̂ij.(r+1),(r+2),...,p√ σ̂ii.(r+1),(r+2),...,p √ σ̂jj.(r+1),(r+2),...,p , i, j = 1, 2, . . . , r will be the ML estimators of ρij.(r+1),(r+2)...,p, i, j = 1, 2, . . . , r. We call ρij.(r+1),(r+2),...,p the correlation of the ith and jth component when the components (r+ 1), (r+ 2), etc. up to the pth (i.e. the last p − r components) have been hold fixed. The interpretation is that we are looking for the association (correlation) between the ith and jth component after eliminating the effect that the last p − r components might have had on this association. 5.1.1. Simple formulae For situations when p is not large, as a partial case of the above general result, simple plug-in formulae are derived that express the partial correlation coefficients by the usual correlation coefficients. We shall discuss such formulae now. (For higher dimensional cases computers need to be utilized- the procedure CORR in SAS is one good option). The formulae are given below: i) partial correlation between first and second variable by adjusting for the effect of the third: ρ12.3 = ρ12 − ρ13ρ23√ (1− ρ213)(1− ρ223) ii) partial correlation between first and second variable by adjusting for the effects of third and fourth variable: ρ12.3,4 = ρ12.4 − ρ13.4ρ23.4√ (1− ρ213.4)(1− ρ223.4) 5.1.2. Example. Three variables have been measured for a set of schoolchildren: i) X1- Intelligence ii) X2- Weight iii)X3 -Age 2 The number of observations was large enough so that one can assume the empirical correlation matrix ρ̂ ∈ M3,3 to be the true correlation matrix: ρ̂ =   1 0.6162 0.82670.6162 1 0.7321 0.8267 0.7321 1  . This suggests there is a high degree of positive dependence between weight and intelli- gence. But (do the calculation (!)) ρ̂12.3 = 0.0286 so that, after the effect of age is adjusted for, there is virtually no correlation between weight and intelligence, i.e. weight obviously plays little part in explaining intelligence. 5.2. Definition and estimation of the multiple correlation coefficient Recall our discussion in the end of Section 2.2.2 for the best prediction in mean squares sense in case of multivariate normality: If we want to predict a random variable Y that is correlated with p random variables (predictors) X =   X1 X2 . . Xp   by trying to minimize the expected value E(Y − g(X))2 the optimal solution (i.e. the regression function) was g∗(X) = E(Y |X). When the joint (p+ 1)-dimensional distribution of Y and X is normal this function was linear in X. Given a specific realization x of X it was given by b+σ′0C −1x where b = E(Y )−σ′0C−1E(X), C is the covariance matrix of the vector X, σ0 is the vector of Covariances of Y with Xi, i = 1, . . . , p. The vector C −1σ0 ∈ Rp was the vector of the regression coefficients. Now, let us define the multiple correlation coefficient between the random variable Y and the random vector X ∈ Rp to be the maximum correlation between Y and any linear combination α′X, α ∈ Rp. This makes sense: to look at the maximal correlation that we can get by trying to predict Y as a linear function of the predictors. The solution to this which also gives us an algorithm to calculate (and estimate) the multiple correlation coefficient is given in the next lemma. Lemma 5.2.1. The multiple correlation coefficient is the ordinary correlation coef- ficient between Y and σ′0C −1X = β∗ ′ X. Proof. Note that for any α ∈ Rp : cov(Y, α′X) = α′Cβ∗ and, in particular, cov(Y, β∗′X) = β∗ ′ Cβ∗ holds. Using Cauchy-Bunyakovsky-Schwartz inequality we have: [cov(α′X, β∗ ′ X]2 ≤ V ar(α′X).V ar(β∗ ′ X) and therefore: σ2Y ρ 2(Y, α′X) = (α′σ0) 2 α′Cα = (α′Cβ∗) 2 α′Cα ≤ β∗ ′ Cβ∗ holds , σ2Y denoting the variance of Y . In this last equality we can get the equality sign by choosing α = β∗, i.e. the squared correlation coefficient ρ2(Y, α′X) of Y and α′X is maximized over α when α = β∗. From Lemma 5.2.1 we see that the maximum correlation between Y and any linear combination α′X,α ∈ Rp is R = √ β∗ ′ Cβ∗ σ2 Y . This is the multiple correlation coefficient. 3 Its square R2 is called coefficient of determination. Having in mind that β∗ = C−1σ0 we see that R = √ σ′ 0 C−1σ0 σ2 Y . If Σ = ( σ2Y σ ′ 0 σ0 C ) = ( Σ11 Σ12 Σ21 Σ22 ) is the partitioned covariance matrix of the (p+ 1)-dimensional vector (Y,X)′ then we know how to calculate the MLE of Σ by Σ̂ = ( Σ̂11 Σ̂12 Σ̂21 Σ̂22 ) so the MLE of R would be R̂ = √ Σ̂12Σ̂ −1 22 Σ̂21 Σ̂11 5.2.2. Interpretation of R. At the end of Section 2.2.2 we derived the minimal value of the mean squared error when trying to predict Y by a linear function of the vector X. It is achieved when using the regression function and the value itself was σ2Y −σ′0C−1σ0. The latter value can also be expressed by using the value of R. It is equal to σ2Y (1−R2). Thus, our conclusion is that when R2 = 0 there is no predictive power at all. In the opposite extreme case, if R2 = 1, it turns out that Y can be predicted without any error at all (it is a true linear function of X). 5.2.3 Numerical example. Let µ =   µYµX1 µX2   =   52 0   and Σ =   10 1 −11 7 3 −1 3 2   = ( σY Y σ ′ 0 σ0 ΣXX ) . Calculate: i) The best linear prediction of Y using X1 and X2. ii) The multiple correlation coefficient R2Y.(X1,X2). iii)The mean squared error of the best linear predictor. Solution. β∗ = Σ−1XXσ0 = ( 7 3 3 2 )−1 ( 1 −1 ) = ( .4 −.6 −.6 1.4 )( 1 −1 ) = ( 1 −2 ) and b = µY − β∗ ′ µX = 5− (1,−2) ( 2 0 ) = 3. Hence the best linear predictor is given by 3 +X1 − 2X2. The value of: RY.(X1,X2) = √√√√√√(1,−1) ( .4 −.6 −.6 1.4 )( 1 −1 ) 10 = √ 3 10 = .548 The mean squared error of prediction is: σ2Y (1−R2Y.(X1,X2)) = 10(1− 3 10 ) = 7. 5.2.4. Remark about the calculation of R2. Sometimes, the correlation matrix only may be available. It can be shown that in that case the relation 1−R2 = 1 ρY Y (5.2) holds. In (5.2) , ρY Y is the upper left-hand corner of the inverse of the correlation matrix 4 ρ ∈ Mp+1,p+1 determined from Σ. We note that the relation ρ = V − 1 2 ΣV − 1 2 holds with V =   σ2y 0 0 . . . 0 0 c11 0 . . . 0 0 0 c22 . . . 0 . . . . . . . . . . . . . . . 0 0 0 . . . cpp   One can use (5.2) to calculate R2 by first calculating the right hand side in (5.2). To show Equality (5.2) we note that 1−R2 = σ2Y − σ′0C−1σ0 σ2Y = |C| |C| σ2Y − σ′0C−1σ0 σ2Y = |Σ| |C|σ2Y But σY Y = |C| |Σ| is the entry in the first row and column of Σ −1. Since ρ−1 = V 1 2 Σ−1V 1 2 , we see that ρY Y = σY Y σ2Y holds. Therefore 1−R2 = 1 ρY Y . 5.3. Testing of correlation coefficients 5.3.1. Usual correlation coefficients When considering the distribution of a particular correlation coefficient ρ̂ij = rij the problem becomes bivariate because only the variables Xi and Xj are involved. Direct transformations with the bivariate normal can be utilized to derive the exact distribution of rij under the hypothesis H0 : ρij = 0. It turns out that in this case the statistic T = rij √ n−2 1−r2 ij ∼ tn−2 and tests can be performed by using the tables of the t-distribution. For other hypothetical values the derivations are more painful. There is one most frequently used approximation that holds no matter what the true value of ρij is. We shall discuss it here. Consider Fisher’s z transformation Z = 1 2 log[ 1+rij 1−rij ]. Under the hypothesis H0 : ρij = ρ0 it holds: Z ≈ N( 1 2 log[ 1 + ρ0 1− ρ0 ], 1 n− 3 ) In particular, in the most common situation, when one would like to test H0 : ρij = 0 versus H1 : ρij 6= 0 one would reject H0 at 5% significance level if |Z| √ n− 3 ≥ 1.96. Based on the above, now you suggest how to test the hypothesis of equality of two correlation coefficients from two different populations(!). 5.3.2. Partial correlation coefficients Coming over to testing partial correlations not much has to be changed. Fisher’s Z approximation can be used again in the following way: to test H0 : ρij.r+1,r+2,...,p = ρ0 versus H1 : ρij.r+1,r+2,...,p 6= ρ0 we construct Z = 12 log[ 1+rij.r+1,r+2,...,p 1−rij.r+1,r+2,...,p ] and a = 1 2 log[1+ρ0 1−ρ0 ]. Asymptotically Z ∼ N(a, 1 n−(p−r)−3) holds. Hence, test statistic to be compared with significance points of the standard normal is now : √ n− (p− r)− 3|Z − a|. 5.3.3. Multiple correlation coefficients It turns out that under the hypothesis H0 : R = 0 the statistic F = R̂2 1−R̂2 .n−p p−1 ∼ Fp−1,n−p. Hence, when testing significance of the multiple correlation, the rejection region would be { R̂ 2 1−R̂2 .n−p p−1 > Fp−1,n−p(α)} for a given significance level α.

5

Remark. It should be stressed that the value of p in Section 5.3.3 refers to the total
number of all variables (the output Y and all of the input variables in the input vector
X). This is different from the value of p that was used in Section 5.2. In other words, the
p in Section 5.3.3 is the p+ 1 in Section 5.2.

5.4. Copulae. For the multivariate normal, independence is equivalent to absence
of correlation between any two components. In this case the joint cdf is a product of the
marginals. When the independence is violated, the relation between the joint multivariate
distribution and the marginals is more involved. An interesting concept that can be used
to describe this more involved relation is the concept of copula. We focus on the two-
dimensional case for simplicity. Then the copula is a function C : [0, 1]2 → [0, 1] with the
properties:

i) C(0, u) = C(u, 0) = 0 for all u ∈ [0, 1].
ii) C(u, 1) = C(1, u) = u for all u ∈ [0, 1].
iii) For all pairs (u1, u2), (v1, v2) ∈ [0, 1]× [0, 1] with u1 ≤ v1, u2 ≤ v2 :

C(v1, v2)− C(v1, u2)− C(u1, v2) + C(u1, u2) ≥ 0.

The name is due to the implication that the copula links the multivariate distribution
to its marginals. This is explicated in the following celebrated Theorem of Sklar :

Let F (., .) be a joint cdf with marginal cdf ’s FX1(.) and FX2(.). Then there exists a
copula C(., .) with the property

F (x1, x2) = C(FX1(x1), FX2(x2))

for every pair (x1, x2) ∈ R2. When FX1(.) and FX2(.) are continuous the above copula
is unique. Vice versa, if C(., .) is a copula and FX1(.), FX2(.) are cdf then the function
F (x1, x2) = C(FX1(x1), FX2(x2)) is a joint cdf with marginals FX1(.) and FX2(.).

Taking derivatives we also get:

f(x1, x2) = c(FX1(x1), FX2(x2))fX1(x1)fX2(x2)

where

c(u, v) =
∂2

∂u∂v
C(u, v)

is the density of the copula. This relation clearly shows that the contribution to the
joint density of X1, X2 comes from two parts: one that comes from the copula and is
“responsible” for the dependence (c(u, v) = ∂

2

∂u∂v
C(u, v)) and another one which takes

into account marginal information only (fX1(x1)fX2(x2)).

It is also clear that the independence implies that the corresponding copula is Π(u, v) =
uv (this is called the independence copula).

These concepts are generalized also to p dimensions with p > 2.

An interesting example is the Gaussian copula. For p = 2 it is equal to:

Cρ(u, v) =
∫ Φ−1(u)
−∞

∫ Φ−1(v)
−∞

fρ(x1, x2)dx2dx1.

Here fρ(., .) is the joint bivariate normal density with zero mean, unit variances and a
correlation ρ and Φ−1(.) is the inverse of the cdf of the standard normal. (This is “The

6

formula that killed Wall street”). When ρ = 0 we see that we get C0(u, v) = uv (as is to
be expected).

Non-Gaussian copulae are much more important in practice and inference methods
about copulae are a hot topic in Statistics. The reason for importance of non-Gaussian
copulae is that Gaussian copulae do not allow us to model reasonably well the tail de-
pendence, that is, joint extreme events have virtually a zero probability. Especially in
financial applications, it is very important to be able to model dependence in the tails.
The Gumbel-Hougaard copula is much more flexible in modeling dependence in the upper
tails. For an arbitrary dimension p is is defined as

CGHθ (u1, u2, . . . , up) = exp{−[
p∑
j=1

(− log uj)θ]1/θ}

where θ ∈ [1,∞) is a parameter that governs the strength of the dependence. You can
easily see that the Gumbell-Hougaard copula reduces to the independence copula when
θ = 1 and to the Fréchet-Hoeffding upper bound copula min(u1, . . . , up) when θ →∞.

The Gumbel-Hougaard copula is also an example of the so-called Archimedean copu-
lae. The latter are characterized by their generator φ(.) : a continuous strictly decreasing
function from [0, 1] in [0,∞] such that φ(1) = 0. Then the Archimedean copula is defined
via the generator as follows:

C(u1, u2, . . . , up) = φ
−1(φ(u1) + . . . φ(up)).

Exercise Show that the Gumbell-Hougaard copula is an Archimeden copula with a
generator φ(t) = (− log t)θ.

The benefit of using the Archimedean copulae is that they allow for simple description
of the p−dim dependence by using a function of one argument only (the generator).
However it is seen immediately that the Archimedean copula is symmetric in its arguments
and this limits its applicability for modelling dependencies that are not symmetric in their
arguments. The so-called Liouville copulae are an extension of the Archiemedean copulae
and can be used also to model dependencies that are not symmetric in their arguments.

Inference procedures about copulae are implemented in the procedure copula in the
SAS/ETS package.

7