CS代考 MAST 90138: MULTIVARIATE STATISTICAL TECHNIQUES 3 MEAN, COVRIANCE, CORRELATION

MAST 90138: MULTIVARIATE STATISTICAL TECHNIQUES 3 MEAN, COVRIANCE, CORRELATION
Sections 3.1, 3.2, 3.3 in Ha ̈rdle and Simar (2015).
3.2 COVARIANCE MATRIX
If X = (X1, . . . , Xp)T is a p dimensional random vector, we can compute the covariance between each pair Xi and Xj and collect all pairwise co- variances in the p × p covariance matrix Σ:
σXX …σXX σ11…σ1p 11.1p .
Σ= . = .  σXpX1 … σXpXp σp1 … σpp
Lecture notes originally by Prof. 1

 To highlight it is the covariance of X we can write ΣX . Σissymmetric:Σ=ΣT .
 Σ is positive semi-definite: Σ ≥ 0 .
 In matrix notation
Σ = E{(X − μ)(X − μ)T } , where X and μ are written as column p-vectors .
Lecture notes originally by Prof. 2

• In practice Σ is mostly unknown
• But can estimate it from a sample X1, . . . , Xn by the sample covariance
matrix
SXX …SXX S11…S1p 11.1p .
S= . = . , SXpX1 … SXpXp Sp1 … Spp
where, for j,k = 1,…,p,
1 􏰋n
(Xij − X ̄j)(Xik − X ̄k) is the sample covariance between Xj and Xk.
SXjXk = Skj = n − 1
 Note that S is symmetric (S = ST ) and semipositive definite .
i=1
Lecture notes originally by Prof. 3

• In matrix notation
S= 1 (X−X ̄1Tn)T(X−X ̄1Tn)= 1 XTX− n X ̄X ̄T,
n−1 n−1 n−1
where X is the n × p data matrix and X ̄ is written as column p-vector.
 Note that S is also positive semi-definite.
Lecture notes originally by Prof. 4

3.3 CORRELATION MATRIX
• Problem with covariance matrix: not unit invariant, i.e. if we change
the units, covariances change.
• Correlation: a measure of linear dependence which is unit invariant.
• The correlation matrix P of a random vector X = (X1,…,Xp)T is a p × p matrix defined by:
where
ρp1 ρp2 … 1 ρij = √σij
 1 ρ12 … ρ1p  ρ21 1…ρ2p
P=.
 . 
σiiσjj is the correlation between Xi and Xj.
Lecture notes originally by Prof. 5

Wealwayshave−1≤ρij ≤1.
 ρij is a measure of the linear relationship between Xi and Xj.  |ρij| = 1 means perfect linear relationship.
 ρij = 0 means absence of linear relationship, but does not imply independence.
Lecture notes originally by Prof. 6

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 XX
1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 XX
1.0
Lecture notes originally by Prof.
7
YY
-1.0 -0.6 -0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0
YY
-0.8 -0.4 0.0 0.0 0.4 0.8

0.0 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 XX
1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 XX
1.0
Lecture notes originally by Prof.
8
YY
0.0 0.05 0.15 0.25 -1.0 0.0 0.5 1.0 1.5 2.0
YY
-1 0 1 2 -1.5 -0.5 0.5

• In matrix notation
P = D−1/2ΣD−1/2 ,
where Σ is the p × p covariance matrix and
D = diag(σ11,…,σpp)
is the p × p diagonal matrix of variances.
Lecture notes originally by Prof. 9

• In practice P is mostly unknown. Can estimate it from a sample X1, . . . , Xn by the sample correlation matrix
R11 …R1p R= . ,
where, for j,k = 1,…,p,
Rp1 … = √sjk
sjjskk
is the sample correlation between Xj and Xk.
Lecture notes originally by Prof. 10

• In matrix notation we can write
R = D−1/2SD−1/2 ,
where S is the p × p sample covariance matrix and, on this occasion, D = diag(s11,…,spp)
is the p × p diagonal matrix of sample variances.
Lecture notes originally by Prof. 11

3.4 LINEAR TRANSFORMATIONS
Let X = (X1,…,Xp)T be a p-vector and let Y be q-vector defined by
Y =AX+b,
where A is a q × p matrix and b is a q × 1 vector. Then we have
E(Y ) = AE(X) + b Y ̄ = A X ̄ + b
ΣY =AΣXAT
SY =ASXAT
 The fact that ΣY = AΣX AT can be used to show that any covariance matrix must be positive semi-definite. (how? make sure you know!)
Lecture notes originally by Prof. 12

4 MULTIVARIATE DISTRIBUTIONS
4.1 DISTRIBUTION AND DENSITY FUNCTION
Sections 4.1, 4.2 in Ha ̈rdle and Simar (2015).
LetX=(X1,…,Xp)T bearandomvector.
• For all x = (x1, . . . , xp)T ∈ Rp, the cumulative distribution function
(cdf), or distribution function, of X is defined by
F(x) = P(X ≤ x) = P(X1 ≤ x1,…,Xp ≤ xp)
Lecture notes originally by Prof. 13

• If X is continuous, the probability density function (pdf) or density, f, of X is a nonnegative function defined through the following equation:
it always satisfies
􏱊x −∞
f(u)du = 1. −∞
F(x) = 􏱊∞
f(u)du;
 The integrals are p-variate, u ∈ Rp but f (u) ∈ R: 􏱊 x 􏱊 x1 􏱊 xp
f(u)du = … f(u1,…,up)du1…dup . −∞ −∞−∞
Lecture notes originally by Prof. 14

• The marginal cdf a subset of X is obtained by the marginal of X com- puted at the subset, letting the other values equal to infinity.
 e.g. the marginal cdf of X1 is
FX1(x1) =P(X1 ≤ x1)
=P(X1 ≤ x1,X2 ≤ ∞,…,,Xp ≤ ∞) =FX(x1,∞,…,∞)
 e.g.the marginal cdf of (X1, X3) is g
FX1,X3(x1,x3) =P(X1 ≤ x1,X3 ≤ x3)
=P(X1 ≤ x1,X2 ≤ ∞,X3 ≤ x3,X4 ≤ ∞…,,Xp ≤ ∞)
=FX(x1,∞,x3,∞,…,∞).
Lecture notes originally by Prof. 15

• For a continuous random vector X, the marginal density of a subset of X is obtained from the joint density f of X by integrating out the other components.
 e.g. the marginal density X1 is 􏱊∞ 􏱊∞
fX1(x1) = … f(x1, u2, . . . , up) du2… dup −∞ −∞
 e.g. the marginal density of (X1, X3) is 􏱊∞ 􏱊∞
fX1,X3(x1, x3) = … f(x1, u2, x3, u4, . . . , up) du2 du4… dup . −∞ −∞
Lecture notes originally by Prof. 16

• For two continuous random variables X1 and X2, the conditional pdf of X2 given X1 is obtained by taking
f(x2|x1) = f(x1, x2)/fX1(x1) . (Defined only for values x1 such that fX1(x1) > 0)
• Two continuous random variables X1 and X2 are independent if and only if
f(x1, x2) = fX1(x1)fX2(x2)
 If X1 and X2 are independent then
fX2|X1(x2|x1) = f(x1, x2)/fX1(x1) = fX1(x1)fX2(x2)/fX1(x1) = fX2(x2) .
(Knowing the value of X1 does not change the probability assess- ments on X2 and vice versa)
Lecture notes originally by Prof. 17

• The mean μ ∈ Rp of X = (X1,…,Xp)T is defined by
μ1  E(X1) 􏱉 xfX (x)dx

μ= . = . = . .
μp E(Xp) 􏱉 xfXp(x) dx
 If X and Y are two p-vectors and α and β are constants then
E(αX + βY ) = αE(X) + βE(Y ).
IfX isap×1vectorwhichisindependentoftheq×1vectorY then
E(XY T ) = E(X)E(Y T ).
 Hint: Remember to always check that matrix dimensions are compatible.
1
Lecture notes originally by Prof. 18

The conditional expectation E(X2|X1 = x1) is defined by 􏱊
E(X2|X1 = x1) =
and the conditional (co)variance var(X2|X1 = x1) is defined by
var(X2|X1 = x1) = E(X2X2T |X1 = x1) − E(X2|X1 = x1)E(X2T |X1 = x1) , if X2 is a column vector.
x2fX2|X1(x2|x1) dx2
Lecture notes originally by Prof. 19

• As seen earlier, the covariance Σ of a vector X of mean μ is defined by Σ = E{(X − μ)(X − μ)T } .
We write
to denote a vector X with mean μ and covariance Σ.
• We can also define a covariance matrix between a p × 1 vector X of meanμandaq×1vectorY ofmeanνby
ΣX,Y =cov(X,Y)=E{(X−μ)(Y −ν)T}=E(XYT)−E(X)E(YT). The elements of this matrix are the pairwise covariances between the
components of X and those of Y .
Lecture notes originally by Prof. 20
X ∼ (μ, Σ)

 We have
cov(X + Y, Z) = cov(X, Z) + cov(Y, Z)
 We have
var(X + Y ) = var(X) + cov(X, Y ) + cov(Y, X) + var(Y )
 For matrices A and B and random vectors X and Y such that the below quantities are well defined we have
cov(AX,BY)=Acov(X,Y)BT .
Lecture notes originally by Prof. 21

4.2 MULTINORMAL DISTRIBUTION
Sections 4.4, 4.5, 5.1 in Ha ̈rdle and Simar (2015).
 Arguably the most common distribution we encounter.
 Recall that in the univariate case, the density of a N(μ,σ2) is given
by
f(x)=√1 exp􏱆−(x−μ)2/(2σ2)􏱇. 2πσ
 In the multivariate case, need to deal with vectors and matrices.
Lecture notes originally by Prof. 22

 Recall that
σ11…σ1p σ12…σ1p Σ= . = . 
σp1 … σpp σp1 … σp2
 The density of the multinormal (or simply normal) distribution with
where σj2 = var(Xj).
mean μ and covariance Σ is given by
f(x) = |2πΣ|−1/2 exp 􏰟 − 1(x − μ)T Σ−1(x − μ)􏰠 . (1) 2
 If the p-vector X is normal with mean μ and covariance Σ we write X ∼ Np(μ, Σ) .
Lecture notes originally by Prof. 23

If the Xi’s are independent, then  σ 12 . . . 0 
Thus and
so that
.22 Σ =  .  = diag(σ1,…,σp).
0 . . . σ p2
|2πΣ| = |diag(2πσ12, . . . , 2πσp2)| = (2π)pσ12 · · · σp2
Σ−1 =diag(σ−2,…,σ−2), 1p
11p
p exp􏰟− 􏰋(xj −μj)2/σj2􏰠
f(x)=􏰎
=􏰎(2π)p 􏱈p σ exp − 2(xj − μj) /σj
(2π)p 􏱈j=1 σj 2 j=1
􏱅􏰟 22􏰠
1p1 j=1 j j=1
􏱅p 1 􏱆 2 2􏱇 = √2πσ exp −(xj −μj) /(2σj) .
j=1 j
is the product of densities of p univariate N (μj , σj2).
Lecture notes originally by Prof. 24

 When we define the multivariate normal distribution above, we’ve implicitly assume that Σ is non-singular, i.e. Σ > 0.
 Strictly speaking, we can also define normal distribution when Σ is only positive semi-definite, i.e. Σ ≥ 0. However, in this case, we end up with a degenerate normal distribution where a density function can- not be defined on Rp. (All probability mass lies in a lower dimensional subspace of Rp)
 In fact, even a constant number c ∈ R is trivially a degnerate normal random variable with variance 0!
Lecture notes originally by Prof. 25

We can see from (1) that the density of a Np(μ, Σ) is constant when (x−μ)TΣ−1(x−μ)
is constant. Now for positive constant c,
(x − μ)T Σ−1(x − μ) = c
corresponds to an ellipsoid. The quantity
􏰏(x−μ)TΣ−1(x−μ)
is called the Mahalanobis distance between x and μ.
For example in p = 2 dimensions:
Lecture notes originally by Prof. 26

4.4 The Multinormal Distribution 139
the rectangle circumscribing the contour ellipse has sides with length 2d􏰬 and is
Fig. 4.3
2
􏰿1:5 􏰾 􏰿1:5 4
MVAcontnorm
Normal sample 87
Contour Ellipses
6
4
2
0
-2
6 5 4 3 2 1 0
-1
-2
-3
-40 2 4 6 12345
X1 X1
Scatterplot of a normal sample and contour ellipses for 􏰳 D 􏰿 3 􏰾 and † D 􏰿 1
64
According to Theorem 2.7 in Sect. 2.6 the half-lengths of the axes in the contour
p 2 Lecture notes originally by Prof. 27 ellipsoid are d 􏰯i where 􏰯i are the eigenvalues of †. If † is a diagonal matrix,
X2
X2

 Let X ∼ Np(μ, Σ), A a q × p matrix and b a q × 1 vector. Then Y =AX+b∼Nq(Aμ+b,AΣAT).
 Let X = (X1T,X2T)T ∼ Np(μ,Σ) where 􏰆Σ11 Σ12􏰇
and Then
Σ= Σ21Σ22 var(X1)=Σ11, var(X2)=Σ22.
Σ12 = 0 ⇐⇒ X1 and X2 are independent.
Lecture notes originally by Prof. 28

 If X ∼ Np(μ, Σ) and A and B are matrices with p columns, then
AX and BX are independent ⇐⇒ AΣBT = 0 . (2)
 If X1,…,Xn are i.i.d.∼ Np(μ,Σ), then
X ̄ ∼Np(μ,Σ/n) (3)
IfZ1,…,Zn areindependentN(0,1)then n
X = 􏰋 Z k2 ∼ χ 2n k=1
is a chi square with n degrees of freedom. (That’s how chi-square dis- tributions are defined)
Lecture notes originally by Prof. 29

 If X ∼ Np(μ, Σ) and Σ is invertible, then
Y =(X−μ)TΣ−1(X−μ)∼χ2p. (4)
– To show this: First write Σ = Σ1/2Σ1/2 with spectral decomposition. (How? Make sure you know!)
⇒ Σ−1 = Σ−1/2Σ−1/2
– But then Σ−1/2(X − μ) ≡ Z ∼ Np(0, Ip)
– So Y = Z T Z ∼ χ2p by the definition of the chi-square distributions.
Lecture notes originally by Prof. 30

4.3 WISHART DISTRIBUTION
• The Wishart distritbution is a generalisation to multiple dimensions of
the chi square distribution.
It depends on 3 parameters: p, a p × p scale matrix Σ and the num-
ber of degrees of freedom n:
Wp(Σ, n) .
• If M is an p × n matrix whose columns are independent and all have a Np(0, Σ) distribution, then
M=MMT ∼Wp(Σ,n).
• Note that in the definition above, Σ doesn’t have to be strictly positive
definite, nor is there any restriction on n and p.
Lecture notes originally by Prof. 31

• However, when M is Wishart-distributed, it must be non-negative definite.
(By definition, M can be represented as MMT for some M with inde- pendent normal columns. Hence it must be that xT Mx = xT M M T x ≥ 0 for all p-vector x )
• Further if M is non-singular (i.e. positive definite) with probability 1, it is said to have a non-singular Wishart distribution.
• The following result can be found in Proposition 8.2 of the lecture notes by in CANVAS:
Suppose M is Wishart-distributed with parameters Σ, p, n. Then M has a non-singular Wishart distribution if and only if n ≥ p and Σ > 0, in which case M has the density
|M|n−p−1 exp(−1tr􏰀MΣ−1􏰁) 22
fΣ,n(M) = 2pn/2πp(p−1)/4|Σ|n/2 􏱈pi=1 Γ((n + 1 − i)/2)
Lecture notes originally by Prof. 32

•Whenσisascalar,aW1(σ2,n)isthesameasσ2 timesaχ2n.
• If a p × p random matrix Y ∼ Wp(Σ, n) and B is a q × p matrix then
BYBT ∼Wq(BΣBT,n). •Ifap×prandommatrixY ∼Wp(Σ,n)andaisap×1vectorsuchthat
aT Σa ̸= 0, then
aTYa/aTΣa ∼ χ2n . (How to show it? Make sure you know!)
Lecture notes originally by Prof. 33

• Recall the unbiased sample covariance matrix
1 􏰋n
S = n−1 It can be proved that
i=1
(Xi −X ̄)(Xi −X ̄)T . (n − 1)S ∼ Wp(Σ, n − 1).
• The above is essentially saying that if X1, . . . , Xn be iid N (μ, Σ) random vectors with sample mean X ̄ , then
n
􏰋(Xi − X ̄)(Xi − X ̄)′
i=1
is distributed as 􏰊n−1 ZiZ′, where the Zi’s are iid N(0, Σ) random vec-
tors. i=1 i
• See Theorem 3.3.2 in An Introduction to Multivariate Statistical Analysis
by Anderson for a proof.
Lecture notes originally by Prof. 34

4.4 HOTELLING DISTRIBUTION
• The Hotelling distribution is a generalisation to multiple dimensions
of the student tn distribution with n degrees of freedom.
• In the univariate case a variable X ∼ tn if it can be written as X = Y 􏰎n/Z
where Y and Z are independent random variables, Y ∼ N(0,1) and Z ∼ χ 2n .
• Definition of Hotelling’s T 2(p, n) distribution: If X ∼ Np(0, Ip) is inde- pendent of M ∼ Wp(Ip, n),, then
nXTM−1X∼T2 . p,n
Lecture notes originally by Prof. 35

• Theorem (p.193 of Ha ̈rdle and Simar): If X ∼ Np(μ, Σ) is independent of M ∼ Wp(Σ, n) with M being non-singular, then
n(X−μ)TM−1(X−μ)∼T2 . p,n
Proof sketch: Consider Y ∼ Np(0, Ip), then Σ1/2Y ∼ Np(0, Σ). Note that Σ1/2 can be obtained by spectral decomposition.
• e.g. If X1, . . . , Xn are i.i.d.∼ Np(μ, Σ), then the sample mean vector X ̄ and the sample covariance matrix S are such that
n ( X ̄ − μ ) T S − 1 ( X ̄ − μ ) ∼ T 2 . p,n−1
This is true because S is independent of X ̄ by Cochran’s theorem (The- orem 5.7 in the Ha ̈rdle and Simar).
Lecture notes originally by Prof. 36

• Hotelling’s T 2 and F-distribution is related by T2 = np Fp,n−p+1.
p,n n−p+1
(Recall that the square of a univariate t-distribution with n degree of
freedom is same as an F1,n distribution)
Lecture notes originally by Prof. 37

• Hotelling’s T 2 test is typically used for the following hypothesis test- ing problem (see chapter 7.1 of Ha ̈rdle and Simar):
• Suppose X1, . . . , Xn is an iid random sample from the Np(μ, Σ) popu- lation with Σ unknown. Test
H0 : μ = μ0 versus H1 : no constraints .
(This is the multivariate version of usual univariate testing problem
tackled by t statistics. )
• When H0 is true, n(X ̄ − μ0)T S−1(X ̄ − μ0) ∼ T 2 . Naturally, we can
p,n−1
use n(X ̄ − μ0)T S−1(X ̄ − μ0) as the test statistic, and calibrate the cutoff
threshold based on the T 2 ( or Fp,n−p ) distribution. p,n−1
• It turns out, the same test can also be derived as the likelihood ratio test for this testing problem (again, see chapter 7.1 of Ha ̈rdle and Simar).
Lecture notes originally by Prof. 38