II Lecture
The Multivariate Normal Distribution
2.1. Standard facts about multivariate distributions
2.1.1. Random samples in multivariate analysis
In order to study the sampling variability of statistics like x̄ and Sn that we intro-
duced in I lecture , with the ultimate goal of making inferences, one needs to make some
assumptions about the random variables whose values constitute the data set X ∈ Mp,n
in (1.1). Suppose the data has not been observed yet but we intend to collect n sets of
measurements on p variables. Since the actual observations can not be predicted before
the measurements are made, we treat them as random variables. Each set of p measure-
ments can be considered as a realization of p-dimensional random vector and we have n
independent realizations of such random vectors Xi, i = 1, 2, . . . , n, so we have the random
matrix X ∈ Mp,n:
X =
X11 X12 .. X1j .. X1n
X21 X22 .. X2j .. X2n
. . . . . .
. . . . . .
. . . . . .
Xi1 Xi2 .. Xij .. Xin
. . . . . .
. . . . . .
. . . . . .
Xp1 Xp2 .. Xpj .. Xpn
= [X1,X2, . . . ,Xn] (2.1)
The vectors Xi, i = 1, 2, . . . , n are considered as independent observations of a p-dimensional
random vector. We start discussing the distribution of such a vector.
2.1.2. Joint, marginal, conditional distributions
Let a random vector X =
X1
X2
.
.
.
Xp
∈ Rp, p ≥ 2 has p different components each of which
is a random variable with a cumulative distribution function (cdf) FXi(xi), i = 1, 2, ..,p.
Each of the functions FXi(.) is called a marginal distribution. The joint cdf of the random
vector X is
FX(x) = P (X1 ≤ x1, X2 ≤ x2, . . . , Xp ≤ xp) = FX(x1, x2, .., xp)
In case of a discrete vector of observations X the probability mass function is defined as
PX(x) = P (X1 = x1, X2 = x2, . . . , Xp = xp)
If a density fX(x) = fX(x1, x2, .., xp) exists such that
1
FX(x) =
∫ x1
−∞
..
∫ xp
−∞
fX(t)dt1 . . . dtp (2.2)
then X is a continuous random vector with a joint density function of p arguments fX(x).
From (2.2) we see that in this case fX(x) =
∂pFX(x)
∂x1∂x2..∂xp
holds. In case X has p independent
components then
FX(x) = FX1(x1)FX2(x2) . . . FXp(xp) (2.3)
holds and, equivalently, also
PX(x) = PX1(x1)PX2(x2) . . . PXp(xp), fX(x) = fX1(x1)fX2(x2)fXp(xp) (2.4)
holds.
The marginal cdf of the first k < p components of the vector X is defined in a natural way as follows: P (X1 ≤ x1, X2 ≤ x2, .., Xk ≤ xk) = P (X1 ≤ x1, X2 ≤ x2, . . . , Xk ≤ xk, Xk+1 ≤ ∞, ..., Xp ≤ ∞) = FX(x1, x2, .., xk,∞,∞, ..,∞) (2.5) The marginal density of the first k components can be obtained by partial differentiation in (2.5) and we arrive at ∫ ∞ −∞ .. ∫ ∞ −∞ fX(x1, x2, .., xp)dxk+1 . . . dxp For any other subset of k < p components of the vector X, their marginal cdf and density can be obtained along the same lines. The conditional density X when Xr+1 = xr+1, . . . , Xp = xp is defined by f(X1,...,Xr|Xr+1,...,Xp)(x1, . . . , xr|xr+1, . . . , xp) = fX(x) fXr+1,...,Xp(xr+1, . . . , xp) (2.6) The above conditional density is interpreted as the joint density of X1, . . . , Xr when Xr+1 = xr+1, . . . , Xp = xp and is only defined when fXr+1,...,Xp(xr+1, . . . , xp) 6= 0. We note that in case of mutual independence the p components, all conditional dis- tributions do not depend on the conditions and the factorizations FX(x) = p∏ i=1 FXi(xi), fX(x) = p∏ i=1 fXi(xi) hold. 2 2.1.3. Moments Given the density fX(x) of the random vector X the joint moments of order s1, s2 . . . , sp are defined, in analogy to the univariate case, as E(Xs11 . . . X sp p ) = ∫ ∞ −∞ . . . ∫ ∞ −∞ xs11 . . . x sp p fX(x1, . . . , xp)dx1 . . . dxp (2.7) Note that if some of the si in (2.7) are equal to zero then in effect we are calculating the joint moment of a subset of the p random variables. 2.1.4. Density transformation formula Assume , the p existing random variables X1, X2 . . . , Xp with given density fX(x) have been transformed by a smooth (i.e. differentiable) one-to-one transformation into p new random variables Y1, Y2 . . . , Yp , i.e. a new random vector Y ∈ Rp has been created by calculating Yi = yi(X1, X2 . . . , Xp), i = 1, 2, . . . , p (2.8) The question is how to calculate the density gY(y) of Y by knowing the transformation functions yi(X1, X2 . . . , Xp), i = 1, 2, . . . , p and the density fX(x) of the original random vector. Naturally, since the transformation (2.8) is assumed to be one-to-one, its inverse transformation Xi = xi(Y1, Y2 . . . , Yp), i = 1, 2, . . . , p also exists and then the following density transformation formula applies: gY(y1, . . . , yp) = fX[x1(y1, . . . , yp), . . . , xp(y1, . . . , yp)]|J(y1, . . . , yp)| (2.9) where J(y1, . . . , yp) is the Jacobian of the transformation: J(y1, . . . , yp) = ∣∣∣∣∣∣∣∣∣∣∣∣ ∂x1 ∂y1 ∂x1 ∂y2 . . . ∂x1 ∂yp ∂x2 ∂y1 ∂x2 ∂y2 . . . ∂x2 ∂yp . . . . . . . . . . . . . . . . . . . . . . . . ∂xp ∂y1 ∂xp ∂y2 . . . ∂xp ∂yp ∣∣∣∣∣∣∣∣∣∣∣∣ (2.10) Note that in (2.9) the absolute value of the Jacobian is substituted. 2.1.5. Characteristic and moment generating functions The characteristic function (cf) ϕ(t) of the random vector X ∈ Rp is a function of a p-dimensional argument. For any real vector t = t1 t2 . . . tp ∈ Rp the above characteristic function is defined as ϕX(t) = E(e it′X) where i = √ −1. Note that the cf always exists since |ϕX(t)| ≤ E(|eit ′X|) = 1 < ∞. Maybe more simple (since it does not involve complex numbers) is the notion of moment generating function (mgf). It is defined as MX(t) = E(e t′X). Note however that in some cases the mgf may not exist for values of t further away from the zero vector. Characteristic functions are in one-to-one correspondence with distributions and this is the reason to use them as a machinery to operate with in cases where direct operation 3 with the distribution is not very convenient. In fact, when the density exists, under mild conditions the following simple inversion formula holds for a one-dimensional random variable: fX(x) = 1 2π ∫ +∞ −∞ e−itxϕ(t)dt. This formula can also be generalised for random vectors. One important property of cf is the following: If the cf ϕX(t) of the random vector X ∈ Rp is given and Y = AX + b,b ∈ Rq,A ∈ Mq,p is a linear transformation of X ∈ Rp into a new random vector Y ∈ Rq then it holds for all s ∈ Rq that ϕY(s) = e is′bϕX(A ′s) Proof: at lectures. 2.2. Multivariate Normal Distribution 2.2.1 Definition The multivariate normal density is a generalization of the univariate normal for p ≥ 2 dimensions. Looking at the term (x−µ σ )2 = (x − µ)(σ2)−1(x − µ) in the exponent of the well known formula f(x) = 1 √ 2πσ2 e−[(x−µ)/σ] 2/2,−∞ < x <∞ (2.11) for the univariate density function, a natural way to generalize this term in higher dimen- sions is to replace it by (x− µ)′Σ−1(x− µ). Here µ = EX ∈ Rp is the expected value of the random vector X ∈ Rp and the matrix Σ = E(X− µ)(X− µ)′ = σ11 σ12 . . . σ1p σ21 σ22 . . . σ2p . . . . . . . . . . . . . . . . . . . . . . . . σp1 σp2 . . . σpp ∈ Mp,p is the covariance matrix. Note that on the diagonals of Σ we get the variances of each of the p random variables whereas σij = E[(Xi − E(Xi))(Xj − E(Xj))], i 6= j are the covariances between the ith and jth random variable. Sometimes, we will also denote σii by σ2i . Of course, the above replacement would only make sense if Σ was positive definite. In general, however, we can only claim that Σ is (as any covariance matrix) non-negative definite (try to prove this claim e.g. using Exercise 1 from Lecture 1 or some other argu- ment). If Σ was positive definite then the density of the random vector X can be written as f(x) = 1 (2π)p/2|Σ| 1 2 e−(x−µ) ′Σ−1(x−µ)/2,−∞ < xi <∞, i = 1, 2 . . . , p (2.12) It can be directly checked that the random vector X ∈ Rp has EX = µ and E[(X− µ)(X− µ)′] = Σ. Since the density is uniquely defined by the mean vector and the covariance matrix we will denote it by Np(µ,Σ). 4 In these notes, however, we will introduce the multivariate normal distribution not through its density formula but through more general reasoning that also allows to cover the case of singular Σ. We will utilize the famous Cramer-Wold argument according to which the distribution of a p-dimensional random vector X is completely characterised by the one dimensional distributions of all linear transformations T′X,T ∈ Rp. Indeed, if we consider E[e(itT ′X)] (which is assumed to be known for every t ∈ R1,T ∈ Rp) then we see that by substituting t = 1 we can get E[e(iT ′X)] which is the cf of the vector X (and the latter uniquely specifies the distribution of X). Hence the following definition will be adopted here: Definition 2.2.1. The random vector X ∈ Rp has a multivariate normal distribu- tion if and only if (iff) any linear transformation T′X,T ∈ Rp has a univariate normal distribution. Theorem 2.2.1. Let for a random vector X ∈ Rp with a normal distribution ac- cording to Definition 2.2.1 we have E(X) = µ and D(X) = E[(X− µ)(X− µ)′] = Σ. Then: i) for any fixed T ∈ Rp, T′X ∼ N(T′µ,T′ΣT) i.e. T′X has an one dimensional nor- mal distribution with expected value T′µ and variance T′ΣT. ii) The cf of X ∈ Rp is ϕX(T) = e (iT′µ−1 2 T′ΣT) (2.13) Proof: Part i) is obvious. For part ii) we observe first the well known fact that the cf of the standard univariate normal random variable Z is e−t 2/2. Since any U ∼ N1(µ1, σ21) has a distribution that coincides with the distribution of µ1 + σ1Z we have: ϕU(t) = e itµ1ϕσ1Z(t) = e itµ1E(eitσ1Z) = eitµ1ϕZ(tσ1) = e (itµ1− 12 t 2σ21) But then, for the univariate random variable T′X ∼ N1(T′µ,T′ΣT) we would have as a characteristic function ϕT′X(t) = e (itT′µ−1 2 t2T′ΣT) . Substituting t = 1 in the latter formula we find that ϕX(T) = e (iT′µ−1 2 T′ΣT) As an upshot, we see that given the expected value vector µ and the covariance matrix Σ we can use the cf formula (2.13) rather than the density formula (2.12) to define the p dimensional multivariate normal distribution. The advantage of the former in comparison to the latter is that in (2.13) only Σ is used, i.e. this definition makes also sense in cases of singular (i.e. non-invertible) Σ. We still want to know that in case of non-singular Σ the more general definition would give raise to the density (2.12). This is the content of the next theorem. Theorem 2.2.2. Assume the matrix Σ in (2.13) is nonsingular. Then the density of the random vector X ∈ Rp with cf as in (2.13) is given by (2.12). Proof: Consider the vector Y ∈ Rp such that Y = Σ− 1 2 (X− µ) (compare 4.2.5 in I Lecture). Since obviouslyE(Y) = 0 andD(Y) = E(YY′) = Σ− 1 2E[(X− µ)(X− µ)′]Σ− 1 2 = Ip holds we can substitute to get the cf of Y ∈ Rp: ϕY(T) = e− 1 2 ∑p 1 t2 i . But the lat- ter can be seen directly to be the characteristic function of the vector of p indepen- dent standard normal variables. Hence, from the relation Y = Σ− 1 2 (X− µ) we can also conclude that X = µ+ Σ 1 2 Y where the density fY(y) = 1 (2π)p/2 e− 1 2 ∑p i=1 y2 i . With other 5 words, X is a linear transformation of Y where the density of Y is known. We can therefore apply the density transformation approach (Section 2.1.4. of this lecture) to obtain: fX(x) = fY(Σ − 1 2 (x− µ))|J(x1, . . . , xp)|. It is easy to see (because of the linearity of the transformation) that |J(x1, . . . , xp)| = |Σ− 1 2 | = |Σ 1 2 |−1. Taking into account that∑p i=1 y 2 i = y ′y = (x− µ)′Σ− 1 2 Σ− 1 2 (x− µ) = (x− µ)′Σ−1(x− µ) we finally arrive at the density formula (2.12) for fX(x). 2.2.2. Properties of multivariate normal The following properties of multivariate normal can be easily derived using the ma- chinery developed so far: Property 1. If Σ = D(X) = Λ is a diagonal matrix then the p components of X are independent. (Indeed , in this case ϕX(T) = e i ∑p j=1 tjµj− 12 t 2 j σ2 j which can be seen to be the cf of the vector of p independent components each distributed according to N(µj, σ 2 j ), j = 1, . . . , p). The above property can be paraphrased as ”for a multivariate normal, if its compo- nents are uncorrelated they are also independent”. On the other hand, it is well known that always, i.e. not only for normal from the fact that certain components are indepen- dent we can conclude that they are also uncorrelated. Therefore , for the multivariate normal distribution we can conclude that its components are independent if and only if they are uncorrelated! Property 2. If X ∼ Np(µ,Σ) and C ∈ Mq,p is an arbitrary matrix of real numbers then Y = CX ∼ Nq(Cµ,CΣC′) To prove this property note that (see 2.1.5) for any s ∈ Rq we have: ϕY(s) = ϕX(C ′s) = eis ′Cµ−1 2 s′CΣC′s which means that Y = CX ∼ Nq(Cµ,CΣC′). Note also that if it happens that the rank of C is full and if rk(Σ) = p then the rank of CΣC′ is also full, i.e. the distribution of Y would not be degenerate in this case. Property 3. (This is a finer version of property 1 ). Assume the vector X ∈ Rp is divided into subvectors X = ( X(1) X(2) ) and according to this subdivision the vector means are µ = ( µ(1) µ(2) ) and the covariance matrix Σ has been subdivided into Σ = ( Σ11 Σ12 Σ21 Σ22 ) . Then the vectors X(1) and X(2) are independent iff Σ12 = 0. Proof: (Exercise (see lecture)). Property 4. Let the vector X ∈ Rp is divided into subvectors X = ( X(1) X(2) ) , X(1) ∈ Rr, r < p,X(2) ∈ Rp−r and according to this subdivision the vector means are µ = ( µ(1) µ(2) ) and the covariance matrix Σ has been subdivided into Σ = ( Σ11 Σ12 Σ21 Σ22 ) . Assume for simplicity that the rank of Σ22 is full. Then the conditional density of X(1) given that 6 X(2) = x(2) is Nr(µ(1) + Σ12Σ −1 22 (x(2) − µ(2)),Σ11 −Σ12Σ −1 22 Σ21) Proof: Perhaps the easiest way to proceed is the following. Note that the expression µ(1) + Σ12Σ −1 22 (x(2) − µ(2)) (for which we want to show that it equals the conditional mean), is a function of x(2). Denote is as g(x(2)) for short. Let us construct the random vectors Z = X(1) − g(X(2)) and Y = X(2) − µ(2). Obviously EZ = 0 and EY = 0 holds. The vector ( Z Y ) is a linear transformation of a normal vector ( ( Z Y ) = A(X− µ) , A = ( Ir −Σ12Σ−122 0 Ip−r ) ) and hence, its distribution is normal (Property 2). Calculating the covariance matrix of the vector ( Z Y ) we find that Cov ( Z Y ) = AΣA′ = after a simple exercise in block multiplication of matrices=( Σ11 −Σ12Σ−122 Σ21 0 0 Σ22 ) Hence the two vectors Z and Y are uncorrelated normal vectors and therefore are independent (Property 3). But Y is a linear transformation of X(2) and this means that Z and X(2) are independent. Hence the conditional density of Z given X(2) = x(2) will not depend on x(2) and coincides with the unconditional density of Z. This means, it is normal with zero mean vector and its covariance matrix is Cov(Z) = Σ11 −Σ12Σ−122 Σ21 = Σ1|2 Hence we can state that X(1) − g(x(2)) ∼ N(O,Σ1|2) and correspondingly, the conditional distribution of X(1) given that X(2) = x(2) is Nr(µ(1) + Σ12Σ −1 22 (x(2) − µ(2)),Σ11 −Σ12Σ −1 22 Σ21) (2.14) Exercise 2. As an immediate consequence of Property 4 we see that if p = 2, r = 1 then for a two-dimensional normal vector ( X1 X2 ) ∼ N( ( µ1 µ2 ) , ( σ21 σ12 σ12 σ 2 2 ) ) its condi- tional density f(x1|x2) is N(µ1 + σ12σ2 2 (x2 − µ2), σ21 − σ212 σ2 2 ) As an exercise, try to derive the above result by direct calculations starting from the joint density f(x1, x2), going over to the marginal f(x2) by integration and finally getting f(x1|x2) = f(x1,x2) f(x2) . Property 5 If X ∼ Np(µ,Σ) and Σ is nonsingular then (x− µ)′Σ−1(x− µ) ∼ χ2p where χ2p denotes the chi-square distribution with p degrees of freedom. Proof: It suffices to use the fact that (see also Theorem 2.2.2) the vector Y ∈ Rp : Y = Σ− 1 2 (X− µ) ∼ N(0, Ip) i.e. it has p independent standard normal components. Then (x− µ)′Σ−1(x− µ) = Y′Y = p∑ i=1 Y 2i ∼ χ 2 p 7 according to the definition of χ2p as a distribution of the sum of squares of p independent standard normals. Finally,one more interpretation of the result in Property 4 will be given. Assume we want, as is a typical situation in statistics, to predict a random variable Y that is correlated with some p random variables (predictors) X = X1 X2 . . Xp . Trying to find the best predictor of Y we would like to minimize the expected value E(Y − g(X))2 over all possible choices of the function g such that Eg(X)2 <∞. A little careful work and use of basic properties of conditional expectations leads us (see lecture) to the conclusion that the optimal solution to the above minimization problem is g∗(X) = E(Y |X). This optimal solution is also called the regression function. Thus given a particular realization x of the random vector X the regression function is just the conditional expected value of Y given X = x. In general, the conditional expected value may be a complicated nonlinear function of the predictors. However, if we assume in addition that the joint (p + 1)-dimensional distribution of Y and X is normal then by applying Property 4 we see that given the realization x of X, the best prediction of the Y value is given by b + σ′0C −1x where b = E(Y ) − σ′0C−1E(X), C is the covariance matrix of the vector X, σ0 is the vector of Covariances of Y with Xi, i = 1, . . . , p. Indeed, we know that when the joint (p+ 1)-dimensional distribution of Y and X is normal the regression function is given by E(Y ) + σ′0C −1(x− E(X)). By introducing the notation b = E(Y )− σ′0C−1E(X) we can write this as b+ σ′0C−1x. That is, in case of normality, the optimal predictor of Y in the least squares sense turns out to be a very simple linear function of the predictors. The vector C−1σ0 ∈ Rp is the vector of the regression coefficients. Substituting the optimal values we get the minimal value of the sum of squares which is equal to V (Y )− σ′0C−1σ0. 2.2.3. Tests for Multivariate Normality We have seen that the assumption of multivariate normality may bring essential sim- plifications in analyzing data. But applying inference methods based on the multivariate normality assumption in cases where it is grossly violated may introduce serious defects in the quality of the analysis. It is therefore important to be able to check the multivariate normality assumption. Based on the properties of normal distributions discussed in this lecture, we know that all linear combinations of normal variables are normal and the con- tours of the multivariate normal density are ellipsoids. Therefore we can (to some extent) check the multivariate normality hypothesis by: i) checking if the marginal distributions of each component appear to be normal (by using Q-Q plots , for example); ii) checking if the scatterplots of pairs of observations give the elliptical appearance expected from normal populations; iii) are there any outlying observations that should be checked for accuracy. 8 All this can be done by applying univariate techniques and by drawing scatterplots which are well developed in SAS. To some extent, however, there is a price to be paid for concentrating on univariate and bivariate examinations of normality . There is a need to construct a ”good” overall test of multivariate normality. One of the simple and tractable ways to verify the multivariate normality assumption is by using tests based on Mar- dia’s multivariate skewness and kurtosis measures. For any general multivariate distribution we define these respectively as β1,p = E[(Y − µ)′Σ−1(X− µ)]3 (2.15) provided that X is independent of Y but has the same distribution and β2,p = E[(X− µ)′Σ−1(X− µ)]2 (2.16) (if the expectations in (2.15) and (2.16) exist). For the Np(µ,Σ) distribution: β1,p = 0 and β2,p = p(p+ 2). (Note that when p = 1, the quantity β1,1 is the square of the skewness coefficient E(X−µ)3 σ3 whereas β2,1 coincides with the kurtosis coefficient E(X−µ)4 σ4 .) For a sample of size n consistent estimates of β1,p and β2,p can be obtained as β̂1,p = 1 n2 n∑ i=1 n∑ j=1 g3ij β̂2,p = 1 n n∑ i=1 g2ii = 1 n n∑ i=1 d4i where gij = (xi − x̄)′S−1n (xj − x̄) and di = √ gii is the Mahalanobis distance. Both quantities β̂1,p and β̂2,p are nonnegative and for multivariate data, one would ex- pect them to be around zero and p(p+2), respectively. If there is a departure from spherical symmetry (that is, zero correlation and equal variance), β̂2,p will be large. Both quantities can be utilized to detect departures from multivariate normality. Mardia has shown that asymptotically, k1 = nβ̂1,p/6 ∼ χ2p(p+1)(p+2)/6, and k2 = [β̂2,p − p(p + 2)]/[8p(p + 2)/n] 1 2 is standard normal. Thus we can use k1 and k2 to test the null hypothesis of multivariate normality. If both hypotheses are accepted, the multivariate normality assumption is in reasonable agreement with the data. It also has been observed that Mardia’s multivariate kurtosis can be used as a measure to detect outliers from the data that are supposedly distributed as multivariate normal. For given data set, the multivariate kurtosis can be computed using the CALIS procedure is SAS. The quantity k2 is called Normalized Mul- tivariate Kurtosis there, whereas β̂2,p − p(p + 2) bears the name Mardia’s Multivariate Kurtosis. 9