程序代写代做代考 III Lecture

III Lecture
Estimation of the Mean Vector and Covariance Matrix

of Multivariate Normal Distribution

3.1. Maximum Likelihood Estimation

3.1.1. Likelihood function

Suppose we have observed n independent realizations of p-dimensional random vectors
from Np(µ,Σ). Suppose for simplicity that Σ is non-singular. The data matrix has the
form

x =




x11 x12 .. x1j .. x1n
x21 x22 .. x2j .. x2n
. . . . . .
. . . . . .
. . . . . .
xi1 xi2 .. xij .. xin
. . . . . .
. . . . . .
. . . . . .
xp1 xp2 .. xpj .. xpn




= [x1,x2, . . . ,xn] (3.1)

The goal to estimate the unknown mean vector and the covariance matrix of the multi-
variate normal distribution by the Maximum Likelihood Estimation (MLE) method.

Based on our knowledge from II Lecture we can write down the Likelihood function

L(x;µ,Σ) = (2π)−
np
2 |Σ|−

n
2 e−

1
2

∑n
i=1

(xi−µ)′Σ−1(xi−µ) (3.2)

(Note that we have substituted the observations in (3.2) and consider L as a function of
the unknown parameters µ,Σ only. Correspondingly, we get the log-likelihood function in
the form

logL(x;µ,Σ) = −
np

2
log(2π)−

n

2
log(|Σ|)−

1

2

n∑
i=1

(xi − µ)′Σ−1(xi − µ) (3.3)

It is well known that maximizing either (3.2) or (3.3) will give the same solution for the
MLE.

We start deriving the MLE by trying to maximize (3.3). To this end, first note that
by utilising properties of traces from I lecture, we can transform:

n∑
i=1

(xi − µ)′Σ−1(xi − µ) =
∑n
i=1 tr[Σ

−1(xi − µ)(xi − µ)′] =

tr[Σ−1(
n∑

i=1

(xi − µ)(xi − µ)′)] = (1)

(by adding ±x̄ = 1
n

∑n
i=1 xi to each term (xi − µ) in

∑n
i=1 (xi − µ)(xi − µ)′)

tr[Σ−1(
n∑
i=1

(xi − x̄)(xi − x̄)′ + n(x̄− µ)(x̄− µ)′)] =

1

tr[Σ−1(
n∑
i=1

(xi − x̄)(xi − x̄)′)] + n(x̄− µ)′Σ−1(x̄− µ)

Thus

logL(x;µ,Σ) = −
np

2
log(2π)−

n

2
log(|Σ|)−

1

2
tr[Σ−1(

n∑
i=1

(xi − x̄)(xi − x̄)′)]−
1

2
n(x̄− µ)′Σ−1(x̄− µ)

(3.4)

3.1.2 Maximum Likelihood Estimators

The MLE are the ones that maximize (3.4). Looking at (3.4) we realize that (since
Σ is non-negative definite ) the minimal value for 1

2
n(x̄− µ)′Σ−1(x̄− µ) is zero and is

attained when µ = x̄. It remains to find the optimal value for Σ. We will use the following

Theorem 3.1 (Anderson’s lemma) If A ∈ Mp,p is symmetric positive definite
then the maximum of the function h(G) = −n log(|G|)− tr(G−1A) (defined over the set
of symmetric positive definite matrices G ∈ Mp,p exists, occurs at G = 1nA and has the
maximal value of np log(n)− n log(|A|)− np.

Proof (sketch, details at lecture): Indeed, (see properties of traces):

tr(G−1A) = tr((G−1A
1
2 )A

1
2 ) = tr(A

1
2G−1A

1
2 )

Let ηi, i = 1, . . . , p be the eigenvalues of A
1
2G−1A

1
2 . Then (since the matrix A

1
2G−1A

1
2 is

positive definite) ηi > 0, i = 1, . . . , p. Also, tr(A
1
2G−1A

1
2 ) =

∑p
i=1 ηi and |A

1
2G−1A

1
2 | =∏p

i=1 ηi holds (Why (?!) Use the spectral decomposition!). Hence

−n log |G| − tr(G−1A) = n.
p∑
i=1

log ηi − n log |A| −
p∑
i=1

ηi (3.5)

Considering the expression n.
∑p
i=1 log ηi − n log |A| −

∑p
i=1 ηi as a function of the eigen-

values ηi, i = 1, . . . , p we realize that it has a maximum which is attained when all
ηi = n, i = 1, . . . , p. Indeed, the first partial derivatives with respect to ηi, i = 1, . . . , p are
equal to n

ηi
−1 and hence the stationary points are η∗i = n, i = 1, . . . , p. The matrix of sec-

ond derivatives calculated at η∗i = n, i = 1, . . . , p is equal to −Ip and hence the stationary
points give rise to a maximum of the function. Now, we can check directly by substituting
the η∗ values that the maximal value of the function is np log(n) − n log(|A|) − np. But
a direct substitution in the formula h(G) = −n log(|G|) − tr(G−1A) with G = 1

n
A also

gives rise to np log(n)− n log(|A|)− np, i.e. the maximum is attained at G = 1
n
A.

Using the structure of the log-likelihood function in (3.4) and Theorem 3.1 (applied
for the case A =

∑n
i=1 (xi − x̄)(xi − x̄)′ (!)) it is now easy to formulate following

Theorem 3.2 Suppose x1,x2, . . . ,xn is a random sample from Np(µ,Σ), p < n . Then µ̂ = x̄ and Σ̂ = 1 n ∑n i=1 (xi − x̄)(xi − x̄)′ are the maximum likelihood estimators of µ and Σ, respectively. Remark 3.1.3. Alternative proofs of Theorem 3.2 are also available that utilize some formal rules for vector and matrix differentiation that have been developed as a standard machinery in multivariate analysis (recall that according to the folklore, in order to find the maximum of the log-likelihood, we need to differentiate it with respect to its arguments, i.e. with respect to the vector µ and to the matrix Σ), set the derivatives equal to zero 2 and solve the corresponding equation system. If time permits, these matrix differentiation rules will also be discussed later in this course. 3.1.4. Application in correlation matrix estimation The correlation matrix can be defined in terms of the elements of the covariance matrix Σ. The correlation coefficients ρij, i = 1, . . . , p, j = 1, . . . , p are defined as ρij = σij√ σii √ σjj where Σ = (σij, i = 1 . . . , p; j = 1, . . . , p) is the covariance matrix. Note that ρii = 1, i = 1, . . . , p. To derive the MLE of ρij, i = 1, . . . , p, j = 1, . . . , p we note that these are continuous transformations of the covariances whose maximum likelihood estimators have already been derived. Then we can claim (according to the transformation invariance properties of MLE) that ρ̂ij = σ̂ij √ σ̂ii √ σ̂jj , i = 1, . . . , p, j = 1, . . . , p (3.6) 3.1.5. Sufficiency of µ̂ and Σ̂ Back from (3.4) we can write the likelihood function as L(x;µ,Σ) = 1 (2π) np 2 |Σ| n 2 e− 1 2 tr[Σ−1( ∑n i=1 (xi−x̄)(xi−x̄)′+n(x̄−µ)(x̄−µ)′)] which means that L(x;µ,Σ) can be factorised into L(x;µ,Σ) = g1(µ,Σ).g2(µ,Σ; µ̂, Σ̂), i.e. the likelihood function depends on the observations only through the values of µ̂ = x̄ and Σ̂. Hence the pair µ̂ and Σ̂ are sufficient statistics for µ and Σ in the case of a sample from Np(µ,Σ). Note that the structure of the multivariate normal density was essentially used here thus underlying the importance of the check on adequacy of multi- variate normality assumptions in practice. If testing indicates significant departures from multivariate normality then inferences that are based solely on µ̂ and Σ̂ may not be very reliable. 3.2. Distributions of MLE of mean vector and covariance matrix of multi- variate normal distribution Inference is not restricted to only find point estimators but also to construct confi- dence regions, test hypotheses etc. To this end we need the distribution of the estimators (or of suitably chosen functions of them). 3.2.1. Sampling distribution of X̄ In the univariate case (p = 1) it is well known that for a sample of n observations from normal distributionN(µ, σ2) the sample mean is normally distributed:N(µ, σ 2 n ). Moreover, the sample mean and the sample variance are independent in the case of sampling from a univariate normal population. This fact was very useful in developing t-statistics for testing the mean vector. Do we have similar statements about the sample mean and sample variance in the multivariate (p > 1) case?

Let the random vector X̄ = 1
n

∑n
i=1Xi ∈ Rp. For any L ∈ Rp : L′X̄ is a linear com-

bination of normals and hence is normal (see Definition 2.2.1). Since taking expected
value is a linear operation, we have EX̄ = 1

n
nµ = µ; In analogy with the univariate case

we could formally write CovX̄ = 1
n2
nCovX1 =

1
n
Σ and hence X̄ ∼ Np(µ, 1nΣ). But we

would like to develop a more appropriate machinery for the multivariate case that would

3

help us to more rigorously prove statements like the last one. It is based on operations
with Kronecker products.

Kronecker product of two matrices A ∈ Mm,n and B ∈ Mp,q is denoted by A⊗B
and is defined (in block matrix notation) as

A⊗B =




a11B a12B . . . a1nB
a21B a22B . . . a2nB
. . . . . . . . . . . .
am1B am2B . . . amnB


 (3.7)

The following basic properties of Kronecker products will be used:

(A⊗B)⊗ C = A⊗ (B ⊗ C)
(A+B)⊗ C = A⊗ C +B ⊗ C

(A⊗B)′ = A′ ⊗B′

(A⊗B)−1 = A−1 ⊗B−1

(A⊗B)(C ⊗D) = AC ⊗BD

(whenever the corresponding matrix products and inverses exist)

tr(A⊗B) = (trA)tr(B)
|A⊗B| = |A|p|B|m

(in case A ∈ Mm,m,B ∈ Mp,p)
In addition , the → operation on a matrix A ∈ Mm,n will be defined. This operation

creates a vector ~A ∈ Rmn which is composed by stacking the n columns of the matrix
A ∈ Mm,n under each other (the second below the first etc). For matrices A,B and C (of
suitable dimensions) it holds:

−−−→
ABC = (C ′ ⊗ A) ~B

Let us see how we could utilize the above to derive the distribution of X̄. Denote by 1n
the vector of n ones. Note that if X is the random matrix (see (2.1) in II lecture) then
~X ∼ N(1n ⊗ µ, In ⊗ Σ) and X̄ = 1n(1


n ⊗ Ip)~X. Hence X̄ is multivariate normal with

EX̄ =
1

n
(1′n ⊗ Ip)(1n ⊗ µ) =

1

n
(1′n1n ⊗ µ) =

1

n
nµ = µ

CovX̄ = n−2(1′n ⊗ Ip)(In ⊗Σ)(1n ⊗ Ip) = n
−2(1′n1n ⊗Σ) = n

−1Σ

How can we show that X̄ and Σ̂ are independent? Recall the likelihood function

L(x;µ,Σ) =
1

(2π)
np
2 |Σ|

n
2

e−
1
2
tr[Σ−1(

∑n
i=1

(xi−x̄)(xi−x̄)′+n(x̄−µ)(x̄−µ)′)]

We have 2 summands in the exponent from which one is a function of the observations
through nΣ̂ =

∑n
i=1(xi − x̄)(xi − x̄)′ only and the other one depends on the observations

4

through X̄ only. The idea is now to transform the original data matrix X ∈ Mp,n into a
new matrix Z ∈ Mp,n of n independent N(0,Σ) vectors in such a way that X̄ would only
be a function of Z1 whereas

∑n
i=1(xi − x̄)(xi − x̄)′ would only be a function of Z2, . . . ,Zn.

If we succeed then clearly X̄ and
∑n
i=1(xi − x̄)(xi − x̄)′ = nΣ̂ would be independent.

Now the claim is that the sought after transformation is given by Z = XA with
A ∈ Mn,n being an orthogonal matrix with a first column equal to 1√n1n. Note that the
first column of Z would be then


nX̄. (An explicit form of the matrix A will be discussed

at the lecture). Since ~Z =
−−−→
IpXA = (A

′ ⊗ Ip)~X, the Jacobian of the transformation (~X
into ~Z) is |A′ ⊗ Ip| = |A|p = ±1 (note that A is orthogonal). Therefore, the absolute
value of the Jacobian is equal to one. For ~Z we have:

E(~Z) = (A′ ⊗ Ip)(1n ⊗ µ) = A′1n ⊗ µ =




n

0
. . .
0


⊗ µ

Further,

Cov(~Z) = (A′ ⊗ Ip)(In ⊗Σ)(A⊗ Ip) = A′A⊗ IpΣIp = In ⊗Σ

which means that the Zi, i = 1, . . . ,n are independent. Note Z1 =

nX̄ holds (because
of the choice of the first column of the orthogonal matrix A). Further

n∑
i=1

(Xi − X̄)(Xi − X̄)′ =
n∑

i=1

XiX

i −

1

n
(

n∑
i=1

Xi)(
n∑

i=1

X′i) =

ZA′AZ′ − Z1Z′1 =
n∑

i=1

ZiZ

i − Z1Z


1 =

n∑
i=2

ZiZ

i

Hence we proved the following

Theorem 3.3 For a sample of size n from Np(µ,Σ), p < n the sample average X̄ ∼ Np(µ, 1nΣ). Moreover, the MLE µ̂ = X̄ and Σ̂ are independent. 3.2.2. Sampling distribution of the MLE of Σ Definition. A random matrix U ∈ Mp,p has a Wishart distribution with parame- ters Σ, p, n (denoting this by U ∼Wp(Σ,n) if there exist n independent random vectors Y1, . . . ,Yn each with Np(0,Σ) distribution such that the distribution of ∑n i=1 YiY ′ i coin- cides with the distribution of U. Note that we require p < n here and that U is necessarily non-negative definite. Having in mind the proof of Theorem 3.3 we can claim that the distribution of the matrix nΣ̂ = ∑n i=1(Xi − X̄)(Xi − X̄)′ is the same as that of ∑n i=2 ZiZ ′ i and therefore is Wishart with parameters Σ, p, n− 1. That is, we can denote: nΣ̂ ∼ Wp(Σ, n− 1). The density formula for the Wishart distribution is given in several sources but we will not deal with it in this course. Some properties of Wishart distribution will be mentioned though since we will make use of them later in the course: 5 • If p = 1 and if we denote the “matrix” Σ by σ2 (as usual) then W1(Σ, n)/σ2 = χ2n. In particular, when σ2 = 1 we see that W1(1, n) is exactly the χ 2 n random variable. In that sense we can state that the Wishart distribution is a generalisation (with respect to the dimension p) of the chi square distribution. • For an arbitrary fixed matrix H ∈ Mk,p, k ≤ p one has: nHΣ̂H′ ∼Wk(HΣH′,n− 1) (Why? Show it!). • Refer to the previous case for the particular value of k=1. The matrix H ∈ M1,p is just a p-dimensional row vector that we could denote by c′. Then: i) nc ′Σ̂c c′Σc ∼ χ2n−1. ii) nc ′Σ−1c c′Σ̂−1c ∼ χ2n−p • Let us partition S = 1 n−1 ∑n i=1(Xi − X̄)(Xi − X̄)′ ∈ Mp,p into S = ( S11 S12 S21 S22 ) ,S11 ∈ Mr,r, r < p,Σ = ( Σ11 Σ12 Σ21 Σ22 ) ,Σ11 ∈ Mr,r, r < p. Further, denote S1|2 = S11 − S12S−122 S21,Σ1|2 = Σ11 −Σ12Σ −1 22 Σ21 Then it holds (n− 1)S11 ∼Wr(Σ11,n− 1) (n− 1)S1|2 ∼Wr(Σ1|2,n− p + r− 1) 6