IV Lecture
Some important test statistics and their Distributions
4.1. Tests and confidence regions for the multivariate normal mean
4.1. Hotelling’s T 2
Suppose again that, like in III Lecture, we have observed n independent realizations
of p-dimensional random vectors from Np(µ,Σ). Suppose for simplicity that Σ is non-
singular. The data matrix has the form
x =
x11 x12 .. x1j .. x1n
x21 x22 .. x2j .. x2n
. . . . . .
. . . . . .
. . . . . .
xi1 xi2 .. xij .. xin
. . . . . .
. . . . . .
. . . . . .
xp1 xp2 .. xpj .. xpn
= [x1,x2, . . . ,xn]
Based on our knowledge from III Lecture we can claim that X̄ ∼ Np(µ, 1nΣ) and
nΣ̂ ∼Wp(Σ,n− 1).
Consequently, any linear combination c′X̄, c 6= 0 ∈ Rp follows N(c′µ, 1
n
c′Σc) and the
quadratic form nc′Σ̂c/c′Σc ∼ χ2n−1. Further, we have shown that X̄ and Σ̂ are indepen-
dently distributed and hence
T =
√
nc′(X̄− µ)/
√
c′
n
n− 1
Σ̂c ∼ tn−1
i.e. follows t-distribution with (n-1) degrees of freedom.
This result has useful applications in testing for contrasts.
Indeed, if we would like to test H0 : c
′µ =
∑p
i=1 ciµi = 0, we note that under H0, T
becomes simply
T =
√
nc′X̄/
√
c′Sc,
that is, does not involve the unknown µ and can be used as a test-statistic whose
distribution under H0 is known. If |T | > tα/2,n−1 we should reject H0 in favour of
H1 : c
′µ =
∑p
i=1 ciµi 6= 0.
The formulation of the test for other (one-sided) alternatives is left for you as an
exercise.
More often we are interested in testing the mean vector of a multivariate normal. First
consider the case of known covariance matrix Σ (variance σ2 in the univariate case). The
standard univariate (p=1) test for this purpose is the following: to test H0 : µ = µ0 versus
H1 : µ 6= µ0 at level of significance α, we look at U =
√
n X̄−µ0
σ
and reject H0 if |U |
exceeds the upper α
2
.100% point of the standard normal distribution. Checking if |U | is
large enough is equivalent to checking if U2 = n(X̄− µ0)(σ2)−1(X̄− µ0) is large enough.
1
We can now easily generalize the above test statistic in a natural way for the multivariate
(p > 1) case: calculate U2 = n(X̄− µ0)′Σ−1(X̄− µ0) and reject the null hypothesis
when U2 is large enough. Similarly to the proof of Property 5 of the multivariate normal
distribution (II Lecture) and by using Theorem 3.3 of III Lecture you can convince yourself
(do it (!)) that U2 ∼ χ2p under the null hypothesis. Hence, tables of the χ2-distribution
will suffice to perform the above test in the multivariate case.
Now let us turn to the (practically more relevant) case of unknown covariance matrix
Σ. The standard univariate (p=1) test for this purpose is the t-test. Let us recall it: to
test H0 : µ = µ0 versus H1 : µ 6= µ0 at level of significance α, we look at
T =
√
n
X̄− µ0
S
,S2 =
1
n− 1
n∑
i=1
(Xi − X̄)2
and reject H0 if |T | exceeds the upper α2 .100% point of the t-distribution with n − 1
degrees of freedom. We note that checking if |T | is large enough is equivalent to checking
if T 2 = n(X̄− µ0)(s2)−1(X̄− µ0) is large enough. Of course, under H0, the statistic T 2
is f-distributed: T 2 ∼ F1,n−1 which means that H0 would be rejected at level α when
T 2 > Fα;1,n−1. We can now easily generalize the above test statistic in a natural way for
the multivariate (p > 1) case:
Definition 4.1.1 Hotelling’s T 2. The statistic
T 2 = n(X̄− µ0)′S−1(X̄− µ0) (4.1)
where X̄ = 1
n
∑n
j=1 Xi,S =
1
n−1
∑n
i=1(Xi − X̄)(Xi − X̄)′, µ0 ∈ Rp,Xi ∈ Rp, i = 1, . . . , n is
named after Harold Hotelling.
Obviously, the test procedure based on Hotelling’s statistic will reject the zero hy-
pothesis H0 : µ = µ0 if the value of T
2 is significantly large. It turns out we do not need
special tables for the distribution of T 2 under the null hypothesis because of the following
basic result (that represents a true generalisation of the univariate (p = 1) case:
Theorem 4.1.2. Under the null hypothesis H0 : µ = µ0, Hotelling’s T
2 is distributed
as
(n−1)p
n−p Fp,n−p where Fp,n−p denotes the F-distribution with p and (n − p) degrees of
freedom.
Proof. Indeed, we can write the T 2 statistic in the form:
T 2 =
n(X̄− µ0)′S−1(X̄− µ0)
n(X̄− µ0)′Σ−1(X̄− µ0)
.n(X̄− µ0)′Σ−1(X̄− µ0)
Denote by c =
√
n(X̄− µ0). Conditionally on c we have:
n(X̄− µ0)′S−1(X̄− µ0)
n(X̄− µ0)′Σ−1(X̄− µ0)
=
c′S−1c
c′Σ−1c
has a distribution that only depends on the data through S−1. Noting that nΣ̂ = (n− 1)S
and having in mind the third property of Wishart distributions from III Lecture, we can
claim that this distribution is the same as of (n−1)/χ2n−p. Note also that the distribution
does not depend on the particular c. The second factor n(X̄− µ0)Σ−1(X̄− µ0) ∼ χ2p and
its distribution depends on the data through X̄ only. Because of the independence of the
2
mean and covariance estimators, we have that the distribution of T 2 is the same as the
distribution of
χ2p(n−1)
χ2n−p
where the two chi-squares are independent. But this means that
T 2(n−p)
p(n−1) ∼ Fp,n−p and hence T
2 ∼ p(n−1)
n−p Fp,n−p.
4.1.3. Remark It is possible to extend the definition of the Wishart distribution
in 3.2.2 by allowing the random vectors Yi, i = 1, . . . ,n there to be independent with
Np(µi,Σ) (instead of just having all µi = 0). One arrives at the noncentral Wishart dis-
tribution with parameters Σ,p,n− 1,Γ that way (denoted also as Wp(Σ,n− 1,Γ).
Here Γ = MM′ ∈ Mp,p,M = [µ1, µ2, . . . , µn] is called a noncentrality parameter. When
all columns of M ∈ Mp,n are zero, this is the usual (central) Wishart distribution. Theo-
rem 4.1.2 can be extended to derive the distribution of the T 2 statistic under alternatives,
i.e. the distribution of T 2 = n(X̄− µ)′S−1(X̄− µ) for µ 6= µ0. This distribution turns out
to be related to noncentral F-distribution. It is helpful in studying power of the test of
H0 : µ = µ0 versus H1 : µ 6= µ0. We shall spare the details here.
4.1.4. T 2 as a likelihood ratio statistic
It is worth mentioning that Hotelling’s T 2 that we introduced by analogy with the
univariate squared t statistic can in fact also be derived as the likelihood ratio test statistic
for testing H0 : µ = µ0 versus H1 : µ 6= µ0. This safeguards the asymptotic optimality of
the test suggested in 4.1.1-4.1.2. To see this , first recall the likelihood function (3.2). Its
unconstrained maximization gives as a maximum value:
L(x; µ̂, Σ̂) =
1
(2π)
np
2 |Σ̂|
n
2
e(−
np
2
)
On the other hand, under H0 :
max
Σ
L(x;µ0,Σ) = max
Σ
1
(2π)
np
2 |Σ|
n
2
e(−
1
2
∑n
i=1
(xi−µ0)′Σ−1(xi−µ0))
Since logL(x;µ0,Σ) = −np2 log(2π)−
n
2
log |Σ| − 1
2
tr[Σ−1(
∑n
i=1(xi − µ0)(xi − µ0)′)], on
applying Anderson’s lemma (see Theorem 3.1 in III lecture) we find that maximum of
logL(x;µ0,Σ) (whence also of L(x;µ0,Σ)) is obtained when Σ̂0 =
1
n
∑n
i=1(xi − µ0)(xi − µ0)′
and the maximal value is
1
(2π)
np
2 |Σ̂0|
n
2
e(−
np
2
)
Hence the likelihood ratio is:
Λ =
maxΣ L(x;µ0,Σ)
maxµ,Σ L(x;µ,Σ)
= (
|Σ̂|
|Σ̂0|
)
n
2 (4.2)
The equivalent statistic Λ
2
n =
|Σ̂|
|Σ̂0|
is called Wilks’ lambda. Small values of Wilks’
lambda lead to rejecting H0 : µ = µ0.
The following theorem shows the relation between Wilks’ lambda and T 2:
Theorem 4.1.5. The likelihood ratio test is equivalent to the test based on T 2 since
Λ
2
n = (1 + T
2
n−1)
−1 holds.
Proof. Consider the matrix A ∈ Mp+1,p+1:
A =
( ∑n
i=1(xi − x̄)(xi − x̄)′
√
n(x̄− µ0)√
n(x̄− µ0)′ −1
)
=
(
A11 A12
A21 A22
)
3
It is easy to check that |A| = |A22||A11−A12A−122 A21| = |A11||A22−A21A
−1
11 A12| holds from
which we get:
(−1)|
n∑
i=1
(xi − x̄)(xi − x̄)′ + n(x̄− µ0)(x̄− µ0)′| =
|
n∑
i=1
(xi − x̄)(xi − x̄)′|| − 1− n(x̄− µ0)′(
n∑
i=1
(xi − x̄)(xi − x̄)′)−1(x̄− µ0)|
Hence (−1)|
∑n
i=1(xi − µ0)(xi − µ0)′| = |
∑n
i=1(xi − x̄)(xi − x̄)′|(−1)(1 +
T 2
n−1).Thus |Σ̂0| =
|Σ̂|(1 + T
2
n−1), i.e.
Λ
2
n = (1 +
T 2
n− 1
)−1 (4.3)
Hence H0 is rejected for small values of Λ
2
n or equivalently, for large values of T 2. The
critical values for T 2 are determined from Theorem 4.1.2.
4.1.6. Note about numerical calculation. Relation (4.3) can be used to calculate
T 2 from Λ
2
n =
|Σ̂|
|Σ̂0|
thus avoiding the need to invert the matrix S when calculating
T 2!
4.1.7. Asymptotic distribution of T 2. Since S−1 is a consistent estimator of Σ−1,
the limiting distribution of T 2 will coincide with the one of n(x̄− µ)′Σ−1(x̄− µ) which,
as we know already, is χ2p. This coincides with a general claim of asymptotic theory which
states that −2 log Λ is asymptotically distributed as χ2p. Indeed:
−2 log Λ = n log(1 +
T 2
n− 1
) ≈
n
n− 1
T 2 ≈ T 2
(by using the fact that log(1 + x) ≈ x for small x).
4.2. Confidence regions for the mean vector and for its components
4.2.1. Confidence region for the mean vector
For a given confidence level (1− α) it can be constructed in the form
{µ|n(x̄− µ)′S−1(x̄− µ) ≤ Fp,n−p(α)
p
n− p
(n− 1)}
where Fp,n−p(α) is the upper α.100% percentage point of the F distribution with (p, n−
p) df. This confidence region has the form of an ellipsoid in Rp centered at x̄. The
axes of this confidence ellipsoid are directed along the eigenvectors ei of the matrix
S = 1
n−1
∑n
i=1(xi − x̄)(xi − x̄)′. The lengths of the axes are determined by the expression
√
λi
√
p(n−1)Fp,n−p(α)
n(n−p) , λi, i = 1, . . . , p being the corresponding eigenvectors, i.e.
Sei = λiei, i = 1, . . . ,p
For illustration : see numerical example 5.3., pages 221–223, Johnson and Wichern.
4.2.2. Simultaneous confidence statements
For a given confidence level (1−α) the confidence ellipsoids in 4.2.1. correctly reflect
the joint (multivariate) knowledge about plausible values of µ ∈ Rp but nevertheless one
is often interested in confidence intervals for means of each individual component. We
4
would like to formulate these statements in such a form that all of the separate confidence
statements should hold simultaneously with a prespecified probability. This is why we
are speaking about simultaneous confidence intervals.
First, note that if the vector X ∼ Np(µ,Σ) then for any l ∈ Rp : l′X ∼ N1(l′µ, l′Σl)
and, hence, for any fixed l we can construct an (1− α).100% confidence interval of l′µ in
the following simple way:
l′x̄− tn−1(α/2)
√
l′Sl
√
n
, l′x̄+ tn−1(α/2)
√
l′Sl
√
n
(4.4)
By taking l′ = [1, 0, . . . , 0] or l′ = [0, 1, 0, . . . , 0] etc. we obtain from (4.4) the usual
confidence interval for each separate component of the mean. Note however that the
confidence level for all these statements taken together is not (1 − α). To make it
(1 − α) for all possible choices simultaneously we need to take a larger constant than
tn−1(α/2) in the right hand side of the inequality |
√
n(l′x̄−l′µ)√
l′Sl
| ≤ tn−1(α/2) (or equivalently
n(l′x̄−l′µ)2
l′Sl
≤ t2n−1(α/2)).
Theorem 4.2.3. Simultaneously for all l ∈ Rp, the interval
(l′x̄−
√√√√ p(n− 1)
n(n− p)
Fp,n−p(α)l′Sl, l
′x̄+
√√√√ p(n− 1)
n(n− p)
Fp,n−p(α)l′Sl)
will contain l′µ with a probability at least (1− α).
Proof. Note that according to Cauchy-Bunyakovski-Schwartz Inequality:
[l′(x̄−µ)]2 = [(S1/2l)′S−1/2(x̄−µ)]2 ≤‖ S1/2l ‖2 . ‖ S−1/2(x̄−µ) ‖2= l′Sl.(x̄−µ)′S−1(x̄−µ)
Therefore
max
l
n(l′(x̄− µ))2
l′Sl
≤ n(x̄− µ)′S−1(x̄− µ) = T 2 (4.5)
Inequality (4.5) helps us to claim that whenever a constant c has been such that T 2 ≤ c2
then also
n(l′x̄−l′µ)2
l′Sl
≤ c2 holds for any l ∈ Rp, l 6= 0. Equivalently,
l′x̄− c
√
l′Sl
n
≤ l′µ ≤ l′x̄+ c
√
l′Sl
n
(4.6)
for every l. Now it remains to choose c2 = p(n − 1)Fp,n−p(α)/(n − p) to make sure that
1 − α = P (T 2 ≤ c2) holds and this will automatically ensure that (4.6) will contain l′µ
with probability 1− α.
Illustration: Example 5.4, p. 226 in Johnson and Wichern.
The simultaneous confidence intervals when applied for the vectors l′ = [1, 0, . . . , 0], l′ =
[0, 1, 0, . . . , 0] etc. are much more reliable at a given confidence level than the one-at-a-time
intervals. Note that the former also utilize the covariance structure of all p variables in
their construction. However, sometimes we can do better in cases where one is interested
in a small number of individual confidence statements.
In this latter case, the simultaneous confidence intervals may give too large a re-
gion and the Bonferonni method may prove more efficient instead. The idea of the
5
Bonferonni approach is based on a simple probabilistic inequality. Assume that simulte-
nous confidence statements about m linear combinations l′1µ, l
′
2µ, . . . , l
′
mµ are required. If
Ci, i = 1, 2, . . . ,m denotes the ith confidence statement and P (Ci true ) = 1− αi then
P ( all Ci true ) = 1− P ( at least one Ci false) ≥
1−
m∑
i=1
P (Ci false ) = 1−
m∑
i=1
(1− P (Ci true)) = 1− (α1 + α2 + . . .+ αm)
Hence, if we choose αi =
α
m
, i = 1, 2, . . . ,m (that is, if we test each separate hypothesis at
a level α
m
instead of α) then the overall level will not exceed α.
4.2.4. Remark. Finally, let us note that comparison of the mean vectors of two or
more than two different multivariate populations when there are independent observations
from each of the populations is important practically relevant problem. It is discussed in
the masters course on Longitudinal data analysis.
6