CS计算机代考程序代写 chain Bayesian flex finance AI algorithm Lecture 0: Preliminaries

Lecture 0: Preliminaries
0.1 Matrix algebra
0.2 Standard facts about multivariate distributions
0.3 Additional resources
0.4 Exercises

UNSW MATH5855 2021T3 Lecture 0 Slide 1

. Preliminaries
0.1 Matrix algebra
0.1.1 Vectors and matrices
0.1.2 Inverse matrices
0.1.3 Rank
0.1.4 Orthogonal matrices
0.1.5 Eigenvalues and eigenvectors
0.1.6 Cholesky Decomposition
0.1.7 Orthogonal Projection
0.2 Standard facts about multivariate distributions
0.3 Additional resources
0.4 Exercises

UNSW MATH5855 2021T3 Lecture 0 Slide 2

. Preliminaries
0.1 Matrix algebra
0.1.1 Vectors and matrices
0.1.2 Inverse matrices
0.1.3 Rank
0.1.4 Orthogonal matrices
0.1.5 Eigenvalues and eigenvectors
0.1.6 Cholesky Decomposition
0.1.7 Orthogonal Projection

UNSW MATH5855 2021T3 Lecture 0 Slide 3

Matrix operations

X ∈Mp,n: X is a matrix with p rows and n columns
x ∈ Rn: x is a n-dimensional column vector a.k.a. x ∈Mn,1
>: transposition: X ∈Mp,n =⇒ X> ∈Mn,p
I a column vector x ∈ Rn =⇒ a row a vector x> ∈M1,n

Scalar operations:
I matrix (vector) multiplied by scalar
I two matrices (vectors) of the same dimension can be added

or subtracted
Euclidean norm ‖x‖ of a vector x = (x1 x2 · · · xp)> ∈ Rp:
‖x‖ =

√∑p
i=1 x

2
i .

UNSW MATH5855 2021T3 Lecture 0 Slide 4

Inner product

inner product (a.k.a. scalar product) of x , y ∈ Rp is denoted and
defined in the following way:

〈x , y〉 = x>y =
p∑

i=1

xiyi (0.1)

I ‖x‖2 = 〈x , x〉
I If θ is the angle between x and y ,

〈x , y〉 = ‖x‖‖y‖ cos(θ) (0.2)

=⇒ Since |cos(θ)| ≤ 1,
|〈x , y〉| ≤ ‖x‖‖y‖(Cauchy–Bunyakovsky–Schwartz
Inequality)

orthogonal projection of x on y : x
>y

y>y y .

UNSW MATH5855 2021T3 Lecture 0 Slide 5

Matrix multiplication

matrix multiplication: if X ∈Mp,k and Y ∈Mk,n (#columns ofX
= # rows in Y , a.k.a. conformable) then XY exists:
Z = XY ∈Mp,n with elements

zi ,j , i = 1, 2, . . . , p, j = 1, 2, . . . , n : zi ,j =
k∑

m=1

xi ,mym,j (0.3)

I element in the ith row and jth column of Z is a scalar product
of the ith row of X and the jth column of Y

I not commutative, even if both XY and YX exist
I Important example: if x ∈ Rp, x>x ∈ R, but xx> ∈Mp,p!

I transpose of product = product of transposes, in reverse order:

(XY )> = Y>X> (0.4)

UNSW MATH5855 2021T3 Lecture 0 Slide 6

Symmetric and identity matrices

symmetric matrix: a square matrix X ∈Mp,p for which xi ,j = xj ,i ,
i = 1, 2, . . . , p, j = 1, 2, . . . , p holds.
I X> = X

identity matrix:
p×p
I = δij , i = 1, 2, . . . , p, j = 1, 2, . . . , p (i.e. ones

on the diagonal and zeros outside the diagonal)
I if conformable, X I = X and IX = X

UNSW MATH5855 2021T3 Lecture 0 Slide 7

Matrix trace
The trace of a square matrix X ∈Mp,p is denoted by
tr(X ) =

∑p
i=1 xii . The following properties of traces are easy to

obtain:
i) tr(X + Y ) = tr(X ) + tr(Y )
ii) tr(XY ) = tr(YX )
iii) tr(X−1YX ) = tr(Y )
iv) If a ∈ Rp and X ∈Mp,p then a>Xa = tr(Xaa>)

UNSW MATH5855 2021T3 Lecture 0 Slide 8

. Preliminaries
0.1 Matrix algebra
0.1.1 Vectors and matrices
0.1.2 Inverse matrices
0.1.3 Rank
0.1.4 Orthogonal matrices
0.1.5 Eigenvalues and eigenvectors
0.1.6 Cholesky Decomposition
0.1.7 Orthogonal Projection

UNSW MATH5855 2021T3 Lecture 0 Slide 9

Matrix determinant

I For a square matrix X ∈Mp,p a number |X | ≡ det(X )
I Defined as

|X | =

±x1ix2j . . . xpm

where the summation is over all permutations (i , j , . . . ,m) of
the numbers (1, 2, . . . , p) by taking into account the sign
rule: summands with an even permutation get a (+) whereas
the ones with an odd permutation get a (−) sign.

UNSW MATH5855 2021T3 Lecture 0 Slide 10

Matrix determinant calculations
I when p = 1 (scalar) X = a, |X | = a

I when p = 2 then
∣∣∣∣x11 x12x21 x22

∣∣∣∣ = x11x22 − x12x21
I when p = 3 then∣∣∣∣∣∣

x11 x12 x13
x21 x22 x23
x31 x32 x33

∣∣∣∣∣∣ = x11x22x33 + x12x23x31 + x21x32x13
−x31x22x13 − x11x23x32 − x12x21x33 (0.5)

I recursively, for X ∈Mp,p,

|X | =

i

(−1)i+jxij |Xij | (for any given j)

=

j

(−1)i+jxij |Xij | (for any given i)

Xij matrix constructed by deleting ith row and jth column of
X .

UNSW MATH5855 2021T3 Lecture 0 Slide 11

Matrix determinant properties

i) If one row or one column of the matrix contains zeros only,
then the value of the determinant is zero.

ii) |X>| = |X |
iii) If one row (or one column) of the matrix is modified by

multiplying with a scalar c then so is the value of the
determinant.

iv) |cX | = cp|X |
v) If X ,Y ∈Mp,p then |XY | = |X ||Y |
vi) If the matrix X is diagonal (i.e. all non-diagonal elements are

zero) then |X | =
∏p

i=1 xii . In particular, the determinant of
the identity matrix is always equal to one.

UNSW MATH5855 2021T3 Lecture 0 Slide 12

Matrix inverse
If,
I |X | 6= 0
I i.e., X ∈Mp,p is nonsingular

then,
I An inverse matrix X−1 ∈Mp,p exists s.t. XX−1 = Ip,p.
I (X−1)ji =

|Xij |
|X | (−1)

i+j

|Xij | as before (i , j)th minor of X .

UNSW MATH5855 2021T3 Lecture 0 Slide 13

Matrix inverse properties

i) XX−1 = X−1X = I
ii) (X−1)> = (X>)−1

iii) (XY )−1 = Y−1X−1 when both X and Y are nonsingular
square matrices of the same dimension.

iv) |X−1| = |X |−1

v) If X is diagonal and nonsingular then all its diagonal elements
are nonzero and X−1 is again diagonal with diagonal elements
equal to 1

xii
, i = 1, 2, . . . , p.

UNSW MATH5855 2021T3 Lecture 0 Slide 14

. Preliminaries
0.1 Matrix algebra
0.1.1 Vectors and matrices
0.1.2 Inverse matrices
0.1.3 Rank
0.1.4 Orthogonal matrices
0.1.5 Eigenvalues and eigenvectors
0.1.6 Cholesky Decomposition
0.1.7 Orthogonal Projection

UNSW MATH5855 2021T3 Lecture 0 Slide 15

Linear dependence
A set of vectors x1, x2, . . . , xk ∈ Rn is linearly dependent if there
exist k numbers a1, a2, . . . , ak not all zero such that

a1x1 + a2x2 + · · ·+ akxk = 0 (0.6)

holds.
I Otherwise the vectors are linearly independent.
I For k linearly independent vectors, (0.6) would only be

possible if all numbers a1, a2, . . . , ak were zero.

UNSW MATH5855 2021T3 Lecture 0 Slide 16

Matrix rank

row rank: the maximum number of linearly independent row vectors
column rank: the rank of its set of column vectors
I Always equal
I denoted rk(X )

full rank: If X ∈Mp,n and rk(X ) = min(p, n)
I square matrix A ∈Mp,p is full rank if rk(A) = p
I implies that |A| 6= 0 (Rouché–Capelli theorem a.k.a

Kronecker–Capelli theorem)
I Let b ∈ Rp be a given vector. Then the linear equation

system Ax = b has a unique solution x = A−1b ∈ Rp.

UNSW MATH5855 2021T3 Lecture 0 Slide 17

. Preliminaries
0.1 Matrix algebra
0.1.1 Vectors and matrices
0.1.2 Inverse matrices
0.1.3 Rank
0.1.4 Orthogonal matrices
0.1.5 Eigenvalues and eigenvectors
0.1.6 Cholesky Decomposition
0.1.7 Orthogonal Projection

UNSW MATH5855 2021T3 Lecture 0 Slide 18

Orthogonal matrix
A square matrix X ∈Mp,p is orthogonal if XX> = X>X = Ip,p
holds. The following properties of orthogonal matrices are obvious:

i) X is of full rank (rk(X ) = p) and X−1 = X>

ii) The name orthogonal of the matrix originates from the fact
that the scalar product of each two different column vectors
equals zero. The same holds for the scalar product of each two
different row vectors of the matrix. The norm of each column
vector (or each row vector) is equal to one. These properties
are equivalent to the definition.

iii) |X | = ±1

UNSW MATH5855 2021T3 Lecture 0 Slide 19

. Preliminaries
0.1 Matrix algebra
0.1.1 Vectors and matrices
0.1.2 Inverse matrices
0.1.3 Rank
0.1.4 Orthogonal matrices
0.1.5 Eigenvalues and eigenvectors
0.1.6 Cholesky Decomposition
0.1.7 Orthogonal Projection

UNSW MATH5855 2021T3 Lecture 0 Slide 20

Eigenvalues
For any square matrix X ∈Mp,p we can define the characteristic
polynomial equation of degree p,

f (λ) = |X − λI | = 0. (0.7)

I Has exactly p roots.
I Some may be complex and some may coincide.
I Since the coefficients are real, if there is a complex root of

(0.7) then also its complex conjugate must be a root of the
same equation.

I Each of the above p roots is called eigenvalue of the matrix X .
I tr(X ) =

∑p
i=1 λi

I |X | =
∏p

i=1 λi

UNSW MATH5855 2021T3 Lecture 0 Slide 21

Eigenvectors

I For any such eigenvalue λ∗, X − λ∗I is singular.
=⇒ There exists a non-zero vector y ∈ Rp s.t. (X − λ∗I )y = 0.

I An eigenvector of X that corresponds to the eigenvalue λ∗.
I Not unique: µy for any real non-zero µ also an eigenvector for

the same eigenvalue.

UNSW MATH5855 2021T3 Lecture 0 Slide 22

Uniqueness of eigenvectors
Sparing some details of the derivation, we shall formulate the
following basic result:

Theorem 0.1.
When the matrix X is real symmetric then all of its p eigenvalues
are real. If the eigenvalues are all different then all the p
eigenvectors that correspond to them, are orthogonal (and hence
form a basis in Rp). These eigenvectors are also unique (up to the
norming constant µ above). If some of the eigenvalues coincide
then the eigenvectors corresponding to them are not necessarily
unique but even in this case they can be chosen to be mutually
orthogonal.

UNSW MATH5855 2021T3 Lecture 0 Slide 23

Spectral decomposition
For each of the p eigenvalues λi , i = 1, 2, . . . , p, of X , denote its
corresponding set of mutually orthogonal eigenvectors of unit
length by ei , i = 1, 2, . . . , p, i.e.

Xei = λiei , i = 1, 2, . . . , p, ‖ei‖ = 1, e>i ej = 0, i 6= j

holds. Then is can be shown that the following decomposition
(spectral decomposition) of any symmetric matrix X ∈Mp,p holds:

X = λ1e1e>1 + λ2e2e
>
2 + . . . λpepe

>
p . (0.8)

Equivalently, X = PΛP> where Λ =


λ1 · · · 0… . . . …

0 · · · λp


 is diagonal

and P ∈Mp,p is an orthogonal matrix containing the p orthogonal
eigenvectors e1, e2, . . . , ep.

UNSW MATH5855 2021T3 Lecture 0 Slide 24

Powers of a matrix: inverse
A symmetric matrix X ∈Mp,p is positive definite if all of its
eigenvalues are positive. (It is called non-negative definite if all
eigenvalues are ≥ 0.) For a symmetric positive definite matrix we
have all λi , i = 1, 2, . . . , p, to be positive in the spectral
decomposition (0.8).
But then

X−1 = (P>)−1Λ−1P−1 = PΛ−1P> =

p∑
i=1

1
λi

eie>i (0.9)

UNSW MATH5855 2021T3 Lecture 0 Slide 25

Powers of a matrix: square root
Moreover we can define the square root of the symmetric
non-negative definite matrix X in a natural way:

X
1
2 =

p∑
i=1


λieie>i (0.10)

I makes sense since X
1
2X

1
2 = X

I X
1
2 also symmetric and non-negative definite

I X−
1
2 =

∑p
i=1 λ

− 12
i eie

>
i = PΛ

− 12P>

UNSW MATH5855 2021T3 Lecture 0 Slide 26

Example 0.2.
Let X ∈Mp,p be symmetric positive definite matrix with
eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λp > 0 and associated eigenvectors of
unit length e1, e2, . . . ep. Show that

I maxy 6=0
y>Xy
y>y = λ1 attained when y = e1.

I miny 6=0
y>Xy
y>y = λp attained when y = ep.

UNSW MATH5855 2021T3 Lecture 0 Slide 27

Let X = PΛP> be the decomposition (0.8) for X . Denote
z = P>y . Note that y 6= 0 implies z 6= 0. Thus

y>Xy
y>y

=
y>PΛP>y

y>y
=

z>Λz
z>z

=

∑p
i=1 λiz

2
i∑p

i=1 z
2
i

≤ λ1
∑p

i=1 z
2
i∑p

i=1 z
2
i

= λ1

If we take y = e1 then having in mind the structure of the matrix
P we have z = P>e1 = (1 0 · · · 0)> and for this choice of y
also z

>Λz
z>z =

λ1
1 = λ1. The first part of the exercise is shown.

Similar arguments (just changing the sign of the inequality) apply
to show the second part.
In addition, you can try to show that maxy 6=0,y⊥e1

y>Xy
y>y = λ2

holds. How?

UNSW MATH5855 2021T3 Lecture 0 Slide 28

. Preliminaries
0.1 Matrix algebra
0.1.1 Vectors and matrices
0.1.2 Inverse matrices
0.1.3 Rank
0.1.4 Orthogonal matrices
0.1.5 Eigenvalues and eigenvectors
0.1.6 Cholesky Decomposition
0.1.7 Orthogonal Projection

UNSW MATH5855 2021T3 Lecture 0 Slide 29

Numerical stability
I Computers have finite precision

I around 16 decimal significant figures
I scientific notation =⇒ absolute magnitude has little effect on

precision, different magnitudes produce rounding
I E.g., 1× 1018 + 1× 100 = 1,000,000,000,000,000,000 + 1 =

1,000,000,000,000,000,000

I For matrices, condition number |λ1/λp| (for pos. def. matrix)
to assess the potential error =⇒ higher = worse

UNSW MATH5855 2021T3 Lecture 0 Slide 30

Cholesky decomposition
I For a symm. pos.def. matrix X ∈Mp,p, a unique matrix

U ∈Mp,p exists that is:
I upper triangular
I U>U = X

I Many authors use LL> = X for a lower-triangular matrix
instead.

I L ≡ U>

I In SAS/IML, root(x) gives this.
I In R, the function is chol().

I Useful for generating correlated variables.

UNSW MATH5855 2021T3 Lecture 0 Slide 31

. Preliminaries
0.1 Matrix algebra
0.1.1 Vectors and matrices
0.1.2 Inverse matrices
0.1.3 Rank
0.1.4 Orthogonal matrices
0.1.5 Eigenvalues and eigenvectors
0.1.6 Cholesky Decomposition
0.1.7 Orthogonal Projection

UNSW MATH5855 2021T3 Lecture 0 Slide 32

Orthogonal projection matrix: necessary conditions

I Let L(X ) be space spanned by the columns of the matrix
X ∈Mn,p.

I Project a vector y ∈ Rn to it it with matrix P ∈Mn,n (an
orthogonal projector):
I Let z = Py ∈ Rn be the projection.
I z ∈ L(X ) =⇒ projection of z on L(X ) is z itself:

Py = z = Pz = PPy = P2y =⇒
(P − P2)y = 0 =⇒ P2 = P
=⇒ P should be idempotent.

I ∀y (y − z)>z = 0 =⇒ ∀yy>(P> − I )Py = 0
=⇒ (P> − I )P = 0 =⇒ P>P = P
=⇒ P>P = P> =⇒ P = P> =⇒ P is symmetrical.

=⇒ The orthogonal projector is a symmetric and idempotent
matrix.

UNSW MATH5855 2021T3 Lecture 0 Slide 33

Orthogonal projection matrix: sufficient conditions
I Let P be symmetric and idempotent.

=⇒ For any y ∈ Rn,
z = Py =⇒ Pz = P2y = Py =⇒ P(y − z) = 0 (and also
P>(y − z) = 0 since P = P>)

I Consider L(P) (the space generated by the rows/columns of
P).
I z = Py =⇒ z ∈ L(P)
I P>(y − z) = 0 means that y − z is perpendicular to L(P).

=⇒ Py is the projection of y on L(P).
=⇒ P ∈Mn,n is an orthogonal projection matrix if and only if it is

a symmetric and idempotent matrix.

UNSW MATH5855 2021T3 Lecture 0 Slide 34

Orthogonal projection matrix: properties
I If P is an orthogonal projection on a given linear spaceM of

dimension dim(M) then I −P an orthogonal projection on the
orthocomplement ofM.
I rk(P) = dim(M).

I The rank of an orthogonal projector is equal to the sum of its
diagonal elements.

UNSW MATH5855 2021T3 Lecture 0 Slide 35

Orthogonal projection matrix: form

I If the matrix X has a full rank then the projector
PL(X ) = X (X

>X )−1X>

I If the matrix X is not of full rank then the generalised inverse
(X>X )− of X>X can be defined instead.
I Not unique
I But X (X>X )−X> is unique
I Is the orthogonal projector on the space L(X ) spanned by the

columns of X when X is not full rank.

UNSW MATH5855 2021T3 Lecture 0 Slide 36

. Preliminaries
0.1 Matrix algebra
0.2 Standard facts about multivariate distributions
0.2.1 Random samples in multivariate analysis
0.2.2 Joint, marginal, conditional distributions
0.2.3 Moments
0.2.4 Density transformation formula
0.2.5 Characteristic and moment generating functions
0.3 Additional resources
0.4 Exercises

UNSW MATH5855 2021T3 Lecture 0 Slide 37

. Preliminaries
0.2 Standard facts about multivariate distributions
0.2.1 Random samples in multivariate analysis
0.2.2 Joint, marginal, conditional distributions
0.2.3 Moments
0.2.4 Density transformation formula
0.2.5 Characteristic and moment generating functions

UNSW MATH5855 2021T3 Lecture 0 Slide 38

Random dataset

I Inference depends on variability of statistics.
I Some assumptions are required about the data matrix (1.1).
I n observations of p-variate random vectors =⇒ random

matrix X ∈Mp,n:

X =




X11 X12 · · · X1j · · · X1n
X21 X22 · · · X2j · · · X2n


. . .


. . .


Xi1 Xi2 · · · Xij · · · Xin


. . .


. . .


Xp1 Xp2 · · · Xpj · · · Xpn




= [X1,X2, . . . ,Xn]

(0.11)
I Xi , i = 1, 2, . . . , n assumed independent

UNSW MATH5855 2021T3 Lecture 0 Slide 39

. Preliminaries
0.2 Standard facts about multivariate distributions
0.2.1 Random samples in multivariate analysis
0.2.2 Joint, marginal, conditional distributions
0.2.3 Moments
0.2.4 Density transformation formula
0.2.5 Characteristic and moment generating functions

UNSW MATH5855 2021T3 Lecture 0 Slide 40

Random vector cdf, pmf, and/or density

I Random vector X = (X1 X2 · · · Xp)> ∈ Rp, p ≥ 2 has
joint cdf

FX (x) = P(X1 ≤ x1,X2 ≤ x2, . . . ,Xp ≤ xp) = FX (x1, x2, . . . , xp)

I If discrete the probability mass function

PX (x) = P(X1 = x1,X2 = x2, . . . ,Xp = xp)

I If a density fX (x) = fX (x1, x2, . . . , xp) exists such that

FX (x) =
∫ x1
−∞
· · ·
∫ xp
−∞

fX (t)dt1 . . . dtp (0.12)

X is continuous.
=⇒ fX (x) =

∂pFX (x)
∂x1∂x2..∂xp

UNSW MATH5855 2021T3 Lecture 0 Slide 41

Marginal distribution

I marginal cdf of the first k < p components of the vector X is P(X1 ≤ x1,X2 ≤ x2, . . . ,Xk ≤ xk) = P(X1 ≤ x1,X2 ≤ x2, . . . ,Xk ≤ xk ,Xk+1 ≤ ∞, ...,Xp ≤ ∞) = FX (x1, x2, . . . , xk ,∞,∞, . . . ,∞) (0.13) I marginal density can be obtained by partial differentiation in (0.13): ∫ ∞ −∞ · · · ∫ ∞ −∞ fX (x1, x2, . . . , xp)dxk+1 . . . dxp I Similarly for any other set of components. I Each component Xi has marginal cdf FXi (xi ), i = 1, 2, . . . , p. UNSW MATH5855 2021T3 Lecture 0 Slide 42 Conditional distribution I conditional density X given Xr+1 = xr+1, . . . ,Xp = xp is f(X1,...,Xr |Xr+1,...,Xp)(x1, . . . , xr |xr+1, . . . , xp) = fX (x) fXr+1,...,Xp(xr+1, . . . , xp) (0.14) I joint density of X1, . . . ,Xr when Xr+1 = xr+1, . . . ,Xp = xp I only defined when fXr+1,...,Xp (xr+1, . . . , xp) 6= 0 UNSW MATH5855 2021T3 Lecture 0 Slide 43 Independence I If X has p independent components =⇒ FX (x) = FX1(x1)FX2(x2) · · ·FXp(xp) (0.15) I Equivalently, PX (x) = PX1(x1)PX2(x2) · · ·PXp(xp) fX (x) = fX1(x1)fX2(x2) · · · fXp(xp) (0.16) I Conditional distributions do not depend on the conditions I Functions factorise: FX (x) = p∏ i=1 FXi (xi ), fX (x) = p∏ i=1 fXi (xi ) UNSW MATH5855 2021T3 Lecture 0 Slide 44 . Preliminaries 0.2 Standard facts about multivariate distributions 0.2.1 Random samples in multivariate analysis 0.2.2 Joint, marginal, conditional distributions 0.2.3 Moments 0.2.4 Density transformation formula 0.2.5 Characteristic and moment generating functions UNSW MATH5855 2021T3 Lecture 0 Slide 45 Moments I For density fX (x) joint moments of order s1, s2 . . . , sp are E(X s11 · · ·X sp p ) = ∫ ∞ −∞ · · · ∫ ∞ −∞ x s11 · · · x sp p fX (x1, . . . , xp)dx1 . . . dxp (0.17) I some si = 0 =⇒ calculating the joint moment of a subset of the random variables UNSW MATH5855 2021T3 Lecture 0 Slide 46 Common multivariate moments Now, let X ∈ Rp and Y ∈ Rq with densities as above. The following moments are commonly used: Expectation: µX = E(X ) = ∫ ∞ −∞ · · · ∫ ∞ −∞ x fX (x1, . . . , xp)dx1 . . . dxp ∈ Rp. Variance–covariance matrix: (a.k.a. variance or covariance matrix) ΣX = Var(X ) = Cov(X ) = E(X − µX )(X − µX )>

= E XX> − µXµ>X =



σ11 σ12 · · · σ1p
σ21 σ22 · · · σ2p


. . .


σp1 σp2 · · · σpp


 ∈Mp,p.

UNSW MATH5855 2021T3 Lecture 0 Slide 47

Covariance matrix:

ΣX ,Y = Cov(X ,Y ) = E(X − µX )(Y − µY )>

= E XY> − µXµ>Y =



σX1Y1 σX1Y2 · · · σX1Yq
σX2Y1 σX2Y2 · · · σX2Yq


. . .

σXpY1 σXpY2 · · · σXpYq


 ∈Mp,q.

UNSW MATH5855 2021T3 Lecture 0 Slide 48

Linear transformations of moments
Let A ∈Mp′,p and B ∈Mq′,q fixed and known. Then,
I µAX = AµX ∈ Rp

I ΣAX = AΣXA
> ∈Mp′,p′

I ΣAX ,BY = AΣX ,YB
> ∈Mp′,q′

As a corollary, if X ′, Y ′, A′ and B ′ are variables and matrices with
the same dimensions as originals (but possibly distributions and
values),
I E(AX + A′X ′) = AµX + A′µX ′
I Var(AX + A′X ′) =

AΣXA
> + AΣX ,X ′(A

′)> + A′ΣX ′,XA
> + A′ΣX ′(A

′)>

I Cov(AX + A′X ′,BY + B ′Y ′) =
AΣX ,YB

> + AΣX ,Y ′(B
′)> + A′ΣX ′,YB

> + A′ΣX ′,Y ′(B
′)>

These identities are also useful when p = p′ = q = q′ = 1 (i.e.,
scalars).

UNSW MATH5855 2021T3 Lecture 0 Slide 49

. Preliminaries
0.2 Standard facts about multivariate distributions
0.2.1 Random samples in multivariate analysis
0.2.2 Joint, marginal, conditional distributions
0.2.3 Moments
0.2.4 Density transformation formula
0.2.5 Characteristic and moment generating functions

UNSW MATH5855 2021T3 Lecture 0 Slide 50

Density transformation
I p existing random variables X1,X2, . . . ,Xp with density fX (x)

transformed into p new random variables Y1,Y2 . . . ,Yp
(Y ∈ Rp), via function

Yi = yi (X1,X2 . . . ,Xp), i = 1, 2, . . . , p (0.18)
I Must be smooth and one-to-one so invertible on codomain of

y(·).
I Call inverse Xi = xi (Y1,Y2 . . . ,Yp), i = 1, 2, . . . , p

I Then,

fY (y1, . . . , yp) = fX [x1(y1, . . . , yp), . . . , xp(y1, . . . , yp)]|J(y1, . . . , yp)|
(0.19)

where J(y1, . . . , yp) is the Jacobian of the transformation:

J(y1, . . . , yp) =
∣∣∣∂x∂y ∣∣∣ ≡

∣∣∣∣∣∣∣∣


∂x1
∂y1

· · · ∂x1
∂yp


. . .


∂xp
∂y1

· · · ∂xp
∂yp



∣∣∣∣∣∣∣∣ ≡

∣∣∣∂y
∂x

∣∣∣−1 (0.20)
I In (0.19) the absolute value of the Jacobian is substituted.

UNSW MATH5855 2021T3 Lecture 0 Slide 51

. Preliminaries
0.2 Standard facts about multivariate distributions
0.2.1 Random samples in multivariate analysis
0.2.2 Joint, marginal, conditional distributions
0.2.3 Moments
0.2.4 Density transformation formula
0.2.5 Characteristic and moment generating functions

UNSW MATH5855 2021T3 Lecture 0 Slide 52

Characteristic function
I characteristic function (cf) ϕX (t) of the random vector

X ∈ Rp is a function of a p-dimensional argument
t = (t1 t2 · · · tp)> ∈ Rp

I defined as
ϕX (t) = E(e

it>X )

where i =

−1.

I always exists (since |ϕX (t)| ≤ E(|eit
>X |) = 1 <∞) I Related to the moment generating function (mgf): MX (t) = E(et >X )
I mgf may not exist for all t

I cf’s have one-to-one correspondence with distributions
I under some conditions, can go the other way:

fX (x) =
1

∫ +∞
−∞

e−itx ϕX (t)dt

fX (x) = (2π)
−p

Rp

e−it
>x ϕX (t)dt

UNSW MATH5855 2021T3 Lecture 0 Slide 53

Characteristic function: linear transformation

Theorem 0.3.
If the cf of the random vector X ∈ Rp is ϕX (t) and
Y = AX + b,b ∈ Rq,A ∈Mq,p is a linear transformation, then it
holds for all s ∈ Rq that

ϕY (s) = e
is>b ϕX (A

>s)

Proof.
at lecture.

UNSW MATH5855 2021T3 Lecture 0 Slide 54

. Preliminaries
0.1 Matrix algebra
0.2 Standard facts about multivariate distributions
0.3 Additional resources
0.4 Exercises

UNSW MATH5855 2021T3 Lecture 0 Slide 55

Additional resources

I JW Ch. 2–3.

UNSW MATH5855 2021T3 Lecture 0 Slide 56

. Preliminaries
0.1 Matrix algebra
0.2 Standard facts about multivariate distributions
0.3 Additional resources
0.4 Exercises

UNSW MATH5855 2021T3 Lecture 0 Slide 57

Exercise 0.1
In an ecological experiment, colonies of 2 different species of insect
are confined to the same habitat. The survival times of the two
species (in days) are random variables X1 and X2 respectively. It is
thought that X1 and X2 have a joint density of the form

fX (x1, x2) = θx1 e
−x1(θ+x2) (0 < x1, x2) for some constant θ > 0.
(a) Show that fX (x1, x2) is a valid density.
(b) Find the probability that both species die out within t days of

the start of the experiment.
(c) Derive the marginal density of X1. Identify this distribution

and write down E(X1) and Var(X1).
(d) Derive the marginal density of X2, and the conditional density

of X2 given X1 = x1.
(e) What evidence do you now have that X1 and X2 are not

independent?
UNSW MATH5855 2021T3 Lecture 0 Slide 58

Exercise 0.2
Let X = [X1,X2]> a random vector with E(X ) = µ and

Var(X ) = Σ = σ2
(
1 ρ
ρ 1

)
.

(a) Find Cov(X1 − X2,X1 + X2).
(b) Find Cov(X1,X2 − ρX1).
(c) Choose b to minimise Var(X2 − bX1).

UNSW MATH5855 2021T3 Lecture 0 Slide 59

Exercise 0.3
Suppose X is a p-dimensional random vector with cf ϕX (t). If X is

partitioned as
[
X(1)
X(2)

]
, where X(1) is a p1-dimensional subvector,

then show that

(a) X(1) has cf ϕ
(1)
X (t(1)) = ϕX

{[
t(1)
0

]}
, t(1) ∈ Rp1 .

(b) X(1) and X(2) are independent if and only if

ϕX (t) = ϕX
{[

t(1)
0

]}
ϕX

{[
0

t(2)

]}
,

∀t(1) ∈ Rp1 , ∀t(2)�Rp−p1 .

UNSW MATH5855 2021T3 Lecture 0 Slide 60

Exercise 0.4
Let X ∈Mp,p is a symmetric positive definite matrix with
eigenvalues λ1 ≥ λ2 · · · ≥ λp > 0 and associated eigenvectors of
unit length ei , i = 1, 2, . . . , p that give rise to the following spectral
decomposition:

X = λ1e1e>1 + λ2e2e
>
2 + . . . λpepe

>
p

It is known that maxy 6=0
y>Xy
y>y = λ1. Now, you show that

maxy 6=0,〈y ,e1〉=0
y>Xy
y>y = λ2. Can you find further generalisations of

this claim?

UNSW MATH5855 2021T3 Lecture 0 Slide 61

Exercise 0.5
We know that an orthogonal projection matrix has only 0 or 1 as
possible eigenvalues. Using this property or otherwise, show that
the rank of an orthogonal projector is equal to the sum of its
diagonal elements.

UNSW MATH5855 2021T3 Lecture 0 Slide 62

Lecture 1: Exploratory Data Analysis of Multivariate Data
1.1 Data organisation
1.2 Basic summaries
1.3 Visualisation
1.4 Software

UNSW MATH5855 2021T3 Lecture 1 Slide 1

1. Exploratory Data Analysis of Multivariate Data
1.1 Data organisation
1.2 Basic summaries
1.3 Visualisation
1.4 Software

UNSW MATH5855 2021T3 Lecture 1 Slide 2

Representation

case (a.k.a. item, individual, or experimental trial) p ≥ 1 variables
recorded on each unit of analysis

xij ith (of p) variable observed on jth (of n) case
data matrix:

p×n
X =




x11 x12 · · · x1j · · · x1n
x21 x22 · · · x2j · · · x2n


. . .


. . .


xi1 xi2 · · · xij · · · xin


. . .


. . .


xp1 xp2 · · · xpj · · · xpn




(1.1)

UNSW MATH5855 2021T3 Lecture 1 Slide 3

1. Exploratory Data Analysis of Multivariate Data
1.1 Data organisation
1.2 Basic summaries
1.3 Visualisation
1.4 Software

UNSW MATH5855 2021T3 Lecture 1 Slide 4

Univariate summaries

sample mean (of variable i) x̄i =
1
n

∑n
j=1 xij

sample variance (of variable i) s2i =
1
n

∑n
j=1(xij − x̄i )

2

I Sometimes, we will use divisor of n − 1 instead.

UNSW MATH5855 2021T3 Lecture 1 Slide 5

Bivariate summaries

sample covariance (of variables i and k)
sik =

1
n

∑n
j=1(xij − x̄i )(xkj − x̄k)

I Linear association only!
I Symmetric: sik ≡ ski .

sample correlation (of variables i and k) rik =
sik√

sii

skk
≡ sik

si sk

I A unitless measure.
I Also symmetric.
I Cauchy–Bunyakovsky–Schwartz Inequality =⇒ |rik | ≤ 1.
I Also linear; can use quotient correlation instead for nonlinear.

UNSW MATH5855 2021T3 Lecture 1 Slide 6

Calculations on matrix data
The descriptive statistics that we discussed until now are usually
organised into arrays, namely:

Vector of sample means x̄ =
(
x̄1 x̄2 · · · x̄p

)>
Matrix of sample variances and covariances

p×p
S =



s11 s12 · · · s1p
s21 s22 · · · s2p


. . .


sp1 sp2 · · · spp


 (1.2)

Matrix of sample correlations

p×p
R =




1 r12 · · · r1p
r21 1 · · · r2p


. . .


rp1 rp2 · · · 1


 (1.3)

UNSW MATH5855 2021T3 Lecture 1 Slide 7

1. Exploratory Data Analysis of Multivariate Data
1.1 Data organisation
1.2 Basic summaries
1.3 Visualisation
1.4 Software

UNSW MATH5855 2021T3 Lecture 1 Slide 8

Graphical representations
Some simple characteristics of the data are worth studying before
the actual multivariate analysis would begin:
I drawing scatterplot of the data;
I calculating simple univariate descriptive statistics for each

variable;
I calculating sample correlation and covariance coefficients; and
I linking multiple two-dimensional scatterplots.

UNSW MATH5855 2021T3 Lecture 1 Slide 9

1. Exploratory Data Analysis of Multivariate Data
1.1 Data organisation
1.2 Basic summaries
1.3 Visualisation
1.4 Software

UNSW MATH5855 2021T3 Lecture 1 Slide 10

SAS In SAS, the procedures that are used for this purpose are
called proc means, proc plot and proc corr. Please study
their short description in the included SAS handout.

R In R, these are implemented in base::rowMeans,
base::colMeans, stats::cor, graphics::plot,
graphics::pairs, GGally::ggpairs. Here, the format is
PACKAGE::FUNCTION, and you can learn more by running

library(PACKAGE)
? FUNCTION

UNSW MATH5855 2021T3 Lecture 1 Slide 11

Lecture 2: The Multivariate Normal Distribution
2.1 Definition
2.2 Properties of multivariate normal
2.3 Tests for Multivariate Normality
2.4 Software
2.5 Examples
2.6 Additional resources
2.7 Exercises

UNSW MATH5855 2021T3 Lecture 2 Slide 1

2. The Multivariate Normal Distribution
2.1 Definition
2.2 Properties of multivariate normal
2.3 Tests for Multivariate Normality
2.4 Software
2.5 Examples
2.6 Additional resources
2.7 Exercises

UNSW MATH5855 2021T3 Lecture 2 Slide 2

Generalising the Normal Distribution
I Generalisation of the univariate normal for p ≥ 2 dimensions
I Consider replacing ( x−µ

σ
)2 = (x − µ)(σ2)−1(x − µ) in

f (x) =
1


2πσ2

e−[(x−µ)/σ]
2/2,−∞ < x <∞ (2.1) by (x − µ)>Σ−1(x − µ).
I µ = E X ∈ Rp the expected value of the random vector

X ∈ Rp
I covariance matrix

Σ = E(X − µ)(X − µ)> =



σ11 σ12 · · · σ1p
σ21 σ22 · · · σ2p


. . .


σp1 σp2 · · · σpp


 ∈Mp,p

I diagonal of Σ =⇒ variances of each of the p random variables
I σij = E[(Xi − E(Xi ))(Xj − E(Xj))], i 6= j =⇒ covariances

between the ith and jth random variable
I σii ≡ σ2i

I Only makes sense if Σ pos. def.
UNSW MATH5855 2021T3 Lecture 2 Slide 3

Multivariate Normal Distribution density

I Σ pos. def. =⇒ density of X is

f (x) =
1

(2π)p/2|Σ|
1
2

e−(x−µ)
>Σ−1(x−µ)/2, −∞ < xi <∞, i = 1, 2 . . . , p (2.2) I E X = µ I E[(X − µ)(X − µ)>] = Σ
I Notation: Np(µ,Σ).

UNSW MATH5855 2021T3 Lecture 2 Slide 4

Cramer–Wold argument

I We also want MVN for singular Σ (nonneg. def.)
I Use Cramer–Wold argument:

The distribution of a p-dimensional random vector X is
completely characterised by the one-dimensional distribu-
tions of all linear transformations t>X , t ∈ Rp.

I I.e., consider E[eit(t
>X )] (assumed known for every

t ∈ R1, t ∈ Rp).
I Substitute t = 1 to get E[eit

>X ], the cf of the vector X .

Definition 2.1.
The random vector X ∈ Rp has a multivariate normal distribution
if and only if (iff) any linear transformation t>X , t ∈ Rp has a
univariate normal distribution.

UNSW MATH5855 2021T3 Lecture 2 Slide 5

Lemma 2.2.
The characteristic function of the (univariate) standard normal
random variable X ∼ N(0, 1) is

ψX (t) = exp(−t2/2).

UNSW MATH5855 2021T3 Lecture 2 Slide 6

Proof.
(optional, not examinable) Sketch (full details in the handout):
1. ψX (t) = E exp(itX ) =

∫ +∞
−∞ exp(itx)

1√

exp(−x2/2)dx
2. Completing the square and factoring,

ψX (t) = exp(−t2/2)︸ ︷︷ ︸
the cf

lim
h→∞

∫ +h+it
−h+it

1

exp(−z2/2)dz︸ ︷︷ ︸
a complex integral

.

3. Use Cauchy’s Theorem and contour integration to show that
the complex integral above equals 1.

Aside: We could use the moment generating function (mgf)
MX (t) = E exp(tX ) = exp(−t2/2) instead.

UNSW MATH5855 2021T3 Lecture 2 Slide 7

Theorem 2.3.
Suppose that for a random vector X ∈ Rp with a normal
distribution according to Definition 2.1 we have E(X ) = µ and
D(X ) = E[(X − µ)(X − µ)>] = Σ. Then:
i) For any fixed t ∈ Rp, t>X ∼ N(t>µ, t>Σt) i.e. t>X has an

one dimensional normal distribution with expected value t>µ
and variance t>Σt.

ii) The cf of X ∈ Rp is

ϕX (t) = e
(it>µ− 12 t

>Σt) . (2.3)

UNSW MATH5855 2021T3 Lecture 2 Slide 8

Proof.
Part i) is obvious. For part ii),
I cf of the standard univariate normal random variable Z is

e−t
2/2

I Any U ∼ N1(µ1, σ21) has a distribution that coincides with the
distribution of µ1 + σ1Z .

I Then,

ϕU(t) = e
itµ1 ϕσ1Z (t) = e

itµ1 E(eitσ1Z )

= eitµ1 ϕZ (tσ1) = e
(itµ1− 12 t

2σ21)

I So, for t>X ∼ N1(t>µ, t>Σt) (univariate), cf
ϕt>X (t) = e

(itt>µ− 12 t
2t>Σt) .

I Substituting t = 1,

ϕX (t) = e
(it>µ− 12 t

>Σt)

UNSW MATH5855 2021T3 Lecture 2 Slide 9

=⇒ Given µ and Σ use cf formula (2.3) rather than the density
formula (2.2).
I cf formula defined for singular Σ.
I Still need to show density (2.2) for invertible Σ.

Theorem 2.4.
Assume the matrix Σ in (2.3) is nonsingular. Then the density of
the random vector X ∈ Rp with cf as in (2.3) is given by (2.2).

UNSW MATH5855 2021T3 Lecture 2 Slide 10

Proof.
I Consider vector Y = Σ−

1
2 (X − µ) ∈ Rp (compare (0.10) in

Section 0.1.5)
I E(Y ) = 0
I D(Y ) = E(YY>) = Σ−

1
2 E[(X − µ)(X − µ)>]Σ−

1
2 = Ip

=⇒ substitute to get the cf of Y : ϕY (t) = e−
1
2
∑p

i=1 t
2
i

I This is cf of p independent N(0, 1)

I Y = Σ−
1
2 (X − µ) =⇒ X = µ + Σ

1
2 Y

I Density fY (y) = 1(2π)p/2 e
− 12

∑p
i=1 y

2
i

I Use density transformation: (Section 0.2.4):
fX (x) = fY (Σ−

1
2 (x − µ))|J(x1, . . . , xp)|

I By linearity: |J(x1, . . . , xp)| = |Σ−
1
2 | = |Σ

1
2 |−1

I
∑p

i=1 y
2
i = y

>y = (x − µ)>Σ−
1
2 Σ−

1
2 (x − µ) =

(x − µ)>Σ−1(x − µ)
=⇒ density formula (2.2) for fX (x)

UNSW MATH5855 2021T3 Lecture 2 Slide 11

2. The Multivariate Normal Distribution
2.1 Definition
2.2 Properties of multivariate normal
2.3 Tests for Multivariate Normality
2.4 Software
2.5 Examples
2.6 Additional resources
2.7 Exercises

UNSW MATH5855 2021T3 Lecture 2 Slide 12

Property 1
If Σ = D(X ) = Λ is a diagonal matrix then the p components of X
are independent.

I I.e., then ϕX (t) = e
i
∑p

j=1 tjµj−
1
2 t

2
j σ

2
j , decomposes into cf’s of p

independent components each distributed according to
N(µj , σ

2
j ), j = 1, . . . , p

I “for a multivariate normal, if its components are uncorrelated
they are also independent”

I converse (if independent, then uncorrelated) true for any
distribution

I For the multivariate normal distribution, we can conclude that
its components are independent if and only if they are
uncorrelated!

UNSW MATH5855 2021T3 Lecture 2 Slide 13

Example 2.5 (Random variables that are marginally
normal and uncorrelated but not independent).
Consider two variables Z1 = (2W − 1)Y and Z2 = Y , where
Y ∼ N1(0, 1) and, independently, W ∼ Binomial(1, 1/2) (so
2W − 1 takes −1 and +1 with equal probability).

UNSW MATH5855 2021T3 Lecture 2 Slide 14

Property 2
If X ∼ Np(µ,Σ) and C ∈Mq,p is an arbitrary matrix of real
numbers then

Y = CX ∼ Nq(Cµ,CΣC>).

I From Section 0.2.5, for any s ∈ Rq,

ϕY (s) = ϕX (C
>s) = eis

>Cµ− 12 s
>CΣC>s

=⇒ Y = CX ∼ Nq(Cµ,CΣC>)
I C is full rank and if rk(Σ) = p then the rank of CΣC> is also

full
I I.e. the distribution of Y would not be degenerate in this case.

UNSW MATH5855 2021T3 Lecture 2 Slide 15

Property 3
(This is a finer version of Property 1). Assume the vector X ∈ Rp

is divided into subvectors X =
(

X(1)
X(2)

)
and according to this

subdivision the vector means are µ =
(
µ(1)
µ(2)

)
and the covariance

matrix Σ has been subdivided into Σ =
(

Σ11 Σ12
Σ21 Σ22

)
. Then the

vectors X(1) and X(2) are independent iff Σ12 = 0.

Proof.
(Exercise (see lecture)).

UNSW MATH5855 2021T3 Lecture 2 Slide 16

Property 4

Let the vector X ∈ Rp be divided into subvectors X =
(

X(1)
X(2)

)
,

X(1) ∈ Rr , r < p,X(2) ∈ Rp−r and according to this subdivision the vector means are µ = ( µ(1) µ(2) ) and the covariance matrix Σ has been subdivided into Σ = ( Σ11 Σ12 Σ21 Σ22 ) . Assume for simplicity that the rank of Σ22 is full. Then the conditional density of X(1) given that X(2) = x(2) is Nr (µ(1) + Σ12Σ −1 22 (x(2) − µ(2)),Σ11 − Σ12Σ −1 22 Σ21) (2.4) UNSW MATH5855 2021T3 Lecture 2 Slide 17 Proof. I Expression µ(1) + Σ12Σ −1 22 (x(2) − µ(2)) is a function of x(2); denote is as g(x(2)). Construct r.v. Z = X(1) − g(X(2)) and Y = X(2) − µ(2). Observe E Z = 0 and E Y = 0. I ( Z Y ) = ( Ir −Σ12Σ−122 0 Ip−r ) (X − µ) =⇒ N (Property 2). I Var ( Z Y ) = AΣA> =(

Σ11 − Σ12Σ−122 Σ21 0
0 Σ22

)
{block multiplication}

=⇒ Z and Y uncorr. N =⇒ independent (Property 3).
I Y is a linear transformation of X(2) =⇒ Z and X(2) indep.

=⇒ Conditional density of Z given X(2) = x(2) will not depend on
x(2) and coincides with the unconditional density of Z .

=⇒ Z normal with Cov(Z ) = Σ11 − Σ12Σ−122 Σ21 = Σ1|2
=⇒ X(1) − g(x(2)) ∼ N(0,Σ1|2)

=⇒ (2.4)
UNSW MATH5855 2021T3 Lecture 2 Slide 18

Example 2.6.
As an immediate consequence of Property 4 we see that if
p = 2, r = 1 then for a two-dimensional normal vector(
X1
X2

)
∼ N

{(
µ1
µ2

)
,

(
σ21 σ12
σ12 σ

2
2

)}
its conditional density f (x1|x2)

is N(µ1 +
σ12
σ22

(x2 − µ2), σ21 −
σ212
σ22

).

As an exercise, try to derive the above result by direct calculations
starting from the joint density f (x1, x2), going over to the marginal
f (x2) by integration and finally getting f (x1|x2) =

f (x1,x2)
f (x2)

.

UNSW MATH5855 2021T3 Lecture 2 Slide 19

Property 5
If X ∼ Np(µ,Σ) and Σ is nonsingular then
(X − µ)>Σ−1(X − µ) ∼ χ2p where χ2p denotes the chi-square
distribution with p degrees of freedom.

Proof.
It suffices to use the fact that (see also Theorem 2.4) the vector
Y ∈ Rp : Y = Σ−

1
2 (X − µ) ∼ N(0, Ip) i.e. it has p independent

standard normal components. Then

(X − µ)>Σ−1(X − µ) = Y>Y =
p∑

i=1

Y 2i ∼ χ
2
p

according to the definition of χ2p as a distribution of the sum of
squares of p independent standard normals.

UNSW MATH5855 2021T3 Lecture 2 Slide 20

Prediction: “Best Predictor”

I A corollary of Property 4
I Predict Y from p predictors X = (X1 X2 · · · Xp) by

choosing g(·) to minimise EY [{Y − g(X )}2|X = x ] s.t.
E g(X )2 <∞. I Optimal g∗(x) = E(Y |X = x): regression function UNSW MATH5855 2021T3 Lecture 2 Slide 21 Prediction: Best Predictor for MVN I In general, g∗(x) = E(Y |X = x) can be complicated. I For MVN, much simpler. I If ( Y X ) ∈ R1+p is normal, apply Property 4 =⇒ g∗(x) = b + σ>0 C
−1x for b = E(Y )− σ>0 C

−1 E(X ),
C = Cov(X ), and σ0 = Cov(X ,Y ).

I I.e.,
g∗(x) = E(Y ) + σ>0 C

−1(x − E(X )).

I In case of joint normality, prediction turns out linear.
I C−1σ0 ∈ Rp is the vector of the regression coefficients.

I results in variance Var(Y )− σ>0 C
−1σ0

UNSW MATH5855 2021T3 Lecture 2 Slide 22

2. The Multivariate Normal Distribution
2.1 Definition
2.2 Properties of multivariate normal
2.3 Tests for Multivariate Normality
2.4 Software
2.5 Examples
2.6 Additional resources
2.7 Exercises

UNSW MATH5855 2021T3 Lecture 2 Slide 23

Graphical diagnostics

I Normality makes things easier.
I Is also sometimes an important assumption.
I Since margins and linear combinations of MVN are normal,

1. check marginal distributions (e.g., Q–Q plots, the
Shapiro–Wilk test);

2. check scatterplots of pairs of observations;
3. note outliers.

I Only good for bivariate normality.

UNSW MATH5855 2021T3 Lecture 2 Slide 24

Mardia’s Multivarate Skewness and Kurtosis

Multivariate skewness: For Y independent of X but with the same
distribution,

β1,p = E[(Y − µ)>Σ−1(X − µ)]3 (2.5)

Multivariate kurtosis:

β2,p = E[(X − µ)>Σ−1(X − µ)]2 (2.6)

I Assuming these expectations exist.
I For Np(µ,Σ), β1,p = 0 and β2,p = p(p + 2).

I p = 1 =⇒ β1,1 =
(

E(X−µ)3
σ3

)2
, β2,1 =

E(X−µ)4
σ4

I Estimated as

β̂1,p =
1
n2

n∑
i=1

n∑
j=1

g3ij , β̂2,p =
1
n

n∑
i=1

g2ii

where gij = (xi − x̄)>S−1n (xj − x̄).
UNSW MATH5855 2021T3 Lecture 2 Slide 25

Mardia’s test statistics

I β̂1,p ≥ 0 and β̂2,p ≥ 0
I For MVN, β̂1,p ≈ 0 and β̂2,p ≈ p(p + 2), respectively.
I Asymptotically, k1 = nβ̂1,p/6 ∼ χ2p(p+1)(p+2)/6, and

k2 = [β̂2,p − p(p + 2)]/[8p(p + 2)/n]
1
2 ∼ N(0, 1).

=⇒ Use k1 and k2 to test the null hypothesis of multivariate
normality.

I If neither hypothesis is rejected MVN assumption is in
reasonable agreement with the data.

I Mardia’s multivariate kurtosis can also be used to detect
outliers.

UNSW MATH5855 2021T3 Lecture 2 Slide 26

Caveat: Overreliance on tests

I Shapiro–Wilk, Mardia, etc. =⇒
H0 : population is (multivariate) normal

I Any deviation from normality =⇒ Pr(reject H0)
n→∞−→ 1

I CLT =⇒ X̄ n→∞−→ MVN regardless of population distribution
I S too, but much more slowly

=⇒ As n increases,
I more likely for test to conclude population non-normality.
I need population normality less in the first place.

=⇒ Particularly for large datasets, don’t overrely on tests.

UNSW MATH5855 2021T3 Lecture 2 Slide 27

2. The Multivariate Normal Distribution
2.1 Definition
2.2 Properties of multivariate normal
2.3 Tests for Multivariate Normality
2.4 Software
2.5 Examples
2.6 Additional resources
2.7 Exercises

UNSW MATH5855 2021T3 Lecture 2 Slide 28

SAS Use CALIS procedure. The quantity k2 is called Normalised
Multivariate Kurtosis there, whereas β̂2,p − p(p + 2) bears the
name Mardia’s Multivariate Kurtosis.

R MVN::mvn, psych::mardia

UNSW MATH5855 2021T3 Lecture 2 Slide 29

2. The Multivariate Normal Distribution
2.1 Definition
2.2 Properties of multivariate normal
2.3 Tests for Multivariate Normality
2.4 Software
2.5 Examples
2.6 Additional resources
2.7 Exercises

UNSW MATH5855 2021T3 Lecture 2 Slide 30

Example 2.7.
Testing multivariate normality of microwave oven radioactivity
measurements (JW).

UNSW MATH5855 2021T3 Lecture 2 Slide 31

2. The Multivariate Normal Distribution
2.1 Definition
2.2 Properties of multivariate normal
2.3 Tests for Multivariate Normality
2.4 Software
2.5 Examples
2.6 Additional resources
2.7 Exercises

UNSW MATH5855 2021T3 Lecture 2 Slide 32

Additional resources

I JW Sec. 4.1–4.2, 4.6.

UNSW MATH5855 2021T3 Lecture 2 Slide 33

2. The Multivariate Normal Distribution
2.1 Definition
2.2 Properties of multivariate normal
2.3 Tests for Multivariate Normality
2.4 Software
2.5 Examples
2.6 Additional resources
2.7 Exercises

UNSW MATH5855 2021T3 Lecture 2 Slide 34

Exercise 2.1
Let X1 and X2 denote i.i.d. N(0, 1) r.v.’s.
(a) Show that the r.v.’s Y1 = X1 − X2 and Y2 = X1 + X2 are

independent, and find their marginal densities.
(b) Find P(X 21 + X

2
2 < 2.41). UNSW MATH5855 2021T3 Lecture 2 Slide 35 Exercise 2.2 Let X ∼ N3(µ,Σ) where µ =   3−1 2   Σ =  3 2 12 3 1 1 1 2   . (a) For A = ( 1 1 1 1 −2 1 ) find the distribution of Z = AX and find the correlation between the two components of Z . (b) Find the conditional distribution of [X1,X3]> given X2 = 0.

UNSW MATH5855 2021T3 Lecture 2 Slide 36

Exercise 2.3
Suppose that X1, . . . ,Xn are independent random vectors, with
each Xi ∼ Np(µi ,Σi ). Let a1, . . . , an be real constants. Using
characteristic functions, show that

a1X1 + · · ·+ anXn ∼ Np(a1µ1 + · · ·+ anµn, a21Σ1 + · · ·+ a
2
nΣn)

Therefore, deduce that, if X1, . . . ,Xn form a random sample from
the Np(µ,Σ) distribution, then the sample mean vector,
X̄ = 1

n

∑n
i=1 Xi , has distribution

X̄ ∼ Np(µ,
1
n

Σ) .

UNSW MATH5855 2021T3 Lecture 2 Slide 37

Exercise 2.4
Prove that if X1 ∼ Nr (µ1,Σ11) and
(X2|X1 = x1) ∼ Np−r (Ax1 + b,Ω) where Ω does not depend on x1

then X =
(

X1
X2

)
∼ Np(µ,Σ) where

µ =

(
µ1

Aµ1 + b

)
, Σ =

(
Σ11 Σ11A

>

AΣ11 Ω + AΣ11A
>

)
.

UNSW MATH5855 2021T3 Lecture 2 Slide 38

Exercise 2.5
Knowing that,
i) Z ∼ N1(0, 1)
ii) Y |Z = z ∼ N1(1 + z , 1)
iii) X |(Y ,Z ) = (y , z) ∼ N1(1− y , 1)

(a) Find the distribution of


XY
Z


 and of Y |(X ,Z ).

(b) Find the distribution of
(
U
V

)
=

(
1 + Z
1− Y

)
.

(c) Compute E(Y |U = 2).

UNSW MATH5855 2021T3 Lecture 2 Slide 39

Lecture 3: Estimation of the Mean Vector and Covariance
Matrix of Multivariate Normal Distribution

3.1 Maximum Likelihood Estimation
3.2 Distributions of MLE of mean vector and covariance matrix of

multivariate normal distribution
3.3 Additional resources
3.4 Exercises

UNSW MATH5855 2021T3 Lecture 3 Slide 1

3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.1 Maximum Likelihood Estimation
3.1.1 Likelihood function
3.1.2 Maximum Likelihood Estimators
3.1.3 Alternative proofs
3.1.4 Application in correlation matrix estimation
3.1.5 Sufficiency of µ̂ and Σ̂
3.2 Distributions of MLE of mean vector and covariance matrix of

multivariate normal distribution
3.3 Additional resources
3.4 Exercises

UNSW MATH5855 2021T3 Lecture 3 Slide 2

3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.1 Maximum Likelihood Estimation
3.1.1 Likelihood function
3.1.2 Maximum Likelihood Estimators
3.1.3 Alternative proofs
3.1.4 Application in correlation matrix estimation
3.1.5 Sufficiency of µ̂ and Σ̂

UNSW MATH5855 2021T3 Lecture 3 Slide 3

Data
Suppose we have observed n independent realisations of
p-dimensional random vectors from Np(µ,Σ). Suppose for
simplicity that Σ is non-singular. The data matrix has the form

X =




X11 X12 · · · X1j · · · X1n
X21 X22 · · · X2j · · · X2n


. . .


. . .


Xi1 Xi2 · · · Xij · · · Xin


. . .


. . .


Xp1 Xp2 · · · Xpj · · · Xpn




= [X1,X2, . . . ,Xn] (3.1)

Goal: Estimate unknown mean vector and the covariance matrix
using Maximum Likelihood Estimation.

UNSW MATH5855 2021T3 Lecture 3 Slide 4

Likelihood function

I Lecture 2 =⇒ Likelihood function

L(x ;µ,Σ) = (2π)−
np
2 |Σ|−

n
2 e−

1
2
∑n

i=1(xi−µ)
>Σ−1(xi−µ) (3.2)

I Fix x = data matrix; focus on µ,Σ.
I Log-likelihood function

log L(x ;µ,Σ) = −
np

2
log(2π)−

n

2
log(|Σ|)


1
2

n∑
i=1

(xi − µ)>Σ−1(xi − µ) (3.3)

I Same maximum.

UNSW MATH5855 2021T3 Lecture 3 Slide 5

Utilising properties of traces from Section 0.1.1, we can transform:
n∑

i=1

(xi − µ)>Σ−1(xi − µ) =
n∑

i=1

tr[Σ−1(xi − µ)(xi − µ)>] =

tr[Σ−1(
n∑

i=1

(xi − µ)(xi − µ)>)] =

tr[Σ−1(
n∑

i=1

(xi − x̄)(xi − x̄)> + n(x̄ − µ)(x̄ − µ)>)]

= tr[Σ−1(
n∑

i=1

(xi − x̄)(xi − x̄)>)] + n(x̄ − µ)>Σ−1(x̄ − µ).

=⇒ log L(x ;µ,Σ) = −
np

2
log(2π)−

n

2
log(|Σ|)


1
2

tr[Σ−1(
n∑

i=1

(xi − x̄)(xi − x̄)>)]−
1
2
n(x̄ − µ)>Σ−1(x̄ − µ)

(3.4)
UNSW MATH5855 2021T3 Lecture 3 Slide 6

3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.1 Maximum Likelihood Estimation
3.1.1 Likelihood function
3.1.2 Maximum Likelihood Estimators
3.1.3 Alternative proofs
3.1.4 Application in correlation matrix estimation
3.1.5 Sufficiency of µ̂ and Σ̂

UNSW MATH5855 2021T3 Lecture 3 Slide 7

I Σ nonnnegative def. =⇒ minimum of
1
2n(x̄ − µ̂)

>Σ−1(x̄ − µ̂) when µ̂ = x̄
I What if Σ singular?

I What about Σ?

UNSW MATH5855 2021T3 Lecture 3 Slide 8

Theorem 3.1 (Anderson’s lemma).
If A ∈Mp,p is symmetric positive definite, then the maximum of
the function h(G ) = −n log(|G |)− tr(G−1A) (defined over the set
of symmetric positive definite matrices G ∈Mp,p) exists, occurs at
G = 1

n
A and has the maximal value of np log(n)− n log(|A|)− np.

UNSW MATH5855 2021T3 Lecture 3 Slide 9

Proof.
(sketch, details at lecture):

I tr(G−1A) = tr((G−1A
1
2 )A

1
2 ) = tr(A

1
2G−1A

1
2 )

I Let ηi , i = 1, . . . , p be the eigenvalues of A
1
2G−1A

1
2

I Pos. def. =⇒ ηi > 0, i = 1, . . . , p
I tr(A

1
2G−1A

1
2 ) =

∑p
i=1 ηi and |A

1
2G−1A

1
2 | =

∏p
i=1 ηi

=⇒ −n log|G | − tr(G−1A) = n
p∑

i=1

log ηi − n log|A| −
p∑

i=1

ηi

(3.5)
I maximised when ∀iηi = n

=⇒ Maximum is np log(n)− n log(|A|)− np.
I For G = 1

n
A,

h(G ) = −n log(|G |)− tr(G−1A) = np log(n)− n log(|A|)− np.
=⇒ G = 1

n
A maximises.

UNSW MATH5855 2021T3 Lecture 3 Slide 10

Theorem 3.1 for A =
∑n

i=1(xi − x̄)(xi − x̄)
> =⇒

Theorem 3.2.
Suppose X1,X2, . . . ,Xn is a random sample from Np(µ,Σ), p < n. Then µ̂ = X̄ and Σ̂ = 1 n ∑n i=1(xi − x̄)(xi − x̄) > are the maximum
likelihood estimators of µ and Σ, respectively.

UNSW MATH5855 2021T3 Lecture 3 Slide 11

3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.1 Maximum Likelihood Estimation
3.1.1 Likelihood function
3.1.2 Maximum Likelihood Estimators
3.1.3 Alternative proofs
3.1.4 Application in correlation matrix estimation
3.1.5 Sufficiency of µ̂ and Σ̂

UNSW MATH5855 2021T3 Lecture 3 Slide 12

I Alternative proofs also available, using vector and matrix
calculus.

I May be covered later, time permitting.

UNSW MATH5855 2021T3 Lecture 3 Slide 13

3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.1 Maximum Likelihood Estimation
3.1.1 Likelihood function
3.1.2 Maximum Likelihood Estimators
3.1.3 Alternative proofs
3.1.4 Application in correlation matrix estimation
3.1.5 Sufficiency of µ̂ and Σ̂

UNSW MATH5855 2021T3 Lecture 3 Slide 14

I Correlation matrix is a function of the covariance matrix:

ρij =
σij√
σii

σjj

I MLE is invariant under transformation.
=⇒

ρ̂ij =
σ̂ij√
σ̂ii

σ̂jj
, i = 1, . . . , p, j = 1, . . . , p. (3.6)

UNSW MATH5855 2021T3 Lecture 3 Slide 15

3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.1 Maximum Likelihood Estimation
3.1.1 Likelihood function
3.1.2 Maximum Likelihood Estimators
3.1.3 Alternative proofs
3.1.4 Application in correlation matrix estimation
3.1.5 Sufficiency of µ̂ and Σ̂

UNSW MATH5855 2021T3 Lecture 3 Slide 16

I Recall:

L(x ;µ,Σ) =
1

(2π)
np
2 |Σ|

n
2

e−
1
2 tr[Σ

−1(
∑n

i=1(xi−x̄)(xi−x̄)
>+n(x̄−µ)(x̄−µ)>)]

I Factorisation L(x ;µ,Σ) = g1(x)g2(µ,Σ; µ̂, Σ̂) =⇒ µ̂ and Σ̂
are (collectively) a sufficient statistic for µ and Σ in the case
of a sample from Np(µ,Σ).

I Structure of normal density important for this.
=⇒ Non-normality can break inferences that are based solely on µ̂

and Σ̂.

UNSW MATH5855 2021T3 Lecture 3 Slide 17

3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.1 Maximum Likelihood Estimation
3.2 Distributions of MLE of mean vector and covariance matrix of

multivariate normal distribution
3.2.1 Sampling distribution of X̄
3.2.2 Sampling distribution of the MLE of Σ
3.2.3 Aside: The Gramm–Schmidt Process (not examinable)
3.3 Additional resources
3.4 Exercises

UNSW MATH5855 2021T3 Lecture 3 Slide 18

I Inference is not just point estimates.
I Need to quantify uncertainty as well.

=⇒ Need sampling distributions of estimators as well.

UNSW MATH5855 2021T3 Lecture 3 Slide 19

3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.2 Distributions of MLE of mean vector and covariance matrix of

multivariate normal distribution
3.2.1 Sampling distribution of X̄
3.2.2 Sampling distribution of the MLE of Σ
3.2.3 Aside: The Gramm–Schmidt Process (not examinable)

UNSW MATH5855 2021T3 Lecture 3 Slide 20

p = 1: sample of size n from N(µ, σ2) =⇒ the sample mean is
N(µ, σ

2

n
)

I sample mean and sample variance are independent (Basu’s
Lemma)

I What about p > 1?
p > 1:
I Let X̄ = 1

n

∑n
i=1 Xi ∈ R

p.
I For any l ∈ Rp : l>X̄ is linear combination of normals and

hence is normal (Definition 2.1).
I E X̄ = 1

n
nµ = µ

I Cov X̄ = 1
n2
nCov X1 = 1nΣ

=⇒ X̄ ∼ Np(µ, 1nΣ).
I Need more rigorous machinery.

=⇒ Kronecker products

UNSW MATH5855 2021T3 Lecture 3 Slide 21

Kronecker product A⊗ B of two matrices A ∈Mm,n and
B ∈Mp,q:

A⊗ B =



a11B a12B · · · a1nB
a21B a22B · · · a2nB


. . .


am1B am2B · · · amnB


 (3.7)

I Properties (assuming conformable, inverses exist, etc.):

(A⊗ B)⊗ C = A⊗ (B ⊗ C )
(A + B)⊗ C = A⊗ C + B ⊗ C

(A⊗ B)> = A> ⊗ B>

(A⊗ B)−1 = A−1 ⊗ B−1

(A⊗ B)(C ⊗ D) = AC ⊗ BD
For square A and B :

tr(A⊗ B) = tr(A) tr(B)
|A⊗ B| = |A|p|B|m

UNSW MATH5855 2021T3 Lecture 3 Slide 22

Stacking columns ~�: For A ∈Mm,n, ~A ∈ Rmn: vector composed
by stacking the n columns of A.
I Matrices A,B and C (of suitable dimensions):

−−→
ABC = (C> ⊗ A)~B

Application: Let 1n ∈ Rn be vector of n ones, X be the random
data matrix (see (0.11) in Lecture 0.2).

=⇒ ~X ∼ N(1n ⊗ µ, In ⊗ Σ) and X̄ = 1n (1
>
n ⊗ Ip) ~X

=⇒ X̄ is MVN,

E X̄ =
1
n

(1>n ⊗ Ip)(1n ⊗ µ) =
1
n

(1>n 1n ⊗ µ) =
1
n
nµ = µ

Cov X̄ = n−2(1>n ⊗Ip)(In⊗Σ)(1n⊗Ip) = n
−2(1>n 1n⊗Σ) = n

−1Σ

UNSW MATH5855 2021T3 Lecture 3 Slide 23

Independence of X̄ and Σ̂

I Recall the likelihood function:

L(x ;µ,Σ) =
1

(2π)
np
2 |Σ|

n
2

e−
1
2 tr[Σ

−1(
∑n

i=1(xi−x̄)(xi−x̄)
>+n(x̄−µ)(x̄−µ)>)]

I Two summands in the exponent:
I one a function of the observations through

nΣ̂ =
∑n

i=1(xi − x̄)(xi − x̄)
> only

I one depends on the observations through x̄ only.

Idea: Transform the original data matrix X ∈Mp,n into a new
matrix Z ∈Mp,n of n independent vectors s.t.,
I X̄ would only be a function of Z1
I
∑n

i=1(xi − x̄)(xi − x̄)
> would only be a function of

Z2, . . . ,Zn
I If we succeed then clearly X̄ and∑n

i=1(xi − x̄)(xi − x̄)
> = nΣ̂ would be independent.

I Functions of independent variables are independent.
UNSW MATH5855 2021T3 Lecture 3 Slide 24

I Want A orthogonal.
I Want X̄ depending only on Z1.

=⇒ First column of A, a1 = 1√n1n.
=⇒ First column of Z , Z1 =


nX̄ .

I Rest: Gramm–Schmidt Process

I ~Z =
−−−→
IpXA = (A> ⊗ Ip) ~X =⇒ Jacobian of ~X 7→ ~Z is

|A> ⊗ Ip| = |A|p = ±1.
=⇒ absolute value of the Jacobian is 1
=⇒ For ~Z we have:

E(~Z ) = (A> ⊗ Ip)(1n ⊗ µ) = A>1n ⊗ µ =




n
0

0


⊗ µ

UNSW MATH5855 2021T3 Lecture 3 Slide 25

I Then,

Cov(~Z ) = (A> ⊗ Ip)(In ⊗Σ)(A⊗ Ip) = A>A⊗ IpΣIp = In ⊗Σ

=⇒ Zi , i = 1, . . . , n independent
I Z1 =


nX̄ .

I Also,
n∑

i=1

(Xi − X̄ )(Xi − X̄ )> =
n∑

i=1

XiX>i −
1
n

(
n∑

i=1

Xi )(
n∑

i=1

X>i ) =

ZA>AZ> − Z1Z>1 =
n∑

i=1

ZiZ>i − Z1Z
>
1 =

n∑
i=2

ZiZ>i

=⇒

Theorem 3.3.
For a sample of size n from Np(µ,Σ), p < n the sample average X̄ ∼ Np(µ, 1nΣ). Moreover, the MLE µ̂ = X̄ and Σ̂ are independent. UNSW MATH5855 2021T3 Lecture 3 Slide 26 3. Estimation of the Mean Vector and Covariance Matrix of Multivariate Normal Distribution 3.2 Distributions of MLE of mean vector and covariance matrix of multivariate normal distribution 3.2.1 Sampling distribution of X̄ 3.2.2 Sampling distribution of the MLE of Σ 3.2.3 Aside: The Gramm–Schmidt Process (not examinable) UNSW MATH5855 2021T3 Lecture 3 Slide 27 Definition 3.4. A random matrix U ∈Mp,p has a Wishart distribution with parameters Σ, p, n (denoting this by U ∼Wp(Σ, n)) if there exist n independent random vectors Y1, . . . ,Yn each with Np(0,Σ) distribution such that the distribution of ∑n i=1 YiY >
i coincides

with the distribution of U .
Note that we require that p < n and that U be non-negative definite. UNSW MATH5855 2021T3 Lecture 3 Slide 28 I Given proof of Theorem 3.3, distribution of nΣ̂ = ∑n i=1(Xi − X̄ )(Xi − X̄ ) > is the same as that of∑n
i=2 ZiZ

>
i :

nΣ̂ ∼Wp(Σ, n − 1)

I Don’t worry about the density formula.
Some properties:
1. p = 1 and Σ = (σ2) =⇒ W1(Σ, n)/σ2 = χ2n

=⇒ σ2 = 1 =⇒ W1(1, n)
d
= χ2n

I I.e., a generalisation.

UNSW MATH5855 2021T3 Lecture 3 Slide 29

2. For an arbitrary fixed matrix H ∈Mk,p, k ≤ p one has:

nHΣ̂H> ∼Wk(HΣH>, n − 1).

(Why? Show it!)

UNSW MATH5855 2021T3 Lecture 3 Slide 30

3. Refer to the previous case for the particular value of k = 1.
The matrix H ∈M1,p is just a p-dimensional row vector that
we could denote by c>. Then:

i) n c
>Σ̂c

c>Σc ∼ χ
2
n−1

ii) n c
>Σ−1c

c>Σ̂−1c
∼ χ2n−p

UNSW MATH5855 2021T3 Lecture 3 Slide 31

4. Let us partition S = 1
n−1

∑n
i=1(Xi − X̄ )(Xi − X̄ )

> ∈Mp,p
into

S =
(

S11 S12
S21 S22

)
,S11 ∈Mr ,r , r < p Σ = ( Σ11 Σ12 Σ21 Σ22 ) ,Σ11 ∈Mr ,r , r < p. Further, denote S1|2 = S11 − S12S −1 22 S21, Σ1|2 = Σ11 − Σ12Σ −1 22 Σ21. Then it holds (n − 1)S11 ∼Wr (Σ11, n − 1) (n − 1)S1|2 ∼Wr (Σ1|2, n − p + r − 1) UNSW MATH5855 2021T3 Lecture 3 Slide 32 3. Estimation of the Mean Vector and Covariance Matrix of Multivariate Normal Distribution 3.2 Distributions of MLE of mean vector and covariance matrix of multivariate normal distribution 3.2.1 Sampling distribution of X̄ 3.2.2 Sampling distribution of the MLE of Σ 3.2.3 Aside: The Gramm–Schmidt Process (not examinable) UNSW MATH5855 2021T3 Lecture 3 Slide 33 Have: A = [a1, . . . , an] ∈Mn,n arbitrary full-rank matrix Want: An orthogonal matrix whose first column ∝ a1 Gram–Schmidt process: 1. For each i = 2, . . . , n, 2. For each j = 1, . . . , i − 1, 3. Update ai = ai − 〈ai ,aj 〉 〈aj ,aj 〉 aj . 4. For each k = 1, . . . , n, 5. Update ak = ak ‖ak‖ . UNSW MATH5855 2021T3 Lecture 3 Slide 34 Example 3.5. Gram–Schmidt process implemented in R. UNSW MATH5855 2021T3 Lecture 3 Slide 35 3. Estimation of the Mean Vector and Covariance Matrix of Multivariate Normal Distribution 3.1 Maximum Likelihood Estimation 3.2 Distributions of MLE of mean vector and covariance matrix of multivariate normal distribution 3.3 Additional resources 3.4 Exercises UNSW MATH5855 2021T3 Lecture 3 Slide 36 Additional resources I JW Sec. 4.3–4.5. UNSW MATH5855 2021T3 Lecture 3 Slide 37 3. Estimation of the Mean Vector and Covariance Matrix of Multivariate Normal Distribution 3.1 Maximum Likelihood Estimation 3.2 Distributions of MLE of mean vector and covariance matrix of multivariate normal distribution 3.3 Additional resources 3.4 Exercises UNSW MATH5855 2021T3 Lecture 3 Slide 38 Exercise 3.1 Find the product A⊗ B if A = ( 1 2 3 4 ) ,B = ( 5 0 2 1 ) . UNSW MATH5855 2021T3 Lecture 3 Slide 39 Lecture 4: Confidence Intervals and Hypothesis Tests for the Mean Vector 4.1 Hypothesis tests for the multivariate normal mean 4.2 Confidence regions for the mean vector and for its components 4.3 Comparison of two or more mean vectors 4.4 Software 4.5 Additional resources 4.6 Exercises UNSW MATH5855 2021T3 Lecture 4 Slide 1 4. Confidence Intervals and Hypothesis Tests for the Mean Vector 4.1 Hypothesis tests for the multivariate normal mean 4.1.1 Hotelling’s T 2 4.1.2 Sampling distribution of T 2 4.1.3 Noncentral Wishart 4.1.4 T 2 as a likelihood ratio statistic 4.1.5 Wilks’ lambda and T 2 4.1.6 Numerical calculation of T 2 4.1.7 Asymptotic distribution of T 2 4.2 Confidence regions for the mean vector and for its components 4.3 Comparison of two or more mean vectors 4.4 Software 4.5 Additional resources 4.6 Exercises UNSW MATH5855 2021T3 Lecture 4 Slide 2 4. Confidence Intervals and Hypothesis Tests for the Mean Vector 4.1 Hypothesis tests for the multivariate normal mean 4.1.1 Hotelling’s T 2 4.1.2 Sampling distribution of T 2 4.1.3 Noncentral Wishart 4.1.4 T 2 as a likelihood ratio statistic 4.1.5 Wilks’ lambda and T 2 4.1.6 Numerical calculation of T 2 4.1.7 Asymptotic distribution of T 2 UNSW MATH5855 2021T3 Lecture 4 Slide 3 Suppose: I n independent realisations of p-dimensional random vectors from Np(µ,Σ) I Σ non-singular I Data matrix: x =   x11 x12 · · · x1j · · · x1n x21 x22 · · · x2j · · · x2n ... ... . . . ... . . . ... xi1 xi2 · · · xij · · · xin ... ... . . . ... . . . ... xp1 xp2 · · · xpj · · · xpn   = [x1, x2, . . . , xn] =⇒ (from Section 3.2) I X̄ ∼ Np(µ, 1nΣ) I nΣ̂ ∼Wp(Σ, n − 1). UNSW MATH5855 2021T3 Lecture 4 Slide 4 Testing contrasts =⇒ c 6= 0 ∈ Rp I c>X̄ ∼ N(c>µ, 1

n
c>Σc)

I nc
>Σ̂c

c>Σc ∼ χ
2
n−1

I X̄ and Σ̂ are independent

=⇒ T =

nc>(X̄ − µ)/


c> n

n−1Σ̂c ∼ tn−1
=⇒ Use to test contrasts

I E.g., under H0 : c>µ =
∑p

i=1 ciµi = 0,

T =

nc>X̄/


c>Sc

I Does not depend on µ (for any H0 value).
I Reject H0 in favour of H1 : c>µ =

∑p
i=1 ciµi 6= 0 if

|T | > t1−α/2,n−1.
I One-sided alternatives possible as well.

UNSW MATH5855 2021T3 Lecture 4 Slide 5

Testing mean vectors (variance known)

I Suppose we know Σ (or σ2)
I p = 1, H0 : µ = µ0 vs. H1 : µ 6= µ0, at level α =⇒

U =

n X̄−µ0

σ
I Reject H0 if |U| > z1−α/2
I |U| > c ⇔ U2 = n(X̄ − µ0)(σ2)−1(X̄ − µ0) > c2

=⇒ For p > 1:
I U2 = n(X̄ − µ0)>Σ−1(X̄ − µ0)
I Reject H0 : µ = µ0 when U

2 is large enough.
I U2 ∼ χ2p under the H0

I Proved similarly to the proof of Property 5 of the multivariate
normal distribution (Section 2.2) and by using Theorem 3.3 of
Section 3.2.

UNSW MATH5855 2021T3 Lecture 4 Slide 6

Testing mean vectors (variance unknown)

I p = 1, t-test: H0 : µ = µ0 vs. H1 : µ 6= µ0 at level α =⇒
T =


n X̄−µ0

S
where S2 = 1

n−1
∑n

i=1(Xi − X̄ )
2

I Reject H0 if |T | > t1−α/2,n−1
I Similar as before: check if T 2 = n(X̄ − µ0)(s2)−1(X̄ − µ0) is

large enough
I Under H0, the statistic T

2 ∼ F1,n−1 =⇒ reject when
T 2 > F1−α;1,n−1

Definition 4.1 (Hotelling’s T 2).

The statistic
T 2 = n(X̄ − µ0)>S−1(X̄ − µ0) (4.1)

where X̄ = 1
n

∑n
i=1 Xi , S =

1
n−1

∑n
i=1(Xi − X̄ )(Xi − X̄ )

>,
µ0 ∈ Rp, Xi ∈ Rp, i = 1, . . . , n is named after Harold Hotelling.

UNSW MATH5855 2021T3 Lecture 4 Slide 7

4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.1.1 Hotelling’s T 2

4.1.2 Sampling distribution of T 2

4.1.3 Noncentral Wishart
4.1.4 T 2 as a likelihood ratio statistic
4.1.5 Wilks’ lambda and T 2

4.1.6 Numerical calculation of T 2

4.1.7 Asymptotic distribution of T 2

UNSW MATH5855 2021T3 Lecture 4 Slide 8

Null distribution of the T 2 statistic

I Reject H0 : µ = µ0 if the value of T
2 is high

I Turns out T 2 has an F distribution.

Theorem 4.2.
Under the null hypothesis H0 : µ = µ0, Hotelling’s T

2 is
distributed as

(n−1)p
n−p Fp,n−p where Fp,n−p denotes the

F -distribution with p and n − p degrees of freedom.

UNSW MATH5855 2021T3 Lecture 4 Slide 9

Proof.
I Write T 2 = n(X̄−µ0)

>S−1(X̄−µ0)
n(X̄−µ0)>Σ−1(X̄−µ0)

n(X̄ − µ0)>Σ−1(X̄ − µ0)

I Let C =

n(X̄ − µ0); given C = c ,

n(X̄ − µ0)>S−1(X̄ − µ0)
n(X̄ − µ0)>Σ−1(X̄ − µ0)

=
c>S−1c
c>Σ−1c

I Depends on data only through S−1.
I nΣ̂ = (n − 1)S
I Recall, Section 3.2.2 Wishart distribution third property:

n c
>Σ−1c

c>Σ̂−1c
∼ χ2n−p

I Does not depend on c!
I n(X̄ − µ0)Σ−1(X̄ − µ0) ∼ χ2p depends on the data through

X̄ only =⇒ Independent of the fraction
I T 2

d
=

χ2p(n−1)
χ2n−p

(χ2s independent)

=⇒ T
2(n−p)
p(n−1) ∼ Fp,n−p =⇒ T

2 ∼ p(n−1)
n−p Fp,n−p

UNSW MATH5855 2021T3 Lecture 4 Slide 10

4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.1.1 Hotelling’s T 2

4.1.2 Sampling distribution of T 2

4.1.3 Noncentral Wishart
4.1.4 T 2 as a likelihood ratio statistic
4.1.5 Wilks’ lambda and T 2

4.1.6 Numerical calculation of T 2

4.1.7 Asymptotic distribution of T 2

UNSW MATH5855 2021T3 Lecture 4 Slide 11

I Suppose Yi , i = 1, . . . , n have different µi s: Np(µi ,Σ).
=⇒ noncentral Wishart distribution with parameters Σ, p, n− 1, Γ

I i.e., Wp(Σ, n − 1, Γ)
I noncentrality parameter Γ = MM> ∈Mp,p where

M = [µ1,µ2, . . . ,µn]
I M = 0 =⇒ usual (central) Wishart distribution

I Distribution of T 2 = n(X̄ − µ)>S−1(X̄ − µ) for µ 6= µ0
leads to noncentral F-distribution
I Used to study power of the test of H0 : µ = µ0 vs.

H1 : µ 6= µ0.

UNSW MATH5855 2021T3 Lecture 4 Slide 12

4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.1.1 Hotelling’s T 2

4.1.2 Sampling distribution of T 2

4.1.3 Noncentral Wishart
4.1.4 T 2 as a likelihood ratio statistic
4.1.5 Wilks’ lambda and T 2

4.1.6 Numerical calculation of T 2

4.1.7 Asymptotic distribution of T 2

UNSW MATH5855 2021T3 Lecture 4 Slide 13

I Recall maximised likelihood (3.2):

L(x ; µ̂, Σ̂) =
1

(2π)
np
2 |Σ̂|

n
2

e−
np
2

I Under H0 :

max
Σ

L(x ;µ0,Σ) = max
Σ

1

(2π)
np
2 |Σ|

n
2

e−
1
2

∑n
i=1(xi−µ0)

>Σ−1(xi−µ0)

=⇒ log L(x ;µ0,Σ) =
− np

2
log(2π)− n

2
log|Σ| − 1

2
tr[Σ−1(

∑n
i=1(xi − µ0)(xi − µ0)

>)]
I Anderson’s Lemma (Theorem 3.1) =⇒

Σ̂0 =
1
n

∑n
i=1(xi − µ0)(xi − µ0)

>

=⇒ max
Σ

L(x ;µ0,Σ) =
1

(2π)
np
2 |Σ̂0|

n
2

e−
np
2

=⇒

Λ =
maxΣ L(x ;µ0,Σ)

maxµ,Σ L(x ;µ,Σ)
=

(
|Σ̂|
|Σ̂0|

)
n
2 (4.2)

=⇒ Wilks’ lambda: Λ
2
n =

|Σ̂|
|Σ̂0|

I Reject H0 : µ = µ0 when small.
UNSW MATH5855 2021T3 Lecture 4 Slide 14

4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.1.1 Hotelling’s T 2

4.1.2 Sampling distribution of T 2

4.1.3 Noncentral Wishart
4.1.4 T 2 as a likelihood ratio statistic
4.1.5 Wilks’ lambda and T 2

4.1.6 Numerical calculation of T 2

4.1.7 Asymptotic distribution of T 2

UNSW MATH5855 2021T3 Lecture 4 Slide 15

The following theorem shows the relation between Wilks’ lambda
and T 2:

Theorem 4.3.
The likelihood ratio test is equivalent to the test based on T 2 since

Λ
2
n = (1 + T

2

n−1 )
−1 holds.

UNSW MATH5855 2021T3 Lecture 4 Slide 16

Proof.
I A ∈Mp+1,p+1:

A =

(∑n
i=1(xi − x̄)(xi − x̄)

> √n(x̄ − µ0)√
n(x̄ − µ0)> −1

)
=

(
A11 A12
A21 A22

)
I |A| = |A22||A11−A12A−122 A21| = |A11||A22−A21A

−1
11 A12| (4.3)

=⇒
(−1)|

n∑
i=1

(xi − x̄)(xi − x̄)> + n(x̄ − µ0)(x̄ − µ0)>| =

|
n∑

i=1

(xi−x̄)(xi−x̄)>||−1−n(x̄−µ0)>(
n∑

i=1

(xi−x̄)(xi−x̄)>)−1(x̄−µ0)|

=⇒ (−1)|
∑n

i=1(xi − µ0)(xi − µ0)
>| =

|
∑n

i=1(xi − x̄)(xi − x̄)
>|(−1)(1 + T

2

n−1 )

=⇒ |Σ̂0| = |Σ̂|(1 + T
2

n−1 ) and

Λ
2
n = (1 +

T 2

n − 1
)−1 (4.4)

UNSW MATH5855 2021T3 Lecture 4 Slide 17

4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.1.1 Hotelling’s T 2

4.1.2 Sampling distribution of T 2

4.1.3 Noncentral Wishart
4.1.4 T 2 as a likelihood ratio statistic
4.1.5 Wilks’ lambda and T 2

4.1.6 Numerical calculation of T 2

4.1.7 Asymptotic distribution of T 2

UNSW MATH5855 2021T3 Lecture 4 Slide 18

Hence H0 is rejected for small values of Λ
2
n or equivalently, for

large values of T 2. The critical values for T 2 are determined from
Theorem 4.2. Relation (4.4) can be used to calculate T 2 from

Λ
2
n =

|Σ̂|
|Σ̂0|

thus avoiding the need to invert the matrix S when

calculating T 2!

UNSW MATH5855 2021T3 Lecture 4 Slide 19

4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.1.1 Hotelling’s T 2

4.1.2 Sampling distribution of T 2

4.1.3 Noncentral Wishart
4.1.4 T 2 as a likelihood ratio statistic
4.1.5 Wilks’ lambda and T 2

4.1.6 Numerical calculation of T 2

4.1.7 Asymptotic distribution of T 2

UNSW MATH5855 2021T3 Lecture 4 Slide 20

I S−1 consistent for Σ−1

I T 2
d→ n(x̄ − µ)>Σ−1(x̄ − µ) ∼ χ2p

I General asymptotic theory: −2 log Λ d→ χ2p
I Here:

−2 log Λ = n log(1 +
T 2

n − 1
) ≈

n

n − 1
T 2 ≈ T 2

I using that log(1 + x) ≈ x for small x

UNSW MATH5855 2021T3 Lecture 4 Slide 21

4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.2 Confidence regions for the mean vector and for its components
4.2.1 Confidence region for the mean vector
4.2.2 Simultaneous confidence statements
4.2.3 Simultaneous confidence ellipsoid

4.3 Comparison of two or more mean vectors
4.4 Software
4.5 Additional resources
4.6 Exercises

UNSW MATH5855 2021T3 Lecture 4 Slide 22

4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.2 Confidence regions for the mean vector and for its components
4.2.1 Confidence region for the mean vector
4.2.2 Simultaneous confidence statements
4.2.3 Simultaneous confidence ellipsoid

UNSW MATH5855 2021T3 Lecture 4 Slide 23

I For CL (1− α) · 100%, can be constructed in the form

{µ|n(x̄ − µ)>S−1(x̄ − µ) ≤ F1−α,p,n−p
p

n − p
(n − 1)}

I F1−α,p,n−p = upper α · 100% percentage point of the F
distribution with (p, n − p) df

I an ellipsoid in Rp centred at x̄
I axes of this confidence ellipsoid are directed along the

eigenvectors ei of the matrix S = 1n−1
∑n

i=1(xi − x̄)(xi − x̄)
>

I half-lengths of axes are

λi


p(n−1)F1−α,p,n−p

n(n−p) , λi , i = 1, . . . , p

being the corresponding eigenvalues.

Example 4.4.

Microwave ovens (Example 5.3., pages 221–223, JW).

UNSW MATH5855 2021T3 Lecture 4 Slide 24

4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.2 Confidence regions for the mean vector and for its components
4.2.1 Confidence region for the mean vector
4.2.2 Simultaneous confidence statements
4.2.3 Simultaneous confidence ellipsoid

UNSW MATH5855 2021T3 Lecture 4 Slide 25

I Ellipsoid in Section 4.2.1 is for the whole vector at once

I We might want CIs for each element, difference, etc..

=⇒ simultaneous confidence intervals
I X ∼ Np(µ,Σ) =⇒ ∀l ∈ Rp, l>X ∼ N1(l>µ, l>Σl )

=⇒ For any given l , (1− α) · 100% CI of l>µ is(
l>x̄ − t1−α/2,n−1


l>Sl

n

, l>x̄ + t1−α/2,n−1


l>Sl

n

)
(4.5)

I Can get elementwise CIs.
I I.e., |


n(l>x̄−l>µ̄)√

l>Sl
| ≤ t1−α/2,n−1 (or equivalently

n(l>x̄−l>µ̄)2
l>Sl ≤ t

2
1−α/2,n−1)

I Not simultaneous: need larger multiplier for that the right
hand side of the inequality.

UNSW MATH5855 2021T3 Lecture 4 Slide 26

4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.2 Confidence regions for the mean vector and for its components
4.2.1 Confidence region for the mean vector
4.2.2 Simultaneous confidence statements
4.2.3 Simultaneous confidence ellipsoid

UNSW MATH5855 2021T3 Lecture 4 Slide 27

Theorem 4.5.
Simultaneously for all l ∈ Rp, the interval(

l>x̄ −


p(n − 1)
n(n − p)

F1−α,p,n−pl>Sl , l>x̄ +


p(n − 1)
n(n − p)

F1−α,p,n−pl>Sl

)

will contain l>µ̄ with a probability at least (1− α).

Example 4.6.

Microwave Ovens (Example 5.4, p. 226 in JW).

UNSW MATH5855 2021T3 Lecture 4 Slide 28

Proof.
I [l>(x̄ − µ)]2 = [(S1/2l )>S−1/2(x̄ − µ)]2

I Cauchy–Bunyakovski–Schwartz Inequality =⇒
[(S1/2l )>S−1/2(x̄ − µ)]2 ≤ ‖S1/2l‖2‖S−1/2(x̄ − µ)‖2

I ‖S1/2l‖2‖S−1/2(x̄ − µ)‖2 = (l>Sl )(x̄ − µ)>S−1(x̄ − µ)
I =⇒ [l>(x̄ − µ)]2 ≤ (l>Sl )(x̄ − µ)>S−1(x̄ − µ) and

max
l

n(l>(x̄ − µ))2

l>Sl
≤ n(x̄ − µ)>S−1(x̄ − µ) = T 2 (4.6)

=⇒ Then if T 2 ≤ c2 then n(l
>x̄−l>µ̄)2
l>Sl ≤ c

2 for any l ∈ Rp, l 6= 0
=⇒ For every l ,

l>x̄ − c

l>Sl
n
≤ l>µ̄ ≤ l>x̄ + c


l>Sl
n

(4.7)

I Choose c2 = p(n − 1)F1−α,p,n−p/(n − p) to make sure that
1− α = P(T 2 ≤ c2) holds

=⇒ (4.7) will contain l>µ̄ with probability 1− α.
UNSW MATH5855 2021T3 Lecture 4 Slide 29

Bonferroni Method

I More reliable than one-at-a-time intervals.
I Utilise the covariance structure of all p variables in their

construction.
I What if only few intervals?
I Bonferroni method may be efficient:

I Suppose we want to make m statements.
I Let Ci , i = 1, 2, . . . ,m be the ith confidence statement.
I P(Ci true) = 1− αi =⇒

P(all Ci true) = 1− P(at least one Ci false) ≥

1−
m∑
i=1

P(Ci false) = 1−
m∑
i=1

(1−P(Ci true)) = 1−(α1+α2+· · ·+αm)

I Choose αi =
α
m
, i = 1, 2, . . . ,m (i.e., each CL (1− α

m
) · 100%

instead of (1− α) · 100%)
=⇒ The probability of any statement being false will not exceed α.

Example 4.7.

Microwave Ovens (based on JW Example 5.4, p. 226).
UNSW MATH5855 2021T3 Lecture 4 Slide 30

4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.2 Confidence regions for the mean vector and for its components
4.3 Comparison of two or more mean vectors
4.3.1 Reducing to a single population
4.3.2 The two-sample T 2-test

4.4 Software
4.5 Additional resources
4.6 Exercises

UNSW MATH5855 2021T3 Lecture 4 Slide 31

I Two samples, X1,X2, . . . ,XnX ∈ R
p and

Y1,Y2, . . . ,YnY ∈ R
p

I means µX ∈ Rp and µY ∈ Rp
I variances ΣX ∈Mp,p and ΣY ∈Mp,p
I Test (typically) H0 : µX − µY = δ0.

I Multivariate ANOVA for comparing more than two
populations =⇒ Lecture 8.

UNSW MATH5855 2021T3 Lecture 4 Slide 32

4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.3 Comparison of two or more mean vectors
4.3.1 Reducing to a single population
4.3.2 The two-sample T 2-test

UNSW MATH5855 2021T3 Lecture 4 Slide 33

Paired Samples

I analogously to the paired t-test; let nX = nY = n
1. Take Di = Xi − Yi for i = 1, . . . , n.
2. Proceed as if with a 1-sample T 2 test:

T 2 = n(D̄ − δ0)>S−1D (D̄ − δ0) ∼
(n − 1)p
n − p

Fp,n−p, (4.8)

I D̄ ∈ Rp = 1
n

∑n
i=1

Di and SD = 1n−1
∑n

i=1
(Di − D̄)(Di − D̄)>

I Requires differences, MVN or n large.
I “Multivariate” form: let contrast matrix C ∈Mp,p+p,

C =


+1 −1+1 −1

+1 −1


=⇒ Di = C
(

Xi
Yi

)
, H0 : C

(
µX
µY

)
= δ0.

=⇒ Test statistic reduces to (4.8).
UNSW MATH5855 2021T3 Lecture 4 Slide 34

Repeated Measures

I a series of p treatment outcomes on each sampling unit

I Xi : individual i ’s measurements as a vector
I Test whether all outcomes are the same in expectation:

1. Form

C =


1 −1… . . .

1 −1


 ∈Mp−1,p

2. Test H0 : CµX = 0p−1.

I CµX = 0p−1 ⇐⇒ all elements of µX are equal

UNSW MATH5855 2021T3 Lecture 4 Slide 35

4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.3 Comparison of two or more mean vectors
4.3.1 Reducing to a single population
4.3.2 The two-sample T 2-test

UNSW MATH5855 2021T3 Lecture 4 Slide 36

Independent Samples (pooled)

I Sometimes, X and Y are, in fact, independent samples
I As with univariate, assume ΣX = ΣY = Σ or not.

I If pooled,

Spooled =
(nX − 1)SX + (nY − 1)SY

nX + nY − 2

Var(X̄ − Ȳ ) =
Σ

nX
+

Σ

nY
≈ Spooled

(
1

nX
+

1

nY

)
call
= S̄p

I X̄ − Ȳ ∼ Np(µX − µY ,Σ(n−1X + n
−1
Y )) =⇒

T 2 = (X̄ − Ȳ − δ0)>S̄−1p (X̄ − Ȳ − δ0) (4.9)

has distribution
(nX +nY−2)p
nX +nY−p−1

Fp,nX +nY−p−1

UNSW MATH5855 2021T3 Lecture 4 Slide 37

I Construct a confidence region:{
δ
∣∣x̄ − ȳ − δ)>S̄−1p (x̄ − ȳ − δ)


(nX + nY − 2)p
nX + nY − p − 1

F1−α,p,nX +nY−p−1

}
I Simultaneous contrast confidence intervals:

l>(x̄ − ȳ)±


(nX + nY − 2)p
nX + nY − p − 1

F1−α,p,nX +nY−p−1l>S̄pl

UNSW MATH5855 2021T3 Lecture 4 Slide 38

Independent Samples (unpooled)

I Var(X̄ − Ȳ ) ≈ SX
nX

+ SY
nY

call
= S̄up

T 2 = (X̄ − Ȳ − δ0)>S̄−1up (X̄ − Ȳ − δ0)

I Distribution of this T 2 approximate: νp
ν−p+1Fp,ν−p+1,

ν = (p + p2)


 2∑

i=1

1

ni


tr


{ 1

ni
Si
(

1

n1
S1 +

1

n2
S2
)−1}2

+

[
tr

{
1

ni
Si
(

1

n1
S1 +

1

n2
S2
)−1}]2


−1

UNSW MATH5855 2021T3 Lecture 4 Slide 39

I Confidence regions:{
δ
∣∣(x̄ − ȳ − δ)>S̄−1up (x̄ − ȳ − δ)


νp

ν − p + 1
Fp,ν−p+1

}
I Simultaneous contrast confidence intervals:

l>(x̄ − ȳ)±

νp

ν − p + 1
Fp,ν−p+1l>S̄upl

UNSW MATH5855 2021T3 Lecture 4 Slide 40

4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.2 Confidence regions for the mean vector and for its components
4.3 Comparison of two or more mean vectors
4.4 Software
4.5 Additional resources
4.6 Exercises

UNSW MATH5855 2021T3 Lecture 4 Slide 41

R: car::confidenceEllipse, package Hotelling,
rrcov::T2.test, ergm::approx.hotelling.diff.test,
MVTests::TwoSamplesHT2

SAS: See IML implementations.

UNSW MATH5855 2021T3 Lecture 4 Slide 42

4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.2 Confidence regions for the mean vector and for its components
4.3 Comparison of two or more mean vectors
4.4 Software
4.5 Additional resources
4.6 Exercises

UNSW MATH5855 2021T3 Lecture 4 Slide 43

Additional resources

I JW Sec. 5.1–5.5, 6.

UNSW MATH5855 2021T3 Lecture 4 Slide 44

4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.2 Confidence regions for the mean vector and for its components
4.3 Comparison of two or more mean vectors
4.4 Software
4.5 Additional resources
4.6 Exercises

UNSW MATH5855 2021T3 Lecture 4 Slide 45

Exercise 4.1

Suppose X1,X2, . . . ,Xn are independent Np(µ,Σ) random vectors
with sample mean vector X̄ and sample covariance matrix S . We
wish to test the hypothesis

H0 : µ2 − µ1 = µ3 − µ2 = · · · = µp − µp−1 = 1

where µ1, µ2, . . . , µp are the elements of µ.

(a) Determine a (p − 1)× p matrix C so that H0 may be written
equivalently as H0 : Cµ = 1 where 1 is a (p− 1)× 1 vector of
ones.

(b) Make an appropriate transformation of the vectors
Xi , i = 1, 2, . . . , n and hence find the rejection region of a size
α test of H0 in terms of X̄ , S , and C .

UNSW MATH5855 2021T3 Lecture 4 Slide 46

Exercise 4.2

A sample of 50 vector observations, each containing three
components, is drawn from a normal distribution having covariance
matrix

Σ =


3 1 11 4 1

1 1 2


 .

The components of the sample mean are 0.8, 1.1 and 0.6. Can you
reject the null hypothesis of zero distribution mean against a
general alternative?

UNSW MATH5855 2021T3 Lecture 4 Slide 47

Exercise 4.3

Evaluate Hotelling’s statistic T 2 for testing H0 : µ =

(
7

11

)
using

the data matrix X =
(

2 8 6 8
12 9 9 10

)
. Test the hypothesis H0 at

level α = 0.05. What conclusion is reached?

UNSW MATH5855 2021T3 Lecture 4 Slide 48

Exercise 4.4

Let X1, . . . ,Xn1 , i.i.d. Np(µ1,Σ) independently of Y1, . . .Yn2 i.i.d.
Np(µ2,Σ), Σ known. Prove that X̄ ∼ Np(µ1, 1n1 Σ) and
Ȳ ∼ Np(µ2, 1n2 Σ). Hence
W = X̄ − Ȳ ∼ N(µ1 − µ2,

(
1
n1

+ 1
n2

)
Σ) so that

X̄ − Ȳ − (µ1 − µ2) ∼ N(0,
(

1
n1

+ 1
n2

)
Σ). Construct a test of

H0 : µ1 = µ2.

UNSW MATH5855 2021T3 Lecture 4 Slide 49

Exercise 4.5

Let X̄ and S be based on n observations from Np(µ,Σ) and let X
be an additional observation from Np(µ,Σ). Show that
X − X̄ ∼ Np(0, (1 + 1n )Σ). Find the distribution of
n

n+1
(X − X̄ )>S−1(X − X̄ ) and suggest how to use this result to

give a (1− α) prediction region for X based on X̄ and S (i.e., a
region in Rp such that one has a given confidence (1− α) that the
next observation will fall into it).

UNSW MATH5855 2021T3 Lecture 4 Slide 50

Lecture 5: Correlation, Partial Correlation, Multiple
Correlation

5.1 Partial correlation
5.2 Multiple correlation
5.3 Testing of correlation coefficients
5.4 Additional resources
5.5 Exercises

UNSW MATH5855 2021T3 Lecture 5 Slide 1

I Correlation = undirected measure of linear dependence

I Prediction also incorporates direction of dependence
I Direction of dependence must be inferred from substantive

reasoning
I When temporal data available, may be detected from data

(e.g., Granger Causality)

UNSW MATH5855 2021T3 Lecture 5 Slide 2

I Correlation measures linear only dependence
=⇒ Uncorrelated variables may still be dependent

I (But independent variables are always uncorrelated.)

I For jointly MVN variables specifically,

uncorrelated⇐⇒ independent

UNSW MATH5855 2021T3 Lecture 5 Slide 3

In general, there are 3 types of correlation coefficients:

I The usual correlation coefficient between 2 variables

I Partial correlation coefficient between 2 variables after
adjusting for the effect (regression, association ) of set of
other variables.

I Multiple correlation between a single random variable and a
set of p other variables

UNSW MATH5855 2021T3 Lecture 5 Slide 4

5. Correlation, Partial Correlation, Multiple Correlation
5.1 Partial correlation
5.1.1 Simple formulae
5.1.2 Software
5.1.3 Examples

5.2 Multiple correlation
5.3 Testing of correlation coefficients
5.4 Additional resources
5.5 Exercises

UNSW MATH5855 2021T3 Lecture 5 Slide 5

“The Usual” Correlation

I For X ∼ Np(µ,Σ)
I ρij =

σij√
σii

σjj
, i , j = 1, 2, . . . , p

I MLE ρ̂ij in (3.6) coincides with the sample correlations rij (1.3)

UNSW MATH5855 2021T3 Lecture 5 Slide 6

Partial Correlation Coefficients

I Section 2.2 MVN Property 4:

I Divide X ∈ Rp into X =
(

X(1)
X(2)

)
,

X(1) ∈ Rr , r < p,X(2) ∈ Rp−r , MVN with µ = ( µ(1) µ(2) ) , Σ = ( Σ11 Σ12 Σ21 Σ22 ) I Σ22 full rank =⇒ X(1)|X(2) = x(2) ∼ Nr (µ(1)+Σ12Σ−122 (x(2)−µ(2)),Σ11−Σ12Σ −1 22 Σ21). UNSW MATH5855 2021T3 Lecture 5 Slide 7 partial correlations of X(1) given X(2) = x(2): the usual correlation coefficients calculated from the elements σij .(r+1),(r+2)...,p of the matrix Σ1|2 = Σ11 − Σ12Σ −1 22 Σ21, i.e. ρij .(r+1),(r+2),...,p = σij .(r+1),(r+2),...,p √ σii .(r+1),(r+2),...,p √ σjj .(r+1),(r+2),...,p (5.1) ρij .(r+1),(r+2),...,p the correlation of the ith and jth component when the components (r + 1), (r + 2), etc. up to the pth (i.e. the last p − r components) have been held fixed. =⇒ association (correlation) between the ith and jth component after eliminating the effect that the last p − r components might have had on this association UNSW MATH5855 2021T3 Lecture 5 Slide 8 Estimation I Invariance property of the MLE =⇒ ρ̂ij .(r+1),(r+2),...,p = σ̂ij .(r+1),(r+2),...,p√ σ̂ii .(r+1),(r+2),...,p √ σ̂jj .(r+1),(r+2),...,p , i , j = 1, 2, . . . , r will be the ML estimators of ρij .(r+1),(r+2)...,p, i , j = 1, 2, . . . , r . UNSW MATH5855 2021T3 Lecture 5 Slide 9 5. Correlation, Partial Correlation, Multiple Correlation 5.1 Partial correlation 5.1.1 Simple formulae 5.1.2 Software 5.1.3 Examples UNSW MATH5855 2021T3 Lecture 5 Slide 10 I Few variables involved =⇒ simple formulas: i) partial correlation between first and second variable by adjusting for the effect of the third: ρ12.3 = ρ12 − ρ13ρ23√ (1− ρ213)(1− ρ 2 23) . ii) partial correlation between first and second variable by adjusting for the effects of third and fourth variable: ρ12.3,4 = ρ12.4 − ρ13.4ρ23.4√ (1− ρ213.4)(1− ρ 2 23.4) . I More variables =⇒ software. UNSW MATH5855 2021T3 Lecture 5 Slide 11 5. Correlation, Partial Correlation, Multiple Correlation 5.1 Partial correlation 5.1.1 Simple formulae 5.1.2 Software 5.1.3 Examples UNSW MATH5855 2021T3 Lecture 5 Slide 12 SAS: PROC CORR R: ggm::pcor, ggm::parcor UNSW MATH5855 2021T3 Lecture 5 Slide 13 5. Correlation, Partial Correlation, Multiple Correlation 5.1 Partial correlation 5.1.1 Simple formulae 5.1.2 Software 5.1.3 Examples UNSW MATH5855 2021T3 Lecture 5 Slide 14 Example 5.1. Three variables have been measured for a set of schoolchildren: i) X1: Intelligence ii) X2: Weight iii) X3: Age The number of observations was large enough so that one can assume the empirical correlation matrix ρ̂ ∈M3,3 to be the true correlation matrix: ρ̂ =   1 0.6162 0.82670.6162 1 0.7321 0.8267 0.7321 1  . This suggests there is a high degree of positive dependence between weight and intelligence. But (do the calculation (!)) ρ̂12.3 = 0.0286 so that, after the effect of age is adjusted for, there is virtually no correlation between weight and intelligence, i.e. weight obviously plays little part in explaining intelligence. UNSW MATH5855 2021T3 Lecture 5 Slide 15 5. Correlation, Partial Correlation, Multiple Correlation 5.1 Partial correlation 5.2 Multiple correlation 5.2.1 Multiple correlation coefficient as ordinary correlation coefficient of transformed data 5.2.2 Interpretation of R 5.2.3 Remark about the calculation of R2 5.2.4 Examples 5.3 Testing of correlation coefficients 5.4 Additional resources 5.5 Exercises UNSW MATH5855 2021T3 Lecture 5 Slide 16 Recall (Section 2.2): I Predict a random variable Y from X = ( X1 X2 · · · Xp )>
I Minimise the value E[{Y − g(X )}2|X = x ] =⇒

g∗(X ) = E(Y |X )
I Y and X jointly MVN =⇒ linear g∗(x) = b + σ>0 C

−1x ,
where
I b = E(Y )− σ>0 C

−1 E(X )
I C = Var(X )
I σ0 = Cov(Y ,X )

=⇒ C−1σ0 ∈ Rp = vector of the regression coefficients

multiple correlation coefficient between Y ∈ R and X ∈ Rp:
maximum correlation between Y and any linear combination
α>X , α ∈ Rp.

UNSW MATH5855 2021T3 Lecture 5 Slide 17

5. Correlation, Partial Correlation, Multiple Correlation
5.2 Multiple correlation
5.2.1 Multiple correlation coefficient as ordinary correlation coefficient of

transformed data
5.2.2 Interpretation of R
5.2.3 Remark about the calculation of R2

5.2.4 Examples

UNSW MATH5855 2021T3 Lecture 5 Slide 18

Lemma 5.2.
The multiple correlation coefficient is the ordinary correlation
coefficient between Y and σ>0 C

−1X ≡ β∗
>
X . (I.e.,

β∗ ≡ C−1σ0.)

Proof.
I ∀α ∈ Rp, Cov(Y ,α>X ) = α>σ0 = α>(CC−1σ0) = α>Cβ∗

I Set α = β∗ =⇒ Cov(Y ,β∗
>
X ) = β∗

>
Cβ∗

I Cauchy–Bunyakovsky–Schwartz inequality =⇒

[Cov(α>X ,β∗
>
X )]2 ≤ Var(α>X ) Var(β∗

>
X )

I Write down [Cov(Y ,α>X )]2 ≡ σ2Y σ
2
α>Xρ

2
Y ,α>X =⇒

σ2Y ρ
2
Y ,α>X =

(α>σ0)
2

α>Cα
=

(α>Cβ∗)2

α>Cα
≤ β∗

>
Cβ∗

I Equality when α = β∗.

=⇒ ρ2
Y ,α>X of Y and α

>X is maximised over α when α = β∗.
UNSW MATH5855 2021T3 Lecture 5 Slide 19

Coefficient of Determination

I maximum correlation between Y and any linear combination

α>X , α ∈ Rp, is R =

β∗
>
Cβ∗

σ2
Y

=⇒ the multiple correlation coefficient
I R2 is coefficient of determination

I β∗ = C−1σ0 =⇒ R =

σ>0 C
−1σ0
σ2
Y

I Let Σ = Var

(
Y
X

)
=

(
σ2Y σ

>
0

σ0 C

)
=

(
Σ11 Σ12
Σ21 Σ22

)
.

I MLE Σ̂ =

(
Σ̂11 Σ̂12
Σ̂21 Σ̂22

)
=⇒ MLE R̂ =


Σ̂12Σ̂

−1
22 Σ̂21

Σ̂11

UNSW MATH5855 2021T3 Lecture 5 Slide 20

5. Correlation, Partial Correlation, Multiple Correlation
5.2 Multiple correlation
5.2.1 Multiple correlation coefficient as ordinary correlation coefficient of

transformed data
5.2.2 Interpretation of R
5.2.3 Remark about the calculation of R2

5.2.4 Examples

UNSW MATH5855 2021T3 Lecture 5 Slide 21

I From Section 2.2, minimal value of MSE predicting Y from a
linear function of X is σ2Y − σ

>
0 C
−1σ0.

I Equivalent to σ2Y (1− R
2)

=⇒ When R2 = 0, no predictive power at all; when R2 = 1, Y
can be predicted without any error at all (is a true linear
function of X ).

UNSW MATH5855 2021T3 Lecture 5 Slide 22

5. Correlation, Partial Correlation, Multiple Correlation
5.2 Multiple correlation
5.2.1 Multiple correlation coefficient as ordinary correlation coefficient of

transformed data
5.2.2 Interpretation of R
5.2.3 Remark about the calculation of R2

5.2.4 Examples

UNSW MATH5855 2021T3 Lecture 5 Slide 23

I What if only the correlation matrix is available?
I Then for ρYY ≡ (ρ−1)11 for correlation matrix ρ ∈Mp+1,p+1

determined from Σ as defined earlier,

1− R2 =
1

ρYY
(5.2)

I Recall (4.3):

1− R2 = σ
2
Y
−σ>0 C

−1σ0
σ2
Y

=
|C |
|C |

σ2
Y
−σ>0 C

−1σ0
σ2
Y

=
|Σ|
|C |σ2

Y

I Recall from Section 0.1.2: (X−1)ji =
|X−(i,j)|
|X | (−1)

i+j

=⇒ |C ||Σ| = σ
YY ≡ (Σ−1)11

I Now, ρ = V−
1
2 ΣV−

1
2 for

V =



σ2y 0 · · · 0
0 c11 · · · 0


. . .


0 0 · · · cpp




I ρ−1 = V
1
2 Σ−1V

1
2 =⇒ ρYY = σYY σ2Y

=⇒ (5.2)
UNSW MATH5855 2021T3 Lecture 5 Slide 24

5. Correlation, Partial Correlation, Multiple Correlation
5.2 Multiple correlation
5.2.1 Multiple correlation coefficient as ordinary correlation coefficient of

transformed data
5.2.2 Interpretation of R
5.2.3 Remark about the calculation of R2

5.2.4 Examples

UNSW MATH5855 2021T3 Lecture 5 Slide 25

Example 5.3.

Let µ =


µYµX1
µX2


 =


52

0


 and

Σ =


10 1 −11 7 3
−1 3 2


 = (σYY σ>0

σ0 ΣXX

)
. Calculate:

(a) The best linear prediction of Y using X1 and X2.

(b) The multiple correlation coefficient R2
Y .(X1,X2)

.

(c) The mean squared error of the best linear predictor.

UNSW MATH5855 2021T3 Lecture 5 Slide 26

Solution

β∗ = Σ−1XXσ0 =

(
7 3
3 2

)−1(
1
−1

)
=

(
.4 −.6
−.6 1.4

)(
1
−1

)
=

(
1
−2

)
and

b = µY − β∗
>
µX = 5− (1,−2)

(
2
0

)
= 3.

Hence the best linear predictor is given by 3 + X1 − 2X2. The value
of:

RY .(X1,X2) =

√√√√√(1,−1)
(
.4 −.6
−.6 1.4

)(
1
−1

)
10

=


3

10
= .548

The mean squared error of prediction is:
σ2Y (1− R

2
Y .(X1,X2)

) = 10(1− 3
10

) = 7.

UNSW MATH5855 2021T3 Lecture 5 Slide 27

Example 5.4.

Relationship between multiple correlation and regression, and
equivalent ways of computing it.

UNSW MATH5855 2021T3 Lecture 5 Slide 28

5. Correlation, Partial Correlation, Multiple Correlation
5.1 Partial correlation
5.2 Multiple correlation
5.3 Testing of correlation coefficients
5.3.1 Usual correlation coefficients
5.3.2 Partial correlation coefficients
5.3.3 Multiple correlation coefficients
5.3.4 Software
5.3.5 Examples

5.4 Additional resources
5.5 Exercises

UNSW MATH5855 2021T3 Lecture 5 Slide 29

5. Correlation, Partial Correlation, Multiple Correlation
5.3 Testing of correlation coefficients
5.3.1 Usual correlation coefficients
5.3.2 Partial correlation coefficients
5.3.3 Multiple correlation coefficients
5.3.4 Software
5.3.5 Examples

UNSW MATH5855 2021T3 Lecture 5 Slide 30

I A bivariate problem.

I H0 : ρij = 0 =⇒ statistic T = rij

n−2
1−r2

ij
∼ tn−2

=⇒ a t-test
I Otherwise, exact distribution complicated.

=⇒ Fisher’s Z transformation Z = 1
2

log[
1+rij
1−rij

]

I H0 : ρij = ρ0 =⇒ Z ≈ N( 12 log[
1+ρ0
1−ρ0

], 1
n−3 )

I How would you test for equality of two correlation coefficients
from two independent samples?

UNSW MATH5855 2021T3 Lecture 5 Slide 31

5. Correlation, Partial Correlation, Multiple Correlation
5.3 Testing of correlation coefficients
5.3.1 Usual correlation coefficients
5.3.2 Partial correlation coefficients
5.3.3 Multiple correlation coefficients
5.3.4 Software
5.3.5 Examples

UNSW MATH5855 2021T3 Lecture 5 Slide 32

I Not much has to be changed.
I To test H0 : ρij .r+1,r+2,…,r+k = ρ0 versus

H1 : ρij .r+1,r+2,…,r+k 6= ρ0 (i.e., given k variables):
1. Construct Z = 1

2
log[

1+rij.r+1,r+2,…,r+k
1−rij.r+1,r+2,…,r+k

] and a = 1
2

log[ 1+ρ0
1−ρ0

].

2. Asymptotically Z ∼ N(a, 1
n−k−3 ).

3. Calculate

n − k − 3|Z − a| and test against standard normal.

I For ρ0 = 0, t-test can still be used, with n − 2→ n − k − 2.

UNSW MATH5855 2021T3 Lecture 5 Slide 33

5. Correlation, Partial Correlation, Multiple Correlation
5.3 Testing of correlation coefficients
5.3.1 Usual correlation coefficients
5.3.2 Partial correlation coefficients
5.3.3 Multiple correlation coefficients
5.3.4 Software
5.3.5 Examples

UNSW MATH5855 2021T3 Lecture 5 Slide 34

I Under H0 : R = 0,

F =
R̂2

1− R̂2
×

n − p
p − 1

∼ Fp−1,n−p

I Equivalent to ANOVA F -Test: R̂2 ≡ SSR
SST

=⇒ 1− R̂2 ≡ SSE
SST

=⇒ F = SSR/SST
SSE/SST

× n−p
p−1 =⇒ F =

SSR/(p−1)
SSE/(n−p) =

MSR
MSE

I Here, p is the total number of all variables (the output Y and
all of the input variables in the input vector X ).
I In Section 5.2, it was just the dimension of X alone.

UNSW MATH5855 2021T3 Lecture 5 Slide 35

5. Correlation, Partial Correlation, Multiple Correlation
5.3 Testing of correlation coefficients
5.3.1 Usual correlation coefficients
5.3.2 Partial correlation coefficients
5.3.3 Multiple correlation coefficients
5.3.4 Software
5.3.5 Examples

UNSW MATH5855 2021T3 Lecture 5 Slide 36

SAS: PROC CORR

R: ggm::pcor.test

UNSW MATH5855 2021T3 Lecture 5 Slide 37

5. Correlation, Partial Correlation, Multiple Correlation
5.3 Testing of correlation coefficients
5.3.1 Usual correlation coefficients
5.3.2 Partial correlation coefficients
5.3.3 Multiple correlation coefficients
5.3.4 Software
5.3.5 Examples

UNSW MATH5855 2021T3 Lecture 5 Slide 38

Example 5.5.

Testing ordinary correlations: age, height, and intelligence.

UNSW MATH5855 2021T3 Lecture 5 Slide 39

Example 5.6.

Testing partial correlations: age, height, and intelligence.

UNSW MATH5855 2021T3 Lecture 5 Slide 40

5. Correlation, Partial Correlation, Multiple Correlation
5.1 Partial correlation
5.2 Multiple correlation
5.3 Testing of correlation coefficients
5.4 Additional resources
5.5 Exercises

UNSW MATH5855 2021T3 Lecture 5 Slide 41

Additional resources

I JW Sec. 7.8.

UNSW MATH5855 2021T3 Lecture 5 Slide 42

5. Correlation, Partial Correlation, Multiple Correlation
5.1 Partial correlation
5.2 Multiple correlation
5.3 Testing of correlation coefficients
5.4 Additional resources
5.5 Exercises

UNSW MATH5855 2021T3 Lecture 5 Slide 43

Exercise 5.1 I

Suppose X ∼ N4(µ,Σ) where µ =




1
2
3
4


 and

Σ =




3 1 0 1
1 4 0 0
0 0 1 4
1 0 4 20


. Determine:

(a) the distribution of




X1
X2
X3

X1 + X2 + X4


;

(b) the conditional mean and variance of X1 given x2, x3, and x4;

(c) the partial correlation coefficients ρ12.3, ρ12.4;

UNSW MATH5855 2021T3 Lecture 5 Slide 44

Exercise 5.1 II

(d) the multiple correlation between X1 and (X2,X3,X4).
Compare it to ρ12 and comment.

(e) Justify that


X2X3
X4


 is independent of

X1 − (1 0 1)


4 0 00 1 4

0 4 20


−1


X2X3
X4


 .

UNSW MATH5855 2021T3 Lecture 5 Slide 45

Exercise 5.2

A random vector X ∼ N3(µ,Σ) with µ =


 2−3

1


 and

Σ =


1 1 11 3 2

1 2 2


.

(a) Find the distribution of 3X1 − 2X2 + X3.

(b) Find a vector a ∈ R2 such that X2 and X2 − a>
(
X1
X3

)
are

independent.

UNSW MATH5855 2021T3 Lecture 5 Slide 46

Lecture 6: Principal Components Analysis
6.1 Introduction
6.2 Precise mathematical formulation
6.3 Estimation of the Principal Components
6.4 Deciding how many principal components to include
6.5 Software
6.6 Examples
6.7 PCA and Factor Analysis
6.8 Application to finance: Portfolio optimisation
6.9 Additional resources
6.10 Exercises

UNSW MATH5855 2021T3 Lecture 6 Slide 1

6. Principal Components Analysis
6.1 Introduction
6.2 Precise mathematical formulation
6.3 Estimation of the Principal Components
6.4 Deciding how many principal components to include
6.5 Software
6.6 Examples
6.7 PCA and Factor Analysis
6.8 Application to finance: Portfolio optimisation
6.9 Additional resources
6.10 Exercises

UNSW MATH5855 2021T3 Lecture 6 Slide 2

Motivation

I mainly a variable reduction procedure

I applied when there is a large number possibly highly
correlated

I “condense” the information to redundancy by “summarising”
in transformations

I principal components are artificial variables (constructs) that
account for most of the variability in the observed variables.
I formulated as linear combinations of variables
I try to absorb as much variation as possible
I can use more than one
I can then be used in subsequent procedures
I also a form of factor analysis

UNSW MATH5855 2021T3 Lecture 6 Slide 3

6. Principal Components Analysis
6.1 Introduction
6.2 Precise mathematical formulation
6.3 Estimation of the Principal Components
6.4 Deciding how many principal components to include
6.5 Software
6.6 Examples
6.7 PCA and Factor Analysis
6.8 Application to finance: Portfolio optimisation
6.9 Additional resources
6.10 Exercises

UNSW MATH5855 2021T3 Lecture 6 Slide 4

Definition

I X ∼ Np(µ,Σ), p relatively large
Goal: find a linear combination α>1 X with α1 ∈ R

p s.t.
I Var(α>1 X ) is maximised

I i.e., Var(α>1 X ) = α
>
1 Σα1

I but ‖α1‖2 = α>1 α1 = 1

UNSW MATH5855 2021T3 Lecture 6 Slide 5

Derivation

i) construct the Lagrangian function

Lag(α1, λ) = α
>
1 Σα1 + λ(1−α

>
1 α1)

where λ ∈ R1 is the Lagrange multiplier;
ii) take the partial derivative with respect to α1 and equate it to

zero:
2Σα1 − 2λα1 = 0 =⇒ (Σ− λIp)α1 = 0 (6.1)

From (6.1) we see that α1 must be an eigenvector of Σ and
since we know from the first lecture what the maximal value
of α

>Σα
α>α

is, we conclude that α1 should be the eigenvector

that corresponds to the largest eigenvalue λ̄1 of Σ. The
random variable α>1 X is called the first principal
component.

UNSW MATH5855 2021T3 Lecture 6 Slide 6

Second Principal Component

I Var(α>2 X ) is maximised, s.t.
I α>2 α2 = 1
I Cov(α>1 X ,α

>
2 X ) = α

>
1 Σα2 = 0

I Lagrange function:

Lag1(α2, λ1, λ2) = α
>
2 Σα2 + λ1(1−α

>
2 α2) + λ2α

>
1 Σα2

I Differentiate w.r.t. α2 and set to 0:

2Σα2 − 2λ1α2 + λ2Σα1 = 0 (6.2)
I Pre-multiply (6.2) by α>1 and use α

>
2 α2 = 1 and

α>2 Σα1 = 0:

−2λ1α>1 α2 + λ2α
>
1 Σα1 = 0 =⇒ λ2 = 0

I WHY? Hint: α1 is an eigenvector of Σ.
I Then (6.2) =⇒ (Σ− λ1Ip)α2 = 0 =⇒ α2 an eigenvector

of Σ with eigenvalue λ1
I To maximise variance, α>2 Σα2 = α

>
2 α2λ1 = λ1,

=⇒ α2 is the normalised eigenvector that corresponds to the
second largest eigenvalue λ̄2 of Σ.

I Repeat to get α3 with λ̄3, etc.
UNSW MATH5855 2021T3 Lecture 6 Slide 7

All principal components

I Extracting all p PCs =⇒

p∑
i=1

Var(α>i X ) =
p∑

i=1

λ̄i = tr(Σ) = Σ11 + · · ·+ Σpp

=⇒ Taking small number of k < p PCs =⇒ explaining Var(α>1 X ) + · · ·+ Var(α
>
k X )

Σ11 + · · ·+ Σpp
×100% =

λ̄1 + · · ·+ λ̄k
Σ11 + · · ·+ Σpp

×100%

of the total population variance Σ11 + · · ·+ Σpp.

UNSW MATH5855 2021T3 Lecture 6 Slide 8

6. Principal Components Analysis
6.1 Introduction
6.2 Precise mathematical formulation
6.3 Estimation of the Principal Components
6.4 Deciding how many principal components to include
6.5 Software
6.6 Examples
6.7 PCA and Factor Analysis
6.8 Application to finance: Portfolio optimisation
6.9 Additional resources
6.10 Exercises

UNSW MATH5855 2021T3 Lecture 6 Slide 9

Estimation

I In practice, Σ is estimated.
I PCs from Σ influenced by the scale of measurement.

I Large variance =⇒ large component in the first PC
=⇒ Alternative is to use correlation matrix ρ instead.
I I.e., standardise variables first:

Zi =
(

X1i−X̄1√
s11

X2i−X̄2√
s22

· · · Xpi−X̄p√
spp

)>
, i = 1, . . . , n.

=⇒ standardised observations matrix Z = [Z1,Z2, . . . ,Zn] ∈Mp,n
gives Z̄ = 1

n
Z1n = 0 and a sample covariance matrix

SZ = 1n−1ZZ
> = R.

Example 6.1 (Eigenvalues obtained from Covariance
and Correlation Matrices: see JW p. 437).

It demonstrates the great effect standardisation may have on the
principal components. The relative magnitudes of the weights after
standardisation (i.e. from ρ may become in direct opposition to
the weights attached to the same variables in the principal
component obtained from Σ).

UNSW MATH5855 2021T3 Lecture 6 Slide 10

6. Principal Components Analysis
6.1 Introduction
6.2 Precise mathematical formulation
6.3 Estimation of the Principal Components
6.4 Deciding how many principal components to include
6.5 Software
6.6 Examples
6.7 PCA and Factor Analysis
6.8 Application to finance: Portfolio optimisation
6.9 Additional resources
6.10 Exercises

UNSW MATH5855 2021T3 Lecture 6 Slide 11

Selecting k

Wanted: k as small as possible, but ψk =
λ̄1+…λ̄k
λ̄1+…λ̄p

as large as

possible—a trade-off.

“scree plot”: Plot λ̄k against k and see where it levels out.

Total variation explained:

1. Choose a constant c ∈ (0, 1).
I Usually, c = 0.9, but is an arbitrary choice.

2. Choose smallest k s.t. ψk ≥ c .
Kaiser’s rule: Keep those that (individually) explain at least

1
p

100% of the total variance.

I I.e., exclude if Var(α>i Z ) < 1. I I.e., exclude if λ̄i < ¯̄λ I Popular, but hard to defend on theoretical ground. UNSW MATH5855 2021T3 Lecture 6 Slide 12 Formal tests of significance I Does not make sense to test H0 : λ̄k+1 = · · · = λ̄p = 0 I H0 =⇒ Σ singular =⇒ Σ̂ singular =⇒ estimated λ̄k for k + 1, . . . , p also zero a.s.. I Instead test H0 : λ̄k+1 = · · · = λ̄p I any common value I more quantitative version of scree test I One possible test: 1. Compute: A0 = arithmetic mean of the last p − k estimated eigenvalues G0 = geometric mean of the last p − k estimated eigenvalues 2. −2 log Λ = n(p − k) log A0 G0 ∼ χ2ν where ν = (p−k+2)(p−k−1) 2 (asymptotically). I More details in Mardia, Kent, and Bibby (1979, p. 235–237), including a more precise form. I Requires MVN. I Only valid if based on S ; is conservative for R. UNSW MATH5855 2021T3 Lecture 6 Slide 13 6. Principal Components Analysis 6.1 Introduction 6.2 Precise mathematical formulation 6.3 Estimation of the Principal Components 6.4 Deciding how many principal components to include 6.5 Software 6.6 Examples 6.7 PCA and Factor Analysis 6.8 Application to finance: Portfolio optimisation 6.9 Additional resources 6.10 Exercises UNSW MATH5855 2021T3 Lecture 6 Slide 14 SAS PROC PRINCOMP, PROC FACTOR R stats::prcomp, stats::princomp, or about half-dozen other implementations UNSW MATH5855 2021T3 Lecture 6 Slide 15 6. Principal Components Analysis 6.1 Introduction 6.2 Precise mathematical formulation 6.3 Estimation of the Principal Components 6.4 Deciding how many principal components to include 6.5 Software 6.6 Examples 6.7 PCA and Factor Analysis 6.8 Application to finance: Portfolio optimisation 6.9 Additional resources 6.10 Exercises UNSW MATH5855 2021T3 Lecture 6 Slide 16 Example 6.2. The Crime Rates example will be discussed at the lecture. The data gives crime rates per 100,000 people in seven categories for each of the 50 states in USA in 1997. Principal components are used to summarise the 7-dimensional data in 2 or 3 dimensions only and help to visualise and interpret the data. UNSW MATH5855 2021T3 Lecture 6 Slide 17 6. Principal Components Analysis 6.1 Introduction 6.2 Precise mathematical formulation 6.3 Estimation of the Principal Components 6.4 Deciding how many principal components to include 6.5 Software 6.6 Examples 6.7 PCA and Factor Analysis 6.8 Application to finance: Portfolio optimisation 6.9 Additional resources 6.10 Exercises UNSW MATH5855 2021T3 Lecture 6 Slide 18 Relationship to Factor Analysis I PCA can provide initial view in factor analysis. I But Principal component analysis 6= Factor analysis. Factor analysis: The covariation in the observed variables is due to the presence of one or more latent variables (factors) that exert casual influence on the observed variables, and we wish to infer them. I More on this later. PCA: There is no prior assumption about an underlying casual model: we just wish to reduce variables. UNSW MATH5855 2021T3 Lecture 6 Slide 19 6. Principal Components Analysis 6.1 Introduction 6.2 Precise mathematical formulation 6.3 Estimation of the Principal Components 6.4 Deciding how many principal components to include 6.5 Software 6.6 Examples 6.7 PCA and Factor Analysis 6.8 Application to finance: Portfolio optimisation 6.9 Additional resources 6.10 Exercises UNSW MATH5855 2021T3 Lecture 6 Slide 20 Mean–Variance-Efficient Portfolio I Other problems in MV statistics lead to similar approaches to PCA. I p-dimensional vector X of returns of the p assets is given, with E X = µ, Var(X ) = Σ. I A portfolio with these assets with weights (c1, c2, . . . , cp) (with ∑p i=1 ci = 1): I has return Q = c>X .
I has expected return EQ = c>µ.
I has risk Var(Q) = c>Σc .

I We want to choose the weights c :
I to achieve prespecified expected return µ̄,
I while minimising the risk.

UNSW MATH5855 2021T3 Lecture 6 Slide 21

Derivation

I Lagrangian function:

Lag(λ1, λ2) = c>Σc + λ1(µ̄− c>µ) + λ2(1− c>1p) (6.3)

I Differentiate (6.3) w.r.t. c :

2Σc − λ1µ− λ21p = 0 (6.4)

I Suppose no riskless asset with a fixed return.
=⇒ Σ is pos. def. and Σ−1 exists

I Then, solving for c :

c =
1

2
Σ−1(λ1µ + λ21p) (6.5)

UNSW MATH5855 2021T3 Lecture 6 Slide 22

I Pre-multiply by 1>p :

1 =
1

2
1>p Σ

−1(λ1µ + λ21p) (6.6)

I Solve for λ2 =
2−λ11>p Σ−1µ

1>p Σ
−11p

I Substitute into (6.5):

c =
1

2
λ1(Σ

−1µ−
1>p Σ

−1µ

1>p Σ
−11p

Σ−11p) +
Σ−11p

1>p Σ
−11p

(6.7)

I Pre-multiply (6.5) by µ> and use µ>c = µ̄:
λ1 =

2µ̄−λ2µ>Σ−11p
µ>Σ−1µ

=⇒ linear system of 2 equations w.r.t. λ1 and λ2 can be solved
and substituted into (6.7) to get the final expression for c
using µ, µ̄ and Σ.

UNSW MATH5855 2021T3 Lecture 6 Slide 23

Variance-Efficient Portfolio

I no prespecified mean return

=⇒ only required to minimise the variance
=⇒ in (6.3), λ1 = 0
I From (6.7) optimal weights for the variance efficient portfolio:

copt =
Σ−11p

1>p Σ
−11p

.

UNSW MATH5855 2021T3 Lecture 6 Slide 24

6. Principal Components Analysis
6.1 Introduction
6.2 Precise mathematical formulation
6.3 Estimation of the Principal Components
6.4 Deciding how many principal components to include
6.5 Software
6.6 Examples
6.7 PCA and Factor Analysis
6.8 Application to finance: Portfolio optimisation
6.9 Additional resources
6.10 Exercises

UNSW MATH5855 2021T3 Lecture 6 Slide 25

Additional resources

I JW Ch. 8.

UNSW MATH5855 2021T3 Lecture 6 Slide 26

6. Principal Components Analysis
6.1 Introduction
6.2 Precise mathematical formulation
6.3 Estimation of the Principal Components
6.4 Deciding how many principal components to include
6.5 Software
6.6 Examples
6.7 PCA and Factor Analysis
6.8 Application to finance: Portfolio optimisation
6.9 Additional resources
6.10 Exercises

UNSW MATH5855 2021T3 Lecture 6 Slide 27

Exercise 6.1

A random vector Y =


Y1Y2
Y3


 is normally distributed with zero

mean vector and Σ =


 1 ρ/2 0ρ/2 1 ρ

0 ρ 1


 where ρ is positive.

(a) Find the coefficients of the first principal component and the
variance of that component. What percentage of the overall
variability does it explain?

(b) Find the joint distribution of Y1,Y2 and Y1 + Y2 + Y3.

(c) Find the conditional distribution of Y1,Y2 given Y3 = y3.

(d) Find the multiple correlation of Y3 with Y1,Y2.

UNSW MATH5855 2021T3 Lecture 6 Slide 28

Lecture 7: Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises

UNSW MATH5855 2021T3 Lecture 7 Slide 1

7. Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises

UNSW MATH5855 2021T3 Lecture 7 Slide 2

Canonical Correlation Analysis

I Two sets of variables.

I Interested in association between the sets.
I Consider the largest possible correlation between any linear

combination of variables in the first set and the second set.

=⇒ first canonical variables and first canonical correlation
I Like PCA, can take second canonical variables/correlation,

etc.

I Like PCA, summarising complex relationships in small number
of correlations on combinations of variables.

I Correlation coefficient is a special case. (Both sets contain
only one variable each.)

I Multiple correlation coefficient is a special case. (One of the
sets contains only one variable.)

UNSW MATH5855 2021T3 Lecture 7 Slide 3

7. Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises

UNSW MATH5855 2021T3 Lecture 7 Slide 4

I Under MVN, independence ⇐⇒ uncorrelatedness
I Suppose X ∼ Np(µ,Σ), partitioned into two components

X (1) ∈ Rr ,X (2) ∈ Rq, with r + q = p.

I Partition Σ =

(
Σ11 Σ12
Σ21 Σ22

)
, assume Σ, Σ11, and Σ22

nonsingular
I To test H0 : Σ12 = 0,

1. For fixed vectors a ∈ Rr ,b ∈ Rq let Z1 = a>X (1) and
Z2 = b>X (2) giving ρa,b = Cor(Z1,Z2) = a

>Σ12b√
a>Σ11ab>Σ22b

.

2. H0 is equivalent to H0 : ∀a∈Rr ,b∈Rqρa,b = 0.
I For given a,b, H0 would not be rejected if
|ra,b| =

|a>S12b|√
a>S11ab>S22b

≤ k for certain critical value k.
=⇒ Acceptance region for H0 would be given in the form

{X ∈Mp,n : maxa,b r2a,b ≤ k
2}.

=⇒ I.e., maximum of (a>S12b)2 under constraints
a>S11a = b>S22b = 1: canonical correlation.

UNSW MATH5855 2021T3 Lecture 7 Slide 5

7. Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises

UNSW MATH5855 2021T3 Lecture 7 Slide 6

I Wanted: Z1 = a>X (1) and Z2 = b>X (2) where
a ∈ Rr ,b ∈ Rq are obtained by
I maximising (a>Σ12b)2
I s.t. a>Σ11a = b>Σ22b = 1.

I Lagrangian:

Lag(a,b, λ1, λ2) = (a>Σ12b)2+λ1(a>Σ11a−1)+λ2(b>Σ22b−1)
I Differentiate w.r.t. a and b (individually):

2(a>Σ12b)Σ12b + 2λ1Σ11a = 0 ∈ Rr (7.1)

2(a>Σ12b)Σ21a + 2λ2Σ22b = 0 ∈ Rq (7.2)
I Pre-multiply (7.1) by vector a> and (7.2) by b> and subtract

them:
λ1 = λ2 = −(a>Σ12b)2 = −µ2

=⇒
Σ12b = µΣ11a (7.3)

Σ21a = µΣ22b (7.4)

UNSW MATH5855 2021T3 Lecture 7 Slide 7

I Pre-multiply (7.3) by Σ21Σ
−1
11 , both sides of (7.4) by the

scalar µ, and add:

(Σ21Σ
−1
11 Σ12 − µ

2Σ22)b = 0 (7.5)

I For (7.5) to have a solution with b 6= 0 =⇒

|Σ21Σ−111 Σ12 − µ
2Σ22| = 0 (7.6)

I Multiply both sides by |Σ
− 1

2
22 |

2 and use
product-of-determinants:


− 1

2
22 ||Σ21Σ

−1
11 Σ12−µ

2Σ22||Σ
− 1

2
22 | = |Σ

− 1
2

22 Σ21Σ
−1
11 Σ12Σ

− 1
2

22 −µ
2Iq| = 0

=⇒ µ2 is an eigenvalue of Σ
− 1

2
22 Σ21Σ

−1
11 Σ12Σ

− 1
2

22

=⇒ b = Σ
− 1

2
22 b̂ where b̂ is the eigenvector of Σ

− 1
2

22 Σ21Σ
−1
11 Σ12Σ

− 1
2

22

corresponding to this eigenvalue (WHY?!).

UNSW MATH5855 2021T3 Lecture 7 Slide 8

I Want to maximise a>Σ12b ∧ µ2 is an eigenvalue of
Σ
− 1

2
22 Σ21Σ

−1
11 Σ12Σ

− 1
2

22 (or Σ
−1
22 Σ21Σ

−1
11 Σ12) =⇒ it is the

largest eigenvalue

I From (7.3), a = 1
µ

Σ−111 Σ12b

=⇒ First canonical variables Z1 = a>X (1) and Z2 = b>X (2) are
determined and the value of the first canonical correlation is
just µ.
I The orientation (sign) of b is chosen such that the sign of µ is

positive.

I Second canonical correlation = second largest eigenvalue
I Automatically ensures second pair of canonical variables is

uncorrelated with the first
I Can have at most min(q, r) canonical correlations
I Usually much fewer used.

I First canonical correlation ≥ highest multiple correlation
between any variable and opposite set of variables.

=⇒ Often first canonical correlation is large, with subsequent small

UNSW MATH5855 2021T3 Lecture 7 Slide 9

7. Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises

UNSW MATH5855 2021T3 Lecture 7 Slide 10

I Estimate as in Section 7.3, substituting Sij in place of Σij .
I To test H0 : ∀a∈Rr ,b∈Rqρa,b = 0 (in Section 7.2), accept for
{X ∈Mp,n : largest eigenvalue of S

− 1
2

22 S21S
−1
11 S12S

− 1
2

22 ≤ kα}
I Constant kα has been worked out and is given in the so called

Hecks charts. This distribution depends on three parameters:
I s = min(r , q)
I m = |r−q|−1

2
I N = n−r−q−2

2
, where n is the sample size

I Can also use good F -distribution-based approximations for a
(transformations of) this distribution like Wilk’s lambda,
Pillai’s trace, Hotelling trace, and Roy’s greatest root.

UNSW MATH5855 2021T3 Lecture 7 Slide 11

7. Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises

UNSW MATH5855 2021T3 Lecture 7 Slide 12

SAS PROC CANCORR

R

I stats::cancor
I package CCA for computing and visualisation
I package CCP for testing canonical correlations

UNSW MATH5855 2021T3 Lecture 7 Slide 13

7. Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises

UNSW MATH5855 2021T3 Lecture 7 Slide 14

I Calculating X−
1
2 and X

1
2 for a symm. pos. def. matrix X

using spectral decomposition may be numerically unstable
I I.e., when condition number λ̄1

λ̄p
is high.

I Cholesky decomposition (Section 0.1.6) can be used
instead.

I In (7.5), Let U>U = Σ−122
=⇒ µ2 is an eigenvalue of the matrix A = UΣ21Σ−111 Σ12U

>.
I I.e., pre-multiplying by U and post-multiplying by U> in (7.6):

|A− µ2UΣ22U>| = 0

I But UΣ22U
> = U(U>U)−1U> = UU−1(U>)−1U> = I holds.

UNSW MATH5855 2021T3 Lecture 7 Slide 15

7. Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises

UNSW MATH5855 2021T3 Lecture 7 Slide 16

Example 7.1.

Canonical Correlation Analysis of the Fitness Club Data.
Three physiological and three exercise variables were measured on
twenty middle aged men in a fitness club. Canonical correlation is
used to determine if the physiological variables are related in any
way to the exercise variables.

UNSW MATH5855 2021T3 Lecture 7 Slide 17

Example 7.2.

JW Example 10.4, p. 552 Studying canonical correlations
between leg and head bone measurements: X1,X2 are skull length
and skull breadth, respectively; X3,X4 are leg bone measurements:
femur and tibia length, respectively. Observations have been taken
on n = 276 White Leghorn chicken. The example is chosen to also
illustrate how a canonical correlation analysis can be performed
when the original data is not given but the empirical correlation
matrix (or empirical covariance matrix) is available.

UNSW MATH5855 2021T3 Lecture 7 Slide 18

7. Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises

UNSW MATH5855 2021T3 Lecture 7 Slide 19

Additional resources

I JW Ch. 10.

UNSW MATH5855 2021T3 Lecture 7 Slide 20

7. Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises

UNSW MATH5855 2021T3 Lecture 7 Slide 21

Exercise 7.1

Let the components of X correspond to scores on tests in
arithmetic speed (X1), arithmetic power (X2), memory for words
(X3), memory for meaningful symbols (X4), and memory for
meaningless symbols (X5). The observed correlations in a sample
of 140 are


1.0000 0.4248 0.0420 0.0215 0.0573

1.0000 0.1487 0.2489 0.2843
1.0000 0.6693 0.4662

1.0000 0.6915
1.0000


 .

Find the canonical correlations and canonical variates between the
first two variates and the last three variates. Comment. Write a
SAS-IML or R code to implement the required calculations.

UNSW MATH5855 2021T3 Lecture 7 Slide 22

Exercise 7.2

Students sit 5 different papers, two of which are closed book and
the rest open book. For the 88 students who sat these exams the
sample covariance matrix is

S =




302.3 125.8 100.4 105.1 116.1
170.9 84.2 93.6 97.9

111.6 110.8 120.5
217.9 153.8

294.4


 .

Find the canonical correlations and canonical variates between the
first two variates (closed book exams) and the last three variates
(open book exams). Comment.

UNSW MATH5855 2021T3 Lecture 7 Slide 23

Exercise 7.3

A random vector X ∼ N4(µ,Σ) with µ =




0
0
0
0


 and




1 2ρ ρ ρ
2ρ 1 ρ ρ
ρ ρ 1 2ρ
ρ ρ 2ρ 1


 where ρ is a small enough positive constant.

(a) Find the two canonical correlations between

(
X1
X2

)
and(

X3
X4

)
. Comment.

(b) Find the first pair of canonical variables.

UNSW MATH5855 2021T3 Lecture 7 Slide 24

Exercise 7.4

Consider the following covariance matrix Σ of a four dimensional

normal vector: Σ =

(
Σ11 Σ12
Σ21 Σ22

)
=




100 0 0 0
0 1 0.95 0

0 0.95 1 0
0 0 0 100


.

Verify that the first pair of canonical variates are just the second
and the third component of the vector and the canonical
correlation equals .95.

UNSW MATH5855 2021T3 Lecture 7 Slide 25

Lecture 8: Multivariate Linear Models and Multivariate
ANOVA

8.1 Univariate linear models and ANOVA
8.2 Multivariate Linear Model and MANOVA
8.3 Computations used in the MANOVA tests
8.4 Software
8.5 Examples
8.6 Additional resources

UNSW MATH5855 2021T3 Lecture 8 Slide 1

8. Multivariate Linear Models and Multivariate ANOVA
8.1 Univariate linear models and ANOVA
8.2 Multivariate Linear Model and MANOVA
8.3 Computations used in the MANOVA tests
8.4 Software
8.5 Examples
8.6 Additional resources

UNSW MATH5855 2021T3 Lecture 8 Slide 2

I For observations i = 1, 2, . . . , n, let,
the response variable Yi = xiβ + �i , for
predictor row vector xi ∈ Rk assumed fixed and known
coefficient vector β ∈ Rp fixed and unknown
stochastic errors �i

i.i.d.∼ N(0, σ2)
=⇒ matrix form, Y =

(
Y1 Y2 · · · Yn

)>
and

X =
(
x>1 x

>
2 · · · x

>
n

)> ∈Mn,k
I assume X contains an intercept

=⇒ Y = Xβ + �, � ∼ Nn(0, Inσ2)
I MLE for β minimises

n∑
i=1

(Yi − xiβ)2 = ‖Y − Xβ‖2 = (Y − Xβ)>(Y − Xβ)

I To get
β̂ = (X>X )−1X>Y

Var(β̂) = (X>X )−1X> Var(Y )X (X>X )−1 = (X>X )−1σ2

UNSW MATH5855 2021T3 Lecture 8 Slide 3

I Consider projection matrices

A = In − X (X>X )−1X>

B = X (X>X )−1X> − 1n(1>n 1n)
−11>n

=⇒ AY = Y − X{(X>X )−1X>Y } = Y − Ŷ = residuals
=⇒ BY = X{(X>X )−1X>}Y − 1n(1>n 1n)−11

>
n Y = Ŷ − 1nȲ =

fitted values over and above the mean

I Also, if X has an intercept,

Cov(AY ,BY ) = AVar(Y )B> = σ2AB>

= X (X>X )−1X> − X (X>X )−1X>X (X>X )−1X>

− 1n(1>n 1n)
−11>n + X (X

>X )−1X>1n(1
>
n 1n)

−11>n

=
1

n
(X (X>X )−1X>1n − 1n)1>n = 0

=⇒ SSE = Y>AY ∼ σ2χ2n−k and SSA = Y
>BY ∼ σ2χ2k−1,

independent

=⇒ Can set up F = SSA/(k−1)
SSE/(n−k) ∼ Fk−1,n−k

UNSW MATH5855 2021T3 Lecture 8 Slide 4

8. Multivariate Linear Models and Multivariate ANOVA
8.1 Univariate linear models and ANOVA
8.2 Multivariate Linear Model and MANOVA
8.3 Computations used in the MANOVA tests
8.4 Software
8.5 Examples
8.6 Additional resources

UNSW MATH5855 2021T3 Lecture 8 Slide 5

response matrix: Y =




Y>1
Y>2


Y>n


 =



Y11 Y12 · · · Y1p
Y21 Y22 · · · Y2p


. . .

Yn1 Yn2 · · · Ynp


 ∈Mn,p

predictors: xi and X as before
coefficient matrix: β ∈Mk,p
stochastic error vectors: �i ∼ Np(0,Σ), Σ ∈Mp,p symmetric

positive definite
I Then,

Y>i = xiβ + �
>
i

Y = Xβ + E , E =
(
�1 �2 · · · �n

)> ∈Mn,p
=⇒ ~E ∼ Nnp(0,Σ⊗ In)

−→
E> ∼ Nnp(0, In ⊗ Σ)

=⇒ ~Y ∼ Nnp({β> ⊗ In}~X ,Σ⊗ In)
−−→
Y> ∼ Nnp({In ⊗ β>}

−→
X>, In ⊗ Σ)

UNSW MATH5855 2021T3 Lecture 8 Slide 6

I MLE is again OLS, minimising∑n
i=1 tr{(Yi −xiβ)(Yi −xiβ)

>} = tr{(Y −Xβ)>(Y −Xβ)},
=⇒

β̂ = (X>X )−1X>Y

Var(
−→
β̂>) = Var(

−−−−−−−−−−→
Y>X (X>X )−1) = Var{((X>X )−1X> ⊗ Ip)

−−→
Y>}

= ((X>X )−1X> ⊗ Ip)(Ip ⊗ Σ)((X>X )−1X> ⊗ Ip)>

= ((X>X )−1X> ⊗ Ip)((X>X )−1X> ⊗ Σ)>

= (X>X )−1 ⊗ Σ

Var(
−→
β̂ ) = Σ⊗ (X>X )−1

I Projection matrices A and B still work (check it!)

=⇒ SSE = Y>AY ∼Wp(Σ, p(n − k − 1))
=⇒ SSA = Y>BY ∼Wp(Σ, p(k − 1))

I Now matrices.

UNSW MATH5855 2021T3 Lecture 8 Slide 7

8. Multivariate Linear Models and Multivariate ANOVA
8.1 Univariate linear models and ANOVA
8.2 Multivariate Linear Model and MANOVA
8.3 Computations used in the MANOVA tests
8.3.1 Roots distributions
8.3.2 Comparisons

8.4 Software
8.5 Examples
8.6 Additional resources

UNSW MATH5855 2021T3 Lecture 8 Slide 8

I In (univariate) ANOVA, with usual assumptions we test for a
factor with the F test:

I Decompose SST = SSA + SSE. Then,
i) SSE and SSA are independent scaled-χ2 distributed;
ii) Divided by degrees of freedom,

I MSE is always unbiased for σ2.
I MSA only unbiased for σ2 under H0 and is larger (in

expectation) otherwise.

=⇒ For F = MSA
MSE

under H0, σ
2s cancel and distribution is just

unscaled F .
I I.e., reject when F is higher than the critical value.

I Multivariate ANOVA (MANOVA) =⇒ χ2 →Wishart
=⇒ “Ratio” of matrices.

I No unambiguous choice.
I Turns out to be related to the eigenvalues of scaled

Wishart-distributed matrices related to decomposition SST =
SSA + SSE in the multivariate case.

UNSW MATH5855 2021T3 Lecture 8 Slide 9

8. Multivariate Linear Models and Multivariate ANOVA
8.3 Computations used in the MANOVA tests
8.3.1 Roots distributions
8.3.2 Comparisons

UNSW MATH5855 2021T3 Lecture 8 Slide 10

I Let Yi , i = 1, 2, . . . , n
ind.∼ Np(µi ,Σ), with data matrix

Y =




Y>1
Y>2


Y>n


 =



Y11 Y12 · · · Y1p
Y21 Y22 · · · Y2p


. . .

Yn1 Yn2 · · · Ynp


 ∈Mn,p

I Call E(Y ) = M, Var( ~Y ) = Σ⊗ In
I Let A and B be projectors such that Q1 = Y>AY and

Q2 = Y>BY are two independent Wp(Σ, v) and Wp(Σ, q)
matrices, respectively.
I E.g., for a multivariate linear model,

Y = Xβ + E , Ŷ = X β̂

A = In − X (X>X )−X>, B = X (X>X )−X> − 1n(1>n 1n)
−11>n

and the corresponding decomposition

Y [In − 1n(1>n 1n)
−11>n ]Y = Y

>BY + Y>AY = Q2 + Q1

of SST = SSA + SSE = Q2 + Q1 where Q2 is the “hypothesis
matrix” and Q1 is the “error matrix”.

UNSW MATH5855 2021T3 Lecture 8 Slide 11

Lemma 8.1.
Let Q1,Q2 ∈Mp,p be two positive definite symmetric matrices .
Then the roots of the determinant equation
|Q2 − θ(Q1 + Q2)| = 0 are related to the roots of the equation
|Q2 − λQ1| = 0 by: λi = θi1−θi (or θi =

λi
1+λi

).

UNSW MATH5855 2021T3 Lecture 8 Slide 12

Lemma 8.2.
Let Q1,Q2 ∈Mp,p be two positive definite symmetric matrices .
Then the roots of the determinant equation
|Q1 − v(Q1 + Q2)| = 0 are related to the roots of the equation
|Q2 − λQ1| = 0 by: λi = 1−vivi (or vi =

1
1+λi

).

UNSW MATH5855 2021T3 Lecture 8 Slide 13

I Then, if λi , vi , θi are the roots

|Q2−λQ1| = 0, |Q1−v(Q1 +Q2)| = 0, |Q2−θ(Q1 +Q2)| = 0

I Then:

Λ = |Q1(Q1+Q2)−1| =
p∏

i=1

(1+λi )
−1 (Wilks’s Criterion statistic)

|Q2Q−11 | =
p∏

i=1

λi =

p∏
i=1

1− vi
vi

=

p∏
i=1

θi
1− θi

|Q2(Q1 + Q2)−1| =
p∏

i=1

θi =

p∏
i=1

λi
1 + λi

=

p∏
i=1

(1− vi )

and others only depend on p (dimension) and v and q (Wishart
dfs).

UNSW MATH5855 2021T3 Lecture 8 Slide 14

Common MANOVA test statistics

I Λ as above (Wilks’s Lambda)

I tr(Q2Q
−1
1 ) = tr(Q

−1
1 Q2) =

∑p
i=1 λi (Lawley–Hotelling trace)

I max iλi (Roy’s criterion)

I V = tr[Q2(Q1 + Q2)−1] =
∑p

i=1
λi

1+λi
(Pillai statistic /

Pillai’s trace)
I Not no simple forms for exact distributions
I We have powerful computers now, however.

I Q1 is called the “error matrix” (also denoted by E )
I Q2 is the “hypothesis matrix” (also denoted by H)

UNSW MATH5855 2021T3 Lecture 8 Slide 15

I Distribution of statistics depends on
I p = the number of responses
I q = νh = degrees of freedom for the hypothesis
I v = νe = degrees of freedom for the error

I Then, compute:
I s = min(p, q)
I m = 0.5(|p − q| − 1)
I n = 0.5(v − p − 1)
I r = v − 0.5(p − q + 1)
I u = 0.25(pq − 2)

I t =

{√
p2q2−4
p2+q2−5 if p

2 + q2 − 5 > 0
1 otherwise

I Order the eigenvalues of E−1H = Q−11 Q2 as
λ1 ≥ λ2 ≥ · · · ≥ λp.

UNSW MATH5855 2021T3 Lecture 8 Slide 16

Then, (exactly if s = 1 or 2; otherwise approximately):

Wilks’s test: Λ =
|E |
|E+H| =

∏p
i=1

1
1+λi

:

F = 1−Λ
1/t

Λ1/t
. rt−2u

pq
∼ Fpq,rt−2u df (Rao’s F).

Lawley–Hotelling trace test: U = tr(E−1H) = λ1 + · · ·+ λp:
F = 2(sn + 1) U

s2(2m+s+1)
∼ Fs(2m+s+1),2(sn+1) df.

Pillai’s test: V = tr(H(H + E )−1) = λ1
1+λ1

+ · · ·+ λp
1+λp

:

F = 2n+s+1
2m+s+1

× V
s−V ∼ Fs(2m+s+1),s(2n+s+1) df.

Roy’s maximum root criterion: The test statistic is just the largest
eigenvalue λ1.

UNSW MATH5855 2021T3 Lecture 8 Slide 17

I An older and very universal approximation to the distribution
of the Λ due to Bartlett (1927):
I Level of −[νe − p−νh+12 ] log Λ = c(p, νh,M)× level of χ

2
pνh

,
where the constant c(p, νh,M = νe − p + 1) is given in tables.

I Such tables are prepared for levels α = 0.10, 0.05, 0.025 etc..

UNSW MATH5855 2021T3 Lecture 8 Slide 18

First Canonical Correlation

I When testing the significance of the first canonical correlation,

E = S22 − S21S−111 S12, H = S21S
−1
11 S12

I Wilks’s statistic becomes |S||S11||S22| (Recall (4.3)!)

I µ2i were the squared canonical correlations =⇒ µ
2
1 was

defined as the maximal eigenvalue to S−122 H , that is, it is a
solution to |(E + H)−1H − µ21I | = 0

I Setting λ1 =
µ21

1−µ21
,

|(E + H)−1H − µ21I | = 0 =⇒ |H − µ
2
1(E + H)| = 0

=⇒ |H −
µ21

1− µ21
E | = 0 =⇒ |E−1H − λ1I | = 0

=⇒ λ1 is an eigenvalue of E−1H .
I Similarly with the remaining λi =

µ2i
1−µ2

i
values

I Degrees of freedom of E and H?
UNSW MATH5855 2021T3 Lecture 8 Slide 19

8. Multivariate Linear Models and Multivariate ANOVA
8.3 Computations used in the MANOVA tests
8.3.1 Roots distributions
8.3.2 Comparisons

UNSW MATH5855 2021T3 Lecture 8 Slide 20

I Wilks’s lambda is the most popular.
I Convenient.
I Related to the LRT.

I No universally best test.

I Few power analyses available.

UNSW MATH5855 2021T3 Lecture 8 Slide 21

8. Multivariate Linear Models and Multivariate ANOVA
8.1 Univariate linear models and ANOVA
8.2 Multivariate Linear Model and MANOVA
8.3 Computations used in the MANOVA tests
8.4 Software
8.5 Examples
8.6 Additional resources

UNSW MATH5855 2021T3 Lecture 8 Slide 22

SAS PROC GLM, PROC REG

R stats::lm

UNSW MATH5855 2021T3 Lecture 8 Slide 23

8. Multivariate Linear Models and Multivariate ANOVA
8.1 Univariate linear models and ANOVA
8.2 Multivariate Linear Model and MANOVA
8.3 Computations used in the MANOVA tests
8.4 Software
8.5 Examples
8.6 Additional resources

UNSW MATH5855 2021T3 Lecture 8 Slide 24

Example 8.3.

Multivariate linear modelling of the Fitness dataset.

UNSW MATH5855 2021T3 Lecture 8 Slide 25

8. Multivariate Linear Models and Multivariate ANOVA
8.1 Univariate linear models and ANOVA
8.2 Multivariate Linear Model and MANOVA
8.3 Computations used in the MANOVA tests
8.4 Software
8.5 Examples
8.6 Additional resources

UNSW MATH5855 2021T3 Lecture 8 Slide 26

Additional resources

I JW Ch. 7.

UNSW MATH5855 2021T3 Lecture 8 Slide 27

Lecture 9: Tests of a Covariance Matrix
9.1 Test of Σ = Σ0
9.2 Sphericity test
9.3 General situation
9.4 Software
9.5 Exercises

UNSW MATH5855 2021T3 Lecture 9 Slide 1

9. Tests of a Covariance Matrix
9.1 Test of Σ = Σ0
9.2 Sphericity test
9.3 General situation
9.4 Software
9.5 Exercises

UNSW MATH5855 2021T3 Lecture 9 Slide 2

I Sample X1,X2, . . . ,Xn from a Np(µ,Σ).
I Test H0 : Σ = Σ0 against the alternative H1 : Σ 6= Σ0.

I Let Yi = Σ
− 1

2
0 Xi ; then

Yi
i.i.d.∼ Np(Σ

− 1
2

0 µ,Σ
− 1

2
0 Σ(Σ

− 1
2

0 )
>) = Np(Σ

− 1
2

0 µ, Ip)
=⇒ can transform and test just H̄0 : Σ = Ip

UNSW MATH5855 2021T3 Lecture 9 Slide 3

I For LRT: likelihood function is

L(x ;µ,Σ) = (2π)−
np
2 |Σ|−

n
2 e−

1
2

∑n
i=1(xi−µ)

>Σ−1(xi−µ)

= (2π)−
np
2 |Σ|−

n
2 e−

1
2

tr[Σ−1
∑n

i=1(xi−µ)(xi−µ)
>]

I H0 =⇒ µ̄ = x̄
I H1 =⇒ maximise with respect to both µ and Σ

I From Section 3.1.2, µ̂ = x̄ and Σ̂ = 1
n

∑n
i=1(xi − x̄)(xi − x̄)

>

and

Λ =
maxµ L(x ;µ, Ip)

maxµ,Σ L(x ;µ,Σ)
=

e[−
1
2

tr V ]

|V |−
n
2 n

np
2 e−

np
2

where V =
∑n

i=1(xi − x̄)(xi − x̄)
>

−2 log Λ = np log n − n log|V |+ tr V − np, (9.1)
I asymptotically ∼ χ2

p(p+1)/2
I I.e., number of “free” elements in a p × p symmetric matrix.

UNSW MATH5855 2021T3 Lecture 9 Slide 4

9. Tests of a Covariance Matrix
9.1 Test of Σ = Σ0
9.2 Sphericity test
9.3 General situation
9.4 Software
9.5 Exercises

UNSW MATH5855 2021T3 Lecture 9 Slide 5

I What if Σ is known up to a constant?
I Section (9.1) transformation =⇒ w.l.g. H0 : Σ = σ2Ip

against a general alternative.
I “sphericity test”

I LR:
−2 log Λ = np log(nσ̂2)− n log|V |

I σ̂2 = 1
np

∑n
i=1(xi − x̄)

>(xi − x̄)
I H0 =⇒ asymptotically ∼ χ2p(p+1)/2−1 (WHY?!)

UNSW MATH5855 2021T3 Lecture 9 Slide 6

9. Tests of a Covariance Matrix
9.1 Test of Σ = Σ0
9.2 Sphericity test
9.3 General situation
9.4 Software
9.5 Exercises

UNSW MATH5855 2021T3 Lecture 9 Slide 7

Goal: Given samples from k different multivariate normal
populations Np(µi ,Σi ), i = 1, 2, . . . , k , test
H0 : Σ1 = · · · = Σk .
I Useful for MANOVA and discriminant analysis in particular.

I Let,
k be the number of populations;
p the dimension of vector;
n the total sample size n = n1 + n2 + . . . nk ,
ni being the sample size for each population.

=⇒ LR test statistic

−2 log
∏k

i=1|Σ̂i |
ni
2

|Σ̂pooled|
n
2

I Σ̂i is the MLE sample variance (with denominator n) of
population i ,

I Σ̂pooled =
1
n

∑k
i=1 nkΣ̂i

I asymptotically ∼ χ2
(k−1)p(p+1)/2

I Results in a biased test: ∃µi ,Σi for which rejecting H0 when
false is lower than when true.

UNSW MATH5855 2021T3 Lecture 9 Slide 8

I Further let N = n − k and Ni = ni − 1
I Replace Σ̂s with Ss.
I Let ρ = 1− [(

∑k
i=1

1
Ni

)− 1
N

] 2p
2+3p−1

6(p+1)(k−1)
=⇒ H0 =⇒ modified LR

−2ρ log
∏k

i=1|Si |
Ni
2

|Spooled|
N
2

, (9.2)

I Si is the sample variance (with denominator n − 1) of
population i ,

I Spooled = 1N
∑k

i=1 NkSi
I asymptotically ∼ χ2

(k−1)p(p+1)/2
I Reject H0 when high.

UNSW MATH5855 2021T3 Lecture 9 Slide 9

I For details, see
Muirhead, R. (1982) Aspects of Multivariate Statistical
Theory. Wiley, New York.

I The modified LR is
I Replacing ni and n by Ni and N

I Ni and N are “degrees of freedom”

I Scaling factor ρ = 1− [(
∑k

i=1
1
Ni

)− 1
N

] 2p
2+3p−1

6(p+1)(k−1) ,

I Close to 1 anyway if all ni large
I improves the quality of the asymptotic approximation
I Application of Bartlett correction
I Asymptotically negligible scalar transformations of the LR

statistic.
I Approaches χ2 at the rate O(1/n) instead of O(1).

UNSW MATH5855 2021T3 Lecture 9 Slide 10

9. Tests of a Covariance Matrix
9.1 Test of Σ = Σ0
9.2 Sphericity test
9.3 General situation
9.4 Software
9.5 Exercises

UNSW MATH5855 2021T3 Lecture 9 Slide 11

SAS: PROC CALIS, PROC DISCRIM (option)

R: heplots::boxM, MVTests::BoxM

The statistic (9.2) is the one that is implemented in software
packages.

UNSW MATH5855 2021T3 Lecture 9 Slide 12

9. Tests of a Covariance Matrix
9.1 Test of Σ = Σ0
9.2 Sphericity test
9.3 General situation
9.4 Software
9.5 Exercises

UNSW MATH5855 2021T3 Lecture 9 Slide 13

Exercise 9.1

Follow the discussion about the sphericity test. Argue that if
λ̂i , i = 1, 2, . . . , p denote the eigenvalues of the empirical
covariance matrix S then

−2 log Λ = np log
arithm. mean λ̂i

geom. mean λ̂i
.

Of course, the above statistic is asymptotically χ2
(p+2)(p−1)/2

distributed under H0 since it only represents the sphericity test in a
different form.

UNSW MATH5855 2021T3 Lecture 9 Slide 14

Exercise 9.2

Show that the likelihood ratio test of

H0 : Σ is a diagonal matrix

rejects H0 when −n log |R| is larger than χ21−α,p(p−1)/2. (Here R is
the empirical correlation matrix, p is the dimension of the
multivariate normal and n is the sample size.)

UNSW MATH5855 2021T3 Lecture 9 Slide 15

Lecture 10: Factor Analysis
10.1 ML Estimation
10.2 Hypothesis testing under multivariate normality assumption
10.3 Varimax method of rotating the factors
10.4 Relationship to Principal Component Analysis
10.5 Software
10.6 Examples
10.7 Additional resources

UNSW MATH5855 2021T3 Lecture 10 Slide 1

Let

Yi , i = 1, 2, .., n be independent Np(µ,Σ) variables
▶ E.g., Yi s as a results of a battery of p tests applied to the

ith individual.

Fundamental assumption in factor analysis:

Yi = Λfi + ei (10.1)

Λ ∈ Mp,k factor loading matrix (full rank);
fi ∈ Rk (k < p) factor variable. ▶ Components are latent factors. ▶ Usually, fi ∼ N(α, Ik) (i.e., “orthogonal”) but sometimes “oblique” factors used. ei i.i.d Np(θ,Σe) with Σe = diag(σ21, σ 2 2, . . . , σ 2 p). Also the es are independent of the f s. UNSW MATH5855 2021T3 Lecture 10 Slide 2 ▶ Then, µ = Λα+ θ; Σ = ΛΛ⊤ +Σe ▶ I.e., Var(Yir ) = k∑ j=1 λ2rj + σ 2 r = communality + uniqueness Cov(Yir ,Yis) = k∑ j=1 λrjλsj Idea: Describe the covariance relationships among many variables (p “large”) in terms of few (k “small”) underlying, not observable (latent) random quantities (the factors). ▶ I.e., suppose variables can be grouped by their correlations: high (+ or −) correlation within group but low between groups. =⇒ Each group of variables represents a single underlying construct (factor) that is “responsible” for the observed correlations. UNSW MATH5855 2021T3 Lecture 10 Slide 3 Important notes ▶ (10.1) is similar to a LM, but “predictors” fi random and are not observable. ▶ Λ known or estimated =⇒ α̂ = (Λ⊤Λ)−1Λ⊤Ȳ ; θ̂ = Ȳ − Λα̂ =⇒ Only µ, Λ, and σ2i , i = 1, 2, . . . , p unknown parameters. ▶ There is a fundamental indeterminacy even if Var(f ) = Ik : for any orthogonal matrix P ∈ Mk,k , ΛΛ⊤ = ΛP(ΛP)⊤; Λfi = (ΛP)(P⊤fi ). =⇒ Hence replacing Λ by ΛP and fi by P⊤fi leads to the same equations. UNSW MATH5855 2021T3 Lecture 10 Slide 4 10. Factor Analysis 10.1 ML Estimation 10.2 Hypothesis testing under multivariate normality assumption 10.3 Varimax method of rotating the factors 10.4 Relationship to Principal Component Analysis 10.5 Software 10.6 Examples 10.7 Additional resources UNSW MATH5855 2021T3 Lecture 10 Slide 5 ▶ Observing Y1,Y2, . . . ,Yn ∈ Rp, likelihood L(Y ;µ,Λ, σ21, σ 2 2, . . . , σ 2 p) = (2π)−np/2|Σ|−n/2 exp[− 1 2 n∑ i=1 (Yi − µ)⊤Σ−1(Yi − µ)] = (2π)−np/2|Σ|−n/2 exp[− n 2 (tr(Σ−1S) + (Ȳ − µ)⊤Σ−1(Ȳ − µ))] with S = 1 n ∑n i=1(Yi − Ȳ )(Yi − Ȳ ) ⊤ log L(Y ;µ,Λ, σ21, σ 2 2, . . . , σ 2 p) = − np 2 log(2π)− n 2 log(|Σ|)− n 2 [tr(Σ−1S)+(Ȳ−µ)⊤Σ−1(Ȳ−µ))] UNSW MATH5855 2021T3 Lecture 10 Slide 6 ▶ Differentiate w.r.t. µ: ∂ log L ∂µ = nΣ−1(Ȳ − µ) = 0 =⇒ µ̂ = Ȳ ▶ Substitute µ̂ = Ȳ , Σ = ΛΛ⊤ +Σe , negate, and drop constants =⇒ minimise Q = 1 2 log|ΛΛ⊤ +Σe |+ 1 2 tr(ΛΛ⊤ +Σe) −1S ▶ Relevant matrix differentiation rules: ∂ ∂Λ log|ΛΛ⊤ +Σe | = 2(ΛΛ⊤ +Σe)−1Λ (10.2) ∂ ∂A tr(A−1B) = −(A−1BA−1)⊤ (10.3) ▶ (10.3) and the chain rule =⇒ ∂ ∂Λ tr[(ΛΛ⊤ +Σe) −1S] = −2(ΛΛ⊤ +Σe)−1S(ΛΛ⊤ +Σe)−1Λ UNSW MATH5855 2021T3 Lecture 10 Slide 7 ▶ Substitute: ∂ ∂Λ Q = (ΛΛ⊤+Σe) −1Λ− (ΛΛ⊤+Σe)−1S(ΛΛ⊤+Σe)−1Λ = (ΛΛ⊤ +Σe) −1[ΛΛ⊤ +Σe − S](ΛΛ⊤ +Σe)−1Λ = 0 (10.4) ▶ Woodbury Matrix Identity (A+ UCV )−1 = A−1 − A−1U(C−1 + VA−1U)−1VA−1 =⇒ (ΛΛ⊤+Σe) −1 = Σ−1e −Σ −1 e Λ(I +Λ ⊤Σ−1e Λ) −1Λ⊤Σ−1e (10.5) ▶ (10.4) and (10.5) =⇒ [ΛΛ⊤ +Σe − S]Σ−1e Λ{I − (I + Λ ⊤Σ−1e Λ) −1Λ⊤Σ−1e Λ} = 0 (10.6) ▶ Matrix in curly brackets is full rank =⇒ [ΛΛ⊤ +Σe − S]Σ−1e Λ = 0 =⇒ SΣ −1 e Λ = Λ(I + Λ ⊤Σ−1e Λ) ▶ Factor Σ−1e = Σ −1/2 e Σ −1/2 e and premultiply by Σ −1/2 e : (Σ −1/2 e SΣ −1/2 e )Σ −1/2 e Λ = Σ −1/2 e Λ(I + Λ ⊤Σ−1e Λ) (10.7) UNSW MATH5855 2021T3 Lecture 10 Slide 8 ▶ Let us require that Λ⊤Σ−1e Λ to be diagonal. =⇒ (10.7) =⇒ Σ−1/2e Λ has as its columns k eigenvectors of Σ −1/2 e SΣ −1/2 e ▶ Q minimised when these correspond to the largest eigenvalues of Σ −1/2 e SΣ −1/2 e =⇒ Iterative algorithm (due to Lawley): 1. With an initial guess Σ̃e , calculate Σ̃ −1/2 e Λ̃ = eigenvectors of the k largest eigenvalues of Σ̃ −1/2 e SΣ̃ −1/2 e . 2. Λ̃ = Σ̃ 1/2 e (Σ̃ −1/2 e Λ̃). 3. Plug Λ̃ in and minimise Σ∗e = argminΣ̃e Q̃(Σ̃e) = 1 2 log|Λ̃Λ̃⊤+Σ̃e |+ 12 tr(Λ̃Λ̃ ⊤+Σ̃e) −1S ▶ Recall, Σ̃e diagonal, so only p variables. 4. Set Σ̃e = Σ ∗ e and repeat from Step 1 until convergence. UNSW MATH5855 2021T3 Lecture 10 Slide 9 10. Factor Analysis 10.1 ML Estimation 10.2 Hypothesis testing under multivariate normality assumption 10.3 Varimax method of rotating the factors 10.4 Relationship to Principal Component Analysis 10.5 Software 10.6 Examples 10.7 Additional resources UNSW MATH5855 2021T3 Lecture 10 Slide 10 ▶ H0 : k factors vs. H1 : ̸= k factors. =⇒ Likelihoods: log L1 = − np 2 log(2π)− n 2 log|S | − np 2 log L0 = − np 2 log(2π)− n 2 log|Σ̂|− n 2 tr(Σ̂−1S), for Σ̂ = Λ̂Λ̂⊤+Σ̂e =⇒ −2 log L0 L1 = n[log|Σ̂| − log|S |+ tr(Σ̂−1S)− p] ▶ Asymptotically χ2 with df = p(p+1) 2 − [pk + p − k(k−1) 2 ] = 1 2 [(p − k)2 − p − k]. Why? UNSW MATH5855 2021T3 Lecture 10 Slide 11 10. Factor Analysis 10.1 ML Estimation 10.2 Hypothesis testing under multivariate normality assumption 10.3 Varimax method of rotating the factors 10.4 Relationship to Principal Component Analysis 10.5 Software 10.6 Examples 10.7 Additional resources UNSW MATH5855 2021T3 Lecture 10 Slide 12 ▶ Recall: Λ̂0 is MLE =⇒ Λ̂ = Λ̂0P for any orthogonal P ∈ Mk,k also MLE. =⇒ Choose P such that Λ̂ has some desirable properties. ▶ Let dr = ∑p i=1 λ 2 ir . ▶ The varimax method of rotating the factors consists in choosing P to maximise Sd = k∑ r=1 { p∑ i=1 (λ2ir − dr p )2} = k∑ r=1 { p∑ i=1 λ4ir − ( ∑p i=1 λ 2 ir ) 2 p } ▶ I.e., for each factor, some loadings large, others small. ▶ Optimise numerically. ▶ Particularly important if loadings obtained by ML, since we chose Λ̂0 s.t. Λ̂ ⊤ 0 Σ −1 e Λ̂0 is diagonal. ▶ Good for computation, bad for interpretation. UNSW MATH5855 2021T3 Lecture 10 Slide 13 10. Factor Analysis 10.1 ML Estimation 10.2 Hypothesis testing under multivariate normality assumption 10.3 Varimax method of rotating the factors 10.4 Relationship to Principal Component Analysis 10.4.1 The principal component solution of the factor model 10.4.2 The Principal Factor Solution 10.5 Software 10.6 Examples 10.7 Additional resources UNSW MATH5855 2021T3 Lecture 10 Slide 14 10. Factor Analysis 10.4 Relationship to Principal Component Analysis 10.4.1 The principal component solution of the factor model 10.4.2 The Principal Factor Solution UNSW MATH5855 2021T3 Lecture 10 Slide 15 ▶ Start with sample variance matrix: S = 1 n n∑ i=1 (Yi − Ȳ )(Yi − Ȳ )⊤ ▶ Can write using all of its p eigenvalues and eigenvectors. ▶ Perfect reconstruction, but as many factors as variables. =⇒ Approximate reconstruction using k highest eigenvalues and their eigenvectors: S ≈ k∑ i=1 τi a⃗i a⃗⊤i = ΛΛ ⊤ ▶ k is the right number of factors =⇒ all communalities have been taken into account =⇒ sii − ∑k j=1 λ 2 ij estimates the uniquenesses =⇒ the principal component solution of the factor model UNSW MATH5855 2021T3 Lecture 10 Slide 16 10. Factor Analysis 10.4 Relationship to Principal Component Analysis 10.4.1 The principal component solution of the factor model 10.4.2 The Principal Factor Solution UNSW MATH5855 2021T3 Lecture 10 Slide 17 ▶ Related, but extraction not on S directly. ▶ Suppose the uniquenesses Σe are known. ▶ Then, S = Sr +Σe =⇒ Λ̂ should satisfy Sr = S − Σe = Λ̂Λ̂⊤ =⇒ Get Λ̂ by PCA on Sr . ▶ If Sr = ∑p i=1 tibib ⊤ i , take k biggest ti s (w.o.l.g. t1, t2, .., tk) and let B = ( b1 b2 · · · bk ) ; ∆ = diag(t1, t2, . . . , tk) =⇒ Λ̂ = B∆1/2 ▶ Can do it also iteratively! UNSW MATH5855 2021T3 Lecture 10 Slide 18 Some problems: i) There is no reliable estimate of Σe available. ▶ The most commonly: get correlation matrix R is σ2ei = 1/r ii where r ii is the ith diagonal element of R−1. ii) How to select k? Note: − The methods in Section 10.4.2 not as efficient as ML method. + Can be used when not MVN. ▶ k chosen by combining subject matter knowledge, “reasonableness” of results and by looking at proportion variance explained. UNSW MATH5855 2021T3 Lecture 10 Slide 19 10. Factor Analysis 10.1 ML Estimation 10.2 Hypothesis testing under multivariate normality assumption 10.3 Varimax method of rotating the factors 10.4 Relationship to Principal Component Analysis 10.5 Software 10.6 Examples 10.7 Additional resources UNSW MATH5855 2021T3 Lecture 10 Slide 20 SAS PROC FACTOR: ▶ To extract different numbers of factors, run the procedure once for each number of factors. ▶ Iterative process can lead to “correlations” > 1 =⇒
ultra-Heywood
▶ Heywood option sets them to one allowing iterations to

continue;

▶ the scree option can be used to produce a plot of the
eigenvalues Σ that is helpful in deciding how many factors to
use;

▶ besides method=ml you can use method=principal;
▶ with the ML method option, the AIC and BIC are included.

▶ Can be used for model selection (smaller = better).

UNSW MATH5855 2021T3 Lecture 10 Slide 21

R

▶ stats::factanal() is the built-in implementation.
▶ Package psych contains additional functions and utilities, as

well as its own implementation, psych::fa().
▶ Model selection tools as well.

▶ Package nFactors contains utilities for determining the
number of factors (e.g., scree plots).

UNSW MATH5855 2021T3 Lecture 10 Slide 22

10. Factor Analysis
10.1 ML Estimation
10.2 Hypothesis testing under multivariate normality assumption
10.3 Varimax method of rotating the factors
10.4 Relationship to Principal Component Analysis
10.5 Software
10.6 Examples
10.7 Additional resources

UNSW MATH5855 2021T3 Lecture 10 Slide 23

Example 10.1.

Data about five socioeconomic variables for 12 census data in the
Los Angeles area. The five variables represent total population,
median school years, total unemployment, miscellaneous
professional services, and median house value. Use ML method
and varimax rotation.

▶ Try to run the above model with n = 3 factors. The message
“WARNING: Too many factors for a unique solution” appears.
This is not surprising as the number of parameters in the
model will exceed the number of elements in Σ
(1
2
[(p − k)2 − p − k] = −2). In this example you can run the

procedure for n = 1 and for n = 2 only (do it!) and you will
see that n = 2 gives the adequate representation.

▶ Try using psych::fa.parallel() to search for optimal
number of factors.

UNSW MATH5855 2021T3 Lecture 10 Slide 24

10. Factor Analysis
10.1 ML Estimation
10.2 Hypothesis testing under multivariate normality assumption
10.3 Varimax method of rotating the factors
10.4 Relationship to Principal Component Analysis
10.5 Software
10.6 Examples
10.7 Additional resources

UNSW MATH5855 2021T3 Lecture 10 Slide 25

Additional resources

▶ JW Ch. 9.

UNSW MATH5855 2021T3 Lecture 10 Slide 26

Lecture 11: Structural Equation Modelling
11.1 General form of the model
11.2 Estimation
11.3 Model evaluation
11.4 Some particular SEM
11.5 Relationship between exploratory and confirmatory FA
11.6 Software
11.7 Examples

UNSW MATH5855 2021T3 Lecture 11 Slide 1

▶ More general idea: model the covariances rather than
individual observations.
▶ Factor analysis (FA) is only one example
▶ Input factors latent =⇒ no regression.
▶ Our “data” was S and our parameters were σ2i and Λ.

▶ Methods minimise not difference between observed and
predicted individual values but differences between sample
covariances and covariances predicted by the model.

▶ I.e., test

H0 : Σ = Σ(θ) against H1 : Σ ̸= Σ(θ)

▶ Σ has p(p + 1)/2 unknown elements (estimated by S)
▶ Modelled with k = dim(θ) < p(p + 1)/2 parameters. ▶ More generally, can still model means and covariances, or means and covariances and higher moments to a given structure. ▶ Regression analysis with random inputs, simultaneous equations systems, confirmatory factor analysis, canonical correlations, (M)ANOVA = special cases. UNSW MATH5855 2021T3 Lecture 11 Slide 2 ▶ Structural Equation Modelling (SEM) an important tool in economics and behavioural sciences. ▶ Relationships among several variables ▶ directly observed (manifest) ▶ unobserved hypothetical variables (latent) ▶ In structural models, as opposed to functional models, all variables are taken to be random rather than having fixed levels. ▶ Approximate MVN assumed. UNSW MATH5855 2021T3 Lecture 11 Slide 3 11. Structural Equation Modelling 11.1 General form of the model 11.2 Estimation 11.3 Model evaluation 11.4 Some particular SEM 11.5 Relationship between exploratory and confirmatory FA 11.6 Software 11.7 Examples UNSW MATH5855 2021T3 Lecture 11 Slide 4 η = Bη + Γξ + ζ. (11.1) Here, η ∈ Rm vector of output latent variables; ξ ∈ Rn ′ vector of input latent variables; B ∈ Mm,m, Γ ∈ Mm,n′ coefficient matrices; Note: (I − B) is assumed to be nonsingular. ζ ∈ Rm disturbance vector with E ζ = 0. To this modelling equation (11.1) we attach 2 measurement equations: Y = ΛYη + ϵ; (11.2) X = ΛXξ + δ; (11.3) Y ∈ Rp, X ∈ Rq; ΛY ∈ mp×m,ΛX ∈ mq×n′ with ϵ ∈ Rp, δ ∈ Rq zero-mean measurement errors. These errors are assumed to be uncorrelated with ξ and ζ and with each other. UNSW MATH5855 2021T3 Lecture 11 Slide 5 Generative model for X and Y Y X ϵ δ η ξ ζ B Γ ΛX ΛY UNSW MATH5855 2021T3 Lecture 11 Slide 6 ▶ General model (11.1)–(11.2)–(11.3) is called Keesling–Wiley–Jöreskog model. ▶ Input and output latent variables ξ and η are connected by a system of linear equations (11.1) with coefficient matrices B and Γ and an error vector ζ. ▶ The random vectors Y and X represent the observable vectors (measurements). ▶ Implied covariance matrix: let Var(ξ) = Φ; Var(ζ) = Ψ; Var(ϵ) = θϵ; Var(δ) = θδ ▶ Then, Σ = Σ(θ) = ( ΣYY (θ) ΣYX (θ) ΣXY (θ) ΣXX (θ) ) (11.4) ΣYY (θ) = ΛY (I − B)−1(ΓΦΓ⊤ +Ψ)[(I − B)−1]⊤Λ⊤Y + θϵ ΣYX (θ) = ΛY (I − B)−1ΓΦΛ⊤X ΣXY (θ) = ΛXΦΓ ⊤[(I − B)−1]⊤Λ⊤Y ΣXX (θ) = ΛXΦΛ ⊤ X + θδ UNSW MATH5855 2021T3 Lecture 11 Slide 7 11. Structural Equation Modelling 11.1 General form of the model 11.2 Estimation 11.3 Model evaluation 11.4 Some particular SEM 11.5 Relationship between exploratory and confirmatory FA 11.6 Software 11.7 Examples UNSW MATH5855 2021T3 Lecture 11 Slide 8 ▶ MVN =⇒ use MLE ▶ “data” is S = 1 n − 1 n∑ i=1 {( Yi − Ȳ Xi − X̄ )( Yi − Ȳ Xi − X̄ )⊤} =⇒ (n − 1)S ∼ Wp+q(n − 1,Σ) =⇒ Wishart density (up to a constant): log L(S ,Σ(θ)) = constant− n − 1 2 {log|Σ(θ)|+ tr[SΣ−1(θ)]} =⇒ minimise FML(θ) = log|Σ(θ)|+ tr[SΣ−1(θ)]− log|S | − (p + q) (11.5) ▶ FML = 0 under “saturated model” with Σ̂ = S ▶ I.e., a perfect fit is indicated by zero. UNSW MATH5855 2021T3 Lecture 11 Slide 9 11. Structural Equation Modelling 11.1 General form of the model 11.2 Estimation 11.3 Model evaluation 11.4 Some particular SEM 11.5 Relationship between exploratory and confirmatory FA 11.6 Software 11.7 Examples UNSW MATH5855 2021T3 Lecture 11 Slide 10 ▶ MVN =⇒ asymptotic χ2-test. ▶ Under H0 : Σ = Σ(θ) versus H1 : Σ ̸= Σ(θ) =⇒ T = (n − 1)FML(θ̂ML) ∼ χ2 with df = (p+q)(p+q+1) 2 − dim(θ) under H0. Reason: log L0 = log L(S , Σ̂MLE) = log L(S ,Σ(θ̂ML)) = − n − 1 2 {log|Σ̂MLE|+ tr[SΣ̂−1MLE]}+ constant; log L1 = log L(S ,S) = − n − 1 2 {log|S |+ (p + q)}+ constant. Then, −2 log L0 L1 = (n − 1){log|Σ̂MLE|+ tr(SΣ̂−1MLE)− log|S | − (p + q)} = (n − 1)FML(θ̂ML). UNSW MATH5855 2021T3 Lecture 11 Slide 11 11. Structural Equation Modelling 11.1 General form of the model 11.2 Estimation 11.3 Model evaluation 11.4 Some particular SEM 11.5 Relationship between exploratory and confirmatory FA 11.6 Software 11.7 Examples UNSW MATH5855 2021T3 Lecture 11 Slide 12 From the general model (11.1)–(11.2)–(11.3), we can obtain following particular models: A) ΛY = Im , ΛX = In′ ; p = m; q = n ′; θϵ = 0 ; θδ = 0 =⇒ Y = BY + ΓX + ζ (the classical econometric model). B) ΛY = Ip , ΛX = Iq =⇒ The measurement error model: ▶ η = Bη + Γξ + ζ ▶ Y = η + ϵ ▶ X = ξ + δ C) Factor Analysis Models: Just take the measurement part X = ΛXξ + δ. UNSW MATH5855 2021T3 Lecture 11 Slide 13 11. Structural Equation Modelling 11.1 General form of the model 11.2 Estimation 11.3 Model evaluation 11.4 Some particular SEM 11.5 Relationship between exploratory and confirmatory FA 11.6 Software 11.7 Examples UNSW MATH5855 2021T3 Lecture 11 Slide 14 EFA: ▶ number of latent variables not determined in advance ▶ measurement errors are assumed uncorrelated CFA: ▶ model is constructed in advance ▶ number of latent variables ξ is set by the analyst ▶ latent variable influences on observed variables specified ▶ some direct effects of latent on observed values are fixed at 0 or other value ▶ measurement errors δ may correlate ▶ covariance of latent variables can be either estimated or set In practice, more blurred: ▶ Researchers using traditional EFA procedures may restrict their analysis to a group of indicators that they believe are influenced by one factor. ▶ Researchers with poorly fitting models in CFA often modify their model in an exploratory way with the goal of improving fit. UNSW MATH5855 2021T3 Lecture 11 Slide 15 11. Structural Equation Modelling 11.1 General form of the model 11.2 Estimation 11.3 Model evaluation 11.4 Some particular SEM 11.5 Relationship between exploratory and confirmatory FA 11.6 Software 11.7 Examples UNSW MATH5855 2021T3 Lecture 11 Slide 16 SAS In SAS, the standard PROC CALIS is used for fitting Structural Equation Models, and it has been significantly upgraded in SAS 9.3. In particular, now you can analyse means and covariance (or even higher order) structures (instead of just covariance structures like in the classical SEM). UNSW MATH5855 2021T3 Lecture 11 Slide 17 R There are two packages for SEM in R: lavaan and sem. sem is an older package, whereas lavaan aims to provide an extensible framework for SEMs and their extensions: ▶ can mimic commercial packages (including those below) ▶ provides convenience functions for specifying simple special cases (such as CFA) but also a more flexible interface for advanced users ▶ mean structures and multiple groups ▶ different estimators and standard errors (including robust) ▶ handling of missing data ▶ linear and nonlinear equality and inequality constraints ▶ categorical data support ▶ multilevel SEMs ▶ package blavaan for Bayesian estimation ▶ etc. UNSW MATH5855 2021T3 Lecture 11 Slide 18 Others ▶ General form of the SEM model given here is only one possible description due to Karl Jöreskog. ▶ First implemented in the software called LISREL (Linear Structural Relationships). ▶ Other equivalent descriptions due to Bentler and Weeks, to McDonald and some other prominent researchers in the field. ▶ The EQS program for PC that deals with the Bentler/Weeks model. ▶ The latest “hit” in the area is the program MPLUS (M is for Bength Muthén). MPLUS capabilities include: ▶ Exploratory factor analysis ▶ Structural equation modelling ▶ Item response theory analysis ▶ Growth curve modelling ▶ Mixture modelling (latent class analysis) ▶ Longitudinal mixture modelling (hidden Markov, latent transition analysis, latent class growth analysis, growth mixture analysis) ▶ Survival analysis (continuous- and discrete-time) ▶ Multilevel analysis ▶ Bayesian analysis ▶ etc. UNSW MATH5855 2021T3 Lecture 11 Slide 19 11. Structural Equation Modelling 11.1 General form of the model 11.2 Estimation 11.3 Model evaluation 11.4 Some particular SEM 11.5 Relationship between exploratory and confirmatory FA 11.6 Software 11.7 Examples UNSW MATH5855 2021T3 Lecture 11 Slide 20 Example 11.1. Wheaton, Muthen, Alwin, and Summers (1977) Anomie example. UNSW MATH5855 2021T3 Lecture 11 Slide 21 Lecture 12: Discrimination and Classification 12.1 Separation and Classification for two populations 12.2 Classification errors 12.3 Summarising 12.4 Optimal classification rules 12.5 Classification with two multivariate normal populations 12.6 Classification with more than 2 normal populations 12.7 Software 12.8 Examples 12.9 Additional resources 12.10Exercises UNSW MATH5855 2021T3 Lecture 12 Slide 1 12. Discrimination and Classification 12.1 Separation and Classification for two populations 12.2 Classification errors 12.3 Summarising 12.4 Optimal classification rules 12.5 Classification with two multivariate normal populations 12.6 Classification with more than 2 normal populations 12.7 Software 12.8 Examples 12.9 Additional resources 12.10Exercises UNSW MATH5855 2021T3 Lecture 12 Slide 2 Goals: discriminant analysis terminology: separating sets of objects classification theory terminology: allocating new objects to given groups ▶ Discriminant analysis is more exploratory than classification. ▶ In practice, one can lead to the other. ▶ Focus on two populations (classes of objects) first. UNSW MATH5855 2021T3 Lecture 12 Slide 3 Notation ▶ Call two classes by π1 and π2. ▶ Separate based on random vectors X ∈ Rp. ▶ Distribution of X depends on π1 ( =⇒ f1(x)) and π2 ( =⇒ f2(x)). ▶ Observe a learning/training sample: measurements from known classes. Goal: Partition sample space into 2 mutually exclusive regions R1 and R2, such that: ▶ If a new observation falls in R1, it is allocated to π1. ▶ If it falls in R2, it is allocated to π2. UNSW MATH5855 2021T3 Lecture 12 Slide 4 12. Discrimination and Classification 12.1 Separation and Classification for two populations 12.2 Classification errors 12.3 Summarising 12.4 Optimal classification rules 12.5 Classification with two multivariate normal populations 12.6 Classification with more than 2 normal populations 12.7 Software 12.8 Examples 12.9 Additional resources 12.10Exercises UNSW MATH5855 2021T3 Lecture 12 Slide 5 ▶ Always a chance of misclassification, which we want to minimise. ▶ Populations may have different sizes in the first place =⇒ prior probabilities. ▶ Cost also matters: errors can have asymmetric costs. ▶ The conditional probabilities for misclassification: Pr(2|1) = Pr(X ∈ R2|π1) = ∫ R2 f1(x)dx (12.1) Pr(1|2) = Pr(X ∈ R1|π2) = ∫ R1 f2(x)dx (12.2) UNSW MATH5855 2021T3 Lecture 12 Slide 6 12. Discrimination and Classification 12.1 Separation and Classification for two populations 12.2 Classification errors 12.3 Summarising 12.4 Optimal classification rules 12.5 Classification with two multivariate normal populations 12.6 Classification with more than 2 normal populations 12.7 Software 12.8 Examples 12.9 Additional resources 12.10Exercises UNSW MATH5855 2021T3 Lecture 12 Slide 7 confusion matrix: a contingency table showing counts of correct classifications and misclassifications from each class to each other class: Predicted class 1 2 Actual class 1 Members of 1 correctly classified Members of 1 misclassified as 2 2 Members of 2 misclassified as 1 Members of 2 correctly classified UNSW MATH5855 2021T3 Lecture 12 Slide 8 Negative/Positive context Predicted class Negative Positive Actual class Negative True Negative (TN) False Positive (FP) Positive False Negative (FN) True Positive (TP) sensitivity (TPR): Pr(Pred. pos.|Act. pos.) = TP TP+FN specificity (TNR): Pr(Pred. neg.|Act. neg.) = TN TN+FP false positive rate (FPR): Pr(Pred. pos.|Act. neg.) = 1− TNR accuracy: TP+TN TP+FP+TN+FN total probability of misclassification (TPM): 1− accuracy precision: Pr(Act. pos.|Pred. pos.) = TP TP+FP negative predictive value: Pr(Act. neg.|Pred. neg.) = TN TN+FN F1 score: 2TP 2TP+FP+FN ▶ For continuous prediction scores, ROC curve (TPR against FPR) for various thresholds. UNSW MATH5855 2021T3 Lecture 12 Slide 9 12. Discrimination and Classification 12.1 Separation and Classification for two populations 12.2 Classification errors 12.3 Summarising 12.4 Optimal classification rules 12.4.1 Rules that minimise the expected cost of misclassification (ECM) 12.4.2 Rules that minimise the total probability of misclassification (TPM) 12.4.3 Bayesian approach 12.5 Classification with two multivariate normal populations 12.6 Classification with more than 2 normal populations 12.7 Software 12.8 Examples 12.9 Additional resources 12.10Exercises UNSW MATH5855 2021T3 Lecture 12 Slide 10 12. Discrimination and Classification 12.4 Optimal classification rules 12.4.1 Rules that minimise the expected cost of misclassification (ECM) 12.4.2 Rules that minimise the total probability of misclassification (TPM) 12.4.3 Bayesian approach UNSW MATH5855 2021T3 Lecture 12 Slide 11 Lemma 12.1. Denote by pi the prior probability of πi , i = 1, 2, p1 + p2 = 1. Then the overall probabilities of incorrectly classifying objects will be: Pr(misclassified as π1) = Pr(1|2)p2 and Pr(misclassified as π2) = Pr(2|1)p1. Further, let c(i |j), i ̸= j , i , j = 1, 2 be the misclassification costs. Then the expected cost of misclassification is ECM = c(2|1) Pr(2|1)p1 + c(1|2) Pr(1|2)p2 (12.3) The regions R1 and R2 that minimise ECM are given by R1 = {x : f1(x) f2(x) ≥ c(1|2) c(2|1) p2 p1 } (12.4) and R2 = {x : f1(x) f2(x) < c(1|2) c(2|1) p2 p1 }. (12.5) UNSW MATH5855 2021T3 Lecture 12 Slide 12 Proof. ▶ ECM = ∫ R1 [c(1|2)p2f2(x)− c(2|1)p1f1(x)]dx + c(2|1)p1 =⇒ minimised if R1 = {x : [c(1|2)p2f2(x)− c(2|1)p1f1(x)] ≤ 0} only. ▶ only ratios involved. ▶ Cost ratios usually easier to elicit than costs. ▶ Your own exercise: suppose that p2 = p1 and/or c(1|2) = c(2|1). What are classification regions like then? UNSW MATH5855 2021T3 Lecture 12 Slide 13 12. Discrimination and Classification 12.4 Optimal classification rules 12.4.1 Rules that minimise the expected cost of misclassification (ECM) 12.4.2 Rules that minimise the total probability of misclassification (TPM) 12.4.3 Bayesian approach UNSW MATH5855 2021T3 Lecture 12 Slide 14 ▶ total probability of misclassification (TPM): TPM = p1 ∫ R2 f1(x)dx + p2 ∫ R1 f2(x)dx =⇒ c(1|2) = c(2|1) in Lemma 12.1 UNSW MATH5855 2021T3 Lecture 12 Slide 15 12. Discrimination and Classification 12.4 Optimal classification rules 12.4.1 Rules that minimise the expected cost of misclassification (ECM) 12.4.2 Rules that minimise the total probability of misclassification (TPM) 12.4.3 Bayesian approach UNSW MATH5855 2021T3 Lecture 12 Slide 16 ▶ Allocate a new observation x0 to the population with the larger posterior probability Pr(πi |x0), i = 1, 2. =⇒ Bayes’s formula: Pr(π1|x0) = p1f1(x0) p1f1(x0) + p2f2(x0) , Pr(π2|x0) = p2f2(x0) p1f1(x0) + p2f2(x0) =⇒ Classify x0 as π1 iff Pr(π1|x0) > Pr(π2|x0)
▶ Again a special case of Lemma 12.1 when c(1|2) = c(2|1)

(Why?)

UNSW MATH5855 2021T3 Lecture 12 Slide 17

12. Discrimination and Classification
12.1 Separation and Classification for two populations
12.2 Classification errors
12.3 Summarising
12.4 Optimal classification rules
12.5 Classification with two multivariate normal populations
12.5.1 Case of equal covariance matrices Σ1 = Σ2 = Σ
12.5.2 Case of different covariance matrices (Σ1 ̸= Σ2)
12.5.3 Optimum error rate and Mahalanobis distance

12.6 Classification with more than 2 normal populations
12.7 Software
12.8 Examples
12.9 Additional resources
12.10Exercises

UNSW MATH5855 2021T3 Lecture 12 Slide 18

12. Discrimination and Classification
12.5 Classification with two multivariate normal populations
12.5.1 Case of equal covariance matrices Σ1 = Σ2 = Σ
12.5.2 Case of different covariance matrices (Σ1 ̸= Σ2)
12.5.3 Optimum error rate and Mahalanobis distance

UNSW MATH5855 2021T3 Lecture 12 Slide 20

▶ Assume for π1 and π2 are Np(µ1,Σ) and Np(µ2,Σ).

(12.4) =⇒ R1 = {x : exp[−
1

2
(x − µ1)⊤Σ−1(x − µ1)

+
1

2
(x − µ2)⊤Σ−1(x − µ2)] ≥

c(1|2)
c(2|1)

×
p2
p1

}

(12.5) =⇒ R2 = {x : exp[−
1

2
(x − µ1)⊤Σ−1(x − µ1)

+
1

2
(x − µ2)⊤Σ−1(x − µ2)] < c(1|2) c(2|1) × p2 p1 } UNSW MATH5855 2021T3 Lecture 12 Slide 21 Theorem 12.2. Under the above assumptions, the allocation rule that minimises the ECM is given by: 1. allocate x0 to π1 if (µ1−µ2)⊤Σ−1x0− 1 2 (µ1−µ2)⊤Σ−1(µ1+µ2) ≥ log[ c(1|2) c(2|1) × p2 p1 ]. 2. Otherwise, allocate x0 to π2. UNSW MATH5855 2021T3 Lecture 12 Slide 22 ▶ Usually, we don’t know µ1, µ2, and Σ. ▶ Suppose, ▶ n1 and n2 sample sizes ▶ x̄1 and x̄2 their sample mean vectors ▶ S1 and S2 their sample covariance matrices ▶ Assume Σ1 = Σ2 = Σ =⇒ pooled covariance matrix estimator Spooled = (n1−1)S1+(n2−1)S2 n1+n2−2 =⇒ sample classification rule: 1. allocate x0 to π1 if (x̄1−x̄2)⊤S−1pooledx0− 1 2 (x̄1−x̄2)⊤S−1pooled(x̄1+x̄2) ≥ log[ c(1|2) c(2|1) × p2 p1 ] (12.6) 2. Otherwise, allocate x0 to π2. UNSW MATH5855 2021T3 Lecture 12 Slide 23 ▶ Allocation rule based on Fisher’s discriminant function: (x̄1 − x̄2)⊤S−1pooledx0 − 1 2 (x̄1 − x̄2)⊤S−1pooled(x̄1 + x̄2) ▶ Function itself called Fisher’s linear discriminant function. ▶ Only an estimate of the optimal rule. ▶ linear in the new observation x0 UNSW MATH5855 2021T3 Lecture 12 Slide 24 12. Discrimination and Classification 12.5 Classification with two multivariate normal populations 12.5.1 Case of equal covariance matrices Σ1 = Σ2 = Σ 12.5.2 Case of different covariance matrices (Σ1 ̸= Σ2) 12.5.3 Optimum error rate and Mahalanobis distance UNSW MATH5855 2021T3 Lecture 12 Slide 25 Theorem 12.3. ▶ Assume π1 and π2 are Np(µ1,Σ1) and Np(µ2,Σ2). ▶ Same steps as in Theorem 12.2 =⇒ R1 = {x : − 1 2 x⊤(Σ−11 − Σ −1 2 )x + (µ ⊤ 1 Σ −1 1 − µ ⊤ 2 Σ −1 2 )x − k ≥ log[ c(1|2) c(2|1) × p2 p1 ]} R2 = {x : − 1 2 x⊤(Σ−11 − Σ −1 2 )x + (µ ⊤ 1 Σ −1 1 − µ ⊤ 2 Σ −1 2 )x − k < log[ c(1|2) c(2|1) × p2 p1 ]} where k = 1 2 log( |Σ1| |Σ2| ) + 1 2 (µ⊤1 Σ −1 1 µ1 − µ ⊤ 2 Σ −1 2 µ2) ▶ Classification regions now quadratic functions of x0. UNSW MATH5855 2021T3 Lecture 12 Slide 26 ▶ One obtains the following rule: 1. allocate x0 to π1 if − 1 2 x⊤0 (S −1 1 −S −1 2 )x0+(x̄ ⊤ 1 S −1 1 −x̄ ⊤ 2 S −1 2 )x0−k̂ ≥ log[ c(1|2) c(2|1) × p2 p1 ] where k̂ is the empirical analog of k . 2. Allocate x0 to π2 otherwise. UNSW MATH5855 2021T3 Lecture 12 Slide 27 ▶ Σ1 = Σ2 =⇒ quadratic term disappears =⇒ Theorem 12.2 ▶ Theorem 12.3 is more general. ▶ p ≥ 2, quadratic rules may not behave well. ▶ More sensitive to non-normality and differences in covariance matrices =⇒ Transform the data if needed. =⇒ Use cautiously. =⇒ Use tests in Lecture 9 to check if equal variance assumption is valid. UNSW MATH5855 2021T3 Lecture 12 Slide 28 12. Discrimination and Classification 12.5 Classification with two multivariate normal populations 12.5.1 Case of equal covariance matrices Σ1 = Σ2 = Σ 12.5.2 Case of different covariance matrices (Σ1 ̸= Σ2) 12.5.3 Optimum error rate and Mahalanobis distance UNSW MATH5855 2021T3 Lecture 12 Slide 29 optimum error rate (OER): smallest TPM attainable for any R1 and R2 ▶ characterises difficulty of the problem ▶ E.g., Given two normal populations with Σ1 = Σ2 = Σ and prior probabilities p1 = p2 = 1 2 , TPM = 1 2 ∫ R2 f1(x)dx + 1 2 ∫ R1 f2(x)dx ▶ OER obtained with R1 = {x : (µ1−µ2)⊤Σ−1x− 1 2 (µ1−µ2)⊤Σ−1(µ1+µ2) ≥ 0} R2 = {x : (µ1−µ2)⊤Σ−1x− 1 2 (µ1−µ2)⊤Σ−1(µ1+µ2) < 0} UNSW MATH5855 2021T3 Lecture 12 Slide 30 ▶ Let Y = (µ1 − µ2)⊤Σ−1X = l⊤X =⇒ Y |i ∼ N1(µiY ,∆2) where µiY = (µ1 − µ2)⊤Σ−1µi ▶ ∆ = √ (µ1 − µ2)⊤Σ−1(µ1 − µ2) = Mahalanobis distance between the two normal populations =⇒ For Φ(·) = standard normal CDF, Pr(2|1) = Pr(Y < 1 2 (µ1 − µ2)⊤Σ−1(µ1 + µ2)) = Pr( Y − µ1Y ∆ < − ∆ 2 ) = Φ(− ∆ 2 ) =⇒ Pr(1|2) = Φ(−∆ 2 ) =⇒ OER = minimum TPM = Φ(−∆ 2 ) ▶ ∆ → ∆̂ = √ (x̄1 − x̄2)⊤S−1pooled(x̄1 − x̄2). UNSW MATH5855 2021T3 Lecture 12 Slide 31 12. Discrimination and Classification 12.1 Separation and Classification for two populations 12.2 Classification errors 12.3 Summarising 12.4 Optimal classification rules 12.5 Classification with two multivariate normal populations 12.6 Classification with more than 2 normal populations 12.7 Software 12.8 Examples 12.9 Additional resources 12.10Exercises UNSW MATH5855 2021T3 Lecture 12 Slide 32 ▶ Generalising to g > 2 groups π1, π2, . . . , πg is straightforward.
▶ Optimal error rate analysis difficult. It is easy to see that the

ECM classification rule with equal misclassification costs
becomes (compare to (12.4) and (12.5)) now:
1. Allocate x0 to πk if pk fk > pi fi for all i ̸= k.

▶ Equivalently, if log pk fk > log pi fi for all i ̸= k.
▶ g normal populations fi (x) ∼ Np(µi ,Σi ), i = 1, 2, . . . , g =⇒

1. Allocate x0 to πk if

log pk fk(x0) = log pk−
p

2
log(2π)−

1

2
log|Σk |−

1

2
(x0−µk)⊤Σ−1k (x0−µk) = max

i
log pi fi (x0)

▶ Ignoring the constant p
2
log(2π) =⇒ quadratic

discriminant score for the ith population:

dQi (x) = −
1

2
log|Σi | −

1

2
(x − µi )⊤Σ−1i (x − µi ) + log pi (12.7)

▶ Allocate x to the population with a largest quadratic
discriminant score.
▶ Estimate unknown quantities in (12.7) from data =⇒

estimated minimum total probability of misclassification rule.
(You formulate the precise statement (!)).

UNSW MATH5855 2021T3 Lecture 12 Slide 33

▶ all covariance matrices for the g populations are equal =⇒
simpler:
▶ define the linear discriminant score:

di (x) = µ⊤i Σ
−1x − 1

2
µ⊤i Σ

−1µi + log pi .
▶ sample version: x̄i instead of µi and

Spooled =
n1−1

n1+n2+…ng−g
S1 + · · ·+

ng−1
n1+n2+···+ng−g

Sg instead of Σ
=⇒

d̂i (x) = x̄

i S

−1
pooledx −

1

2
x̄⊤i S

−1
pooledx̄i + log pi

=⇒ Estimated Minimum TPM Rule for Equal Covariance
Normal Populations:

1. Allocate x to πk if d̂k(x) is the largest of the g values
d̂i (x), i = 1, 2, . . . , g .

UNSW MATH5855 2021T3 Lecture 12 Slide 34

12. Discrimination and Classification
12.1 Separation and Classification for two populations
12.2 Classification errors
12.3 Summarising
12.4 Optimal classification rules
12.5 Classification with two multivariate normal populations
12.6 Classification with more than 2 normal populations
12.7 Software
12.8 Examples
12.9 Additional resources
12.10Exercises

UNSW MATH5855 2021T3 Lecture 12 Slide 35

SAS: PROC DISCRIM

R: MASS:lda, MASS:qda

UNSW MATH5855 2021T3 Lecture 12 Slide 36

12. Discrimination and Classification
12.1 Separation and Classification for two populations
12.2 Classification errors
12.3 Summarising
12.4 Optimal classification rules
12.5 Classification with two multivariate normal populations
12.6 Classification with more than 2 normal populations
12.7 Software
12.8 Examples
12.9 Additional resources
12.10Exercises

UNSW MATH5855 2021T3 Lecture 12 Slide 37

Example 12.4.

Linear and quadratic discriminant analysis for the Edgar
Anderson’s Iris data, and using cross-validation to assess classifiers.

UNSW MATH5855 2021T3 Lecture 12 Slide 38

12. Discrimination and Classification
12.1 Separation and Classification for two populations
12.2 Classification errors
12.3 Summarising
12.4 Optimal classification rules
12.5 Classification with two multivariate normal populations
12.6 Classification with more than 2 normal populations
12.7 Software
12.8 Examples
12.9 Additional resources
12.10Exercises

UNSW MATH5855 2021T3 Lecture 12 Slide 39

Additional resources

▶ JW Sec. 11.1–11.6.

UNSW MATH5855 2021T3 Lecture 12 Slide 40

12. Discrimination and Classification
12.1 Separation and Classification for two populations
12.2 Classification errors
12.3 Summarising
12.4 Optimal classification rules
12.5 Classification with two multivariate normal populations
12.6 Classification with more than 2 normal populations
12.7 Software
12.8 Examples
12.9 Additional resources
12.10Exercises

UNSW MATH5855 2021T3 Lecture 12 Slide 41

Exercise 12.1

Three bivariate normal populations, labelled i = 1, 2, 3 have same

covariance matrix given by Σ =

(
1 0.5
0.5 1

)
and means

µ1 =

(
1
1

)
, µ2 =

(
1
0

)
,µ3 =

(
0
1

)
, respectively.

(a) Suggest a classification rule for an observation x =

(
x1
x2

)
that

corresponds to one of the three populations. You may assume
equal priors for the three populations and equal
misclassification costs.

(b) Classify the following observations to one of the three

distributions:

(
0.2
0.6

)
,

(
2
0.8

)
,

(
0.75
1

)
.

(c) Show that in R2, the 3 classification regions are bounded by
straight lines and draw a graph of these three regions.

UNSW MATH5855 2021T3 Lecture 12 Slide 42

Lecture 13: Support Vector Machines
13.1 Introduction and motivation
13.2 Expected versus Empirical Risk minimisation
13.3 Basic idea of SVMs
13.4 Estimation
13.5 Nonlinear SVMs
13.6 Multiple classes
13.7 SVM specification and tuning
13.8 Examples
13.9 Conclusion

UNSW MATH5855 2021T3 Lecture 13 Slide 1

13. Support Vector Machines
13.1 Introduction and motivation
13.2 Expected versus Empirical Risk minimisation
13.3 Basic idea of SVMs
13.4 Estimation
13.5 Nonlinear SVMs
13.6 Multiple classes
13.7 SVM specification and tuning
13.8 Examples
13.9 Conclusion

UNSW MATH5855 2021T3 Lecture 13 Slide 2

▶ Recall Lecture 12: classifying between 2 p-dimensional MVN
populations,
▶ scores are either linear or quadratic
▶ Optimal, but only for MVN.

▶ Non-MVN =⇒ nonlinear/nonquadratic boundaries
▶ Support vector machines (SVM) is a nonlinear technique that

often performs well.

▶ We will formulate it as an empirical risk minimisation problem
and to solve the problem under additional restrictions on the
allowed (nonlinear) classifier functions.

UNSW MATH5855 2021T3 Lecture 13 Slide 3

13. Support Vector Machines
13.1 Introduction and motivation
13.2 Expected versus Empirical Risk minimisation
13.3 Basic idea of SVMs
13.4 Estimation
13.5 Nonlinear SVMs
13.6 Multiple classes
13.7 SVM specification and tuning
13.8 Examples
13.9 Conclusion

UNSW MATH5855 2021T3 Lecture 13 Slide 4

▶ Let
Y group “indicator” with values +1 and −1
x ∈ Rp vector based on which we wish to classify the
observation

Wanted: A “best” classifier in a class F of functions f .
▶ f (x) maps x onto +1 or −1
▶ Minimise expected risk

R(f ) = EX ,Y (
1

2
|f (X )− Y |) =


1

2
|f (x)− y |dP(x , y)

▶ Joint distribution P(x , y) is unknown in practice =⇒
empirical risk over a training set (xi , yi ), i = 1, 2, . . . , n is:

R̂(f ) =
1

n

n∑
i=1

1

2
|f (xi )− yi |

▶ I.e., “zero-one loss” given by

L(x , y) =
1

2
|f (x)− y |.

UNSW MATH5855 2021T3 Lecture 13 Slide 5

▶ Minimising empirical risk =⇒ find fn = argminf ∈F R̂(f ) as
an approximation to fopt = argminf ∈F R(f ).
▶ Not the same thing and may be very different.

▶ Vapnik: If F is not too large and n → ∞, an upper bound on
their difference with probability 1− η:

R(f ) ≤ R̂(f ) + ϕ(
h

n
,
log η

n
)

h is Vapnik–Chervonenkis (VC) dimension (i.e., a measure
of the complexity of the class F).

ϕ is monotone increasing in h (at least for large enough
sample sizes n).

=⇒ Test error is bounded from above by the sum of the training
error and the complexity of the set of models under
consideration.

=⇒ Limiting the complexity of the model limits the the
discrepancy between training and test error.

UNSW MATH5855 2021T3 Lecture 13 Slide 6

▶ A linear classification rule has the form f (x) = sign(w⊤x + b)
for some w ∈ Rp and b ∈ R.

▶ For a linear classification rule,

▶ ϕ( h
n
, log η

n
) =


h(log( 2n

h
)+1)−log( η

4
)

n
▶ VC dimension h = p + 1

▶ Now,

∂h
[
h(log(2n

h
) + 1)− log(η

4
)

n
] =

1

n
log(

2n

h
) > 0

(as long as h < 2n). ▶ In general, the VC dimension of a given set of functions is equal to the maximal number of points that can be “shattered”—separated in all possible ways by that set of functions. UNSW MATH5855 2021T3 Lecture 13 Slide 7 ▶ “Richer” function class F =⇒ less training classification error. ▶ Will overfit and not generalise, however. ▶ “More rich” =⇒ higher value of h =⇒ higher ϕ (for large enough n) =⇒ more discrepancy between R(f ) and R̂(f ) ▶ The rest of the lecture focuses on ways to solve (or solve approximately) this minimisation problem for some classes F . ▶ For additional information, see references in the notes. UNSW MATH5855 2021T3 Lecture 13 Slide 8 13. Support Vector Machines 13.1 Introduction and motivation 13.2 Expected versus Empirical Risk minimisation 13.3 Basic idea of SVMs 13.4 Estimation 13.5 Nonlinear SVMs 13.6 Multiple classes 13.7 SVM specification and tuning 13.8 Examples 13.9 Conclusion UNSW MATH5855 2021T3 Lecture 13 Slide 9 Linear Classifiers ▶ A linear classifier is one that given feature vector xnew and weights w , classifies ynew based on the value of w⊤xnew; for example, ŷnew = { +1 if w⊤xnew + b > 0
−1 if w⊤xnew + b < 0 for a threshold −b. ▶ Every element of x , xi , gets a weight wi : Sign of wi determines whether increasing xi pushes the prediction toward yi = −1 or yi = +1. Magnitude of wi determines how strongly. UNSW MATH5855 2021T3 Lecture 13 Slide 10 Hyperplane Interpretation and Linear Separability ▶ We separate +1s from −1s at w⊤x + b = 0. ▶ Points x that satisfy that equation exactly form a line (if d = 2), a plane (if d = 3), or a hyperplane (if d ≥ 3). ▶ Data are linearly separable if a hyperplane that separates them exists: x1 x2 −1 +1 y separating line w −b ∥w ∥ UNSW MATH5855 2021T3 Lecture 13 Slide 11 Maximum Margin ▶ Usually, there are many different hyperplanes which could be used to separate a linearly separable dataset. ▶ The “best” choice can be regarded as the middle of the widest empty strip (or higher dimensional analogue) between the two classes. x1 x2 −1 +1 yw ⊤ x + b = 0 w ⊤ x + b + = 0 w ⊤ x + b − = 0 |b+− b− | ∥w∥ =⇒ We want to make the margin |b+−b−||w | as big as possible. UNSW MATH5855 2021T3 Lecture 13 Slide 12 ▶ The scale of w and b is arbitrary: for arbitrary α ̸= 0, any x that satisfies w⊤x + b = 0 also satisfies (αw)⊤x + (αb) = α(w⊤x + b) = 0, so (w , b) and (αw , αb) define the same plane. =⇒ We fix |b+ − b| = |b− − b| = 1, and only vary w : our “outer” hyperplanes become w⊤x + (b − 1) = 0 w⊤x + (b + 1) = 0 =⇒ A margin of |b+−b−|∥w∥ = 2 ∥w∥ is maximised by minimising ∥w∥. A Linear Support Vector Machine minimises ∥w∥2 subject to separating −1s and +1s. UNSW MATH5855 2021T3 Lecture 13 Slide 13 13. Support Vector Machines 13.1 Introduction and motivation 13.2 Expected versus Empirical Risk minimisation 13.3 Basic idea of SVMs 13.4 Estimation 13.4.1 Linear SVM: Separable Case 13.4.2 Linear SVM: Nonseparable Case 13.5 Nonlinear SVMs 13.6 Multiple classes 13.7 SVM specification and tuning 13.8 Examples 13.9 Conclusion UNSW MATH5855 2021T3 Lecture 13 Slide 14 13. Support Vector Machines 13.4 Estimation 13.4.1 Linear SVM: Separable Case 13.4.2 Linear SVM: Nonseparable Case UNSW MATH5855 2021T3 Lecture 13 Slide 15 Linear SVM: Separable Case ▶ Write w⊤x + (b − 1) = 0 =⇒ w⊤x + b = +1 w⊤x + (b + 1) = 0 =⇒ w⊤x + b = −1 ▶ Recall ŷi = { +1 if w⊤xi + b > 0
−1 if w⊤xi + b < 0 = sign(w⊤xi + b). =⇒ If w⊤x + b = 0 separates −1s and +1s (i.e., yi = ŷi for all i = 1, . . . , n.), yi (w⊤xi + b) ≥ 1 =⇒ A linear SVM learning task for can be expressed as a constrained optimisation problem: argmin w 1 2 ∥w∥2 subject to yi (w⊤xi + b) ≥ 1, i = 1, . . . , n. UNSW MATH5855 2021T3 Lecture 13 Slide 16 Lagrange Multiplier Technique ▶ The objective is quadratic (convex) and the constraints are linear. ▶ This can be solved by Lagrange multipliers. 1. Rewrite the objective function as the Lagrangian: (note the use of αi s instead of λi s): Lag(w , b;α) = 1 2 ∥w∥2 − n∑ i=1 αi [ yi (w⊤xi + b)− 1 ] . 2. As the constraints are inequalities rather than equalities, apply the so-called KKT (Karush–Kuhn–Tucker) conditions: the saddle point (w , b,α) : Lag′(w , b;α) = 0 will be the constrained optimum if αi ≥ 0, i = 1, . . . , n. =⇒ Solve for Lag′(w , b;α) = 0 subject to αi ≥ 0. UNSW MATH5855 2021T3 Lecture 13 Slide 17 3. Set derivatives of Lag with respect to w and b equal to zero: ∂L ∂w = w − n∑ i=1 αiyixi = 0 =⇒ w = n∑ i=1 αiyixi , ∂L ∂b = − n∑ i=1 αiyi = 0 =⇒ n∑ i=1 αiyi = 0. 4. Note, also, that yi (w⊤xi + b)− 1 ≥ 0, i = 1, . . . , n, αi ( yi (w⊤xi + b)− 1 ) = 0, i = 1, . . . , n. =⇒ Each αi must be zero unless yi (w⊤xi + b) = 1, in which case the training instance lies on a corresponding hyperplane and is known as a support vector. UNSW MATH5855 2021T3 Lecture 13 Slide 18 Dual Optimisation Problem ▶ Substituting the expression of w in terms of α and expanding ∥w∥2, we get LagD(α) = n∑ i=1 αi − 1 2 n∑ i=1 n∑ j=1 αiαjyiyjx⊤i xj , to be maximised subject to αi ≥ 0, i = 1, . . . , n n∑ i=1 αiyi = 0. ▶ This is a quadratic programming problem, for which many software tools are available. UNSW MATH5855 2021T3 Lecture 13 Slide 19 13. Support Vector Machines 13.4 Estimation 13.4.1 Linear SVM: Separable Case 13.4.2 Linear SVM: Nonseparable Case UNSW MATH5855 2021T3 Lecture 13 Slide 20 Linear SVM: Nonseparable Case ▶ In many real-world problems, it is not possible to find hyperplanes which perfectly separate the target classes. ▶ The soft margin approach considers a trade-off between margin width and number of training misclassifications. ▶ Slack variables ξi ≥ 0 are included in the constraints yi (w⊤xi + b) ≥ 1− ξi . (13.1) UNSW MATH5855 2021T3 Lecture 13 Slide 21 ▶ Optimisation becomes argmin w ,ξ ( 1 2 ∥w∥2 + C n∑ i=1 ξi ) subject to yi (w⊤xi + b) ≥ 1− ξi , i = 1, . . . , n, for a tuning constant C . ▶ Small C : lots of slack. ▶ Large C : little slack. ▶ C = ∞: hard margin. ▶ Now, yi (w⊤xi + b) ≥ 1− ξi =⇒ ξi ≥ 1− yi (w⊤xi + b). ▶ We want to make ξi as small as possible. =⇒ ξi = max{0, 1− yi (w⊤xi + b)}. UNSW MATH5855 2021T3 Lecture 13 Slide 22 Dual Optimisation Problem ▶ The Lagrangian is now (with additional multipliers µ), Lag(w , b, ξ;α,µ) = 1 2 ∥w∥2 + C n∑ i=1 ξi − n∑ i=1 αi [ yi (w⊤xi + b)− 1 + ξi ] − n∑ i=1 µiξi . UNSW MATH5855 2021T3 Lecture 13 Slide 23 ▶ Now, ∂L ∂w = w − n∑ i=1 αiyixi = 0 =⇒ w = n∑ i=1 αiyixi ∂L ∂b = − n∑ i=1 αiyi = 0 =⇒ n∑ i=1 αiyi = 0 ∂L ∂ξ = C1n −α− µ = 0 =⇒ C − αi − µi = 0, i = 1, . . . , n. with additional KKT conditions for i = 1, . . . , n: αi ≥ 0 µi ≥ 0 αi ( yi (w⊤xi + b)− 1 + ξi ) = 0. UNSW MATH5855 2021T3 Lecture 13 Slide 24 ▶ Substituting into the Lagrangian leads to LagD(α,µ) = n∑ i=1 αi− 1 2 n∑ j=1 n∑ k=1 αjαkyjyk(x ⊤ j xk)+ n∑ i=1 ξi (C−αi−µi ). ▶ But C − αi − µi = 0, so as long as αi ≤ C , µi ≥ 0 is completely determined by αi , and we get a dual problem argmax α   n∑ i=1 αi − 1 2 n∑ j=1 n∑ k=1 αjαkyjyk(x ⊤ j xk)   subject to n∑ i=1 αiyi = 0 and 0 ≤ αi ≤ C , i = 1, . . . , n. UNSW MATH5855 2021T3 Lecture 13 Slide 25 The consequences Primal: ŷ(x) = sign(w⊤x + b) (13.2) Dual: ŷ(x) = sign{ n∑ j=1 αjyj(x⊤j x) + b} (13.3) ▶ Primal (w) form requires d parameters, while dual (α) form requires n parameters. ▶ If d ≫ n, dual is more efficient. ▶ But, notice that only the xi s closest to the hyperplane matter in determining w , so most of them will have no effect. =⇒ Most αjs in w = ∑n j=1 αjyjxj will be 0! =⇒ Computationally, effective “n” is actually much smaller than the sample size. =⇒ Those xi s that “support” the hyperplane are called support vectors. ▶ Also, notice that the dual form only depends on (x⊤j xk)s. UNSW MATH5855 2021T3 Lecture 13 Slide 26 13. Support Vector Machines 13.1 Introduction and motivation 13.2 Expected versus Empirical Risk minimisation 13.3 Basic idea of SVMs 13.4 Estimation 13.5 Nonlinear SVMs 13.6 Multiple classes 13.7 SVM specification and tuning 13.8 Examples 13.9 Conclusion UNSW MATH5855 2021T3 Lecture 13 Slide 27 Nonlinear SVM ▶ SVM methodology can be modified to create nonlinear decision boundaries. ▶ Consider: x1 x2 UNSW MATH5855 2021T3 Lecture 13 Slide 28 ▶ The technique involves transforming the original x space so that a linear decision boundary can separate instances in the transformed space. ▶ Suppose we augmented our x with squared terms: (x1, x2) → (x1, x2, x21 , x 2 2 ) : x1 0 .0 0 .3 0 .6 0.0 0.2 0.4 0.6 0 .0 0 .2 0 .4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 x2 x1.2 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0 .0 0 .3 0 .6 0 .0 0 .2 0 .4 x2.2 =⇒ Nonlinear problem becomes a linear problem. UNSW MATH5855 2021T3 Lecture 13 Slide 29 Kernel Trick ▶ The dual form depends only on dot products x⊤i xj . =⇒ We can specify other kernels k(xi , xj). ▶ E.g., a “kernel” function of the form k(u, v) = (u⊤v + 1)2 can be regarded as a dot product u21v 2 1 + u2v 2 2 + 2u1v1 + 2u2v2 + 1 = (u21 , u 2 2 , √ 2u1, √ 2u2, 1) ⊤(v21 , v 2 2 , √ 2v1, √ 2v2, 1) ▶ In general, kernel functions can be expressed in terms of high dimensional dot products. ▶ Computing dot products via kernel functions is computationally “cheaper” than using transformed attributes directly. UNSW MATH5855 2021T3 Lecture 13 Slide 30 Radial Basis Function ▶ A radial basis function is a function of distance from the origin, or from another fixed point v . ▶ Usually distance is Euclidean, i.e. ∥u − v∥ = √ (u1 − v1)2 + · · ·+ (un − vn)2 ▶ A common form of radial basis function is Gaussian: ϕ(∥u − v∥) = exp ( −γ∥u − v∥2 ) (Maximum of 1 occurs when u = v , decreases towards zero as u moves away from v .) ▶ We can use ϕ(·, ·) as our SVM kernel. UNSW MATH5855 2021T3 Lecture 13 Slide 31 13. Support Vector Machines 13.1 Introduction and motivation 13.2 Expected versus Empirical Risk minimisation 13.3 Basic idea of SVMs 13.4 Estimation 13.5 Nonlinear SVMs 13.6 Multiple classes 13.7 SVM specification and tuning 13.8 Examples 13.9 Conclusion UNSW MATH5855 2021T3 Lecture 13 Slide 32 Multiclass SVMs ▶ The SVM technique can be adapted to handle multiclass problems (K categories) rather than binary classification problems (2 categories): One-against-rest: ▶ Recall that w⊤xi gives us a “score” that we normally compare to b, but we don’t have to. ▶ For each k = 1, . . . ,K fit a separate SVM (i.e., wk and bk) for whether an observation is in k vs. not. ▶ Predict ŷnew by evaluating w⊤k xnew + bk for each k and taking the biggest one. One-against-one: ▶ For every pair k1, k2 = 1, . . . ,K , fit an SVM for k1 vs. k2. ▶ Requires K (K − 1)/2 binary classifiers. UNSW MATH5855 2021T3 Lecture 13 Slide 33 13. Support Vector Machines 13.1 Introduction and motivation 13.2 Expected versus Empirical Risk minimisation 13.3 Basic idea of SVMs 13.4 Estimation 13.5 Nonlinear SVMs 13.6 Multiple classes 13.7 SVM specification and tuning 13.8 Examples 13.9 Conclusion UNSW MATH5855 2021T3 Lecture 13 Slide 34 User Control ▶ Categorical data can be handled by introducing binary dummy variables to indicate each attribute value. ▶ The user must specify some control parameters, e.g. type of kernel function and cost constant C for slack variables. ▶ The following kernel functions available via the R e1071 package: linear: u⊤v polybomial: (γu⊤v + c0)p radial basis: exp(−γ∥u − v∥2) sigmoid: tanh(−γu⊤v + c0) for constants γ, p, and c0. UNSW MATH5855 2021T3 Lecture 13 Slide 35 13. Support Vector Machines 13.1 Introduction and motivation 13.2 Expected versus Empirical Risk minimisation 13.3 Basic idea of SVMs 13.4 Estimation 13.5 Nonlinear SVMs 13.6 Multiple classes 13.7 SVM specification and tuning 13.8 Examples 13.9 Conclusion UNSW MATH5855 2021T3 Lecture 13 Slide 36 Example 13.1. SVM classification for the Edgar Anderson’s Iris data, and using ROC curves. UNSW MATH5855 2021T3 Lecture 13 Slide 37 13. Support Vector Machines 13.1 Introduction and motivation 13.2 Expected versus Empirical Risk minimisation 13.3 Basic idea of SVMs 13.4 Estimation 13.5 Nonlinear SVMs 13.6 Multiple classes 13.7 SVM specification and tuning 13.8 Examples 13.9 Conclusion UNSW MATH5855 2021T3 Lecture 13 Slide 38 Advantages and Disadvantages + SVM training can be formulated as a convex optimisation problem, with efficient algorithms for finding the global minimum. + SVM involves support vectors rather than the whole training set, so outliers have less effect than for other methods. − Much harder to interpret than model-based classification techniques. − Does not directly provide class probability estimates, although these can be estimated by cross-validation. UNSW MATH5855 2021T3 Lecture 13 Slide 39 Lecture 14: Cluster Analysis 14.1 “Classical” 14.2 Model-based clustering 14.3 Additional resources UNSW MATH5855 2021T3 Lecture 14 Slide 1 Goal Given: An unlabelled sample x1, . . . , xn ∈ Rp. Wanted: A grouping of observations such that more similar observations are placed in the group ▶ I.e., assign to each xi a group index Gi ∈ {1, . . . ,K} s.t. if Gi = Gj , xi and xj are “on average” more similar in some sense than if Gi ̸= Gj . ▶ Let G = (G1, . . . ,Gn)⊤ for conciseness. ▶ Equivalently, partition i = 1, . . . , n into K sets S1, . . . ,SK so that if two points belong to the same set, they are more similar “on average” than if they do not. ▶ Call S = (S1, . . . ,SK ) (collection of sets) for conciseness. ▶ Differs from SVM and discriminant analysis in that no labels are provided in the data. ▶ An example of unsupervised learning. UNSW MATH5855 2021T3 Lecture 14 Slide 2 Approaches “classical”: An algorithm that seeks to put more similar (in some sense) observations into the same cluster hierarchical: Produces a hierarchy of nested clusterings non-hierarchical: Just a single clustering ▶ Often seeks to minimise some objective function model-based: An MLE or Bayesian solution to a mixture model UNSW MATH5855 2021T3 Lecture 14 Slide 3 14. Cluster Analysis 14.1 “Classical” 14.1.1 Components 14.1.2 Example: K -means 14.1.3 Extension: K -medioids 14.1.4 Hierarchical clustering 14.1.5 Software 14.1.6 Assessing 14.1.7 Examples 14.2 Model-based clustering 14.3 Additional resources UNSW MATH5855 2021T3 Lecture 14 Slide 4 14. Cluster Analysis 14.1 “Classical” 14.1.1 Components 14.1.2 Example: K -means 14.1.3 Extension: K -medioids 14.1.4 Hierarchical clustering 14.1.5 Software 14.1.6 Assessing 14.1.7 Examples UNSW MATH5855 2021T3 Lecture 14 Slide 5 Specify, proximity measure: some function d(x1, x2) that determines difference between two observations (or a similarity score) e.g., Euclidean: ∥x1 − x2∥ taxicab/Manhattan: ∥x1 − x2∥1 = ∑p j=1|x1j − x2j | Gower: p−1 ∑p j=1 I(x1j ̸= x2j) (xij binary) ▶ Should be substantively meaningful. algorithm choice: an algorithm that minimises within-cluster and maximises between-cluster distances in some sense UNSW MATH5855 2021T3 Lecture 14 Slide 6 14. Cluster Analysis 14.1 “Classical” 14.1.1 Components 14.1.2 Example: K -means 14.1.3 Extension: K -medioids 14.1.4 Hierarchical clustering 14.1.5 Software 14.1.6 Assessing 14.1.7 Examples UNSW MATH5855 2021T3 Lecture 14 Slide 7 ▶ Simple, intuitive algorithm. ▶ For a Euclidean distance, minimise argmin S K∑ k=1 1 2|Sk | ∑ i ,j∈Sk ∥xi − xj∥2 ▶ Equivalent to minimising argmin S K∑ k=1 ∑ i∈Sk ∥xi − x̄Sk∥ 2, x̄Sk = 1 |Sk | ∑ i∈Sk xi UNSW MATH5855 2021T3 Lecture 14 Slide 8 Procedure 1. Randomly assign a cluster index to each element of G (0). 2. Calculate cluster means (centroids): x̄ S (t−1) k = 1 |S (t−1)k | ∑ i∈S(t−1) k xi , k = 1, . . . ,K . 3. Calculate distances of each data point from each mean: dik = ∥xi − x̄S(t−1) k ∥, i = 1, . . . , n, k = 1, . . . ,K . 4. Reassign each point to its nearest mean: G (t) i = argmin k dik . 5. Repeat from Step 2 until G (t) = G (t−1). A toy example is given in lecture. UNSW MATH5855 2021T3 Lecture 14 Slide 9 Data Data: Index V1 V2 1 1 1 2 2 2 3 4 5 4 5 4 5 4 4 1 2 3 4 5 1 2 3 4 5 1 2 3 45 UNSW MATH5855 2021T3 Lecture 14 Slide 10 Iteration 0: Initial clustering (random) Data: Index V1 V2 C 1 1 1 1 2 2 2 2 3 4 5 1 4 5 4 2 5 4 4 1 1 2 3 4 5 1 2 3 4 5 1 2 3 45 UNSW MATH5855 2021T3 Lecture 14 Slide 11 Iteration 1a: Calculate centroids Data: Index V1 V2 C 1 1 1 1 2 2 2 2 3 4 5 1 4 5 4 2 5 4 4 1 Centroids: C V1 V2 1 3.0 3.333333 2 3.5 3.000000 1 2 3 4 5 1 2 3 4 5 1 2 3 45 UNSW MATH5855 2021T3 Lecture 14 Slide 12 Iteration 1b: Update clustering Data: Index V1 V2 C 1 1 1 1 2 2 2 1 3 4 5 1 4 5 4 2 5 4 4 2 Distances to centroid: Index 1 2 1 3.073182 3.201562 2 1.666667 1.802776 3 1.943651 2.061553 4 2.108185 1.802776 5 1.201850 1.118034 1 2 3 4 5 1 2 3 4 5 1 2 3 45 UNSW MATH5855 2021T3 Lecture 14 Slide 13 Iteration 2a: Calculate centroids Data: Index V1 V2 C 1 1 1 1 2 2 2 1 3 4 5 1 4 5 4 2 5 4 4 2 Centroids: C V1 V2 1 2.333333 2.666667 2 4.500000 4.000000 1 2 3 4 5 1 2 3 4 5 1 2 3 45 UNSW MATH5855 2021T3 Lecture 14 Slide 14 Iteration 2b: Update clustering Data: Index V1 V2 C 1 1 1 1 2 2 2 1 3 4 5 2 4 5 4 2 5 4 4 2 Distances to centroid: Index 1 2 1 2.134375 4.609772 2 0.745356 3.201562 3 2.867442 1.118034 4 2.981424 0.500000 5 2.134375 0.500000 1 2 3 4 5 1 2 3 4 5 1 2 3 45 UNSW MATH5855 2021T3 Lecture 14 Slide 15 Iteration 3a: Calculate centroids Data: Index V1 V2 C 1 1 1 1 2 2 2 1 3 4 5 2 4 5 4 2 5 4 4 2 Centroids: C V1 V2 1 1.500000 1.500000 2 4.333333 4.333333 1 2 3 4 5 1 2 3 4 5 1 2 3 45 UNSW MATH5855 2021T3 Lecture 14 Slide 16 Iteration 3b: Update clustering Data: Index V1 V2 C 1 1 1 1 2 2 2 1 3 4 5 2 4 5 4 2 5 4 4 2 Distances to centroid: Index 1 2 1 0.7071068 4.7140452 2 0.7071068 3.2998316 3 4.3011626 0.7453560 4 4.3011626 0.7453560 5 3.5355339 0.4714045 1 2 3 4 5 1 2 3 4 5 1 2 3 45 Clustering unchanged. Finished! UNSW MATH5855 2021T3 Lecture 14 Slide 17 14. Cluster Analysis 14.1 “Classical” 14.1.1 Components 14.1.2 Example: K -means 14.1.3 Extension: K -medioids 14.1.4 Hierarchical clustering 14.1.5 Software 14.1.6 Assessing 14.1.7 Examples UNSW MATH5855 2021T3 Lecture 14 Slide 18 medioid x̃k of cluster k a specific observation that has the closest summed distance to all other observations in Sk : x̃Sk = argminxi ∑ i∈Sk d(xj , xi ) ▶ Method of K -medioids or partitioning around medioids (PAM) minimises absolute distances: argmin S K∑ k=1 ∑ i∈Sk d(xi , x̃Sk ) ▶ Much slower than K -means, but more robust to outliers UNSW MATH5855 2021T3 Lecture 14 Slide 19 Procedure 1. Randomly assign a cluster index to each element of G (0). 2. Calculate cluster medioids: x̃ S (t−1) k = argmin xi ∑ j∈S(t−1) k d(xj , xi ), k = 1, . . . ,K . 3. Calculate distances of each data point from each medioid: dik = d(xi , x̃S(t−1) k ), i = 1, . . . , n, k = 1, . . . ,K . 4. Reassign each point to its nearest medioid: G (t) i = argmin k dik . 5. Repeat from Step 2 until G (t) = G (t−1). UNSW MATH5855 2021T3 Lecture 14 Slide 20 14. Cluster Analysis 14.1 “Classical” 14.1.1 Components 14.1.2 Example: K -means 14.1.3 Extension: K -medioids 14.1.4 Hierarchical clustering 14.1.5 Software 14.1.6 Assessing 14.1.7 Examples UNSW MATH5855 2021T3 Lecture 14 Slide 21 Two approaches: Agglomerative: Combine nearest observations into clusters, nearest clusters into bigger clusters, etc.. ▶ Need to define “distance” between clusters. Divisive: Partition the whole dataset into clusters, clusters into smaller clusters, etc.. ▶ Need to define criterion based on which a cluster is partitioned. ▶ Typically visualised in a dendrogram. UNSW MATH5855 2021T3 Lecture 14 Slide 22 Distance between clusters Single linkage d(S1, S2) = min{d(xi , xj) : i ∈ S1, j ∈ S2} Complete linkage d(S1, S2) = max{d(xi , xj) : i ∈ S1, j ∈ S2} Average linkage (unweighted) d(S1,S2) = 1 |S1||S2| ∑ i∈S1 ∑ j∈S2 d(xi , xj) Average linkage (weighted) d(S1 ∪ S2,S3) = d(S1,S3)+d(S2,S3) 2 Centroid d(S1,S2) = ∥x̄S1 − x̄S2∥ Ward d(S1,S2) = ∑ i∈S1∪S2∥xi − x̄S1∪S2∥ 2 − ∑ i∈S1∥xi − x̄S1∥ 2 − ∑ i∈S2∥xi − x̄S2∥ 2 = |S1||S2| |S1|+|S2| ∥x̄S1 − x̄S2∥ 2 UNSW MATH5855 2021T3 Lecture 14 Slide 23 Lance–Williams framework Express distance recursively: d(S1 ∪ S2, S3) = α1d(S1,S3) + α2d(S2,S3) + βd(S1, S2) + γ|d(S1,S3)− d(S2, S3)| Then, for, e.g., ▶ Unweighted average linkage: d(S1 ∪ S2,S3) = 1 |S1 ∪ S2||S3| ∑ i∈S1∪S2 ∑ j∈S3 d(xi , xj) = 1 (|S1|+ |S2|)|S3|  ∑ i∈S1 ∑ j∈S3 d(xi , xj) + ∑ i∈S2 ∑ j∈S3 d(xi , xj)   = |S1||S3|d(S1, S3) + |S2||S3|d(S2,S3) (|S1|+ |S2|)|S3| =⇒ α1 = |S1| |S1|+ |S2| , α2 = |S2| |S1|+ |S2| , β = γ = 0. UNSW MATH5855 2021T3 Lecture 14 Slide 24 Ward’s method ▶ Use squared Euclidean distance: d(xi , xj) = ∥xi − xj∥2. ▶ Use α1 = |S1|+ |S3| |S1|+ |S2|+ |S3| , α2 = |S2|+ |S3| |S1|+ |S2|+ |S3| , β = −|S3| |S1|+ |S2|+ |S3| , γ = 0. ▶ Ward’s method joins the groups that will increase the within-group variance least. UNSW MATH5855 2021T3 Lecture 14 Slide 25 14. Cluster Analysis 14.1 “Classical” 14.1.1 Components 14.1.2 Example: K -means 14.1.3 Extension: K -medioids 14.1.4 Hierarchical clustering 14.1.5 Software 14.1.6 Assessing 14.1.7 Examples UNSW MATH5855 2021T3 Lecture 14 Slide 26 SAS: Hierarchical: PROC CLUSTER (PROC TREE to visualise, PROC DISTANCE to preprocess), PROC VARCLUS Non-hierarchical: PROC FASTCLUS, PROC MODECLUS, PROC FASTKNN R: Hierarchical: stats::hclust, cluster::agnes Non-hierarchical: stats::kmeans, cluster::pam ▶ Many others UNSW MATH5855 2021T3 Lecture 14 Slide 27 14. Cluster Analysis 14.1 “Classical” 14.1.1 Components 14.1.2 Example: K -means 14.1.3 Extension: K -medioids 14.1.4 Hierarchical clustering 14.1.5 Software 14.1.6 Assessing 14.1.7 Examples UNSW MATH5855 2021T3 Lecture 14 Slide 28 UNSW MATH5855 2021T3 Lecture 14 Slide 29 Silhouettes ▶ Popular method, inspired by K -medioid clustering. ▶ For each i = 1, . . . , n, let a(i) = 1 |SGi | − 1 ∑ j∈SGi d(xi , xj) b(i) = min k ̸=Gi 1 |Sk | ∑ j∈Sk d(xi , xj). ▶ Then, silhouette of i s(i) = { b(i)−a(i) max(a(i),b(i)) if |SGi | > 1

0 otherwise
.

=⇒ I.e. how much closer is i to the rest of its cluster than it is to
its nearest cluster?

▶ −1 ≤ s(i) ≤ +1, higher =⇒ better
▶ Mean silhouette n−1

∑n
i=1 s(i) measures the quality of

clustering.
UNSW MATH5855 2021T3 Lecture 14 Slide 30

14. Cluster Analysis
14.1 “Classical”
14.1.1 Components
14.1.2 Example: K -means
14.1.3 Extension: K -medioids
14.1.4 Hierarchical clustering
14.1.5 Software
14.1.6 Assessing
14.1.7 Examples

UNSW MATH5855 2021T3 Lecture 14 Slide 31

Example 14.1.

Hierarchical, non-hierarchical clustering and assessment illustrated
on the Edgar Anderson’s Iris data.

UNSW MATH5855 2021T3 Lecture 14 Slide 32

14. Cluster Analysis
14.1 “Classical”
14.2 Model-based clustering
14.2.1 Mixture Models
14.2.2 Multivariate normal clusters
14.2.3 Model selection
14.2.4 Software
14.2.5 Examples
14.2.6 Expectation–Maximisation Algorithm

14.3 Additional resources

UNSW MATH5855 2021T3 Lecture 14 Slide 33

14. Cluster Analysis
14.2 Model-based clustering
14.2.1 Mixture Models
14.2.2 Multivariate normal clusters
14.2.3 Model selection
14.2.4 Software
14.2.5 Examples
14.2.6 Expectation–Maximisation Algorithm

UNSW MATH5855 2021T3 Lecture 14 Slide 34

Finite mixture model

finite mixture model: a probability model under which each
observation comes from one of several distributions, but we do
not observe from which one

Typically, given:

K the number of distributions

fk(xi ;θk), k = 1, . . . ,K a collection of density functions on the
support of xi with parameter vectors θk

πk , k = 1, . . . ,K probabilities of an observation coming from
density k (0 ≤ πk ≤ 1,

∑K
k=1 πk)

▶ π = (π1, . . . , πK )

Ψ = {θ1, . . . ,θK ,π} collection of mixture model parameters
For each i = 1, . . . , n,

1. Sample Gi ∈ {1, . . . ,K} with Pr(Gi = k ;π) = πk .
2. Sample Xi |Gi ∼ fGi (·;θGi ).
3. Observe Xi , and “forget” Gi .

UNSW MATH5855 2021T3 Lecture 14 Slide 35

=⇒ mixture density

fXi (xi ; Ψ) =
K∑

k=1

πk fk(xi ;θk) (14.1)

▶ We wish to estimate Ψ from the sample of x = [x1, . . . , xn].
▶ Likelihood

Lx(Ψ) =
n∏

i=1

K∑
k=1

πk fk(xi ;θk). (14.2)

▶ Probability model =⇒ soft clustering possible, can be
embedded in a hierarchy

▶ Likelihood =⇒ model selection

UNSW MATH5855 2021T3 Lecture 14 Slide 36

14. Cluster Analysis
14.2 Model-based clustering
14.2.1 Mixture Models
14.2.2 Multivariate normal clusters
14.2.3 Model selection
14.2.4 Software
14.2.5 Examples
14.2.6 Expectation–Maximisation Algorithm

UNSW MATH5855 2021T3 Lecture 14 Slide 37

fk(xi ;θk) =
1

(2π)p/2|Σ(θk)|1/2
e−

1
2
(xi−µ(θk ))⊤Σ(θk )−1(xi−µ(θk ))

▶ µ(θk) = µk , Σ(θk) a variance model

▶ Important special case

▶ Strictly speaking, we may also have different clusters “share”
elements of θ:

fk(xi ;θ) =
1

(2π)p/2|Σk(θ)|1/2
e−

1
2
(xi−µk (θ))⊤Σk (θ)−1(xi−µk (θ))

(14.3)
▶ µk(θ) and Σk(θ) “extract” the appropriate elements from θ.

UNSW MATH5855 2021T3 Lecture 14 Slide 38

Parametrising the variance matrix

▶ Recall, Σ = PΛP⊤, P orthogonal, Λ diagonal and
nonnegative.

▶ Further parametrise
Σ = λPAP⊤

P ∈ Mp,p orthogonal
A ∈ Mp,p diagonal and nonnegative with |A| = 1

R ∋ λ > 0
=⇒ |Σ| = λp|P||A||P⊤| = λp =⇒ λ is the “spread” of the cluster
=⇒ A = Ip =⇒ Σ = λPAP⊤ = λPP⊤ = λIp =⇒ spherical

=⇒ uncorrelated, equal variances
=⇒ A controls shape of ellipsoid (relative scales for different

dimensions)
=⇒ P = Ip =⇒ Σ = λPAP⊤ = λA =⇒ ellipsoidal whose axes

parallel coordinate axes
=⇒ uncorrelated, unequal variances
=⇒ P controls rotation of ellipsoid (correlation)

=⇒ Constraining A and P controls the shape and orientation of
cluster.

UNSW MATH5855 2021T3 Lecture 14 Slide 39

Variance matrix specification and degrees of freedom

In a mixture of K clusters, estimate:

π1, . . . , πK (K − 1 parameters)
µ1, . . . ,µK (Kp parameters)

λ:

E: λ1 = λ2 = · · · = λK (1 parameter)
V: λks vary (K parameters)

A:

I: A1 = A2 = · · · = AK = Id (0 parameters)
E: A1 = A2 = · · · = AK (p − 1 parameters)
V: Aks vary (K (p − 1) parameters)

P:

I: P1 = P2 = · · · = PK = Id (0 parameters)
E: P1 = P2 = · · · = PK (

(
p
2

)
parameters (why?))

V: Pks vary (K
(
p
2

)
parameters)

UNSW MATH5855 2021T3 Lecture 14 Slide 40

Incorporated under the terms of Creative Commons Attribution 3.0 Unported license from Figure 2 of:

Luca Scrucca, Michael Fop, T. Brendan Murphy, and Adrian E. Raftery (2016). mclust 5: Clus-
tering, Classification and Density Estimation Using Gaussian Finite Mixture Models. The R Journal
8:1, pages 289-317.

UNSW MATH5855 2021T3 Lecture 14 Slide 41

14. Cluster Analysis
14.2 Model-based clustering
14.2.1 Mixture Models
14.2.2 Multivariate normal clusters
14.2.3 Model selection
14.2.4 Software
14.2.5 Examples
14.2.6 Expectation–Maximisation Algorithm

UNSW MATH5855 2021T3 Lecture 14 Slide 42

Model selection

Need to select:
▶ Number of clusters K

▶ Within-cluster models (e.g., model for Σ)

▶ Number of parameters grows quickly for “XXV” models in
particular.

▶ BIC often used:

BICν = −2 log Lx(Ψ̂) + ν log n

ν the number of parameters estimated

▶ Lower BIC =⇒ better
▶ Some authors use 2 log Lx(Ψ̂)− ν log n with higher =⇒

better.

▶ Substantive considerations also matter, e.g.,
▶ How many clusters does our research hypothesis predict?
▶ Do we expect correlations between dimensions to vary between

clusters?
UNSW MATH5855 2021T3 Lecture 14 Slide 43

14. Cluster Analysis
14.2 Model-based clustering
14.2.1 Mixture Models
14.2.2 Multivariate normal clusters
14.2.3 Model selection
14.2.4 Software
14.2.5 Examples
14.2.6 Expectation–Maximisation Algorithm

UNSW MATH5855 2021T3 Lecture 14 Slide 44

SAS: PROC MBC

R: package mclust and others

UNSW MATH5855 2021T3 Lecture 14 Slide 45

14. Cluster Analysis
14.2 Model-based clustering
14.2.1 Mixture Models
14.2.2 Multivariate normal clusters
14.2.3 Model selection
14.2.4 Software
14.2.5 Examples
14.2.6 Expectation–Maximisation Algorithm

UNSW MATH5855 2021T3 Lecture 14 Slide 46

Example 14.2.

Model-based clustering and model selection illustrated on the
Edgar Anderson’s Iris data.

UNSW MATH5855 2021T3 Lecture 14 Slide 47

14. Cluster Analysis
14.2 Model-based clustering
14.2.1 Mixture Models
14.2.2 Multivariate normal clusters
14.2.3 Model selection
14.2.4 Software
14.2.5 Examples
14.2.6 Expectation–Maximisation Algorithm

UNSW MATH5855 2021T3 Lecture 14 Slide 48

E-M Algorithm

▶ Computationally tractable but log L(Ψ) does not simplify or
decompose:

Lx(Ψ) =
n∏

i=1

K∑
k=1

πk fk(xi ;θk).

=⇒ Expectation–Maximisation (EM) algorithm is normally used.
1. Introduce an unobserved (latent) variable Gi , i = 1, . . . , n

giving the cluster membership of i .
2. Suppose that G1, . . . ,Gn are observed; then

Lx ,G1,…,Gn(Ψ) =
n∏

i=1

πGi fGi (xi ;θGi )

log Lx ,G1,…,Gn(Ψ) =
n∑

i=1

log πGi +
n∑

i=1

log fGi (xi ;θGi ). (14.4)

3. Start with an initial guess Ψ(0).
4. Iterate E-step and M-step to convergence.

UNSW MATH5855 2021T3 Lecture 14 Slide 49

E-step

Let
Q(Ψ|Ψ(t−1)) = EG1,…,Gn|x ;Ψ(t−1)(log Lx ,G1,…,Gn(Ψ)).

Compute

q
(t−1)
ik = Pr(Gi = k |x ; Ψ

(t−1)) =
π
(t−1)
k fk(xi ;θ

(t−1)
k )∑K

k ′=1 π
(t−1)
k ′ fk ′(xi ;θ

(t−1)
k ′ )

i = 1, . . . , n, k = 1, . . . ,K .

Then,

Q(Ψ|Ψ(t−1)) =
n∑

i=1

K∑
k=1

q
(t−1)
ik log πk +

n∑
i=1

K∑
k=1

q
(t−1)
ik log fk(xi ;θk)

(14.5)

UNSW MATH5855 2021T3 Lecture 14 Slide 50

M-step

Find Ψ(t) = argmaxΨQ(Ψ|Ψ(t−1)), s.t.,
∑K

k=1 πk = 1:

Lag(π) =
n∑

i=1

K∑
k=1

q
(t−1)
ik log πk − α(

K∑
k=1

πk − 1)

Lag′k(π) =
n∑

i=1

q
(t−1)
ik π

−1
k − α

set
= 0 =⇒ πk =

n∑
i=1

q
(t−1)
ik /α

K∑
k=1

πk =
1

α

K∑
k=1

n∑
i=1

q
(t−1)
ik = 1 =⇒ α =

K∑
k=1

n∑
i=1

q
(t−1)
ik

=⇒ π(t)k =
∑n

i=1 q
(t−1)
ik∑K

k=1

∑n
i=1 q

(t−1)
ik

∂Q(Ψ|Ψ(t−1))
∂θk

=
n∑

i=1

q
(t−1)
ik

∂ log fk(xi ;θk)
∂θk

set
= 0 =⇒ weighted MLE

UNSW MATH5855 2021T3 Lecture 14 Slide 51

“Sharing” θs

▶ Strictly speaking, when we select one of the “E” models, we
no longer have a separate θk for every fk .
▶ θ ∈ RKp+1 or more contains parameters for all groups (separate

means, distinct variance parameters, etc.)
▶ fk “extracts” those elements of θ that it needs
▶ Ψ = (θ,π)

▶ Inferentially, θ replaces θk in all derivations above. In
particular,

Q(Ψ|Ψ(t−1)) =
n∑

i=1

K∑
k=1

q
(t−1)
ik log πk +

n∑
i=1

K∑
k=1

q
(t−1)
ik log fk(xi ;θ)

=⇒
∂Q(Ψ|Ψ(t−1))

∂θ
=

n∑
i=1

K∑
k=1

q
(t−1)
ik

∂ log fk(xi ;θ)
∂θ

set
= 0

=⇒ Still weighted MLE, but now joint for all groups, and without
simplification.

UNSW MATH5855 2021T3 Lecture 14 Slide 52

14. Cluster Analysis
14.1 “Classical”
14.2 Model-based clustering
14.3 Additional resources

UNSW MATH5855 2021T3 Lecture 14 Slide 53

Additional resources

▶ JW Sec. 12.1–12.5.

▶ Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2016).
mclust 5: clustering, classification and density estimation
using Gaussian finite mixture models. The R Journal, 8(1),
289.

UNSW MATH5855 2021T3 Lecture 14 Slide 54

Lecture 15: Copulae
15.1 Formulation
15.2 Common copula types
15.3 Margins, estimation, and simulation
15.4 Software
15.5 Examples
15.6 Exercises

UNSW MATH5855 2021T3 Lecture 15 Slide 1

15. Copulae
15.1 Formulation
15.2 Common copula types
15.3 Margins, estimation, and simulation
15.4 Software
15.5 Examples
15.6 Exercises

UNSW MATH5855 2021T3 Lecture 15 Slide 2

▶ MVN =⇒ independence ≡ no correlation
▶ For other families, more complicated

▶ Under independence, joint cdf is product of marginals.

=⇒ copulae “entangle” marginal distributions to produce a joint
multivariate one

▶ 2-dimensions =⇒ copula is a function C : [0, 1]2 → [0, 1]
with the properties:

i) C (0, u) = C (u, 0) = 0 for all u ∈ [0, 1].
ii) C (u, 1) = C (1, u) = u for all u ∈ [0, 1].
iii) For all pairs (u1, u2), (v1, v2) ∈ [0, 1]× [0, 1] with

u1 ≤ v1, u2 ≤ v2 :

C (v1, v2)− C (v1, u2)− C (u1, v2) + C (u1, u2) ≥ 0.

UNSW MATH5855 2021T3 Lecture 15 Slide 3

Theorem 15.1 (Sklar’s Theorem).

Let F (·, ·) be a joint cdf with marginal cdf’s FX1(.) and FX2(.).
Then there exists a copula C (·, ·) with the property

F (x1, x2) = C (FX1(x1),FX2(x2))

for every pair (x1, x2) ∈ R2. When FX1(·) and FX2(·) are
continuous the above copula is unique. Vice versa, if C (·, ·) is a
copula and FX1(·),FX2(·) are cdf then the function
F (x1, x2) = C (FX1(x1),FX2(x2)) is a joint cdf with marginals FX1(·)
and FX2(·).

UNSW MATH5855 2021T3 Lecture 15 Slide 4

Copula density

▶ Take derivatives:

f (x1, x2) = c(FX1(x1),FX2(x2))fX1(x1)fX2(x2) (15.1)

where

c(u, v) =
∂2

∂u∂v
C (u, v)

▶ Contribution to the joint density of X1,X2 comes from two
parts:

c(u, v) = ∂
2

∂u∂v
C (u, v): dependence from the copula

fX1(x1)fX2(x2): marginals

▶ Independence =⇒ C (u, v) = Π(u, v) = uv (independence
copula)

UNSW MATH5855 2021T3 Lecture 15 Slide 5

0.0
0.2

0.4
0.6

0.8
1.0

0.0

0.2

0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0

C(u,v)

0.0
0.2

0.4
0.6

0.8
1.0

0.0

0.2

0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0

c(u,v)

C(u,v)

0.1

0.2

0.3

0.4

0.5 0.6

0.7
0.8

0.9

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

c(u,v)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

0.9

1

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

Independence copula, dim. d = 2

UNSW MATH5855 2021T3 Lecture 15 Slide 6

15. Copulae
15.1 Formulation
15.2 Common copula types
15.2.1 Elliptical copulae
15.2.2 Archimedean copulae

15.3 Margins, estimation, and simulation
15.4 Software
15.5 Examples
15.6 Exercises

UNSW MATH5855 2021T3 Lecture 15 Slide 7

15. Copulae
15.2 Common copula types
15.2.1 Elliptical copulae
15.2.2 Archimedean copulae

UNSW MATH5855 2021T3 Lecture 15 Slide 8

Gaussian copula

For p = 2, let

Cρ(u, v) = Φρ(Φ
−1(u),Φ−1(v))

=

∫ Φ−1(u)
−∞

∫ Φ−1(v)
−∞

fρ(x1, x2)dx2dx1

for

fρ(·, ·) the density of N
((

0
0

)
,

(
1 ρ
ρ 1

))
Φρ(·, ·) its cdf
Φ−1(·) inverse-cdf of N(0, 1)
▶ ρ = 0 =⇒ C0(u, v) = uv
▶ “The formula that killed Wall Street.”

▶ Models tail behaviour poorly: almost no chance of joint
extreme events.

▶ Multivariate t does better: Z ∼ N(0,Σ) and X ∼ χ2ν
(independently) =⇒ T = Z/X .

▶ Var(T ) ̸= Σ!
UNSW MATH5855 2021T3 Lecture 15 Slide 9

0.0
0.2

0.4
0.6

0.8
1.0

0.0

0.2

0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0

C(u,v)

0.0
0.2

0.4
0.6

0.8
1.0

0.0

0.2

0.4
0.6
0.8
1.0
0
2
4
6
8

c(u,v)

C(u,v)

0.1

0.2
0.3

0.4

0.5

0.6

0.7

0.8 0.9

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

c(u,v)

123

3

4

4

5

5

6

6

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

Normal copula, dim. d = 2
param.: (rho.1 = 0.9)

UNSW MATH5855 2021T3 Lecture 15 Slide 10

0.0
0.2

0.4
0.6

0.8
1.0

0.0

0.2

0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0

C(u,v)

0.0
0.2

0.4
0.6

0.8
1.0

0.0

0.2

0.4
0.6
0.8
1.0
0

5

10

c(u,v)

C(u,v)

0.1
0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

c(u,v)

123

3

4

4

5

5

6

6

7

7

8

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

t-copula, dim. d = 2
param.: (rho.1 = 0.9, df = 4.0)

UNSW MATH5855 2021T3 Lecture 15 Slide 11

15. Copulae
15.2 Common copula types
15.2.1 Elliptical copulae
15.2.2 Archimedean copulae

UNSW MATH5855 2021T3 Lecture 15 Slide 12

Gumbel–Hougaard copula

▶ Much more flexible in modelling dependence in the upper tails.

▶ For dimension p,

CGHθ (u1, u2, . . . , up) = exp{−[
p∑

j=1

(− log uj)θ]1/θ}

▶ θ ∈ [1,∞) governs the strength of the dependence
▶ θ = 1 =⇒ independence
▶ θ → ∞ =⇒ min(u1, . . . , up) (Fréchet–Hoeffding upper

bound copula)

UNSW MATH5855 2021T3 Lecture 15 Slide 13

0.0
0.2

0.4
0.6

0.8
1.0

0.0

0.2

0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0

C(u,v)

0.0
0.2

0.4
0.6

0.8
1.0

0.0

0.2

0.4
0.6
0.8
1.0
0
2
4
6
8

c(u,v)

C(u,v)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

c(u,v)

12

2

3

47

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

Gumbel copula, dim. d = 2
param.: 2

UNSW MATH5855 2021T3 Lecture 15 Slide 14

Archimedean copulae

▶ characterised by generator ϕ(·):
▶ a continuous, strictly decreasing function from [0, 1] to [0,∞)
▶ ϕ(1) = 0

▶ Then,

C (u1, u2, . . . , up) = ϕ
−1(ϕ(u1) + · · ·+ ϕ(up))

▶ ϕ−1(t) is defined to be 0 if t /∈ ϕ([0, 1])

UNSW MATH5855 2021T3 Lecture 15 Slide 15

Example 15.2.

Show that the Gumbell–Hougaard copula is an Archimeden copula
with a generator ϕ(t) = (− log t)θ.

UNSW MATH5855 2021T3 Lecture 15 Slide 16

Archimedean copulae advantages and disadvantages

+ simple description of the p-dim dependence by using a
function of one argument only (the generator)

− symmetric in its arguments
▶ Liouville copulae are a non-symmetric extension

UNSW MATH5855 2021T3 Lecture 15 Slide 17

15. Copulae
15.1 Formulation
15.2 Common copula types
15.3 Margins, estimation, and simulation
15.4 Software
15.5 Examples
15.6 Exercises

UNSW MATH5855 2021T3 Lecture 15 Slide 18

Copula margins: parametric

▶ Specification also requires FX1(·), FX2(·), fX1(·), fX2(·)
▶ Can be any univariate continuous distributions

▶ Density (15.1) provides likelihood:

L(ρ,θ1,θ2) = fρ,θ1,θ2(x1, x2)

= cρ(FX1|θ1(x1),FX2|θ2(x2))fX1|θ1(x1)fX2|θ2(x2)

▶ Maximise w.r.t. parameters (ρ,θ1,θ2)
▶ Usually done numerically

UNSW MATH5855 2021T3 Lecture 15 Slide 19

Copula margins: empirical

▶ Xij , i = 1, 2, j = 1, . . . , n observations

▶ edf: F̂Xi (x) = n
−1∑n

j=1 I(Xij ≤ x)

=⇒ F (x1, x2) = C (F̂X1(x1), F̂X2(x2))

Estimation:

▶ Likelihood no longer available
=⇒ Convert each variable into empirical quantiles:

Pij =
n

n+1
F̂Xi (Xij)

▶ Pij uniform, but approx. correlation maintained
▶ Tune copula parameters to match observed correlations.

Simulation:

1. Simulate from C (·, ·) and/or c(·, ·) to obtain vector of
quantiles P⋆ = [P1⋆,P2⋆]⊤.

2. Evaluate Xi⋆ = F̂
−1
Xi

(Pi⋆), i = 1, 2.

▶ F̂−1Xi (·) may be smoothed in some way.

UNSW MATH5855 2021T3 Lecture 15 Slide 20

15. Copulae
15.1 Formulation
15.2 Common copula types
15.3 Margins, estimation, and simulation
15.4 Software
15.5 Examples
15.6 Exercises

UNSW MATH5855 2021T3 Lecture 15 Slide 21

SAS: PROC COPULA

R: Packages copula, VineCopula, and others.

UNSW MATH5855 2021T3 Lecture 15 Slide 22

15. Copulae
15.1 Formulation
15.2 Common copula types
15.3 Margins, estimation, and simulation
15.4 Software
15.5 Examples
15.6 Exercises

UNSW MATH5855 2021T3 Lecture 15 Slide 23

Example 15.3.

Microwave Ovens example (with empirical and gamma margins).

UNSW MATH5855 2021T3 Lecture 15 Slide 24

Example 15.4.

Stock and portfolio modelling.

UNSW MATH5855 2021T3 Lecture 15 Slide 25

15. Copulae
15.1 Formulation
15.2 Common copula types
15.3 Margins, estimation, and simulation
15.4 Software
15.5 Examples
15.6 Exercises

UNSW MATH5855 2021T3 Lecture 15 Slide 26

Exercise 15.1

The (p-dimensional) Clayton copula is defined for a given
parameter θ > 0 as

Cθ(u1, u2, . . . , up) =

[
p∑

i=1

u−θi − p + 1

]−1/θ
.

Show that it is an Archimedean copula and that its generator is
ϕ(x) = θ−1(x−θ − 1).

UNSW MATH5855 2021T3 Lecture 15 Slide 27

pbs@ARFix@17:
pbs@ARFix@18:
pbs@ARFix@19:
pbs@ARFix@20:
pbs@ARFix@21:
pbs@ARFix@22:
pbs@ARFix@23:
pbs@ARFix@24:
pbs@ARFix@25:
pbs@ARFix@26:
pbs@ARFix@27:
pbs@ARFix@28:
pbs@ARFix@29:
pbs@ARFix@30:
pbs@ARFix@31:
pbs@ARFix@32:
pbs@ARFix@33:
pbs@ARFix@34:
pbs@ARFix@35:
pbs@ARFix@36:
pbs@ARFix@37:
pbs@ARFix@38:
pbs@ARFix@39:
pbs@ARFix@40:
pbs@ARFix@41:
pbs@ARFix@42:
pbs@ARFix@43:
pbs@ARFix@44:
pbs@ARFix@45:
pbs@ARFix@46:
pbs@ARFix@47:
pbs@ARFix@48:
pbs@ARFix@49:
pbs@ARFix@50:
pbs@ARFix@51:
pbs@ARFix@52:
pbs@ARFix@53:
pbs@ARFix@54:
pbs@ARFix@55:
pbs@ARFix@56:
pbs@ARFix@57:
pbs@ARFix@58:
pbs@ARFix@59:
pbs@ARFix@60:
pbs@ARFix@61:
pbs@ARFix@62:
pbs@ARFix@63:
pbs@ARFix@64:
pbs@ARFix@65:
pbs@ARFix@66:
pbs@ARFix@67:
pbs@ARFix@68:
pbs@ARFix@69:
pbs@ARFix@70:
pbs@ARFix@71:
pbs@ARFix@72:
pbs@ARFix@73:
pbs@ARFix@74:
pbs@ARFix@75:
pbs@ARFix@76:
pbs@ARFix@77:
pbs@ARFix@78:
pbs@ARFix@81:
pbs@ARFix@82:
pbs@ARFix@83:
pbs@ARFix@84:
pbs@ARFix@85:
pbs@ARFix@86:
pbs@ARFix@87:
pbs@ARFix@88:
pbs@ARFix@89:
pbs@ARFix@90:
pbs@ARFix@91:
pbs@ARFix@93:
pbs@ARFix@94:
pbs@ARFix@95:
pbs@ARFix@96:
pbs@ARFix@97:
pbs@ARFix@98:
pbs@ARFix@99:
pbs@ARFix@100:
pbs@ARFix@101:
pbs@ARFix@102:
pbs@ARFix@103:
pbs@ARFix@104:
pbs@ARFix@105:
pbs@ARFix@106:
pbs@ARFix@107:
pbs@ARFix@108:
pbs@ARFix@109:
pbs@ARFix@110:
pbs@ARFix@111:
pbs@ARFix@112:
pbs@ARFix@113:
pbs@ARFix@114:
pbs@ARFix@115:
pbs@ARFix@116:
pbs@ARFix@117:
pbs@ARFix@118:
pbs@ARFix@119:
pbs@ARFix@120:
pbs@ARFix@121:
pbs@ARFix@122:
pbs@ARFix@123:
pbs@ARFix@124:
pbs@ARFix@125:
pbs@ARFix@126:
pbs@ARFix@127:
pbs@ARFix@128:
pbs@ARFix@129:
pbs@ARFix@130:
pbs@ARFix@131:
pbs@ARFix@133:
pbs@ARFix@134:
pbs@ARFix@135:
pbs@ARFix@136:
pbs@ARFix@137:
pbs@ARFix@138:
pbs@ARFix@139:
pbs@ARFix@140:
pbs@ARFix@141:
pbs@ARFix@142:
pbs@ARFix@143:
pbs@ARFix@144:
pbs@ARFix@145:
pbs@ARFix@146:
pbs@ARFix@147:
pbs@ARFix@148:
pbs@ARFix@149:
pbs@ARFix@150:
pbs@ARFix@151:
pbs@ARFix@152:
pbs@ARFix@153:
pbs@ARFix@154:
pbs@ARFix@155:
pbs@ARFix@156:
pbs@ARFix@157:
pbs@ARFix@158:
pbs@ARFix@159:
pbs@ARFix@160:
pbs@ARFix@161:
pbs@ARFix@162:
pbs@ARFix@163:
pbs@ARFix@164:
pbs@ARFix@165:
pbs@ARFix@166:
pbs@ARFix@167:
pbs@ARFix@168:
pbs@ARFix@169:
pbs@ARFix@170:
pbs@ARFix@171:
pbs@ARFix@173:
pbs@ARFix@174:
pbs@ARFix@175:
pbs@ARFix@176:
pbs@ARFix@177:
pbs@ARFix@178:
pbs@ARFix@179:
pbs@ARFix@180:
pbs@ARFix@181:
pbs@ARFix@182:
pbs@ARFix@183:
pbs@ARFix@184:
pbs@ARFix@185:
pbs@ARFix@186:
pbs@ARFix@187:
pbs@ARFix@188:
pbs@ARFix@189:
pbs@ARFix@190:
pbs@ARFix@191:
pbs@ARFix@192:
pbs@ARFix@193:
pbs@ARFix@194:
pbs@ARFix@195:
pbs@ARFix@196:
pbs@ARFix@197:
pbs@ARFix@198:
pbs@ARFix@199:
pbs@ARFix@200:
pbs@ARFix@201:
pbs@ARFix@202:
pbs@ARFix@203:
pbs@ARFix@204:
pbs@ARFix@205:
pbs@ARFix@206:
pbs@ARFix@207:
pbs@ARFix@208:
pbs@ARFix@209:
pbs@ARFix@210:
pbs@ARFix@211:
pbs@ARFix@212:
pbs@ARFix@213:
pbs@ARFix@214:
pbs@ARFix@215:
pbs@ARFix@216:
pbs@ARFix@217:
pbs@ARFix@218:
pbs@ARFix@219:
pbs@ARFix@220:
pbs@ARFix@221:
pbs@ARFix@222:
pbs@ARFix@225:
pbs@ARFix@226:
pbs@ARFix@227:
pbs@ARFix@228:
pbs@ARFix@229:
pbs@ARFix@230:
pbs@ARFix@231:
pbs@ARFix@232:
pbs@ARFix@233:
pbs@ARFix@234:
pbs@ARFix@235:
pbs@ARFix@236:
pbs@ARFix@237:
pbs@ARFix@238:
pbs@ARFix@239:
pbs@ARFix@240:
pbs@ARFix@241:
pbs@ARFix@242:
pbs@ARFix@243:
pbs@ARFix@244:
pbs@ARFix@245:
pbs@ARFix@246:
pbs@ARFix@247:
pbs@ARFix@248:
pbs@ARFix@249:
pbs@ARFix@250:
pbs@ARFix@251:
pbs@ARFix@252:
pbs@ARFix@253:
pbs@ARFix@254:
pbs@ARFix@255:
pbs@ARFix@256:
pbs@ARFix@257:
pbs@ARFix@258:
pbs@ARFix@259:
pbs@ARFix@260:
pbs@ARFix@261:
pbs@ARFix@262:
pbs@ARFix@263:
pbs@ARFix@264:
pbs@ARFix@265:
pbs@ARFix@266:
pbs@ARFix@267:
pbs@ARFix@268:
pbs@ARFix@269:
pbs@ARFix@270:
pbs@ARFix@273:
pbs@ARFix@274:
pbs@ARFix@275:
pbs@ARFix@276:
pbs@ARFix@277:
pbs@ARFix@278:
pbs@ARFix@279:
pbs@ARFix@280:
pbs@ARFix@281:
pbs@ARFix@282:
pbs@ARFix@283:
pbs@ARFix@284:
pbs@ARFix@285:
pbs@ARFix@286:
pbs@ARFix@287:
pbs@ARFix@288:
pbs@ARFix@289:
pbs@ARFix@290:
pbs@ARFix@291:
pbs@ARFix@292:
pbs@ARFix@293:
pbs@ARFix@294:
pbs@ARFix@295:
pbs@ARFix@296:
pbs@ARFix@297:
pbs@ARFix@298:
pbs@ARFix@299:
pbs@ARFix@300:
pbs@ARFix@301:
pbs@ARFix@302:
pbs@ARFix@303:
pbs@ARFix@304:
pbs@ARFix@305:
pbs@ARFix@306:
pbs@ARFix@307:
pbs@ARFix@308:
pbs@ARFix@309:
pbs@ARFix@310:
pbs@ARFix@311:
pbs@ARFix@312:
pbs@ARFix@313:
pbs@ARFix@314:
pbs@ARFix@315:
pbs@ARFix@316:
pbs@ARFix@317:
pbs@ARFix@318:
pbs@ARFix@319:
pbs@ARFix@320:
pbs@ARFix@321:
pbs@ARFix@322:
pbs@ARFix@323:
pbs@ARFix@324:
pbs@ARFix@325:
pbs@ARFix@329:
pbs@ARFix@330:
pbs@ARFix@331:
pbs@ARFix@332:
pbs@ARFix@333:
pbs@ARFix@334:
pbs@ARFix@335:
pbs@ARFix@336:
pbs@ARFix@337:
pbs@ARFix@338:
pbs@ARFix@339:
pbs@ARFix@340:
pbs@ARFix@341:
pbs@ARFix@342:
pbs@ARFix@343:
pbs@ARFix@344:
pbs@ARFix@345:
pbs@ARFix@346:
pbs@ARFix@347:
pbs@ARFix@348:
pbs@ARFix@349:
pbs@ARFix@350:
pbs@ARFix@351:
pbs@ARFix@352:
pbs@ARFix@353:
pbs@ARFix@354:
pbs@ARFix@355:
pbs@ARFix@357:
pbs@ARFix@358:
pbs@ARFix@359:
pbs@ARFix@360:
pbs@ARFix@361:
pbs@ARFix@362:
pbs@ARFix@363:
pbs@ARFix@364:
pbs@ARFix@365:
pbs@ARFix@366:
pbs@ARFix@367:
pbs@ARFix@368:
pbs@ARFix@369:
pbs@ARFix@370:
pbs@ARFix@371:
pbs@ARFix@373:
pbs@ARFix@374:
pbs@ARFix@375:
pbs@ARFix@376:
pbs@ARFix@377:
pbs@ARFix@378:
pbs@ARFix@379:
pbs@ARFix@380:
pbs@ARFix@381:
pbs@ARFix@382:
pbs@ARFix@383:
pbs@ARFix@384:
pbs@ARFix@385:
pbs@ARFix@386:
pbs@ARFix@387:
pbs@ARFix@388:
pbs@ARFix@389:
pbs@ARFix@390:
pbs@ARFix@391:
pbs@ARFix@392:
pbs@ARFix@393:
pbs@ARFix@394:
pbs@ARFix@395:
pbs@ARFix@396:
pbs@ARFix@397:
pbs@ARFix@398:
pbs@ARFix@401:
pbs@ARFix@402:
pbs@ARFix@403:
pbs@ARFix@404:
pbs@ARFix@405:
pbs@ARFix@406:
pbs@ARFix@407:
pbs@ARFix@408:
pbs@ARFix@409:
pbs@ARFix@410:
pbs@ARFix@411:
pbs@ARFix@412:
pbs@ARFix@413:
pbs@ARFix@414:
pbs@ARFix@415:
pbs@ARFix@416:
pbs@ARFix@417:
pbs@ARFix@418:
pbs@ARFix@419:
pbs@ARFix@420:
pbs@ARFix@421:
pbs@ARFix@425:
pbs@ARFix@426:
pbs@ARFix@427:
pbs@ARFix@428:
pbs@ARFix@429:
pbs@ARFix@430:
pbs@ARFix@431:
pbs@ARFix@432:
pbs@ARFix@433:
pbs@ARFix@434:
pbs@ARFix@435:
pbs@ARFix@436:
pbs@ARFix@437:
pbs@ARFix@438:
pbs@ARFix@439:
pbs@ARFix@440:
pbs@ARFix@441:
pbs@ARFix@442:
pbs@ARFix@443:
pbs@ARFix@444:
pbs@ARFix@445:
pbs@ARFix@446:
pbs@ARFix@447:
pbs@ARFix@448:
pbs@ARFix@449:
pbs@ARFix@450:
pbs@ARFix@451:
pbs@ARFix@452:
pbs@ARFix@453:
pbs@ARFix@454:
pbs@ARFix@455:
pbs@ARFix@456:
pbs@ARFix@457:
pbs@ARFix@458:
pbs@ARFix@459:
pbs@ARFix@460:
pbs@ARFix@461:
pbs@ARFix@462:
pbs@ARFix@463:
pbs@ARFix@464:
pbs@ARFix@465:
pbs@ARFix@469:
pbs@ARFix@470:
pbs@ARFix@471:
pbs@ARFix@472:
pbs@ARFix@473:
pbs@ARFix@474:
pbs@ARFix@475:
pbs@ARFix@476:
pbs@ARFix@477:
pbs@ARFix@478:
pbs@ARFix@479:
pbs@ARFix@480:
pbs@ARFix@481:
pbs@ARFix@482:
pbs@ARFix@483:
pbs@ARFix@484:
pbs@ARFix@485:
pbs@ARFix@486:
pbs@ARFix@487:
pbs@ARFix@488:
pbs@ARFix@489:
pbs@ARFix@490:
pbs@ARFix@491:
pbs@ARFix@492:
pbs@ARFix@493:
pbs@ARFix@494:
pbs@ARFix@495:
pbs@ARFix@496:
pbs@ARFix@497:
pbs@ARFix@498:
pbs@ARFix@499:
pbs@ARFix@500:
pbs@ARFix@501:
pbs@ARFix@502:
pbs@ARFix@503:
pbs@ARFix@504:
pbs@ARFix@505:
pbs@ARFix@506:
pbs@ARFix@507:
pbs@ARFix@509:
pbs@ARFix@510:
pbs@ARFix@511:
pbs@ARFix@512:
pbs@ARFix@513:
pbs@ARFix@514:
pbs@ARFix@515:
pbs@ARFix@516:
pbs@ARFix@517:
pbs@ARFix@518:
pbs@ARFix@519:
pbs@ARFix@520:
pbs@ARFix@521:
pbs@ARFix@522:
pbs@ARFix@523:
pbs@ARFix@524:
pbs@ARFix@525:
pbs@ARFix@526:
pbs@ARFix@527:
pbs@ARFix@528:
pbs@ARFix@529:
pbs@ARFix@530:
pbs@ARFix@531:
pbs@ARFix@532:
pbs@ARFix@533:
pbs@ARFix@534:
pbs@ARFix@535:
pbs@ARFix@536:
pbs@ARFix@537:
pbs@ARFix@538:
pbs@ARFix@539:
pbs@ARFix@540:
pbs@ARFix@541:
pbs@ARFix@542:
pbs@ARFix@543:
pbs@ARFix@544:
pbs@ARFix@545:
pbs@ARFix@546:
pbs@ARFix@547:
pbs@ARFix@548:
pbs@ARFix@549:
pbs@ARFix@550:
pbs@ARFix@551:
pbs@ARFix@552:
pbs@ARFix@553:
pbs@ARFix@554:
pbs@ARFix@555:
pbs@ARFix@556:
pbs@ARFix@557:
pbs@ARFix@558:
pbs@ARFix@559:
pbs@ARFix@560:
pbs@ARFix@561:
pbs@ARFix@562:
pbs@ARFix@565:
pbs@ARFix@566:
pbs@ARFix@567:
pbs@ARFix@568:
pbs@ARFix@569:
pbs@ARFix@570:
pbs@ARFix@571:
pbs@ARFix@572:
pbs@ARFix@573:
pbs@ARFix@574:
pbs@ARFix@575:
pbs@ARFix@576:
pbs@ARFix@577:
pbs@ARFix@578:
pbs@ARFix@579:
pbs@ARFix@580:
pbs@ARFix@581:
pbs@ARFix@582:
pbs@ARFix@583:
pbs@ARFix@584:
pbs@ARFix@585:
pbs@ARFix@586:
pbs@ARFix@587:
pbs@ARFix@588:
pbs@ARFix@589:
pbs@ARFix@590:
pbs@ARFix@591: