Lecture 0: Preliminaries
0.1 Matrix algebra
0.2 Standard facts about multivariate distributions
0.3 Additional resources
0.4 Exercises
UNSW MATH5855 2021T3 Lecture 0 Slide 1
. Preliminaries
0.1 Matrix algebra
0.1.1 Vectors and matrices
0.1.2 Inverse matrices
0.1.3 Rank
0.1.4 Orthogonal matrices
0.1.5 Eigenvalues and eigenvectors
0.1.6 Cholesky Decomposition
0.1.7 Orthogonal Projection
0.2 Standard facts about multivariate distributions
0.3 Additional resources
0.4 Exercises
UNSW MATH5855 2021T3 Lecture 0 Slide 2
. Preliminaries
0.1 Matrix algebra
0.1.1 Vectors and matrices
0.1.2 Inverse matrices
0.1.3 Rank
0.1.4 Orthogonal matrices
0.1.5 Eigenvalues and eigenvectors
0.1.6 Cholesky Decomposition
0.1.7 Orthogonal Projection
UNSW MATH5855 2021T3 Lecture 0 Slide 3
Matrix operations
X ∈Mp,n: X is a matrix with p rows and n columns
x ∈ Rn: x is a n-dimensional column vector a.k.a. x ∈Mn,1
>: transposition: X ∈Mp,n =⇒ X> ∈Mn,p
I a column vector x ∈ Rn =⇒ a row a vector x> ∈M1,n
Scalar operations:
I matrix (vector) multiplied by scalar
I two matrices (vectors) of the same dimension can be added
or subtracted
Euclidean norm ‖x‖ of a vector x = (x1 x2 · · · xp)> ∈ Rp:
‖x‖ =
√∑p
i=1 x
2
i .
UNSW MATH5855 2021T3 Lecture 0 Slide 4
Inner product
inner product (a.k.a. scalar product) of x , y ∈ Rp is denoted and
defined in the following way:
〈x , y〉 = x>y =
p∑
i=1
xiyi (0.1)
I ‖x‖2 = 〈x , x〉
I If θ is the angle between x and y ,
〈x , y〉 = ‖x‖‖y‖ cos(θ) (0.2)
=⇒ Since |cos(θ)| ≤ 1,
|〈x , y〉| ≤ ‖x‖‖y‖(Cauchy–Bunyakovsky–Schwartz
Inequality)
orthogonal projection of x on y : x
>y
y>y y .
UNSW MATH5855 2021T3 Lecture 0 Slide 5
Matrix multiplication
matrix multiplication: if X ∈Mp,k and Y ∈Mk,n (#columns ofX
= # rows in Y , a.k.a. conformable) then XY exists:
Z = XY ∈Mp,n with elements
zi ,j , i = 1, 2, . . . , p, j = 1, 2, . . . , n : zi ,j =
k∑
m=1
xi ,mym,j (0.3)
I element in the ith row and jth column of Z is a scalar product
of the ith row of X and the jth column of Y
I not commutative, even if both XY and YX exist
I Important example: if x ∈ Rp, x>x ∈ R, but xx> ∈Mp,p!
I transpose of product = product of transposes, in reverse order:
(XY )> = Y>X> (0.4)
UNSW MATH5855 2021T3 Lecture 0 Slide 6
Symmetric and identity matrices
symmetric matrix: a square matrix X ∈Mp,p for which xi ,j = xj ,i ,
i = 1, 2, . . . , p, j = 1, 2, . . . , p holds.
I X> = X
identity matrix:
p×p
I = δij , i = 1, 2, . . . , p, j = 1, 2, . . . , p (i.e. ones
on the diagonal and zeros outside the diagonal)
I if conformable, X I = X and IX = X
UNSW MATH5855 2021T3 Lecture 0 Slide 7
Matrix trace
The trace of a square matrix X ∈Mp,p is denoted by
tr(X ) =
∑p
i=1 xii . The following properties of traces are easy to
obtain:
i) tr(X + Y ) = tr(X ) + tr(Y )
ii) tr(XY ) = tr(YX )
iii) tr(X−1YX ) = tr(Y )
iv) If a ∈ Rp and X ∈Mp,p then a>Xa = tr(Xaa>)
UNSW MATH5855 2021T3 Lecture 0 Slide 8
. Preliminaries
0.1 Matrix algebra
0.1.1 Vectors and matrices
0.1.2 Inverse matrices
0.1.3 Rank
0.1.4 Orthogonal matrices
0.1.5 Eigenvalues and eigenvectors
0.1.6 Cholesky Decomposition
0.1.7 Orthogonal Projection
UNSW MATH5855 2021T3 Lecture 0 Slide 9
Matrix determinant
I For a square matrix X ∈Mp,p a number |X | ≡ det(X )
I Defined as
|X | =
∑
±x1ix2j . . . xpm
where the summation is over all permutations (i , j , . . . ,m) of
the numbers (1, 2, . . . , p) by taking into account the sign
rule: summands with an even permutation get a (+) whereas
the ones with an odd permutation get a (−) sign.
UNSW MATH5855 2021T3 Lecture 0 Slide 10
Matrix determinant calculations
I when p = 1 (scalar) X = a, |X | = a
I when p = 2 then
∣∣∣∣x11 x12x21 x22
∣∣∣∣ = x11x22 − x12x21
I when p = 3 then∣∣∣∣∣∣
x11 x12 x13
x21 x22 x23
x31 x32 x33
∣∣∣∣∣∣ = x11x22x33 + x12x23x31 + x21x32x13
−x31x22x13 − x11x23x32 − x12x21x33 (0.5)
I recursively, for X ∈Mp,p,
|X | =
∑
i
(−1)i+jxij |Xij | (for any given j)
=
∑
j
(−1)i+jxij |Xij | (for any given i)
Xij matrix constructed by deleting ith row and jth column of
X .
UNSW MATH5855 2021T3 Lecture 0 Slide 11
Matrix determinant properties
i) If one row or one column of the matrix contains zeros only,
then the value of the determinant is zero.
ii) |X>| = |X |
iii) If one row (or one column) of the matrix is modified by
multiplying with a scalar c then so is the value of the
determinant.
iv) |cX | = cp|X |
v) If X ,Y ∈Mp,p then |XY | = |X ||Y |
vi) If the matrix X is diagonal (i.e. all non-diagonal elements are
zero) then |X | =
∏p
i=1 xii . In particular, the determinant of
the identity matrix is always equal to one.
UNSW MATH5855 2021T3 Lecture 0 Slide 12
Matrix inverse
If,
I |X | 6= 0
I i.e., X ∈Mp,p is nonsingular
then,
I An inverse matrix X−1 ∈Mp,p exists s.t. XX−1 = Ip,p.
I (X−1)ji =
|Xij |
|X | (−1)
i+j
|Xij | as before (i , j)th minor of X .
UNSW MATH5855 2021T3 Lecture 0 Slide 13
Matrix inverse properties
i) XX−1 = X−1X = I
ii) (X−1)> = (X>)−1
iii) (XY )−1 = Y−1X−1 when both X and Y are nonsingular
square matrices of the same dimension.
iv) |X−1| = |X |−1
v) If X is diagonal and nonsingular then all its diagonal elements
are nonzero and X−1 is again diagonal with diagonal elements
equal to 1
xii
, i = 1, 2, . . . , p.
UNSW MATH5855 2021T3 Lecture 0 Slide 14
. Preliminaries
0.1 Matrix algebra
0.1.1 Vectors and matrices
0.1.2 Inverse matrices
0.1.3 Rank
0.1.4 Orthogonal matrices
0.1.5 Eigenvalues and eigenvectors
0.1.6 Cholesky Decomposition
0.1.7 Orthogonal Projection
UNSW MATH5855 2021T3 Lecture 0 Slide 15
Linear dependence
A set of vectors x1, x2, . . . , xk ∈ Rn is linearly dependent if there
exist k numbers a1, a2, . . . , ak not all zero such that
a1x1 + a2x2 + · · ·+ akxk = 0 (0.6)
holds.
I Otherwise the vectors are linearly independent.
I For k linearly independent vectors, (0.6) would only be
possible if all numbers a1, a2, . . . , ak were zero.
UNSW MATH5855 2021T3 Lecture 0 Slide 16
Matrix rank
row rank: the maximum number of linearly independent row vectors
column rank: the rank of its set of column vectors
I Always equal
I denoted rk(X )
full rank: If X ∈Mp,n and rk(X ) = min(p, n)
I square matrix A ∈Mp,p is full rank if rk(A) = p
I implies that |A| 6= 0 (Rouché–Capelli theorem a.k.a
Kronecker–Capelli theorem)
I Let b ∈ Rp be a given vector. Then the linear equation
system Ax = b has a unique solution x = A−1b ∈ Rp.
UNSW MATH5855 2021T3 Lecture 0 Slide 17
. Preliminaries
0.1 Matrix algebra
0.1.1 Vectors and matrices
0.1.2 Inverse matrices
0.1.3 Rank
0.1.4 Orthogonal matrices
0.1.5 Eigenvalues and eigenvectors
0.1.6 Cholesky Decomposition
0.1.7 Orthogonal Projection
UNSW MATH5855 2021T3 Lecture 0 Slide 18
Orthogonal matrix
A square matrix X ∈Mp,p is orthogonal if XX> = X>X = Ip,p
holds. The following properties of orthogonal matrices are obvious:
i) X is of full rank (rk(X ) = p) and X−1 = X>
ii) The name orthogonal of the matrix originates from the fact
that the scalar product of each two different column vectors
equals zero. The same holds for the scalar product of each two
different row vectors of the matrix. The norm of each column
vector (or each row vector) is equal to one. These properties
are equivalent to the definition.
iii) |X | = ±1
UNSW MATH5855 2021T3 Lecture 0 Slide 19
. Preliminaries
0.1 Matrix algebra
0.1.1 Vectors and matrices
0.1.2 Inverse matrices
0.1.3 Rank
0.1.4 Orthogonal matrices
0.1.5 Eigenvalues and eigenvectors
0.1.6 Cholesky Decomposition
0.1.7 Orthogonal Projection
UNSW MATH5855 2021T3 Lecture 0 Slide 20
Eigenvalues
For any square matrix X ∈Mp,p we can define the characteristic
polynomial equation of degree p,
f (λ) = |X − λI | = 0. (0.7)
I Has exactly p roots.
I Some may be complex and some may coincide.
I Since the coefficients are real, if there is a complex root of
(0.7) then also its complex conjugate must be a root of the
same equation.
I Each of the above p roots is called eigenvalue of the matrix X .
I tr(X ) =
∑p
i=1 λi
I |X | =
∏p
i=1 λi
UNSW MATH5855 2021T3 Lecture 0 Slide 21
Eigenvectors
I For any such eigenvalue λ∗, X − λ∗I is singular.
=⇒ There exists a non-zero vector y ∈ Rp s.t. (X − λ∗I )y = 0.
I An eigenvector of X that corresponds to the eigenvalue λ∗.
I Not unique: µy for any real non-zero µ also an eigenvector for
the same eigenvalue.
UNSW MATH5855 2021T3 Lecture 0 Slide 22
Uniqueness of eigenvectors
Sparing some details of the derivation, we shall formulate the
following basic result:
Theorem 0.1.
When the matrix X is real symmetric then all of its p eigenvalues
are real. If the eigenvalues are all different then all the p
eigenvectors that correspond to them, are orthogonal (and hence
form a basis in Rp). These eigenvectors are also unique (up to the
norming constant µ above). If some of the eigenvalues coincide
then the eigenvectors corresponding to them are not necessarily
unique but even in this case they can be chosen to be mutually
orthogonal.
UNSW MATH5855 2021T3 Lecture 0 Slide 23
Spectral decomposition
For each of the p eigenvalues λi , i = 1, 2, . . . , p, of X , denote its
corresponding set of mutually orthogonal eigenvectors of unit
length by ei , i = 1, 2, . . . , p, i.e.
Xei = λiei , i = 1, 2, . . . , p, ‖ei‖ = 1, e>i ej = 0, i 6= j
holds. Then is can be shown that the following decomposition
(spectral decomposition) of any symmetric matrix X ∈Mp,p holds:
X = λ1e1e>1 + λ2e2e
>
2 + . . . λpepe
>
p . (0.8)
Equivalently, X = PΛP> where Λ =
λ1 · · · 0… . . . …
0 · · · λp
is diagonal
and P ∈Mp,p is an orthogonal matrix containing the p orthogonal
eigenvectors e1, e2, . . . , ep.
UNSW MATH5855 2021T3 Lecture 0 Slide 24
Powers of a matrix: inverse
A symmetric matrix X ∈Mp,p is positive definite if all of its
eigenvalues are positive. (It is called non-negative definite if all
eigenvalues are ≥ 0.) For a symmetric positive definite matrix we
have all λi , i = 1, 2, . . . , p, to be positive in the spectral
decomposition (0.8).
But then
X−1 = (P>)−1Λ−1P−1 = PΛ−1P> =
p∑
i=1
1
λi
eie>i (0.9)
UNSW MATH5855 2021T3 Lecture 0 Slide 25
Powers of a matrix: square root
Moreover we can define the square root of the symmetric
non-negative definite matrix X in a natural way:
X
1
2 =
p∑
i=1
√
λieie>i (0.10)
I makes sense since X
1
2X
1
2 = X
I X
1
2 also symmetric and non-negative definite
I X−
1
2 =
∑p
i=1 λ
− 12
i eie
>
i = PΛ
− 12P>
UNSW MATH5855 2021T3 Lecture 0 Slide 26
Example 0.2.
Let X ∈Mp,p be symmetric positive definite matrix with
eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λp > 0 and associated eigenvectors of
unit length e1, e2, . . . ep. Show that
I maxy 6=0
y>Xy
y>y = λ1 attained when y = e1.
I miny 6=0
y>Xy
y>y = λp attained when y = ep.
UNSW MATH5855 2021T3 Lecture 0 Slide 27
Let X = PΛP> be the decomposition (0.8) for X . Denote
z = P>y . Note that y 6= 0 implies z 6= 0. Thus
y>Xy
y>y
=
y>PΛP>y
y>y
=
z>Λz
z>z
=
∑p
i=1 λiz
2
i∑p
i=1 z
2
i
≤ λ1
∑p
i=1 z
2
i∑p
i=1 z
2
i
= λ1
If we take y = e1 then having in mind the structure of the matrix
P we have z = P>e1 = (1 0 · · · 0)> and for this choice of y
also z
>Λz
z>z =
λ1
1 = λ1. The first part of the exercise is shown.
Similar arguments (just changing the sign of the inequality) apply
to show the second part.
In addition, you can try to show that maxy 6=0,y⊥e1
y>Xy
y>y = λ2
holds. How?
UNSW MATH5855 2021T3 Lecture 0 Slide 28
. Preliminaries
0.1 Matrix algebra
0.1.1 Vectors and matrices
0.1.2 Inverse matrices
0.1.3 Rank
0.1.4 Orthogonal matrices
0.1.5 Eigenvalues and eigenvectors
0.1.6 Cholesky Decomposition
0.1.7 Orthogonal Projection
UNSW MATH5855 2021T3 Lecture 0 Slide 29
Numerical stability
I Computers have finite precision
I around 16 decimal significant figures
I scientific notation =⇒ absolute magnitude has little effect on
precision, different magnitudes produce rounding
I E.g., 1× 1018 + 1× 100 = 1,000,000,000,000,000,000 + 1 =
1,000,000,000,000,000,000
I For matrices, condition number |λ1/λp| (for pos. def. matrix)
to assess the potential error =⇒ higher = worse
UNSW MATH5855 2021T3 Lecture 0 Slide 30
Cholesky decomposition
I For a symm. pos.def. matrix X ∈Mp,p, a unique matrix
U ∈Mp,p exists that is:
I upper triangular
I U>U = X
I Many authors use LL> = X for a lower-triangular matrix
instead.
I L ≡ U>
I In SAS/IML, root(x) gives this.
I In R, the function is chol().
I Useful for generating correlated variables.
UNSW MATH5855 2021T3 Lecture 0 Slide 31
. Preliminaries
0.1 Matrix algebra
0.1.1 Vectors and matrices
0.1.2 Inverse matrices
0.1.3 Rank
0.1.4 Orthogonal matrices
0.1.5 Eigenvalues and eigenvectors
0.1.6 Cholesky Decomposition
0.1.7 Orthogonal Projection
UNSW MATH5855 2021T3 Lecture 0 Slide 32
Orthogonal projection matrix: necessary conditions
I Let L(X ) be space spanned by the columns of the matrix
X ∈Mn,p.
I Project a vector y ∈ Rn to it it with matrix P ∈Mn,n (an
orthogonal projector):
I Let z = Py ∈ Rn be the projection.
I z ∈ L(X ) =⇒ projection of z on L(X ) is z itself:
Py = z = Pz = PPy = P2y =⇒
(P − P2)y = 0 =⇒ P2 = P
=⇒ P should be idempotent.
I ∀y (y − z)>z = 0 =⇒ ∀yy>(P> − I )Py = 0
=⇒ (P> − I )P = 0 =⇒ P>P = P
=⇒ P>P = P> =⇒ P = P> =⇒ P is symmetrical.
=⇒ The orthogonal projector is a symmetric and idempotent
matrix.
UNSW MATH5855 2021T3 Lecture 0 Slide 33
Orthogonal projection matrix: sufficient conditions
I Let P be symmetric and idempotent.
=⇒ For any y ∈ Rn,
z = Py =⇒ Pz = P2y = Py =⇒ P(y − z) = 0 (and also
P>(y − z) = 0 since P = P>)
I Consider L(P) (the space generated by the rows/columns of
P).
I z = Py =⇒ z ∈ L(P)
I P>(y − z) = 0 means that y − z is perpendicular to L(P).
=⇒ Py is the projection of y on L(P).
=⇒ P ∈Mn,n is an orthogonal projection matrix if and only if it is
a symmetric and idempotent matrix.
UNSW MATH5855 2021T3 Lecture 0 Slide 34
Orthogonal projection matrix: properties
I If P is an orthogonal projection on a given linear spaceM of
dimension dim(M) then I −P an orthogonal projection on the
orthocomplement ofM.
I rk(P) = dim(M).
I The rank of an orthogonal projector is equal to the sum of its
diagonal elements.
UNSW MATH5855 2021T3 Lecture 0 Slide 35
Orthogonal projection matrix: form
I If the matrix X has a full rank then the projector
PL(X ) = X (X
>X )−1X>
I If the matrix X is not of full rank then the generalised inverse
(X>X )− of X>X can be defined instead.
I Not unique
I But X (X>X )−X> is unique
I Is the orthogonal projector on the space L(X ) spanned by the
columns of X when X is not full rank.
UNSW MATH5855 2021T3 Lecture 0 Slide 36
. Preliminaries
0.1 Matrix algebra
0.2 Standard facts about multivariate distributions
0.2.1 Random samples in multivariate analysis
0.2.2 Joint, marginal, conditional distributions
0.2.3 Moments
0.2.4 Density transformation formula
0.2.5 Characteristic and moment generating functions
0.3 Additional resources
0.4 Exercises
UNSW MATH5855 2021T3 Lecture 0 Slide 37
. Preliminaries
0.2 Standard facts about multivariate distributions
0.2.1 Random samples in multivariate analysis
0.2.2 Joint, marginal, conditional distributions
0.2.3 Moments
0.2.4 Density transformation formula
0.2.5 Characteristic and moment generating functions
UNSW MATH5855 2021T3 Lecture 0 Slide 38
Random dataset
I Inference depends on variability of statistics.
I Some assumptions are required about the data matrix (1.1).
I n observations of p-variate random vectors =⇒ random
matrix X ∈Mp,n:
X =
X11 X12 · · · X1j · · · X1n
X21 X22 · · · X2j · · · X2n
…
…
. . .
…
. . .
…
Xi1 Xi2 · · · Xij · · · Xin
…
…
. . .
…
. . .
…
Xp1 Xp2 · · · Xpj · · · Xpn
= [X1,X2, . . . ,Xn]
(0.11)
I Xi , i = 1, 2, . . . , n assumed independent
UNSW MATH5855 2021T3 Lecture 0 Slide 39
. Preliminaries
0.2 Standard facts about multivariate distributions
0.2.1 Random samples in multivariate analysis
0.2.2 Joint, marginal, conditional distributions
0.2.3 Moments
0.2.4 Density transformation formula
0.2.5 Characteristic and moment generating functions
UNSW MATH5855 2021T3 Lecture 0 Slide 40
Random vector cdf, pmf, and/or density
I Random vector X = (X1 X2 · · · Xp)> ∈ Rp, p ≥ 2 has
joint cdf
FX (x) = P(X1 ≤ x1,X2 ≤ x2, . . . ,Xp ≤ xp) = FX (x1, x2, . . . , xp)
I If discrete the probability mass function
PX (x) = P(X1 = x1,X2 = x2, . . . ,Xp = xp)
I If a density fX (x) = fX (x1, x2, . . . , xp) exists such that
FX (x) =
∫ x1
−∞
· · ·
∫ xp
−∞
fX (t)dt1 . . . dtp (0.12)
X is continuous.
=⇒ fX (x) =
∂pFX (x)
∂x1∂x2..∂xp
UNSW MATH5855 2021T3 Lecture 0 Slide 41
Marginal distribution
I marginal cdf of the first k < p components of the vector X is P(X1 ≤ x1,X2 ≤ x2, . . . ,Xk ≤ xk) = P(X1 ≤ x1,X2 ≤ x2, . . . ,Xk ≤ xk ,Xk+1 ≤ ∞, ...,Xp ≤ ∞) = FX (x1, x2, . . . , xk ,∞,∞, . . . ,∞) (0.13) I marginal density can be obtained by partial differentiation in (0.13): ∫ ∞ −∞ · · · ∫ ∞ −∞ fX (x1, x2, . . . , xp)dxk+1 . . . dxp I Similarly for any other set of components. I Each component Xi has marginal cdf FXi (xi ), i = 1, 2, . . . , p. UNSW MATH5855 2021T3 Lecture 0 Slide 42 Conditional distribution I conditional density X given Xr+1 = xr+1, . . . ,Xp = xp is f(X1,...,Xr |Xr+1,...,Xp)(x1, . . . , xr |xr+1, . . . , xp) = fX (x) fXr+1,...,Xp(xr+1, . . . , xp) (0.14) I joint density of X1, . . . ,Xr when Xr+1 = xr+1, . . . ,Xp = xp I only defined when fXr+1,...,Xp (xr+1, . . . , xp) 6= 0 UNSW MATH5855 2021T3 Lecture 0 Slide 43 Independence I If X has p independent components =⇒ FX (x) = FX1(x1)FX2(x2) · · ·FXp(xp) (0.15) I Equivalently, PX (x) = PX1(x1)PX2(x2) · · ·PXp(xp) fX (x) = fX1(x1)fX2(x2) · · · fXp(xp) (0.16) I Conditional distributions do not depend on the conditions I Functions factorise: FX (x) = p∏ i=1 FXi (xi ), fX (x) = p∏ i=1 fXi (xi ) UNSW MATH5855 2021T3 Lecture 0 Slide 44 . Preliminaries 0.2 Standard facts about multivariate distributions 0.2.1 Random samples in multivariate analysis 0.2.2 Joint, marginal, conditional distributions 0.2.3 Moments 0.2.4 Density transformation formula 0.2.5 Characteristic and moment generating functions UNSW MATH5855 2021T3 Lecture 0 Slide 45 Moments I For density fX (x) joint moments of order s1, s2 . . . , sp are E(X s11 · · ·X sp p ) = ∫ ∞ −∞ · · · ∫ ∞ −∞ x s11 · · · x sp p fX (x1, . . . , xp)dx1 . . . dxp (0.17) I some si = 0 =⇒ calculating the joint moment of a subset of the random variables UNSW MATH5855 2021T3 Lecture 0 Slide 46 Common multivariate moments Now, let X ∈ Rp and Y ∈ Rq with densities as above. The following moments are commonly used: Expectation: µX = E(X ) = ∫ ∞ −∞ · · · ∫ ∞ −∞ x fX (x1, . . . , xp)dx1 . . . dxp ∈ Rp. Variance–covariance matrix: (a.k.a. variance or covariance matrix) ΣX = Var(X ) = Cov(X ) = E(X − µX )(X − µX )>
= E XX> − µXµ>X =
σ11 σ12 · · · σ1p
σ21 σ22 · · · σ2p
…
…
. . .
…
σp1 σp2 · · · σpp
∈Mp,p.
UNSW MATH5855 2021T3 Lecture 0 Slide 47
Covariance matrix:
ΣX ,Y = Cov(X ,Y ) = E(X − µX )(Y − µY )>
= E XY> − µXµ>Y =
σX1Y1 σX1Y2 · · · σX1Yq
σX2Y1 σX2Y2 · · · σX2Yq
…
…
. . .
…
σXpY1 σXpY2 · · · σXpYq
∈Mp,q.
UNSW MATH5855 2021T3 Lecture 0 Slide 48
Linear transformations of moments
Let A ∈Mp′,p and B ∈Mq′,q fixed and known. Then,
I µAX = AµX ∈ Rp
′
I ΣAX = AΣXA
> ∈Mp′,p′
I ΣAX ,BY = AΣX ,YB
> ∈Mp′,q′
As a corollary, if X ′, Y ′, A′ and B ′ are variables and matrices with
the same dimensions as originals (but possibly distributions and
values),
I E(AX + A′X ′) = AµX + A′µX ′
I Var(AX + A′X ′) =
AΣXA
> + AΣX ,X ′(A
′)> + A′ΣX ′,XA
> + A′ΣX ′(A
′)>
I Cov(AX + A′X ′,BY + B ′Y ′) =
AΣX ,YB
> + AΣX ,Y ′(B
′)> + A′ΣX ′,YB
> + A′ΣX ′,Y ′(B
′)>
These identities are also useful when p = p′ = q = q′ = 1 (i.e.,
scalars).
UNSW MATH5855 2021T3 Lecture 0 Slide 49
. Preliminaries
0.2 Standard facts about multivariate distributions
0.2.1 Random samples in multivariate analysis
0.2.2 Joint, marginal, conditional distributions
0.2.3 Moments
0.2.4 Density transformation formula
0.2.5 Characteristic and moment generating functions
UNSW MATH5855 2021T3 Lecture 0 Slide 50
Density transformation
I p existing random variables X1,X2, . . . ,Xp with density fX (x)
transformed into p new random variables Y1,Y2 . . . ,Yp
(Y ∈ Rp), via function
Yi = yi (X1,X2 . . . ,Xp), i = 1, 2, . . . , p (0.18)
I Must be smooth and one-to-one so invertible on codomain of
y(·).
I Call inverse Xi = xi (Y1,Y2 . . . ,Yp), i = 1, 2, . . . , p
I Then,
fY (y1, . . . , yp) = fX [x1(y1, . . . , yp), . . . , xp(y1, . . . , yp)]|J(y1, . . . , yp)|
(0.19)
where J(y1, . . . , yp) is the Jacobian of the transformation:
J(y1, . . . , yp) =
∣∣∣∂x∂y ∣∣∣ ≡
∣∣∣∣∣∣∣∣
∂x1
∂y1
· · · ∂x1
∂yp
…
. . .
…
∂xp
∂y1
· · · ∂xp
∂yp
∣∣∣∣∣∣∣∣ ≡
∣∣∣∂y
∂x
∣∣∣−1 (0.20)
I In (0.19) the absolute value of the Jacobian is substituted.
UNSW MATH5855 2021T3 Lecture 0 Slide 51
. Preliminaries
0.2 Standard facts about multivariate distributions
0.2.1 Random samples in multivariate analysis
0.2.2 Joint, marginal, conditional distributions
0.2.3 Moments
0.2.4 Density transformation formula
0.2.5 Characteristic and moment generating functions
UNSW MATH5855 2021T3 Lecture 0 Slide 52
Characteristic function
I characteristic function (cf) ϕX (t) of the random vector
X ∈ Rp is a function of a p-dimensional argument
t = (t1 t2 · · · tp)> ∈ Rp
I defined as
ϕX (t) = E(e
it>X )
where i =
√
−1.
I always exists (since |ϕX (t)| ≤ E(|eit
>X |) = 1 <∞)
I Related to the moment generating function (mgf):
MX (t) = E(et
>X )
I mgf may not exist for all t
I cf’s have one-to-one correspondence with distributions
I under some conditions, can go the other way:
fX (x) =
1
2π
∫ +∞
−∞
e−itx ϕX (t)dt
fX (x) = (2π)
−p
∫
Rp
e−it
>x ϕX (t)dt
UNSW MATH5855 2021T3 Lecture 0 Slide 53
Characteristic function: linear transformation
Theorem 0.3.
If the cf of the random vector X ∈ Rp is ϕX (t) and
Y = AX + b,b ∈ Rq,A ∈Mq,p is a linear transformation, then it
holds for all s ∈ Rq that
ϕY (s) = e
is>b ϕX (A
>s)
Proof.
at lecture.
UNSW MATH5855 2021T3 Lecture 0 Slide 54
. Preliminaries
0.1 Matrix algebra
0.2 Standard facts about multivariate distributions
0.3 Additional resources
0.4 Exercises
UNSW MATH5855 2021T3 Lecture 0 Slide 55
Additional resources
I JW Ch. 2–3.
UNSW MATH5855 2021T3 Lecture 0 Slide 56
. Preliminaries
0.1 Matrix algebra
0.2 Standard facts about multivariate distributions
0.3 Additional resources
0.4 Exercises
UNSW MATH5855 2021T3 Lecture 0 Slide 57
Exercise 0.1
In an ecological experiment, colonies of 2 different species of insect
are confined to the same habitat. The survival times of the two
species (in days) are random variables X1 and X2 respectively. It is
thought that X1 and X2 have a joint density of the form
fX (x1, x2) = θx1 e
−x1(θ+x2) (0 < x1, x2)
for some constant θ > 0.
(a) Show that fX (x1, x2) is a valid density.
(b) Find the probability that both species die out within t days of
the start of the experiment.
(c) Derive the marginal density of X1. Identify this distribution
and write down E(X1) and Var(X1).
(d) Derive the marginal density of X2, and the conditional density
of X2 given X1 = x1.
(e) What evidence do you now have that X1 and X2 are not
independent?
UNSW MATH5855 2021T3 Lecture 0 Slide 58
Exercise 0.2
Let X = [X1,X2]> a random vector with E(X ) = µ and
Var(X ) = Σ = σ2
(
1 ρ
ρ 1
)
.
(a) Find Cov(X1 − X2,X1 + X2).
(b) Find Cov(X1,X2 − ρX1).
(c) Choose b to minimise Var(X2 − bX1).
UNSW MATH5855 2021T3 Lecture 0 Slide 59
Exercise 0.3
Suppose X is a p-dimensional random vector with cf ϕX (t). If X is
partitioned as
[
X(1)
X(2)
]
, where X(1) is a p1-dimensional subvector,
then show that
(a) X(1) has cf ϕ
(1)
X (t(1)) = ϕX
{[
t(1)
0
]}
, t(1) ∈ Rp1 .
(b) X(1) and X(2) are independent if and only if
ϕX (t) = ϕX
{[
t(1)
0
]}
ϕX
{[
0
t(2)
]}
,
∀t(1) ∈ Rp1 , ∀t(2)�Rp−p1 .
UNSW MATH5855 2021T3 Lecture 0 Slide 60
Exercise 0.4
Let X ∈Mp,p is a symmetric positive definite matrix with
eigenvalues λ1 ≥ λ2 · · · ≥ λp > 0 and associated eigenvectors of
unit length ei , i = 1, 2, . . . , p that give rise to the following spectral
decomposition:
X = λ1e1e>1 + λ2e2e
>
2 + . . . λpepe
>
p
It is known that maxy 6=0
y>Xy
y>y = λ1. Now, you show that
maxy 6=0,〈y ,e1〉=0
y>Xy
y>y = λ2. Can you find further generalisations of
this claim?
UNSW MATH5855 2021T3 Lecture 0 Slide 61
Exercise 0.5
We know that an orthogonal projection matrix has only 0 or 1 as
possible eigenvalues. Using this property or otherwise, show that
the rank of an orthogonal projector is equal to the sum of its
diagonal elements.
UNSW MATH5855 2021T3 Lecture 0 Slide 62
Lecture 1: Exploratory Data Analysis of Multivariate Data
1.1 Data organisation
1.2 Basic summaries
1.3 Visualisation
1.4 Software
UNSW MATH5855 2021T3 Lecture 1 Slide 1
1. Exploratory Data Analysis of Multivariate Data
1.1 Data organisation
1.2 Basic summaries
1.3 Visualisation
1.4 Software
UNSW MATH5855 2021T3 Lecture 1 Slide 2
Representation
case (a.k.a. item, individual, or experimental trial) p ≥ 1 variables
recorded on each unit of analysis
xij ith (of p) variable observed on jth (of n) case
data matrix:
p×n
X =
x11 x12 · · · x1j · · · x1n
x21 x22 · · · x2j · · · x2n
…
…
. . .
…
. . .
…
xi1 xi2 · · · xij · · · xin
…
…
. . .
…
. . .
…
xp1 xp2 · · · xpj · · · xpn
(1.1)
UNSW MATH5855 2021T3 Lecture 1 Slide 3
1. Exploratory Data Analysis of Multivariate Data
1.1 Data organisation
1.2 Basic summaries
1.3 Visualisation
1.4 Software
UNSW MATH5855 2021T3 Lecture 1 Slide 4
Univariate summaries
sample mean (of variable i) x̄i =
1
n
∑n
j=1 xij
sample variance (of variable i) s2i =
1
n
∑n
j=1(xij − x̄i )
2
I Sometimes, we will use divisor of n − 1 instead.
UNSW MATH5855 2021T3 Lecture 1 Slide 5
Bivariate summaries
sample covariance (of variables i and k)
sik =
1
n
∑n
j=1(xij − x̄i )(xkj − x̄k)
I Linear association only!
I Symmetric: sik ≡ ski .
sample correlation (of variables i and k) rik =
sik√
sii
√
skk
≡ sik
si sk
I A unitless measure.
I Also symmetric.
I Cauchy–Bunyakovsky–Schwartz Inequality =⇒ |rik | ≤ 1.
I Also linear; can use quotient correlation instead for nonlinear.
UNSW MATH5855 2021T3 Lecture 1 Slide 6
Calculations on matrix data
The descriptive statistics that we discussed until now are usually
organised into arrays, namely:
Vector of sample means x̄ =
(
x̄1 x̄2 · · · x̄p
)>
Matrix of sample variances and covariances
p×p
S =
s11 s12 · · · s1p
s21 s22 · · · s2p
…
…
. . .
…
sp1 sp2 · · · spp
(1.2)
Matrix of sample correlations
p×p
R =
1 r12 · · · r1p
r21 1 · · · r2p
…
…
. . .
…
rp1 rp2 · · · 1
(1.3)
UNSW MATH5855 2021T3 Lecture 1 Slide 7
1. Exploratory Data Analysis of Multivariate Data
1.1 Data organisation
1.2 Basic summaries
1.3 Visualisation
1.4 Software
UNSW MATH5855 2021T3 Lecture 1 Slide 8
Graphical representations
Some simple characteristics of the data are worth studying before
the actual multivariate analysis would begin:
I drawing scatterplot of the data;
I calculating simple univariate descriptive statistics for each
variable;
I calculating sample correlation and covariance coefficients; and
I linking multiple two-dimensional scatterplots.
UNSW MATH5855 2021T3 Lecture 1 Slide 9
1. Exploratory Data Analysis of Multivariate Data
1.1 Data organisation
1.2 Basic summaries
1.3 Visualisation
1.4 Software
UNSW MATH5855 2021T3 Lecture 1 Slide 10
SAS In SAS, the procedures that are used for this purpose are
called proc means, proc plot and proc corr. Please study
their short description in the included SAS handout.
R In R, these are implemented in base::rowMeans,
base::colMeans, stats::cor, graphics::plot,
graphics::pairs, GGally::ggpairs. Here, the format is
PACKAGE::FUNCTION, and you can learn more by running
library(PACKAGE)
? FUNCTION
UNSW MATH5855 2021T3 Lecture 1 Slide 11
Lecture 2: The Multivariate Normal Distribution
2.1 Definition
2.2 Properties of multivariate normal
2.3 Tests for Multivariate Normality
2.4 Software
2.5 Examples
2.6 Additional resources
2.7 Exercises
UNSW MATH5855 2021T3 Lecture 2 Slide 1
2. The Multivariate Normal Distribution
2.1 Definition
2.2 Properties of multivariate normal
2.3 Tests for Multivariate Normality
2.4 Software
2.5 Examples
2.6 Additional resources
2.7 Exercises
UNSW MATH5855 2021T3 Lecture 2 Slide 2
Generalising the Normal Distribution
I Generalisation of the univariate normal for p ≥ 2 dimensions
I Consider replacing ( x−µ
σ
)2 = (x − µ)(σ2)−1(x − µ) in
f (x) =
1
√
2πσ2
e−[(x−µ)/σ]
2/2,−∞ < x <∞ (2.1)
by (x − µ)>Σ−1(x − µ).
I µ = E X ∈ Rp the expected value of the random vector
X ∈ Rp
I covariance matrix
Σ = E(X − µ)(X − µ)> =
σ11 σ12 · · · σ1p
σ21 σ22 · · · σ2p
…
…
. . .
…
σp1 σp2 · · · σpp
∈Mp,p
I diagonal of Σ =⇒ variances of each of the p random variables
I σij = E[(Xi − E(Xi ))(Xj − E(Xj))], i 6= j =⇒ covariances
between the ith and jth random variable
I σii ≡ σ2i
I Only makes sense if Σ pos. def.
UNSW MATH5855 2021T3 Lecture 2 Slide 3
Multivariate Normal Distribution density
I Σ pos. def. =⇒ density of X is
f (x) =
1
(2π)p/2|Σ|
1
2
e−(x−µ)
>Σ−1(x−µ)/2, −∞ < xi <∞, i = 1, 2 . . . , p
(2.2)
I E X = µ
I E[(X − µ)(X − µ)>] = Σ
I Notation: Np(µ,Σ).
UNSW MATH5855 2021T3 Lecture 2 Slide 4
Cramer–Wold argument
I We also want MVN for singular Σ (nonneg. def.)
I Use Cramer–Wold argument:
The distribution of a p-dimensional random vector X is
completely characterised by the one-dimensional distribu-
tions of all linear transformations t>X , t ∈ Rp.
I I.e., consider E[eit(t
>X )] (assumed known for every
t ∈ R1, t ∈ Rp).
I Substitute t = 1 to get E[eit
>X ], the cf of the vector X .
Definition 2.1.
The random vector X ∈ Rp has a multivariate normal distribution
if and only if (iff) any linear transformation t>X , t ∈ Rp has a
univariate normal distribution.
UNSW MATH5855 2021T3 Lecture 2 Slide 5
Lemma 2.2.
The characteristic function of the (univariate) standard normal
random variable X ∼ N(0, 1) is
ψX (t) = exp(−t2/2).
UNSW MATH5855 2021T3 Lecture 2 Slide 6
Proof.
(optional, not examinable) Sketch (full details in the handout):
1. ψX (t) = E exp(itX ) =
∫ +∞
−∞ exp(itx)
1√
2π
exp(−x2/2)dx
2. Completing the square and factoring,
ψX (t) = exp(−t2/2)︸ ︷︷ ︸
the cf
lim
h→∞
∫ +h+it
−h+it
1
√
2π
exp(−z2/2)dz︸ ︷︷ ︸
a complex integral
.
3. Use Cauchy’s Theorem and contour integration to show that
the complex integral above equals 1.
Aside: We could use the moment generating function (mgf)
MX (t) = E exp(tX ) = exp(−t2/2) instead.
UNSW MATH5855 2021T3 Lecture 2 Slide 7
Theorem 2.3.
Suppose that for a random vector X ∈ Rp with a normal
distribution according to Definition 2.1 we have E(X ) = µ and
D(X ) = E[(X − µ)(X − µ)>] = Σ. Then:
i) For any fixed t ∈ Rp, t>X ∼ N(t>µ, t>Σt) i.e. t>X has an
one dimensional normal distribution with expected value t>µ
and variance t>Σt.
ii) The cf of X ∈ Rp is
ϕX (t) = e
(it>µ− 12 t
>Σt) . (2.3)
UNSW MATH5855 2021T3 Lecture 2 Slide 8
Proof.
Part i) is obvious. For part ii),
I cf of the standard univariate normal random variable Z is
e−t
2/2
I Any U ∼ N1(µ1, σ21) has a distribution that coincides with the
distribution of µ1 + σ1Z .
I Then,
ϕU(t) = e
itµ1 ϕσ1Z (t) = e
itµ1 E(eitσ1Z )
= eitµ1 ϕZ (tσ1) = e
(itµ1− 12 t
2σ21)
I So, for t>X ∼ N1(t>µ, t>Σt) (univariate), cf
ϕt>X (t) = e
(itt>µ− 12 t
2t>Σt) .
I Substituting t = 1,
ϕX (t) = e
(it>µ− 12 t
>Σt)
UNSW MATH5855 2021T3 Lecture 2 Slide 9
=⇒ Given µ and Σ use cf formula (2.3) rather than the density
formula (2.2).
I cf formula defined for singular Σ.
I Still need to show density (2.2) for invertible Σ.
Theorem 2.4.
Assume the matrix Σ in (2.3) is nonsingular. Then the density of
the random vector X ∈ Rp with cf as in (2.3) is given by (2.2).
UNSW MATH5855 2021T3 Lecture 2 Slide 10
Proof.
I Consider vector Y = Σ−
1
2 (X − µ) ∈ Rp (compare (0.10) in
Section 0.1.5)
I E(Y ) = 0
I D(Y ) = E(YY>) = Σ−
1
2 E[(X − µ)(X − µ)>]Σ−
1
2 = Ip
=⇒ substitute to get the cf of Y : ϕY (t) = e−
1
2
∑p
i=1 t
2
i
I This is cf of p independent N(0, 1)
I Y = Σ−
1
2 (X − µ) =⇒ X = µ + Σ
1
2 Y
I Density fY (y) = 1(2π)p/2 e
− 12
∑p
i=1 y
2
i
I Use density transformation: (Section 0.2.4):
fX (x) = fY (Σ−
1
2 (x − µ))|J(x1, . . . , xp)|
I By linearity: |J(x1, . . . , xp)| = |Σ−
1
2 | = |Σ
1
2 |−1
I
∑p
i=1 y
2
i = y
>y = (x − µ)>Σ−
1
2 Σ−
1
2 (x − µ) =
(x − µ)>Σ−1(x − µ)
=⇒ density formula (2.2) for fX (x)
UNSW MATH5855 2021T3 Lecture 2 Slide 11
2. The Multivariate Normal Distribution
2.1 Definition
2.2 Properties of multivariate normal
2.3 Tests for Multivariate Normality
2.4 Software
2.5 Examples
2.6 Additional resources
2.7 Exercises
UNSW MATH5855 2021T3 Lecture 2 Slide 12
Property 1
If Σ = D(X ) = Λ is a diagonal matrix then the p components of X
are independent.
I I.e., then ϕX (t) = e
i
∑p
j=1 tjµj−
1
2 t
2
j σ
2
j , decomposes into cf’s of p
independent components each distributed according to
N(µj , σ
2
j ), j = 1, . . . , p
I “for a multivariate normal, if its components are uncorrelated
they are also independent”
I converse (if independent, then uncorrelated) true for any
distribution
I For the multivariate normal distribution, we can conclude that
its components are independent if and only if they are
uncorrelated!
UNSW MATH5855 2021T3 Lecture 2 Slide 13
Example 2.5 (Random variables that are marginally
normal and uncorrelated but not independent).
Consider two variables Z1 = (2W − 1)Y and Z2 = Y , where
Y ∼ N1(0, 1) and, independently, W ∼ Binomial(1, 1/2) (so
2W − 1 takes −1 and +1 with equal probability).
UNSW MATH5855 2021T3 Lecture 2 Slide 14
Property 2
If X ∼ Np(µ,Σ) and C ∈Mq,p is an arbitrary matrix of real
numbers then
Y = CX ∼ Nq(Cµ,CΣC>).
I From Section 0.2.5, for any s ∈ Rq,
ϕY (s) = ϕX (C
>s) = eis
>Cµ− 12 s
>CΣC>s
=⇒ Y = CX ∼ Nq(Cµ,CΣC>)
I C is full rank and if rk(Σ) = p then the rank of CΣC> is also
full
I I.e. the distribution of Y would not be degenerate in this case.
UNSW MATH5855 2021T3 Lecture 2 Slide 15
Property 3
(This is a finer version of Property 1). Assume the vector X ∈ Rp
is divided into subvectors X =
(
X(1)
X(2)
)
and according to this
subdivision the vector means are µ =
(
µ(1)
µ(2)
)
and the covariance
matrix Σ has been subdivided into Σ =
(
Σ11 Σ12
Σ21 Σ22
)
. Then the
vectors X(1) and X(2) are independent iff Σ12 = 0.
Proof.
(Exercise (see lecture)).
UNSW MATH5855 2021T3 Lecture 2 Slide 16
Property 4
Let the vector X ∈ Rp be divided into subvectors X =
(
X(1)
X(2)
)
,
X(1) ∈ Rr , r < p,X(2) ∈ Rp−r and according to this subdivision the vector means are µ = ( µ(1) µ(2) ) and the covariance matrix Σ has been subdivided into Σ = ( Σ11 Σ12 Σ21 Σ22 ) . Assume for simplicity that the rank of Σ22 is full. Then the conditional density of X(1) given that X(2) = x(2) is Nr (µ(1) + Σ12Σ −1 22 (x(2) − µ(2)),Σ11 − Σ12Σ −1 22 Σ21) (2.4) UNSW MATH5855 2021T3 Lecture 2 Slide 17 Proof. I Expression µ(1) + Σ12Σ −1 22 (x(2) − µ(2)) is a function of x(2); denote is as g(x(2)). Construct r.v. Z = X(1) − g(X(2)) and Y = X(2) − µ(2). Observe E Z = 0 and E Y = 0. I ( Z Y ) = ( Ir −Σ12Σ−122 0 Ip−r ) (X − µ) =⇒ N (Property 2). I Var ( Z Y ) = AΣA> =(
Σ11 − Σ12Σ−122 Σ21 0
0 Σ22
)
{block multiplication}
=⇒ Z and Y uncorr. N =⇒ independent (Property 3).
I Y is a linear transformation of X(2) =⇒ Z and X(2) indep.
=⇒ Conditional density of Z given X(2) = x(2) will not depend on
x(2) and coincides with the unconditional density of Z .
=⇒ Z normal with Cov(Z ) = Σ11 − Σ12Σ−122 Σ21 = Σ1|2
=⇒ X(1) − g(x(2)) ∼ N(0,Σ1|2)
=⇒ (2.4)
UNSW MATH5855 2021T3 Lecture 2 Slide 18
Example 2.6.
As an immediate consequence of Property 4 we see that if
p = 2, r = 1 then for a two-dimensional normal vector(
X1
X2
)
∼ N
{(
µ1
µ2
)
,
(
σ21 σ12
σ12 σ
2
2
)}
its conditional density f (x1|x2)
is N(µ1 +
σ12
σ22
(x2 − µ2), σ21 −
σ212
σ22
).
As an exercise, try to derive the above result by direct calculations
starting from the joint density f (x1, x2), going over to the marginal
f (x2) by integration and finally getting f (x1|x2) =
f (x1,x2)
f (x2)
.
UNSW MATH5855 2021T3 Lecture 2 Slide 19
Property 5
If X ∼ Np(µ,Σ) and Σ is nonsingular then
(X − µ)>Σ−1(X − µ) ∼ χ2p where χ2p denotes the chi-square
distribution with p degrees of freedom.
Proof.
It suffices to use the fact that (see also Theorem 2.4) the vector
Y ∈ Rp : Y = Σ−
1
2 (X − µ) ∼ N(0, Ip) i.e. it has p independent
standard normal components. Then
(X − µ)>Σ−1(X − µ) = Y>Y =
p∑
i=1
Y 2i ∼ χ
2
p
according to the definition of χ2p as a distribution of the sum of
squares of p independent standard normals.
UNSW MATH5855 2021T3 Lecture 2 Slide 20
Prediction: “Best Predictor”
I A corollary of Property 4
I Predict Y from p predictors X = (X1 X2 · · · Xp) by
choosing g(·) to minimise EY [{Y − g(X )}2|X = x ] s.t.
E g(X )2 <∞.
I Optimal g∗(x) = E(Y |X = x): regression function
UNSW MATH5855 2021T3 Lecture 2 Slide 21
Prediction: Best Predictor for MVN
I In general, g∗(x) = E(Y |X = x) can be complicated.
I For MVN, much simpler.
I If
(
Y
X
)
∈ R1+p is normal, apply Property 4
=⇒ g∗(x) = b + σ>0 C
−1x for b = E(Y )− σ>0 C
−1 E(X ),
C = Cov(X ), and σ0 = Cov(X ,Y ).
I I.e.,
g∗(x) = E(Y ) + σ>0 C
−1(x − E(X )).
I In case of joint normality, prediction turns out linear.
I C−1σ0 ∈ Rp is the vector of the regression coefficients.
I results in variance Var(Y )− σ>0 C
−1σ0
UNSW MATH5855 2021T3 Lecture 2 Slide 22
2. The Multivariate Normal Distribution
2.1 Definition
2.2 Properties of multivariate normal
2.3 Tests for Multivariate Normality
2.4 Software
2.5 Examples
2.6 Additional resources
2.7 Exercises
UNSW MATH5855 2021T3 Lecture 2 Slide 23
Graphical diagnostics
I Normality makes things easier.
I Is also sometimes an important assumption.
I Since margins and linear combinations of MVN are normal,
1. check marginal distributions (e.g., Q–Q plots, the
Shapiro–Wilk test);
2. check scatterplots of pairs of observations;
3. note outliers.
I Only good for bivariate normality.
UNSW MATH5855 2021T3 Lecture 2 Slide 24
Mardia’s Multivarate Skewness and Kurtosis
Multivariate skewness: For Y independent of X but with the same
distribution,
β1,p = E[(Y − µ)>Σ−1(X − µ)]3 (2.5)
Multivariate kurtosis:
β2,p = E[(X − µ)>Σ−1(X − µ)]2 (2.6)
I Assuming these expectations exist.
I For Np(µ,Σ), β1,p = 0 and β2,p = p(p + 2).
I p = 1 =⇒ β1,1 =
(
E(X−µ)3
σ3
)2
, β2,1 =
E(X−µ)4
σ4
I Estimated as
β̂1,p =
1
n2
n∑
i=1
n∑
j=1
g3ij , β̂2,p =
1
n
n∑
i=1
g2ii
where gij = (xi − x̄)>S−1n (xj − x̄).
UNSW MATH5855 2021T3 Lecture 2 Slide 25
Mardia’s test statistics
I β̂1,p ≥ 0 and β̂2,p ≥ 0
I For MVN, β̂1,p ≈ 0 and β̂2,p ≈ p(p + 2), respectively.
I Asymptotically, k1 = nβ̂1,p/6 ∼ χ2p(p+1)(p+2)/6, and
k2 = [β̂2,p − p(p + 2)]/[8p(p + 2)/n]
1
2 ∼ N(0, 1).
=⇒ Use k1 and k2 to test the null hypothesis of multivariate
normality.
I If neither hypothesis is rejected MVN assumption is in
reasonable agreement with the data.
I Mardia’s multivariate kurtosis can also be used to detect
outliers.
UNSW MATH5855 2021T3 Lecture 2 Slide 26
Caveat: Overreliance on tests
I Shapiro–Wilk, Mardia, etc. =⇒
H0 : population is (multivariate) normal
I Any deviation from normality =⇒ Pr(reject H0)
n→∞−→ 1
I CLT =⇒ X̄ n→∞−→ MVN regardless of population distribution
I S too, but much more slowly
=⇒ As n increases,
I more likely for test to conclude population non-normality.
I need population normality less in the first place.
=⇒ Particularly for large datasets, don’t overrely on tests.
UNSW MATH5855 2021T3 Lecture 2 Slide 27
2. The Multivariate Normal Distribution
2.1 Definition
2.2 Properties of multivariate normal
2.3 Tests for Multivariate Normality
2.4 Software
2.5 Examples
2.6 Additional resources
2.7 Exercises
UNSW MATH5855 2021T3 Lecture 2 Slide 28
SAS Use CALIS procedure. The quantity k2 is called Normalised
Multivariate Kurtosis there, whereas β̂2,p − p(p + 2) bears the
name Mardia’s Multivariate Kurtosis.
R MVN::mvn, psych::mardia
UNSW MATH5855 2021T3 Lecture 2 Slide 29
2. The Multivariate Normal Distribution
2.1 Definition
2.2 Properties of multivariate normal
2.3 Tests for Multivariate Normality
2.4 Software
2.5 Examples
2.6 Additional resources
2.7 Exercises
UNSW MATH5855 2021T3 Lecture 2 Slide 30
Example 2.7.
Testing multivariate normality of microwave oven radioactivity
measurements (JW).
UNSW MATH5855 2021T3 Lecture 2 Slide 31
2. The Multivariate Normal Distribution
2.1 Definition
2.2 Properties of multivariate normal
2.3 Tests for Multivariate Normality
2.4 Software
2.5 Examples
2.6 Additional resources
2.7 Exercises
UNSW MATH5855 2021T3 Lecture 2 Slide 32
Additional resources
I JW Sec. 4.1–4.2, 4.6.
UNSW MATH5855 2021T3 Lecture 2 Slide 33
2. The Multivariate Normal Distribution
2.1 Definition
2.2 Properties of multivariate normal
2.3 Tests for Multivariate Normality
2.4 Software
2.5 Examples
2.6 Additional resources
2.7 Exercises
UNSW MATH5855 2021T3 Lecture 2 Slide 34
Exercise 2.1
Let X1 and X2 denote i.i.d. N(0, 1) r.v.’s.
(a) Show that the r.v.’s Y1 = X1 − X2 and Y2 = X1 + X2 are
independent, and find their marginal densities.
(b) Find P(X 21 + X
2
2 < 2.41).
UNSW MATH5855 2021T3 Lecture 2 Slide 35
Exercise 2.2
Let X ∼ N3(µ,Σ) where
µ =
3−1
2
Σ =
3 2 12 3 1
1 1 2
.
(a) For A =
(
1 1 1
1 −2 1
)
find the distribution of Z = AX and
find the correlation between the two components of Z .
(b) Find the conditional distribution of [X1,X3]> given X2 = 0.
UNSW MATH5855 2021T3 Lecture 2 Slide 36
Exercise 2.3
Suppose that X1, . . . ,Xn are independent random vectors, with
each Xi ∼ Np(µi ,Σi ). Let a1, . . . , an be real constants. Using
characteristic functions, show that
a1X1 + · · ·+ anXn ∼ Np(a1µ1 + · · ·+ anµn, a21Σ1 + · · ·+ a
2
nΣn)
Therefore, deduce that, if X1, . . . ,Xn form a random sample from
the Np(µ,Σ) distribution, then the sample mean vector,
X̄ = 1
n
∑n
i=1 Xi , has distribution
X̄ ∼ Np(µ,
1
n
Σ) .
UNSW MATH5855 2021T3 Lecture 2 Slide 37
Exercise 2.4
Prove that if X1 ∼ Nr (µ1,Σ11) and
(X2|X1 = x1) ∼ Np−r (Ax1 + b,Ω) where Ω does not depend on x1
then X =
(
X1
X2
)
∼ Np(µ,Σ) where
µ =
(
µ1
Aµ1 + b
)
, Σ =
(
Σ11 Σ11A
>
AΣ11 Ω + AΣ11A
>
)
.
UNSW MATH5855 2021T3 Lecture 2 Slide 38
Exercise 2.5
Knowing that,
i) Z ∼ N1(0, 1)
ii) Y |Z = z ∼ N1(1 + z , 1)
iii) X |(Y ,Z ) = (y , z) ∼ N1(1− y , 1)
(a) Find the distribution of
XY
Z
and of Y |(X ,Z ).
(b) Find the distribution of
(
U
V
)
=
(
1 + Z
1− Y
)
.
(c) Compute E(Y |U = 2).
UNSW MATH5855 2021T3 Lecture 2 Slide 39
Lecture 3: Estimation of the Mean Vector and Covariance
Matrix of Multivariate Normal Distribution
3.1 Maximum Likelihood Estimation
3.2 Distributions of MLE of mean vector and covariance matrix of
multivariate normal distribution
3.3 Additional resources
3.4 Exercises
UNSW MATH5855 2021T3 Lecture 3 Slide 1
3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.1 Maximum Likelihood Estimation
3.1.1 Likelihood function
3.1.2 Maximum Likelihood Estimators
3.1.3 Alternative proofs
3.1.4 Application in correlation matrix estimation
3.1.5 Sufficiency of µ̂ and Σ̂
3.2 Distributions of MLE of mean vector and covariance matrix of
multivariate normal distribution
3.3 Additional resources
3.4 Exercises
UNSW MATH5855 2021T3 Lecture 3 Slide 2
3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.1 Maximum Likelihood Estimation
3.1.1 Likelihood function
3.1.2 Maximum Likelihood Estimators
3.1.3 Alternative proofs
3.1.4 Application in correlation matrix estimation
3.1.5 Sufficiency of µ̂ and Σ̂
UNSW MATH5855 2021T3 Lecture 3 Slide 3
Data
Suppose we have observed n independent realisations of
p-dimensional random vectors from Np(µ,Σ). Suppose for
simplicity that Σ is non-singular. The data matrix has the form
X =
X11 X12 · · · X1j · · · X1n
X21 X22 · · · X2j · · · X2n
…
…
. . .
…
. . .
…
Xi1 Xi2 · · · Xij · · · Xin
…
…
. . .
…
. . .
…
Xp1 Xp2 · · · Xpj · · · Xpn
= [X1,X2, . . . ,Xn] (3.1)
Goal: Estimate unknown mean vector and the covariance matrix
using Maximum Likelihood Estimation.
UNSW MATH5855 2021T3 Lecture 3 Slide 4
Likelihood function
I Lecture 2 =⇒ Likelihood function
L(x ;µ,Σ) = (2π)−
np
2 |Σ|−
n
2 e−
1
2
∑n
i=1(xi−µ)
>Σ−1(xi−µ) (3.2)
I Fix x = data matrix; focus on µ,Σ.
I Log-likelihood function
log L(x ;µ,Σ) = −
np
2
log(2π)−
n
2
log(|Σ|)
−
1
2
n∑
i=1
(xi − µ)>Σ−1(xi − µ) (3.3)
I Same maximum.
UNSW MATH5855 2021T3 Lecture 3 Slide 5
Utilising properties of traces from Section 0.1.1, we can transform:
n∑
i=1
(xi − µ)>Σ−1(xi − µ) =
n∑
i=1
tr[Σ−1(xi − µ)(xi − µ)>] =
tr[Σ−1(
n∑
i=1
(xi − µ)(xi − µ)>)] =
tr[Σ−1(
n∑
i=1
(xi − x̄)(xi − x̄)> + n(x̄ − µ)(x̄ − µ)>)]
= tr[Σ−1(
n∑
i=1
(xi − x̄)(xi − x̄)>)] + n(x̄ − µ)>Σ−1(x̄ − µ).
=⇒ log L(x ;µ,Σ) = −
np
2
log(2π)−
n
2
log(|Σ|)
−
1
2
tr[Σ−1(
n∑
i=1
(xi − x̄)(xi − x̄)>)]−
1
2
n(x̄ − µ)>Σ−1(x̄ − µ)
(3.4)
UNSW MATH5855 2021T3 Lecture 3 Slide 6
3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.1 Maximum Likelihood Estimation
3.1.1 Likelihood function
3.1.2 Maximum Likelihood Estimators
3.1.3 Alternative proofs
3.1.4 Application in correlation matrix estimation
3.1.5 Sufficiency of µ̂ and Σ̂
UNSW MATH5855 2021T3 Lecture 3 Slide 7
I Σ nonnnegative def. =⇒ minimum of
1
2n(x̄ − µ̂)
>Σ−1(x̄ − µ̂) when µ̂ = x̄
I What if Σ singular?
I What about Σ?
UNSW MATH5855 2021T3 Lecture 3 Slide 8
Theorem 3.1 (Anderson’s lemma).
If A ∈Mp,p is symmetric positive definite, then the maximum of
the function h(G ) = −n log(|G |)− tr(G−1A) (defined over the set
of symmetric positive definite matrices G ∈Mp,p) exists, occurs at
G = 1
n
A and has the maximal value of np log(n)− n log(|A|)− np.
UNSW MATH5855 2021T3 Lecture 3 Slide 9
Proof.
(sketch, details at lecture):
I tr(G−1A) = tr((G−1A
1
2 )A
1
2 ) = tr(A
1
2G−1A
1
2 )
I Let ηi , i = 1, . . . , p be the eigenvalues of A
1
2G−1A
1
2
I Pos. def. =⇒ ηi > 0, i = 1, . . . , p
I tr(A
1
2G−1A
1
2 ) =
∑p
i=1 ηi and |A
1
2G−1A
1
2 | =
∏p
i=1 ηi
=⇒ −n log|G | − tr(G−1A) = n
p∑
i=1
log ηi − n log|A| −
p∑
i=1
ηi
(3.5)
I maximised when ∀iηi = n
=⇒ Maximum is np log(n)− n log(|A|)− np.
I For G = 1
n
A,
h(G ) = −n log(|G |)− tr(G−1A) = np log(n)− n log(|A|)− np.
=⇒ G = 1
n
A maximises.
UNSW MATH5855 2021T3 Lecture 3 Slide 10
Theorem 3.1 for A =
∑n
i=1(xi − x̄)(xi − x̄)
> =⇒
Theorem 3.2.
Suppose X1,X2, . . . ,Xn is a random sample from Np(µ,Σ), p < n.
Then µ̂ = X̄ and Σ̂ = 1
n
∑n
i=1(xi − x̄)(xi − x̄)
> are the maximum
likelihood estimators of µ and Σ, respectively.
UNSW MATH5855 2021T3 Lecture 3 Slide 11
3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.1 Maximum Likelihood Estimation
3.1.1 Likelihood function
3.1.2 Maximum Likelihood Estimators
3.1.3 Alternative proofs
3.1.4 Application in correlation matrix estimation
3.1.5 Sufficiency of µ̂ and Σ̂
UNSW MATH5855 2021T3 Lecture 3 Slide 12
I Alternative proofs also available, using vector and matrix
calculus.
I May be covered later, time permitting.
UNSW MATH5855 2021T3 Lecture 3 Slide 13
3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.1 Maximum Likelihood Estimation
3.1.1 Likelihood function
3.1.2 Maximum Likelihood Estimators
3.1.3 Alternative proofs
3.1.4 Application in correlation matrix estimation
3.1.5 Sufficiency of µ̂ and Σ̂
UNSW MATH5855 2021T3 Lecture 3 Slide 14
I Correlation matrix is a function of the covariance matrix:
ρij =
σij√
σii
√
σjj
I MLE is invariant under transformation.
=⇒
ρ̂ij =
σ̂ij√
σ̂ii
√
σ̂jj
, i = 1, . . . , p, j = 1, . . . , p. (3.6)
UNSW MATH5855 2021T3 Lecture 3 Slide 15
3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.1 Maximum Likelihood Estimation
3.1.1 Likelihood function
3.1.2 Maximum Likelihood Estimators
3.1.3 Alternative proofs
3.1.4 Application in correlation matrix estimation
3.1.5 Sufficiency of µ̂ and Σ̂
UNSW MATH5855 2021T3 Lecture 3 Slide 16
I Recall:
L(x ;µ,Σ) =
1
(2π)
np
2 |Σ|
n
2
e−
1
2 tr[Σ
−1(
∑n
i=1(xi−x̄)(xi−x̄)
>+n(x̄−µ)(x̄−µ)>)]
I Factorisation L(x ;µ,Σ) = g1(x)g2(µ,Σ; µ̂, Σ̂) =⇒ µ̂ and Σ̂
are (collectively) a sufficient statistic for µ and Σ in the case
of a sample from Np(µ,Σ).
I Structure of normal density important for this.
=⇒ Non-normality can break inferences that are based solely on µ̂
and Σ̂.
UNSW MATH5855 2021T3 Lecture 3 Slide 17
3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.1 Maximum Likelihood Estimation
3.2 Distributions of MLE of mean vector and covariance matrix of
multivariate normal distribution
3.2.1 Sampling distribution of X̄
3.2.2 Sampling distribution of the MLE of Σ
3.2.3 Aside: The Gramm–Schmidt Process (not examinable)
3.3 Additional resources
3.4 Exercises
UNSW MATH5855 2021T3 Lecture 3 Slide 18
I Inference is not just point estimates.
I Need to quantify uncertainty as well.
=⇒ Need sampling distributions of estimators as well.
UNSW MATH5855 2021T3 Lecture 3 Slide 19
3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.2 Distributions of MLE of mean vector and covariance matrix of
multivariate normal distribution
3.2.1 Sampling distribution of X̄
3.2.2 Sampling distribution of the MLE of Σ
3.2.3 Aside: The Gramm–Schmidt Process (not examinable)
UNSW MATH5855 2021T3 Lecture 3 Slide 20
p = 1: sample of size n from N(µ, σ2) =⇒ the sample mean is
N(µ, σ
2
n
)
I sample mean and sample variance are independent (Basu’s
Lemma)
I What about p > 1?
p > 1:
I Let X̄ = 1
n
∑n
i=1 Xi ∈ R
p.
I For any l ∈ Rp : l>X̄ is linear combination of normals and
hence is normal (Definition 2.1).
I E X̄ = 1
n
nµ = µ
I Cov X̄ = 1
n2
nCov X1 = 1nΣ
=⇒ X̄ ∼ Np(µ, 1nΣ).
I Need more rigorous machinery.
=⇒ Kronecker products
UNSW MATH5855 2021T3 Lecture 3 Slide 21
Kronecker product A⊗ B of two matrices A ∈Mm,n and
B ∈Mp,q:
A⊗ B =
a11B a12B · · · a1nB
a21B a22B · · · a2nB
…
…
. . .
…
am1B am2B · · · amnB
(3.7)
I Properties (assuming conformable, inverses exist, etc.):
(A⊗ B)⊗ C = A⊗ (B ⊗ C )
(A + B)⊗ C = A⊗ C + B ⊗ C
(A⊗ B)> = A> ⊗ B>
(A⊗ B)−1 = A−1 ⊗ B−1
(A⊗ B)(C ⊗ D) = AC ⊗ BD
For square A and B :
tr(A⊗ B) = tr(A) tr(B)
|A⊗ B| = |A|p|B|m
UNSW MATH5855 2021T3 Lecture 3 Slide 22
Stacking columns ~�: For A ∈Mm,n, ~A ∈ Rmn: vector composed
by stacking the n columns of A.
I Matrices A,B and C (of suitable dimensions):
−−→
ABC = (C> ⊗ A)~B
Application: Let 1n ∈ Rn be vector of n ones, X be the random
data matrix (see (0.11) in Lecture 0.2).
=⇒ ~X ∼ N(1n ⊗ µ, In ⊗ Σ) and X̄ = 1n (1
>
n ⊗ Ip) ~X
=⇒ X̄ is MVN,
E X̄ =
1
n
(1>n ⊗ Ip)(1n ⊗ µ) =
1
n
(1>n 1n ⊗ µ) =
1
n
nµ = µ
Cov X̄ = n−2(1>n ⊗Ip)(In⊗Σ)(1n⊗Ip) = n
−2(1>n 1n⊗Σ) = n
−1Σ
UNSW MATH5855 2021T3 Lecture 3 Slide 23
Independence of X̄ and Σ̂
I Recall the likelihood function:
L(x ;µ,Σ) =
1
(2π)
np
2 |Σ|
n
2
e−
1
2 tr[Σ
−1(
∑n
i=1(xi−x̄)(xi−x̄)
>+n(x̄−µ)(x̄−µ)>)]
I Two summands in the exponent:
I one a function of the observations through
nΣ̂ =
∑n
i=1(xi − x̄)(xi − x̄)
> only
I one depends on the observations through x̄ only.
Idea: Transform the original data matrix X ∈Mp,n into a new
matrix Z ∈Mp,n of n independent vectors s.t.,
I X̄ would only be a function of Z1
I
∑n
i=1(xi − x̄)(xi − x̄)
> would only be a function of
Z2, . . . ,Zn
I If we succeed then clearly X̄ and∑n
i=1(xi − x̄)(xi − x̄)
> = nΣ̂ would be independent.
I Functions of independent variables are independent.
UNSW MATH5855 2021T3 Lecture 3 Slide 24
I Want A orthogonal.
I Want X̄ depending only on Z1.
=⇒ First column of A, a1 = 1√n1n.
=⇒ First column of Z , Z1 =
√
nX̄ .
I Rest: Gramm–Schmidt Process
I ~Z =
−−−→
IpXA = (A> ⊗ Ip) ~X =⇒ Jacobian of ~X 7→ ~Z is
|A> ⊗ Ip| = |A|p = ±1.
=⇒ absolute value of the Jacobian is 1
=⇒ For ~Z we have:
E(~Z ) = (A> ⊗ Ip)(1n ⊗ µ) = A>1n ⊗ µ =
√
n
0
…
0
⊗ µ
UNSW MATH5855 2021T3 Lecture 3 Slide 25
I Then,
Cov(~Z ) = (A> ⊗ Ip)(In ⊗Σ)(A⊗ Ip) = A>A⊗ IpΣIp = In ⊗Σ
=⇒ Zi , i = 1, . . . , n independent
I Z1 =
√
nX̄ .
I Also,
n∑
i=1
(Xi − X̄ )(Xi − X̄ )> =
n∑
i=1
XiX>i −
1
n
(
n∑
i=1
Xi )(
n∑
i=1
X>i ) =
ZA>AZ> − Z1Z>1 =
n∑
i=1
ZiZ>i − Z1Z
>
1 =
n∑
i=2
ZiZ>i
=⇒
Theorem 3.3.
For a sample of size n from Np(µ,Σ), p < n the sample average
X̄ ∼ Np(µ, 1nΣ). Moreover, the MLE µ̂ = X̄ and Σ̂ are
independent.
UNSW MATH5855 2021T3 Lecture 3 Slide 26
3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.2 Distributions of MLE of mean vector and covariance matrix of
multivariate normal distribution
3.2.1 Sampling distribution of X̄
3.2.2 Sampling distribution of the MLE of Σ
3.2.3 Aside: The Gramm–Schmidt Process (not examinable)
UNSW MATH5855 2021T3 Lecture 3 Slide 27
Definition 3.4.
A random matrix U ∈Mp,p has a Wishart distribution with
parameters Σ, p, n (denoting this by U ∼Wp(Σ, n)) if there exist n
independent random vectors Y1, . . . ,Yn each with Np(0,Σ)
distribution such that the distribution of
∑n
i=1 YiY
>
i coincides
with the distribution of U .
Note that we require that p < n and that U be non-negative
definite.
UNSW MATH5855 2021T3 Lecture 3 Slide 28
I Given proof of Theorem 3.3, distribution of
nΣ̂ =
∑n
i=1(Xi − X̄ )(Xi − X̄ )
> is the same as that of∑n
i=2 ZiZ
>
i :
nΣ̂ ∼Wp(Σ, n − 1)
I Don’t worry about the density formula.
Some properties:
1. p = 1 and Σ = (σ2) =⇒ W1(Σ, n)/σ2 = χ2n
=⇒ σ2 = 1 =⇒ W1(1, n)
d
= χ2n
I I.e., a generalisation.
UNSW MATH5855 2021T3 Lecture 3 Slide 29
2. For an arbitrary fixed matrix H ∈Mk,p, k ≤ p one has:
nHΣ̂H> ∼Wk(HΣH>, n − 1).
(Why? Show it!)
UNSW MATH5855 2021T3 Lecture 3 Slide 30
3. Refer to the previous case for the particular value of k = 1.
The matrix H ∈M1,p is just a p-dimensional row vector that
we could denote by c>. Then:
i) n c
>Σ̂c
c>Σc ∼ χ
2
n−1
ii) n c
>Σ−1c
c>Σ̂−1c
∼ χ2n−p
UNSW MATH5855 2021T3 Lecture 3 Slide 31
4. Let us partition S = 1
n−1
∑n
i=1(Xi − X̄ )(Xi − X̄ )
> ∈Mp,p
into
S =
(
S11 S12
S21 S22
)
,S11 ∈Mr ,r , r < p
Σ =
(
Σ11 Σ12
Σ21 Σ22
)
,Σ11 ∈Mr ,r , r < p.
Further, denote
S1|2 = S11 − S12S
−1
22 S21, Σ1|2 = Σ11 − Σ12Σ
−1
22 Σ21.
Then it holds
(n − 1)S11 ∼Wr (Σ11, n − 1)
(n − 1)S1|2 ∼Wr (Σ1|2, n − p + r − 1)
UNSW MATH5855 2021T3 Lecture 3 Slide 32
3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.2 Distributions of MLE of mean vector and covariance matrix of
multivariate normal distribution
3.2.1 Sampling distribution of X̄
3.2.2 Sampling distribution of the MLE of Σ
3.2.3 Aside: The Gramm–Schmidt Process (not examinable)
UNSW MATH5855 2021T3 Lecture 3 Slide 33
Have: A = [a1, . . . , an] ∈Mn,n arbitrary full-rank matrix
Want: An orthogonal matrix whose first column ∝ a1
Gram–Schmidt process:
1. For each i = 2, . . . , n,
2. For each j = 1, . . . , i − 1,
3. Update ai = ai −
〈ai ,aj 〉
〈aj ,aj 〉
aj .
4. For each k = 1, . . . , n,
5. Update ak =
ak
‖ak‖
.
UNSW MATH5855 2021T3 Lecture 3 Slide 34
Example 3.5.
Gram–Schmidt process implemented in R.
UNSW MATH5855 2021T3 Lecture 3 Slide 35
3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.1 Maximum Likelihood Estimation
3.2 Distributions of MLE of mean vector and covariance matrix of
multivariate normal distribution
3.3 Additional resources
3.4 Exercises
UNSW MATH5855 2021T3 Lecture 3 Slide 36
Additional resources
I JW Sec. 4.3–4.5.
UNSW MATH5855 2021T3 Lecture 3 Slide 37
3. Estimation of the Mean Vector and Covariance Matrix of
Multivariate Normal Distribution
3.1 Maximum Likelihood Estimation
3.2 Distributions of MLE of mean vector and covariance matrix of
multivariate normal distribution
3.3 Additional resources
3.4 Exercises
UNSW MATH5855 2021T3 Lecture 3 Slide 38
Exercise 3.1
Find the product A⊗ B if A =
(
1 2
3 4
)
,B =
(
5 0
2 1
)
.
UNSW MATH5855 2021T3 Lecture 3 Slide 39
Lecture 4: Confidence Intervals and Hypothesis Tests for
the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.2 Confidence regions for the mean vector and for its components
4.3 Comparison of two or more mean vectors
4.4 Software
4.5 Additional resources
4.6 Exercises
UNSW MATH5855 2021T3 Lecture 4 Slide 1
4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.1.1 Hotelling’s T 2
4.1.2 Sampling distribution of T 2
4.1.3 Noncentral Wishart
4.1.4 T 2 as a likelihood ratio statistic
4.1.5 Wilks’ lambda and T 2
4.1.6 Numerical calculation of T 2
4.1.7 Asymptotic distribution of T 2
4.2 Confidence regions for the mean vector and for its components
4.3 Comparison of two or more mean vectors
4.4 Software
4.5 Additional resources
4.6 Exercises
UNSW MATH5855 2021T3 Lecture 4 Slide 2
4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.1.1 Hotelling’s T 2
4.1.2 Sampling distribution of T 2
4.1.3 Noncentral Wishart
4.1.4 T 2 as a likelihood ratio statistic
4.1.5 Wilks’ lambda and T 2
4.1.6 Numerical calculation of T 2
4.1.7 Asymptotic distribution of T 2
UNSW MATH5855 2021T3 Lecture 4 Slide 3
Suppose:
I n independent realisations of p-dimensional random vectors
from Np(µ,Σ)
I Σ non-singular
I Data matrix:
x =
x11 x12 · · · x1j · · · x1n
x21 x22 · · · x2j · · · x2n
...
...
. . .
...
. . .
...
xi1 xi2 · · · xij · · · xin
...
...
. . .
...
. . .
...
xp1 xp2 · · · xpj · · · xpn
= [x1, x2, . . . , xn]
=⇒ (from Section 3.2)
I X̄ ∼ Np(µ, 1nΣ)
I nΣ̂ ∼Wp(Σ, n − 1).
UNSW MATH5855 2021T3 Lecture 4 Slide 4
Testing contrasts
=⇒ c 6= 0 ∈ Rp
I c>X̄ ∼ N(c>µ, 1
n
c>Σc)
I nc
>Σ̂c
c>Σc ∼ χ
2
n−1
I X̄ and Σ̂ are independent
=⇒ T =
√
nc>(X̄ − µ)/
√
c> n
n−1Σ̂c ∼ tn−1
=⇒ Use to test contrasts
I E.g., under H0 : c>µ =
∑p
i=1 ciµi = 0,
T =
√
nc>X̄/
√
c>Sc
I Does not depend on µ (for any H0 value).
I Reject H0 in favour of H1 : c>µ =
∑p
i=1 ciµi 6= 0 if
|T | > t1−α/2,n−1.
I One-sided alternatives possible as well.
UNSW MATH5855 2021T3 Lecture 4 Slide 5
Testing mean vectors (variance known)
I Suppose we know Σ (or σ2)
I p = 1, H0 : µ = µ0 vs. H1 : µ 6= µ0, at level α =⇒
U =
√
n X̄−µ0
σ
I Reject H0 if |U| > z1−α/2
I |U| > c ⇔ U2 = n(X̄ − µ0)(σ2)−1(X̄ − µ0) > c2
=⇒ For p > 1:
I U2 = n(X̄ − µ0)>Σ−1(X̄ − µ0)
I Reject H0 : µ = µ0 when U
2 is large enough.
I U2 ∼ χ2p under the H0
I Proved similarly to the proof of Property 5 of the multivariate
normal distribution (Section 2.2) and by using Theorem 3.3 of
Section 3.2.
UNSW MATH5855 2021T3 Lecture 4 Slide 6
Testing mean vectors (variance unknown)
I p = 1, t-test: H0 : µ = µ0 vs. H1 : µ 6= µ0 at level α =⇒
T =
√
n X̄−µ0
S
where S2 = 1
n−1
∑n
i=1(Xi − X̄ )
2
I Reject H0 if |T | > t1−α/2,n−1
I Similar as before: check if T 2 = n(X̄ − µ0)(s2)−1(X̄ − µ0) is
large enough
I Under H0, the statistic T
2 ∼ F1,n−1 =⇒ reject when
T 2 > F1−α;1,n−1
Definition 4.1 (Hotelling’s T 2).
The statistic
T 2 = n(X̄ − µ0)>S−1(X̄ − µ0) (4.1)
where X̄ = 1
n
∑n
i=1 Xi , S =
1
n−1
∑n
i=1(Xi − X̄ )(Xi − X̄ )
>,
µ0 ∈ Rp, Xi ∈ Rp, i = 1, . . . , n is named after Harold Hotelling.
UNSW MATH5855 2021T3 Lecture 4 Slide 7
4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.1.1 Hotelling’s T 2
4.1.2 Sampling distribution of T 2
4.1.3 Noncentral Wishart
4.1.4 T 2 as a likelihood ratio statistic
4.1.5 Wilks’ lambda and T 2
4.1.6 Numerical calculation of T 2
4.1.7 Asymptotic distribution of T 2
UNSW MATH5855 2021T3 Lecture 4 Slide 8
Null distribution of the T 2 statistic
I Reject H0 : µ = µ0 if the value of T
2 is high
I Turns out T 2 has an F distribution.
Theorem 4.2.
Under the null hypothesis H0 : µ = µ0, Hotelling’s T
2 is
distributed as
(n−1)p
n−p Fp,n−p where Fp,n−p denotes the
F -distribution with p and n − p degrees of freedom.
UNSW MATH5855 2021T3 Lecture 4 Slide 9
Proof.
I Write T 2 = n(X̄−µ0)
>S−1(X̄−µ0)
n(X̄−µ0)>Σ−1(X̄−µ0)
n(X̄ − µ0)>Σ−1(X̄ − µ0)
I Let C =
√
n(X̄ − µ0); given C = c ,
n(X̄ − µ0)>S−1(X̄ − µ0)
n(X̄ − µ0)>Σ−1(X̄ − µ0)
=
c>S−1c
c>Σ−1c
I Depends on data only through S−1.
I nΣ̂ = (n − 1)S
I Recall, Section 3.2.2 Wishart distribution third property:
n c
>Σ−1c
c>Σ̂−1c
∼ χ2n−p
I Does not depend on c!
I n(X̄ − µ0)Σ−1(X̄ − µ0) ∼ χ2p depends on the data through
X̄ only =⇒ Independent of the fraction
I T 2
d
=
χ2p(n−1)
χ2n−p
(χ2s independent)
=⇒ T
2(n−p)
p(n−1) ∼ Fp,n−p =⇒ T
2 ∼ p(n−1)
n−p Fp,n−p
UNSW MATH5855 2021T3 Lecture 4 Slide 10
4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.1.1 Hotelling’s T 2
4.1.2 Sampling distribution of T 2
4.1.3 Noncentral Wishart
4.1.4 T 2 as a likelihood ratio statistic
4.1.5 Wilks’ lambda and T 2
4.1.6 Numerical calculation of T 2
4.1.7 Asymptotic distribution of T 2
UNSW MATH5855 2021T3 Lecture 4 Slide 11
I Suppose Yi , i = 1, . . . , n have different µi s: Np(µi ,Σ).
=⇒ noncentral Wishart distribution with parameters Σ, p, n− 1, Γ
I i.e., Wp(Σ, n − 1, Γ)
I noncentrality parameter Γ = MM> ∈Mp,p where
M = [µ1,µ2, . . . ,µn]
I M = 0 =⇒ usual (central) Wishart distribution
I Distribution of T 2 = n(X̄ − µ)>S−1(X̄ − µ) for µ 6= µ0
leads to noncentral F-distribution
I Used to study power of the test of H0 : µ = µ0 vs.
H1 : µ 6= µ0.
UNSW MATH5855 2021T3 Lecture 4 Slide 12
4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.1.1 Hotelling’s T 2
4.1.2 Sampling distribution of T 2
4.1.3 Noncentral Wishart
4.1.4 T 2 as a likelihood ratio statistic
4.1.5 Wilks’ lambda and T 2
4.1.6 Numerical calculation of T 2
4.1.7 Asymptotic distribution of T 2
UNSW MATH5855 2021T3 Lecture 4 Slide 13
I Recall maximised likelihood (3.2):
L(x ; µ̂, Σ̂) =
1
(2π)
np
2 |Σ̂|
n
2
e−
np
2
I Under H0 :
max
Σ
L(x ;µ0,Σ) = max
Σ
1
(2π)
np
2 |Σ|
n
2
e−
1
2
∑n
i=1(xi−µ0)
>Σ−1(xi−µ0)
=⇒ log L(x ;µ0,Σ) =
− np
2
log(2π)− n
2
log|Σ| − 1
2
tr[Σ−1(
∑n
i=1(xi − µ0)(xi − µ0)
>)]
I Anderson’s Lemma (Theorem 3.1) =⇒
Σ̂0 =
1
n
∑n
i=1(xi − µ0)(xi − µ0)
>
=⇒ max
Σ
L(x ;µ0,Σ) =
1
(2π)
np
2 |Σ̂0|
n
2
e−
np
2
=⇒
Λ =
maxΣ L(x ;µ0,Σ)
maxµ,Σ L(x ;µ,Σ)
=
(
|Σ̂|
|Σ̂0|
)
n
2 (4.2)
=⇒ Wilks’ lambda: Λ
2
n =
|Σ̂|
|Σ̂0|
I Reject H0 : µ = µ0 when small.
UNSW MATH5855 2021T3 Lecture 4 Slide 14
4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.1.1 Hotelling’s T 2
4.1.2 Sampling distribution of T 2
4.1.3 Noncentral Wishart
4.1.4 T 2 as a likelihood ratio statistic
4.1.5 Wilks’ lambda and T 2
4.1.6 Numerical calculation of T 2
4.1.7 Asymptotic distribution of T 2
UNSW MATH5855 2021T3 Lecture 4 Slide 15
The following theorem shows the relation between Wilks’ lambda
and T 2:
Theorem 4.3.
The likelihood ratio test is equivalent to the test based on T 2 since
Λ
2
n = (1 + T
2
n−1 )
−1 holds.
UNSW MATH5855 2021T3 Lecture 4 Slide 16
Proof.
I A ∈Mp+1,p+1:
A =
(∑n
i=1(xi − x̄)(xi − x̄)
> √n(x̄ − µ0)√
n(x̄ − µ0)> −1
)
=
(
A11 A12
A21 A22
)
I |A| = |A22||A11−A12A−122 A21| = |A11||A22−A21A
−1
11 A12| (4.3)
=⇒
(−1)|
n∑
i=1
(xi − x̄)(xi − x̄)> + n(x̄ − µ0)(x̄ − µ0)>| =
|
n∑
i=1
(xi−x̄)(xi−x̄)>||−1−n(x̄−µ0)>(
n∑
i=1
(xi−x̄)(xi−x̄)>)−1(x̄−µ0)|
=⇒ (−1)|
∑n
i=1(xi − µ0)(xi − µ0)
>| =
|
∑n
i=1(xi − x̄)(xi − x̄)
>|(−1)(1 + T
2
n−1 )
=⇒ |Σ̂0| = |Σ̂|(1 + T
2
n−1 ) and
Λ
2
n = (1 +
T 2
n − 1
)−1 (4.4)
UNSW MATH5855 2021T3 Lecture 4 Slide 17
4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.1.1 Hotelling’s T 2
4.1.2 Sampling distribution of T 2
4.1.3 Noncentral Wishart
4.1.4 T 2 as a likelihood ratio statistic
4.1.5 Wilks’ lambda and T 2
4.1.6 Numerical calculation of T 2
4.1.7 Asymptotic distribution of T 2
UNSW MATH5855 2021T3 Lecture 4 Slide 18
Hence H0 is rejected for small values of Λ
2
n or equivalently, for
large values of T 2. The critical values for T 2 are determined from
Theorem 4.2. Relation (4.4) can be used to calculate T 2 from
Λ
2
n =
|Σ̂|
|Σ̂0|
thus avoiding the need to invert the matrix S when
calculating T 2!
UNSW MATH5855 2021T3 Lecture 4 Slide 19
4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.1.1 Hotelling’s T 2
4.1.2 Sampling distribution of T 2
4.1.3 Noncentral Wishart
4.1.4 T 2 as a likelihood ratio statistic
4.1.5 Wilks’ lambda and T 2
4.1.6 Numerical calculation of T 2
4.1.7 Asymptotic distribution of T 2
UNSW MATH5855 2021T3 Lecture 4 Slide 20
I S−1 consistent for Σ−1
I T 2
d→ n(x̄ − µ)>Σ−1(x̄ − µ) ∼ χ2p
I General asymptotic theory: −2 log Λ d→ χ2p
I Here:
−2 log Λ = n log(1 +
T 2
n − 1
) ≈
n
n − 1
T 2 ≈ T 2
I using that log(1 + x) ≈ x for small x
UNSW MATH5855 2021T3 Lecture 4 Slide 21
4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.2 Confidence regions for the mean vector and for its components
4.2.1 Confidence region for the mean vector
4.2.2 Simultaneous confidence statements
4.2.3 Simultaneous confidence ellipsoid
4.3 Comparison of two or more mean vectors
4.4 Software
4.5 Additional resources
4.6 Exercises
UNSW MATH5855 2021T3 Lecture 4 Slide 22
4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.2 Confidence regions for the mean vector and for its components
4.2.1 Confidence region for the mean vector
4.2.2 Simultaneous confidence statements
4.2.3 Simultaneous confidence ellipsoid
UNSW MATH5855 2021T3 Lecture 4 Slide 23
I For CL (1− α) · 100%, can be constructed in the form
{µ|n(x̄ − µ)>S−1(x̄ − µ) ≤ F1−α,p,n−p
p
n − p
(n − 1)}
I F1−α,p,n−p = upper α · 100% percentage point of the F
distribution with (p, n − p) df
I an ellipsoid in Rp centred at x̄
I axes of this confidence ellipsoid are directed along the
eigenvectors ei of the matrix S = 1n−1
∑n
i=1(xi − x̄)(xi − x̄)
>
I half-lengths of axes are
√
λi
√
p(n−1)F1−α,p,n−p
n(n−p) , λi , i = 1, . . . , p
being the corresponding eigenvalues.
Example 4.4.
Microwave ovens (Example 5.3., pages 221–223, JW).
UNSW MATH5855 2021T3 Lecture 4 Slide 24
4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.2 Confidence regions for the mean vector and for its components
4.2.1 Confidence region for the mean vector
4.2.2 Simultaneous confidence statements
4.2.3 Simultaneous confidence ellipsoid
UNSW MATH5855 2021T3 Lecture 4 Slide 25
I Ellipsoid in Section 4.2.1 is for the whole vector at once
I We might want CIs for each element, difference, etc..
=⇒ simultaneous confidence intervals
I X ∼ Np(µ,Σ) =⇒ ∀l ∈ Rp, l>X ∼ N1(l>µ, l>Σl )
=⇒ For any given l , (1− α) · 100% CI of l>µ is(
l>x̄ − t1−α/2,n−1
√
l>Sl
√
n
, l>x̄ + t1−α/2,n−1
√
l>Sl
√
n
)
(4.5)
I Can get elementwise CIs.
I I.e., |
√
n(l>x̄−l>µ̄)√
l>Sl
| ≤ t1−α/2,n−1 (or equivalently
n(l>x̄−l>µ̄)2
l>Sl ≤ t
2
1−α/2,n−1)
I Not simultaneous: need larger multiplier for that the right
hand side of the inequality.
UNSW MATH5855 2021T3 Lecture 4 Slide 26
4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.2 Confidence regions for the mean vector and for its components
4.2.1 Confidence region for the mean vector
4.2.2 Simultaneous confidence statements
4.2.3 Simultaneous confidence ellipsoid
UNSW MATH5855 2021T3 Lecture 4 Slide 27
Theorem 4.5.
Simultaneously for all l ∈ Rp, the interval(
l>x̄ −
√
p(n − 1)
n(n − p)
F1−α,p,n−pl>Sl , l>x̄ +
√
p(n − 1)
n(n − p)
F1−α,p,n−pl>Sl
)
will contain l>µ̄ with a probability at least (1− α).
Example 4.6.
Microwave Ovens (Example 5.4, p. 226 in JW).
UNSW MATH5855 2021T3 Lecture 4 Slide 28
Proof.
I [l>(x̄ − µ)]2 = [(S1/2l )>S−1/2(x̄ − µ)]2
I Cauchy–Bunyakovski–Schwartz Inequality =⇒
[(S1/2l )>S−1/2(x̄ − µ)]2 ≤ ‖S1/2l‖2‖S−1/2(x̄ − µ)‖2
I ‖S1/2l‖2‖S−1/2(x̄ − µ)‖2 = (l>Sl )(x̄ − µ)>S−1(x̄ − µ)
I =⇒ [l>(x̄ − µ)]2 ≤ (l>Sl )(x̄ − µ)>S−1(x̄ − µ) and
max
l
n(l>(x̄ − µ))2
l>Sl
≤ n(x̄ − µ)>S−1(x̄ − µ) = T 2 (4.6)
=⇒ Then if T 2 ≤ c2 then n(l
>x̄−l>µ̄)2
l>Sl ≤ c
2 for any l ∈ Rp, l 6= 0
=⇒ For every l ,
l>x̄ − c
√
l>Sl
n
≤ l>µ̄ ≤ l>x̄ + c
√
l>Sl
n
(4.7)
I Choose c2 = p(n − 1)F1−α,p,n−p/(n − p) to make sure that
1− α = P(T 2 ≤ c2) holds
=⇒ (4.7) will contain l>µ̄ with probability 1− α.
UNSW MATH5855 2021T3 Lecture 4 Slide 29
Bonferroni Method
I More reliable than one-at-a-time intervals.
I Utilise the covariance structure of all p variables in their
construction.
I What if only few intervals?
I Bonferroni method may be efficient:
I Suppose we want to make m statements.
I Let Ci , i = 1, 2, . . . ,m be the ith confidence statement.
I P(Ci true) = 1− αi =⇒
P(all Ci true) = 1− P(at least one Ci false) ≥
1−
m∑
i=1
P(Ci false) = 1−
m∑
i=1
(1−P(Ci true)) = 1−(α1+α2+· · ·+αm)
I Choose αi =
α
m
, i = 1, 2, . . . ,m (i.e., each CL (1− α
m
) · 100%
instead of (1− α) · 100%)
=⇒ The probability of any statement being false will not exceed α.
Example 4.7.
Microwave Ovens (based on JW Example 5.4, p. 226).
UNSW MATH5855 2021T3 Lecture 4 Slide 30
4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.2 Confidence regions for the mean vector and for its components
4.3 Comparison of two or more mean vectors
4.3.1 Reducing to a single population
4.3.2 The two-sample T 2-test
4.4 Software
4.5 Additional resources
4.6 Exercises
UNSW MATH5855 2021T3 Lecture 4 Slide 31
I Two samples, X1,X2, . . . ,XnX ∈ R
p and
Y1,Y2, . . . ,YnY ∈ R
p
I means µX ∈ Rp and µY ∈ Rp
I variances ΣX ∈Mp,p and ΣY ∈Mp,p
I Test (typically) H0 : µX − µY = δ0.
I Multivariate ANOVA for comparing more than two
populations =⇒ Lecture 8.
UNSW MATH5855 2021T3 Lecture 4 Slide 32
4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.3 Comparison of two or more mean vectors
4.3.1 Reducing to a single population
4.3.2 The two-sample T 2-test
UNSW MATH5855 2021T3 Lecture 4 Slide 33
Paired Samples
I analogously to the paired t-test; let nX = nY = n
1. Take Di = Xi − Yi for i = 1, . . . , n.
2. Proceed as if with a 1-sample T 2 test:
T 2 = n(D̄ − δ0)>S−1D (D̄ − δ0) ∼
(n − 1)p
n − p
Fp,n−p, (4.8)
I D̄ ∈ Rp = 1
n
∑n
i=1
Di and SD = 1n−1
∑n
i=1
(Di − D̄)(Di − D̄)>
I Requires differences, MVN or n large.
I “Multivariate” form: let contrast matrix C ∈Mp,p+p,
C =
+1 −1+1 −1
+1 −1
=⇒ Di = C
(
Xi
Yi
)
, H0 : C
(
µX
µY
)
= δ0.
=⇒ Test statistic reduces to (4.8).
UNSW MATH5855 2021T3 Lecture 4 Slide 34
Repeated Measures
I a series of p treatment outcomes on each sampling unit
I Xi : individual i ’s measurements as a vector
I Test whether all outcomes are the same in expectation:
1. Form
C =
1 −1… . . .
1 −1
∈Mp−1,p
2. Test H0 : CµX = 0p−1.
I CµX = 0p−1 ⇐⇒ all elements of µX are equal
UNSW MATH5855 2021T3 Lecture 4 Slide 35
4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.3 Comparison of two or more mean vectors
4.3.1 Reducing to a single population
4.3.2 The two-sample T 2-test
UNSW MATH5855 2021T3 Lecture 4 Slide 36
Independent Samples (pooled)
I Sometimes, X and Y are, in fact, independent samples
I As with univariate, assume ΣX = ΣY = Σ or not.
I If pooled,
Spooled =
(nX − 1)SX + (nY − 1)SY
nX + nY − 2
Var(X̄ − Ȳ ) =
Σ
nX
+
Σ
nY
≈ Spooled
(
1
nX
+
1
nY
)
call
= S̄p
I X̄ − Ȳ ∼ Np(µX − µY ,Σ(n−1X + n
−1
Y )) =⇒
T 2 = (X̄ − Ȳ − δ0)>S̄−1p (X̄ − Ȳ − δ0) (4.9)
has distribution
(nX +nY−2)p
nX +nY−p−1
Fp,nX +nY−p−1
UNSW MATH5855 2021T3 Lecture 4 Slide 37
I Construct a confidence region:{
δ
∣∣x̄ − ȳ − δ)>S̄−1p (x̄ − ȳ − δ)
≤
(nX + nY − 2)p
nX + nY − p − 1
F1−α,p,nX +nY−p−1
}
I Simultaneous contrast confidence intervals:
l>(x̄ − ȳ)±
√
(nX + nY − 2)p
nX + nY − p − 1
F1−α,p,nX +nY−p−1l>S̄pl
UNSW MATH5855 2021T3 Lecture 4 Slide 38
Independent Samples (unpooled)
I Var(X̄ − Ȳ ) ≈ SX
nX
+ SY
nY
call
= S̄up
T 2 = (X̄ − Ȳ − δ0)>S̄−1up (X̄ − Ȳ − δ0)
I Distribution of this T 2 approximate: νp
ν−p+1Fp,ν−p+1,
ν = (p + p2)
2∑
i=1
1
ni
tr
{ 1
ni
Si
(
1
n1
S1 +
1
n2
S2
)−1}2
+
[
tr
{
1
ni
Si
(
1
n1
S1 +
1
n2
S2
)−1}]2
−1
UNSW MATH5855 2021T3 Lecture 4 Slide 39
I Confidence regions:{
δ
∣∣(x̄ − ȳ − δ)>S̄−1up (x̄ − ȳ − δ)
≤
νp
ν − p + 1
Fp,ν−p+1
}
I Simultaneous contrast confidence intervals:
l>(x̄ − ȳ)±
√
νp
ν − p + 1
Fp,ν−p+1l>S̄upl
UNSW MATH5855 2021T3 Lecture 4 Slide 40
4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.2 Confidence regions for the mean vector and for its components
4.3 Comparison of two or more mean vectors
4.4 Software
4.5 Additional resources
4.6 Exercises
UNSW MATH5855 2021T3 Lecture 4 Slide 41
R: car::confidenceEllipse, package Hotelling,
rrcov::T2.test, ergm::approx.hotelling.diff.test,
MVTests::TwoSamplesHT2
SAS: See IML implementations.
UNSW MATH5855 2021T3 Lecture 4 Slide 42
4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.2 Confidence regions for the mean vector and for its components
4.3 Comparison of two or more mean vectors
4.4 Software
4.5 Additional resources
4.6 Exercises
UNSW MATH5855 2021T3 Lecture 4 Slide 43
Additional resources
I JW Sec. 5.1–5.5, 6.
UNSW MATH5855 2021T3 Lecture 4 Slide 44
4. Confidence Intervals and Hypothesis Tests for the Mean Vector
4.1 Hypothesis tests for the multivariate normal mean
4.2 Confidence regions for the mean vector and for its components
4.3 Comparison of two or more mean vectors
4.4 Software
4.5 Additional resources
4.6 Exercises
UNSW MATH5855 2021T3 Lecture 4 Slide 45
Exercise 4.1
Suppose X1,X2, . . . ,Xn are independent Np(µ,Σ) random vectors
with sample mean vector X̄ and sample covariance matrix S . We
wish to test the hypothesis
H0 : µ2 − µ1 = µ3 − µ2 = · · · = µp − µp−1 = 1
where µ1, µ2, . . . , µp are the elements of µ.
(a) Determine a (p − 1)× p matrix C so that H0 may be written
equivalently as H0 : Cµ = 1 where 1 is a (p− 1)× 1 vector of
ones.
(b) Make an appropriate transformation of the vectors
Xi , i = 1, 2, . . . , n and hence find the rejection region of a size
α test of H0 in terms of X̄ , S , and C .
UNSW MATH5855 2021T3 Lecture 4 Slide 46
Exercise 4.2
A sample of 50 vector observations, each containing three
components, is drawn from a normal distribution having covariance
matrix
Σ =
3 1 11 4 1
1 1 2
.
The components of the sample mean are 0.8, 1.1 and 0.6. Can you
reject the null hypothesis of zero distribution mean against a
general alternative?
UNSW MATH5855 2021T3 Lecture 4 Slide 47
Exercise 4.3
Evaluate Hotelling’s statistic T 2 for testing H0 : µ =
(
7
11
)
using
the data matrix X =
(
2 8 6 8
12 9 9 10
)
. Test the hypothesis H0 at
level α = 0.05. What conclusion is reached?
UNSW MATH5855 2021T3 Lecture 4 Slide 48
Exercise 4.4
Let X1, . . . ,Xn1 , i.i.d. Np(µ1,Σ) independently of Y1, . . .Yn2 i.i.d.
Np(µ2,Σ), Σ known. Prove that X̄ ∼ Np(µ1, 1n1 Σ) and
Ȳ ∼ Np(µ2, 1n2 Σ). Hence
W = X̄ − Ȳ ∼ N(µ1 − µ2,
(
1
n1
+ 1
n2
)
Σ) so that
X̄ − Ȳ − (µ1 − µ2) ∼ N(0,
(
1
n1
+ 1
n2
)
Σ). Construct a test of
H0 : µ1 = µ2.
UNSW MATH5855 2021T3 Lecture 4 Slide 49
Exercise 4.5
Let X̄ and S be based on n observations from Np(µ,Σ) and let X
be an additional observation from Np(µ,Σ). Show that
X − X̄ ∼ Np(0, (1 + 1n )Σ). Find the distribution of
n
n+1
(X − X̄ )>S−1(X − X̄ ) and suggest how to use this result to
give a (1− α) prediction region for X based on X̄ and S (i.e., a
region in Rp such that one has a given confidence (1− α) that the
next observation will fall into it).
UNSW MATH5855 2021T3 Lecture 4 Slide 50
Lecture 5: Correlation, Partial Correlation, Multiple
Correlation
5.1 Partial correlation
5.2 Multiple correlation
5.3 Testing of correlation coefficients
5.4 Additional resources
5.5 Exercises
UNSW MATH5855 2021T3 Lecture 5 Slide 1
I Correlation = undirected measure of linear dependence
I Prediction also incorporates direction of dependence
I Direction of dependence must be inferred from substantive
reasoning
I When temporal data available, may be detected from data
(e.g., Granger Causality)
UNSW MATH5855 2021T3 Lecture 5 Slide 2
I Correlation measures linear only dependence
=⇒ Uncorrelated variables may still be dependent
I (But independent variables are always uncorrelated.)
I For jointly MVN variables specifically,
uncorrelated⇐⇒ independent
UNSW MATH5855 2021T3 Lecture 5 Slide 3
In general, there are 3 types of correlation coefficients:
I The usual correlation coefficient between 2 variables
I Partial correlation coefficient between 2 variables after
adjusting for the effect (regression, association ) of set of
other variables.
I Multiple correlation between a single random variable and a
set of p other variables
UNSW MATH5855 2021T3 Lecture 5 Slide 4
5. Correlation, Partial Correlation, Multiple Correlation
5.1 Partial correlation
5.1.1 Simple formulae
5.1.2 Software
5.1.3 Examples
5.2 Multiple correlation
5.3 Testing of correlation coefficients
5.4 Additional resources
5.5 Exercises
UNSW MATH5855 2021T3 Lecture 5 Slide 5
“The Usual” Correlation
I For X ∼ Np(µ,Σ)
I ρij =
σij√
σii
√
σjj
, i , j = 1, 2, . . . , p
I MLE ρ̂ij in (3.6) coincides with the sample correlations rij (1.3)
UNSW MATH5855 2021T3 Lecture 5 Slide 6
Partial Correlation Coefficients
I Section 2.2 MVN Property 4:
I Divide X ∈ Rp into X =
(
X(1)
X(2)
)
,
X(1) ∈ Rr , r < p,X(2) ∈ Rp−r , MVN with
µ =
(
µ(1)
µ(2)
)
, Σ =
(
Σ11 Σ12
Σ21 Σ22
)
I Σ22 full rank
=⇒
X(1)|X(2) = x(2) ∼ Nr (µ(1)+Σ12Σ−122 (x(2)−µ(2)),Σ11−Σ12Σ
−1
22 Σ21).
UNSW MATH5855 2021T3 Lecture 5 Slide 7
partial correlations of X(1) given X(2) = x(2): the usual correlation
coefficients calculated from the elements σij .(r+1),(r+2)...,p of the
matrix Σ1|2 = Σ11 − Σ12Σ
−1
22 Σ21, i.e.
ρij .(r+1),(r+2),...,p =
σij .(r+1),(r+2),...,p
√
σii .(r+1),(r+2),...,p
√
σjj .(r+1),(r+2),...,p
(5.1)
ρij .(r+1),(r+2),...,p the correlation of the ith and jth component
when the components (r + 1), (r + 2), etc. up to the pth (i.e.
the last p − r components) have been held fixed.
=⇒ association (correlation) between the ith and jth component
after eliminating the effect that the last p − r components
might have had on this association
UNSW MATH5855 2021T3 Lecture 5 Slide 8
Estimation
I Invariance property of the MLE =⇒
ρ̂ij .(r+1),(r+2),...,p =
σ̂ij .(r+1),(r+2),...,p√
σ̂ii .(r+1),(r+2),...,p
√
σ̂jj .(r+1),(r+2),...,p
, i , j = 1, 2, . . . , r
will be the ML estimators of ρij .(r+1),(r+2)...,p, i , j = 1, 2, . . . , r .
UNSW MATH5855 2021T3 Lecture 5 Slide 9
5. Correlation, Partial Correlation, Multiple Correlation
5.1 Partial correlation
5.1.1 Simple formulae
5.1.2 Software
5.1.3 Examples
UNSW MATH5855 2021T3 Lecture 5 Slide 10
I Few variables involved =⇒ simple formulas:
i) partial correlation between first and second variable by
adjusting for the effect of the third:
ρ12.3 =
ρ12 − ρ13ρ23√
(1− ρ213)(1− ρ
2
23)
.
ii) partial correlation between first and second variable by
adjusting for the effects of third and fourth variable:
ρ12.3,4 =
ρ12.4 − ρ13.4ρ23.4√
(1− ρ213.4)(1− ρ
2
23.4)
.
I More variables =⇒ software.
UNSW MATH5855 2021T3 Lecture 5 Slide 11
5. Correlation, Partial Correlation, Multiple Correlation
5.1 Partial correlation
5.1.1 Simple formulae
5.1.2 Software
5.1.3 Examples
UNSW MATH5855 2021T3 Lecture 5 Slide 12
SAS: PROC CORR
R: ggm::pcor, ggm::parcor
UNSW MATH5855 2021T3 Lecture 5 Slide 13
5. Correlation, Partial Correlation, Multiple Correlation
5.1 Partial correlation
5.1.1 Simple formulae
5.1.2 Software
5.1.3 Examples
UNSW MATH5855 2021T3 Lecture 5 Slide 14
Example 5.1.
Three variables have been measured for a set of schoolchildren:
i) X1: Intelligence
ii) X2: Weight
iii) X3: Age
The number of observations was large enough so that one can
assume the empirical correlation matrix ρ̂ ∈M3,3 to be the true
correlation matrix: ρ̂ =
1 0.6162 0.82670.6162 1 0.7321
0.8267 0.7321 1
. This
suggests there is a high degree of positive dependence between
weight and intelligence. But (do the calculation (!))
ρ̂12.3 = 0.0286 so that, after the effect of age is adjusted for, there
is virtually no correlation between weight and intelligence, i.e.
weight obviously plays little part in explaining intelligence.
UNSW MATH5855 2021T3 Lecture 5 Slide 15
5. Correlation, Partial Correlation, Multiple Correlation
5.1 Partial correlation
5.2 Multiple correlation
5.2.1 Multiple correlation coefficient as ordinary correlation coefficient of
transformed data
5.2.2 Interpretation of R
5.2.3 Remark about the calculation of R2
5.2.4 Examples
5.3 Testing of correlation coefficients
5.4 Additional resources
5.5 Exercises
UNSW MATH5855 2021T3 Lecture 5 Slide 16
Recall (Section 2.2):
I Predict a random variable Y from X =
(
X1 X2 · · · Xp
)>
I Minimise the value E[{Y − g(X )}2|X = x ] =⇒
g∗(X ) = E(Y |X )
I Y and X jointly MVN =⇒ linear g∗(x) = b + σ>0 C
−1x ,
where
I b = E(Y )− σ>0 C
−1 E(X )
I C = Var(X )
I σ0 = Cov(Y ,X )
=⇒ C−1σ0 ∈ Rp = vector of the regression coefficients
multiple correlation coefficient between Y ∈ R and X ∈ Rp:
maximum correlation between Y and any linear combination
α>X , α ∈ Rp.
UNSW MATH5855 2021T3 Lecture 5 Slide 17
5. Correlation, Partial Correlation, Multiple Correlation
5.2 Multiple correlation
5.2.1 Multiple correlation coefficient as ordinary correlation coefficient of
transformed data
5.2.2 Interpretation of R
5.2.3 Remark about the calculation of R2
5.2.4 Examples
UNSW MATH5855 2021T3 Lecture 5 Slide 18
Lemma 5.2.
The multiple correlation coefficient is the ordinary correlation
coefficient between Y and σ>0 C
−1X ≡ β∗
>
X . (I.e.,
β∗ ≡ C−1σ0.)
Proof.
I ∀α ∈ Rp, Cov(Y ,α>X ) = α>σ0 = α>(CC−1σ0) = α>Cβ∗
I Set α = β∗ =⇒ Cov(Y ,β∗
>
X ) = β∗
>
Cβ∗
I Cauchy–Bunyakovsky–Schwartz inequality =⇒
[Cov(α>X ,β∗
>
X )]2 ≤ Var(α>X ) Var(β∗
>
X )
I Write down [Cov(Y ,α>X )]2 ≡ σ2Y σ
2
α>Xρ
2
Y ,α>X =⇒
σ2Y ρ
2
Y ,α>X =
(α>σ0)
2
α>Cα
=
(α>Cβ∗)2
α>Cα
≤ β∗
>
Cβ∗
I Equality when α = β∗.
=⇒ ρ2
Y ,α>X of Y and α
>X is maximised over α when α = β∗.
UNSW MATH5855 2021T3 Lecture 5 Slide 19
Coefficient of Determination
I maximum correlation between Y and any linear combination
α>X , α ∈ Rp, is R =
√
β∗
>
Cβ∗
σ2
Y
=⇒ the multiple correlation coefficient
I R2 is coefficient of determination
I β∗ = C−1σ0 =⇒ R =
√
σ>0 C
−1σ0
σ2
Y
I Let Σ = Var
(
Y
X
)
=
(
σ2Y σ
>
0
σ0 C
)
=
(
Σ11 Σ12
Σ21 Σ22
)
.
I MLE Σ̂ =
(
Σ̂11 Σ̂12
Σ̂21 Σ̂22
)
=⇒ MLE R̂ =
√
Σ̂12Σ̂
−1
22 Σ̂21
Σ̂11
UNSW MATH5855 2021T3 Lecture 5 Slide 20
5. Correlation, Partial Correlation, Multiple Correlation
5.2 Multiple correlation
5.2.1 Multiple correlation coefficient as ordinary correlation coefficient of
transformed data
5.2.2 Interpretation of R
5.2.3 Remark about the calculation of R2
5.2.4 Examples
UNSW MATH5855 2021T3 Lecture 5 Slide 21
I From Section 2.2, minimal value of MSE predicting Y from a
linear function of X is σ2Y − σ
>
0 C
−1σ0.
I Equivalent to σ2Y (1− R
2)
=⇒ When R2 = 0, no predictive power at all; when R2 = 1, Y
can be predicted without any error at all (is a true linear
function of X ).
UNSW MATH5855 2021T3 Lecture 5 Slide 22
5. Correlation, Partial Correlation, Multiple Correlation
5.2 Multiple correlation
5.2.1 Multiple correlation coefficient as ordinary correlation coefficient of
transformed data
5.2.2 Interpretation of R
5.2.3 Remark about the calculation of R2
5.2.4 Examples
UNSW MATH5855 2021T3 Lecture 5 Slide 23
I What if only the correlation matrix is available?
I Then for ρYY ≡ (ρ−1)11 for correlation matrix ρ ∈Mp+1,p+1
determined from Σ as defined earlier,
1− R2 =
1
ρYY
(5.2)
I Recall (4.3):
1− R2 = σ
2
Y
−σ>0 C
−1σ0
σ2
Y
=
|C |
|C |
σ2
Y
−σ>0 C
−1σ0
σ2
Y
=
|Σ|
|C |σ2
Y
I Recall from Section 0.1.2: (X−1)ji =
|X−(i,j)|
|X | (−1)
i+j
=⇒ |C ||Σ| = σ
YY ≡ (Σ−1)11
I Now, ρ = V−
1
2 ΣV−
1
2 for
V =
σ2y 0 · · · 0
0 c11 · · · 0
…
…
. . .
…
0 0 · · · cpp
I ρ−1 = V
1
2 Σ−1V
1
2 =⇒ ρYY = σYY σ2Y
=⇒ (5.2)
UNSW MATH5855 2021T3 Lecture 5 Slide 24
5. Correlation, Partial Correlation, Multiple Correlation
5.2 Multiple correlation
5.2.1 Multiple correlation coefficient as ordinary correlation coefficient of
transformed data
5.2.2 Interpretation of R
5.2.3 Remark about the calculation of R2
5.2.4 Examples
UNSW MATH5855 2021T3 Lecture 5 Slide 25
Example 5.3.
Let µ =
µYµX1
µX2
=
52
0
and
Σ =
10 1 −11 7 3
−1 3 2
= (σYY σ>0
σ0 ΣXX
)
. Calculate:
(a) The best linear prediction of Y using X1 and X2.
(b) The multiple correlation coefficient R2
Y .(X1,X2)
.
(c) The mean squared error of the best linear predictor.
UNSW MATH5855 2021T3 Lecture 5 Slide 26
Solution
β∗ = Σ−1XXσ0 =
(
7 3
3 2
)−1(
1
−1
)
=
(
.4 −.6
−.6 1.4
)(
1
−1
)
=
(
1
−2
)
and
b = µY − β∗
>
µX = 5− (1,−2)
(
2
0
)
= 3.
Hence the best linear predictor is given by 3 + X1 − 2X2. The value
of:
RY .(X1,X2) =
√√√√√(1,−1)
(
.4 −.6
−.6 1.4
)(
1
−1
)
10
=
√
3
10
= .548
The mean squared error of prediction is:
σ2Y (1− R
2
Y .(X1,X2)
) = 10(1− 3
10
) = 7.
UNSW MATH5855 2021T3 Lecture 5 Slide 27
Example 5.4.
Relationship between multiple correlation and regression, and
equivalent ways of computing it.
UNSW MATH5855 2021T3 Lecture 5 Slide 28
5. Correlation, Partial Correlation, Multiple Correlation
5.1 Partial correlation
5.2 Multiple correlation
5.3 Testing of correlation coefficients
5.3.1 Usual correlation coefficients
5.3.2 Partial correlation coefficients
5.3.3 Multiple correlation coefficients
5.3.4 Software
5.3.5 Examples
5.4 Additional resources
5.5 Exercises
UNSW MATH5855 2021T3 Lecture 5 Slide 29
5. Correlation, Partial Correlation, Multiple Correlation
5.3 Testing of correlation coefficients
5.3.1 Usual correlation coefficients
5.3.2 Partial correlation coefficients
5.3.3 Multiple correlation coefficients
5.3.4 Software
5.3.5 Examples
UNSW MATH5855 2021T3 Lecture 5 Slide 30
I A bivariate problem.
I H0 : ρij = 0 =⇒ statistic T = rij
√
n−2
1−r2
ij
∼ tn−2
=⇒ a t-test
I Otherwise, exact distribution complicated.
=⇒ Fisher’s Z transformation Z = 1
2
log[
1+rij
1−rij
]
I H0 : ρij = ρ0 =⇒ Z ≈ N( 12 log[
1+ρ0
1−ρ0
], 1
n−3 )
I How would you test for equality of two correlation coefficients
from two independent samples?
UNSW MATH5855 2021T3 Lecture 5 Slide 31
5. Correlation, Partial Correlation, Multiple Correlation
5.3 Testing of correlation coefficients
5.3.1 Usual correlation coefficients
5.3.2 Partial correlation coefficients
5.3.3 Multiple correlation coefficients
5.3.4 Software
5.3.5 Examples
UNSW MATH5855 2021T3 Lecture 5 Slide 32
I Not much has to be changed.
I To test H0 : ρij .r+1,r+2,…,r+k = ρ0 versus
H1 : ρij .r+1,r+2,…,r+k 6= ρ0 (i.e., given k variables):
1. Construct Z = 1
2
log[
1+rij.r+1,r+2,…,r+k
1−rij.r+1,r+2,…,r+k
] and a = 1
2
log[ 1+ρ0
1−ρ0
].
2. Asymptotically Z ∼ N(a, 1
n−k−3 ).
3. Calculate
√
n − k − 3|Z − a| and test against standard normal.
I For ρ0 = 0, t-test can still be used, with n − 2→ n − k − 2.
UNSW MATH5855 2021T3 Lecture 5 Slide 33
5. Correlation, Partial Correlation, Multiple Correlation
5.3 Testing of correlation coefficients
5.3.1 Usual correlation coefficients
5.3.2 Partial correlation coefficients
5.3.3 Multiple correlation coefficients
5.3.4 Software
5.3.5 Examples
UNSW MATH5855 2021T3 Lecture 5 Slide 34
I Under H0 : R = 0,
F =
R̂2
1− R̂2
×
n − p
p − 1
∼ Fp−1,n−p
I Equivalent to ANOVA F -Test: R̂2 ≡ SSR
SST
=⇒ 1− R̂2 ≡ SSE
SST
=⇒ F = SSR/SST
SSE/SST
× n−p
p−1 =⇒ F =
SSR/(p−1)
SSE/(n−p) =
MSR
MSE
I Here, p is the total number of all variables (the output Y and
all of the input variables in the input vector X ).
I In Section 5.2, it was just the dimension of X alone.
UNSW MATH5855 2021T3 Lecture 5 Slide 35
5. Correlation, Partial Correlation, Multiple Correlation
5.3 Testing of correlation coefficients
5.3.1 Usual correlation coefficients
5.3.2 Partial correlation coefficients
5.3.3 Multiple correlation coefficients
5.3.4 Software
5.3.5 Examples
UNSW MATH5855 2021T3 Lecture 5 Slide 36
SAS: PROC CORR
R: ggm::pcor.test
UNSW MATH5855 2021T3 Lecture 5 Slide 37
5. Correlation, Partial Correlation, Multiple Correlation
5.3 Testing of correlation coefficients
5.3.1 Usual correlation coefficients
5.3.2 Partial correlation coefficients
5.3.3 Multiple correlation coefficients
5.3.4 Software
5.3.5 Examples
UNSW MATH5855 2021T3 Lecture 5 Slide 38
Example 5.5.
Testing ordinary correlations: age, height, and intelligence.
UNSW MATH5855 2021T3 Lecture 5 Slide 39
Example 5.6.
Testing partial correlations: age, height, and intelligence.
UNSW MATH5855 2021T3 Lecture 5 Slide 40
5. Correlation, Partial Correlation, Multiple Correlation
5.1 Partial correlation
5.2 Multiple correlation
5.3 Testing of correlation coefficients
5.4 Additional resources
5.5 Exercises
UNSW MATH5855 2021T3 Lecture 5 Slide 41
Additional resources
I JW Sec. 7.8.
UNSW MATH5855 2021T3 Lecture 5 Slide 42
5. Correlation, Partial Correlation, Multiple Correlation
5.1 Partial correlation
5.2 Multiple correlation
5.3 Testing of correlation coefficients
5.4 Additional resources
5.5 Exercises
UNSW MATH5855 2021T3 Lecture 5 Slide 43
Exercise 5.1 I
Suppose X ∼ N4(µ,Σ) where µ =
1
2
3
4
and
Σ =
3 1 0 1
1 4 0 0
0 0 1 4
1 0 4 20
. Determine:
(a) the distribution of
X1
X2
X3
X1 + X2 + X4
;
(b) the conditional mean and variance of X1 given x2, x3, and x4;
(c) the partial correlation coefficients ρ12.3, ρ12.4;
UNSW MATH5855 2021T3 Lecture 5 Slide 44
Exercise 5.1 II
(d) the multiple correlation between X1 and (X2,X3,X4).
Compare it to ρ12 and comment.
(e) Justify that
X2X3
X4
is independent of
X1 − (1 0 1)
4 0 00 1 4
0 4 20
−1
X2X3
X4
.
UNSW MATH5855 2021T3 Lecture 5 Slide 45
Exercise 5.2
A random vector X ∼ N3(µ,Σ) with µ =
2−3
1
and
Σ =
1 1 11 3 2
1 2 2
.
(a) Find the distribution of 3X1 − 2X2 + X3.
(b) Find a vector a ∈ R2 such that X2 and X2 − a>
(
X1
X3
)
are
independent.
UNSW MATH5855 2021T3 Lecture 5 Slide 46
Lecture 6: Principal Components Analysis
6.1 Introduction
6.2 Precise mathematical formulation
6.3 Estimation of the Principal Components
6.4 Deciding how many principal components to include
6.5 Software
6.6 Examples
6.7 PCA and Factor Analysis
6.8 Application to finance: Portfolio optimisation
6.9 Additional resources
6.10 Exercises
UNSW MATH5855 2021T3 Lecture 6 Slide 1
6. Principal Components Analysis
6.1 Introduction
6.2 Precise mathematical formulation
6.3 Estimation of the Principal Components
6.4 Deciding how many principal components to include
6.5 Software
6.6 Examples
6.7 PCA and Factor Analysis
6.8 Application to finance: Portfolio optimisation
6.9 Additional resources
6.10 Exercises
UNSW MATH5855 2021T3 Lecture 6 Slide 2
Motivation
I mainly a variable reduction procedure
I applied when there is a large number possibly highly
correlated
I “condense” the information to redundancy by “summarising”
in transformations
I principal components are artificial variables (constructs) that
account for most of the variability in the observed variables.
I formulated as linear combinations of variables
I try to absorb as much variation as possible
I can use more than one
I can then be used in subsequent procedures
I also a form of factor analysis
UNSW MATH5855 2021T3 Lecture 6 Slide 3
6. Principal Components Analysis
6.1 Introduction
6.2 Precise mathematical formulation
6.3 Estimation of the Principal Components
6.4 Deciding how many principal components to include
6.5 Software
6.6 Examples
6.7 PCA and Factor Analysis
6.8 Application to finance: Portfolio optimisation
6.9 Additional resources
6.10 Exercises
UNSW MATH5855 2021T3 Lecture 6 Slide 4
Definition
I X ∼ Np(µ,Σ), p relatively large
Goal: find a linear combination α>1 X with α1 ∈ R
p s.t.
I Var(α>1 X ) is maximised
I i.e., Var(α>1 X ) = α
>
1 Σα1
I but ‖α1‖2 = α>1 α1 = 1
UNSW MATH5855 2021T3 Lecture 6 Slide 5
Derivation
i) construct the Lagrangian function
Lag(α1, λ) = α
>
1 Σα1 + λ(1−α
>
1 α1)
where λ ∈ R1 is the Lagrange multiplier;
ii) take the partial derivative with respect to α1 and equate it to
zero:
2Σα1 − 2λα1 = 0 =⇒ (Σ− λIp)α1 = 0 (6.1)
From (6.1) we see that α1 must be an eigenvector of Σ and
since we know from the first lecture what the maximal value
of α
>Σα
α>α
is, we conclude that α1 should be the eigenvector
that corresponds to the largest eigenvalue λ̄1 of Σ. The
random variable α>1 X is called the first principal
component.
UNSW MATH5855 2021T3 Lecture 6 Slide 6
Second Principal Component
I Var(α>2 X ) is maximised, s.t.
I α>2 α2 = 1
I Cov(α>1 X ,α
>
2 X ) = α
>
1 Σα2 = 0
I Lagrange function:
Lag1(α2, λ1, λ2) = α
>
2 Σα2 + λ1(1−α
>
2 α2) + λ2α
>
1 Σα2
I Differentiate w.r.t. α2 and set to 0:
2Σα2 − 2λ1α2 + λ2Σα1 = 0 (6.2)
I Pre-multiply (6.2) by α>1 and use α
>
2 α2 = 1 and
α>2 Σα1 = 0:
−2λ1α>1 α2 + λ2α
>
1 Σα1 = 0 =⇒ λ2 = 0
I WHY? Hint: α1 is an eigenvector of Σ.
I Then (6.2) =⇒ (Σ− λ1Ip)α2 = 0 =⇒ α2 an eigenvector
of Σ with eigenvalue λ1
I To maximise variance, α>2 Σα2 = α
>
2 α2λ1 = λ1,
=⇒ α2 is the normalised eigenvector that corresponds to the
second largest eigenvalue λ̄2 of Σ.
I Repeat to get α3 with λ̄3, etc.
UNSW MATH5855 2021T3 Lecture 6 Slide 7
All principal components
I Extracting all p PCs =⇒
p∑
i=1
Var(α>i X ) =
p∑
i=1
λ̄i = tr(Σ) = Σ11 + · · ·+ Σpp
=⇒ Taking small number of k < p PCs =⇒ explaining
Var(α>1 X ) + · · ·+ Var(α
>
k X )
Σ11 + · · ·+ Σpp
×100% =
λ̄1 + · · ·+ λ̄k
Σ11 + · · ·+ Σpp
×100%
of the total population variance Σ11 + · · ·+ Σpp.
UNSW MATH5855 2021T3 Lecture 6 Slide 8
6. Principal Components Analysis
6.1 Introduction
6.2 Precise mathematical formulation
6.3 Estimation of the Principal Components
6.4 Deciding how many principal components to include
6.5 Software
6.6 Examples
6.7 PCA and Factor Analysis
6.8 Application to finance: Portfolio optimisation
6.9 Additional resources
6.10 Exercises
UNSW MATH5855 2021T3 Lecture 6 Slide 9
Estimation
I In practice, Σ is estimated.
I PCs from Σ influenced by the scale of measurement.
I Large variance =⇒ large component in the first PC
=⇒ Alternative is to use correlation matrix ρ instead.
I I.e., standardise variables first:
Zi =
(
X1i−X̄1√
s11
X2i−X̄2√
s22
· · · Xpi−X̄p√
spp
)>
, i = 1, . . . , n.
=⇒ standardised observations matrix Z = [Z1,Z2, . . . ,Zn] ∈Mp,n
gives Z̄ = 1
n
Z1n = 0 and a sample covariance matrix
SZ = 1n−1ZZ
> = R.
Example 6.1 (Eigenvalues obtained from Covariance
and Correlation Matrices: see JW p. 437).
It demonstrates the great effect standardisation may have on the
principal components. The relative magnitudes of the weights after
standardisation (i.e. from ρ may become in direct opposition to
the weights attached to the same variables in the principal
component obtained from Σ).
UNSW MATH5855 2021T3 Lecture 6 Slide 10
6. Principal Components Analysis
6.1 Introduction
6.2 Precise mathematical formulation
6.3 Estimation of the Principal Components
6.4 Deciding how many principal components to include
6.5 Software
6.6 Examples
6.7 PCA and Factor Analysis
6.8 Application to finance: Portfolio optimisation
6.9 Additional resources
6.10 Exercises
UNSW MATH5855 2021T3 Lecture 6 Slide 11
Selecting k
Wanted: k as small as possible, but ψk =
λ̄1+…λ̄k
λ̄1+…λ̄p
as large as
possible—a trade-off.
“scree plot”: Plot λ̄k against k and see where it levels out.
Total variation explained:
1. Choose a constant c ∈ (0, 1).
I Usually, c = 0.9, but is an arbitrary choice.
2. Choose smallest k s.t. ψk ≥ c .
Kaiser’s rule: Keep those that (individually) explain at least
1
p
100% of the total variance.
I I.e., exclude if Var(α>i Z ) < 1.
I I.e., exclude if λ̄i <
¯̄λ
I Popular, but hard to defend on theoretical ground.
UNSW MATH5855 2021T3 Lecture 6 Slide 12
Formal tests of significance
I Does not make sense to test H0 : λ̄k+1 = · · · = λ̄p = 0
I H0 =⇒ Σ singular =⇒ Σ̂ singular =⇒ estimated λ̄k for
k + 1, . . . , p also zero a.s..
I Instead test H0 : λ̄k+1 = · · · = λ̄p
I any common value
I more quantitative version of scree test
I One possible test:
1. Compute:
A0 = arithmetic mean of the last p − k estimated eigenvalues
G0 = geometric mean of the last p − k estimated eigenvalues
2. −2 log Λ = n(p − k) log A0
G0
∼ χ2ν where ν =
(p−k+2)(p−k−1)
2
(asymptotically).
I More details in Mardia, Kent, and Bibby (1979, p. 235–237),
including a more precise form.
I Requires MVN.
I Only valid if based on S ; is conservative for R.
UNSW MATH5855 2021T3 Lecture 6 Slide 13
6. Principal Components Analysis
6.1 Introduction
6.2 Precise mathematical formulation
6.3 Estimation of the Principal Components
6.4 Deciding how many principal components to include
6.5 Software
6.6 Examples
6.7 PCA and Factor Analysis
6.8 Application to finance: Portfolio optimisation
6.9 Additional resources
6.10 Exercises
UNSW MATH5855 2021T3 Lecture 6 Slide 14
SAS PROC PRINCOMP, PROC FACTOR
R stats::prcomp, stats::princomp, or about half-dozen other
implementations
UNSW MATH5855 2021T3 Lecture 6 Slide 15
6. Principal Components Analysis
6.1 Introduction
6.2 Precise mathematical formulation
6.3 Estimation of the Principal Components
6.4 Deciding how many principal components to include
6.5 Software
6.6 Examples
6.7 PCA and Factor Analysis
6.8 Application to finance: Portfolio optimisation
6.9 Additional resources
6.10 Exercises
UNSW MATH5855 2021T3 Lecture 6 Slide 16
Example 6.2.
The Crime Rates example will be discussed at the lecture. The
data gives crime rates per 100,000 people in seven categories for
each of the 50 states in USA in 1997. Principal components are
used to summarise the 7-dimensional data in 2 or 3 dimensions
only and help to visualise and interpret the data.
UNSW MATH5855 2021T3 Lecture 6 Slide 17
6. Principal Components Analysis
6.1 Introduction
6.2 Precise mathematical formulation
6.3 Estimation of the Principal Components
6.4 Deciding how many principal components to include
6.5 Software
6.6 Examples
6.7 PCA and Factor Analysis
6.8 Application to finance: Portfolio optimisation
6.9 Additional resources
6.10 Exercises
UNSW MATH5855 2021T3 Lecture 6 Slide 18
Relationship to Factor Analysis
I PCA can provide initial view in factor analysis.
I But Principal component analysis 6= Factor analysis.
Factor analysis: The covariation in the observed variables is due to
the presence of one or more latent variables (factors) that exert
casual influence on the observed variables, and we wish to infer
them.
I More on this later.
PCA: There is no prior assumption about an underlying casual
model: we just wish to reduce variables.
UNSW MATH5855 2021T3 Lecture 6 Slide 19
6. Principal Components Analysis
6.1 Introduction
6.2 Precise mathematical formulation
6.3 Estimation of the Principal Components
6.4 Deciding how many principal components to include
6.5 Software
6.6 Examples
6.7 PCA and Factor Analysis
6.8 Application to finance: Portfolio optimisation
6.9 Additional resources
6.10 Exercises
UNSW MATH5855 2021T3 Lecture 6 Slide 20
Mean–Variance-Efficient Portfolio
I Other problems in MV statistics lead to similar approaches to
PCA.
I p-dimensional vector X of returns of the p assets is given,
with E X = µ, Var(X ) = Σ.
I A portfolio with these assets with weights (c1, c2, . . . , cp)
(with
∑p
i=1 ci = 1):
I has return Q = c>X .
I has expected return EQ = c>µ.
I has risk Var(Q) = c>Σc .
I We want to choose the weights c :
I to achieve prespecified expected return µ̄,
I while minimising the risk.
UNSW MATH5855 2021T3 Lecture 6 Slide 21
Derivation
I Lagrangian function:
Lag(λ1, λ2) = c>Σc + λ1(µ̄− c>µ) + λ2(1− c>1p) (6.3)
I Differentiate (6.3) w.r.t. c :
2Σc − λ1µ− λ21p = 0 (6.4)
I Suppose no riskless asset with a fixed return.
=⇒ Σ is pos. def. and Σ−1 exists
I Then, solving for c :
c =
1
2
Σ−1(λ1µ + λ21p) (6.5)
UNSW MATH5855 2021T3 Lecture 6 Slide 22
I Pre-multiply by 1>p :
1 =
1
2
1>p Σ
−1(λ1µ + λ21p) (6.6)
I Solve for λ2 =
2−λ11>p Σ−1µ
1>p Σ
−11p
I Substitute into (6.5):
c =
1
2
λ1(Σ
−1µ−
1>p Σ
−1µ
1>p Σ
−11p
Σ−11p) +
Σ−11p
1>p Σ
−11p
(6.7)
I Pre-multiply (6.5) by µ> and use µ>c = µ̄:
λ1 =
2µ̄−λ2µ>Σ−11p
µ>Σ−1µ
=⇒ linear system of 2 equations w.r.t. λ1 and λ2 can be solved
and substituted into (6.7) to get the final expression for c
using µ, µ̄ and Σ.
UNSW MATH5855 2021T3 Lecture 6 Slide 23
Variance-Efficient Portfolio
I no prespecified mean return
=⇒ only required to minimise the variance
=⇒ in (6.3), λ1 = 0
I From (6.7) optimal weights for the variance efficient portfolio:
copt =
Σ−11p
1>p Σ
−11p
.
UNSW MATH5855 2021T3 Lecture 6 Slide 24
6. Principal Components Analysis
6.1 Introduction
6.2 Precise mathematical formulation
6.3 Estimation of the Principal Components
6.4 Deciding how many principal components to include
6.5 Software
6.6 Examples
6.7 PCA and Factor Analysis
6.8 Application to finance: Portfolio optimisation
6.9 Additional resources
6.10 Exercises
UNSW MATH5855 2021T3 Lecture 6 Slide 25
Additional resources
I JW Ch. 8.
UNSW MATH5855 2021T3 Lecture 6 Slide 26
6. Principal Components Analysis
6.1 Introduction
6.2 Precise mathematical formulation
6.3 Estimation of the Principal Components
6.4 Deciding how many principal components to include
6.5 Software
6.6 Examples
6.7 PCA and Factor Analysis
6.8 Application to finance: Portfolio optimisation
6.9 Additional resources
6.10 Exercises
UNSW MATH5855 2021T3 Lecture 6 Slide 27
Exercise 6.1
A random vector Y =
Y1Y2
Y3
is normally distributed with zero
mean vector and Σ =
1 ρ/2 0ρ/2 1 ρ
0 ρ 1
where ρ is positive.
(a) Find the coefficients of the first principal component and the
variance of that component. What percentage of the overall
variability does it explain?
(b) Find the joint distribution of Y1,Y2 and Y1 + Y2 + Y3.
(c) Find the conditional distribution of Y1,Y2 given Y3 = y3.
(d) Find the multiple correlation of Y3 with Y1,Y2.
UNSW MATH5855 2021T3 Lecture 6 Slide 28
Lecture 7: Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises
UNSW MATH5855 2021T3 Lecture 7 Slide 1
7. Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises
UNSW MATH5855 2021T3 Lecture 7 Slide 2
Canonical Correlation Analysis
I Two sets of variables.
I Interested in association between the sets.
I Consider the largest possible correlation between any linear
combination of variables in the first set and the second set.
=⇒ first canonical variables and first canonical correlation
I Like PCA, can take second canonical variables/correlation,
etc.
I Like PCA, summarising complex relationships in small number
of correlations on combinations of variables.
I Correlation coefficient is a special case. (Both sets contain
only one variable each.)
I Multiple correlation coefficient is a special case. (One of the
sets contains only one variable.)
UNSW MATH5855 2021T3 Lecture 7 Slide 3
7. Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises
UNSW MATH5855 2021T3 Lecture 7 Slide 4
I Under MVN, independence ⇐⇒ uncorrelatedness
I Suppose X ∼ Np(µ,Σ), partitioned into two components
X (1) ∈ Rr ,X (2) ∈ Rq, with r + q = p.
I Partition Σ =
(
Σ11 Σ12
Σ21 Σ22
)
, assume Σ, Σ11, and Σ22
nonsingular
I To test H0 : Σ12 = 0,
1. For fixed vectors a ∈ Rr ,b ∈ Rq let Z1 = a>X (1) and
Z2 = b>X (2) giving ρa,b = Cor(Z1,Z2) = a
>Σ12b√
a>Σ11ab>Σ22b
.
2. H0 is equivalent to H0 : ∀a∈Rr ,b∈Rqρa,b = 0.
I For given a,b, H0 would not be rejected if
|ra,b| =
|a>S12b|√
a>S11ab>S22b
≤ k for certain critical value k.
=⇒ Acceptance region for H0 would be given in the form
{X ∈Mp,n : maxa,b r2a,b ≤ k
2}.
=⇒ I.e., maximum of (a>S12b)2 under constraints
a>S11a = b>S22b = 1: canonical correlation.
UNSW MATH5855 2021T3 Lecture 7 Slide 5
7. Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises
UNSW MATH5855 2021T3 Lecture 7 Slide 6
I Wanted: Z1 = a>X (1) and Z2 = b>X (2) where
a ∈ Rr ,b ∈ Rq are obtained by
I maximising (a>Σ12b)2
I s.t. a>Σ11a = b>Σ22b = 1.
I Lagrangian:
Lag(a,b, λ1, λ2) = (a>Σ12b)2+λ1(a>Σ11a−1)+λ2(b>Σ22b−1)
I Differentiate w.r.t. a and b (individually):
2(a>Σ12b)Σ12b + 2λ1Σ11a = 0 ∈ Rr (7.1)
2(a>Σ12b)Σ21a + 2λ2Σ22b = 0 ∈ Rq (7.2)
I Pre-multiply (7.1) by vector a> and (7.2) by b> and subtract
them:
λ1 = λ2 = −(a>Σ12b)2 = −µ2
=⇒
Σ12b = µΣ11a (7.3)
Σ21a = µΣ22b (7.4)
UNSW MATH5855 2021T3 Lecture 7 Slide 7
I Pre-multiply (7.3) by Σ21Σ
−1
11 , both sides of (7.4) by the
scalar µ, and add:
(Σ21Σ
−1
11 Σ12 − µ
2Σ22)b = 0 (7.5)
I For (7.5) to have a solution with b 6= 0 =⇒
|Σ21Σ−111 Σ12 − µ
2Σ22| = 0 (7.6)
I Multiply both sides by |Σ
− 1
2
22 |
2 and use
product-of-determinants:
|Σ
− 1
2
22 ||Σ21Σ
−1
11 Σ12−µ
2Σ22||Σ
− 1
2
22 | = |Σ
− 1
2
22 Σ21Σ
−1
11 Σ12Σ
− 1
2
22 −µ
2Iq| = 0
=⇒ µ2 is an eigenvalue of Σ
− 1
2
22 Σ21Σ
−1
11 Σ12Σ
− 1
2
22
=⇒ b = Σ
− 1
2
22 b̂ where b̂ is the eigenvector of Σ
− 1
2
22 Σ21Σ
−1
11 Σ12Σ
− 1
2
22
corresponding to this eigenvalue (WHY?!).
UNSW MATH5855 2021T3 Lecture 7 Slide 8
I Want to maximise a>Σ12b ∧ µ2 is an eigenvalue of
Σ
− 1
2
22 Σ21Σ
−1
11 Σ12Σ
− 1
2
22 (or Σ
−1
22 Σ21Σ
−1
11 Σ12) =⇒ it is the
largest eigenvalue
I From (7.3), a = 1
µ
Σ−111 Σ12b
=⇒ First canonical variables Z1 = a>X (1) and Z2 = b>X (2) are
determined and the value of the first canonical correlation is
just µ.
I The orientation (sign) of b is chosen such that the sign of µ is
positive.
I Second canonical correlation = second largest eigenvalue
I Automatically ensures second pair of canonical variables is
uncorrelated with the first
I Can have at most min(q, r) canonical correlations
I Usually much fewer used.
I First canonical correlation ≥ highest multiple correlation
between any variable and opposite set of variables.
=⇒ Often first canonical correlation is large, with subsequent small
UNSW MATH5855 2021T3 Lecture 7 Slide 9
7. Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises
UNSW MATH5855 2021T3 Lecture 7 Slide 10
I Estimate as in Section 7.3, substituting Sij in place of Σij .
I To test H0 : ∀a∈Rr ,b∈Rqρa,b = 0 (in Section 7.2), accept for
{X ∈Mp,n : largest eigenvalue of S
− 1
2
22 S21S
−1
11 S12S
− 1
2
22 ≤ kα}
I Constant kα has been worked out and is given in the so called
Hecks charts. This distribution depends on three parameters:
I s = min(r , q)
I m = |r−q|−1
2
I N = n−r−q−2
2
, where n is the sample size
I Can also use good F -distribution-based approximations for a
(transformations of) this distribution like Wilk’s lambda,
Pillai’s trace, Hotelling trace, and Roy’s greatest root.
UNSW MATH5855 2021T3 Lecture 7 Slide 11
7. Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises
UNSW MATH5855 2021T3 Lecture 7 Slide 12
SAS PROC CANCORR
R
I stats::cancor
I package CCA for computing and visualisation
I package CCP for testing canonical correlations
UNSW MATH5855 2021T3 Lecture 7 Slide 13
7. Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises
UNSW MATH5855 2021T3 Lecture 7 Slide 14
I Calculating X−
1
2 and X
1
2 for a symm. pos. def. matrix X
using spectral decomposition may be numerically unstable
I I.e., when condition number λ̄1
λ̄p
is high.
I Cholesky decomposition (Section 0.1.6) can be used
instead.
I In (7.5), Let U>U = Σ−122
=⇒ µ2 is an eigenvalue of the matrix A = UΣ21Σ−111 Σ12U
>.
I I.e., pre-multiplying by U and post-multiplying by U> in (7.6):
|A− µ2UΣ22U>| = 0
I But UΣ22U
> = U(U>U)−1U> = UU−1(U>)−1U> = I holds.
UNSW MATH5855 2021T3 Lecture 7 Slide 15
7. Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises
UNSW MATH5855 2021T3 Lecture 7 Slide 16
Example 7.1.
Canonical Correlation Analysis of the Fitness Club Data.
Three physiological and three exercise variables were measured on
twenty middle aged men in a fitness club. Canonical correlation is
used to determine if the physiological variables are related in any
way to the exercise variables.
UNSW MATH5855 2021T3 Lecture 7 Slide 17
Example 7.2.
JW Example 10.4, p. 552 Studying canonical correlations
between leg and head bone measurements: X1,X2 are skull length
and skull breadth, respectively; X3,X4 are leg bone measurements:
femur and tibia length, respectively. Observations have been taken
on n = 276 White Leghorn chicken. The example is chosen to also
illustrate how a canonical correlation analysis can be performed
when the original data is not given but the empirical correlation
matrix (or empirical covariance matrix) is available.
UNSW MATH5855 2021T3 Lecture 7 Slide 18
7. Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises
UNSW MATH5855 2021T3 Lecture 7 Slide 19
Additional resources
I JW Ch. 10.
UNSW MATH5855 2021T3 Lecture 7 Slide 20
7. Canonical Correlation Analysis
7.1 Introduction
7.2 Application in testing for independence of sets of variables
7.3 Precise mathematical formulation and solution to the problem
7.4 Estimating and testing canonical correlations
7.5 Software
7.6 Some important computational issues
7.7 Examples
7.8 Additional resources
7.9 Exercises
UNSW MATH5855 2021T3 Lecture 7 Slide 21
Exercise 7.1
Let the components of X correspond to scores on tests in
arithmetic speed (X1), arithmetic power (X2), memory for words
(X3), memory for meaningful symbols (X4), and memory for
meaningless symbols (X5). The observed correlations in a sample
of 140 are
1.0000 0.4248 0.0420 0.0215 0.0573
1.0000 0.1487 0.2489 0.2843
1.0000 0.6693 0.4662
1.0000 0.6915
1.0000
.
Find the canonical correlations and canonical variates between the
first two variates and the last three variates. Comment. Write a
SAS-IML or R code to implement the required calculations.
UNSW MATH5855 2021T3 Lecture 7 Slide 22
Exercise 7.2
Students sit 5 different papers, two of which are closed book and
the rest open book. For the 88 students who sat these exams the
sample covariance matrix is
S =
302.3 125.8 100.4 105.1 116.1
170.9 84.2 93.6 97.9
111.6 110.8 120.5
217.9 153.8
294.4
.
Find the canonical correlations and canonical variates between the
first two variates (closed book exams) and the last three variates
(open book exams). Comment.
UNSW MATH5855 2021T3 Lecture 7 Slide 23
Exercise 7.3
A random vector X ∼ N4(µ,Σ) with µ =
0
0
0
0
and
1 2ρ ρ ρ
2ρ 1 ρ ρ
ρ ρ 1 2ρ
ρ ρ 2ρ 1
where ρ is a small enough positive constant.
(a) Find the two canonical correlations between
(
X1
X2
)
and(
X3
X4
)
. Comment.
(b) Find the first pair of canonical variables.
UNSW MATH5855 2021T3 Lecture 7 Slide 24
Exercise 7.4
Consider the following covariance matrix Σ of a four dimensional
normal vector: Σ =
(
Σ11 Σ12
Σ21 Σ22
)
=
100 0 0 0
0 1 0.95 0
0 0.95 1 0
0 0 0 100
.
Verify that the first pair of canonical variates are just the second
and the third component of the vector and the canonical
correlation equals .95.
UNSW MATH5855 2021T3 Lecture 7 Slide 25
Lecture 8: Multivariate Linear Models and Multivariate
ANOVA
8.1 Univariate linear models and ANOVA
8.2 Multivariate Linear Model and MANOVA
8.3 Computations used in the MANOVA tests
8.4 Software
8.5 Examples
8.6 Additional resources
UNSW MATH5855 2021T3 Lecture 8 Slide 1
8. Multivariate Linear Models and Multivariate ANOVA
8.1 Univariate linear models and ANOVA
8.2 Multivariate Linear Model and MANOVA
8.3 Computations used in the MANOVA tests
8.4 Software
8.5 Examples
8.6 Additional resources
UNSW MATH5855 2021T3 Lecture 8 Slide 2
I For observations i = 1, 2, . . . , n, let,
the response variable Yi = xiβ + �i , for
predictor row vector xi ∈ Rk assumed fixed and known
coefficient vector β ∈ Rp fixed and unknown
stochastic errors �i
i.i.d.∼ N(0, σ2)
=⇒ matrix form, Y =
(
Y1 Y2 · · · Yn
)>
and
X =
(
x>1 x
>
2 · · · x
>
n
)> ∈Mn,k
I assume X contains an intercept
=⇒ Y = Xβ + �, � ∼ Nn(0, Inσ2)
I MLE for β minimises
n∑
i=1
(Yi − xiβ)2 = ‖Y − Xβ‖2 = (Y − Xβ)>(Y − Xβ)
I To get
β̂ = (X>X )−1X>Y
Var(β̂) = (X>X )−1X> Var(Y )X (X>X )−1 = (X>X )−1σ2
UNSW MATH5855 2021T3 Lecture 8 Slide 3
I Consider projection matrices
A = In − X (X>X )−1X>
B = X (X>X )−1X> − 1n(1>n 1n)
−11>n
=⇒ AY = Y − X{(X>X )−1X>Y } = Y − Ŷ = residuals
=⇒ BY = X{(X>X )−1X>}Y − 1n(1>n 1n)−11
>
n Y = Ŷ − 1nȲ =
fitted values over and above the mean
I Also, if X has an intercept,
Cov(AY ,BY ) = AVar(Y )B> = σ2AB>
= X (X>X )−1X> − X (X>X )−1X>X (X>X )−1X>
− 1n(1>n 1n)
−11>n + X (X
>X )−1X>1n(1
>
n 1n)
−11>n
=
1
n
(X (X>X )−1X>1n − 1n)1>n = 0
=⇒ SSE = Y>AY ∼ σ2χ2n−k and SSA = Y
>BY ∼ σ2χ2k−1,
independent
=⇒ Can set up F = SSA/(k−1)
SSE/(n−k) ∼ Fk−1,n−k
UNSW MATH5855 2021T3 Lecture 8 Slide 4
8. Multivariate Linear Models and Multivariate ANOVA
8.1 Univariate linear models and ANOVA
8.2 Multivariate Linear Model and MANOVA
8.3 Computations used in the MANOVA tests
8.4 Software
8.5 Examples
8.6 Additional resources
UNSW MATH5855 2021T3 Lecture 8 Slide 5
response matrix: Y =
Y>1
Y>2
…
Y>n
=
Y11 Y12 · · · Y1p
Y21 Y22 · · · Y2p
…
…
. . .
…
Yn1 Yn2 · · · Ynp
∈Mn,p
predictors: xi and X as before
coefficient matrix: β ∈Mk,p
stochastic error vectors: �i ∼ Np(0,Σ), Σ ∈Mp,p symmetric
positive definite
I Then,
Y>i = xiβ + �
>
i
Y = Xβ + E , E =
(
�1 �2 · · · �n
)> ∈Mn,p
=⇒ ~E ∼ Nnp(0,Σ⊗ In)
−→
E> ∼ Nnp(0, In ⊗ Σ)
=⇒ ~Y ∼ Nnp({β> ⊗ In}~X ,Σ⊗ In)
−−→
Y> ∼ Nnp({In ⊗ β>}
−→
X>, In ⊗ Σ)
UNSW MATH5855 2021T3 Lecture 8 Slide 6
I MLE is again OLS, minimising∑n
i=1 tr{(Yi −xiβ)(Yi −xiβ)
>} = tr{(Y −Xβ)>(Y −Xβ)},
=⇒
β̂ = (X>X )−1X>Y
Var(
−→
β̂>) = Var(
−−−−−−−−−−→
Y>X (X>X )−1) = Var{((X>X )−1X> ⊗ Ip)
−−→
Y>}
= ((X>X )−1X> ⊗ Ip)(Ip ⊗ Σ)((X>X )−1X> ⊗ Ip)>
= ((X>X )−1X> ⊗ Ip)((X>X )−1X> ⊗ Σ)>
= (X>X )−1 ⊗ Σ
Var(
−→
β̂ ) = Σ⊗ (X>X )−1
I Projection matrices A and B still work (check it!)
=⇒ SSE = Y>AY ∼Wp(Σ, p(n − k − 1))
=⇒ SSA = Y>BY ∼Wp(Σ, p(k − 1))
I Now matrices.
UNSW MATH5855 2021T3 Lecture 8 Slide 7
8. Multivariate Linear Models and Multivariate ANOVA
8.1 Univariate linear models and ANOVA
8.2 Multivariate Linear Model and MANOVA
8.3 Computations used in the MANOVA tests
8.3.1 Roots distributions
8.3.2 Comparisons
8.4 Software
8.5 Examples
8.6 Additional resources
UNSW MATH5855 2021T3 Lecture 8 Slide 8
I In (univariate) ANOVA, with usual assumptions we test for a
factor with the F test:
I Decompose SST = SSA + SSE. Then,
i) SSE and SSA are independent scaled-χ2 distributed;
ii) Divided by degrees of freedom,
I MSE is always unbiased for σ2.
I MSA only unbiased for σ2 under H0 and is larger (in
expectation) otherwise.
=⇒ For F = MSA
MSE
under H0, σ
2s cancel and distribution is just
unscaled F .
I I.e., reject when F is higher than the critical value.
I Multivariate ANOVA (MANOVA) =⇒ χ2 →Wishart
=⇒ “Ratio” of matrices.
I No unambiguous choice.
I Turns out to be related to the eigenvalues of scaled
Wishart-distributed matrices related to decomposition SST =
SSA + SSE in the multivariate case.
UNSW MATH5855 2021T3 Lecture 8 Slide 9
8. Multivariate Linear Models and Multivariate ANOVA
8.3 Computations used in the MANOVA tests
8.3.1 Roots distributions
8.3.2 Comparisons
UNSW MATH5855 2021T3 Lecture 8 Slide 10
I Let Yi , i = 1, 2, . . . , n
ind.∼ Np(µi ,Σ), with data matrix
Y =
Y>1
Y>2
…
Y>n
=
Y11 Y12 · · · Y1p
Y21 Y22 · · · Y2p
…
…
. . .
…
Yn1 Yn2 · · · Ynp
∈Mn,p
I Call E(Y ) = M, Var( ~Y ) = Σ⊗ In
I Let A and B be projectors such that Q1 = Y>AY and
Q2 = Y>BY are two independent Wp(Σ, v) and Wp(Σ, q)
matrices, respectively.
I E.g., for a multivariate linear model,
Y = Xβ + E , Ŷ = X β̂
A = In − X (X>X )−X>, B = X (X>X )−X> − 1n(1>n 1n)
−11>n
and the corresponding decomposition
Y [In − 1n(1>n 1n)
−11>n ]Y = Y
>BY + Y>AY = Q2 + Q1
of SST = SSA + SSE = Q2 + Q1 where Q2 is the “hypothesis
matrix” and Q1 is the “error matrix”.
UNSW MATH5855 2021T3 Lecture 8 Slide 11
Lemma 8.1.
Let Q1,Q2 ∈Mp,p be two positive definite symmetric matrices .
Then the roots of the determinant equation
|Q2 − θ(Q1 + Q2)| = 0 are related to the roots of the equation
|Q2 − λQ1| = 0 by: λi = θi1−θi (or θi =
λi
1+λi
).
UNSW MATH5855 2021T3 Lecture 8 Slide 12
Lemma 8.2.
Let Q1,Q2 ∈Mp,p be two positive definite symmetric matrices .
Then the roots of the determinant equation
|Q1 − v(Q1 + Q2)| = 0 are related to the roots of the equation
|Q2 − λQ1| = 0 by: λi = 1−vivi (or vi =
1
1+λi
).
UNSW MATH5855 2021T3 Lecture 8 Slide 13
I Then, if λi , vi , θi are the roots
|Q2−λQ1| = 0, |Q1−v(Q1 +Q2)| = 0, |Q2−θ(Q1 +Q2)| = 0
I Then:
Λ = |Q1(Q1+Q2)−1| =
p∏
i=1
(1+λi )
−1 (Wilks’s Criterion statistic)
|Q2Q−11 | =
p∏
i=1
λi =
p∏
i=1
1− vi
vi
=
p∏
i=1
θi
1− θi
|Q2(Q1 + Q2)−1| =
p∏
i=1
θi =
p∏
i=1
λi
1 + λi
=
p∏
i=1
(1− vi )
and others only depend on p (dimension) and v and q (Wishart
dfs).
UNSW MATH5855 2021T3 Lecture 8 Slide 14
Common MANOVA test statistics
I Λ as above (Wilks’s Lambda)
I tr(Q2Q
−1
1 ) = tr(Q
−1
1 Q2) =
∑p
i=1 λi (Lawley–Hotelling trace)
I max iλi (Roy’s criterion)
I V = tr[Q2(Q1 + Q2)−1] =
∑p
i=1
λi
1+λi
(Pillai statistic /
Pillai’s trace)
I Not no simple forms for exact distributions
I We have powerful computers now, however.
I Q1 is called the “error matrix” (also denoted by E )
I Q2 is the “hypothesis matrix” (also denoted by H)
UNSW MATH5855 2021T3 Lecture 8 Slide 15
I Distribution of statistics depends on
I p = the number of responses
I q = νh = degrees of freedom for the hypothesis
I v = νe = degrees of freedom for the error
I Then, compute:
I s = min(p, q)
I m = 0.5(|p − q| − 1)
I n = 0.5(v − p − 1)
I r = v − 0.5(p − q + 1)
I u = 0.25(pq − 2)
I t =
{√
p2q2−4
p2+q2−5 if p
2 + q2 − 5 > 0
1 otherwise
I Order the eigenvalues of E−1H = Q−11 Q2 as
λ1 ≥ λ2 ≥ · · · ≥ λp.
UNSW MATH5855 2021T3 Lecture 8 Slide 16
Then, (exactly if s = 1 or 2; otherwise approximately):
Wilks’s test: Λ =
|E |
|E+H| =
∏p
i=1
1
1+λi
:
F = 1−Λ
1/t
Λ1/t
. rt−2u
pq
∼ Fpq,rt−2u df (Rao’s F).
Lawley–Hotelling trace test: U = tr(E−1H) = λ1 + · · ·+ λp:
F = 2(sn + 1) U
s2(2m+s+1)
∼ Fs(2m+s+1),2(sn+1) df.
Pillai’s test: V = tr(H(H + E )−1) = λ1
1+λ1
+ · · ·+ λp
1+λp
:
F = 2n+s+1
2m+s+1
× V
s−V ∼ Fs(2m+s+1),s(2n+s+1) df.
Roy’s maximum root criterion: The test statistic is just the largest
eigenvalue λ1.
UNSW MATH5855 2021T3 Lecture 8 Slide 17
I An older and very universal approximation to the distribution
of the Λ due to Bartlett (1927):
I Level of −[νe − p−νh+12 ] log Λ = c(p, νh,M)× level of χ
2
pνh
,
where the constant c(p, νh,M = νe − p + 1) is given in tables.
I Such tables are prepared for levels α = 0.10, 0.05, 0.025 etc..
UNSW MATH5855 2021T3 Lecture 8 Slide 18
First Canonical Correlation
I When testing the significance of the first canonical correlation,
E = S22 − S21S−111 S12, H = S21S
−1
11 S12
I Wilks’s statistic becomes |S||S11||S22| (Recall (4.3)!)
I µ2i were the squared canonical correlations =⇒ µ
2
1 was
defined as the maximal eigenvalue to S−122 H , that is, it is a
solution to |(E + H)−1H − µ21I | = 0
I Setting λ1 =
µ21
1−µ21
,
|(E + H)−1H − µ21I | = 0 =⇒ |H − µ
2
1(E + H)| = 0
=⇒ |H −
µ21
1− µ21
E | = 0 =⇒ |E−1H − λ1I | = 0
=⇒ λ1 is an eigenvalue of E−1H .
I Similarly with the remaining λi =
µ2i
1−µ2
i
values
I Degrees of freedom of E and H?
UNSW MATH5855 2021T3 Lecture 8 Slide 19
8. Multivariate Linear Models and Multivariate ANOVA
8.3 Computations used in the MANOVA tests
8.3.1 Roots distributions
8.3.2 Comparisons
UNSW MATH5855 2021T3 Lecture 8 Slide 20
I Wilks’s lambda is the most popular.
I Convenient.
I Related to the LRT.
I No universally best test.
I Few power analyses available.
UNSW MATH5855 2021T3 Lecture 8 Slide 21
8. Multivariate Linear Models and Multivariate ANOVA
8.1 Univariate linear models and ANOVA
8.2 Multivariate Linear Model and MANOVA
8.3 Computations used in the MANOVA tests
8.4 Software
8.5 Examples
8.6 Additional resources
UNSW MATH5855 2021T3 Lecture 8 Slide 22
SAS PROC GLM, PROC REG
R stats::lm
UNSW MATH5855 2021T3 Lecture 8 Slide 23
8. Multivariate Linear Models and Multivariate ANOVA
8.1 Univariate linear models and ANOVA
8.2 Multivariate Linear Model and MANOVA
8.3 Computations used in the MANOVA tests
8.4 Software
8.5 Examples
8.6 Additional resources
UNSW MATH5855 2021T3 Lecture 8 Slide 24
Example 8.3.
Multivariate linear modelling of the Fitness dataset.
UNSW MATH5855 2021T3 Lecture 8 Slide 25
8. Multivariate Linear Models and Multivariate ANOVA
8.1 Univariate linear models and ANOVA
8.2 Multivariate Linear Model and MANOVA
8.3 Computations used in the MANOVA tests
8.4 Software
8.5 Examples
8.6 Additional resources
UNSW MATH5855 2021T3 Lecture 8 Slide 26
Additional resources
I JW Ch. 7.
UNSW MATH5855 2021T3 Lecture 8 Slide 27
Lecture 9: Tests of a Covariance Matrix
9.1 Test of Σ = Σ0
9.2 Sphericity test
9.3 General situation
9.4 Software
9.5 Exercises
UNSW MATH5855 2021T3 Lecture 9 Slide 1
9. Tests of a Covariance Matrix
9.1 Test of Σ = Σ0
9.2 Sphericity test
9.3 General situation
9.4 Software
9.5 Exercises
UNSW MATH5855 2021T3 Lecture 9 Slide 2
I Sample X1,X2, . . . ,Xn from a Np(µ,Σ).
I Test H0 : Σ = Σ0 against the alternative H1 : Σ 6= Σ0.
I Let Yi = Σ
− 1
2
0 Xi ; then
Yi
i.i.d.∼ Np(Σ
− 1
2
0 µ,Σ
− 1
2
0 Σ(Σ
− 1
2
0 )
>) = Np(Σ
− 1
2
0 µ, Ip)
=⇒ can transform and test just H̄0 : Σ = Ip
UNSW MATH5855 2021T3 Lecture 9 Slide 3
I For LRT: likelihood function is
L(x ;µ,Σ) = (2π)−
np
2 |Σ|−
n
2 e−
1
2
∑n
i=1(xi−µ)
>Σ−1(xi−µ)
= (2π)−
np
2 |Σ|−
n
2 e−
1
2
tr[Σ−1
∑n
i=1(xi−µ)(xi−µ)
>]
I H0 =⇒ µ̄ = x̄
I H1 =⇒ maximise with respect to both µ and Σ
I From Section 3.1.2, µ̂ = x̄ and Σ̂ = 1
n
∑n
i=1(xi − x̄)(xi − x̄)
>
and
Λ =
maxµ L(x ;µ, Ip)
maxµ,Σ L(x ;µ,Σ)
=
e[−
1
2
tr V ]
|V |−
n
2 n
np
2 e−
np
2
where V =
∑n
i=1(xi − x̄)(xi − x̄)
>
−2 log Λ = np log n − n log|V |+ tr V − np, (9.1)
I asymptotically ∼ χ2
p(p+1)/2
I I.e., number of “free” elements in a p × p symmetric matrix.
UNSW MATH5855 2021T3 Lecture 9 Slide 4
9. Tests of a Covariance Matrix
9.1 Test of Σ = Σ0
9.2 Sphericity test
9.3 General situation
9.4 Software
9.5 Exercises
UNSW MATH5855 2021T3 Lecture 9 Slide 5
I What if Σ is known up to a constant?
I Section (9.1) transformation =⇒ w.l.g. H0 : Σ = σ2Ip
against a general alternative.
I “sphericity test”
I LR:
−2 log Λ = np log(nσ̂2)− n log|V |
I σ̂2 = 1
np
∑n
i=1(xi − x̄)
>(xi − x̄)
I H0 =⇒ asymptotically ∼ χ2p(p+1)/2−1 (WHY?!)
UNSW MATH5855 2021T3 Lecture 9 Slide 6
9. Tests of a Covariance Matrix
9.1 Test of Σ = Σ0
9.2 Sphericity test
9.3 General situation
9.4 Software
9.5 Exercises
UNSW MATH5855 2021T3 Lecture 9 Slide 7
Goal: Given samples from k different multivariate normal
populations Np(µi ,Σi ), i = 1, 2, . . . , k , test
H0 : Σ1 = · · · = Σk .
I Useful for MANOVA and discriminant analysis in particular.
I Let,
k be the number of populations;
p the dimension of vector;
n the total sample size n = n1 + n2 + . . . nk ,
ni being the sample size for each population.
=⇒ LR test statistic
−2 log
∏k
i=1|Σ̂i |
ni
2
|Σ̂pooled|
n
2
I Σ̂i is the MLE sample variance (with denominator n) of
population i ,
I Σ̂pooled =
1
n
∑k
i=1 nkΣ̂i
I asymptotically ∼ χ2
(k−1)p(p+1)/2
I Results in a biased test: ∃µi ,Σi for which rejecting H0 when
false is lower than when true.
UNSW MATH5855 2021T3 Lecture 9 Slide 8
I Further let N = n − k and Ni = ni − 1
I Replace Σ̂s with Ss.
I Let ρ = 1− [(
∑k
i=1
1
Ni
)− 1
N
] 2p
2+3p−1
6(p+1)(k−1)
=⇒ H0 =⇒ modified LR
−2ρ log
∏k
i=1|Si |
Ni
2
|Spooled|
N
2
, (9.2)
I Si is the sample variance (with denominator n − 1) of
population i ,
I Spooled = 1N
∑k
i=1 NkSi
I asymptotically ∼ χ2
(k−1)p(p+1)/2
I Reject H0 when high.
UNSW MATH5855 2021T3 Lecture 9 Slide 9
I For details, see
Muirhead, R. (1982) Aspects of Multivariate Statistical
Theory. Wiley, New York.
I The modified LR is
I Replacing ni and n by Ni and N
I Ni and N are “degrees of freedom”
I Scaling factor ρ = 1− [(
∑k
i=1
1
Ni
)− 1
N
] 2p
2+3p−1
6(p+1)(k−1) ,
I Close to 1 anyway if all ni large
I improves the quality of the asymptotic approximation
I Application of Bartlett correction
I Asymptotically negligible scalar transformations of the LR
statistic.
I Approaches χ2 at the rate O(1/n) instead of O(1).
UNSW MATH5855 2021T3 Lecture 9 Slide 10
9. Tests of a Covariance Matrix
9.1 Test of Σ = Σ0
9.2 Sphericity test
9.3 General situation
9.4 Software
9.5 Exercises
UNSW MATH5855 2021T3 Lecture 9 Slide 11
SAS: PROC CALIS, PROC DISCRIM (option)
R: heplots::boxM, MVTests::BoxM
The statistic (9.2) is the one that is implemented in software
packages.
UNSW MATH5855 2021T3 Lecture 9 Slide 12
9. Tests of a Covariance Matrix
9.1 Test of Σ = Σ0
9.2 Sphericity test
9.3 General situation
9.4 Software
9.5 Exercises
UNSW MATH5855 2021T3 Lecture 9 Slide 13
Exercise 9.1
Follow the discussion about the sphericity test. Argue that if
λ̂i , i = 1, 2, . . . , p denote the eigenvalues of the empirical
covariance matrix S then
−2 log Λ = np log
arithm. mean λ̂i
geom. mean λ̂i
.
Of course, the above statistic is asymptotically χ2
(p+2)(p−1)/2
distributed under H0 since it only represents the sphericity test in a
different form.
UNSW MATH5855 2021T3 Lecture 9 Slide 14
Exercise 9.2
Show that the likelihood ratio test of
H0 : Σ is a diagonal matrix
rejects H0 when −n log |R| is larger than χ21−α,p(p−1)/2. (Here R is
the empirical correlation matrix, p is the dimension of the
multivariate normal and n is the sample size.)
UNSW MATH5855 2021T3 Lecture 9 Slide 15
Lecture 10: Factor Analysis
10.1 ML Estimation
10.2 Hypothesis testing under multivariate normality assumption
10.3 Varimax method of rotating the factors
10.4 Relationship to Principal Component Analysis
10.5 Software
10.6 Examples
10.7 Additional resources
UNSW MATH5855 2021T3 Lecture 10 Slide 1
Let
Yi , i = 1, 2, .., n be independent Np(µ,Σ) variables
▶ E.g., Yi s as a results of a battery of p tests applied to the
ith individual.
Fundamental assumption in factor analysis:
Yi = Λfi + ei (10.1)
Λ ∈ Mp,k factor loading matrix (full rank);
fi ∈ Rk (k < p) factor variable.
▶ Components are latent factors.
▶ Usually, fi ∼ N(α, Ik) (i.e., “orthogonal”) but sometimes
“oblique” factors used.
ei i.i.d Np(θ,Σe) with Σe = diag(σ21, σ
2
2, . . . , σ
2
p).
Also the es are independent of the f s.
UNSW MATH5855 2021T3 Lecture 10 Slide 2
▶ Then,
µ = Λα+ θ; Σ = ΛΛ⊤ +Σe
▶ I.e.,
Var(Yir ) =
k∑
j=1
λ2rj + σ
2
r = communality + uniqueness
Cov(Yir ,Yis) =
k∑
j=1
λrjλsj
Idea: Describe the covariance relationships among many
variables (p “large”) in terms of few (k “small”) underlying, not
observable (latent) random quantities (the factors).
▶ I.e., suppose variables can be grouped by their correlations:
high (+ or −) correlation within group but low between
groups.
=⇒ Each group of variables represents a single underlying
construct (factor) that is “responsible” for the observed
correlations.
UNSW MATH5855 2021T3 Lecture 10 Slide 3
Important notes
▶ (10.1) is similar to a LM, but “predictors” fi random and are
not observable.
▶ Λ known or estimated =⇒
α̂ = (Λ⊤Λ)−1Λ⊤Ȳ ; θ̂ = Ȳ − Λα̂
=⇒ Only µ, Λ, and σ2i , i = 1, 2, . . . , p unknown parameters.
▶ There is a fundamental indeterminacy even if Var(f ) = Ik :
for any orthogonal matrix P ∈ Mk,k ,
ΛΛ⊤ = ΛP(ΛP)⊤; Λfi = (ΛP)(P⊤fi ).
=⇒ Hence replacing Λ by ΛP and fi by P⊤fi leads to the same
equations.
UNSW MATH5855 2021T3 Lecture 10 Slide 4
10. Factor Analysis
10.1 ML Estimation
10.2 Hypothesis testing under multivariate normality assumption
10.3 Varimax method of rotating the factors
10.4 Relationship to Principal Component Analysis
10.5 Software
10.6 Examples
10.7 Additional resources
UNSW MATH5855 2021T3 Lecture 10 Slide 5
▶ Observing Y1,Y2, . . . ,Yn ∈ Rp, likelihood
L(Y ;µ,Λ, σ21, σ
2
2, . . . , σ
2
p)
= (2π)−np/2|Σ|−n/2 exp[−
1
2
n∑
i=1
(Yi − µ)⊤Σ−1(Yi − µ)]
= (2π)−np/2|Σ|−n/2 exp[−
n
2
(tr(Σ−1S) + (Ȳ − µ)⊤Σ−1(Ȳ − µ))]
with S = 1
n
∑n
i=1(Yi − Ȳ )(Yi − Ȳ )
⊤
log L(Y ;µ,Λ, σ21, σ
2
2, . . . , σ
2
p) =
−
np
2
log(2π)−
n
2
log(|Σ|)−
n
2
[tr(Σ−1S)+(Ȳ−µ)⊤Σ−1(Ȳ−µ))]
UNSW MATH5855 2021T3 Lecture 10 Slide 6
▶ Differentiate w.r.t. µ:
∂ log L
∂µ
= nΣ−1(Ȳ − µ) = 0 =⇒ µ̂ = Ȳ
▶ Substitute µ̂ = Ȳ , Σ = ΛΛ⊤ +Σe , negate, and drop
constants =⇒
minimise Q =
1
2
log|ΛΛ⊤ +Σe |+
1
2
tr(ΛΛ⊤ +Σe)
−1S
▶ Relevant matrix differentiation rules:
∂
∂Λ
log|ΛΛ⊤ +Σe | = 2(ΛΛ⊤ +Σe)−1Λ (10.2)
∂
∂A
tr(A−1B) = −(A−1BA−1)⊤ (10.3)
▶ (10.3) and the chain rule =⇒
∂
∂Λ
tr[(ΛΛ⊤ +Σe)
−1S] = −2(ΛΛ⊤ +Σe)−1S(ΛΛ⊤ +Σe)−1Λ
UNSW MATH5855 2021T3 Lecture 10 Slide 7
▶ Substitute:
∂
∂Λ
Q = (ΛΛ⊤+Σe)
−1Λ− (ΛΛ⊤+Σe)−1S(ΛΛ⊤+Σe)−1Λ =
(ΛΛ⊤ +Σe)
−1[ΛΛ⊤ +Σe − S](ΛΛ⊤ +Σe)−1Λ = 0 (10.4)
▶ Woodbury Matrix Identity
(A+ UCV )−1 = A−1 − A−1U(C−1 + VA−1U)−1VA−1 =⇒
(ΛΛ⊤+Σe)
−1 = Σ−1e −Σ
−1
e Λ(I +Λ
⊤Σ−1e Λ)
−1Λ⊤Σ−1e (10.5)
▶ (10.4) and (10.5) =⇒
[ΛΛ⊤ +Σe − S]Σ−1e Λ{I − (I + Λ
⊤Σ−1e Λ)
−1Λ⊤Σ−1e Λ} = 0
(10.6)
▶ Matrix in curly brackets is full rank =⇒
[ΛΛ⊤ +Σe − S]Σ−1e Λ = 0 =⇒ SΣ
−1
e Λ = Λ(I + Λ
⊤Σ−1e Λ)
▶ Factor Σ−1e = Σ
−1/2
e Σ
−1/2
e and premultiply by Σ
−1/2
e :
(Σ
−1/2
e SΣ
−1/2
e )Σ
−1/2
e Λ = Σ
−1/2
e Λ(I + Λ
⊤Σ−1e Λ) (10.7)
UNSW MATH5855 2021T3 Lecture 10 Slide 8
▶ Let us require that Λ⊤Σ−1e Λ to be diagonal.
=⇒ (10.7) =⇒ Σ−1/2e Λ has as its columns k eigenvectors of
Σ
−1/2
e SΣ
−1/2
e
▶ Q minimised when these correspond to the largest eigenvalues
of Σ
−1/2
e SΣ
−1/2
e
=⇒ Iterative algorithm (due to Lawley):
1. With an initial guess Σ̃e , calculate Σ̃
−1/2
e Λ̃ = eigenvectors of
the k largest eigenvalues of Σ̃
−1/2
e SΣ̃
−1/2
e .
2. Λ̃ = Σ̃
1/2
e (Σ̃
−1/2
e Λ̃).
3. Plug Λ̃ in and minimise
Σ∗e = argminΣ̃e Q̃(Σ̃e) =
1
2
log|Λ̃Λ̃⊤+Σ̃e |+ 12 tr(Λ̃Λ̃
⊤+Σ̃e)
−1S
▶ Recall, Σ̃e diagonal, so only p variables.
4. Set Σ̃e = Σ
∗
e and repeat from Step 1 until convergence.
UNSW MATH5855 2021T3 Lecture 10 Slide 9
10. Factor Analysis
10.1 ML Estimation
10.2 Hypothesis testing under multivariate normality assumption
10.3 Varimax method of rotating the factors
10.4 Relationship to Principal Component Analysis
10.5 Software
10.6 Examples
10.7 Additional resources
UNSW MATH5855 2021T3 Lecture 10 Slide 10
▶ H0 : k factors vs. H1 : ̸= k factors.
=⇒ Likelihoods:
log L1 = −
np
2
log(2π)−
n
2
log|S | −
np
2
log L0 = −
np
2
log(2π)−
n
2
log|Σ̂|−
n
2
tr(Σ̂−1S), for Σ̂ = Λ̂Λ̂⊤+Σ̂e
=⇒ −2 log L0
L1
= n[log|Σ̂| − log|S |+ tr(Σ̂−1S)− p]
▶ Asymptotically χ2 with
df =
p(p+1)
2
− [pk + p − k(k−1)
2
] = 1
2
[(p − k)2 − p − k]. Why?
UNSW MATH5855 2021T3 Lecture 10 Slide 11
10. Factor Analysis
10.1 ML Estimation
10.2 Hypothesis testing under multivariate normality assumption
10.3 Varimax method of rotating the factors
10.4 Relationship to Principal Component Analysis
10.5 Software
10.6 Examples
10.7 Additional resources
UNSW MATH5855 2021T3 Lecture 10 Slide 12
▶ Recall: Λ̂0 is MLE =⇒ Λ̂ = Λ̂0P for any orthogonal
P ∈ Mk,k also MLE.
=⇒ Choose P such that Λ̂ has some desirable properties.
▶ Let dr =
∑p
i=1 λ
2
ir .
▶ The varimax method of rotating the factors consists in
choosing P to maximise
Sd =
k∑
r=1
{
p∑
i=1
(λ2ir −
dr
p
)2} =
k∑
r=1
{
p∑
i=1
λ4ir −
(
∑p
i=1 λ
2
ir )
2
p
}
▶ I.e., for each factor, some loadings large, others small.
▶ Optimise numerically.
▶ Particularly important if loadings obtained by ML, since we
chose Λ̂0 s.t. Λ̂
⊤
0 Σ
−1
e Λ̂0 is diagonal.
▶ Good for computation, bad for interpretation.
UNSW MATH5855 2021T3 Lecture 10 Slide 13
10. Factor Analysis
10.1 ML Estimation
10.2 Hypothesis testing under multivariate normality assumption
10.3 Varimax method of rotating the factors
10.4 Relationship to Principal Component Analysis
10.4.1 The principal component solution of the factor model
10.4.2 The Principal Factor Solution
10.5 Software
10.6 Examples
10.7 Additional resources
UNSW MATH5855 2021T3 Lecture 10 Slide 14
10. Factor Analysis
10.4 Relationship to Principal Component Analysis
10.4.1 The principal component solution of the factor model
10.4.2 The Principal Factor Solution
UNSW MATH5855 2021T3 Lecture 10 Slide 15
▶ Start with sample variance matrix:
S =
1
n
n∑
i=1
(Yi − Ȳ )(Yi − Ȳ )⊤
▶ Can write using all of its p eigenvalues and eigenvectors.
▶ Perfect reconstruction, but as many factors as variables.
=⇒ Approximate reconstruction using k highest eigenvalues and
their eigenvectors:
S ≈
k∑
i=1
τi a⃗i a⃗⊤i = ΛΛ
⊤
▶ k is the right number of factors
=⇒ all communalities have been taken into account
=⇒ sii −
∑k
j=1 λ
2
ij estimates the uniquenesses
=⇒ the principal component solution of the factor model
UNSW MATH5855 2021T3 Lecture 10 Slide 16
10. Factor Analysis
10.4 Relationship to Principal Component Analysis
10.4.1 The principal component solution of the factor model
10.4.2 The Principal Factor Solution
UNSW MATH5855 2021T3 Lecture 10 Slide 17
▶ Related, but extraction not on S directly.
▶ Suppose the uniquenesses Σe are known.
▶ Then,
S = Sr +Σe
=⇒ Λ̂ should satisfy
Sr = S − Σe = Λ̂Λ̂⊤
=⇒ Get Λ̂ by PCA on Sr .
▶ If Sr =
∑p
i=1 tibib
⊤
i , take k biggest ti s (w.o.l.g. t1, t2, .., tk)
and let
B =
(
b1 b2 · · · bk
)
; ∆ = diag(t1, t2, . . . , tk)
=⇒ Λ̂ = B∆1/2
▶ Can do it also iteratively!
UNSW MATH5855 2021T3 Lecture 10 Slide 18
Some problems:
i) There is no reliable estimate of Σe available.
▶ The most commonly: get correlation matrix R is σ2ei = 1/r
ii
where r ii is the ith diagonal element of R−1.
ii) How to select k?
Note:
− The methods in Section 10.4.2 not as efficient as ML
method.
+ Can be used when not MVN.
▶ k chosen by combining subject matter knowledge,
“reasonableness” of results and by looking at proportion
variance explained.
UNSW MATH5855 2021T3 Lecture 10 Slide 19
10. Factor Analysis
10.1 ML Estimation
10.2 Hypothesis testing under multivariate normality assumption
10.3 Varimax method of rotating the factors
10.4 Relationship to Principal Component Analysis
10.5 Software
10.6 Examples
10.7 Additional resources
UNSW MATH5855 2021T3 Lecture 10 Slide 20
SAS
PROC FACTOR:
▶ To extract different numbers of factors, run the procedure
once for each number of factors.
▶ Iterative process can lead to “correlations” > 1 =⇒
ultra-Heywood
▶ Heywood option sets them to one allowing iterations to
continue;
▶ the scree option can be used to produce a plot of the
eigenvalues Σ that is helpful in deciding how many factors to
use;
▶ besides method=ml you can use method=principal;
▶ with the ML method option, the AIC and BIC are included.
▶ Can be used for model selection (smaller = better).
UNSW MATH5855 2021T3 Lecture 10 Slide 21
R
▶ stats::factanal() is the built-in implementation.
▶ Package psych contains additional functions and utilities, as
well as its own implementation, psych::fa().
▶ Model selection tools as well.
▶ Package nFactors contains utilities for determining the
number of factors (e.g., scree plots).
UNSW MATH5855 2021T3 Lecture 10 Slide 22
10. Factor Analysis
10.1 ML Estimation
10.2 Hypothesis testing under multivariate normality assumption
10.3 Varimax method of rotating the factors
10.4 Relationship to Principal Component Analysis
10.5 Software
10.6 Examples
10.7 Additional resources
UNSW MATH5855 2021T3 Lecture 10 Slide 23
Example 10.1.
Data about five socioeconomic variables for 12 census data in the
Los Angeles area. The five variables represent total population,
median school years, total unemployment, miscellaneous
professional services, and median house value. Use ML method
and varimax rotation.
▶ Try to run the above model with n = 3 factors. The message
“WARNING: Too many factors for a unique solution” appears.
This is not surprising as the number of parameters in the
model will exceed the number of elements in Σ
(1
2
[(p − k)2 − p − k] = −2). In this example you can run the
procedure for n = 1 and for n = 2 only (do it!) and you will
see that n = 2 gives the adequate representation.
▶ Try using psych::fa.parallel() to search for optimal
number of factors.
UNSW MATH5855 2021T3 Lecture 10 Slide 24
10. Factor Analysis
10.1 ML Estimation
10.2 Hypothesis testing under multivariate normality assumption
10.3 Varimax method of rotating the factors
10.4 Relationship to Principal Component Analysis
10.5 Software
10.6 Examples
10.7 Additional resources
UNSW MATH5855 2021T3 Lecture 10 Slide 25
Additional resources
▶ JW Ch. 9.
UNSW MATH5855 2021T3 Lecture 10 Slide 26
Lecture 11: Structural Equation Modelling
11.1 General form of the model
11.2 Estimation
11.3 Model evaluation
11.4 Some particular SEM
11.5 Relationship between exploratory and confirmatory FA
11.6 Software
11.7 Examples
UNSW MATH5855 2021T3 Lecture 11 Slide 1
▶ More general idea: model the covariances rather than
individual observations.
▶ Factor analysis (FA) is only one example
▶ Input factors latent =⇒ no regression.
▶ Our “data” was S and our parameters were σ2i and Λ.
▶ Methods minimise not difference between observed and
predicted individual values but differences between sample
covariances and covariances predicted by the model.
▶ I.e., test
H0 : Σ = Σ(θ) against H1 : Σ ̸= Σ(θ)
▶ Σ has p(p + 1)/2 unknown elements (estimated by S)
▶ Modelled with k = dim(θ) < p(p + 1)/2 parameters.
▶ More generally, can still model means and covariances, or
means and covariances and higher moments to a given
structure.
▶ Regression analysis with random inputs, simultaneous
equations systems, confirmatory factor analysis, canonical
correlations, (M)ANOVA = special cases.
UNSW MATH5855 2021T3 Lecture 11 Slide 2
▶ Structural Equation Modelling (SEM) an important tool in
economics and behavioural sciences.
▶ Relationships among several variables
▶ directly observed (manifest)
▶ unobserved hypothetical variables (latent)
▶ In structural models, as opposed to functional models, all
variables are taken to be random rather than having fixed
levels.
▶ Approximate MVN assumed.
UNSW MATH5855 2021T3 Lecture 11 Slide 3
11. Structural Equation Modelling
11.1 General form of the model
11.2 Estimation
11.3 Model evaluation
11.4 Some particular SEM
11.5 Relationship between exploratory and confirmatory FA
11.6 Software
11.7 Examples
UNSW MATH5855 2021T3 Lecture 11 Slide 4
η = Bη + Γξ + ζ. (11.1)
Here,
η ∈ Rm vector of output latent variables;
ξ ∈ Rn
′
vector of input latent variables;
B ∈ Mm,m, Γ ∈ Mm,n′ coefficient matrices;
Note: (I − B) is assumed to be nonsingular.
ζ ∈ Rm disturbance vector with E ζ = 0.
To this modelling equation (11.1) we attach 2 measurement
equations:
Y = ΛYη + ϵ; (11.2)
X = ΛXξ + δ; (11.3)
Y ∈ Rp, X ∈ Rq; ΛY ∈ mp×m,ΛX ∈ mq×n′
with ϵ ∈ Rp, δ ∈ Rq zero-mean measurement errors. These errors
are assumed to be uncorrelated with ξ and ζ and with each other.
UNSW MATH5855 2021T3 Lecture 11 Slide 5
Generative model for X and Y
Y
X
ϵ
δ
η
ξ
ζ
B
Γ
ΛX
ΛY
UNSW MATH5855 2021T3 Lecture 11 Slide 6
▶ General model (11.1)–(11.2)–(11.3) is called
Keesling–Wiley–Jöreskog model.
▶ Input and output latent variables ξ and η are connected by a
system of linear equations (11.1) with coefficient matrices B
and Γ and an error vector ζ.
▶ The random vectors Y and X represent the observable
vectors (measurements).
▶ Implied covariance matrix: let
Var(ξ) = Φ; Var(ζ) = Ψ; Var(ϵ) = θϵ; Var(δ) = θδ
▶ Then,
Σ = Σ(θ) =
(
ΣYY (θ) ΣYX (θ)
ΣXY (θ) ΣXX (θ)
)
(11.4)
ΣYY (θ) = ΛY (I − B)−1(ΓΦΓ⊤ +Ψ)[(I − B)−1]⊤Λ⊤Y + θϵ
ΣYX (θ) = ΛY (I − B)−1ΓΦΛ⊤X
ΣXY (θ) = ΛXΦΓ
⊤[(I − B)−1]⊤Λ⊤Y
ΣXX (θ) = ΛXΦΛ
⊤
X + θδ
UNSW MATH5855 2021T3 Lecture 11 Slide 7
11. Structural Equation Modelling
11.1 General form of the model
11.2 Estimation
11.3 Model evaluation
11.4 Some particular SEM
11.5 Relationship between exploratory and confirmatory FA
11.6 Software
11.7 Examples
UNSW MATH5855 2021T3 Lecture 11 Slide 8
▶ MVN =⇒ use MLE
▶ “data” is S =
1
n − 1
n∑
i=1
{(
Yi − Ȳ
Xi − X̄
)(
Yi − Ȳ
Xi − X̄
)⊤}
=⇒ (n − 1)S ∼ Wp+q(n − 1,Σ)
=⇒ Wishart density (up to a constant):
log L(S ,Σ(θ)) = constant−
n − 1
2
{log|Σ(θ)|+ tr[SΣ−1(θ)]}
=⇒ minimise
FML(θ) = log|Σ(θ)|+ tr[SΣ−1(θ)]− log|S | − (p + q) (11.5)
▶ FML = 0 under “saturated model” with Σ̂ = S
▶ I.e., a perfect fit is indicated by zero.
UNSW MATH5855 2021T3 Lecture 11 Slide 9
11. Structural Equation Modelling
11.1 General form of the model
11.2 Estimation
11.3 Model evaluation
11.4 Some particular SEM
11.5 Relationship between exploratory and confirmatory FA
11.6 Software
11.7 Examples
UNSW MATH5855 2021T3 Lecture 11 Slide 10
▶ MVN =⇒ asymptotic χ2-test.
▶ Under H0 : Σ = Σ(θ) versus H1 : Σ ̸= Σ(θ) =⇒
T = (n − 1)FML(θ̂ML) ∼ χ2 with df =
(p+q)(p+q+1)
2
− dim(θ)
under H0.
Reason:
log L0 = log L(S , Σ̂MLE) = log L(S ,Σ(θ̂ML))
= −
n − 1
2
{log|Σ̂MLE|+ tr[SΣ̂−1MLE]}+ constant;
log L1 = log L(S ,S) = −
n − 1
2
{log|S |+ (p + q)}+ constant.
Then,
−2 log
L0
L1
= (n − 1){log|Σ̂MLE|+ tr(SΣ̂−1MLE)− log|S | − (p + q)}
= (n − 1)FML(θ̂ML).
UNSW MATH5855 2021T3 Lecture 11 Slide 11
11. Structural Equation Modelling
11.1 General form of the model
11.2 Estimation
11.3 Model evaluation
11.4 Some particular SEM
11.5 Relationship between exploratory and confirmatory FA
11.6 Software
11.7 Examples
UNSW MATH5855 2021T3 Lecture 11 Slide 12
From the general model (11.1)–(11.2)–(11.3), we can obtain
following particular models:
A) ΛY = Im , ΛX = In′ ; p = m; q = n
′; θϵ = 0 ; θδ = 0 =⇒
Y = BY + ΓX + ζ (the classical econometric model).
B) ΛY = Ip , ΛX = Iq =⇒ The measurement error model:
▶ η = Bη + Γξ + ζ
▶ Y = η + ϵ
▶ X = ξ + δ
C) Factor Analysis Models: Just take the measurement part
X = ΛXξ + δ.
UNSW MATH5855 2021T3 Lecture 11 Slide 13
11. Structural Equation Modelling
11.1 General form of the model
11.2 Estimation
11.3 Model evaluation
11.4 Some particular SEM
11.5 Relationship between exploratory and confirmatory FA
11.6 Software
11.7 Examples
UNSW MATH5855 2021T3 Lecture 11 Slide 14
EFA:
▶ number of latent variables not determined in advance
▶ measurement errors are assumed uncorrelated
CFA:
▶ model is constructed in advance
▶ number of latent variables ξ is set by the analyst
▶ latent variable influences on observed variables specified
▶ some direct effects of latent on observed values are fixed at
0 or other value
▶ measurement errors δ may correlate
▶ covariance of latent variables can be either estimated or set
In practice, more blurred:
▶ Researchers using traditional EFA procedures may restrict
their analysis to a group of indicators that they believe are
influenced by one factor.
▶ Researchers with poorly fitting models in CFA often modify
their model in an exploratory way with the goal of improving
fit.
UNSW MATH5855 2021T3 Lecture 11 Slide 15
11. Structural Equation Modelling
11.1 General form of the model
11.2 Estimation
11.3 Model evaluation
11.4 Some particular SEM
11.5 Relationship between exploratory and confirmatory FA
11.6 Software
11.7 Examples
UNSW MATH5855 2021T3 Lecture 11 Slide 16
SAS
In SAS, the standard PROC CALIS is used for fitting Structural
Equation Models, and it has been significantly upgraded in SAS
9.3. In particular, now you can analyse means and covariance
(or even higher order) structures (instead of just covariance
structures like in the classical SEM).
UNSW MATH5855 2021T3 Lecture 11 Slide 17
R
There are two packages for SEM in R: lavaan and sem. sem is an
older package, whereas lavaan aims to provide an extensible
framework for SEMs and their extensions:
▶ can mimic commercial packages (including those below)
▶ provides convenience functions for specifying simple special
cases (such as CFA) but also a more flexible interface for
advanced users
▶ mean structures and multiple groups
▶ different estimators and standard errors (including robust)
▶ handling of missing data
▶ linear and nonlinear equality and inequality constraints
▶ categorical data support
▶ multilevel SEMs
▶ package blavaan for Bayesian estimation
▶ etc.
UNSW MATH5855 2021T3 Lecture 11 Slide 18
Others
▶ General form of the SEM model given here is only one
possible description due to Karl Jöreskog.
▶ First implemented in the software called LISREL (Linear
Structural Relationships).
▶ Other equivalent descriptions due to Bentler and Weeks, to
McDonald and some other prominent researchers in the field.
▶ The EQS program for PC that deals with the Bentler/Weeks
model.
▶ The latest “hit” in the area is the program MPLUS (M is for
Bength Muthén). MPLUS capabilities include:
▶ Exploratory factor analysis
▶ Structural equation modelling
▶ Item response theory analysis
▶ Growth curve modelling
▶ Mixture modelling (latent class analysis)
▶ Longitudinal mixture modelling (hidden Markov, latent
transition analysis, latent class growth analysis, growth
mixture analysis)
▶ Survival analysis (continuous- and discrete-time)
▶ Multilevel analysis
▶ Bayesian analysis
▶ etc.
UNSW MATH5855 2021T3 Lecture 11 Slide 19
11. Structural Equation Modelling
11.1 General form of the model
11.2 Estimation
11.3 Model evaluation
11.4 Some particular SEM
11.5 Relationship between exploratory and confirmatory FA
11.6 Software
11.7 Examples
UNSW MATH5855 2021T3 Lecture 11 Slide 20
Example 11.1.
Wheaton, Muthen, Alwin, and Summers (1977) Anomie example.
UNSW MATH5855 2021T3 Lecture 11 Slide 21
Lecture 12: Discrimination and Classification
12.1 Separation and Classification for two populations
12.2 Classification errors
12.3 Summarising
12.4 Optimal classification rules
12.5 Classification with two multivariate normal populations
12.6 Classification with more than 2 normal populations
12.7 Software
12.8 Examples
12.9 Additional resources
12.10Exercises
UNSW MATH5855 2021T3 Lecture 12 Slide 1
12. Discrimination and Classification
12.1 Separation and Classification for two populations
12.2 Classification errors
12.3 Summarising
12.4 Optimal classification rules
12.5 Classification with two multivariate normal populations
12.6 Classification with more than 2 normal populations
12.7 Software
12.8 Examples
12.9 Additional resources
12.10Exercises
UNSW MATH5855 2021T3 Lecture 12 Slide 2
Goals: discriminant analysis terminology: separating sets of objects
classification theory terminology: allocating new objects to
given groups
▶ Discriminant analysis is more exploratory than classification.
▶ In practice, one can lead to the other.
▶ Focus on two populations (classes of objects) first.
UNSW MATH5855 2021T3 Lecture 12 Slide 3
Notation
▶ Call two classes by π1 and π2.
▶ Separate based on random vectors X ∈ Rp.
▶ Distribution of X depends on π1 ( =⇒ f1(x)) and π2
( =⇒ f2(x)).
▶ Observe a learning/training sample: measurements from
known classes.
Goal: Partition sample space into 2 mutually exclusive regions R1
and R2, such that:
▶ If a new observation falls in R1, it is allocated to π1.
▶ If it falls in R2, it is allocated to π2.
UNSW MATH5855 2021T3 Lecture 12 Slide 4
12. Discrimination and Classification
12.1 Separation and Classification for two populations
12.2 Classification errors
12.3 Summarising
12.4 Optimal classification rules
12.5 Classification with two multivariate normal populations
12.6 Classification with more than 2 normal populations
12.7 Software
12.8 Examples
12.9 Additional resources
12.10Exercises
UNSW MATH5855 2021T3 Lecture 12 Slide 5
▶ Always a chance of misclassification, which we want to
minimise.
▶ Populations may have different sizes in the first place =⇒
prior probabilities.
▶ Cost also matters: errors can have asymmetric costs.
▶ The conditional probabilities for misclassification:
Pr(2|1) = Pr(X ∈ R2|π1) =
∫
R2
f1(x)dx (12.1)
Pr(1|2) = Pr(X ∈ R1|π2) =
∫
R1
f2(x)dx (12.2)
UNSW MATH5855 2021T3 Lecture 12 Slide 6
12. Discrimination and Classification
12.1 Separation and Classification for two populations
12.2 Classification errors
12.3 Summarising
12.4 Optimal classification rules
12.5 Classification with two multivariate normal populations
12.6 Classification with more than 2 normal populations
12.7 Software
12.8 Examples
12.9 Additional resources
12.10Exercises
UNSW MATH5855 2021T3 Lecture 12 Slide 7
confusion matrix: a contingency table showing counts of correct
classifications and misclassifications from each class to each other
class:
Predicted class
1 2
Actual class
1 Members of 1
correctly classified
Members of 1
misclassified as 2
2 Members of 2
misclassified as 1
Members of 2
correctly classified
UNSW MATH5855 2021T3 Lecture 12 Slide 8
Negative/Positive context
Predicted class
Negative Positive
Actual class
Negative True Negative (TN) False Positive (FP)
Positive False Negative (FN) True Positive (TP)
sensitivity (TPR): Pr(Pred. pos.|Act. pos.) = TP
TP+FN
specificity (TNR): Pr(Pred. neg.|Act. neg.) = TN
TN+FP
false positive rate (FPR): Pr(Pred. pos.|Act. neg.) = 1− TNR
accuracy: TP+TN
TP+FP+TN+FN
total probability of misclassification (TPM): 1− accuracy
precision: Pr(Act. pos.|Pred. pos.) = TP
TP+FP
negative predictive value: Pr(Act. neg.|Pred. neg.) = TN
TN+FN
F1 score: 2TP
2TP+FP+FN
▶ For continuous prediction scores, ROC curve (TPR against
FPR) for various thresholds.
UNSW MATH5855 2021T3 Lecture 12 Slide 9
12. Discrimination and Classification
12.1 Separation and Classification for two populations
12.2 Classification errors
12.3 Summarising
12.4 Optimal classification rules
12.4.1 Rules that minimise the expected cost of misclassification (ECM)
12.4.2 Rules that minimise the total probability of misclassification (TPM)
12.4.3 Bayesian approach
12.5 Classification with two multivariate normal populations
12.6 Classification with more than 2 normal populations
12.7 Software
12.8 Examples
12.9 Additional resources
12.10Exercises
UNSW MATH5855 2021T3 Lecture 12 Slide 10
12. Discrimination and Classification
12.4 Optimal classification rules
12.4.1 Rules that minimise the expected cost of misclassification (ECM)
12.4.2 Rules that minimise the total probability of misclassification (TPM)
12.4.3 Bayesian approach
UNSW MATH5855 2021T3 Lecture 12 Slide 11
Lemma 12.1.
Denote by pi the prior probability of πi , i = 1, 2, p1 + p2 = 1.
Then the overall probabilities of incorrectly classifying objects will
be: Pr(misclassified as π1) = Pr(1|2)p2 and
Pr(misclassified as π2) = Pr(2|1)p1. Further, let
c(i |j), i ̸= j , i , j = 1, 2 be the misclassification costs. Then the
expected cost of misclassification is
ECM = c(2|1) Pr(2|1)p1 + c(1|2) Pr(1|2)p2 (12.3)
The regions R1 and R2 that minimise ECM are given by
R1 = {x :
f1(x)
f2(x)
≥
c(1|2)
c(2|1)
p2
p1
} (12.4)
and
R2 = {x :
f1(x)
f2(x)
<
c(1|2)
c(2|1)
p2
p1
}. (12.5)
UNSW MATH5855 2021T3 Lecture 12 Slide 12
Proof.
▶ ECM =
∫
R1
[c(1|2)p2f2(x)− c(2|1)p1f1(x)]dx + c(2|1)p1
=⇒ minimised if R1 = {x : [c(1|2)p2f2(x)− c(2|1)p1f1(x)] ≤ 0}
only.
▶ only ratios involved.
▶ Cost ratios usually easier to elicit than costs.
▶ Your own exercise: suppose that p2 = p1 and/or
c(1|2) = c(2|1). What are classification regions like then?
UNSW MATH5855 2021T3 Lecture 12 Slide 13
12. Discrimination and Classification
12.4 Optimal classification rules
12.4.1 Rules that minimise the expected cost of misclassification (ECM)
12.4.2 Rules that minimise the total probability of misclassification (TPM)
12.4.3 Bayesian approach
UNSW MATH5855 2021T3 Lecture 12 Slide 14
▶ total probability of misclassification (TPM):
TPM = p1
∫
R2
f1(x)dx + p2
∫
R1
f2(x)dx
=⇒ c(1|2) = c(2|1) in Lemma 12.1
UNSW MATH5855 2021T3 Lecture 12 Slide 15
12. Discrimination and Classification
12.4 Optimal classification rules
12.4.1 Rules that minimise the expected cost of misclassification (ECM)
12.4.2 Rules that minimise the total probability of misclassification (TPM)
12.4.3 Bayesian approach
UNSW MATH5855 2021T3 Lecture 12 Slide 16
▶ Allocate a new observation x0 to the population with the
larger posterior probability Pr(πi |x0), i = 1, 2.
=⇒ Bayes’s formula:
Pr(π1|x0) =
p1f1(x0)
p1f1(x0) + p2f2(x0)
, Pr(π2|x0) =
p2f2(x0)
p1f1(x0) + p2f2(x0)
=⇒ Classify x0 as π1 iff Pr(π1|x0) > Pr(π2|x0)
▶ Again a special case of Lemma 12.1 when c(1|2) = c(2|1)
(Why?)
UNSW MATH5855 2021T3 Lecture 12 Slide 17
12. Discrimination and Classification
12.1 Separation and Classification for two populations
12.2 Classification errors
12.3 Summarising
12.4 Optimal classification rules
12.5 Classification with two multivariate normal populations
12.5.1 Case of equal covariance matrices Σ1 = Σ2 = Σ
12.5.2 Case of different covariance matrices (Σ1 ̸= Σ2)
12.5.3 Optimum error rate and Mahalanobis distance
12.6 Classification with more than 2 normal populations
12.7 Software
12.8 Examples
12.9 Additional resources
12.10Exercises
UNSW MATH5855 2021T3 Lecture 12 Slide 18
12. Discrimination and Classification
12.5 Classification with two multivariate normal populations
12.5.1 Case of equal covariance matrices Σ1 = Σ2 = Σ
12.5.2 Case of different covariance matrices (Σ1 ̸= Σ2)
12.5.3 Optimum error rate and Mahalanobis distance
UNSW MATH5855 2021T3 Lecture 12 Slide 20
▶ Assume for π1 and π2 are Np(µ1,Σ) and Np(µ2,Σ).
(12.4) =⇒ R1 = {x : exp[−
1
2
(x − µ1)⊤Σ−1(x − µ1)
+
1
2
(x − µ2)⊤Σ−1(x − µ2)] ≥
c(1|2)
c(2|1)
×
p2
p1
}
(12.5) =⇒ R2 = {x : exp[−
1
2
(x − µ1)⊤Σ−1(x − µ1)
+
1
2
(x − µ2)⊤Σ−1(x − µ2)] <
c(1|2)
c(2|1)
×
p2
p1
}
UNSW MATH5855 2021T3 Lecture 12 Slide 21
Theorem 12.2.
Under the above assumptions, the allocation rule that minimises
the ECM is given by:
1. allocate x0 to π1 if
(µ1−µ2)⊤Σ−1x0−
1
2
(µ1−µ2)⊤Σ−1(µ1+µ2) ≥ log[
c(1|2)
c(2|1)
×
p2
p1
].
2. Otherwise, allocate x0 to π2.
UNSW MATH5855 2021T3 Lecture 12 Slide 22
▶ Usually, we don’t know µ1, µ2, and Σ.
▶ Suppose,
▶ n1 and n2 sample sizes
▶ x̄1 and x̄2 their sample mean vectors
▶ S1 and S2 their sample covariance matrices
▶ Assume Σ1 = Σ2 = Σ
=⇒ pooled covariance matrix estimator Spooled =
(n1−1)S1+(n2−1)S2
n1+n2−2
=⇒ sample classification rule:
1. allocate x0 to π1 if
(x̄1−x̄2)⊤S−1pooledx0−
1
2
(x̄1−x̄2)⊤S−1pooled(x̄1+x̄2) ≥ log[
c(1|2)
c(2|1)
×
p2
p1
]
(12.6)
2. Otherwise, allocate x0 to π2.
UNSW MATH5855 2021T3 Lecture 12 Slide 23
▶ Allocation rule based on Fisher’s discriminant function:
(x̄1 − x̄2)⊤S−1pooledx0 −
1
2
(x̄1 − x̄2)⊤S−1pooled(x̄1 + x̄2)
▶ Function itself called Fisher’s linear discriminant function.
▶ Only an estimate of the optimal rule.
▶ linear in the new observation x0
UNSW MATH5855 2021T3 Lecture 12 Slide 24
12. Discrimination and Classification
12.5 Classification with two multivariate normal populations
12.5.1 Case of equal covariance matrices Σ1 = Σ2 = Σ
12.5.2 Case of different covariance matrices (Σ1 ̸= Σ2)
12.5.3 Optimum error rate and Mahalanobis distance
UNSW MATH5855 2021T3 Lecture 12 Slide 25
Theorem 12.3.
▶ Assume π1 and π2 are Np(µ1,Σ1) and Np(µ2,Σ2).
▶ Same steps as in Theorem 12.2 =⇒
R1 = {x : −
1
2
x⊤(Σ−11 − Σ
−1
2 )x + (µ
⊤
1 Σ
−1
1 − µ
⊤
2 Σ
−1
2 )x − k
≥ log[
c(1|2)
c(2|1)
×
p2
p1
]}
R2 = {x : −
1
2
x⊤(Σ−11 − Σ
−1
2 )x + (µ
⊤
1 Σ
−1
1 − µ
⊤
2 Σ
−1
2 )x − k
< log[
c(1|2)
c(2|1)
×
p2
p1
]}
where k = 1
2
log(
|Σ1|
|Σ2|
) + 1
2
(µ⊤1 Σ
−1
1 µ1 − µ
⊤
2 Σ
−1
2 µ2)
▶ Classification regions now quadratic functions of x0.
UNSW MATH5855 2021T3 Lecture 12 Slide 26
▶ One obtains the following rule:
1. allocate x0 to π1 if
−
1
2
x⊤0 (S
−1
1 −S
−1
2 )x0+(x̄
⊤
1 S
−1
1 −x̄
⊤
2 S
−1
2 )x0−k̂ ≥ log[
c(1|2)
c(2|1)
×
p2
p1
]
where k̂ is the empirical analog of k .
2. Allocate x0 to π2 otherwise.
UNSW MATH5855 2021T3 Lecture 12 Slide 27
▶ Σ1 = Σ2 =⇒ quadratic term disappears =⇒ Theorem 12.2
▶ Theorem 12.3 is more general.
▶ p ≥ 2, quadratic rules may not behave well.
▶ More sensitive to non-normality and differences in covariance
matrices
=⇒ Transform the data if needed.
=⇒ Use cautiously.
=⇒ Use tests in Lecture 9 to check if equal variance assumption is
valid.
UNSW MATH5855 2021T3 Lecture 12 Slide 28
12. Discrimination and Classification
12.5 Classification with two multivariate normal populations
12.5.1 Case of equal covariance matrices Σ1 = Σ2 = Σ
12.5.2 Case of different covariance matrices (Σ1 ̸= Σ2)
12.5.3 Optimum error rate and Mahalanobis distance
UNSW MATH5855 2021T3 Lecture 12 Slide 29
optimum error rate (OER): smallest TPM attainable for any R1
and R2
▶ characterises difficulty of the problem
▶ E.g., Given two normal populations with Σ1 = Σ2 = Σ and
prior probabilities p1 = p2 =
1
2
,
TPM =
1
2
∫
R2
f1(x)dx +
1
2
∫
R1
f2(x)dx
▶ OER obtained with
R1 = {x : (µ1−µ2)⊤Σ−1x−
1
2
(µ1−µ2)⊤Σ−1(µ1+µ2) ≥ 0}
R2 = {x : (µ1−µ2)⊤Σ−1x−
1
2
(µ1−µ2)⊤Σ−1(µ1+µ2) < 0}
UNSW MATH5855 2021T3 Lecture 12 Slide 30
▶ Let Y = (µ1 − µ2)⊤Σ−1X = l⊤X =⇒ Y |i ∼ N1(µiY ,∆2)
where µiY = (µ1 − µ2)⊤Σ−1µi
▶ ∆ =
√
(µ1 − µ2)⊤Σ−1(µ1 − µ2) = Mahalanobis distance
between the two normal populations
=⇒ For Φ(·) = standard normal CDF,
Pr(2|1) = Pr(Y <
1
2
(µ1 − µ2)⊤Σ−1(µ1 + µ2))
= Pr(
Y − µ1Y
∆
< −
∆
2
) = Φ(−
∆
2
)
=⇒ Pr(1|2) = Φ(−∆
2
)
=⇒ OER = minimum TPM = Φ(−∆
2
)
▶ ∆ → ∆̂ =
√
(x̄1 − x̄2)⊤S−1pooled(x̄1 − x̄2).
UNSW MATH5855 2021T3 Lecture 12 Slide 31
12. Discrimination and Classification
12.1 Separation and Classification for two populations
12.2 Classification errors
12.3 Summarising
12.4 Optimal classification rules
12.5 Classification with two multivariate normal populations
12.6 Classification with more than 2 normal populations
12.7 Software
12.8 Examples
12.9 Additional resources
12.10Exercises
UNSW MATH5855 2021T3 Lecture 12 Slide 32
▶ Generalising to g > 2 groups π1, π2, . . . , πg is straightforward.
▶ Optimal error rate analysis difficult. It is easy to see that the
ECM classification rule with equal misclassification costs
becomes (compare to (12.4) and (12.5)) now:
1. Allocate x0 to πk if pk fk > pi fi for all i ̸= k.
▶ Equivalently, if log pk fk > log pi fi for all i ̸= k.
▶ g normal populations fi (x) ∼ Np(µi ,Σi ), i = 1, 2, . . . , g =⇒
1. Allocate x0 to πk if
log pk fk(x0) = log pk−
p
2
log(2π)−
1
2
log|Σk |−
1
2
(x0−µk)⊤Σ−1k (x0−µk) = max
i
log pi fi (x0)
▶ Ignoring the constant p
2
log(2π) =⇒ quadratic
discriminant score for the ith population:
dQi (x) = −
1
2
log|Σi | −
1
2
(x − µi )⊤Σ−1i (x − µi ) + log pi (12.7)
▶ Allocate x to the population with a largest quadratic
discriminant score.
▶ Estimate unknown quantities in (12.7) from data =⇒
estimated minimum total probability of misclassification rule.
(You formulate the precise statement (!)).
UNSW MATH5855 2021T3 Lecture 12 Slide 33
▶ all covariance matrices for the g populations are equal =⇒
simpler:
▶ define the linear discriminant score:
di (x) = µ⊤i Σ
−1x − 1
2
µ⊤i Σ
−1µi + log pi .
▶ sample version: x̄i instead of µi and
Spooled =
n1−1
n1+n2+…ng−g
S1 + · · ·+
ng−1
n1+n2+···+ng−g
Sg instead of Σ
=⇒
d̂i (x) = x̄
⊤
i S
−1
pooledx −
1
2
x̄⊤i S
−1
pooledx̄i + log pi
=⇒ Estimated Minimum TPM Rule for Equal Covariance
Normal Populations:
1. Allocate x to πk if d̂k(x) is the largest of the g values
d̂i (x), i = 1, 2, . . . , g .
UNSW MATH5855 2021T3 Lecture 12 Slide 34
12. Discrimination and Classification
12.1 Separation and Classification for two populations
12.2 Classification errors
12.3 Summarising
12.4 Optimal classification rules
12.5 Classification with two multivariate normal populations
12.6 Classification with more than 2 normal populations
12.7 Software
12.8 Examples
12.9 Additional resources
12.10Exercises
UNSW MATH5855 2021T3 Lecture 12 Slide 35
SAS: PROC DISCRIM
R: MASS:lda, MASS:qda
UNSW MATH5855 2021T3 Lecture 12 Slide 36
12. Discrimination and Classification
12.1 Separation and Classification for two populations
12.2 Classification errors
12.3 Summarising
12.4 Optimal classification rules
12.5 Classification with two multivariate normal populations
12.6 Classification with more than 2 normal populations
12.7 Software
12.8 Examples
12.9 Additional resources
12.10Exercises
UNSW MATH5855 2021T3 Lecture 12 Slide 37
Example 12.4.
Linear and quadratic discriminant analysis for the Edgar
Anderson’s Iris data, and using cross-validation to assess classifiers.
UNSW MATH5855 2021T3 Lecture 12 Slide 38
12. Discrimination and Classification
12.1 Separation and Classification for two populations
12.2 Classification errors
12.3 Summarising
12.4 Optimal classification rules
12.5 Classification with two multivariate normal populations
12.6 Classification with more than 2 normal populations
12.7 Software
12.8 Examples
12.9 Additional resources
12.10Exercises
UNSW MATH5855 2021T3 Lecture 12 Slide 39
Additional resources
▶ JW Sec. 11.1–11.6.
UNSW MATH5855 2021T3 Lecture 12 Slide 40
12. Discrimination and Classification
12.1 Separation and Classification for two populations
12.2 Classification errors
12.3 Summarising
12.4 Optimal classification rules
12.5 Classification with two multivariate normal populations
12.6 Classification with more than 2 normal populations
12.7 Software
12.8 Examples
12.9 Additional resources
12.10Exercises
UNSW MATH5855 2021T3 Lecture 12 Slide 41
Exercise 12.1
Three bivariate normal populations, labelled i = 1, 2, 3 have same
covariance matrix given by Σ =
(
1 0.5
0.5 1
)
and means
µ1 =
(
1
1
)
, µ2 =
(
1
0
)
,µ3 =
(
0
1
)
, respectively.
(a) Suggest a classification rule for an observation x =
(
x1
x2
)
that
corresponds to one of the three populations. You may assume
equal priors for the three populations and equal
misclassification costs.
(b) Classify the following observations to one of the three
distributions:
(
0.2
0.6
)
,
(
2
0.8
)
,
(
0.75
1
)
.
(c) Show that in R2, the 3 classification regions are bounded by
straight lines and draw a graph of these three regions.
UNSW MATH5855 2021T3 Lecture 12 Slide 42
Lecture 13: Support Vector Machines
13.1 Introduction and motivation
13.2 Expected versus Empirical Risk minimisation
13.3 Basic idea of SVMs
13.4 Estimation
13.5 Nonlinear SVMs
13.6 Multiple classes
13.7 SVM specification and tuning
13.8 Examples
13.9 Conclusion
UNSW MATH5855 2021T3 Lecture 13 Slide 1
13. Support Vector Machines
13.1 Introduction and motivation
13.2 Expected versus Empirical Risk minimisation
13.3 Basic idea of SVMs
13.4 Estimation
13.5 Nonlinear SVMs
13.6 Multiple classes
13.7 SVM specification and tuning
13.8 Examples
13.9 Conclusion
UNSW MATH5855 2021T3 Lecture 13 Slide 2
▶ Recall Lecture 12: classifying between 2 p-dimensional MVN
populations,
▶ scores are either linear or quadratic
▶ Optimal, but only for MVN.
▶ Non-MVN =⇒ nonlinear/nonquadratic boundaries
▶ Support vector machines (SVM) is a nonlinear technique that
often performs well.
▶ We will formulate it as an empirical risk minimisation problem
and to solve the problem under additional restrictions on the
allowed (nonlinear) classifier functions.
UNSW MATH5855 2021T3 Lecture 13 Slide 3
13. Support Vector Machines
13.1 Introduction and motivation
13.2 Expected versus Empirical Risk minimisation
13.3 Basic idea of SVMs
13.4 Estimation
13.5 Nonlinear SVMs
13.6 Multiple classes
13.7 SVM specification and tuning
13.8 Examples
13.9 Conclusion
UNSW MATH5855 2021T3 Lecture 13 Slide 4
▶ Let
Y group “indicator” with values +1 and −1
x ∈ Rp vector based on which we wish to classify the
observation
Wanted: A “best” classifier in a class F of functions f .
▶ f (x) maps x onto +1 or −1
▶ Minimise expected risk
R(f ) = EX ,Y (
1
2
|f (X )− Y |) =
∫
1
2
|f (x)− y |dP(x , y)
▶ Joint distribution P(x , y) is unknown in practice =⇒
empirical risk over a training set (xi , yi ), i = 1, 2, . . . , n is:
R̂(f ) =
1
n
n∑
i=1
1
2
|f (xi )− yi |
▶ I.e., “zero-one loss” given by
L(x , y) =
1
2
|f (x)− y |.
UNSW MATH5855 2021T3 Lecture 13 Slide 5
▶ Minimising empirical risk =⇒ find fn = argminf ∈F R̂(f ) as
an approximation to fopt = argminf ∈F R(f ).
▶ Not the same thing and may be very different.
▶ Vapnik: If F is not too large and n → ∞, an upper bound on
their difference with probability 1− η:
R(f ) ≤ R̂(f ) + ϕ(
h
n
,
log η
n
)
h is Vapnik–Chervonenkis (VC) dimension (i.e., a measure
of the complexity of the class F).
ϕ is monotone increasing in h (at least for large enough
sample sizes n).
=⇒ Test error is bounded from above by the sum of the training
error and the complexity of the set of models under
consideration.
=⇒ Limiting the complexity of the model limits the the
discrepancy between training and test error.
UNSW MATH5855 2021T3 Lecture 13 Slide 6
▶ A linear classification rule has the form f (x) = sign(w⊤x + b)
for some w ∈ Rp and b ∈ R.
▶ For a linear classification rule,
▶ ϕ( h
n
, log η
n
) =
√
h(log( 2n
h
)+1)−log( η
4
)
n
▶ VC dimension h = p + 1
▶ Now,
∂
∂h
[
h(log(2n
h
) + 1)− log(η
4
)
n
] =
1
n
log(
2n
h
) > 0
(as long as h < 2n).
▶ In general, the VC dimension of a given set of functions is
equal to the maximal number of points that can be
“shattered”—separated in all possible ways by that set of
functions.
UNSW MATH5855 2021T3 Lecture 13 Slide 7
▶ “Richer” function class F =⇒ less training classification
error.
▶ Will overfit and not generalise, however.
▶ “More rich” =⇒ higher value of h =⇒ higher ϕ (for large
enough n) =⇒ more discrepancy between R(f ) and R̂(f )
▶ The rest of the lecture focuses on ways to solve (or solve
approximately) this minimisation problem for some classes F .
▶ For additional information, see references in the notes.
UNSW MATH5855 2021T3 Lecture 13 Slide 8
13. Support Vector Machines
13.1 Introduction and motivation
13.2 Expected versus Empirical Risk minimisation
13.3 Basic idea of SVMs
13.4 Estimation
13.5 Nonlinear SVMs
13.6 Multiple classes
13.7 SVM specification and tuning
13.8 Examples
13.9 Conclusion
UNSW MATH5855 2021T3 Lecture 13 Slide 9
Linear Classifiers
▶ A linear classifier is one that given feature vector xnew and
weights w , classifies ynew based on the value of w⊤xnew; for
example,
ŷnew =
{
+1 if w⊤xnew + b > 0
−1 if w⊤xnew + b < 0
for a threshold −b.
▶ Every element of x , xi , gets a weight wi :
Sign of wi determines whether increasing xi pushes the
prediction toward yi = −1 or yi = +1.
Magnitude of wi determines how strongly.
UNSW MATH5855 2021T3 Lecture 13 Slide 10
Hyperplane Interpretation and Linear Separability
▶ We separate +1s from −1s at w⊤x + b = 0.
▶ Points x that satisfy that equation exactly form a line (if
d = 2), a plane (if d = 3), or a hyperplane (if d ≥ 3).
▶ Data are linearly separable if a hyperplane that separates them
exists:
x1
x2
−1
+1
y
separating
line
w
−b
∥w
∥
UNSW MATH5855 2021T3 Lecture 13 Slide 11
Maximum Margin
▶ Usually, there are many different hyperplanes which could be
used to separate a linearly separable dataset.
▶ The “best” choice can be regarded as the middle of the
widest empty strip (or higher dimensional analogue) between
the two classes.
x1
x2
−1
+1
yw ⊤
x
+
b
=
0
w
⊤
x
+
b
+
=
0
w
⊤
x
+
b
−
=
0
|b+−
b−
|
∥w∥
=⇒ We want to make the margin |b+−b−||w | as big as possible.
UNSW MATH5855 2021T3 Lecture 13 Slide 12
▶ The scale of w and b is arbitrary: for arbitrary α ̸= 0, any x
that satisfies w⊤x + b = 0 also satisfies
(αw)⊤x + (αb) = α(w⊤x + b) = 0, so (w , b) and (αw , αb)
define the same plane.
=⇒ We fix |b+ − b| = |b− − b| = 1, and only vary w : our “outer”
hyperplanes become
w⊤x + (b − 1) = 0
w⊤x + (b + 1) = 0
=⇒ A margin of |b+−b−|∥w∥ =
2
∥w∥ is maximised by minimising ∥w∥.
A Linear Support Vector Machine minimises ∥w∥2 subject to
separating −1s and +1s.
UNSW MATH5855 2021T3 Lecture 13 Slide 13
13. Support Vector Machines
13.1 Introduction and motivation
13.2 Expected versus Empirical Risk minimisation
13.3 Basic idea of SVMs
13.4 Estimation
13.4.1 Linear SVM: Separable Case
13.4.2 Linear SVM: Nonseparable Case
13.5 Nonlinear SVMs
13.6 Multiple classes
13.7 SVM specification and tuning
13.8 Examples
13.9 Conclusion
UNSW MATH5855 2021T3 Lecture 13 Slide 14
13. Support Vector Machines
13.4 Estimation
13.4.1 Linear SVM: Separable Case
13.4.2 Linear SVM: Nonseparable Case
UNSW MATH5855 2021T3 Lecture 13 Slide 15
Linear SVM: Separable Case
▶ Write
w⊤x + (b − 1) = 0 =⇒ w⊤x + b = +1
w⊤x + (b + 1) = 0 =⇒ w⊤x + b = −1
▶ Recall
ŷi =
{
+1 if w⊤xi + b > 0
−1 if w⊤xi + b < 0
= sign(w⊤xi + b).
=⇒ If w⊤x + b = 0 separates −1s and +1s (i.e., yi = ŷi for all
i = 1, . . . , n.),
yi (w⊤xi + b) ≥ 1
=⇒ A linear SVM learning task for can be expressed as a
constrained optimisation problem:
argmin
w
1
2
∥w∥2 subject to yi (w⊤xi + b) ≥ 1, i = 1, . . . , n.
UNSW MATH5855 2021T3 Lecture 13 Slide 16
Lagrange Multiplier Technique
▶ The objective is quadratic (convex) and the constraints are
linear.
▶ This can be solved by Lagrange multipliers.
1. Rewrite the objective function as the Lagrangian: (note the
use of αi s instead of λi s):
Lag(w , b;α) =
1
2
∥w∥2 −
n∑
i=1
αi
[
yi (w⊤xi + b)− 1
]
.
2. As the constraints are inequalities rather than equalities, apply
the so-called KKT (Karush–Kuhn–Tucker) conditions: the
saddle point (w , b,α) : Lag′(w , b;α) = 0 will be the
constrained optimum if αi ≥ 0, i = 1, . . . , n.
=⇒ Solve for Lag′(w , b;α) = 0 subject to αi ≥ 0.
UNSW MATH5855 2021T3 Lecture 13 Slide 17
3. Set derivatives of Lag with respect to w and b equal to zero:
∂L
∂w
= w −
n∑
i=1
αiyixi = 0 =⇒ w =
n∑
i=1
αiyixi ,
∂L
∂b
= −
n∑
i=1
αiyi = 0 =⇒
n∑
i=1
αiyi = 0.
4. Note, also, that
yi (w⊤xi + b)− 1 ≥ 0, i = 1, . . . , n,
αi
(
yi (w⊤xi + b)− 1
)
= 0, i = 1, . . . , n.
=⇒ Each αi must be zero unless yi (w⊤xi + b) = 1, in which case
the training instance lies on a corresponding hyperplane and is
known as a support vector.
UNSW MATH5855 2021T3 Lecture 13 Slide 18
Dual Optimisation Problem
▶ Substituting the expression of w in terms of α and expanding
∥w∥2, we get
LagD(α) =
n∑
i=1
αi −
1
2
n∑
i=1
n∑
j=1
αiαjyiyjx⊤i xj ,
to be maximised subject to
αi ≥ 0, i = 1, . . . , n
n∑
i=1
αiyi = 0.
▶ This is a quadratic programming problem, for which many
software tools are available.
UNSW MATH5855 2021T3 Lecture 13 Slide 19
13. Support Vector Machines
13.4 Estimation
13.4.1 Linear SVM: Separable Case
13.4.2 Linear SVM: Nonseparable Case
UNSW MATH5855 2021T3 Lecture 13 Slide 20
Linear SVM: Nonseparable Case
▶ In many real-world problems, it is not possible to find
hyperplanes which perfectly separate the target classes.
▶ The soft margin approach considers a trade-off between
margin width and number of training misclassifications.
▶ Slack variables ξi ≥ 0 are included in the constraints
yi (w⊤xi + b) ≥ 1− ξi . (13.1)
UNSW MATH5855 2021T3 Lecture 13 Slide 21
▶ Optimisation becomes
argmin
w ,ξ
(
1
2
∥w∥2 + C
n∑
i=1
ξi
)
subject to yi (w⊤xi + b) ≥ 1− ξi , i = 1, . . . , n,
for a tuning constant C .
▶ Small C : lots of slack.
▶ Large C : little slack.
▶ C = ∞: hard margin.
▶ Now, yi (w⊤xi + b) ≥ 1− ξi =⇒ ξi ≥ 1− yi (w⊤xi + b).
▶ We want to make ξi as small as possible.
=⇒ ξi = max{0, 1− yi (w⊤xi + b)}.
UNSW MATH5855 2021T3 Lecture 13 Slide 22
Dual Optimisation Problem
▶ The Lagrangian is now (with additional multipliers µ),
Lag(w , b, ξ;α,µ) =
1
2
∥w∥2 + C
n∑
i=1
ξi
−
n∑
i=1
αi
[
yi (w⊤xi + b)− 1 + ξi
]
−
n∑
i=1
µiξi .
UNSW MATH5855 2021T3 Lecture 13 Slide 23
▶ Now,
∂L
∂w
= w −
n∑
i=1
αiyixi = 0 =⇒ w =
n∑
i=1
αiyixi
∂L
∂b
= −
n∑
i=1
αiyi = 0 =⇒
n∑
i=1
αiyi = 0
∂L
∂ξ
= C1n −α− µ = 0 =⇒ C − αi − µi = 0, i = 1, . . . , n.
with additional KKT conditions for i = 1, . . . , n:
αi ≥ 0
µi ≥ 0
αi
(
yi (w⊤xi + b)− 1 + ξi
)
= 0.
UNSW MATH5855 2021T3 Lecture 13 Slide 24
▶ Substituting into the Lagrangian leads to
LagD(α,µ) =
n∑
i=1
αi−
1
2
n∑
j=1
n∑
k=1
αjαkyjyk(x
⊤
j xk)+
n∑
i=1
ξi (C−αi−µi ).
▶ But C − αi − µi = 0, so as long as αi ≤ C , µi ≥ 0 is
completely determined by αi , and we get a dual problem
argmax
α
n∑
i=1
αi −
1
2
n∑
j=1
n∑
k=1
αjαkyjyk(x
⊤
j xk)
subject to
n∑
i=1
αiyi = 0 and 0 ≤ αi ≤ C , i = 1, . . . , n.
UNSW MATH5855 2021T3 Lecture 13 Slide 25
The consequences
Primal: ŷ(x) = sign(w⊤x + b) (13.2)
Dual: ŷ(x) = sign{
n∑
j=1
αjyj(x⊤j x) + b} (13.3)
▶ Primal (w) form requires d parameters, while dual (α) form
requires n parameters.
▶ If d ≫ n, dual is more efficient.
▶ But, notice that only the xi s closest to the hyperplane matter
in determining w , so most of them will have no effect.
=⇒ Most αjs in w =
∑n
j=1 αjyjxj will be 0!
=⇒ Computationally, effective “n” is actually much smaller than
the sample size.
=⇒ Those xi s that “support” the hyperplane are called support
vectors.
▶ Also, notice that the dual form only depends on (x⊤j xk)s.
UNSW MATH5855 2021T3 Lecture 13 Slide 26
13. Support Vector Machines
13.1 Introduction and motivation
13.2 Expected versus Empirical Risk minimisation
13.3 Basic idea of SVMs
13.4 Estimation
13.5 Nonlinear SVMs
13.6 Multiple classes
13.7 SVM specification and tuning
13.8 Examples
13.9 Conclusion
UNSW MATH5855 2021T3 Lecture 13 Slide 27
Nonlinear SVM
▶ SVM methodology can be modified to create nonlinear
decision boundaries.
▶ Consider:
x1
x2
UNSW MATH5855 2021T3 Lecture 13 Slide 28
▶ The technique involves transforming the original x space so
that a linear decision boundary can separate instances in the
transformed space.
▶ Suppose we augmented our x with squared terms:
(x1, x2) → (x1, x2, x21 , x
2
2 ) :
x1
0
.0
0
.3
0
.6
0.0 0.2 0.4 0.6
0
.0
0
.2
0
.4
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
x2
x1.2
0.0 0.1 0.2 0.3 0.4 0.5
0.0 0.1 0.2 0.3 0.4 0.5
0
.0
0
.3
0
.6
0
.0
0
.2
0
.4
x2.2
=⇒ Nonlinear problem becomes a linear problem.
UNSW MATH5855 2021T3 Lecture 13 Slide 29
Kernel Trick
▶ The dual form depends only on dot products x⊤i xj .
=⇒ We can specify other kernels k(xi , xj).
▶ E.g., a “kernel” function of the form k(u, v) = (u⊤v + 1)2
can be regarded as a dot product
u21v
2
1 + u2v
2
2 + 2u1v1 + 2u2v2 + 1
= (u21 , u
2
2 ,
√
2u1,
√
2u2, 1)
⊤(v21 , v
2
2 ,
√
2v1,
√
2v2, 1)
▶ In general, kernel functions can be expressed in terms of high
dimensional dot products.
▶ Computing dot products via kernel functions is
computationally “cheaper” than using transformed attributes
directly.
UNSW MATH5855 2021T3 Lecture 13 Slide 30
Radial Basis Function
▶ A radial basis function is a function of distance from the
origin, or from another fixed point v .
▶ Usually distance is Euclidean, i.e.
∥u − v∥ =
√
(u1 − v1)2 + · · ·+ (un − vn)2
▶ A common form of radial basis function is Gaussian:
ϕ(∥u − v∥) = exp
(
−γ∥u − v∥2
)
(Maximum of 1 occurs when u = v , decreases towards zero as
u moves away from v .)
▶ We can use ϕ(·, ·) as our SVM kernel.
UNSW MATH5855 2021T3 Lecture 13 Slide 31
13. Support Vector Machines
13.1 Introduction and motivation
13.2 Expected versus Empirical Risk minimisation
13.3 Basic idea of SVMs
13.4 Estimation
13.5 Nonlinear SVMs
13.6 Multiple classes
13.7 SVM specification and tuning
13.8 Examples
13.9 Conclusion
UNSW MATH5855 2021T3 Lecture 13 Slide 32
Multiclass SVMs
▶ The SVM technique can be adapted to handle multiclass
problems (K categories) rather than binary classification
problems (2 categories):
One-against-rest: ▶ Recall that w⊤xi gives us a “score” that
we normally compare to b, but we don’t have to.
▶ For each k = 1, . . . ,K fit a separate SVM (i.e., wk and bk)
for whether an observation is in k vs. not.
▶ Predict ŷnew by evaluating w⊤k xnew + bk for each k and
taking the biggest one.
One-against-one: ▶ For every pair k1, k2 = 1, . . . ,K , fit an
SVM for k1 vs. k2.
▶ Requires K (K − 1)/2 binary classifiers.
UNSW MATH5855 2021T3 Lecture 13 Slide 33
13. Support Vector Machines
13.1 Introduction and motivation
13.2 Expected versus Empirical Risk minimisation
13.3 Basic idea of SVMs
13.4 Estimation
13.5 Nonlinear SVMs
13.6 Multiple classes
13.7 SVM specification and tuning
13.8 Examples
13.9 Conclusion
UNSW MATH5855 2021T3 Lecture 13 Slide 34
User Control
▶ Categorical data can be handled by introducing binary dummy
variables to indicate each attribute value.
▶ The user must specify some control parameters, e.g. type of
kernel function and cost constant C for slack variables.
▶ The following kernel functions available via the R e1071
package:
linear: u⊤v
polybomial: (γu⊤v + c0)p
radial basis: exp(−γ∥u − v∥2)
sigmoid: tanh(−γu⊤v + c0)
for constants γ, p, and c0.
UNSW MATH5855 2021T3 Lecture 13 Slide 35
13. Support Vector Machines
13.1 Introduction and motivation
13.2 Expected versus Empirical Risk minimisation
13.3 Basic idea of SVMs
13.4 Estimation
13.5 Nonlinear SVMs
13.6 Multiple classes
13.7 SVM specification and tuning
13.8 Examples
13.9 Conclusion
UNSW MATH5855 2021T3 Lecture 13 Slide 36
Example 13.1.
SVM classification for the Edgar Anderson’s Iris data, and using
ROC curves.
UNSW MATH5855 2021T3 Lecture 13 Slide 37
13. Support Vector Machines
13.1 Introduction and motivation
13.2 Expected versus Empirical Risk minimisation
13.3 Basic idea of SVMs
13.4 Estimation
13.5 Nonlinear SVMs
13.6 Multiple classes
13.7 SVM specification and tuning
13.8 Examples
13.9 Conclusion
UNSW MATH5855 2021T3 Lecture 13 Slide 38
Advantages and Disadvantages
+ SVM training can be formulated as a convex optimisation
problem, with efficient algorithms for finding the global
minimum.
+ SVM involves support vectors rather than the whole training
set, so outliers have less effect than for other methods.
− Much harder to interpret than model-based classification
techniques.
− Does not directly provide class probability estimates, although
these can be estimated by cross-validation.
UNSW MATH5855 2021T3 Lecture 13 Slide 39
Lecture 14: Cluster Analysis
14.1 “Classical”
14.2 Model-based clustering
14.3 Additional resources
UNSW MATH5855 2021T3 Lecture 14 Slide 1
Goal
Given: An unlabelled sample x1, . . . , xn ∈ Rp.
Wanted: A grouping of observations such that more similar
observations are placed in the group
▶ I.e., assign to each xi a group index Gi ∈ {1, . . . ,K} s.t. if
Gi = Gj , xi and xj are “on average” more similar in some
sense than if Gi ̸= Gj .
▶ Let G = (G1, . . . ,Gn)⊤ for conciseness.
▶ Equivalently, partition i = 1, . . . , n into K sets S1, . . . ,SK
so that if two points belong to the same set, they are more
similar “on average” than if they do not.
▶ Call S = (S1, . . . ,SK ) (collection of sets) for conciseness.
▶ Differs from SVM and discriminant analysis in that no labels
are provided in the data.
▶ An example of unsupervised learning.
UNSW MATH5855 2021T3 Lecture 14 Slide 2
Approaches
“classical”: An algorithm that seeks to put more similar (in some
sense) observations into the same cluster
hierarchical: Produces a hierarchy of nested clusterings
non-hierarchical: Just a single clustering
▶ Often seeks to minimise some objective function
model-based: An MLE or Bayesian solution to a mixture model
UNSW MATH5855 2021T3 Lecture 14 Slide 3
14. Cluster Analysis
14.1 “Classical”
14.1.1 Components
14.1.2 Example: K -means
14.1.3 Extension: K -medioids
14.1.4 Hierarchical clustering
14.1.5 Software
14.1.6 Assessing
14.1.7 Examples
14.2 Model-based clustering
14.3 Additional resources
UNSW MATH5855 2021T3 Lecture 14 Slide 4
14. Cluster Analysis
14.1 “Classical”
14.1.1 Components
14.1.2 Example: K -means
14.1.3 Extension: K -medioids
14.1.4 Hierarchical clustering
14.1.5 Software
14.1.6 Assessing
14.1.7 Examples
UNSW MATH5855 2021T3 Lecture 14 Slide 5
Specify,
proximity measure: some function d(x1, x2) that determines
difference between two observations (or a similarity score) e.g.,
Euclidean: ∥x1 − x2∥
taxicab/Manhattan: ∥x1 − x2∥1 =
∑p
j=1|x1j − x2j |
Gower: p−1
∑p
j=1 I(x1j ̸= x2j) (xij binary)
▶ Should be substantively meaningful.
algorithm choice: an algorithm that minimises within-cluster and
maximises between-cluster distances in some sense
UNSW MATH5855 2021T3 Lecture 14 Slide 6
14. Cluster Analysis
14.1 “Classical”
14.1.1 Components
14.1.2 Example: K -means
14.1.3 Extension: K -medioids
14.1.4 Hierarchical clustering
14.1.5 Software
14.1.6 Assessing
14.1.7 Examples
UNSW MATH5855 2021T3 Lecture 14 Slide 7
▶ Simple, intuitive algorithm.
▶ For a Euclidean distance, minimise
argmin
S
K∑
k=1
1
2|Sk |
∑
i ,j∈Sk
∥xi − xj∥2
▶ Equivalent to minimising
argmin
S
K∑
k=1
∑
i∈Sk
∥xi − x̄Sk∥
2, x̄Sk =
1
|Sk |
∑
i∈Sk
xi
UNSW MATH5855 2021T3 Lecture 14 Slide 8
Procedure
1. Randomly assign a cluster index to each element of G (0).
2. Calculate cluster means (centroids):
x̄
S
(t−1)
k
=
1
|S (t−1)k |
∑
i∈S(t−1)
k
xi , k = 1, . . . ,K .
3. Calculate distances of each data point from each mean:
dik = ∥xi − x̄S(t−1)
k
∥, i = 1, . . . , n, k = 1, . . . ,K .
4. Reassign each point to its nearest mean:
G
(t)
i = argmin
k
dik .
5. Repeat from Step 2 until G (t) = G (t−1).
A toy example is given in lecture.
UNSW MATH5855 2021T3 Lecture 14 Slide 9
Data
Data:
Index V1 V2
1 1 1
2 2 2
3 4 5
4 5 4
5 4 4
1 2 3 4 5
1
2
3
4
5
1
2
3
45
UNSW MATH5855 2021T3 Lecture 14 Slide 10
Iteration 0: Initial clustering (random)
Data:
Index V1 V2 C
1 1 1 1
2 2 2 2
3 4 5 1
4 5 4 2
5 4 4 1
1 2 3 4 5
1
2
3
4
5
1
2
3
45
UNSW MATH5855 2021T3 Lecture 14 Slide 11
Iteration 1a: Calculate centroids
Data:
Index V1 V2 C
1 1 1 1
2 2 2 2
3 4 5 1
4 5 4 2
5 4 4 1
Centroids:
C V1 V2
1 3.0 3.333333
2 3.5 3.000000
1 2 3 4 5
1
2
3
4
5
1
2
3
45
UNSW MATH5855 2021T3 Lecture 14 Slide 12
Iteration 1b: Update clustering
Data:
Index V1 V2 C
1 1 1 1
2 2 2 1
3 4 5 1
4 5 4 2
5 4 4 2
Distances to centroid:
Index 1 2
1 3.073182 3.201562
2 1.666667 1.802776
3 1.943651 2.061553
4 2.108185 1.802776
5 1.201850 1.118034
1 2 3 4 5
1
2
3
4
5
1
2
3
45
UNSW MATH5855 2021T3 Lecture 14 Slide 13
Iteration 2a: Calculate centroids
Data:
Index V1 V2 C
1 1 1 1
2 2 2 1
3 4 5 1
4 5 4 2
5 4 4 2
Centroids:
C V1 V2
1 2.333333 2.666667
2 4.500000 4.000000
1 2 3 4 5
1
2
3
4
5
1
2
3
45
UNSW MATH5855 2021T3 Lecture 14 Slide 14
Iteration 2b: Update clustering
Data:
Index V1 V2 C
1 1 1 1
2 2 2 1
3 4 5 2
4 5 4 2
5 4 4 2
Distances to centroid:
Index 1 2
1 2.134375 4.609772
2 0.745356 3.201562
3 2.867442 1.118034
4 2.981424 0.500000
5 2.134375 0.500000
1 2 3 4 5
1
2
3
4
5
1
2
3
45
UNSW MATH5855 2021T3 Lecture 14 Slide 15
Iteration 3a: Calculate centroids
Data:
Index V1 V2 C
1 1 1 1
2 2 2 1
3 4 5 2
4 5 4 2
5 4 4 2
Centroids:
C V1 V2
1 1.500000 1.500000
2 4.333333 4.333333
1 2 3 4 5
1
2
3
4
5
1
2
3
45
UNSW MATH5855 2021T3 Lecture 14 Slide 16
Iteration 3b: Update clustering
Data:
Index V1 V2 C
1 1 1 1
2 2 2 1
3 4 5 2
4 5 4 2
5 4 4 2
Distances to centroid:
Index 1 2
1 0.7071068 4.7140452
2 0.7071068 3.2998316
3 4.3011626 0.7453560
4 4.3011626 0.7453560
5 3.5355339 0.4714045
1 2 3 4 5
1
2
3
4
5
1
2
3
45
Clustering unchanged.
Finished!
UNSW MATH5855 2021T3 Lecture 14 Slide 17
14. Cluster Analysis
14.1 “Classical”
14.1.1 Components
14.1.2 Example: K -means
14.1.3 Extension: K -medioids
14.1.4 Hierarchical clustering
14.1.5 Software
14.1.6 Assessing
14.1.7 Examples
UNSW MATH5855 2021T3 Lecture 14 Slide 18
medioid x̃k of cluster k a specific observation that has the closest
summed distance to all other observations in Sk :
x̃Sk = argminxi
∑
i∈Sk
d(xj , xi )
▶ Method of K -medioids or partitioning around medioids (PAM)
minimises absolute distances:
argmin
S
K∑
k=1
∑
i∈Sk
d(xi , x̃Sk )
▶ Much slower than K -means, but more robust to outliers
UNSW MATH5855 2021T3 Lecture 14 Slide 19
Procedure
1. Randomly assign a cluster index to each element of G (0).
2. Calculate cluster medioids:
x̃
S
(t−1)
k
= argmin
xi
∑
j∈S(t−1)
k
d(xj , xi ), k = 1, . . . ,K .
3. Calculate distances of each data point from each medioid:
dik = d(xi , x̃S(t−1)
k
), i = 1, . . . , n, k = 1, . . . ,K .
4. Reassign each point to its nearest medioid:
G
(t)
i = argmin
k
dik .
5. Repeat from Step 2 until G (t) = G (t−1).
UNSW MATH5855 2021T3 Lecture 14 Slide 20
14. Cluster Analysis
14.1 “Classical”
14.1.1 Components
14.1.2 Example: K -means
14.1.3 Extension: K -medioids
14.1.4 Hierarchical clustering
14.1.5 Software
14.1.6 Assessing
14.1.7 Examples
UNSW MATH5855 2021T3 Lecture 14 Slide 21
Two approaches:
Agglomerative: Combine nearest observations into clusters, nearest
clusters into bigger clusters, etc..
▶ Need to define “distance” between clusters.
Divisive: Partition the whole dataset into clusters, clusters into
smaller clusters, etc..
▶ Need to define criterion based on which a cluster is
partitioned.
▶ Typically visualised in a dendrogram.
UNSW MATH5855 2021T3 Lecture 14 Slide 22
Distance between clusters
Single linkage d(S1, S2) = min{d(xi , xj) : i ∈ S1, j ∈ S2}
Complete linkage d(S1, S2) = max{d(xi , xj) : i ∈ S1, j ∈ S2}
Average linkage
(unweighted)
d(S1,S2) =
1
|S1||S2|
∑
i∈S1
∑
j∈S2 d(xi , xj)
Average linkage
(weighted)
d(S1 ∪ S2,S3) =
d(S1,S3)+d(S2,S3)
2
Centroid d(S1,S2) = ∥x̄S1 − x̄S2∥
Ward
d(S1,S2) =
∑
i∈S1∪S2∥xi − x̄S1∪S2∥
2
−
∑
i∈S1∥xi − x̄S1∥
2
−
∑
i∈S2∥xi − x̄S2∥
2
=
|S1||S2|
|S1|+|S2|
∥x̄S1 − x̄S2∥
2
UNSW MATH5855 2021T3 Lecture 14 Slide 23
Lance–Williams framework
Express distance recursively:
d(S1 ∪ S2, S3) = α1d(S1,S3) + α2d(S2,S3)
+ βd(S1, S2) + γ|d(S1,S3)− d(S2, S3)|
Then, for, e.g.,
▶ Unweighted average linkage:
d(S1 ∪ S2,S3) =
1
|S1 ∪ S2||S3|
∑
i∈S1∪S2
∑
j∈S3
d(xi , xj)
=
1
(|S1|+ |S2|)|S3|
∑
i∈S1
∑
j∈S3
d(xi , xj) +
∑
i∈S2
∑
j∈S3
d(xi , xj)
=
|S1||S3|d(S1, S3) + |S2||S3|d(S2,S3)
(|S1|+ |S2|)|S3|
=⇒ α1 =
|S1|
|S1|+ |S2|
, α2 =
|S2|
|S1|+ |S2|
, β = γ = 0.
UNSW MATH5855 2021T3 Lecture 14 Slide 24
Ward’s method
▶ Use squared Euclidean distance:
d(xi , xj) = ∥xi − xj∥2.
▶ Use
α1 =
|S1|+ |S3|
|S1|+ |S2|+ |S3|
, α2 =
|S2|+ |S3|
|S1|+ |S2|+ |S3|
,
β =
−|S3|
|S1|+ |S2|+ |S3|
, γ = 0.
▶ Ward’s method joins the groups that will increase the
within-group variance least.
UNSW MATH5855 2021T3 Lecture 14 Slide 25
14. Cluster Analysis
14.1 “Classical”
14.1.1 Components
14.1.2 Example: K -means
14.1.3 Extension: K -medioids
14.1.4 Hierarchical clustering
14.1.5 Software
14.1.6 Assessing
14.1.7 Examples
UNSW MATH5855 2021T3 Lecture 14 Slide 26
SAS:
Hierarchical: PROC CLUSTER (PROC TREE to visualise, PROC
DISTANCE to preprocess), PROC VARCLUS
Non-hierarchical: PROC FASTCLUS, PROC MODECLUS, PROC
FASTKNN
R:
Hierarchical: stats::hclust, cluster::agnes
Non-hierarchical: stats::kmeans, cluster::pam
▶ Many others
UNSW MATH5855 2021T3 Lecture 14 Slide 27
14. Cluster Analysis
14.1 “Classical”
14.1.1 Components
14.1.2 Example: K -means
14.1.3 Extension: K -medioids
14.1.4 Hierarchical clustering
14.1.5 Software
14.1.6 Assessing
14.1.7 Examples
UNSW MATH5855 2021T3 Lecture 14 Slide 28
UNSW MATH5855 2021T3 Lecture 14 Slide 29
Silhouettes
▶ Popular method, inspired by K -medioid clustering.
▶ For each i = 1, . . . , n, let
a(i) =
1
|SGi | − 1
∑
j∈SGi
d(xi , xj)
b(i) = min
k ̸=Gi
1
|Sk |
∑
j∈Sk
d(xi , xj).
▶ Then, silhouette of i
s(i) =
{
b(i)−a(i)
max(a(i),b(i))
if |SGi | > 1
0 otherwise
.
=⇒ I.e. how much closer is i to the rest of its cluster than it is to
its nearest cluster?
▶ −1 ≤ s(i) ≤ +1, higher =⇒ better
▶ Mean silhouette n−1
∑n
i=1 s(i) measures the quality of
clustering.
UNSW MATH5855 2021T3 Lecture 14 Slide 30
14. Cluster Analysis
14.1 “Classical”
14.1.1 Components
14.1.2 Example: K -means
14.1.3 Extension: K -medioids
14.1.4 Hierarchical clustering
14.1.5 Software
14.1.6 Assessing
14.1.7 Examples
UNSW MATH5855 2021T3 Lecture 14 Slide 31
Example 14.1.
Hierarchical, non-hierarchical clustering and assessment illustrated
on the Edgar Anderson’s Iris data.
UNSW MATH5855 2021T3 Lecture 14 Slide 32
14. Cluster Analysis
14.1 “Classical”
14.2 Model-based clustering
14.2.1 Mixture Models
14.2.2 Multivariate normal clusters
14.2.3 Model selection
14.2.4 Software
14.2.5 Examples
14.2.6 Expectation–Maximisation Algorithm
14.3 Additional resources
UNSW MATH5855 2021T3 Lecture 14 Slide 33
14. Cluster Analysis
14.2 Model-based clustering
14.2.1 Mixture Models
14.2.2 Multivariate normal clusters
14.2.3 Model selection
14.2.4 Software
14.2.5 Examples
14.2.6 Expectation–Maximisation Algorithm
UNSW MATH5855 2021T3 Lecture 14 Slide 34
Finite mixture model
finite mixture model: a probability model under which each
observation comes from one of several distributions, but we do
not observe from which one
Typically, given:
K the number of distributions
fk(xi ;θk), k = 1, . . . ,K a collection of density functions on the
support of xi with parameter vectors θk
πk , k = 1, . . . ,K probabilities of an observation coming from
density k (0 ≤ πk ≤ 1,
∑K
k=1 πk)
▶ π = (π1, . . . , πK )
⊤
Ψ = {θ1, . . . ,θK ,π} collection of mixture model parameters
For each i = 1, . . . , n,
1. Sample Gi ∈ {1, . . . ,K} with Pr(Gi = k ;π) = πk .
2. Sample Xi |Gi ∼ fGi (·;θGi ).
3. Observe Xi , and “forget” Gi .
UNSW MATH5855 2021T3 Lecture 14 Slide 35
=⇒ mixture density
fXi (xi ; Ψ) =
K∑
k=1
πk fk(xi ;θk) (14.1)
▶ We wish to estimate Ψ from the sample of x = [x1, . . . , xn].
▶ Likelihood
Lx(Ψ) =
n∏
i=1
K∑
k=1
πk fk(xi ;θk). (14.2)
▶ Probability model =⇒ soft clustering possible, can be
embedded in a hierarchy
▶ Likelihood =⇒ model selection
UNSW MATH5855 2021T3 Lecture 14 Slide 36
14. Cluster Analysis
14.2 Model-based clustering
14.2.1 Mixture Models
14.2.2 Multivariate normal clusters
14.2.3 Model selection
14.2.4 Software
14.2.5 Examples
14.2.6 Expectation–Maximisation Algorithm
UNSW MATH5855 2021T3 Lecture 14 Slide 37
fk(xi ;θk) =
1
(2π)p/2|Σ(θk)|1/2
e−
1
2
(xi−µ(θk ))⊤Σ(θk )−1(xi−µ(θk ))
▶ µ(θk) = µk , Σ(θk) a variance model
▶ Important special case
▶ Strictly speaking, we may also have different clusters “share”
elements of θ:
fk(xi ;θ) =
1
(2π)p/2|Σk(θ)|1/2
e−
1
2
(xi−µk (θ))⊤Σk (θ)−1(xi−µk (θ))
(14.3)
▶ µk(θ) and Σk(θ) “extract” the appropriate elements from θ.
UNSW MATH5855 2021T3 Lecture 14 Slide 38
Parametrising the variance matrix
▶ Recall, Σ = PΛP⊤, P orthogonal, Λ diagonal and
nonnegative.
▶ Further parametrise
Σ = λPAP⊤
P ∈ Mp,p orthogonal
A ∈ Mp,p diagonal and nonnegative with |A| = 1
R ∋ λ > 0
=⇒ |Σ| = λp|P||A||P⊤| = λp =⇒ λ is the “spread” of the cluster
=⇒ A = Ip =⇒ Σ = λPAP⊤ = λPP⊤ = λIp =⇒ spherical
=⇒ uncorrelated, equal variances
=⇒ A controls shape of ellipsoid (relative scales for different
dimensions)
=⇒ P = Ip =⇒ Σ = λPAP⊤ = λA =⇒ ellipsoidal whose axes
parallel coordinate axes
=⇒ uncorrelated, unequal variances
=⇒ P controls rotation of ellipsoid (correlation)
=⇒ Constraining A and P controls the shape and orientation of
cluster.
UNSW MATH5855 2021T3 Lecture 14 Slide 39
Variance matrix specification and degrees of freedom
In a mixture of K clusters, estimate:
π1, . . . , πK (K − 1 parameters)
µ1, . . . ,µK (Kp parameters)
λ:
E: λ1 = λ2 = · · · = λK (1 parameter)
V: λks vary (K parameters)
A:
I: A1 = A2 = · · · = AK = Id (0 parameters)
E: A1 = A2 = · · · = AK (p − 1 parameters)
V: Aks vary (K (p − 1) parameters)
P:
I: P1 = P2 = · · · = PK = Id (0 parameters)
E: P1 = P2 = · · · = PK (
(
p
2
)
parameters (why?))
V: Pks vary (K
(
p
2
)
parameters)
UNSW MATH5855 2021T3 Lecture 14 Slide 40
Incorporated under the terms of Creative Commons Attribution 3.0 Unported license from Figure 2 of:
Luca Scrucca, Michael Fop, T. Brendan Murphy, and Adrian E. Raftery (2016). mclust 5: Clus-
tering, Classification and Density Estimation Using Gaussian Finite Mixture Models. The R Journal
8:1, pages 289-317.
UNSW MATH5855 2021T3 Lecture 14 Slide 41
14. Cluster Analysis
14.2 Model-based clustering
14.2.1 Mixture Models
14.2.2 Multivariate normal clusters
14.2.3 Model selection
14.2.4 Software
14.2.5 Examples
14.2.6 Expectation–Maximisation Algorithm
UNSW MATH5855 2021T3 Lecture 14 Slide 42
Model selection
Need to select:
▶ Number of clusters K
▶ Within-cluster models (e.g., model for Σ)
▶ Number of parameters grows quickly for “XXV” models in
particular.
▶ BIC often used:
BICν = −2 log Lx(Ψ̂) + ν log n
ν the number of parameters estimated
▶ Lower BIC =⇒ better
▶ Some authors use 2 log Lx(Ψ̂)− ν log n with higher =⇒
better.
▶ Substantive considerations also matter, e.g.,
▶ How many clusters does our research hypothesis predict?
▶ Do we expect correlations between dimensions to vary between
clusters?
UNSW MATH5855 2021T3 Lecture 14 Slide 43
14. Cluster Analysis
14.2 Model-based clustering
14.2.1 Mixture Models
14.2.2 Multivariate normal clusters
14.2.3 Model selection
14.2.4 Software
14.2.5 Examples
14.2.6 Expectation–Maximisation Algorithm
UNSW MATH5855 2021T3 Lecture 14 Slide 44
SAS: PROC MBC
R: package mclust and others
UNSW MATH5855 2021T3 Lecture 14 Slide 45
14. Cluster Analysis
14.2 Model-based clustering
14.2.1 Mixture Models
14.2.2 Multivariate normal clusters
14.2.3 Model selection
14.2.4 Software
14.2.5 Examples
14.2.6 Expectation–Maximisation Algorithm
UNSW MATH5855 2021T3 Lecture 14 Slide 46
Example 14.2.
Model-based clustering and model selection illustrated on the
Edgar Anderson’s Iris data.
UNSW MATH5855 2021T3 Lecture 14 Slide 47
14. Cluster Analysis
14.2 Model-based clustering
14.2.1 Mixture Models
14.2.2 Multivariate normal clusters
14.2.3 Model selection
14.2.4 Software
14.2.5 Examples
14.2.6 Expectation–Maximisation Algorithm
UNSW MATH5855 2021T3 Lecture 14 Slide 48
E-M Algorithm
▶ Computationally tractable but log L(Ψ) does not simplify or
decompose:
Lx(Ψ) =
n∏
i=1
K∑
k=1
πk fk(xi ;θk).
=⇒ Expectation–Maximisation (EM) algorithm is normally used.
1. Introduce an unobserved (latent) variable Gi , i = 1, . . . , n
giving the cluster membership of i .
2. Suppose that G1, . . . ,Gn are observed; then
Lx ,G1,…,Gn(Ψ) =
n∏
i=1
πGi fGi (xi ;θGi )
log Lx ,G1,…,Gn(Ψ) =
n∑
i=1
log πGi +
n∑
i=1
log fGi (xi ;θGi ). (14.4)
3. Start with an initial guess Ψ(0).
4. Iterate E-step and M-step to convergence.
UNSW MATH5855 2021T3 Lecture 14 Slide 49
E-step
Let
Q(Ψ|Ψ(t−1)) = EG1,…,Gn|x ;Ψ(t−1)(log Lx ,G1,…,Gn(Ψ)).
Compute
q
(t−1)
ik = Pr(Gi = k |x ; Ψ
(t−1)) =
π
(t−1)
k fk(xi ;θ
(t−1)
k )∑K
k ′=1 π
(t−1)
k ′ fk ′(xi ;θ
(t−1)
k ′ )
i = 1, . . . , n, k = 1, . . . ,K .
Then,
Q(Ψ|Ψ(t−1)) =
n∑
i=1
K∑
k=1
q
(t−1)
ik log πk +
n∑
i=1
K∑
k=1
q
(t−1)
ik log fk(xi ;θk)
(14.5)
UNSW MATH5855 2021T3 Lecture 14 Slide 50
M-step
Find Ψ(t) = argmaxΨQ(Ψ|Ψ(t−1)), s.t.,
∑K
k=1 πk = 1:
Lag(π) =
n∑
i=1
K∑
k=1
q
(t−1)
ik log πk − α(
K∑
k=1
πk − 1)
Lag′k(π) =
n∑
i=1
q
(t−1)
ik π
−1
k − α
set
= 0 =⇒ πk =
n∑
i=1
q
(t−1)
ik /α
K∑
k=1
πk =
1
α
K∑
k=1
n∑
i=1
q
(t−1)
ik = 1 =⇒ α =
K∑
k=1
n∑
i=1
q
(t−1)
ik
=⇒ π(t)k =
∑n
i=1 q
(t−1)
ik∑K
k=1
∑n
i=1 q
(t−1)
ik
∂Q(Ψ|Ψ(t−1))
∂θk
=
n∑
i=1
q
(t−1)
ik
∂ log fk(xi ;θk)
∂θk
set
= 0 =⇒ weighted MLE
UNSW MATH5855 2021T3 Lecture 14 Slide 51
“Sharing” θs
▶ Strictly speaking, when we select one of the “E” models, we
no longer have a separate θk for every fk .
▶ θ ∈ RKp+1 or more contains parameters for all groups (separate
means, distinct variance parameters, etc.)
▶ fk “extracts” those elements of θ that it needs
▶ Ψ = (θ,π)
▶ Inferentially, θ replaces θk in all derivations above. In
particular,
Q(Ψ|Ψ(t−1)) =
n∑
i=1
K∑
k=1
q
(t−1)
ik log πk +
n∑
i=1
K∑
k=1
q
(t−1)
ik log fk(xi ;θ)
=⇒
∂Q(Ψ|Ψ(t−1))
∂θ
=
n∑
i=1
K∑
k=1
q
(t−1)
ik
∂ log fk(xi ;θ)
∂θ
set
= 0
=⇒ Still weighted MLE, but now joint for all groups, and without
simplification.
UNSW MATH5855 2021T3 Lecture 14 Slide 52
14. Cluster Analysis
14.1 “Classical”
14.2 Model-based clustering
14.3 Additional resources
UNSW MATH5855 2021T3 Lecture 14 Slide 53
Additional resources
▶ JW Sec. 12.1–12.5.
▶ Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2016).
mclust 5: clustering, classification and density estimation
using Gaussian finite mixture models. The R Journal, 8(1),
289.
UNSW MATH5855 2021T3 Lecture 14 Slide 54
Lecture 15: Copulae
15.1 Formulation
15.2 Common copula types
15.3 Margins, estimation, and simulation
15.4 Software
15.5 Examples
15.6 Exercises
UNSW MATH5855 2021T3 Lecture 15 Slide 1
15. Copulae
15.1 Formulation
15.2 Common copula types
15.3 Margins, estimation, and simulation
15.4 Software
15.5 Examples
15.6 Exercises
UNSW MATH5855 2021T3 Lecture 15 Slide 2
▶ MVN =⇒ independence ≡ no correlation
▶ For other families, more complicated
▶ Under independence, joint cdf is product of marginals.
=⇒ copulae “entangle” marginal distributions to produce a joint
multivariate one
▶ 2-dimensions =⇒ copula is a function C : [0, 1]2 → [0, 1]
with the properties:
i) C (0, u) = C (u, 0) = 0 for all u ∈ [0, 1].
ii) C (u, 1) = C (1, u) = u for all u ∈ [0, 1].
iii) For all pairs (u1, u2), (v1, v2) ∈ [0, 1]× [0, 1] with
u1 ≤ v1, u2 ≤ v2 :
C (v1, v2)− C (v1, u2)− C (u1, v2) + C (u1, u2) ≥ 0.
UNSW MATH5855 2021T3 Lecture 15 Slide 3
Theorem 15.1 (Sklar’s Theorem).
Let F (·, ·) be a joint cdf with marginal cdf’s FX1(.) and FX2(.).
Then there exists a copula C (·, ·) with the property
F (x1, x2) = C (FX1(x1),FX2(x2))
for every pair (x1, x2) ∈ R2. When FX1(·) and FX2(·) are
continuous the above copula is unique. Vice versa, if C (·, ·) is a
copula and FX1(·),FX2(·) are cdf then the function
F (x1, x2) = C (FX1(x1),FX2(x2)) is a joint cdf with marginals FX1(·)
and FX2(·).
UNSW MATH5855 2021T3 Lecture 15 Slide 4
Copula density
▶ Take derivatives:
f (x1, x2) = c(FX1(x1),FX2(x2))fX1(x1)fX2(x2) (15.1)
where
c(u, v) =
∂2
∂u∂v
C (u, v)
▶ Contribution to the joint density of X1,X2 comes from two
parts:
c(u, v) = ∂
2
∂u∂v
C (u, v): dependence from the copula
fX1(x1)fX2(x2): marginals
▶ Independence =⇒ C (u, v) = Π(u, v) = uv (independence
copula)
UNSW MATH5855 2021T3 Lecture 15 Slide 5
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
C(u,v)
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
c(u,v)
C(u,v)
0.1
0.2
0.3
0.4
0.5 0.6
0.7
0.8
0.9
0.0 0.2 0.4 0.6 0.8 1.0
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
c(u,v)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.9
1
0.0 0.2 0.4 0.6 0.8 1.0
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
Independence copula, dim. d = 2
UNSW MATH5855 2021T3 Lecture 15 Slide 6
15. Copulae
15.1 Formulation
15.2 Common copula types
15.2.1 Elliptical copulae
15.2.2 Archimedean copulae
15.3 Margins, estimation, and simulation
15.4 Software
15.5 Examples
15.6 Exercises
UNSW MATH5855 2021T3 Lecture 15 Slide 7
15. Copulae
15.2 Common copula types
15.2.1 Elliptical copulae
15.2.2 Archimedean copulae
UNSW MATH5855 2021T3 Lecture 15 Slide 8
Gaussian copula
For p = 2, let
Cρ(u, v) = Φρ(Φ
−1(u),Φ−1(v))
=
∫ Φ−1(u)
−∞
∫ Φ−1(v)
−∞
fρ(x1, x2)dx2dx1
for
fρ(·, ·) the density of N
((
0
0
)
,
(
1 ρ
ρ 1
))
Φρ(·, ·) its cdf
Φ−1(·) inverse-cdf of N(0, 1)
▶ ρ = 0 =⇒ C0(u, v) = uv
▶ “The formula that killed Wall Street.”
▶ Models tail behaviour poorly: almost no chance of joint
extreme events.
▶ Multivariate t does better: Z ∼ N(0,Σ) and X ∼ χ2ν
(independently) =⇒ T = Z/X .
▶ Var(T ) ̸= Σ!
UNSW MATH5855 2021T3 Lecture 15 Slide 9
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
C(u,v)
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0
2
4
6
8
c(u,v)
C(u,v)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8 0.9
0.0 0.2 0.4 0.6 0.8 1.0
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
c(u,v)
123
3
4
4
5
5
6
6
0.0 0.2 0.4 0.6 0.8 1.0
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
Normal copula, dim. d = 2
param.: (rho.1 = 0.9)
UNSW MATH5855 2021T3 Lecture 15 Slide 10
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
C(u,v)
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0
5
10
c(u,v)
C(u,v)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.0 0.2 0.4 0.6 0.8 1.0
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
c(u,v)
123
3
4
4
5
5
6
6
7
7
8
0.0 0.2 0.4 0.6 0.8 1.0
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
t-copula, dim. d = 2
param.: (rho.1 = 0.9, df = 4.0)
UNSW MATH5855 2021T3 Lecture 15 Slide 11
15. Copulae
15.2 Common copula types
15.2.1 Elliptical copulae
15.2.2 Archimedean copulae
UNSW MATH5855 2021T3 Lecture 15 Slide 12
Gumbel–Hougaard copula
▶ Much more flexible in modelling dependence in the upper tails.
▶ For dimension p,
CGHθ (u1, u2, . . . , up) = exp{−[
p∑
j=1
(− log uj)θ]1/θ}
▶ θ ∈ [1,∞) governs the strength of the dependence
▶ θ = 1 =⇒ independence
▶ θ → ∞ =⇒ min(u1, . . . , up) (Fréchet–Hoeffding upper
bound copula)
UNSW MATH5855 2021T3 Lecture 15 Slide 13
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
C(u,v)
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0
2
4
6
8
c(u,v)
C(u,v)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.0 0.2 0.4 0.6 0.8 1.0
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
c(u,v)
12
2
3
47
0.0 0.2 0.4 0.6 0.8 1.0
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
Gumbel copula, dim. d = 2
param.: 2
UNSW MATH5855 2021T3 Lecture 15 Slide 14
Archimedean copulae
▶ characterised by generator ϕ(·):
▶ a continuous, strictly decreasing function from [0, 1] to [0,∞)
▶ ϕ(1) = 0
▶ Then,
C (u1, u2, . . . , up) = ϕ
−1(ϕ(u1) + · · ·+ ϕ(up))
▶ ϕ−1(t) is defined to be 0 if t /∈ ϕ([0, 1])
UNSW MATH5855 2021T3 Lecture 15 Slide 15
Example 15.2.
Show that the Gumbell–Hougaard copula is an Archimeden copula
with a generator ϕ(t) = (− log t)θ.
UNSW MATH5855 2021T3 Lecture 15 Slide 16
Archimedean copulae advantages and disadvantages
+ simple description of the p-dim dependence by using a
function of one argument only (the generator)
− symmetric in its arguments
▶ Liouville copulae are a non-symmetric extension
UNSW MATH5855 2021T3 Lecture 15 Slide 17
15. Copulae
15.1 Formulation
15.2 Common copula types
15.3 Margins, estimation, and simulation
15.4 Software
15.5 Examples
15.6 Exercises
UNSW MATH5855 2021T3 Lecture 15 Slide 18
Copula margins: parametric
▶ Specification also requires FX1(·), FX2(·), fX1(·), fX2(·)
▶ Can be any univariate continuous distributions
▶ Density (15.1) provides likelihood:
L(ρ,θ1,θ2) = fρ,θ1,θ2(x1, x2)
= cρ(FX1|θ1(x1),FX2|θ2(x2))fX1|θ1(x1)fX2|θ2(x2)
▶ Maximise w.r.t. parameters (ρ,θ1,θ2)
▶ Usually done numerically
UNSW MATH5855 2021T3 Lecture 15 Slide 19
Copula margins: empirical
▶ Xij , i = 1, 2, j = 1, . . . , n observations
▶ edf: F̂Xi (x) = n
−1∑n
j=1 I(Xij ≤ x)
=⇒ F (x1, x2) = C (F̂X1(x1), F̂X2(x2))
Estimation:
▶ Likelihood no longer available
=⇒ Convert each variable into empirical quantiles:
Pij =
n
n+1
F̂Xi (Xij)
▶ Pij uniform, but approx. correlation maintained
▶ Tune copula parameters to match observed correlations.
Simulation:
1. Simulate from C (·, ·) and/or c(·, ·) to obtain vector of
quantiles P⋆ = [P1⋆,P2⋆]⊤.
2. Evaluate Xi⋆ = F̂
−1
Xi
(Pi⋆), i = 1, 2.
▶ F̂−1Xi (·) may be smoothed in some way.
UNSW MATH5855 2021T3 Lecture 15 Slide 20
15. Copulae
15.1 Formulation
15.2 Common copula types
15.3 Margins, estimation, and simulation
15.4 Software
15.5 Examples
15.6 Exercises
UNSW MATH5855 2021T3 Lecture 15 Slide 21
SAS: PROC COPULA
R: Packages copula, VineCopula, and others.
UNSW MATH5855 2021T3 Lecture 15 Slide 22
15. Copulae
15.1 Formulation
15.2 Common copula types
15.3 Margins, estimation, and simulation
15.4 Software
15.5 Examples
15.6 Exercises
UNSW MATH5855 2021T3 Lecture 15 Slide 23
Example 15.3.
Microwave Ovens example (with empirical and gamma margins).
UNSW MATH5855 2021T3 Lecture 15 Slide 24
Example 15.4.
Stock and portfolio modelling.
UNSW MATH5855 2021T3 Lecture 15 Slide 25
15. Copulae
15.1 Formulation
15.2 Common copula types
15.3 Margins, estimation, and simulation
15.4 Software
15.5 Examples
15.6 Exercises
UNSW MATH5855 2021T3 Lecture 15 Slide 26
Exercise 15.1
The (p-dimensional) Clayton copula is defined for a given
parameter θ > 0 as
Cθ(u1, u2, . . . , up) =
[
p∑
i=1
u−θi − p + 1
]−1/θ
.
Show that it is an Archimedean copula and that its generator is
ϕ(x) = θ−1(x−θ − 1).
UNSW MATH5855 2021T3 Lecture 15 Slide 27
pbs@ARFix@17:
pbs@ARFix@18:
pbs@ARFix@19:
pbs@ARFix@20:
pbs@ARFix@21:
pbs@ARFix@22:
pbs@ARFix@23:
pbs@ARFix@24:
pbs@ARFix@25:
pbs@ARFix@26:
pbs@ARFix@27:
pbs@ARFix@28:
pbs@ARFix@29:
pbs@ARFix@30:
pbs@ARFix@31:
pbs@ARFix@32:
pbs@ARFix@33:
pbs@ARFix@34:
pbs@ARFix@35:
pbs@ARFix@36:
pbs@ARFix@37:
pbs@ARFix@38:
pbs@ARFix@39:
pbs@ARFix@40:
pbs@ARFix@41:
pbs@ARFix@42:
pbs@ARFix@43:
pbs@ARFix@44:
pbs@ARFix@45:
pbs@ARFix@46:
pbs@ARFix@47:
pbs@ARFix@48:
pbs@ARFix@49:
pbs@ARFix@50:
pbs@ARFix@51:
pbs@ARFix@52:
pbs@ARFix@53:
pbs@ARFix@54:
pbs@ARFix@55:
pbs@ARFix@56:
pbs@ARFix@57:
pbs@ARFix@58:
pbs@ARFix@59:
pbs@ARFix@60:
pbs@ARFix@61:
pbs@ARFix@62:
pbs@ARFix@63:
pbs@ARFix@64:
pbs@ARFix@65:
pbs@ARFix@66:
pbs@ARFix@67:
pbs@ARFix@68:
pbs@ARFix@69:
pbs@ARFix@70:
pbs@ARFix@71:
pbs@ARFix@72:
pbs@ARFix@73:
pbs@ARFix@74:
pbs@ARFix@75:
pbs@ARFix@76:
pbs@ARFix@77:
pbs@ARFix@78:
pbs@ARFix@81:
pbs@ARFix@82:
pbs@ARFix@83:
pbs@ARFix@84:
pbs@ARFix@85:
pbs@ARFix@86:
pbs@ARFix@87:
pbs@ARFix@88:
pbs@ARFix@89:
pbs@ARFix@90:
pbs@ARFix@91:
pbs@ARFix@93:
pbs@ARFix@94:
pbs@ARFix@95:
pbs@ARFix@96:
pbs@ARFix@97:
pbs@ARFix@98:
pbs@ARFix@99:
pbs@ARFix@100:
pbs@ARFix@101:
pbs@ARFix@102:
pbs@ARFix@103:
pbs@ARFix@104:
pbs@ARFix@105:
pbs@ARFix@106:
pbs@ARFix@107:
pbs@ARFix@108:
pbs@ARFix@109:
pbs@ARFix@110:
pbs@ARFix@111:
pbs@ARFix@112:
pbs@ARFix@113:
pbs@ARFix@114:
pbs@ARFix@115:
pbs@ARFix@116:
pbs@ARFix@117:
pbs@ARFix@118:
pbs@ARFix@119:
pbs@ARFix@120:
pbs@ARFix@121:
pbs@ARFix@122:
pbs@ARFix@123:
pbs@ARFix@124:
pbs@ARFix@125:
pbs@ARFix@126:
pbs@ARFix@127:
pbs@ARFix@128:
pbs@ARFix@129:
pbs@ARFix@130:
pbs@ARFix@131:
pbs@ARFix@133:
pbs@ARFix@134:
pbs@ARFix@135:
pbs@ARFix@136:
pbs@ARFix@137:
pbs@ARFix@138:
pbs@ARFix@139:
pbs@ARFix@140:
pbs@ARFix@141:
pbs@ARFix@142:
pbs@ARFix@143:
pbs@ARFix@144:
pbs@ARFix@145:
pbs@ARFix@146:
pbs@ARFix@147:
pbs@ARFix@148:
pbs@ARFix@149:
pbs@ARFix@150:
pbs@ARFix@151:
pbs@ARFix@152:
pbs@ARFix@153:
pbs@ARFix@154:
pbs@ARFix@155:
pbs@ARFix@156:
pbs@ARFix@157:
pbs@ARFix@158:
pbs@ARFix@159:
pbs@ARFix@160:
pbs@ARFix@161:
pbs@ARFix@162:
pbs@ARFix@163:
pbs@ARFix@164:
pbs@ARFix@165:
pbs@ARFix@166:
pbs@ARFix@167:
pbs@ARFix@168:
pbs@ARFix@169:
pbs@ARFix@170:
pbs@ARFix@171:
pbs@ARFix@173:
pbs@ARFix@174:
pbs@ARFix@175:
pbs@ARFix@176:
pbs@ARFix@177:
pbs@ARFix@178:
pbs@ARFix@179:
pbs@ARFix@180:
pbs@ARFix@181:
pbs@ARFix@182:
pbs@ARFix@183:
pbs@ARFix@184:
pbs@ARFix@185:
pbs@ARFix@186:
pbs@ARFix@187:
pbs@ARFix@188:
pbs@ARFix@189:
pbs@ARFix@190:
pbs@ARFix@191:
pbs@ARFix@192:
pbs@ARFix@193:
pbs@ARFix@194:
pbs@ARFix@195:
pbs@ARFix@196:
pbs@ARFix@197:
pbs@ARFix@198:
pbs@ARFix@199:
pbs@ARFix@200:
pbs@ARFix@201:
pbs@ARFix@202:
pbs@ARFix@203:
pbs@ARFix@204:
pbs@ARFix@205:
pbs@ARFix@206:
pbs@ARFix@207:
pbs@ARFix@208:
pbs@ARFix@209:
pbs@ARFix@210:
pbs@ARFix@211:
pbs@ARFix@212:
pbs@ARFix@213:
pbs@ARFix@214:
pbs@ARFix@215:
pbs@ARFix@216:
pbs@ARFix@217:
pbs@ARFix@218:
pbs@ARFix@219:
pbs@ARFix@220:
pbs@ARFix@221:
pbs@ARFix@222:
pbs@ARFix@225:
pbs@ARFix@226:
pbs@ARFix@227:
pbs@ARFix@228:
pbs@ARFix@229:
pbs@ARFix@230:
pbs@ARFix@231:
pbs@ARFix@232:
pbs@ARFix@233:
pbs@ARFix@234:
pbs@ARFix@235:
pbs@ARFix@236:
pbs@ARFix@237:
pbs@ARFix@238:
pbs@ARFix@239:
pbs@ARFix@240:
pbs@ARFix@241:
pbs@ARFix@242:
pbs@ARFix@243:
pbs@ARFix@244:
pbs@ARFix@245:
pbs@ARFix@246:
pbs@ARFix@247:
pbs@ARFix@248:
pbs@ARFix@249:
pbs@ARFix@250:
pbs@ARFix@251:
pbs@ARFix@252:
pbs@ARFix@253:
pbs@ARFix@254:
pbs@ARFix@255:
pbs@ARFix@256:
pbs@ARFix@257:
pbs@ARFix@258:
pbs@ARFix@259:
pbs@ARFix@260:
pbs@ARFix@261:
pbs@ARFix@262:
pbs@ARFix@263:
pbs@ARFix@264:
pbs@ARFix@265:
pbs@ARFix@266:
pbs@ARFix@267:
pbs@ARFix@268:
pbs@ARFix@269:
pbs@ARFix@270:
pbs@ARFix@273:
pbs@ARFix@274:
pbs@ARFix@275:
pbs@ARFix@276:
pbs@ARFix@277:
pbs@ARFix@278:
pbs@ARFix@279:
pbs@ARFix@280:
pbs@ARFix@281:
pbs@ARFix@282:
pbs@ARFix@283:
pbs@ARFix@284:
pbs@ARFix@285:
pbs@ARFix@286:
pbs@ARFix@287:
pbs@ARFix@288:
pbs@ARFix@289:
pbs@ARFix@290:
pbs@ARFix@291:
pbs@ARFix@292:
pbs@ARFix@293:
pbs@ARFix@294:
pbs@ARFix@295:
pbs@ARFix@296:
pbs@ARFix@297:
pbs@ARFix@298:
pbs@ARFix@299:
pbs@ARFix@300:
pbs@ARFix@301:
pbs@ARFix@302:
pbs@ARFix@303:
pbs@ARFix@304:
pbs@ARFix@305:
pbs@ARFix@306:
pbs@ARFix@307:
pbs@ARFix@308:
pbs@ARFix@309:
pbs@ARFix@310:
pbs@ARFix@311:
pbs@ARFix@312:
pbs@ARFix@313:
pbs@ARFix@314:
pbs@ARFix@315:
pbs@ARFix@316:
pbs@ARFix@317:
pbs@ARFix@318:
pbs@ARFix@319:
pbs@ARFix@320:
pbs@ARFix@321:
pbs@ARFix@322:
pbs@ARFix@323:
pbs@ARFix@324:
pbs@ARFix@325:
pbs@ARFix@329:
pbs@ARFix@330:
pbs@ARFix@331:
pbs@ARFix@332:
pbs@ARFix@333:
pbs@ARFix@334:
pbs@ARFix@335:
pbs@ARFix@336:
pbs@ARFix@337:
pbs@ARFix@338:
pbs@ARFix@339:
pbs@ARFix@340:
pbs@ARFix@341:
pbs@ARFix@342:
pbs@ARFix@343:
pbs@ARFix@344:
pbs@ARFix@345:
pbs@ARFix@346:
pbs@ARFix@347:
pbs@ARFix@348:
pbs@ARFix@349:
pbs@ARFix@350:
pbs@ARFix@351:
pbs@ARFix@352:
pbs@ARFix@353:
pbs@ARFix@354:
pbs@ARFix@355:
pbs@ARFix@357:
pbs@ARFix@358:
pbs@ARFix@359:
pbs@ARFix@360:
pbs@ARFix@361:
pbs@ARFix@362:
pbs@ARFix@363:
pbs@ARFix@364:
pbs@ARFix@365:
pbs@ARFix@366:
pbs@ARFix@367:
pbs@ARFix@368:
pbs@ARFix@369:
pbs@ARFix@370:
pbs@ARFix@371:
pbs@ARFix@373:
pbs@ARFix@374:
pbs@ARFix@375:
pbs@ARFix@376:
pbs@ARFix@377:
pbs@ARFix@378:
pbs@ARFix@379:
pbs@ARFix@380:
pbs@ARFix@381:
pbs@ARFix@382:
pbs@ARFix@383:
pbs@ARFix@384:
pbs@ARFix@385:
pbs@ARFix@386:
pbs@ARFix@387:
pbs@ARFix@388:
pbs@ARFix@389:
pbs@ARFix@390:
pbs@ARFix@391:
pbs@ARFix@392:
pbs@ARFix@393:
pbs@ARFix@394:
pbs@ARFix@395:
pbs@ARFix@396:
pbs@ARFix@397:
pbs@ARFix@398:
pbs@ARFix@401:
pbs@ARFix@402:
pbs@ARFix@403:
pbs@ARFix@404:
pbs@ARFix@405:
pbs@ARFix@406:
pbs@ARFix@407:
pbs@ARFix@408:
pbs@ARFix@409:
pbs@ARFix@410:
pbs@ARFix@411:
pbs@ARFix@412:
pbs@ARFix@413:
pbs@ARFix@414:
pbs@ARFix@415:
pbs@ARFix@416:
pbs@ARFix@417:
pbs@ARFix@418:
pbs@ARFix@419:
pbs@ARFix@420:
pbs@ARFix@421:
pbs@ARFix@425:
pbs@ARFix@426:
pbs@ARFix@427:
pbs@ARFix@428:
pbs@ARFix@429:
pbs@ARFix@430:
pbs@ARFix@431:
pbs@ARFix@432:
pbs@ARFix@433:
pbs@ARFix@434:
pbs@ARFix@435:
pbs@ARFix@436:
pbs@ARFix@437:
pbs@ARFix@438:
pbs@ARFix@439:
pbs@ARFix@440:
pbs@ARFix@441:
pbs@ARFix@442:
pbs@ARFix@443:
pbs@ARFix@444:
pbs@ARFix@445:
pbs@ARFix@446:
pbs@ARFix@447:
pbs@ARFix@448:
pbs@ARFix@449:
pbs@ARFix@450:
pbs@ARFix@451:
pbs@ARFix@452:
pbs@ARFix@453:
pbs@ARFix@454:
pbs@ARFix@455:
pbs@ARFix@456:
pbs@ARFix@457:
pbs@ARFix@458:
pbs@ARFix@459:
pbs@ARFix@460:
pbs@ARFix@461:
pbs@ARFix@462:
pbs@ARFix@463:
pbs@ARFix@464:
pbs@ARFix@465:
pbs@ARFix@469:
pbs@ARFix@470:
pbs@ARFix@471:
pbs@ARFix@472:
pbs@ARFix@473:
pbs@ARFix@474:
pbs@ARFix@475:
pbs@ARFix@476:
pbs@ARFix@477:
pbs@ARFix@478:
pbs@ARFix@479:
pbs@ARFix@480:
pbs@ARFix@481:
pbs@ARFix@482:
pbs@ARFix@483:
pbs@ARFix@484:
pbs@ARFix@485:
pbs@ARFix@486:
pbs@ARFix@487:
pbs@ARFix@488:
pbs@ARFix@489:
pbs@ARFix@490:
pbs@ARFix@491:
pbs@ARFix@492:
pbs@ARFix@493:
pbs@ARFix@494:
pbs@ARFix@495:
pbs@ARFix@496:
pbs@ARFix@497:
pbs@ARFix@498:
pbs@ARFix@499:
pbs@ARFix@500:
pbs@ARFix@501:
pbs@ARFix@502:
pbs@ARFix@503:
pbs@ARFix@504:
pbs@ARFix@505:
pbs@ARFix@506:
pbs@ARFix@507:
pbs@ARFix@509:
pbs@ARFix@510:
pbs@ARFix@511:
pbs@ARFix@512:
pbs@ARFix@513:
pbs@ARFix@514:
pbs@ARFix@515:
pbs@ARFix@516:
pbs@ARFix@517:
pbs@ARFix@518:
pbs@ARFix@519:
pbs@ARFix@520:
pbs@ARFix@521:
pbs@ARFix@522:
pbs@ARFix@523:
pbs@ARFix@524:
pbs@ARFix@525:
pbs@ARFix@526:
pbs@ARFix@527:
pbs@ARFix@528:
pbs@ARFix@529:
pbs@ARFix@530:
pbs@ARFix@531:
pbs@ARFix@532:
pbs@ARFix@533:
pbs@ARFix@534:
pbs@ARFix@535:
pbs@ARFix@536:
pbs@ARFix@537:
pbs@ARFix@538:
pbs@ARFix@539:
pbs@ARFix@540:
pbs@ARFix@541:
pbs@ARFix@542:
pbs@ARFix@543:
pbs@ARFix@544:
pbs@ARFix@545:
pbs@ARFix@546:
pbs@ARFix@547:
pbs@ARFix@548:
pbs@ARFix@549:
pbs@ARFix@550:
pbs@ARFix@551:
pbs@ARFix@552:
pbs@ARFix@553:
pbs@ARFix@554:
pbs@ARFix@555:
pbs@ARFix@556:
pbs@ARFix@557:
pbs@ARFix@558:
pbs@ARFix@559:
pbs@ARFix@560:
pbs@ARFix@561:
pbs@ARFix@562:
pbs@ARFix@565:
pbs@ARFix@566:
pbs@ARFix@567:
pbs@ARFix@568:
pbs@ARFix@569:
pbs@ARFix@570:
pbs@ARFix@571:
pbs@ARFix@572:
pbs@ARFix@573:
pbs@ARFix@574:
pbs@ARFix@575:
pbs@ARFix@576:
pbs@ARFix@577:
pbs@ARFix@578:
pbs@ARFix@579:
pbs@ARFix@580:
pbs@ARFix@581:
pbs@ARFix@582:
pbs@ARFix@583:
pbs@ARFix@584:
pbs@ARFix@585:
pbs@ARFix@586:
pbs@ARFix@587:
pbs@ARFix@588:
pbs@ARFix@589:
pbs@ARFix@590:
pbs@ARFix@591: