MAST 90138: MULTIVARIATE STATISTICAL TECHNIQUES 1 INTRODUCTION
Goal of the course: Describe, understand and discover properties of data in p > 1 dimensions.
We are interested in analysing a sample of n random vectors X1, . . . , Xn, also denoted by Xi for i = 1,…,n, where X1, X2, etc all belong to Rp:
X1 = (X11,…,X1p) ∈ Rp X2 = (X21,…,X2p) ∈ Rp
.
Xn = (Xn1,…,Xnp) ∈ Rp .
This means that we have a sample of n individuals, and that for each in-
dividual we observe p variables (sometimes called features).
Lecture notes originally by Prof. 1
Example:
• Consider a health study involving n = 100 patients.
• On each patient we measure p = 4 quantities: age, weight, body mass index, systolic blood pressure.
• For the ith individual, where i = 1, . . . , 100 we observe Xi =(Xi1,Xi2,Xi3,Xi4).
• Xi1=age of ith patient
Xi2=weight of ith patient
Xi3= body mass index of ith patient Xi4=systolic blood pressure of ith patient.
Lecture notes originally by Prof. 2
• Often we gather all the observations into a single n × p matrix X11 …X1p
X=X21 …X2p .
Xn1 … Xnp
Each Xij, for i = 1,…,n and j = 1,…,p is a random variable. X has n rows and p columns.
The ith row represents the p variables corresponding to the ith individual,
the jth column represents the jth variable for all n individuals. •WecallthevaluetakenbyXi =(Xi1,…,Xip)theobservedvalue,or
realization.
• Often we use a lower case to denote the observed value, i.e.
xi = (xi1, . . . , xip) is the realization of the random vector Xi.
Lecture notes originally by Prof. 3
What we will do in the course:
• The technique learned in basic statistics courses are not focused on
multivariate data.
• E.g. how would we do a scatterplot in more than p = 2 dimensions? How to graphically represent multivariate data?
• In this course we will learn techniques especially designed for multi- variate data.
Lecture notes originally by Prof. 4
• Revision of basic matrix results needed in the course.
As this is just a revision, we won’t spend much time on examples
but the reference book has examples.
• Revision of basic results about multivariate data and multinormal dis- tributions.
As this is just a revision, we won’t spend much time on examples but the reference book has examples.
Lecture notes originally by Prof. 5
2 REVIEW OF MATRIX PROPERTIES
Sections 2.1, 2.2, 2.3, 2.4, 2.6 and 2.7 in Ha ̈rdle and Simar(2015).
2.1 ELEMENTARY OPERATIONS
A matrix A is a system of numbers with n rows and p columns:
a11 a12 …a1p A=a21 a22 …a2p
. an1 an2 … anp
We can refer to A as (aij) or (aij)1≤i≤n,1≤j≤p.
The dimension of the matrix is denoted by n × p, where n is the num-
ber of rows and p is the number of columns.
Lecture notes originally by Prof. 6
We use AT to denote the transpose of the matrix A: a11 a21 …an1
AT = a12 a22 . . . an2 .
a1p a2p … anp (rows become columns and columns become rows).
Lecture notes originally by Prof. 7
54 2 A Short Excursion into Matrix Algebra
Special matrices (taken from Ha ̈rdle and Simar):
Table 2.1
Name
Scalar
Column vector Row vector Vector of ones
Vector of zeros
Square matrix Diagonal matrix Identity matrix
Unit matrix Symmetric matrix
Special matrices and vectors
Definition
Notation
pDnD1
a
pD1
a
nD1
a>
.1;:::;1/> „ ƒ‚ …
n
1n
.0;:::;0/> „ ƒ‚ …
n
0n
nDp
A.p p/
aij D 0, i 6D j , n D p
diag.aii /
Example
3
1 3
1 !3 1
1
!
0 0
!
20 02!
10 02!
10 01
!
11 1 1!
128 23
!
diag.1; : : : ; 1/ „ ƒ‚ …
p
aij D 1, n D p
Ip 1n1>n
Lecture notes originally by Prof.
aij D aji
!
Diagonal matrix Identity matrix
Unit matrix Symmetric matrix Null matrix
Upper triangular matrix
Idempotent matrix
Orthogonal matrix
aij D 0, i 6D j , n D p diag.aii/ diag.1; : : : ; 1/ Ip
02! 10
01
!
11 1 1!
12 23!
00
00 0 1
B124C
@ 0 1 3 A
„ ƒ‚ … p
aij D1,nDp
1 n 1 >n
aij D aji
aij D 0
0
aij D 0; i < j
AA D A
A>A D I D AA>
We also write .aij/ for A and A.n p/ to indicate the numbers of rows and columns. Vectors are matrices with one column and are denoted as x or x.p 1/. Special matrices and vectors are defined in Table 2.1. Note that we use small letters
for scalars as well as for vectors.
001
01 B100C
@ 0 1 1 A 22
011 22
!
p1 22
p1
p1 p1
22
Lecture notes originally by Prof. 9
Matrix operations:
• Sum of two n × p matrices A = (aij) and B = (bij)
a11 +b11 a12 +b12 … a1p +b1p
D=A+B=a21+b21 a22+b22 … a2p+b2p .
an1 +bn1 an2 +bn2 … anp +bnp
Thus the (i, j)th element dij of D (found at the ith row, j columns of D)
is equal to aij + bij, the sum of the corresponding elements of A and B. D is also an n × p matrix.
• Likewise
a11 −b11 a12 −b12 … a1p −b1p A−B=a21−b21 a22−b22 … a2p−b2p
. an1 −bn1 an2 −bn2 … anp −bnp
Lecture notes originally by Prof.
10
• A constant c times an n × p matrix A = (aij) gives the following n × p matrix:
ca11 ca12 … ca1p cA=ca21 ca22 …ca2p
. can1 can2 … canp
so that each aij is multiplied by c.
• Product of two matrices A, an n × p matrix, and B, a p × m matrix: AB = C ,
where C is an n × m matrix whose (i, j)th element, for i = 1, . . . , n and j = 1,…,m is given by:
p
cij =aikbkj
k=1
Lecture notes originally by Prof. 11
• Scalar product (or dot product) of two p dimensional vectors x = (x1, . . . , xp)T (a column vector) and y = (y1,…,yp)T:
p
xTy = xjyj
is a real number (not a vector).
Properties of matrix operations:
BUT :
A+B=B+A A(B + C) = AB + AC A(BC) = (AB)C
(AT)T =A (AB)T =BTAT
AC̸=CA
j=1
Lecture notes originally by Prof. 12
Matrix Characteristics:
• Rank: The rank of an n × p matrix A, denoted by rank(A) is defined as
the maximum number of linearly independent rows (or columns).
A set of k rows (or columns) a1,…,ak of A are said to be linearly inde- pendent if none of them can be expressed as a nontrivial linear combi- nation of the other k − 1 rows (or columns). It is not possible to write
aj =ciai, i̸=j
where the ci’s are all different from zero, or equivalently
We always have
k
ciai = 0 ⇒ c1,…,ck = 0.
i=1
rank(A) ≤ min(n, p) .
Lecture notes originally by Prof. 13
• Trace: The trace of a square p × p matrix A, denoted by tr(A), is the sum of its diagonal elements:
p tr(A) = aii .
i=1
• Determinant: The determinant of a square p × p matrix A, denoted by det(A) or |A|, is a number computed from the matrix and which plays an important role in all sorts of problems. For a 2 by 2 matrix
it is computed by
a11 a12 A= a21a22
|A| = a11a22 − a21a12 .
For larger matrices, we compute this recursively. If you have forgot- ten how a determinant is computed, see Appendix A.2.3 of Mardia et al. (1979) or
https://en.wikipedia.org/wiki/Determinant
Lecture notes originally by Prof. 14
• Inverse: If |A| ≠ 0, the inverse of a square p × p matrix A exists. It is denoted by A−1 and is such that
AA−1 = A−1A = Ip
(Ip = p × p identity matrix).
See Ha ̈rdle and Simar, page 57, for how to compute the inverse.
We have
|A−1| = 1/|A| .
Lecture notes originally by Prof. 15
• Eigenvalues and eigenvectors of a square p × p matrix A:
The (non zero) p × 1 vector v is an eigenvector of A with eigenvalue
λ if it is such that
Note that λ is a real number (not a vector).
If the matrix A is symmetric then there are p eigenvalues and eigenvectors.
The eigenvalues are not necessarily all different from each other. All eigenvalues satisfy
|A − λIp| = 0
(they are the p roots of the above polynomial of order p in λ).
In practice we can compute them with a software, e.g. R.
Lecture notes originally by Prof. 16
Av = λv .
Constant multiples of an eigenvector v with eigenvalue λ are also eigenvectors with eigenvalue λ.
Therefore we usually define eigenvectors to be scaled so that they have norm 1:
√
∥v∥= vTv=1.
Lecture notes originally by Prof. 17
• Suppose the square p × p matrix A has eigenvalues λ1, . . . , λp.
Let Λ be the diagonal matrix Λ = diag(λ1, . . . , λp) with the λi’s on the
diagonal and 0 everywhere else. Then we have
and .
p det(A) = |A| = |Λ| = λi
i=1 p
tr(A) = tr(Λ) = λi i=1
Lecture notes originally by Prof. 18
Properties of Matrix Characteristics
tr.ABC/ D tr.BCA/
)
5
)
6 7
8
9 0 1 2 3
A.n n/; B.n n/; c 2 R
2.1 Elementary Operations
A.np/; B.pn/
tr.ACB/ D trACtrB tr.cA/ D c tr A
jcAj D cnjAj
jABj D jBAj D jAjjBj
tr.A B/ D tr.B A/ rank.A/ min.n; p/ rank.A/ 0 rank.A/ D rank.A>/
rank.A>A/ D rank.A/
rank.A C B/ rank.A/ C rank.B/
rank.AB/ minfrank.A/; rank.B/g
(2.4 (2.5
(2. (2.
(2.
(2. (2.1 (2.1 (2.1 (2.1
19
A.np/; B.pq/; C.qn/
Lecture notes originally by Prof.
rank.AB/ minfrank.A/; rank.B/g (2.1
1 1
1 1
A.np/; B.pq/; C.qn/
A.p p/
tr.ABC/ D tr.BCA/ D tr.CAB/
rank.ABC/ D rank.B/
for nonsingular A; C
(2. (2.
(2. (2.
jA1j D jAj1
rank.A/ D p if and only if A is nonsingular.
Summary
,! The determinant jAj is the product of the eigenvalues of A.
,! The inverse of a matrix A exists if jAj ¤ 0.
Lecture notes originally by Prof.
20
,! The trace tr.A/ is the sum of the eigenvalues of A.
2.2 SPECTRAL DECOMPOSITIONS Spectral decomposition
• Spectral decomposition: Suppose A is a square and symmetric p × p matrix and let
λ1, . . . , λp denotes its p eigenvalues and
v1, . . . , vp denote the associated p × 1 eigenvectors of norm 1 and or-
thogonal to each other.
Note: two p × 1 vectors v and w are orthogonal if p
vT w = viwi = 0 . i=1
Lecture notes originally by Prof. 21
Then we can always express A in the following way, which is called spectral decomposition of A:
p
A = λ j v j v jT .
j=1
This can also be written in matrix form, if we let
Λ = diag(λ1,…,λp) Γ = (v1,…,vp)
and
the p × p orthogonal matrix whose columns are the p eigenvectors:
A=ΓΛΓT .
Lecture notes originally by Prof. 22
• In the above notation, if A = ΓΛΓT then if we take a power of A, for example Aα, we find
Aα = ΓΛαΓT .
This is because the vj’s are orthogonal and of norm 1.
For example
A2 =ΓΛΓT ΓΛΓT = ΓΛ2ΓT .
This also works for negative powers if A is invertible (which happens if and only if the eigenvalues are all nonzero). For example,
A−1 = ΓΛ−1ΓT (A−1: the inverse of the matrix A).
Lecture notes originally by Prof. 23
Singular value decomposition
More generally, a similar decomposition exists for matrices that are not especially square matrices. In particular, any n × p matrix A with rank r can be decomposed as
A=ΓΛ∆T ,
where the n × r matrix Γ and the p × r matrix ∆ are column orthonormal,
which means that their colomns are orthonormal, that is
ΓT Γ = ∆T ∆ = Ir
Λ = diag(λ1/2, . . . , λ1/2)
The λi’s are the nonzero eigenvalues of the matrices AAT and AT A; the columns of Γ and ∆ are the corresponding r eigenvectors of those matri- ces.
and
where each λi > 0.
1r
Lecture notes originally by Prof. 24
2.3 QUADRATIC FORMS •AquadraticformQ(x)ofthep-vectorx=(x1,…,xp)T isdefinedby
pp
Q(x) = aijxixj = xTAx,
i=1 j=1
where aij is the (i, j)th element of a symmetric p × p matrix A.
• If
then the matrix A is called semi positive definite, which is denoted by
A ≥ 0.
• However if the quadratic form satisfies Q(x)>0 forallx̸=0
then the matrix A is called positive definite, which is denoted by A > 0. Lecture notes originally by Prof. 25
Q(x)≥0 forallx̸=0
• A > 0 is equivalent to all the eigenvalues of A satisfy: λ1 > 0,…,λp > 0.
Then |A| > 0 and A−1 exists. • If A ≥ 0 then
rank(A) = r < p
and
p − r eigenvalues of A are equal to zero
while the other r are strictly positive.
Lecture notes originally by Prof. 26
2.4 GEOMETRICAL ASPECTS Distance
• The Euclidian distance d(x, y) between two vectors x, y ∈ Rp is defined by
d(x,y)=
(xi −yi)2 =
p
(x−y)T(x−y) Example in R2, where x = (x1, x2) and y = (y1, y2):
69
i=1
Lecture notes originally by Prof. 27
• A weighted version of this distance can be defined as
p
wi(xi −yi)2 =
(x−y)TW(x−y),
d(x,y)=
where each wj > 0 and W = diag(w1,…,wp).
i=1
• This can be further generalised into the following distance: d(x, y) = (x − y)T A(x − y) ,
where A is a positive definite matrix.
Lecture notes originally by Prof. 28
Norm
• The (Euclidian) norm of a vector x ∈ Rp is defined by p
√ ∥x∥= x2i= xTx.
i=1
• A unit vector is a vector of norm 1.
• Can be generalised into a norm with respect to a positive definite ma- trix A:
√
∥x∥A =
xTAx.
Lecture notes originally by Prof. 29
Angle between two vectors
• The angle θ between two vectors x, y ∈ Rp is defined through the co- sine of θ by:
Example in R2:
cos(θ) = xT y/∥x∥∥y∥ .
73
0, then the angle is equal to 2 . From trigonometry, we f equals the length of the base of a triangle (jjpx jj) divided
potenuse (jjxjj). Hence, we have
Lecture notes originally by Prof. 30
D
y
jx>yj
xjj). Hence, we have
jjxjjjcosj D jx>yj; (2.42)
∥ cos(θ)∥ = ∥px∥/∥x∥ ,
(see figure) where px is called the projection of x on y.
kyk
• Since cos(θ) = xT y/∥x∥∥y∥ , we can find px by taking
y (which is defined below). It is the coordinate
px = (cos(θ)∥x∥) y = xT y y,
73
7/17/2019 Rtriangle.svg
• We also know from trigonometry that, in a right angled triangle ACB
with right angle at C, the cos of the angle at A is equal to length b of e angle is equal to 2 . From trigonometry, we
the segment AC divided by the length c of the segment AB. Thus e length of the base of a triangle (jjpx jj) divided
ith respect to a general metric A ∥y∥ ∥y∥2 > where y is the unit vector in the direction of y.
D x Ay :∥y∥ (2.43) kxkA kykA
file:///C:/Users/delaigle/Documents/COURS/MAST90085/AuroreSlides/Rtriangle.svg
o y with respect to the metric A.
Lecture notes originally by Prof. 31
h
j D
n w
t
Rotation
• When we work with vectors in Rp we generally describe them through a system of p axes (see figure on p.9 here for example: the arrows repre- sent the axes) and give the coordinates of x in that p coordinate system.
• In multivariate statistics it is sometimes useful to rotate the axes (all of them at the same time) by an angle θ, creating in this way a new p coordinate system.
• In R2, we can describe a rotation of angle θ via the orthogonal matrix cos(θ) sin(θ)
Γ = − sin(θ) cos(θ) .
Lecture notes originally by Prof. 32
Specifially if the original set of axes are rotated counter-clockwise through the origin by an angle θ then the new coordinates y of a point with co- ordinates x in the original system of axes is given by
y = Γx .
If the rotation is clockwise, then instead we have y = ΓT x .
• More generally, premultiplying a vector x by an orthogonal matrix Γ geometrically corresponds to a rotation of the system of axes.
Lecture notes originally by Prof. 33
3 MEAN, COVRIANCE, CORRELATION
Sections 3.1, 3.2, 3.3 in Ha ̈rdle and Simar (2015).
3.1 MEAN
• The mean μ ∈ Rp of a random vector X = (X1,…,Xp)T is defined by
μ1 E(X1) μ= . = . .
μp E(Xp)
• In practice can’t compute μ (we don’t observe the whole population),
but we can estimate it from a sample X1, . . . , Xn by the sample mean
where, for j = 1,…,p,
X ̄ p ̄ 1n
X ̄ 1 X ̄ = . . . ,
Xj = n
Lecture notes originally by Prof. 34
i=1
Xij
is the sample mean of the jth component Xj. • Recall the notation
X11 …X1p X=X21 …X2p
. Xn1 … Xnp
and 1n = (1,…,1)T, a column vector of length n. We can express X ̄ in matrix notation as
X ̄ = n − 1 X T 1 n .
Lecture notes originally by Prof. 35
3.2 COVARIANCE MATRIX
• The covariance σXY between two random variables X and Y is a mea-
sure of the linear dependence between those random variables: σXY =cov(X,Y)=E(XY)−E(X)E(Y).
σXX = var(X).
if X and Y are independent then σXY = 0.
However σXY = 0 does not imply that X and Y are independent (there could be a nonlinear dependence).
Lecture notes originally by Prof. 36
positive covariance
-2 -1 0 1 2 3 X
Lecture notes originally by Prof. 37
Y
-2 0 2 4 6
negative covariance
-2 -1 0 1 2 3 X
Lecture notes originally by Prof. 38
Y
-6 -4 -2 0 2
Can you picture a situation where X and Y are dependent but have zero covariance?
Lecture notes originally by Prof. 39