STA 314: Statistical Methods for Machine Learning I – Lecture 8 – Principle Components Analysis
STA 314: Statistical Methods for Machine Learning I
Lecture 8 – Principle Components Analysis
Chris J. Maddison
University of Toronto
Intro ML (UofT) STA314-Lec8 1 / 58
Multivariate Gaussian Model
x ∼ N (µ,Σ), a Gaussian (or normal) distribution defined as
p(x) =
1
(2π)D/2∣Σ∣1/2
exp [−
1
2
(x − µ)TΣ−1(x − µ)]
To understand the shape of the density, we will study now the
standard normal N (0, I ) is transformed to produce N (µ,Σ).
▶ Last week I mentioned that the multivariate Gaussian requires
understanding multivariate scaling by positive definite matrices.
▶ I didn’t do a great job of explaining this, so I’m going to try again.
Intro ML (UofT) STA314-Lec8 2 / 58
Recall some definitions (details optional)
First, recall:
Definition. Symmetric matrix A is positive semidefinite if x
⊤
Ax ≥ 0 for all
non-zero x. It is positive definite if x
⊤
Ax > 0 for all non-zero x.
▶ Any positive definite matrix is positive semidefinite.
▶ Positive definite matrices have positive eigenvalues, and positive
semidefinite matrices have non-negative eigenvalues.
▶ For any matrix X, X
⊤
X and XX
⊤
are positive semidefinite.
Theorem (Unique Positive Square Root). Let A be a positive semidefinite real
matrix. Then there is a unique positive semidefinite matrix B such that
A = B
⊤
B = BB. We call A
1
2 ≜ B the positive square root of A.
Theorem (Spectral Theorem). If A ∈ Rd×d is symmetric, then
1. RD has an orthonormal basis consisting of the eigenvectors of A.
2. There exists orthonormal matrix Q and diagonal matrix Λ such that
A = QΛQ
T
. This is called the spectral decomposition of A.
▶ The columns of Q are (unit) eigenvectors of A.
Intro ML (UofT) STA314-Lec8 3 / 58
Matrix Square Roots & the Multivariate Gaussian
Suppose x is a standard Gaussian in D dimensions with density
p(x) =
1
(2π)D/2
exp [−
∥x∥22
2
]
Transform x to y = µ +Σ
1
2 x. Then by change of variables
p(y) =
1
(2π)D/2
exp [−
∥Σ−
1
2 (y − µ)∥22
2
]
»»»»»»
Σ
− 1
2
»»»»»»
=
1
(2π)D/2 ∣Σ∣1/2
exp [−
1
2
(x − µ)TΣ−1(x − µ)]
▶ Be careful, this derivative use many facts about determinants, inverses,
and square roots that one would have to verify.
Intro ML (UofT) STA314-Lec8 4 / 58
Matrix Square Roots & the Multivariate Gaussian
So N (µ,Σ) is N (0, I ) shifted by µ and “scaled” by Σ
1
2 .
How can you think of “scaling” space by the square root of a matrix?
For a PSD matrix Σ, find it’s spectral decomposition:
Σ = QΛQ
T
Since Q is orthonormal, we have Q
⊤
Q = I, and that:
Σ
1
2 = QΛ
1
2 Q
T
Intro ML (UofT) STA314-Lec8 5 / 58
Matrix Square Roots & the Multivariate Gaussian
We want to understand what it means to scale space by Σ
1
2 x.
Multiplying a vector x by Q
⊤
x is the same as projecting x onto the
columns of Q, so this is like rotating spaces so that the basis of Q
becomes the standard basis.
Since Λ is diagonal, it is easy to calculate
Λ
1
2 =
⎛
⎜⎜⎜⎜⎜⎜⎜
⎝
√
λ1 0 . . . 0
0
√
λ2 . . . 0
⋮ ⋮ ⋱ ⋮
0 0 . . .
√
λD
⎞
⎟⎟⎟⎟⎟⎟⎟
⎠
and multiplying by is the same as scaling the (current) standard basis
by
√
λi .
Multiplying by Q rotates the standard basis back into the basis of Q.
Intro ML (UofT) STA314-Lec8 6 / 58
Matrix Square Roots & the Multivariate Gaussian
To to summarize, you can think of scaling space by Σ
1
2 x as the effect
of rotating the standard basis into the eigenvectors of Σ
1
2 and scaling
space along those orthogonal directions.
So multivariate “scaling” has both magnitude and direction.
Intro ML (UofT) STA314-Lec8 7 / 58
Bivariate Gaussian
Σ = (1 0
0 1
) Σ = 0.5 (1 0
0 1
) Σ = 2 (1 0
0 1
)
Probability density function
Contour plot of the pdf
Intro ML (UofT) STA314-Lec8 8 / 58
Bivariate Gaussian
Σ = (1 0
0 1
) Σ = (2 0
0 1
) Σ = (1 0
0 2
)
Probability density function
Contour plot of the pdf
Intro ML (UofT) STA314-Lec8 9 / 58
Bivariate Gaussian
Σ = (1 0
0 1
) Σ = ( 1 0.5
0.5 1
) Σ = ( 1 0.8
0.8 1
)
= Q1 (
1.5 0.
0. 0.5
)Q⊤1 = Q2 (
1.8 0.
0. 0.2
)Q⊤2
Test your intuition: Does Q1 = Q2?
Probability density function
Contour plot of the pdf
Intro ML (UofT) STA314-Lec8 10 / 58
Bivariate Gaussian
Σ = (1 0
0 1
) Σ = ( 1 0.5
0.5 1
) Σ = ( 1 −0.5
−0.5 1
)
= Q1 (
1.5 0.
0. 0.5
)Q⊤1 = Q2 (
λ1 0.
0. λ2
)Q⊤2
Test your intuition: Does Q1 = Q2? What are λ1 and λ2?
Probability density function
Contour plot of the pdf
Intro ML (UofT) STA314-Lec8 11 / 58
Back to PCA
Back to principal component analysis (PCA)
Dimensionality reduction: map data to a lower dimensional space
PCA is a linear model. It’s useful for understanding lots of other
algorithms.
PCA is about finding linear structure in data.
Intro ML (UofT) STA314-Lec8 12 / 58
Recall: Multivariate Parameters
Setup: given a iid dataset D = {x(1), . . . , x(N)} ⊂ RD .
N instances/observations/examples
X =
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
[x(1)]⊤
[x(2)]⊤
⋮
[x(N)]⊤
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
=
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
x
(1)
1 x
(1)
2 ⋯ x
(1)
D
x
(2)
1 x
(2)
2 ⋯ x
(2)
D
⋮ ⋮ ⋱ ⋮
x
(N)
1 x
(N)
2 ⋯ x
(N)
D
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
Intro ML (UofT) STA314-Lec8 13 / 58
Mean and Covariance Estimators
We can estimate mean µ and Σ under the multivariate Gaussian
model using these sample approximations:
Sample mean: µ̂ =
1
N
N
∑
i=1
x
(i)
Sample covariance:
Σ̂ =
1
N
N
∑
i=1
(x(i) − µ̂)(x(i) − µ̂)⊤
=
1
N
(X − 1µ⊤)⊤(X − 1µ⊤)
µ̂ quantifies where your data is located in space (shift)
Σ̂ quantifies the shape of spread of your data points (scale)
Intro ML (UofT) STA314-Lec8 14 / 58
Low dimensional representation
In practice, even though data is very high dimensional, its important features
can be accurately captured in a low dimensional subspace.
Image credit: Elements of Statistical Learning
Find a low dimensional representation of your data.
▶ Computational benefits
▶ Interpretability, visualization
▶ Generalization
Intro ML (UofT) STA314-Lec8 15 / 58
Projection onto a subspace
Set-up: given a dataset D = {x(1), . . . , x(N)} ⊂ RD
Set µ̂ to the sample mean of the data, µ̂ = 1
N
∑Ni=1 x
(i)
Goal: find a K -dimensional subspace S ⊂ RD such that x(n) − µ̂ is
“well-represented” by its projection onto a K -dimensional S
Recall: The projection of a point x onto S is the point in S closest to
x. More on this coming soon.
Intro ML (UofT) STA314-Lec8 16 / 58
We are looking for directions
For example, in a 2-dimensional problem, we are looking for the
direction u1 along which the data is well represented: (?)
▶ e.g. direction of higher variance
▶ e.g. direction of minimum difference after projection
▶ turns out they are the same!
Intro ML (UofT) STA314-Lec8 17 / 58
First step: Center data
u3
Directions we compute will pass through origin, and should represent
the direction of highest variance.
We need to center our data since we don’t want location of data to
influence our calculations. We are only interested in finding the
direction of highest variance. This is independent from its mean.
⟹ We are not interested in u3, we are interested in u1.
Intro ML (UofT) STA314-Lec8 18 / 58
Second step: Project onto lower dimensional space S
A projection is just a multivariate “scale” by 0 in the pruned directions. You
already know how to do this!
Use positive semi-definite matrix:
Proju1 = Q (
1 0
0 0
)Q⊤ where Q =
⎛
⎜⎜
⎝
∣ ∣
u1
∥u1∥
u2
∥u2∥
∣ ∣
⎞
⎟⎟
⎠
This is the same as:
Proju1 = Q (
1 0
0 0
)Q⊤ = UU⊤ where U = ( u1∥u1∥)
Intro ML (UofT) STA314-Lec8 19 / 58
Third step: Add back mean
u3
u3
Summary for a given point x:
1. Subtract mean: x − µ̂
2. Project on S: UU⊤(x − µ̂), where columns of U are unit eigenvectors for
largest K eigenvalues of Σ̂ (K directions of highest variance)
3. Add back mean: x̃ = µ̂ +UU
T (x − µ̂)
Here, z = U
T (x − µ̂) is a lower dimensional representation of x.
And that’s it! We’ve done Principal Components Analysis (PCA)!
Intro ML (UofT) STA314-Lec8 20 / 58
Goal: find a low dimensional represenation z of data x.
Outline for PCA:
Review projection onto a K dimensional subspace S.
Selecting the best affine space onto which to project.
Project x onto the affine space to get our low dimensional
representation z.
Intro ML (UofT) STA314-Lec8 21 / 58
Euclidean projection
Projection onto a 1-D subspace
x
x̃
u
kuk
= 1
kx
k
✓
kx̃k
= kx
k cos
(✓)
S
unit vector u
▶ {u} is a basis for S: any point in
S can be written as zu for some z .
Projection of x on S is denoted by ProjS(x)
Recall: x
⊤
u = ∥x∥∥u∥ cos(θ) = ∥x∥ cos(θ)
ProjS(x) = x
⊤
uÍ ÒÒÒÒÒÑÒÒÒÒÒÏ
length of proj
⋅ uÍÑÏ
direction of proj
= ∥x̃∥u
Intro ML (UofT) STA314-Lec8 22 / 58
General subspaces
How to project onto a K -dimensional subspace?
▶ Idea: choose an orthonormal basis {u1,u2,⋯,uK} for S (i.e. all unit
vectors and orthogonal to each other)
▶ Project onto each unit vector individually (as in previous slide), and
sum together the projections.
Mathematically, the projection is given as:
ProjS(x) =
K
∑
i=1
ziui where zi = x
⊤
ui .
In vector form:
ProjS(x) = Uz where zi = U
⊤
x
Intro ML (UofT) STA314-Lec8 23 / 58
Projection onto a Subspace
So far, we assumed the subspace passes through 0.
In mathematical terminology, the “subspaces” we want to project
onto are really affine spaces, and can have an arbitrary origin µ̂.
z = U
⊤(x − µ̂)
x̃ = UU
⊤(x − µ̂) + µ̂
In machine learning, x̃ is also called the reconstruction of x.
z is its representation, or code.
Intro ML (UofT) STA314-Lec8 24 / 58
Projection onto a Subspace
If we have a K -dimensional subspace in a
D-dimensional input space, then x ∈ RD and
z ∈ RK .
If the data points x all lie close to their
reconstructions, then we can approximate
distances, etc. in terms of these same
operations on the code vectors z.
If K ≪ D, then it’s much cheaper to work
with z than x.
A mapping to a space that’s easier to
manipulate or visualize is called a
representation, and learning such a mapping
is representation learning.
Mapping data to a low-dimensional space is
called dimensionality reduction.
Intro ML (UofT) STA314-Lec8 25 / 58
Learning a Subspace
How to choose a good subspace S?
▶ Origin µ̂ is the empirical mean of the data
▶ Need to choose a D × K matrix U with orthonormal columns.
Two criteria:
▶ Minimize the reconstruction error:
min
U
1
N
N
∑
i=1
∥x(i) − x̃(i)∥2
▶ Maximize the variance of reconstructions: Find a subspace where data
has the most variability.
max
U
1
N
∑
i
∥x̃(i) − µ̂∥2
▶ Note: The data and its reconstruction have the same means (exercise)!
Intro ML (UofT) STA314-Lec8 26 / 58
Learning a Subspace
These two criteria are equivalent! I.e., we’ll show
1
N
N
∑
i=1
∥x(i) − x̃(i)∥2 = const −
1
N
∑
i
∥x̃(i) − µ̂∥2
Recall x̃
(i)
= µ̂ +Uz
(i)
and z
(i)
= U
⊤(x(i) − µ̂).
Intro ML (UofT) STA314-Lec8 27 / 58
Learning a Subspace
Warmup Observation: Because the columns of U are orthogonal,
U
⊤
U = I, so
∥x̃ − µ̂∥2 = ∥Uz∥2 = z⊤U⊤Uz = z⊤z = ∥z∥2.
⟹ norm of centered reconstruction is equal to norm of representation.
(If you draw it, this is obvious).
▶ Variance of reconstructions is equal to variance of code vectors:
1
N
∑i ∥x̃
(i)
− µ̂∥2 = 1
N
∑i ∥z
(i)∥2 (exercise 1
N
∑i z
(i)
= 0)
Intro ML (UofT) STA314-Lec8 28 / 58
Pythagorean Theorem
Key Observation: orthogonality of x̃
(i)
− µ̂ and x̃
(i)
− x
(i)
(Two vectors a,b are orthogonal ⟺ a
⊤
b = 0)
Recall x̃
(i)
= µ̂ +UU
⊤(x(i) − µ̂).
(x̃(i) − µ̂)⊤(x̃(i) − x(i))
= (x(i) − µ̂)⊤UU⊤(µ̂ − x(i) +UU⊤(x(i) − µ̂))
=(x(i)−µ̂)⊤UU⊤(µ̂−x(i)) + (x(i)−µ̂)⊤UU⊤(x(i)−µ̂)
= 0
Intro ML (UofT) STA314-Lec8 29 / 58
Pythagorean Theorem
The Pythagorean Theorem tells us:
∥x̃(i) − µ̂∥2 + ∥x(i) − x̃(i)∥2 = ∥x(i) − µ̂∥2 for each i
By averaging over data and from observation 2, we obtain
1
N
N
∑
i=1
∥x̃(i) − µ̂∥2
Í ÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÑÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÏ
projected variance
+
1
N
N
∑
i=1
∥x(i) − x̃(i)∥2
Í ÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÑÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÏ
reconstruction error
=
1
N
N
∑
i=1
∥x(i) − µ̂∥2
Í ÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÑÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÏ
constant
Therefore,
projected variance = constant − reconstruction error
Maximizing the variance is equivalent to minimizing the reconstruction error!
Intro ML (UofT) STA314-Lec8 30 / 58
Principal Component Analysis
Choosing a subspace to maximize the projected variance, or minimize the
reconstruction error, is called principal component analysis (PCA).
Consider the empirical covariance matrix:
Σ̂ =
1
N
N
∑
i=1
(x(i) − µ̂)(x(i) − µ̂)⊤
Recall: Σ̂ is symmetric and positive semidefinite.
The optimal PCA subspace is spanned
by the top K eigenvectors of Σ̂.
▶ More precisely, choose the first K of
any orthonormal eigenbasis for Σ̂.
▶ We’ll show this for K = 1.
These eigenvectors are called principal
components, analogous to the principal
axes of an ellipse.
Intro ML (UofT) STA314-Lec8 31 / 58
Supplement: Deriving PCA
For K = 1, we are fitting a unit vector u, and the code is a scalar
z
(i)
= u
⊤(x(i) − µ̂). Let’s maximize the projected variance. From our
warmup observation, we have
1
N
∑
i
∥x̃(i) − µ̂∥2 =
1
N
∑
i
[z(i)]2 =
1
N
∑
i
(u⊤(x(i) − µ̂))2
=
1
N
N
∑
i=1
u
⊤(x(i) − µ̂)(x(i) − µ̂)⊤u (a⊤b)2 = a⊤bb⊤a
= u
⊤ [
1
N
N
∑
i=1
(x(i) − µ̂)(x(i) − µ̂)⊤]u
= u
⊤
Σ̂u
= u
⊤
QΛQ
⊤
u Spectral Decomposition Σ̂ = QΛQ
⊤
= a
⊤
Λa for a = Q
⊤
u
=
D
∑
j=1
λja
2
j
Intro ML (UofT) STA314-Lec8 32 / 58
Supplement: Deriving PCA
Maximize a
⊤
Λa = ∑Dj=1 λja
2
j for a = Q
⊤
u.
▶ This is a change-of-basis to the eigenbasis of Σ.
Assume the λi are in sorted order, λ1 ≥ λ2,≥ …
Observation: since u is a unit vector, then by unitarity, a is also a unit
vector: a
⊤
a = u
⊤
QQ
⊤
u = u
⊤
u, i.e., ∑j a
2
j = 1.
By inspection, set a1 = ±1 and aj = 0 for j ≠ 1.
Hence, u = Qa = q1 (the top eigenvector).
A similar argument shows that the kth principal component is the kth
eigenvector of Σ.
Intro ML (UofT) STA314-Lec8 33 / 58
Decorrelation
Interesting fact: the dimensions of z are decorrelated. For now, let
Cov denote the empirical covariance.
Cov(z) = Cov(U⊤(x − µ))
= U
⊤
Cov(x)U
= U
⊤
ΣU
= U
⊤
QΛQ
⊤
U ▷ spectral decomposition
= (I 0)Λ ( I
0
) ▷ by orthogonality
= top left K × K block of Λ
If the covariance matrix is diagonal, this means the features are
uncorrelated.
Intro ML (UofT) STA314-Lec8 34 / 58
Recap
Recap:
Dimensionality reduction aims to find a low-dimensional
representation of the data.
PCA projects the data onto a subspace which maximizes the
projected variance, or equivalently, minimizes the reconstruction error.
The optimal subspace is given by the top eigenvectors of the
empirical covariance matrix.
PCA gives a set of decorrelated features.
Intro ML (UofT) STA314-Lec8 35 / 58
Applying PCA to faces
Consider running PCA on 2429 19×19 grayscale images (CBCL data)
Can get good reconstructions with only 3 components
PCA for pre-processing: can apply classifier to latent representation
▶ Original data is 361 dimensional
▶ For face recognition PCA with 3 components obtains 79% accuracy on
face/non-face discrimination on test data vs. 76.8% for a Gaussian
mixture model (GMM) with 84 states. (We’ll cover GMMs later in the
course.)
Can also be good for visualization
Intro ML (UofT) STA314-Lec8 36 / 58
Applying PCA to faces: Learned basis
Principal components of face images (“eigenfaces”)
Intro ML (UofT) STA314-Lec8 37 / 58
Applying PCA to digits
Intro ML (UofT) STA314-Lec8 38 / 58
Next
One more interpretation of PCA, which has an interesting generalization:
Matrix factorization.
Intro ML (UofT) STA314-Lec8 39 / 58