. of Toronto
Intro ML (UofT) STA314-Lec8 1 / 58
STA 314: Statistical Methods for Machine Learning I
Lecture 8 – Principle Components Analysis
Copyright By PowCoder代写 加微信 powcoder
Multivariate Gaussian Model
x ∼ N (μ, Σ), a Gaussian (or normal) distribution defined as p(x) = 1 exp [−1(x − μ)T Σ−1(x − μ)]
(2π)D/2∣Σ∣1/2 2
To understand the shape of the density, we will study now the standard normal N (0, I ) is transformed to produce N (μ, Σ).
▶ Last week I mentioned that the multivariate Gaussian requires
understanding multivariate scaling by positive definite matrices.
▶ I didn’t do a great job of explaining this, so I’m going to try again.
Intro ML (UofT) STA314-Lec8
Recall some definitions (details optional)
First, recall:
Definition. Symmetric matrix A is positive semidefinite if x⊤Ax ≥ 0 for all
non-zero x. It is positive definite if x⊤Ax > 0 for all non-zero x.
▶ Any positive definite matrix is positive semidefinite.
▶ Positive definite matrices have positive eigenvalues, and positive
semidefinite matrices have non-negative eigenvalues.
▶ For any matrix X, X⊤X and XX⊤ are positive semidefinite.
Theorem (Unique Positive Square Root). Let A be a positive semidefinite real
matrix. Then there is a unique positive semidefinite matrix B such that 1
A=B⊤B=BB. WecallA2 ≜BthepositivesquarerootofA. Theorem (Spectral Theorem). If A ∈ Rd×d is symmetric, then
1. RD has an orthonormal basis consisting of the eigenvectors of A.
2. There exists orthonormal matrix Q and diagonal matrix Λ such that
A = QΛQT . This is called the spectral decomposition of A.
▶ The columns of Q are (unit) eigenvectors of A.
Intro ML (UofT) STA314-Lec8
Matrix Square Roots & the Multivariate Gaussian
Suppose x is a standard Gaussian in D dimensions with density
2 ] Transform x to y = μ + Σ 2 x. Then by change of variables
p(y) = 1 exp [− ∥Σ− 2 (y − μ)∥2 ] Σ− 12
( 2 π ) D / 2 2
= 1 exp[−1(x−μ)TΣ−1(x−μ)]
(2π)D/2 ∣Σ∣1/2 2
▶ Be careful, this derivative use many facts about determinants, inverses,
and square roots that one would have to verify.
Intro ML (UofT) STA314-Lec8 4 / 58
Matrix Square Roots & the Multivariate Gaussian
1 So N(μ,Σ) is N(0,I) shifted by μ and “scaled” by Σ2 .
How can you think of “scaling” space by the square root of a matrix? For a PSD matrix Σ, find it’s spectral decomposition:
Since Q is orthonormal, we have Q⊤Q = I, and that:
11T Σ2 =QΛ2Q
Intro ML (UofT) STA314-Lec8 5 / 58
Matrix Square Roots & the Multivariate Gaussian
1 We want to understand what it means to scale space by Σ2 x.
Multiplying a vector x by Q⊤x is the same as projecting x onto the columns of Q, so this is like rotating spaces so that the basis of Q becomes the standard basis.
Since Λ is diagonal, it is easy to calculate √
⎛λ1 0 … 0⎞
12 ⎜0√λ…0⎟
⎜⎝⋮ ⋮ ⋱√⋮⎟⎠
0 0 … λD
and multiplying by is the same as scaling the (current) standard basis
Multiplying by Q rotates the standard basis back into the basis of Q.
Intro ML (UofT) STA314-Lec8 6 / 58
Matrix Square Roots & the Multivariate Gaussian
To to summarize, you can think of scaling space by Σ2 x as the effect 1
of rotating the standard basis into the eigenvectors of Σ2 and scaling space along those orthogonal directions.
So multivariate “scaling” has both magnitude and direction.
Intro ML (UofT) STA314-Lec8 7 / 58
Bivariate Gaussian
Σ=(1 0) Σ=0.5(1 0) Σ=2(1 0) 010101
Probability density function
Contour plot of the pdf
Intro ML (UofT) STA314-Lec8 8 / 58
Bivariate Gaussian
Σ=(1 0) Σ=(2 0) Σ=(1 0) 010102
Probability density function
Contour plot of the pdf
Intro ML (UofT) STA314-Lec8 9 / 58
Bivariate Gaussian
Σ = (1 0) 0 1
Σ = ( 1 0.5) 0.5 1
=Q1(1.5 0.)Q⊤1 0. 0.5
Σ = ( 1 0.8) 0.8 1
=Q2(1.8 0.)Q⊤2 0. 0.2
Contour plot of the pdf
Test your intuition: Does Q1 = Q2?
Probability density function
STA314-Lec8
Bivariate Gaussian
Σ=(10) 0 1
Σ=(1 0.5) 0.5 1
=Q (1.5 0.)Q⊤ 1 0. 0.5 1
Σ=(1 −0.5) −0.5 1
=Q (λ1 0.)Q⊤ 2 0. λ2 2
Test your intuition: Does Q1 = Q2? What are λ1 and λ2?
Probability density function
Contour plot of the pdf
Intro ML (UofT)
STA314-Lec8
Back to PCA
Back to principal component analysis (PCA)
Dimensionality reduction: map data to a lower dimensional space
PCA is a linear model. It’s useful for understanding lots of other algorithms.
PCA is about finding linear structure in data.
Intro ML (UofT) STA314-Lec8
Recall: Multivariate Parameters
Setup: given a iid dataset D = {x(1),…,x(N)} ⊂ RD. N instances/observations/examples
⎡[x(1)]⊤ ⎤ ⎡⎢x(1) x(1) ⋯ x(1) ⎤⎥ ⎢⎥⎢12 D⎥ ⎢[x(2)]⊤ ⎥ ⎢x(2) x(2) ⋯ x(2) ⎥
X=⎢ ⎥=⎢1 2 D ⎥ ⎢ ⋮ ⎥ ⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎢ ⎢⎣ [ x ( N ) ] ⊤ ⎥ ⎥⎦ ⎢ ⎢ x ( N ) x ( N ) ⋯
x ( N ) ⎥ ⎥ ⎣12D⎦
STA314-Lec8 13 / 58
Mean and Covariance Estimators
We can estimate mean μ and Σ under the multivariate Gaussian model using these sample approximations:
1N (i) Sample mean: μˆ = N ∑ x
Sample covariance:
Σˆ=N∑(x −μˆ)(x −μˆ)
1N(i) (i) ⊤ i=1
= 1(X−1μ⊤)⊤(X−1μ⊤) N
μˆ quantifies where your data is located in space (shift)
Σˆ quantifies the shape of spread of your data points (scale)
Intro ML (UofT) STA314-Lec8
Low dimensional representation
In practice, even though data is very high dimensional, its important features can be accurately captured in a low dimensional subspace.
Image credit: Elements of Statistical Learning
Find a low dimensional representation of your data.
▶ Computational benefits
▶ Interpretability, visualization ▶ Generalization
Intro ML (UofT) STA314-Lec8 15 / 58
Projection onto a subspace
Set-up: given a dataset D = {x(1), . . . , x(N)} ⊂ RD
Set μˆ to the sample mean of the data, μˆ = N1 ∑Ni=1 x(i)
Goal: find a K -dimensional subspace S ⊂ RD such that x(n) − μˆ is “well-represented” by its projection onto a K-dimensional S
Recall: The projection of a point x onto S is the point in S closest to x. More on this coming soon.
Intro ML (UofT) STA314-Lec8 16 / 58
We are looking for directions
For example, in a 2-dimensional problem, we are looking for the direction u1 along which the data is well represented: (?)
▶ e.g. direction of higher variance
▶ e.g. direction of minimum difference after projection
▶ turns out they are the same!
Intro ML (UofT) STA314-Lec8
First step: Center data
Directions we compute will pass through origin, and should represent the direction of highest variance.
We need to center our data since we don’t want location of data to influence our calculations. We are only interested in finding the direction of highest variance. This is independent from its mean.
⟹ We are not interested in u3, we are interested in u1.
Intro ML (UofT) STA314-Lec8 18 / 58
Second step: Project onto lower dimensional space S
A projection is just a multivariate “scale” by 0 in the pruned directions. You already know how to do this!
Use positive semi-definite matrix:
Proj =Q(1 0)Q⊤ where Q=⎜u u ⎟ u ⎜∥u1∥ ∥u2∥⎟ 100 ⎝12⎠
This is the same as:
Proj =Q(1 0)Q⊤ =UU⊤ where U=( u1 )
Intro ML (UofT)
STA314-Lec8
Third step: Add back mean
Summary for a given point x: 1. Subtract mean: x − μˆ
2. Project on S: UU⊤(x − μˆ), where columns of U are unit eigenvectors for largest K eigenvalues of Σˆ (K directions of highest variance)
3. Addbackmean: ̃x=μˆ+UUT(x−μˆ)
Here, z = UT (x − μˆ ) is a lower dimensional representation of x. And that’s it! We’ve done Principal Components Analysis (PCA)!
Intro ML (UofT) STA314-Lec8 20 / 58
Goal: find a low dimensional represenation z of data x. Outline for PCA:
Review projection onto a K dimensional subspace S. Selecting the best affine space onto which to project.
Project x onto the affine space to get our low dimensional representation z.
Intro ML (UofT) STA314-Lec8 21 / 58
Euclidean projection
Projection onto a 1-D subspace
Projection of x on S is denoted by ProjS(x) Recall: x⊤u = ∥x∥∥u∥ cos(θ) = ∥x∥ cos(θ)
Proj (x)= x⊤u
u =∥ ̃x∥u
direction of proj
length of proj
Subspace S is the line along the unit vector u
▶ {u} is a basis for S: any point in S can be written as zu for some z.
Intro ML (UofT)
STA314-Lec8 22 / 58
kx ̃k = kxk cos(✓)
General subspaces
How to project onto a K-dimensional subspace?
▶ Idea: choose an orthonormal basis {u1, u2, ⋯, uK } for S (i.e. all unit
vectors and orthogonal to each other)
▶ Project onto each unit vector individually (as in previous slide), and
sum together the projections. Mathematically, the projection is given as:
In vector form:
Proj (x)=∑zu where z =x⊤u. Siiii
Proj (x)=Uz where z =U⊤x Si
Intro ML (UofT) STA314-Lec8 23 / 58
Projection onto a Subspace
So far, we assumed the subspace passes through 0.
In mathematical terminology, the “subspaces” we want to project onto are really affine spaces, and can have an arbitrary origin μˆ.
z = U ⊤ ( x − μˆ )
̃x = UU⊤(x − μˆ) + μˆ
In machine learning, ̃x is also called the reconstruction of x.
z is its representation, or code.
Intro ML (UofT) STA314-Lec8 24 / 58
Projection onto a Subspace
If we have a K-dimensional subspace in a D-dimensional input space, then x ∈ RD and
If the data points x all lie close to their reconstructions, then we can approximate distances, etc. in terms of these same operations on the code vectors z.
If K ≪ D, then it’s much cheaper to work with z than x.
A mapping to a space that’s easier to manipulate or visualize is called a representation, and learning such a mapping is representation learning.
Mapping data to a low-dimensional space is called dimensionality reduction.
Intro ML (UofT) STA314-Lec8 25 / 58
Learning a Subspace
How to choose a good subspace S?
▶ Origin μˆ is the empirical mean of the data
▶ Need to choose a D × K matrix U with orthonormal columns.
Two criteria:
▶ Minimize the reconstruction error:
▶ Maximize the variance of reconstructions: Find a subspace where data
has the most variability.
1N (i) (i)2 min ∑∥x − ̃x ∥
max 1 ∑∥ ̃x(i) −μˆ∥2
▶ Note: The data and its reconstruction have the same means (exercise)! Intro ML (UofT) STA314-Lec8 26 / 58
Learning a Subspace
These two criteria are equivalent! I.e., we’ll show
1N (i) (i)2 1 (i) 2
N∑∥x − ̃x ∥ =const−N∑∥ ̃x −μˆ∥ i=1 i
Recall ̃x(i) = μˆ + Uz(i) and z(i) = U⊤(x(i) − μˆ).
Intro ML (UofT) STA314-Lec8
Learning a Subspace
Warmup Observation: Because the columns of U are orthogonal, U⊤U = I, so
∥ ̃x − μˆ∥2 = ∥Uz∥2 = z⊤U⊤Uz = z⊤z = ∥z∥2.
⟹ norm of centered reconstruction is equal to norm of representation.
(If you draw it, this is obvious).
▶ Variance of reconstructions is equal to variance of code vectors:
N1 ∑i ∥ ̃x(i) − μˆ∥2 = N1 ∑i ∥z(i)∥2 (exercise N1 ∑i z(i) = 0)
Intro ML (UofT) STA314-Lec8
Pythagorean Theorem
Key Observation: orthogonality of ̃x(i) − μˆ and ̃x(i) − x(i) (Two vectors a, b are orthogonal ⟺ a⊤b = 0)
Recall ̃x(i) = μˆ + UU⊤(x(i) − μˆ).
( ̃x(i) − μˆ)⊤( ̃x(i) − x(i))
= (x(i) − μˆ)⊤UU⊤(μˆ − x(i) + UU⊤(x(i) − μˆ))
=(x(i)−μˆ)⊤UU⊤(μˆ−x(i))+(x(i)−μˆ)⊤UU⊤(x(i)−μˆ) =0
Intro ML (UofT) STA314-Lec8 29 / 58
Pythagorean Theorem
The Pythagorean Theorem tells us:
∥ ̃x(i) − μˆ∥2 + ∥x(i) − ̃x(i)∥2 = ∥x(i) − μˆ∥2 for each i By averaging over data and from observation 2, we obtain
projected variance = constant − reconstruction error Maximizing the variance is equivalent to minimizing the reconstruction error!
1N(i) 21N(i)(i)2 N ∑ ∥ ̃x − μˆ ∥ + N ∑ ∥ x − ̃x ∥
i=1 i=1
Therefore,
projected variance
1N (i) 2 =N∑∥x −μˆ∥
i=1
reconstruction error
Intro ML (UofT) STA314-Lec8
Principal Component Analysis
Choosing a subspace to maximize the projected variance, or minimize the reconstruction error, is called principal component analysis (PCA).
Consider the empirical covariance matrix: 1N(i) (i) ⊤
Σˆ = N ∑(x −μˆ)(x −μˆ) i=1
Recall: Σˆ is symmetric and positive semidefinite.
The optimal PCA subspace is spanned by the top K eigenvectors of Σˆ .
▶ More precisely, choose the first K of any orthonormal eigenbasis for Σˆ.
▶ We’ll show this for K = 1.
These eigenvectors are called principal components, analogous to the principal axes of an ellipse.
Intro ML (UofT) STA314-Lec8 31 / 58
Supplement: Deriving PCA
For K = 1, we are fitting a unit vector u, and the code is a scalar
z(i) = u⊤(x(i) − μˆ). Let’s maximize the projected variance. From our warmup observation, we have
N1 ∑∥ ̃x(i) −μˆ∥2 = N1 ∑[z(i)]2 = N1 ∑(u⊤(x(i) −μˆ))2 iii
= N ∑u⊤(x(i) −μˆ)(x(i) −μˆ)⊤u
(a⊤b)2 =a⊤bb⊤a
⊤1N(i) (i) ⊤
=u[N∑(x −μˆ)(x −μˆ)]u i=1
= u⊤QΛQ⊤u Spectral Decomposition Σˆ = QΛQ⊤ ⊤⊤
= ∑ λ j a j2 j=1
for a = Q u
STA314-Lec8
Supplement: Deriving PCA
Maximize a⊤Λa = ∑Dj=1 λjaj2 for a = Q⊤u.
▶ This is a change-of-basis to the eigenbasis of Σ.
Assume the λi are in sorted order, λ1 ≥ λ2, ≥ …
Observation: since u is a unit vector, then by unitarity, a is also a unit
vector: a⊤a = u⊤QQ⊤u = u⊤u, i.e., ∑j aj2 = 1. Byinspection,seta1=±1andaj =0forj≠1. Hence, u = Qa = q1 (the top eigenvector).
A similar argument shows that the kth principal component is the kth eigenvector of Σ.
Intro ML (UofT) STA314-Lec8 33 / 58
Decorrelation
Interesting fact: the dimensions of z are decorrelated. For now, let Cov denote the empirical covariance.
Cov(z) = Cov(U⊤(x − μ)) = U⊤ Cov(x)U
= (I 0) Λ (0I )
▷ spectral decomposition ▷ by orthogonality
=topleftK×K blockofΛ
If the covariance matrix is diagonal, this means the features are
uncorrelated.
Intro ML (UofT) STA314-Lec8 34 / 58
Dimensionality reduction aims to find a low-dimensional representation of the data.
PCA projects the data onto a subspace which maximizes the projected variance, or equivalently, minimizes the reconstruction error.
The optimal subspace is given by the top eigenvectors of the empirical covariance matrix.
PCA gives a set of decorrelated features.
Intro ML (UofT) STA314-Lec8 35 / 58
Applying PCA to faces
Consider running PCA on 2429 19×19 grayscale images (CBCL data) Can get good reconstructions with only 3 components
PCA for pre-processing: can apply classifier to latent representation
▶ Original data is 361 dimensional
▶ For face recognition PCA with 3 components obtains 79% accuracy on
face/non-face discrimination on test data vs. 76.8% for a Gaussian mixture model (GMM) with 84 states. (We’ll cover GMMs later in the course.)
Can also be good for visualization
Intro ML (UofT) STA314-Lec8 36 / 58
Applying PCA to faces: Learned basis
Principal components of face images (“eigenfaces”)
Intro ML (UofT) STA314-Lec8 37 / 58
Applying PCA to digits
Intro ML (UofT) STA314-Lec8 38 / 58
One more interpretation of PCA, which has an interesting generalization: Matrix factorization.
Intro ML (UofT) STA314-Lec8 39 / 58
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com