Topic 5: Principal component analysis 5.1 Covariance matrices
Suppose we are interested in a population whose members are represented by vectors in Rd. We model the population as a probability distribution P over Rd, and let X be a random vector with distribution P. The mean of X is the “center of mass” of P. The covariance of X is also a kind of “center of mass”, but it turns out to reveal quite a lot of other information.
Note: if we have a finite collection of data points x1,x2,…,xn ∈ Rd, then it is common to arrange these vectors as rows of a matrix A ∈ Rn×d. In this case, we can think of P as the uniform distribution over the n points x1, x2, . . . , xn. The mean of X ∼ P can be written as
E(X) = 1A⊤1, n
and the covariance of X is
cov(X) = nA A− nA 1 nA 1 = nA A
where A = A − (1/n)11⊤A. We often call these the empirical mean and empirical covariance of the data x1,x2,…,xn.
Covariance matrices are always symmetric by definition. Moreover, they are always positive semidefinite, since for any non-zero z ∈ Rd,
1 ⊤ 1 ⊤ 1 ⊤ ⊤ 1 ⊤
⊤ ⊤ ⊤2 z cov(X)z = z E (X −E(X))(X −E(X)) z = E ⟨z,X −E(X)⟩
This also shows that for any unit vector u, the variance of X in direction u is 2⊤
Consider the following question: in what direction does X have the highest variance? It turns out this is given by an eigenvector corresponding to the largest eigenvalue of cov(X). This follows the following variational characterization of eigenvalues of symmetric matrices.
Theorem 5.1. Let M ∈ Rd×d be a symmetric matrix with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd and corresponding orthonormal eigenvectors v1, v2, . . . , vd. Then
maxu⊤Mu = λ1, u̸=0 u⊤ u
minu⊤Mu = λd. u̸=0 u⊤ u
These are achieved by v1 and vd, respectively. (The ratio u⊤Mu/u⊤u is called the Rayleigh quotient associated with M in direction u.)
var(⟨u,X⟩) = E ⟨u,X −EX⟩ = u cov(X)u.
≥ 0.
32
TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 33 Proof. Following Theorem 4.1, write the eigendecomposition of M as M = V ΛV ⊤ where V =
[v1|v2| · · · |vd] is orthogonal and Λ = diag(λ1, λ2, . . . , λd) is diagonal. For any u ̸= 0, u⊤Mu = u⊤VΛV⊤u (sinceVV⊤ =I)
u⊤u u⊤V V ⊤u
= w⊤Λw (using w := V ⊤u)
w⊤w
= w12λ1 +w2λ2 +···+wd2λd .
w12 +w2 +···+wd2
This final ratio represents a convex combination of the scalars λ1, λ2, . . . , λd. Its largest value is λ1, achieved by w = e1 (and hence u = V e1 = v1), and its smallest value is λd, achieved by w = ed (andhenceu=Ved =vd).
Corollary 5.1. Let v1 be a unit-length eigenvector of cov(X) corresponding to the largest eigen- value of cov(X). Then
var(⟨v1, X⟩) = max var(⟨u, X⟩) . u∈S d−1
Now suppose we are interested in the k-dimensional subspace of Rd that captures the “most” variance of X. Recall that a k-dimensional subspace W ⊆ Rd can always be specified by a collection of k orthonormal vectors u1,u2,…,uk ∈ W. By the orthogonal projection to W, we mean the linear map
⊤ d×k x→Ux, where U=u1 u2 ··· uk∈R .
The covariance of U⊤X, a k×k covariance matrix, is simply given by cov(U⊤X) = U⊤ cov(X)U .
The “total” variance in this subspace is often measured by the trace of the covariance: tr(cov(U⊤X)). Recall, the trace of a square matrix is the sum of its diagonal entries, and it is a linear function.
Fact5.1. ForanyU∈Rd×k,tr(cov(U⊤X))=E∥U⊤(X−E(X))∥2. Furthermore,ifU⊤U=I, then tr(cov(U⊤X))=E∥UU⊤(X−E(X))∥2.
Theorem 5.2. Let M ∈ Rd×d be a symmetric matrix with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd and corresponding orthonormal eigenvectors v1, v2, . . . , vd. Then for any k ∈ [d],
max tr(U⊤MU) = λ1 +λ2 +···+λk , U∈Rd×k : U⊤U=I
min tr(U⊤MU) = λd−k+1 +λd−k+2 +···+λd . U∈Rd×k : U⊤U=I
The max is achieved by an orthogonal projection to the span of v1,v2,…,vk, and the min is achieved by an orthogonal projection to the span of vd−k+1, vd−k+2, . . . , vd.
Proof. Let u1,u2,…,uk denote the columns of U. Then, writing M = dj=1 λjvjv⊤j (Theo- rem 4.1),
kkddkd tr(U⊤MU) = u⊤Mu = u⊤ λvv⊤ u = λ ⟨v,u⟩2 = cλ
iiijjjijjijj i=1 i=1 j=1 j=1 i=1 j=1
TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 34
where cj := ki=1⟨vj,ui⟩2 for each j ∈ [d]. We’ll show that each cj ∈ [0,1], and dj=1 cj = k. First, it is clear that cj ≥ 0 for each j ∈ [d]. Next, extending u1, u2, . . . , uk to an orthonormal
basis u1,u2,…,ud for Rd, we have for each j ∈ [d],
kd
cj = ⟨vj,ui⟩2 ≤ ⟨vj,ui⟩2 = 1.
i=1 i=1 Finally, since v1, v2, . . . , vd is an orthonormal basis for Rd,
ddkkdk
cj = ⟨vj,ui⟩2 = ⟨vj,ui⟩2 = ∥ui∥2 = k. j=1 j=1 i=1 i=1 j=1 i=1
The maximum value of dj=1 cjλj over all choices of c1,c2,…,cd ∈ [0,1] with dj=1 cj = k isλ1+λ2+···+λk. Thisisachievedwhenc1 =c2 =···=ck =1andck+1 =···=cd =0, i.e., when span(v1,v2,…,vk) = span(u1,u2,…,uk). The minimum value of dj=1 cjλj over all
choices of c1, c2, . . . , cd ∈ [0, 1] with dj=1 cj = k is λd−k+1 +λd−k+2 +· · ·+λd. This is achieved when c1 = ··· = cd−k = 0 and cd−k+1 = cd−k+2 = ··· = cd = 1, i.e., when span(vd−k+1,vd−k+2,…,vd) = span(u1, u2, . . . , uk).
We’ll refer to the k largest eigenvalues of a symmetric matrix M as the top-k eigenvalues of M, and the k smallest eigenvalues as the bottom-k eigenvalues of M. We analogously use the term top-k (resp., bottom-k) eigenvectors to refer to orthonormal eigenvectors corresponding to the top-k (resp., bottom-k) eigenvalues. Note that the choice of top-k (or bottom-k) eigenvectors is not necessarily unique.
Corollary 5.2. Let v1, v2, . . . , vk be top-k eigenvectors of cov(X), and let V k := [v1|v2| · · · |vk]. Then
tr(cov(V⊤kX)) = max tr(cov(U⊤X)). U∈Rd×k : U⊤U=I
An orthogonal projection given by top-k eigenvectors of cov(X) is called a (rank-k) principal component analysis (PCA) projection. Corollary 5.2 reveals an important property of a PCA projection: it maximizes the variance captured by the subspace.
5.2 Best affine and linear subspaces
PCA has another important property: it gives an affine subspace A ⊆ Rd that minimizes the expected squared distance between X and A.
Recall that a k-dimensional affine subspace A is specified by a k-dimensional (linear) subspace W ⊆ Rd—say, with orthonormal basis u1, u2, . . . , uk—and a displacement vector u0 ∈ Rd:
A = {u0 +α1u1 +α2u2 +···+αkuk :α1,α2,…,αk ∈R}.
Let U := [u1|u2|···|uk]. Then, for any x ∈ Rd, the point in A closest to x is given by u0 +
UU⊤(x − u0), and hence the squared distance from x to A is ∥(I − UU⊤)(x − u0)∥2.
Theorem 5.3. Let v1, v2, . . . , vk be top-k eigenvectors of cov(X), let V k := [v1|v2| · · · |vk], and
v0 := E(X). Then
E∥(I−VkV⊤k)(X−v0)∥2 = min E∥(I−UU⊤)(X−u0)∥2. U∈Rd×k, u0∈Rd:
U⊤U=I
TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 35
Proof. For any matrix d×d matrix M, the function u0 → E∥M(X − u0)∥2 is minimized when Mu0 = M E(X) (Fact 5.2). Therefore, we can plug-in E(X) for u0 in the minimization problem, whereupon it reduces to
min E∥(I −UU⊤)(X −E(X))∥2 . U∈Rd×k : U⊤U=I
The objective function is equivalent to
E∥(I−UU⊤)(X−E(X))∥2 = E∥X−E(X)∥2−E∥UU⊤(X−E(X))∥2
= E∥X−E(X)∥2−tr(cov(U⊤X)),
where the second equality comes from Fact 5.1. Therefore, minimizing the objective is equivalent
to maximizing tr(cov(U⊤X)), which is achieved by PCA (Corollary 5.2).
The proof of Theorem 5.3 depends on the following simple but useful fact.
Fact 5.2 (Bias-variance decomposition). Let Y be a random vector in Rd, and b ∈ Rd be any fixed vector. Then
E∥Y −b∥2 = E∥Y −E(Y)∥2 +∥E(Y)−b∥2 (which, as a function of b, is minimized when b = E(Y )).
A similar statement can be made about (linear) subspaces by using top-k eigenvectors of E(XX⊤) instead of cov(X). This is sometimes called uncentered PCA.
Theorem 5.4. Let v1, v2, . . . , vk be top-k eigenvectors of E(XX⊤), and let V k := [v1|v2| · · · |vk]. Then
E∥(I−VkV⊤k)X∥2 = min E∥(I−UU⊤)X∥2. U∈Rd×k : U⊤U=I
5.3 Noisy affine subspace recovery
Suppose there are n points t1,t2,…,tn ∈ Rd that lie on an affine subspace A⋆ of dimension k. In this scenario, you don’t directly observe the ti; rather, you only observe noisy versions of these points: Y 1,Y 2,…,Y n, where for some σ1,σ2,…,σn > 0,
Yj ∼ N(tj,σj2I) forallj∈[n]
and Y 1,Y 2,…,Y n are independent. The observations Y 1,Y 2,…,Y n no longer all lie in the affine subspace A⋆, but by applying PCA to the empirical covariance of Y 1, Y 2, . . . , Y n, you can hope to approximately recover A⋆.
Regard X as a random vector whose conditional distribution given the noisy points is uniform over Y 1, Y 2, . . . , Y n. In fact, the distribution of X is given by the following generative process:
1. Draw J ∈ [n] uniformly at random. 2. Given J, draw Z ∼ N(0,σJ2I).
3. Set X := tJ + Z.
TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 36
Note that the empirical covariance based on Y 1, Y 2, . . . , Y n is not exactly cov(X), but it can be a good approximation when n is large (with high probability). Similarly, the empirical average of Y 1, Y 2, . . . , Y n is a good approximation to E(X) when n is large (with high probability). So here, we assume for simplicity that both cov(X) and E(X) are known exactly. We show that PCA produces a k-dimensional affine subspace that contains all of the tj.
Theorem 5.5. Let X be the random vector as defined above, v1,v2,…,vk be top-k eigenvectors of cov(X), and v0 := E(X). Then the affine subspace
A:={v0 +α1v1 +α2v2 +···+αkvk :α1,α2,…,αk ∈R} contains t1,t2,…,tn.
Proof. Theorem5.3saysthatthematrixVk :=[v1|v2|···|vk]minimizesE∥(I−UU⊤)(X−v0)∥2 (as a function of U ∈ Rd×k, subject to U⊤U = I), or equivalently, maximizes tr(cov(U⊤X)). This maximization objective can be written as
tr(cov(U⊤X))
= E∥UU⊤(X−v0)∥2 (byFact5.1) 1n ⊤ 2
E ∥ U U ( t j − v 0 + Z ) ∥ 2 J = j
1n⊤ 2 ⊤ ⊤ ⊤2
j=1
= n = n
j=1
E∥UU (tj−v0)∥2+2⟨UU (tj−v0),UU Z⟩+∥UU Z∥2J=j 1n ⊤ 2 ⊤ 2
∥UU (tj−v0)∥2+E∥UU Z∥2J=j = 1n ∥UU⊤(tj−v0)∥2+kσj2,
= n
n j=1
j=1
where the penultimate step uses the fact that the conditional distribution of Z given J = j is N(0,σj2I), and the final step uses the fact that ∥UU⊤Z∥2 has the same conditional distribution (given J = j) as the squared length of a N(0,σj2I) random vector in Rk. Since UU⊤(tj − v0) is the orthogonal projection of tj − v0 onto the subspace spanned by the columns of U (call it W ),
∥UU⊤(tj−v0)∥2 ≤ ∥tj−v0∥2 forallj∈[n].
The inequalities above are equalities precisely when tj − v0 ∈ W for all j ∈ [n]. This is indeed the case for the subspace A⋆ − {v0}. Since V k maximizes the objective, its columns must span a k-dimensional subspace W that also contains all of the tj − v0; hence the affine subspace A = {v0 + x : x ∈ W} contains all of the tj.
5.4 Singular value decomposition
Let A be any n×d matrix. Our aim is to define an extremely useful decomposition of A called the singular value decomposition (SVD). Our derivation starts by considering two related matrices, A⊤A and AA⊤; their eigendecompositions will lead to the SVD of A.
Fact 5.3. A⊤A and AA⊤ are symmetric and positive semidefinite.
TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 37 It is clear that the eigenvalues of A⊤A and AA⊤ are non-negative. In fact, the non-zero
eigenvalues of A⊤A and AA⊤ are exactly the same.
Lemma 5.1. Let λ be an eigenvalue of A⊤A with corresponding eigenvector v.
• If λ > 0, then λ is a non-zero eigenvalue of AA⊤ with corresponding eigenvector Av.
• Ifλ=0,thenAv=0. Proof. First suppose λ > 0. Then
AA⊤(Av) = A(A⊤Av) = A(λv) = λ(Av), so λ is an eigenvalue of AA⊤ with corresponding eigenvector Av.
Now suppose λ = 0 (which is the only remaining case, as per Fact 5.3). Then ∥Av∥2 = v⊤A⊤Av = v⊤(λv) = 0.
Since only the zero vector has length 0, it must be that Av = 0.
(We can apply Lemma 5.1 to both A and A⊤ to conclude that A⊤A and AA⊤ have the same
non-zero eigenvalues.)
Theorem 5.6 (Singular value decomposition). Let A be an n×d matrix. Let v1, v2, . . . , vd ∈ Rd be orthonormal eigenvectors of A⊤A corresponding to eigenvalues λ1 ≥ λ2 ≥ ··· ≥ λd ≥ 0. Let r be the number of positive λi. Define
Avi Avi ui:=∥Av∥=⊤ ⊤
Avi
=√λ foreachi∈[r].
i 2 vi A Avi ⊤
Then
A=u1 u2 ··· ur … . =
√⊤ i=1 λ r ←−−−−− v r −−−−−→
√
λ 2 ←−−−−− v −−−−−→ 2⊤
i
λ1 ←−−−−− v −−−−−→ √ 1
⊤r
r×r r×d
λiuivi,
n×r andu1,u2,…,ur areorthonormal.
Proof. It suffices to show that for some set of d linearly independent vectors q1, q2, . . . , qd ∈ Rd, r
Aqj = λiuiv⊤i qj for all j ∈ [d].
i=1 We’ll use v1, v2, . . . , vd. Observe that
λj uj if 1 ≤ j ≤ r , Avj = 0 ifr
σi= d−1max y⊤Ax=u⊤iAvi. x∈S :⟨vj ,x⟩=0 ∀j 0. For any k ≤ r, a rank-k SVD of A is obtained by just keeping the first k components (corresponding to the k largest singular values), and this yields a matrix Ak ∈ Rn×d with rank k:
k
Ak := σiuiv⊤i . (5.1)
i=1
This matrix Ak is the best rank-k approximation to A in the sense that it minimizes the Frobenius norm error over all matrices of rank (at most) k. This is remarkable because the set of matrices of rank at most k is not a set over which it is typically easy to optimize. (For instance, it is not a convex set.)
TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 41 Theorem 5.7. Let A ∈ Rn×d be any matrix, with SVD as given in Theorem 5.6, and Ak as defined
in (5.1). Then:
1. The rows of Ak are the orthogonal projections of the corresponding rows of A to the k-
dimensional subspace spanned by top-k right singular vectors v1, v2, . . . , vk of A.
2. ∥A−Ak∥F ≤min{∥A−B∥F :B∈Rn×d,rank(B)≤k}.
3. If a1,a2,…,an ∈ Rd are the rows of A, and aˆ1,aˆ2,…,aˆn ∈ Rd are the rows of Ak, then
nn
∥ai −aˆi∥2 ≤ ∥ai −bi∥2
i=1 i=1
for any vectors b1, b2, . . . , bn ∈ Rd that span a subspace of dimension at most k.
Proof. The orthogonal projection to the subspace Wk spanned by v1, v2, . . . , vk is given by x → VkV⊤kx,whereVk :=[v1|v2|···|vk]. SinceVkV⊤kvi =vi fori∈[k]andVkV⊤kvi =0fori>k,
rk
AVkV⊤k = σiuiv⊤i VkV⊤k = σiuiv⊤i = Ak.
i=1 i=1
This equality says that the rows of Ak are the orthogonal projections of the rows of A onto Wk. This proves the first claim.
Consider any matrix B ∈ Rn×d with rank(B) ≤ k, and let W be the subspace spanned by the rows of B. Let ΠW denotes the orthogonal projector to W . Then clearly we have ∥A − AΠW ∥F ≤ ∥A − B∥F. This means that
min
B∈Rn×d: rank(B)≤k
∥A−B∥2F
= min
subspace W⊆Rd: dim W ≤k
∥A−AΠW∥2F
n
= min ∥(I −ΠW)ai∥2 ,
subspace W⊆Rd: i=1 dim W ≤k
where ai ∈ Rd denotes the i-th row of A. In fact, it is clear that we can take the minimization over subspaces W with dimW = k. Since the orthogonal projector to a subspace of dimension k is of the form UU⊤ for some U ∈ Rd×k satisfying U⊤U = I, it follows that the expression above is the same as
n
min ∥(I−UU⊤)ai∥2. U∈Rd×k: i=1
Observe that 1 n
1n ⊤ n i=1 i n
U⊤U=I
aia⊤ = 1 A⊤A, so Theorem 5.6 implies that top-k eigenvectors of the
n i=1 aiai are top-k right singular vectors of A. By Theorem 5.4, the minimization problem above is achieved when U = V k. This proves the second claim. The third claim is just a different interpretation of the second claim.
Best rank-k approximation in spectral norm
Another important matrix norm is the spectral norm: for a matrix X ∈ Rn×d,
∥X∥2 := max ∥Xu∥2. u∈S d−1
By Theorem 5.6, the spectral norm of X is equal to its largest singular value.
TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 42 Fact 5.6. Let the SVD of a matrix A ∈ Rn×d be as given in Theorem 5.6, with r = rank(A).
• For any x ∈ Rd,
• For any x in the span of v1,v2,…,vr,
∥Ax∥2 ≥ σr∥x∥2 .
Unlike the Frobenius norm, the spectral norm does not arise from a matrix inner product. Nevertheless, it can be checked that it has the required properties of a norm: it satisfies ∥cX∥2 = |c|∥X∥2 and ∥X +Y ∥2 ≤ ∥X∥2 +∥Y ∥2, and the only matrix with ∥X∥2 = 0 is X = 0. Because of this, the spectral norm also provides a metric between matrices, dist(X, Y ) = ∥X −Y ∥2, satisfying the properties given in Section 1.1.
The rank-k SVD of a matrix A also provides the best rank-k approximation in terms of spectral norm error.
Theorem 5.8. Let A ∈ Rn×d be any matrix, with SVD as given in Theorem 5.6, and Ak as defined in (5.1). Then∥A−Ak∥2 ≤min{∥A−B∥2 :B∈Rn×d,rank(B)≤k}.
Proof. Since the largest singular value of A − Ak = ri=k+1 σiuiv⊤i is σk+1, it follows that ∥A−Ak∥2 = σk+1.
Consider any matrix B ∈ Rn×d with rank(B) ≤ k. Its null space ker(B) has dimension at least d − rank(B) ≥ d − k. On the other hand, the span Wk+1 of v1,v2,…,…,vk+1 has dimension k + 1. Therefore, there must be some non-zero vector x ∈ ker(B) ∩ Wk+1. For any such vector x,
∥A−B∥2
≥ ∥(A − B)x∥2 (by Fact 5.6) ∥x∥2
≥ ∥Ax∥2 (since x is in the null space of B) ∥x∥2
∥Ak+1x∥2 + ∥(A − Ak+1)x∥2 = ∥x∥2
≥ ∥Ak+1x∥2
∥x∥2
≥ σk+1 (by Fact 5.6) .
∥Ax∥2 ≤ σ1∥x∥2 .
Therefore ∥A − B∥2 ≥ ∥A − Ak∥2.