CS计算机代考程序代写 AI Topic 5: Principal component analysis

Topic 5: Principal component analysis

5.1 Covariance matrices

Suppose we are interested in a population whose members are represented by vectors in Rd. We
model the population as a probability distribution P over Rd, and let X be a random vector with
distribution P. The mean of X is the “center of mass” of P. The covariance of X is also a kind of
“center of mass”, but it turns out to reveal quite a lot of other information.

Note: if we have a finite collection of data points x1,x2, . . . ,xn ∈ Rd, then it is common to
arrange these vectors as rows of a matrix A ∈ Rn×d. In this case, we can think of P as the uniform
distribution over the n points x1,x2, . . . ,xn. The mean of X ∼ P can be written as

E(X) =
1

n
A>1 ,

and the covariance of X is

cov(X) =
1

n
A>A−

(
1

n
A>1

)(
1

n
A>1

)>
=

1

n

>

where à = A − (1/n)11>A. We often call these the empirical mean and empirical covariance of
the data x1,x2, . . . ,xn.

Covariance matrices are always symmetric by definition. Moreover, they are always positive
semidefinite, since for any non-zero z ∈ Rd,

z> cov(X)z = z> E
[
(X − E(X))(X − E(X))>

]
z = E

[
〈z,X − E(X)〉2

]
≥ 0 .

This also shows that for any unit vector u, the variance of X in direction u is

var(〈u,X〉) = E
[
〈u,X − EX〉2

]
= u> cov(X)u .

Consider the following question: in what direction does X have the highest variance? It turns
out this is given by an eigenvector corresponding to the largest eigenvalue of cov(X). This follows
the following variational characterization of eigenvalues of symmetric matrices.

Theorem 5.1. Let M ∈ Rd×d be a symmetric matrix with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd and
corresponding orthonormal eigenvectors v1,v2, . . . ,vd. Then

max
u6=0

u>Mu

u>u
= λ1 ,

min
u6=0

u>Mu

u>u
= λd .

These are achieved by v1 and vd, respectively. (The ratio u
>Mu/u>u is called the Rayleigh

quotient associated with M in direction u.)

32

TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 33

Proof. Following Theorem 4.1, write the eigendecomposition of M as M = V ΛV > where V =
[v1|v2| · · · |vd] is orthogonal and Λ = diag(λ1, λ2, . . . , λd) is diagonal. For any u 6= 0,

u>Mu

u>u
=
u>V ΛV >u

u>V V >u
(since V V > = I)

=
w>Λw

w>w
(using w := V >u)

=
w21λ1 + w

2
2λ2 + · · ·+ w

2
dλd

w21 + w
2
2 + · · ·+ w

2
d

.

This final ratio represents a convex combination of the scalars λ1, λ2, . . . , λd. Its largest value is λ1,
achieved by w = e1 (and hence u = V e1 = v1), and its smallest value is λd, achieved by w = ed
(and hence u = V ed = vd).

Corollary 5.1. Let v1 be a unit-length eigenvector of cov(X) corresponding to the largest eigen-
value of cov(X). Then

var(〈v1,X〉) = max
u∈Sd−1

var(〈u,X〉) .

Now suppose we are interested in the k-dimensional subspace of Rd that captures the “most”
variance ofX. Recall that a k-dimensional subspace W ⊆ Rd can always be specified by a collection
of k orthonormal vectors u1,u2, . . . ,uk ∈ W . By the orthogonal projection to W , we mean the
linear map

x 7→ U>x , where U =



x x x
u1 u2 · · · uky y y


 ∈ Rd×k .

The covariance of U>X, a k×k covariance matrix, is simply given by

cov(U>X) = U> cov(X)U .

The “total” variance in this subspace is often measured by the trace of the covariance: tr(cov(U>X)).
Recall, the trace of a square matrix is the sum of its diagonal entries, and it is a linear function.

Fact 5.1. For any U ∈ Rd×k, tr(cov(U>X)) = E ‖U>(X − E(X))‖22. Furthermore, if U
>U = I,

then tr(cov(U>X)) = E ‖UU>(X − E(X))‖22.

Theorem 5.2. Let M ∈ Rd×d be a symmetric matrix with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd and
corresponding orthonormal eigenvectors v1,v2, . . . ,vd. Then for any k ∈ [d],

max
U∈Rd×k :U>U=I

tr(U>MU) = λ1 + λ2 + · · ·+ λk ,

min
U∈Rd×k :U>U=I

tr(U>MU) = λd−k+1 + λd−k+2 + · · ·+ λd .

The max is achieved by an orthogonal projection to the span of v1,v2, . . . ,vk, and the min is
achieved by an orthogonal projection to the span of vd−k+1,vd−k+2, . . . ,vd.

Proof. Let u1,u2, . . . ,uk denote the columns of U . Then, writing M =
∑d

j=1 λjvjv
>
j (Theo-

rem 4.1),

tr(U>MU) =
k∑
i=1

u>iMui =
k∑
i=1

u>i


 d∑
j=1

λjvjv
>
j


ui = d∑

j=1

λj

k∑
i=1

〈vj ,ui〉2 =
d∑
j=1

cjλj

TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 34

where cj :=
∑k

i=1〈vj ,ui〉
2 for each j ∈ [d]. We’ll show that each cj ∈ [0, 1], and

∑d
j=1 cj = k.

First, it is clear that cj ≥ 0 for each j ∈ [d]. Next, extending u1,u2, . . . ,uk to an orthonormal
basis u1,u2, . . . ,ud for Rd, we have for each j ∈ [d],

cj =
k∑
i=1

〈vj ,ui〉2 ≤
d∑
i=1

〈vj ,ui〉2 = 1 .

Finally, since v1,v2, . . . ,vd is an orthonormal basis for Rd,
d∑
j=1

cj =
d∑
j=1

k∑
i=1

〈vj ,ui〉2 =
k∑
i=1

d∑
j=1

〈vj ,ui〉2 =
k∑
i=1

‖ui‖22 = k .

The maximum value of
∑d

j=1 cjλj over all choices of c1, c2, . . . , cd ∈ [0, 1] with
∑d

j=1 cj = k
is λ1 + λ2 + · · · + λk. This is achieved when c1 = c2 = · · · = ck = 1 and ck+1 = · · · = cd = 0,
i.e., when span(v1,v2, . . . ,vk) = span(u1,u2, . . . ,uk). The minimum value of

∑d
j=1 cjλj over all

choices of c1, c2, . . . , cd ∈ [0, 1] with
∑d

j=1 cj = k is λd−k+1+λd−k+2+· · ·+λd. This is achieved when
c1 = · · · = cd−k = 0 and cd−k+1 = cd−k+2 = · · · = cd = 1, i.e., when span(vd−k+1,vd−k+2, . . . ,vd) =
span(u1,u2, . . . ,uk).

We’ll refer to the k largest eigenvalues of a symmetric matrix M as the top-k eigenvalues of
M , and the k smallest eigenvalues as the bottom-k eigenvalues of M . We analogously use the
term top-k (resp., bottom-k) eigenvectors to refer to orthonormal eigenvectors corresponding to the
top-k (resp., bottom-k) eigenvalues. Note that the choice of top-k (or bottom-k) eigenvectors is
not necessarily unique.

Corollary 5.2. Let v1,v2, . . . ,vk be top-k eigenvectors of cov(X), and let V k := [v1|v2| · · · |vk].
Then

tr(cov(V >kX)) = max
U∈Rd×k :U>U=I

tr(cov(U>X)) .

An orthogonal projection given by top-k eigenvectors of cov(X) is called a (rank-k) principal
component analysis (PCA) projection. Corollary 5.2 reveals an important property of a PCA
projection: it maximizes the variance captured by the subspace.

5.2 Best affine and linear subspaces

PCA has another important property: it gives an affine subspace A ⊆ Rd that minimizes the
expected squared distance between X and A.

Recall that a k-dimensional affine subspace A is specified by a k-dimensional (linear) subspace
W ⊆ Rd—say, with orthonormal basis u1,u2, . . . ,uk—and a displacement vector u0 ∈ Rd:

A = {u0 + α1u1 + α2u2 + · · ·+ αkuk : α1, α2, . . . , αk ∈ R} .

Let U := [u1|u2| · · · |uk]. Then, for any x ∈ Rd, the point in A closest to x is given by u0 +
UU>(x− u0), and hence the squared distance from x to A is ‖(I −UU>)(x− u0)‖22.

Theorem 5.3. Let v1,v2, . . . ,vk be top-k eigenvectors of cov(X), let V k := [v1|v2| · · · |vk], and
v0 := E(X). Then

E ‖(I − V kV >k )(X − v0)‖
2
2 = min

U∈Rd×k,u0∈Rd:
U>U=I

E ‖(I −UU>)(X − u0)‖22 .

TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 35

Proof. For any matrix d×d matrix M , the function u0 7→ E ‖M(X − u0)‖22 is minimized when
Mu0 = M E(X) (Fact 5.2). Therefore, we can plug-in E(X) for u0 in the minimization problem,
whereupon it reduces to

min
U∈Rd×k :U>U=I

E ‖(I −UU>)(X − E(X))‖22 .

The objective function is equivalent to

E ‖(I −UU>)(X − E(X))‖22 = E ‖X − E(X)‖
2
2 − E ‖UU

>(X − E(X))‖22
= E ‖X − E(X)‖22 − tr(cov(U

>X)) ,

where the second equality comes from Fact 5.1. Therefore, minimizing the objective is equivalent
to maximizing tr(cov(U>X)), which is achieved by PCA (Corollary 5.2).

The proof of Theorem 5.3 depends on the following simple but useful fact.

Fact 5.2 (Bias-variance decomposition). Let Y be a random vector in Rd, and b ∈ Rd be any fixed
vector. Then

E ‖Y − b‖22 = E ‖Y − E(Y )‖
2
2 + ‖E(Y )− b‖

2
2

(which, as a function of b, is minimized when b = E(Y )).

A similar statement can be made about (linear) subspaces by using top-k eigenvectors of
E(XX>) instead of cov(X). This is sometimes called uncentered PCA.

Theorem 5.4. Let v1,v2, . . . ,vk be top-k eigenvectors of E(XX>), and let V k := [v1|v2| · · · |vk].
Then

E ‖(I − V kV >k )X‖
2
2 = min

U∈Rd×k :U>U=I
E ‖(I −UU>)X‖22 .

5.3 Noisy affine subspace recovery

Suppose there are n points t1, t2, . . . , tn ∈ Rd that lie on an affine subspace A? of dimension k.
In this scenario, you don’t directly observe the ti; rather, you only observe noisy versions of these
points: Y 1,Y 2, . . . ,Y n, where for some σ1, σ2, . . . , σn > 0,

Y j ∼ N(tj , σ2j I) for all j ∈ [n]

and Y 1,Y 2, . . . ,Y n are independent. The observations Y 1,Y 2, . . . ,Y n no longer all lie in the
affine subspace A?, but by applying PCA to the empirical covariance of Y 1,Y 2, . . . ,Y n, you can
hope to approximately recover A?.

Regard X as a random vector whose conditional distribution given the noisy points is uniform
over Y 1,Y 2, . . . ,Y n. In fact, the distribution of X is given by the following generative process:

1. Draw J ∈ [n] uniformly at random.

2. Given J , draw Z ∼ N(0, σ2JI).

3. Set X := tJ +Z.

TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 36

Note that the empirical covariance based on Y 1,Y 2, . . . ,Y n is not exactly cov(X), but it can be
a good approximation when n is large (with high probability). Similarly, the empirical average
of Y 1,Y 2, . . . ,Y n is a good approximation to E(X) when n is large (with high probability). So
here, we assume for simplicity that both cov(X) and E(X) are known exactly. We show that PCA
produces a k-dimensional affine subspace that contains all of the tj .

Theorem 5.5. Let X be the random vector as defined above, v1,v2, . . . ,vk be top-k eigenvectors
of cov(X), and v0 := E(X). Then the affine subspace

 := {v0 + α1v1 + α2v2 + · · ·+ αkvk : α1, α2, . . . , αk ∈ R}

contains t1, t2, . . . , tn.

Proof. Theorem 5.3 says that the matrix V k := [v1|v2| · · · |vk] minimizes E ‖(I−UU>)(X−v0)‖22
(as a function of U ∈ Rd×k, subject to U>U = I), or equivalently, maximizes tr(cov(U>X)). This
maximization objective can be written as

tr(cov(U>X)) = E ‖UU>(X − v0)‖22 (by Fact 5.1)

=
1

n

n∑
j=1

E
[
‖UU>(tj − v0 +Z)‖22

∣∣∣ J = j]
=

1

n

n∑
j=1

E
[
‖UU>(tj − v0)‖22 + 2〈UU

>(tj − v0),UU>Z〉+ ‖UU>Z‖22
∣∣∣ J = j]

=
1

n

n∑
j=1

{
‖UU>(tj − v0)‖22 + E

[
‖UU>Z‖22

∣∣∣ J = j]}

=
1

n

n∑
j=1

{
‖UU>(tj − v0)‖22 + kσ

2
j

}
,

where the penultimate step uses the fact that the conditional distribution of Z given J = j is
N(0, σ2j I), and the final step uses the fact that ‖UU

>Z‖22 has the same conditional distribution
(given J = j) as the squared length of a N(0, σ2j I) random vector in R

k. Since UU>(tj − v0) is
the orthogonal projection of tj − v0 onto the subspace spanned by the columns of U (call it W ),

‖UU>(tj − v0)‖22 ≤ ‖tj − v0‖
2
2 for all j ∈ [n] .

The inequalities above are equalities precisely when tj − v0 ∈ W for all j ∈ [n]. This is indeed
the case for the subspace A? − {v0}. Since V k maximizes the objective, its columns must span
a k-dimensional subspace Ŵ that also contains all of the tj − v0; hence the affine subspace  =
{v0 + x : x ∈ Ŵ} contains all of the tj .

5.4 Singular value decomposition

Let A be any n×d matrix. Our aim is to define an extremely useful decomposition of A called
the singular value decomposition (SVD). Our derivation starts by considering two related matrices,
A>A and AA>; their eigendecompositions will lead to the SVD of A.

Fact 5.3. A>A and AA> are symmetric and positive semidefinite.

TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 37

It is clear that the eigenvalues of A>A and AA> are non-negative. In fact, the non-zero
eigenvalues of A>A and AA> are exactly the same.

Lemma 5.1. Let λ be an eigenvalue of A>A with corresponding eigenvector v.

• If λ > 0, then λ is a non-zero eigenvalue of AA> with corresponding eigenvector Av.

• If λ = 0, then Av = 0.

Proof. First suppose λ > 0. Then

AA>(Av) = A(A>Av) = A(λv) = λ(Av) ,

so λ is an eigenvalue of AA> with corresponding eigenvector Av.
Now suppose λ = 0 (which is the only remaining case, as per Fact 5.3). Then

‖Av‖22 = v
>A>Av = v>(λv) = 0 .

Since only the zero vector has length 0, it must be that Av = 0.

(We can apply Lemma 5.1 to both A and A> to conclude that A>A and AA> have the same
non-zero eigenvalues.)

Theorem 5.6 (Singular value decomposition). Let A be an n×d matrix. Let v1,v2, . . . ,vd ∈ Rd
be orthonormal eigenvectors of A>A corresponding to eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd ≥ 0. Let r
be the number of positive λi. Define

ui :=
Avi
‖Avi‖2

=
Avi√

v>i A
>Avi

=
Avi√
λi

for each i ∈ [r] .

Then

A =




x
x

x
u1 u2 · · · ury

y
y




︸ ︷︷ ︸
n×r




λ1 √

λ2
. . . √

λr




︸ ︷︷ ︸
r×r



←−−−−− v>1 −−−−−→
←−−−−− v>2 −−−−−→


←−−−−− v>r −−−−−→




︸ ︷︷ ︸
r×d

=

r∑
i=1


λiuiv

>
i ,

and u1,u2, . . . ,ur are orthonormal.

Proof. It suffices to show that for some set of d linearly independent vectors q1, q2, . . . , qd ∈ Rd,

Aqj =


 r∑
i=1


λiuiv

>
i


qj for all j ∈ [d] .

We’ll use v1,v2, . . . ,vd. Observe that

Avj =

{√
λjuj if 1 ≤ j ≤ r ,

0 if r < j ≤ d , TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 38 by definition of ui and by Lemma 5.1. Moreover,  r∑ i=1 √ λiuiv >
i


vj = r∑

i=1


λi〈vj ,vi〉ui =

{√
λjuj if 1 ≤ j ≤ r ,

0 if r < j ≤ d , since v1,v2, . . . ,vd are orthonormal. We conclude that Avj = ( ∑r i=1 √ λiuiv >
i )vj for all j ∈ [d],

and hence A =
∑r

i=1


λiuiv

>
i .

Note that

u>i uj =
v>i A

>Avj√
λiλj

=
λjv

>
i vj√
λiλj

= 0 for all 1 ≤ i < j ≤ r , where the last step follows since v1,v2, . . . ,vd are orthonormal. This implies that u1,u2, . . . ,ur are orthonormal. The decomposition of A into the sum A = ∑r i=1 √ λiuiv >
i from Theorem 5.6 is called the

singular value decomposition (SVD) of A. The u1,u2, . . . ,ur are the left singular vectors, and the
v1,v2, . . . ,vr are the right singular vectors. The scalars


λ1 ≥


λ2 ≥ · · · ≥


λr are the (positive)

singular values corresponding to the left/right singular vectors (u1,v1), (u2,v2), . . . , (ur,vr). The
representation A =

∑r
i=1


λiuiv

>
i is actually typically called the thin SVD of A. The number r

of positive λi is the rank of A, which is at most the smaller of n and d.
Of course, one can extend u1,u2, . . . ,ur to an orthonormal basis for Rn. Define the matrices

U := [u1|u2| · · · |un] ∈ Rn×n and V := [v1|v2| · · · |vd] ∈ Rd×d. Also define S ∈ Rn×d to be the ma-
trix whose only non-zero entries are


λi in the (i, i)-th position, for 1 ≤ i ≤ r. Then A = USV >.

This matrix factorization ofA is typically called the full SVD ofA. (The vectors ur+1,ur+2, . . . ,un
and vr+1,vr+2, . . . ,vd are also regarded as singular vectors of A; they correspond to the singular
value equal to zero.)

Just as before, we’ll refer to the k largest singular values of A as the top-k singular values of
A, and the k smallest singular values as the bottom-k singular values of A. We analogously use the
term top-k (resp., bottom-k) singular vectors to refer to orthonormal singular vectors corresponding
to the top-k (resp., bottom-k) singular values. Again, the choice of top-k (or bottom-k) singular
vectors is not necessarily unique.

Relationship between PCA and SVD

As seen above, the eigenvectors of A>A are the right singular vectors A, and the eigenvectors of
AA> are the left singular vectors of A.

Suppose there are n data points a1,a2, . . . ,an ∈ Rd, arranged as the rows of the matrix A ∈
Rn×d. Now regard X as a random vector with the uniform distribution on the n data points. Then
E(XX>) = 1

n

∑n
i=1 aia

>
i =

1
n
A>A: top-k eigenvectors of 1

n
A>A are top-k right singular vectors

of A. Hence, rank-k uncentered PCA (as in Theorem 5.4) corresponds to the subspace spanned by
the top-k right singular vectors of A.

Variational characterization of singular values

Given the relationship between the singular values of A and the eigenvalues of A>A and AA>,
it is easy to obtain variational characterizations of the singular values. We can also obtain the
characterization directly.

TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 39

Fact 5.4. Let the SVD of a matrix A ∈ Rn×d be given by A =
∑r

i=1 σiuiv
>
i , where σ1 ≥ σ2 ≥

· · · ≥ σr > 0. For each i ∈ [r],

σi = max
x∈Sd−1:〈vj ,x〉=0 ∀jAx = u>i Avi .

Relationship between eigendecomposition and SVD

If M ∈ Rd×d is symmetric and has eigendecomposition M =
∑d

i=1 λiviv
>
i , then its singular values

are the absolute values of the λi. We can take v1,v2, . . . ,vd as corresponding right singular vectors.
For corresponding left singular vectors, we can take ui := vi whenever λi ≥ 0 (which is the case for
all i if M is also psd), and ui := −vi whenever λi < 0. Therefore, we have the following variational characterization of the singular values of M . Fact 5.5. Let the eigendecomposition of a symmetric matrix M ∈ Rd×d be given by M =∑d i=1 λiviv >
i , where |λ1| ≥ |λ2| ≥ · · · ≥ |λd|. For each i ∈ [d],

|λi| = max
x∈Sd−1:〈vj ,x〉=0∀jMx = max
x∈Sd−1:〈vj ,x〉=0∀jMx| = |v>iMvi| .

Moore-Penrose pseudoinverse

The SVD defines a kind of matrix inverse that is applicable to non-square matrices A ∈ Rn×d
(where possibly n 6= d). Let the SVD be given by A = USV >, where U ∈ Rn×r and V ∈ Rd×r
satisfy U>U = V >V = I, and S ∈ Rr×r is diagonal with positive diagonal entries. Here, the rank
of A is r. The Moore-Penrose pseudoinverse of A is given by

A† := V S−1U> ∈ Rd×n .

Note that A† is well-defined: S is invertible because its diagonal entries are all strictly positive.
What is the effect of multiplying A by A† on the left? Using the SVD of A,

A†A = V S−1U>USV > = V V > ∈ Rd×d ,

which is the orthogonal projection to the row space of A. In particular, this means that

AA†A = A .

Similarly, AA† = UU> ∈ Rn×n, the orthogonal projection to the column space of A. Note that if
r = d, then A†A = I, as the row space of A is simply Rd; similarly, if r = n, then AA† = I.

The Moore-Penrose pseudoinverse is also related to least squares. For any y ∈ Rn, the
vector AA†y is the orthogonal projection of y onto the column space of A. This means that
minx∈Rd ‖Ax− y‖22 is minimized by x = A

†y. The more familiar expression for the least squares
solution x = (A>A)−1A>y only applies in the special case where A>A is invertible. The connec-
tion to the general form of a solution can be seen by using the easily verified identity

A† = (A>A)†A>

and using the fact that (A>A)† = (A>A)−1 when A>A is invertible.

TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 40

5.5 Matrix norms and low rank SVD

Matrix inner product and the Frobenius norm

The space of n×d real matrices is a real vector space in its own right, and it can, in fact, be viewed
as a Euclidean space with inner product given by 〈X,Y 〉 := tr(X>Y ). It can be checked that this
indeed is a valid inner product. For instance, the fact that the trace function is linear can be used
to establish linearity in the first argument:

〈cX + Y ,Z〉 = tr((cX + Y )>Z)
= tr(cX>Z + Y >Z)

= c tr(X>Z) + tr(Y >Z) = c〈X,Z〉+ 〈Y ,Z〉 .

The inner product naturally induces an associated norm X 7→

〈X,X〉. Viewing X ∈ Rn×d

as a data matrix whose rows are the vectors x1,x2, . . . ,xn ∈ Rd, we see that

〈X,X〉 = tr(X>X) = tr


 n∑
i=1

xix
>
i


 = n∑

i=1

tr(xix
>
i ) =

n∑
i=1

tr(x>i xi) =
n∑
i=1

‖xi‖22 .

Above, we make use of the fact that for any matrices A,B ∈ Rn×d,

tr(A>B) = tr(BA>) ,

which is called the cyclic property of the matrix trace. Therefore, the square of the induced norm
is simply the sum-of-squares of the entries in the matrix. We call this norm the Frobenius norm of
the matrix X, and denote it by ‖X‖F. It can be checked that this matrix inner product and norm
are exactly the same as the Euclidean inner product and norm when you view the n×d matrices as
nd-dimensional vectors obtained by stacking columns on top of each other (or rows side-by-side).

Suppose a matrix X has thin SVD X = USV >, where S = diag(σ1, σ2, . . . , σr), and U
>U =

V >V = I. Then its squared Frobenius norm is

‖X‖2F = tr(V SU
>USV >) = tr(V S2V >) = tr(S2V >V ) = tr(S2) =

r∑
i=1

σ2i ,

the sum-of-squares of X’s singular values.

Best rank-k approximation in Frobenius norm

Let the SVD of a matrix A ∈ Rn×d be given by A =
∑r

i=1 σiuiv
>
i . Here, we assume σ1 ≥ σ2 ≥

· · · ≥ σr > 0. For any k ≤ r, a rank-k SVD of A is obtained by just keeping the first k components
(corresponding to the k largest singular values), and this yields a matrix Ak ∈ Rn×d with rank k:

Ak :=
k∑
i=1

σiuiv
>
i . (5.1)

This matrix Ak is the best rank-k approximation to A in the sense that it minimizes the Frobenius
norm error over all matrices of rank (at most) k. This is remarkable because the set of matrices
of rank at most k is not a set over which it is typically easy to optimize. (For instance, it is not a
convex set.)

TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 41

Theorem 5.7. Let A ∈ Rn×d be any matrix, with SVD as given in Theorem 5.6, and Ak as defined
in (5.1). Then:

1. The rows of Ak are the orthogonal projections of the corresponding rows of A to the k-
dimensional subspace spanned by top-k right singular vectors v1,v2, . . . ,vk of A.

2. ‖A−Ak‖F ≤ min{‖A−B‖F : B ∈ Rn×d , rank(B) ≤ k}.

3. If a1,a2, . . . ,an ∈ Rd are the rows of A, and â1, â2, . . . , ân ∈ Rd are the rows of Ak, then
n∑
i=1

‖ai − âi‖22 ≤
n∑
i=1

‖ai − bi‖22

for any vectors b1, b2, . . . , bn ∈ Rd that span a subspace of dimension at most k.

Proof. The orthogonal projection to the subspace Wk spanned by v1,v2, . . . ,vk is given by x 7→
V kV

>
kx, where V k := [v1|v2| · · · |vk]. Since V kV

>
kvi = vi for i ∈ [k] and V kV

>
kvi = 0 for i > k,

AV kV
>
k =

r∑
i=1

σiuiv
>
i V kV

>
k =

k∑
i=1

σiuiv
>
i = Ak .

This equality says that the rows of Ak are the orthogonal projections of the rows of A onto Wk.
This proves the first claim.

Consider any matrix B ∈ Rn×d with rank(B) ≤ k, and let W be the subspace spanned by the
rows of B. Let ΠW denotes the orthogonal projector to W . Then clearly we have ‖A−AΠW ‖F ≤
‖A−B‖F. This means that

min
B∈Rn×d:

rank(B)≤k

‖A−B‖2F = min
subspace W⊆Rd:

dimW≤k

‖A−AΠW ‖2F = min
subspace W⊆Rd:

dimW≤k

n∑
i=1

‖(I −ΠW )ai‖22 ,

where ai ∈ Rd denotes the i-th row of A. In fact, it is clear that we can take the minimization over
subspaces W with dimW = k. Since the orthogonal projector to a subspace of dimension k is of
the form UU> for some U ∈ Rd×k satisfying U>U = I, it follows that the expression above is the
same as

min
U∈Rd×k:
U>U=I

n∑
i=1

‖(I −UU>)ai‖22 .

Observe that 1
n

∑n
i=1 aia

>
i =

1
n
A>A, so Theorem 5.6 implies that top-k eigenvectors of the

1
n

∑n
i=1 aia

>
i are top-k right singular vectors of A. By Theorem 5.4, the minimization problem

above is achieved when U = V k. This proves the second claim. The third claim is just a different
interpretation of the second claim.

Best rank-k approximation in spectral norm

Another important matrix norm is the spectral norm: for a matrix X ∈ Rn×d,

‖X‖2 := max
u∈Sd−1

‖Xu‖2 .

By Theorem 5.6, the spectral norm of X is equal to its largest singular value.

TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 42

Fact 5.6. Let the SVD of a matrix A ∈ Rn×d be as given in Theorem 5.6, with r = rank(A).

• For any x ∈ Rd,
‖Ax‖2 ≤ σ1‖x‖2 .

• For any x in the span of v1,v2, . . . ,vr,

‖Ax‖2 ≥ σr‖x‖2 .

Unlike the Frobenius norm, the spectral norm does not arise from a matrix inner product.
Nevertheless, it can be checked that it has the required properties of a norm: it satisfies ‖cX‖2 =
|c|‖X‖2 and ‖X+Y ‖2 ≤ ‖X‖2 +‖Y ‖2, and the only matrix with ‖X‖2 = 0 is X = 0. Because of
this, the spectral norm also provides a metric between matrices, dist(X,Y ) = ‖X−Y ‖2, satisfying
the properties given in Section 1.1.

The rank-k SVD of a matrix A also provides the best rank-k approximation in terms of spectral
norm error.

Theorem 5.8. Let A ∈ Rn×d be any matrix, with SVD as given in Theorem 5.6, and Ak as defined
in (5.1). Then ‖A−Ak‖2 ≤ min{‖A−B‖2 : B ∈ Rn×d , rank(B) ≤ k}.

Proof. Since the largest singular value of A−Ak =
∑r

i=k+1 σiuiv
>
i is σk+1, it follows that

‖A−Ak‖2 = σk+1 .

Consider any matrix B ∈ Rn×d with rank(B) ≤ k. Its null space ker(B) has dimension at least
d − rank(B) ≥ d − k. On the other hand, the span Wk+1 of v1,v2, . . . , . . . ,vk+1 has dimension
k + 1. Therefore, there must be some non-zero vector x ∈ ker(B) ∩Wk+1. For any such vector x,

‖A−B‖2 ≥
‖(A−B)x‖2
‖x‖2

(by Fact 5.6)


‖Ax‖2
‖x‖2

(since x is in the null space of B)

=


‖Ak+1x‖22 + ‖(A−Ak+1)x‖

2
2

‖x‖2


‖Ak+1x‖2
‖x‖2

≥ σk+1 (by Fact 5.6) .

Therefore ‖A−B‖2 ≥ ‖A−Ak‖2.