Dimensionality Reduction with Principal Component Analysis
Liang Zheng Australian National University liang.zheng@anu.edu.au
Meta Sim: Learning to Generate Synthetic Datasets. Kar et al., ICCV 2019
Idea of PCA
area
House price (million)
House area (100m2)
a
10
10
b
2
2
c
7
7
d
1
1
e
5
5
price
We subtract means from data points
area
House price (normalised)
House area (normalised)
a
5
5
b
-3
-3
c
2
2
d
-4
-4
e
0
0
price
Idea of PCA
area
second principal component
first principal component
We rotate our price and area axis
price
House price (normalised)
House area (normalised)
a
5
5
b
-3
-3
c
2
2
d
-4
-4
e
0
0
First principal component
Second principal component
a
7.07
0
b
-4.24
0
c
2.82
0
d
-5.66
0
e
0
0
Motivation
• High-dimensional data, such as images, is hard to analyze, interpretate, and visualize, and expensive to store.
• Good news
• high-dimensional data is often overcomplete, i.e., many dimensions are redundant and can be explained by a combination of other dimensions
• Furthermore, dimensions in high-dimensional data are often correlated so that the data possesses an intrinsic lower-dimensional structure.
The data in (a) does not vary much in the 𝑥”-direction, so that we can express it as if it were on a line – with nearly no loss; see (b).
To describe the data in (b), only the 𝑥” -coordinate is required, and the data lies in a one-dimensional subspace of R”
10.1 Problem Setting
• In PCA, we are interested in finding projections 𝒙$& of data points 𝒙& that are as similar to the original data points as possible, but which have a significantly lower intrinsic dimensionality
• We consider an i.i.d. dataset 𝓧 = 𝒙),⋯,𝒙, ,𝒙& ∈ R., with mean 𝟎 that possesses the data covariance matrix,
𝑺 = 𝑁1 3 𝒙 & 𝒙 5& &4)
• We assume there exists a low-dimensional compressed representation (code)
𝒛& =𝑩5𝒙& ∈R8
of 𝒙&, where we define the projection matrix .×8 𝑩≔ 𝒃),⋯,𝒃8 ∈R
• Example (Coordinate Representation/Code)
• Consider R” with the canonical basis 𝒆) = 1,0 5, 𝒆” = 0,1 5.
• 𝒙 ∈ R” can be represented as a linear combination of these basis vectors, e.g.,
5 3
• However, when we consider vectors of the form
𝒙$ = 0𝑧 ∈ R ” , 𝑧 ∈ R they can always be written as 0𝒆) + 𝑧𝒆”.
• To represent these vectors it is sufficient to store the coordinate/code 𝑧 of 𝒙$ with respect to the 𝒆” vector.
= 5𝒆) + 3𝒆”
10.2 PCA from Maximum Variance Perspective
• We ignore 𝑥” -coordinate of the data because it did not add too much information: the compressed data (b) is similar to the original data in (a)
• We derive PCA so as to maximize the variance in the low-dimensional representation of the data to retain as much information as possible
• Retaining most information after data compression is equivalent to capturing the largest amount of variance in the low-dimensional code (Hotelling, 1933)
10.2.1 Direction with Maximal Variance
• Data centering
• In the data covariance matrix, we assume centered data.
1,
𝑺 = 𝑁 3 𝒙 & 𝒙 5&
&4)
• Let us assume that 𝝁 is the mean of the data. Using the properties of the variance, we obtain 5 5 5 5
𝕍𝒛𝒛 =𝕍𝒙𝑩 𝒙−𝝁 =𝕍𝒙𝑩𝒙−𝑩𝝁 =𝕍𝒙𝑩𝒙
• That is, the variance of the low-dimensional code does not depend on the mean of the data.
• With this assumption the mean of the low-dimensional code is also 𝟎 since 𝔼𝒛 𝒛 = 𝔼𝒙 𝑩5𝒙 = 𝑩5𝔼𝒙 𝒙 = 𝟎
• To maximize the variance of the low-dimensional code, we first seek a single vector 𝒃) ∈ R. that maximizes the variance of the projected data, i.e., we aim to maximize the variance of the first coordinate 𝑧) of 𝒛 ∈ R8 so that
1,
𝑉≔𝕍[𝑧]= 3𝑧” ) ) 𝑁 )&
&4)
is maximized, where we defined 𝑧)& as the first coordinate of the low-
dimensional representation 𝒛& ∈ R8 of 𝒙& ∈ R.. 𝑧)& is given by, 𝑧 ) & = 𝒃 )5 𝒙 &
i.e., it is the coordinate of the orthogonal projection of 𝒙& onto the one- dimensional subspace spanned by 𝒃). We substitute 𝑧)& into 𝑉) and obtain,
1, 1,
𝑉 ) = 𝑁 3 ( 𝒃 )5 𝒙 & ) ” = 𝑁 3 𝒃 )5 𝒙 & 𝒙 5& 𝒃 )
&4) , &4)
=𝒃)5 𝑁13𝒙&𝒙5& 𝒃)=𝒃)5𝑺𝒃) &4)
where 𝑺 is the data covariance matrix.
• We further restrict all solutions to 𝒃) ” = 1
• Wehavethefollowingconstrainedoptimizationproblem
max 𝒃)5𝑺𝒃) 𝒃𝟏
subjectto 𝒃) “=1
• WeobtaintheLagrangian(notrequiredinthiscourse),
𝔏𝒃),λ =𝒃)5𝑺𝒃)+λ) 1−𝒃)5𝒃)
• The partial derivatives of 𝔏 with respect to 𝒃) and 𝜆) are
𝜕 𝔏 = 2 𝒃 )5 𝑺 − 2 λ ) 𝒃 )5 , 𝜕 𝔏 = 1 − 𝒃 )5 𝒃 ) 𝜕𝒃) 𝜕λ)
• Settingthesepartialderivativesto𝟎givesustherelations 𝑺𝒃) = λ)𝒃)
𝒃 )5 𝒃 ) = 1
• We see that 𝒃) is an eigenvector of 𝑺, and λ) is the corresponding eigenvalue. We
rewrite our objective as,
• i.e.,thevarianceofthedataprojectedontoaone-dimensionalsubspaceequalsthe
𝑉) = 𝒃)5𝑺𝒃) = λ)𝒃)5𝒃) = λ)
eigenvalue that is associated with the basis vector 𝒃) that spans this subspace.
• Tomaximizethevarianceofthelow-dimensionalcode,wechoosethebasisvector associated with the largest eigenvalue of the data covariance matrix. This eigenvector is called the first principal component.
10.2.2 𝑀-dimensional Subspace with Maximal Variance
• Assume we have found the 𝑚 − 1 eigenvectors of 𝑺 that are associated with
the largest 𝑚 − 1 eigenvalues.
• We want to find the 𝑚th principal component.
• We subtract the effect of the first 𝑚 − 1 principal components 𝒃) ⋯ , 𝒃^_)from the data, and find principal components that compress the remaining information. We then arrive at the new data matrix,
𝑿` ≔ 𝑿 − ∑ ^ _ ) 𝒃 𝒃 5 𝑿 = 𝑿 − 𝑩 𝑿 c4) c c ^_)
where 𝑿 = [𝒙) , ⋯ , 𝒙,] ∈ R.×, contains the data points as column vectors and 𝑩 ≔ ∑^_) 𝒃 𝒃5 is a projection matrix that projects onto the subspace
^_) c4)cc spanned by 𝒃) , ⋯ , 𝒃^_).
• To find the 𝑚th principal component, we maximize the variance 1, 1,
𝑉 =𝕍𝑧 = 3𝑧” = 3(𝒃5𝒙d)”=𝒃5𝑺`𝒃 ^ ^𝑁^&𝑁^& ^^
&4) &4)
subject to 𝒃^ ” = 1, and we define 𝑺` as the data covariance matrix of the transformed dataset 𝓧` ≔ 𝒙e), ⋯ , 𝒙e, .
• The optimal solution 𝒃^ is the eigenvector of 𝑺` that is associated with the largest eigenvalue of 𝑺`.
• In fact, we can derive that
𝑺`𝒃^ = 𝑺𝒃^ = λ^𝒃^ (1)
• 𝒃^ is not only an eigenvector of 𝑺 but also of 𝑺` .
• Specifically, λ^ is the largest eigenvalue of 𝑺` and λ^ is the 𝑚th largest eigenvalue of 𝑺, and both have the associated eigenvector 𝒃^.
• Moreover, 𝒃) , ⋯ , 𝒃^_) are also eigenvectors of 𝑺`, but they are associated with eigenvalue 0.
• Considering (1) and, 𝒃5^𝒃^ = 1, the variance of the data projected onto the 𝑚th principal component is
𝑉^ = 𝒃5^𝑺𝒃^ = λ^𝒃5^𝒃^ = λ^
• This means that the variance of the data, when projected onto an 𝑀- dimensional subspace, equals the sum of the eigenvalues that are associated with the corresponding eigenvectors of the data covariance matrix.
MNIST dataset
• 60,000 examples of handwritten digits 0 through 9.
• Each digit is a grayscale image of size 28×28, i.e., it contains 784 pixels. • We can interpret every image in this dataset as a vector 𝑥 ∈ Rghi
Example – Eigenvalues of MNIST digit “8”
(a) Top 200 largest eigenvalues
(b) Variance captured by the principal components.
• A 784-dim vector is used to represent an image
• Taking all images of “8” in MNIST, we compute the eigenvalues of the data
covariance matrix.
• We see that only a few of them have a value that differs significantly from 0.
• Most of the variance, when projecting data onto the subspace spanned by the corresponding eigenvectors, is captured by only a few principal components
Overall
• To find an 𝑀-dimensional subspace of R. that retains as much information as possible,
• We choose the columns of 𝑩 = 𝒃), ⋯ , 𝒃8 ∈ R.×8 as the 𝑀 eigenvectors of the data covariance matrix 𝑺 that are associated with the 𝑀 largest eigenvalues.
• The maximum amount of variance PCA can capture with the first 𝑀 principal
components is
8
𝑉8 = 3 λ^ ^4)
where the λ^ are the 𝑀 largest eigenvalues of the data covariance matrix 𝑺. • The variance lost by data compression via PCA is
.
𝐽8= 3 λn=𝑉.−𝑉8 n48o)
• Instead of these absolute quantities, we can define the relative variance
captured as pq, and the relative variance lost by compression as 1 − pq . pr pr
10.3 PCA from Projection Perspective
• Previously, we derived PCA by maximizing the variance in the projected space to retain as much information as possible
max 𝒃)5𝑺𝒃) 𝒃𝟏
subjectto 𝒃) “=1
• Alternatively, we derive PCA as an algorithm that directly minimizes the average reconstruction error
10.3.1 Setting and Objective
(a) A vector 𝒙 ∈ R” (red cross) shall be (b) Differences 𝒙 − 𝒙$𝒊 for 50 different 𝒙$𝒊 projected onto a one-dimensional subspace are shown by the red lines
𝑈 ⊆ R” spanned by 𝒃
• We wish to project 𝒙 to 𝒙$ in a lower-dimensional space, such that 𝒙$ is similar to the original data point 𝒙. That is,
• We aim to minimize the (Euclidean) distance 𝒙 − 𝒙$
• Given an orthonormal basis (𝒃) , ⋯ , 𝒃.) of R., any 𝒙 ∈ R.can be written as a linear combination of the basis vectors of R.:
𝒙=3𝜁v𝒃v=3𝜁^𝒃^+ 3 𝜁n𝒃n v4) ^4) n48o)
for suitable coordinates 𝜁v ∈ R .
• We aim to find vectors 𝒙$ ∈ R., which live in an intrinsically lower-dimensional
subspace 𝑈 ⊆ R., dim(U) = 𝑀, so that 8
𝒙$ = 3 𝑧 ^ 𝒃 ^ ∈ 𝑈 ⊆ R . ^4)
is as similar to 𝒙 as possible.
.8.
• Wehaveadataset𝒳= 𝒙),…,𝒙, ,𝒙, ∈R. centeredat𝟎,i.e.,𝔼𝒳 =𝟎.
• We want to find the best linear projection of 𝒳 onto a lower dimensional subspace 𝑈 ⊆ R., dim(U) = 𝑀. Also, 𝑈 has orthonormal basis vectors 𝒃) ,⋯,𝒃8.
• We call this subspace 𝑈 the principal subspace.
• The projections of the data points are denoted by
8
𝒙$& ≔ 3𝑧^&𝒃^ =𝑩𝒛& ∈R. ^4)
where 𝒛& ≔ 𝑧)&, ⋯ , 𝑧8& 5 ∈ R8 is the coordinate vector of 𝒙$& with respect to the basis (𝒃) , ⋯ , 𝒃8).
• We want to have 𝒙$& as similar to 𝒙& as possible.
• We define our objective as minimizing the average squared Euclidean
distance (reconstruction error) ,
𝐽 8 : = 𝑁1 3 𝒙 & − 𝒙$ &
&4)
• We need to find the orthonormal basis of the principal subspace and the coordinates 𝑧& ∈ R 8 of the projections with respect to this basis.
”
10.3.2 Finding Optimal Coordinates
(a) A vector 𝒙 ∈ R” (red cross) shall be (b) Differences 𝒙 − 𝒙$𝒊 for 50 different 𝒙$𝒊 projected onto a one-dimensional subspace are shown by the red lines
𝑈 ⊆ R” spanned by 𝒃
• We want to find 𝒙$ in a subspace spanned by 𝒃 that minimizes 𝒙 − 𝒙$ . • Apparently, this will be the orthogonal projection
• Given an ONB (𝒃) , ⋯ , 𝒃8) of 𝑈 ⊆ R., to find the optimal coordinates 𝒛^ with respect to this basis, we calculate the partial derivatives
𝜕 𝐽 8 = 𝜕 𝐽 8 𝜕 𝒙$ & 𝜕𝑧c& 𝜕𝒙$& 𝜕𝑧c&
𝜕 𝐽 8 = − 2 ( 𝒙 & − 𝒙$ & ) 5 ∈ R ) × . 𝜕𝒙$& 𝑁 8
𝜕 𝒙$ & = 𝜕 3 𝑧 ^ & 𝒃 ^ = 𝒃 c 𝜕𝑧c& 𝜕𝑧c& ^4)
for 𝑖 = 1 , ⋯ , 𝑀, such that we obtain
𝜕𝐽8 2 5 2 8
5
𝒃 c
𝜕 𝑧 c & = − 𝑁 𝒙 & − 𝒙$ & 𝒃 c = − 𝑁 𝒙 & − 3 𝑧 ^ & 𝒃 ^
ONB 2 2 ^4)
= −𝑁 𝒙5&𝒃c−𝑧c&𝒃5c𝒃c =−𝑁(𝒙5&𝒃c−𝑧c&)
𝜕𝐽8 =−2(𝒙5&𝒃c−𝑧c&) 𝜕𝑧c& 𝑁
• Setting this partial derivative to 0 yields immediately the optimal coordinates 𝑧 c & = 𝒙 5& 𝒃 c = 𝒃 5c 𝒙 &
for 𝑖 = 1 , ⋯ , 𝑀, and 𝑛 = 1 , ⋯ , 𝑁 .
• The optimal coordinates 𝑧c& of the projection 𝒙$& are the coordinates of the orthogonal projection of the original data point 𝒙& onto the one-dimensional subspace that is spanned by 𝒃c.
• The optimal linear projection 𝒙$& of 𝒙& is an orthogonal projection.
• The coordinates of 𝒙$& with respect to the basis (𝒃) , ⋯ , 𝒃8) are the coordinates of the orthogonal projection of 𝒙& onto the principal subspace.
(a) A vector 𝒙 ∈ R” (red cross) shall be (b) Differences 𝒙 − 𝒙$𝒊 for 50 different 𝒙$𝒊 projected onto a one-dimensional subspace are shown by the red lines
𝑈 ⊆ R” spanned by 𝒃
(c) Distances 𝒙 − 𝒙$ for some 𝒙$ = (d) The vector 𝒙$ that minimizes 𝒙 − 𝒙$ is 𝑧)𝒃 ∈ 𝑈 = span 𝒃 the orthogonal projection of 𝒙 onto 𝑈.
• We briefly recap orthogonal projections from Section 3.8 (Analytic geometry). • If (𝒃) , ⋯ , 𝒃.) is an orthonormal basis of R. then
𝒙$ = 𝒃 n Ç 𝒙 𝒃 n = 𝒃 n 𝒃 n5 𝒙 ∈ R . 𝒃”
is the orthogonal projection of 𝒙 onto the subspace spanned by the 𝑗th basis vector, and 𝑧n = 𝒃n5𝒙 is the coordinate of this projection with respect to the basis vector 𝒃n that spans that subspace since 𝑧n𝒃n = 𝒙$.
• More generally, if we aim to project onto an 𝑀-dimensional subspace of R., we obtain the orthogonal projection of 𝒙 onto the 𝑀-dimensional subspace with orthonormal basis vectors 𝒃) , ⋯ , 𝒃8 as
𝒙$=𝑩 𝑩5𝑩 _)𝑩5𝒙=𝑩𝑩5𝒙 =𝑰
where we defined 𝑩 ∶= 𝒃𝟏, ⋯ , 𝒃 8 ∈ R.×8 . The coordinates of this projection with respect to the ordered basis (𝒃), ⋯ , 𝒃8) are 𝒛 ≔ 𝑩5𝒙
• Although 𝒙$ ∈ R𝑫, we only need 𝑀 coordinates to represent 𝒙$. The other 𝐷 − 𝑀 coordinates with respect to the basis vectors (𝒃8o), ⋯ , 𝒃.) are always 0
n
10.3.3 Finding the Basis of the Principal Subspace
• So far we have shown that for a given ONB we can find the optimal coordinates of 𝒙$ by an orthogonal projection onto the principal subspace. In the following, we will determine what the best basis is.
• Recall the optimal coordinates of 𝒙$ given ONB is
𝑧 c & = 𝒙 5& 𝒃 c = 𝒃 5c 𝒙 &
• We have
88
𝒙$ & = 3 𝑧 ^ & 𝒃 ^ = 3 ( 𝒙 5& 𝒃 ^ ) 𝒃 ^ ^4) ^4)
• We now exploit the symmetry of the dot product, which yields
888
𝒙$& = 3(𝒃5^𝒙&)𝒃^ = 3 𝒃^(𝒃5^𝒙&) = 3 𝒃^𝒃5^ 𝒙& ^4) ^4) ^4)
• Since we can generally write the original data point 𝒙& as a linear combination of all basis vectors, it holds that . . .
𝒙& =3𝑧v&𝒃v =3 𝒙5&𝒃v 𝒃v = 3𝒃v𝒃5v 𝒙& v4)8 v4) . v4)
= 3𝒃^𝒃5^ 𝒙&+ 3 𝒃n𝒃n5 𝒙& ^4) n48o)
where we split the sum with 𝐷 terms into a sum over 𝑀 and a sum over 𝐷 − 𝑀 terms.
• With these results, the displacement vector 𝒙& − 𝒙$&, i.e., the difference vector between the original data point and its projection, is
..
𝒙&−𝒙$&= 3𝒃n𝒃n5𝒙&=3𝒙5&𝒃n𝒃n n48o) n48o)
• The displacement vector 𝒙& − 𝒙$& is exactly the projection of the data point onto the orthogonal complement of the principal subspace.
• 𝒙& − 𝒙$& lies in the subspace that is orthogonal to the principal subspace.
• We identify the matrix ∑. 𝒃 𝒃5 in the equation above as the projection
matrix that performs this projection.
Orthogonal projection and displacement vectors. When projecting data points 𝒙& (blue) onto subspace 𝑈), we obtain 𝒙$& (orange). The displacement vector 𝒙& − 𝒙$& lies completely in the orthogonal complement 𝑈” of 𝑈).
n48o) n n
• Now we reformulate the loss function.
1,1,.
”
𝐽 8 = 𝑁 3 𝒙 & − 𝒙$ & ” = 𝑁 3 3 ( 𝒃 5n 𝒙 & ) 𝒃 n &4) &4) n48o)
• We explicitly compute the squared norm and exploit the fact that the 𝒃n form an ONB: ,. ,.
𝐽 8 = 𝑁1 3 3 ( 𝒃 5n 𝒙 & ) ” = 𝑁1 3 3 𝒃 n 5 𝒙 & 𝒃 n 5 𝒙 & &4) n48o) , . &4) n48o)
=𝑁13 3 𝒃n5𝒙&𝒙5&𝒃n &4) n48o)
where we exploited the symmetry of the dot product in the last step to write 𝒃n5𝒙& = 𝒙5&𝒃n. We now swap the sums and obtain
.1,.
𝐽8= 3 𝒃n5 𝑁3𝒙&𝒙5& 𝒃n = 3 𝒃n5𝑺𝒃n n48o) &4) n48o)
. 5 . 4:𝑺 5 . 5
= ∑n48o) tr(𝒃n 𝑺𝒃n) = ∑n48o) tr(𝑺𝒃n𝒃n ) = tr ∑n48o) 𝒃n𝒃n 𝑺
âäãåçéèêãë íìèäêî
where we exploited the property that the trace operator tr(·) is linear and invariant to cyclic permutations of its arguments
..
𝐽8=3𝒃n5𝑺𝒃n=tr 3𝒃n𝒃5n𝑺 n48o) n48o)
âäãåçéèêãë íìèäêî
• The loss is formulated as the covariance matrix of the data, projected onto the orthogonal complement of the principal subspace.
• Minimizing the average squared reconstruction error is therefore equivalent to minimizing the variance of the data when projected onto the subspace we ignore, i.e., the orthogonal complement of the principal subspace.
• Equivalently, we maximize the variance of the projection that we retain in the principal subspace, which links the projection loss immediately to the maximum-variance formulation of PCA in Section 10.2.
• In Section 10.2, the average squared reconstruction error, when projecting onto the 𝑀-dimensional principal subspace, is
.
𝐽8 = 3 λn n48o)
• where λå are the eigenvalues of the data covariance matrix.
.
𝐽8 = 3 λn n48o)
• To minimize it, we need to select the smallest 𝐷 − 𝑀 eigenvalues. Their corresponding eigenvectors are the basis of the orthogonal complement of the principal subspace.
• Consequently, this means that the basis of the principal subspace comprises the eigenvectors 𝒃), … , 𝒃8 that are associated with the largest 𝑀 eigenvalues of the data covariance matrix.
10.5 PCA in High Dimensions
• In order to do PCA, we need to compute the data covariance matrix 𝑺
• In 𝐷 dimensions, 𝑺 is a 𝐷×𝐷 matrix.
• Computing the eigenvalues and eigenvectors of this matrix is computationally expensive as it scales cubically in 𝐷.
• Therefore, PCA will be infeasible in very high dimensions
• For example, if 𝒙& are images with 10,000 pixels, we would need to compute
the eigendecomposition of a 10,000×10,000 matrix.
• We provide a solution to this problem for the case that we have substantially
fewer data points than dimensions, i.e., 𝑁 ≪ 𝐷
• Assume we have a centered dataset 𝒙), ⋯ , 𝒙,, 𝒙& ∈ R.. Then the data
covariance matrix is given as𝑺 = 𝑁1 𝑿𝑿5 ∈ R.×.
where 𝑿 = 𝒙), ⋯ , 𝒙, is a 𝐷×𝑁 matrix whose columns are the data points.
• We now assume that 𝑁 ≪ 𝐷, i.e., the number of data points is smaller than the dimensionality of the data.
• With 𝑁 ≪ 𝐷 data points, the rank of the covariance matrix 𝑆 is at most 𝑁, so it has at least 𝐷 − 𝑁 eigenvalues that are 0.
• Intuitively, this means that there are some redundancies. In the following, we will exploit this and turn the 𝐷×𝐷 covariance matrix into an 𝑁 × 𝑁 covariance matrix whose eigenvalues are all positive.
• In PCA, we ended up with the eigenvector equation
𝑺𝒃^ = λ^𝒃^ , 𝑚 = 1,…,𝑀
where 𝒃^ is a basis vector of the principal subspace. Let us rewrite this
equation a bit: With 𝑺 = ,) 𝑿𝑿5 ∈ R.×., we obtain 𝑺𝒃^ = 𝑁1 𝑿𝑿5𝒃^ = λ^𝒃^
• We now multiply 𝑿5 ∈ R,×. from the left-hand side, which yields 1𝑿5𝑿𝑿5𝒃 =λ 𝑿5𝒃 ⟺ 1𝑿5𝑿𝒄 =λ 𝒄
𝑁ó^^^𝑁^^^ 𝑁×𝑁 =: 𝒄^
𝑁1 𝑿5𝑿𝒄^ = λ^𝒄^
• We get a new eigenvector/eigenvalue equation: λ^ remains eigenvalue, which confirms our results from exercise 4.11 that the nonzero eigenvalues of 𝑿𝑿5 equal the nonzero eigenvalues of 𝑿5𝑿.
• We obtain the eigenvector of the matrix ,) 𝑿5𝑿 ∈ R,×, associated with λ^ as 𝒄^ ≔ 𝑿5𝒃^ .
• This also implies that ,) 𝑿5𝑿 has the same (nonzero) eigenvalues as the data covariance matrix 𝑺.
• But 𝑿5𝑿 now an 𝑁×𝑁 matrix, so that we can compute the eigenvalues and eigenvectors much more efficiently than for the original 𝐷×𝐷 data covariance matrix.
• Now that we have the eigenvectors of ,) 𝑿5𝑿, we are going to recover the original eigenvectors, which we still need for PCA. Currently, we know the eigenvectors of ,) 𝑿5𝑿. If we left-multiply our eigenvalue/ eigenvector equation with 𝑿, we get
𝑁1 𝑿𝑿5 𝑿𝒄^ = λ^𝑿𝒄^ 𝑺
and we recover the data covariance matrix again. This now also means that we recover 𝑿𝒄^ as an eigenvector of 𝑺.
10.6 Key Steps of PCA in Practice
eigendecomposition
• Step 1. Mean subtraction
• We center the data by computing the mean 𝜇 of the dataset and subtracting it
from every single data point. This ensures that the dataset has mean 0.
• Step 2. Standardization Divide the data points by the standard deviation σd of the dataset for every dimension 𝑑 = 1, . . . , 𝐷. Now the data has variance 1 along each axis.
• Step 3. Eigendecomposition of the covariance matrix
• Compute the data covariance matrix and its eigenvalues and corresponding eigenvectors. The longer vector (larger eigenvalue) spans the principal subspace 𝑈
• 4. Projection We can project any data point 𝒙∗ ∈ R. onto the principal subspace: To get this right, we need to standardize 𝒙∗ using the mean 𝜇v and standard deviation 𝜎v of the training data in the 𝑑th dimension, respectively, so that
𝑥∗v ⟵𝑥∗v −𝜇v , 𝜎v
where 𝑥∗v is the 𝑑th component of 𝒙∗.
𝑑=1,⋯,𝐷
• We obtain the projection as
with coordinates
𝒙$ ∗ = 𝑩 𝑩 5 𝒙 ∗ 𝒛∗ = 𝑩5𝒙∗
with respect to the basis of the principal subspace. Here, 𝑩 is the matrix that
contains the eigenvectors that are associated with the largest eigenvalues of the data covariance matrix as columns.
• Note that PCA returns the coordinates 𝒛∗, not the projections of 𝒙∗.
• Having standardized our dataset, 𝒙$∗ = 𝑩𝑩5𝒙∗ only yields the projections in the context of the standardized dataset.
• To obtain our projection in the original data space (i.e., before standardization), we need to undo the standardization: multiply by the standard deviation before adding the mean.
• We obtain
• Figure 10.10(f) illustrates the projection in the original data space.
𝑥ü∗v ⟵𝑥ü∗v𝜎v+𝜇v, 𝑑=1,…,𝐷