WiSe 2021/22
Machine Learning 1/1-X
Lecture 3 PCA
Copyright By PowCoder代写 加微信 powcoder
Dimensionality reduction
Principal Component Analysis (PCA)
What are principal components? How to find/calculate them
Lagrange multipliers
Eigenvalues formulation Iterative approaches
What can we do with them? / Applications
Curse of Dimensionality
In many machine learning problems, the data are high-dimensional.
Standard regression/classification techniques can become ill-defined for d > N.
ill-conditioned/numerically unstable even for d ≤ N.
Example of ill-conditioning Bayes Classifier with Estimated Covariance
Assume our data is generated for each class ωj according to the multivariate Gaussian distribution p(x|ωj ) = N (μj , Σ) and with class priors P (ωj ). The Bayes optimal classifier is derived as
arg max{P(ωj |x)} j
= arg max{log p(x|ωj ) + log P(ωj )} j
=argmaxx⊤Σ−1μj − 1μjΣ−1μj +logP(ωj) j2
In practice, the true Σ is unknown, so we use some estimator Σ in place of Σ.
Can we compute Σ−1, and can we do so accurately?
Example of ill-conditioning Bayes Classifier with Estimated Covariance
Example of estimator:
Maximum likelihood estimator: 1⊤
Σ = N X ̄ X ̄
where X ̄ is a matrix of size d × N containing
the centered data.
The matrix Σ−1 used in the classifier only exists when d ≤ N and the matrix X ̄ has full rank.
The matrix Σ−1 used in the classifier is typically accurate only when d ≪ N.
Solutions:
Reduce the dimensionality of the data (today) Regularize the model (later)
Curse of Dimensionality (2)
When the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse.
The amount of data needed for a reliable result (e.g. no overfitting) often grows exponentially with the dimensionality.
Here again, this can be addressed by reducing the dimensionality of the data (today) or regularizing (later).
Principal Component Analysis (PCA)
PCA is an analysis that linearly maps the data from a high-dimensional space to a low-dimensional space. In the example below, from a multi-dimensional space (2d) to a line (1d).
1st principal component
Reference: Bishop, Pattern Recognition and Machine Learning, Chapter 12.1
Which line fits data best? The line w that minimizes the noise and maximizes the signal [Pearson 1901].
PCA as Noise Minimization
Given a dataset x1 , . . . , xN ∈ Rd , find a one-dimensional subspace, so that the data projected on that subspace has minimum distance to the original data.
The reconstruction model is given by:
with ∥w∥ = 1.
Minimizing the noise (i.e. reconstruction error) can be written as:
argmin1 ∥x−ww⊤x∥2
w x
PCA as Signal Maximization
Given a dataset x1 , . . . , xN ∈ Rd . Find a one-dimensional subspace, so that the data projected has maximum variance.
The projection model is given by:
with ∥w∥ = 1.
Maximizing the signal (i.e. variance of projected data) can be written
arg max 1 (w⊤xk − E[w⊤x])2
and assuming centered data, i.e. E[x] = 0, it simplifies to:
argmax1 (w⊤xk)2
From Min Noise to
argmin1 ∥xk −ww⊤xk∥2
w Nk=1 w N
=argmin1 (xk −ww⊤xk)⊤(xk −ww⊤xk)
= arg min 1 x⊤k xk −2x⊤k ww⊤xk + (ww⊤xk )⊤(ww⊤xk )
N k=1 cst.
1⊤2⊤⊤⊤ −2(xk w) +xk ww ww xk
N k=1 =1
=argmin1 −2(x⊤k w)2 +(x⊤k w)2
=argmax1 (x⊤k w)2
Data needs to be centered
What if the Data is Uncentered?
What we get if we forget to center
True PCA direction
Is PCA Robust to Outliers?
Not very robust
Computing Principal Components
So far, we have formulated PCA as maximizing some objective f (w) subject to a constraint (∥w∥ = 1), but we do not have yet a procedure to quickly find the solution.
Now, we will use the method of to effectively find a solution w.
constraint
Lagrange multipliers
constraint
General framework for finding solutions of constrained optimization problems of the type arg max f (θ) subject to g(θ) = 0.
Step 1: Construct the ‘Lagrangian’:
L(θ, λ) = f (θ) + λg(θ)
where λ is called the Lagrange multiplier.
Step 2: Solve the equation ∇L(θ, λ) = 0
which is a necessary condition for the solution.
Intuition for step 2: the equation includes the equation ∇f (θ) = −λ∇g(θ), i.e. the gradient of objective and constraint are aligned, but point in opposite directions (cf. 2d plot).
: 2D Example
Consider the constrained optimization problem
argmax[1−θ12 +θ2] s.t. θ1 +θ2 =1 θ
Step 1: Build the Lagrangian
L(θ,λ) = 1−θ12 +θ2 +λ·(θ1 +θ2 −1)
Step 2: Set gradient of Lagrangian to zero:
∇θ1L(θ,λ)=0 ∇θ2L(θ,λ)=0 ∇λL(θ,λ)=0
⇒ 2θ1 =λ ⇒ 2θ2 =λ ⇒ θ1 +θ2 =1
which gives us the solution:
θ = 0.5, 0.5
The Solution of PCA
Before applying , we first observe that the PCA optimization problem given in slide 8:
argmax1(x⊤kw)2
can be rewritten as:
s.t. ∥w∥=1
s.t. ∥w∥2=1 s.t. ∥w∥2 = 1
=argmax1(w⊤xk)(x⊤kw)
= argmaxw⊤ 1 xkx⊤w wNk
where Σ is the covariance matrix (recall: the data is assumed to be centered). The covariance matrix does not depend on w and can be precomputed.
The Solution of PCA
Starting from our rewritten PCA optimization problem
argmax[w⊤Σw] s.t. ∥w∥2 =1 w
we look for a solution by applying the method of Lagrange multipliers.
Step 1: Build the Lagrangian
L ( w , λ ) = w ⊤ Σ w + λ · ( 1 − ∥ w ∥ 2 )
Step 2: Set gradient of Lagrangian to zero: ∇ w L ( w , λ ) = 0 ⇒ Σ w = λ w
∇λL(w,λ)=0 ⇒ ∥w∥2 =1
(Homework: Show that it is the eigenvector with highest eigenvalue.)
! PCA solution is an eigenvector of Σ
Solving PCA (more components)
PCA can be generalized to extracting h components w1,…,wh.
Conceptually it can be described by the following algorithm, which repeats multiple times the steps (i) compute the first principal component and (ii) remove it from the data.
Finding multiple PCA components
for i = 1 to h do
wi ←PC1(X ̄) (i) X ̄←X ̄−wiw⊤i X ̄ (ii)
It can be shown that the resulting sequence of principal components w1,…,wh corresponds to the leading eigenvectors of the matrix Σ.
Therefore, in practice, we just need to compute the leading eigenvectors of Σ, without an iterative procedure.
Interpreting PCA
PCA rotates data into new coordinate system with the directions of largest variance being the new coordinate axes.
Stable PCA Computation via SVD
Singular Value Decomposition (SVD)
A singular value decomposition factorizes a matrix M as UΛV = M where U contains the eigenvectors of MM⊤
V contains the eigenvectors of M⊤M
The square roots of the eigenvalues of MM⊤ are on the diagonal of Λ.
Observation:
Setting M = √1 X ̄ and computing SVD results in N
a matrix U = (w1| . . . |wd ) that contains the eigenvectors of Σ, i.e. the principal components, and
a matrix Λ = diag(√λ1 , . . . , √λd ) (but applying the square root does not change the ordering of eigenvalues).
SVD is computationally faster and more stable than covariance computation + eigenvalue decomposition.
Comparing Classical PCA and SVD
Eigenvalues are the same.
PCA projection vector is the same (up to a sign flip).
PCA Computation via SVD, limitations
The SVD has computational complexity of O(min(N2d,d2N)).
When d ≈ N, the complexity is roughly as bad as the that of
computing the eigenvectors of Σ, that is, O(d3).
This makes the computation of principal components using SVD
prohibitive for large datasets.
In practice, we often need only the first few principal components, whereas SVD computes all of them.
Can we design a faster procedure that only extracts the first few principal components?
Power Iteration, Algorithm
Find the first component by apply- ing the iteration
v(t) ← Σw(t−1)
w(t) ← v(t) ∥v(t)∥
T times, starting from a random (or any) unit-norm vector w(0).
Questions:
Does it always converge? yes How fast it converges?
exponentially fast
(Homework: prove this.)
constraint
Power Iteration (more components)
The power iteration (POWIT) method only gives us the leading eigenvector. If we want further PCA components, we need to compute them iteratively.
Naive approach (cf. slide 17): Recompute covariance at each step
for i = 1 to h do
Σ = N1 X ̄ X ̄ ⊤
wi ← POWIT(Σ)
X ̄ ← X ̄ − w i w ⊤i X ̄ end for
Better approach: Work directly in covariance space
Σ = N1 X ̄ X ̄ ⊤
for i = 1 to h do
wi ← POWIT(Σ)
Σ ← Σ − w i w ⊤i Σ end for
Applications of PCA: Data Visualization
PCA can be used to visualize how examples of different classes (e.g. different species) are related, and whether they form clusters.
Canonical coordinates
representing individual features can also be projected in the PCA space. This gives a ‘Biplot’.
Applications of PCA: Data Compression
Applications of PCA: Data Compression
Applications of PCA: Data Compression
Applications of PCA: Denoising
Applications of PCA: Denoising
Applications of PCA: Artifact Removal
Beyond PCA
Many methods that extend PCA or relate to it have been developed:
Kernel PCA: Perform PCA in feature space (where input features are transformed nonlinearly). Gives better compression/denoising.
CCA, ICA, etc. Does not maximize variance but some other quantity of interest (e.g. correlation, kurtosis, etc.)
Autoencoder Networks. PCA is equivalent to a linear autoencoder network. The analysis can be made nonlinear by incorporating nonlinearities in the autoencoder.
Sparse Coding. Maximize reconstruction subject to constraints in the projected space (e.g. sparsity).
Kernel PCA will be presented in ML1. Other approaches will be presented in ML2.
Learning algorithms can be instable in high dimensional spaces. An approach to alleviate this problem is to reduce the dimensionality of the data.
Principal Component Analysis is a dimensionality reduction technique that implements [Pearson 1901]’s principle of minimizing noise and maximizing signal. (It does both simultaneously!).
The framework of shows us that the solution of PCA is an eigenvector of the data covariance.
Several methods exist to compute principal components (e.g. SVD, Power Iteration) which one to choose depends on the considered scenario, e.g. how large is our data, how many PCA components we need.
PCA has many applications (e.g. visualization, compression, denoising, artifact reduction).
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com