编程代写 WiSe 2021/22 - PowCoder代写

WiSe 2021/22
Machine Learning 1/1-X
Lecture 3 PCA

􏰀 Dimensionality reduction
􏰀 Principal Component Analysis (PCA)
􏰀 What are principal components? 􏰀 How to find/calculate them
􏰀 Lagrange multipliers
􏰀 Eigenvalues formulation 􏰀 Iterative approaches
􏰀 What can we do with them? / Applications

Curse of Dimensionality
In many machine learning problems, the data are high-dimensional.
Standard regression/classification techniques can become 􏰀 ill-defined for d > N.
􏰀 ill-conditioned/numerically unstable even for d ≤ N.

Example of ill-conditioning Bayes Classifier with Estimated Covariance
􏰀 Assume our data is generated for each class ωj according to the multivariate Gaussian distribution p(x|ωj ) = N (μj , Σ) and with class priors P (ωj ). The Bayes optimal classifier is derived as
arg max{P(ωj |x)} j
= arg max{log p(x|ωj ) + log P(ωj )} j
=argmax􏰃x⊤Σ−1μj − 1μjΣ−1μj +logP(ωj)􏰄 j2
􏰀 In practice, the true Σ is unknown, so we use some estimator Σ􏰆 in place of Σ.
􏰀 Can we compute Σ􏰆−1, and can we do so accurately?

Example of ill-conditioning Bayes Classifier with Estimated Covariance
Example of estimator:
􏰀 Maximum likelihood estimator: 1⊤
Σ􏰆 = N X ̄ X ̄
where X ̄ is a matrix of size d × N containing
the centered data.
􏰀 The matrix Σ􏰆−1 used in the classifier only exists when d ≤ N and the matrix X ̄ has full rank.
􏰀 The matrix Σ􏰆−1 used in the classifier is typically accurate only when d ≪ N.
Solutions:
􏰀 Reduce the dimensionality of the data (today) 􏰀 Regularize the model (later)

Curse of Dimensionality (2)
􏰀 When the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse.
􏰀 The amount of data needed for a reliable result (e.g. no overfitting) often grows exponentially with the dimensionality.
􏰀 Here again, this can be addressed by reducing the dimensionality of the data (today) or regularizing (later).

Principal Component Analysis (PCA)
PCA is an analysis that linearly maps the data from a high-dimensional space to a low-dimensional space. In the example below, from a multi-dimensional space (2d) to a line (1d).
1st principal component
Reference: Bishop, Pattern Recognition and Machine Learning, Chapter 12.1
Which line fits data best? The line w that minimizes the noise and maximizes the signal [Pearson 1901].

PCA as Noise Minimization
􏰀 Given a dataset x1 , . . . , xN ∈ Rd , find a one-dimensional subspace, so that the data projected on that subspace has minimum distance to the original data.
􏰀 The reconstruction model is given by:
with ∥w∥ = 1.
􏰀 Minimizing the noise (i.e. reconstruction error) can be written as:
argmin􏰎1 􏰍∥x−ww⊤x∥2􏰏
􏰉 􏰈􏰇 􏰊 w 􏰆x

PCA as Signal Maximization
􏰀 Given a dataset x1 , . . . , xN ∈ Rd . Find a one-dimensional subspace, so that the data projected has maximum variance.
􏰀 The projection model is given by:
with ∥w∥ = 1.
􏰀 Maximizing the signal (i.e. variance of projected data) can be written
arg max 􏰎 1 􏰍(w⊤xk − E[w⊤x])2􏰏
and assuming centered data, i.e. E[x] = 0, it simplifies to:
argmax􏰎1 􏰍(w⊤xk)2􏰏

From Min Noise to
argmin􏰎1 􏰍∥xk −ww⊤xk∥2􏰏
w Nk=1 w N
=argmin􏰎1 􏰍(xk −ww⊤xk)⊤(xk −ww⊤xk)􏰏
= arg min 􏰎 1 􏰍 x⊤k xk −2x⊤k ww⊤xk + (ww⊤xk )⊤(ww⊤xk )􏰏
N k=1 􏰉􏰈􏰇􏰊 cst.
􏰎1􏰍⊤2⊤⊤⊤􏰏 −2(xk w) +xk ww ww xk
N k=1 􏰉􏰈􏰇􏰊 =1
=argmin􏰎1 􏰍−2(x⊤k w)2 +(x⊤k w)2􏰏
=argmax􏰎1 􏰍(x⊤k w)2􏰏
􏰐 Data needs to be centered

What if the Data is Uncentered?
What we get if we forget to center
True PCA direction

Is PCA Robust to Outliers?
Not very robust

Computing Principal Components
􏰀 So far, we have formulated PCA as maximizing some objective f (w) subject to a constraint (∥w∥ = 1), but we do not have yet a procedure to quickly find the solution.
􏰀 Now, we will use the method of to effectively find a solution w.
constraint
Lagrange multipliers
constraint

General framework for finding solutions of constrained optimization problems of the type arg max f (θ) subject to g(θ) = 0.
􏰀 Step 1: Construct the ‘Lagrangian’:
L(θ, λ) = f (θ) + λg(θ)
where λ is called the Lagrange multiplier.
􏰀 Step 2: Solve the equation ∇L(θ, λ) = 0
which is a necessary condition for the solution.
Intuition for step 2: the equation includes the equation ∇f (θ) = −λ∇g(θ), i.e. the gradient of objective and constraint are aligned, but point in opposite directions (cf. 2d plot).

: 2D Example
Consider the constrained optimization problem
argmax[1−θ12 +θ2] s.t. θ1 +θ2 =1 θ
􏰀 Step 1: Build the Lagrangian
L(θ,λ) = 1−θ12 +θ2 +λ·(θ1 +θ2 −1)
􏰀 Step 2: Set gradient of Lagrangian to zero:
∇θ1L(θ,λ)=0 ∇θ2L(θ,λ)=0 ∇λL(θ,λ)=0
⇒ 2θ1 =λ ⇒ 2θ2 =λ ⇒ θ1 +θ2 =1
which gives us the solution:
θ = 􏰁0.5, 0.5􏰂

The Solution of PCA
􏰀 Before applying , we first observe that the PCA optimization problem given in slide 8:
argmax􏰎1􏰍(x⊤kw)2􏰏
can be rewritten as:
s.t. ∥w∥=1
s.t. ∥w∥2=1 s.t. ∥w∥2 = 1
=argmax􏰎1􏰍(w⊤xk)(x⊤kw)􏰏
= argmax􏰎w⊤􏰋 1 􏰍xkx⊤􏰌w􏰏 wNk
where Σ􏰆 is the covariance matrix (recall: the data is assumed to be centered). The covariance matrix does not depend on w and can be precomputed.

The Solution of PCA
Starting from our rewritten PCA optimization problem
argmax[w⊤Σ􏰆w] s.t. ∥w∥2 =1 w
we look for a solution by applying the method of Lagrange multipliers.
􏰀 Step 1: Build the Lagrangian
L ( w , λ ) = w ⊤ Σ􏰆 w + λ · ( 1 − ∥ w ∥ 2 )
􏰀 Step 2: Set gradient of Lagrangian to zero: ∇ w L ( w , λ ) = 0 ⇒ Σ􏰆 w = λ w
∇λL(w,λ)=0 ⇒ ∥w∥2 =1
(Homework: Show that it is the eigenvector with highest eigenvalue.)
! PCA solution is an eigenvector of Σ􏰆

Solving PCA (more components)
􏰀 PCA can be generalized to extracting h components w1,…,wh.
􏰀 Conceptually it can be described by the following algorithm, which repeats multiple times the steps (i) compute the first principal component and (ii) remove it from the data.
Finding multiple PCA components
for i = 1 to h do
wi ←PC1(X ̄) (i) X ̄←X ̄−wiw⊤i X ̄ (ii)
􏰀 It can be shown that the resulting sequence of principal components w1,…,wh corresponds to the leading eigenvectors of the matrix Σ􏰆.
􏰀 Therefore, in practice, we just need to compute the leading eigenvectors of Σ􏰆, without an iterative procedure.

Interpreting PCA
􏰀 PCA rotates data into new coordinate system with the directions of largest variance being the new coordinate axes.

Stable PCA Computation via SVD
Singular Value Decomposition (SVD)
A singular value decomposition factorizes a matrix M as UΛV = M where 􏰀 U contains the eigenvectors of MM⊤
􏰀 V contains the eigenvectors of M⊤M
􏰀 The square roots of the eigenvalues of MM⊤ are on the diagonal of Λ.
Observation:
􏰀 Setting M = √1 X ̄ and computing SVD results in N
􏰀 a matrix U = (w1| . . . |wd ) that contains the eigenvectors of Σ􏰆, i.e. the principal components, and
􏰀 a matrix Λ = diag(√λ1 , . . . , √λd ) (but applying the square root does not change the ordering of eigenvalues).
􏰀 SVD is computationally faster and more stable than covariance computation + eigenvalue decomposition.

Comparing Classical PCA and SVD
􏰀 Eigenvalues are the same.
􏰀 PCA projection vector is the same (up to a sign flip).

PCA Computation via SVD, limitations
􏰀 The SVD has computational complexity of O(min(N2d,d2N)).
􏰀 When d ≈ N, the complexity is roughly as bad as the that of
computing the eigenvectors of Σ􏰆, that is, O(d3).
􏰀 This makes the computation of principal components using SVD
prohibitive for large datasets.
􏰀 In practice, we often need only the first few principal components, whereas SVD computes all of them.
􏰀 Can we design a faster procedure that only extracts the first few principal components?

Power Iteration, Algorithm
Find the first component by apply- ing the iteration
v(t) ← Σ􏰆w(t−1)
w(t) ← v(t) ∥v(t)∥
T times, starting from a random (or any) unit-norm vector w(0).
Questions:
􏰀 Does it always converge? yes 􏰀 How fast it converges?
exponentially fast
(Homework: prove this.)
constraint

Power Iteration (more components)
The power iteration (POWIT) method only gives us the leading eigenvector. If we want further PCA components, we need to compute them iteratively.
Naive approach (cf. slide 17): Recompute covariance at each step
for i = 1 to h do
Σ􏰆 = N1 X ̄ X ̄ ⊤
wi ← POWIT(Σ􏰆)
X ̄ ← X ̄ − w i w ⊤i X ̄ end for
Better approach: Work directly in covariance space
Σ􏰆 = N1 X ̄ X ̄ ⊤
for i = 1 to h do
wi ← POWIT(Σ􏰆)
Σ􏰆 ← Σ􏰆 − w i w ⊤i Σ􏰆 end for

Applications of PCA: Data Visualization
􏰀 PCA can be used to visualize how examples of different classes (e.g. different species) are related, and whether they form clusters.
􏰀 Canonical coordinates
representing individual features can also be projected in the PCA space. This gives a ‘Biplot’.

Applications of PCA: Data Compression

Applications of PCA: Denoising

Applications of PCA: Artifact Removal

Beyond PCA
Many methods that extend PCA or relate to it have been developed:
􏰀 Kernel PCA: Perform PCA in feature space (where input features are transformed nonlinearly). Gives better compression/denoising.
􏰀 CCA, ICA, etc. Does not maximize variance but some other quantity of interest (e.g. correlation, kurtosis, etc.)
􏰀 Autoencoder Networks. PCA is equivalent to a linear autoencoder network. The analysis can be made nonlinear by incorporating nonlinearities in the autoencoder.
􏰀 Sparse Coding. Maximize reconstruction subject to constraints in the projected space (e.g. sparsity).
Kernel PCA will be presented in ML1. Other approaches will be presented in ML2.

􏰀 Learning algorithms can be instable in high dimensional spaces. An approach to alleviate this problem is to reduce the dimensionality of the data.
􏰀 Principal Component Analysis is a dimensionality reduction technique that implements [Pearson 1901]’s principle of minimizing noise and maximizing signal. (It does both simultaneously!).
􏰀 The framework of shows us that the solution of PCA is an eigenvector of the data covariance.
􏰀 Several methods exist to compute principal components (e.g. SVD, Power Iteration) which one to choose depends on the considered scenario, e.g. how large is our data, how many PCA components we need.
􏰀 PCA has many applications (e.g. visualization, compression, denoising, artifact reduction).

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts