Wojciech Samek & Grégoire Montavon
About myself
1. Interpretability & Explainability 2. Neural Network Compression 3. Federated Learning
4. Applications of Deep Learning
ML1 Lecture 3: Dimensionality Reduction and Principle Component Analysis
2
This Lecture
1. Dimensionality reduction
2. PrincipleComponentAnalysis
1. What are Principle Components? 2. How to find/calculate them
1. Lagrange Multipliers
3. What can we do with them? / Applications
2
Curse of dimensionality In many machine learning problems, the data are high-dimensional
From Limb and Braun, 2008 University of Colorado IEEE Spectrum
3
Curse of dimensionality In many machine learning problems, the data are high-dimensional
From Limb and Braun, 2008 University of Colorado IEEE Spectrum
Standard regression/classification techniques can become
ill-defined for M >> N
ill-conditioned/numerically unstable even for M < N
• •
4
Curse of dimensionality In many machine learning problems, the data are high-dimensional
From Limb and Braun, 2008 University of Colorado IEEE Spectrum
Standard regression/classification techniques can become
ill-defined for M >> N
ill-conditioned/numerically unstable even for M < N
• •
w = 1(μ1 μ2)
5
Singular, ill-conditioned
Curse of dimensionality
6
Regularization
Idea: impose constraints on the parameters to stabilize solution. One way: introduce prior probability
Regularization
Idea: impose constraints on the parameters to stabilize solution.
One way: introduce prior probability
Linear regression setting: y = X + ⇥ , ⇥ N (0, 2I) , ˆ = (X X)⇥1X y
8
Regularization
Idea: impose constraints on the parameters to stabilize solution.
One way: introduce prior probability
Linear regression setting: y = X + ⇥ , ⇥ N (0, 2I) , ˆ = (X X)⇥1X y
Prior: N(0, 2I)
9
Regularization
Idea: impose constraints on the parameters to stabilize solution.
One way: introduce prior probability
Linear regression setting: y = X + ⇥ , ⇥ N (0, 2I) , ˆ = (X X)⇥1X y
Prior: N(0, 2I)
10
= favor a “simple” model
Dimensionality reduction
Problem: we still have to invert an M-dimensional matrix → expensive Goal: reduce data to features most relevant for the learning task
14
Dimensionality reduction
Problem: we still have to invert an M-dimensional matrix → expensive Goal: reduce data to features most relevant for the learning task
One way: significance test for single features
15
Dimensionality reduction
Problem: we still have to invert an M-dimensional matrix → expensive Goal: reduce data to features most relevant for the learning task
One way: significance test for single features
Another way: find ‘relevant’ directions/subspaces in correlated data
16
Why dimensionality reduction?
17
Principle Component Analysis (PCA)
Noise
Which line fits data best?
The line w that minimizes the noise and maximizes the signal [Pearson 1901].
Signal
1. principle component
Reference: Bishop, Pattern Recognition and Machine Learning, Chapter 12.1 18
Problem Definition
Given x1, . . . , xn 2 Rd find a k-dimensional subspace, so that the data projected on that subspace
1) is as close to the original data as possible (minimum noise)
2) has maximum variance (maximum signal).
2 Motivations: same result?
Assume the data is centered: μ =
1 Xn n i=1
xi = 0
19
Minimum distance
w
We are only interested in the direction of w, not its length, i.e. set ||w|| = wTw = 1 .
Xn w
||(wTxi)w xi||2
wTxixTi wwTw 2wTxixTi w + xTi xi
min
= min
i=1 Xn
w
i=1
Reminder
a
aTb b ||b|| ||b||
b
aTb ||b||
Xn w
=min
wTxixTi w wTxixTi w
=max
i=1 Xn
w
i=1
=maxwTXXTw w
20
Maximum Variance
w
maxVar(wTX) = E[(wTX)2] w
Introduce the scatter matrix S = XX . This leads to the constrained optimization problem:
max wTSw w
s.t. ||w|| = 1 21
1 Xn w
wTxixTi w n✓
T
=max =maxwTXXTw
i=1 w
(same as minimum distance)
X is centered
Lagrange Multipliers
Imagine you have a constrained optimization problem:
•
max f (x) Read: subject to x
Objective function
Constraint
•
Example:
s.t. g(x) = 0
max1 x21 x2 x
s.t. x1 +x2 1=0
Reference: Bishop, Pattern Recognition and Machine Learning, Appendix E
22
Geometric intuition
max f (x) x
s.t. g(x) = 0 Intuition 1: ∇g must be
normal to the line g(x, y) = c.
Intuition 2: At an optimum f(x*,y*) can not increase in the direction of a neighboring point where g(x, y) = c. Thus, at (x*, y*) ∇f must be perpendicular to the line g(x, y) = c.
It follows: In (x*, y*) the gradients ∇g and ∇f are parallel!
23
Contour lines of f(x,y)
Constraint
The Lagrange Multiplier
In (x*, y*) the gradients ∇g and ∇f are parallel. Thus for a maximum:
rf = rg
Lagrange Multiplier
Lagrangian function:
L(x, ⇥) = f (x) ⇥ ⇥g(x)
∇L = 0 is a necessary (but not sufficient) condition for the optimization solution.
24
Thus, to solve a constrained
optimization problem, we can define the Lagrangian and look for the points where its gradient vanishes.
Example
s.t. x1 +x2 1=0 1. Define Lagrangian:
L(x, )=1 x2 x2 + (x1 +x2 1) 12
2. Set gradient of Lagrangian to zero:
2x1 + = 0
2x2 + = 0 x1 +x2 1=0
3. Solve equations: (x ,x ) = (1, 1)
25
= 1
max1 x2 x2 x12
12
22
PCA optimization problem
max wTSw w
s.t. ||w|| = 1 1.DefineLagrangian: L(w, )=wTSw+ (1 wTw)
2. Compute gradient:
@L(w, ) =2Sw 2 w @w
Sw = ⇥w
3. Set to zero:
26
PCA optimization problem
max wTSw w
s.t. ||w|| = 1 1.DefineLagrangian: L(w, )=wTSw+ (1 wTw)
2. Compute gradient:
@L(w, ) =2Sw 2 w @w
Sw = ⇥w
3. Set to zero:
This is an Eigenvalue problem!
27
Eigenvalue problems
Sw = ⇥w (S ⇥ ⇥I)w = 0
28
Eigenvalue problems
Sw = ⇥w (S ⇥ ⇥I)w = 0
Solutions are found at roots of characteristic polynomial
p(⇥) := det(S ⇥ ⇥I) = 0
29
Eigenvalue problems
Sw = ⇥w (S ⇥ ⇥I)w = 0
Solutions are found at roots of characteristic polynomial
p(⇥) := det(S ⇥ ⇥I) = 0 In general: d roots (eigenvalues) ⇥1 , . . . , ⇥d
w1,...,wd
Corresponding eigenvectors:
30
Eigenvalue problems
Sw = ⇥w (S ⇥ ⇥I)w = 0
Solutions are found at roots of characteristic polynomial
p(⇥) := det(S ⇥ ⇥I) = 0 In general: d roots (eigenvalues) ⇥1 , . . . , ⇥d
w1,...,wd
Leads to the decomposition W SW = S = W W
where W = [w1,...,wd],W W = I is orthogonal and
is diagonal. 31
Corresponding eigenvectors:
, ii =⇥i
PCA algorithm
The i-th eigenvalue is the variance in the direction of the i-th eigenvector:
Var(wiTx)= 1wiTSwi = 1 iwiTwi = 1 i nnn
32
PCA algorithm
The i-th eigenvalue is the variance in the direction of the i-th eigenvector:
Var(wiTx)= 1wiTSwi = 1 iwiTwi = 1 i nnn
The direction of largest variance corresponds to the largest eigenvector (= the eigenvector with largest eigenvalue).
33
PCA algorithm
The i-th eigenvalue is the variance in the direction of the i-th eigenvector:
Var(wiTx)= 1wiTSwi = 1 iwiTwi = 1 i nnn
34
The direction of largest variance corresponds to the largest eigenvector (= the eigenvector with largest eigenvalue).
35
Solving PCA with SVD A singular value decomposition factorizes a matrix as:
where
U V =M
U are the Eigenvectors of MM .
V are the Eigenvectors of M M.
The square roots of the Eigenvalues of MM are on the diagonal of .
37
Solving PCA with SVD A singular value decomposition factorizes a matrix as:
where
U V =M
U are the Eigenvectors of MM .
V are the Eigenvectors of M M.
The square roots of the Eigenvalues of MM are on the diagonal of .
Instead of calculating the EV-decomposition of S, we can compute the SVD-decomposition of X. This is computationally much more stable.
E.g. Läuchli Matrix
38
SVD in numpy
!
39
Power iteration
The SVD has computational complexity O(min(n2d, d2n)).
But: We often only need a few largest principle components and not all of them.
or power iteration
The solution is the first eigenvector of A. 40
Power iteration intuition
A = U UT Assume a diagonalizable matrix .
The power iteration computes after step k: Akx where x is a random start vector.
kkTT WehaveA=U U becauseUU=I.
Transform x into a new coordinate system: x ̃ = UTx dd
k kT k Xk kX ki
A x=U U x=U x ̃= ( ix ̃i)ui = 1 ( kx ̃i)ui
k i=1 Ax k!1
i=1 1
) ||Akx|| !u1
because i 1for 1 ⇥ 2 ⇥···⇥ d ⇥0 1
41
Deflation
To find the second EV, remember S = W WT =
Xd
i
Thus, do power iteration on Sˆ = S 1w1w1T .
i=1
42
iwiwT.
Application: Dimensionality Reduction
Noise
w
Signal
We can reduce the dimensionality of our dataset by projecting on the first k principle components.
How much signal are we going to loose?
dd
X Var(wTx) = 1 X i
n
i=k+1
number of components
Projection on k principle components:
0 wTx 1 1
B . C=4w1 ··· wk5x @wkTxA | |
2| |3T
i=k+1
43
Application: Eigenfaces
Idea: Faces look very similar in comparison to other random images. How many principle components would we need to accurately describe all faces?
An 64x64 pixel image of a face can be represented as a 4096 dimensional vector where each entry has the pixel’s grayscale value.
Reference: Turk, Matthew A and Pentland, Alex P. Face recognition using eigenfaces. Computer Vision and Pattern Recognition, 1991.
44
Eigenfaces
The principle components are directions in our faces space. Thus, each principle component is a face representation, too.
Principle component 1 Principle component 2
Principle component 3 Principle component 350 45
Variance in first 50 direction
Eigenfaces
46
Application: Denoising We can use PCA to denoise data:
Step 1: Reduce dimensionality to filter out “noise” directions: (w1Tx, . . . , wkTx)T
Step 2: Project back into original space:
Xk
(wiT x)wi
i=1
Step 3: Undo centering:
kn X ( w iT x ) w i + X x i
i=1 i=1 47
Application: Denoising
Mann et al., 2013
48
Application: EEG artifacts
In electroencephalographic (EEG) recordings, eye blink artifacts can be stronger than the neuronal activity.
49
BCI2000
Application: EEG artifacts
In electroencephalographic (EEG) recordings, eye blink artifacts can be stronger than the neuronal activity.
BCI2000
→ reasonable to remove first principal components
50
How many principal components?
51
How many principal components?
Heuristics
52
How robust is PCA to outliers?
Not very robust
53
!