CS计算机代考程序代写 algorithm deep learning Wojciech Samek & Grégoire Montavon

Wojciech Samek & Grégoire Montavon

About myself
1. Interpretability & Explainability 2. Neural Network Compression 3. Federated Learning
4. Applications of Deep Learning
ML1 Lecture 3: Dimensionality Reduction and Principle Component Analysis
2

This Lecture
1. Dimensionality reduction
2. PrincipleComponentAnalysis
1. What are Principle Components? 2. How to find/calculate them
1. Lagrange Multipliers
3. What can we do with them? / Applications
2

Curse of dimensionality In many machine learning problems, the data are high-dimensional
From Limb and Braun, 2008 University of Colorado IEEE Spectrum
3

Curse of dimensionality In many machine learning problems, the data are high-dimensional
From Limb and Braun, 2008 University of Colorado IEEE Spectrum
Standard regression/classification techniques can become
ill-defined for M >> N
ill-conditioned/numerically unstable even for M < N • • 4 Curse of dimensionality In many machine learning problems, the data are high-dimensional From Limb and Braun, 2008 University of Colorado IEEE Spectrum Standard regression/classification techniques can become ill-defined for M >> N
ill-conditioned/numerically unstable even for M < N • • w = 1(μ1 μ2) 5 Singular, ill-conditioned Curse of dimensionality 6 Regularization Idea: impose constraints on the parameters to stabilize solution. One way: introduce prior probability Regularization Idea: impose constraints on the parameters to stabilize solution. One way: introduce prior probability Linear regression setting: y = X + ⇥ , ⇥ N (0, 2I) , ˆ = (XX)⇥1Xy 8 Regularization Idea: impose constraints on the parameters to stabilize solution. One way: introduce prior probability Linear regression setting: y = X + ⇥ , ⇥ N (0, 2I) , ˆ = (XX)⇥1Xy Prior: N(0,2I) 9 Regularization Idea: impose constraints on the parameters to stabilize solution. One way: introduce prior probability Linear regression setting: y = X + ⇥ , ⇥ N (0, 2I) , ˆ = (XX)⇥1Xy Prior: N(0,2I) 10 = favor a “simple” model Dimensionality reduction Problem: we still have to invert an M-dimensional matrix → expensive Goal: reduce data to features most relevant for the learning task 14 Dimensionality reduction Problem: we still have to invert an M-dimensional matrix → expensive Goal: reduce data to features most relevant for the learning task One way: significance test for single features 15 Dimensionality reduction Problem: we still have to invert an M-dimensional matrix → expensive Goal: reduce data to features most relevant for the learning task One way: significance test for single features Another way: find ‘relevant’ directions/subspaces in correlated data 16 Why dimensionality reduction? 17 Principle Component Analysis (PCA) Noise Which line fits data best? The line w that minimizes the noise and maximizes the signal [Pearson 1901]. Signal 1. principle component Reference: Bishop, Pattern Recognition and Machine Learning, Chapter 12.1 18 Problem Definition Given x1, . . . , xn 2 Rd find a k-dimensional subspace, so that the data projected on that subspace 1) is as close to the original data as possible (minimum noise) 2) has maximum variance (maximum signal). 2 Motivations: same result? Assume the data is centered: μ = 1 Xn n i=1 xi = 0 19 Minimum distance w We are only interested in the direction of w, not its length, i.e. set ||w|| = wTw = 1 . Xn w ||(wTxi)w xi||2 wTxixTi wwTw 2wTxixTi w + xTi xi min = min i=1 Xn w i=1 Reminder a aTb b ||b|| ||b|| b aTb ||b|| Xn w =min wTxixTi w wTxixTi w =max i=1 Xn w i=1 =maxwTXXTw w 20 Maximum Variance w maxVar(wTX) = E[(wTX)2] w Introduce the scatter matrix S = XX . This leads to the constrained optimization problem: max wTSw w s.t. ||w|| = 1 21 1 Xn w wTxixTi w n✓ T =max =maxwTXXTw i=1 w (same as minimum distance) X is centered Lagrange Multipliers Imagine you have a constrained optimization problem: • max f (x) Read: subject to x Objective function Constraint • Example: s.t. g(x) = 0 max1x21 x2 x s.t. x1 +x2 1=0 Reference: Bishop, Pattern Recognition and Machine Learning, Appendix E 22 Geometric intuition max f (x) x s.t. g(x) = 0 Intuition 1: ∇g must be normal to the line g(x, y) = c. Intuition 2: At an optimum f(x*,y*) can not increase in the direction of a neighboring point where g(x, y) = c. Thus, at (x*, y*) ∇f must be perpendicular to the line g(x, y) = c. It follows: In (x*, y*) the gradients ∇g and ∇f are parallel! 23 Contour lines of f(x,y) Constraint The Lagrange Multiplier In (x*, y*) the gradients ∇g and ∇f are parallel. Thus for a maximum: rf = rg Lagrange Multiplier Lagrangian function: L(x, ⇥) = f (x) ⇥ ⇥g(x) ∇L = 0 is a necessary (but not sufficient) condition for the optimization solution. 24 Thus, to solve a constrained optimization problem, we can define the Lagrangian and look for the points where its gradient vanishes. Example s.t. x1 +x2 1=0 1. Define Lagrangian: L(x,)=1x2 x2 +(x1 +x2 1) 12 2. Set gradient of Lagrangian to zero: 2x1 + = 0 2x2 + = 0 x1 +x2 1=0 3. Solve equations: (x,x) = (1, 1) 25 = 1 max1x2 x2 x12 12 22 PCA optimization problem max wTSw w s.t. ||w|| = 1 1.DefineLagrangian: L(w,)=wTSw+(1wTw) 2. Compute gradient: @L(w,) =2Sw2w @w Sw = ⇥w 3. Set to zero: 26 PCA optimization problem max wTSw w s.t. ||w|| = 1 1.DefineLagrangian: L(w,)=wTSw+(1wTw) 2. Compute gradient: @L(w,) =2Sw2w @w Sw = ⇥w 3. Set to zero: This is an Eigenvalue problem! 27 Eigenvalue problems Sw = ⇥w (S ⇥ ⇥I)w = 0 28 Eigenvalue problems Sw = ⇥w (S ⇥ ⇥I)w = 0 Solutions are found at roots of characteristic polynomial p(⇥) := det(S ⇥ ⇥I) = 0 29 Eigenvalue problems Sw = ⇥w (S ⇥ ⇥I)w = 0 Solutions are found at roots of characteristic polynomial p(⇥) := det(S ⇥ ⇥I) = 0 In general: d roots (eigenvalues) ⇥1 , . . . , ⇥d w1,...,wd Corresponding eigenvectors: 30 Eigenvalue problems Sw = ⇥w (S ⇥ ⇥I)w = 0 Solutions are found at roots of characteristic polynomial p(⇥) := det(S ⇥ ⇥I) = 0 In general: d roots (eigenvalues) ⇥1 , . . . , ⇥d w1,...,wd Leads to the decomposition WSW = S = WW where W = [w1,...,wd],WW = I is orthogonal and is diagonal. 31 Corresponding eigenvectors: ,ii =⇥i PCA algorithm The i-th eigenvalue is the variance in the direction of the i-th eigenvector: Var(wiTx)= 1wiTSwi = 1iwiTwi = 1i nnn 32 PCA algorithm The i-th eigenvalue is the variance in the direction of the i-th eigenvector: Var(wiTx)= 1wiTSwi = 1iwiTwi = 1i nnn The direction of largest variance corresponds to the largest eigenvector (= the eigenvector with largest eigenvalue). 33 PCA algorithm The i-th eigenvalue is the variance in the direction of the i-th eigenvector: Var(wiTx)= 1wiTSwi = 1iwiTwi = 1i nnn 34 The direction of largest variance corresponds to the largest eigenvector (= the eigenvector with largest eigenvalue). 35 Solving PCA with SVD A singular value decomposition factorizes a matrix as: where UV =M U are the Eigenvectors of MM. V are the Eigenvectors of MM. The square roots of the Eigenvalues of MM are on the diagonal of . 37 Solving PCA with SVD A singular value decomposition factorizes a matrix as: where UV =M U are the Eigenvectors of MM. V are the Eigenvectors of MM. The square roots of the Eigenvalues of MM are on the diagonal of . Instead of calculating the EV-decomposition of S, we can compute the SVD-decomposition of X. This is computationally much more stable. E.g. Läuchli Matrix 38 SVD in numpy ! 39 Power iteration The SVD has computational complexity O(min(n2d, d2n)). But: We often only need a few largest principle components and not all of them. or power iteration The solution is the first eigenvector of A. 40 Power iteration intuition A = UUT Assume a diagonalizable matrix . The power iteration computes after step k: Akx where x is a random start vector. kkTT WehaveA=UU becauseUU=I. Transform x into a new coordinate system: x ̃ = UTx dd k kT k Xk kXki A x=UU x=U x ̃= (ix ̃i)ui =1 (kx ̃i)ui k i=1 Ax k!1 i=1 1 ) ||Akx|| !u1 because i 1for1 ⇥2 ⇥···⇥d ⇥0 1 41 Deflation To find the second EV, remember S = WWT = Xd i Thus, do power iteration on Sˆ = S 1w1w1T . i=1 42 iwiwT. Application: Dimensionality Reduction Noise w Signal We can reduce the dimensionality of our dataset by projecting on the first k principle components. How much signal are we going to loose? dd X Var(wTx) = 1 X i n i=k+1 number of components Projection on k principle components: 0 wTx 1 1 B . C=4w1 ··· wk5x @wkTxA | | 2| |3T i=k+1 43 Application: Eigenfaces Idea: Faces look very similar in comparison to other random images. How many principle components would we need to accurately describe all faces? An 64x64 pixel image of a face can be represented as a 4096 dimensional vector where each entry has the pixel’s grayscale value. Reference: Turk, Matthew A and Pentland, Alex P. Face recognition using eigenfaces. Computer Vision and Pattern Recognition, 1991. 44 Eigenfaces The principle components are directions in our faces space. Thus, each principle component is a face representation, too. Principle component 1 Principle component 2 Principle component 3 Principle component 350 45 Variance in first 50 direction Eigenfaces 46 Application: Denoising We can use PCA to denoise data: Step 1: Reduce dimensionality to filter out “noise” directions: (w1Tx, . . . , wkTx)T Step 2: Project back into original space: Xk (wiT x)wi i=1 Step 3: Undo centering: kn X ( w iT x ) w i + X x i i=1 i=1 47 Application: Denoising Mann et al., 2013 48 Application: EEG artifacts In electroencephalographic (EEG) recordings, eye blink artifacts can be stronger than the neuronal activity. 49 BCI2000 Application: EEG artifacts In electroencephalographic (EEG) recordings, eye blink artifacts can be stronger than the neuronal activity. BCI2000 → reasonable to remove first principal components 50 How many principal components? 51 How many principal components? Heuristics 52 How robust is PCA to outliers? Not very robust 53 !