CS计算机代考程序代写 data structure algorithm Lecture 19. Dimensionality Reduction

Lecture 19. Dimensionality Reduction
COMP90051 Statistical Machine Learning
Semester 2, 2019 Lecturer: Ben Rubinstein
Copyright: University of Melbourne

COMP90051 Statistical Machine Learning
This lecture
• Principalcomponentsanalysis
∗ Linear dimensionality reduction method ∗ Diagonalising covariance matrix
• KernelPCA
2

COMP90051 Statistical Machine Learning
True dimensionality of data?
Image adapted from Wikipedia, original image: Olivier Grisel 3

COMP90051 Statistical Machine Learning
Dimensionality reduction
• Previouslyinunsupervisedlearning:Clustering
• Dimensionalityreductionreferstorepresentingthe data using a smaller number of variables (dimensions) while preserving the “interesting” structure of the data
• Suchareductioncanserveseveralpurposes
∗ Visualisation (e.g., mapping multidimensional data to 2D)
∗ Computational efficiency in a pipeline
∗ Data compression or statistical efficiency in a pipeline
4

COMP90051 Statistical Machine Learning
Exploiting data structure
• Dimensionality reduction in general results in loss of information
• The trick is to ensure that most of the “interesting” information
(signal) is preserved, while what is lost is mostly noise
• This is often possible because real data may have inherently fewer
dimensions than recorded variables
• Example: GPS coordinates are 3D, while car locations on a flat road are actually on 2D manifold
• Example: Marks* for Knowledge Technology and Statistical Machine Learning
* synthetic data 🙂
KT mark
5
SML mark

COMP90051 Statistical Machine Learning
Principal Component Analysis
Finding a rotation of data that minimises covariance between (new) variables
6

COMP90051 Statistical Machine Learning
Principal component analysis
• Principal component analysis (PCA𝑚𝑚) is a popular method for dimensionality reduction and data analysis in general
• Givenadataset𝒙𝒙 ,…,𝒙𝒙 ,𝒙𝒙 ∈𝑹𝑹 ,PCAaimstofindanewcoordinate
the first 𝑙𝑙 < 𝑚𝑚 original 1𝑛𝑛𝑖𝑖 system such that most of the variance is concentrated along the first coordinate, then most of the remaining variance along the second coordinate, etc. • Dimensionality reduction is then based on discarding coordinates except data is 2D find new axes Leave only the first coordinate new coordinate system, still 2D 7 COMP90051 Statistical Machine Learning Naïve PCA algorithm 1. Choose a direction as new axis, such that the variance along 2. Choose the next direction/axis perpendicular to all axes so far, such that the (remaining) variance along this axis is maximised 3. Repeat 2, until you have the same number of axes (i.e., dimensions) as in the original data this axis is maximised 4. Project original data on the new axes. This gives new coordinates (“PCA coordinates”) 5. For each point, keep only the first 𝑙𝑙 coordinates Such an algorithm if implemented directly would work, but there’s a better solution 8 COMP90051 Statistical Machine Learning Formulating the problem • The core of PCA is finding the new coordinate system, such that most of the variation is captured by “earlier” axes • Let’s write down this aim formally and see how it can be achieved • First, recall the geometric definition ofadotproduct𝒖𝒖⋅𝒗𝒗=𝑢𝑢 𝒗𝒗 𝒖𝒖 𝜃𝜃 𝒗𝒗 • Suppose 𝒗𝒗 =1,so𝒖𝒖⋅𝒗𝒗=𝑢𝑢𝒗𝒗 𝒗𝒗 • Vector 𝒗𝒗 can be considered a and 𝑢𝑢 the coordinate of point 𝒖𝒖 𝒗𝒗 candidate coordinate axis, 9 COMP90051 Statistical Machine Learning Data transformation • So the new coordinate system is a set of vectors 𝒑𝒑 , ... , 𝒑𝒑 , where each 𝒑𝒑 = 1 1𝑚𝑚𝑖𝑖 • Consider an original data point 𝒙𝒙𝑗𝑗, 𝑗𝑗 = 1, ... , 𝑛𝑛, and a principal axis 𝒑𝒑𝑖𝑖, 𝑖𝑖 = 1,...,𝑚𝑚 • The corresponding 𝑖𝑖𝑡𝑡𝑡 coordinate for the first point after the transformation is 𝒑𝒑𝑖𝑖 ′(𝒙𝒙1) ∗ For the second point it is 𝒑𝒑𝑖𝑖 ′(𝒙𝒙2), etc. • Collate all these numbers into a vector 𝒑𝒑′(𝒙𝒙),...,𝒑𝒑′(𝒙𝒙)′= 𝒑𝒑′𝑿𝑿′=𝑿𝑿′𝒑𝒑,where𝑿𝑿 𝑖𝑖1𝑖𝑖𝑛𝑛𝑖𝑖𝑖𝑖 has original data points in columns 10 COMP90051 Statistical Machine Learning Refresher: basic stats • Consider a random variable 𝑈𝑈 and the corresponding sample 𝒖𝒖 = 𝑢𝑢1,...,𝑢𝑢𝑛𝑛 ′ • Samplemean𝑢𝑢�≡1∑𝑛𝑛𝑢𝑢.Samplevariance 1 ∑𝑛𝑛 𝑢𝑢 −𝑢𝑢� 2 𝑛𝑛 𝑖𝑖 𝑖𝑖 𝑛𝑛−1 𝑖𝑖=1 𝑖𝑖 𝒖𝒖′𝒖𝒖 • Similarly, if we have a centered random sample 𝒗𝒗 from another • Suppose the mean was subtracted beforehand (the sample is centered). In this case, the variance is a scaled dot product 1 𝑛𝑛−1 • Finally, if our data is 𝒙𝒙1 = 𝑢𝑢1,𝑣𝑣1 ′, ... , 𝒙𝒙𝑛𝑛 = 𝑢𝑢𝑛𝑛,𝑣𝑣𝑛𝑛 ′ organised 𝒖𝒖′𝒗𝒗 into a matrix 𝑿𝑿 with data in columns and centered variables in rows, random variable, sample covariance is 1 𝑛𝑛−1 we have that covariance matrix is 𝚺𝚺𝑋𝑋 ≡ 1 𝑿𝑿𝑿𝑿′ 𝑛𝑛−1 11 COMP90051 Statistical Machine Learning The objective of PCA • Weshallassumethatthedataiscentered • Let’sstartwiththeobjectiveforthefirstprincipal axis. The data projected on this axis is 𝑿𝑿′𝒑𝒑 1 • Accordingly,thevariancealongthisprincipalaxisis 1 𝑿𝑿′𝒑𝒑′𝑿𝑿′𝒑𝒑=1𝒑𝒑′𝑿𝑿𝑿𝑿′𝒑𝒑=𝒑𝒑′𝚺𝚺𝒑𝒑 𝑛𝑛−1 1 1 𝑛𝑛−11 1 1𝑋𝑋1 ∗ Here 𝚺𝚺𝑋𝑋 is the covariance matrix of the original data • PCA should therefore find 𝒑𝒑1 to maximise 𝒑𝒑1′ 𝚺𝚺𝑋𝑋𝒑𝒑1, subject to 𝒑𝒑1 = 1 12 COMP90051 Statistical Machine Learning Solving the optimisation • PCA aims to find 𝒑𝒑1 that maximises 𝒑𝒑1′ 𝚺𝚺𝑋𝑋 𝒑𝒑1 , subject to 𝒑𝒑1 =𝒑𝒑1′𝒑𝒑1=1 • Constrained  Lagrange mulitipliers. Introduce multiplier 𝜆𝜆 ; set derivatives of Lagrangian to zero, solve • 𝐿𝐿=𝒑𝒑′𝚺𝚺 𝒑𝒑1−𝜆𝜆 𝒑𝒑′𝒑𝒑 −1 1𝑋𝑋1111 • 𝜕𝜕𝜕𝜕 =2𝚺𝚺𝑋𝑋𝒑𝒑1−2𝜆𝜆1𝒑𝒑1=0 𝜕𝜕𝒑𝒑1 • 𝚺𝚺𝑋𝑋𝒑𝒑1=𝜆𝜆1𝒑𝒑1 • Precisely defines 𝒑𝒑1 as an eigenvector with 𝜆𝜆1 being the corresponding eigenvalue 13 COMP90051 Statistical Machine Learning 𝑨𝑨𝒆𝒆 = 𝜆𝜆𝒆𝒆 𝒆𝒆 Refresher on eigenvectors (1/2) • Given a square matrix 𝑨𝑨, a column vector 𝒆𝒆 is called an eigenvector if 𝑨𝑨𝒆𝒆 = 𝜆𝜆𝒆𝒆. Here 𝜆𝜆 is the corresponding eigenvalue • Geometric interpretation: compare 𝑨𝑨𝒆𝒆 with 𝑷𝑷𝒙𝒙𝑖𝑖 from previous slides. Here 𝑨𝑨 is a transformation matrix (“new axes”) for some vector 𝒆𝒆. Vector 𝒆𝒆 is such that it still points to the same direction after transformation Adapted from Lantonov at Wikimedia Commons (CC4) 14 COMP90051 Statistical Machine Learning Refresher on eigenvalues (2/2) • Algebraic interpretation: if 𝑨𝑨𝒆𝒆 = 𝜆𝜆𝒆𝒆 then 𝑨𝑨 − 𝜆𝜆𝑰𝑰 𝒆𝒆 = 0, where 𝑰𝑰 is the identity matrix • This equation has a non-zero solution 𝒆𝒆 if and only if the determinant is zero 𝑨𝑨 − 𝜆𝜆𝑰𝑰 = 0. Eigenvalues are roots of this equation called characteristic equation • Eigenvectors and eigenvalues are prominent concepts in linear algebra and arise in many practical applications • Spectrum of a matrix is a set of its eignevalues • There are efficient algorithms for computing eigenvectors (not covered) 15 COMP90051 Statistical Machine Learning Variance captured by PCs • In summary: we choose 𝒑𝒑1 as the of centered covariance matrix 𝚺𝚺 𝑋𝑋 eigenvector with largest eigenvalue, • Variance of data captured by 𝒑𝒑1: ∗ Note we’ve shown 𝜆𝜆1 = 𝒑𝒑1′ 𝚺𝚺𝑋𝑋𝒑𝒑1, ∗ But 𝒑𝒑1′ 𝚺𝚺𝑋𝑋𝒑𝒑1 is var of projected data First eigenvalue is variance captured • Choose dimensions to keep from “knee” in scree plot 16 COMP90051 Statistical Machine Learning Efficient solution for PCA • The same pattern can be used to find all PCs ∗ Constraint 𝒑𝒑 = 1 prevents var 𝒑𝒑 𝚺𝚺 𝒑𝒑 diverging by rescaling 𝒑𝒑 𝑖𝑖 ′𝑖𝑖𝑋𝑋𝑖𝑖 𝑖𝑖 ∗ Each time we add additional constraints that next PC be orthogonal to all previous PCs. Equivalently, we search in their complement. • Solution is to: setting 𝒑𝒑 as all eigenvectors of centered data 𝑖𝑖 covariance matrix 𝚺𝚺𝑋𝑋 in decreasing eigenvalue order • Really possible with any 𝚺𝚺𝑋𝑋 ? • Lemma: a real symmetric 𝑚𝑚 × 𝑚𝑚 matrix has 𝑚𝑚 real eigenvalues and corresponding eigenvectors are orthogonal • Lemma: a PSD matrix further has non-negative eigenvalues. 17 COMP90051 Statistical Machine Learning Summary of PCA (1/2) • Assume data points are arranged in columns of 𝑿𝑿. That means that the variables are in rows • Ensure that the data is centered: subtract the mean row (the data centroid) from each row • We seek an orthonormal basis 𝒑𝒑 , ... , 𝒑𝒑 • Find eigenvalues of centered data cov matrix 𝚺𝚺 ≡ ∗ Alwayspossible,relativelyefficiently ′ 𝑿𝑿𝑿𝑿 ∗ Eachbasisvectorisofunitlengthandperpendiculartoeveryother 1𝑚𝑚 𝑋𝑋 1 𝑛𝑛−1 18 COMP90051 Statistical Machine Learning Summary of PCA (2/2) • Sort eigenvalues from largest to smallest ∗ EacheigenvalueequalstovarianceofdataalongcorrespondingPC • Set 𝒑𝒑 , ... , 𝒑𝒑 as corresponding eigenvectors 1𝑚𝑚 • Project data 𝑿𝑿 onto these new axes to get coordinates of the transformed data • Keep only the first 𝑠𝑠 coordinates to reduce dimensionality • Another view of PCA: s-dim plane minimising residual sum- squares to data. (This is exactly spanned by s chosen PCs) 19 COMP90051 Statistical Machine Learning PCA vs. Linear regression PCA minimises projections of PCA minimises errors of points onto linear hyperplane predicted labels 20 COMP90051 Statistical Machine Learning Additional effect of PCA • PCA aims to find axes such that the variance along each subsequent axis is maximised • Consider candidate axes 𝑖𝑖 and (𝑖𝑖 + 1). Informally, if there’s a correlation between them, this means that axis 𝑖𝑖 can be rotated further to capture more variance • PCA should end up finding new axes (i.e., the transformation) such that the transformed data is uncorrelated original data is 2D correlation because of suboptimal axes placement no correlation 21 COMP90051 Statistical Machine Learning Spectral theorem for symmetric matrices • In order to explore this effect further, we need to refer to one of the fundamental results in linear algebra ∗ Theproofisoutsidethescopeofthissubject ∗ Thisisaspecialcaseofthesingularvaluedecompositiontheorem • Theorem: for any real symmetric matrix 𝚺𝚺 there exists a real 𝑋𝑋 orthogonal matrix 𝑷𝑷 with eigenvectors of 𝚺𝚺𝑋𝑋 arranged in rows and a diagonal matrix of eigenvalues 𝚲𝚲 such that 𝚺𝚺𝑋𝑋 = 𝑷𝑷′𝚲𝚲𝑷𝑷 22 COMP90051 Statistical Machine Learning • • • ∗ Similartoabove,notethatelement(𝑖𝑖,𝑗𝑗)of𝑷𝑷𝑿𝑿isthedotproduct 𝒑𝒑′𝑖𝑖𝒙𝒙𝑗𝑗, which is the projection of 𝒙𝒙𝑗𝑗 on axis 𝒑𝒑𝑖𝑖, i.e., the new 𝑖𝑖𝑡𝑡𝑡 coordinate for 𝑗𝑗𝑡𝑡𝑡 point Diagonalising covariance matrix (1/2) Form transformation matrix 𝑷𝑷 with evects (new axes) as rows ∗ Byourproblemformulation,𝑷𝑷isanorthonormalmatrix Note that 𝑷𝑷′𝑷𝑷 = 𝑰𝑰, where 𝑰𝑰 is the identity matrix ∗ Toseethisrecallthateachelementoftheresultingmatrix multiplication is a dot product of the corresponding row and column ∗ Soelement(𝑖𝑖,𝑗𝑗)of𝑷𝑷′𝑷𝑷isthedotproduct𝒑𝒑′𝒑𝒑 ,whichis1if𝑖𝑖=𝑗𝑗, and 0 otherwise The transformed data is 𝑷𝑷𝑿𝑿 𝑖𝑖𝑗𝑗 23 COMP90051 Statistical Machine Learning Diagonalising covariance matrix (2/2) • The covariance of the transformed data is • 𝚺𝚺𝑷𝑷𝑿𝑿≡ 1 𝑷𝑷𝑿𝑿 𝑷𝑷𝑿𝑿′= 1 𝑷𝑷𝑿𝑿 𝑿𝑿′𝑷𝑷′ =𝑷𝑷𝚺𝚺𝑋𝑋𝑷𝑷′ 𝑛𝑛−1 𝑛𝑛−1 • By spectral decomposition theorem we have 𝚺𝚺𝑋𝑋 = 𝑷𝑷′𝚲𝚲𝑷𝑷 • Therefore 𝚺𝚺𝑷𝑷𝑿𝑿 = 𝑷𝑷𝑷𝑷′ 𝚲𝚲𝑷𝑷𝑷𝑷′ = 𝚲𝚲 • The covariance matrix of the transformed data is diagonal with eigenvalues on the diagonal of 𝚲𝚲 • The transformed data is uncorrelated 24 COMP90051 Statistical Machine Learning Non-linear data and kernel PCA • Low dimensional approximation need not be linear PCA • Kernel PCA: map data to feature space, then run PCA ∗ Expressprincipalcom′ponentsinterms of data points. Solution uses 𝑿𝑿′𝑿𝑿 that canbekernelised 𝑿𝑿𝑿𝑿𝑖𝑖𝑗𝑗=𝐾𝐾𝒙𝒙𝑖𝑖,𝒙𝒙𝑗𝑗 ∗ The solution strategy differs from Kernel PCA regular PCA ∗ Modular: Changing kernel leads to a different feature space transformation 25 COMP90051 Statistical Machine Learning This lecture • Principalcomponentsanalysis ∗ Linear dimensionality reduction method ∗ Diagonalising covariance matrix • KernelPCA 26