Lecture 19. Dimensionality Reduction
COMP90051 Statistical Machine Learning
Semester 2, 2019 Lecturer: Ben Rubinstein
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• Principalcomponentsanalysis
∗ Linear dimensionality reduction method ∗ Diagonalising covariance matrix
• KernelPCA
2
COMP90051 Statistical Machine Learning
True dimensionality of data?
Image adapted from Wikipedia, original image: Olivier Grisel 3
COMP90051 Statistical Machine Learning
Dimensionality reduction
• Previouslyinunsupervisedlearning:Clustering
• Dimensionalityreductionreferstorepresentingthe data using a smaller number of variables (dimensions) while preserving the “interesting” structure of the data
• Suchareductioncanserveseveralpurposes
∗ Visualisation (e.g., mapping multidimensional data to 2D)
∗ Computational efficiency in a pipeline
∗ Data compression or statistical efficiency in a pipeline
4
COMP90051 Statistical Machine Learning
Exploiting data structure
• Dimensionality reduction in general results in loss of information
• The trick is to ensure that most of the “interesting” information
(signal) is preserved, while what is lost is mostly noise
• This is often possible because real data may have inherently fewer
dimensions than recorded variables
• Example: GPS coordinates are 3D, while car locations on a flat road are actually on 2D manifold
• Example: Marks* for Knowledge Technology and Statistical Machine Learning
* synthetic data 🙂
KT mark
5
SML mark
COMP90051 Statistical Machine Learning
Principal Component Analysis
Finding a rotation of data that minimises covariance between (new) variables
6
COMP90051 Statistical Machine Learning
Principal component analysis
• Principal component analysis (PCA𝑚𝑚) is a popular method for dimensionality reduction and data analysis in general
• Givenadataset𝒙𝒙 ,…,𝒙𝒙 ,𝒙𝒙 ∈𝑹𝑹 ,PCAaimstofindanewcoordinate
the first 𝑙𝑙 < 𝑚𝑚 original
1𝑛𝑛𝑖𝑖
system such that most of the variance is concentrated along the first coordinate, then most of the remaining variance along the second coordinate, etc.
• Dimensionality reduction is then based on discarding coordinates except
data is 2D
find new axes
Leave only the first coordinate
new coordinate system, still 2D
7
COMP90051 Statistical Machine Learning
Naïve PCA algorithm
1. Choose a direction as new axis, such that the variance along
2. Choose the next direction/axis perpendicular to all axes so far, such that the (remaining) variance along this axis is maximised
3. Repeat 2, until you have the same number of axes (i.e., dimensions) as in the original data
this axis is maximised
4. Project original data on the new axes. This gives new coordinates (“PCA coordinates”)
5. For each point, keep only the first 𝑙𝑙 coordinates
Such an algorithm if implemented directly would work, but there’s a better solution
8
COMP90051 Statistical Machine Learning
Formulating the problem
• The core of PCA is finding the new coordinate system, such that most of the variation is captured by “earlier” axes
• Let’s write down this aim formally and see how it can be achieved
• First, recall the geometric definition ofadotproduct𝒖𝒖⋅𝒗𝒗=𝑢𝑢 𝒗𝒗
𝒖𝒖 𝜃𝜃
𝒗𝒗
• Suppose 𝒗𝒗 =1,so𝒖𝒖⋅𝒗𝒗=𝑢𝑢𝒗𝒗
𝒗𝒗
• Vector 𝒗𝒗 can be considered a
and 𝑢𝑢 the coordinate of point 𝒖𝒖 𝒗𝒗
candidate coordinate axis,
9
COMP90051 Statistical Machine Learning
Data transformation
• So the new coordinate system is a set of vectors
𝒑𝒑 , ... , 𝒑𝒑 , where each 𝒑𝒑 = 1 1𝑚𝑚𝑖𝑖
• Consider an original data point 𝒙𝒙𝑗𝑗, 𝑗𝑗 = 1, ... , 𝑛𝑛, and a principal axis 𝒑𝒑𝑖𝑖, 𝑖𝑖 = 1,...,𝑚𝑚
• The corresponding 𝑖𝑖𝑡𝑡𝑡 coordinate for the first point after the transformation is 𝒑𝒑𝑖𝑖 ′(𝒙𝒙1)
∗ For the second point it is 𝒑𝒑𝑖𝑖 ′(𝒙𝒙2), etc.
• Collate all these numbers into a vector
𝒑𝒑′(𝒙𝒙),...,𝒑𝒑′(𝒙𝒙)′= 𝒑𝒑′𝑿𝑿′=𝑿𝑿′𝒑𝒑,where𝑿𝑿 𝑖𝑖1𝑖𝑖𝑛𝑛𝑖𝑖𝑖𝑖
has original data points in columns
10
COMP90051 Statistical Machine Learning
Refresher: basic stats
• Consider a random variable 𝑈𝑈 and the corresponding sample 𝒖𝒖 =
𝑢𝑢1,...,𝑢𝑢𝑛𝑛 ′
• Samplemean𝑢𝑢�≡1∑𝑛𝑛𝑢𝑢.Samplevariance 1 ∑𝑛𝑛 𝑢𝑢 −𝑢𝑢� 2
𝑛𝑛 𝑖𝑖 𝑖𝑖 𝑛𝑛−1 𝑖𝑖=1 𝑖𝑖
𝒖𝒖′𝒖𝒖 • Similarly, if we have a centered random sample 𝒗𝒗 from another
• Suppose the mean was subtracted beforehand (the sample is centered). In this case, the variance is a scaled dot product 1
𝑛𝑛−1
• Finally, if our data is 𝒙𝒙1 = 𝑢𝑢1,𝑣𝑣1 ′, ... , 𝒙𝒙𝑛𝑛 = 𝑢𝑢𝑛𝑛,𝑣𝑣𝑛𝑛 ′ organised
𝒖𝒖′𝒗𝒗
into a matrix 𝑿𝑿 with data in columns and centered variables in rows,
random variable, sample covariance is 1 𝑛𝑛−1
we have that covariance matrix is 𝚺𝚺𝑋𝑋 ≡ 1 𝑿𝑿𝑿𝑿′ 𝑛𝑛−1
11
COMP90051 Statistical Machine Learning
The objective of PCA
• Weshallassumethatthedataiscentered
• Let’sstartwiththeobjectiveforthefirstprincipal
axis. The data projected on this axis is 𝑿𝑿′𝒑𝒑 1
• Accordingly,thevariancealongthisprincipalaxisis 1 𝑿𝑿′𝒑𝒑′𝑿𝑿′𝒑𝒑=1𝒑𝒑′𝑿𝑿𝑿𝑿′𝒑𝒑=𝒑𝒑′𝚺𝚺𝒑𝒑
𝑛𝑛−1 1 1 𝑛𝑛−11 1 1𝑋𝑋1 ∗ Here 𝚺𝚺𝑋𝑋 is the covariance matrix of the original data
• PCA should therefore find 𝒑𝒑1 to maximise 𝒑𝒑1′ 𝚺𝚺𝑋𝑋𝒑𝒑1, subject to 𝒑𝒑1 = 1
12
COMP90051 Statistical Machine Learning
Solving the optimisation
• PCA aims to find 𝒑𝒑1 that maximises 𝒑𝒑1′ 𝚺𝚺𝑋𝑋 𝒑𝒑1 , subject to 𝒑𝒑1 =𝒑𝒑1′𝒑𝒑1=1
• Constrained Lagrange mulitipliers. Introduce multiplier 𝜆𝜆 ; set derivatives of Lagrangian to zero, solve
• 𝐿𝐿=𝒑𝒑′𝚺𝚺 𝒑𝒑1−𝜆𝜆 𝒑𝒑′𝒑𝒑 −1
1𝑋𝑋1111
• 𝜕𝜕𝜕𝜕 =2𝚺𝚺𝑋𝑋𝒑𝒑1−2𝜆𝜆1𝒑𝒑1=0 𝜕𝜕𝒑𝒑1
• 𝚺𝚺𝑋𝑋𝒑𝒑1=𝜆𝜆1𝒑𝒑1
• Precisely defines 𝒑𝒑1 as an eigenvector
with 𝜆𝜆1 being the corresponding eigenvalue
13
COMP90051 Statistical Machine Learning
𝑨𝑨𝒆𝒆 = 𝜆𝜆𝒆𝒆 𝒆𝒆
Refresher on eigenvectors (1/2)
• Given a square matrix 𝑨𝑨, a column vector 𝒆𝒆 is called an eigenvector if 𝑨𝑨𝒆𝒆 = 𝜆𝜆𝒆𝒆. Here 𝜆𝜆 is the corresponding eigenvalue
• Geometric interpretation: compare 𝑨𝑨𝒆𝒆 with 𝑷𝑷𝒙𝒙𝑖𝑖 from previous slides. Here 𝑨𝑨 is a transformation matrix (“new axes”) for some vector 𝒆𝒆. Vector 𝒆𝒆 is such that it still points to the same direction after transformation
Adapted from Lantonov at Wikimedia Commons (CC4) 14
COMP90051 Statistical Machine Learning
Refresher on eigenvalues (2/2)
• Algebraic interpretation: if 𝑨𝑨𝒆𝒆 = 𝜆𝜆𝒆𝒆 then 𝑨𝑨 − 𝜆𝜆𝑰𝑰 𝒆𝒆 = 0, where 𝑰𝑰 is the identity matrix
• This equation has a non-zero solution 𝒆𝒆 if and only if the determinant is zero 𝑨𝑨 − 𝜆𝜆𝑰𝑰 = 0. Eigenvalues are roots of this equation called characteristic equation
• Eigenvectors and eigenvalues are prominent concepts in linear algebra and arise in many practical applications
• Spectrum of a matrix is a set of its eignevalues
• There are efficient algorithms for computing eigenvectors (not covered)
15
COMP90051 Statistical Machine Learning
Variance captured by PCs
• In summary: we choose 𝒑𝒑1 as the
of centered covariance matrix 𝚺𝚺 𝑋𝑋
eigenvector with largest eigenvalue,
• Variance of data captured by 𝒑𝒑1:
∗ Note we’ve shown 𝜆𝜆1 = 𝒑𝒑1′ 𝚺𝚺𝑋𝑋𝒑𝒑1,
∗ But 𝒑𝒑1′ 𝚺𝚺𝑋𝑋𝒑𝒑1 is var of projected data First eigenvalue is variance captured
• Choose dimensions to
keep from “knee” in scree plot
16
COMP90051 Statistical Machine Learning
Efficient solution for PCA
• The same pattern can be used to find all PCs
∗ Constraint 𝒑𝒑 = 1 prevents var 𝒑𝒑 𝚺𝚺 𝒑𝒑 diverging by rescaling 𝒑𝒑
𝑖𝑖 ′𝑖𝑖𝑋𝑋𝑖𝑖 𝑖𝑖
∗ Each time we add additional constraints that next PC be orthogonal to all previous PCs. Equivalently, we search in their complement.
• Solution is to: setting 𝒑𝒑 as all eigenvectors of centered data 𝑖𝑖
covariance matrix 𝚺𝚺𝑋𝑋 in decreasing eigenvalue order
• Really possible with any 𝚺𝚺𝑋𝑋 ?
• Lemma: a real symmetric 𝑚𝑚 × 𝑚𝑚 matrix has 𝑚𝑚 real eigenvalues and corresponding eigenvectors are orthogonal
• Lemma: a PSD matrix further has non-negative eigenvalues.
17
COMP90051 Statistical Machine Learning
Summary of PCA (1/2)
• Assume data points are arranged in columns of 𝑿𝑿. That means that the variables are in rows
• Ensure that the data is centered: subtract the mean row (the
data centroid) from each row
• We seek an orthonormal basis 𝒑𝒑 , ... , 𝒑𝒑
• Find eigenvalues of centered data cov matrix 𝚺𝚺 ≡
∗ Alwayspossible,relativelyefficiently
′ 𝑿𝑿𝑿𝑿
∗ Eachbasisvectorisofunitlengthandperpendiculartoeveryother
1𝑚𝑚
𝑋𝑋
1 𝑛𝑛−1
18
COMP90051 Statistical Machine Learning
Summary of PCA (2/2)
• Sort eigenvalues from largest to smallest
∗ EacheigenvalueequalstovarianceofdataalongcorrespondingPC
• Set 𝒑𝒑 , ... , 𝒑𝒑 as corresponding eigenvectors 1𝑚𝑚
• Project data 𝑿𝑿 onto these new axes to get coordinates of the transformed data
• Keep only the first 𝑠𝑠 coordinates to reduce dimensionality
• Another view of PCA: s-dim plane minimising residual sum- squares to data. (This is exactly spanned by s chosen PCs)
19
COMP90051 Statistical Machine Learning
PCA vs. Linear regression
PCA minimises projections of PCA minimises errors of points onto linear hyperplane predicted labels
20
COMP90051 Statistical Machine Learning
Additional effect of PCA
• PCA aims to find axes such that the variance along each subsequent axis is maximised
• Consider candidate axes 𝑖𝑖 and (𝑖𝑖 + 1). Informally, if there’s a correlation between them, this means that axis 𝑖𝑖 can be rotated further to capture more variance
• PCA should end up finding new axes (i.e., the transformation) such that the transformed data is uncorrelated
original data is 2D
correlation
because of suboptimal axes placement
no correlation
21
COMP90051 Statistical Machine Learning
Spectral theorem for symmetric matrices
• In order to explore this effect further, we need to refer to one of the fundamental results in linear algebra
∗ Theproofisoutsidethescopeofthissubject
∗ Thisisaspecialcaseofthesingularvaluedecompositiontheorem
• Theorem: for any real symmetric matrix 𝚺𝚺 there exists a real 𝑋𝑋
orthogonal matrix 𝑷𝑷 with eigenvectors of 𝚺𝚺𝑋𝑋 arranged in rows and a diagonal matrix of eigenvalues 𝚲𝚲 such that 𝚺𝚺𝑋𝑋 = 𝑷𝑷′𝚲𝚲𝑷𝑷
22
COMP90051 Statistical Machine Learning
• •
•
∗ Similartoabove,notethatelement(𝑖𝑖,𝑗𝑗)of𝑷𝑷𝑿𝑿isthedotproduct 𝒑𝒑′𝑖𝑖𝒙𝒙𝑗𝑗, which is the projection of 𝒙𝒙𝑗𝑗 on axis 𝒑𝒑𝑖𝑖, i.e., the new 𝑖𝑖𝑡𝑡𝑡 coordinate for 𝑗𝑗𝑡𝑡𝑡 point
Diagonalising covariance matrix (1/2)
Form transformation matrix 𝑷𝑷 with evects (new axes) as rows ∗ Byourproblemformulation,𝑷𝑷isanorthonormalmatrix
Note that 𝑷𝑷′𝑷𝑷 = 𝑰𝑰, where 𝑰𝑰 is the identity matrix
∗ Toseethisrecallthateachelementoftheresultingmatrix
multiplication is a dot product of the corresponding row and column ∗ Soelement(𝑖𝑖,𝑗𝑗)of𝑷𝑷′𝑷𝑷isthedotproduct𝒑𝒑′𝒑𝒑 ,whichis1if𝑖𝑖=𝑗𝑗,
and 0 otherwise
The transformed data is 𝑷𝑷𝑿𝑿
𝑖𝑖𝑗𝑗
23
COMP90051 Statistical Machine Learning
Diagonalising covariance matrix (2/2)
• The covariance of the transformed data is
• 𝚺𝚺𝑷𝑷𝑿𝑿≡ 1 𝑷𝑷𝑿𝑿 𝑷𝑷𝑿𝑿′= 1 𝑷𝑷𝑿𝑿 𝑿𝑿′𝑷𝑷′ =𝑷𝑷𝚺𝚺𝑋𝑋𝑷𝑷′
𝑛𝑛−1 𝑛𝑛−1
• By spectral decomposition theorem we have 𝚺𝚺𝑋𝑋 = 𝑷𝑷′𝚲𝚲𝑷𝑷
• Therefore 𝚺𝚺𝑷𝑷𝑿𝑿 = 𝑷𝑷𝑷𝑷′ 𝚲𝚲𝑷𝑷𝑷𝑷′ = 𝚲𝚲
• The covariance matrix of the transformed data is diagonal
with eigenvalues on the diagonal of 𝚲𝚲 • The transformed data is uncorrelated
24
COMP90051 Statistical Machine Learning
Non-linear data and kernel PCA
• Low dimensional approximation need not be linear
PCA
• Kernel PCA: map data to feature space, then run PCA
∗ Expressprincipalcom′ponentsinterms of data points. Solution uses 𝑿𝑿′𝑿𝑿 that
canbekernelised 𝑿𝑿𝑿𝑿𝑖𝑖𝑗𝑗=𝐾𝐾𝒙𝒙𝑖𝑖,𝒙𝒙𝑗𝑗 ∗ The solution strategy differs from
Kernel PCA
regular PCA
∗ Modular: Changing kernel leads to a different feature space transformation
25
COMP90051 Statistical Machine Learning
This lecture
• Principalcomponentsanalysis
∗ Linear dimensionality reduction method ∗ Diagonalising covariance matrix
• KernelPCA
26