COMS 4771 Dimensionality Reduction
Nakul Verma
Example: Handwritten digits
Handwritten digit data, but with no labels
What can we do?
• Suppose know that there are 10 groupings, can we find the groups?
• What if we don’t know there are 10 groups?
• How can we discover/explore other structure in such data?
A 2D visualization of digits dataset
Dimensionality Reduction
Data:
Goal: find a ‘useful’ transformation downstream prediction task.
Some previously seen useful transformations: • z-scoring
• Kernel transformations.
How about lower dimensionality while keeping the relevant information?
that helps in the
Keeps same dimensionality but with better scaling
Higher dimensionality, making data linearly separable
What are other desirable feature transformations?
Principal Components Analysis (PCA)
Data:
Goal: find the best linear transformation that best maintains reconstruction accuracy.
Equivalently, minimize aggregate residual error
Define:
minimize
k-dimensional orthogonal linear projector
How do we optimize this?
Dimensionality Reduction via Projections
A k dimensional subspace can be represented by orthonormal vectors.
The projection of any
x
q
q⋅x
in the
is given by
To represent it in coefficients simply are:
(using basis ) the
PCA: k = 1 case
If projection dimension k = 1, then looking for a q such that minimize ||q||=1
Equivalent formulation:
maximize||q||=1
How to solve?
Eigenvectors and Eigenvalues
Recall for any matrix M, the (λ,v) pairs of the fixed point equation are the eigenvalue and the eigenvectors of M. (v ≠ 0)
So,
maximize||q||=1
(ie, unit length)
Basically is the top eigenvector of matrix (1/n) XXT!
PCA: k = 1 case
maximize||q||=1
Covariance of data (if mean = 0) For any q the quadratic form is the empirical
variance of data in the direction q, ie, of data
Therefore, the top eigenvector solution implies that the
direction of maximum variance minimizes the residual error!
What about general k?
why?
PCA: general k case
Solution: Basically is the top k eigenvectors of the matrix XXT !
k-dimensional subspace preserving maximum amount of variance
PCA: Example Handwritten Digits
Images of handwritten 3s in R784
q1 q2 q3 q4
Any example:
Data Reconstruction:
… = +w1 +w2 +…
We can compress the each datapoint to just k numbers!
Other Popular Dimension Reduction Methods
Multi-dimensional Scaling
Independent Component Analysis (ICA) (for blind source separation) Non-negative matrix factorization (to create additive models) Dictionary Learning
Random Projections
…
All of them are linear methods
Non-Linear Dimensionality Reduction
Consider non-linear data
Linear embedding non-linear embedding
Non-Linear Dimensionality Reduction
Basic optimization criterion:
Find an embedding that:
• Keeps neighboring points close
• Keeps far-off points far
Example variation 1:
Distort neighboring distances by at most (1±ε) factor, while maximizing non-neighbor distances.
Example variation 2:
Compute geodesic (local hop) distances, and find an embedding that best preserves geodesics.
Non-linear embedding: Example
Popular Non-Linear Methods
Locally Linear Embedding (LLE) Isometric Mapping (IsoMap) Laplacian Eigenmaps (LE)
Local Tangent Space Alignment (LTSA) Maximum Variance Unfolding (MVU) …
What We Learned…
• Dimensionality Reduction
Linear vs non-linear Dimensionality Reduction
• Principal Component Analysis
Questions?