Dimensionality Reduction I: Principle Component Analysis
ISML_5: Dimensionality Reduction
Lingqiao Liu
Outline
• Overview of dimensionality reduction
– The benefits
– Why we can perform dimensionality reduction
– Type of dimensionality reduction methods
• Principal component analysis
– Mathematical basics
– Decorrelation and dimension selection
– Eigenface and high-dimensionality issue
• Linear Discriminative Analysis
• Summary
What is dimensionality reduction
• Reduce dimensions of data
n
p
A n
k
X
Dimensionality Reduction: why?
• Extract underlying factors
Dimensionality Reduction: why?
• Reduce data noise
– Face recognition
– Applied to image de-noising
Image courtesy of Charles-Alban Deledalle, Joseph Salmon, Arnak Dalalyan; BMVC 2011
Image denoising with patch-based PCA: local versus global
Dimensionality Reduction: why?
• Reduce the number of model parameters
– Avoid over-fitting
– Reduce computational cost
Dimensionality Reduction: why?
• Visualization
Dimensionality Reduction
• General principle:
Preserve “useful” information in low dimensional data
• How to define “usefulness”?
– Many
– An active research direction in machine learning
• Taxonomy
– Supervised or Unsupervised
– Linear or nonlinear
• Commonly used methods:
– PCA, LDA (linear discriminant analysis), and more.
• Feature Selection vs dimensionality reduction
Outline
• Overview of dimensionality reduction
– The benefits
– Why we can perform dimensionality reduction
– Type of dimensionality reduction methods
• Principal component analysis
– Mathematical basics
– Decorrelation and dimension selection
– Eigenface and high-dimensionality issue
• Linear Discriminative Analysis
• Summary
Principal Component Analysis: Motivation
• Principal Component Analysis (PCA) is an unsupervised
dimensionality reduction method
– Transform data to remove redundant information
– Keep the most informative dimensions after the transformation
Data correlation & information
redundancy
PCA explained: De-correlating data
• Dependency vs. Correlation
Dependent is a stronger criterion
• Equivalent when data follows Gaussian distribution
• PCA only de-correlates data
One limitation of PCA
ICA, but it is more complicate
PCA explained: De-correlating data
• Geometric interpretation of correlation
Exercise: which figure shows the
highest data correlation?
Exercise: which figure shows the
highest data correlation?
PCA explained: De-correlating data
• Correlation can be removed by rotating the data
point or coordinate
PCA explained: SNR maximization
• Maximize
Data matrix
Signal Noise
Data matrix
Transform
PCA explained: SNR maximization
Signal
Noise
Discussion: why we think a dimension
with larger covariance contain more
information?
PCA explained
• Target
– 1: Find a new coordinate system which makes
different dimensions zero correlated
– 2: keep the top-k dimensions with largest variance in
the new coordinate system
• Method
Rotate the data point or coordinate
• Mathematically speaking…
– How to rotate?
– How to express our criterion
Mathematic Basics
• Mean, Variance, Covariance
• Matrix norm, trace,
• Orthogonal matrix, basis
• Eigen decomposition
Mathematic Basics
• (Sample) Mean
• (Sample) Variance
• (Sample) Covariance
Mathematic Basics
• Covariance Matrix
Or compactly
is the centralized data matrix (with mean subtracted)
When C is a diagonal matrix, dimensions of features become zero-correlated
Mathematic Basics
• Orthogonal matrix
• Rotation effect
Mathematic Basics
• Relationship to coordinate system
– A point = linear combination of bases
– Combination weight = coordinate
• Each row (column) of Q = basis
– Not unique
– Relation to coordinate rotation
• New coordinate
Mathematic Basics
(2,2)
Old coordinate
New coordinate
Mathematic Basics
Eigen-decomposition
If is symmetric
PCA: solution
• Target 1: de-correlation
Original covariance matrix of data
We are looking for transform the centralized
data with a matrix P
Require be a diagonal matrix
PCA: solution
• Target 1: de-correlation
Original covariance matrix of data
We are looking for transform the centralized
data with a matrix P
Require be a diagonal matrix
Sounds familiar?
PCA: solution
• Target 1: de-correlation
Original covariance matrix of data
We are looking for transform the centralized
data with a matrix P
Require be a diagonal matrix
Sounds familiar? Use eigen decomposition
set
so
Recap: Matrix decomposition
• Matrix can be decomposed into the combination (usually
product) of special matrices
• Eigen decomposition
– When A is symmetric, i.e.
• Related topic: Singular value decomposition
University of Adelaide 30
A closer look at the covariance matrix of y
• The covariance matrix of the transformed data is a
diagonal matrix, what are the diagonal elements?
• From the perspective of covariance matrix
• From the perspective of Eigen-decomposition, the
diagonal matrix corresponding to the eigenvalues of
The i, j th element of the covariance matrix is the covariance between the i-th
dimension feature and j-th dimension feature. The i-th diagonal element
represents the variance of the feature at the i-th dimension
How to select the most informative
dimensions?
• We rank the dimensions by their variance, which is
equivalent to rank the dimensions by their
corresponding eigen values
– Note that the i-th dimension of y is obtained by using the i-th
row of P
– If we only want to have k dimensions, we can simply select the
eigen vectors corresponding to the top-k eigenvalues to form the
projection matrix P
PCA: algorithm
• 1. Subtract mean
• 2. Calculate the covariance matrix
• 3. Calculate eigenvectors and eigenvalues of the
covariance matrix
• 4. Rank eigenvectors by its corresponding
eigenvalues
• 4. Obtain P with its row vectors corresponding to the
top k eigenvectors
PCA: MATLAB code
Note fea is with the size of N x d.
PCA: reconstruction
• From a new feature y, we can reconstruct its
original feature x
• Fun fact, we can also derive PCA algorithm
from the following optimization problem
Similar to the rotation back operation, e.g.,
PCA: reconstruction
• Reighley Quotient
• Solution = PCA
Application: Eigen-face method
• Sirovich and Kirby (1987) showed that PCA could be
used on a collection of face images to form a set of
basis features.
• Not only limited to face recognition
• Steps
– Image as high-dimensional feature
– PCA
Application: Eigen-face method
Application: Reconstruction
Reconstructed from top-2 eigenvectors
Application: Reconstruction
Reconstructed from top-15 eigenvectors
Application: Reconstruction
Reconstructed from top-40 eigenvectors
Application: Eigen-face method
• From large to small eigenvalues
Common
Patterns
Discriminative
Patterns
Noisy
Patterns
high dimensionality issue
• For high-dimensional data large
• The number of samples is relatively small,
can be too large
• Solution: a useful relation
Suppose , we first consider the eigen value and vectors of the matrix
Multiply X on both side of equation
If we define
Then we know
high dimensionality issue
• For high-dimensional data large
• The number of samples is relatively small,
can be too large
• Solution: a useful relation
Suppose , we first consider the eigen value and vectors of the matrix
Multiply X on both side of equation
If we define
Then we know
The definition of the eigen system
of covariance matrix
Kernel
matrix
High dimensionality issue
• 1. Centralize data
• 2. Calculate the kernel matrix
• 3. Perform Eigen-decomposition on the kernel matrix and
obtain its eigenvector
• 4. Obtain the Eigenvector of the covariance matrix by
• Question? How many eigenvectors you can obtain in this way?
Mathematic Basics
• Rank of a matrix
• In linear algebra, the rank of a matrix A is the dimension of the
vector space generated (or spanned) by its columns
• It is identical to the dimension of the vector space spanned by
its rows
• Important relationship
• Relationship to the eigenvalues
the rank of a matrix equals to the number of non-zero
eigenvalues of the matrix.
Back to the previous question
• So the rank of
• So at most meaningful projections from PCA
The eigen vector corresponding to 0 eigenvalue is meaningless
in PCA
Outline
• Overview of dimensionality reduction
– The benefits
– Why we can perform dimensionality reduction
– Type of dimensionality reduction methods
• Principal component analysis
– Mathematical basics
– Decorrelation and dimension selection
– Eigenface and high-dimensionality issue
• Linear Discriminative Analysis
• Summary
Discriminative dimensionality reduction
• General principle:
Preserve “useful” information in low dimensional data
PCA: measure “usefulness’’ through reconstruction
error or covariance structure.
Useful for reconstruction ≠ useful for classification
• General principle for discriminative
dimensionality reduction
Preserve “discriminative” information in low
dimensional data
Linear Discriminant Analysis: Basic
idea
• Linear Discriminant Analysis (LDA)
– Discriminative dimensionality reduction
– Linear dimensionality reduction
• Supervised information
– Class label
– Data from the same class => Become close
– Data from different classes => far from each other
Linear Discriminant Analysis (LDA): Objective
• Two classes:
Linear Discriminant Analysis: Objective
• Two classes:
Linear Discriminant Analysis (LDA): Objective
• Mean after projection:
• Variance after projection:
Linear Discriminant Analysis (LDA),
Objective
• Mean distance
• Between class Scatterness, within class Scatterness
Linear Discriminant Analysis (LDA):
Solution
• Objective
• Solution
Linear Discriminant Analysis (LDA),
Solution
• Eigenvectors corresponding to the largest
eigenvalues of
– Why?
• Implementation details
– What if is not invertible?
Use instead
At optimum, we have
Linear Discriminant Analysis, Multi-
class
• Generalized to multiple classes
• Solution:
– Top c eigenvectors of
– Discussion: how many projections you can have?
Linear Discriminant Analysis, Multi-
class
• Generalized to multiple classes
• Solution:
– Top c eigenvectors of
– Discussion: how many projections you can have?
Check the rank of
At most C projections!
Summary
• The benefit of dimensionality reduction
• Principal component analysis
– Two basic steps: decorrelation and select the dimensions with
top variances
– How to perform data transformation?
– Its application, eigen face. The meaning behind eigen vectors
– High dimensionality issue
• Linear discriminative component analysis
– Basic objective
– Two-class case
– Multi-class case