Option One Title Here
ANLY-601
Advanced Pattern Recognition
Spring 2018
L14 – Principal Components
Feature Extraction for Signal Representation
Principal Components
• Principal component analysis is a classical statistical
technique that eliminates correlation among variables and is
can be used to reduce data dimensionality for
– Visualization
– Data compression (transform coding)
– Dimension reduction prior to classification
• The method is covariance-based and doesn’t “know” about
higher order correlation in the data.
• We’ll look at classic treatment, and a probabilistic
interpretation.
2
Eigenvectors and Eigenvalues of Covariance
The covariance matrix of n-dimensional random
variables x is a real, symmetric matrix. So there’s a
complete set of orthonormal eigenvectors that form a
spanning set (complete basis) for Rn
x
n
i
T
ii
ijj
T
i
iiix
completeI
lorthonorma
ni
1
…1,
3
v
v
x
x
Positive Semi-Definite Covariance
The covariance is positive semi-definite
Since the form a complete set, we can expand V as
then the inner product above becomes
2 2
2 2 2
/ ( [ ])( [ ]) /
( [ ]) / 0
T T T
x
T
V
V V V E V x E x x E x V V
E V x E x V
n
i
T
iiii VcwithcV
1
,
ji i
iiijjji
ji
j
T
ijji
ji
jx
T
ijix
T
ccc
ccccVV
,
2
,,
0
. Hence . 0i
4
Positive Semi-Definite
2 2
/ 0
T
x V
V V V
0i
0i
The two definitions of positive semi-definite
and
are equivalent.
Usually, unless there is a strict linear constraint in the system,
the covariance is positive-definite
5
Zero Eigenvalues,
Singularities, and Correlations
To get a zero eigenvalue, it must be true that the corresponding
eigenvector is in the null space
and hence the covariance is singular. A reflection of this is that the
determinant vanishes
Geometrically, we get a singular covariance if the data has zero
spread along
0 kx
n
i
ix
1
0
k
k
Consequently, some components
of x are perfectly correlated, and the
data are on an n-1 dimensional
(or smaller) sub-space.
6
Correlations in Real Data
In real data, components are almost never perfectly correlated.
However, high dimensional data often has many strongly correlated
components. This is reflected in the eigenvalue spectrum of the
covariance having some very small (but non-zero) eigenvalues.
Here’s the eigenvalue spectrum of the ‘m’ phoneme data.
7
Correlations in Real Data
Here’s the covariance eigenspectrum for 8×8 blocks (64-dim
vectors) of pixel intensities from a grayscale image (4096
vectors) – note log scale!
8
PCA
Transform to basis of eigenvectors:
The y’s are the principal components. Note that F FT = I (the
identity matrix), and
Then the covariance
of y is
Since cov(y) is diagonal, the components of y are
uncorrelated.
FF
)(
)(
2
1
wherexyorxy
T
ii
F FFF
I
yEyyEyE
TT
x
T
y ])[])([(
1
2
0
,
0
T T
x
where
F F
9
Maximal Variance Directions
The variance of the data along an arbitrary direction V
(with |V|=1) is
Let’s find V that maximizes this subject to the constraint that V
is unit norm. Construct the Lagrangian
And find it’s critical points by setting the derivative with respect
to the components of V equal to 0
2
Vx
T
VV
)1( VVVVL
T
x
T
10
Maximal Variance Directions
Let’s find V that maximizes this subject to the constraint that V
is unit norm. Construct the Lagrangian
And find it’s critical points by setting the derivative with respect
to the components of V equal to 0
)1( VVVVL
T
x
T
VVSo
VV
VVV
VVVV
VV
L
x
iix
i
k
xkikl
l
x
k
kkl
lk
xk
ii
il
kl
022
2
1
,
11
Maximal Variance
The data has maximal variance along an eigenvector of
the covariance. Which one?
The variance is maximum along the eigenvector
corresponding to the largest eigenvalue.
Typically we order the indices such that
VVhaveWe x
ii
T
iiix
T
ii
2
n …21
12
Deflation
So the leading variance direction is 1. Let’s project the data
orthogonal to this eigenvector with a projection operator
where the projection operator P1 satisfies
The x’ have covariance
with eigenvalues and eigenvectors
The highest variance direction of the new data x’ is the old
second eigenvector 2 … etc.
xxx
T
111
)1(‘ P
11111 , PPPPP
T
11′ PP xx
1’ iiiix
13
Dimension Reduction
For visualization and as a precursor to compression or other
processing, PCA is useful as a dimension reduction
technique.
Approximate the vector x by its expansion in terms of the
leading m
Map from S into x : W s + m has an image in the m-dimensional
hyperplane S. Matrix W is nXm.
A. Basilevsky. Statistical Factor Analysis and Related Methods, Wiley 1994.
s
x
S
Ws+m
23
Probabilistic PCA
Next add spherical Gaussian noise e with density N(0, 2 I). The
resulting
x = W s + m + e
also has Gaussian density, and occupies Rn rather than the subspace
S.
x
s
S
Ws + m e
24
Probabilistic PCA
To see that x is Gaussian, note that
where p(x | s) is normal with mean Ws + m and covariance 2e
and p(s) is normal with mean zero and covariance I
It’s easy to show that x has mean m, and covariance
Isp
Wsp(x|s)
dsspsxpxp
covariance and zeromean with normal is)(and
,covarianceandmeanwithnormaliswhere
)()|()(
2
em
IWW
T
x
2
e
25
Fitting Model Parameters to Data
Maximum Likelihood Estimates
• MLE of m is the data mean.
• MLE of W is
W = U G 1/2
where U are the leading m eigenvectors of the data covariance, G
is a diagonal matrix with entries
where i are the leading m eigenvalues of the data covariance.
• MLE estimate of the noise variance is the average of the trailing
eigenvalues of the data covariance.
miii …1,
2
G e
n
mi
imn
1
12ˆ e
26
Probabilistic Interpretation of
Projection onto the Hyperplane S
• Projection of x back onto S corresponds to a two-step process
– Given x, find the most likely s that generated it
– Then map this s onto the hyperplane by
This corresponds exactly to the projection of x onto the
space spanned by the leading m eigenvectors of the data
covariance.
)|(maxarg)(ˆ sxpxs
s
m)(ˆ xsW
27
PCA and SVD
PCA and singular value decomposition (SVD) are often
spoken of as if they are identical. Here’s the connection.
The SVD theorem says that any matrix XNxn can be
decomposed as the product
orthogonalis
ofvalues’singular’thediagonalis
orthogonalcolumniswhere
nn
nn
Nn
T
V
XS
U
VSUX
28
PCA and SVD
Suppose we have a collection of N vectors, each of
dimension n, with zero mean (not absolutely necessary).
Load these vectors row-wise into the matrix X
)(
)(
)(
)(
)2(
)1(
N
x
x
x
X
29
PCA and SVD
• Next, SHOW THAT
• Substitute the SVD of X into the above
rearrange to give
Compare with the equation in the middle of p3 to conclude
that
x
T
NXX ˆ
x
TTTT
NVSVVSUUSVXX ˆ
2
VSV xN
ˆ
21
2
N
1 are ˆ of seigenvalue The
ˆof rseigenvecto theare of columns The
S
V
x
x
30
PCA and SVD: Numerical Considerations
• Whenever you do algebraic operations on a computer you
have to worry about truncation error and its propagation.
If the ratio of the smallest eigenvalue to the largest
eigenvalue (or singular values) of a matrix approaches the
machine precision, you’re sure to be in trouble. This ratio is
called the condition number of the matrix.
• Notice that the eigenvalues of the covariance are S2. So
the condition number of the covariance matrix is the square
of the condition number of X.
Numerically, you’re better off using SVD than computing the
covariance and diagonalizing it directly.
31
Application: Dimensionality Reduction
of Faces
64×64 pixel 8-bit grayscale images
(64×64=4096 dimensional points).
20 different subjects, 160 total
images.
PCA reduction to 5 dimensions.
32
PCA – Linear Subspace Model
33
34
Nonlinear PCA Curved Manifold Model
(Deep learning — cf 1995)
Approximating the Manifold
By Local Tangent Spaces
35
PCA & Nonlinear Dimension
Reduction
36