程序代写代做代考 deep learning Option One Title Here

Option One Title Here

ANLY-601
Advanced Pattern Recognition

Spring 2018

L14 – Principal Components

Feature Extraction for Signal Representation
Principal Components

• Principal component analysis is a classical statistical
technique that eliminates correlation among variables and is
can be used to reduce data dimensionality for

– Visualization

– Data compression (transform coding)

– Dimension reduction prior to classification

• The method is covariance-based and doesn’t “know” about
higher order correlation in the data.

• We’ll look at classic treatment, and a probabilistic
interpretation.

2

Eigenvectors and Eigenvalues of Covariance

The covariance matrix of n-dimensional random

variables x is a real, symmetric matrix. So there’s a

complete set of orthonormal eigenvectors that form a

spanning set (complete basis) for Rn

x




n

i

T

ii

ijj

T

i

iiix

completeI

lorthonorma

ni

1

…1,







3

v

v

x
 

x

Positive Semi-Definite Covariance

The covariance is positive semi-definite

Since the  form a complete set, we can expand V as

then the inner product above becomes

 

2 2

2 2 2

/ ( [ ])( [ ]) /

( [ ]) / 0

T T T

x

T

V

V V V E V x E x x E x V V

E V x E x V 

     

    
  



n

i

T
iiii VcwithcV

1

, 

 







ji i

iiijjji

ji

j
T
ijji

ji

jx
T
ijix

T

ccc

ccccVV

,

2

,,

0



. Hence . 0i

4

Positive Semi-Definite

2 2
/ 0

T

x V
V V V   

0i

0i

The two definitions of positive semi-definite

and

are equivalent.

Usually, unless there is a strict linear constraint in the system,

the covariance is positive-definite

5

Zero Eigenvalues,
Singularities, and Correlations

To get a zero eigenvalue, it must be true that the corresponding

eigenvector is in the null space

and hence the covariance is singular. A reflection of this is that the

determinant vanishes

Geometrically, we get a singular covariance if the data has zero

spread along

0 kx 



n

i

ix

1

0

k

k
Consequently, some components

of x are perfectly correlated, and the

data are on an n-1 dimensional

(or smaller) sub-space.

6

Correlations in Real Data

In real data, components are almost never perfectly correlated.

However, high dimensional data often has many strongly correlated

components. This is reflected in the eigenvalue spectrum of the

covariance having some very small (but non-zero) eigenvalues.

Here’s the eigenvalue spectrum of the ‘m’ phoneme data.

7

Correlations in Real Data

Here’s the covariance eigenspectrum for 8×8 blocks (64-dim

vectors) of pixel intensities from a grayscale image (4096

vectors) – note log scale!

8

PCA
Transform to basis of eigenvectors:

The y’s are the principal components. Note that F FT = I (the
identity matrix), and

Then the covariance
of y is

Since cov(y) is diagonal, the components of y are
uncorrelated.




FF


)(

)(

2

1


 wherexyorxy
T
ii

 

F FFF



I

yEyyEyE

TT
x

T
y ])[])([(

1

2

0

,

0

T T

x
where

 
 

 F  F   
 
 
 

9

Maximal Variance Directions

The variance of the data along an arbitrary direction V

(with |V|=1) is

Let’s find V that maximizes this subject to the constraint that V

is unit norm. Construct the Lagrangian

And find it’s critical points by setting the derivative with respect

to the components of V equal to 0

2
Vx

T
VV 

)1(  VVVVL
T

x
T

10

Maximal Variance Directions

Let’s find V that maximizes this subject to the constraint that V

is unit norm. Construct the Lagrangian

And find it’s critical points by setting the derivative with respect

to the components of V equal to 0

)1(  VVVVL
T

x
T

 

VVSo

VV

VVV

VVVV
VV

L

x

iix

i

k

xkikl

l

x

k

kkl

lk

xk
ii

il

kl



















022

2

1
,

11

Maximal Variance

The data has maximal variance along an eigenvector of

the covariance. Which one?

The variance is maximum along the eigenvector

corresponding to the largest eigenvalue.

Typically we order the indices such that

VVhaveWe x 

ii
T

iiix
T

ii
 

2

n  …21

12

Deflation

So the leading variance direction is 1. Let’s project the data
orthogonal to this eigenvector with a projection operator

where the projection operator P1 satisfies

The x’ have covariance

with eigenvalues and eigenvectors

The highest variance direction of the new data x’ is the old
second eigenvector 2 … etc.

xxx
T

111
)1(‘ P 

11111 , PPPPP
T

11′ PP xx

1’  iiiix 

13

Dimension Reduction

For visualization and as a precursor to compression or other

processing, PCA is useful as a dimension reduction

technique.

Approximate the vector x by its expansion in terms of the

leading mm.

Map from S into x : W s + m has an image in the m-dimensional
hyperplane S. Matrix W is nXm.

A. Basilevsky. Statistical Factor Analysis and Related Methods, Wiley 1994.

s

x

S

Ws+m

23

Probabilistic PCA

Next add spherical Gaussian noise e with density N(0, 2 I). The
resulting

x = W s + m + e

also has Gaussian density, and occupies Rn rather than the subspace
S.

x

s

S

Ws + m  e

24

Probabilistic PCA

To see that x is Gaussian, note that

where p(x | s) is normal with mean Ws + m and covariance 2e
and p(s) is normal with mean zero and covariance I

It’s easy to show that x has mean m, and covariance

Isp
Wsp(x|s)

dsspsxpxp

covariance and zeromean with normal is)(and
,covarianceandmeanwithnormaliswhere

)()|()(

2
em

 

IWW
T

x
2
e

25

Fitting Model Parameters to Data
Maximum Likelihood Estimates

• MLE of m is the data mean.

• MLE of W is

W = U G 1/2

where U are the leading m eigenvectors of the data covariance, G

is a diagonal matrix with entries

where i are the leading m eigenvalues of the data covariance.

• MLE estimate of the noise variance is the average of the trailing

eigenvalues of the data covariance.

miii …1,
2

G e





n

mi

imn
1

12ˆ e

26

Probabilistic Interpretation of
Projection onto the Hyperplane S

• Projection of x back onto S corresponds to a two-step process

– Given x, find the most likely s that generated it

– Then map this s onto the hyperplane by

This corresponds exactly to the projection of x onto the
space spanned by the leading m eigenvectors of the data
covariance.

)|(maxarg)(ˆ sxpxs
s

m)(ˆ xsW

27

PCA and SVD

PCA and singular value decomposition (SVD) are often

spoken of as if they are identical. Here’s the connection.

The SVD theorem says that any matrix XNxn can be

decomposed as the product

orthogonalis

ofvalues’singular’thediagonalis

orthogonalcolumniswhere

nn

nn

Nn

T

V

XS

U

VSUX



28

PCA and SVD

Suppose we have a collection of N vectors, each of

dimension n, with zero mean (not absolutely necessary).

Load these vectors row-wise into the matrix X







)(

)(

)(

)(

)2(

)1(

N
x

x

x

X

29

PCA and SVD

• Next, SHOW THAT

• Substitute the SVD of X into the above

rearrange to give

Compare with the equation in the middle of p3 to conclude

that

x
T

NXX  ˆ

x
TTTT

NVSVVSUUSVXX  ˆ
2

VSV xN
 ˆ

21

2

N
1 are ˆ of seigenvalue The

ˆof rseigenvecto theare of columns The

S

V

x

x

30

PCA and SVD: Numerical Considerations

• Whenever you do algebraic operations on a computer you
have to worry about truncation error and its propagation.

If the ratio of the smallest eigenvalue to the largest
eigenvalue (or singular values) of a matrix approaches the
machine precision, you’re sure to be in trouble. This ratio is
called the condition number of the matrix.

• Notice that the eigenvalues of the covariance  are S2. So
the condition number of the covariance matrix is the square
of the condition number of X.

Numerically, you’re better off using SVD than computing the
covariance and diagonalizing it directly.

31

Application: Dimensionality Reduction
of Faces

64×64 pixel 8-bit grayscale images
(64×64=4096 dimensional points).

20 different subjects, 160 total
images.

PCA reduction to 5 dimensions.

32

PCA – Linear Subspace Model

33

34

Nonlinear PCA Curved Manifold Model
(Deep learning — cf 1995)

Approximating the Manifold
By Local Tangent Spaces

35

PCA & Nonlinear Dimension
Reduction

36