PowerPoint Presentation
CS373 Data Mining and�
Machine Learning�
Lecture 12
Jean Honorio
Purdue University
Except for the first two and last two slides, the presentation was taken from
http://www.cs.cmu.edu/~bapoczos/Classes/ML10715_2015Fall/
• Symmetric matrix
• Orthonormal matrix
• Diagonal matrix of �
eigenvalues
• Eigenvectors, columns of
• Python:
Linear Algebra: Eigen Decomposition
import numpy as np
import numpy.linalg as la
Sigma = np.array([[ 5.2, 3.3, -2],
[ 3.3, 8.1, 4],
[ -2, 4, 6.5]])
lam, U = la.eig(Sigma)
>>> lam
array([ 0.38, 7.7 , 11.71])
>>> U
array([[-0.61, 0.75, 0.25],
[ 0.55, 0.18, 0.81],
[-0.56, -0.64, 0.53]])
>>> np.dot(U.T,U)
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
>>> np.dot(U,np.dot(np.diag(lam),U.T))
array([[ 5.2, 3.3, -2. ],
[ 3.3, 8.1, 4. ],
[-2. , 4. , 6.5]])
Σ =UΛUT
UTU = I
Λ
U
• Matrix
• Orthonormal matrices
,
• Diagonal matrix of �
singular values
• Python:
Linear Algebra: Singular Value Decomposition
import numpy as np
import numpy.linalg as la
X = np.array([[ 0.6, -0.7],
[ 2.5, 1.9],
[ -1.6, -0.9],
[ -2.8, 0.8]])
U, s, Vt = la.svd(X,False)
>>> U
array([[-0.09, 0.39],
[-0.69, -0.54],
[ 0.42, 0.2 ],
[ 0.58, -0.72]])
>>> s
array([ 4.24, 2.13])
>>> Vt
array([[-0.96, -0.27],
[ 0.27, -0.96]])
>>> np.dot(Vt,Vt.T)
array([[ 1., 0.],
[ 0., 1.]])
# U*Diag(s)*Vt
>>> np.dot(U,np.dot(np.diag(s),Vt))
array([[ 0.6, -0.7],
[ 2.5, 1.9],
[-1.6, -0.9],
[-2.8, 0.8]])
X =USV T
S
UTU = I V TV = I
3
Motivation
4
• Data Visualization
• Data Compression
• Noise Reduction
PCA Applications
5
Data Visualization
Example:
• Given 53 blood and urine samples
(features) from 65 people.
• How can we visualize the measurements?
6
• Matrix format (65×53)
H-WBC H-RBC H-Hgb H-Hct H-MCV H-MCH H-MCHCH-MCHC
A1 8.0000 4.8200 14.1000 41.0000 85.0000 29.0000 34.0000
A2 7.3000 5.0200 14.7000 43.0000 86.0000 29.0000 34.0000
A3 4.3000 4.4800 14.1000 41.0000 91.0000 32.0000 35.0000
A4 7.5000 4.4700 14.9000 45.0000 101.0000 33.0000 33.0000
A5 7.3000 5.5200 15.4000 46.0000 84.0000 28.0000 33.0000
A6 6.9000 4.8600 16.0000 47.0000 97.0000 33.0000 34.0000
A7 7.8000 4.6800 14.7000 43.0000 92.0000 31.0000 34.0000
A8 8.6000 4.8200 15.8000 42.0000 88.0000 33.0000 37.0000
A9 5.1000 4.7100 14.0000 43.0000 92.0000 30.0000 32.0000
In
st
an
ce
s
Features
Difficult to see the correlations between the features…
Data Visualization
7
• Spectral format (65 curves, one for each person)
0 10 20 30 40 50 600
100
200
300
400
500
600
700
800
900
1000
measurement
Va
lu
e
Measurement
Difficult to compare the different patients…
Data Visualization
9
0 50 150 250 350 450
50
100
150
200
250
300
350
400
450
500
550
C-Triglycerides
C
-L
D
H
0 100
200300
400500
0
200
400
600
0
1
2
3
4
C-Triglycerides
C-LDH
M
-E
PI
Bi-variate Tri-variate
How can we visualize the other variables???
… difficult to see in 4 or higher dimensional spaces…
Data Visualization
10
• Is there a representation better than the coordinate axes?
• Is it really necessary to show all the 53 dimensions?
• … what if there are strong correlations between the
features?
• How could we find
the smallest subspace of the 53-D space that
keeps the most information about the original data?
• A solution: Principal Component Analysis
Data Visualization
11
PCA Algorithms
12
Orthogonal projection of the data onto a lower-dimension linear
space that…
�maximizes variance of projected data (purple line)
�minimizes the mean squared distance between
• data point and
• projections (sum of blue lines)
PCA:
Principal Component Analysis
13
Idea:
� Given data points in a d-dimensional space,
project them into a lower dimensional space while
preserving as much information as possible.
• Find best planar approximation of 3D data
• Find best 12-D approximation of 104-D data
� In particular, choose projection that
minimizes squared error
in reconstructing the original data.
Principal Component Analysis
14
�PCA Vectors originate from the center of mass.
�Principal component #1: points in the direction of
the largest variance.
�Each subsequent principal component
• is orthogonal to the previous ones, and
• points in the directions of the largest
variance of the residual subspace
Principal Component Analysis
Properties:
15
2D Gaussian dataset
16
1st PCA axis
17
2nd PCA axis
18
¦
�
m
i
i
T
i
T
m 1
2
11
1
2 })]({[
1
maxarg xwwxww
w
}){(
1
maxarg
1
2
i
1
1 ¦
m
i
T
m
xww
w
To find w2, we maximize
the variance of the
projection in the residual
subspace
To find w1, maximize the variance of projection of x
x’ PCA reconstruction
Given the centered data {x1, …, xm}, compute the principal vectors:
1st PCA vector
2nd PCA vector
x
w1
w
x’=w1(w1Tx)
w
PCA algorithm I (sequential)
x-x’ w2
19
¦ ¦
�
�
m
i
k
j
i
T
jji
T
k m 1
2
1
1
1
})]({[
1
maxarg xwwxww
w
We maximize the variance
of the projection in the
residual subspace
Maximize the variance of projection of x
x’ PCA reconstruction
Given w1,…, wk-1, we calculate wk principal vector as before:
kth PCA vector
w1(w1Tx)
w2(w2Tx)
x
w1
w2
x’=w1(w1Tx)+w2(w2Tx)
w
PCA algorithm I (sequential)
20
• Given data {x1, …, xm}, compute covariance matrix 6
• PCA basis vectors = the eigenvectors of 6�
• Larger eigenvalue � more important eigenvectors
¦
�� 6
m
i
T
im 1
))((
1
xxxx ¦
m
i
im 1
1
xxwhere
PCA algorithm II
(sample covariance matrix)
i
21
PCA algorithm(X, k): top k eigenvalues/eigenvectors
% X = N u m data matrix,
% … each data point xi = column vector, i=1..m
•
• X Å subtract mean x from each column vector xi in X
• 6 Å X XT … covariance matrix of X
• { Oi, ui }i=1..N = eigenvectors/eigenvalues of 6�
�����O1 t O2 t … t ON�
• Return { Oi, ui }i=1..k
% top k PCA components
¦
m
im 1
1
ixx
PCA algorithm II
(sample covariance matrix)
24
Singular Value Decomposition of the centered data matrix X.
Xfeatures u samples = USVT
X VT S U =
samples
significant
noise
no
is
e noise
si
gn
if
ic
an
t
sig.
PCA algorithm III
(SVD of the data matrix)
25
• Columns of U
• the principal vectors, { u(1), …, u(k) }
• orthogonal and has unit norm – so UTU = I
• Can reconstruct the data using linear combinations
of { u(1), …, u(k) }
• Matrix S
• Diagonal
• Shows importance of each eigenvector
• Columns of VT
• The coefficients for reconstructing the samples
PCA algorithm III
26
Applications
�Want to identify specific person, based on facial image
� Robust to glasses, lighting,…
� Can’t just use the given 256 x 256 pixels
Face Recognition
27
� Example data set: Images of faces
• Eigenface approach
[Turk & Pentland], [Sirovich & Kirby]
� Each face x is …
• 256 u 256 values (luminance at location)
• x in �256u256 (view as 64K dim vector)
� Form X = [ x1 , …, xm ] centered data
mtx
� Compute 6�= XXT
� Problem: 6 is 64K u 64K … HUGE!!!
Applying PCA: Eigenfaces
29
256 x 256
real values
m faces
X =
x1, …, xm
34
Happiness subspace
35
Disgust subspace
37
Facial Expression Recognition
Movies
Image Compression
40
� Divide the original 372×492 image into patches:
• Each patch is an instance that contains 12×12 pixels on a grid
� Consider each as a 144-D vector
Original Image
41
L2 error and PCA dim
42
PCA compression: 144D ) 60D
50
Looks like the discrete cosine bases of JPG!…
60 most important eigenvectors
PCA Shortcomings
57
PCA doesn’t know about class labels!
Problematic Data Set for PCA
58
• PCA maximizes variance,
independent of class
� magenta
• FLD attempts to separate classes
� green line
PCA vs Fisher Linear Discriminant
A linear classifier attempts to separate classes
PCA versus linear classifiers
• Incorrect way: DO NOT do feature selection (or
dimensionality reduction) on the whole dataset, and then
cross-validation”
• Feature selection and dimensionality reduction on the whole
dataset destroys cross-validation”
• reduced training set would depend on the validation set”
• Thus, training is looking at the supposedly “unseen” data!
Feature Selection and Cross-Validation”
Training
set”
Validation
set”
Reduced
training
set”
Reduced
validation
set”
Feature selection / dimensionality reduction”
Training
set”
Validation
set”
• Correct way: feature selection (or dimensionality reduction)
inside cross-validation, only applied to the training set”
Feature sel. / dim. reduc.”
Feature Selection and Cross-Validation”