Data Representation for High Dimensional Data
Lecture 6: Dimensionality Reduction
1
CMSC5741 Big Data Tech. & Apps.
Prof. Michael R. Lyu
Computer Science & Engineering Dept.
The Chinese University of Hong Kong
A Compression Example
2
Outline
Motivation
SVD
CUR
Application of SVD and CUR
PCA
Extension to robust PCA
3
Dimensionality Reduction Motivation
Assumption: Data lie on or near a low
d-dimensional subspace
Axes of this subspace are effective representation of the data
4
Dimensionality Reduction Motivation
Compress / reduce dimensionality:
106 rows; 103 columns; no updates
Random access to any cell(s); small error: OK
5
The above matrix is really “2-dimensional.” All rows can be reconstructed by scaling [1 1 1 0 0] or [0 0 0 1 1]
16
16
16
16
16
Rank of a Matrix
Q: What is rank of a matrix A?
A: No. of linearly independent rows/columns of A
For example:
Matrix A = has rank r=2
Why? The first two rows are linearly independent, so the rank is at least 2, but all three rows are linearly dependent (the first is equal to the sum of the second and third) so the rank must be less than 3.
Why do we care about low rank?
We can write A as two “basis” vectors: [1 2 1] [-2 -3 1]
And new coordinates of : [1 0] [0 1] [1 -1]
6
Rank is “Dimensionality”
Cloud of points 3D space:
Think of point positions
as a matrix:
We can rewrite coordinates more efficiently!
Old basis vectors: [1 0 0] [0 1 0] [0 0 1]
New basis vectors: [1 2 1] [-2 -3 1]
Then A has new coordinates: [1 0]. B: [0 1], C: [1 -1]
Notice: We reduced the number of coordinates!
1 row per point:
A
B
C
A
7
Dimensionality Reduction
Goal of dimensionality reduction is to
discover the axis of data!
Rather than representing
every point with 2 coordinates
we represent each point with
1 coordinate (corresponding to
the position of the point on
the red line).
By doing this we incur a bit of
error as the points do not
exactly lie on the line
8
Why Reduce Dimensions?
Why reduce dimensions?
Discover hidden correlations/topics
Words that occur commonly together
Remove redundant and noisy features
Not all words are useful
Interpretation and visualization
Easier storage and processing of the data
9
SVD: Dimensionality Reduction
B
10
For an matrix A, we can decompose it as = , where
U is an real or complex orthonormal matrix (i.e., = = I)
Σ is an m × n rectangular diagonal matrix with non-negative real numbers on the diagonal, and
VT (the conjugate transpose of V, or simply the transpose of V if V is real) is an real or complex orthonormal matrix.
SVD: Singular Value Decomposition
11
When rank(A) = r:
: input data matrix
matrix (e.g., documents, terms)
: left singular vectors
matrix (documents, topics)
: singular values
diagonal matrix (strength of each “topic”)
rank of matrix
: right singular vectors
matrix ( terms, topics)
SVD: Singular Value Decomposition
12
: scalar
: vector
: vector
A
𝑉𝑇
SVD: Singular Value Decomposition
13
A
: scalar
: vector
: vector
A
𝑉𝑇
13
SVD Properties
It is always possible to do SVD, i.e. decompose a matrix 𝐴 into , where
𝑈, , 𝑉: unique
𝑈,𝑉: column orthonormal
, (𝐼: identity matrix)
: diagonal
Entries (singular values) are non-negative,
Sorted in decreasing order (𝜎1 ≥ 𝜎2 ≥⋯ ≥ 𝜎r ≥ 0).
14
SVD Example
We give an example of a simple SVD decomposition
15
=
T
15
SVD: Example – Users-to-Movies
– example: Users to Movies
SciFi
Romance
“topics” or “concepts”
aka. Latent dimensions
aka. Latent factors
16
The Avengers
Star Wars
Twilight
Matrix
In this example, there are two “concepts” underlying the movies: science fiction and romance.
16
SVD: Example – Users-to-Movies
– example
SciFi
Romance
17
The Avengers
Star Wars
Twilight
Matrix
SciFi-concept
X
X
Romance-concept
The rows of M as people and the columns of M as movies.
17
SVD: Example – Users-to-Movies
– example
SciFi
Romance
18
The Avengers
Star Wars
Twilight
Matrix
SciFi-concept
X
X
Romance-concept
U is “user-to-concept” similarity matrix
The rows of M as people and the columns of M as movies.
The matrix U connects user to concepts.
The value 0.24 in the first row and first column of U is smaller than some of the other entries in that column, because the first person watches only science fiction, he does not rate those movies highly. The second column of the first row of U is nearly zero, because he does not rate romance movies at all.
18
SVD: Example – Users-to-Movies
– example
SciFi
Romance
19
The Avengers
Star Wars
Twilight
Matrix
SciFi-concept
X
X
“strength” of the SciFi-concept
The matrix \Sigma gives the strength of each of the concepts. In this example, the strength of the science fiction concept is 11.9, while the strength of the romance concept is 7.1. Intuitively, the science-fiction concept is stronger because the data provides more movies of that genre and more people who like them.
19
SVD: Example – Users-to-Movies
– example
SciFi
Romance
20
The Avengers
Star Wars
Twilight
Matrix
SciFi-concept
X
X
V is “movie-to-concept” similarity matrix
SciFi-concept
Q: Does the movie “Twilight” relate to concept “Romance”?
The matrix V relates movie to concepts. The 0.5x indicates that the first three movies are of the science-fiction genre, while the small absolute value in the last column say that the movie does not partake much the concept of romance.
The second row indicates that the movie Twilight is related to romance.
20
SVD: Interpretation #1
“movies”, “users” and “concepts”
: user-to-concept similarity matrix
: movie-to-concept similarity matrix
: its diagonal elements
‘strength’ of each concept
21
SVD: Interpretations #2
SVD gives ‘best’ axis to project on
‘best’ = minimal sum of squares of projection errors
In other words,
minimum reconstruction error
22
SVD: Interpretation #2
– example
𝑈: user-to-concept matrix
𝑉: movie-to-concept matrix
23
X
X
23
SVD: Interpretation #2
– example
24
X
X
variance (“spread”)
on the v1 axis
24
SVD: Interpretation #2
– example
: the coordinates of the points in the projection axis
25
Projection of users on the “Sci-Fi” axis
25
SVD: Interpretation #2
Q: how exactly is dimension reduction done?
26
X
X
SVD: Interpretation #2
Q: how exactly is dimension reduction done?
A: Set smallest singular values to zero
27
X
X
X
SVD: Interpretation #2
Q: how exactly is dimension reduction done?
A: Set smallest singular values to zero
Approximate original matrix by low-rank matrices
28
X
X
X
SVD: Interpretation #2
Q: how exactly is dimension reduction done?
A: Set smallest singular values to zero
Approximate original matrix by low-rank matrices
29
X
X
SVD: Best Low Rank Approximation
A
B
is a rank matrix
is the best rank approximation of matrix
30
SVD: Best Low Rank Approximation
Theorem: Let (), and
= diagonal matrix where ( and ()
or equivalently, is the best rank- approximation to :
or equivalently,
Intuition (spectral decomposition)
Why setting small to 0 is the right thing to do?
Vectors and are unit length, so scales them.
Therefore, zeroing small introduces less error.
31
SVD: Interpretation #2
Q: How many 𝜎𝑖 to keep?
A: Rule-of-a thumb
Keep 80~90% “energy” ()
32
𝜎1𝑢1∘𝑣1𝑇+ 𝜎2𝑢2∘𝑣2𝑇 +⋯
Assume: 𝜎1 ≥ 𝜎2 ≥⋯
SVD: Complexity
SVD for full matrix
But
Less work, if we only want to compute singular values
or if we only want first k singular vectors (thin-svd).
or if the matrix is sparse (sparse svd).
Stable implementations
LINPACK, Matlab, Splus, Mathematica…
Available in most common languages
33
SVD: Conclusions so far
SVD: : unique
user-to-concept similarities
movie-to-concept similarities
: strength to each concept
Dimensionality reduction
Keep the few largest singular values (80-90% of “energy”)
SVD: picks up linear correlations
34
SVD: Relationship to Eigen-decomposition
SVD gives us
Eigen-decomposition
is symmetric
are orthonormal (
are diagonal
35
SVD: Relationship to Eigen-decomposition
Eigen-decomposition of and
So, , and
That is, is the matrix of eigenvectors of , and
is the matrix of eigenvectors of
This shows how to use eigen-decomposition to compute SVD
Ahe singular values of are the square roots of the corresponding eigenvalues of
Note: and are the dataset covariance matrices
36
36
A Brief Review of Eigen-Decomposition
Eigenvalues and eigenvectors
matrix.
eigenvalue of , : eigenvector of . eigenpair.
Simple computational method of eigenvalues
Solve the equation
Example
Then
Then
Solve , we get
37
A Brief Review of Eigen-Decomposition
Example (continued)
solve , we get eigenvalues
now we compute eigenvectors.
for eigenvalue we need to find
solve
We get Since needs to be a unit vector, therefore
Similarly, we can compute
38
Computing Eigenvalues: Power Method
Power method
choose an arbitrary
.
Theorem: sequence converges to the principal eigenvector (i.e., the eigenvector corresponds to the largest eigenvalue)
Normalized power method
choose an arbitrary
Theorem: sequence converges to the principal eigenvector.
39
In-class Practice
Go to practice
40
SVD Case Study: How to Query?
Q: Find users that like “The Avengers”
A: Map query into a “concept space” – how?
41
SciFi
Romance
The Avengers
Star Wars
Twilight
Matrix
X
X
41
Q: Find users that like “The Avengers”
A: Map query into a “concept space” – how?
Case Study: How to Query?
42
The Avengers
Twilight
Matrix
Star Wars
v1
q
Project into concept space:
Inner product with each concept vector vi
v2
A is not one of the people represented by the original matrix, but he wants to use the system to know what movies he would like.
He has only seen one movie, Matrix, and rated it 5. Thus, we can represent A by the vector q = [5, 0, 0, 0], as if this were one of the rows of the original matrix.
42
Q: Find users that like “The Avengers”
A: Map query into a “concept space” – how?
Case Study: How to Query?
43
The Avengers
Twilight
Matrix
Star Wars
v1
q
Project into concept space:
Inner product with each concept vector vi
v2
q*v1
A is not one of the people represented by the original matrix, but he wants to use the system to know what movies he would like.
He has only seen one movie, Matrix, and rated it 5. Thus, we can represent A by the vector q = [5, 0, 0, 0], as if this were one of the rows of the original matrix.
43
Compactly, we have
𝑞𝑐=𝑞𝑉
Case Study: How to Query?
=
movie-to-concept
similarities ()
SciFi-concept
44
The Avengers
Twilight
Matrix
Star Wars
X
is high in science-fiction interest, and nearly no interested in romance.
44
How would the user d that rated (‘Star Wars’, ‘Matrix’) be handled?
d𝑐=d 𝑉
Case Study: How to Query?
=
movie-to-concept
similarities ()
SciFi-concept
45
The Avengers
Twilight
Matrix
Star Wars
X
is high in science-fiction interest, and nearly no interested in romance.
45
Observation
User d that rated (‘Star Wars’) will be similar to user q that rate (‘The Avengers’), although d and q have zero ratings in common!
Case Study: How to Query?
SciFi-concept
46
The Avengers
Twilight
Matrix
Star Wars
Zero ratings in common
Similarity 0
Cosine similarity: 0.99
is high in science-fiction interest, and nearly no interested in romance.
46
SVD: Drawbacks
Optimal low-rank approximation
in terms of Euclidean norm
Interpretability problem:
A singular vector specifies a linear
combination of all input columns or rows
Lack of sparsity:
Singular vectors are dense!
47
=
U
VT
CUR Decomposition
Goal: express A as a product of matrices
Minimize
Constraints on
48
CUR Decomposition
Goal: express A as a product of matrices
Minimize
Constraints on
49
CUR: Good Approximation to SVD
Let be the best rank approximation of (obtain by SVD)
Theorem
CUR algorithm in time achieves
with probability at least , by picking
columns and
rows
(in practice, choose columns/rows)
50
CUR: How it Works
Sampling columns (similarly for rows):
51
For step 5: Having selected each of the columns in A, we scale each column by dividing its elements by the square root of the expected number of times this column would be picked. That is, we divide the elements of the jth column of A, if it is selected, by sqrt(cP(j)).
51
CUR: Computing U
Let be the “intersection” of sampled columns C and rows R
Let SVD of
Then: , where
is the “Moore-Penrose pseudo-inverse”.
, if .
52
CUR: Pros & Cons
+ easy interpretation
the basis vectors are actual columns and rows
– duplicate columns and rows
columns of large norms will be sampled many times
53
CUR: Duplicate Columns
If we want to get rid of the duplicates
Throw them away
Scale the columns/rows by the square root of the number of duplicates
54
SVD vs CUR
55
Question: Large or small? Dense or sparse?
SVD vs CUR
56
SVD & CUR: Simple Experiments
DBLP data
author-to-conference matrix
very sparse
: number of papers published by author at conference .
428k authors (rows)
3659 conferences (column)
Dimensionality reduction
Running time?
Space?
Reconstruction error?
57
Results: DBLP
accuracy: 1-relative sum squared errors
space ratio: # output non-zero matrix entries / # input non-zero matrix entries
courtesy: Sun, Faloutsos: Less is more, Compact Matrix Decomposition for Large Sparse Graph, SDM’07
58
58
The Linearity Assumption
SVD is limited to linear projections
Data lies on a low-dimensional linear space
Non-linear methods: Isomap
Data lies on a low-dimensional manifold
Non-linear
How?
Build adjacency graph
SVD the graph adjacency matrix
Further reading: wikipage of Isomap
59
PCA: An Application of SVD
PCA = Principle Component Analysis
Motivation
Visualization
60
PCA: Data Visualization
Example:
Given 53 blood samples (features) from 65 people (data item or instance)
How can we visualize the samples
61
PCA: Data Visualization
How can we visualize the other variables???
… difficult to see in 4 or higher dimensional spaces …
62
PCA: Data Visualization
Is there a representation better than the coordinate axes?
Is it really necessary to show all the 53 dimensions?
What if there are strong correlations between the features?
How could we find the smallest subspace of the 53-D space that keeps the most information about the original data?
A solution: Principal Component Analysis
An application of SVD.
63
PCA: Definition and Algorithms
PCA
Orthogonal projection of the data onto a lower-dimensional linear space such that
Maximize variance of projected data (purple line)
Minimize mean squared distance between
Data point
Projection (sum of blue lines)
Look data from a literally different angle.
64
PCA: Idea
Given data points in a d-dimensional space, project them into a lower dimensional space while preserving as much information as possible.
Find best planar approximation to 3D data
Find best 12-D approximation to 104-D data
In particular, choose projection that minimizes squared error in reconstructing the original data.
Implement through SVD
65
PCA
PCA Vectors originate from the center of mass.
Principal component #1: points in the direction of the largest variance.
Each subsequent principal component
is orthogonal to the previous ones, and
points in the directions of the largest variance of the residual subspace
66
PCA: 2D Gaussian dataset
67
1st PCA axis
68
2nd PCA axis
69
PCA: Algorithm
Given centered data , compute principle vectors
1st principle vector
maximize the variance of projection of
70
PCA: Algorithm by SVD
SVD of the centered data matrix
: number of instances
: dimension
A
71
PCA: Algorithm by SVD
Columns of
is exactly the principal vectors,
orthogonal and has unit norm
Matrix
Diagonal
Strength of each eigenvector
Columns of
Coefficients for reconstructing the samples.
72
Application: Face Recognition
Want to identify specific person, based on facial image
Can’t just use the given 256 x 256 pixels
73
Applying PCA
Method A: Build a PCA subspace for each person and check which subspace can reconstruct the test image the best
Method B: Build one PCA database for the whole dataset and then classify based on the weights.
Example data set: Images of faces
Each face is …
values
74
Each face is converted to a 256*256-d vector (not matrix).
74
Principal Components (Method B)
75
Reconstructing … (Method B)
76
Happiness Subspace (Method A)
77
Suppose we have a set of faces known as happy faces (i.e. we have labels for each faces). Then, we may compute the principle vectors of this subset of faces. These principle vectors are shown here.
Similarly, in the next slide, we can compute the principle vectors of disgust faces.
Therefore, to predict the label of a test face, we can use these two sets of principle vectors to reconstruct test face and see which set reconstructs the test face better (method A).
77
Disgust Subspace (Method A)
78
Image Compression
Divide the original 372×492 image into patches:
Each patch is an instance that contains 12×12 pixels on a grid
View each as a 144-D vector
79
PCA Compression: 144D => 60D
80
PCA Compression: 144D => 16D
81
PCA Compression: 144D => 6D
82
6 Most Important Eigenvectors
83
PCA Compression: 144D => 3D
84
3 Most Important Eigenvectors
85
Noisy Image
86
Denoised Image using 15 PCA Components
87
PCA: Shortcomings
PCA cannot capture non-linear structure
Similar with SVD
88
PCA: Shortcomings
PCA does not know labels
89
PCA: Conclusions
PCA
find orthonormal basis for data
sort dimensions in order of “strength”
discard low significance dimensions
Uses
Get compact description
Ignore noise
Improve classification (hopefully)
Not magic:
Doesn’t know class labels
Can only capture linear variations
One of many tricks to reduce dimensionality!
90
Extra: Compute PCA Using Eigen-Decomposition
Given centered data compute covariance matrix
Top PCA components = Top eigenvectors of .
Equivalence between eigen-decomposition and SVD
SVD decomposition of ,
SVD-based algorithm for PCA
Eigen-decomposition of .
Eigen-based algorithm for PCA
The equivalence gives .
91
One-slide Takeaway
Dimensionality reduction
compress/reduce dimension
reconstruct the original matrix by two or more smaller matrices
Singular value decomposition (SVD)
decompose a matrix into
: column-orthonormal. diagonal matrix.
CUR decomposition
set of columns of . set of rows of .
Principle component analysis (PCA)
reconstruct data matrix by a smaller number of eigenvectors
view the data from a literally different angle.
92
In-class Practice
1. Describe briefly (informally or formally) the relationship between singular value decomposition and eigenvalue decomposition.
2.1 Compute the eigenvalues and eigenvectors of matrix
2.2 Let It is easy to check that . What are the singular values of ?
2.3 Obtain SVD for A where A = U VT
93
/docProps/thumbnail.jpeg