程序代写代做代考 case study database algorithm matlab Data Representation for High Dimensional Data

Data Representation for High Dimensional Data

Lecture 6: Dimensionality Reduction
1
CMSC5741 Big Data Tech. & Apps.

Prof. Michael R. Lyu
Computer Science & Engineering Dept.
The Chinese University of Hong Kong

A Compression Example

2

Outline
Motivation
SVD
CUR
Application of SVD and CUR
PCA
Extension to robust PCA

3

Dimensionality Reduction Motivation
Assumption: Data lie on or near a low
d-dimensional subspace
Axes of this subspace are effective representation of the data
4

Dimensionality Reduction Motivation
Compress / reduce dimensionality:
106 rows; 103 columns; no updates
Random access to any cell(s); small error: OK
5

The above matrix is really “2-dimensional.” All rows can be reconstructed by scaling [1 1 1 0 0] or [0 0 0 1 1]
16
16
16
16
16

Rank of a Matrix
Q: What is rank of a matrix A?
A: No. of linearly independent rows/columns of A
For example:
Matrix A = has rank r=2

Why? The first two rows are linearly independent, so the rank is at least 2, but all three rows are linearly dependent (the first is equal to the sum of the second and third) so the rank must be less than 3.
Why do we care about low rank?
We can write A as two “basis” vectors: [1 2 1] [-2 -3 1]
And new coordinates of : [1 0] [0 1] [1 -1]

6

Rank is “Dimensionality”
Cloud of points 3D space:
Think of point positions
as a matrix:

We can rewrite coordinates more efficiently!
Old basis vectors: [1 0 0] [0 1 0] [0 0 1]
New basis vectors: [1 2 1] [-2 -3 1]
Then A has new coordinates: [1 0]. B: [0 1], C: [1 -1]
Notice: We reduced the number of coordinates!

1 row per point:
A
B
C

A
7

Dimensionality Reduction
Goal of dimensionality reduction is to
discover the axis of data!

Rather than representing
every point with 2 coordinates
we represent each point with
1 coordinate (corresponding to
the position of the point on
the red line).

By doing this we incur a bit of
error as the points do not
exactly lie on the line
8

Why Reduce Dimensions?
Why reduce dimensions?
Discover hidden correlations/topics
Words that occur commonly together
Remove redundant and noisy features
Not all words are useful
Interpretation and visualization
Easier storage and processing of the data
9

SVD: Dimensionality Reduction

B

10

For an matrix A, we can decompose it as = , where
U is an real or complex orthonormal matrix (i.e., = = I)
Σ is an m × n rectangular diagonal matrix with non-negative real numbers on the diagonal, and
VT (the conjugate transpose of V, or simply the transpose of V if V is real) is an real or complex orthonormal matrix.
SVD: Singular Value Decomposition
11

When rank(A) = r:

: input data matrix
matrix (e.g., documents, terms)
: left singular vectors
matrix (documents, topics)
: singular values
diagonal matrix (strength of each “topic”)
rank of matrix
: right singular vectors
matrix ( terms, topics)
SVD: Singular Value Decomposition
12

: scalar
: vector
: vector

A

𝑉𝑇

SVD: Singular Value Decomposition

13
A

: scalar
: vector
: vector

A

𝑉𝑇

13

SVD Properties
It is always possible to do SVD, i.e. decompose a matrix 𝐴 into , where
𝑈, , 𝑉: unique
𝑈,𝑉: column orthonormal
, (𝐼: identity matrix)
: diagonal
Entries (singular values) are non-negative,
Sorted in decreasing order (𝜎1 ≥ 𝜎2 ≥⋯ ≥ 𝜎r ≥ 0).
14

SVD Example
We give an example of a simple SVD decomposition
15

=

T

15

SVD: Example – Users-to-Movies
– example: Users to Movies

SciFi
Romance
“topics” or “concepts”
aka. Latent dimensions
aka. Latent factors

16
The Avengers
Star Wars
Twilight
Matrix

In this example, there are two “concepts” underlying the movies: science fiction and romance.

16

SVD: Example – Users-to-Movies
– example

SciFi
Romance

17
The Avengers
Star Wars
Twilight
Matrix
SciFi-concept
X
X
Romance-concept

The rows of M as people and the columns of M as movies.

17

SVD: Example – Users-to-Movies
– example

SciFi
Romance

18
The Avengers
Star Wars
Twilight
Matrix
SciFi-concept
X
X
Romance-concept
U is “user-to-concept” similarity matrix

The rows of M as people and the columns of M as movies.

The matrix U connects user to concepts.

The value 0.24 in the first row and first column of U is smaller than some of the other entries in that column, because the first person watches only science fiction, he does not rate those movies highly. The second column of the first row of U is nearly zero, because he does not rate romance movies at all.

18

SVD: Example – Users-to-Movies
– example

SciFi
Romance

19
The Avengers
Star Wars
Twilight
Matrix
SciFi-concept
X
X

“strength” of the SciFi-concept

The matrix \Sigma gives the strength of each of the concepts. In this example, the strength of the science fiction concept is 11.9, while the strength of the romance concept is 7.1. Intuitively, the science-fiction concept is stronger because the data provides more movies of that genre and more people who like them.
19

SVD: Example – Users-to-Movies
– example

SciFi
Romance

20
The Avengers
Star Wars
Twilight
Matrix
SciFi-concept
X
X
V is “movie-to-concept” similarity matrix

SciFi-concept
Q: Does the movie “Twilight” relate to concept “Romance”?

The matrix V relates movie to concepts. The 0.5x indicates that the first three movies are of the science-fiction genre, while the small absolute value in the last column say that the movie does not partake much the concept of romance.

The second row indicates that the movie Twilight is related to romance.
20

SVD: Interpretation #1
“movies”, “users” and “concepts”
: user-to-concept similarity matrix
: movie-to-concept similarity matrix
: its diagonal elements
‘strength’ of each concept
21

SVD: Interpretations #2
SVD gives ‘best’ axis to project on
‘best’ = minimal sum of squares of projection errors
In other words,
minimum reconstruction error
22

SVD: Interpretation #2
– example
𝑈: user-to-concept matrix
𝑉: movie-to-concept matrix

23
X
X

23

SVD: Interpretation #2
– example

24
X
X

variance (“spread”)
on the v1 axis

24

SVD: Interpretation #2
– example
: the coordinates of the points in the projection axis

25

Projection of users on the “Sci-Fi” axis

25

SVD: Interpretation #2
Q: how exactly is dimension reduction done?
26

X
X

SVD: Interpretation #2
Q: how exactly is dimension reduction done?
A: Set smallest singular values to zero

27

X
X
X

SVD: Interpretation #2
Q: how exactly is dimension reduction done?
A: Set smallest singular values to zero
Approximate original matrix by low-rank matrices

28

X
X
X

SVD: Interpretation #2
Q: how exactly is dimension reduction done?
A: Set smallest singular values to zero
Approximate original matrix by low-rank matrices

29

X
X

SVD: Best Low Rank Approximation
A

B

is a rank matrix

is the best rank approximation of matrix
30

SVD: Best Low Rank Approximation
Theorem: Let (), and
= diagonal matrix where ( and ()
or equivalently, is the best rank- approximation to :
or equivalently,
Intuition (spectral decomposition)

Why setting small to 0 is the right thing to do?
Vectors and are unit length, so scales them.
Therefore, zeroing small introduces less error.

31

SVD: Interpretation #2
Q: How many 𝜎𝑖 to keep?
A: Rule-of-a thumb
Keep 80~90% “energy” ()

32

𝜎1𝑢1∘𝑣1𝑇+ 𝜎2𝑢2∘𝑣2𝑇 +⋯
Assume: 𝜎1 ≥ 𝜎2 ≥⋯

SVD: Complexity
SVD for full matrix

But
Less work, if we only want to compute singular values
or if we only want first k singular vectors (thin-svd).
or if the matrix is sparse (sparse svd).
Stable implementations
LINPACK, Matlab, Splus, Mathematica…
Available in most common languages

33

SVD: Conclusions so far
SVD: : unique
user-to-concept similarities
movie-to-concept similarities
: strength to each concept
Dimensionality reduction
Keep the few largest singular values (80-90% of “energy”)
SVD: picks up linear correlations
34

SVD: Relationship to Eigen-decomposition
SVD gives us

Eigen-decomposition

is symmetric
are orthonormal (
are diagonal

35

SVD: Relationship to Eigen-decomposition
Eigen-decomposition of and

So, , and
That is, is the matrix of eigenvectors of , and
is the matrix of eigenvectors of
This shows how to use eigen-decomposition to compute SVD
Ahe singular values of are the square roots of the corresponding eigenvalues of
Note:  and are the dataset covariance matrices
36

36

A Brief Review of Eigen-Decomposition
Eigenvalues and eigenvectors

matrix.
eigenvalue of , : eigenvector of . eigenpair.
Simple computational method of eigenvalues
Solve the equation
Example

Then
Then
Solve , we get

37

A Brief Review of Eigen-Decomposition
Example (continued)

solve , we get eigenvalues
now we compute eigenvectors.
for eigenvalue we need to find
solve
We get Since needs to be a unit vector, therefore

Similarly, we can compute

38

Computing Eigenvalues: Power Method
Power method
choose an arbitrary
.
Theorem: sequence converges to the principal eigenvector (i.e., the eigenvector corresponds to the largest eigenvalue)
Normalized power method
choose an arbitrary

Theorem: sequence converges to the principal eigenvector.

39

In-class Practice
Go to practice
40

SVD Case Study: How to Query?
Q: Find users that like “The Avengers”
A: Map query into a “concept space” – how?

41

SciFi
Romance

The Avengers
Star Wars
Twilight
Matrix
X
X

41

Q: Find users that like “The Avengers”
A: Map query into a “concept space” – how?

Case Study: How to Query?

42
The Avengers
Twilight
Matrix
Star Wars

v1

q

Project into concept space:
Inner product with each concept vector vi
v2

A is not one of the people represented by the original matrix, but he wants to use the system to know what movies he would like.

He has only seen one movie, Matrix, and rated it 5. Thus, we can represent A by the vector q = [5, 0, 0, 0], as if this were one of the rows of the original matrix.
42

Q: Find users that like “The Avengers”
A: Map query into a “concept space” – how?

Case Study: How to Query?

43
The Avengers
Twilight
Matrix
Star Wars

v1

q

Project into concept space:
Inner product with each concept vector vi
v2

q*v1

A is not one of the people represented by the original matrix, but he wants to use the system to know what movies he would like.

He has only seen one movie, Matrix, and rated it 5. Thus, we can represent A by the vector q = [5, 0, 0, 0], as if this were one of the rows of the original matrix.
43

Compactly, we have
𝑞𝑐=𝑞𝑉

Case Study: How to Query?

=

movie-to-concept
similarities ()
SciFi-concept
44
The Avengers
Twilight
Matrix
Star Wars
X

is high in science-fiction interest, and nearly no interested in romance.
44

How would the user d that rated (‘Star Wars’, ‘Matrix’) be handled?
d𝑐=d 𝑉

Case Study: How to Query?

=

movie-to-concept
similarities ()
SciFi-concept
45
The Avengers
Twilight
Matrix
Star Wars
X

is high in science-fiction interest, and nearly no interested in romance.
45

Observation
User d that rated (‘Star Wars’) will be similar to user q that rate (‘The Avengers’), although d and q have zero ratings in common!

Case Study: How to Query?

SciFi-concept
46
The Avengers
Twilight
Matrix
Star Wars

Zero ratings in common
Similarity  0
Cosine similarity: 0.99

is high in science-fiction interest, and nearly no interested in romance.
46

SVD: Drawbacks
Optimal low-rank approximation
in terms of Euclidean norm
Interpretability problem:
A singular vector specifies a linear
combination of all input columns or rows
Lack of sparsity:
Singular vectors are dense!
47

=

U

VT

CUR Decomposition
Goal: express A as a product of matrices
Minimize
Constraints on

48

CUR Decomposition
Goal: express A as a product of matrices
Minimize
Constraints on

49

CUR: Good Approximation to SVD
Let be the best rank approximation of (obtain by SVD)
Theorem
CUR algorithm in time achieves

with probability at least , by picking
columns and
rows
(in practice, choose columns/rows)
50

CUR: How it Works
Sampling columns (similarly for rows):

51

For step 5: Having selected each of the columns in A, we scale each column by dividing its elements by the square root of the expected number of times this column would be picked. That is, we divide the elements of the jth column of A, if it is selected, by sqrt(cP(j)).
51

CUR: Computing U
Let be the “intersection” of sampled columns C and rows R
Let SVD of
Then: , where
is the “Moore-Penrose pseudo-inverse”.
, if .

52

CUR: Pros & Cons
+ easy interpretation
the basis vectors are actual columns and rows
– duplicate columns and rows
columns of large norms will be sampled many times
53

CUR: Duplicate Columns
If we want to get rid of the duplicates
Throw them away
Scale the columns/rows by the square root of the number of duplicates
54

SVD vs CUR

55

Question: Large or small? Dense or sparse?

SVD vs CUR

56

SVD & CUR: Simple Experiments
DBLP data
author-to-conference matrix
very sparse
: number of papers published by author at conference .
428k authors (rows)
3659 conferences (column)
Dimensionality reduction
Running time?
Space?
Reconstruction error?

57

Results: DBLP

accuracy: 1-relative sum squared errors
space ratio: # output non-zero matrix entries / # input non-zero matrix entries

courtesy: Sun, Faloutsos: Less is more, Compact Matrix Decomposition for Large Sparse Graph, SDM’07
58

58

The Linearity Assumption
SVD is limited to linear projections
Data lies on a low-dimensional linear space
Non-linear methods: Isomap
Data lies on a low-dimensional manifold
Non-linear
How?
Build adjacency graph
SVD the graph adjacency matrix
Further reading: wikipage of Isomap

59

PCA: An Application of SVD
PCA = Principle Component Analysis

Motivation
Visualization

60

PCA: Data Visualization
Example:
Given 53 blood samples (features) from 65 people (data item or instance)

How can we visualize the samples

61

PCA: Data Visualization

How can we visualize the other variables???
… difficult to see in 4 or higher dimensional spaces …
62

PCA: Data Visualization
Is there a representation better than the coordinate axes?
Is it really necessary to show all the 53 dimensions?

What if there are strong correlations between the features?
How could we find the smallest subspace of the 53-D space that keeps the most information about the original data?

A solution: Principal Component Analysis
An application of SVD.
63

PCA: Definition and Algorithms
PCA
Orthogonal projection of the data onto a lower-dimensional linear space such that
Maximize variance of projected data (purple line)
Minimize mean squared distance between
Data point
Projection (sum of blue lines)
Look data from a literally different angle.

64

PCA: Idea
Given data points in a d-dimensional space, project them into a lower dimensional space while preserving as much information as possible.
Find best planar approximation to 3D data
Find best 12-D approximation to 104-D data
In particular, choose projection that minimizes squared error in reconstructing the original data.
Implement through SVD

65

PCA
PCA Vectors originate from the center of mass.
Principal component #1: points in the direction of the largest variance.
Each subsequent principal component
is orthogonal to the previous ones, and
points in the directions of the largest variance of the residual subspace

66

PCA: 2D Gaussian dataset

67

1st PCA axis

68

2nd PCA axis

69

PCA: Algorithm
Given centered data , compute principle vectors
1st principle vector

maximize the variance of projection of

70

PCA: Algorithm by SVD
SVD of the centered data matrix

: number of instances
: dimension

A

71

PCA: Algorithm by SVD
Columns of
is exactly the principal vectors,
orthogonal and has unit norm
Matrix
Diagonal
Strength of each eigenvector
Columns of
Coefficients for reconstructing the samples.
72

Application: Face Recognition
Want to identify specific person, based on facial image
Can’t just use the given 256 x 256 pixels

73

Applying PCA
Method A: Build a PCA subspace for each person and check which subspace can reconstruct the test image the best
Method B: Build one PCA database for the whole dataset and then classify based on the weights.

Example data set: Images of faces
Each face is …
values

74

Each face is converted to a 256*256-d vector (not matrix).
74

Principal Components (Method B)

75

Reconstructing … (Method B)

76

Happiness Subspace (Method A)

77

Suppose we have a set of faces known as happy faces (i.e. we have labels for each faces). Then, we may compute the principle vectors of this subset of faces. These principle vectors are shown here.
Similarly, in the next slide, we can compute the principle vectors of disgust faces.
Therefore, to predict the label of a test face, we can use these two sets of principle vectors to reconstruct test face and see which set reconstructs the test face better (method A).

77

Disgust Subspace (Method A)

78

Image Compression

Divide the original 372×492 image into patches:
Each patch is an instance that contains 12×12 pixels on a grid
View each as a 144-D vector

79

PCA Compression: 144D => 60D

80

PCA Compression: 144D => 16D

81

PCA Compression: 144D => 6D

82

6 Most Important Eigenvectors

83

PCA Compression: 144D => 3D

84

3 Most Important Eigenvectors

85

Noisy Image

86

Denoised Image using 15 PCA Components
87

PCA: Shortcomings
PCA cannot capture non-linear structure
Similar with SVD

88

PCA: Shortcomings
PCA does not know labels

89

PCA: Conclusions
PCA
find orthonormal basis for data
sort dimensions in order of “strength”
discard low significance dimensions
Uses
Get compact description
Ignore noise
Improve classification (hopefully)
Not magic:
Doesn’t know class labels
Can only capture linear variations
One of many tricks to reduce dimensionality!
90

Extra: Compute PCA Using Eigen-Decomposition
Given centered data compute covariance matrix

Top PCA components = Top eigenvectors of .
Equivalence between eigen-decomposition and SVD
SVD decomposition of ,
SVD-based algorithm for PCA
Eigen-decomposition of .
Eigen-based algorithm for PCA
The equivalence gives .

91

One-slide Takeaway
Dimensionality reduction
compress/reduce dimension
reconstruct the original matrix by two or more smaller matrices
Singular value decomposition (SVD)
decompose a matrix into
: column-orthonormal. diagonal matrix.

CUR decomposition

set of columns of . set of rows of .
Principle component analysis (PCA)
reconstruct data matrix by a smaller number of eigenvectors
view the data from a literally different angle.
92

In-class Practice
1. Describe briefly (informally or formally) the relationship between singular value decomposition and eigenvalue decomposition.
2.1 Compute the eigenvalues and eigenvectors of matrix
2.2 Let It is easy to check that . What are the singular values of ?
2.3 Obtain SVD for A where A = U  VT

93

/docProps/thumbnail.jpeg