Lecture 2:
Lecture 8
Scalable PCA/SVD
Dimensionality Reduction & Factor Analysis
Haiping Lu
http://www.dcs.shef.ac.uk/~haiping
COM6012 Scalable Machine Learning
Spring 2018
13/04/2018 1Haiping Lu – University of Sheffield
Week 8 Contents
• Unsupervised Learning
• PCA – Dimensionality Reduction
• SVD – Factor Analysis
• Scalable PCA in Spark
13/04/2018 Haiping Lu – University of Sheffield 2
Unsupervised Learning
13/04/2018 Haiping Lu – University of Sheffield 3
Supervised methods
predict
our data
as a function of
other data
y = f (X)
Unsupervised methods
find structure in the
data on its own
f (X)
Three Topics
• Principal component analysis (PCA) & SVD
• Dimensionality reduction & factor analysis
• K-means
• Clustering
• Matrix factorisation (with missing information)
• Collaborative filtering Recommender system
• Scale these algorithms for big data
13/04/2018 Haiping Lu – University of Sheffield 4
Week 8 Contents
• Unsupervised Learning
• PCA – Dimensionality Reduction
• SVD – Factor Analysis
• Scalable PCA in Spark
13/04/2018 Haiping Lu – University of Sheffield 5
Dimensionality Reduction
13/04/2018 Haiping Lu – University of Sheffield 6
• Assumption: Data lies on or near a low
d-dimensional subspace
• Axes of this subspace are effective representation of
the data
Why Reduce Dimensions?
13/04/2018 Haiping Lu – University of Sheffield 7
Why reduce dimensions?
• Discover hidden correlations/topics
• Words that occur commonly together
• Remove redundant and noisy features
• Not all words are useful
• Interpretation and visualization
• Easier storage and processing of the data
Dimensionality Reduction
• Raw data is complex and high-dimensional
• Dimensionality reduction describes the data using a
simpler, more compact representation
• This representation may make interesting patterns in
the data clearer or easier to see
13/04/2018 Haiping Lu – University of Sheffield 8
Dimensionality Reduction
• Goal: Find a ‘better’ representation for data
• How do we define ‘better’?
• For example
• Minimise reconstruction error
• Maximise variance
• They give the same solution PCA!
13/04/2018 Haiping Lu – University of Sheffield 9
PCA Algorithm
• Input: N data points, each D-dimensional vector
• PCA algorithm
• 1. X0 Form N × D data matrix, with one row vector xn
per data point
• 2. X: subtract mean x from each row vector xn in X0
• 3. Σ XTX Gramian (scatter) matrix for X
• Find eigenvectors and eigenvalues of Σ
• PCs U (D × d ) the d eigenvectors with largest
eigenvalues
• PCA feature for y D-dim: UTy (d-dimensional)
• Zero correlations, ordered by variance
13/04/2018 Haiping Lu – University of Sheffield 10
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5
-2
0
2
4
6
8
10
2D Data
13/04/2018 Haiping Lu – University of Sheffield 11
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5
-4
-3
-2
-1
0
1
2
3
4
5
Principal Components
13/04/2018 Haiping Lu – University of Sheffield 12
1st principal vector
2nd principal vector
• The best axis to
project
• Minimum RMS
error
• Principal vectors
are orthogonal
How Many Components?
13/04/2018 Haiping Lu – University of Sheffield 13
• Check the distribution of eigen-values
• Take enough many eigen-vectors to cover 80-90% of the
variance
Other Practical Tips
• PCA assumptions (linearity, orthogonality) not
always appropriate
• Various extensions to PCA with different underlying
assumptions, e.g., manifold learning, Kernel PCA,
ICA
• Centring is crucial, i.e., we must preprocess data so
that all features have zero mean before applying
PCA
• PCA results dependent on scaling of data
• Data is sometimes rescaled in practice before
applying PCA
13/04/2018 Haiping Lu – University of Sheffield 14
Problems and Limitations
13/04/2018 Haiping Lu – University of Sheffield 15
• What if very large dimensional data?
• e.g., Images (D ≥ 104= 100×100)
• Problem:
• Gramian matrix Σ is size (D2)
• D=104 |Σ| = 108
• Singular Value Decomposition (SVD)!
• Efficient algorithms available
• Some implementations find just top d eigenvectors
Week 8 Contents
• Unsupervised Learning
• PCA – Dimensionality Reduction
• SVD – Factor Analysis
• Scalable PCA in Spark
13/04/2018 Haiping Lu – University of Sheffield 16
Singular Value Decomposition
13/04/2018 Haiping Lu – University of Sheffield 17
• Factorization (decomposition) problem
• #1: Find concepts/topics/genres Factor Analysis
• #2: Reduce dimensionality
The above matrix is actually “2-dimensional.” All rows can be
reconstructed by scaling [1 1 1 0 0] or [0 0 0 1 1]: D=5d=2
SVD – Definition
13/04/2018 Haiping Lu – University of Sheffield 18
A[n x m] = U[n x r] Λ [ r x r] (V[m x r])T
• A: n × m matrix (e.g., n documents, m terms)
• U: n × r matrix (n documents, r concepts)
• Λ: r × r diagonal matrix (strength of each
‘concept’) (r: rank of the matrix)
• V: m × r matrix (m terms, r concepts)
SVD – Properties
13/04/2018 Haiping Lu – University of Sheffield 19
Always possible to decompose matrix A into A = U Λ VT ,
where
• U, Λ, V: unique (*)
• U, V: column orthonormal (i.e., columns are unit vectors,
orthogonal to each other)
• UTU = I; VTV = I (I: identity matrix)
• Λ: singular value are positive, and sorted in decreasing
order
SVD Eigen-decomposition
13/04/2018 Haiping Lu – University of Sheffield 20
• SVD gives us:
• A = U Λ VT
• Eigen-decomposition:
• B = W Σ WT
• U, V, W are orthonormal (UTU=I),
• Λ, Σ are diagonal
• Relationship:
• AAT= U Λ VT(U Λ VT)T = U Λ VT(V Λ TUT) = U Λ Λ T UT
• ATA = V Λ T UT (U Λ VT) = V Λ Λ T VT = V Λ 2 VT
• B= ATA=W Σ WT
SVD for PCA
• PCA by SVD:
• 1. X0 Form N × d data matrix, with one row vector xn
per data point
• 2. X subtract mean x from each row vector xn in X0
• 3. U Λ VT SVD of X
• The right singular vectors V of X are equivalent to the
eigenvectors of XTX the PCs
• The singular values in Λ are equal to the square roots of
the eigenvalues of XTX
13/04/2018 Haiping Lu – University of Sheffield 21
SVD – Properties
13/04/2018 Haiping Lu – University of Sheffield 22
‘spectral decomposition’ of the matrix:
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
= x xu1 u2
λ1
λ2
v1
v2
SVD – Interpretation
13/04/2018 Haiping Lu – University of Sheffield 23
‘documents’, ‘terms’ and ‘concepts’:
• U: document-to-concept similarity matrix
• V: term-to-concept similarity matrix
• Λ: its diagonal elements: ‘strength’ of each concept
Projection:
• Best axis to project on: (‘best’ = min sum of
squares of projection errors)
SVD – Example
13/04/2018 Haiping Lu – University of Sheffield 24
• A = U Λ VT – example:
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
data
inf.
retrieval
brain lung
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
CS
MD
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
SVD – Example
13/04/2018 Haiping Lu – University of Sheffield 25
• A = U Λ VT – example:
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
data
inf.
retrieval
brain lung
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
CS
MD
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
CS-concept
MD-concept
doc-to-concept
similarity matrix
SVD – Example
13/04/2018 Haiping Lu – University of Sheffield 26
• A = U Λ VT – example:
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
data
inf.
retrieval
brain lung
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
CS
MD
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
‘strength’ of CS-concept
SVD – Example
13/04/2018 Haiping Lu – University of Sheffield 27
• A = U Λ VT – example:
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
data
inf.
retrieval
brain lung
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
CS
MD
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
term-to-concept
similarity matrix
CS-concept
SVD – Dimensionality Reduction
13/04/2018 Haiping Lu – University of Sheffield 28
• Q: how exactly is (further) dim. reduction done?
• A: set the smallest singular values to zero:
• Note: 3 zero singular values already removed
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
SVD – Dimensionality Reduction
13/04/2018 Haiping Lu – University of Sheffield 29
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
0.18
0.36
0.18
0.90
0
0
0
~
9.64
x
0.58 0.58 0.58 0 0
x
SVD – Dimensionality Reduction
13/04/2018 Haiping Lu – University of Sheffield 30
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
~
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
• Best rank-1 approximation
Week 8 Contents
• Unsupervised Learning
• PCA – Dimensionality Reduction
• SVD – Factor Analysis
• Scalable PCA in Spark
13/04/2018 Haiping Lu – University of Sheffield 31
PCA & SVD in Spark MLlib
• Not scalable: computePrincipalComponents( )
from RowMatrix
• Scalable: computeSVD( ) from RowMatrix
• Code:
https://github.com/apache/spark/blob/v2.1.0/mllib/src/main/scala/org/apa
che/spark/mllib/linalg/distributed/RowMatrix.scala
• Documentation:
https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.
apache.spark.mllib.linalg.distributed.RowMatrix
13/04/2018 Haiping Lu – University of Sheffield 32
https://github.com/apache/spark/blob/v2.1.0/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
https://spark.apache.org/docs/2.1.0/api/scala/index.html
PCA in Spark MLlib (RDD)
• https://spark.apache.org/docs/2.1.0/mllib-
dimensionality-reduction.html
• Not scalable, local computation
• Notebook 8
13/04/2018 Haiping Lu – University of Sheffield 33
https://spark.apache.org/docs/2.1.0/mllib-dimensionality-reduction.html
PCA in Spark ML (DF)
• Now in
https://spark.apache.org/docs/2.1.0/ml-features.html#pca
• Under features
• Scalable? Not likely
13/04/2018 Haiping Lu – University of Sheffield 34
https://spark.apache.org/docs/2.1.0/ml-features.html
SVD in Spark MLlib (RDD)
• https://spark.apache.org/docs/2.1.0/mllib-
dimensionality-reduction.html
• With distributed implementations
13/04/2018 Haiping Lu – University of Sheffield 35
https://spark.apache.org/docs/2.1.0/mllib-dimensionality-reduction.html
SVD in Spark MLlib (RDD)
• An m × n data matrix A with m>n (note different
notations)
• For large matrices, usually we don’t need the
complete factorization but only the top k singular
values and its associated singular vectors.
• Save storage, de-noise and recover the low-rank
structure of the matrix (dimensionality reduction)
13/04/2018 Haiping Lu – University of Sheffield 36
SVD in Spark MLlib (RDD)
• An m × n data matrix A
• Assume m>n, SVD A = U Λ VT
• The singular values and the right singular vectors
are derived from the eigenvalues and the
eigenvectors of ATA (which is smaller than A)
• The left singular vectors are computed via matrix
multiplication as U=AV Λ −1, if requested by the
user via the computeU parameter
13/04/2018 Haiping Lu – University of Sheffield 37
Selection of SVD Computation
• Auto
• If n is small (n<100) or k is large compared with n
(k>n/2), compute ATA first and then compute its top
eigenvalues and eigenvectors locally on the driver
• Otherwise, compute ATA v in a distributive way and
send it to ARPACK to compute the top eigenvalues
and eigenvectors on the driver node
13/04/2018 Haiping Lu – University of Sheffield 38
Selection of SVD Computation
• Auto (default)
13/04/2018 Haiping Lu – University of Sheffield 39
Selection of SVD Computation
• Specify computeMode (private)
13/04/2018 Haiping Lu – University of Sheffield 40
Selection of SVD Computation
• computeMode (note brzSvd.SVD is local)
13/04/2018 Haiping Lu – University of Sheffield 41
Remark
• Acknowledgement
• Some slides are adapted from slides by Jure Leskovec et
al. http://www.mmds.org
• References
• http://infolab.stanford.edu/~ullman/mmds/ch11.pdf
• http://www.mmds.org
• https://en.wikipedia.org/wiki/Principal_component_analy
sis
• https://spark.apache.org/docs/2.1.0/mllib-dimensionality-
reduction.html
13/04/2018 Haiping Lu – University of Sheffield 42
http://www.mmds.org/
http://infolab.stanford.edu/%7Eullman/mmds/ch11.pdf
http://www.mmds.org/
https://en.wikipedia.org/wiki/Principal_component_analysis
https://spark.apache.org/docs/2.1.0/mllib-dimensionality-reduction.html
Lecture 8 ��Scalable PCA/SVD�Dimensionality Reduction & Factor Analysis
Week 8 Contents
Unsupervised Learning
Three Topics
Week 8 Contents
Dimensionality Reduction
Why Reduce Dimensions?
Dimensionality Reduction
Dimensionality Reduction
PCA Algorithm
2D Data
Principal Components
How Many Components?�
Other Practical Tips
Problems and Limitations
Week 8 Contents
Singular Value Decomposition
SVD – Definition
SVD – Properties
SVD Eigen-decomposition
SVD for PCA
SVD – Properties
SVD – Interpretation
SVD – Example
SVD – Example
SVD – Example
SVD – Example
SVD – Dimensionality Reduction
SVD – Dimensionality Reduction
SVD – Dimensionality Reduction
Week 8 Contents
PCA & SVD in Spark MLlib
PCA in Spark MLlib (RDD)
PCA in Spark ML (DF)
SVD in Spark MLlib (RDD)
SVD in Spark MLlib (RDD)
SVD in Spark MLlib (RDD)
Selection of SVD Computation
Selection of SVD Computation
Selection of SVD Computation
Selection of SVD Computation
Remark
1
1
1
0
0
2
2
2
0
0
1
1
1
0
0
5
5
5
0
0
0
0
0
2
2
0
0
0
3
3
0
0
0
1
1
1
1
1
0
0
2
2
2
0
0
1
1
1
0
0
5
5
5
0
0
0
0
0
2
2
0
0
0
3
3
0
0
0
1
1
0.18
0
0.36
0
0.18
0
0.90
0
0
0.53
0
0.80
0
0.27
9.64
0
0
5.29
0.58
0.58
0.58
0
0
0
0
0
0.71
0.71
1
1
1
0
0
2
2
2
0
0
1
1
1
0
0
5
5
5
0
0
0
0
0
2
2
0
0
0
3
3
0
0
0
1
1
0.18
0
0.36
0
0.18
0
0.90
0
0
0.53
0
0.80
0
0.27
9.64
0
0
5.29
0.58
0.58
0.58
0
0
0
0
0
0.71
0.71
1
1
1
0
0
2
2
2
0
0
1
1
1
0
0
5
5
5
0
0
0
0
0
2
2
0
0
0
3
3
0
0
0
1
1
0.18
0
0.36
0
0.18
0
0.90
0
0
0.53
0
0.80
0
0.27
1
1
1
0
0
2
2
2
0
0
1
1
1
0
0
5
5
5
0
0
0
0
0
2
2
0
0
0
3
3
0
0
0
1
1
0.18
0
0.36
0
0.18
0
0.90
0
0
0.53
0
0.80
0
0.27
1
1
1
0
0
2
2
2
0
0
1
1
1
0
0
5
5
5
0
0
0
0
0
2
2
0
0
0
3
3
0
0
0
1
1
0.18
0
0.36
0
0.18
0
0.90
0
0
0.53
0
0.80
0
0.27
9.64
0
0
5.29
0.58
0.58
0.58
0
0
0
0
0
0.71
0.71
1
1
1
0
0
2
2
2
0
0
1
1
1
0
0
5
5
5
0
0
0
0
0
2
2
0
0
0
3
3
0
0
0
1
1
0.18
0
0.36
0
0.18
0
0.90
0
0
0.53
0
0.80
0
0.27
9.64
0
0
5.29
0.58
0.58
0.58
0
0
0
0
0
0.71
0.71
1
1
1
0
0
2
2
2
0
0
1
1
1
0
0
5
5
5
0
0
0
0
0
2
2
0
0
0
3
3
0
0
0
1
1
0.18
0
0.36
0
0.18
0
0.90
0
0
0.53
0
0.80
0
0.27
9.64
0
0
5.29
0.58
0.58
0.58
0
0
0
0
0
0.71
0.71
0.58
0.58
0.58
0
0
0
0
0
0.71
0.71
0.58
0.58
0.58
0
0
0
0
0
0.71
0.71