Notebook 8 – PCA
Principal component analysis (PCA)
Principal component analysis (PCA) is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible.
Usual initializations and the relevant imports:
In [ ]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
spark.mllib supports PCA for tall-and-skinny matrices stored in row-oriented format and any Vectors. We demonstrate how to compute principal components on a RowMatrix and use them to project the vectors into a low-dimensional space in the cell below.
In [ ]:
val sparkSession = SparkSession
.builder()
.master(“local[1]”)
.appName(“PCA”)
.getOrCreate()
val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
val dataRDD = sparkSession.sparkContext.parallelize(data, 2)
val mat: RowMatrix = new RowMatrix(dataRDD)
// Compute the top 4 principal components.
// Principal components are stored in a local dense matrix.
val pc: Matrix = mat.computePrincipalComponents(4)
// Project the rows to the linear space spanned by the top 4 principal components.
val projected: RowMatrix = mat.multiply(pc)
In HPC experiments, we will use the NIPS 2014 paper along with the bag of words data mentioned within, but these datasets are too large for this notebook. We therefore create a small dataset from the documents:
D1: the cat sat on the mat
D2: the cat sat on the cat
D3: the cat sat
D4: the mat sat
Numbering the words
0 the
1 cat
2 sat
3 on
4 the
5 mat
the documents can be represented using the following sparse vectors
Vectors.sparse(5, Seq((0, 2.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0))),
Vectors.sparse(5, Seq((0, 2.0), (1, 2.0), (2, 1.0), (3, 1.0))),
Vectors.sparse(5, Seq((0, 1.0), (1, 1.0), (2, 1.0))),
Vectors.sparse(5, Seq((0, 1.0), (2, 1.0), (4, 1.0))))
Equally, it could be represented by triples of document_id word_id freq as follows
1 0 2.0
1 1 1.0
1 2 1.0
1 3 1.0
1 4 1.0
2 0 2.0
2 1 2.0
2 2 1.0
2 3 1.0
3 0 1.0
3 1 1.0
3 2 1.0
4 0 1.0
4 2 1.0
4 4 1.0
(This conversion will come in useful for processing the bag of words data.) We now generate the principal component vectors for this dataset.
In [ ]:
val sc = sparkSession.sparkContext
// The data
val data = Array(
Vectors.sparse(5, Seq((0, 2.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0))),
Vectors.sparse(5, Seq((0, 2.0), (1, 2.0), (2, 1.0), (3, 1.0))),
Vectors.sparse(5, Seq((0, 1.0), (1, 1.0), (2, 1.0))),
Vectors.sparse(5, Seq((0, 1.0), (2, 1.0), (4, 1.0))))
val dataRDD = sc.parallelize(data)
val mat: RowMatrix = new RowMatrix(dataRDD)
// Compute the top 4 principal components.
// Principal components are stored in a local dense matrix.
val pc: Matrix = mat.computePrincipalComponents(4)
// Project the rows to the linear space spanned by the top 4 principal components.
val projected: RowMatrix = mat.multiply(pc)
Exercises
Exercise 1
Take a look at the bagofwords NIPS data. The format of this data is
Number of documents
Number of words in the vocabulary
Total number of words in the collection
docID wordID count
docID wordID count
…
docID wordID count
Initially, we need to read this data in: the steps in this would be roughly:
1) extract the number of documents, size of the vocabulary and strip off the first 3 lines
2) combine the words per document
3) create sparse vectors
Create this as a standalone program. You can use .partitions.size to check the number of partitions your data is divided into and you should keep everything as parallel as possible. You will benefit from creating a very small example to test your work, and then checking that your work scales up to the NYTIMES bagofwords data.
Exercise 2
If you try to run the PCA program from the notebook on a large dataset, it is likely to run out of memory as it attempts to construct the covariance matrix locally on the driver (and then uses SVD to generate the principal components). For this exercise, you are to implement an alternative method of computing principal components. For this, you need to center your matrix (for each word vector, this involves subtracting the mean) and then use SVD. Implementations of this exist and you can follow this link to a Matlab implementation of disPCA.
You should be able to use the sparse vectors you generated in exercise 1, center these and then use SVD. A small scale example will allow you to check that your first principal components match those generated by computePrincipalComponents.
Run PCA on the full NIPS data and the NYTIMES data, varying the number of principal components generated.
Exercise 3
Run $k$-means with the generated vectors for both datasets.