Lecture 25:
More dimensionality reduction CS 189 (CDSS offering)
2022/03/30
Copyright By PowCoder代写 加微信 powcoder
Today’s lecture
We will continue our discussion on dimensionality reduction and first finish up principal component analysis (PCA)
Then, we will briefly discuss a few other dimensionality reduction tools
Remember, this week, we are focused on unsupervised learning instead of the form x1, …, xN
• Generally speaking, rather than having data (x1, y1), …, (xN, yN), our data is
Not enough time to cover everything! If this topic is interesting or relevant to
you, let me know and I can provide additional resources
Recall: PCA
What is the “best” linear projection? PCA says: the one that retains the most We compute the d × k matrix via the eigendecomposition of X⊤X
information about the data in the form of variance
• The d × k matrix we need is the first k columns of Q (write this as Q1:k) • XQ1:k thus gives a k-dimensional data representation
Writing X⊤X = QDQ⊤, where D is the diagonal matrix of eigenvalues in non increasing order, and Q is the orthogonal matrix of corresponding eigenvectors
Important: the singular value decomposition
Computing the eigendecomposition of a d × d matrix takes O(d3) time
Often, a better approach is to use the singular value decomposition (SVD), which is valid for
The SVD of our N × d design matrix is X = UΣV⊤, where U contains the eigenvectors of XX⊤, V contains the eigenvectors of X⊤X, U and V are both orthogonal, and Σ contains the square root of the eigenvalues of both XX⊤ and X⊤X (they’re the same) along the diagonal
V is the same as Q from before! So we can use it in exactly the same way
Time complexity: O(Nd2 + N2d), and practically this can be further improved through
truncation (only computing the k components that we care about) 4
any matrix
Choosing k
In practice, how is the number of principal components, k, chosen?
We can inspect a quantity known as explained variance, which is just how
much of the overall variance we capture with the first k principal components
∑ki=1 Dii ∑dj=1 Djj
This is calculated as EV(k) =
We may evaluate this on a validation set
PCA and reconstruction
Another view we can take on PCA is that of reconstruction: reducing the dimensionality • How do we reconstruct X from XQ1:k?
of X and then attempting to reconstruct X from the reduced representation
PCA can also be understood as finding the linear projection that minimizes
“Expand” back to the original basis by right multiplying by Q⊤1:k, i.e., X ≈ XQ1:kQ⊤1:k Remember that Q is orthogonal, so equality holds when using the full Q
reconstruction error! Specifically, Q1:k minimizes ∥X − XWW⊤∥F out of all d × k W 6
Example: eigenfaces
PCA: the tradeoffs
PCA has enjoyed considerable success in applications involving data analysis PCA is well motivated, analytically tractable, and can be solved efficiently
and preprocessing for machine learning
But this is because PCA is restricted
to learning linear projections…
Let’s briefly discuss some nonlinear
methods for dimensionality reduction
Briefly: kernel PCA
We won’t go into detail, but it turns out that PCA can be kernelized Why should we do this?
Sound familiar?
One downside is that we still have to pick an appropriate kernel
A linear projection may not be suitable in the original feature representation but
may work well in a higher dimensional, more expressive featurization
This is one example of how, as alluded to last time, dimensionality reduction and
dimensionality “lifting” (e.g., kernelization) do not have to be mutually exclusive!
Briefly: autoencoders
The interpretation of PCA as minimizing reconstruction error also naturally
• Instead of reducing dimensionality (“encoding”) with Q1:k and reconstructing
provides us with inspiration for how to do nonlinear dimensionality reduction
(“decoding”) with Q⊤1:k, make these learnable nonlinear functions
I.e., define an encoder with parameters θ that takes a d-dimensional data And define a decoder with parameters φ that takes a k-dimensional
point and outputs a k-dimensional representation
representation and outputs a d-dim reconstruction
Briefly: autoencoders
Usually, both the encoder and decoder are neural networks — more on this next week The encoder and decoder share the same objective, which is to minimize
reconstruction loss ∥X − dφ(eθ(X))∥F (or something similar to this)
Optimization usually is gradient based and usually is stochastic
There are many flavors of autoencoders, e.g., denoising autoencoders, variational
autoencoders (covered in CS 182), etc., but these are out of scope
We will discuss in a bit more detail, when we talk about neural networks, how they can
be used for dimensionality reduction even without the explicit reconstruction loss
Briefly: neighbor embedding methods
Neighbor embedding methods try to learn low dimensional representations of the The most common approach, t-distributed stochastic neighbor embedding
data where data points that are close to each other remain close
(t-SNE), is a popular tool for data visualization but is not widely used otherwise
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com