代写代考 CS 189 (CDSS offering)

Lecture 24:
Principal component analysis CS 189 (CDSS offering)
2022/03/28

Copyright By PowCoder代写 加微信 powcoder

Today’s lecture
• This week, we switch our focus from supervised to unsupervised learning
• Generally speaking, rather than having data (x1, y1), …, (xN, yN), our data is
instead of the form x1, …, xN
• Sometimes, this is all we can get for certain real world applications
• The objective in this setting is much less well defined, but roughly, we may still be able to learn about the structure of the data and analyze it in interesting ways
• We will discuss two directions of analysis this week: reducing the dimensionality of the data (today and Wednesday) and clustering the data (Friday)

Featurization, in general
• Recall the question from the end of lecture 1: machine learning is (often) about predicting y from x, but what is x?
• One way we looked at this question was through the lens of kernelization, where we “lifted” the data into a higher dimensional featurization such that linear models could be effective even for data with nonlinear relationships
• Today, we are going in the opposite direction: going from high dimensional data to a lower dimensional featurization! This is dimensionality reduction
• Funny as it seems, these two ideas are not mutually exclusive: we can perform dimensionality reduction after kernelization, and vice versa

A motivating picture For dimensionality reduction
• What dimensionality is this data?
• We have plotted it as 3-dimensional, however, pretty much all information about the data can be retained in a 2-dimensional space
• Note, however, that we aren’t simply doing feature selection! The 2-dimensional representation of the data is not just two of the original data dimensions

Dimensionality reduction
• For certain applications, the data may naturally come in one representation (e.g., pixels of an image), but there are other representations of the data that are lower dimensional (e.g., think about the space of possible natural images)
• We call the original data dimensionality the ambient dimensionality and the possible lower dimensionality the intrinsic dimensionality
• As an example we’ll see later, let’s consider 32 ! 32 grayscale images of faces (1024 ambient dimensionality)
• Might the intrinsic dimensionality be lower?

But why do we care?
• What is useful about getting closer to the intrinsic dimensionality of the data?
• We may wish to better understand the problem and the properties of the data
• We may wish to make it easier to visualize the data (e.g., reduce to 2 dims)
• We may wish to speed up supervised learning, since the runtime of many methods depend on the data dimensionality
• We may wish to get rid of noise in the data
• And so on, and so forth…

Principal component analysis
• Let’s first study the most canonical approach to dimensionality reduction, called
principal component analysis (PCA)
• The idea behind PCA is that we wish to figure out a linear projection of our data that will result in a lower dimensional data representation
• That is, if X is our N ! d design matrix, we wish to construct a d ! k matrix that we can right multiply with X to get a k-dimensional data representation
• What is the “best” linear projection?

PCA, the intuition
more information is retained by
maximizing the variance of the
to project onto i
projected data
not pictured datafirst
we center the

The first principal component, the math
find projection direction that maximizes variance
Watx’X wa spectral decomposition XTX QDQT Q orthogonal D diagonal

change problem let za w 112 1 implies 112111 1
arg max EMIL Eid
4 112111 fihnowkintmuaxiitmiivzely

Extending the idea to more dimensions
• The first principal component of X is given by w1 = arg max “Xw”2, and we saw “w”2=1 #
that w1 is the eigenvector corresponding to the largest eigenvalue of X X What if we want more than one dimension?
• We can compute the second principal component! Same objective, but w2 is
further constrained to be orthogonal to w1
• It turns out, w2 is the eigenvector corresponding to the second largest eigenvalue
• w1, w2, …, wk form the orthogonal basis for our data representation 10

Summarizing PCA
• What is the “best” linear projection? PCA says: the one that retains the most information about the data in the form of variance
• We compute the d ! k matrix via the eigendecomposition of X#X
• Writing X#X = QDQ#, where D is the diagonal matrix of eigenvalues in non
increasing order, and Q is the orthogonal matrix of corresponding eigenvectors
• The d ! k matrix we need is the first k columns of Q (write this as Q1:k)
• XQ1:k thus gives a k-dimensional data representation 11

PCA and reconstruction
• Another view we can take on PCA is that of reconstruction: reducing the dimensionality of X and then attempting to reconstruct X from the reduced representation
• How do we reconstruct X from XQ1:k? We “expand” back to the original basis by right multiplying by Q#1:k, i.e., X $ XQ1:kQ#1:k
• Remember that Q is orthogonal, so equality holds when using the full Q • PCA can also be understood as finding the linear projection that minimizes
reconstruction error! Specifically, Q1:k minimizes “X % XWW#”F out of all W • A generalization of this idea to nonlinear projections is called autoencoders

Final example: eigenfaces

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com