留学生考试辅导 CSC 311: Introduction to Machine Learning

CSC 311: Introduction to Machine Learning
Lecture 9 – PCA, Matrix Completion, Autoencoders
Roger G. of Toronto, Fall 2021
Intro ML (UofT) CSC311-Lec9 1 / 50

So far in this course: supervised learning Today we start unsupervised learning
▶ No labels, so the purpose is to find patterns in data
▶ Need to specify what kind of patterns to look for
This week: dimensionality reduction
▶ Linear dimensionality reduction (Principal Component Analysis)
▶ Matrix completion (needed for the project) is closely related to PCA.
▶ Nonlinear dimensionality reduction (autoencoders) Week 11: clustering
Intro ML (UofT) CSC311-Lec9

Motivating Examples
Energy disaggregation
Kolter and Johnson, “REDD: A public data set for energy disaggregation research”
Intro ML (UofT) CSC311-Lec9 3 / 50

Motivating Examples
Modeling the change in scientific topics over time
Griffiths and Steyvers, “Finding scientific topics”
Intro ML (UofT) CSC311-Lec9 4 / 50

Motivating Examples
Modeling the change in scientific topics over time
Griffiths and Steyvers, “Finding scientific topics”
Intro ML (UofT) CSC311-Lec9 5 / 50

Motivating Examples
The models for those tasks are fairly complicated. In this course, we’ll focus on two simpler instances of unsupervised learning:
Clustering
Dimensionality Reduction
Intro ML (UofT) CSC311-Lec9 6 / 50

Linear Dimensionality Reduction
We’ll start with a simpler form of dimensionality reduction: linear dimensionality reduction
Example: suppose you’re a psychologist interested in modeling the variation in human personality
▶ You’ve asked lots of participants to take a survey with lots of personality questions.
▶ By figuring out which questions are highly correlated with each other, you can uncover the main factors describing human personality.
A linear dimensionality reduction model called factor analysis found five key personality traits called the Big Five:
▶ extraversion, agreeableness, openness to experience,
conscientiousness, neuroticism
In this lecture, we’ll consider a different but closely related model
called Principal Component Analysis (PCA).
Intro ML (UofT) CSC311-Lec9 7 / 50

PCA: Overview
Principal Component Analysis (PCA) is our first unsupervised learning algorithm, and an example of linear dimensionality reduction.
Dimensionality reduction: map data to a lower dimensional space
▶ Save computation/memory
▶ Reduce overfitting, achieve better generalization
▶ Visualize in 2 dimensions
Since PCA is a linear model, this mapping will be a projection.
Image credit: Elements of Statistical Learning
Intro ML (UofT) CSC311-Lec9 8 / 50

Euclidean projection
Projection onto a 1-D subspace
Subspace S is the line along the unit vector u
Projection of x on S is denoted by ProjS(x) Recall: x⊤u = ∥x∥∥u∥ cos(θ) = ∥x∥ cos(θ)
Proj (x)= x⊤u
⋅ u =∥x ̃∥u ÍÑÏ
direction of proj
ÍÒÒÒÒÒÒÑÒÒÒÒÒÒÒÏ
Intro ML (UofT)
CSC311-Lec9
length of proj
▶ {u} is a basis for S: any point in S can be written as zu for some
kx ̃k = kxk cos(✓)

General subspaces
How to project onto a K-dimensional subspace?
▶ Idea: choose an orthonormal basis {u1, u2, ⋯, uK } for S (i.e. all
unit vectors and orthogonal to each other)
▶ Project onto each unit vector individually (as in previous slide), and
sum together the projections. Mathematically, the projection is given as:
Proj (x)=∑zu where z =x⊤u. Siiii
In vector form:
Proj (x)=Uz where z=U⊤x S
Intro ML (UofT)
CSC311-Lec9

Projection onto a Subspace
So far, we assumed the subspace passes through 0.
In mathematical terminology, the “subspaces” we want to project
onto are really affine spaces, and can have an arbitrary origin μˆ.
z = U ⊤ ( x − μˆ )
In machine learning, x ̃ is also called the reconstruction of x. z is its representation, or code.
Intro ML (UofT) CSC311-Lec9 11 / 50

Projection onto a Subspace
If we have a K-dimensional subspace in a D-dimensional input space, then x ∈ RD and z ∈ RK.
If the data points x all lie close to their reconstructions, then we can approximate distances, etc. in terms of these same operations on the code vectors z.
If K ≪ D, then it’s much cheaper to work with z than x.
A mapping to a space that’s easier to manipulate or visualize is called a representation, and learning such a mapping is representation learning.
Mapping data to a low-dimensional space is called dimensionality reduction.
Intro ML (UofT) CSC311-Lec9 12 / 50

Learning a Subspace
How to choose a good subspace S?
▶ Origin μˆ is the empirical mean of the data
▶ Need to choose a D × K matrix U with orthonormal columns.
Two criteria:
▶ Minimize the reconstruction error:
▶ Maximize the variance of reconstructions: Find a subspace where
data has the most variability.
1 N (i) (i) 2 min ∑∥x −x ̃ ∥
max 1 ∑∥x ̃(i) −μˆ∥2
▶ Note: The data and its reconstruction have the same means (exercise)!
(UofT) CSC311-Lec9

Learning a Subspace
These two criteria are equivalent! I.e., we’ll show
1N (i) (i)2 1 (i) 2
N∑∥x −x ̃ ∥ =const−N∑∥x ̃ −μˆ∥ i=1 i
Recall x ̃(i) = μˆ + Uz(i) and z(i) = U⊤(x(i) − μˆ).
Observation 1: Because the columns of U are orthogonal, U⊤U = I, so
∥x ̃ − μˆ∥2 = ∥Uz∥2 = z⊤U⊤Uz = z⊤z = ∥z∥2.
⟹ norm of centered reconstruction is equal to norm of representation. (If you draw it, this is obvious).
Intro ML (UofT) CSC311-Lec9 14 / 50

Pythagorean Theorem
Observation1: ∥x ̃(i)−μˆ∥2=∥z(i)∥2
▶ Variance of reconstructions is equal to variance of code vectors:
N1 ∑i ∥x ̃(i) − μˆ∥2 = N1 ∑i ∥z(i)∥2 (exercise N1 ∑i z(i) = 0) Observation 2: orthogonality of x ̃(i) − μˆ and x ̃(i) − x(i)
(Two vectors a, b are orthogonal ⟺ a⊤b = 0) Recall x ̃(i) = μˆ + UU⊤(x(i) − μˆ).
(x ̃(i) − μˆ)⊤(x ̃(i) − x(i))
= (x(i) − μˆ)⊤UU⊤(μˆ − x(i) + UU⊤(x(i) − μˆ))
=(x(i)−μˆ)⊤UU⊤(μˆ−x(i)) + (x(i)−μˆ)⊤UU⊤(x(i)−μˆ) =0
Intro ML (UofT)
CSC311-Lec9 15 / 50

Pythagorean Theorem
The Pythagorean Theorem tells us:
∥x ̃(i) − μˆ∥2 + ∥x(i) − x ̃(i)∥2 = ∥x(i) − μˆ∥2 for each i By averaging over data and from observation 2, we obtain
Therefore,
1N (i) 2 1N (i) (i)2 N ∑ ∥ x ̃ − μˆ ∥ + N ∑ ∥ x − x ̃ ∥
i=1 i=1 ÍÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÑÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÏ ÍÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÑÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÏ
reconstruction error
projected variance 1N (i) 2
=N∑∥x −μˆ∥
projected variance = constant − reconstruction error
Maximizing the variance is equivalent to minimizing the reconstruction error!
i=1 ÍÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÑÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÏ
Intro ML (UofT) CSC311-Lec9 16 / 50

Principal Component Analysis
Choosing a subspace to maximize the projected variance, or minimize the reconstruction error, is called principal component analysis (PCA).
Consider the empirical covariance matrix: 1N(i) (i)⊤
Σˆ = N ∑(x −μˆ)(x −μˆ) i=1
Recall: Σˆ is symmetric and positive semidefinite. The optimal PCA subspace is spanned
by the top K eigenvectors of Σˆ .
▶ More precisely, choose the first K of any orthonormal eigenbasis for Σˆ .
▶ The general case is tricky, but we’ll show this for K = 1.
These eigenvectors are called principal components, analogous to the
principal axes of an ellipse.
Intro ML (UofT) CSC311-Lec9 17 / 50

Supplement: Deriving PCA
For K = 1, we are fitting a unit vector u, and the code is a scalar z(i) = u⊤(x(i) − μˆ). Let’s maximize the projected variance. From observation 1, we have
N1 ∑∥x ̃(i) −μˆ∥2 = N1 ∑[z(i)]2 = N1 ∑(u⊤(x(i) −μˆ))2 iii
= N ∑u⊤(x(i) −μˆ)(x(i) −μˆ)⊤u
(a⊤b)2 =a⊤bb⊤a
⊤1N(i) (i)⊤
=u[N∑(x −μˆ)(x −μˆ)]u i=1
= u⊤QΛQ⊤u Spectral Decomposition Σˆ = QΛQ⊤ ⊤⊤
= ∑ λ j a 2j j=1
for a = Q u
CSC311-Lec9

Supplement: Deriving PCA
Maximize a⊤Λa = ∑Dj=1 λja2j for a = Q⊤u.
▶ This is a change-of-basis to the eigenbasis of Σ.
Assume the λi are in sorted order, λ1 ≥ λ2, ≥ …
Observation: since u is a unit vector, then by unitarity, a is also a
unit vector: a⊤a = u⊤QQ⊤u = u⊤u, i.e., ∑j a2j = 1. Byinspection,seta1 =±1andaj =0forj≠1. Hence, u = Qa = q1 (the top eigenvector).
A similar argument shows that the kth principal component is the kth eigenvector of Σ.
Intro ML (UofT) CSC311-Lec9 19 / 50

Decorrelation
Interesting fact: the dimensions of z are decorrelated. For now, let Cov denote the empirical covariance.
Cov(z) = Cov(U⊤(x − μ)) = U⊤ Cov(x)U
= (I 0) Λ (0I)
▷ spectral decomposition ▷ by orthogonality
= top left K × K block of Λ
If the covariance matrix is diagonal, this means the features are
uncorrelated.
Intro ML (UofT) CSC311-Lec9 20 / 50

Dimensionality reduction aims to find a low-dimensional representation of the data.
PCA projects the data onto a subspace which maximizes the projected variance, or equivalently, minimizes the reconstruction error.
The optimal subspace is given by the top eigenvectors of the empirical covariance matrix.
PCA gives a set of decorrelated features.
Intro ML (UofT) CSC311-Lec9 21 / 50

Applying PCA to faces
Consider running PCA on 2429 19×19 grayscale images (CBCL data) Can get good reconstructions with only 3 components
PCA for pre-processing: can apply classifier to latent representation
▶ Original data is 361 dimensional
▶ For face recognition PCA with 3 components obtains 79% accuracy
on face/non-face discrimination on test data vs. 76.8% for a Gaussian mixture model (GMM) with 84 states. (We’ll cover GMMs later in the course.)
Can also be good for visualization
Intro ML (UofT) CSC311-Lec9 22 / 50

Applying PCA to faces: Learned basis
Principal components of face images (“eigenfaces”)
Intro ML (UofT) CSC311-Lec9 23 / 50

Applying PCA to digits
Intro ML (UofT) CSC311-Lec9 24 / 50

Two more interpretations of PCA, which have interesting generalizations.
1. Matrix factorization 2. Autoencoder
Intro ML (UofT) CSC311-Lec9 25 / 50

Some recommender systems in action
Ideally recommendations should combine global and seasonal interests, look at your history if available, should adapt with time, be coherent and diverse, etc.
Intro ML (UofT) CSC311-Lec9 26 / 50

Some recommender systems in action
Intro ML (UofT) CSC311-Lec9 27 / 50

The Netflix problem
Movie recommendation: Users watch movies and rate them out of 5⭑.
User Movie Frozen Chained Bambi Titanic Goodfellas Dumbo Twilight
⭑⭐⭐⭐⭐ ⭑⭑⭐⭐⭐ ⭑⭑⭑⭐⭐ ⭑⭑⭑⭑⭐ ⭑⭑⭑⭑⭑ ⭑⭑⭑⭐⭐ ⭑⭑⭑⭑⭑ ⭑⭑⭑⭑⭑ ⭑⭑⭐⭐⭐
Because users only rate a few items, one would like to infer their preference for unrated items
Intro ML (UofT) CSC311-Lec9

Netflix Prize
Intro ML (UofT) CSC311-Lec9 29 / 50

PCA as Matrix Factorization
Recall PCA: each input vector x(i) ∈ RD is approximated as
μˆ + Uz(i),
where μˆ = n1 ∑i x(i) is the data mean, U ∈ RD×K is the orthogonal
basis for the principal subspace, and z(i) ∈ RK is the code vector, and x ̃(i) ∈ RD is x(i)’s reconstruction or approximation.
Assume for simplicity that the data is centered: μˆ = 0. Then, the approximation looks like
x(i) ≈ x ̃(i) = Uz(i).
Intro ML (UofT) CSC311-Lec9 30 / 50
x(i) ≈ x ̃(i) = μˆ + Uz(i)

PCA as Matrix Factorization
PCA(on centered data): input vector x(i) is approximated as Uz(i) x(i) ≈ Uz(i)
Write this in matrix form, we have X ≈ ZU⊤ where X and Z are matrices with one row per data point
⎡⎢ [x(1)]⊤ ⎤⎥ ⎢[x(2)]⊤ ⎥ N×D
X=⎢ ⋮ ⎥∈R ⎢⎣[x(N ) ]⊤ ⎥⎦
and Z=⎢ ⋮ ⎢⎣[z(N ) ]⊤ ⎥⎦
⎥∈R ∑∥x −Uz ∥ =∥X−ZU ∥F,
Can write the squared reconstruction error as
N(i)(i)2 ⊤2 i=1
= ∑ ∥y(i)∥2. i
∥⋅∥F denotestheFrobeniusnorm: ∥Y∥2 = ∥Y⊤∥2 = ∑ y2
F F ij i,j
⎡⎢ [z(1)]⊤ ⎤⎥ ⎢[z(2)]⊤ ⎥ N×K
Intro ML (UofT) CSC311-Lec9

PCA as Matrix Factorization
So PCA is approximating X ≈ ZU⊤, or equivalently X⊤ ≈ UZ⊤.
Based on the sizes of the matrices, this is a rank-K approximation. Since U was chosen to minimize reconstruction error, this is the
optimal rank-K approximation, in terms of error ∥X⊤ − UZ⊤∥2F . Intro ML (UofT) CSC311-Lec9 32 / 50

Supplement: Singular-Value Decomposition (SVD)
This has a close relationship to the Singular Value Decomposition (SVD) of X which is a matrix factorization technique. Consider an N×DmatrixX∈RN×D withSVD
Q, S, and U⊤ provide a real-valued matrix factorization of X.
Properties:
Q is a N × D matrix with orthonormal columns, Q⊤Q = ID,
where ID is the D × D identity matrix.
U is an orthonormal D × D matrix, U⊤ = U−1.
S is a D × D diagonal matrix, with non-negative singular values, s1,s2,…,sD, on the diagonal, where the singular values are conventionally ordered from largest to smallest.
Note that standard SVD notation is X = UDV⊤. We are using X = QSU⊤
for notational convenience.
Intro ML (UofT) CSC311-Lec9 33 / 50

Matrix Completion
We just saw that PCA gives the optimal low-rank matrix factorization to a matrix X.
Can we generalize this to the case where X is only partially observed?
▶ A sparse 1000 × 1000 matrix with 50,000 observations (only 5% observed).
▶ A rank 5 approximation requires only 10,000 parameters, so it’s reasonable to fit this.
▶ Unfortunately, no closed form solution.
Intro ML (UofT) CSC311-Lec9

The Netflix problem
Movie recommendation: Users watch movies and rate them as good or bad.
User Movie Frozen Chained Bambi Titanic Goodfellas Dumbo Twilight
⭑⭐⭐⭐⭐ ⭑⭑⭐⭐⭐ ⭑⭑⭑⭐⭐ ⭑⭑⭑⭑⭐ ⭑⭑⭑⭑⭑ ⭑⭑⭑⭐⭐ ⭑⭑⭑⭑⭑ ⭑⭑⭑⭑⭑ ⭑⭑⭐⭐⭐
Because users only rate a few items, one would like to infer their
preference for unrated items
Intro ML (UofT) CSC311-Lec9

Matrix Completion
Matrix completion problem: Transform the table into a N users by M movies matrix R
Ninja Cat Angel Neutral
2 3 ? ? 4 ? 5 ?
Rating matrix ?
? ? 5 ? ? ?
? ? ? 2 ? ?
1 ? ? ? ? ? ? ? ? ? ? 1
Data: Users rate some movies. Ruser,movie. Very sparse
Task: Predict missing entries, i.e. how a user would rate a movie they haven’t previously rated
Evaluation Metric: Squared error (used by Netflix Competition). Is this a reasonable metric?
CSC311-Lec9
Chained Frozen
Bambi Titanic
Goodfellas Dumbo

Matrix Completion
In our current setting, latent factor models attempt to explain the ratings by characterizing both movies and users on a number of factors K inferred from the ratings patterns.
That is, we seek representations for movies and users as vectors in RK that can ultimately be translated to ratings.
For simplicity, we can associate these factors (i.e. the dimensions of the vectors) with idealized concepts like
▶ But also uninterpretable dimensions
Can we use the sparse ratings matrix R to find these latent factors automatically?
Intro ML (UofT) CSC311-Lec9 37 / 50

Matrix Completion
Let the representation of user i in the K-dimensional space be ui and the representation of movie j be zj
▶ Intuition: maybe the first entry of ui says how much the user likes horror films, and the first entry of zj says how much movie j is a horror film.
Assume the rating user i gives to movie j is given by a dot product: Rij ≈ u⊤i zj
∣⎤⎥ … zM⎥
In matrix form, if:
⎡⎢—u⊤1 —⎤⎥ ⎡⎢∣
U = ⎢ ⋮ ⎥ and Z⊤ = ⎢z1 ⎢⎣—u⊤N —⎥⎦ ⎢⎣∣
then: R ≈ UZ⊤
This is a matrix factorization problem!
Intro ML (UofT) CSC311-Lec9

Matrix Completion
Recall PCA: To enforce X⊤ ≈ UZ⊤, we minimized min∥X⊤−UZ⊤∥2F=∑(xji−u⊤i zj)2
where ui and zi are the i-th rows of matrices U and Z,
respectively.
What’s different about the Netflix problem?
▶ Most entries are missing!
▶ We only want to count the error for the observed entries.
Intro ML (UofT) CSC311-Lec9

Matrix Completion
Let O = {(n, m) ∶ entry (n, m) of matrix R is observed}
Using the squared error loss, matrix completion requires solving
min1 ∑ (R −u⊤z)2
U,Z 2 (i,j)∈O
The objective is non-convex in U and Z jointly, and in fact it’s generally
NP-hard to minimize the above cost function exactly.
As a function of either U or Z individually, the problem is convex and easy to optimize. We can use coordinate descent, just like with K-means and mixture models!
Alternating Least Squares (ALS): fix Z and optimize U, followed by fix U and optimize Z, and so on until convergence.
Intro ML (UofT) CSC311-Lec9 40 / 50

Alternating Least Squares
Want to minimize the squared error cost with respect to the factor U. (The case of Z is exactly symmetric.)
We can decompose the cost into a sum of independent terms:
∑ (R −u⊤z) =∑ ∑ (R −u⊤z) ij ij ij ij
This can be minimized independently for each ui.
This is a linear regression problem in disguise. Its optimal solution
⎛ ⎞−1 u=⎜∑zz⊤⎟ ∑Rz
i j ∶(i,j )∈O ÍÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÑÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÏ
i⎝ jj⎠ ijj j ∶(i,j )∈O j ∶(i,j )∈O
only depends on ui
CSC311-Lec9 41 / 50

Alternating Least Squares
ALS for Matrix Completion problem
Initialize U and Z randomly repeat until convergence
for i = 1,..,N do
u = (∑ z z⊤)−1 ∑ R z
i j∶(i,j)∈O j j j∶(i,j)∈O ij j for j = 1,..,M do
uu⊤)−1∑ R u i∶(i,j)∈O i i i∶(i,j)∈O ij i
Intro ML (UofT)
CSC311-Lec9 42 / 50

Two more interpretations of PCA, which have interesting generalizations.
1. Matrix factorization 2. Autoencoder
Intro ML (UofT) CSC311-Lec9 43 / 50

Autoencoders
An autoencoder is a feed-forward neural net whose job is to take an input x and predict x.
To make this non-trivial, we need to add a bottleneck layer whose dimension is much smaller than the input.
Intro ML (UofT) CSC311-Lec9 44 / 50

Linear Autoencoders
Why autoencoders?
Map high-dimensional data to two dimensions for visualization
Learn abstract features in an unsupervised way so you can apply them to a supervised task
▶ Unlabled data can be much more plentiful than labeled data
Intro ML (UofT) CSC311-Lec9 45 / 50

Linear Autoencoders
The simplest kind of autoencoder has one hidden layer, linear activations, and squared error loss.
L ( x , x ̃ ) = ∥ x − x ̃ ∥ 2
This network computes x ̃ = W2W1x, which
is a linear function.
IfK≥D,wecanchooseW2 andW1 such that W2W1 is the identity matrix. This isn’t very interesting.
But suppose K < D: ▶ W1 maps x to a K-dimensional space, so it’s doing dimensionality reduction. Intro ML (UofT) CSC311-Lec9 46 / 50 Linear Autoencoders Observe that the output of the autoencoder must lie in a K-dimensional subspace spanned by the columns of W2. This is because x ̃ = W2z We saw that the best possible (min error) K-dimensional linear subspace in terms of reconstruction error is the PCA subspace. The autoencoder can achieve this by setting W1 = U⊤ and Therefore, the optimal weights for a linear autoencoder are just the principal components! Intro ML (UofT) CSC311-Lec9 Nonlinear Autoencoders Deep nonlinear autoencoders learn to project the data, not onto a subspace, but onto a nonlinear manifold This manifold is the image of the decoder. This is a kind of nonlinear dimensionality reduction. Intro ML (UofT) CSC311-Lec9 48 / 50 Nonlinear Autoencoders Nonlinear a 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts