Machine learning lecture slides
Machine learning lecture slides
COMS 4771 Fall 2020
0 / 35
Multivariate Gaussians and PCA
Outline
I Multivariate Gaussians
I Eigendecompositions and covariance matrices
I Principal component analysis
I Principal component regression and spectral regularization
I Singular value decomposition
I Examples: topic modeling and matrix completion
1 / 35
Multivariate Gaussians: Isotropic Gaussians
I Start with X = (X1, . . . , Xd) ∼ N(0, I), i.e., X1, . . . , Xd are
iid N(0, 1) random variables.
I Probability density function is product of (univariate) Gaussian
densities
I E(Xi) = 0
I var(Xi) = cov(Xi, Xi) = 1, cov(Xi, Xj) = 0 for i 6= j
I Arrange in mean vector E(X) = 0, covariance matrix
cov(X) = I
2 / 35
0
0.05
2
d
e
n
s
it
y
0.1
2
x
2
0
x
1
0.15
0
-2 -2
Figure 1: Density function for isotropic Gaussian in R2
3 / 35
-3 -2 -1 0 1 2 3
x
1
-3
-2
-1
0
1
2
3
x
2
Figure 2: Density function level sets for isotropic Gaussian in R2
4 / 35
Affine transformations of random vectors
I Start with any random vector Z, then apply linear
transformation, followed by translation
I X := MZ + µ, for M ∈ Rk×d and µ ∈ Rk
I Fact: E(X) = ME(Z) + µ, cov(X) = M cov(Z)M T
I E.g., let u ∈ Rd be a unit vector (‖u‖2 = 1), and X := uTZ
(projection of X along direction u). Then E(X) = uTE(Z), and
var(X) = uT cov(Z)u.
I Note: These transformations work for random vectors with any
distribution, not just Gaussian distributions.
I However, it is convenient to illustrate the effect of these
transformations on Gaussian distributions, since the “shape” of
the Gaussian pdf is easy to understand.
5 / 35
Multivariate Gaussians: General Gaussians
I If Z ∼ N(0, I) and X = MZ + µ, we have E(X) = µ and
cov(X) = MMT
I Assume M ∈ Rd×d is invertible (else we get a degenerate
Gaussian distribution).
I We say X ∼ N(µ,MMT)
I Density function given by
p(x) =
1
(2π)d/2|MM T|1/2
exp
(
−
1
2
(x− µ)T(MM T)−1(x− µ)
)
.
I Note: every non-singular covariance matrix Σ can be written as
MMT for some non-singular matrix M . (We’ll see why soon.)
6 / 35
-2 -1 0 1 2 3 4
x
1
-2
-1
0
1
2
3
4
x
2
Figure 3: Density function level sets for anisotropic Gaussian in R2
7 / 35
Inference with multivariate Gaussians (2)
I Bivariate case: (X1, X2) ∼ N(µ,Σ) in R2
µ =
[
µ1
µ2
]
, Σ =
[
Σ1,1 Σ1,2
Σ2,1 Σ2,2
]
I What is the distribution of X2?
I N(µ2, Σ2,2)
I What is the distribution of X2 | X1 = x1?
I Miracle 1: it is a Gaussian distribution
I Miracle 2: mean provided by linear prediction of X2 from X1
with smallest MSE
I Miracle 3: variance doesn’t depend on x1
8 / 35
Inference with multivariate Gaussians (2)
I What is the distribution of X2 | X1 = x1?
I Miracle 1: it is a Gaussian distribution
I Miracle 2: mean provided by linear prediction of X2 from X1
with smallest MSE
I Miracle 3: variance doesn’t depend on x1
I OLS with X1 as input variable and X2 as output variable:
x1 7→ m̂x1 + θ̂
where
m̂ =
cov(X1, X2)
var(X1)
=
Σ1,2
Σ1,1
,
θ̂ = E(X2)− m̂E(X1) = µ2 − m̂µ1.
I Therefore:
E[X2 | X1 = x1] = m̂x1 + θ̂
= µ2 + m̂(x1 − µ1)
= µ2 +
Σ1,2
Σ1,1
(x1 − µ1)
9 / 35
Inference with multivariate Gaussians (3)
I What is the distribution of X2 | X1 = x1?
I Miracle 1: it is a Gaussian distribution
I Miracle 2: mean provided by linear prediction of X2 from X1
with smallest MSE
I Miracle 3: variance doesn’t depend on x1
var(X2 | X1 = x1) = E[var(X2 | X1)]
= var(X2)− var(E[X2 | X1])
= Σ2,2 − var(m̂X1 + θ̂)
= Σ2,2 − m̂2 var(X1)
= Σ2,2 −
Σ21,2
Σ21,1
Σ1,1
= Σ2,2 −
Σ21,2
Σ1,1
.
10 / 35
Inference with multivariate Gaussians (4)
I Beyond bivariate Gaussians: same as above, but just writing
things properly using matrix notations
E[X2 | X1 = x1] = µ2 +Σ2,1Σ−11,1(x1 − µ1)
cov(X2 | X1 = x1) = Σ2,2 −Σ2,1Σ−11,1Σ1,2
11 / 35
Eigendecomposition (1)
I Every symmetric matrix M ∈ Rd×d has d real eigenvalues,
which we arrange as
λ1 ≥ · · · ≥ λd
I Can choose corresponding orthonormal eigenvectors
v1, . . . , vd ∈ Rd
I This means
Mvi = λivi
for each i = 1, . . . , d, and
vTi vj = 1{i=j}
12 / 35
Eigendecomposition (2)
I Arrange v1, . . . , vd in an orthogonal matrix V := [v1| · · · |vd]
I V TV = I and V V T =
∑d
i=1 viv
T
i = I
I Therefore,
M = MV V T
=
d∑
i=1
Mviv
T
i
=
d∑
i=1
λiviv
T
i
I This is our preferred way to express the eigendecomposition
I Also called spectral decomposition
I Can also write M = V ΛV T, where Λ = diag(λ1, . . . , λd)
I The matrix V diagonalizes M :
V TMV = Λ
13 / 35
Covariance matrix (1)
I A ∈ Rn×d is data matrix
I Σ := ATA = 1
n
∑n
i=1 xix
T
i is
(empirical) second-moment matrix
I If 1
n
∑n
i=1 xi = 0 (data are “centered”), this is the
(empirical) covariance matrix
I For purpose of exposition, just say/write “(co)variance” even
though “second-moment” is technically correct
I For any unit vector u ∈ Rd,
uTΣu =
1
n
n∑
i=1
(uTxi)2
is (empirical) variance of data along direction u
14 / 35
Covariance matrix (2)
I Note: some pixels in OCR data have very little (or zero!)
variation
I These are “coordinate directions” (e.g., u = (1, 0, . . . , 0))
I Probably can/should ignore these!
Figure 4: Which pixels are likely to have very little variance?
15 / 35
Top eigenvector
I Σ is symmetric, so can write eigendecomposition
Σ =
d∑
i=1
λiviv
T
i
I In which direction is variance maximized?
I Answer: v1, corresponding to largest eigenvalue λ1
I Called the top eigenvector
I This follows from the following characterization of v1:
vT1Σv1 = max
u∈Rd:‖u‖2=1
uTΣu = λ1.
16 / 35
-2 -1 0 1 2 3 4
x
1
-2
-1
0
1
2
3
4
x
2
Figure 5: What is the direction of the top eigenvector for the covariance of
this Gaussian?
17 / 35
Top k eigenvectors
I What about among directions orthogonal to v1?
I Answer: v2, corresponding to second largest eigenvalue λ2
I (Note: all eigenvalues of Σ are non-negative!)
I For any k, Vk := [v1| · · · |vk] satisfies
k∑
i=1
vTiΣvi = tr(V
T
k ΣVk) = max
U∈Rd×k:UTU=I
tr(UTΣU) =
k∑
i=1
λi
(the top k eigenvectors)
18 / 35
Principal component analysis
I k-dimensional principal components analysis (PCA) mapping:
ϕ(x) = (xTv1, . . . , xTvk) = V Tk x ∈ R
k
where Vk = [v1| · · · |vk] ∈ Rd×k
I (Only really makes sense when λk > 0.)
I This is a form of dimensionality reduction when k < d.
19 / 35
dimension of projections k
0 200 400 600 800
fr
a
c
ti
o
n
o
f
re
s
id
u
a
l
v
a
ri
a
n
c
e
0
0.2
0.4
0.6
0.8
1
coordinate projections
PCA projections
Figure 6: Fraction of residual variance from projections of varying
dimension
20 / 35
Covariance of data upon PCA mapping
I Covariance of data upon PCA mapping:
1
n
n∑
i=1
ϕ(xi)ϕ(xi)T =
1
n
n∑
i=1
V Tk xix
T
i Vk = V
T
k ΣVk = Λk
where Λk is diagonal matrix with λ1, . . . , λk along diagonal.
I In particular, coordinates in ϕ(x)-representation are
uncorrelated.
-2 -1 0 1 2 3 4
x
1
-2
-1
0
1
2
3
4
x
2
-2 -1 0 1 2 3 4
x
1
-2
-1
0
1
2
3
4
x
2
21 / 35
PCA and linear regression
I Use k-dimensional PCA mapping ϕ(x) = V Tk x with ordinary
least squares
I (Assume rank of A is at least k, so ATA has λk > 0)
I Data matrix is
1
√
n
← ϕ(x1)T →
…
← ϕ(xn)T →
= 1√n
← xT1Vk →
…
← xTnVk →
= AVk ∈ Rn×k
I Therefore, OLS solution is
β̂ = (V Tk A
TAVk)−1(AVk)Tb
= Λ−1k V
T
k A
Tb
(Note: here β̂ ∈ Rk.)
22 / 35
Principal component regression
I Use β̂ = Λ−1k V
T
k A
Tb to predict on new x ∈ Rd:
ϕ(x)Tβ̂ = (V Tk x)
TΛ−1k V
T
k A
Tb
= xT(VkΛ−1k V
T
k )(A
Tb)
I So “effective” weight vector (that acts directly on x rather
than ϕ(x)) is given by
ŵ := (VkΛ−1k V
T
k )(A
Tb).
I This is called principal component regression (PCR) (here, k is
hyperparameter)
I Alternative hyper-parameterization: λ > 0; same as before but
using the largest k such that λk ≥ λ.
23 / 35
Spectral regularization
I PCR and ridge regression are examples of
spectral regularization.
I For a function g : R→ R, write g(M) to mean
g(M) =
d∑
i=1
g(λi)vivTi
where M has eigendecomposition M =
∑d
i=1 λiviv
T
i .
I I.e., g is applied to eigenvalues of M
I Generalizes effect of polynomials: e.g., g(z) = z2
M2 = (V ΛV T)(V ΛV T) = V Λ2V T.
I Claim: Can write each of PCR and ridge regression as
ŵ = g(ATA)ATb
for appropriate function g (depending on λ).
24 / 35
Comparing ridge regression and PCR
I ŵ = g(ATA)ATb
I Ridge regression (with parameter λ): g(z) = 1
z+λ
I PCR (with parameter λ): g(z) = 1{z≥λ} · 1z
I Interpretation:
I PCR uses directions with sufficient variability; ignores the rest
I Ridge artificially inflates the variance in all directions
0 0.2 0.4 0.6 0.8 1
z
0
2
4
6
8
10
g
(z
)
ridge
pcr
Figure 7: Comparison of ridge regression and PCR
25 / 35
Matrix factorization
I Let A =
← xT1 →
…
← xTn →
∈ Rn×d (forget the 1/√n scaling)
I Try to approximate A with BC, where B ∈ Rn×k and
C ∈ Rk×d, to minimize ‖A−BC‖2F .
I Here, ‖ · ‖F is a matrix norm called Frobenius norm, which
treats the n× d matrix as a vector in nd-dimensional Euclidean
space
I Think of B as the encodings of the data in A
I “Dimension reduction” when k < d
I Theorem (Schmidt, 1907; Eckart-Young, 1936): Optimal
solution is given by truncating the
singular value decomposition (SVD) of A
26 / 35
Singular value decomposition
I Every matrix A ∈ Rn×d—say, with rank r—can be written as
A =
r∑
i=1
σiuiv
T
i
where
I σ1 ≥ · · · ≥ σr > 0 (singular values)
I u1, . . . , ur ∈ Rn (orthonormal left singular vectors)
I v1, . . . , vr ∈ Rd (orthonormal right singular vectors)
I Can also write as
A = USV T
where
I U = [u1| · · · |ur] ∈ Rn×r, satisfies U TU = I
I S = diag(σ1, . . . , σr) ∈ Rr×r
I V = [v1| · · · |vr] ∈ Rd×r, satisfies V TV = I
27 / 35
Truncated SVD
I Let A have SVD A =
∑r
i=1 σiuiv
T
i (rank of A is r)
I Truncate at rank k (for any k ≤ r): rank-k SVD
Ak :=
k∑
i=1
σiuiv
T
i
I Can write as Ak := UkSkV Tk , where
I Uk = [u1| · · · |uk] ∈ Rn×k, satisfies U TU = I
I Sk = diag(σ1, . . . , σk) ∈ Rk×k
I Vk = [v1| · · · |vk] ∈ Rd×r, satisfies V TV = I
I Theorem (Schmidt/Eckart-Young):
‖A−Ak‖2F = min
M :rank(M)=k
‖A−M‖2F =
r∑
i=k+1
σ2i
28 / 35
Encoder/decoder interpretation (1)
I Encoder: x 7→ ϕ(x) = V Tk x ∈ R
k
I Encoding rows of A: AVk = UkSk
I Decoder: z 7→ Vkz ∈ Rd
I Decoding rows of UkSk: UkSkV Tk = Ak
I Same as k-dimensional PCA mapping!
I ATA = V S2V T, so eigenvectors of ATA are right singular
vectors of A, non-zero eigenvalues are squares of the singular
values
I PCA/SVD finds k-dimensional subspace of smallest sum of
squared distances to data points.
Figure 8: Geometric picture of PCA
29 / 35
Encoder/decoder interpretation (2)
I Example: OCR data, compare original image to decoding of
k-dimensional PCA encoding (k ∈ {1, 10, 50, 200})
Original M = 1 M = 10 M = 50 M = 250
Figure 9: PCA compression of MNIST digit
30 / 35
Application: Topic modeling (1)
I Start with n documents, represent using “bag-of-words” count
vectors
I Arrange in matrix A ∈ Rn×d, where d is vocabulary size
aardvark abacus abalone · · ·
doc 1 3 0 0 · · ·
doc 2 7 0 4 · · ·
doc 3 2 4 0 · · ·
…
…
…
…
31 / 35
Application: Topic modeling (2)
I Rank k SVD provides an approximate factorization
A ≈ BC
where B ∈ Rn×k and C ∈ Rk×d
I This use of SVD is called Latent Semantic Analysis (LSA)
I Interpret rows of C as “topics”
I Bi,t is “weight” of document i on topic t
I If rows of C were probability distributions, could interpret as
Ct,w as probability that word w appears in topic t
32 / 35
Application: Matrix completion (1)
I Start with ratings of movies given by users
I Arrange in a matrix A ∈ Rn×d, where Ai,j is rating given by
user i for movie j.
I Netflix: n = 480000, d = 18000; on average, each user rates
200 movies
I Most entries of A are unknown
I Idea: Approximate A with low-rank matrix, i.e., find
B =
← bT1 →
…
← bTn →
∈ Rn×k, C =
↑ ↑c1 · · · cd
↓ ↓
∈ Rk×d
with goal of minimizing ‖A−BC‖2F
I Note: If all entries of A were observed, we could do this with
truncated SVD.
33 / 35
Application: Matrix completion (2)
I Need to find a low-rank approximation without all of A:
(low-rank) matrix completion
I Lots of ways to do this
I Popular way (used in Netflix competition): based on “stochastic
gradient descent” (discussed later)
I Another way: fill in missing entries with plug-in estimates (based
on a statistical model), then compute truncated SVD as usual
34 / 35
Feature representations from matrix completion
I MovieLens data set (n = 6040 users, d = 3952 movies,
|Ω| = 800000 ratings)
I Fit B and C by using a standard matrix completion method
(based on SGD, discussed later)
I Are c1, . . . , cd ∈ Rk useful feature vectors for movies?
I Some nearest-neighbor pairs (cj ,NN(cj)):
I Toy Story (1995), Toy Story 2 (1999)
I Sense and Sensibility (1995), Emma (1996)
I Heat (1995), Carlito’s Way (1993)
I The Crow (1994), Blade (1998)
I Forrest Gump (1994), Dances with Wolves (1990)
I Mrs. Doubtfire (1993), The Bodyguard (1992)
35 / 35
Multivariate Gaussians and PCA