Machine learning lecture slides
COMS 4771 Fall 2020
0/35
Multivariate Gaussians and PCA
Outline
Multivariate Gaussians
Eigendecompositions and covariance matrices
Principal component analysis
Principal component regression and spectral regularization Singular value decomposition
Examples: topic modeling and matrix completion
1/35
Multivariate Gaussians: Isotropic Gaussians
Start with X = (X1,…,Xd) ∼ N(0,I), i.e., X1,…,Xd are iid N(0, 1) random variables.
Probability density function is product of (univariate) Gaussian densities
E(Xi)=0
var(Xi) = cov(Xi,Xi) = 1, cov(Xi,Xj) = 0 for i ̸= j Arrange in mean vector E(X) = 0, covariance matrix
cov(X) = I
2/35
0.15
0.1
0.05
0
2
density
0
x-2 -2
2 0
2 x1
Figure 1: Density function for isotropic Gaussian in R2
3/35
3
2
1
0
-1
-2
-3
-3 -2 -1 0 1 2 3
x
1
Figure 2: Density function level sets for isotropic Gaussian in R2
4/35
x
2
Affine transformations of random vectors
Start with any random vector Z, then apply linear transformation, followed by translation
X:=MZ+μ,forM∈Rk×d andμ∈Rk
Fact: E(X) = ME(Z) + μ, cov(X) = M cov(Z)MT
E.g.,letu∈Rd beaunitvector(∥u∥2 =1),andX:=uTZ
(projection of X along direction u). Then E(X) = uTE(Z), and var(X) = uT cov(Z)u.
Note: These transformations work for random vectors with any distribution, not just Gaussian distributions.
However, it is convenient to illustrate the effect of these transformations on Gaussian distributions, since the “shape” of the Gaussian pdf is easy to understand.
5/35
Multivariate Gaussians: General Gaussians
IfZ∼N(0,I)andX=MZ+μ,wehaveE(X)=μand cov(X) = MMT
Assume M ∈ Rd×d is invertible (else we get a degenerate Gaussian distribution).
WesayX∼N(μ,MMT) Density function given by
1 1 T T−1 p(x) = (2π)d/2|MMT|1/2 exp −2(x − μ) (MM ) (x − μ) .
Note: every non-singular covariance matrix Σ can be written as MMT for some non-singular matrix M. (We’ll see why soon.)
6/35
4
3
2
1
0
-1
-2
-2 -1 0 1 2 3 4
x
1
Figure 3: Density function level sets for anisotropic Gaussian in R2
7/35
x
2
Inference with multivariate Gaussians (2)
Bivariate case: (X1, X2) ∼ N(μ, Σ) in R2
μ ΣΣ μ=1,Σ=1,1 1,2 μ2 Σ2,1 Σ2,2
What is the distribution of X2? N(μ2,Σ2,2)
What is the distribution of X2 | X1 = x1?
Miracle 1: it is a Gaussian distribution
Miracle 2: mean provided by linear prediction of X2 from X1
with smallest MSE
Miracle 3: variance doesn’t depend on x1
8/35
Inference with multivariate Gaussians (2)
What is the distribution of X2 | X1 = x1?
Miracle 1: it is a Gaussian distribution
Miracle 2: mean provided by linear prediction of X2 from X1
with smallest MSE
Miracle 3: variance doesn’t depend on x1
OLS with X1 as input variable and X2 as output variable:
where
x1 →mˆx1 +θˆ mˆ = cov(X1,X2) = Σ1,2,
var(X1 ) Σ1,1 θˆ=E(X2)−mˆE(X1)=μ2 −mˆμ1.
E[X2 |X1 =x1]=mˆx1 +θˆ
= μ 2 + mˆ ( x 1 − μ 1 )
= μ2 + Σ1,2 (x1 − μ1) Σ1,1
Therefore:
9/35
Inference with multivariate Gaussians (3)
What is the distribution of X2 | X1 = x1?
Miracle 1: it is a Gaussian distribution
Miracle 2: mean provided by linear prediction of X2 from X1
with smallest MSE
Miracle 3: variance doesn’t depend on x1
var(X2 | X1 = x1) = E[var(X2 | X1)]
= var(X2) − var(E[X2 | X1])
=Σ2,2 −var(mˆX1 +θˆ) = Σ2,2 − mˆ 2 var(X1)
= Σ2,2 − Σ12,2 Σ1,1 Σ12,1
= Σ 2 , 2 − Σ 12 , 2 . Σ1,1
10/35
Inference with multivariate Gaussians (4)
Beyond bivariate Gaussians: same as above, but just writing things properly using matrix notations
E[X2 |X1 =x1]=μ2 +Σ2,1Σ−1(x1 −μ1) 1,1
cov(X2 | X1 = x1) = Σ2,2 − Σ2,1Σ−1Σ1,2 1,1
11/35
Eigendecomposition (1)
Every symmetric matrix M ∈ Rd×d has d real eigenvalues, which we arrange as
λ1 ≥ · · · ≥ λd
Can choose corresponding orthonormal eigenvectors
This means
v1,…,vd ∈ Rd Mvi =λivi
for each i = 1,…,d, and
viTvj = 1{i=j}
12/35
Eigendecomposition (2)
Arrange v1,…,vd in an orthogonal matrix V := [v1|···|vd] V TV = I and V V T = di=1 viviT = I
Therefore,
M = MV V T d
= M v i v iT i=1
d
= λ i v i v iT
i=1
This is our preferred way to express the eigendecomposition
Also called spectral decomposition
CanalsowriteM=VΛVT,whereΛ=diag(λ1,…,λd) The matrix V diagonalizes M:
VTMV =Λ
13/35
Covariance matrix (1)
A ∈ Rn×d is data matrix Σ:=ATA=1n xxTis
n i=1 i i (empirical) second-moment matrix
If 1 n xi = 0 (data are “centered”), this is the n i=1
(empirical) covariance matrix
For purpose of exposition, just say/write “(co)variance” even though “second-moment” is technically correct
For any unit vector u ∈ Rd,
1 n
uTΣu = n
is (empirical) variance of data along direction u
(uTxi)2
i=1
14/35
Covariance matrix (2)
Note: some pixels in OCR data have very little (or zero!) variation
Theseare“coordinatedirections”(e.g.,u=(1,0,…,0)) Probably can/should ignore these!
Figure 4: Which pixels are likely to have very little variance?
15/35
Top eigenvector
Σ is symmetric, so can write eigendecomposition
d
Σ = λ i v i v iT
i=1
In which direction is variance maximized?
Answer: v1, corresponding to largest eigenvalue λ1
Called the top eigenvector
This follows from the following characterization of v1:
v1TΣv1 = max uTΣu = λ1. u∈Rd :∥u∥2 =1
16/35
4
3
2
1
0
-1
-2
-2 -1 0 1 2 3 4
x
1
Figure 5: What is the direction of the top eigenvector for the covariance of this Gaussian?
17/35
x
2
Top k eigenvectors
What about among directions orthogonal to v1?
Answer: v2, corresponding to second largest eigenvalue λ2
(Note: all eigenvalues of Σ are non-negative!) For any k, Vk := [v1|···|vk] satisfies
kk viTΣvi = tr(VkTΣVk) = max tr(UTΣU) = λi
i=1 U∈Rd×k:UTU=I i=1 (the top k eigenvectors)
18/35
Principal component analysis
k-dimensional principal components analysis (PCA) mapping: φ(x) = (xTv1,…,xTvk) = VkTx ∈ Rk
where Vk = [v1|···|vk] ∈ Rd×k
(Only really makes sense when λk > 0.)
This is a form of dimensionality reduction when k < d.
19/35
1 0.8 0.6 0.4 0.2 0
Figure 6: Fraction of residual variance from projections of varying dimension
coordinate projections
PCA projections
0 200
400 600 800
dimension of projections k
20/35
fraction of residual variance
Covariance of data upon PCA mapping
Covariance of data upon PCA mapping:
1 n
n
i=1
1 n
φ(xi)φ(xi)T = n VkTxixTi Vk = VkTΣVk = Λk
i=1
where Λk is diagonal matrix with λ1, . . . , λk along diagonal. In particular, coordinates in φ(x)-representation are
uncorrelated.
44
33
22
11
00
-1 -1
-2 -2
-2 -1 0 1 2 3 4 -2 -1 0 1 2 3 4 21/35
x
2
x
2
PCA and linear regression
Use k-dimensional PCA mapping φ(x) = VkTx with ordinary least squares
(AssumerankofAisatleastk,soATAhasλk >0)
Data matrix is
← φ(x1)T → ← xT1Vk →
1 . 1 . n×k √ . =√ . =AV ∈R
n . n . k ← φ(xn)T → ← xTnVk →
Therefore, OLS solution is
βˆ = (VkTATAVk)−1(AVk)Tb
= Λ−1V TATb kk
(Note: here βˆ ∈ Rk.)
22/35
Principal component regression
Useβˆ=Λ−1VTATbtopredictonnewx∈Rd: kk
φ(x)Tβˆ = (V Tx)TΛ−1V TATb kkk
= xT(VkΛ−1VkT)(ATb) k
So “effective” weight vector (that acts directly on x rather than φ(x)) is given by
wˆ := (VkΛ−1VkT)(ATb). k
This is called principal component regression (PCR) (here, k is hyperparameter)
Alternative hyper-parameterization: λ > 0; same as before but using the largest k such that λk ≥ λ.
23/35
Spectral regularization
PCR and ridge regression are examples of spectral regularization.
For a function g: R → R, write g(M) to mean
d
g ( M ) = g ( λ i ) v i v iT i=1
where M has eigendecomposition M = di=1 λiviviT. I.e., g is applied to eigenvalues of M
Generalizes effect of polynomials: e.g., g(z) = z2
M2 =(VΛVT)(VΛVT)=VΛ2VT.
Claim: Can write each of PCR and ridge regression as
wˆ = g(ATA)ATb
for appropriate function g (depending on λ).
24/35
Comparing ridge regression and PCR
wˆ = g(ATA)ATb
Ridge regression (with parameter λ): g(z) = 1
z+λ PCR (with parameter λ): g(z) = 1{z≥λ} · 1
z
Interpretation:
PCR uses directions with sufficient variability; ignores the rest Ridge artificially inflates the variance in all directions
10 8 6 4 2 0
Figure 7: Comparison of ridge regression and PCR
ridge
pcr
0 0.2 0.4 0.6 0.8 1 z
25/35
g(z)
Matrix factorization
← xT1 →
Let A = . ∈ Rn×d (forget the 1/√n scaling)
. ← xTn →
Try to approximate A with BC, where B ∈ Rn×k and C ∈ Rk×d, to minimize ∥A − BC∥2F .
Here, ∥ · ∥F is a matrix norm called Frobenius norm, which treats the n × d matrix as a vector in nd-dimensional Euclidean space
Think of B as the encodings of the data in A
“Dimension reduction” when k < d
Theorem (Schmidt, 1907; Eckart-Young, 1936): Optimal solution is given by truncating the
singular value decomposition (SVD) of A
26/35
Singular value decomposition
Every matrix A ∈ Rn×d—say, with rank r—can be written as
r
A = σ i u i v iT
i=1
where
σ1 ≥ ··· ≥ σr > 0 (singular values)
u1 , . . . , ur ∈ Rn (orthonormal left singular vectors) v1 , . . . , vr ∈ Rd (orthonormal right singular vectors)
Can also write as
A = USV T
where
U = [u1|···|ur] ∈ Rn×r, satisfies UTU = I
S = diag(σ1,…,σr) ∈ Rr×r
V =[v1|···|vr]∈Rd×r,satisfiesVTV =I
27/35
Truncated SVD
LetAhaveSVDA=ri=1σiuiviT (rankofAisr) Truncate at rank k (for any k ≤ r): rank-k SVD
k
A k : = σ i u i v iT i=1
Can write as Ak := UkSkVkT, where
Uk = [u1|···|uk] ∈ Rn×k, satisfies UTU = I
Sk = diag(σ1,…,σk) ∈ Rk×k
Vk =[v1|···|vk]∈Rd×r,satisfiesVTV =I
Theorem (Schmidt/Eckart-Young):
∥A−Ak∥2F = min ∥A−M∥2F = σi2 M :rank(M )=k i=k+1
r
28/35
Encoder/decoder interpretation (1)
Encoder: x → φ(x) = VkTx ∈ Rk
Encoding rows of A: AVk = UkSk
Decoder: z → Vkz ∈ Rd
Decoding rows of UkSk: UkSkVkT = Ak
Same as k-dimensional PCA mapping!
ATA = V S2V T, so eigenvectors of ATA are right singular
vectors of A, non-zero eigenvalues are squares of the singular
values
PCA/SVD finds k-dimensional subspace of smallest sum of
squared distances to data points.
29/35
Figure 8: Geometric picture of PCA
Encoder/decoder interpretation (2)
Example: OCR data, compare original image to decoding of k-dimensional PCA encoding (k ∈ {1, 10, 50, 200})
Original
M = 1 M = 10 M = 50 M = 250
Figure 9: PCA compression of MNIST digit
30/35
Application: Topic modeling (1)
Start with n documents, represent using “bag-of-words” count vectors
Arrange in matrix A ∈ Rn×d, where d is vocabulary size
aardvark doc1 3
doc2 7 doc3 2 . .
abacus abalone · · · 0 0 ··· 0 4 ··· 4 0 ···
. .
31/35
Application: Topic modeling (2)
Rank k SVD provides an approximate factorization A ≈ BC
where B ∈ Rn×k and C ∈ Rk×d
This use of SVD is called Latent Semantic Analysis (LSA)
Interpret rows of C as “topics”
Bi,t is “weight” of document i on topic t
If rows of C were probability distributions, could interpret as
Ct,w as probability that word w appears in topic t
32/35
Application: Matrix completion (1)
Start with ratings of movies given by users
Arrange in a matrix A ∈ Rn×d, where Ai,j is rating given by
user i for movie j.
Netflix: n = 480000, d = 18000; on average, each user rates
200 movies
Most entries of A are unknown
Idea: Approximate A with low-rank matrix, i.e., find ←bT1→ ↑ ↑
B = . . ∈ R n × k , C = c · · · c ∈ R k × d . 1d
↓↓
← bTn →
with goal of minimizing ∥A − BC∥2F
Note: If all entries of A were observed, we could do this with
truncated SVD.
33/35
Application: Matrix completion (2)
Need to find a low-rank approximation without all of A: (low-rank) matrix completion
Lots of ways to do this
Popular way (used in Netflix competition): based on “stochastic
gradient descent” (discussed later)
Another way: fill in missing entries with plug-in estimates (based
on a statistical model), then compute truncated SVD as usual
34/35
Feature representations from matrix completion
MovieLens data set (n = 6040 users, d = 3952 movies, |Ω| = 800000 ratings)
Fit B and C by using a standard matrix completion method (based on SGD, discussed later)
Are c1, . . . , cd ∈ Rk useful feature vectors for movies?
Some nearest-neighbor pairs (cj,NN(cj)): Toy Story (1995), Toy Story 2 (1999)
Sense and Sensibility (1995), Emma (1996)
Heat (1995), Carlito’s Way (1993)
The Crow (1994), Blade (1998)
Forrest Gump (1994), Dances with Wolves (1990) Mrs. Doubtfire (1993), The Bodyguard (1992)
35/35