Machine learning lecture slides
COMS 4771 Fall 2020
0/35
Multivariate Gaussians and PCA
Outline
I Multivariate Gaussians
I Eigendecompositions and covariance matrices
I Principal component analysis
I Principal component regression and spectral regularization I Singular value decomposition
I Examples: topic modeling and matrix completion
1/35
Multivariate Gaussians: Isotropic Gaussians variety standard multivariate Cd
S3 h distribution
I Start with X = (X1,…,Xd) ≥ N(0,I), i.e., X1,…,Xd are
Cov
i j covCX xj
iid N(0, 1) random variables.
I Probability density function is product of (univariate) Gaussian
I densities TECH Ot Rd E(Xi) = 0
I var(Xi) = cov(Xi,Xi) = 1, cov(Xi,Xj) = 0 for i ”= j e
I Arrange in mean vector E(X) = 0, covariance matrix cov(X) = I
Cov
Xi
Xj
Eki
Ef Xi OOO
Xj
Ekki Exit EfCXj Eg
Ex
2/35
0.15
0.1
0.05
0
2
density
0
-2
x2
-2
2 0
x1
Figure 1: Density function for isotropic Gaussian in R2
3/35
3
2
1
0
-1
-2
-3
-3 -2 -1 0 1 2 3
x
1
Figure 2: Density function level sets for isotropic Gaussian in R2
4/35
x
2
A ne transformations of random vectors T μ
an
I X:=MZ+μ,forMœRk◊d andμœRk
I Fact: E(X) = ME(Z) + μ, cov(X) = M cov(Z)MT
I E.g.,letuœRd beaunitvector(ÎuÎ2 =1),andX:=uTZ
(projection of X along direction u). Then E(X) = uTE(Z), and var(X) = uT cov(Z)u.
I Note: These transformations work for random vectors with any distribution, not just Gaussian distributions.
I However, it is convenient to illustrate the e ect of these transformations on Gaussian distributions, since the “shape” of the Gaussian pdf is easy to understand.
I Start with any random vector Z, then apply linear transformation, followed by translation
couth varlet
Ekartti
EEtI5YrgenqgmhY
Crow vector
5/35
outtiitil
could Eff.EE KE
ZAZ
E MZ MIFFED MZ EfmH EET E EFI
M EHEELEDEE MEEEEEDEEEAT MT
M EfH EADIEE T MT
MT
M
Cov
ft
METEDT MT
EE5
MI
Multivariate Gaussians: General Gaussians
E MEET I IfZ≥N(0,I)andX=MZ+μ,wehaveE(X)=μand
cov(X) = MMT M covCZ MT I
MMT
I Assume M œ Rd◊d is invertible (else we get a degenerate Gaussian distribution).
I WesayX≥N(μ,MMT) I Density function given by
p(x) = 1 exp 3≠ 1 (x ≠ μ)T (M M T )≠1 (x ≠ μ)4 . (2fi)d/2|MMT|1/2 2
I Note: every non-singular covariance matrix À can be written as MMT for some non-singular matrix M. (We’ll see why soon.)
MMI
6/35
4 Z 3
2
1
0
-1
to M2 tu
μ
Nlp E
E Mini
Eft μ
O
o
-2
-2 -1 0 1 2 3 4
x
1
Figure 3: Density function level sets for anisotropic Gaussian in R2
7/35
x
2
Inference with multivariate Gaussians (2)
I Bivariate case: (X1, X2) ≥ N(μ, À) in R2
μ = Cμ1D, À = CÀ1,1 À1,2D
μ2 À2,1 À2,2 I What is the distribution of X2?
I N(μ2,À2,2) E Co D
I What is the distribution of X2 | X1 = x1?
I Miracle 1: it is a Gaussian distribution
I Miracle 2: mean provided by linear prediction of X2 from X1
with smallest MSE
I Miracle 3: variance doesn’t depend on x1
D ma o
I
22,2
8/35
Inference with multivariate Gaussians (2)
I What is the distribution of X2 | X1 = x1?
I Miracle 1: it is a Gaussian distribution
I Miracle 2: mean provided by linear prediction of X2 from X1
with smallest MSE
I Miracle 3: variance doesn’t depend on x1
I OLS with X1 as input variable and X2 as output variable:
x1 ‘æmˆx1 +◊ˆ mˆ = c o v ( X 1 , X 2 ) = À 1 , 2 ,
var(X1 ) À1,1 ◊ˆ=E(X2)≠mˆE(X1)=μ2 ≠mˆμ1.
I Therefore:
E[X2 |X1 =x1]=mˆx1 +◊ˆ
= μ 2 + mˆ ( x 1 ≠ μ 1 )
= μ2 + À1,2 (x1 ≠ μ1) À1,1
where
9/35
Inference with multivariate Gaussians (3)
I What is the distribution of X2 | X1 = x1?
I Miracle 1: it is a Gaussian distribution
I Miracle 2: mean provided by linear prediction of X2 from X1
with smallest MSE
I Miracle 3: variance doesn’t depend on x1
var(X2 | X1 = x1) = E[var(X2 | X1)]
= var(X2) ≠ var(E[X2 | X1])
ˆ =À2,2 ≠var(mˆX1 +◊)
= À2,2 ≠ mˆ 2 var(X1) À2
= À2,2 ≠ 1,2 À1,1 À2
=À2,2≠ 1,2. À1,1
1,1 À2
10/35
Inference with multivariate Gaussians (4)
I Beyond bivariate Gaussians: same as above, but just writing things properly using matrix notations
E[X2 |X1 =x1]=μ2 +À2,1À≠1(x1 ≠μ1) 1,1
cov(X2 | X1 = x1) = À2,2 ≠ À2,1À≠1À1,2 1,1
in Rd
Hattie
X p
rand vector
11/35
HW 2 due tomorrow
HW 3 probably out Thursday
119 and due 4z
week’s
Quiz 2 on Friday 120
Covers all material through
HW2
last
lectures
t
118
RecallcinllDmodet
E Insect
Elvarcylx ElvarCI
Some
of
EMMY thumb
EILEEN KI rules
i
p
crigqeayagresf.im
I
9
ii
IX
Eigendecomposition (1)
Rd
Mv
TER
M
MT
I Every symmetric matrix M œ Rd◊d has d real eigenvalues, which we arrange as
detI
XM
pal
⁄1 Ø · · · Ø ⁄d
I Can choose corresponding orthonormal eigenvectors
v1,…,vd œ Rd
Mvi =⁄ivi for each i = 1,…,d, and
viTvj = 1{i=j}
I This means
12/35
Eigendecomposition (2)
I Arrange v1,…,vd in an oqrthogonal matrix V := [v1|···|vd]
I V TV = I and V V T = di=1 viviT = I μ uit
I Therefore,
ÿd
i=1
I This is our preferred way to express the eigendecomposition
I Also called spectral decomposition
I CanalsowriteM=V VT,where =diag(⁄1,…,⁄d) I The matrix V diagonalizes M:
VTMV =
g
did
i=1 ÿd
M = MV V T
= =
M v i v iT ⁄ i v i v iT
13/35
Covariance matrix (1) f T
I A œ R n ◊ d i s d a t qa m a t r i x A T1nT
xi
9wriaiiiii.ae
IÀ:=AaA=n i=1xixi is O_O
(empirical) second-moment matrix
t.E.xixi.gr
I If 1 qn x = 0 (data are “centered”), this is the
n i=1
(empirical) covariance matrix
i
F thenXi I For purpose of exposition, just say/write “(co)variance” even
though “second-moment” is technically correct
I For any unit vector u œ Rd,
u T À u = 1 ÿn ( u T x i ) 2
n i=1 g is (empirical) variance of data along direction u
var
Xi
hits
Wen
utxixitu cutx.ie
14/35
Covariance matrix (2)
I Note: some pixels in OCR data have very little (or zero!)
variation
I Theseare“coordinatedirections”(e.g.,u=(1,0,…,0)) I Probably can/should ignore these!
E 112784
Figure 4: Which pixels are likely to have very little variance?
15/35
Top eigenvector
I À is symmetric, so can write eigendecomposition
ÿd i=1
I In which direction is variance maximized?
I Answer: v1, corresponding to largest eigenvalue ⁄1
À =
⁄ i v i v iT
I Called the top eigenvector
I This follows from the followinglcharacterization of v1:
v1TÀv1 = max uTÀu = ⁄1. uœRd :ÎuÎ2 =1
Variational
16/35
4
Iv
3
2
1
0
-1
-2
-2 -1 0 1 2 3 4
x
1
Figure 5: What is the direction of the top eigenvector for the covariance of this Gaussian?
17/35
x
2
Top k eigenvectors
I What about among directions orthogonal to v1?
I Answer: v2, corresponding to second largest eigenvalue ⁄2
I (Note: all eigenvalues of À are non-negative!)
I For any k, Vk := [v1|···|vk] satisfies
ÿk
ÿk i=1
viTÀvi = tr(VkTÀVk) = i=1
max
tr(UTÀU) =
⁄i
(the top k eigenvectors)
UœRd◊k:UTU=I
18/35
E.g
Uz span
a
o o
9
jqy 49IExoaeR3
k orthonormal vectors Ui 1
Hk WER
Picking Picking
d3
Ui Ct Ez
fz tf our
dimensional
ak
w span Ui
subspace sure
1123 n
Top k eigenvectors
I What about among directions orthogonal to v1?
I Answer: v2, corresponding to second largest eigenvalue ⁄2 I (Note: all eigenvalues of À are non-negative!)
I For any k, Vk := [v1|···|vk] satisfies
max
ÿk i=1
ÿk i=1
viTÀvi = tr(VkTÀVk) = (the top k eigenvectors)
tr(UTÀU) =
⁄i
UœRd◊k:UTU=I
Picking Ufu.IT uDs.t.UTU
I
is same as picking k dim subspace
with Ui Uh as orthonormal basis
18/35
Principal component analysis
I k-dimensional principal components analysis (PCA) mapping: Ï(x) = (xTv1,…,xTvk) = VkTx œ Rk
where Vk = [v1|···|vk] œ Rd◊k
I (Only really makes sense when ⁄k > 0.)
is
I This is a form of dimensionality reduction when k < d.
19/35
1 0.8 0.6 0.4 0.2
0 0
oo Io co
coordinate projections
PCA projections
200 400 600 800
dimension of projections k
Figure 6: Fraction of residual variance from projections of varying dimension
20/35
fraction of residual variance
Covariance of data upon PCA mapping
its
VEAL vii I
I Covariance of data upon PCA mapping: 1 ÿn 1 ÿn
tyg Ea
n Ï(xi)Ï(xi)T = n VkTxixTi Vk = VkTÀVk = k i=1Fini Iif i=1 del
where k is diagonal matrix with ⁄1, . . . , ⁄k along diagonal. I In particular, coordinates in Ï(x)-representation are
uncorrelated.
44
33
22
11
00
-1 -1
-2 -2
-2 -1 0 1 2 3 4 -2 -1 0 1 2 3 4 21/35
x
2
x
2
uo
v
to
Then
vi
M
d
z
v
v
Suppose M is symmetric Mu Hiu
I
MVz X2Vz
o
PCA and linear regression
I Use k-dimensional PCA mapping Ï(x) = VkTx with ordinary least squares
I (AssumerankofAisatleastk,soATAhas⁄k >0) I Data matrix is
1 S Ω Ï ( x 1 ) T æ T 1 S Ω x T1 V k æ T
Ôn WU . XV = Ôn WU . XV = AVk œ Rn◊k
Ω Ï ( x n ) T æ Ω x Tn V k æ i i i t I Therefore, OLS solution is
—ˆ = (VkTATAVk)≠1(AVk)Tb = ≠1V TATb
VIAt
kk
(Note: here —ˆ œ Rk.)
AVhf
22/35
Principal component regression
I Use—ˆ= ≠1VTATbtopredictonnewxœRd: kk
Ï(x)T—ˆ = (V Tx)T ≠1V TATb kkk
= xT(Vk ≠1V T)(ATb) kk
I So “e ective” weight vector (that acts directly on x rather than Ï(x)) is given by
wˆ := (Vk ≠1V T)(ATb). kk
I This is called principal component regression (PCR) (here, k is hyperparameter)
I Alternative hyper-parameterization: ⁄ > 0; same as before but using the largest k such that ⁄k Ø ⁄.
23/35
Spectral regularization
I PCR and ridge regression are examples of spectral regularization.
I For a function g: R æ R, write g(M) to mean
g(M) =
where M has eigendecomposition M = qdi=1 ⁄iviviT.
g(⁄i)viviT I I.e., g is applied to eigenvalues of M
I Generalizes e ect of polynomials: e.g., g(z) = z2 M2 =(V VT)(V VT)=V 2VT.
I Claim: Can write each of PCR and ridge regression as wˆ = g(ATA)ATb
for appropriate function g (depending on ⁄).
ÿd i=1
24/35
d
Comparing ridge regression and PCR dI
Ridge w CATA it gCAA
IAT
1
ATA divivit
ATAHI
as get
8 pcr
tz
I wˆ = g(ATA)ATb
I Ridge regression (with parameter ⁄): g(z) =
GitHvivit
VA
VT I Ridge artificially inflates the variance in all directions VAH.IT
I PCR (with parameter ⁄): g(z) = 1{zØ⁄} · 1 I Interpretation: z
I
I PCR uses directions with su cient variability; ignores the rest
10
6
4
2
0
0 0.2 0.4 0.6 0.8 1
z
Figure 7: Comparison of ridge regression and PCR
z+⁄ A’AHII
ridge
t
g(z)
25/35
Matrix factorization
SΩ xT1 æT
I Let A = WU . XV œ Rn◊d (forget the 1/Ôn scaling) Ω xTn æ
I Try to approximate A with BC, where B œ Rn◊k and C œ Rk◊d, to minimize ÎA ≠ BCÎ2F .
I Here, Î · ÎF is a matrix norm called Frobenius norm, which treats the n ◊ d matrix as a vector in nd-dimensional Euclidean space
I Think of B as the encodings of the data in A I “Dimension reduction” when k < d
I Theorem (Schmidt, 1907; Eckart-Young, 1936): Optimal solution is given by truncating the
singular value decomposition (SVD) of A
26/35
Singular value decomposition
I Every matrix A œ Rn◊d—say, with rank r—can be written as
A =
I ‡1 Ø ··· Ø ‡r > 0 (singular values)
‡ i u i v iT
I u1 , . . . , ur œ Rn (orthonormal left singular vectors)
ÿr i=1
where
I v1 , . . . , vr œ Rd (orthonormal right singular vectors)
I Can also write as where
A = USV T
I U = [u1|···|ur] œ Rn◊r, satisfies UTU = I I S = diag(‡1,…,‡r) œ Rr◊r
I V =[v1|···|vr]œRd◊r,satisfiesVTV =I
27/35
Truncated SVD
I Let A have SVD A = qri=1 ‡iuiviT (rank of A is r) I Truncate at rank k (for any k Æ r): rank-k SVD
Ak :=
I Can write as Ak := UkSkVkT, where
ÿk i=1
‡iuiviT n◊k T
GO
I Uk = [u1|···|uk] œ R , satisfies U U = I I Sk = diag(‡1,…,‡k) œ Rk◊k
d◊r T
I V =[v |···|v ]œR ,satisfiesV V =I
k1k
I Theorem (Schmidt/Eckart-Young):
k
HA
M :rank(M )=k i=k+1 BELIE IIITE
2 2ÿr2 ÎA≠AkÎF = min ÎA≠MÎF = ‡i
28/35
Encoder/decoder inOterpretation (1) I Encoder: x ‘æ Ï(x) = VkTx œ Rk
Equiv
A I Encoding rows of A: AVk = UkSk A
I Decoder: z ‘æ Vkz œ Rd
I Decoding rows of UkSk: UkSkVkT = Ak
the
riuivi.tk vital
I Same as k-dimensional PCA mapping!
I ATA = V S2V T, so eigenvectors of ATA are right singular
eiIfo
vectors of A, non-zero eigenvalues are squares of the singular
values
I PCA/SVD finds k-dimensional subspace of smallest sum of
USVT I SVT
ftp.vs v5Vt
f
f.nu rnunJtkS ftp.eisenvaiga ieimEaem
AAT
alA
squared distances to data points.
Figure 8: Geometric picture of PCA
fri.ie
lvitvk
u
sur
29/35
Encoder/decoder interpretation (2)
I Example: OCR data, compare original image to decoding of k-dimensional PCA encoding (k œ {1, 10, 50, 200})
Original
M = 1 M = 10 M = 50 M = 250
Figure 9: PCA compression of MNIST digit
mostly aeros 2
Dictionary learning
Akn f
hope
B
has
30/35
Application: Topic modeling (1)
I Start with n documents, represent using “bag-of-words” count vectors
I Arrange in matrix A œ Rn◊d, where d is vocabulary size
aardvark doc1 3
abacus abalone · · · 0 0 ··· 0 4 ··· 4 0 ···
doc2 7 doc3 2
. .
. .
31/35
Application: Topic modeling (2)
I Rank k SVD provides an approximate factorization
where B œ Rn◊k and C œ Rk◊d
I This use of SVD is called Latent Semantic Analysis (LSA) I Interpret rows of C as “topics”
I Bi,t is “weight” of document i on topic t
I If rows of C were probability distributions, could interpret as
Ct,w as probability that word w appears in topic t
A ¥ BC
kid
big
32/35
Application: Matrix completion (1)
I Start with ratings of movies given by users
I Arrange in a matrix A œ Rn◊d, where Ai,j is rating given by
user i for movie j.
I Netflix: n = 480000, d = 18000; on average, each user rates 200 movies
I Most entries of A are unknown
I Idea: Approximate A with low-rank matrix, i.e., find
SΩbT1æT Sø øT
B = W WU . . . X XV œ R n ◊ k , C = WU c 1 · · · c d XV œ R k ◊ d
ΩbTnæ ¿¿
with goal of minimizing ÎA ≠ BCÎ2F
I Note: If all entries of A were observed, we could do this with
truncated SVD.
33/35
Application: Matrix completion (2)
I Need to find a low-rank approximation without all of A: (low-rank) matrix completion
I Lots of ways to do this
I Popular way (used in Netflix competition): based on “stochastic I gradient descent” (discussed later)
Another way: fill in missing entries with plug-in estimates (based on a statistical model), then compute truncated SVD as usual
34/35
Feature representations from matrix completion
I MovieLens data set (n = 6040 users, d = 3952 movies, | | = 800000 ratings)
I Fit B and C by using a standard matrix completion method (based on SGD, discussed later)
I Are c1, . . . , cd œ Rk useful feature vectors for movies? I Some nearest-neighbor pairs (cj,NN(cj)):
I Toy Story (1995), Toy Story 2 (1999)
I Sense and Sensibility (1995), Emma (1996)
I Heat (1995), Carlito’s Way (1993)
I The Crow (1994), Blade (1998)
I Forrest Gump (1994), Dances with Wolves (1990) I Mrs. Doubtfire (1993), The Bodyguard (1992)
35/35