程序代写代做代考 C Machine learning lecture slides

Machine learning lecture slides
COMS 4771 Fall 2020
0/35

Multivariate Gaussians and PCA

Outline
􏰛 Multivariate Gaussians
􏰛 Eigendecompositions and covariance matrices
􏰛 Principal component analysis
􏰛 Principal component regression and spectral regularization 􏰛 Singular value decomposition
􏰛 Examples: topic modeling and matrix completion
1/35

Multivariate Gaussians: Isotropic Gaussians
􏰛 Start with X = (X1,…,Xd) ∼ N(0,I), i.e., X1,…,Xd are iid N(0, 1) random variables.
􏰛 Probability density function is product of (univariate) Gaussian densities
􏰛 E(Xi)=0
􏰛 var(Xi) = cov(Xi,Xi) = 1, cov(Xi,Xj) = 0 for i ̸= j 􏰛 Arrange in mean vector E(X) = 0, covariance matrix
cov(X) = I
2/35

0.15
0.1
0.05
0
2
density
0
x-2 -2
2 0
2 x1
Figure 1: Density function for isotropic Gaussian in R2
3/35

3
2
1
0
-1
-2
-3
-3 -2 -1 0 1 2 3
x
1
Figure 2: Density function level sets for isotropic Gaussian in R2
4/35
x
2

Affine transformations of random vectors
􏰛 Start with any random vector Z, then apply linear transformation, followed by translation
􏰛 X:=MZ+μ,forM∈Rk×d andμ∈Rk
􏰛 Fact: E(X) = ME(Z) + μ, cov(X) = M cov(Z)MT
􏰛 E.g.,letu∈Rd beaunitvector(∥u∥2 =1),andX:=uTZ
(projection of X along direction u). Then E(X) = uTE(Z), and var(X) = uT cov(Z)u.
􏰛 Note: These transformations work for random vectors with any distribution, not just Gaussian distributions.
􏰛 However, it is convenient to illustrate the effect of these transformations on Gaussian distributions, since the “shape” of the Gaussian pdf is easy to understand.
5/35

Multivariate Gaussians: General Gaussians
􏰛 IfZ∼N(0,I)andX=MZ+μ,wehaveE(X)=μand cov(X) = MMT
􏰛 Assume M ∈ Rd×d is invertible (else we get a degenerate Gaussian distribution).
􏰛 WesayX∼N(μ,MMT) 􏰛 Density function given by
1 􏰃1 T T−1 􏰄 p(x) = (2π)d/2|MMT|1/2 exp −2(x − μ) (MM ) (x − μ) .
􏰛 Note: every non-singular covariance matrix Σ can be written as MMT for some non-singular matrix M. (We’ll see why soon.)
6/35

4
3
2
1
0
-1
-2
-2 -1 0 1 2 3 4
x
1
Figure 3: Density function level sets for anisotropic Gaussian in R2
7/35
x
2

Inference with multivariate Gaussians (2)
􏰛 Bivariate case: (X1, X2) ∼ N(μ, Σ) in R2
􏰥μ􏰦 􏰥ΣΣ􏰦 μ=1,Σ=1,1 1,2 μ2 Σ2,1 Σ2,2
􏰛 What is the distribution of X2? 􏰛 N(μ2,Σ2,2)
􏰛 What is the distribution of X2 | X1 = x1?
􏰛 Miracle 1: it is a Gaussian distribution
􏰛 Miracle 2: mean provided by linear prediction of X2 from X1
with smallest MSE
􏰛 Miracle 3: variance doesn’t depend on x1
8/35

Inference with multivariate Gaussians (2)
􏰛 What is the distribution of X2 | X1 = x1?
􏰛 Miracle 1: it is a Gaussian distribution
􏰛 Miracle 2: mean provided by linear prediction of X2 from X1
with smallest MSE
􏰛 Miracle 3: variance doesn’t depend on x1
􏰛 OLS with X1 as input variable and X2 as output variable:
where
x1 􏰜→mˆx1 +θˆ mˆ = cov(X1,X2) = Σ1,2,
var(X1 ) Σ1,1 θˆ=E(X2)−mˆE(X1)=μ2 −mˆμ1.
E[X2 |X1 =x1]=mˆx1 +θˆ
= μ 2 + mˆ ( x 1 − μ 1 )
= μ2 + Σ1,2 (x1 − μ1) Σ1,1
􏰛 Therefore:
9/35

Inference with multivariate Gaussians (3)
􏰛 What is the distribution of X2 | X1 = x1?
􏰛 Miracle 1: it is a Gaussian distribution
􏰛 Miracle 2: mean provided by linear prediction of X2 from X1
with smallest MSE
􏰛 Miracle 3: variance doesn’t depend on x1
var(X2 | X1 = x1) = E[var(X2 | X1)]
= var(X2) − var(E[X2 | X1])
=Σ2,2 −var(mˆX1 +θˆ) = Σ2,2 − mˆ 2 var(X1)
= Σ2,2 − Σ12,2 Σ1,1 Σ12,1
= Σ 2 , 2 − Σ 12 , 2 . Σ1,1
10/35

Inference with multivariate Gaussians (4)
􏰛 Beyond bivariate Gaussians: same as above, but just writing things properly using matrix notations
E[X2 |X1 =x1]=μ2 +Σ2,1Σ−1(x1 −μ1) 1,1
cov(X2 | X1 = x1) = Σ2,2 − Σ2,1Σ−1Σ1,2 1,1
11/35

Eigendecomposition (1)
􏰛 Every symmetric matrix M ∈ Rd×d has d real eigenvalues, which we arrange as
λ1 ≥ · · · ≥ λd
􏰛 Can choose corresponding orthonormal eigenvectors
􏰛 This means
v1,…,vd ∈ Rd Mvi =λivi
for each i = 1,…,d, and
viTvj = 1{i=j}
12/35

Eigendecomposition (2)
􏰛 Arrange v1,…,vd in an orthogonal matrix V := [v1|···|vd] 􏰛 V TV = I and V V T = 􏰉di=1 viviT = I
􏰛 Therefore,
M = MV V T d
= 􏰊 M v i v iT i=1
d
= 􏰊 λ i v i v iT
i=1
􏰛 This is our preferred way to express the eigendecomposition
􏰛 Also called spectral decomposition
􏰛 CanalsowriteM=VΛVT,whereΛ=diag(λ1,…,λd) 􏰛 The matrix V diagonalizes M:
VTMV =Λ
13/35

Covariance matrix (1)
􏰛 A ∈ Rn×d is data matrix 􏰛Σ:=ATA=1􏰉n xxTis
n i=1 i i (empirical) second-moment matrix
􏰛 If 1 􏰉n xi = 0 (data are “centered”), this is the n i=1
(empirical) covariance matrix
􏰛 For purpose of exposition, just say/write “(co)variance” even though “second-moment” is technically correct
􏰛 For any unit vector u ∈ Rd,
1 􏰊n
uTΣu = n
is (empirical) variance of data along direction u
(uTxi)2
i=1
14/35

Covariance matrix (2)
􏰛 Note: some pixels in OCR data have very little (or zero!) variation
􏰛 Theseare“coordinatedirections”(e.g.,u=(1,0,…,0)) 􏰛 Probably can/should ignore these!
Figure 4: Which pixels are likely to have very little variance?
15/35

Top eigenvector
􏰛 Σ is symmetric, so can write eigendecomposition
d
Σ = 􏰊 λ i v i v iT
i=1
􏰛 In which direction is variance maximized?
􏰛 Answer: v1, corresponding to largest eigenvalue λ1
􏰛 Called the top eigenvector
􏰛 This follows from the following characterization of v1:
v1TΣv1 = max uTΣu = λ1. u∈Rd :∥u∥2 =1
16/35

4
3
2
1
0
-1
-2
-2 -1 0 1 2 3 4
x
1
Figure 5: What is the direction of the top eigenvector for the covariance of this Gaussian?
17/35
x
2

Top k eigenvectors
􏰛 What about among directions orthogonal to v1?
􏰛 Answer: v2, corresponding to second largest eigenvalue λ2
􏰛 (Note: all eigenvalues of Σ are non-negative!) 􏰛 For any k, Vk := [v1|···|vk] satisfies
kk 􏰊viTΣvi = tr(VkTΣVk) = max tr(UTΣU) = 􏰊λi
i=1 U∈Rd×k:UTU=I i=1 (the top k eigenvectors)
18/35

Principal component analysis
􏰛 k-dimensional principal components analysis (PCA) mapping: φ(x) = (xTv1,…,xTvk) = VkTx ∈ Rk
where Vk = [v1|···|vk] ∈ Rd×k
􏰛 (Only really makes sense when λk > 0.)
􏰛 This is a form of dimensionality reduction when k < d. 19/35 1 0.8 0.6 0.4 0.2 0 Figure 6: Fraction of residual variance from projections of varying dimension coordinate projections PCA projections 0 200 400 600 800 dimension of projections k 20/35 fraction of residual variance Covariance of data upon PCA mapping 􏰛 Covariance of data upon PCA mapping: 1 􏰊n n i=1 1 􏰊n φ(xi)φ(xi)T = n VkTxixTi Vk = VkTΣVk = Λk i=1 where Λk is diagonal matrix with λ1, . . . , λk along diagonal. 􏰛 In particular, coordinates in φ(x)-representation are uncorrelated. 44 33 22 11 00 -1 -1 -2 -2 -2 -1 0 1 2 3 4 -2 -1 0 1 2 3 4 21/35 x 2 x 2 PCA and linear regression 􏰛 Use k-dimensional PCA mapping φ(x) = VkTx with ordinary least squares 􏰛 (AssumerankofAisatleastk,soATAhasλk >0)
􏰛 Data matrix is
← φ(x1)T → ← xT1Vk →
1 . 1 . n×k √  . =√  . =AV ∈R
n .  n .  k ← φ(xn)T → ← xTnVk →
􏰛 Therefore, OLS solution is
βˆ = (VkTATAVk)−1(AVk)Tb
= Λ−1V TATb kk
(Note: here βˆ ∈ Rk.)
22/35

Principal component regression
􏰛 Useβˆ=Λ−1VTATbtopredictonnewx∈Rd: kk
φ(x)Tβˆ = (V Tx)TΛ−1V TATb kkk
= xT(VkΛ−1VkT)(ATb) k
􏰛 So “effective” weight vector (that acts directly on x rather than φ(x)) is given by
wˆ := (VkΛ−1VkT)(ATb). k
􏰛 This is called principal component regression (PCR) (here, k is hyperparameter)
􏰛 Alternative hyper-parameterization: λ > 0; same as before but using the largest k such that λk ≥ λ.
23/35

Spectral regularization
􏰛 PCR and ridge regression are examples of spectral regularization.
􏰛 For a function g: R → R, write g(M) to mean
d
g ( M ) = 􏰊 g ( λ i ) v i v iT i=1
where M has eigendecomposition M = 􏰉di=1 λiviviT. 􏰛 I.e., g is applied to eigenvalues of M
􏰛 Generalizes effect of polynomials: e.g., g(z) = z2
M2 =(VΛVT)(VΛVT)=VΛ2VT.
􏰛 Claim: Can write each of PCR and ridge regression as
wˆ = g(ATA)ATb
for appropriate function g (depending on λ).
24/35

Comparing ridge regression and PCR
􏰛 wˆ = g(ATA)ATb
􏰛 Ridge regression (with parameter λ): g(z) = 1
z+λ 􏰛 PCR (with parameter λ): g(z) = 1{z≥λ} · 1
z
􏰛 Interpretation:
􏰛 PCR uses directions with sufficient variability; ignores the rest 􏰛 Ridge artificially inflates the variance in all directions
10 8 6 4 2 0
Figure 7: Comparison of ridge regression and PCR
ridge
pcr
0 0.2 0.4 0.6 0.8 1 z
25/35
g(z)

Matrix factorization
← xT1 →
􏰛 Let A =  .  ∈ Rn×d (forget the 1/√n scaling)
 .  ← xTn →
􏰛 Try to approximate A with BC, where B ∈ Rn×k and C ∈ Rk×d, to minimize ∥A − BC∥2F .
􏰛 Here, ∥ · ∥F is a matrix norm called Frobenius norm, which treats the n × d matrix as a vector in nd-dimensional Euclidean space
􏰛 Think of B as the encodings of the data in A
􏰛 “Dimension reduction” when k < d 􏰛 Theorem (Schmidt, 1907; Eckart-Young, 1936): Optimal solution is given by truncating the singular value decomposition (SVD) of A 26/35 Singular value decomposition 􏰛 Every matrix A ∈ Rn×d—say, with rank r—can be written as r A = 􏰊 σ i u i v iT i=1 where 􏰛 σ1 ≥ ··· ≥ σr > 0 (singular values)
􏰛 u1 , . . . , ur ∈ Rn (orthonormal left singular vectors) 􏰛 v1 , . . . , vr ∈ Rd (orthonormal right singular vectors)
􏰛 Can also write as
A = USV T
where
􏰛 U = [u1|···|ur] ∈ Rn×r, satisfies UTU = I
􏰛 S = diag(σ1,…,σr) ∈ Rr×r
􏰛 V =[v1|···|vr]∈Rd×r,satisfiesVTV =I
27/35

Truncated SVD
􏰛 LetAhaveSVDA=􏰉ri=1σiuiviT (rankofAisr) 􏰛 Truncate at rank k (for any k ≤ r): rank-k SVD
k
A k : = 􏰊 σ i u i v iT i=1
􏰛 Can write as Ak := UkSkVkT, where
􏰛 Uk = [u1|···|uk] ∈ Rn×k, satisfies UTU = I
􏰛 Sk = diag(σ1,…,σk) ∈ Rk×k
􏰛 Vk =[v1|···|vk]∈Rd×r,satisfiesVTV =I
􏰛 Theorem (Schmidt/Eckart-Young):
∥A−Ak∥2F = min ∥A−M∥2F = 􏰊 σi2 M :rank(M )=k i=k+1
r
28/35

Encoder/decoder interpretation (1)
􏰛 Encoder: x 􏰜→ φ(x) = VkTx ∈ Rk
􏰛 Encoding rows of A: AVk = UkSk
􏰛 Decoder: z 􏰜→ Vkz ∈ Rd
􏰛 Decoding rows of UkSk: UkSkVkT = Ak
􏰛 Same as k-dimensional PCA mapping!
􏰛 ATA = V S2V T, so eigenvectors of ATA are right singular
vectors of A, non-zero eigenvalues are squares of the singular
values
􏰛 PCA/SVD finds k-dimensional subspace of smallest sum of
squared distances to data points.
29/35
Figure 8: Geometric picture of PCA

Encoder/decoder interpretation (2)
􏰛 Example: OCR data, compare original image to decoding of k-dimensional PCA encoding (k ∈ {1, 10, 50, 200})
Original
M = 1 M = 10 M = 50 M = 250
Figure 9: PCA compression of MNIST digit
30/35

Application: Topic modeling (1)
􏰛 Start with n documents, represent using “bag-of-words” count vectors
􏰛 Arrange in matrix A ∈ Rn×d, where d is vocabulary size
aardvark doc1 3
doc2 7 doc3 2 . .
abacus abalone · · · 0 0 ··· 0 4 ··· 4 0 ···
. .
31/35

Application: Topic modeling (2)
􏰛 Rank k SVD provides an approximate factorization A ≈ BC
where B ∈ Rn×k and C ∈ Rk×d
􏰛 This use of SVD is called Latent Semantic Analysis (LSA)
􏰛 Interpret rows of C as “topics”
􏰛 Bi,t is “weight” of document i on topic t
􏰛 If rows of C were probability distributions, could interpret as
Ct,w as probability that word w appears in topic t
32/35

Application: Matrix completion (1)
􏰛 Start with ratings of movies given by users
􏰛 Arrange in a matrix A ∈ Rn×d, where Ai,j is rating given by
user i for movie j.
􏰛 Netflix: n = 480000, d = 18000; on average, each user rates
200 movies
􏰛 Most entries of A are unknown
􏰛 Idea: Approximate A with low-rank matrix, i.e., find ←bT1→ ↑ ↑
B =  . .  ∈ R n × k , C =  c · · · c  ∈ R k × d . 1d
↓↓
← bTn →
with goal of minimizing ∥A − BC∥2F
􏰛 Note: If all entries of A were observed, we could do this with
truncated SVD.
33/35

Application: Matrix completion (2)
􏰛 Need to find a low-rank approximation without all of A: (low-rank) matrix completion
􏰛 Lots of ways to do this
􏰛 Popular way (used in Netflix competition): based on “stochastic
gradient descent” (discussed later)
􏰛 Another way: fill in missing entries with plug-in estimates (based
on a statistical model), then compute truncated SVD as usual
34/35

Feature representations from matrix completion
􏰛 MovieLens data set (n = 6040 users, d = 3952 movies, |Ω| = 800000 ratings)
􏰛 Fit B and C by using a standard matrix completion method (based on SGD, discussed later)
􏰛 Are c1, . . . , cd ∈ Rk useful feature vectors for movies?
􏰛 Some nearest-neighbor pairs (cj,NN(cj)): 􏰛 Toy Story (1995), Toy Story 2 (1999)
􏰛 Sense and Sensibility (1995), Emma (1996)
􏰛 Heat (1995), Carlito’s Way (1993)
􏰛 The Crow (1994), Blade (1998)
􏰛 Forrest Gump (1994), Dances with Wolves (1990) 􏰛 Mrs. Doubtfire (1993), The Bodyguard (1992)
35/35