CS代考 CSC 311: Introduction to Machine Learning

CSC 311: Introduction to Machine Learning
Lecture 8 – Multivariate Gaussians, GDA
Roger G. of Toronto, Fall 2021
Intro ML (UofT) CSC311-Lec8 1 / 51

Copyright By PowCoder代写 加微信 powcoder

Last week, we started our tour of probabilistic models, and introduced the fundamental concepts in the discrete setting. Continuous random variables:
I Manipulating Gaussians to tackle interesting problems requires lots of linear algebra, so we’ll begin with a linear algebra review.
I Additional reference: See also Chapter 4 of Mathematics for
Machine Learning, by Desienroth et al.
https://mml-book.github.io/
Regression: Linear regression as maximum likelihood estimation under a Gaussian distribution.
Generative classifier for continuous data: Gaussian discriminant analysis, a Bayes classifier for continuous variables.
Next week’s lecture (PCA) draws heavily on today’s linear algebra content, so be sure to review it offline.
Intro ML (UofT) CSC311-Lec8 2 / 51

Linear Algebra Review
Intro ML (UofT) CSC311-Lec8 3 / 51

Eigenvectors and Eigenvalues
Let B be a square matrix. An eigenvector of B is a vector v such that
for a scalar λ, which is called an eigenvalue.
A matrix of size D × D has at most D distinct eigenvalues, but may have fewer.
I will have very little to say about the general case, since in this course we will only be concerned with the case of symmetric matrices, which is much simpler.
I Today’s tutorial covers the general case, as well as how to compute eigenvectors/eigenvalues.
Intro ML (UofT) CSC311-Lec8 4 / 51

Spectral Decomposition
If a matrix A is symmetric, then the situation is much simpler, due to a result called the Spectral Theorem.
I All of the eigenvalues are real-valued.
I There is a full set of linearly independent eigenvectors (i.e. D for a
D × D matrix).
I I.e., these eigenvectors form a basis for RD.
I These eigenvectors can be chosen to be real-valued. I The eigenvectors can be chosen to be orthonormal.
In this class, we will only need to use eigenvectors and eigenvalues in the symmetric case. But it’s important to remember why this case is so special.
Intro ML (UofT) CSC311-Lec8 5 / 51

Spectral Decomposition
Equivalently to the Spectral Theorem, a symmetric matrix A can be factorized with the Spectral Decomposition:
I Q is an orthogonal matrix
I The columns qi of Q are eigenvectors. I Λ is a diagonal matrix.
I The diagonal entries λi are the corresponding eigenvalues. Check that this is reasonable:
Intro ML (UofT) CSC311-Lec8 6 / 51

Spectral Decomposition
Because A has a full set of orthonormal eigenvectors {qi}, we can use these as an orthonormal basis for RD.
I.e., a vector x can be written in an alternate coordinate system: x = x ̃ 1 q 1 + · · · + x ̃ D q D
Converting between the two coordinate systems: x ̃ = Q ⊤ x x = Q x ̃
In the alternate coordinate system, A acts by rescaling the individual coordinates (i.e. “stretching” the space):
Ax=x ̃1Aq1 +···+x ̃DAqD = λ 1 x ̃ 1 q 1 + · · · + λ D x ̃ D q D
Intro ML (UofT) CSC311-Lec8 7 / 51

PSD Matrices
Symmetric matrices are important because they represent quadratic forms, f(v) = v⊤Av.
positive definite non-strictly PSD
negative definite
indefinite
If v⊤Av > 0 for all v ̸= 0, i.e. the quadratic form curves upwards, we say that A is positive definite and denote this A ≻ 0.
If v⊤Av ≥ 0 for all v, we say A is positive semidefinite (PSD), denoted A ≽ 0.
If v⊤Av < 0 for all v ̸= 0, we say A is negative definite, denoted A ≺ 0. If v⊤Av can be positive or negative then A is indefinite. Intro ML (UofT) CSC311-Lec8 8 / 51 PSD Matrices Exercise: Show from the definition that nonnegative linear combinations of PSD matrices are PSD. Related: If A is a random matrix which is always PSD, then E[A] is PSD. (The discrete case is a special case of the above.) Exercise: Show that for any matrix B, the matrix BB⊤ is PSD. Corollary: For a random vector x, the covariance matrix Cov(x) = E[(x − μ)(x − μ)⊤] is a PSD matrix. (Special case of above, since x − μ is a column vector, i.e. a D × 1 matrix.) Intro ML (UofT) CSC311-Lec8 9 / 51 PSD Matrices Claim: A is positive definite iff all of its eigenvalues are positive. It is PSD iff all of its eigenvalues are nonnegative. I Expressing v in terms of the eigenbasis, v ̃ = Q⊤v, v⊤Av = v⊤QΛQ⊤v = v ̃⊤Λv ̃ = X λ i v ̃ i2 i I This is positive (nonnegative) for all v iff all the λi are positive (nonnegative). Intro ML (UofT) CSC311-Lec8 10 / 51 PSD Matrices If A is positive definite, then the contours of the quadratic form are elliptical. If A is both diagonal and positive definite (i.e. its diagonal entries are positive), then the ellipses are axis-aligned. 0.5 0 A=01 f(v) = v⊤Av = X a i v i2 i Intro ML (UofT) CSC311-Lec8 11 / 51 PSD Matrices For general positive definite A = QΛQ⊤, the contours of the quadratic form are elliptical, and the principal axes of the ellipses are aligned with the eigenvectors. 1 −1 A= −1 2 f(v) = v⊤QΛQ⊤v = v ̃⊤Λv ̃ = X λ i v ̃ i2 i In this example, λ1 > λ2.
All symmetric matrices are diagonal if you choose the right coordinate system.
Intro ML (UofT) CSC311-Lec8 12 / 51

Matrix Powers
The Spectral Decomposition makes it easy to compute powers of a matrix. Observe that
A2 = (QΛQ⊤)2 = QΛ Q⊤Q ΛQ⊤ = QΛ2Q⊤ | {z }
=I Iterating this, for any integer k > 0,
Ak = QΛkQ⊤. Similarly, if A is invertible, then
A−1 = (Q⊤)−1Λ−1Q−1 = QΛ−1Q⊤.
If A is PSD, then we can easily define the matrix square root:
A1/2 = QΛ1/2Q⊤.
Observe that A1/2 is PSD and (A1/2)2 = A. This is the unique PSD matrix with this property (but we won’t show this).
Intro ML (UofT) CSC311-Lec8

Determinant
The determinant |B| of a square matrix B determines how volumes change under linear transformation by B.
The definition of the determinant is complicated, and we won’t need it in this course.
Figure: Mathematics for Machine Learning
Intro ML (UofT) CSC311-Lec8 14 / 51

Determinant
Some basic properties: I |BC|=|B|·|C|
I |B|=0iffBissingular
I |B−1| = |B|−1 if B is invertible (nonsingular)
I |B⊤ | = |B|
I If Q is orthogonal, then |Q| = ±1 (i.e. orthogonal transformations
preserve volume)
I If Λ is diagonal with entries {λi}, then |Λ| = Qi λi.
The determinant of a matrix equals the product of its eigenvalues. This is easy to show in the symmetric case:
|A| = |QΛQ⊤| = |Q||Λ||Q⊤| = |Λ| = Y λi. i
Corollary: the determinant of a PSD matrix is nonnegative, and the determinant of a positive definite matrix is positive.
Intro ML (UofT) CSC311-Lec8 15 / 51

Multivariate Gaussian Distribution
Intro ML (UofT) CSC311-Lec8 16 / 51

Univariate Gaussian distribution
Recall the Gaussian, or normal, distribution:
2 1 (x−μ)2 N(x;μ,σ)=√2πσexp − 2σ2
Parameterized by mean μ and variance σ2.
The Central Limit Theorem says that sums of lots of independent random variables are approximately Gaussian.
In machine learning, we use Gaussians a lot because they make the calculations easy.
Intro ML (UofT) CSC311-Lec8 17 / 51

Multivariate Mean and Covariance
Covariance
μ=E[x]= .  μd
⊤ σ12 σ2 ··· σ2D Σ=Cov(x)=E[(x−μ)(x−μ) ]= . . .. .  ….
σ D 1 σ D 2 · · · σ D2 The statistics (μ and Σ) uniquely define a multivariate Gaussian (or
multivariate Normal) distribution, denoted N (μ, Σ) or N (x; μ, Σ) I This is not true for distributions in general!
σ12 σ12 ··· σ1D
Intro ML (UofT) CSC311-Lec8

Multivariate Gaussian Distribution
PDF of the multivariate Gaussian distribution:
1 1 T−1 
N(x;μ,Σ)= (2π)d/2|Σ|1/2 exp −2(x−μ) Σ (x−μ)
Compare to the univariate case (d = 1, Σ = σ2):
2 1 (x−μ)2
N(x;μ,σ)=√2πσexp − 2σ2
Intro ML (UofT) CSC311-Lec8 19 / 51

Bivariate Gaussian
1 0 Σ= 0 1
1 0 Σ=0.5 0 1
1 0 Σ=2 0 1
Figure: Probability density function
Intro ML (UofT)
Figure: Contour plot of the pdf CSC311-Lec8

Bivariate Gaussian
1 0 2 0 1 0 Σ=01 Σ=01 Σ=02
Figure: Probability density function
Figure: Contour plot of the pdf
Intro ML (UofT) CSC311-Lec8 21 / 51

Bivariate Gaussian
1 0 1 0.5
1 0.8 Σ= 0.8 1
1.8 0. ⊤ =Q2 0. 0.2 Q2
1.5 0. ⊤
=Q1 0. 0.5 Q1
Test your intuition: Does Q1 = Q2?
Figure: Probability density function
Figure: Contour plot of the pdf CSC311-Lec8

Bivariate Gaussian
10 Σ= 0 1
1 0.5 Σ= 0.5 1
1.5 0. ⊤ = Q1 Q1
1 −0.5 Σ= −0.5 1
Test your intuition: Does Q1 = Q2? What are λ1 and λ2?
Figure: Probability density function
Figure: Contour plot of the pdf CSC311-Lec8

Gaussian Intuition: (Multivariate) Shift + Scale
Recall that in the univariate case, all normal distributions are shaped like the standard normal distribution
The densities are related to the standard normal by a shift (μ), a scale (or stretch, or dilation) σ, and a normalization factor
Intro ML (UofT) CSC311-Lec8 24 / 51

Shift + Scale: Multivariate Case
Start with a standard (spherical) Gaussian x ∼ N (0, I). So E[x] = 0 and Cov(x) = I.
Consider what happens if we map xˆ = Sx + b. By linearity of expecation,
E [ xˆ ] = S E [ x ] + b = b .
By the linear transformation rule for covariance, Cov(xˆ) = S Cov(x)S⊤ = SS⊤.
It’s possible to show that xˆ is also Gaussian distributed (but we won’t show this here).
Intro ML (UofT) CSC311-Lec8 25 / 51

Shift + Scale: Multivariate Case
E[Sx + b] = b Cov(Sx + b) = SS⊤.
In the univariate case, we obtain N (μ, σ2) by starting with N (0, 1)
I Recall: Σ1/2 = QΛ1/2Q.
I Intuition: for each eigenvector qi with corresponding eigenvalue
and shifting by μ and stretching by σ =
In the multivariate case, to obtain N (μ, Σ), we start with N (0, I) and shift by μ and scale by the matrix square root Σ1/2.
λi, we stretch by a factor of
λi in the direction qi.
Intro ML (UofT) CSC311-Lec8 26 / 51

Gaussian Maximum Likelihood
Suppose we want to model the distribution of highest and lowest temperatures in Toronto in March, and we’ve recorded the following observations /
(-2.5,-7.5) (-9.9,-14.9) (-12.1,-17.5) (-8.9,-13.9) (-6.0,-11.1) Assume they’re drawn from a Gaussian distribution with mean μ, and
covariance Σ. We want to estimate these using data. Log-likelihood function:
YN 1 1(i) T−1(i) 
(2π)d/2|Σ|1/2 exp −2(x −μ) Σ (x −μ)
l(μ,Σ)=log
XN  1 1(i) T−1(i) 
= log (2π)d/2|Σ|1/2 exp −2(x −μ) Σ (x −μ) i=1
=X−log(2π)d/2−log|Σ|1/2 −1(x(i) −μ)TΣ−1(x(i) −μ) i=1 | {z } 2
Optional intuition building: why does |Σ|1/2 show up in the Gaussian density p(x)? Intro ML (UofT) CSC311-Lec8
Hint: determinant is product of eigenvalues

Gaussian Maximum Likelihood
Maximize the log-likelihood by setting the derivative to zero: 0= dl =−XN d 1(x(i)−μ)TΣ−1(x(i)−μ)
dμ i=1 dμ2
= − X Σ−1(x(i) − μ) = 0
Here we use the identity ∇xx⊤Ax = 2 we get μˆ = N1 PNi=1 x(i). In general, “hat” means estimator
This is just the sample mean of the observed values, or the empirical mean.
Intro ML (UofT) CSC311-Lec8 28 / 51

Gaussian Maximum Likelihood
We can do a similar calculation for the covariance matrix Σ (we skip the details).
Setting the partial derivatives to zero, just like before, we get: ∂l 1XN
0=∂Σ =⇒Σˆ=N
= N1 ( X − 1 μ ⊤ ) ⊤ ( X − 1 μ ⊤ )
(x(i)−μˆ)(x(i)−μˆ)⊤
where 1 is an N-dimensional vector of 1s.
This is called the empirical covariance and comes up quite often
(e.g., PCA soon!)
Derivation in multivariate case is tedious. No need to worry about it. But it is good practice to derive this in one dimension. See supplement (next slide).
Intro ML (UofT) CSC311-Lec8 29 / 51

Supplement: MLE for univariate Gaussian
= − x(i) − μ
∂l∂”N1 1 #
∂σ ∂σi=1 2
= X− log2π−logσ−
(x(i) −μ)2 2N
XN1∂∂∂1 Ni=1
= − 2 ∂σ log 2π − ∂σ log σ − ∂σ 2σ (x
= 0− 1 + 1 (x(i) −μ)2
σˆML = t N
(x(i) − μ)2
N 1 XN = − +
(x(i) − μ)2
Intro ML (UofT)
CSC311-Lec8

Revisiting Linear Regression
Intro ML (UofT) CSC311-Lec8 31 / 51

Recap: Linear Regression
Given a training set of inputs and targets {(x(i),t(i))}Ni=1 Linear model:
y = w⊤ψ(x)
L ( y , t ) = 12 ( t − y ) 2
R ( w ) = λ2 ∥ w ∥ 2
Solution 1: solve analytically by setting the gradient to 0 w = (Ψ⊤Ψ + λI)−1Ψ⊤t
Solution 2: solve approximately using gradient descent w ← (1 − αλ)w − αΨ⊤(y − t)
Squared error loss: L2 regularization:
Intro ML (UofT) CSC311-Lec8 32 / 51

Linear Regression as Maximum Likelihood
We can give linear regression a probabilistic interpretation by assuming a Gaussian noise model:
t|x∼N(w⊤ψ(x), σ2)
Linear regression is just maximum likelihood under this model:
1 Xlogp(t(i) |x(i);w,b) = 1 XlogN(t(i);w⊤ψ(x),σ2)
 1  (t(i)−w⊤ψ(x))2 log √2πσexp − 2σ2
= const − 1 2Nσ2
X(t(i) − w⊤ψ(x))2 i=1
CSC311-Lec8

Regularization as MAP Inference
We can view an L2 regularizer as MAP inference with a Gaussian prior. Recall MAP inference:
argmaxlogp(w|D) = argmax[logp(w)+logp(D|w)] ww
We just derived the likelihood term log p(D | w): N
logp(D|w) = − 1 X(t(i) −w⊤x−b)2 +const
Assume a Gaussian prior, w ∼ N (m, S): logp(w) = logN(w;m,S)
1 1 ⊤−1  =log (2π)D/2|S|1/2 exp −2(w−m) S (w−m)
= − 21 (w − m)⊤ S−1 (w − m) + const Commonly, m = 0 and S = ηI, so
log p(w) = − 1 ∥w∥2 + const. 2η
This is just L2 regularization!
Intro ML (UofT) CSC311-Lec8

Gaussian Discriminant Analysis
Intro ML (UofT) CSC311-Lec8 35 / 51

Generative vs Discriminative (Recap)
Two approaches to classification:
Discriminative approach: estimate parameters of decision boundary/class separator directly from labeled examples.
I Model p(t|x) directly (logistic regression models)
I Learn mappings from inputs to classes (linear/logistic regression,
decision trees etc)
I Tries to solve: How do I separate the classes?
Generative approach: model the distribution of inputs characteristic of the class (Bayes classifier).
I Model p(x|t)
I Apply Bayes Rule to derive p(t|x).
I Tries to solve: What does each class ”look” like?
Intro ML (UofT) CSC311-Lec8

Classification: Diabetes Example
Gaussian discriminant analysis (GDA) is a Bayes classifier for continuous-valued inputs.
Observation per patient: White blood cell count & glucose value.
p(x | t = k) for each class is shaped like an ellipse
=⇒ we model each class as a multivariate Gaussian
Intro ML (UofT) CSC311-Lec8 37 / 51

Gaussian Discriminant Analysis
Gaussian Discriminant Analysis in its general form assumes that p(x|t) is distributed according to a multivariate Gaussian distribution
Multivariate Gaussian distribution:
1 1 T−1 
p(x|t=k)= (2π)D/2|Σk|1/2 exp −2(x−μk) Σk (x−μk)
where |Σk| denotes the determinant of the matrix.
Each class k has associated mean vector μk and covariance matrix Σk How many parameters?
I Each μk has D parameters, for DK total.
I Each Σk has O(D2) parameters, for O(D2K) — could be hard to
estimate (more on that later).
Intro ML (UofT) CSC311-Lec8 38 / 51

GDA: Learning
Learn the parameters for each class using maximum likelihood For simplicity, assume binary classification
p(t | φ) = φt(1 − φ)1−t
You can compute the ML estimates in closed form (φ and μk are easy,
Σk is tricky)
PN r(i)·x(i) μ = i=1 k
= Xr(i)(x(i) −μ )(x(i) −μ )⊤
k PN r(i) i=1 k
k PNr(i) k k k i=1 k i=1
r(i) = 1[t(i) = k] k
(UofT) CSC311-Lec8

GDA Decision Boundary
Recall: for Bayes classifiers, we compute the decision boundary with Bayes’ Rule:
p(t) p(x | t) p(t|x) = Pt′ p(t′)p(x|t′)
Plug in the Gaussian p(x | t):
log p(tk|x) = log p(x|tk) + log p(tk) − log p(x)
= −D log(2π)− 1 log|Σk|− 1(x−μk)⊤Σ−1(x−μk)+ 222k
+logp(tk)−logp(x) Decision boundary:
(x−μ )⊤Σ−1(x−μ ) = (x−μ )⊤Σ−1(x−μ )+Const kkklll
What’s the shape of the boundary?
I We have a quadratic function in x, so the decision boundary is a conic section!
Intro ML (UofT) CSC311-Lec8 40 / 51

GDA Decision Boundary
likelihoods)
discriminant:!! P!(t1|x”)!=!0.5!
posterior)for)t1)
Intro ML (UofT) CSC311-Lec8 41 / 51

GDA Decision Boundary
Our equation for the decision boundary:
(x−μ )⊤Σ−1(x−μ ) = (x−μ )⊤Σ−1(x−μ )+Const
Expand the product and factor out constants (w.r.t. x): x⊤Σ−1x − 2μ⊤Σ−1x = x⊤Σ−1x − 2μ⊤Σ−1x + Const
What if all classes share the same covariance Σ? I We get a linear decision boundary!
−2μ⊤k Σ−1x = −2μ⊤l Σ−1x + Const (μk − μl)⊤Σ−1x = Const
Intro ML (UofT) CSC311-Lec8 42 / 51

GDA Decision Boundary: Shared Covariances
variances may be different
Intro ML (UofT) CSC311-Lec8 43 / 51

GDA vs Logistic Regression
Binary classification: If you examine p(t = 1 | x) under GDA and assume Σ0 = Σ1 = Σ, you will find that it looks like this:
p(t|x,φ,μ0,μ1,Σ) = 1 1+exp(−wTx−b)
where (w, b) are chosen based on (φ, μ0, μ1, Σ). Same model as logistic regression!
Intro ML (UofT) CSC311-Lec8

GDA vs Logistic Regression
When should we prefer GDA to logistic regression, and vice versa?
GDA makes a stronger modeling assumption: assumes class-conditional data is multivariate Gaussian
I If this is true, GDA is asymptotically efficient (best model in limit of large N)
I If it’s not true, the quality of the predictions might suffer. Many class-conditional distributions lead to logistic classifier.
I When these distributions are non-Gaussian (i.e., almost always), LR usually beats GDA
GDA can handle easily missing features (how do you do that with LR?)
Intro ML (UofT) CSC311-Lec8 45 / 51

Gaussian Naive Bayes
What if x is high-dimensional?
I The Σk have O(D2K) parameters, which can be a problem if D is large.
I We already saw we can save some a factor of K by using a shared covariance for the classes.
I Any other idea you can think of?
Naive Bayes: Assumes features independent given the class
p(x | t = k) = Y p(xj | t = k)
Assuming likelihoods are Gaussian, how many parameters required for
Naive Bayes classifier?
I This is equivalent to assuming the xj are uncorrelated, i.e. Σ is diagonal.
I Hence, only D parameters for Σ!
Intro ML (UofT) CSC311-Lec8 46 / 51

Gaussian Na ̈ıve Bayes
Gaussian Na ̈ıve Bayes classifier assumes that the likelihoods are Gaussian:
1 p(xj |t = k) = √2πσ
“−(xj −μjk)2# 2σ2
(this is just a 1-dim Gaussian, one for each input dimension) Model the same as GDA with diagonal covariance matrix Maximum likelihood estimate of parameters
PN r(i) x(i) μ = i=1 k j
jk PN r(i) i=1 k
PN r(i) (x(i) − μjk)2 σ2 = i=1k j
PN r(i) i=1 k
1[t(i) = k] CSC311-Lec8
Intro ML (UofT)

Decision Boundary: Isotropic
We can go even further and assume the covariances are spherical, or isotropic.
In this case: Σ = σ2I (just need one parameter!) Going back to the class posterior for GDA:
log p(tk|x) = log p(x | tk) + log p(tk) − log p(x)
= −D log(2π)− 1 log|Σ−1|− 1(x−μ )⊤Σ−1(x−μ )+
22k2kkk +logp(tk)−logp(x)
Suppose for simplicity that p(t) is uniform. Plugging in Σ = σ2I and simplifying a bit,
logp(tk|x)−logp(tl|x)=− 1 (x−μk)⊤(x−μk)−(x−μl)⊤(x−μl) 2σ2
=− 1 ∥x−μk∥2 −∥x−μl∥2 2σ2
Intro ML (UofT) CSC311-Lec8 48 / 51

Decision Boundary: Isotropic
The decision boundary bisects the class means!
Intro ML (UofT) CSC311-Lec8 49 / 51

Intro ML (UofT) CSC311-Lec8 50 / 51

Generative models – Recap
GDA has quadratic (conic) decision boundary.
With shared covariance, GDA is similar to logistic regression.
Generative models:
I Flexible models, easy to add/remove class.
I Handle missing data naturally.
I More “natural” way to think about things, but usually doesn’t work
Tries to solve a hard problem (model p(x)) in order to solve a easy
problem (model p(t | x)).
Next up: Unsupervised learning with PCA!
Intro ML (UofT) CSC311-Lec8 51 / 51

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com