. of Toronto
Intro ML (UofT) STA314-Lec10 1 / 50
STA 314: Statistical Methods for Machine Learning I
Lecture 10 – Probabilistic Models
Copyright By PowCoder代写 加微信 powcoder
Wrapping up inference and decision-making. Gaussian generative models.
Intro ML (UofT) STA314-Lec10 2 / 50
Last time we discussed the maximum likelihood estimation view of machine learning:
Specify a family of distributions p(x∣θ) parameterized by θ ∈ Θ. Observe a data set D = {x(1),…,x(N)}.
Under an IID assumption, MLE corresponds to
θˆ =argmax∑logp(x ∣θ)
Intro ML (UofT)
STA314-Lec10
MLE issue: Data Sparsity
Maximum likelihood has a pitfall: if you have too little data, it can overfit.
E.g., what if you flip the coin twice and get H both times? θML= NH = 2 =1
NH +NT 2+0
Because it never observed T, it assigns this outcome probability 0.
This problem is known as data sparsity.
Intro ML (UofT) STA314-Lec10
Bayesiam Parameter Estimation
Somehow we want to reflect our uncertainty in the true value of θ. Maybe the problem was that we summarized D in a single setting of
the parameters θˆMLE
What if we summarized using a distribution? This will allow us to reflect that fact that we want to consider a variety of possible parameters weighted by some probability. This is the spirit behind Bayesian inference.
Intro ML (UofT) STA314-Lec10 5 / 50
Bayesian Parameter Estimation
In maximum likelihood, the observations are treated as random variables, but the parameters are not.
The Bayesian approach treats the parameters as random variables as well. β is the set of parameters in the prior distribution of θ.
To define a Bayesian model, we need to specify two distributions:
▶ The prior distribution p(θ), which encodes our beliefs about the
parameters before we observe the data
▶ The likelihood p(D ∣ θ), same as in maximum likelihood
Intro ML (UofT) STA314-Lec10
Bayesian Parameter Estimation
The posterior distribution is the distribution that we will use to summarize D.
Using Bayes’ Rule:
p(θ∣D) = p(θ)p(D∣θ) . ∫ p(θ′)p(D ∣ θ′) dθ′
We rarely ever compute the denominator explicitly. In general, it is computationally intractable.
Intro ML (UofT) STA314-Lec10
Bayesian Parameter Estimation
Let’s revisit the coin example. We already know the likelihood: L(θ) = p(D∣θ) = θNH (1 − θ)NT
It remains to specify the prior p(θ).
▶ We can choose an uninformative prior, which assumes as little as
possible. A reasonable choice is the uniform prior.
▶ But our experience tells us 0.5 is more likely than 0.99. One
particularly useful prior that lets us specify this is the beta distribution: p(θ; a, b) = Γ(a + b) θa−1(1 − θ)b−1.
▶ This notation for proportionality lets us ignore the normalization
p(θ; a, b) ∝ θa−1(1 − θ)b−1.
Intro ML (UofT) STA314-Lec10 8 / 50
Bayesian Parameter Estimation
Beta distribution for various values of a, b:
Some observations:
▶ The expectation E[θ] = a/(a + b) (easy to derive).
▶ The distribution gets more peaked when a and b are large.
▶ The uniform distribution is the special case where a = b = 1.
The beta distribution is used for is as a prior for the Bernoulli distribution.
Intro ML (UofT) STA314-Lec10
Bayesian Parameter Estimation
Computing the posterior distribution:
p(θ ∣ D) ∝ p(θ)p(D ∣ θ)
∝ [θa−1(1 − θ)b−1] [θNH (1 − θ)NT ]
= θa−1+NH (1 − θ)b−1+NT .
This is just a beta distribution with parameters NH + a and NT + b.
The parameters a and b of the prior can be thought of as pseudo-counts.
▶ The reason this works is that the prior and likelihood have the same functional form. This phenomenon is known as conjugacy (conjugate priors), and it’s very useful.
Intro ML (UofT) STA314-Lec10 10 / 50
Bayesian Parameter Estimation
Bayesian inference for the coin flip example:
Small data setting Large data setting
NH =2,NT =0 NH =55,NT =45
When you have enough observations, the data overwhelm the prior.
Intro ML (UofT) STA314-Lec10 11 / 50
Bayesian Parameter Estimation
What do we actually do with the posterior?
The posterior predictive distribution is the distribution over future observables given the past observations. We compute this by marginalizing out the parameter(s):
p(D′ ∣D) = ∫ p(θ∣D)p(D′ ∣θ)dθ. (1) For the coin flip example:
=Pr(x′ =H∣D)
=∫ p(θ∣D)Pr(x′ =H∣θ)dθ
=∫ Beta(θ;NH +a,NT +b)⋅θdθ = EBeta(θ;NH +a,NT +b)[θ]
= NH +a , (2) NH +NT +a+b
Intro ML (UofT)
STA314-Lec10
Bayesian Parameter Estimation
Maybe we can summarize the posterior using a single value?
One option is to use the posterior expectation of θ.
For the coin flip example it coincides with the probability of heads:
E[θ∣D]= NH +a
NH +NT +a+b
Intro ML (UofT) STA314-Lec10 13 / 50
Maximum A-Posteriori Estimation
Another option is Maximum a-posteriori (MAP) estimation: find the most likely parameter settings under the posterior to summarize the posterior.
Intro ML (UofT) STA314-Lec10 14 / 50
Maximum A-Posteriori Estimation
This converts the Bayesian parameter estimation problem into a maximization problem
θˆMAP =argmax p(θ∣D) θ
=argmax p(θ,D) θ
=argmax p(θ)p(D∣θ) θ
=argmax logp(θ)+logp(D∣θ) θ
We already saw an example of this in the homework.
Intro ML (UofT) STA314-Lec10
Maximum A-Posteriori Estimation
Joint probability in the coin flip example:
log p(θ, D) = log p(θ) + log p(D ∣ θ) =Const+(a−1)logθ+(b−1)log(1−θ)+NH logθ+NT log(1−θ) =Const+(NH +a−1)logθ+(NT +b−1)log(1−θ)
Maximize by finding a critical point
0 = d log p(θ, D) = NH + a − 1 − NT + b − 1
Solving for θ,
dθ θ 1−θ θˆMAP = NH + a − 1
NH +NT +a+b−2 STA314-Lec10
Intro ML (UofT)
Maximum A-Posteriori Estimation
Comparison of estimates in the coin flip example:
θˆ NH ML NH+NT
NH =2,NT =0
NH =55,NT =45 55 =0.55
57 ≈0.548 104
56 ≈0.549 102
NH +NT +a+b
NH +NT +a+b−2
θˆMAP assigns nonzero probabilities as long as a, b > 1. Intro ML (UofT) STA314-Lec10
We took a probabilistic perspective on parameter estimation.
We modeled a biased coin as a Bernoulli random variable with parameter θ,
which we estimated using:
▶ maximum likelihood estimation:
θˆ =max p(D∣θ) ML θ
▶ expected Bayesian posterior:
E[θ ∣ D] where p(θ ∣ D) ∝ p(θ)p(D ∣ θ) by Bayes’ Rule.
▶ Maximum a-posteriori (MAP) estimation: θˆ = arg max p(θ ∣ D)
We also saw parameter estimation in context of a Na ̈ıve Bayes classifier. Today we will continuing developing the probabilistic perspective:
▶ Gaussian Discriminant Analysis: Use Gaussian generative model of the data for classification
▶ Gaussian Mixture Model: Gaussian generative model view of clustering Intro ML (UofT) STA314-Lec10 18 / 50
Motivation
Generative models – model p(x∣t = k)
Instead of trying to separate classes, try to model what each class
”looks like”.
Recall that p(x∣t = k) may be very complex
p(x1,⋯,xd,y) = p(x1∣x2,⋯,xd,y)⋯p(xd−1∣xd,y)p(xd,y) Naive bayes used a conditional independence assumption. What else
could we do? Choose a simple distribution.
Today we will discuss fitting Gaussian distributions to our data. First, a review of our setting and MLE in Gaussians.
Intro ML (UofT) STA314-Lec10 19 / 50
Multivariate Data
Multiple measurements (sensors)
d inputs/features/attributes
N instances/observations/examples
⎡⎢ x ( 1 ) x ( 1 ) ⋯ x ( 1 ) ⎤⎥ ⎢12 d⎥
⎢x(2) x(2) ⋯ x(2) ⎥
X=⎢1 2 d ⎥ ⎢⋮ ⋮ ⋱ ⋮⎥
⎢ (N) (N) ⎣x1 x2
(N)⎥ ⋯ xd ⎦
STA314-Lec10
Multivariate Parameters
Covariance
T ⎢σ12 Σ=Cov(x)=E[(x−μ) (x−μ)]=⎢ ⋮
E[x] = [μ1,⋯,μd]T
⎡2⎤ ⎢σ1 σ12 ⋯ σ1d⎥
σ2d ⎥ ⋮ ⎥ σd2 ⎥⎦
(UofT) STA314-Lec10
Multivariate Gaussian Distribution
x ∼ N (μ, Σ), a Gaussian (or normal) distribution defined as p(x) = 1 exp[−1(x−μ)TΣ−1(x−μ)]
(2π)d/2∣Σ∣1/2 2
Mahalanobis distance (x − μ )T Σ−1(x − μ ) measures the distance from x kk
to μ in terms of Σ
It normalizes for difference in variances and correlations
Intro ML (UofT) STA314-Lec10 22 / 50
Gaussian Maximum Likelihood
Suppose we want to model the distribution of highest and lowest temperatures in Toronto in March, and we’ve recorded the following observations
(-2.5,-7.5) (-9.9,-14.9) (-12.1,-17.5) (-8.9,-13.9) (-6.0,-11.1) Assume they’re drawn from a Gaussian distribution with mean μ, and
covariance Σ. We want to estimate these using data. Log-likelihood function:
N 1 1 (i) T −1 (i) l(μ,Σ)=log∏[(2π)d/2∣Σ∣1/2 exp{−2(x −μ) Σ (x −μ)}]
=∑log[(2π)d/2∣Σ∣1/2 exp{−2(x −μ) Σ (x −μ)}]
1 1 (i) T −1 (i)
= ∑ − log(2π)d/2 − log ∣Σ∣1/2 − i=1
Optional intuition building: why does ∣Σ∣1/2 show up in the Gaussian density p(x)? Intro ML (UofT) STA314-Lec10
(x(i) − μ)T Σ−1(x(i) − μ)
Hint: determinant is product of eigenvalues
Gaussian Maximum Likelihood
Maximize the log-likelihood by setting the derivative to zero:
0= dl =−∑N d 1(x(i)−μ)TΣ−1(x(i)−μ) dμ dμ2
=−∑Σ−1(x(i) −μ)=0 i=1
Here we use the identity ∂x⊤Ax/∂x = 2Ax for symmetric A. Solving we get μˆ = N1 ∑Ni=1 x(i). In general, “hat” means estimator
This is just the sample mean of the observed values, or the empirical mean.
Intro ML (UofT) STA314-Lec10 24 / 50
Gaussian Maximum Likelihood
We can do a similar calculation for the covariance matrix Σ (we skip the details).
Setting the partial derivatives to zero, just like before, we get: ∂l 1N(i) (i) ⊤
0=∂Σ⟹Σˆ=N∑(x −μˆ)(x −μˆ) i=1
= 1(X−1μ⊤)⊤(X−1μ⊤) N
where 1 is an N-dimensional vector of 1s.
This is called the empirical covariance and comes up quite often (e.g.,
PCA soon!)
Derivation in multivariate case is tedious. No need to worry about it. But it is good practice to derive this in one dimension. See supplement (next slide).
Intro ML (UofT) STA314-Lec10 25 / 50
Supplement: MLE for univariate Gaussian
∂l 1N(i) 0=∂μ=−σ2∑x −μ
∂l ∂ N 1 1 (i) 2
0=∂σ=∂σ[∑−2log2π−logσ−2σ2(x −μ)] i=1
1 N (i) μˆML = N ∑x
N1∂ ∂ ∂1(i)2
= ∑ − 2 ∂σ log 2π − ∂σ log σ − ∂σ 2σ (x i=1
σˆ =√1∑(x(i)−μ)2
N 1 1 (i) 2 =∑0−σ+σ3(x −μ)
=−σ+σ3∑(x −μ) i=1
Intro ML (UofT)
STA314-Lec10
Bayes Classifier
Let’s take a step back… Bayes Classifier
h(x) = argmaxp(t = k∣x) = argmax p(x∣t = k)p(t = k) p(x)
= argmaxp(x∣t = k)p(t = k) Talked about Discrete x, what if x is continuous?
Intro ML (UofT) STA314-Lec10
Classification: Diabetes Example
Observation per patient: White blood cell count & glucose value.
How can we model p(x∣t = k)? Multivariate Gaussian
Intro ML (UofT) STA314-Lec10 28 / 50
Gaussian Discriminant Analysis (Gaussian Bayes Classifier)
Gaussian Discriminant Analysis in its general form assumes that p(x∣t) is distributed according to a multivariate normal (Gaussian) distribution
Multivariate Gaussian distribution:
p(x∣t=k)= 1 exp[−1(x−μ )TΣ−1(x−μ )]
(2π)d/2∣Σk∣1/2 2 k k k
where ∣Σk ∣ denotes the determinant of the matrix, and d is dimension of x Each class k has associated mean vector μk and covariance matrix Σk
has O(d2) parameters – could be hard to estimate (more on that later).
Intro ML (UofT) STA314-Lec10 29 / 50
Learn the parameters for each class using maximum likelihood Assume the prior is Bernoulli (we have two classes)
p(t∣φ) = φt(1 − φ)1−t.
You can compute the MLE in closed form (good exercise!)
φˆ = N∑1[t =1]
∑Nn=1 1[t(n) = k] ⋅ x(n)
Σˆk = ∑N 1[t(n)=k]∑1[t =k](x −μˆt(n))(x −μˆt(n))
∑Nn=1 1[t(n) = k]
1 N (n) (n) (n) T
Intro ML (UofT)
n=1 n=1 STA314-Lec10
Gaussian Discriminant Analysis (Gaussian Bayes Classifier)
GDA (GBC) decision boundary is based on class posterior. Make decisions by comparing class probabilities:
logp(tk∣x) = logp(x∣tk)+logp(tk)−logp(x)
= −dlog(2π)−1log∣Σ−1∣−1(x−μ )TΣ−1(x−μ )
22k2kkk +logp(tk)−logp(x)
Decision boundary (log p(tk ∣x) = log p(tl ∣x)):
(x−μ )TΣ−1(x−μ )=(x−μ )TΣ−1(x−μ )+C
k k k l l l k,l xT Σ−1x − 2μT Σ−1x = xT Σ−1x − 2μT Σ−1x + C
k kk l ll k,l Quadratic relation in x ⟹ quadratic (conic) decision boundary
So sometimes called “Quadratic Discriminant Analysis” (QDA)
Intro ML (UofT) STA314-Lec10
Decision Boundary
likelihoods)
discriminant:!! P!(t1|x”)!=!0.5!
posterior)for)t1)
Intro ML (UofT) STA314-Lec10 32 / 50
Simplifying the Model
What if x is high-dimensional?
For Gaussian Bayes Classifier, if input x is high-dimensional, then
covariance matrix has many parameters O(d2)
Save some parameters by using a shared covariance for the classes,
i.e. Σk =Σl. MLE in this case:
1N(n) (n) T Σˆ = N ∑(x −μt(n))(x −μt(n))
Linear decision boundary (at home: verify this mathematically!). ▶ In Scikit-Learn this is called “Linear Discriminant Analysis” (LDA)
Intro ML (UofT) STA314-Lec10
Decision Boundary: Shared Variances (between Classes)
variances may be different
Intro ML (UofT) STA314-Lec10 34 / 50
Gaussian Discriminative Analysis vs Logistic Regression
Binary classification: If you examine p(t = 1∣x) under GDA and assume Σ0 = Σ1 = Σ, you will find that it looks like this:
p(t∣x, φ, μ0, μ1, Σ) = 1
1 + exp(−wT x)
where w is an appropriate function of (φ, μ0, μ1, Σ), φ = p(t = 1). GDA is similar to logistic regression (LR), parameter estimates are
computed differently.
When should we prefer GDA to LR, and vice versa?
Intro ML (UofT) STA314-Lec10
Gaussian Discriminative Analysis vs Logistic Regression
GDA is a generative model, LR is a discriminative model.
GDA makes stronger modeling assumption: assumes class-conditional data is multivariate Gaussian.
If this is true, GDA is asymptotically efficient (best model in limit of large N)
But LR is more robust, less sensitive to incorrect modeling assumptions (what loss is it optimizing?)
Many class-conditional distributions lead to logistic classifier.
When these distributions are non-Gaussian (true almost always), LR usually beats GDA
Intro ML (UofT) STA314-Lec10 36 / 50
A Generative View of Clustering
What if we do not observe the targets?
Intro ML (UofT) STA314-Lec10 37 / 50
A Generative View of Clustering
We covered hard and soft k-means algorithm for clustering.
Today: statistical formulation of clustering → principled, justification for updates
We need a sensible measure of what it means to cluster the data well
▶ This makes it possible to judge different methods
▶ It may help us decide on the number of clusters
An obvious approach is to imagine that the data was produced by a
generative model
▶ Then we adjust the model parameters to maximize the probability that it would produce exactly the data we observed
Intro ML (UofT) STA314-Lec10 38 / 50
Latent Variable Models
To incorporate the idea of clusters model a joint distribution,
p(x, z) = p(x∣z)p(z)
between the data and an unobserved cluster id z ∈ {1, . . . , K }.
The “label” or cluster id z is not observed, so we call it a latent variable.
Because z is unobserved, we cannot just maximize log p(x, z). Instead, we must maximize just the likelihood of the data x:
p(x) = ∑ p(x, z) = ∑ p(x∣z)p(z) zz
This is an instance of a mixture model or more generally, a latent variable model.
Intro ML (UofT) STA314-Lec10 39 / 50
Gaussian Mixture Model (GMM)
Most common mixture model: Gaussian mixture model (GMM) A GMM represents a distribution as
p(x) = ∑πkN(x∣μk,Σk)
k=1 with πk the mixing coefficients, where:
∑πk=1 and πk≥0 ∀k k=1
GMM is a density estimator
In general mixture models are very powerful, but harder to optimize
Intro ML (UofT) STA314-Lec10
Visualizing a Mixture of Gaussians – 1D Gaussians
If you fit a Gaussian to data:
Now, we are trying to fit a GMM (with K = 2 in this example):
[Slide credit: K. Kutulakos]
Intro ML (UofT) STA314-Lec10 41 / 50
Visualizing a Mixture of Gaussians – 2D Gaussians
Intro ML (UofT) STA314-Lec10 42 / 50
Fitting GMMs: Maximum Likelihood
Maximum likelihood maximizes
lnp(X∣π,μ,Σ) = ∑ln(∑πkN(x(n)∣μk,Σk))
w.r.t Θ = {πk,μk,Σk} Problems:
▶ Singularities: Arbitrarily large likelihood when a Gaussian explains a single point
▶ Identifiability: Solution is invariant to permutations
▶ Non-convex
How would you optimize this?
Could try gradient descent, but don’t forget to satisfy the constraints on πk
Intro ML (UofT) STA314-Lec10 43 / 50
NK n=1 k=1
Expectation Maximization
Typically a latent variable model is fit with the Expectation Maximization (EM) algorithm, or variants of it.
The EM algorithm can be seen as a type of coordinate descent, just like K-means and our method for matrix completion.
We won’t go into details to justify the convergence of the algorithm, but I will show you the high-level algorithm for Gaussian mixture models and compare it to K-means.
Intro ML (UofT) STA314-Lec10 44 / 50
Intuitively, How Can We Fit a Mixture of Gaussians?
1. E-step: Compute the posterior probability over z given our current model – i.e. how much do we think each Gaussian generates each datapoint.
2. M-step: Assuming that the data really was generated this way, change the parameters of each Gaussian to maximize the probability that it would generate the data it is currently responsible for.
Intro ML (UofT) STA314-Lec10 45 / 50
Relation to k-Means
The K-Means Algorithm:
1. Assignment step: Assign each data point to the closest cluster
2. Refitting step: Move each cluster center to the center of gravity of the
data assigned to it
The EM Algorithm:
1. E-step: Compute the posterior probability over z given our current model
2. M-step: Maximize the probability that it would generate the data it is currently responsible for.
Intro ML (UofT) STA314-Lec10 46 / 50
EM Algorithm for GMM
Initialize the means μk , covariances Σk and mixing coefficients πk Iterate until convergence:
▶ E-step: Evaluate the responsibilities given current parameters γ(n) =p(z(n)∣x)= πkN(x(n)∣μk,Σk)
k ∑Kj=1 πj N (x(n)∣μj , Σj )
▶ M-step: Re-estimate the parameters given current responsibilities
1 N (n) (n) μk = N ∑γk x
1 N (n) (n) (n)
Σk = N ∑γk (x −μk)(x −μk) k n=1
N N(n) πk = Nk with Nk=∑γk
▶ Evaluate log likelihood and check for convergence
lnp(X∣π,μ,Σ) = ∑ln(∑πkN(x(n)∣μk,Σk))
n=1 k=1 Intro ML (UofT) STA314-Lec10
EM Algorithm for GMM
Can show that the EM algorithm monotonically improves the log-likelihood. Evaluate log likelihood and check for convergence
lnp(X∣π,μ,Σ) = ∑ln(∑πkN(x(n)∣μk,Σk))
Intro ML (UofT) STA314-Lec10 48 / 50
Intro ML (UofT) STA314-Lec10 49 / 50
Mixture of Gaussians vs. K-means
EM for mixtures of Gaussians is just like a soft version of K-means, with
fixed priors and covariance
Instead of hard assignments in the E-step, we do soft assignments based on the softmax of the squared Mahalanobis distance from each point to each cluster.
Each center moved by weighted means of the data, with weights given by soft assignments
In K-means, weights are 0 or 1. Confirm this at home!!!
Intro ML (UofT) STA314-Lec10 50 / 50
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com