Lecture 10: Mixture modeling
, CSC 311 Fall 2020
Learning goals
Know what generative process is assumed in a mixture model, and what sort of data it is intended to model
Copyright By PowCoder代写 加微信 powcoder
Be able to perform posterior inference in a mixture model, in particular – compute the posterior distribution over the latent variable
– compute the posterior predictive distribution BeabletolearntheparametersofamixturemodelusingtheExpectation-Maximization
(E-M) algorithm
Unsupervised learning
Through Lecture 7 we focused on supervised learning, where we assumed we had a set of training examples labeled with the correct output of the algorithm. Starting Lecture 8 we shifted focus to unsupervised learning, where we don’t know the right answer ahead of time. Instead, we have a collection of data, and we want to find interesting patterns in the data. Here are some examples:
• You have a collection of scientific papers and you want to understand how the field’s research focus has evolved over time. (For instance, maybe neural nets were a popular topic among AI researchers in the ’80s and early ’90s, then died out, and then came back with a vengeance around 2010.) You fit a model to a dataset of scientific papers which tries to identify different topics in the papers, and for each paper, automatically list the topics that it covers. The model wouldn’t know what to call each topic, but you can attach labels manually by inspecting the frequent words in each topic. You then plot how often each topic is discussed in each year.
• You’re a biologist interested in studying behavior, and you’ve collected lots of videos of bees. You want to “cluster” their behaviors into meaningful groups, so that you can look at each cluster and figure out what it means.
• You want to know how to reduce your energy consumption. You have a device which measures the total energy usage for your house (as a scalar value) for each hour over the course of a month. You want to decompose this signal into a sum of components which you can then try to match to various devices in your house (e.g. computer, refrigerator, washing machine), so that you can figure out which one is wasting the most electricity.
All of these are situations that call out for unsupervised learning. You don’t know the right answer ahead of time, but just want to spot interesting patterns. In general, our strategy for unsupervised learning will be to formulate a probabilistic model which postulates certain unobserved random variables (called latent variables) which correspond to things we’re interested in inferring. We then try to infer the values of the latent variables for all the data points, as well as parameters which relate the latent variables to the observations. In this lecture, we’ll look at one type of latent variable model, namely mixture models.
3 Mixture models
In Lectures 7-8, we looked at some methods for learning probabilistic models which took the form of simple distributions (e.g. Bernoulli or Gaussian). But often the data we’re trying to model is more complex. For instance, it might be multimodal. This means that there are several different modes, or regions of high probability mass, and regions of smaller probability mass in between.
For instance, suppose we’ve collected the high temperatures for every day in March 2014 for both Toronto and Miami, Florida, but forgot to write down which city is associated with each temperature. The values are plotted in Figure 1; we can see the distribution has two modes.
In this situation, we might model the data in terms of a mixture of several compo- nents, where each component has a simple parametric form (such as a Gaussian). In other words, we assume each data point belongs to one of the components, and we try to infer the distribution for each component separately. In this example, we happen to know that the two mixture components should correspond to the two cities. The model itself doesn’t “know” anything about cities, though: this is just something we would read into it at the very end, when we analyze the results. In general, we won’t always know precisely what meaning should be attached to the latent variables.
In order to represent this mathematically, we formulate the model in terms of latent variables, usually denoted z. These are variables which are never observed, and where we don’t know the correct values in advance. They are roughly analogous to hidden units, in that the learning algorithm needs to figure out what they should represent, without a human specifying it by hand. Variables which are always observed, or even sometimes
Figure 1: A histogram of daily high temperatures in ◦C for Toronto and Miami in March 2014. The distribution clearly has two modes.
Figure 2: An example of a univariate mixture of Gaussians model.
observed, are referred to as observables. In the above example, the city is the latent variable and the temperature is the observable.
In mixture models, the latent variable corresponds to the mixture component. It takes values in a discrete set, which we’ll denote {1, . . . , K }. (For now, assume K is fixed; we’ll talk later about how to choose it.) In general, a mixture model assumes the data are generated by the following process: first we sample z, and then we sample the observables x from a distribution which depends on z, i.e.
p(z, x) = p(z) p(x | z).
In mixture models, p(z) is always a multinomial distribution. p(x | z) can take a variety of parametric forms, but for this note we’ll assume it’s a Gaussian distribution. We refer to such a model as a mixture of Gaussians.
Figure 2 shows an example of a mixture of Gaussians model with 2 components. It has the following generative process:
• With probability 0.7, choose component 1, otherwise choose component 2 3
• If we chose component 1, sample x from a Gaussian with mean 0 and standard deviation 1
• If we chose component 2, sample x from a Gaussian with mean 6 and standard deviation 2
This can be written in a more compact mathematical notation:
For the general case,
z ∼ Multinomial(0.7, 0.3) (1) x|z = 1 ∼ Gaussian(0,1) (2) x|z = 2 ∼ Gaussian(6,2) (3)
z ∼ Multinomial(π) (4) x | z = k ∼ Gaussian(μk, σk). (5)
Here, π is a vector of probabilities known as the mixing proportions.
In general, we can compute the probability density function (PDF) over x by marginal-
izing out, or summing out, z:
Note: Equations 6 and 7 are two different ways of writing the PDF; the first is more general (since it applies to other latent variable models), while the second emphasizes the meaning of the clustering model itself. In the example above, this gives us:
p(x) = 0.7 · Gaussian(0, 1) + 0.3 · Gaussian(6, 2). (8)
This PDF is a convex combination, or weighted average, of the PDFs of the component distributions. The PDFs of the component distributions, and the mixture, are shown in Figure 2.
The general problem of grouping data points into clusters, where data points in the same cluster are more similar than data points in different clusters, is known as clustering. Learning a mixture model is one approach to clustering, but we should mention that there are a number of other approaches, most notably the K-means algorithm covered in class.
p(z) p(x | z) (6) Pr(z = k)p(x|z = k) (7)
4 Posterior inference
Before we talk about learning a mixture model, let’s first talk about posterior inference. Here, we assume we’ve already chosen the parameters of the model. We want to infer, given a data point x, which component it is likely to belong to. More precisely, we want to infer the posterior distribution p(z | x). Just like in Bayesian parameter estimation, we can use Bayes’ Rule:
p(z|x) ∝ p(z)p(x|z). (9)
Recall that ∝ means “equal up to a constant.” In other words, we can evaluate the right- hand side for all values of z, and then renormalize so that the values sum to 1.
Example 1. Suppose we observe x compute:
= 2 in the model described above. We can
= Gaussian(2; 0, 1) ≈ 0.054
= Gaussian(2; 6, 2) ≈ 0.027
And therefore,
Pr(z = 1|x) = =
Pr(z = 1)p(x|z = 1)
Pr(z = 1)p(x|z = 1) + Pr(z = 2)p(x|z = 2)
0.7 · 0.054
0.7 · 0.054 + 0.3 · 0.027
Pr(z = 2) p(x | z = 1)
p(x | z = 2)
What if we repeat this calculation for different values of x? Figure 2 shows the
posterior probability Pr(z = 1 | x) as a function of x.
Sometimes only some of the observables are actually observed. The others are said to be missing. One of the important uses of probabilistic models is to make predictions about the missing data given the observed data.
Suppose, for instance, that our observations are two-dimensional, and that we observe x1 and want to make predictions about x2. We can do this using the posterior predictive
distribution:
p(x2 |x1) =
p(z|x1)p(x2 |z,x1). (10)
Figure 3: Left: Samples from the mixture of Gaussians model of Example 2. Right: The PDF of the posterior predictive distribution p(x2 | x1), for various values of x1.
Sometimes we make the assumption that the xi values are conditionally independent given z. This means that, if we are told the value of z, then x1 and x2 are independent. However, if z is unknown, then x1 and x2 are not independent, because x1 gives information about z, which gives information about x2. (For instance, the pixels in one half of an image are clearly not independent of the pixels in the other half. But maybe they are roughly independent, conditioned on a detailed description of everything going on in the image.)
Example 2. Consider the following two-dimensional mixture of Gaussians model, where x1 and x2 are conditionally independent given z:
z ∼ Multinomial(0.4, 0.6) x1 | z = 1 ∼ Gaussian(0, 1)
x2 | z = 1 ∼ Gaussian(6, 1) x1 | z = 2 ∼ Gaussian(6, 2) x2 | z = 2 ∼ Gaussian(3, 2)
The distribution is plotted in Figure 3. Suppose we observe x1 = 3. We can compute the posterior distribution just like in the previous example:
Pr(z = 1|x1) = Pr(z = 1)p(x|z = 1)
Pr(z = 1)p(x|z = 1) + Pr(z = 2)p(x|z = 2)
= 0.4 · 0.0175
0.4 · 0.0175 + 0.6 · 0.0431
Using the posterior, we compute the posterior predictive distribution:
p(x2 |x1) = Pr(z = 1|x1)p(x2 |z = 1) + Pr(z = 2|x1)p(x2 |z = 2) = 0.213 · Gaussian(x2; 6, 1) + 0.787 · Gaussian(x2; 3, 2)
Hence, the posterior predictive distribution p(x2 | x1) is a mixture of two Gaussians, but with different mixing proportions than the original mixture model. Figure 3 shows the posterior predictive distribution for various values of x1.
5 Learning
Now let’s see how to learn the parameters of a mixture model. This doesn’t immediately seem to have much to do with inference, but it’ll turn out we need to do inference repeatedly in order to learn the parameters.
So far, we’ve been a bit vague about exactly what are the parameters of the Gaussian mixture model. We need two sets of parameters:
• The mean μk and standard deviation σk associated with each component k. (Other kinds of mixture models will have other sets of parameters.)
• The mixing proportions πk, defined as Pr(z = k).
In Lecture 7, we discussed three methods of parameter learning: maximum likelihood (ML), the maximum a-posteriori approximation (MAP), and the full Bayesian (FB) ap- proach. For mixture models, we’re going to focus on the first two. (There are fully Bayesian methods which look very much like the algorithm we’ll discuss, but these are beyond the scope of the class.)
5.1 Log-likelihood derivatives (optional)
We’ve been computing derivatives all term. Therefore, it might make sense to give a fairly nontraditional motivation for the E-M algorithm in terms of the derivatives of the log- likelihood function. If this section helps you, then great. If it doesn’t, feel free to skip it; you’re not responsible for it.
Now we compute the log-likelihood derivative with respect to a parameter θ; this could stand for any of the parameters we want to learn, such as the mixing proportions or the
mean or standard deviation of one of the components.
dθ log p(x) = dθ log p(z, x) (11)
d p(z,x) z dθ
=∑′ ∑z′ p(z,x)
z z′p(z,x) dθ ∑d
= p(z|x)dθlogp(z,x) (15) z[d]
=Ep(z|x) dθlogp(z,x) (16)
This is pretty cool — the derivative of the marginal log-probability p(x) is simply the expected derivative of the joint log-probability, where the expectation is with respect to the posterior distribution. This applies to any latent variable model, not just mixture models, since our derivation didn’t rely on anything specific to mixture models.
That derivation was rather long, but observe that each of the steps was a completely mechanical substitution, except for the step labeled (13). In that step, we used the formula
d logp(z,x)= 1 dp(z,x). (17) dθ p(z, x) dθ
You’re probably used to using this formula to get rid of the log inside the derivative. We instead used it to introduce the log. We did this because it’s much easier to deal with derivatives of log probabilities than derivatives of probabilities. This trick comes up quite a lot in probabilistic modeling, and we’ll see it twice more: once when we talk about Boltzmann machines, and once when we talk about reinforcement learning.
If we can compute the expectation (16) summed over all training cases, that gives us the log-likelihood gradient.
Example 3. Let’s return to the mixture model from Example 1, where we’re consid- ering the training case x = 2. Let’s compute the log-likelihood gradient with respect
p(z, x) d =z∑dθ
log p(z, x) z′ p(z′,x)
∑( p(z,x) ) d
= ∑ ′ logp(z,x) (14)
to μ1, the mean of the first mixture component. ∂[∂]
logp(x) = Ep(z|x) ∂μ logp(z,x) 1[1]
= E ∂ logp(z)+ ∂ logp(x|z) p(z|x) ∂μ1[ ∂μ1
=Pr(z=1|x) ∂ logPr(z=1)+ ∂ logp(x|z=1) + ∂μ1 ∂μ1
+Pr(z = 2|x) ∂μ logPr(z = 2)+ ∂μ
logGaussian(x;μ1,σ1)
logp(x|z = 2) (20) +0.176·[0+0] (21)
= 0.824· 0+ ∂μ =0.824·x−μ1
all training cases, we find:
=0.824·2 = 1.648
∂l ∑N ∂μ =
Pr(z(i) =1|x(i))· σ2 (25)
So essentially, we’re taking a weighted sum of the conditional probability gradients for all the training cases, where each training case is weighted by its probability of belonging to the first component.
Since we can compute the log-likelihood gradient, we can maximize the log-likelihood using gradient ascent. This is, in fact, one way to learn the parameters of a mixture of Gaussians. However, we’re not going to pursue this further, since we’ll see an even more elegant and efficient learning algorithm for mixture models, called Expectation-Maximization.
5.2 Expectation-Maximization
As mentioned above, one way to fit a mixture of Gaussians model is with gradient ascent. But a mixture of Gaussians is built out of multinomial and Gaussian distributions, and we have closed-form solutions for maximum likelihood for these distributions. Shouldn’t we be able to make use of these closed-form solutions when fitting mixture models?
log p(x | z = 1) are zero because changing This only covers a single training case. But if we take the sum of this formula over
In the step (21), all of the terms except
the parameters of the first component doesn’t affect the other probabilities involved.
Look at the log-likelihood derivative ∂l/∂μ1 from Example 3, Eqn 25. It’s a bit clunky to write out the conditional probabilities every time, so let’s introduce the variables r(i) =
k Pr(z(i) = 1 | x(i)), which are called responsibilities. These say how strongly a data point
“belongs” to each component. Eqn 25 can then be written in the more concise form
μ=i=11 . (27) 1 ∑N r(i)
In other words, it’s a weighted average of the training cases (essentially, the “center of
gravity”), where the weights are given by the responsibilities. The problem is, we can’t
just solve for μ1 in Eqn 26, because the responsibilities r(i) depend on μ1. When we apply k
Eqn 27, the derivative in Eqn 25 won’t be 0, because the posterior probabilities will have changed!
But that’s OK. Even though the update Eqn 27 doesn’t give the optimal parame- ters, it still turns out to be quite a good update. This is the basis of the Expectation- Maximization (E-M) algorithm. This algorithm gets its name because the update we just performed involves two steps: computing the responsibilities, and applying the maxi- mum likelihood update with those responsibilities. More precisely, we apply the following algorithm:
Repeat until converged:
E-step.1 Compute the expectations, or responsibilities, of the latent variables:
r(i) ← Pr(z(i) = k | x(i)) (28) k
M-step. Compute the maximum likelihood parameters given these responsibil- ities:
θ ← argmax r(i) logPr(z(i) = k)+logp(x(i) |z(i) = k) (29)
∂l∑N x(i)−μ
=r(i)· 1 (26)
k σ 12 ∑N r(i)x(i)
It would be tempting to set this partial to 0, and then solve for μ1. This would give us:
θk i=1 k=1
Typically, only one of the two terms within the bracket will depend on any given parameter, which simplifies the computations.
1This is called the Expectation step for the following reason: it’s common to represent the mixture component z with a 1-of-K encoding. In this representation, the responsibilities are simply the expectations of the latent variables.
Generally, it takes a modest number of iterations (e.g. 20-50) to get close to a local optimum. Now let’s derive the E-M algorithm for Gaussian mixture models.
Example 4. Let’s derive the full E-M update for Gaussian mixtures. The E-step simply consists of computing the posterior probabilities, as we did in Example 1:
r(i) ← Pr(z(i) = k | x(i)) k
= ∑Pr(z(i) = k)p(x(i) |z(i) = k)
k′ Pr(z(i) = k′) p(x(i) | z(i) = k′)
= ∑ πk · Gaussian(x(i); μk, σk)
k′ πk′ ·Gaussian(x(i);μk′,σk′)
Now let’s turn to the M-step. Consider the mixing proportions πk first. Before we go through the derivation, we can probably guess what the solution should be. A good estimate of the probability of an outcome is the proportion of times it appears, i.e. πk = Nk/N, where Nk is the count for outcome k, and N is the number of observations. If we replace Nk with the sum of the responsibilities (which can be thought of as “fractional counts”), we should get
πk ← r(i). (30)
∑N ∑K ∑N ∑K r(i) log Pr(z(i) = k) =
Let’s see if we were right. Observe that the mixing proportions don’t affect the second term inside the sum in Eqn 29, so we only need to worry about the first term, log Pr(z(i) = k). We want to maximize
r(i) log πk. (31) kk
i=1 k=1 i=1 k=1
subject to the normalization constraint ∑ πk = 1. We can do this by computing
the Lagrangian and setting its partial derivatives to zero. (If you don’t know what the Lagrangian is, you can skip this step. We’re not going to need it again.)
πk (32) (33)
Setting the partial derivative to zero, we see that
∑N r(i) λ= i=1 k
r(i) log πk + λ k
= i=1 k −λ
for each k. For this to be true, πk must be proportional to ∑N r(i). Therefore, i=1 k
∑N r(i) π ← i=1 k
∑N r(i) = i=1 k ,
k ∑K ∑N (i) k′=1 i=1 rk′
confirming our original guess, Eqn 30.
Finally, let’s turn to the Gaussian parameters μk and σk. Observe that these pa- rameters don’t affect the first term inside the sum in Eqn 29, so we only need to worry about the second term, log p(x(i) | z(i) = k). Also, the parameters associated with mixture component k only affect the data points assigned to component k. Therefore, we want to maximize
But this is exactly the maximum likelihood estimation problem for Gaussians, which we solved in Lecture 8. (The only difference here is the use of responsibilities.) Thus, the updates are
i=1 k i=1 1 ∑N
Nr(i) k i=1 k i=1
These updates are visualized for the temperature data in Figure 4.
This example demonstrates a general recipe for deriving the M-step: apply the standard maximum likelihood formulas, except weight each data point according to its responsibility. This is true in general for mixture models, which is all we’re going to talk about in this class. However, the E-M algorithm is much
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com