程序代写代做代考 scheme algorithm GMM 4. Latent Variable Models and EM

4. Latent Variable Models and EM

4. Latent Variable Models and
EM

Gholamreza Haffari

4. Latent Variable Models and EM

Gholamreza Haffari

Generated by Alexandria (https://www.alexandriarepository.org) on March 11, 2017 at 12:26 pm AEDT

https://www.alexandriarepository.org

7 Appendix A: Constrained Optimisation

iii

Contents
Title i ……………………………………………………………………………………………………………………………
Copyright ii …………………………………………………………………………………………………………………..
1 Clustering and Kmeans 1 ………………………………………………………………………………………….
2 Gaussian Mixture Models and Expectation-Maximisation 5 ………………………………………
3 A General View to EM 11 …………………………………………………………………………………………..
4 Activity 4.1: EM for GMM 15 ……………………………………………………………………………………..
5 Latent Variable Models for Document Analysis 16 ……………………………………………………
6 Activity 4.2: Document Clustering 20 ……………………………………………………………………….
7 Appendix A: Constrained Optimisation 21 ………………………………………………………………..

1 Clustering and Kmeans

1
Clustering and Kmeans

Data Clustering
Data clustering or cluster analysis is one of the most common forms of unsupervised learning that aims to
find a sensible structure from unlabeled data. In other words, a clustering algorithm groups data into their
natural categories based on the similarities between them without knowing their actual groups. Clustering
algorithms help data scientists to do several tasks such as revealing the structure of the data, which has
many applications in hypothesis generation and anomaly detection, categorizing data when labeling (e.g.,
by human experts) are costly, and compressing data (only storing the cluster representatives/prototypes
or the parameters of the generative models).

Some of the challenges in data clustering are scalability to big data, the capability to handle different
types of data with a variety of distributions and insensitivity to noise and outliers. So far, heaps of
clustering algorithms are developed by statistician and computer scientists. Center-based (http://Center-based)
partitioning (e.g., KMeans), density-based (https://en.wikipedia.org/wiki/Cluster_analysis#Density-based_clustering)
clustering (e.g., DBSCAN), hierarchical (https://en.wikipedia.org/wiki/Hierarchical_clustering) clustering, and graph-
based (http://www.slideshare.net/ssakpi/graph-based-clustering) clustering are just a few examples of popular clustering
approaches. Figure 4.1.1. compares the partitioning clustering (the left side) versus the hierarchical
clustering (the right side).

(http://www.slideshare.net/aorriols/lecture17)
Figure 4.1.1. Partitioning (left) vs. hierarchical clustering (right). Source:

Lecture17 from Albert Orriols-Puig

In this chapter, we focus on the KMeans algorithm not only because of its simplicity, but also because it
provides us some useful anchors to important concepts and techniques that will be discussed in the next
chapters.

The KMeans Algorithm
Suppose unlabeled data points are given, and the goal is to partition them into distinct groups
(or clusters) such that similar points are grouped together. The similarity is measured usually by the

inverse of a distance measure e.g., Euclidean distance. The simplest center-based algorithms to

https://en.wikipedia.org/wiki/Cluster_analysis#Density-based_clustering
https://en.wikipedia.org/wiki/Hierarchical_clustering

Graph Based Clustering from SSA KPI

Lecture17 from Albert Orriols-Puig

1 Clustering and Kmeans

solve the clustering problem is KMeans.

KMeans, is an iterative algorithm that starts with initial random guesses of cluster centers

and continues by iteratively executing the following two steps until a stopping
criterion is met:

Update Assignment of Datapoints to Clusters: Calculate the distance of each data point 1.
to all cluster centers, and assign the datapoint to the cluster with the minimum distance.
Update Centers of the Clusters: For each cluster, calculate the new center as the average of all2.
datapoints assigned to it. More formally,

where indicates whether is assigned to cluster:

Based on the above description,in one step KMeans assigns data points to their closest centres, and in the
consecutive step it updates the each cluster mean by simply averaging over all the datapoints that are
currently assigned to that cluster. Figure 4.1.2 illustrates the KMeans steps.

1 Clustering and Kmeans

Figure 4.1.2. Illustration of KMeans steps.

The green points in Plot (a) from Figure 4.1.2. are unlabeled data points, and the crosses are the
randomly generated cluster centers which are far from the center of the clusters. The red and blue points
in Plot (b) indicate the samples that are assigned to the closest clusters to them. and the pink solid line is
the border between the two so found clusters. Plot (c) shows the cluster prototypes (i.e., crosses) after an
update. Plots (d)-(h) illustrate the next steps of the KMeans and Plot (i) demonstrates the final cluster
centers and clustered points.

Important Remarks
We can view KMeans as an optimisation algorithm to minimise he following objective function:1.

1 Clustering and Kmeans

with the constraints that and for all . Note that is a function of

the cluster centres and the cluster assignments ,
and is optimised with respect to these variables.

KMeans is sensitive to initial values, which means the different execution of KMeans (with different2.
initial cluster centers) may result in different solutions.
KMeans is a non-probabilistic algorithm which only supports hard-assignment. This means a data3.
point can only be assigned to one and only one of the clusters.

Application
Figure 4.1.3. shows an application of KMeans in image compression. In this particular application, the
pixels colors (i.e., RGB intensities) are treated as the datapoints. We can cluster the pixel colors into some
clusters, and then replace the colors in a cluster with the cluster centroid. Therefore, the color resolution
of the picture is reduced, which in turn leads to compression. Apparently, the lower the is set, the
compression rate would be higher, and accordingly, the image quality would be lower.

Figure 4.1.3. Application of KMeans algorithm in image compression.

2 Gaussian Mixture Models and Expectation-Maximisation

2
Gaussian Mixture Models and Expectation-
Maximisation
Consider the clustering problem that we saw in the previous chapter, where we wanted to partition a set

of training data points into groups of similar data points. Thinking differently, we

would like to assign one of the labels to each data point where a label is a group identifier.
As opposed to supervised learning, we are not given labels for training data points in the clustering
problem. In other words, the label of the training data points are latent or hidden, which we denote by
latent variables. Latent variable is an important concept in probabilistic modelling, and its usage goes
beyond the clustering problem.

In this chapter, we give a probabilistic modelling viewpoint to the clustering problem which involves latent
variables. The parameters of the probabilistic model and best values for the latent variables are then
learned based on the maximum likelihood principle. This is in contrast to the Kmeans algorithm for
clustering, which is merely designed based on minimising an intuitively reasonable objective function
(and is not based on any probabilistic principle).

Gaussian Mixture Models (GMMs)
A Generative Story. Consider the following hypothetical generative story for generating a label-

datapoint pair :

First, generate a cluster label , by tossing a dice with faces where each face of the dice
corresponds to a cluster label.

Second, generate the data point , by sampling from the distribution corresponding to the
cluster label

We use oracle to refer to the person who has hypothetically generated the data based on the above
process. After generating the data, the oracle hides the label from us and only gives us the datapoint

. We denote the label which is hidden from us by a random variable which we call
latent variable. Now given the training data, we would like to find the best value for the latent variables,
and the best estimates for the parameters of the above generative story.

The Probabilistic Generative Model. Let us formulate the probabilistic model corresponding to the
above generative story:

Tossing a dice with faces is the same as sampling from a multinomial distribution on
elements; the parameters of the multinomial are the probability of selecting elements, where

and .

For the face ,we assume data points are sampled from Gaussian distributions

2 Gaussian Mixture Models and Expectation-Maximisation

where parameters include mean and covariance . Note that we have a collection of these
Gaussian distributions, each of which corresponds to one of dice faces.

As said, the oracle hides the labels and only gives us the datapoints , and we aim to

best guess the latent variables where each represents the latent
label for a datapoint . Furthermore, we would like to find the best estimate for model parameters

. We achieve these two goals by resorting to the maximum
likelihood principle.

Gaussian Mixture Model and the Likelihood. If we were given a complete data point where
the label was not hidden, then the probability of the pair according to our generative story would be:

where the probability of the dice face is , and the probability of the datapoint given the dice

face is .

In practice, the oracle has not given us the complete datapoint, instead it has given us only an incomplete
data (or observed data) . Denoting the latent label by the latent variable , we can write

where we have marginalised out the random variable to get the probability of the incomplete data. This
model is called the Gaussian mixture model, which is composed of mixing coefficients (corresponding to
the multinomial distribution parameterised by ) and a collection of Gaussian components. We can now
write down the log-likelihood of the observed data:

where are the model parameters.

The Prediction Rule. After estimating the model parameters, we can ask the question to which cluster
an observed datapoint should be assigned to? Using the the Bayes’ rule, we can find the conditional
probability of a cluster for a given datapoint :

2 Gaussian Mixture Models and Expectation-Maximisation

where . Interestingly, this gives us a partial assignment, also called soft assignment, of
a datapoint to a clusters based on our probabilistic model. Recall that in Kmeans, a datapoint belongs
only to one cluster (i.e. hard assignment) and there was no notion of partial assignment.

Alternatively, we shall view as the prior probability of , and the quantity as the

corresponding posterior probability once we have observed . We refer to as the responsibility
that cluster takes for explaining the observation .

We will see in the next section that cluster prediction is intertwined with parameter estimation, i.e. they
are both done simultaneously in an iterative algorithm called the Expectation Maximisation Algorithm
which finds both the best latent variable assignment and the best parameter estimates.

Expectation Maximisation (EM) for GMMs
The maximum likelihood principle instructs us to maximise the likelihood as a function of model
parameters. Let us begin by writing down the conditions that must be satisfied at a maximum of the log-
likelihood function.

The Mean Parameters. Firstly, setting the gradient of with respect to the means of the
Gaussian components to zero, we obtain

where the responsibility quantities naturally appear on the right hand side. Rearranging the terms, we
obtain

where we have defined . We can interpret as the effective number of points
assigned to cluster . Note carefully the form of this solution. We see that the mean for the th
Gaussian component is obtained by taking a weighted mean of all of the points in the data set, in which

2 Gaussian Mixture Models and Expectation-Maximisation

the weighting factor for datapoint is given by the posterior probability that component was
responsible for generating .

The Covariance Matrices. Secondly, setting the gradient of with respect to to zero, and
following a similar line of reasoning, we obtain

which has the same form as the corresponding result for a single Gaussian fitted to the data set, but
again with each data point weighted by the corresponding posterior probability and with the denominator
given by the effective number of points associated with the corresponding component.

The Mixing Coefficients. Finally, we maximise with respect to the mixing coefficients . Here
we must take account of the constraint that the mixing coefficients have to sum to one. This can be
achieved using a Lagrange multiplier and maximizing the following objective function

which, after setting the gradient to zero, gives:

Making use of the constraint , we find . Hence eliminating , we obtain

so that the mixing coefficient for the th component is given by the average responsibility which that
component takes for explaining the data points.

EM Algorithm for Gaussian Mixture Models. It is worth emphasizing that the above results for the

2 Gaussian Mixture Models and Expectation-Maximisation

optimal parameter estimate do not constitute a closed-form solution because the responsibilities
depend on those parameters in a complex way. However, these results do suggest a simple iterative
scheme for finding a solution to the maximum likelihood problem, which as we shall see turns out to be
an instance of the Expectation Maximisation (EM) algorithm for the particular case of the Gaussian
mixture model.

Here is the EM algorithm for Gaussian mixture models. We first choose some initial values for the means,
covariances, and mixing coefficients. Then we alternate between the following two updates that we shall
call the E step and the M step (for reasons that will become apparent shortly) until a stopping condition is
met:

In the expectation step, or E step, we use the current values for the parameters to evaluate the
posterior probabilities

In the maximisation step, or M step, to re-estimate the means , covariances , and mixing
coefficients using the above results.

Relation to KMeans. The above algorithm is reminiscent of the Kmeans algorithm, with some

differences: (i) The only parameters in Kmeans are the cluster means (there are no and ), and
(ii) The assignment of data points to clusters are hard in Kmeans where each data point is associated
uniquely with one cluster; however, the EM algorithm makes soft assignments based on the posterior
probabilities .

Important Remarks.

Each update to the parameters resulting from an E step followed by an M step is guaranteed to1.
increase the log likelihood function (we do not prove it here).
In practice, the algorithm is deemed to have converged when the change in the log likelihood2.
function, or alternatively in the parameters, falls below some threshold.
The EM algorithm takes many more iterations to reach convergence compared to the Kmeans3.
algorithm, and that each cycle requires significantly more computations.

Example. We illustrate the EM algorithm for a mixture of two Gaussians on the data points in Figure
4.2.1. Here a mixture of two Gaussians is used, with centres initialized using the same values as for the
Kmeans algorithm in Figure 4.1.1, and with covariance matrices initialized to be proportional to the unit
matrix.

2 Gaussian Mixture Models and Expectation-Maximisation

Figure 4.2.1. Illustration of EM steps in GMM.

Plot (a) shows the data points in green, together with the initial configuration of the mixture model in
which the one standard-deviation contours for the two Gaussian components are shown as blue and red
circles.

Plot (b) shows the result of the initial E step, in which each data point is depicted using a proportion of
blue ink equal to the posterior probability of having been generated from the blue component, and a
corresponding proportion of red ink given by the posterior probability of having been generated by the
red component. Thus, points that have a significant probability for belonging to either cluster appear
purple.

The situation after the first M step is shown in plot (c), in which the mean of the blue Gaussian has moved
to the mean of the data set, weighted by the probabilities of each data point belonging to the blue cluster,
in other words it has moved to the centre of mass of the blue ink. Similarly, the covariance of the blue
Gaussian is set equal to the covariance of the blue ink. Analogous results hold for the red component.

Plots (d), (e), and (f) show the results after 2, 5, and 20 complete cycles of EM, respectively. In Plot (f)
the algorithm is close to convergence.

3 A General View to EM

3
A General View to EM
In this section, we present a general view of the EM algorithm that recognizes the key role played by
latent variables. We discuss this approach first of all in an abstract setting, and then for illustration we
consider once again the case of Gaussian mixtures.

The EM Algorithm: General Case
The Training Objective. The goal of the EM algorithm is to find maximum likelihood solution for models
having latent variables. We denote the observed data by , and similarly we denote the latent variables
by . The set of all model parameters is denoted by , and so the log likelihood function is given by

Note that our discussion will apply equally well to continuous latent variables simply by replacing the sum
over with an integral.

Suppose that, for the observation , we were told the corresponding value of the latent variable . We

shall call the complete data set, and we shall refer to the actual observed data as

incomplete. The likelihood function for the complete data set simply takes the form , and
we shall suppose that maximization of this complete-data log likelihood function is straightforward.

Maximising the Likelihood of Incomplete Data. In practice, we are not given the complete data set

, but only the incomplete data . Our state of knowledge of the values of the latent variables in

is given only by the posterior distribution . Because we cannot use the complete-data log
likelihood, we consider instead its expected value under the posterior distribution of the latent variable,
which corresponds to the E step of the EM algorithm. In the subsequent M step, we maximize this
expectation as a function of model parameters. If the current estimate for the parameters is denoted

, then a pair of successive E and M steps gives rise to a revised estimate . The algorithm is

initialized by choosing some starting value for the parameters . The expectation and maximisation
steps are iterated until a convergence condition is met.

The EM Algorithm. More specifically, in the E step, we use the current parameter values to find

the posterior distribution of the latent variables given by . We then use this posterior
distribution to find the expectation of the complete-data log likelihood evaluated for some general

parameter value . This expectation, denoted , is given by

3 A General View to EM

In the M step, we determine the revised parameter estimate θnew by maximising this function

Note that in the definition of , the logarithm acts directly on the joint distribution

, and so the corresponding M-step maximization will, by supposition, be tractable.

To summarise, the general EM algorithm is as follows:

Choose an initial setting for the parameters

While the convergence is not met:

E step: Evaluate
M Step: Evaluate given by

The incomplete-data log likelihood is guaranteed to increase in each cycle of the EM algorithm (we don’t
prove it here).

The Hard-EM Algorithm. In the Hard-EM Algorithm, there is no expectation in over the latent variables

in the definition of the function. Instead, only the most probable value for the latent variable is chosen

to define the function. In summary, this algorithm is as follows:

Choose an initial setting for the parameters

While the convergence is not met:

E step: Set

M Step: Set

It can be shown that the incomplete-data log likelihood is guaranteed to increase in each cycle of the
Hard-EM algorithm (we don’t prove it here).

3 A General View to EM

EM for GMMs: Revisited
The Complete Data Likelihood. Assume that in addition to the observed data set

, we were also given the values of the corresponding latent variables

, then

where is the cluster assignment vector for the th datapoint in which

if this datapoint belongs to the cluster and zero otherwise. Note that only one element of the
cluster assignment vector is 1, and the rest are zero.

Maximising the complete data likelihood to find the parameters of the model is straightforward:

The mixing components: where

The mean parameters:

The covariance matrices:

The function. In practice we are not given the values for the latent variables, and we need to resort
to the EM algorithm to find the best values for the parameters and the latent variables.

where is indeed the responsibility that we defined before:

3 A General View to EM

Maximising the function, leads to the following updated parameters:

The mixing components: where

The mean parameters:

The covariance matrices:

The EM Algorithm for GMMs. The EM Algorithms thus is as follows:

Choose an initial setting for the parameters

While the convergence is not met:

E step: Set based on

M Step: Set based on

This is exactly the EM Algorithm that we saw in the previous chapter.

4 Activity 4.1: EM for GMM

4
Activity 4.1: EM for GMM
In this activity we implement soft and hard expectation maximisation for training gaussian mixture
models. This activity helps you to complete Assignment 3.

Please download this (https://www.alexandriarepository.org/wp-content/uploads/20160701174342/Activity4.zip) zip file (right
click and then choose “save link as”) that contains the dataset (.csv), R script (.r), Jupyter notebook
(.ipynb) and output file (.html). If you have access to a Jupyter instance, the notebook (and dataset) is
enough to run the experiments. Otherwise, you may read the instructions and discussions from the HTML
file and execute the R script.

For detailed discussion about setting your programming environment up, please refer to Appendix B:
Setting up Your Programming Environme (https://www.alexandriarepository.org/?post_type=module&p=97654)

https://www.alexandriarepository.org/wp-content/uploads/20160701174342/Activity4.zip
https://www.alexandriarepository.org/?post_type=module&p=97654
https://www.alexandriarepository.org/?post_type=module&p=97654

5 Latent Variable Models for Document Analysis

5
Latent Variable Models for Document Analysis

Document Clustering
Given a collection of documents we would like to partition them into clusters.

Document Representation. Each document is made of some text. We treat a document as a set of
words in its text irrespective of their positions, i.e. a document consisting of the text “Andi went to school
to meet Bob” is represented as the following set of words: {Andi, went, to, school, to, meet, Bob}. Note
that if we had another document with the text “Bob went to school to meet Andi”, it would have the same
representation as the previous sentence. We refer to this as the bag of word representation of the
document. Also, we assume the words appearing in our collection of documents come from a dictionary
denoted by .

Generative Model. We assume the following hypothetical generative story for generating the collection
of our documents:

For each document
toss the -face dice (with the parameter ) to choose the face (i.e. the cluster) that the

th document belongs to

For each word placeholder in the document
generate the word by tossing the dice (with parameter ) corresponding to the face

The parameters of the model are (1) the clusters proportion which is a probability vector of size

where , and (2) the word proportion corresponding to the th face of the dice where

; note that we have such word proportion vectors each of which corresponding to a
face of the dice (or cluster).

The probability of generating a a pair of a document and its cluster (k,d), according to our generative
story, is

where is simply the number of occurrences of the word in the document .

5 Latent Variable Models for Document Analysis

In practice, the document cluster ids are not given to us, so we use latent variables to denote the cluster
assignments.

Complete Data
The Likelihood. Assume that in addition to the documents , we were also given the

values of the corresponding latent variables , then

where is the cluster assignment vector for the th document in which

if this document belongs to the cluster and zero otherwise. Note that only one element of the
cluster assignment vector is 1, and the rest are zero. To summarise, the log-likelihood of the complete
data is:

Learning the Parameters. Maximising the complete data log-likelihood to learn the parameters of the
model leads to the following solution:

The mixing components: where

The word proportion parameters for each cluster:

The above results can be obtained by forming the Lagrangian (to enforce the constraints, e.g. the sum of
mixing coefficients must be zero), and then setting the derivatives with respect to the parameters to zero
(see the Appendix A).

The above estimates for the optimal parameters are very intuitive. For example, the best value for is
obtained by counting the number of times that each word of the dictionary has been seen in the
documents belonging to the cluster , and then normalising this count vector so that it sums to 1 (be
dividing each element of the count vector by the sum of the counts).

Incomplete Data and EM
The Likelihood. In practice, the document clusters are not given to us, so is latent. So, the
probability of the observed documents is

5 Latent Variable Models for Document Analysis

To summarise, the log-likelihood is:

To maximise the above incomplete data log-likelihood objective, we resort to the EM Algorithm.

The Function. Let’s look at the function, which will be the basis of our EM Algorithm:

where is the collection of model parameters, and

are the responsibility factors.

To maximise the function, we can form the Lagrangian to enforce the constraints (See Appendix A),
and set the derivatives to zero which leads to the following solution for the model parameters:

The mixing components: where

The word proportion parameters for each cluster:

The EM Algorithm. Now let’s put everything together to construct our EM algorithm to learn the
parameters and find the best value for the latent variables:

5 Latent Variable Models for Document Analysis

Choose an initial setting for the parameters

While the convergence is not met:

E step: Set based on

M Step: Set based on

6 Activity 4.2: Document Clustering

6
Activity 4.2: Document Clustering
In this activity we implement soft and hard expectation maximisation for document clustering problem.
This activity helps you to complete Assignment 3.

7 Appendix A: Constrained Optimisation

7
Appendix A: Constrained Optimisation

Constrained Optimisation
In these problems, we would like to maximise or minimise an objective function subject to some
constraints. A constrain is a condition that the solution must satisfy. In this unit, the constrains are
equality but in general they can be inequality as well. Briefly speaking, the constrained optimisation
problem that we consider has the following form:

Remember that previously we have seen gradient based algorithms for solving unconstrained
optimisation problems, i.e. there were not any constraints on the solution for those problems other than
achieving the best value for the objective function. Our strategy to solve unconstrained optimisation
problem was to set the partial derivatives to zero, and either find a solution analytically or use an iterative
algorithm like BGD or SGD to find a solution. Our strategy for solving constrained optimisation problem is
to convert them to unconstrained optimisation problems using a technique called the method of
Lagrange multipliers, and then solve them using the gradient based method that we already know.

Lagrange Multipliers
The method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function
subject to equality constraints. More specifically, for the constrained optimisation problem mentioned
above, we introduce the notion of Lagrangian as follows:

where is called Lagrange multipliers.

If is a maximum of f(x) for the original constrained optimisation problem, then there exists

such that is a stationary point for the Lagrange function (stationary
points are those points where the partial derivatives of are zero). Therefore, to solve a constrained
optimisation problem, the recipe is as follows: (1) Construct the Lagrangian, and (2) Set the partial
derivatives to zero (with respect to the original variable and all the Lagrange multipliers).

To summarise, the original constrained optimisation problem is converted to an unconstrained
optimisation problem via the Lagrangian. We then use the gradient-based techniques to find a stationary
point of the Lagrangian, which in turns gives (hopefully) a solution of the original problem.

7 Appendix A: Constrained Optimisation

Example
Suppose we want to solve the following optimisation problem:

Note that we have only one constraint . Therefore, the Lagrangian would be:

We now set the partial derivatives to zero:

Finding and from the first two equations and substituting in the third equation yields

Hence, the optimal values are and .

If you are interested to know more about the method of Lagrangian multipliers, please see the wikipedia
page (https://en.wikipedia.org/wiki/Lagrange_multiplier) on this subject.

https://en.wikipedia.org/wiki/Lagrange_multiplier
https://en.wikipedia.org/wiki/Lagrange_multiplier

Title
Copyright
1 Clustering and Kmeans
2 Gaussian Mixture Models and Expectation-Maximisation
3 A General View to EM
4 Activity 4.1: EM for GMM
5 Latent Variable Models for Document Analysis
6 Activity 4.2: Document Clustering
7 Appendix A: Constrained Optimisation

Related Posts