4.-Latent-Variable-Models-and-EM.1.1
3 A General View to EM
11
3
A General View to EM
In this section, we present a general view of the EM algorithm that recognizes the key role played by
latent variables. We discuss this approach first of all in an abstract setting, and then for illustration we
consider once again the case of Gaussian mixtures.
The EM Algorithm: General Case
The Training Objective. The goal of the EM algorithm is to find maximum likelihood solution for models
having latent variables. We denote the observed data by , and similarly we denote the latent variables
by . The set of all model parameters is denoted by , and so the log likelihood function is given by
Note that our discussion will apply equally well to continuous latent variables simply by replacing the sum
over with an integral.
Suppose that, for the observation , we were told the corresponding value of the latent variable . We
shall call the complete data set, and we shall refer to the actual observed data as
incomplete. The likelihood function for the complete data set simply takes the form , and
we shall suppose that maximization of this complete-data log likelihood function is straightforward.
Maximising the Likelihood of Incomplete Data. In practice, we are not given the complete data set
, but only the incomplete data . Our state of knowledge of the values of the latent variables in
is given only by the posterior distribution . Because we cannot use the complete-data log
likelihood, we consider instead its expected value under the posterior distribution of the latent variable,
which corresponds to the E step of the EM algorithm. In the subsequent M step, we maximize this
expectation as a function of model parameters. If the current estimate for the parameters is denoted
, then a pair of successive E and M steps gives rise to a revised estimate . The algorithm is
initialized by choosing some starting value for the parameters . The expectation and maximisation
steps are iterated until a convergence condition is met.
The EM Algorithm. More specifically, in the E step, we use the current parameter values to find
the posterior distribution of the latent variables given by . We then use this posterior
distribution to find the expectation of the complete-data log likelihood evaluated for some general
parameter value . This expectation, denoted , is given by
3 A General View to EM
12
In the M step, we determine the revised parameter estimate θnew by maximising this function
Note that in the definition of , the logarithm acts directly on the joint distribution
, and so the corresponding M-step maximization will, by supposition, be tractable.
To summarise, the general EM algorithm is as follows:
Choose an initial setting for the parameters
While the convergence is not met:
E step: Evaluate
M Step: Evaluate given by
The incomplete-data log likelihood is guaranteed to increase in each cycle of the EM algorithm (we don’t
prove it here).
The Hard-EM Algorithm. In the Hard-EM Algorithm, there is no expectation in over the latent variables
in the definition of the function. Instead, only the most probable value for the latent variable is chosen
to define the function. In summary, this algorithm is as follows:
Choose an initial setting for the parameters
While the convergence is not met:
E step: Set
M Step: Set
It can be shown that the incomplete-data log likelihood is guaranteed to increase in each cycle of the
Hard-EM algorithm (we don’t prove it here).
3 A General View to EM
13
EM for GMMs: Revisited
The Complete Data Likelihood. Assume that in addition to the observed data set
, we were also given the values of the corresponding latent variables
, then
where is the cluster assignment vector for the th datapoint in which
if this datapoint belongs to the cluster and zero otherwise. Note that only one element of the
cluster assignment vector is 1, and the rest are zero.
Maximising the complete data likelihood to find the parameters of the model is straightforward:
The mixing components: where
The mean parameters:
The covariance matrices:
The function. In practice we are not given the values for the latent variables, and we need to resort
to the EM algorithm to find the best values for the parameters and the latent variables.
where is indeed the responsibility that we defined before:
3 A General View to EM
14
Maximising the function, leads to the following updated parameters:
The mixing components: where
The mean parameters:
The covariance matrices:
The EM Algorithm for GMMs. The EM Algorithms thus is as follows:
Choose an initial setting for the parameters
While the convergence is not met:
E step: Set based on
M Step: Set based on
This is exactly the EM Algorithm that we saw in the previous chapter.
3 A General View to EM