If we define a joint distribution over observed and latent variables,
the corresponding distribution of the observed variables alone is
obtained by marginalization. This allows relatively complex marginal
distributions over observed variables to be expressed in terms of more
tractable joint distributions over the expanded space of observed and
latent variables.
Mixture distributions, such as the Gaussian mixture, can be
interpreted in terms of discrete latent variables. These mixture
models are also useful in clustering data. We first discuss the
problem of finding clusters using a non-probabilistic technique called
K-means clustering. Then we introduce the latent variable view of
mixture distributions where the discrete latent variables can be
interpreted as defining assignments of data points to specific
components of the mixture.
The EM algorithm is the general approach to finding maximum likelihood
estimators in latent variable models. We will see that K-means
clustering is the non-probabilistic version of the EM algorithm.
K-means Clustering
We formulate the K-means clustering problem as follows:
Suppose we have a datset
consisting of observations of a random D-dimensional Euclidean
variable . Our goal is to partition the data set into a
fixed number of clusters . We formalize this by introducing
, that represents the center of the clusters. Our goal
is then to minimize the objective function:
The K-means algorithm works as follows:
Choose some initial values of $μ$_k.
Minimize wrt , keeping fixed.
Minimize wrt , keeping fixed.
Repeat steps 2 and 3 until convergence.
We shall see that updating and respectively
correspond to the E-step and M-step of the EM algorithm.
Consider first the determination of . Because is a ilnear
function of , this optimization can be performed easily to
give a closed-form solution:
Now, consider the optimization of with held fixed.
The objective function is quadratic of , and can be
minimized by setting the derivative wrt to to 0. We solve it
to be:
This result can be interpreted as setting equal to the mean of
all the datapoints assigned to cluster .
Because each step reduces the objective function , the algorithm is
guaranteed to converge to some minimum. This minimum may be a local
minimum, instead of a global one.
One way of choosing the cluster centers is to choose a random
subset of data points as the centers. K-means is often used to
initialize parameters in a Gaussian mixture model before applying the
EM algorithm.
The K-means algorithm described can be generalized to datasets by
using a dissimilarity measure (instead of
Euclidean distance), and minimizing the distortion measure:
which gives the K-medoids algorithm.
The computational cost of the E-step is O(KN), while the M-step
involves evaluations of .
One feature of the K-means algorithm is that it assigns each datapoint
to uniquely one of the clusters. There may be
datapoints that lie roughly halfway between two cluster centers. By
adopting a probabilistic approach, we can obtain ‘soft’ assignments
that reflect the level of uncertainty over the most appropriate
assignment.
The Gaussian Mixture Model
The Gaussian mixture model allows modelling more complex
distributions. We can formulate the Gaussian mixture model in terms of
discrete latent variables. the Gaussian mixture distribution can be
written as a linear superposition of Gaussians of the form:
We can introduce a K-dimensional binary random variable
a 1-of-K representation such that all elements but one of them are 0s,
and . We can then define the joint distribution
corresponding to the following graphical
model.
Figure 1: Mixture of Gaussians
The parameters must satisfy:
we can then represent as:
The conditional distribution can also be written as:
Since the joint distribution is given by
, the marginal distribution of
is given by summing the joint distribution over all
possible states of :
We define another quantity , whose
quantity can be found via Bayes theorem:
We shall view as the prior probability that , and
is the responsibility that component takes for
explaining the observation .
Suppose we have a dataset of observations , and wish to model this data using a mixture of
Gaussians. We can find the maximum likelihood estimates of the mixture
model parameters. Before proceeding, it is worth noting that the
maximum likelihood approach has severe problems. In particular, the
maximimizing the log likelihood function is not a well-posed problem,
and there are singularities that can cause the log-likelihood to
become infinity in the limit where the variance is zero. This occurs
when a Gaussian component collapses onto a single data point. This can
be interpreted as overfitting. We can hope to avoid these
singularities by using suitable heuristics, using the MAP or Bayesian
approach.
EM for Gaussian Mixture Model
We can find maximum likelihood solutions for models with latent
variables using the EM algorithm. We demonstrate this for the case of
the Gaussian mixture model.
First, we find the derivatives of wrt to to 0:
Multiplying by (which we assume to be
non-singular), we get:
We can interpret as the effective number of points assigned to
cluster .
If we set the derivative of wrt to to 0, and work it
out, we also get that:
These results are not closed form solutions of the parameters of the
mixture model, since they depend on . However, the
iterative procedure in the EM algorithm, allows us to choose some
initial values and perform E and M-steps to converge to a solution.
In the expectation step (E-step), we use the current values for the
parameters to evaluate the posterior probabilities, or
responsibilities. We then use these probabilities in the maximization
step (M-step), to re-estimate the means, covariances and mixing
coefficients.
General EM
The goal of the EM algorithm is to find maximum likelihood solutions
for models having latent variables. We denote the set of all observed
data by , in which the nth row represents
, and similarly we denote the set of all latent
variables by , with a corresponding row $\mathbf{z}_n^T4. The set
of all model parameters is denoted by . The
log-likelihood function is given by:
A key observation is that the summation occurs within the
logarithm. Even if the joint distribution belongs to the exponential
family, the marginal generally does
not because of this summation. The presence of the sum prevents the
logarithm from acting directly on the joint distribution, resulting in
complicated expressions for the maximum likelihood solution.
Since we are in general not given the complete dataset , but only the incomplete data , our state of
knowledge of the values of the latent variables is given only by a
posterior distribution .
Instead, we consider the expected value under the posterior
distribution of the latent variable, which corresponds to the E-step
of the EM algorithm. In the subsequent M-step, we maximize this
expectation.
In the E-step, we use the current parameters to
find the posterior distribution of the latent variables given by
. We use the
posterior distribution to fin the expectation of the complete-data log
likelihood evaluated for some general parameter . This
expectation, denoted , is given by:
In the M-step, we revise the parameter estimate
by maximizing the Q function:
In the definition of , the logarithm acts directly on the joint
distribution, making the M-step tractable.
In general, we suppose that the direct optimization of is difficult, and the optimization of is significantly easier.
We introduce a distribution over the latent variables,
and we observe that for any choice of , the following
decomposition holds:
where
and
The EM algorithm involves alternatingly computing a lower bound on the
log likelihood for the current parameter values, and then maximizing
this bound to obtain the new parameter values.
For complex models, the E-step and M-step can still be intractable.
The Generalized EM (GEM) algorithm addresses the problem on the
intractable M-step. Instead of maximizing wrt
, it seeks to change the parameters such that the
value is increased. Similarly, one can address the intractable E-step
by seeking to partially optimize wrt
.