5
5
Lecture-77
Expectation Maximization
Here we are assuming two things. We are assuming that there is some parameterized family (i.e.,
you are doing some parameterized fitting) for which the joint likelihood is easy to compute. So we
denote by {𝑋, 𝑍} the complete data (the observed data plus the hidden data), and we assume some
parameterized family from which this data is generated (like a Gaussian or an exponential, and we
do not know the unknown parameters which we want to estimate.
So we start seeing connections with what we have seen in the case of Gaussian mixture because if
you take the marginal probability here, you again see a summation coming inside the 𝑙𝑜𝑔, and this
again poses problems. Also, the marginals need not be of the same family. For example, if the joint
probability distribution is an exponential distribution, it does not mean that the corresponding
marginal probability distributions also comes from the exponential family.
EM as it is most commonly used today was proposed by Dempster, Laird and Rubin in their 1977
seminal paper “Maximum Likelihood from incomplete data via the EM algorithm”. Even before
1977, a lot of statisticians have developed EM like algorithms, but when we cite EM, we usually
cite this paper.
The complete data is the combination of the incomplete data and hidden data. We assume some
parameterized family for the complete data, and want to estimate the parameters. This is the
problem that EM solves.
So we decide to not compute the maximum likelihood in this way, but instead compute the
maximum taking the expectation of the log likelihood of the complete data under the distribution
of Z. But we do not know the real distribution of Z because we do not know the parameters. So
we take the guess that we had in the previous round. We compute the expectation of the complete
data likelihood under the distribution of Z given the current guess of the parameters.
This works. We will see why it works. So the entire EM algorithm can actually be represented by
just this one line. You start with some guess, and then for the next guess you calculate the
expectation of the complete data log likelihood under the distribution of the missing data using the
previous guess. This can actually be broken down into two steps, where in the E step you compute
this expectation, and in the M step you maximize this Q and get the next set of parameters.
This slides recalls the Gaussian mixture model. We have the Gaussian mixture model, where we
want to estimate all the parameters represented by 𝜗. We have 𝐾 components, and each of the
component has parameters 𝜋𝑘 , 𝜇𝑘 and Σ𝑘 . What we want to find out is the maximum likelihood
estimate.
If we want to get the maximum likelihood parameters, we first need to compute this Q function.
Then usually the M step is easier. It’s just the expectation computation in E step that requires some
work. Once you get Q, then the maximization step is just computing the derivatives.
As you can see, this works only if the E step gives you something which you can easily maximize.
In the case of Gaussian mixture, we will see that by using the complete data likelihood, we will be
able to get something that we can easily maximize.
So the log likelihood here is the likelihood of 𝑥𝑛 , 𝑧𝑛 given the unknown parameters, over all the
data points.
𝑁
We can take this summation outside by linearity of expectations. Rewriting the the complete log
likelihood, the equation assumes the form:
𝑁 𝐾
(𝑚−1) 𝕀(𝑧𝑛 =𝑘)
𝑄(𝜗, 𝜗 ) = ∑ 𝔼𝑍|𝑋,𝜗(𝑚−1) [log ∏(𝜋𝑘 𝑝(𝑥𝑛 |𝜃𝑘 )) ] (3)
𝑛=1 𝑘=1
So we can derive this formally, but intuitively it is very clear. 𝕀(𝑧𝑛 = 𝑘) is just an indicator
function which assumes a value of 1 when 𝑧𝑛 = 𝑘, and assumes a value of 0 when 𝑧𝑛 ≠ 𝑘. So in
this product, all the terms except one will get an exponent of 0 (and therefore, thos terms assume
a value of 1). So only one term out of the K which will remain for each of the data points. The log-
likelihood comes straight away from this formula after using the indicator function here, and this
gives us the complete data likelihood.
So then the product becomes a sum when we take it outside the log, and the exponent term 𝕀(𝑧𝑛 =
𝑘) comes down. Again the expectation can be brought inside by linearity, and so we get both the
summations out.
𝑁 𝐾
With respect to the distribution of 𝑍 using the previously guessed parameters 𝜗 (𝑚−1) ,
log(𝜋𝑘 𝑝(𝑥𝑛 |𝜃𝑘 )) is just a constant. So the expectation is only over the indicator function 𝕀(𝑧𝑛 =
𝑘).
Expectation of an indicator function just gives us the probability, and we have the probability of
𝑧𝑛 = 𝑘, again given 𝑋 and the previously guessed parameter values 𝜗 (𝑚−1) . Log just remains.
𝑁 𝐾
So 𝑝(𝑧𝑛 = 𝑘|𝑋, 𝜗 (𝑚−1) ) is again the responsibility, which is the posterior probability of 𝑧𝑛 = 𝑘.
In this case, this responsibility is not with respect to the original parameters, but this posterior
probability is with respect to the guessed parameters. This has been indicated with the subscript.
So it is the responsibility times the log that remains.
𝑁 𝐾
So, expressing the log of a product as a summation of individual log terms, we get:
𝑁 𝐾
(𝑚−1)
𝑄(𝜗, 𝜗 ) = ∑ ∑ 𝛾(𝑧𝑛𝑘 )|𝜗𝑚−1 log 𝜋𝑘 + 𝛾(𝑧𝑛𝑘 )|𝜗𝑚−1 log 𝑝(𝑥𝑛 |𝜃𝑘 ) (5)
𝑛=1 𝑘=1
So what have we achieved? So we have got an expression for Q in terms of again the responsibility,
but this time the responsibility is with respect to the guesssed parameters.
If you just look at the Q function in equation (5), you can see that it is easier to differentiate because
the summations are all outside and the normal distribution term comes towards the end of the
equation. The differentiation of the normal distribution term will be just like what you do in the
case of fitting a single Gaussian. How did this nice mathematical form come about? It happened
mainly because we were taking the complete data likelihood as shown in equation (2). The
complete data likelihood gave us a nice mathematical form in equation (3), and due to the
expectation, all the summations got pushed out in equation (4). So we got a nice form for 𝑄 as
shown in equation (5).
So is it only because the summation is inside the logarithm, we need to do all this? Otherwise, can
we can skip the whole thing ?
Yes, but that comes in many contexts, not just in the case of Gaussian mixture. In a lot of those
cases, EM is useful. If we could get the maximum likelihood easily, we wouldn’t need to use EM
for Gaussian mixture.
This whole first term inside the summations ( which is 𝜸(𝒛𝒏𝒌 )|𝝑(𝒎−𝟏) 𝐥𝐨𝐠 𝝅𝒌 ) is not necessary. We
focus only on the second term inside the summations ( which is 𝜸(𝒛𝒏𝒌 )|𝝑(𝒎−𝟏) 𝐥𝐨𝐠 𝒑(𝒙𝒏 |𝜽𝒌 ) ).
For each of the different components, we get the entire normal distribution within the equation
here.
𝑁
𝜕𝑄 𝜕
= {∑ 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log 𝑝(𝑥𝑛 |𝜃𝑘 )} , 𝑘 = 1, … , 𝐾
𝜕𝜇𝑘 𝜕𝜇𝑘
𝑛=1
𝑁
𝜕𝑄 𝜕 1 1 1
= { ∑ 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log [ 𝑝 1 exp {− (𝑥𝑛 − 𝜇𝑘 )𝑇 𝛴𝑘−1 (𝑥𝑛 − 𝜇𝑘 )}] }
𝜕𝜇𝑘 𝜕𝜇𝑘 (2𝝅)2 |Σ𝑘 |2 2
𝑛=1
There are no summations inside the logarithm. This is exactly the same derivation as for a single
Gaussian. We use matrix derivatives to get very simple forms here.
𝜕𝑄
Substituting 𝜕𝜇 = 0, we get:
𝑘
∑𝑁
𝑛=1 𝛾(𝑧𝑛𝑘 )|𝜗 (𝑚−1) 𝑥𝑛
𝜇𝑘 = , 𝑘 = 1, … , 𝐾
∑𝑁
𝑛=1 𝛾(𝑧𝑛𝑘 )|𝜗 (𝑚−1)
We will see that we again get the same form for 𝜇𝑘 that we saw earlier for our adhoc iterative
algorithm except that these responsibilities are with respect to the previously guessed parameters.
Again, we do not need to worry about the first term within the summations. We only find the
derivative for the second term within the summations.
𝜕𝑄 𝜕 1 1 1
= {𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log [ 𝑝 1 exp {− (𝑥𝑛 − 𝜇𝑘 )𝑇 Σ𝑘−1 (𝑥𝑛 − 𝜇𝑘 )}]}
𝜕Σ𝑘 𝜕Σ𝑘 (2𝝅)2 |Σ𝑘 |2 2
You can simplify this further by first applying the logarithm here for each of these parts.
𝑁
𝜕𝑄 𝜕 1 1 1
= {∑ 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) [log 𝑝 − log|Σ𝑘 | − (𝑥𝑛 − 𝜇𝑘 )𝑇 Σ𝑘−1 (𝑥𝑛 − 𝜇𝑘 )]}
𝜕Σ𝑘 𝜕Σ𝑘 (2𝝅)2 2 2
𝑛=1
Then, the derivative for the determinant is given by a simple formula. We can apply the derivative
𝜕𝑄
formula here, and by setting 𝜕Σ = 0, we get back the same form for Σ𝑘 which we found earlier.
𝑘
∑𝑁
𝑛=1 𝛾(𝑧𝑛𝑘 )|𝜗 (𝑚−1) (𝑥𝑛 − 𝜇𝑘 )(𝑥𝑛 − 𝜇𝑘 )
𝑇
Σ𝑘 = , 𝑘 = 1, … , 𝐾
∑𝑁
𝑛=1 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1)
second term within the summations (which is 𝜸(𝒛𝒏𝒌 )|𝝑(𝒎−𝟏) 𝐥𝐨𝐠 𝒑(𝒙𝒏 |𝜽𝒌 ) ) is a constant with
respect to 𝜋𝑘 and therefore goes away. We focus on the differentitation of the first term within the
summations (which is 𝜸(𝒛𝒏𝒌 )|𝝑(𝒎−𝟏) 𝐥𝐨𝐠 𝝅𝒌 ). Since, the 𝜋𝑘 s must satisfy the constraint ∑𝐾
𝑘=1 𝜋𝑘 =
1, we use Lagrange multipliers to get the formulation for 𝜋𝑘 . The langrange function 𝐽 can be
written as:
𝐾 𝑁 𝐾
By differentiating the langrage function 𝐽 with respect to 𝜋𝑘 , and setting the derivative to 0, we
get:
∑𝑁
𝑛=1 𝛾(𝑧𝑛𝑘 )|𝜗 (𝑚−1)
𝜋𝑘 = , 𝑘 = 1, … , 𝐾
𝑁
We plug the formulas for 𝜇𝑘 , Σ𝑘 and 𝜋𝑘 into the EM framework. In the EM framework, we start
with a guess for these parameters. We then iterate by first finding the posterior distribution of Z
(which gives us the responsibilities and can be represented as 𝑝(𝑍|𝑋, 𝜗 (𝑚−1) )), and then we find
the expected complete likelihood under this distribution of Z (which can be represented as
𝑄(𝜗, 𝜗 (𝑚−1) ) = 𝔼𝑍|𝑋,𝜗(𝑚−1) log 𝑝(𝑋, 𝑍|𝜗)), and we finally maximize the Q function to get the new
guesses for the parameters.
This gives us the next set of guesses, and this iteratively we can, we can check for convergence.
When the likelihood does not change much we stop, and that is the EM algorithm for GMM. I
have still not told you why this works, but we will see that, we will see the theoretical properties
of why this is, why this works well. Yeah. I just wanted to show you that what we have got through,
by doing all the math for EM is exactly the same as what we found during the iterative algorithm
that we guessed.
component gives exactly the same contribution towards the Gaussian mixture. Now the only
parameter to estimate is the 𝜇𝑘 s.
Since Σ𝑘 = 𝜖𝕀, the formula for the normal distribution simplifies to:
1 1 1
𝑝(𝑥|𝜃𝑘 ) = 𝑝 1 exp {− (𝑥 − 𝜇𝑘 )𝑇 Σ𝑘−1 (𝑥 − 𝜇𝑘 ) )}
(2𝝅)2 |Σ 2
k |2
1 1
𝑝(𝑥|𝜃𝑘 ) = 𝑝 exp {− ‖(𝑥 − 𝜇𝑘 )‖2 }
2𝜖 (6)
(2𝝅𝜖)2
The formula for the responsibility also simplifies. We plug 𝑝(𝑥𝑛 |𝜃𝑘 ) from equation (6) and also
1
substitute 𝜋𝑘 = 𝐾 into the formula for responsibilities 𝛾(𝑧𝑛𝑘 ), and we get:
So this is the special case of hard clustering that was mentioned earlier. What it turns out to be is
just setting the responsibility of 𝑥𝑛 to 1 for that component for which the data point is closest to
the mean, and setting the responsibility of 𝑥𝑛 to zero for all other components.
2
𝛾(𝑧𝑛𝑘 ) = {1 𝑖𝑓 𝑘 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑗 ‖𝑥𝑛 − 𝜇𝑗 ‖
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
The responsibility is just the posterior probability of 𝑧𝑛 being equal to 𝑘. For the data point 𝑥𝑛 , we
want to know which component it has come from. For the data point 𝑥𝑛 , the responsibility is one
for that component whose mean the data point is closest to, and it is 0 for all others components.
This means that if you look at the data and look at 𝑥𝑛 , and want to know the posterior probability
of which component it came from, it is that component whose mean that data point is closest to.
The formula for Q is the same as before because we are doing Gaussian mixture. It’s just a special
Gaussian mixture.
1
Substituting 𝜋𝑘 = 𝐾, and replacing log 𝑝(𝑥𝑛 |𝜃𝑘 ) with the expansion from equation (6) for the this
𝑁 𝐾
1 1 1
𝑄 = ∑ ∑ 𝛾(𝑧𝑛𝑘 ) log + 𝛾(𝑧𝑛𝑘 ) log 𝑝 exp {− ‖(𝑥𝑛 − 𝜇𝑘 )‖2 }
𝐾 (2𝝅𝜖)2 2𝜖
𝑛=1 𝑘=1
We do the differentiation of the Q function which has this simplified normal distribution in it.
𝑁
𝜕𝑄 𝜕 1 1
= {∑ 𝛾(𝑧𝑛𝑘 ) log 𝑝 exp {− ‖(𝑥𝑛 − 𝜇𝑘 )‖2 }}
𝜕𝜇𝑘 𝜕𝜇𝑘 (2𝝅𝜖)2 2𝜖
𝑛=1
𝜕𝑄 ∑𝑁
𝑛=1 𝛾(𝑧𝑛𝑘 ) 𝑥𝑛
= 0 ⇒ 𝜇𝑘 =
𝜕𝜇𝑘 ∑𝑁𝑛=1 𝛾(𝑧𝑛𝑘 )
We again get the same formula for 𝜇𝑘 , but the only difference is that this responsibility is defined
in the way it was described earlier (for hard clustetring).
What is it basically saying? So for the k th component, only the responsibilities of the data points
assigned to the 𝑘 𝑡ℎ component will be 1, and responsibilities of the same data points corresponding
to other components will be 0. So we take all the data points that are assigned to the 𝑘 𝑡ℎ
component, take the mean of those data points, and update the mean parameter of 𝑘 𝑡ℎ component,
𝜇𝑘 , with this computed mean value.
This is the general EM algorithm. We first calculate the posterior distribution of Z, which we saw
is exactly given by:
2
𝛾(𝑧𝑛𝑘 ) = {1 𝑖𝑓 𝑘 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑗 ‖𝑥𝑛 − 𝜇𝑗 ‖
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
We assign the latent variable of 𝑥𝑛 to the closest mean, and then set the new mean as the mean of
all data points with the same latent variable. This is exactly also the k-means algorithm.
So we are assigning 𝑥𝑛 to the closest cluster, with the cluster center 𝜇𝑘 , and then reestimating the
cluster centers as the mean of all data points that are assigned to that cluster. So k-means is just a
special case of Gaussian mixture, where the covariance matrix is epsilon times the identity matrix.
That is why we just have to compute the means, and do not have to worry about the covariance.
So I think we will stop here because after this we will talk about all the theoretical properties of
EM. I wanted to show you one more thing.
(Refer Slide Time: 26:59)
The data is generated from three Gaussians. The three Gaussians and their cluster centers are
shown here. The previous figure shows how the data looks.
(Refer Slide Time: 27:18)
Now let's run the EM algorithm ten times, and plot it. In this plot, on the x axis we have iterations
and on the y axis we have the likelihood. In each iteration, the likelihood keeps increasing until it
reaches a point and remains steady there. So this is a property of the EM algorithm which we will
prove in the coming lecture.
.
We always have a hard bound 𝑇 on the number of iterations because sometimes the likelihood may
not converge. So we usually give an upper bound on the number of iterations as well and stop it
there.
When we ran EM ten times on the data, the 10 final negative log-likelihood values were -1490.738,
-1495.891, -1496.862, -1496.869, -1491.275, -1490.691, -1496.861, -1491.737, -1493.812 and -
1495.482. The minimum negative log likelihood was achieved for the fourth run of the EM.
I gave 𝑘 = 5, so it has estimated five different components. The means of the components are
shown here. So it has tried to fit five Gaussians.
Funded by
Department of the higher education
Ministry of the human resource department
Government of India
www.nptel.ac.in
Copyrights Reserved