0% found this document useful (0 votes)
14 views

5

The document discusses the Expectation Maximization (EM) algorithm, which is used for maximum likelihood estimation in the presence of incomplete data. It outlines the iterative process of the EM algorithm, including the E-step and M-step, and highlights its application in Gaussian mixture models. The lecture emphasizes the importance of estimating parameters through the Q function derived from the complete data likelihood and the role of hidden variables in the estimation process.

Uploaded by

abernakumari87
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

5

The document discusses the Expectation Maximization (EM) algorithm, which is used for maximum likelihood estimation in the presence of incomplete data. It outlines the iterative process of the EM algorithm, including the E-step and M-step, and highlights its application in Gaussian mixture models. The lecture emphasizes the importance of estimating parameters through the Q function derived from the complete data likelihood and the role of hidden variables in the estimation process.

Uploaded by

abernakumari87
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

NPTEL

NPTEL ONLINE CERTIFICATION COURSE

Introduction to Machine Learning

Lecture-77
Expectation Maximization

Prof. Balaraman Ravindran


Computer Science and Engineering
Indian Institute of Technology Madras

(Refer Slide Time: 00:15)

Expectation Maximization (EM) is a way to do maximum likelihood estimation. Initially, it was


proposed as a way to do maximum likelihood estimation when you have missing data. Suppose
you are given data 𝑋 which is what you observe and is known to be incomplete (i.e., there are
some values that are missing), and you want to get the maximum likelihood estimates of the
parameters which are unknown.

Here we are assuming two things. We are assuming that there is some parameterized family (i.e.,
you are doing some parameterized fitting) for which the joint likelihood is easy to compute. So we
denote by {𝑋, 𝑍} the complete data (the observed data plus the hidden data), and we assume some
parameterized family from which this data is generated (like a Gaussian or an exponential, and we
do not know the unknown parameters which we want to estimate.

So we start seeing connections with what we have seen in the case of Gaussian mixture because if
you take the marginal probability here, you again see a summation coming inside the 𝑙𝑜𝑔, and this
again poses problems. Also, the marginals need not be of the same family. For example, if the joint
probability distribution is an exponential distribution, it does not mean that the corresponding
marginal probability distributions also comes from the exponential family.

(Refer Slide Time: 02:04)

EM as it is most commonly used today was proposed by Dempster, Laird and Rubin in their 1977
seminal paper “Maximum Likelihood from incomplete data via the EM algorithm”. Even before
1977, a lot of statisticians have developed EM like algorithms, but when we cite EM, we usually
cite this paper.

(Refer Slide Time: 02:42)


We have observed or incomplete data and hidden data. For the rest of the discussion, we will
assume that this hidden data is discrete, but all the derivations will work if we assume the hidden
data to be continuous as well. In which case, we just have to replace the summations with integrals.

The complete data is the combination of the incomplete data and hidden data. We assume some
parameterized family for the complete data, and want to estimate the parameters. This is the
problem that EM solves.

(Refer Slide Time: 02:32)


EM is an iterative algorithm, just like what we designed for Gaussian mixture. We again start with
a guess of the parameters that we have. We evaluate our first guessed likelihood, and then we
iteratively do two steps. First, we compute the posterior distribution of 𝑍 given the current estimate
of the parameters. After that, we compute the expected complete log likelihood under this
distribution, which we’ll call as the Q function. Now notice that this, this expectation takes the
complete data likelihood. Here the parameters are unknown, and this expectation assumes the
distribution of Z given the parameters that we have guessed in the previous round. And then we
again get a new guess for the parameters by maximizing this Q function, and taking that argument
𝜗 which maximizes this Q function.

(Refer Slide Time: 05:08)


What we wanted was to get the maximum likelihood estimate of 𝜗. We see 𝑋 but 𝑋 is not complete.
𝑋 has some missing data 𝑍. So we express the likelihood of observed data 𝑋 as a marginal
distribution of the complete data, and we get into the same problem of the summation being inside
the logarithm.

So we decide to not compute the maximum likelihood in this way, but instead compute the
maximum taking the expectation of the log likelihood of the complete data under the distribution
of Z. But we do not know the real distribution of Z because we do not know the parameters. So
we take the guess that we had in the previous round. We compute the expectation of the complete
data likelihood under the distribution of Z given the current guess of the parameters.

This works. We will see why it works. So the entire EM algorithm can actually be represented by
just this one line. You start with some guess, and then for the next guess you calculate the
expectation of the complete data log likelihood under the distribution of the missing data using the
previous guess. This can actually be broken down into two steps, where in the E step you compute
this expectation, and in the M step you maximize this Q and get the next set of parameters.

(Refer Slide Time: 07:39)


So let us see how we can get the EM algorithm for Gaussian mixtures. The key thing here is we
did not say anything about hidden variables, but the trick here is to use these latent variables as
hidden variables. This is how EM is used in a lot of different models, not just Gaussian mixture,
but also in a lot of latent variable models. You assume that the latent variables are hidden, i.e., you
do not know them, and run the whole EM machinery.

This slides recalls the Gaussian mixture model. We have the Gaussian mixture model, where we
want to estimate all the parameters represented by 𝜗. We have 𝐾 components, and each of the
component has parameters 𝜋𝑘 , 𝜇𝑘 and Σ𝑘 . What we want to find out is the maximum likelihood
estimate.

(Refer Slide Time: 08:31)


All right.

(Refer Slide Time: 09:20)

So let’s write this down here.


𝐸: 𝑄(𝜗, 𝜗 (𝑚−1) ) = 𝐸𝑍|𝑋,𝜗(𝑚−1) log(𝑋, 𝑍|𝜗)
The most important thing to remember is that this distribution is taken over the previous guess,
where as the expectation is for the complete log-likelihood over the unknown parameters. The M
step just gets the next guess.
𝑀: 𝜗 (𝑚) = 𝑎𝑟𝑔𝑚𝑎𝑥𝜗 𝑄(𝜗, 𝜗 (𝑚−1) ) (1)

If we want to get the maximum likelihood parameters, we first need to compute this Q function.
Then usually the M step is easier. It’s just the expectation computation in E step that requires some
work. Once you get Q, then the maximization step is just computing the derivatives.

(Refer Slide Time: 10:45)

As you can see, this works only if the E step gives you something which you can easily maximize.
In the case of Gaussian mixture, we will see that by using the complete data likelihood, we will be
able to get something that we can easily maximize.

(Refer Slide Time: 11:06)


The Q function, by definition, is the expectation of the complete data likelihood under the
distribution of Z given the previously guessed parameters.
𝑄(𝜗, 𝜗 (𝑚−1) ) = 𝔼𝑍|𝑋,𝜗(𝑚−1) log 𝑝(𝑋, 𝑍|𝜗)

So the log likelihood here is the likelihood of 𝑥𝑛 , 𝑧𝑛 given the unknown parameters, over all the
data points.
𝑁

𝑄(𝜗, 𝜗 (𝑚−1) ) = 𝔼𝑍|𝑋,𝜗(𝑚−1) [∑ log 𝑝(𝑥𝑛 , 𝑧𝑛 |𝜗)] (2)


𝑛=1

We can take this summation outside by linearity of expectations. Rewriting the the complete log
likelihood, the equation assumes the form:
𝑁 𝐾
(𝑚−1) 𝕀(𝑧𝑛 =𝑘)
𝑄(𝜗, 𝜗 ) = ∑ 𝔼𝑍|𝑋,𝜗(𝑚−1) [log ∏(𝜋𝑘 𝑝(𝑥𝑛 |𝜃𝑘 )) ] (3)
𝑛=1 𝑘=1

So we can derive this formally, but intuitively it is very clear. 𝕀(𝑧𝑛 = 𝑘) is just an indicator
function which assumes a value of 1 when 𝑧𝑛 = 𝑘, and assumes a value of 0 when 𝑧𝑛 ≠ 𝑘. So in
this product, all the terms except one will get an exponent of 0 (and therefore, thos terms assume
a value of 1). So only one term out of the K which will remain for each of the data points. The log-
likelihood comes straight away from this formula after using the indicator function here, and this
gives us the complete data likelihood.

So then the product becomes a sum when we take it outside the log, and the exponent term 𝕀(𝑧𝑛 =
𝑘) comes down. Again the expectation can be brought inside by linearity, and so we get both the
summations out.
𝑁 𝐾

= ∑ ∑ 𝔼𝑍|𝑋,𝜗(𝑚−1) [𝕀(𝑧𝑛 = 𝑘)] log(𝜋𝑘 𝑝(𝑥𝑛 |𝜃𝑘 )) (4)


𝑛=1 𝑘=1

With respect to the distribution of 𝑍 using the previously guessed parameters 𝜗 (𝑚−1) ,
log(𝜋𝑘 𝑝(𝑥𝑛 |𝜃𝑘 )) is just a constant. So the expectation is only over the indicator function 𝕀(𝑧𝑛 =
𝑘).

Expectation of an indicator function just gives us the probability, and we have the probability of
𝑧𝑛 = 𝑘, again given 𝑋 and the previously guessed parameter values 𝜗 (𝑚−1) . Log just remains.
𝑁 𝐾

= ∑ ∑ 𝑝(𝑧𝑛 = 𝑘|𝑋, 𝜗 (𝑚−1) ) log(𝜋𝑘 𝑝(𝑥𝑛 |𝜃𝑘 ))


𝑛=1 𝑘=1

So 𝑝(𝑧𝑛 = 𝑘|𝑋, 𝜗 (𝑚−1) ) is again the responsibility, which is the posterior probability of 𝑧𝑛 = 𝑘.
In this case, this responsibility is not with respect to the original parameters, but this posterior
probability is with respect to the guessed parameters. This has been indicated with the subscript.
So it is the responsibility times the log that remains.
𝑁 𝐾

= ∑ ∑ 𝛾(𝑧𝑛𝑘 )|𝜗𝑚−1 log(𝜋𝑘 𝑝(𝑥𝑛 |𝜃𝑘 ))


𝑛=1 𝑘=1

So, expressing the log of a product as a summation of individual log terms, we get:
𝑁 𝐾
(𝑚−1)
𝑄(𝜗, 𝜗 ) = ∑ ∑ 𝛾(𝑧𝑛𝑘 )|𝜗𝑚−1 log 𝜋𝑘 + 𝛾(𝑧𝑛𝑘 )|𝜗𝑚−1 log 𝑝(𝑥𝑛 |𝜃𝑘 ) (5)
𝑛=1 𝑘=1
So what have we achieved? So we have got an expression for Q in terms of again the responsibility,
but this time the responsibility is with respect to the guesssed parameters.

If you just look at the Q function in equation (5), you can see that it is easier to differentiate because
the summations are all outside and the normal distribution term comes towards the end of the
equation. The differentiation of the normal distribution term will be just like what you do in the
case of fitting a single Gaussian. How did this nice mathematical form come about? It happened
mainly because we were taking the complete data likelihood as shown in equation (2). The
complete data likelihood gave us a nice mathematical form in equation (3), and due to the
expectation, all the summations got pushed out in equation (4). So we got a nice form for 𝑄 as
shown in equation (5).

So is it only because the summation is inside the logarithm, we need to do all this? Otherwise, can
we can skip the whole thing ?

Yes, but that comes in many contexts, not just in the case of Gaussian mixture. In a lot of those
cases, EM is useful. If we could get the maximum likelihood easily, we wouldn’t need to use EM
for Gaussian mixture.

(Refer Slide Time: 16:12)


Now, the M step is just that which is shown in equation (1), which is we differentiate the 𝑄 function
with respect to each of the parameters. The 𝑄 function here is the same 𝑄 function which is
expanded in equation (5). Now one thing to remember here is that we know the guessed parameter
at the previous iteration ( 𝜗 (𝑚−1) ). So we know the responsibilities. So the responsibilities are just
constants in this case, and so differentiating the 𝑄 function with respect to the parameters becomes
very easy.

So here we differentiate the 𝑄 function from equation (5) with respect to 𝜇𝑘 .


𝑁 𝐾
𝜕𝑄 𝜕
= {∑ ∑ 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log 𝜋𝑘 + 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log 𝑝(𝑥𝑛 |𝜃𝑘 )}
𝜕𝜇𝑘 𝜕𝜇𝑘
𝑛=1 𝑘=1

This whole first term inside the summations ( which is 𝜸(𝒛𝒏𝒌 )|𝝑(𝒎−𝟏) 𝐥𝐨𝐠 𝝅𝒌 ) is not necessary. We
focus only on the second term inside the summations ( which is 𝜸(𝒛𝒏𝒌 )|𝝑(𝒎−𝟏) 𝐥𝐨𝐠 𝒑(𝒙𝒏 |𝜽𝒌 ) ).

For each of the different components, we get the entire normal distribution within the equation
here.
𝑁
𝜕𝑄 𝜕
= {∑ 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log 𝑝(𝑥𝑛 |𝜃𝑘 )} , 𝑘 = 1, … , 𝐾
𝜕𝜇𝑘 𝜕𝜇𝑘
𝑛=1

𝑁
𝜕𝑄 𝜕 1 1 1
= { ∑ 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log [ 𝑝 1 exp {− (𝑥𝑛 − 𝜇𝑘 )𝑇 𝛴𝑘−1 (𝑥𝑛 − 𝜇𝑘 )}] }
𝜕𝜇𝑘 𝜕𝜇𝑘 (2𝝅)2 |Σ𝑘 |2 2
𝑛=1

There are no summations inside the logarithm. This is exactly the same derivation as for a single
Gaussian. We use matrix derivatives to get very simple forms here.

𝜕𝑄
Substituting 𝜕𝜇 = 0, we get:
𝑘

∑𝑁
𝑛=1 𝛾(𝑧𝑛𝑘 )|𝜗 (𝑚−1) 𝑥𝑛
𝜇𝑘 = , 𝑘 = 1, … , 𝐾
∑𝑁
𝑛=1 𝛾(𝑧𝑛𝑘 )|𝜗 (𝑚−1)

We will see that we again get the same form for 𝜇𝑘 that we saw earlier for our adhoc iterative
algorithm except that these responsibilities are with respect to the previously guessed parameters.

(Refer Slide Time: 17:43)


Here, similar to what we did for 𝜇𝑘 , we find the derivative of 𝑄 from equation (5) with respect to
Σ𝑘 and equate it to zero.
𝑁 𝐾
𝜕𝑄 𝜕
= {∑ ∑ 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log 𝜋𝑘 + 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log 𝑝(𝑥𝑛 |𝜃𝑘 )}
𝜕Σ𝑘 𝜕Σ𝑘
𝑛=1 𝑘=1

Again, we do not need to worry about the first term within the summations. We only find the
derivative for the second term within the summations.

𝜕𝑄 𝜕 1 1 1
= {𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log [ 𝑝 1 exp {− (𝑥𝑛 − 𝜇𝑘 )𝑇 Σ𝑘−1 (𝑥𝑛 − 𝜇𝑘 )}]}
𝜕Σ𝑘 𝜕Σ𝑘 (2𝝅)2 |Σ𝑘 |2 2

You can simplify this further by first applying the logarithm here for each of these parts.
𝑁
𝜕𝑄 𝜕 1 1 1
= {∑ 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) [log 𝑝 − log|Σ𝑘 | − (𝑥𝑛 − 𝜇𝑘 )𝑇 Σ𝑘−1 (𝑥𝑛 − 𝜇𝑘 )]}
𝜕Σ𝑘 𝜕Σ𝑘 (2𝝅)2 2 2
𝑛=1

Then, the derivative for the determinant is given by a simple formula. We can apply the derivative
𝜕𝑄
formula here, and by setting 𝜕Σ = 0, we get back the same form for Σ𝑘 which we found earlier.
𝑘

∑𝑁
𝑛=1 𝛾(𝑧𝑛𝑘 )|𝜗 (𝑚−1) (𝑥𝑛 − 𝜇𝑘 )(𝑥𝑛 − 𝜇𝑘 )
𝑇
Σ𝑘 = , 𝑘 = 1, … , 𝐾
∑𝑁
𝑛=1 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1)

(Refer Slide Time: 18:22)


𝜕𝑄
So similarly we perform the computations for the M step for 𝜋𝑘 . However, this time in the
𝜕𝜋𝑘

second term within the summations (which is 𝜸(𝒛𝒏𝒌 )|𝝑(𝒎−𝟏) 𝐥𝐨𝐠 𝒑(𝒙𝒏 |𝜽𝒌 ) ) is a constant with
respect to 𝜋𝑘 and therefore goes away. We focus on the differentitation of the first term within the
summations (which is 𝜸(𝒛𝒏𝒌 )|𝝑(𝒎−𝟏) 𝐥𝐨𝐠 𝝅𝒌 ). Since, the 𝜋𝑘 s must satisfy the constraint ∑𝐾
𝑘=1 𝜋𝑘 =

1, we use Lagrange multipliers to get the formulation for 𝜋𝑘 . The langrange function 𝐽 can be
written as:
𝐾 𝑁 𝐾

𝐽 = ∑ ∑ 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log 𝜋𝑘 + 𝜆 (∑ 𝜋𝑘 − 1 )


𝑘=1 𝑛=1 𝑘=1

By differentiating the langrage function 𝐽 with respect to 𝜋𝑘 , and setting the derivative to 0, we
get:
∑𝑁
𝑛=1 𝛾(𝑧𝑛𝑘 )|𝜗 (𝑚−1)
𝜋𝑘 = , 𝑘 = 1, … , 𝐾
𝑁

(Refer Slide Time: 18:50)


In the previous lecture, we first found sformulas for 𝜇𝑘 , Σ𝑘 and 𝜋𝑘 by assuming that we know the
responsibilities. Then, in this lecture, we used the E and M steps to find exactly the same formulas
for 𝜇𝑘 , Σ𝑘 and 𝜋𝑘 that we had found earlier in the previous lecture.

(Refer Slide Time: 19:13)

We plug the formulas for 𝜇𝑘 , Σ𝑘 and 𝜋𝑘 into the EM framework. In the EM framework, we start
with a guess for these parameters. We then iterate by first finding the posterior distribution of Z
(which gives us the responsibilities and can be represented as 𝑝(𝑍|𝑋, 𝜗 (𝑚−1) )), and then we find
the expected complete likelihood under this distribution of Z (which can be represented as
𝑄(𝜗, 𝜗 (𝑚−1) ) = 𝔼𝑍|𝑋,𝜗(𝑚−1) log 𝑝(𝑋, 𝑍|𝜗)), and we finally maximize the Q function to get the new
guesses for the parameters.

(Refer Slide Time: 19:43)

This gives us the next set of guesses, and this iteratively we can, we can check for convergence.
When the likelihood does not change much we stop, and that is the EM algorithm for GMM. I
have still not told you why this works, but we will see that, we will see the theoretical properties
of why this is, why this works well. Yeah. I just wanted to show you that what we have got through,
by doing all the math for EM is exactly the same as what we found during the iterative algorithm
that we guessed.

(Refer Slide Time: 20:31)


Let’s look at a special case. Let’s assume a Gaussian mixture model where the covariance of each
component is fixed to be 𝜖𝕀, where 𝜖 is a fixed constant and 𝕀 is the identity matrix. Covariance
1
matrix of Σ𝑘 = 𝜖𝕀 will give us a spherical Gaussian. We also fixed each of the 𝜋𝑘 to be 𝐾 . So each

component gives exactly the same contribution towards the Gaussian mixture. Now the only
parameter to estimate is the 𝜇𝑘 s.

Since Σ𝑘 = 𝜖𝕀, the formula for the normal distribution simplifies to:
1 1 1
𝑝(𝑥|𝜃𝑘 ) = 𝑝 1 exp {− (𝑥 − 𝜇𝑘 )𝑇 Σ𝑘−1 (𝑥 − 𝜇𝑘 ) )}
(2𝝅)2 |Σ 2
k |2
1 1
𝑝(𝑥|𝜃𝑘 ) = 𝑝 exp {− ‖(𝑥 − 𝜇𝑘 )‖2 }
2𝜖 (6)
(2𝝅𝜖)2

The formula for the responsibility also simplifies. We plug 𝑝(𝑥𝑛 |𝜃𝑘 ) from equation (6) and also
1
substitute 𝜋𝑘 = 𝐾 into the formula for responsibilities 𝛾(𝑧𝑛𝑘 ), and we get:

𝜋𝑘 𝑝(𝑥𝑛 |𝜃𝑘 ) exp{−‖𝑥𝑛 − 𝜇𝑘 ‖2 /2𝜖 }


𝛾(𝑧𝑛𝑘 ) = = 2
∑𝐾
𝑗=1 𝜋𝑗 𝑝(𝑥𝑛 |𝜃𝑗 ) ∑𝐾 (7)
𝑗=1 exp {−‖𝑥𝑛 − 𝜇𝑗 ‖ /2𝜖}
Now if we look at this expression in equation (7), and see what happens to the denominator as 𝜖 →
2
0, as 𝜖 → 0, the term for which the difference ‖𝑥𝑛 − 𝜇𝑗 ‖ is smallest will go to 0 most slowly. So
the responsibility for 𝑥𝑛 will tend to 1 for that particular component (which is the 𝑗 𝑡ℎ component
in this case) because the numerator will be equal to the denominator in the limits, and for all other
components (for which 𝑘 ≠ 𝑗), the responsibility will go to 0.

𝛾(𝑧𝑛𝒋 ) → 1 and 𝛾(𝑧𝑛𝒌 ) → 0, ∀ 𝑘 ≠ 𝑗

So this is the special case of hard clustering that was mentioned earlier. What it turns out to be is
just setting the responsibility of 𝑥𝑛 to 1 for that component for which the data point is closest to
the mean, and setting the responsibility of 𝑥𝑛 to zero for all other components.

2
𝛾(𝑧𝑛𝑘 ) = {1 𝑖𝑓 𝑘 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑗 ‖𝑥𝑛 − 𝜇𝑗 ‖
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

The responsibility is just the posterior probability of 𝑧𝑛 being equal to 𝑘. For the data point 𝑥𝑛 , we
want to know which component it has come from. For the data point 𝑥𝑛 , the responsibility is one
for that component whose mean the data point is closest to, and it is 0 for all others components.
This means that if you look at the data and look at 𝑥𝑛 , and want to know the posterior probability
of which component it came from, it is that component whose mean that data point is closest to.

(Refer Slide Time: 24:08)


Let us do the EM for this special case of hard clustering in which for each of the components, we
have set Σ𝑘 = 𝜖𝕀 and 𝜋𝑘 = 1/𝐾. The first step in EM is to calculate Q.

From equation (5), we have:


𝑁 𝐾

𝑄 = ∑ ∑ 𝛾(𝑧𝑛𝑘 ) log 𝜋𝑘 + 𝛾(𝑧𝑛𝑘 ) log 𝑝(𝑥𝑛 |𝜃𝑘 )


𝑛=1 𝑘=1

The formula for Q is the same as before because we are doing Gaussian mixture. It’s just a special
Gaussian mixture.

1
Substituting 𝜋𝑘 = 𝐾, and replacing log 𝑝(𝑥𝑛 |𝜃𝑘 ) with the expansion from equation (6) for the this

special case of Gaussian, we have:

𝑁 𝐾
1 1 1
𝑄 = ∑ ∑ 𝛾(𝑧𝑛𝑘 ) log + 𝛾(𝑧𝑛𝑘 ) log 𝑝 exp {− ‖(𝑥𝑛 − 𝜇𝑘 )‖2 }
𝐾 (2𝝅𝜖)2 2𝜖
𝑛=1 𝑘=1

We do the differentiation of the Q function which has this simplified normal distribution in it.
𝑁
𝜕𝑄 𝜕 1 1
= {∑ 𝛾(𝑧𝑛𝑘 ) log 𝑝 exp {− ‖(𝑥𝑛 − 𝜇𝑘 )‖2 }}
𝜕𝜇𝑘 𝜕𝜇𝑘 (2𝝅𝜖)2 2𝜖
𝑛=1

𝜕𝑄 ∑𝑁
𝑛=1 𝛾(𝑧𝑛𝑘 ) 𝑥𝑛
= 0 ⇒ 𝜇𝑘 =
𝜕𝜇𝑘 ∑𝑁𝑛=1 𝛾(𝑧𝑛𝑘 )

We again get the same formula for 𝜇𝑘 , but the only difference is that this responsibility is defined
in the way it was described earlier (for hard clustetring).

What is it basically saying? So for the k th component, only the responsibilities of the data points
assigned to the 𝑘 𝑡ℎ component will be 1, and responsibilities of the same data points corresponding
to other components will be 0. So we take all the data points that are assigned to the 𝑘 𝑡ℎ
component, take the mean of those data points, and update the mean parameter of 𝑘 𝑡ℎ component,
𝜇𝑘 , with this computed mean value.

(Refer Slide Time: 25:13)

This is the general EM algorithm. We first calculate the posterior distribution of Z, which we saw
is exactly given by:
2
𝛾(𝑧𝑛𝑘 ) = {1 𝑖𝑓 𝑘 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑗 ‖𝑥𝑛 − 𝜇𝑗 ‖
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

We assign the latent variable of 𝑥𝑛 to the closest mean, and then set the new mean as the mean of
all data points with the same latent variable. This is exactly also the k-means algorithm.

So we are assigning 𝑥𝑛 to the closest cluster, with the cluster center 𝜇𝑘 , and then reestimating the
cluster centers as the mean of all data points that are assigned to that cluster. So k-means is just a
special case of Gaussian mixture, where the covariance matrix is epsilon times the identity matrix.
That is why we just have to compute the means, and do not have to worry about the covariance.

This entire procedure follows the same framework of EM that we saw.

(Refer Slide Time: 26:31)

So I think we will stop here because after this we will talk about all the theoretical properties of
EM. I wanted to show you one more thing.
(Refer Slide Time: 26:59)

Let us consider this data.

(Refer Slide Time: 27:05)

The data is generated from three Gaussians. The three Gaussians and their cluster centers are
shown here. The previous figure shows how the data looks.
(Refer Slide Time: 27:18)

This is the density plot corresponding to the X coordinates of the data.

(Refer Slide Time: 27:30)


If we run EM once on it, and plot it, we see that it recovers the means and covariances exactly the
way the data was generated.

(Refer Slide Time: 28:02)

Now let's run the EM algorithm ten times, and plot it. In this plot, on the x axis we have iterations
and on the y axis we have the likelihood. In each iteration, the likelihood keeps increasing until it
reaches a point and remains steady there. So this is a property of the EM algorithm which we will
prove in the coming lecture.

(Refer Slide Time: 28:35)


Let us look at these ten likelihood values. The 10 negative log-likelihood values corresponding to
when we stopped the iteration for each of the 10 runs are:

-1490.738 -1495.891 -1496.862 -1496.869 -1491.275


-1490.691 -1496.861 -1491.737 -1493.812 -1495.482

(Refer Slide Time: 28:54)

.
We always have a hard bound 𝑇 on the number of iterations because sometimes the likelihood may
not converge. So we usually give an upper bound on the number of iterations as well and stop it
there.

When we ran EM ten times on the data, the 10 final negative log-likelihood values were -1490.738,
-1495.891, -1496.862, -1496.869, -1491.275, -1490.691, -1496.861, -1491.737, -1493.812 and -
1495.482. The minimum negative log likelihood was achieved for the fourth run of the EM.

This is a very easy case.

(Refer Slide Time: 29:43)

I gave 𝑘 = 5, so it has estimated five different components. The means of the components are
shown here. So it has tried to fit five Gaussians.

(Refer Slide Time: 30:35)


So this is a more difficult case because the three components are not very well separated. Now we
run the EM algorithm ten times on this data. In this case, the data is as shown here.

(Refer Slide Time: 31:45)

EM ran, the likelihood always increased.

(Refer Slide Time: 31:48)


But what it estimated is shown here. So when the data is very well separated, like we saw earlier,
EM will usually give very good results, but when the data is not very well separated like this, it
starts giving very weird results. But no matter what it does, the likelihoods will always increase.

IIT Madras production

Funded by
Department of the higher education
Ministry of the human resource department
Government of India

www.nptel.ac.in

Copyrights Reserved

You might also like