0% found this document useful (0 votes)

14 views

The document discusses the Expectation Maximization (EM) algorithm, which is used for maximum likelihood estimation in the presence of incomplete data. It outlines the iterative process of the EM algorithm, including the E-step and M-step, and highlights its application in Gaussian mixture models. The lecture emphasizes the importance of estimating parameters through the Q function derived from the complete data likelihood and the role of hidden variables in the estimation process.

Uploaded by

abernakumari87

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Uploaded by

abernakumari87

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

NPTEL

NPTEL ONLINE CERTIFICATION COURSE

Introduction to Machine Learning

Lecture-77
Expectation Maximization

Prof. Balaraman Ravindran

Computer Science and Engineering
Indian Institute of Technology Madras

(Refer Slide Time: 00:15)

Expectation Maximization (EM) is a way to do maximum likelihood estimation. Initially, it was

proposed as a way to do maximum likelihood estimation when you have missing data. Suppose
you are given data 𝑋 which is what you observe and is known to be incomplete (i.e., there are
some values that are missing), and you want to get the maximum likelihood estimates of the
parameters which are unknown.

Here we are assuming two things. We are assuming that there is some parameterized family (i.e.,
you are doing some parameterized fitting) for which the joint likelihood is easy to compute. So we
denote by {𝑋, 𝑍} the complete data (the observed data plus the hidden data), and we assume some
parameterized family from which this data is generated (like a Gaussian or an exponential, and we
do not know the unknown parameters which we want to estimate.

So we start seeing connections with what we have seen in the case of Gaussian mixture because if
you take the marginal probability here, you again see a summation coming inside the 𝑙𝑜𝑔, and this
again poses problems. Also, the marginals need not be of the same family. For example, if the joint
probability distribution is an exponential distribution, it does not mean that the corresponding
marginal probability distributions also comes from the exponential family.

(Refer Slide Time: 02:04)

EM as it is most commonly used today was proposed by Dempster, Laird and Rubin in their 1977
seminal paper “Maximum Likelihood from incomplete data via the EM algorithm”. Even before
1977, a lot of statisticians have developed EM like algorithms, but when we cite EM, we usually
cite this paper.

(Refer Slide Time: 02:42)

We have observed or incomplete data and hidden data. For the rest of the discussion, we will
assume that this hidden data is discrete, but all the derivations will work if we assume the hidden
data to be continuous as well. In which case, we just have to replace the summations with integrals.

The complete data is the combination of the incomplete data and hidden data. We assume some
parameterized family for the complete data, and want to estimate the parameters. This is the
problem that EM solves.

(Refer Slide Time: 02:32)

EM is an iterative algorithm, just like what we designed for Gaussian mixture. We again start with
a guess of the parameters that we have. We evaluate our first guessed likelihood, and then we
iteratively do two steps. First, we compute the posterior distribution of 𝑍 given the current estimate
of the parameters. After that, we compute the expected complete log likelihood under this
distribution, which we’ll call as the Q function. Now notice that this, this expectation takes the
complete data likelihood. Here the parameters are unknown, and this expectation assumes the
distribution of Z given the parameters that we have guessed in the previous round. And then we
again get a new guess for the parameters by maximizing this Q function, and taking that argument
𝜗 which maximizes this Q function.

(Refer Slide Time: 05:08)

What we wanted was to get the maximum likelihood estimate of 𝜗. We see 𝑋 but 𝑋 is not complete.
𝑋 has some missing data 𝑍. So we express the likelihood of observed data 𝑋 as a marginal
distribution of the complete data, and we get into the same problem of the summation being inside
the logarithm.

So we decide to not compute the maximum likelihood in this way, but instead compute the
maximum taking the expectation of the log likelihood of the complete data under the distribution
of Z. But we do not know the real distribution of Z because we do not know the parameters. So
we take the guess that we had in the previous round. We compute the expectation of the complete
data likelihood under the distribution of Z given the current guess of the parameters.

This works. We will see why it works. So the entire EM algorithm can actually be represented by
just this one line. You start with some guess, and then for the next guess you calculate the
expectation of the complete data log likelihood under the distribution of the missing data using the
previous guess. This can actually be broken down into two steps, where in the E step you compute
this expectation, and in the M step you maximize this Q and get the next set of parameters.

(Refer Slide Time: 07:39)

So let us see how we can get the EM algorithm for Gaussian mixtures. The key thing here is we
did not say anything about hidden variables, but the trick here is to use these latent variables as
hidden variables. This is how EM is used in a lot of different models, not just Gaussian mixture,
but also in a lot of latent variable models. You assume that the latent variables are hidden, i.e., you
do not know them, and run the whole EM machinery.

This slides recalls the Gaussian mixture model. We have the Gaussian mixture model, where we
want to estimate all the parameters represented by 𝜗. We have 𝐾 components, and each of the
component has parameters 𝜋𝑘 , 𝜇𝑘 and Σ𝑘 . What we want to find out is the maximum likelihood
estimate.

(Refer Slide Time: 08:31)

All right.

(Refer Slide Time: 09:20)

So let’s write this down here.

𝐸: 𝑄(𝜗, 𝜗 (𝑚−1) ) = 𝐸𝑍|𝑋,𝜗(𝑚−1) log(𝑋, 𝑍|𝜗)
The most important thing to remember is that this distribution is taken over the previous guess,
where as the expectation is for the complete log-likelihood over the unknown parameters. The M
step just gets the next guess.
𝑀: 𝜗 (𝑚) = 𝑎𝑟𝑔𝑚𝑎𝑥𝜗 𝑄(𝜗, 𝜗 (𝑚−1) ) (1)

If we want to get the maximum likelihood parameters, we first need to compute this Q function.
Then usually the M step is easier. It’s just the expectation computation in E step that requires some
work. Once you get Q, then the maximization step is just computing the derivatives.

(Refer Slide Time: 10:45)

As you can see, this works only if the E step gives you something which you can easily maximize.
In the case of Gaussian mixture, we will see that by using the complete data likelihood, we will be
able to get something that we can easily maximize.

(Refer Slide Time: 11:06)

The Q function, by definition, is the expectation of the complete data likelihood under the
distribution of Z given the previously guessed parameters.
𝑄(𝜗, 𝜗 (𝑚−1) ) = 𝔼𝑍|𝑋,𝜗(𝑚−1) log 𝑝(𝑋, 𝑍|𝜗)

So the log likelihood here is the likelihood of 𝑥𝑛 , 𝑧𝑛 given the unknown parameters, over all the
data points.
𝑁

𝑄(𝜗, 𝜗 (𝑚−1) ) = 𝔼𝑍|𝑋,𝜗(𝑚−1) [∑ log 𝑝(𝑥𝑛 , 𝑧𝑛 |𝜗)] (2)

𝑛=1

We can take this summation outside by linearity of expectations. Rewriting the the complete log
likelihood, the equation assumes the form:
𝑁 𝐾
(𝑚−1) 𝕀(𝑧𝑛 =𝑘)
𝑄(𝜗, 𝜗 ) = ∑ 𝔼𝑍|𝑋,𝜗(𝑚−1) [log ∏(𝜋𝑘 𝑝(𝑥𝑛 |𝜃𝑘 )) ] (3)
𝑛=1 𝑘=1

So we can derive this formally, but intuitively it is very clear. 𝕀(𝑧𝑛 = 𝑘) is just an indicator
function which assumes a value of 1 when 𝑧𝑛 = 𝑘, and assumes a value of 0 when 𝑧𝑛 ≠ 𝑘. So in
this product, all the terms except one will get an exponent of 0 (and therefore, thos terms assume
a value of 1). So only one term out of the K which will remain for each of the data points. The log-
likelihood comes straight away from this formula after using the indicator function here, and this
gives us the complete data likelihood.

So then the product becomes a sum when we take it outside the log, and the exponent term 𝕀(𝑧𝑛 =
𝑘) comes down. Again the expectation can be brought inside by linearity, and so we get both the
summations out.
𝑁 𝐾

= ∑ ∑ 𝔼𝑍|𝑋,𝜗(𝑚−1) [𝕀(𝑧𝑛 = 𝑘)] log(𝜋𝑘 𝑝(𝑥𝑛 |𝜃𝑘 )) (4)

𝑛=1 𝑘=1

With respect to the distribution of 𝑍 using the previously guessed parameters 𝜗 (𝑚−1) ,
log(𝜋𝑘 𝑝(𝑥𝑛 |𝜃𝑘 )) is just a constant. So the expectation is only over the indicator function 𝕀(𝑧𝑛 =
𝑘).

Expectation of an indicator function just gives us the probability, and we have the probability of
𝑧𝑛 = 𝑘, again given 𝑋 and the previously guessed parameter values 𝜗 (𝑚−1) . Log just remains.
𝑁 𝐾

= ∑ ∑ 𝑝(𝑧𝑛 = 𝑘|𝑋, 𝜗 (𝑚−1) ) log(𝜋𝑘 𝑝(𝑥𝑛 |𝜃𝑘 ))

𝑛=1 𝑘=1

So 𝑝(𝑧𝑛 = 𝑘|𝑋, 𝜗 (𝑚−1) ) is again the responsibility, which is the posterior probability of 𝑧𝑛 = 𝑘.
In this case, this responsibility is not with respect to the original parameters, but this posterior
probability is with respect to the guessed parameters. This has been indicated with the subscript.
So it is the responsibility times the log that remains.
𝑁 𝐾

= ∑ ∑ 𝛾(𝑧𝑛𝑘 )|𝜗𝑚−1 log(𝜋𝑘 𝑝(𝑥𝑛 |𝜃𝑘 ))

𝑛=1 𝑘=1

So, expressing the log of a product as a summation of individual log terms, we get:
𝑁 𝐾
(𝑚−1)
𝑄(𝜗, 𝜗 ) = ∑ ∑ 𝛾(𝑧𝑛𝑘 )|𝜗𝑚−1 log 𝜋𝑘 + 𝛾(𝑧𝑛𝑘 )|𝜗𝑚−1 log 𝑝(𝑥𝑛 |𝜃𝑘 ) (5)
𝑛=1 𝑘=1
So what have we achieved? So we have got an expression for Q in terms of again the responsibility,
but this time the responsibility is with respect to the guesssed parameters.

If you just look at the Q function in equation (5), you can see that it is easier to differentiate because
the summations are all outside and the normal distribution term comes towards the end of the
equation. The differentiation of the normal distribution term will be just like what you do in the
case of fitting a single Gaussian. How did this nice mathematical form come about? It happened
mainly because we were taking the complete data likelihood as shown in equation (2). The
complete data likelihood gave us a nice mathematical form in equation (3), and due to the
expectation, all the summations got pushed out in equation (4). So we got a nice form for 𝑄 as
shown in equation (5).

So is it only because the summation is inside the logarithm, we need to do all this? Otherwise, can
we can skip the whole thing ?

Yes, but that comes in many contexts, not just in the case of Gaussian mixture. In a lot of those
cases, EM is useful. If we could get the maximum likelihood easily, we wouldn’t need to use EM
for Gaussian mixture.

(Refer Slide Time: 16:12)

Now, the M step is just that which is shown in equation (1), which is we differentiate the 𝑄 function
with respect to each of the parameters. The 𝑄 function here is the same 𝑄 function which is
expanded in equation (5). Now one thing to remember here is that we know the guessed parameter
at the previous iteration ( 𝜗 (𝑚−1) ). So we know the responsibilities. So the responsibilities are just
constants in this case, and so differentiating the 𝑄 function with respect to the parameters becomes
very easy.

So here we differentiate the 𝑄 function from equation (5) with respect to 𝜇𝑘 .

𝑁 𝐾
𝜕𝑄 𝜕
= {∑ ∑ 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log 𝜋𝑘 + 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log 𝑝(𝑥𝑛 |𝜃𝑘 )}
𝜕𝜇𝑘 𝜕𝜇𝑘
𝑛=1 𝑘=1

This whole first term inside the summations ( which is 𝜸(𝒛𝒏𝒌 )|𝝑(𝒎−𝟏) 𝐥𝐨𝐠 𝝅𝒌 ) is not necessary. We
focus only on the second term inside the summations ( which is 𝜸(𝒛𝒏𝒌 )|𝝑(𝒎−𝟏) 𝐥𝐨𝐠 𝒑(𝒙𝒏 |𝜽𝒌 ) ).

For each of the different components, we get the entire normal distribution within the equation
here.
𝑁
𝜕𝑄 𝜕
= {∑ 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log 𝑝(𝑥𝑛 |𝜃𝑘 )} , 𝑘 = 1, … , 𝐾
𝜕𝜇𝑘 𝜕𝜇𝑘
𝑛=1

𝑁
𝜕𝑄 𝜕 1 1 1
= { ∑ 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log [ 𝑝 1 exp {− (𝑥𝑛 − 𝜇𝑘 )𝑇 𝛴𝑘−1 (𝑥𝑛 − 𝜇𝑘 )}] }
𝜕𝜇𝑘 𝜕𝜇𝑘 (2𝝅)2 |Σ𝑘 |2 2
𝑛=1

There are no summations inside the logarithm. This is exactly the same derivation as for a single
Gaussian. We use matrix derivatives to get very simple forms here.

𝜕𝑄
Substituting 𝜕𝜇 = 0, we get:
𝑘

∑𝑁
𝑛=1 𝛾(𝑧𝑛𝑘 )|𝜗 (𝑚−1) 𝑥𝑛
𝜇𝑘 = , 𝑘 = 1, … , 𝐾
∑𝑁
𝑛=1 𝛾(𝑧𝑛𝑘 )|𝜗 (𝑚−1)

We will see that we again get the same form for 𝜇𝑘 that we saw earlier for our adhoc iterative
algorithm except that these responsibilities are with respect to the previously guessed parameters.

(Refer Slide Time: 17:43)

Here, similar to what we did for 𝜇𝑘 , we find the derivative of 𝑄 from equation (5) with respect to
Σ𝑘 and equate it to zero.
𝑁 𝐾
𝜕𝑄 𝜕
= {∑ ∑ 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log 𝜋𝑘 + 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log 𝑝(𝑥𝑛 |𝜃𝑘 )}
𝜕Σ𝑘 𝜕Σ𝑘
𝑛=1 𝑘=1

Again, we do not need to worry about the first term within the summations. We only find the
derivative for the second term within the summations.

𝜕𝑄 𝜕 1 1 1
= {𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log [ 𝑝 1 exp {− (𝑥𝑛 − 𝜇𝑘 )𝑇 Σ𝑘−1 (𝑥𝑛 − 𝜇𝑘 )}]}
𝜕Σ𝑘 𝜕Σ𝑘 (2𝝅)2 |Σ𝑘 |2 2

You can simplify this further by first applying the logarithm here for each of these parts.
𝑁
𝜕𝑄 𝜕 1 1 1
= {∑ 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) [log 𝑝 − log|Σ𝑘 | − (𝑥𝑛 − 𝜇𝑘 )𝑇 Σ𝑘−1 (𝑥𝑛 − 𝜇𝑘 )]}
𝜕Σ𝑘 𝜕Σ𝑘 (2𝝅)2 2 2
𝑛=1

Then, the derivative for the determinant is given by a simple formula. We can apply the derivative
𝜕𝑄
formula here, and by setting 𝜕Σ = 0, we get back the same form for Σ𝑘 which we found earlier.
𝑘

∑𝑁
𝑛=1 𝛾(𝑧𝑛𝑘 )|𝜗 (𝑚−1) (𝑥𝑛 − 𝜇𝑘 )(𝑥𝑛 − 𝜇𝑘 )
𝑇
Σ𝑘 = , 𝑘 = 1, … , 𝐾
∑𝑁
𝑛=1 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1)

(Refer Slide Time: 18:22)

𝜕𝑄
So similarly we perform the computations for the M step for 𝜋𝑘 . However, this time in the
𝜕𝜋𝑘

second term within the summations (which is 𝜸(𝒛𝒏𝒌 )|𝝑(𝒎−𝟏) 𝐥𝐨𝐠 𝒑(𝒙𝒏 |𝜽𝒌 ) ) is a constant with
respect to 𝜋𝑘 and therefore goes away. We focus on the differentitation of the first term within the
summations (which is 𝜸(𝒛𝒏𝒌 )|𝝑(𝒎−𝟏) 𝐥𝐨𝐠 𝝅𝒌 ). Since, the 𝜋𝑘 s must satisfy the constraint ∑𝐾
𝑘=1 𝜋𝑘 =

1, we use Lagrange multipliers to get the formulation for 𝜋𝑘 . The langrange function 𝐽 can be
written as:
𝐾 𝑁 𝐾

𝐽 = ∑ ∑ 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log 𝜋𝑘 + 𝜆 (∑ 𝜋𝑘 − 1 )

𝑘=1 𝑛=1 𝑘=1

By differentiating the langrage function 𝐽 with respect to 𝜋𝑘 , and setting the derivative to 0, we
get:
∑𝑁
𝑛=1 𝛾(𝑧𝑛𝑘 )|𝜗 (𝑚−1)
𝜋𝑘 = , 𝑘 = 1, … , 𝐾
𝑁

(Refer Slide Time: 18:50)

In the previous lecture, we first found sformulas for 𝜇𝑘 , Σ𝑘 and 𝜋𝑘 by assuming that we know the
responsibilities. Then, in this lecture, we used the E and M steps to find exactly the same formulas
for 𝜇𝑘 , Σ𝑘 and 𝜋𝑘 that we had found earlier in the previous lecture.

(Refer Slide Time: 19:13)

We plug the formulas for 𝜇𝑘 , Σ𝑘 and 𝜋𝑘 into the EM framework. In the EM framework, we start
with a guess for these parameters. We then iterate by first finding the posterior distribution of Z
(which gives us the responsibilities and can be represented as 𝑝(𝑍|𝑋, 𝜗 (𝑚−1) )), and then we find
the expected complete likelihood under this distribution of Z (which can be represented as
𝑄(𝜗, 𝜗 (𝑚−1) ) = 𝔼𝑍|𝑋,𝜗(𝑚−1) log 𝑝(𝑋, 𝑍|𝜗)), and we finally maximize the Q function to get the new
guesses for the parameters.

(Refer Slide Time: 19:43)

This gives us the next set of guesses, and this iteratively we can, we can check for convergence.
When the likelihood does not change much we stop, and that is the EM algorithm for GMM. I
have still not told you why this works, but we will see that, we will see the theoretical properties
of why this is, why this works well. Yeah. I just wanted to show you that what we have got through,
by doing all the math for EM is exactly the same as what we found during the iterative algorithm
that we guessed.

(Refer Slide Time: 20:31)

Let’s look at a special case. Let’s assume a Gaussian mixture model where the covariance of each
component is fixed to be 𝜖𝕀, where 𝜖 is a fixed constant and 𝕀 is the identity matrix. Covariance
1
matrix of Σ𝑘 = 𝜖𝕀 will give us a spherical Gaussian. We also fixed each of the 𝜋𝑘 to be 𝐾 . So each

component gives exactly the same contribution towards the Gaussian mixture. Now the only
parameter to estimate is the 𝜇𝑘 s.

Since Σ𝑘 = 𝜖𝕀, the formula for the normal distribution simplifies to:
1 1 1
𝑝(𝑥|𝜃𝑘 ) = 𝑝 1 exp {− (𝑥 − 𝜇𝑘 )𝑇 Σ𝑘−1 (𝑥 − 𝜇𝑘 ) )}
(2𝝅)2 |Σ 2
k |2
1 1
𝑝(𝑥|𝜃𝑘 ) = 𝑝 exp {− ‖(𝑥 − 𝜇𝑘 )‖2 }
2𝜖 (6)
(2𝝅𝜖)2

The formula for the responsibility also simplifies. We plug 𝑝(𝑥𝑛 |𝜃𝑘 ) from equation (6) and also
1
substitute 𝜋𝑘 = 𝐾 into the formula for responsibilities 𝛾(𝑧𝑛𝑘 ), and we get:

𝜋𝑘 𝑝(𝑥𝑛 |𝜃𝑘 ) exp{−‖𝑥𝑛 − 𝜇𝑘 ‖2 /2𝜖 }

𝛾(𝑧𝑛𝑘 ) = = 2
∑𝐾
𝑗=1 𝜋𝑗 𝑝(𝑥𝑛 |𝜃𝑗 ) ∑𝐾 (7)
𝑗=1 exp {−‖𝑥𝑛 − 𝜇𝑗 ‖ /2𝜖}
Now if we look at this expression in equation (7), and see what happens to the denominator as 𝜖 →
2
0, as 𝜖 → 0, the term for which the difference ‖𝑥𝑛 − 𝜇𝑗 ‖ is smallest will go to 0 most slowly. So
the responsibility for 𝑥𝑛 will tend to 1 for that particular component (which is the 𝑗 𝑡ℎ component
in this case) because the numerator will be equal to the denominator in the limits, and for all other
components (for which 𝑘 ≠ 𝑗), the responsibility will go to 0.

𝛾(𝑧𝑛𝒋 ) → 1 and 𝛾(𝑧𝑛𝒌 ) → 0, ∀ 𝑘 ≠ 𝑗

So this is the special case of hard clustering that was mentioned earlier. What it turns out to be is
just setting the responsibility of 𝑥𝑛 to 1 for that component for which the data point is closest to
the mean, and setting the responsibility of 𝑥𝑛 to zero for all other components.

2
𝛾(𝑧𝑛𝑘 ) = {1 𝑖𝑓 𝑘 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑗 ‖𝑥𝑛 − 𝜇𝑗 ‖
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

The responsibility is just the posterior probability of 𝑧𝑛 being equal to 𝑘. For the data point 𝑥𝑛 , we
want to know which component it has come from. For the data point 𝑥𝑛 , the responsibility is one
for that component whose mean the data point is closest to, and it is 0 for all others components.
This means that if you look at the data and look at 𝑥𝑛 , and want to know the posterior probability
of which component it came from, it is that component whose mean that data point is closest to.

(Refer Slide Time: 24:08)

Let us do the EM for this special case of hard clustering in which for each of the components, we
have set Σ𝑘 = 𝜖𝕀 and 𝜋𝑘 = 1/𝐾. The first step in EM is to calculate Q.

From equation (5), we have:

𝑁 𝐾

𝑄 = ∑ ∑ 𝛾(𝑧𝑛𝑘 ) log 𝜋𝑘 + 𝛾(𝑧𝑛𝑘 ) log 𝑝(𝑥𝑛 |𝜃𝑘 )

𝑛=1 𝑘=1

The formula for Q is the same as before because we are doing Gaussian mixture. It’s just a special
Gaussian mixture.

1
Substituting 𝜋𝑘 = 𝐾, and replacing log 𝑝(𝑥𝑛 |𝜃𝑘 ) with the expansion from equation (6) for the this

special case of Gaussian, we have:

𝑁 𝐾
1 1 1
𝑄 = ∑ ∑ 𝛾(𝑧𝑛𝑘 ) log + 𝛾(𝑧𝑛𝑘 ) log 𝑝 exp {− ‖(𝑥𝑛 − 𝜇𝑘 )‖2 }
𝐾 (2𝝅𝜖)2 2𝜖
𝑛=1 𝑘=1

We do the differentiation of the Q function which has this simplified normal distribution in it.
𝑁
𝜕𝑄 𝜕 1 1
= {∑ 𝛾(𝑧𝑛𝑘 ) log 𝑝 exp {− ‖(𝑥𝑛 − 𝜇𝑘 )‖2 }}
𝜕𝜇𝑘 𝜕𝜇𝑘 (2𝝅𝜖)2 2𝜖
𝑛=1

𝜕𝑄 ∑𝑁
𝑛=1 𝛾(𝑧𝑛𝑘 ) 𝑥𝑛
= 0 ⇒ 𝜇𝑘 =
𝜕𝜇𝑘 ∑𝑁𝑛=1 𝛾(𝑧𝑛𝑘 )

We again get the same formula for 𝜇𝑘 , but the only difference is that this responsibility is defined
in the way it was described earlier (for hard clustetring).

What is it basically saying? So for the k th component, only the responsibilities of the data points
assigned to the 𝑘 𝑡ℎ component will be 1, and responsibilities of the same data points corresponding
to other components will be 0. So we take all the data points that are assigned to the 𝑘 𝑡ℎ
component, take the mean of those data points, and update the mean parameter of 𝑘 𝑡ℎ component,
𝜇𝑘 , with this computed mean value.

(Refer Slide Time: 25:13)

This is the general EM algorithm. We first calculate the posterior distribution of Z, which we saw
is exactly given by:
2
𝛾(𝑧𝑛𝑘 ) = {1 𝑖𝑓 𝑘 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑗 ‖𝑥𝑛 − 𝜇𝑗 ‖
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

We assign the latent variable of 𝑥𝑛 to the closest mean, and then set the new mean as the mean of
all data points with the same latent variable. This is exactly also the k-means algorithm.

So we are assigning 𝑥𝑛 to the closest cluster, with the cluster center 𝜇𝑘 , and then reestimating the
cluster centers as the mean of all data points that are assigned to that cluster. So k-means is just a
special case of Gaussian mixture, where the covariance matrix is epsilon times the identity matrix.
That is why we just have to compute the means, and do not have to worry about the covariance.

This entire procedure follows the same framework of EM that we saw.

(Refer Slide Time: 26:31)

So I think we will stop here because after this we will talk about all the theoretical properties of
EM. I wanted to show you one more thing.
(Refer Slide Time: 26:59)

Let us consider this data.

(Refer Slide Time: 27:05)

The data is generated from three Gaussians. The three Gaussians and their cluster centers are
shown here. The previous figure shows how the data looks.
(Refer Slide Time: 27:18)

This is the density plot corresponding to the X coordinates of the data.

(Refer Slide Time: 27:30)

If we run EM once on it, and plot it, we see that it recovers the means and covariances exactly the
way the data was generated.

(Refer Slide Time: 28:02)

Now let's run the EM algorithm ten times, and plot it. In this plot, on the x axis we have iterations
and on the y axis we have the likelihood. In each iteration, the likelihood keeps increasing until it
reaches a point and remains steady there. So this is a property of the EM algorithm which we will
prove in the coming lecture.

(Refer Slide Time: 28:35)

Let us look at these ten likelihood values. The 10 negative log-likelihood values corresponding to
when we stopped the iteration for each of the 10 runs are:

-1490.738 -1495.891 -1496.862 -1496.869 -1491.275

-1490.691 -1496.861 -1491.737 -1493.812 -1495.482

(Refer Slide Time: 28:54)

.
We always have a hard bound 𝑇 on the number of iterations because sometimes the likelihood may
not converge. So we usually give an upper bound on the number of iterations as well and stop it
there.

When we ran EM ten times on the data, the 10 final negative log-likelihood values were -1490.738,
-1495.891, -1496.862, -1496.869, -1491.275, -1490.691, -1496.861, -1491.737, -1493.812 and -
1495.482. The minimum negative log likelihood was achieved for the fourth run of the EM.

This is a very easy case.

(Refer Slide Time: 29:43)

I gave 𝑘 = 5, so it has estimated five different components. The means of the components are
shown here. So it has tried to fit five Gaussians.

(Refer Slide Time: 30:35)

So this is a more difficult case because the three components are not very well separated. Now we
run the EM algorithm ten times on this data. In this case, the data is as shown here.

(Refer Slide Time: 31:45)

EM ran, the likelihood always increased.

(Refer Slide Time: 31:48)

But what it estimated is shown here. So when the data is very well separated, like we saw earlier,
EM will usually give very good results, but when the data is not very well separated like this, it
starts giving very weird results. But no matter what it does, the likelihoods will always increase.

IIT Madras production

Funded by
Department of the higher education
Ministry of the human resource department
Government of India

www.nptel.ac.in

Copyrights Reserved

The Practically Cheating Calculus Handbook
From Everand
The Practically Cheating Calculus Handbook
S. Deviant
3.5/5 (7)
ECE Regulation 106
No ratings yet
ECE Regulation 106
148 pages
Syllabus Theories and Practice of Modern Administration Mas 101
No ratings yet
Syllabus Theories and Practice of Modern Administration Mas 101
3 pages
The Morning of The Magicians
0% (5)
The Morning of The Magicians
2 pages
8th Lecture Note - 1039837803 230515 094639
No ratings yet
8th Lecture Note - 1039837803 230515 094639
10 pages
GMMEMNotes
No ratings yet
GMMEMNotes
10 pages
EM-algorithm: California Institute of Technology 136-93 Pasadena, CA 91125 Welling@vision - Caltech.edu
No ratings yet
EM-algorithm: California Institute of Technology 136-93 Pasadena, CA 91125 Welling@vision - Caltech.edu
7 pages
Mixture Models and Expectation-Maximization: Justus H. Piater
No ratings yet
Mixture Models and Expectation-Maximization: Justus H. Piater
11 pages
cs229 Notes7b PDF
No ratings yet
cs229 Notes7b PDF
4 pages
کتاب ششم بارگزاری شده
No ratings yet
کتاب ششم بارگزاری شده
49 pages
Lec15 16 Handout
No ratings yet
Lec15 16 Handout
33 pages
The EM Algorithm: Ajit Singh November 20, 2005
No ratings yet
The EM Algorithm: Ajit Singh November 20, 2005
4 pages
Lecture-04_GMM_EMalg
No ratings yet
Lecture-04_GMM_EMalg
34 pages
EM Algo
No ratings yet
EM Algo
8 pages
Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie
No ratings yet
Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie
24 pages
Lecture3 EM
No ratings yet
Lecture3 EM
36 pages
TR 97 021
No ratings yet
TR 97 021
15 pages
EM Algorithm: Shu-Ching Chang Hyung Jin Kim December 9, 2007
No ratings yet
EM Algorithm: Shu-Ching Chang Hyung Jin Kim December 9, 2007
10 pages
Latent 2
No ratings yet
Latent 2
4 pages
The Expectation Maximization Algorithm
No ratings yet
The Expectation Maximization Algorithm
7 pages
Expectation Maximization (EM) Algorithm.pptx
No ratings yet
Expectation Maximization (EM) Algorithm.pptx
47 pages
GAUSSIAN MIXTURES
No ratings yet
GAUSSIAN MIXTURES
5 pages
Expectation Maximization
No ratings yet
Expectation Maximization
19 pages
AI29
No ratings yet
AI29
3 pages
The Kullback-Liebler Distance and Entropy
No ratings yet
The Kullback-Liebler Distance and Entropy
5 pages
N D IX: The E-M Algorithm
No ratings yet
N D IX: The E-M Algorithm
12 pages
gmm
No ratings yet
gmm
8 pages
Statistical Inference III: Mohammad Samsul Alam
No ratings yet
Statistical Inference III: Mohammad Samsul Alam
32 pages
Dis10 Sol PDF
No ratings yet
Dis10 Sol PDF
6 pages
EM at RIT
No ratings yet
EM at RIT
17 pages
The Problem: Library (MASS) Data (Faithful) Attach (Faithful)
No ratings yet
The Problem: Library (MASS) Data (Faithful) Attach (Faithful)
7 pages
EM Presentation 2013
No ratings yet
EM Presentation 2013
18 pages
lecture5
No ratings yet
lecture5
16 pages
Figueiredo EM Algorithm
No ratings yet
Figueiredo EM Algorithm
35 pages
Tutorial On Generalized Expectation Maximization: Javier R. Movellan
No ratings yet
Tutorial On Generalized Expectation Maximization: Javier R. Movellan
6 pages
Tutorial On Generalized Expectation
No ratings yet
Tutorial On Generalized Expectation
6 pages
20-gaussian-mixture-model
No ratings yet
20-gaussian-mixture-model
55 pages
simple_example
No ratings yet
simple_example
3 pages
GMM
No ratings yet
GMM
26 pages
An Alternative View of EM_poornima
No ratings yet
An Alternative View of EM_poornima
4 pages
11 Hidden Markov Models (HMMS) Model and Problem Description
No ratings yet
11 Hidden Markov Models (HMMS) Model and Problem Description
15 pages
(Slides) The em Algorithm
No ratings yet
(Slides) The em Algorithm
14 pages
Expectation Maximization
No ratings yet
Expectation Maximization
21 pages
A Modified Expectation Maximization Algorithm For Penalized Likelihood Estimation in Emission Tomorzradhv
No ratings yet
A Modified Expectation Maximization Algorithm For Penalized Likelihood Estimation in Emission Tomorzradhv
6 pages
Stats, Mle, and Other Stuff: 1 Sevssd
No ratings yet
Stats, Mle, and Other Stuff: 1 Sevssd
10 pages
7.Estimation Clustering
No ratings yet
7.Estimation Clustering
56 pages
MIT14 30s09 Lec19
No ratings yet
MIT14 30s09 Lec19
7 pages
Expectation-Maximization Algorithm
No ratings yet
Expectation-Maximization Algorithm
13 pages
Likelihood EM HMM Kalman
No ratings yet
Likelihood EM HMM Kalman
46 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
14 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
Algoritmo E-M. Utilizado para Calcular La Mezcla de Gausianas
No ratings yet
Algoritmo E-M. Utilizado para Calcular La Mezcla de Gausianas
8 pages
Chapter 9.4 Allele Frequency Estimation
No ratings yet
Chapter 9.4 Allele Frequency Estimation
24 pages
Probability Concepts Explained
No ratings yet
Probability Concepts Explained
10 pages
Oral Texte
No ratings yet
Oral Texte
12 pages
Expectation Maximization: Dekang Lin Department of Computing Science University of Alberta
No ratings yet
Expectation Maximization: Dekang Lin Department of Computing Science University of Alberta
22 pages
21 Mle
No ratings yet
21 Mle
24 pages
Chap2 Part2 GMM
No ratings yet
Chap2 Part2 GMM
34 pages
msqe_metrics_1_ps2
No ratings yet
msqe_metrics_1_ps2
11 pages
Calculus by Muhammad Umer
From Everand
Calculus by Muhammad Umer
Muhammad Umer
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Exercises of Logarithms and Exponentials
From Everand
Exercises of Logarithms and Exponentials
Simone Malacrida
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Big Questions With Answers
100% (1)
Big Questions With Answers
32 pages
Answer Key
No ratings yet
Answer Key
12 pages
lab record details
No ratings yet
lab record details
3 pages
Answer Key -OS
No ratings yet
Answer Key -OS
13 pages
Data warehousing Answer key
No ratings yet
Data warehousing Answer key
4 pages
Unit 2
No ratings yet
Unit 2
19 pages
Unit 3
No ratings yet
Unit 3
10 pages
06-01-2024 Anna University - COE
No ratings yet
06-01-2024 Anna University - COE
2 pages
Unit 1
No ratings yet
Unit 1
19 pages
Phy Unit Test I Fin
No ratings yet
Phy Unit Test I Fin
4 pages
HdfcLogs 20 June 23
No ratings yet
HdfcLogs 20 June 23
4 pages
Lesson Plan Project Part 2: Instructional Procedures: From Molecules To Organisms: Structures and Processes
No ratings yet
Lesson Plan Project Part 2: Instructional Procedures: From Molecules To Organisms: Structures and Processes
11 pages
United Excel Project Work Sheet 2020
No ratings yet
United Excel Project Work Sheet 2020
33 pages
D day Deception Operation Fortitude the Normandy invasion 1st Edition Mary Kathryn Barbier - The ebook in PDF format is ready for download
100% (2)
D day Deception Operation Fortitude the Normandy invasion 1st Edition Mary Kathryn Barbier - The ebook in PDF format is ready for download
57 pages
Full Oracle Insights Tales of The Oak Table 1st Edition Mogens Nørgaard Ebook All Chapters
100% (21)
Full Oracle Insights Tales of The Oak Table 1st Edition Mogens Nørgaard Ebook All Chapters
84 pages
BÀI KIỂM TRA SỐ 3
No ratings yet
BÀI KIỂM TRA SỐ 3
4 pages
220kv - Twin - Moose Z-3 Master Bom For Emc Ac Tower
No ratings yet
220kv - Twin - Moose Z-3 Master Bom For Emc Ac Tower
19 pages
Arsi University ICT Directorate ICT Club Formation Proposal
No ratings yet
Arsi University ICT Directorate ICT Club Formation Proposal
8 pages
Outline For Mission-Based Morphing Evaluation With Smart Material Technology
No ratings yet
Outline For Mission-Based Morphing Evaluation With Smart Material Technology
6 pages
Jurisprudence 1
No ratings yet
Jurisprudence 1
13 pages
MPN Solutions Partner Benefits Guide
No ratings yet
MPN Solutions Partner Benefits Guide
22 pages
Leadership Development Training Program
100% (1)
Leadership Development Training Program
7 pages
Improving Survival and Storage Stability of Bacteria Recalcitrant To Freeze-Drying: A Coordinated Study by European Culture Collections
No ratings yet
Improving Survival and Storage Stability of Bacteria Recalcitrant To Freeze-Drying: A Coordinated Study by European Culture Collections
14 pages
Strater 2
0% (1)
Strater 2
11 pages
Design and Construction of An Electric Motorcycle: April 2019
No ratings yet
Design and Construction of An Electric Motorcycle: April 2019
7 pages
Vedic View of Sri Jagannath
100% (7)
Vedic View of Sri Jagannath
346 pages
Form Th-1 National University of Sciences & Technology Master'S Thesis Work Formulation of Guidance and Examination Committee
No ratings yet
Form Th-1 National University of Sciences & Technology Master'S Thesis Work Formulation of Guidance and Examination Committee
3 pages
Instant Download The Embryologic Basis of Craniofacial Structure Developmental Anatomy Evolutionary Design and Clinical Applications Michael H. Carstens PDF All Chapters
100% (3)
Instant Download The Embryologic Basis of Craniofacial Structure Developmental Anatomy Evolutionary Design and Clinical Applications Michael H. Carstens PDF All Chapters
40 pages
Booklist, Wiser
No ratings yet
Booklist, Wiser
5 pages
Application of Integration
No ratings yet
Application of Integration
36 pages
Easa Ad 2022-0252R1 1 PDF
No ratings yet
Easa Ad 2022-0252R1 1 PDF
4 pages
Exercise No.1 Soil Sampling For Chemical and Physical Analyses
No ratings yet
Exercise No.1 Soil Sampling For Chemical and Physical Analyses
22 pages
Basic Yoga Session I
No ratings yet
Basic Yoga Session I
4 pages
E. HFAC332-1-Breakdown of proposed course material-Jan-Jun2025-CP-V1-29012025
No ratings yet
E. HFAC332-1-Breakdown of proposed course material-Jan-Jun2025-CP-V1-29012025
1 page
Basic Information On Pharmaceutical Dosage Forms and Drug Delivery Systems
No ratings yet
Basic Information On Pharmaceutical Dosage Forms and Drug Delivery Systems
30 pages
Ikaw Na Da Best Ka!!!!!!!!!!!!!
No ratings yet
Ikaw Na Da Best Ka!!!!!!!!!!!!!
7 pages
Lutron Homeworks Qs Wiring Diagram
100% (1)
Lutron Homeworks Qs Wiring Diagram
6 pages
Bergey CV
No ratings yet
Bergey CV
3 pages

Uploaded by

Uploaded by

NPTEL

NPTEL ONLINE CERTIFICATION COURSE

Introduction to Machine Learning

Prof. Balaraman Ravindran

(Refer Slide Time: 00:15)

Expectation Maximization (EM) is a way to do maximum likelihood estimation. Initially, it was

(Refer Slide Time: 02:04)

(Refer Slide Time: 02:42)

(Refer Slide Time: 02:32)

(Refer Slide Time: 05:08)

(Refer Slide Time: 07:39)

(Refer Slide Time: 08:31)

(Refer Slide Time: 09:20)

So let’s write this down here.

(Refer Slide Time: 10:45)

(Refer Slide Time: 11:06)

𝑄(𝜗, 𝜗 (𝑚−1) ) = 𝔼𝑍|𝑋,𝜗(𝑚−1) [∑ log 𝑝(𝑥𝑛 , 𝑧𝑛 |𝜗)] (2)

= ∑ ∑ 𝔼𝑍|𝑋,𝜗(𝑚−1) [𝕀(𝑧𝑛 = 𝑘)] log(𝜋𝑘 𝑝(𝑥𝑛 |𝜃𝑘 )) (4)

= ∑ ∑ 𝑝(𝑧𝑛 = 𝑘|𝑋, 𝜗 (𝑚−1) ) log(𝜋𝑘 𝑝(𝑥𝑛 |𝜃𝑘 ))

= ∑ ∑ 𝛾(𝑧𝑛𝑘 )|𝜗𝑚−1 log(𝜋𝑘 𝑝(𝑥𝑛 |𝜃𝑘 ))

(Refer Slide Time: 16:12)

So here we differentiate the 𝑄 function from equation (5) with respect to 𝜇𝑘 .

(Refer Slide Time: 17:43)

(Refer Slide Time: 18:22)

𝐽 = ∑ ∑ 𝛾(𝑧𝑛𝑘 )|𝜗(𝑚−1) log 𝜋𝑘 + 𝜆 (∑ 𝜋𝑘 − 1 )

(Refer Slide Time: 18:50)

(Refer Slide Time: 19:13)

(Refer Slide Time: 19:43)

(Refer Slide Time: 20:31)

𝜋𝑘 𝑝(𝑥𝑛 |𝜃𝑘 ) exp{−‖𝑥𝑛 − 𝜇𝑘 ‖2 /2𝜖 }

𝛾(𝑧𝑛𝒋 ) → 1 and 𝛾(𝑧𝑛𝒌 ) → 0, ∀ 𝑘 ≠ 𝑗

(Refer Slide Time: 24:08)

From equation (5), we have:

𝑄 = ∑ ∑ 𝛾(𝑧𝑛𝑘 ) log 𝜋𝑘 + 𝛾(𝑧𝑛𝑘 ) log 𝑝(𝑥𝑛 |𝜃𝑘 )

special case of Gaussian, we have:

(Refer Slide Time: 25:13)

This entire procedure follows the same framework of EM that we saw.

(Refer Slide Time: 26:31)

Let us consider this data.

(Refer Slide Time: 27:05)

This is the density plot corresponding to the X coordinates of the data.

(Refer Slide Time: 27:30)

(Refer Slide Time: 28:02)

(Refer Slide Time: 28:35)

-1490.738 -1495.891 -1496.862 -1496.869 -1491.275

(Refer Slide Time: 28:54)

This is a very easy case.

(Refer Slide Time: 29:43)

(Refer Slide Time: 30:35)

(Refer Slide Time: 31:45)

EM ran, the likelihood always increased.

(Refer Slide Time: 31:48)

IIT Madras production

You might also like