Maximum Likelihood Estimation (MLE)
Maximum Likelihood Estimation (MLE)
max (y; θ)
θ∈Θ
• Note that the solution to an optimization problem is invariant to a strictly monotone increasing trans-
formation of the objective function, a MLE can be obtained as a solution to the following problem;
Proposition 2 (Sufficient condition for existence) If the parameter space Θ is compact and if the likelihood
function θ → (y; θ) is continuous on Θ, then there exists a MLE.
Proposition 3 (Sufficient condition for uniqueness of MLE) If the parameter space Θ is convex and if the
likelihood function θ → (y; θ) is strictly concave in θ, then the MLE is unique when it exists.
• If the observations on Y are i.i.d. with density f (yi ; θ) for each observation, then we can write the
likelihood function as
n n
(y; θ) = f (yi ; θ) ⇒ L (y; θ) = log f (yi ; θ)
i=1 i=1
Properties of MLE
Proposition 4 (Functional invariance
of MLE) Suppose a bijective function g : Θ → Λ where Λ ⊂ Rq and
θ is a MLE of θ, then λ
= g θ is a MLE of λ ∈ Λ.
or equivalently,
∈ Λ and y; g −1 λ
λ ≥ y; g −1 (λ) , ∀λ ∈ Λ
= g θ is a MLE of λ in a model with density y; g −1 (λ) .
which implies that λ
⇒ Let S (Y ) be a sufficient statistic. From the factorization theorem of a sufficient statistic, the density
function can be written as (y; θ) = Ψ (S (y) ; θ) h (y) , i.e., L (y; θ) = log Ψ (S (y) ; θ) + log h (y) . Hence max-
imizing (y; θ) with respect to θ is equivalent to maximizing log Ψ (S (y) ; θ) with respect to θ. Therefore,
MLE depends on Y through S (Y ) .
• To discuss asymptotic properties of MLE, which are why we study and use MLE in practice, we need
some so-called regularity conditions. These conditions are to be checked not to be granted before
we use MLE. It is difficult, mostly impossible, to check in practice, though.
1
Regularity Conditions
1. The variables Yi , i = 1, 2, · · · are independent and identically distributed with density f (y; θ) .
2. The parameter space Θ is compact.
3. The true but unknown parameter value θ0 is identified, i.e.
is continuous in θ.
5. Eθ0 log f (Y ; θ) exists.
6. The log-likelihood function is such that n1 L (y; θ) converges almost surely (in probability) to Eθ0 log f (Yi ; θ)
uniformly in θ ∈ Θ, i.e.,
1
sup L (y; θ) − Eθ0 log f (Yi ; θ) < δ almost surely (in probability) for some δ > 0.
θ∈Θ n
Proposition 6 Under 1 - 6, there exists a sequence of MLE’s converging almost surely (in probability) to
the true parameter value θ0 . That is, MLE is a consistent estimator.
i.e.,
Now, note that the identifiability condition 3 ensures the convergence of θn to θ0 .
• The additional assumptions enables us to use differential method to obtain MLE and its asymptotic
distribution.
Lemma 7
∂ log f (Y ; θ0 )
Eθ0 = 0.
∂θ
2
⇒
1 ∂f (y; θ0 ) ∂f (y; θ0 )
= f (y; θ0 ) dy = dy
f (y; θ0 ) ∂θ ∂θ
However,
f (y; θ0 ) dy = 1 by definition.
∂ ∂f (y; θ0 )
f (y; θ0 ) dy = dy = 0
∂θ ∂θ
Lemma 8 2
∂ log f (Y ; θ0 ) ∂ log f (Y ; θ0 ) ∂ log f (Y ; θ0 )
Eθ0 = Eθ0 −
∂θ ∂θ ∂θ∂θ
⇒
2
∂ log f (Y ; θ0 )
Eθ0
∂θ∂θ
2
∂ log f (y; θ0 ) ∂ ∂ log f (y; θ0 )
= f (y; θ 0 ) dy = f (y; θ0 ) dy
∂θ∂θ ∂θ ∂θ
∂ 1 ∂f (y; θ0 )
= f (y; θ0 ) dy
∂θ f (y; θ0 ) ∂θ
1 ∂f (y; θ0 ) ∂f (y; θ0 ) 1 ∂ 2 f (y; θ0 )
= − 2 + f (y; θ0 ) dy
(f (y; θ0 )) ∂θ ∂θ f (y; θ0 ) ∂θ∂θ
2
1 ∂f (y; θ0 ) 1 ∂f (y; θ0 ) ∂ f (y; θ0 )
=− f (y; θ0 ) dy + dy
f (y; θ0 ) ∂θ f (y; θ0 ) ∂θ ∂θ∂θ
∂ log f (Y ; θ0 ) ∂ log f (Y ; θ0 ) ∂ log f (Y ; θ0 ) ∂ log f (Y ; θ0 )
=− f (y; θ0 ) dy = −Eθ0
∂θ ∂θ ∂θ ∂θ
∂ 2 f (y;θ0 )
The last line follows from the fact that ∂θ∂θ dy = 0.
⇒ A Taylor series expansion of the first order condition around the true value of θ, θ0 , yields
∂L θn ∂L (θ0 ) ∂ 2 L (θ∗ )
= + θ n − θ 0
∂θ ∂θ ∂θ∂θ
where θ∗ is on the line segment connecting θn and θ0 . From the first order condition, we have
∂L (θ0 ) ∂ 2 L (θ∗ )
0= +
θn − θ0
∂θ ∂θ∂θ
Therefore,
−1
√ 1 ∂ 2 L (θ∗ ) 1 ∂L (θ0 )
n θn − θ0 = − √
n ∂θ∂θ n ∂θ
As n → ∞,
1 ∂ 2 log f (Yi ; θ∗ )
n
1 ∂ 2 L (θ∗ )
− = −
n ∂θ∂θ n i=1 ∂θ∂θ
3
converges almost surely to 2
∂ log f (Y ; θ0 )
I (θ0 ) = Eθ0 −
∂θ∂θ
a.s.
by the strong law of large numbers and the fact that θ∗ → θ0 . Moreover,
1 ∂ log f (Y ; θ0 )
n
1 ∂L (θ0 )
√ = √
n ∂θ n i=1 ∂θ
n
1 ∂ log f (Y ; θ0 ) ∂ log f (Y ; θ0 )
= √ − Eθ0
n i=1 ∂θ ∂θ
• The asymptotic distribution, itself is useless since we have to evaluate the information matrix at
true value of parameter. However, we can consistently estimate the asymptotic variance of MLE by
evaluating the information matrix at MLE, i.e.,
−1
√
n θn − θ0 → N 0, I θn
d