Chapter 6 Maximum Likelihood Methods
6.2 Rao–Crame ́r Lower Bound and Efficiency
1/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Outline
1 Fisher information, from slide 3 to slide 10.
2 Rao–Crame ́r inequality, from slide 11 to slide 18.
3 Explanation of Fisher information, from slide 19 to slide 24.
4 Proof of Rao–Crame ́r inequality, from slide 25 to slide 28.
5 Asymptotic properties of MLE, from slide 29 to slide 33.
2/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Fisher Information
3/33
Definition of Fisher Information
Suppose X = {X1, . . . , Xn} are iid random variables drawn from f(x; θ), and the observed data are x1, . . . , xn. The Fisher information contained in the n samples are
d2 In(θ) = E −dθ2 logf(X;θ) .
If we have only one sample, the Fisher information is d2
which says
I1(θ) = E −dθ2 logf(X1;θ) , In(θ) = nI1(θ).
Remark: This is not the original definition of the Fisher information. It provides a fast computation of the Fisher information. The original definition will be discussed in slide 22.
4/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Example (6.2.1): Information for Bernoulli R.V.s
Let X1, . . . , Xn ∼ Bern(θ). Compute the Fisher information In(θ). Solution:Step1: computethesecond-orderderivative:
log f (x; θ) = x log θ + (1 − x) log(1 − θ), ∂logf(x;θ) = x − 1−x,
∂θ θ1−θ ∂2logf(x;θ)=−x− 1−x.
∂θ2 θ2 (1−θ)2
Step2: computetheexpectationofnegative2nd-orderderivative:
−X 1−X I1(θ)=−E θ2 −(1−θ)2
=θ+1−θ=1+1=1. θ2 (1−θ)2 θ (1−θ) θ(1−θ)
Step3: In(θ)=nI1(θ)= n . θ(1−θ)
5/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Example (6.2.1) cont’d
If X1, . . . , X160 are iid Bern(0.2), what is the value of Fisher information?
Answer:
What is the mle of the Fisher information?
Answer:ThemleofθisX ̄,sothemleofIn(θ)is ̄ n ̄ . X(1−X)
160 = 1000. 0.2 × 0.8
6/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Example: Information for Poisson R.V.s
Let X1, . . . , Xn ∼ Pois(λ). Compute the Fisher information In(λ). Solution:Step1: computethesecond-orderderivative:
−λ λx f(x;λ)=e x!,
log f (x; λ) = −λ + x log(λ) − log(x!), d logf(x;λ)=−1+x1,
dλ λ d2 1
−dλ2 logf(x;λ)=xλ2.
Step2: computetheexpectationofnegative2nd-orderderivative:
d2 EX 1 I1(λ)=E −dλ2logf(X;λ) = λ2 =λ.
Step3: In(θ)=nI1(θ)=n.
λ
See the efficiency in slide 17.
7/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Example: Information for Normal R.V.s
Let X1, . . . , Xn ∼ N(μ, σ2). Suppose σ is known. Compute the Fisher information In(μ).
Solution:
Step1: computethesecond-orderderivative:
1 − (x−μ)2
f(x;μ)=√
logf(x;μ) = log √2πσ2 − 2σ2 ,
e 2·σ2 ,
1 (x−μ)2
2π · σ2
d logf(x;μ)=x−μ, dμ σ2
d2 1 −dμ2 logf(x;μ)= σ2.
8/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Example: Information for Normal R.V.s (cont’d)
Step 2: compute the expectation of negative 2nd-order derivative: d2 1
I1(μ)=E −dμ2logf(X|μ) =σ2.
Step3: In(μ)=nI1(μ)= n.
See efficiency in slide 18.
σ2
9/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Example: Information for Normal R.V.s
Let X1, . . . , Xn ∼ N(μ, σ2). Suppose μ is known. Compute the Fisher information In(σ2).
Solution: Step 1: For convenience, let θ = σ2:
1 − (x−μ)2
f(x;θ) = √
logf(x;θ)=log √2π −2log(θ)− 2θ ,
e 2θ ,
11 (x−μ)2
2π · θ
1 (x−μ)2 d2 1 (x−μ)2
d
dθ logf(x;θ) = −2θ + 2θ2
,
−dθ2 logf(x;θ)=−2θ2 + θ3
d2 1θ1
Step3: Inσ2=In(θ)= n . 2σ4
.
Step2: E−dθ2logf(X;θ) =−2θ2+θ3=2θ2.
10/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Rao–Crame ́r Lower Bound
11/33
Regularity Conditions for Rao–Crame ́r Lower Bound
(R0) The pdfs are distinct; i.e., θ ̸= θ′ → f(xi; θ) ̸= f(xi; θ′).
(R1) The pdfs have common support for all θ.
(R2) The true parameter θ0 is an interior point in Ω.
(R3) The pdf f (x; θ) is twice differentiable as a function of θ.
(R4) The integral f (x; θ)dx can be differentiated twice under the integral sign as a function of θ.
Remark: with those conditions, we can interchange integration and differentiation with respect to θ.
12/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Theorem (6.2.1): Rao–Crame ́r Lower Bound
Let X1,··· ,Xn be i.i.d with common pdf f(x;θ) for θ ∈ Ω. Assume some regularity conditions hold, i.e., (R0) – (R4). Let
Y =y(X1,···,Xn)beastatisticwithmeanE(Y)=k(θ).Then
[k′(θ)]2 Var(Y)≥ nI1(θ).
13/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Corollary (6.2.1)
Under the assumption of Theorem 6.2.1, if Y = y(X1, · · · , Xn) is an unbiased estimator of θ, so Rao-Crame ́r inequality becomes
Var(Y)≥ 1 = 1 . In(θ) nI1(θ)
The variance of an unbiased estimator of θ is no less than the inverse of the Fisher information.
14/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Application of Rao–Crame ́r Inequality
Consider the Bernoulli model with probability of success θ which was treated in Example 6.2.1.
ForθˆsuchthatEθˆ=θ,
Var(θˆ) ≥ θ(1 − θ).
n
This is because I1(θ) = [θ(1 − θ)−1], and the Rao–Crame ́r
inequality gives
Consider θˆ MLE
Var(θˆ)≥ 1 =θ(1−θ). In(θ) n
= X ̄, then Eθˆ = θ. It is unbiased. MLE
We also Var(θˆ MLE
) = θ(1 − θ)/n, so in this case the variance of the mle attains the Rao–Crame ́r inequality.
We say X ̄ has the minimum variance among all the unbiased estimators of θ.
15/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Definition of Efficient Estimator
Let Y be an unbiased estimator of a parameter θ in the case of estimation. The statistic Y is called an efficient estimator of θ if and only if the variance of Y attains the Rao-Crame ́r lower bound.
Remark: Y is efficient if Y is unbiased and it has the minimum variance among all the unbiased estimator of θ.
16/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Example (6.2.3): Efficient Estimator for Poission(λ)
Let X1, . . . , Xn ∼ Pois(λ). Find an efficient estimator of λ. Solution:
The Fisher information is n. See slide 7. λ
Rao-Crame ́r lower bound: λ. n
Consider X ̄. It is unbiased, and its variance is λ. n
Thus X ̄ is an efficient estimator of λ.
17/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Example (6.2.3): Efficient Estimator for N(μ, σ2)
Let X1, . . . , Xn ∼ N(μ, σ2). Suppose σ is known. Find an efficient estimator of μ.
Solution:
The Fisher information is n . See slide 8. σ2
́ σ2 Rao-Cramerlowerbound: n.
̄ σ2 Consider X. It is unbiased, and its variance is n .
Thus X ̄ is an efficient estimator of μ.
18/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Explanation of the Fisher Information
19/33
Consider X ∼ f (x; θ). Let us derive the following fast-computation formula of Fisher information:
We begin with
d2 I1(θ) = E −dθ2 logf(X;θ) .
∞
1= f(x;θ)dx,
0=
−∞
∞ ∂f(x;θ)
The last expression can be re-written as
0 =
−∞
∂θ dx.
∞ ∂f(x;θ)/∂θ
f(x;θ) f(x;θ)dx, ∞ ∂logf(x;θ)
−∞
0= ∂θ f(x;θ)dx. −∞
20/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
The important function
∂logf(X;θ) ∂θ
is called the score function. We have established:
from
∂logf(X;θ)
E ∂θ =0
∞ ∂logf(x;θ)
0= ∂θ f(x;θ)dx.
−∞
Let us differentiate the last equation again.
21/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
It follows that
∞ ∂2logf(x;θ)
0= ∂θ2 f(x;θ)dx −∞
∞ ∂logf(x;θ)∂logf(x;θ)
+ ∂θ ∂θ f(x;θ)dx.
−∞
We observe that
∂2 logf(X;θ) ∂logf(X;θ)2
0 =E ∂θ2 + E ∂θ Define Fisher information:
∂logf(X;θ)2 I1(θ) = E ∂θ .
The fast-computation formula that we have used:
∂2 logf(X;θ) I1(θ) = −E ∂θ2 .
22/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
How to interpret
I1(θ) = E ∂θ ?
∂logf(X;θ)2 ∂logf(X;θ)
Recall
Thus Fisher information is the variance of the random variable
E ∂θ =0. (∗) ∂ log f (X ;θ) , i.e., score function:
∂θ
∂logf(X;θ) I1(θ) = Var ∂θ .
Remark: The equation determines the estimating equations for the mle, i.e., θˆMLE solves
n ∂logf(xi;θ)=0. i=1 ∂θ
The equation (∗) will be also used in slide 28.
23/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Let X1,…,Xn be a random sample from f(x;θ). The likelihood L(θ) is the pdf of the random sample:
∂logL(θ,X)=n ∂logf(Xi;θ). ∂θ i=1 ∂θ
Thus the Fisher information contained in the n samples are ∂logL(θ,X)
In(θ) = Var ∂θ
n ∂logf(Xi;θ) =Var
i=1 ∂θ
n ∂ log f (Xi; θ)
= Var ∂θ i=1
= nI1(θ).
Remark: The iid assumption is the key.
Our derivation is for the continuous case, but the discrete case can be handled in a similar manner.
24/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Proof of Rao–Crame ́r Lower Bound
25/33
Theorem (6.2.1): Rao–Crame ́r Lower Bound
Let X1,··· ,Xn be i.i.d with common pdf f(x;θ) for θ ∈ Ω. Assume some regularity conditions hold, i.e., (R0) – (R4). Let
Y =y(X1,···,Xn)beastatisticwithmeanE(Y)=k(θ).Then
[k′(θ)]2 Var(Y)≥ nI(θ).
26/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
The proof is for the continuous case, but the proof for the discrete case is quite similar.
For Y = y(X1,··· ,Xn), its mean is
E(Y ) = k(θ) ∞ ∞
= … y(x1,…,xn)f (x1;θ)···f (xn;θ)dx1 ···dxn. −∞ −∞
Differentiating with respect to θ, ∞ ∞
k′(θ) = … y(x1,x2,…,xn) −∞ −∞
n 1 ∂f(xi;θ)
i=1 f(xi;θ) ∂θ
× f (x1; θ) · · · f (xn; θ) dx1 · · · dxn
∞ ∞ = …
−∞ −∞
n ∂logf(xi;θ) y(x1,x2,…,xn)
× f (x1; θ) · · · f (xn; θ) dx1 · · · dxn.
i=1 ∂θ
27/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Define
Z=n ∂logf(xi;θ). i=1 ∂θ
Then the equation in the last slide is expressed as k′(θ) = E(Y Z).
We further have
k′(θ) = E(Y Z) = E(Y )E(Z) + ρσY nI(θ).
Recall in slide 23 we have shown that E(Z) = 0,
which gives that
k′(θ)
ρ = σY nI(θ).
Because ρ2 ≤ 1, we have
[k′(θ)]2
σY2 nI(θ) ≤ 1.
28/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Asymptotic Properties of MLE
29/33
Corollary (6.1.1): Consistency of MLE
Assume X1, . . . , Xn are iid with f(x; θ0) under some regularity conditions.
Assume the likelihood equation has a unique solution θˆn, i.e.,
then
θ ˆ n →P θ 0 .
∂ ∂ θ
l ( θ )
θˆ n
= 0 ,
Remark: Under some regularity conditions, the mle is a consistent estimator of the population parameter.
30/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Theorem (6.2.2): Asymptotic Normality of MLE
Assume X1, . . . , Xn are iid with f(x; θ0) under some regularity conditions. Assume the likelihood equation has a unique solution
θˆn. Then
Remark:
√ˆ D 1 n θn−θ0 →N 0,I(θ0) .
We have constructed a central limit theorem for mle.
Roughly speaking, when n is large, the distribution of the mle
θˆn is approximately
ˆ·1 θn∼N θ0,nI(θ0) .
Recall Rao-Crame ́r inequality: for an unbiased estimator of θ, Var(θˆ)≥ 1 .
nI(θ0)
Thus we say the mle θˆn is asymptotically unbiased and also asymptotically efficient.
31/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Large-sample Confidence Interval
The asymptotic standard deviation of the mle θˆ is n
[nI (θ0 )]−1/2 .
Because I(θ) is a continuous function of θ,
I θ →P I ( θ ) .
n0
An approximately (1 − α)100% confidence interval for θ is
11
θn −zα/2 ,θn +zα/2 .
nI (θˆn ) nI (θˆn )
32/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021
Corollary (6.2.2): Delta-method
Under the assumptions of Theorem 6.2.2, suppose g(x) is a continuous function of x which is differentiable at θ0 such that g′(θ0) ̸= 0. Then
√ g′(θ)2
ngθn−g(θ0)→DN0, 0 . I (θ0)
33/33
Boxiang Wang
Chapter 6 STAT 4101 Spring 2021