Asymptotics & optimality
(Module 11)
Statistics (MAST20005) & Elements of Statistics (MAST90058)
School of Mathematics and Statistics University of Melbourne
Copyright By PowCoder代写 加微信 powcoder
Semester 2, 2022
Aims of this module
• Explain some of the theory that we skipped in previous modules • Show why the MLE is usually a good (or best) estimator
• Explain some related important theoretical concepts
Likelihood theory
Asymptotic distribution of the MLE Cram ́er–Rao lower bound
Sufficient statistics Factorisation theorem
Optimal tests
Previous claims (from modules 2 & 4)
The MLE is asymptotically:
• unbiased
• efficient (has the optimal variance) • normally distributed
Can use the 2nd derivative of the log-likelihood (the ‘observed information function’) to get a standard error for the MLE.
Motivating example (non-zero binomial)
• Consider a factory producing items in batches. Let θ denote the proportion of defective items. From each batch 3 items are sampled at random and the number of defectives is determined. However, records are only kept if there is at least one defective.
• Let Y be the number of defectives in a batch.
• ThenY ∼Bi(3,θ),
3 y 3−y Pr(Y=y)=yθ(1−θ) ,y=0,1,2,3
• But we only take an observation if Y > 0, so the pmf is 3θy(1 − θ)3−y
Pr(Y =y|Y >0)= y , y=1,2,3 1−(1−θ)3
• Let Xi be the number of times we observe i defectives and let n = X1 + X2 + X3 be the total number of observations.
• The likelihood is,
3θ(1−θ)2 x1 3θ2(1−θ) x2 θ3 x3
L(θ) = 1−(1−θ)3 1−(1−θ)3 1−(1−θ)3 • This simplifies to,
L(θ) ∝ θx1+2×2+3×3 (1 − θ)2×1+x2 (1−(1−θ)3)n
• After taking logarithms and derivatives, the MLE is found to be the smaller root of
tθ2 −3tθ+3(t−n)=0 where t = x1 + 2×2 + 3×3.
• This gives:
θˆ=3t− −3t2+12tn
• We now have the MLE…
• . . . but finding its sampling distribution is not straightforward!
• In general, finding the exact distribution of a statistic is often difficult.
• We’ve used the Central Limit Theorem to approximate the distribution of the sample mean.
• Gave us approximate CIs for a population mean μ of the form, −1 α s
x ̄±Φ 1−2×√n
• Similar results hold more generally for MLEs (and other estimators)
Definitions
• Start with the log-likelihood:
l(θ) = ln L(θ)
• Taking the first derivative gives the score function (also known simply as the score). Let’s call it U,
U(θ) = ∂l ∂θ
• Note: we solve U(θˆ) = 0 to get the MLE
• Taking the second derivative, and then it’s negative, gives the observed information function (also known simply as the observed information). Let’s call it V ,
V(θ)=−∂U =−∂2l ∂θ ∂θ2
• This represents the curvature of the log-likelihood. Greater curvature ⇒ narrower likelihood around a certain value ⇒ the likelihood is more informative.
Fisher information
• All of the above are functions of the data (and parameters). Therefore they are random variables and have sampling distributions.
• For example, we can show that E(U(θ)) = 0.
• An important quantity is I(θ) = E(V (θ)), which is the Fisher information function (or just the Fisher information). It is also known as the expected information function (or simply as the expected information).
• Many results are based on the Fisher information.
• For example, we can show that var(U(θ)) = I(θ).
• More importantly, it arises in theory about the distribution of the MLE.
Asymptotic distribution
• The following is a key result: ˆ1
θ≈N θ,I(θ) asn→∞
• It requires some conditions for it to hold. The main one being that the parameter should not be defining a boundary of the sample space (e.g. like in the boundary problem examples we’ve looked at).
• Let’sseeaproof…
Asymptotic distribution (derivation)
• Assumptions:
◦ X1,…,Xn is a random sample from f(x,θ) ◦ Continuous pdf, f(x,θ)
◦ θ is not a boundary parameter
• Suppose the MLE satisfies:
U(θˆ)= ∂lnL(θˆ) =0
Note: this requires that θ is not a boundary parameter.
• Taylor series approximation for U(θˆ) about θ:
0 = U(θˆ) = ∂ lnL(θˆ) ≈ ∂ lnL(θ) + (θˆ− θ)∂2 lnL(θ) ∂θ ∂θ ∂θ2
= U ( θ ) − ( θˆ − θ ) V ( θ )
• We can write this as:
V ( θ ) ( θˆ − θ ) ≈ U ( θ )
• Remember that we have a random sample (iid rvs), so we have,
U(θ)=∂lnL(θ)=n ∂lnf(Xi,θ) ∂θ i=1 ∂θ
• Since the Xi are iid so are:
Ui=∂lnf(Xi,θ), i=1,…,n.
• And the same for:
Vi = −∂2 lnf(Xi,θ), i = 1,…,n. ∂θ2
• Determine E(Ui) by integration by substitution and exchanging the order of integration and differentiation,
∞ ∂lnf(x,θ)
f(x,θ)dx =
∞ ∂f(x,θ)f(x,θ)
∂θ ∞ ∂f(x,θ)
∂θ f(x,θ) dx
∂ f(x,θ)dx= ∂θ1=0
• To get the variance of Ui, we start with one of the above results, ∞ ∂lnf(x,θ)f(x,θ)dx=0
• Taking another derivative of both sides gives,
∞ ∂2lnf(x,θ) ∂lnf(x,θ)∂f(x,θ) −∞ ∂θ2 f(x, θ) + ∂θ ∂θ
• Combining the previous two equations gives,
∂f(x,θ) = ∂lnf(x,θ)f(x,θ) ∂θ ∂θ
∞ ∂lnf(x,θ)2
∞ ∂2lnf(x,θ) f(x,θ)dx = − ∂θ2
E(Ui2) = E(Vi)
• In other words, 15 of 46
• Since E(Ui) = 0 we also have E(Ui2) = var(Ui), so we can conclude,
var(Ui) = E(Vi)
• ThusU=iUi isthesumofiidrvswithmean0andthis
var(U) = nE(Vi)
• Also, since V = i Vi, we can conclude that,
E(V ) = n E(Vi)
• Note that this is just the Fisher information, i.e.
E(V ) = var(U) = I(θ)
• Looking back at,
We want to know what happens to U and V as the sample size
gets large.
• U has mean 0 and variance I(θ).
• Central Limit Theorem ⇒ U ≈ N(0, I(θ)).
• V has mean I(θ).
• Law of Large Numbers ⇒ V → I(θ)
• Putting these together gives, as n → ∞,
I(θ) (θˆ − θ) ∼ N(0, I(θ))
V ( θ ) ( θˆ − θ ) ≈ U ( θ )
• Equivalently,
ˆ1 θ∼N θ,I(θ)
• This is a very powerful result. For large (or even modest) samples we do not need to a find the exact distribution of the MLE but can use this approximation.
• In other words, as a standard error of the MLE we can use: ˆ1
se(θ) = I(θˆ)
if we know I(θ), or otherwise replace it with it’s observed version,
ˆ1 se(θ) = V (θˆ)
• Furthermore, we use the normal distribution to construct
approximate confidence intervals.
Example (exponential distribution)
• X1,…,Xn random sample from
f(x|θ)= 1θe−x/θ, 0
• lnf(x|λ)=xlnλ−λ−ln(x!),so
∂lnf(x|λ)=x−1, and ∂2lnf(x|λ)=−x
∂λ λ ∂λ2 λ2
Xλ1 −E−λ2 =λ2=λ
• Then λˆ ≈ N(λ, λ/n)
• Suppose we observe n = 40 and x ̄ = 2.225. An approximate 90%
2.225 2.225 ± 1.645 40
= (1.837, 2.612)
Cram ́er–Rao lower bound
• How good can our estimator get?
• Suppose we know that it is unbiased.
• What is the minimum variance we can achieve?
• Under similar assumptions to before (esp. the parameter must not define a boundary), we can find a lower bound on the variance
• This is known as the Cram ́er–Rao lower bound
• It is equal to the asymptotic variance of the MLE.
• In other words, if we take any unbiased estimator T , then
var(T) 1 I(θ)
Cram ́er–Rao lower bound (proof)
• Let T be an unbiased estimator of θ
• Consider its covariance with the score function,
cov(T , U ) = E(T U ) − E(T ) E(U ) = E(T U ) ∂lnL ∂L
= T∂θLdx= T∂θdx ∂∂∂
= ∂θ TLdx= ∂θE(T)= ∂θθ=1
• Using the fact that cor(T,U)2 1,
cov(T,U)2 var(T)var(U)
var(T) 1 = 1
Implications of the Cram ́er–Rao lower bound
• If an unbiased estimator attains this bound, then it is best in the sense that it has minimum variance compared with other unbiased estimators.
• Therefore, MLEs are approximately (or exactly) optimal for large sample size because:
◦ They are asymptotically unbiased
◦ Their variance meets the Cram ́er–Rao lower bound asymptotically
Efficiency
• We can compare any unbiased estimator against the lower bound
• We define the efficiency of the unbiased estimator T as its variance
relative to the lower bound,
eff(T) = 1/I(θ) = 1
• Notethat0eff(T)1
• If eff(T ) ≈ 1 we say that T is an efficient estimator
var(T) I(θ)var(T)
Example (exponential distribution)
• Sampling from an exponential distribution
• We saw that I(θ) = n/θ2
• Therefore, the Cram ́er–Rao lower bound is θ2/n.
• Any unbiased estimator must have variance at least as large as this. • The MLE in this case is the sample mean, θˆ = X ̄
• Therefore, var(θˆ) = var(X)/n = θ2/n
• So the MLE is efficient (for all sample sizes!)
Likelihood theory
Asymptotic distribution of the MLE Cram ́er–Rao lower bound
Sufficient statistics Factorisation theorem
Optimal tests
Sufficiency: a starting example
• We toss a coin 10 times
• Want to estimate the probability of heads, θ
• Xi ∼ Be(θ)
• Supposeweuseθˆ=12(X1+X2)
• Only uses the first 2 coin tosses
• Clearly, we have not used all of the available information!
Motivation
• Point estimation reduces the whole sample to a few statistics.
• Different methods of estimation can yield different statistics.
• Is there a preferred reduction?
• Toss a coin with probability of heads θ 10 times.
Observe T H T H T H H T T T.
• Intuitively, knowing we have 4 heads in 10 tosses is all we need.
• But are we missing something? Does the length of the longest run give extra information?
Definition
• Intuition: want to find a statistic so that any other statistic provides no additional information about the value of the parameter
• Definition: the statistic T = g(X1, . . . , Xn) is sufficient for an underlying parameter θ if the conditional probability distribution of the data (X1, . . . , Xn), given the statistic g(X1, . . . , Xn), does not depend on the parameter θ.
• Sometimes need more than one statistic, e.g. T1 and T2, in which case we say they are jointly sufficient for θ
Example (binomial)
• Thepmfis,f(x|p)=px(1−p)1−x, x=0,1 • The likelihood is,
f(xi |p)=pxi(1−p)n−xi i=1
• Let Y = Xi, we have that Y ∼ Bi(n,p) and then,
Pr(X1 =x1,…,Xn =xn |Y =y) = Pr(X1 = x1,…,Xn = xn)
px1(1−p)1−x1 …pxn(1−p)1−xn 1
= npy(1 − p)n−y = n yy
• Given Y = y, the conditional distribution of X1, · · · Xn does not depend on p.
• Therefore, Y is sufficient for p.
Factorisation theorem
• Let X1,…,Xn have joint pdf or pmf f(x1,…,xn | θ) • Y = g(x1,…,xn) is sufficient for θ if and only if
f(x1,…,xn | θ) = φ{g(x1,…,xn) | θ}h(x1,…,xn) • φ depends on x1,··· ,xn only through g(x1,··· ,xn) and h
doesn’t depend on θ.
Example (binomial)
• Thepmfis,f(x|p)=px(1−p)1−x, x=0,1
• The likelihood is,
f(xi |p)=pxi(1−p)n−xi i=1
• So y = xi is sufficient for p, since we can factorise the likelihood into:
φ(y,p)=py(1−p)n−y and h(x1,···,xn)=1
• So in the coin tossing example, the total number of heads is
sufficient for θ. 35 of 46
Example (Poisson)
• X1, . . . , Xn random sample from a Poisson distribution with mean λ.
• The likelihood is, n
λxie−nλnx ̄−nλ1
f(xi | λ) = x1!…xn! = (λ e ) x1!…xn! i=1
• We see that X ̄ is sufficient for λ.
Exponential family of distributions
• We often use distributions which have pdfs of the form: f(x | θ) = exp{K(x)p(θ) + S(x) + q(θ)}
• This is called the exponential family.
• Let X1, · · · , Xn be iid from an exponential family. Then
ni=1 K(Xi) is sufficient for θ.
• To prove this note that the joint pdf is
exp p(θ) K(xi) + S(xi) + nq(θ)
= exp p(θ) K(xi) + nq(θ) exp S(xi)
• The factorisation theorem then shows sufficiency. 37 of 46
Example (exponential)
• The pdf is,
f(x|θ)=θe =exp x −θ −lnθ , 0