Asymptotics & optimality
(Module 11)
Statistics (MAST20005) & Elements of Statistics (MAST90058) Semester 2, 2022
1 Likelihood theory 1 1.1 AsymptoticdistributionoftheMLE ………………………………. 2 1.2 Cram ́er–Raolowerbound ……………………………………. 6
Copyright By PowCoder代写 加微信 powcoder
2 Sufficient statistics 7 2.1 Factorisationtheorem ……………………………………… 8
3 Optimal tests 9
Aims of this module
• Explain some of the theory that we skipped in previous modules • Show why the MLE is usually a good (or best) estimator
• Explain some related important theoretical concepts
1 Likelihood theory
Previous claims (from modules 2 & 4)
The MLE is asymptotically: • unbiased
• efficient (has the optimal variance)
• normally distributed
Can use the 2nd derivative of the log-likelihood (the ‘observed information function’) to get a standard error for the
Motivating example (non-zero binomial)
• Consider a factory producing items in batches. Let θ denote the proportion of defective items. From each batch 3 items are sampled at random and the number of defectives is determined. However, records are only kept if there is at least one defective.
• Let Y be the number of defectives in a batch.
• ThenY ∼Bi(3,θ),
3 y 3−y Pr(Y=y)=yθ(1−θ) ,y=0,1,2,3
• But we only take an observation if Y > 0, so the pmf is
3θy (1 − θ)3−y
Pr(Y =y|Y >0)= y , y=1,2,3 1−(1−θ)3
• Let Xi be the number of times we observe i defectives and let n = X1 + X2 + X3 be the total number of observations.
• The likelihood is,
• This simplifies to,
3θ(1−θ)2 x1 3θ2(1−θ) x2 θ3 x3 1−(1−θ)3 1−(1−θ)3 1−(1−θ)3
L(θ) ∝ θx1 +2×2 +3×3 (1 − θ)2×1 +x2 (1−(1−θ)3)n
• After taking logarithms and derivatives, the MLE is found to be the smaller root of tθ2 −3tθ+3(t−n)=0
where t = x1 + 2×2 + 3×3. • This gives:
θˆ=3t− −3t2+12tn
• We now have the MLE…
• . . . but finding its sampling distribution is not straightforward!
• In general, finding the exact distribution of a statistic is often difficult.
• We’ve used the Central Limit Theorem to approximate the distribution of the sample mean.
• Gave us approximate CIs for a population mean μ of the form,
x ̄ ± Φ − 1 1 − α × √s 2n
• Similar results hold more generally for MLEs (and other estimators)
1.1 Asymptotic distribution of the MLE
Definitions
• Start with the log-likelihood:
• Taking the first derivative gives the score function (also known simply as the score). Let’s call it U,
U(θ) = ∂l ∂θ
• Note: we solve U(θˆ) = 0 to get the MLE
• Taking the second derivative, and then it’s negative, gives the observed information function (also known simply
as the observed information). Let’s call it V ,
V(θ)=−∂U =−∂2l
• This represents the curvature of the log-likelihood. Greater curvature ⇒ narrower likelihood around a certain value ⇒ the likelihood is more informative.
l(θ) = ln L(θ)
Fisher information
• All of the above are functions of the data (and parameters). Therefore they are random variables and have sampling distributions.
• For example, we can show that E(U(θ)) = 0.
• An important quantity is I(θ) = E(V (θ)), which is the Fisher information function (or just the Fisher informa-
tion). It is also known as the expected information function (or simply as the expected information).
• Many results are based on the Fisher information.
• For example, we can show that var(U(θ)) = I(θ).
• More importantly, it arises in theory about the distribution of the MLE.
Asymptotic distribution
• The following is a key result:
θ≈N θ,I(θ) asn→∞
• It requires some conditions for it to hold. The main one being that the parameter should not be defining a boundary of the sample space (e.g. like in the boundary problem examples we’ve looked at).
• Let’sseeaproof…
Asymptotic distribution (derivation)
• Assumptions:
– X1,…,Xn is a random sample from f(x,θ) – Continuous pdf, f(x,θ)
– θ is not a boundary parameter
• Suppose the MLE satisfies:
U(θˆ)= ∂lnL(θˆ) =0 ∂θ
Note: this requires that θ is not a boundary parameter.
• Taylor series approximation for U(θˆ) about θ:
0 = U(θˆ) = ∂ lnL(θˆ) ≈ ∂ lnL(θ) + (θˆ− θ)∂2 lnL(θ) ∂θ ∂θ ∂θ2
= U ( θ ) − ( θˆ − θ ) V ( θ )
• We can write this as:
V ( θ ) ( θˆ − θ ) ≈ U ( θ ) • Remember that we have a random sample (iid rvs), so we have,
• Since the Xi are iid so are:
• And the same for:
Ui=∂lnf(Xi,θ), i=1,…,n. ∂θ
Vi = −∂2 lnf(Xi,θ), i = 1,…,n. ∂θ2
U(θ)= ∂lnL(θ) =n ∂lnf(Xi,θ)
• Determine E(Ui) by integration by substitution and exchanging the order of integration and differentiation,
∞ ∂lnf(x,θ)
f(x,θ)dx =
∞ ∂f(x,θ)f(x,θ)
∂θ ∞ ∂f(x,θ)
∂θ f(x,θ) dx ∂
f(x,θ)dx= ∂θ1=0 • To get the variance of Ui, we start with one of the above results,
∂f(x,θ) = ∂lnf(x,θ)f(x,θ)
∂lnf(x,θ)∂f(x,θ) f(x, θ) + ∂θ ∂θ
∞ ∂lnf(x,θ)f(x,θ)dx=0
• Taking another derivative of both sides gives, ∞ ∂2lnf(x,θ)
• Combining the previous two equations gives, ∞∂lnf(x,θ)2
f(x,θ)dx = −
∞ ∂2lnf(x,θ)
• In other words,
• Since E(Ui) = 0 we also have E(Ui2) = var(Ui), so we can conclude,
E(Ui2) = E(Vi) var(Ui) = E(Vi)
• Thus U = i Ui is the sum of iid rvs with mean 0 and this variance.
var(U) = nE(Vi)
• Also, since V = i Vi, we can conclude that,
• Note that this is just the Fisher information, i.e.
E(V ) = var(U) = I(θ)
• Looking back at,
V ( θ ) ( θˆ − θ ) ≈ U ( θ )
We want to know what happens to U and V as the sample size gets large.
• U has mean 0 and variance I(θ).
• Central Limit Theorem ⇒ U ≈ N(0, I(θ)).
• V has mean I(θ).
• Law of Large Numbers ⇒ V → I(θ)
• Putting these together gives, as n → ∞,
• Equivalently,
E(V ) = n E(Vi)
I(θ) (θˆ − θ) ∼ N(0, I(θ)) ˆ1
θ∼N θ,I(θ) 4
• This is a very powerful result. For large (or even modest) samples we do not need to a find the exact distribution of the MLE but can use this approximation.
• In other words, as a standard error of the MLE we can use: ˆ1
se(θ) = I(θˆ)
if we know I(θ), or otherwise replace it with it’s observed version,
ˆ1 se(θ) = V (θˆ)
• Furthermore, we use the normal distribution to construct approximate confidence intervals. Example (exponential distribution)
• X1,…,Xn random sample from
• MLE is X ̄.
• lnf(x|θ)=−lnθ−x/θ,so
• Since E(X) = θ,
• Suppose we observe n = 20 and x ̄ = 3.7. An approximate 95% CI is,
f(x|θ)= 1θe−x/θ, 0
∂λ λ ∂λ2 λ2 Xλ1
−E−λ2 =λ2=λ
• Suppose we observe n = 40 and x ̄ = 2.225. An approximate 90% CI is,
• Then λˆ ≈ N(λ, λ/n)
2.225 ± 1.645
40 = (1.837, 2.612)
1.2 Cram ́er–Rao lower bound
Cram ́er–Rao lower bound
• How good can our estimator get?
• Suppose we know that it is unbiased.
• What is the minimum variance we can achieve?
• Under similar assumptions to before (esp. the parameter must not define a boundary), we can find a lower bound on the variance
• This is known as the Cram ́er–Rao lower bound
• It is equal to the asymptotic variance of the MLE.
• In other words, if we take any unbiased estimator T , then
Cram ́er–Rao lower bound (proof)
• Let T be an unbiased estimator of θ
• Consider its covariance with the score function,
var(T) 1 I(θ)
cov(T , U ) = E(T U ) − E(T ) E(U ) = E(T U ) ∂lnL ∂L
= T∂θLdx= T∂θdx ∂∂∂
cov(T,U)2 var(T)var(U) var(T) 1 = 1
• Using the fact that cor(T,U)2 1,
Implications of the Cram ́er–Rao lower bound
= ∂θ TLdx= ∂θE(T)= ∂θθ=1
• If an unbiased estimator attains this bound, then it is best in the sense that it has minimum variance compared with other unbiased estimators.
• Therefore, MLEs are approximately (or exactly) optimal for large sample size because: – They are asymptotically unbiased
– Their variance meets the Cram ́er–Rao lower bound asymptotically
Efficiency
• We can compare any unbiased estimator against the lower bound
• We define the efficiency of the unbiased estimator T as its variance relative to the lower bound,
eff(T) = 1/I(θ) = 1 var(T) I(θ)var(T)
• Notethat0eff(T)1
• If eff(T ) ≈ 1 we say that T is an efficient estimator
Example (exponential distribution)
• Sampling from an exponential distribution
• We saw that I(θ) = n/θ2
• Therefore, the Cram ́er–Rao lower bound is θ2/n.
• Any unbiased estimator must have variance at least as large as this. • The MLE in this case is the sample mean, θˆ = X ̄
• Therefore, var(θˆ) = var(X)/n = θ2/n
• So the MLE is efficient (for all sample sizes!)
Sufficient statistics
Sufficiency: a starting example
• We toss a coin 10 times
• Want to estimate the probability of heads, θ
• Xi ∼ Be(θ)
• Suppose we use θˆ= 12(X1 +X2)
• Only uses the first 2 coin tosses
• Clearly, we have not used all of the available information!
Motivation
• Point estimation reduces the whole sample to a few statistics.
• Different methods of estimation can yield different statistics.
• Is there a preferred reduction?
• Tossacoinwithprobabilityofheadsθ10times. ObserveTHTHTHHTTT. • Intuitively, knowing we have 4 heads in 10 tosses is all we need.
• But are we missing something? Does the length of the longest run give extra information? Definition
• Intuition: want to find a statistic so that any other statistic provides no additional information about the value of the parameter
• Definition: the statistic T = g(X1, . . . , Xn) is sufficient for an underlying parameter θ if the conditional proba- bility distribution of the data (X1, . . . , Xn), given the statistic g(X1, . . . , Xn), does not depend on the parameter θ.
• Sometimes need more than one statistic, e.g. T1 and T2, in which case we say they are jointly sufficient for θ
Example (binomial)
• Thepmfis,f(x|p)=px(1−p)1−x, • The likelihood is,
f(xi |p)=pxi(1−p)n−xi i=1
• Let Y = Xi, we have that Y ∼ Bi(n,p) and then,
Pr(X1 =x1,…,Xn =xn |Y =y) = Pr(X1 = x1,…,Xn = xn)
Pr(Y = y) px1(1−p)1−x1 …pxn(1−p)1−xn
• Given Y = y, the conditional distribution of X1, · · · Xn does not depend on p. • Therefore, Y is sufficient for p.
2.1 Factorisation theorem
Factorisation theorem
• Let X1,…,Xn have joint pdf or pmf f(x1,…,xn | θ) • Y = g(x1,…,xn) is sufficient for θ if and only if
f(x1,…,xn | θ) = φ{g(x1,…,xn) | θ}h(x1,…,xn) • φ depends on x1,··· ,xn only through g(x1,··· ,xn) and h doesn’t depend on θ.
Example (binomial)
• Thepmfis,f(x|p)=px(1−p)1−x, • The likelihood is,
f(xi |p)=pxi(1−p)n−xi i=1
= npy(1−p)n−y yy
• So y = xi is sufficient for p, since we can factorise the likelihood into: φ(y,p)=py(1−p)n−y and h(x1,···,xn)=1
• So in the coin tossing example, the total number of heads is sufficient for θ.
Example (Poisson)
• X1 , . . . , Xn random sample from a Poisson distribution with mean λ. • The likelihood is,
λxie−nλ 1
f(xi | λ) = x1!…xn! = (λnx ̄e−nλ) x1!…xn! i=1
• We see that X ̄ is sufficient for λ.
Exponential family of distributions
• We often use distributions which have pdfs of the form:
f(x | θ) = exp{K(x)p(θ) + S(x) + q(θ)}
• This is called the exponential family.
• Let X1, · · · , Xn be iid from an exponential family. Then ni=1 K(Xi) is sufficient for θ. • To prove this note that the joint pdf is
exp p(θ) K(xi) + S(xi) + nq(θ)
= exp p(θ) K(xi) + nq(θ) exp S(xi)
• The factorisation theorem then shows sufficiency.
Example (exponential)
• The pdf is,
• This is of the form
• SoK(x)=xandXi issufficientforθ(andsoisX ̄=Xi/n).
Sufficiency and MLEs
• If there exist sufficient statistics, the MLE will be a function of them. • Factorise the likelihood:
L(θ) = f(x1,…,xn | θ) = φ{g(x1,…,xn} | θ]h(x1,…,xn)
• We find the MLE by maximizing φ{g(x1, . . . , xn) | θ} which is a function of the sufficient statistics and θ
• So the MLE must be a function of the sufficient statistics
Importance of sufficiency
• Why are sufficient statistics important?
• Once the sufficient statistics are known there is no additional information on the parameter in the sample • Samples that have the same values of the sufficient statistic yield the same estimates
• The optimal estimators/tests are based on sufficient statistics (such as the MLE)
• A lot of statistical theory is based on them
• Easy to find the sufficient statistics in some special cases (e.g. exponential family)
Disclaimer
But. . . the concept of sufficiency relies on knowing the population distribution
So, it is mostly important for theoretical work.
In practice, we want to also look at all aspects of our data
That is, we should go beyond any putative sufficient statistics, as a sanity check of our assumptions (e.g. QQ plots).
Optimal tests
1 −x/θ 1
f(x|θ)=θe =exp x −θ −lnθ , 0