代写代考 MAST20005) & Elements of Statistics (MAST90058)

Asymptotics & optimality
(Module 11)
Statistics (MAST20005) & Elements of Statistics (MAST90058)
School of Mathematics and Statistics University of Melbourne

Copyright By PowCoder代写 加微信 powcoder

Semester 2, 2022

Aims of this module
• Explain some of the theory that we skipped in previous modules • Show why the MLE is usually a good (or best) estimator
• Explain some related important theoretical concepts

Likelihood theory
Asymptotic distribution of the MLE Cram ́er–Rao lower bound
Sufficient statistics Factorisation theorem
Optimal tests

Previous claims (from modules 2 & 4)
The MLE is asymptotically:
• unbiased
• efficient (has the optimal variance) • normally distributed
Can use the 2nd derivative of the log-likelihood (the ‘observed information function’) to get a standard error for the MLE.

Motivating example (non-zero binomial)
• Consider a factory producing items in batches. Let θ denote the proportion of defective items. From each batch 3 items are sampled at random and the number of defectives is determined. However, records are only kept if there is at least one defective.
• Let Y be the number of defectives in a batch.
• ThenY ∼Bi(3,θ),
􏰌3􏰍 y 3−y Pr(Y=y)=yθ(1−θ) ,y=0,1,2,3
• But we only take an observation if Y > 0, so the pmf is 􏰆3􏰇θy(1 − θ)3−y
Pr(Y =y|Y >0)= y , y=1,2,3 1−(1−θ)3

• Let Xi be the number of times we observe i defectives and let n = X1 + X2 + X3 be the total number of observations.
• The likelihood is,
􏰌 3θ(1−θ)2 􏰍x1 􏰌 3θ2(1−θ) 􏰍x2 􏰌 θ3 􏰍x3
L(θ) = 1−(1−θ)3 1−(1−θ)3 1−(1−θ)3 • This simplifies to,
L(θ) ∝ θx1+2×2+3×3 (1 − θ)2×1+x2 (1−(1−θ)3)n
• After taking logarithms and derivatives, the MLE is found to be the smaller root of
tθ2 −3tθ+3(t−n)=0 where t = x1 + 2×2 + 3×3.
• This gives:
θˆ=3t− −3t2+12tn

• We now have the MLE…
• . . . but finding its sampling distribution is not straightforward!
• In general, finding the exact distribution of a statistic is often difficult.
• We’ve used the Central Limit Theorem to approximate the distribution of the sample mean.
• Gave us approximate CIs for a population mean μ of the form, −1􏰊 α􏰋 s
x ̄±Φ 1−2×√n
• Similar results hold more generally for MLEs (and other estimators)

Definitions
• Start with the log-likelihood:
l(θ) = ln L(θ)
• Taking the first derivative gives the score function (also known simply as the score). Let’s call it U,
U(θ) = ∂l ∂θ
• Note: we solve U(θˆ) = 0 to get the MLE

• Taking the second derivative, and then it’s negative, gives the observed information function (also known simply as the observed information). Let’s call it V ,
V(θ)=−∂U =−∂2l ∂θ ∂θ2
• This represents the curvature of the log-likelihood. Greater curvature ⇒ narrower likelihood around a certain value ⇒ the likelihood is more informative.

Fisher information
• All of the above are functions of the data (and parameters). Therefore they are random variables and have sampling distributions.
• For example, we can show that E(U(θ)) = 0.
• An important quantity is I(θ) = E(V (θ)), which is the Fisher information function (or just the Fisher information). It is also known as the expected information function (or simply as the expected information).
• Many results are based on the Fisher information.
• For example, we can show that var(U(θ)) = I(θ).
• More importantly, it arises in theory about the distribution of the MLE.

Asymptotic distribution
• The following is a key result: ˆ􏰌1􏰍
θ≈N θ,I(θ) asn→∞
• It requires some conditions for it to hold. The main one being that the parameter should not be defining a boundary of the sample space (e.g. like in the boundary problem examples we’ve looked at).
• Let’sseeaproof…

Asymptotic distribution (derivation)
• Assumptions:
◦ X1,…,Xn is a random sample from f(x,θ) ◦ Continuous pdf, f(x,θ)
◦ θ is not a boundary parameter
• Suppose the MLE satisfies:
U(θˆ)= ∂lnL(θˆ) =0
Note: this requires that θ is not a boundary parameter.
• Taylor series approximation for U(θˆ) about θ:
0 = U(θˆ) = ∂ lnL(θˆ) ≈ ∂ lnL(θ) + (θˆ− θ)∂2 lnL(θ) ∂θ ∂θ ∂θ2
= U ( θ ) − ( θˆ − θ ) V ( θ )

• We can write this as:
V ( θ ) ( θˆ − θ ) ≈ U ( θ )
• Remember that we have a random sample (iid rvs), so we have,
U(θ)=∂lnL(θ)=􏰃n ∂lnf(Xi,θ) ∂θ i=1 ∂θ
• Since the Xi are iid so are:
Ui=∂lnf(Xi,θ), i=1,…,n.
• And the same for:
Vi = −∂2 lnf(Xi,θ), i = 1,…,n. ∂θ2

• Determine E(Ui) by integration by substitution and exchanging the order of integration and differentiation,
􏰑 ∞ ∂lnf(x,θ)
f(x,θ)dx =
􏰑 ∞ ∂f(x,θ)f(x,θ)
∂θ 􏰑 ∞ ∂f(x,θ)
∂θ f(x,θ) dx
∂ f(x,θ)dx= ∂θ1=0

• To get the variance of Ui, we start with one of the above results, 􏰑 ∞ ∂lnf(x,θ)f(x,θ)dx=0
• Taking another derivative of both sides gives,
􏰑 ∞ 􏰒∂2lnf(x,θ) ∂lnf(x,θ)∂f(x,θ)􏰓 −∞ ∂θ2 f(x, θ) + ∂θ ∂θ
• Combining the previous two equations gives,
∂f(x,θ) = ∂lnf(x,θ)f(x,θ) ∂θ ∂θ
􏰑 ∞ 􏰒∂lnf(x,θ)􏰓2
􏰑 ∞ ∂2lnf(x,θ) f(x,θ)dx = − ∂θ2
E(Ui2) = E(Vi)
• In other words, 15 of 46

• Since E(Ui) = 0 we also have E(Ui2) = var(Ui), so we can conclude,
var(Ui) = E(Vi)
• ThusU=􏰂iUi isthesumofiidrvswithmean0andthis
var(U) = nE(Vi)
• Also, since V = 􏰂i Vi, we can conclude that,
E(V ) = n E(Vi)
• Note that this is just the Fisher information, i.e.
E(V ) = var(U) = I(θ)

• Looking back at,
We want to know what happens to U and V as the sample size
gets large.
• U has mean 0 and variance I(θ).
• Central Limit Theorem ⇒ U ≈ N(0, I(θ)).
• V has mean I(θ).
• Law of Large Numbers ⇒ V → I(θ)
• Putting these together gives, as n → ∞,
I(θ) (θˆ − θ) ∼ N(0, I(θ))
V ( θ ) ( θˆ − θ ) ≈ U ( θ )

• Equivalently,
ˆ􏰌1􏰍 θ∼N θ,I(θ)
• This is a very powerful result. For large (or even modest) samples we do not need to a find the exact distribution of the MLE but can use this approximation.
• In other words, as a standard error of the MLE we can use: ˆ1
se(θ) = 􏰚I(θˆ)
if we know I(θ), or otherwise replace it with it’s observed version,
ˆ1 se(θ) = 􏰚V (θˆ)
• Furthermore, we use the normal distribution to construct
approximate confidence intervals.

Example (exponential distribution)
• X1,…,Xn random sample from
f(x|θ)= 1θe−x/θ, 00 We have seen λˆ = X ̄.
• lnf(x|λ)=xlnλ−λ−ln(x!),so
∂lnf(x|λ)=x−1, and ∂2lnf(x|λ)=−x
∂λ λ ∂λ2 λ2
􏰌X􏰍λ1 −E−λ2 =λ2=λ

• Then λˆ ≈ N(λ, λ/n)
• Suppose we observe n = 40 and x ̄ = 2.225. An approximate 90%
􏰅 2.225 2.225 ± 1.645 40
= (1.837, 2.612)

Cram ́er–Rao lower bound
• How good can our estimator get?
• Suppose we know that it is unbiased.
• What is the minimum variance we can achieve?
• Under similar assumptions to before (esp. the parameter must not define a boundary), we can find a lower bound on the variance
• This is known as the Cram ́er–Rao lower bound
• It is equal to the asymptotic variance of the MLE.
• In other words, if we take any unbiased estimator T , then
var(T)􏰁 1 I(θ)

Cram ́er–Rao lower bound (proof)
• Let T be an unbiased estimator of θ
• Consider its covariance with the score function,
cov(T , U ) = E(T U ) − E(T ) E(U ) = E(T U ) 􏰑 ∂lnL 􏰑 ∂L
= T∂θLdx= T∂θdx ∂􏰑∂∂
= ∂θ TLdx= ∂θE(T)= ∂θθ=1
• Using the fact that cor(T,U)2 􏰀 1,
cov(T,U)2 􏰀 var(T)var(U)
var(T) 􏰁 1 = 1

Implications of the Cram ́er–Rao lower bound
• If an unbiased estimator attains this bound, then it is best in the sense that it has minimum variance compared with other unbiased estimators.
• Therefore, MLEs are approximately (or exactly) optimal for large sample size because:
◦ They are asymptotically unbiased
◦ Their variance meets the Cram ́er–Rao lower bound asymptotically

Efficiency
• We can compare any unbiased estimator against the lower bound
• We define the efficiency of the unbiased estimator T as its variance
relative to the lower bound,
eff(T) = 1/I(θ) = 1
• Notethat0􏰀eff(T)􏰀1
• If eff(T ) ≈ 1 we say that T is an efficient estimator
var(T) I(θ)var(T)

Example (exponential distribution)
• Sampling from an exponential distribution
• We saw that I(θ) = n/θ2
• Therefore, the Cram ́er–Rao lower bound is θ2/n.
• Any unbiased estimator must have variance at least as large as this. • The MLE in this case is the sample mean, θˆ = X ̄
• Therefore, var(θˆ) = var(X)/n = θ2/n
• So the MLE is efficient (for all sample sizes!)

Likelihood theory
Asymptotic distribution of the MLE Cram ́er–Rao lower bound
Sufficient statistics Factorisation theorem
Optimal tests

Sufficiency: a starting example
• We toss a coin 10 times
• Want to estimate the probability of heads, θ
• Xi ∼ Be(θ)
• Supposeweuseθˆ=12(X1+X2)
• Only uses the first 2 coin tosses
• Clearly, we have not used all of the available information!

Motivation
• Point estimation reduces the whole sample to a few statistics.
• Different methods of estimation can yield different statistics.
• Is there a preferred reduction?
• Toss a coin with probability of heads θ 10 times.
Observe T H T H T H H T T T.
• Intuitively, knowing we have 4 heads in 10 tosses is all we need.
• But are we missing something? Does the length of the longest run give extra information?

Definition
• Intuition: want to find a statistic so that any other statistic provides no additional information about the value of the parameter
• Definition: the statistic T = g(X1, . . . , Xn) is sufficient for an underlying parameter θ if the conditional probability distribution of the data (X1, . . . , Xn), given the statistic g(X1, . . . , Xn), does not depend on the parameter θ.
• Sometimes need more than one statistic, e.g. T1 and T2, in which case we say they are jointly sufficient for θ

Example (binomial)
• Thepmfis,f(x|p)=px(1−p)1−x, x=0,1 • The likelihood is,
􏰜f(xi |p)=p􏰂xi(1−p)n−􏰂xi i=1
• Let Y = 􏰂Xi, we have that Y ∼ Bi(n,p) and then,
Pr(X1 =x1,…,Xn =xn |Y =y) = Pr(X1 = x1,…,Xn = xn)
px1(1−p)1−x1 …pxn(1−p)1−xn 1
= 􏰆n􏰇py(1 − p)n−y = 􏰆n􏰇 yy

• Given Y = y, the conditional distribution of X1, · · · Xn does not depend on p.
• Therefore, Y is sufficient for p.

Factorisation theorem
• Let X1,…,Xn have joint pdf or pmf f(x1,…,xn | θ) • Y = g(x1,…,xn) is sufficient for θ if and only if
f(x1,…,xn | θ) = φ{g(x1,…,xn) | θ}h(x1,…,xn) • φ depends on x1,··· ,xn only through g(x1,··· ,xn) and h
doesn’t depend on θ.

Example (binomial)
• Thepmfis,f(x|p)=px(1−p)1−x, x=0,1
• The likelihood is,
􏰜f(xi |p)=p􏰂xi(1−p)n−􏰂xi i=1
• So y = 􏰂 xi is sufficient for p, since we can factorise the likelihood into:
φ(y,p)=py(1−p)n−y and h(x1,···,xn)=1
• So in the coin tossing example, the total number of heads is
sufficient for θ. 35 of 46

Example (Poisson)
• X1, . . . , Xn random sample from a Poisson distribution with mean λ.
• The likelihood is, n
􏰜 λxie−nλnx ̄−nλ􏰌1􏰍
f(xi | λ) = x1!…xn! = (λ e ) x1!…xn! i=1
• We see that X ̄ is sufficient for λ.

Exponential family of distributions
• We often use distributions which have pdfs of the form: f(x | θ) = exp{K(x)p(θ) + S(x) + q(θ)}
• This is called the exponential family.
• Let X1, · · · , Xn be iid from an exponential family. Then
􏰂ni=1 K(Xi) is sufficient for θ.
• To prove this note that the joint pdf is
exp 􏰖p(θ) 􏰃 K(xi) + 􏰃 S(xi) + nq(θ)􏰗
= 􏰔exp 􏰖p(θ) 􏰃 K(xi) + nq(θ)􏰗􏰕 exp 􏰖􏰃 S(xi)􏰗
• The factorisation theorem then shows sufficiency. 37 of 46

Example (exponential)
• The pdf is,
f(x|θ)=θe =exp x −θ −lnθ , 0CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com