F71SM STATISTICAL METHODS
7 ESTIMATION 7.1 Introduction
The aim of statistical inference is to extract relevant, useful information from a set of data x = {x1,x2,…,xn} and to use this information in an efficient way to inform us about the population from which the data have arisen.
We perform inference about a population in the presence of a model, which is a mathematical representation of the random process which is assumed to have generated the data. The model involves a probability distribution for a population variable X. This distribution contains one (or more) unknown parameter(s), θ or θ , whose value(s) we want to estimate.
In the classical (frequentist) approach we think of θ as having a fixed (but unknown) value, and we regard the data x as coming from a repeatable sampling procedure.
We have a random sample X = (X1, X2, . . . , Xn) available. Each observation xi is a sampled value of the random variable X with pdf f(x;θ).
A statistic is a function of X which does not involve θ. It is a random variable with its own probability distribution, called its sampling distribution, the properties of which depend on those of X.
An estimator of θ is a statistic whose value is used as an estimate of θ. ˆ ˆ ̃∗
We denote the estimator θ(X) or just θ, or θ or θ , and the estimate (its value for a ˆˆ
particular observed sample) θ(x), or again just θ.
Example 7.1 Suppose we have a random sample x = (x1, x2, . . . , xn) of X ∼ Poisson(θ).
Consider the following four estimators of θ: ˆ 1n ̄
(i) θ(X) = n i=1 Xi = X ˆ
(ii) θ(X) = X1 ˆ n
(iii) θ(X) = i=1 wiXi, where w1, . . . , wn are a known set of constants ˆ 1n 2
(iv) θ(X) = n i=1 Xi
Which do you think is the ‘best’ one to use? Why? What makes for a good estimator?
We consider desirable properties of θ, recognising that it has its own sampling distribution
with its own properties.
ˆˆˆ The SD of an estimator θ is referred to as its standard error. We write se(θ), or ese(θ) for
the estimated standard error.
7.2 Properties of estimators
Unbiasedness
An estimator θ is unbiased for θ if E[θ] = θ
The bias of θ is B[θ] = E[θ − θ] = E[θ] − θ ˆˆˆ
θisasymptoticallyunbiasedforθifE[θ]→θasn→∞,i.e. ifB[θ]→0asn→∞ Example 7.2 Sampling from X with mean μ and variance σ2
E[X ̄] = μ so X ̄ is unbiased for μ
E[S2] = σ2 so S2 is unbiased for σ2
E[S] ̸= σ so S is biased for σ
Definition The mean square error of an estimator θ is MSE[θ] = E (θ − θ)
Note the difference from the expression for the variance: Var[θ] = E θ − E[θ] ; the MSE and the variance of an estimator are the same in the case of an unbiased estimator.
ˆ ˆˆ2 It is easy to show MSE[θ] = Var[θ] + B[θ]
So, if Var[θ] and B[θ] → 0 as n → ∞, then MSE[θ] → 0 as n → ∞
Consistency
An estimator θˆ is consistent (some authors say ‘consistent in probability’) for θ if, for any
ε > 0, P |θˆ − θ| ≥ ε → 0 as n → ∞
From Markov’s inequality we get P |θ − θ| ≥ ε ≤ ε2 MSE[θ]
ˆ1ˆ ˆˆˆˆ
Hence Var[θ] → 0 and B[θ] → 0 as n → ∞ ⇒ MSE[θ] → 0 ⇒ θ is consistent for θ Example 7.3 Sampling from X with mean μ and variance σ2
X ̄ is unbiased for μ, with variance σ2/n → 0 as n → ∞, so X ̄ is consistent for μ Example 7.4 Sampling from X ∼ N(μ,σ2)
(n − 1)S2 2σ4 σ2 ∼ χ2n−1 ⇒ Var[S2] = n − 1
S2 is unbiased for σ2, and its variance → 0 as n → ∞, so S2 is consistent for σ2
ˆ ˆ ˆ2
Efficiency
An estimator θˆ is more efficient than θˆ if MSE θˆ < MSE θˆ
Let θˆ and θˆ be unbiased estimators of θ with Var[θˆ ] ≤ Var[θˆ ]. Then the relative efficiency 12 12
of the estimators is Var[θˆ ]/Var[θˆ ] 12
In Example 7.1, X ̄ is unbiased for θ with variance θ/n; X1 is unbiased for θ with variance θ; so the relative efficiency of the estimators is (θ/n)/θ = 1/n.
Definition. The likelihood function for a parameter θ, given a sample x, denoted L(θ;x), is the joint pdf evaluated at the sample values and regarded as a function of the unknown parameter(s).
L(θ;x) = fX(x1;θ)fX(x2;θ)...fX(xn;θ)
It represents a relative measure of how strongly the data support different possible values
of the parameter θ, and is often written simply as L(θ). Log-likelihood: l(θ) = l(θ; x) = ln (L(θ; x))
Definition. The score function is U(θ) = U(θ;x) = ∂l(θ) ∂θ
For any value of θ, we can think of L(θ;x) as a realised value of a r.v. L(θ;X). Similarly for l(θ;x) and U(θ;x). The score function U(θ) is a random variable.
Since l(θ) is a log of a product (and hence is a sum of the individual logs) then l = li and U = Ui where li and Ui are the corresponding functions for the ith observation xi alone. The Ui’s are i.i.d. r.v.s.
Note: For the proofs which follow we assume that certain ‘regularity conditions’ hold — the main ones we require are that the derivatives used exist and that the range of the distribution of X does not depend on the parameter θ.
Properties of the score function
Consider the situation for a single observation (the ith) and a single unknown parameter θ. Write f(xi;θ) simply as f. Then l = l(θ) = ln(f).
Starting from the identity f dx = 1, differentiate with respect to θ:
d df 1df d(ln(f)) dl dl
0= dθ fdx= dθdx= f f dθ dx= f dθ dx= f dθdx=E dθ =E[Ui(θ)]
So E[Ui(θ)] = 0, and for the whole sample, E[U(θ)] = E Ui(θ) = 0 i
Starting from the relation 0 = f dθ dx and differentiating again with respect to θ:
d2l dfdl d2l 1dfdl
0 = fdθ2+dθdθ dx= fdθ2dx+ f fdθ dθdx
d2l dl2 d2l dl2 = fdθ2dx+ f dθ dx=E dθ2 +E dθ
dl2 d2l 2 dUi So E dθ = −E dθ2 , that is, E Ui = −E dθ
To see that this result holds for the whole sample, note that E[U2] = i E [Ui2] since the Ui 2 dUi dU
arei.i.d. withzeromean,andsoE[U ]=− E dθ =−E dθ i
Definition. Fisher’s Information about θ contained in the sample is ∂l2
1. I(θ)=E(U(θ))2=Var[U(θ)]=ni=1Var[Ui(θ)]
2. I(θ) for a sample of size n is n × I(θ) for a sample of size 1 (i.e. a single observation). 3. Important and useful alternative form
∂2l ∂U I(θ)=−E ∂θ2 =−E ∂θ
The Cramer-Rao inequality
(Proof omitted.)
So the more ‘information’ we have, the lower the bound on Var θˆ .
Lemma on attaining the Cramer-Rao lower bound (CRLB)
There exists an unbiased estimator θˆ of θ which attains the CRLB if and only if U(θ) can be expressed in the form U (θ) = k(θ) θˆ − θ
(Proof omitted.)
Summary of key points regarding CRLB:
If we can express U (θ) in the form U (θ) = k(θ) θˆ − θ, then θˆ is unbiased for θ, it attains
the CRLB, and k(θ) = I(θ) = 1/Var θˆ
(It also follows that θˆ is sufficient for θ.)
There is a theoretical minimum value for the variance of an unbiased estimator θˆ of θ
1 TheCramer-RaoinequalitystatesthatifθˆisanunbiasedestimatorofθthenVar θˆ ≥I(θ)
In Examples 7.5 and 7.6 below the bound is attained in each case.
An unbiased estimator which attains the CRLB is a MVUE (minimum variance unbiased estimator) – but note that the converse is not necessarily true.
Example 7.5 Random sample,
⇒l(μ) ⇒U(μ) ⇒I(μ) Alternatively, I (μ)
size n, of X ∼ N (μ, 1).
−n/2 1 2
= (2π) exp −2 (xi −μ) = −nln(2π)−1(xi−μ)2
= dl=(xi−μ)=n(x ̄−μ)
= −E dμ =−E[−n]=n
= En2X ̄−μ2=n2VarX ̄=n2×n1=n
Note: E[U(μ)] = 0 and Var[U(μ)] = n2VarX ̄ = I(μ)
Consider the estimator μˆ = X ̄ . First, it is unbiased. Next, Var X ̄ = 1/n, so it is consistent.
Finally, U (μ) = I (μ) X ̄ − μ, so it attains the CRLB – it is MVUE.
Example 7.6 X ∼ b(n,θ). θ is the ‘population proportion’ of successes in a series of n
Bernoulli trials.
We have a single observation of X. That is, we observe x successes in n trials.
= kθx(1 − θ)n−x
= k1 +xln(θ)+(n−x)ln(1−θ) = dl=x−n−x= x−nθ
dθ θ 1−θ θ(1−θ)
= 1 E(X−nθ)2= 1
L(θ, x) ⇒l(θ)
⇒U(θ) ⇒I(θ)
Alternatively,I(θ)=−E dθ =−E−θ2−(1−θ)2 =θ2+(1−θ)2=θ(1−θ) Note: E[U(θ)] = 0 and Var[U(θ)] = I(θ)
(θ(1 − θ))2 (θ(1 − θ))2
dU Xn−Xnθn−nθ n
Var[X]= nθ(1−θ) = (θ(1 − θ))2
n θ(1 − θ)
Consider the estimator (unbiased) θˆ = X/n. That is, use the sample proportion of successes to estimate the population success probability.
SinceVar θˆ =θ(1−θ)/n,itisconsistent.
Further,U(θ)=I(θ)Xn −θ,henceattainstheCRLB–itisMVUE.
Graphs of L(θ) and l(θ) in the case in which we observe 7 successes in 10 trials (n = 10, x = 7).
7.3 Methods of constructing estimators
Method of moments
Simply equate the population moments (about the origin or about the mean) to the cor- responding sample moments (in as convenient a manner as possible) and solve for parameter estimates. The method is often convenient, but does not always produce efficient estimators.
Example 7.7 X ∼ U(0,θ) ̃ ̄ ̃ ̄
Have E[X] = θ/2, so set θ/2 = X ⇒ MME is θ = 2X Example 7.8 X ∼ N(0,σ2), estimate σ2
First moment E[X] = 0. This is no use, so try second moment. E[X2]=σ2,sosetσ ̃2 = 1 X2 andsolvefortheMMEσ ̃=1 X2
Example 7.9 X ∼ Γ(r, λ), estimate both r and λ
First moment E[X] = r/λ, second moment (about the mean) Var[X] = r/λ2
So set r/λ ̃ = X ̄, r ̃/λ ̃2 = S2 and solve ⇒MMEsareλ ̃=X ̄/S2 ,r ̃=X ̄2/S2
Method of least squares
Choose parameter values which minimise the sum of the squares of the deviations of the observations from their means as given by the model.
Example 7.10 X has an unspecified distribution with mean μ.
We can write X = μ + ε where ε is a variable with mean 0; ε represents the deviation of X
from its mean, called an ‘error’ or ‘noise’ variable.
SumofsquaresS=ε2i =(Xi−μ)2⇒dS=−2(Xi−μ)=0⇒LSEisμˆ=X ̄
Least Squares estimation is heavily used in regression analysis, see chapter 10.
Method of maximum likelihood
If we know the value of θ, then L(θ; x) gives the probability of (or value of the pdf associated with) observing that particular sample. But we don’t know θ – it’s the xi values that are known.
Definition. The maximum likelihood estimator (MLE) of θ is the value θˆ that maximises L(θ) (or equivalently l(θ)).
Example 7.11 X ∼ b(10, θ). Suppose we observe 7 successes in the 10 trials. 10 7 3
⇒l(θ) = k+7ln(θ)+3ln(1−θ) dl = 7− 3
dθ θ 1−θ ⇒MLEisθˆ = 0.7
General case: X ∼ b(n, θ), observe X = x
L(θ, x) = ⇒l(θ) =
kθx(1 − θ)n−x
k1 +xln(θ)+(n−x)ln(1−θ) x−n−x=x−nθ
Example 7.12 Random sample, size n, of X ∼ Poisson(λ) e−nλ λ xi
L(θ)= 7θ(1−θ)
dθ θ 1−θ θ(1−θ)
⇒MLEisθˆ =
The MLE of a binomial probability (a population proportion) is the sample proportion
L(λ) = xi!
⇒l(λ) = −nλ+xln(λ)+k
⇒dλ = −n+ λ ⇒ MLE is λˆ = X ̄
The MLE of a Poisson mean is the sample mean
From Example 7.5, random sample, size n, of X ∼ N (μ, 1), we have dl/dμ = n (x ̄ − μ) and so MLE is μˆ = X ̄. Also true for X ∼ N(μ,σ2).
The MLE of a normal distribution mean is the sample mean
(i) General shape of L(θ) is as in the graph in Example 7.6 above, but take care — especially if the range of the r.v. involves the parameter.
(ii) MLEs are invariant under transformations of the parameter, so if θˆ is the MLE of θ then
g θˆ is the MLE of g(θ)
e.g. ifθˆ=0.7thentheMLEofη=θ2 isηˆ=0.72 =0.49
(iii) It is not always possible to solve the equation(s) for the MLE(s) explicitly — we may
need a numerical solution.
(iv) Likelihood methods are not restricted to situations in which we have full data from a random sample from a single distribution. The method can be used whenever we can specify the joint probability distribution of the observed data. For instance, the observa- tions may come from independent, but not identically distributed, r.v.s; or we may have incomplete or censored data.
Example 7.13 A case of incomplete information: random sample, size n = 10, of X ∼ exp(λ). We observe only that 7 observations are less than 5 (and 3 exceed 5). Calculate the MLE of λ.
Let θ = P (X > 5) = exp(−5λ), and think of ‘proportions’. The MLE of θ is the proportion of observations greater than 5 in the sample (here 3/10). So the MLE of θ = exp(−5λ) is θˆ = 0.3. So the MLE of λ is given by solving exp(−5λˆ) = 0.3 ⇒ λˆ = 0.241
Classical large sample properties of MLEs
MLEs have very attractive properties — at least for large samples. Firstly, it can be shown that the MLE is consistent for θ.
The ‘maximum likelihood theorem’ states that in the limit as n → ∞,
θˆ−θI(θ) ∼ N(0,1) So, in large samples, an MLE θˆ is approximately:
• unbiased;
• efficient (achieves the CRLB); • normally distributed.
To sum up: asymptotically,
ˆ1 θ ∼ N θ,I(θ)
In practice, since se
Example 7.14 X ∼ Poisson(λ) (continuing from Example 7.12)
θ = 1/ I(θ), we estimate the standard error by ese θ = 1 I(θ)
λˆ = X ̄, Varλˆ = VarX ̄ = Var[X]/n = λ/n and eseλˆ = λˆ/n Example 7.15 Random sample, size n (large), of X ∼ exp(λ)
L(λ) = λn exp−λx i
⇒l(λ) = nlnλ−λxi ⇒dl = n−xi
dλ λ ⇒λˆ = 1/X ̄
In this case λˆ is biased, but asymptotically unbiased. (In fact E[λˆ] = nλ/(n − 1). Check by notingthatXi ∼Γ(n,λ)andforn>1,Y ∼Γ(n,λ)⇒E[1/Y]=λ/(n−1).)
d2l = − n dλ2 λ2
⇒I(λ) = En = n λ2 λ2
ˆ λ2 so asymptotically, λ ∼ N λ, n
Note that ese(λˆ) = λˆ/√n
MLEs in multi-parameter models
We now have an r-vector of parameters θ = (θ1, θ2, . . . , θr) and we use partial differentiation to maximise L.
The Fisher information matrix I(θ) is the r × r matrix with (i, j)th element ∂2l(θ)
Iij(θ) = −E ∂θi∂θj
and for large samples, θ ∼ MV N (θ, I(θ)−1)
Example 7.16 Normal distribution with mean and variance both unknown. Random sample, size n, of X ∼ N(μ,σ2). Find the MLE of θ = (μ,φ) where φ = σ2
l(μ, φ) = −(n/2) ln(2πφ) − (xi − μ)2 2φ
∂ l = x i − μ ⇒ μˆ = X ̄ ∂μ φ
∂l n (xi−μ)2 ˆ 1 2 1 ̄2 ∂φ = −2φ+ 2φ2 ⇒φ=n (Xi−μˆ) =n (Xi−X)
Note that φˆ is biased: E[φˆ] = E[(n − 1)S2/n] = (n − 1)σ2/n. It is, however, asymptotically unbiased.
Note that the MLE of σ is σˆ = φˆ
n/φ 0 ˆ φ/n 0
You can verify that I(θ) = 0 n/2φ2
and so (μˆ, φ) ∼ N (μ, φ), 0 2φ2/n
7.4 Further worked examples
7.17 Random sample, size 5, of X ∼ U (0, θ). Data x = (0.46, 1.14, 0.83, 0.21, 0.59). (a) Find the MME and the MLE of θ
(b) Compare the expectations and the MSEs of the two estimators.
(a) MME: θ ̃ = 2X ̄ (see example 7.7). Here θ ̃ = 1.292
MLE: f(x;θ) = 1/θ for 0 < x < θ (and zero otherwise), so the value of θ must exceed all the sample values. This is a situation in which the range of values assumed by the r.v. depends on the unknown parameter (so the regularity conditions are violated). We must be careful when specifying the likelihood function.
L(θ) = (1/θ)5 for θ > 1.14 (and 0 otherwise)
The likelihood function is maximised at the highest sample value, i.e. θˆ = 1.14. Graph of likelihood function:
(b) E[θ] = E[2X] = 2E[X] = 2(θ/2) = θ so θ is unbiased (but may give inadmissible
estimate) ̃ ̃ ̄22
MSE[θ] = Var[θ] = 4Var[X] = 4Var[X]/n = 4θ /12n = θ /3n ˆˆ
To find properties of θ, let Y = θ = max(x), the largest observation in the sample. FY (y) = P(Y ≤ y) = P(all n observations are ≤ y) = (y/θ)n
so fY (y) = dF/dy = nyn−1/θn
From this, can show that E[Y] = nθ/(n+1), E[Y2] = nθ2/(n+2), and Var[Y] = nθ2/ ((n + 1)(n + 2)2) (check)
Then E[θ] = E[Y ] = nθ/(n + 1), so θ is biased, but asymptotically unbiased.
ˆ ˆˆ2 nθ2 nθ 2 2θ2 MSE[θ]=Var[θ]+ B[θ] =(n+1)(n+2)2 + n+1−θ =(n+1)(n+2)
Hence MSE[θ] < MSE[θ] for n ≥ 2, and in both cases MSE → 0 as n → ∞
7.18 Random sample, size n, of a r.v. X with pdf f(x) = 2θxexp(−θx2), x > 0 (zero other- wise), where θ > 0.
(a) Find the MME of θ.
(b) Find the MLE of θ, the score function U(θ), and Fisher’s information function I(θ).
State the limiting distribution of the MLE.
∞ 1∞ 1π (a) MME: E[X] = 2θx2 exp(−θx2) dx = √ u1/2e−u du = 2 θ
0θ0 Setting X ̄ = 1π ̃ and solving for θ ̃ gives θ ̃ = π/4X ̄2
L(θ) = kθnexp−θx2 ⇒l(θ)=c+nlnθ−θx2
dln2 ˆn dθ=θ− xi ⇒θ=x2i
U(θ) = dl=n−x2i dθ θ
d2l n n I(θ)=−Edθ2 =Eθ2 =θ2
ˆ θ2 The limiting distribution is θ ∼ N θ, n
7.19 Random sample, size n, from X ∼ N (μ, 1). All we know is that k of the n observations are positive. Find an expression for the MLE of μ and evaluate it in the case n = 20, k = 14.
P(X > 0) = P(Z > −μ) = P(Z < μ) = Φ(μ) where Φ(·) is the distribution function of a standard normal random variable N(0,1)
The MLE of the population proportion (of observations which are positive) is the sample proportion k/n, so k/n is the MLE of Φ(μ). It follows that μˆ is given by solving Φ(μˆ) = k/n, which gives μˆ = Φ−1(k/n).
In the case n = 20, k = 14 we have μˆ = Φ−1(0.7) = 0.5244 (from tables).
7.20 A crude model for the number of boys, X, in families with three children is
X0123 P(X =x) θ 3θ/2 3θ/2 1−4θ
where θ is a parameter to be estimated. In a random sample of n such families there were ni with i boys, i = 0,1,2,3.
(a) Find the method of moments estimator of θ.
(b) Find the maximum likelihood estimator of θ and its asymptotic distribution.
(a) E[X]=1×3θ/2+2×3θ/2+3×(1−4θ)=3−(15/2)θ Sample mean = (n1 + 2n2 + 3n3)/n,
so we get MME by setting 3−(15/2)θ ̃ = (n1 +2n2 +3n3)/n ⇒ θ ̃ = 2 (3n0 + 2n1 + n2) 15n
L(θ) = kθn0+n1+n2 (1 − 4θ)n3
l(θ) = c+(n0 +n1 +n2)lnθ+n3ln(1−4θ)
⇒dl = n0+n1+n2− 4n3 ⇒θˆ=n0+n1+n2 dθ θ 1−4θ 4n
(can you see a simple justification for this answer?) d2l = −n0+n1+n2− 16n3
dθ2 θ2 d2l
I(θ) = −E dθ2 = ˆ
n(θ + 3θ/2 + 3θ/2) θ2
θ(1−4θ) θ, 4n
16n(1 − 4θ) 4n
+ (1−4θ)2 =θ(1−4θ)
Asymptotic distribution is θ ∼ N
7.21 The lifetime T of a bulb of a certain type is to be modelled as an exp(θ) r.v. A random sample of n such bulbs are put on test and are observed for a period t0. The times to failure of bulbs which fail are recorded. It is observed that by time t0, m of the n bulbs fail, with lifetimes t1, t2, . . . , tm respectively. The remaining n − m bulbs are still working at the end of the observation period.
For T ∼ exp(θ), P(T > t0) = exp(−θt0), and the overall likelihood for the whole sample is given by
L(θ) = P(n−m bulbs last longer than t0)×fT(ti)
i=1 mm
= (exp(−θt0))n−m ×θmexp −θti i=1
m ⇒l(θ) = mlnθ−θ(n−m)t0 −θti
=θmexp −θ(n−m)t0 −θti i=1
⇒dθ = θ−(n−m)t0− ˆm
This is an example of censored data.
i=1 ⇒θ = (n−m)t0+mi=1ti