5 Point estimation
5.1 Introduction
The objective of a statistical analysis is to make inferences about a population
based on a sample. Usually we begin by assuming that the data were generated
by a probability model for the population. Such a model will typically contain
one or more parameters θ whose value is unknown. The value of θ needs to be
estimated using the sample data. For example, in previous chapters we have used
the sample mean to estimate the population mean, and the sample proportion
to estimate the population proportion.
A given estimation procedure will typically yield different results for different
samples, thus under random sampling from the population the result of the
estimation will be a random variable with its own sampling distribution. In this
chapter, we will discuss further the properties that we would like an estimation
procedure to have. We begin to answer questions such as:
• Is my estimation procedure a good one or not?
• What properties would we like the sampling distribution to have?
5.2 General framework
Let X1, . . . , Xn be a random sample from a distribution with c.d.f. FX(x; θ),
where θ is a parameter whose value is unknown. A (point) estimator of θ,
denoted by θ̂ is a real, single-valued function of the sample, i.e.
θ̂ = h(X1, . . . , Xn) .
As we have seen already, because the Xi are random variables, the estimator θ̂
is also a random variable whose probability distribution is called its sampling
distribution.
The value θ̂ = h(x1, . . . , xn) assumed for a particular sample x1, . . . , xn of
observed data is called a (point) estimate of θ. Note the point estimate will
almost never be exactly equal to the true value of θ, because of sampling error.
Often θ may in fact be a vector of p scalar parameters. In this case, we
require p separate estimators for each of the components of θ. For example,
the normal distribution has two scalar parameters µ and σ2. These could be
1
combined into a single parameter vector, θ = (µ, σ2), for which one possible
estimator is θ̂ = (X̄, S2).
5.3 Properties of estimators
We would like an estimator θ̂ of θ to be such that:
(i) the sampling distribution of θ̂ is centered about the target parameter, θ.
(ii) the spread of the sampling distribution of θ̂ to be small.
If an estimator has properties (i) and (ii) above then we can expect estimates re-
sulting from statistical experiments to be close to the true value of the population
parameter we are trying to estimate.
We now define some mathematical concepts formalizing these notions. The
bias of a point estimator θ̂ is bias(θ̂) = E(θ̂) − θ. The estimator is said to be
unbiased if
E(θ̂) = θ ,
i.e. if bias(θ̂) = 0. Unbiasedness corresponds to property (i) above, and is gen-
erally seen as a desirable property for an estimator. Note that sometimes bi-
ased estimators can be modified to obtain unbiased estimators. For example, if
E(θ̂) = kθ, where k 6= 1 a constant, then bias(θ̂) = (k − 1)θ. However, θ̂/k is an
unbiased estimator of θ.
The spread of the sampling distribution can be measured by Var(θ̂). In
this context, the standard deviation of θ̂, i.e.
√
Var(θ̂), is called the standard
error. Suppose that we have two different unbiased estimators of θ, called θ̂1
and θ̂2, which are both based on samples of size n. By principle (ii) above, we
would prefer to use the estimator with the smallest variance, i.e. choose θ̂1 if
Var(θ̂1) < Var(θ̂2), otherwise choose θ̂2. Example 5.1. Let X1, . . . , Xn be a random sample from a N(µ, σ 2) distribution where σ2 is assumed known. Recall that the Xi ∼ N(µ, σ2) independently in this case. We can estimate µ by the sample mean, i.e. µ̂ = X̄ = 1 n n∑ i=1 Xi . We have already seen that E(X̄) = µ, thus bias(X̄) = 0. Moreover, Var(X̄) = σ2/n. Note that Var(X̄)→ 0 as n→∞. Thus, as the sample size increases, the 2 sampling distribution of X̄ becomes more concentrated about the true parameter value µ. The standard error of X̄ is s.e.(X̄) = √ Var(X̄) = σ √ n . Note that if σ2 were in fact unknown, then this standard error would also need to be estimated from the data, via ŝ.e.(X̄) = s √ n . Importantly, the results E(X̄) = µ, Var(X̄) = σ2/n also hold if X1, . . . , Xn are sampled independently from any continuous or discrete distribution with mean µ and variance σ2. Thus the sample mean is always an unbiased estimator of the population mean. Example 5.2. Suppose now that n = 5, X1, . . . , X5 ∼ N(µ, σ2), and an alter- native estimator of µ is given by µ̃ = 1 9 X1 + 2 9 X2 + 3 9 X3 + 2 9 X4 + 1 9 X5 . We have that E[µ̃] = µ 9 + 2µ 9 + 3µ 9 + 2µ 9 + µ 9 = µ , and Var[µ̃] = σ2 81 + 4σ2 81 + 9σ2 81 + 4σ2 81 + σ2 81 = 19σ2 81 . Thus, µ̃ is an unbiased estimator of µ with variance 19σ 2 81 . The sample mean µ̂ = X̄ is also unbiased for µ and has variance σ 2 5 . The two estimators µ̂ and µ̃ both have normal sampling distributions centered on µ but the variance of the sampling distribution of µ̂ is smaller than that of µ̃ because σ 2 5 < 19σ 2 81 . Hence, in practice, we would prefer to use µ̂. Example 5.3. Let X1, . . . , Xn be a random sample from a N(µ, σ 2) distribution where now both µ and σ2 are assumed to be unknown. We can use X̄ as an estimator of µ and S2 as an estimator of σ2. We have already seen that σ̂2 = S2 = 1 n− 1 n∑ i=1 (Xi − X̄)2 3 is an unbiased estimator of σ2, i.e. E[S2] = σ2 and bias(S2) = E[S2] − σ2 = σ2 − σ2 = 0. If we instead consider the estimator σ̃2 = 1 n n∑ i=1 (Xi − X̄)2 , we see that E[σ̃2] = (n−1) n σ2. Thus σ̃2 is a biased estimator of σ2 with bias −σ2/n. Notice that bias(σ̃2) → 0 as n → ∞. We say that σ̃2 is asymptotically unbiased. It is common practice to use S2, with the denominator n − 1 rather than n. This results in an unbiased estimator of σ2 for all values of n. Exactly the same argument as above could also be made for using S2 as an estimator of the variance of the population distribution if the data were from another, non-normal, continuous distribution or even a discrete distribution. The only prerequisite is that σ2 is finite in the population distribution. Therefore, calculations of the sample variance for any set of data should always be based on using divisor (n− 1). Example 5.4. Let X1, . . . , Xn be a random sample of Bernoulli random vari- ables with parameter p which is unknown. Thus, Xi ∼ Bi(1, p) for i = 1, . . . , n so that E(Xi) = p and Var(Xi) = p(1− p), i = 1, . . . , n. If we consider estimating p by the proportion of ‘successes’ in the sample then we have p̂ = 1 n n∑ i=1 Xi so that E(p̂) = 1 n n∑ i=1 E(Xi) = 1 n np , thus E(p̂) = p. Also, Var(p̂) = 1 n2 n∑ i=1 Var(Xi) by independence = 1 n2 np(1− p) = p(1− p) n , 4 Hence, p̂ is an unbiased estimator of p with variance p(1− p)/n. Notice that the variance of this estimator also tends towards zero as n gets larger. Example 5.5. LetX1, . . . , Xn be a random sample from a U [θ, θ+1] distribution where θ is unknown. Thus, the data is uniformly distributed on a unit interval but the location of that interval is unknown. Consider using the estimator θ̂ = X̄. Now, E(X̄) = θ + (θ + 1) 2 = 2θ + 1 2 = θ + 1 2 Therefore, bias(X̄) = θ + 1/2 − θ = 1/2 while Var(X̄) = 1 12n . However, if we instead define θ̂ = X̄ − 1/2 then E(θ̂) = θ and Var(θ̂) = 1 12n . Summary of point estimation The key ingredients are: • A probability model for the data. • Unknown model parameter(s) to be estimated. • An estimation procedure, or estimator. • The sampling distribution of the estimator. The main points are: • Application of the estimation procedure, or estimator, to a particular ob- served data set results in an estimate of the unknown value of the param- eter. The estimate will be different for different random data sets. • The properties of the sampling distribution (bias, variance) tell us how good our estimator is, and hence how good our estimate is likely to be. • Estimation procedures can occasionally give poor estimates due to random sampling error. For good estimators, the probability of obtaining a poor estimate is lower. 5 Point estimation Introduction General framework Properties of estimators