程序代写代做代考 3 Sufficient Statistics

3 Sufficient Statistics

1.Sufficiency Principle An experimenter uses the information in a sampleX1, X2, . . . , Xn
to make inferences about an unknown parameter θ. If n is large, then the observed sample
x1, x2, . . . , xn is a long list of numbers that may be hard to interpret. An experimenter
might wish to summarize the information in a sample by determining a few key features of
the sample values. This is usually done by computing statistics, functions of the sample.
Any statistic, T (X), defines a form of data reduction or data summary. An experimenter
who uses only the observed value of the statistic, T (x), rather the entire observed sample,
x, will treat as equal two samples, x and y, that satisfy T (x) = T (y) even though actual
sample values may be different in some ways.

A sufficient statistic of a parameter θ is a statistic that, in a certain sense, captures all
the information about θ contained in a sample. Any additional information in a sample
besides the value of the sufficient statistic, does not contain any more information about
θ. These considerations lead to the data reduction technique known as the Sufficiency
principle.

Sufficiency Principle: If T (X) is a sufficient statistic for θ, then any inference about θ
should depend on the sample X only through T (X). That is, if x and y are two sample
points, such that T (x) = T (y), then the inference about θ should be the same whether
X = x or X = y is observed.

2. Example. In the earlier courses in statistics we relied on our intuition to define the
statistics we used for estimation and hypothesis testing. For example, in a sequence of
Bernoulli trials in each of which there is a probability θ of success we used

∑n
i=1 xi as the

base for our estimate of θ and in our hypothesis tests about θ. [All xi are 0 or 1.] Other
information, such as the order in which the 0s and 1s appear was not considered relevant
and the statistic T (X) =

∑n
i=1Xi (number of Xis that equal 1), was sufficient for our

purposes. We now give this notion a general mathematical formulation.

Suppose that we are given a value of T (X) =
∑n

i=1Xi = t. If we know the value t, can
we gain any further information about θ by looking at other functions of X1, X2, . . . , Xn?

One way to answer this question is by looking at the conditional distribution ofX1, X2, . . . , Xn
given T (X) = t:

Pr(X1 = x1, X2 = x2, . . . , Xn = xn|T (X) = t) =
θt(1− θ)n−t(
n
t

)
θt(1− θ)n−t

=
1(
n
t

) .
Thus, the conditional probability is independent of θ, that is , once

∑n
i=1Xi is known, no

other function of X1, X2, . . . , Xn will shed additional light on the possible value of θ.

3. Definition. A statistic T (X) is a sufficient statistic for θ if the conditional distri-
bution of the sample X1, X2, . . . , Xn given the value of T (X) does not depend on θ.

4. Factorization criterion. Let f(x, θ) denote the joint p.d.f/p.f. of a sample X. A
statistic T (X) is a sufficient statistic for the parameter θ if and only if we can write f(x,θ)
= h(T (x), θ)c(x) for appropriate functions h and c of the indicated variables – that is, c
does not depend on θ. [Recall that this includes the statement that the set of values x
where c(x) 6= 0 does not depend on θ.]
5. Examples. Let X1, X2…, Xn be i.i.d. (a) N(µ, σ

2) r.v.s, where σ is known; (b) M(θ);
(c) U(0, θ).

11

6. Note. In all the previous examples, the sufficient statistic is a real valued function
of the sample. All the information about θ in a sample x is summarised in the single
number T (x). Sometimes the information cannot be summarised in a single number and
several numbers are required instead. In such cases a sufficient statistic is a vector, say
T (x) = (T1(x), . . . , Tr(x)). This situation often occurs when the parameter is also a vector,
say, θ = (θ1, …, θk), and it is usually the case that the sufficient statistic and the vector
of parameters are of equal length, that is r = k.

7. Example. Let X1, X2…, Xn be i.i.d. N(µ, σ
2) r.v.s, where both µ and σ2 are unknown.

8. Proposition Let X1, X2…, Xn be i.i.d. observations from a p.d.f./p.f. f(x, θ) that
belongs to the exponential family, i.e.,

f(x, θ) = exp{
∑m

i=1 pi(θ)Ki(x) + S(x) + q(θ)}

for all {x : f(x, θ) 6= 0}, where θ = (θ1, …, θk) and the set {x : f(x, θ) 6= 0} does not
depend on θ. Then

T (X) =

(
n∑
j=1

K1(Xj),
n∑
j=1

K2(Xj), . . . ,
n∑
j=1

Km(Xj)

)
is a sufficient statistic for θ.

9. Example. Let X1, X2…, Xn be i.i.d. N(µ, σ
2) r.v.s, where both µ and σ2 are unknown.

10. Remarks. (i) In the examples above we found one sufficient statistic for each model
considered. In any problem, there are many sufficient statistics. It is always true that the
complete sample, X, is a sufficient statistic.

(ii) It follows that any one-to-one function of a sufficient statistic is a sufficient statistic.

(iii) Because of the numerous sufficient statistics in a model, we might ask whether one
sufficient statistic is any better than the another. Recall that the purpose of a sufficient
statistic is to achieve data reduction without loss of information about the parameter
θ; thus, a statistic that achieves the most data reduction while still retaining all the
information about θ might be considered as preferable. The definition of such statistic is
formalized below.

11. Definition. A sufficient statistic T (X) is called a minimal sufficient statistic, if
for any other sufficient statistic T ′(X), T (x) is a function of T ′(x).

12. Example. (Two normal sufficient statistics) In Example 5(a) with N(µ,σ2) r.v.s
and σ2 known, we concluded that T (X) =


Xi is a sufficient statistic for µ. Instead

we could write down factorisation from Example 7 for this problem (σ2 is a known value
now) and correctly conclude that T ′(X) = (


Xi,

X2i ) is a sufficient statistic for µ

in this problem. Clearly, T (X) achieves a greater data reduction than T ′(X); we can
write T (X) as a function of T ′(X). (Indeed, define v(a, b) = a, then T (X) =


Xi =

v(

Xi,

X2i ) = v(T

′(X)).) Since T (X) and T ′(X) are both sufficient statistics, they
both contain the same information about µ. Thus the additional information about the
value of


X2i does not add to our knowledge of µ since σ

2 is known. Of course, if σ2 is
unknown, as in Example 7, T (X) =


Xi is not a sufficient statistic and T

′(X) contains
more information about (µ,σ2) than does T (X).

13. Theorem. (Lehmann-Scheffé criterion) Let f(x, θ) be the joint p.d.f/p.f. of a sample
X. Suppose that there exists a function T (x) such that, for any two sample points x and

12

y, the ratio f(x, θ)/f(y, θ) is constant as a function of θ if and only if T (x) = T (y). Then
T (X) is a minimal sufficient statistic for θ.

14. Example. Let X1, X2…, Xn be i.i.d. (a) B(1, θ), (b)Γ(α, β).

15. Theorem. If T is a sufficient statistic for the parameter θ and θ̂ is a maximum
likelihood estimator of θ, then we may write θ̂ = θ̂(T ).

Proof. By the factorization, we havef(x,θ) = h(T (x),θ)c(x) where c(x) is independent
of θ. Thus, maximizing f(x,θ) over θ for fixed x (and, therefore fixed T (x)) is equivalent

to maximizing h(T (x),θ) and the solution θ̂ will be a function of T . The corresponding

estimator θ̂ is thus a function of T .

13