Slide 3.1 / 3.73
3 Principles of data reductions and inference 3.1 Data reduction in statistical inference
3.2 Sufficent partition example
3.3 Sufficiency principle
Copyright By PowCoder代写 加微信 powcoder
3.4 Neyman Fisher factorization criterion
3.5 Sufficiency examples
3.6 Lehmann and Scheffe’s method for constructing a minimal sufficient partition
3.7 Minimal sufficient examples
3.8 One parameter exponential family densities
3.9 Generalization to a k- parameter exponential family 3.10 Ancillary statistic and ancillarity principle
Chapter 3 : Principles of data reductions and inference MATH3821: Page 185 / 758
Slide 3.1 / 3.73
3.11 Ancillary examples
3.12 Maximum likelihood inference
3.13 Maximum likelihood estimation an introduction 3.14 Information and likelihood
Chapter 3 : Principles of data reductions and inference MATH3821: Page 186 / 758
3.1 Data reduction in statistical inference Slide 3.2 / 3.73
3.1 Data reduction in statistical inference
In chapter one we were dealing with different probability models. These models can be used to describe the population of interest. Finding such a good model is our final goal. On the way towards this goal, we use the data to identify the parameters θ ∈ Θ that best describes the model.
The statistical inference problem arises because we do not know the exact value of the parameter. The word parameter is used very generally here. The parameter could be a scalar (e.g., the probability of success in a binomial experiment), or a vector parameter (e.g., the vector with two components (μ,σ2) for the normal distribution), or even a function if we perform nonparametric inference.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 187 / 758
3.1 Data reduction in statistical inference Slide 3.3 / 3.73
Suppose a vector X = (X1, X2, .., Xn) of n i.i.d. random variables, each with a density f (x; θ) is to be observed, and inference on θ ∈ Θ based on the observations x1, x2, .., xn is to be made.
Let X takes values in X – the sample space. The statistician will use the information in the observations x1, x2, . . . , xn to make inference about θ.
The statistician wants to summarize the information in the sample by determining a few key features of the sample values through transforming the sample values. Calculating such transformations means to calculate a statistic.
Typically, dim(T ) << n, i.e. using the statistic, we achieve the goal of data reduction. The statistic summarises the data in that, rather than reporting the entire sample x, it reports only that T (x) = t.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 188 / 758
3.1 Data reduction in statistical inference Slide 3.4 / 3.73
The main purpose of this chapter is to discuss how the data reduction is to be performed in some sort of optimal way.
Data reduction in terms of a particular statistic can be thought of as partitioning the sample space X into disjoint subsets
At = {X : T(X) = t}.
Ifτ={t : t=T(x)forsomex∈X}thenXisrepresentedasaunion
of disjoint sets (i.e. is partitioned):
X= At. t∈τ
Chapter 3 : Principles of data reductions and inference
MATH3821: Page 189 / 758
3.1 Data reduction in statistical inference Slide 3.5 / 3.73
Goal in data reduction:
When only using the value of the statistic T(x) we want to“not to lose information” about θ. The whole information about θ will be contained in the statistic.
In particular, we will treat as equal any two samples x and y that satisfy
T(x) = T(y)
even though the actual sample values may be different. Hence we
arrive at the definition of sufficiency.
The information in X about θ can be discussed in terms of partitions
of the sample space.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 190 / 758
3.1 Data reduction in statistical inference Slide 3.6 / 3.73
Definition 3.5 (Sufficient partition)
Suppose for any set At in a particular partition A = {At,t ∈ τ} we have
P{X = x | X ∈ At}
does not depend on θ. Then A is a sufficient partition for θ.
Remark 3.5
The partition is defined through a suitable statistic. If the statistic T is such that it generates a sufficient partition of the sample space then the statistic itself is sufficient.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 191 / 758
3.2 Sufficent partition example Slide 3.7 / 3.73
3.2 Sufficent partition example
Exercise 3.13 (at lecture)
Suppose X = (X1, X2, .., Xn) are i.i.d. Bernoulli with parameter θ, i.e. P(Xi = xi) = θxi(1−θ)1−xi, xi = 0,1.
The partition A = (A0,A1,...,An) where x ∈ Ar if and only if (iff) n
xi = r, i=1
is sufficient for θ. Correspondingly, the statistic n
is sufficient for θ.
Chapter 3 : Principles of data reductions and inference
MATH3821: Page 192 / 758
3.2 Sufficent partition example Slide 3.8 / 3.73
Given observed value t of T, we know that the observed value x of X is in the partition set At.
Sufficiency means that P(X = x | T = t) is a function of x and t only (i.e., is not a function of θ).
Thus once having observed the particular realization t of T, knowing in addition the particular value x of X would not help for a better identification of θ. Hence we arrive at the sufficiency principle:
Chapter 3 : Principles of data reductions and inference MATH3821: Page 193 / 758
3.3 Sufficiency principle Slide 3.9 / 3.73
3.3 Sufficiency principle
Remark 3.6 (Sufficiency principle)
The sufficiency principle implies that if T is sufficient for θ, then if x and y are such that T(x) = T(y) then inference about θ should be the same whether X = x or Y = y is observed.
This leads to the following useful criterion:
Chapter 3 : Principles of data reductions and inference MATH3821: Page 194 / 758
3.4 Neyman Fisher factorization criterion Slide 3.10 / 3.73
3.4 Neyman Fisher factorization criterion
Theorem 3.10 (Neyman Fisher Factorization Criterion)
If Xi ∼ f(x,θ) then T(X) = T(X1,X2,...,Xn) is sufficient for θ iff L(X,θ) = fθ(X1,X2,...,Xn) = g(T(X),θ)h(X)
where X,T,θ may all be vectors and g ≥ 0,h ≥ 0. Proof: at lecture.
In other words, we try to factorise the joint distribution into a product of two non-negative functions of which one factor (g(T(X),θ)) may depend on the parameter θ but depends on the sample only through the value of the statistic T whereas the other factor (h(X)) does not depend on the parameter θ.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 195 / 758
3.4 Neyman Fisher factorization criterion Slide 3.11 / 3.73
Sufficient partitions can be ordered and the coarsest partition (i.e. the one that contains the smallest number of sets) is called the minimal sufficient partition.
Suppose T is sufficient and T(X) = g1(U(X)) where U is a statistic and g1 is a known function. It can be seen that U must also be sufficient for θ then. Indeed:
L(X, θ) = g(T (X), θ)h(X)
= g(g1(U(X)), θ)h(X)
= g ̄(U(X), θ)h(X) which means that U(X) is also sufficient.
Chapter 3 : Principles of data reductions and inference
MATH3821: Page 196 / 758
3.4 Neyman Fisher factorization criterion Slide 3.12 / 3.73
But, generally speaking, U induces a finer (or at least no coarser) partition than T since it might happen that U1 U2 but yet g1(U1) = g1(U2). Thus a finer partition of any sufficient partition is sufficient.
In applications we look for the coarsest partition that is still suffi- cient because this means the greatest data reduction without loss of information on θ.
From the above, we see that the statistic that introduces this coarsest partition will be a function of any other sufficient statistics. Such a statistic is called the minimal sufficient statistic.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 197 / 758
3.4 Neyman Fisher factorization criterion Slide 3.13 / 3.73
Properties of sufficient statistics:
i) If T is sufficient, so is any one-to-one function of T (since it generates the same partition);
ii) If T is minimal sufficient, it is necessarily a function of all other possible sufficient statistics;
iii) If T is sufficient then P(x | t) does not depend on θ. The observed t is a summary of x that contains all the information about θ in the data, under the given family of models. It divides the sample space X into disjoint subsets At, each containing all possible observations x with the same value t.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 198 / 758
3.5 Sufficiency examples Slide 3.14 / 3.73
3.5 Sufficiency examples
Exercise 3.14 (at lecture)
For the following distributions find the sufficient statistic and provide justification.
i) Bernoulli
ii) Univariate normal distribution with unknown μ and σ2
iii) Uniform distribution in [0, θ)
iv) Multivariate normal with unknown μ and Σ.
See next two slides for some useful formula.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 199 / 758
3.5 Sufficiency examples
Slide 3.15 / 3.73
Fundamental equality 1-dimensional
(Xi −μ)2 = (Xi −X ̄)2 +n(X ̄ −μ)2 i=1 i=1
(Xi −μ)2 = (Xi −X ̄ +X ̄ −μ)2
i=1 i=1 nnn
= (Xi −X ̄)2 +2(Xi −X ̄)(X ̄ −μ)+(X ̄ −μ)2 i=1 i=1 i=1
= (Xi −X ̄)2 +2(X ̄ −μ)(Xi −X ̄)+n(X ̄ −μ)2
= ( X i − X ̄ ) 2 + n ( X ̄ − μ ) 2 i=1
Chapter 3 : Principles of data reductions and inference MATH3821: Page 200 / 758
3.5 Sufficiency examples Slide 3.16 / 3.73
Fundamental equality p-dimensional
A p-dimensional version of the fundamental equality also exists:
(Xi −μ)(Xi −μ)⊤ = (Xi −X ̄)(Xi −X ̄)⊤ +n(X ̄ −μ)(X ̄ −μ)⊤ i=1 i=1
We also need to use the following property of traces: X⊤AX = tr(A(XX⊤))
where tr(A) is the sum of the diagonal elements of the matrix.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 201 / 758
3.6 Lehmann and Scheffe’s method for constructing a minimal sufficient partition Slide 3.17 / 3.73
3.6 Lehmann and Scheffe’s method for constructing a minimal sufficient partition
Often the sufficient statistic which has been found by the Factorization criterion, turns out to be minimal sufficient. However, to find a general method for constructing a minimal sufficient statistic is a difficult task.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 202 / 758
3.6 Lehmann and Scheffe’s method for constructing a minimal sufficient partition Slide 3.18 / 3.73
Theorem 3.11 (Lehmann- Scheffe’s method)
Consider a partition A of X by defining for any
L(y,θ)
x∈X:A(x)= y: L(x,θ) doesnotdependonθ
That is, the ratio is a function of the type h(y, x). The above defined sets {A(x), x ∈ X} form a partition of X and this partition is minimal sufficient.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 203 / 758
3.6 Lehmann and Scheffe’s method for constructing a minimal sufficient partition Slide 3.19 / 3.73
Proof. (discrete case only for simplicity; proof
could be skipped if you are not interested in the
Step one: We have to show that the sets {A(x), x ∈ X} form a partition. To this end, we have to show that they are either disjoint or coincide.
If we assume there exists a joint element z ∈ A(x) A(u) then A(x) and A(u) must coincide. To see this, consider the following:
i) Take any other x0 ∈ A(x) and any u0 ∈ A(u). ii) Then
L(z,θ) = L(z,θ)L(x0,θ) L(x0, θ) L(x, θ) L(x, θ)
is not a function of θ because each of the two ratios on the RHS
are not functions of θ.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 204 / 758
3.6 Lehmann and Scheffe’s method for constructing a minimal sufficient partition
Slide 3.20 / 3.73
iii) L(z, θ) is also not a function of θ since z ∈ A(u). L(u, θ)
iv) But then
L(x0,θ) = L(x0,θ).L(x,θ).L(z,θ) L(u, θ) L(x, θ) L(z, θ) L(u, θ)
is not a function of θ since the RHS is not a function of θ. The conclusion from this chain of statements is that an arbitrary
x0 ∈ A(x) must belong to A(u) as well.
But in the same way as above it can be argued that uo ∈ A(x) and u0 was arbitrary in A(u). Therefore it must hold A(x) = A(u) if they had one joint element z. This shows that {A(x), x ∈ X} is a partition.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 205 / 758
3.6 Lehmann and Scheffe’s method for constructing a minimal sufficient partition Slide 3.21 / 3.73
Step two: We want to show the above defined partition A is minimal sufficient. Remember that we consider the discrete case only. First, we show that the partition is sufficient. Fix x and consider the conditional probability:
P(Y = y | Y ∈ A(x)) = Pθ(Y = y,Y ∈ A(x)) Pθ(Y ∈ A(x))
if y is not in A(x) if y ∈ A(x)
Pθ(Y=y) Pθ(Y∈A(x))
not depend on θ, that is, A is a sufficient partition.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 206 / 758
Pθ(Y ∈ A(x)) =
Pθ(Y = z) =
is not a function of θ, we see that always P(Y = y | Y ∈ A(x)) does
h(z, x)Pθ(x)
3.6 Lehmann and Scheffe’s method for constructing a minimal sufficient partition Slide 3.22 / 3.73
Now we want to show that A is a minimal sufficient partition. Take any A(x). Fix any y ∈ A(x). Assume that v = v(Y) is also sufficient and creates a coarser partition.
If y and z are such that v(y) = v(z) then by the factorization theorem (v is assumed to be sufficient) we have:
L(y, θ) = g(v(y), θ)h∗ (y) = g(v(z), θ)h∗ (y) = L(z, θ) h∗ (y) h∗ (z)
i.e. L(z, θ) is not a function of θ. L(y, θ)
Chapter 3 : Principles of data reductions and inference
MATH3821: Page 207 / 758
3.6 Lehmann and Scheffe’s method for constructing a minimal sufficient partition Slide 3.23 / 3.73
But this means
L(z,θ) = L(z,θ)/L(y,θ) L(x, θ) L(x, θ)/L(y, θ)
is not a function of θ, because the RHS is not. Hence y and z are in the same A(x) class. This means that the partition A(x) includes the partition generated by v and so, A must be the coarsest partition. □
Chapter 3 : Principles of data reductions and inference MATH3821: Page 208 / 758
3.7 Minimal sufficient examples Slide 3.24 / 3.73
3.7 Minimal sufficient examples
Exercise 3.15 (at lecture)
For the following distributions find the minimal sufficient statistic and provide justification.
i) i.i.d. Bernoulli;
ii) i.i.d. normal with unknown μ and σ2;
iii) i.i.d. uniform in (0, θ);
iv) i.i.d. Cauchy(θ) - an example that shows that sometimes the dimension of the minimal sufficient statistics can be quite large, even equal to the sample size n itself.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 209 / 758
3.8 One parameter exponential family densities Slide 3.25 / 3.73
3.8 One parameter exponential family densities
A density f(x,θ) is a one parameter exponential family densityifθ∈Θ∈R1 and
f (x, θ) = a(θ)b(x) exp(c(θ)d(x)) with c(θ) strictly monotone.
The joint density of n i.i.d observations is: n
nn =a(θ)nb(xi)exp c(θ)d(xi)
a(θ)b(xi) exp(c(θ)d(xi))
Chapter 3 : Principles of data reductions and inference
MATH3821: Page 210 / 758
3.8 One parameter exponential family densities
Slide 3.26 / 3.73
In this case we have:
L ( x , θ ) n b ( x i ) L(y, θ) = b(yi) exp c(θ)
n n d(xi) − d(yi)
which is not a function of θ iff
So, if x is any point in X then nn
A(x) = y : d(xi) = d(yi) i=1 i=1
and the sets in the minimal sufficient partition are contours of
ni=1 d(xi). Hence, T = ni=1 d(xi) is minimal sufficient.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 211 / 758
3.8 One parameter exponential family densities Slide 3.27 / 3.73
Remark 3.7
It turns out that a lot of the standard distributions can be seen to belong to the one parameter exponential family:
f(x,θ)=θexp(−θx), x>0, θ>0 Poisson(θ)
Bernoulli (θ);
and others. However, that there are many distributions outside the above class, too. For example, the uniform (0, θ) distribution or the Cauchy(θ) distribution do not belong to exponential family.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 212 / 758
3.8 One parameter exponential family densities Slide 3.28 / 3.73
Example 3.19
The exponential density:
f(x,θ) = θexp(−θx), x > 0, θ > 0
belongs to the one parameter exponential family of densities, with
a(θ)=θ, b(x)=1, c(θ)=−θ and d(x)=x. Therefore, T(X) = ni=1 Xi is minimal sufficient.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 213 / 758
3.8 One parameter exponential family densities Slide 3.29 / 3.73
Exercise 3.16 (at lecture)
Show that the following densities belong to the exponential family of densities and identify the minimal sufficient statistic for each of the distributions.
i) Poisson(θ) : f(x,θ) = x! , x ∈ {0,1,2,…}, θ > 0
ii) Bernoulli(θ) : f(x,θ) = θx(1−θ)1−x, x ∈ {0,1}, θ ∈ (0,1)
1 −1(x−θ)2 iii)NormalN(θ,1):f(x,θ)=√ e2
iv) Normal N(0,θ2): f(x,θ) = √ e 2θ2 , 2πθ2
x ∈ R, θ2 > 0.
Chapter 3 : Principles of data reductions and inference
MATH3821: Page 214 / 758
3.9 Generalization to a k- parameter exponential family Slide 3.30 / 3.73
3.9 Generalization to a k- parameter exponential family
It is natural to define the k-parameter exponential family (k ≥ 1) via: k
f(x;θ1,…,θk) = a(θ1,…,θk)b(x)exp
where cj(.) are certain smooth functions of the k-dimensional param-
eter vector θ = (θ1,…,θk)′.
cj(θ1,…,θk)dj(x)
Chapter 3 : Principles of data reductions and inference MATH3821: Page 215 / 758
3.9 Generalization to a k- parameter exponential family Slide 3.31 / 3.73
In order to avoid degenerate cases, it is also required that the k × k matrix of partial derivatives
∂θ , j = 1,…,k; l = 1,…,k
has a non-zero determinant.
The minimal sufficient (vector) statistic is:
n n ⊤
T= d(X),…,
d(X) . k i
Chapter 3 : Principles of data reductions and inference
MATH3821: Page 216 / 758
3.9 Generalization to a k- parameter exponential family Slide 3.32 / 3.73
Example 3.20
Consider a Normal density which belongs to the two-parameter expo- nential family of densities.
f(x;μ,σ2)= √ = √ =√
exp −2σ2(x−μ)2 11
exp −2σ2(x2−2xμ+μ2) 1 μ2 μ x2
exp−2σ2 expσ2x−2σ2 . n n ⊤
d1(x)=x, d2(x)=x2 and T ̄= Xi,Xi2 i=1 i=1
is minimal sufficient for θ = (μ,σ2)⊤. Chapter 3 : Principles of data reductions and inference
MATH3821: Page 217 / 758
3.9 Generalization to a k- parameter exponential family Slide 3.33 / 3.73
Exercise 3.17 (at lecture)
Show that the Beta(θ1,θ2) density belongs to the two-parameter exponential family of densities and identify the sufficient statistic for θ1 and θ2 given that
f(x,θ1,θ2)= 1 xθ1−1(1−x)θ2−1, x∈(0,1) θ1,θ2 >0 B(θ1,θ2)
Here B(θ1,θ2) is the Beta function that has given rise to the name of the density family. It serves to norm the density to integrate to one.By definition,
B(θ1, θ2) =
xθ1−1(1 − x)θ2−1dx.
Chapter 3 : Principles of data reductions and inference
MATH3821: Page 218 / 758
3.10 Ancillary statistic and ancillarity principle Slide 3.34 / 3.73
3.10 Ancillary statistic and ancillarity principle
Consider again X = (X1,X2,…,Xn): i.i.d. with f(x,θ),θ ∈ Rk. Definition 3.6
A statistic is called ancillary if its distribution does not depend on θ.
Intuitively, it would seem that knowledge of an ancillary should not help in inference about θ. However, an ancillary, when used in conjunction with another statistic, sometimes does help in inference about θ. Inference for θ could be improved if done conditionally on the ancillary.
Chapter 3 : Principles of data reductions and inference MATH3821: Page 219 / 758
3.10 Ancillary statistic and ancillarity principle Slide 3.35 / 3.73
The most important case occurs when a statistic T is minimal sufficient for θ but its dimension is greater than that of θ.
We can write T=(T′1,T′2)′ where T2 has a marginal distribution not depending on θ. The d
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com