F71SM STATISTICAL METHODS
6 SAMPLING DISTRIBUTIONS, CENTRAL LIMIT THEOREM, t and F DISTRIBUTIONS
6.1 Introduction – sampling distributions
Let Xi, i = 1,2,…,n be i.i.d. r.v.s, each with the same distribution as a r.v. X with mean μ and variance σ2
Then X = (X1, X2, . . . , Xn) is a random sample of (or from) the population variable X.
A function of a random sample which does not involve unknown parameters is called a statistic. A statistic is a random variable. The value assumed by a statistic for any particular sample can be calculated from the sample data, and the values of the statistic vary from sample to sample. The distribution of a statistic is called a sampling distribution – its properties depend on those of the population variable X and on the sample size. Important examples of statistics are the sample mean and the sample variance. Other examples are the sample median, the sample maximum, and the sample range.
Sample mean X = n SamplevarianceS2= 1 Xi−X ̄2
n − 1 i=1 CanalsobecomputedasS2 = 1 Xi2−1
n 2 Xi
6.2 Distribution of the sample sum and sample mean
E Xi = E[Xi]=nμ
i=1 i=1 nn
Xi = Var[Xi] = nσ2 (using independence) i=1 i=1
EX ̄ = nE Xi =nnμ=μ
VarX ̄= 1VarXi =1nσ2=σ2 n2 n2n
Note that the expected value of the sample mean is the population mean, and the variance of
the sample mean is the population variance divided by the sample size. 1
SD[X ̄] = √σ is called the standard error of the sample mean, is denoted s.e.(X ̄) and is n
a measure of the precision of the sample mean as an estimate of the population mean – the smaller s.e.(X ̄) is, the better. As n increases, s.e.(X ̄) decreases – larger samples provide more precise estimates of population parameters.
Further properties of the distribution of X ̄
1. Sampling from a normal population: X ∼ N ⇒ X ̄ ∼ N (since X ̄ is a sum of independent
normal r.v.s) and so σ/√n ∼ N (0, 1)
2. Weak law of large numbers (generalising the version given in section 4.3).
Applying Chebyshev’s inequality to X ̄, we get that σ2
foranyconstantk>0,PX ̄−μ≥k≤ 2 ⇒ lim PX ̄−μ≥k=0. nk n→∞
The probability that the sample mean differs from the population mean by more than any specified amount (however small) tends to zero as the sample size increases.
This result generalises the version of the law relating to a sample proportion to a version relating to a general sample mean.
The histograms below show three simulated data sets on the same scale – the population variable is N(10,22). The first plot shows 200 observations from the population; the second shows the means of 200 samples of size 2 from the population; the third shows the means of 200 samples of size 30 from the population.
Summary statistics of the three data sets:
Population data
Means of samples of size 2 Means of samples of size 30
mean = 9.85 mean = 10.03 mean = 10.01
sd = 2.06 sd = 1.42 sd = 0.41
min 4.07 min 6.23 min 8.89
max 15.00 max 14.18 max 11.04
The three data sets are centred on the same value (the population mean 10) and the spread of the means decreases as the sample size increases. Further, the three data sets all look as though they could come from normal distributions.
These results illustrate well the results above on the distribution of the sample mean (in this case sampling from a normal distribution).
Note: be careful not to confuse the sum of n i.i.d. r.v.s with n times any one of them. e.g. n = 2: let X1 and X2 be i.i.d. with mean μ and variance σ2.
E[2X1] = 2μ and E[X1 + X2] = 2μ, but Var[2X1] = 4σ2 whereas Var[X1 + X2] = 2σ2 The r.v.s 2X1 and (X1 + X2) have the same mean but have different variances.
6.3 The Central Limit Theorem (CLT)
The CLT is a cornerstone of statistical theory and practice. Informally, it states that sample means become ‘more and more normally distributed’ as the size of the sample on which the means are based increases, almost regardless of the form of the distribution of the population from which the sample is drawn.
Theorem Let X = (X1, X2, . . . , Xn) be a random sample of a population variable X with 2 ̄ X ̄−μ
meanμandvarianceσ andletXbethesamplemean.ThenZ=σ/√n→N(0,1)asn→∞ Proof (for the case in which X has a mgf)
Let S = X1 + X2 + · · · + Xn.
Then S has mgf MS(t) = (MX(t))n, and so X ̄ = 1S has mgf MX ̄(t) = MX t n
X ̄−μ √n ̄ μ√n −tμ√n 1t√nn
NowZ=σ/√n= σ X− σ ,soZhasmgfMZ(t)=e σ That is,
−tμ√n tn MZ(t) = e σ MX σ√n
tμ√n t ⇒lnMZ(t) = − σ +nlnMX σ√n
tμ√n t μ′2 t2
= − σ +nln 1+μσ√n+2!σ2n+o(1/n)
where μ′2 = E[X2]
UsingtheMaclaurinexpansionofln(1+x)=x−x2 +x3 −···wehave
tμ√n μt μ′2t2
μ′2t2 μ2t2 t2 (μ′2 − μ2) t2
1μ2t2 lnMZ(t) = − σ +n σ√n+2σ2n+···−2σ2n+o(1/n)
= 2σ2 − 2σ2 +o(1)= 2σ2 +o(1)= 2 +o(1)
HencelnMZ(t)→t2/2asn→∞,andsoMZ(t)→et2/2 asn→∞. Hence, in the limit as n → ∞, Z has N (0, 1) distribution.
From this we have the asymptotic (‘large sample’) distribution of X ̄:
For large n, X ∼ N μ, n , approximately.
Equivalently, we have the asymptotic (‘large sample’) distribution of the sample sum: For large n, ni=1 Xi ∼ N (nμ, nσ2), approximately.
This result is true regardless of the nature of the population variable X (provided only it has finite mean and variance). Of course, the result is exact for all n if we are sampling from a normal distribution in the first place, since X ∼ N ⇒ X ̄ ∼ N
The histograms below show three simulated data sets on the same scale – the population variable is positively skewed, and has mean 4 and variance 8. The first plot shows 200 obser- vations from the population; the second shows the means of 200 samples of size 2 from the population; the third shows the means of 200 samples of size 30 from the population.
Summary statistics of the three data sets
Population data
Means of samples of size 2 Means of samples of size 30
mean = 3.96 mean = 3.99 mean = 4.03
sd = 2.61 sd = 2.01 sd = 0.48
min 0.10 min 0.60 min 2.98
max 15.43 max 16.07 max 5.64
The three data sets are centred on the same value (the population mean 4) and the spread of the means decreases as the sample size on which the means are based increases. Further, the third data set (means of samples of size 30) looks as though it could come from a normal distribution.
These results illustrate well the results above on the distribution of the sample mean – in this case in sampling from a skewed, non-normal distribution (in fact it was a chi-squared distribution with 4 df).
Warning! The CLT is sometimes misunderstood. It concerns the distribution of sample means/sums, not of individual values from a population. A large sample from a skewed popu- lation will be a skewed sample. The statement ‘we have a lot of data, so it will be approximately normally distributed’ is nonsense. The statement ‘we have a large sample, so the mean (or sum) of the sample is approximately normally distributed’ is correct.
6.4 The CLT in practice -– important special cases
1. The normal approximation to the binomial distribution.
X ∼ b(n, p) is the sum of n i.i.d. r.v.s, each b(1, p), and so the CLT applies.
For large n, X ∼ N (np, np(1 − p)) approximately.
2. The normal approximation to the distribution of a sample proportion.
Let P be the proportion of successes in a series of n i.i.d. trials. That is, P = X/n where X ∼ b(n, p)
P has mean p and standard error n , and the CLT applies.
For large n, P ∼ N p, n approximately.
3. Any variable which can be expressed as the sum of a large number of i.i.d. r.v.s (with finite mean and variance) can be approximated by a normal distribution.
e.g. Let X ∼ P(λ), then for λ a positive integer, X is the sum of λ i.i.d. r.v.s, each distributed as P (1), and so for large λ, X ∼ N (λ, λ) approximately (also true if λ is not an integer)
e.g. Let X ∼ χ2n, then for n a positive integer, X is the sum of n i.i.d. r.v.s, each distributed as χ21, and so for large n, X ∼ N(n,2n) approximately.
Continuity correction
When we use a normal approximation to calculate a probability associated with a discrete r.v. we are using a continuous distribution to approximate a discrete one. We are effectively superimposing a continuous pdf curve over a set of pmf rectangles and matching the areas – this is best done if we think of the rectangle for, say X = 16, as covering the interval from X = 15.5 to X = 16.5 and we use this interval when using the approximating distribution – this improves the approximation.
So, for example, for X ∼ b(200,0.4), we approximate the probability P(X ≥ 84) by cal- culating P(X > 83.5), and P(73 ≤ X ≤ 85) by calculating P(72.5 < X < 85.5), where X ∼ N(80,48).
For X ∼ P(120), we approximate P(X ≥ 110) by calculating P(X > 109.5) where X ∼ N(120,120) and so on.
We do not need a continuity correction when we are approximating a continuous distribution – for example, for X ∼ χ2120, we approximate the probability P(X > 150) by calculating P (X > 150) where X ∼ N (120, 240). [Note: in this case other good approximations are available.]
6.5 Distribution of the sample variance
As before let X = (X1,X2,…,Xn) be a random sample of a variable X with mean μ and variance σ2
n SamplevarianceS2= 1 Xi−X ̄2
To find the mean of the distribution of S2,
Xi −X ̄ +X ̄ −μ2
= Xi −X ̄2 +nX ̄ −μ2
nVar[X] = E[(n − 1)S2] + nVar[X ̄] ⇒nσ2 = (n−1)E[S2]+nσ2/n
(Xi −μ)2 =
= Xi −X ̄2 +X ̄ −μ2 +2X ̄ −μXi −X ̄
Taking expectations,
⇒ E[S2] = σ2
The expected value of the sample variance is the population variance.
6.6 Sampling from a normal variable – structural and distributional results
Let X = (X1,X2,…,Xn) be a random sample of a variable X ∼ N(μ,σ2) Result 1: X ̄ and S2 are independent r.v.s.
σ 2 ∼ χ 2n − 1
From above, we have
This follows from an argument which is best done in a matrix formulation and is beyond the scope of this course.
Note that taking expectations gives E already noted).
(n−1)S2 2 2
σ2 = n − 1 and hence E[S ] = σ (as
(Xi −μ)2 = (n−1)S2 +nX ̄ −μ2
The left-hand r.v. has a (scaled) χ2n distribution, while the furthest right r.v. has a (scaled) χ21 distribution. It follows that the other r.v. on the right hand side of the equation has a (scaled) χ2n−1 distribution (proof by mgfs).
That is, a scaled version of the sample variance S2 has a chi-squared distribution with parameter n − 1 (‘n − 1 degrees of freedom’). This gives us the sampling distribution of the sample variance.
Also, since χ2n−1 has variance 2(n − 1), we have Var[S2] = 2σ4 n−1
X ̄ − μ X ̄ − μ Result 3: S/√n ∼ tn−1 Compare σ/√n ∼ N (0, 1)
This result concerns the distribution of the standardised sample mean when the popula- tion variance σ2 is not known.
When σ is known, we have σ/√n ∼ N (0, 1).
When σ2 is not known, it is estimated by S2 and the resulting distribution is no longer normal but has a closely-related distribution called ‘Student’s t’ or just ‘t’ (the name ‘Student’s t’ is after W. S. Gosset, whose work was published under the pseudonym ‘Student’). The t-distribution is symmetrical but has higher variation than N(0,1). It has a single parameter, which is referred to as the number of ‘degrees of freedom’.
Definition of tn: For n ≥ 1, tn = V/n where U ∼ N(0,1), V ∼ χn and U,V are
independent.
The t distribution is like N(0,1) but with ‘fatter tails’.
tn hasmean0,variance n forn>2(variance→1asn→∞). n−2
tn →N(0,1)asn→∞.
Fortherecord,thepdfoftn isf(t)=Γn√nπ 1+n
Γ n+1 t2 −(n+1)/2 2
Tables are available: NCST, cdf p42–44, percentage points p45 (note in each column the percentage point tends to the corresponding N(0,1) point as the number of degrees of freedom increases – the bottom row corresponds to N(0,1)).
Looking at plot of the pdfs of N(0,1), t3 and t10 (below), note that t10 is a closer match with N(0, 1) than is t3.
6.7 Two-sample situations
In some situations we want to study the difference, if any, between the means of two populations. We do this via the distribution of the difference between the two sample means.
In other cases, we may want to study the difference, if any, between the variances of two populations. We do this via the distribution of the ratio of the sample variances.
Let X = (X1,X2,…,Xn) be a random sample of size n of a population variable X with mean μX and variance σX2 , and let Y = (Y1,Y2,…,Ym) be a random sample of size m of a population variable Y with mean μY and variance σY2 , with X and Y independent. Let the sample means be X ̄ , Y ̄ and the sample variances be SX2 , SY2
Difference between the sample means for two independent random samples
̄ ̄ σX2 σY2 X−Y hasmeanμX −μY andvariance n + m
̄ ̄ σX2σY2
For large n,m, have X − Y ∼ N μX −μY , n + m approximately (exact for X,Y
Equal variance case: in the case of sampling from normal distributions with σX2 = σY2 = σ2 we have
̄ ̄ 2 1 1 X−Y∼N μX−μY,σ +
( X ̄ − Y ̄ ) − ( μ X − μ Y )
∼N(0,1),
( X ̄ − Y ̄ ) − ( μ X − μ Y ) 2 2 and ∼ tn+m−2 where Sp is a ‘pooled’ estimate of σ ,
givenby Sp2=(n−1)SX2 +(m−1)SY2
Ratio of two sample variances – sampling from independent normal distributions
(n − 1)S2 First note X
Y ∼ χ2 , independent. m−1
Definition: For n, m ≥ 1, Fn,m = U/n where U ∼ χ2n, V ∼ χ2m and U, V are independent. V/m
NotethatX∼Fn,m ⇔1/X∼Fm,n
Tables of upper percentage points are available: NCST, cdf p50–55.
We scale the sample variances by the corresponding population variances and construct the
SX2 /σX2 SY2 /σY2
ratio. The resulting two-parameter distribution is called F (after R A Fisher), and
∼ Fn−1,m−1
6.8 Worked examples
6.1 A random sample of 9 observations is taken from a N(35,16) distribution. Find the probability that the sample mean exceeds 36.
X ̄ ∼ N(35,16/9), so
̄ 36−35
P(X >36)=P Z > 4/3 =P(Z >0.75)=1−0.7734=0.2266
6.2 A random sample of 200 observations is taken from a r.v. with mean 35 and variance 16. Find the (approximate) probability that the sample mean assumes a value between 34.6 and 35.3.
X ̄ ∼ N (35, 16/200) approximately (by CLT), so
34.6 − 35 35.3 − 35
P(34.6
60000 − 50000
Z > √ =P(Z >1.414)=0.079
6.4 In sampling from a N (μ, 25) r.v., find how many observations are required to ensure that the sample mean differs from the population mean by at most 1 with probability at least 0.9,i.e. toensureP|X ̄−μ|≤1≥0.9.
Let sample size be n. Then X ̄ ∼ N (μ, 25/n), and
̄ 1 √n
P |X−μ|≤1 ≥0.9⇒P |Z|≤ 5/√n ≥0.9⇒ 5 ≥1.6449⇒n≥67.6
So we require at least 68 observations.
6.5 Find the (approximate) probability of getting
(a) at least 108 heads when a fair coin is tossed 200 times; and (b) at most 138 heads when a fair coin is tossed 300 times.
(a) Let X be the number of heads in 200 tosses.
X ∼ b(200, 0.5) so X ∼ N (100, 50) approximately
107.5 − 100
P(X ≥108)=P(X >107.5)≈P Z > √ =P(Z >1.06)=0.1446
(b) Let X be the number of heads in 300 tosses.
X ∼ b(300, 0.5) so X ∼ N (150, 75) approximately
138.5 − 150
P(X ≤138)=P(X <138.5)≈P Z < √ =P(Z <−1.33)=0.0918
6.6 Claims arise on a group of policies as a Poisson process with average rate 5 per month. Find the probability that at most 75 claims will arise on this group of policies in a given year.
Number of claims in a year X ∼ P (60) ∼ N (60, 60) approximately, by CLT.
75.5−60
P(X ≤75)=P(X <75.5)≈P Z < √ =P(Z <2.00)=0.97725
Using R, the exact probability using X ∼ P (60) is 0.974.
6.7 Consider a random sample of size 16 from a N(100,400) distribution. FindPX ̄ <97andS2 >167
X ̄ ∼ N (100, 400/16), i.e. N (100, 25), so
̄ 97−100 P(X <97)=P Z < 5
=P(Z <−0.6)=0.2743 P(S2 > 167) = χ215 > 15 × 167/400 = P(χ215 > 6.2625) = 0.975
15S2/400 ∼ χ215, so
Now X ̄ and S2 are independent random variables and so
P(X ̄ <97andS2 >167)=P(X ̄ <97)×P(S2 >167)=0.2743×0.975=0.267
6.8 A company wishes to estimate the proportion, p, of policies of a particular type that have a certain property. The proportion is to be estimated by finding the proportion, P, of such policies with the property in a random sample.
How large a sample should be selected to be 90% confident that the error of estimation does not exceed 0.015?
Let X be the number of policies in the sample with the property. Assuming sample size
n is large then X ∼ N (np, np(1 − p)) approximately.
We require P(|P − p| ≤ 0.015) ≥ 0.90, i.e. P(|X − np| ≤ 0.015n) ≥ 0.90 Noting that p(1 − p) ≤ 0.25 we have
0.015n + 0.5 0.015n + 0.5 P(|X−np|≤0.015n)≈P |Z|< np(1−p) ≥P |Z|< 0.5√n
0.015n + 0.5
So we require that 0.5√n ≥ 1.6449 ⇒ n ≥ 2940
6.9 Let X1, X2, . . . , Xn be i.i.d. Poisson(λ) r.v.s and let S = X1 + X2 + · · · + Xn. (a) State the distribution of S and write down an expression for P (S = n).
(b) In the case λ = 1, use the CLT to obtain an approximation for P (S = n) in terms of Φ(z) = P(Z ≤ z), the cdf of a standard normal r.v.
(a) S∼P(nλ)andP(S=n)=e−nλ λn
(b) Withλ=1,thenS∼P(n)andP(S=n)=e−n nn
But P (S = n) ≈ P (n − 0.5 < X < n + 0.5), where X ∼ N (n, n)
111 P(S=n)≈P −2√n