3 Probability models for data
Let x1, . . . , xn be the observed values in a particular random sample of the
random variable X, whose distribution is unknown. We may wish to use these
data to estimate the probability of an event {X ∈ A}, A ⊆ RX . One way is to
use the empirical probability of the event, in other words the proportion of
the sample values that lie in A,
P̂(A) =
#{i : xi ∈ A}
n
.
An alternative approach is to assume that the data were generated as a
random sample from a particular parametric probability model, e.g. N(µ, σ2).
Such models usually contain unknown parameters, e.g. in the previous example
the parameters µ and σ2 are unknown. We can use the sample to estimate the
parameters of the distribution, thereby fitting the model to the data. A fitted
model can be used to calculate probabilities of events of interest.
If the chosen model is a good fit then the empirical and model-based esti-
mated probabilities of the event should be similar. Small differences between the
empirical and model-based estimated probabilities will occur frequently due to
the fact that we have only observed a random sample and not the entire popula-
tion. Thus, both estimates exhibit random variation around the true population
probability. However, large differences between empirical and model-based prob-
abilities may be indicative that the chosen parametric model is a poor approxi-
mation of the true data generating process. This is best illustrated by studying
some examples.
3.1 Continuous data
Component lifetime data
A sample of n = 50 components was taken from a production line, and their
lifetimes (in hours) determined. A tabulation of the sample values is given
overleaf. A possible parametric model for these data is to assume that they are
a random sample from a normal distribution N(µ, σ2). The parameters µ and
σ2 can be estimated from the sample by µ̂ = x̄ = 334.6, σ̂2 = s2 = 15.288.
We can informally investigate how well this distribution fits the data by
superimposing the probability density function of a N(334.6, 3.9122) distribution
1
onto a histogram of the data. This is illustrated in the figure overleaf, which
shows the fit to be reasonably good, particularly for data greater than the mean.
Intervals Frequencies Percents
323.75 to 326.25 1 2
326.25 to 328.75 0 0
328.75 to 331.25 9 18
331.25 to 333.75 12 24
333.75 to 336.25 11 22
336.25 to 338.75 10 20
338.75 to 341.25 5 10
341.25 to 343.75 1 2
343.75 to 346.25 1 2
Totals 50 100
Histogram of lifetime data with Normal pdf
lifetime (hours)
D
e
n
si
ty
320 325 330 335 340 345 350
0
.0
0
0
.0
2
0
.0
4
0
.0
6
0
.0
8
0
.1
0
0
.1
2
Figure: histogram of the component lifetime data together
with a N(334.6, 3.9122) p.d.f.
This figure can be obtained using the R code below. The lines command draws
a curve through the (x, y) co-ordinates provided.
xx <- comp_lifetime$lifetime xv <- seq(320, 350, 0.1) yv <- dnorm(xv, mean=mean(xx), sd=sd(xx)) hist(xx, freq=F, breaks=seq(from=323.75, to= 346.25, by=2.5), xlim=c(320, 350), ylim=c(0, 0.12), main="Histogram of lifetime data with Normal pdf", xlab="lifetime (hours)") lines(xv, yv) 2 The fitted normal distribution appears to be a reasonably good fit to the observed data, thus we may use it to calculate estimated probabilities. For ex- ample, consider the question ‘what is the estimated probability that a randomly selected component lasts between 330 and 340 hours?’. To answer this, let the random variable X be the lifetime of a randomly selected component. We require P(330 < X < 340) under the fitted normal model, X ∼ N(334.6, 3.9122): P(330 < X < 340) = P ( 330.0− 334.6 3.912 < X − 334.6 3.912 < 340.0− 334.6 3.912 ) = P(−1.18 < Z < 1.38) , where Z ∼ N(0, 1) = Φ(1.38)− Φ(−1.18) = 0.9162− 0.1190 = 0.7972 . Hence, using the fitted normal model we estimate that 79.72% of randomly selected components will have lifetimes between 330 and 340 hours. Manchester income data If we superimpose a normal density curve onto the histogram for these data, then we see that the symmetric normal distribution is a poor fit, since the data are skewed. In particular, the normal density extends to negative income values despite the fact that all of the incomes in the sample are positive. Histogram of income data with Normal pdf income (GBP x 1000) D e n si ty 0 50 100 150 200 0 .0 0 0 0 .0 0 5 0 .0 1 0 0 .0 1 5 0 .0 2 0 0 .0 2 5 0 .0 3 0 Figure: histogram of the income data with the p.d.f. of the fitted normal distribution This figure can be obtained using the following R code: xx <- income$income 3 xv <- seq(0, 200, 0.5) yv <- dnorm(xv, mean=mean(xx), sd=sd(xx)) hist(xx, freq=F, breaks=seq(from=5, to=195, by=10), ylim=c(0, 0.030), xlab="income (GBP x 1000)", main="Histogram of income data with Normal pdf") lines(xv, yv) One way forward is to look for a transformation which will make the data appear to be more normally distributed. Because the data are strongly positively skewed on the positive real line one possibility is to take logarithms. In the figure below, we see a histogram of the log transformed income data. The fit of the superimposed normal p.d.f. now looks reasonable, although there are perhaps slightly fewer sample observations than might be expected according to the normal model in the left-hand tail and centre. There are also some outliers in the right-hand tail. Histogram of log(income) data with Normal pdf log(income) D e n si ty 1 2 3 4 5 6 0 .0 0 .2 0 .4 0 .6 0 .8 Figure: histogram of log(income) with a normal p.d.f. This figure can be obtained using the following R code: lxx <- log(xx) lxv <- seq(1, 6, 0.05) lyv <- dnorm(lxv, mean=mean(lxx), sd=sd(lxx)) hist(lxx, freq=F, breaks=c(1, 1.5,2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6), ylim=c(0, 0.80), xlab="log(income)", main="Histogram of log(income) data with Normal pdf") lines(lxv, lyv) 4 Even if it is not clear whether or not we can find a completely satisfactory para- metric model, we will see in a later section that we can still make approximate inferences about the mean income in the population by appealing to the central limit theorem. 3.2 Discrete data Opinion poll data Let X be the party supported by a randomly selected voter, X = Conservative with probability pC Labour with probability pL Liberal Democrats with probability pLD UKIP with probability pU Other with probability pO , where ‘Other’ includes all other parties. As suggested earlier, we can estimate the probabilities pC , pL, etc. by the proportions of sampled individuals supporting the corresponding party. Specifically we obtain the following estimates: p̂C = P̂(X = Conservatives) = 369/1000 = 0.369, p̂L = P̂(X = Labour) = 314/1000 = 0.314, p̂LD = P̂(X = Liberal Democrats) = 75/1000 = 0.075, p̂U = P̂(X = UKIP) = 118/1000 = 0.118 , p̂O = P̂(X = Other party) = 124/1000 = 0.124 . It is beyond the scope of this module to consider a joint probability model for the vector (nC, nL, nLD, nU, nO) containing the numbers of individuals sup- porting each of the five possible choices in a sample of size n. However we may slightly simplify the situation by focussing on whether or not a randomly chosen voter supports Labour. Let the random variable XL denote the number of voters out of the 1000 who support Labour. An appropriate model may be XL ∼ Bi(n, pL) , 5 with n = 1000, and pL is estimated by p̂L = 0.314. We may use the fitted model to answer various questions, e.g. ‘what is the estimated probability that in a random sample of 1000 voters at least 330 will support Labour?’. We require P(XL ≥ 330) under the fitted model Bi(1000, 0.314). It is easiest to use a normal approximation to the binomial distribution, which gives P(XL ≥ 330) ≈ 1− Φ ( 329.5− 1000× 0.314 √ 1000× 0.314× 0.686 ) = 1− Φ(1.0561) = 0.1455 , using a continuity correction. For further details on the normal approximation to the binomial distribution, see the lecture slides or the supplementary notes available on Blackboard. An interesting question is whether, in the population, voters are equally as likely to support Labour as they are to support the Conservatives, i.e. is it true that pL = pC? Even if it is true that the population proportions pL and pC are equal, the numbers supporting Labour and Conservative in the sample will usually be slightly different simply due to random variation in the sample selection. Thus, the sample only contains significant evidence that pL 6= pC if the difference between the numbers of people in the sample supporting Labour and Conservative is ‘large’. However, how do we decide how large the difference needs to be in order to support the conclusion pL 6= pC? This kind of question will be addressed in a later chapter on Hypothesis Testing. 6 Probability models for data Continuous data Discrete data