STAT 513/413: Lecture 7 What we are after: means
(what is probability, after all?)
Obtaining 2 or less heads in 3 coin tosses
How to count the number of heads in repeated 3 coins tosses – and subsequently estimate the probabilities as frequences? First, we generate the tosses:
> numrep=10
> set.seed(1)
> alltosses=matrix(floor(2*runif(3*numrep)),numrep,3)
> alltosses
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 0 0 0
[3,] 1 1 1
[4,] 1 0 0
[5,] 0 1 0
[6,] 1 0 0
[7,] 1 1 0
[8,] 1 1 0
[9,] 1 0 1
[10,] 0 1 0
1
And then, the frequencies and relative frequencies
(continuation of the previous computation)
> apply(alltosses,1,sum)
[1] 1 0 3 1 1 1 2 2 2 1
> apply(alltosses,1,sum) <= 2
[1] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> as.numeric(apply(alltosses,1,sum) <= 2)
[1] 1 1 0 1 1 1 1 1 1 1
The latter may better conform to good programming practices, but note the same result is obtained when the logical variables are added:
> sum(apply(alltosses,1,sum) <= 2)
[1] 9
> sum(as.numeric(apply(alltosses,1,sum) <= 2))
[1] 9
> sum(apply(alltosses,1,sum) <= 2)/numrep
[1] 0.9
And the latter is the same as when we simply run
> mean(apply(alltosses,1,sum) <= 2)
[1] 0.9
2
So, are we done here?
> mean(apply(matrix(floor(2*runif(3*1)),1,3),1,sum) <= 2)
[1] 0
> mean(apply(matrix(floor(2*runif(3*10)),10,3),1,sum) <= 2)
[1] 0.8
> mean(apply(matrix(floor(2*runif(3*10)),10,3),1,sum) <= 2)
[1] 0.7
> mean(apply(matrix(floor(2*runif(3*10)),10,3),1,sum) <= 2)
How about this?
> mean(apply(matrix(floor(2*runif(3*1000000)),1000000,3),1,sum) <=2)
[1] 0.875483
> mean(apply(matrix(floor(2*runif(3*1000000)),1000000,3),1,sum) <=2)
[1] 0.875564
> mean(apply(matrix(floor(2*runif(3*1000000)),1000000,3),1,sum) <=2)
[1] 0.875285
> mean(apply(matrix(floor(2*runif(3*1000000)),1000000,3),1,sum) <=2)
[1] 0.87549
> pbinom(2,3,0.5)
[1] 0.875
3
After all, what is “probability”?
Philosophical interpretations can be varied, and lengthy…
…but in our practice here we stick to the following principles:
• we evaluate probabilities only in the repeatable situation (“chance setup”): we have a well-defined random experiment, such an experiment have outcomes in the sample space of the experiment, and an event is defined as a particular subset of the sample space to be considered; for any given event, only one of two possibilities may hold: it occurs or it does not;
• probability is something that is estimated as the relative frequency of occurrence of an event, in a number of repetitions of the experiment; we believe that increasing the number of repetitions gives us better estimate of the probability
4
Digression: a “biased coin” – a thumbtack
What is the probability that it falls in the “dangerous” way? (Is there any? If yes, how to find it?)
5
Further digression: Tour de France
6
Jacob Bernoulli (1655-1705): Ars Conjectandi (1713)
Law of large numbers, in the version proved later by Kolmogorov:
If X1, X2, . . . , Xn are independent random variables with the same distribution, such that the expected value
μ = E(X1) = E(X2) = . . .
of X1 and thus of all of Xi (same distribution!) is well defined
then the mean X= n
̄1n 1
E(X)= n
i=1
E(Xi)= n(nμ)=μ
̄ 1 n
Xi
i=1
approaches, for n → ∞, the expectation μ – in the following sense:
P [ | X ̄ − μ | ε ] → 0
Usage: to “estimate” (that is, calculate) μ, we take X ̄
Note also that μ is not only the expected value of all Xi, but also
7
Probabilities are special case of this scheme
We are interested now in p = P(E), the probability of some event E We define random variables such that
1 if event E happens in the i-th trial
0 if event E does not happen in the i-th repetition
If the trials are independent, then also Xi are independent and their common mean is
μ=E(Xi)=1P[Xi =1]+0P[Xi =0]=P[Xi =1]=p ̄ 1 n card{i : Xi = 1}
Xi =
So, then we take X = n
to estimate (that is, calculate) p
i=1
Xi = n
8
How does it work for probabilities
> mean(apply(matrix(floor(2*runif(3*10)),10,3),1,sum) <=2)
[1] 0.7
> mean(apply(matrix(floor(2*runif(3*100)),100,3),1,sum) <=2)
[1] 0.92
> mean(apply(matrix(floor(2*runif(3*1000)),1000,3),1,sum) <=2)
[1] 0.854
> mean(apply(matrix(floor(2*runif(3*10000)),10000,3),1,sum) <=2)
[1] 0.873
> mean(apply(matrix(floor(2*runif(3*100000)),100000,3),1,sum) <=2)
[1] 0.87687
> mean(apply(matrix(floor(2*runif(3*1000000)),1000000,3),1,sum) <=2)
[1] 0.874671
9
And for expected values that are not probabilities
> mean(apply(matrix(floor(2*runif(3*10)),10,3),1,sum))
[1] 1.4
> mean(apply(matrix(floor(2*runif(3*100)),100,3),1,sum))
[1] 1.4
> mean(apply(matrix(floor(2*runif(3*1000)),1000,3),1,sum))
[1] 1.482
> mean(apply(matrix(floor(2*runif(3*10000)),10000,3),1,sum))
[1] 1.493
> mean(apply(matrix(floor(2*runif(3*100000)),100000,3),1,sum))
[1] 1.49576
> mean(apply(matrix(floor(2*runif(3*1000000)),1000000,3),1,sum))
[1] 1.49995
10
Measure of precision: variance
Let X1,X2,…,Xn be random variables with the same distribution
they (all) have (the same) expected value μ = E(X) and they also have the same variance σ2 = Var(X)
Xi? Standard calculation… 1n 1n 1 2 σ2
̄ 1 n
What is the variance of X = n
Var(X ̄)=Var n Xi =n2 Var(Xi)=n2nσ = n
i=1
i=1 i=1
This will serve as the measure of precision – if we know σ2
11
Performance guarantee
Once we know σ2, we have a bound on the probability that our results deviates form μ by more than ε
Chebyshev(-Bienaym ́e) inequality: for any random variable with E(Y) and finite Var(Y)
P[|Y − E(Y)| ε] Var(Y) ε2
We apply it to X ̄ (with the expected value E(X ̄) = μ: P[|X ̄−μ|ε]Var(X ̄)= σ2
ε2 nε2
But what if we do not know σ2 (as is often the case)?
12
Sometimes we can find an upper bound on it
Return to probabilities: we define Xi with values 0 and 1 as above These Xi happen to have expected value equal to p
Var(Xi) = (1−p)(0−p)2+p(1−p)2 = p(1−p)2−p2(1−p) = p(1−p) Given that p is in [0, 1], we have an inequality p(1 − p) 1/4
And therefore Var(X ̄) = p(1 − p) 1 n 4n
So, for 0-1 Xi, we get now from the Chebyshev inequality ̄ σ2 1
δ = 0.01, then we set 1/(4nε2) = δ, from which
n = 1/(4δε2) = 1/(4 × 0.01 × 0.0052) = 1000000
or larger…
P[|X−μ|ε]nε2 4nε2
If we want the probability of an error ε = 0.005 not to exceed
13
For bounded Xi we can do even better
Hoeffding inequality: Suppose that X1, X2, . . . , Xn are independent random variables, each Xi bounded by an interval [ai,bi] (that is, Xi takes values in [ai, bi] with probability one). If μ = E(X ̄), then
−n2n2ε2 P[|X ̄ − μ| ε] 2e i=1(bi−ai)
In particular, if all [ai, bi] = [0, 1], then P[|X ̄ − μ| ε] 2e−2nε2
So, if we want the probability of error ε bounded by δ, we need to at least
1 δ n=−2ε2log 2
For δ=0.01 and ε=0.005 we obtain n105967
For n = 1000000 and ε = 0.005 we actually have δ 3.8575 × 10−22
But for n = 1000000 and ε = 0.0005 we actually obtain the bound δ 1.213 – which is trivial, because we know that δ 1 as a probability
To achieve δ = 0.01 for ε = 0.0005, we need n 10596634
14
Sometimes we only can estimate σ2
The well-known estimate of σ2 is
s2 = n − 1
1 n
1 n
(Xi − X ̄)2
s2 n−1 (Xi−X ̄)2 = i=1
(Xi − X ̄)2 1 n
i=1
and VarX ̄
Note: sometimes instead of s2 = n − 1
they use n
(compare textbook, pp. 151-152 and also pp. 52-53)
The reason is that n is usually large…
…so 1 differs very little from 1 n n−1
is then estimated by
nn 1 n
i=1
(Xi − X ̄)2
i=1
15
In fact, it may work that way for probabilities
Suppose again that X1, X2, . . . , Xn are independent random variables such that P[X1 =1]=p and P[X1 =0]=1−p
We know that μ=E(Xi)=p
and also that σ2 = Var(Xi) = p(1 − p)
So we could also estimate Var(X ̄) by X ̄(1 − X ̄), right? 1 n
But also by n − 1 And perhaps also by
i=1
(Xi − X ̄)2 1 n
(Xi − X ̄)2 Are those related in some way?
n
i=1
16
A computational experiment first
> tosses = floor(2*runif(20))
> tosses
[1] 0 1 0 1 0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0
> mean(tosses)*(1-mean(tosses))
[1] 0.25
> var(tosses)
[1] 0.2631579
> var(tosses)*19/20
[1] 0.25
Maybe only a coincidence… OK, another one:
> tosses = floor(2*runif(30))
> tosses
[1] 1 1 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 1 1 0 0 1 1 0
> mean(tosses)*(1-mean(tosses))
[1] 0.24
> var(tosses)
[1] 0.2482759
> var(tosses)*29/30
[1] 0.24
17
So your theory now:
18
Appendix: some philosophy for those who like it
The notion of probability useful here is the probability that works with repetitions: probability of a random event is the relative frequency of ocurrence of an experiment’s outcome, if repeating experiment; it is a “limiting” frequency, a frequency in “long run”.
19
Infinity is perhaps where we wouldn’t like to be
> mean(apply(matrix(floor(2*runif(3*1000000)),1000000,3),1,sum) <= 2)
[1] 0.874483
... almost immediately
> mean(apply(matrix(floor(2*runif(3*10000000)),10000000,3),1,sum) <=
[1] 0.8750248
... after about half a minute
> mean(apply(matrix(floor(2*runif(3*100000000)),100000000,3),1,sum) <
[1] 0.8750126
... after a couple of minutes
So: if you couldn’t wait, or wouldn’t be able to, or computer couldn’t handle it...
20
Simple:
number of Canadian females total number of Canadians
Or perhaps merely a very big number
What is the probability that a randomly drawn Canadian is female?
21