代写 Scheme theory STAT 513/413: Lecture 7 What we are after: means

STAT 513/413: Lecture 7 What we are after: means
(what is probability, after all?)

Obtaining 2 or less heads in 3 coin tosses
How to count the number of heads in repeated 3 coins tosses – and subsequently estimate the probabilities as frequences? First, we generate the tosses:
> numrep=10
> set.seed(1)
> alltosses=matrix(floor(2*runif(3*numrep)),numrep,3)
> alltosses
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 0 0 0
[3,] 1 1 1
[4,] 1 0 0
[5,] 0 1 0
[6,] 1 0 0
[7,] 1 1 0
[8,] 1 1 0
[9,] 1 0 1
[10,] 0 1 0
1

And then, the frequencies and relative frequencies
(continuation of the previous computation)
> apply(alltosses,1,sum)
[1] 1 0 3 1 1 1 2 2 2 1
> apply(alltosses,1,sum) <= 2 [1] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE > as.numeric(apply(alltosses,1,sum) <= 2) [1] 1 1 0 1 1 1 1 1 1 1 The latter may better conform to good programming practices, but note the same result is obtained when the logical variables are added: > sum(apply(alltosses,1,sum) <= 2) [1] 9 > sum(as.numeric(apply(alltosses,1,sum) <= 2)) [1] 9 > sum(apply(alltosses,1,sum) <= 2)/numrep [1] 0.9 And the latter is the same as when we simply run > mean(apply(alltosses,1,sum) <= 2) [1] 0.9 2 So, are we done here? > mean(apply(matrix(floor(2*runif(3*1)),1,3),1,sum) <= 2) [1] 0 > mean(apply(matrix(floor(2*runif(3*10)),10,3),1,sum) <= 2) [1] 0.8 > mean(apply(matrix(floor(2*runif(3*10)),10,3),1,sum) <= 2) [1] 0.7 > mean(apply(matrix(floor(2*runif(3*10)),10,3),1,sum) <= 2) How about this? > mean(apply(matrix(floor(2*runif(3*1000000)),1000000,3),1,sum) <=2) [1] 0.875483 > mean(apply(matrix(floor(2*runif(3*1000000)),1000000,3),1,sum) <=2) [1] 0.875564 > mean(apply(matrix(floor(2*runif(3*1000000)),1000000,3),1,sum) <=2) [1] 0.875285 > mean(apply(matrix(floor(2*runif(3*1000000)),1000000,3),1,sum) <=2) [1] 0.87549 > pbinom(2,3,0.5)
[1] 0.875
3

After all, what is “probability”?
Philosophical interpretations can be varied, and lengthy…
…but in our practice here we stick to the following principles:
• we evaluate probabilities only in the repeatable situation (“chance setup”): we have a well-defined random experiment, such an experiment have outcomes in the sample space of the experiment, and an event is defined as a particular subset of the sample space to be considered; for any given event, only one of two possibilities may hold: it occurs or it does not;
• probability is something that is estimated as the relative frequency of occurrence of an event, in a number of repetitions of the experiment; we believe that increasing the number of repetitions gives us better estimate of the probability
4

Digression: a “biased coin” – a thumbtack
What is the probability that it falls in the “dangerous” way? (Is there any? If yes, how to find it?)
5

Further digression: Tour de France
6

Jacob Bernoulli (1655-1705): Ars Conjectandi (1713)
Law of large numbers, in the version proved later by Kolmogorov:
If X1, X2, . . . , Xn are independent random variables with the same distribution, such that the expected value
μ = E(X1) = E(X2) = . . .
of X1 and thus of all of Xi (same distribution!) is well defined
then the mean X= n
̄1􏰃n 1
E(X)= n
i=1
E(Xi)= n(nμ)=μ
̄ 1 􏰃n
Xi
i=1
approaches, for n → ∞, the expectation μ – in the following sense:
P [ | X ̄ − μ | 􏰄 ε ] → 0
Usage: to “estimate” (that is, calculate) μ, we take X ̄
Note also that μ is not only the expected value of all Xi, but also
7

Probabilities are special case of this scheme
We are interested now in p = P(E), the probability of some event E We define random variables such that
1 if event E happens in the i-th trial
0 if event E does not happen in the i-th repetition
If the trials are independent, then also Xi are independent and their common mean is
μ=E(Xi)=1P[Xi =1]+0P[Xi =0]=P[Xi =1]=p ̄ 1 􏰃n card{i : Xi = 1}
􏰁
Xi =
So, then we take X = n
to estimate (that is, calculate) p
i=1
Xi = n
8

How does it work for probabilities
> mean(apply(matrix(floor(2*runif(3*10)),10,3),1,sum) <=2) [1] 0.7 > mean(apply(matrix(floor(2*runif(3*100)),100,3),1,sum) <=2) [1] 0.92 > mean(apply(matrix(floor(2*runif(3*1000)),1000,3),1,sum) <=2) [1] 0.854 > mean(apply(matrix(floor(2*runif(3*10000)),10000,3),1,sum) <=2) [1] 0.873 > mean(apply(matrix(floor(2*runif(3*100000)),100000,3),1,sum) <=2) [1] 0.87687 > mean(apply(matrix(floor(2*runif(3*1000000)),1000000,3),1,sum) <=2) [1] 0.874671 9 And for expected values that are not probabilities > mean(apply(matrix(floor(2*runif(3*10)),10,3),1,sum))
[1] 1.4
> mean(apply(matrix(floor(2*runif(3*100)),100,3),1,sum))
[1] 1.4
> mean(apply(matrix(floor(2*runif(3*1000)),1000,3),1,sum))
[1] 1.482
> mean(apply(matrix(floor(2*runif(3*10000)),10000,3),1,sum))
[1] 1.493
> mean(apply(matrix(floor(2*runif(3*100000)),100000,3),1,sum))
[1] 1.49576
> mean(apply(matrix(floor(2*runif(3*1000000)),1000000,3),1,sum))
[1] 1.49995
10

Measure of precision: variance
Let X1,X2,…,Xn be random variables with the same distribution
they (all) have (the same) expected value μ = E(X) and they also have the same variance σ2 = Var(X)
Xi? Standard calculation… 􏰇1􏰃n 􏰈 1􏰃n 1 2 σ2
̄ 1 􏰃n
What is the variance of X = n
Var(X ̄)=Var n Xi =n2 Var(Xi)=n2nσ = n
i=1
i=1 i=1
This will serve as the measure of precision – if we know σ2
11

Performance guarantee
Once we know σ2, we have a bound on the probability that our results deviates form μ by more than ε
Chebyshev(-Bienaym ́e) inequality: for any random variable with E(Y) and finite Var(Y)
P[|Y − E(Y)| 􏰄 ε] 􏰀 Var(Y) ε2
We apply it to X ̄ (with the expected value E(X ̄) = μ: P[|X ̄−μ|􏰄ε]􏰀Var(X ̄)= σ2
ε2 nε2
But what if we do not know σ2 (as is often the case)?
12

Sometimes we can find an upper bound on it
Return to probabilities: we define Xi with values 0 and 1 as above These Xi happen to have expected value equal to p
Var(Xi) = (1−p)(0−p)2+p(1−p)2 = p(1−p)2−p2(1−p) = p(1−p) Given that p is in [0, 1], we have an inequality p(1 − p) 􏰀 1/4
And therefore Var(X ̄) = p(1 − p) 􏰀 1 n 4n
So, for 0-1 Xi, we get now from the Chebyshev inequality ̄ σ2 1
δ = 0.01, then we set 1/(4nε2) = δ, from which
n = 1/(4δε2) = 1/(4 × 0.01 × 0.0052) = 1000000
or larger…
P[|X−μ|􏰄ε]􏰀nε2 􏰀4nε2
If we want the probability of an error ε = 0.005 not to exceed
13

For bounded Xi we can do even better
Hoeffding inequality: Suppose that X1, X2, . . . , Xn are independent random variables, each Xi bounded by an interval [ai,bi] (that is, Xi takes values in [ai, bi] with probability one). If μ = E(X ̄), then
−􏰂n2n2ε2 P[|X ̄ − μ| 􏰄 ε] 􏰀 2e i=1(bi−ai)
In particular, if all [ai, bi] = [0, 1], then P[|X ̄ − μ| 􏰄 ε] 􏰀 2e−2nε2
So, if we want the probability of error ε bounded by δ, we need to at least
1 􏰅δ􏰆 n=−2ε2log 2
For δ=0.01 and ε=0.005 we obtain n􏰄105967
For n = 1000000 and ε = 0.005 we actually have δ 􏰀 3.8575 × 10−22
But for n = 1000000 and ε = 0.0005 we actually obtain the bound δ 􏰀 1.213 – which is trivial, because we know that δ 􏰀 1 as a probability
To achieve δ = 0.01 for ε = 0.0005, we need n 􏰄 10596634
14

Sometimes we only can estimate σ2
The well-known estimate of σ2 is
s2 = n − 1
1 􏰃n
1 􏰃n
(Xi − X ̄)2
s2 n−1 (Xi−X ̄)2 = i=1
(Xi − X ̄)2 1 􏰃n
i=1
and VarX ̄
Note: sometimes instead of s2 = n − 1
they use n
(compare textbook, pp. 151-152 and also pp. 52-53)
The reason is that n is usually large…
…so 1 differs very little from 1 n n−1
is then estimated by
nn 1 􏰃n
i=1
(Xi − X ̄)2
i=1
15

In fact, it may work that way for probabilities
Suppose again that X1, X2, . . . , Xn are independent random variables such that P[X1 =1]=p and P[X1 =0]=1−p
We know that μ=E(Xi)=p
and also that σ2 = Var(Xi) = p(1 − p)
So we could also estimate Var(X ̄) by X ̄(1 − X ̄), right? 1 􏰃n
But also by n − 1 And perhaps also by
i=1
(Xi − X ̄)2 1 􏰃n
(Xi − X ̄)2 Are those related in some way?
n
i=1
16

A computational experiment first
> tosses = floor(2*runif(20))
> tosses
[1] 0 1 0 1 0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0
> mean(tosses)*(1-mean(tosses))
[1] 0.25
> var(tosses)
[1] 0.2631579
> var(tosses)*19/20
[1] 0.25
Maybe only a coincidence… OK, another one:
> tosses = floor(2*runif(30))
> tosses
[1] 1 1 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 1 1 0 0 1 1 0
> mean(tosses)*(1-mean(tosses))
[1] 0.24
> var(tosses)
[1] 0.2482759
> var(tosses)*29/30
[1] 0.24
17

So your theory now:
18

Appendix: some philosophy for those who like it
The notion of probability useful here is the probability that works with repetitions: probability of a random event is the relative frequency of ocurrence of an experiment’s outcome, if repeating experiment; it is a “limiting” frequency, a frequency in “long run”.
19

Infinity is perhaps where we wouldn’t like to be
> mean(apply(matrix(floor(2*runif(3*1000000)),1000000,3),1,sum) <= 2) [1] 0.874483 ... almost immediately > mean(apply(matrix(floor(2*runif(3*10000000)),10000000,3),1,sum) <= [1] 0.8750248 ... after about half a minute > mean(apply(matrix(floor(2*runif(3*100000000)),100000000,3),1,sum) < [1] 0.8750126 ... after a couple of minutes So: if you couldn’t wait, or wouldn’t be able to, or computer couldn’t handle it... 20 Simple: number of Canadian females total number of Canadians Or perhaps merely a very big number What is the probability that a randomly drawn Canadian is female? 21