CS代考计算机代写 scheme STAT 513/413: Lecture 12 Classical Monte Carlo

STAT 513/413: Lecture 12 Classical Monte Carlo
(probabilistic theorems)

Recall: classical Monte Carlo
Classical Monte Carlo deals with the estimation of expected values, with probabilities also falling under this case. This particular problem allows for more than merely intuitive approach: it is possible to assess the precision more rigorously via existing mathematical theorems.
Notation: we are still after quantity θ
we still compute its estimate TN
And the subscript N still indicates how many random numbers were used in the computation of TN: the higher N, the more numbers used
But now we have θ = E(TN)
Or even more precisely: we are after θ = E(X) where X has the
same distribution as random numbers X1,X2,…,XN; that is θ = E(X1) = E(X2) = … = E(XN)
This particular case covers quite a lot of instances; in particular…
1

Probabilities come as a special case of this scheme
We are interested in p = P(E), the probability of some event E We define an indicator random variable Y to be
􏱼
1 if event E happens
0 if event E does not happen
Y=
The expected value of Y is
E(Y) = 1P[Y = 1] + 0P[Y = 0] = P[Y = 1] = p Clear?
2

and
We estimate expected values by averages
(or means by means – but this formulation may confuse some) We make (that is, generate) N replications of X
X1,X2,…,XN
are considered independent and having the same distribution as X Then
θ = E(X1) = E(X2) = … = E(XN)
̄ 1 􏱛N
Xi
TN = X = N
i=1
Note that then TN is an unbiased estimator of θ
̄1􏱛n 1
E(TN)=E(X)= N
Why does Monte Carlo work in this case? A classical result:
i=1
E(Xi)= N(Nθ)=θ
3

Jacob Bernoulli (1655-1705), Ars Conjectandi (1713) Law of large numbers (in the version proved later by Kolmogorov)
If X1, X2, . . . , XN are independent random variables with the same distribution, such that the expected value of all of them (or of one: same distribution!) [exists and] is finite
μ = E(X1) = E(X2) = . . .
̄ 1 􏱛N
then the average X = N
converges to the mean μ for n → ∞, in probability :
Xi
i=1
for every ε>0, P[|X ̄−μ|􏱾ε]→0 with N→∞
If our estimated value is θ = μ and its estimate is TN = X ̄, then this
establishes consistency, right?
(A fine point: mere TN being unbiased could not be enough, we still need something more… and there are also another fine points for math enthusiasts)
4

For math enthusiasts
5

􏱼
Yi =
Probabilities are again a special case
We are interested in p = P(E), the probability of some event E We define indicator random variables to be
1 if event E happens in the i-th repetition (trial)
0 if event E does not happen in the i-th repetition
If the repetitions are independent, then also Yi are independent and their common mean is
E(Yi)=1P[Yi =1]+0P[Yi =0]=P[Yi =1]=p
̄ 1 􏱛N card{i : Yi = 1}
So, then we calculate Y=N
Yi = N
The consistency follows from the Law of large numbers
i=1 to estimate (that is, compute) p
6

Let us try it on toy examples
Recall: what is the probability of obtaining 2 or less heads in 3 coin tosses? (Answer: 0.875)
A problem for expected values proper (not probabilities): what is the expected value of heads in 3 coin tosses?
(Well, everybody knows that 2 – but we need a simple problem to start with.)
7

First: it works for estimating a probability
(We should have kept the seed. Just in case.)
> mean(apply(replicate(1000,rbinom(3,1,0.5)),2,sum) <= 2) [1] 0.858 > mean(apply(replicate(1000,rbinom(3,1,0.5)),2,sum) <= 2) [1] 0.854 > mean(apply(replicate(1000,rbinom(3,1,0.5)),2,sum) <= 2) [1] 0.888 > mean(apply(replicate(10000,rbinom(3,1,0.5)),2,sum) <= 2) [1] 0.8806 > mean(apply(replicate(10000,rbinom(3,1,0.5)),2,sum) <= 2) [1] 0.8761 > mean(apply(replicate(1000000,rbinom(3,1,0.5)),2,sum) <= 2) [1] 0.875105 > mean(apply(replicate(1000000,rbinom(3,1,0.5)),2,sum) <= 2) [1] 0.874815 > mean(apply(replicate(1000000,rbinom(3,1,0.5)),2,sum) <= 2) [1] 0.875326 > mean(apply(replicate(10000000,rbinom(3,1,0.5)),2,sum) <= 2) [1] 0.8748622 8 Works for the expected values as well 9 The evaluation of precision Classical Monte Carlo has one big advantage over the intuitive approach: we can evaluate the precision in an exact way For starters, let us - tentatively! - measure it once again by the mean squared error E[(TN − θ)2] (Why? Because it is quite reasonable and we are able to handle the mathematics - compared to other possible choices.) An important case is when E(TN) = θ for all θ under consideration Recall: such estimators are called unbiased For unbiased estimator, the mean squared error is equal to their variance E[(TN − θ)2] = E[(TN − E(TN))2] = Var(TN) Sometimes, we may also look at the square root of variance standard deviation 􏱿Var(TN) 10 Performance guarantees II: precision via variance Variance has direct consequences for the precision via the Chebyshev(-Bienaym ́e) inequality, which in its general form says that for any random variable Z with both E(Z) and Var(Z) finite P [|Z − E(Z)| 􏱾 ε] 􏱽 Var(Z) ε2 Thus, if TN is an unbiased estimator of θ, then for every ε > 0, P [|TN − θ| 􏱾 ε] 􏱽 Var(TN))
ε2
So, the variance (or its square root, standard deviation) is an
important indicator of the precision of the estimate TN of θ
And it is evaluated explicitly via Chebyshev inequality, so no mean
square error anymore! We can do better here!
A side product: if TN is unbiased and we can show mathematically that Var(TN) → 0 for N → ∞, , then consistency for TN immediately follows
11

So, again the canonical case
Let X1,X2,…,XN be random variables with the same distribution
they (all) have (the same) expected value μ and they also have the same variance σ2
What is the variance of X = N
􏱥1􏱛N 􏱦 1􏱛n 1 2 σ2
̄ 1 􏱛N
Xi? Standard calculation…
V a r ( X ̄ ) = V a r N X i = N 2 i=1
i=1
V a r ( X i ) = N 2 N σ
= N
i=1
This may serve as the measure of precision – only we need to know σ2
(which we may not, however)
12

Performance guarantees II: X ̄ via Chebyshev Once we know σ2, we have a bound on the probability that our
results deviates from μ by more than ε:
Knowing that E(X ̄) = θ, we apply the Chebyshev inequality to X ̄
P[|X ̄−θ|􏱾ε]􏱽Var(X ̄)= σ2 ε2 Nε2
This gives us a lucid expression of precision (as long as we know σ2) σ2
Also, note that, as Nε2 → 0 with N → ∞ (whatever σ2 is) we have automatically consistency here
Return to precision: what if σ2 is not known (as is often the case)?
13

An upper bound can help
First, recall “the other formula” for the variance: this yields Var(Z) = E(Z2) − (E(Z))2 􏱽 E(Z2)
Note: as E(Z) is usually finite (we are estimating E(Z), so if it would not be properly defined, it would be hopeless), the crux is in E(Z2)
And geeks know: E(Z2) < ∞ implies not only Var(Z) < ∞, but also that E(|Z|) < ∞, that is, E(Z) exists and is finite Recall some very simple, but also very useful principles: if for random variables U and V we have U 􏱽 V (with probability one) then E(U) 􏱽 E(Y) In particular, if X2 􏱽 Y2 (with probability one) then E(X2) 􏱽 E(Y2) (Beware: nothing like that necessarily follows for variances) 14 A very often used upper bound on variance If a random variable Z has values in [a,b] (with probability one) (b−a)2 then Var(Z) 􏱽 4 (try to prove that!) Combined with the Chebyshev inequality for X ̄, we get P[|X ̄−θ|􏱾ε]􏱽 Var(X ̄) 􏱽 (b−a)2 ε2 4Nε2 whenever θ = E(X ̄) and all Xi forming X ̄ have values in [a, b] (with probability one) 15 Probabilities are a special case In particular, for probabilities: we define Yi with values 0 and 1 as above these Yi happen to have expected value equal to p and Var(Yi) = (1−p)(0−p)2+p(1−p)2 = p(1−p)2−p2(1−p) = p(1−p) Given that p is in [0, 1], we have an inequality p(1 − p) 􏱽 1/4 And therefore Var(Y ̄) = p(1 − p) 􏱽 1 N 4N So, for 0-1 Yi, we get now from the Chebyshev inequality ̄ σ2 1 P[|Y−μ|􏱾ε]􏱽Nε2 􏱽4Nε2 16 Return to precision: how is this used? We first select ε to express precision we would like to achieve for instance, if we want 2 decimal digits, then ε = 0.005 (or slightly less) Then, we select “margin of error” δ: if we are to be outside the bounds with some probability, what probability we would tolerate for instance, δ = 0.01 We are interested in N that will guarantee us chosen ε and δ The Chebyshev inequality says that it will be equal to or larger than σ2 N satisfying Nε2 = δ which is if estimating a probability we need N 􏱾 σ2 N = δε2 1 = 1000000 4δε2 Note: to increase precision by one decimal digit we have to divide ε by 10, hence multiply N by 100 and to decrease “margin of error” by one decimal digit we have to divide δ by 10, hence multiply N by 100 17 Performance guarantees II: X ̄ for bounded Xi Hoeffding inequality. Suppose that X1, X2, . . . , XN are independent random variables, each Xi bounded by an interval [ai,bi] (that is, Xi takes values in [ai,bi] with probability one). If θ = E(X ̄), then P[|X ̄−θ|􏱾ε]􏱽2e ni=1(bi−ai)2 −􏱯 2N2ε2 In particular, if all [ai, bi] = [a, b], then ̄ − 2Nε2 P[|X−θ|􏱾ε]􏱽2e (b−a)2 And then, if [a, b] = [0, 1], P[|X ̄ − θ| 􏱾 ε] 􏱽 2e−2Nε2 (the latter case is also known as the Chernoff bound) 18 The use of this one Suppose that we have [ai, bi] = [a, b] - and as before we want precision ε and the “margin of error” δ To have N guaranteeing this, it has to be equal or greater than (b − a)2 the one satisfying N = − 2ε2 􏱣δ􏱤 log 2 1 for indicator variables it is N = −2ε2 log 􏱣δ􏱤 2 So, for δ=0.01 and ε=0.005 we get N􏱾105967 Note: to increase precision by one decimal digit we have to divide ε by 10, hence multiply N by 100 this is the same of the Chebyshev But, decreasing “margin of error” by one decimal digit: dividing δ by 10 means adding log 10 ≈ 1.1513 to N 2ε2 ε2 That looks cheaper... 19 Some remarks Note first: All the mentioned inequalities bound some probability from above thus, to be nontrivial, the bound has to be < 1 Some examples now: For δ=0.01 and ε=0.005 we get N􏱾105967 by the Hoeffding In the same situation, for N = 1000000 and ε = 0.005, Hoeffding says we δ 􏱽 3.8575 × 10−22 Great, this is way much we hoped for, but nonetheless fine... Can we expect also better precision here - maybe with slightly higher probability of error? If we try ε = 0.0005, still with N = 1000000, we actually obtain the bound δ 􏱽 1.213 - which does not say anything nontrivial: it is a bound on a probability which is always < 1 To achieve δ = 0.01 for ε = 0.0005, we need N 􏱾 10596634 And also: Hoeffding works for bounded Xi but in that case we also have an upper bound on variance! 20 Chebyshev strikes back If Xi is within [a,b] (with probability one) then Chebyshev and the inequality on variance together yield P[|X ̄−θ|􏱾ε]􏱽 Var(X ̄) 􏱽 (b−a)2 ε2 4Nε2 and Hoeffding inequality gives ̄ − 2Nε2 −1 4Nε2 P[|X−θ|􏱾ε]􏱽2e (b−a)2 =e 2(b−a)2 (b−a)2 1 So, the bounds are respectively δ = 4Nε2 and 2e−2δ Analyzing this, we figure out that for small δ, the second bound is better; the situation reverses at about δ ≈ 0.2322026 However, we are interested in small δ and in such situation is Hoeffding better Finally: for given precision ε, we decrease the “margin of error” by increasing N - but we may also do it by decreasing (b − a)2 or σ2 21 A pictorial comparison of bounds Hoeffding Chebyshev 0.00 0.05 0.10 0.15 0.20 0.25 δ (Chebyshev) 22 0.00 0.05 0.10 0.15 0.20 0.25 δ The comparison of the required N’s For a given precision ε, the required N is greater or equal than (b − a)2 (b − a)2 􏱣 1 􏱤 4δε2 = 2ε2 2δ by the Chebyshev and bound (b − a)2 􏱣δ􏱤 − 2ε2 log 2 by the Hoeffding Once ε is given, then (b − a)2/(2ε2) is given as well 1 􏱣δ􏱤 and we can compare 2δ and −log 2 Again, they are equal for δ ≈ 0.2322026 and then for smaller δ Hoeffding gives smaller N in fact, much smaller for very small δ 23 The comparison of required N for probabilities 1 (2δ) −log(δ 2) 0.0 0.1 0.2 0.3 0.4 0.5 delta 24 5 10 15 Sometimes, however, σ2 can only be estimated (Xi − X ̄)2 s2 N−1 (Xi−X ̄)2 = i=1 (Xi − X ̄)2 1 􏱛n The reason is that - in the present context - N is usually large... ...so 1 differs very little from 1 N N−1 The well-known estimate of σ2 is s2 = N − 1 1 􏱛n 1 􏱛n i=1 and VarX ̄ Note: sometimes instead of s2 = N − 1 they use N (compare textbook, pp. 151-152 and also pp. 52-53) is then estimated by NN 1 􏱛n i=1 (Xi − X ̄)2 i=1 25 In fact, it may work that way for probabilities Suppose again that Y1,Y2,...,YN are independent random variables such that P[Y1 =1]=p and P[Y1 =0]=1−p We know that θ=p=E(Yi) and also that σ2 = Var(Yi) = p(1 − p) We estimate θ by Y=N ̄ 1 􏱛N Yi i=1 So we could perhaps also estimate Var(Y ̄) by Y ̄(1 − Y ̄) 1 􏱛N but we could do it also by N−1 (Yi −Y ̄)2 and perhaps also by N (Yi − Y ̄)2 1 􏱛N i=1 Are those related in some way? i=1 26 Let us do a computational experiment first > tosses = floor(2*runif(20)) ; tosses
[1] 0 1 0 1 0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0
> mean(tosses)*(1-mean(tosses))
[1] 0.25
> var(tosses)
[1] 0.2631579
> var(tosses)*19/20
[1] 0.25
Maybe only a coincidence… OK, another one:
> tosses = floor(2*runif(30))
> tosses
[1] 1 1 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 1 1 0 0 1 1 0
> mean(tosses)*(1-mean(tosses))
[1] 0.24
> var(tosses)
[1] 0.2482759
> var(tosses)*29/30
[1] 0.24
So, a rule here?
27

Your theory for that:
28