CS计算机代考程序代写 AI scheme algorithm STAT 513/413: Lecture 14 Variance reduction

STAT 513/413: Lecture 14 Variance reduction
(the staple of Monte Carlo courses) Rizzo 6.3, 6.4, 6.5, 6.6, 6.7, 6.8

The beginning of our soap opera
In the forthcoming series of trasparencies (“soap opera”), we will entertain two test, “Guinea pig” functions: g1 and g2.
They are to be introduced shortly, now only the R code for those.
> g1 <- function(x) exp(-x^2) # once again > g2 <- function(x) exp(-x)/(1+x^2) Later, we need also to know what is √π. > sqrt(pi)
[1] 1.772454
1

The first function is g1(x) = e−x2
We want to compute the integral
􏰅∞ 2 e−x dx
Guinea pig I
-10 -5 0 5 10
x
−∞
Well, we actually know it is equal to √π, but this knowledge
enables us to assess the precision of the method(s)
In the algorithms, we used to approximate the integral above by 􏰅10 􏰅10
e−x2 dx or also by 2 e−x2 dx but we know now that 􏰅−10 􏰅0
44
e−x2 dx or also by 2 e−x2 dx is just enough −4 0
2
exp(-x^2)
0.0 0.2 0.4 0.6 0.8 1.0

Once again: rectangular better than Monte Carlo
…once again – check it out:
> sqrt(pi)
[1] 1.772454
> x=seq(-4,4,length=11)[1:10]
> mean(g1(x))*8
[1] 1.772454
> x=seq(-10,10,length=11)[1:10]
> mean(g1(x))*20
[1] 2.073263
> x=seq(-10,10,length=21)[1:20]
> mean(g1(x))*20
[1] 1.772637
> x=seq(-10,10,length=31)[1:30]
> mean(g1(x))*20
[1] 1.772454
> integrate(g1,-Inf,Inf)
1.772454 with absolute error < 4.3e-06 3 Guinea pig II The other function is g2(x) = e−x 1+x2 􏰅1 0 > x=seq(0,1,length=10001); sum(g2(x))*diff(x)[1]
[1] 0.5248563
> x=seq(0,1,length=100001); sum(g2(x))*diff(x)[1]
[1] 0.5248031
> x=seq(0,1,length=1000001); sum(g2(x))*diff(x)[1]
[1] 0.5247977
> x=seq(0,1,length=10000001); sum(g2(x))*diff(x)[1]
[1] 0.5247972
> x=seq(0,1,length=100000001); sum(g2(x))*diff(x)[1]
[1] 0.5247971
> x=seq(0,1,length=200000001); sum(g2(x))*diff(x)[1]
[1] 0.5247971
> integrate(g2,0,1)
0.5247971 with absolute error < 5.8e-15 and we want to compute the integral For this one, we compute the “truth” numerically e−x 1 + x2 dx 4 Also here is rectangular better than Monte Carlo > x=seq(0,1,length=100000001); sum(g2(x))*diff(x)[1]
[1] 0.5247971
> mean(g2(runif(100000001)))
[1] 0.5248177
> mean(g2(runif(100000001)))
[1] 0.5248393
> mean(g2(runif(100000001)))
[1] 0.5247994
5

The graph of g2
0.0 0.2 0.4
0.6 0.8 1.0
x
6
0.0 0.2 0.4
0.6 0.8 1.0
g2(x)

Variance reduction
In the classical Monte Carlo, using the average to estimate an σ2
expected value, are looking at some N as a measure of precision (to use it in a Chebyshev inequality, for instance)
Now:
we are not able to do anything with N
(that is, how many repetitions we need)
but we perhaps can do something with the variance σ2
→ Variance Reduction Techniques
7

1. Antithetic variables (variates)
This strategy requires that the integrand, g, is monotone
We explain the idea of it on pairs: suppose that we have U and V independent, with same distribution, and these two are used for computing (estimating) the integral of g as folows:
1(g(U) + g(V)) 2
The corresponding variance is then
1 Var(g(U) + g(V)) = 1(Var(g(U)) + Var(g(V))) = 1 Var(g(U)) 442
That is less than Var(g(U)), right?
(In practice, of course, we will use not one pair, but many pairs of random numbers)
8

Antithetic variables: idea
What if U and V were not independent (still with the same distribution)? Then the same variance would be
1(Var(g(U)) + 2 Cov(g(U), g(V)) + Var(g(V))) 4
which could be less than the original
1(g(U) + g(V)), 2
as long as Cov(g(U), g(V )) < 0. How to achieve that? If we take V = 1−U, and U is uniform on [0,1], then V has the same distribution as U (right?) and Cov(U,V) is Cov(U, 1 − U) = Cov(U, −U) = − Cov(U, U) = − Var(U) 􏰆 0 Typically, it is not only − Var(U) 􏰆 0, but − Var(U) < 0 However, we need it not for U and V, but for g(U) and g(V) (And then we will use it not on one pair, but on many pairs of random numbers) 9 Lemma (Hoeffding, 1940) Suppose that g and h are monotone, same sense (that is: either both noincreasing, either both nondecreasing), and X is a random variable. Then Cov(g(X), h(X)) 􏰇 0 Usage: take U uniform on [0,1] and, say, g(u) noincreasing; define V=1−U,h(U)=−g(1−U)=−g(V). Clearly,Vhasthesame distribution as U and as h(u) = −g(1 − u) is nonincreasing too, the Hoeffding lemma gives 0 􏰆 Cov(g(U), h(U)) = Cov(g(U), −g(V )) = − Cov(g(U), g(V ) and therefore Cov(g(U), g(V )) 􏰆 0 The case g(u) nondecreasing works the same way (check it out!) We can take it further: suppose that we also generate some distribution there, with the inversion method using the quantile (inverse cdf) function Q(u) = F−1(u). Then the pair g(Q(u)) −g(Q(1 − u)) has the same monotonicity properties as g(u) and −g(1 − u) 10 Let us try it on Guinea pig I So now not one, but many pairs of random numbers: > u=runif(10000,0,4); v=4-u; mean(g1(c(u,v)))*8
[1] 1.77311
> u=runif(10000,0,4); v=4-u; mean(g1(c(u,v)))*8
[1] 1.768021
> u=runif(10000,0,4); v=4-u; mean(g1(c(u,v)))*8
[1] 1.783613
Nothing extra… Well, yeah:
> 2*cov(g1(u),g1(v))
[1] -0.09921393
> v2=var(g1(u)) + 2*cov(g1(u),g1(v)) + var(g1(v)); v2
[1] 0.09401626
> v1=var(g1(u)) + var(g1(v)); v1
[1] 0.1097085
> v2/v1
[1] 0.8569642
But note: essentially, we had to generate only half of the random numbers, u, the others, v, are computed in a straightforward way
11

Now the second one
u=runif(10000); v=1-u; mean(g2(c(u,v)))
[1] 0.5250257
> u=runif(10000); v=1-u; mean(g2(c(u,v)))
[1] 0.5250572
> u=runif(10000); v=1-u; mean(g2(c(u,v)))
[1] 0.5247662
> 2*cov(g2(u),g2(v))
[1] -0.1155884
> v2=var(g2(u)) + 2*cov(g2(u),g2(v)) + var(g2(v)); v2
[1] 0.004348575
> v1=var(g2(u)) + var(g2(v)); v1
[1] 0.1199369
> v2/v1
[1] 0.03625719
> (v1-v2)/v1
[1] 0.9637428
And again, only with the half of random numbers to compute
12

Another example: recognize?
> U=runif(100000000)
> V=1-U
> var(exp(U))
[1] 0.2420872
> -exp(2)/2+2*exp(1)-3/2
[1] 0.2420356
> cov(exp(U),exp(V))
[1] -0.2342585
> -exp(2)+3*exp(1)-1
[1] -0.2342106
>
> var(exp(U)+exp(V))
[1] 0.01565305
> -3*exp(2)+10*exp(1)-5
[1] 0.01564999
>
> (var(exp(U)+exp(V)))/(var(exp(U))+var(exp(V)))
[1] 0.03232965
> (-3*exp(2)+10*exp(1)-5)/(-exp(2)+4*exp(1)-3)
[1] 0.03232993
13

Proof of the Hoeffding lemma
Let g and h be same sense monotone; then for any x,y (g(x) − g(y))(h(x) − h(y)) 􏰇 0
(nice, eh? it works also when they are both noincreasing!)
Now, take Y with the same distribution as X, and independent. Then
we have (recall: if Y 􏰇 0, then E(Y) 􏰇 0) 0 􏰆 E[(g(X) − g(Y))(h(X) − h(Y))]
= E[g(X)h(X)] + E[g(Y)h(Y)] − E(g(X)h(Y)) − E(g(Y)h(X)) and as X and Y have the same distribution (recall: expected value
depends only on the distribution), we can continue = 2 E(g(X)h(X)) − E(g(X)h(Y)) − E(g(Y)h(X)
and then use the independence of X and Y (recall…)
= 2 E(g(X)h(X)) − E(g(X)) E(h(Y)) − E(g(Y)) E(h(X)
and again the same distribution of X and Y
= 2 E(g(X)h(X)) − 2 E(g(X)) E(h(X)) = Cov(g(X), h(X))
Hence Cov(g(X), h(X)) 􏰇 0
14

Further extension
Works also in multivariate situation, when seeking an integral of g(x1,x2,…,xp)
and g is monotone in every variable. The pair g(Q(U1), Q(U2), . . . , Q(Up)
and g(Q(1−U1),Q(1−U2),…,Q(1−Up)
works in the same way as in the univariate case
By the way: should g be same sense monotone in every variable? (Beware the textbook notation: my N is their m, my p is their n)
15

Again, we demonstrate it on pairs: 􏰂1 􏰃1
Unbiased estimator?
Well, everything is fine, the total variance is reduced… … but are we still computing the right thing?
In other words, is it still the unbiased estimator?
E 2(g(U)+g(V)) = 2E(g(U)+g(V))
= 1 (E(g(U)) + E(g(V))) = 12 E(g(U)) = E(g(U))
22
It works regardless whether U and V are independent or not: the
expected value of sum is always sum of expected values
We only need that U and V, and then g(U) and g(V) have the same distribution, as then E(g(U)) = E(g(V))
16

2. Control variates (variables)
Now we need a function f(x) we know the integral of – let us say it is ν. It will be good also when f(x) is close to the integrand g(x) – we can measure this this closeness by correlation (recall the definition)
Cov(f(X), g(X)) Cor(f(X), g(X)) = 􏰌Var(f(X)) Var(g(X))
(Which means: f(x) and g(x) are close if this correlation is positive and high – that is, the closer to 1 the correlation is, the close f(x) and g(x) are considered to be)
We then estimate the integral via a regression E(g(X)) − Cov(g(X), f(X))􏰈E(h(X)) − ν􏰉
Var(f(X))
with a possibility to estimate the coefficient A = −Cov(g(X), f(X))
in a preliminary (smaller) Monte Carlo experiment
Var(f(X))
17

We compute
􏰄m􏰊 1􏰄m 􏰋􏰊 1􏰄m 􏰋
g(Yi) − m
Aˆ = − i = 1 i = 1
g(Yi) f(Yi) − m f(Yi) i = 1
n i=1
1 􏰄n
= n i=1
n i=1
That is
f(Yi) − m and then estimate the integral by
1 􏰄n
i=1
􏰄m 􏰊 i=1
1 􏰄m
􏰋 2 f(Yi)
g(Xi) + Aˆ g(Xi)+ n f(Xi)−Aˆν
􏰈g(Xi) + Aˆ (f(Xi)) − ν)􏰉 =
1 􏰄n n i=1
􏰊 1 􏰄n 􏰋
􏰈f(Xi) − ν􏰉
Aˆ 􏰄n i=1
18

Now, for our Guinea pig II, g2
e−0.5
For the control variate, the textbook suggests f1(x) = 1 + x2 ,
but my choice would be simply f2(x) = 1−x
(or f3(x) = 1−0.9x?)
g2 f1 f2 f3
0.0 0.2 0.4
0.6 0.8
1.0
0.0 0.2 0.4
0.6 0.8 1.0
g2(x)
x
19

So, let us roll like in the textbook
> f1 = function (x) exp(-0.5)/(1+x^2)
> set.seed(007)
> u = runif(10000)
> cor(g2(u),f1(u))
[1] 0.9737252
> A = -cov(g2(u),f1(u))/var(f1(u)) ; A
[1] -2.44699
> u = runif(100000)
> mean(g2(u))
[1] 0.5253908
> mean(g2(u) + A*(f1(u)- exp(-0.5)*(pi/4)))
[1] 0.5250312
> v1=var(g2(u)) ; v1
[1] 0.06027686
> v2=var(g2(u) + A*(f1(u)- exp(-0.5)*(pi/4)))
> v2
[1] 0.003117086
> (v1-v2)/v1
[1] 0.9482872
20

And now like in a bigger book
> f2 = function(x) 1-x
> set.seed(007)
> u = runif(10000)
> cor(g2(u),f2(u))
[1] 0.9905712
> A = -cov(g2(u),f2(u))/var(f2(u)) ; A
[1] -0.8407234
> u = runif(100000)
> mean(g2(u))
[1] 0.5253908
> mean(g2(u) + A*(f2(u) – 0.5))
[1] 0.5249112
> v1=var(g2(u)) ; v1
[1] 0.06027686
> v2=var(g2(u) + A*(f2(u) – 0.5))
> v2
[1] 0.001115375
> (v1-v2)/v1
[1] 0.9814958
21

And now: you roll! (Exactly as I did before)
That is, with f3, and in the same way as for f1 and f2 before Can you do even better?
22

Preliminary Monte Carlo to estimate A?
This suggestion has more to do with the certain theoretical vindication of the technique than with the practice (in the practice we can do it one way or another, with little impact on the result)
If the estimate Aˆ comes from a preliminary Monte Carlo run
it means that the random numbers it depends on are different than
the Xi’s that enter 1 􏰄n
􏰊 1 􏰄n 􏰋
(g(Xi)) + Aˆ
􏰈f(Xi) − ν􏰉
If the random numbers Yi are different, then they are independent
n i=1
and consequently Aˆ , that depends on Yi, can be considered
n i=1 from Xi
independent of the random quantities dependent on Xi – so that (Recall: E(XY) = E(X) E(Y), if X and Y are independent)
23

The estimator is unbiased
If Xi all have the same distribution, so that
E(g(X1)) = E(g(X2)) = ··· = μ
and also E(f(X1))=E(f(X2))=···=ν
and Aˆ is independent on the Xi’s, then 􏰊􏰄N 􏰄N 􏰋
E
1
N i=1
g ( X i ) + Aˆ 􏰈 ( X i ) − ν 􏰉 N i=1
1 􏰄N N i=1
1 􏰄N N i=1
1 􏰄N
N
=
=
=
=
1
N i=1
E(g(Xi))+E Aˆ 􏰈f(Xi)−ν􏰉 N i=1
􏰄N 􏰊􏰄N 􏰋
􏰊 1 􏰄N 􏰋 E(g(Xi)) + E(Aˆ ) E 􏰈f(Xi) − ν􏰉
N i=1 􏰊 1 􏰄N
􏰋 E(g(Xi)) + E(Aˆ ) 􏰈E(f(Xi)) − ν􏰉
i=1
E(g(Xi)) + E(Aˆ )0 = N E(g(Xi)) = μ i=1
N i=1
1 􏰄N
24

Other than that?
Other than that, the Yi’s and Xi’s could be the same random numbers – but then we
could not vindicate the estimate by proving that it is unbiased
Nevertheless, the way of estimation can be still vindicated by the fact that for n → ∞ the computed expression still approaches the correct quantity μ (in the sense of convergence in probability)
but yeah, if it is not unbiased, it is difficult to get the exact performance guarantees
But, as I said already, numerically the difference is minimal and the regression interpretation nicely stands out
25

Numerically, the difference is minimal
> set.seed(007)
> u=runif(10000)
> A = -cov(g2(u),f2(u))/var(f2(u)) >A
[1] -0.8407234
> mean(g2(u))
[1] 0.524227
> mean(g2(u) + A*(f2(u)-0.5))
[1] 0.5247263
> v1=var(g2(u))
> v1
[1] 0.0599499
> v2=var(g2(u) + A*(f2(u)-0.5))
> v2
[1] 0.001125176
> (v1-v2)/v1
[1] 0.9812314
26

And it is indeed regression
> set.seed(007)
> u=runif(10000)
> z=f2(u)
> reg=lm(g2(u)~z)
> summary(reg)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1043645 0.0006707 155.6 <2e-16 *** z 0.8407234 0.0011629 723.0 <2e-16 *** ... > predict(reg,list(z=0.5))
1 0.5247263
This shows connection to the regression estimator from sample survey methodology
And also shows a path to generalization: if we have several control variates, we can do “multiple regression” then!
27

Beware, however
> predict(reg,list(z=0.5),se=TRUE)$se.fit^2
[1] 1.125293e-07
Yeah, divided by N, or N − 1, or N − 2… Not really:
> var(g2(u) – coef(reg)[2]*(f2(u)-0.5))
[1] 0.001125176
> predict(reg,list(z=0.5),se=TRUE)$se.fit^2*(length(u))
[1] 0.001125293
> predict(reg,list(z=0.5),se=TRUE)$se.fit^2*(length(u)-1)
[1] 0.001125181
> predict(reg,list(z=0.5),se=TRUE)$se.fit^2*(length(u)-2)
[1] 0.001125068
Hmm, which one…?
> summary(reg)$sigma^2*(length(u)-2)/(length(u)-1)
[1] 0.001125176
This one!
28

Just for fun
> set.seed(007)
> u = runif(10000)
> z=f2(u); y=f1(u)
> reg=lm(g2(u)~z+y); cf=coef(reg)
> predict(reg,list(z=0.5,y=exp(-.5)*(pi/4)))
1 0.5247242
> mean(g2(u)-cf[2]*(f2(u)-0.5)-cf[3]*(f1(u)-exp(-.5)*(pi/4)))
[1] 0.5247242
> v3=var(g2(u)-cf[2]*(f2(u)-0.5)-cf[3]*(f1(u)-exp(-.5)*(pi/4)))
> v3
[1] 0.0005562679
> (v1-v3)/v1
[1] 0.9907211
Yeah, and also: why length(u)-3 and not length(u)-2 this time?
> summary(reg)$sigma^2*(length(u)-3)/(length(u)-1)
[1] 0.0005562679
29

The difference seems to be really small
> set.seed(007)
> u = runif(100000)
> z = f2(u); reg1=lm(g2(u)~z)
> coef(reg1)
(Intercept) z
0.1042988 0.8411960
> predict(reg1,list(z=0.5))
1 0.5248968
> uu=runif(100000)
> mean(g2(uu)-coef(reg1)[2]*(f2(uu)-0.5))
[1] 0.524816
> zz=f2(uu); reg2=lm(g2(uu)~zz)
> predict(reg2,list(zz=0.5))
1 0.5248153
> reg3=reg2; reg3$coefficients[2] = coef(reg1)[2]
> predict(reg3,list(zz=0.5))
1 0.5251825
30

A little bit more fun
> set.seed(007)
> u = runif(10000)
> z=f2(u); y=f1(u)
> reg=lm(g2(u)~z+y); cf=coef(reg)
> u=runif(100000)
> mean(g2(u))
[1] 0.5253908
> mean(g2(u)-cf[2]*(f2(u)-0.5)-cf[3]*(f1(u)-exp(-.5)*(pi/4)))
[1] 0.5248127
> v1=var(g2(u))
> v1
[1] 0.06027686
> v3=var(g2(u)-cf[2]*(f2(u)-0.5)-cf[3]*(f1(u)-exp(-.5)*(pi/4)))
> v3
[1] 0.0005455055
> (v1-v3)/v1
[1] 0.99095
Of course, summary(reg) does not work now (but is close…)
> summary(reg)$sigma^2*(length(u)-3)/(length(u)-1)
[1] 0.0005563681
31

However
However, if we want to maintain unbiasedness in theory, we are still supposed to estimate the regression coefficients in a preliminary Monte Carlo run…
…loosing thus the nice possibility of using lm()
32

3. Importance sampling
Simple Monte Carlo: suppose that f(x) = sl(x) is a probability density
The integral 1 􏰄N
􏰅1􏰅
g(x)l(x) dx = s g(x)sl(x) dx is estimated by
g(Xi) where Xi follow the density sl(x) = f(x) In the typical case, l(x) = 1, s = 1 , so that f(x) is density
sN
i=1
v−u
of the uniform distribution on [u,v] and the integral
􏰅v 􏰅v1 g(x)dx=(v−u) g(x)v−udx
uu
v − u 􏰄N
is estimated by N
g(Xi)
i=1
The idea: how about taking for sl something better than uniform?
33

What can be better?
􏰅 􏰅 g(x) 1 􏰅 g(x)
g(x) dx = sl(x)sl(x) dx = s l(x) f(x) dx
We estimate
by 1 􏰄N g(Yi) where Yi follow the density sl(x) = f(x)
sN i=1 f(Yi)
Which f will yield “better” estimates here?
􏰂g(Yi)􏰃 Those that will yield smaller Var l(Yi)
for instance, those that make g(Yi) more “constant” l(Yi)
– which can be interpreted as that f targets more “important” places regarding g – “importance sampling”
Also: is it still an unbiased estimator? Of course – it is pretty much simple Monte Carlo, only with a bit more sophisticated l
34

A paradox
􏰅1 0
1
Somebody would tell us: indeed, π􏰌x(1 − x) is a density
of a so-called arc-sine distribution
its name coming from the fact that its cumulative distribution function involves arc-sine function
arc-sine is easily inverted to a sine
so we can easily program a function generating random numbers from this distribution
and then put s = 1/π and l be the integrand above
and then we are to integrate a function h = 1 over [0, 1]
Think again of the integral
1 􏰌x(1 − x)
1 􏰄n
which is sooo easy: N i=1
1 􏰄n
h(X)= N 1=1 …
i=1
35

The catch
… and the integral is then 1 multiplied by 1/s = π – it is π?
The catch is that we knew, more or less, the integral of g so why compute something we know?
36

f0(x) = 1 f1(x) = e−x
0 < x < 1 0 < x < ∞ (exponential distribution with λ = 1) f2(x) = f3(x) = − ∞ < x < ∞ ce−x 0 < x < 1 (Cauchy distribution, t1) (c = 1/(1 − 1/e)) (a density) (c=1/0.6) Textbook, Example 6.11 (2019 edition) Take Guinea pig II 􏰅1 0 e−x 1 + x2 dx with five choices for f(x) = cl(x) 1 π(1 + x2) f4(x) = 4 π(1 + x2) (simple Monte Carlo, uniform f) 0 x <- y <- seq(0,1,length=1000) > plot(x,(10/6)*cumsum(1-0.8*x)/1000,
+ type=”l”,lwd=16,col=”yellow”)
> lines(x,(10/6)*(x-0.4*x^2),col=”red”,lwd=8)
> lines((5-sqrt(25-24*y))/4,y,lwd=2
In other words: why not to verify your calculus also numerically
x
0.6 0.8 1.0
40

The code, 1st part
# m <- 10000 theta.hat <- se <- numeric(6) g <- function(x) exp(-x)/(1+x^2)*(x>=0)*(x<=1) f0 <- 1 x <- runif(m) fg <- g(x)/f0 theta.hat[1] <- mean(fg) se[1] <- sd(fg) f1 <- function(x) exp(-x) x <- rexp(m,1) fg <- g(x)/f1(x) theta.hat[2] <- mean(fg) se[2] <- sd(fg) 41 f2 <- function(x) 1/(1+x^2) con <- 1/pi x <- rcauchy(m) i <- c(which(x>1),which(x<0)) x[i] <- 2 fg <- g(x)/f2(x) theta.hat[3] <- mean(fg)/con se[3] <- sd(fg)/con f3 <- f1 con <- 1/(1-exp(-1)) u <- runif(m) x <- -log(1-u/con) fg <- g(x)/f3(x) theta.hat[4] <- mean(fg)/con se[4] <- sd(fg)/con The code, 2nd part 42 The code, 3rd (and last) part f4 <- f2 con <- 4/pi u <- runif(m) x <- tan(pi*u/4) fg <- g(x)/f4(x) theta.hat[5] <- mean(fg)/con se[5] <- sd(fg)/con f5 <- function(x) 1-0.8*x con <- 1/(1-0.8/2) u <- runif(m) x <- (5-sqrt(25-24*u))/4 fg <- g(x)/f5(x) theta.hat[6] <- mean(fg)/con se[6] <- sd(fg)/con 43 And the Oscar goes to... > m=100000; source(“import.R”)
> rbind(round(theta.hat,7),round(se/sqrt(m),7),rank(se))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.5244982 0.5239369 0.5235613 0.5241246 0.5255023 0.5246502
[2,] 0.0007743 0.0013231 0.0030062 0.0003067 0.0004470 0.0001551
[3,] 4.0000000 5.0000000 6.0000000 2.0000000 3.0000000 1.0000000
> source(“import.R”)
> rbind(round(theta.hat,7),round(se/sqrt(m),7),rank(se))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.5232864 0.5260607 0.5252418 0.5248760 0.5256166 0.5248048
[2,] 0.0007729 0.0013235 0.0030139 0.0003067 0.0004461 0.0001555
[3,] 4.0000000 5.0000000 6.0000000 2.0000000 3.0000000 1.0000000
> source(“import.R”)
> rbind(round(theta.hat,7),round(se/sqrt(m),7),rank(se))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.5255765 0.5248700 0.5247340 0.5250224 0.5245876 0.5248881
[2,] 0.0007743 0.0013232 0.0030035 0.0003072 0.0004457 0.0001554
[3,] 4.0000000 5.0000000 6.0000000 2.0000000 3.0000000 1.0000000
> (se[1]^2-se[6]^2)/se[1]^2
[1] 0.9600556
44

4. Stratified sampling: divide et impera
Divide the domain of integration into several subdomains (some overlaps on boundaries are allowed when the sampled distributions are continuous) and then compute the integrals separately, splitting random numbers among them
As before, f(x) = sl(x)
􏰅 􏰅􏰅A􏰅
So, we are computing μ =
= g(x)f(x) dx + g(x)f(x) dx + g(x)f(x) dx
A1 A2 Ar
after splitting the domain of integration: A = A1 ∪ A2 ∪ · · · ∪ Ar
g(x)f(x) dx
45

Some more details and notation
The density f on Ai needs to be made a probability density f(x) 􏰅
fAi(x) = p where pi = f(x)dx i Ai
The integrals in the strata are
􏰅
􏰅 f(x) 􏰅
g(x) p dx = pi g(x)fAi (x) dx
􏰅
g(x)f(x) dx = pi
We use the notation μi =
Ai
Ai i Ai
σ2i =p
(g(x)−μi))2f(x)dx= (g(x)−μi))2fAi(x)dx Ai
g(x)fAi(x)dx – the expected value of g(Y)
Ai
whenever Y follows the distribution with density fAi
We add also: 1􏰅􏰅
Ai
– the variance of g(Y)
whenever Y follows the distribution with density fAi
46

p1 􏰄1
p2 􏰄2
pr 􏰄r g(Y2j) + · · · + N g(Yrj)
The algorithm
The integral is then computed (estimated) as the sum of the integrals in the strata; the latter are computed separately
with respectively Ni repetitions and
Yij sampled independently from the distribution with density fAi:
NNN
g(Y1j) + N
We denote the variance of this sum by σ2(N1,N2,…,Nr)
It is equal to p21σ21 +···+p2rσ2r N1 Nr
N
1 j=1
2 j=1
r j=1
47

Is it worth the effort?
We want to compare the variance σ2(N1, N2, . . . , Nr) of the introduced stratified scheme to the variance, denoted by σ2(N),
1 􏰄N
of the simple Monte Carlo estimator N
We can also put it this way: for given N, is there a way to allocate
g(Xj) withN=N1+N2+···+Nr randomnumbersXj sampledfromf
Ni to the strata so that
when N = N1 + N2 + · · · + Nr
then σ2(N1,N2,…,Nr) 􏰆 σ2(N) In particular, is it always possible?
j=1
48

Let us try splitting domain to halves for g2
> set.seed(007)
> u=runif(10000)
> mean(g2(u))
[1] 0.524227
> v1=var(g2(u))/10000
> v1
[1] 5.99499e-06
> u1=u[1:5000]/2
> u2=1/2+u[5001:10000]/2
> (1/2)*mean(g2(u1)) + (1/2)*mean(g2(u2))
[1] 0.5248533
> v2=(1/4)*var(g2(u1))/5000 + (1/4)*var(g2(u2))/5000
> v2
[1] 1.526541e-06
> (v1-v2)/v1
[1] 0.7453639
49

Yes, it is always possible
… and there is a theorem for that – which says:
If p1 + p2 + · · · + pr = 1, then σ2(Np1, Np2, . . . , Npr) 􏰆 σ2(N)
Interestingly, it is not the best we can do. The best (minimum variance) allocation is
Nσ1p1 , Nσ2p2 ,…, Nσrpr
σ1p1 +···+σrpr σ1p1 +···+σrpr σ1p1 +···+σrpr
so if we happen to know σi …
50

Proof of the theorem
1 􏰅 1 􏰄r 􏰅
(g(x) − μ)2f(x) dx (g(x)−μi +μi −μ)2f(x)dx
1 􏰄r 􏰅
σ2(N) = N (g(x) − μ)2f(x) dx = N 1 􏰄r 􏰅
i=1 Ai
= N
1 􏰄r 􏰅
= N
(g(x) − μi)2f(x) dx + N
i=1 Ai
i=1 Ai i=1 Ai
(μi − μ)2f(x) dx +N (g(x) − μi)(μi − μ)f(x) dx
2 􏰄r 􏰅 i=1 Ai
2 􏰄r 􏰅 f ( x ) The last term is N (μi −μ)pi (g(x)−μi) p dx
i=1 Ai i
2􏰄r 􏰂􏰅 􏰅 􏰃
= N
i=1
(μi − μ)pi g(x)fAi(x) dx − μi Ai
fAi(x) dx Ai
= N
(μi −μ)pi(μi −μi) = 0
2 􏰄r
i=1
51

σ2(N) = N
1 􏰄r 􏰅
1 􏰄r 􏰅
1 􏰄r 􏰅
Hence (proof continued)
(g(x) − μi)2f(x) dx + N
(μi − μ)2f(x) dx
i=1 Ai
1 􏰄r 􏰅
i=1 Ai = pi
f ( x )
Ni=1Ai piNi=1Ai
(g(x) − μi)2
(μi − μ)2 1􏰄r
i=1 􏰄rp2σ2 1􏰄r
i i + ( μ i − μ ) 2 p i i=1 Npi Ni=1
dx +
(g(x) − μi)2fAi(x) dx + N (μi − μ)2pi
f(x) dx
1􏰄r 􏰅
= N i=1
pi
Ai
1􏰄r Ni=1
p i σ 2i +
1􏰄r
( μ i − μ ) 2 p i = 1 􏰄r
= =σ2(Np1,Np2,…,Npr)+N
(μi − μ)2pi
Ni=1
0 􏰆 N
(and the difference is bigger if μi are more spread)
(μi −μ)2pi
i=1
Which means that σ2(N) 􏰇 σ2(Np1, Np2, . . . , Npr), as
1 􏰄r
i=1
52

Note: the ultimate stratification is…
> u=runif(200)
> uu=matrix(u/100,100,2)
> mean(g2(u))
[1] 0.5202573
> v1=var(g2(u))
> v1
[1] 0.05879549
> z=seq(0,1,len=101)[-101]
> sum(apply(g2(uu+cbind(z,z)),1,mean)*(1/100))
[1] 0.5248075
> v2=sum(apply(g2(uu+cbind(z,z)),1,var)*(1/10000))
> v2
[1] 6.269604e-08
> (v1-v2)/v1
[1] 0.9999989
53

N p1 􏰄i
E N
1 j=1
p2 􏰄2
g(Y2j) + · · · + N g(Yrj) = r j=1
Finally: is it an unbiased estimator?
That is: what is the expected value of
p1 􏰄i
p2 􏰄2
pr 􏰄r g(Y2j) + · · · + N g(Yrj)
NNN
g(Y1j) + N NNN
1 j=1
2 j=1
r j=1 pr 􏰄r
g(Y1j) + N
2 j=1
54

Some summary at the end
Monte-Carlo: we use random numbers as if they were governed by the rules of probability (→ Creed of Monte Carlo), the probability we know
to compute (→ estimate) some quantity
(usually the one that is difficult to obtain otherwise)
Because we believe in probabilistic behavior of random numbers (→ Creed of Monte Carlo), the mathematical analysis of Monte Carlo is conducted using probability theory, which helps us to figure out
how much our computed results (→ estimates)
are close to the desired quantities (→ estimated targets)
The minimal requirement here is consistency: it formalizes that in probabilistic way (→ convergence in probability)
our estimates are approaching the estimated targets when the number of repetitions grows
Consistency is a minimal requirement here; without it, our computational efforts are doomed
55

Estimating expected values
A common estimated targets are expected values (→ integrals) of various kinds (including probabilities as a special case)
For this particular case, it is important that the expected values (integral) exists and is finite, as then
the law of large numbers (Kolmogorov version) provides consistency, once expected values are estimated by averages
It is also important that the estimators are unbiased (as averages usually are), because then we can obtain performance guarantees (similar to error estimates in classical numerical analysis) via
powerful inequalities of probability theory: Chebyshev, Hoeffding
(And so the reason why we need estimators to be unbiased is that all the above involve deviations of a random variable from its expected value)
56

Performance guarantees for expected values
The Hoeffding inequality gives the sharpest performance guarantees: but it applies only for the averages of random variables that have bounded range
The Chebyshev inequality is more general (at the price of being less powerful): it gives nontrivial results whenever the variance of the estimator – its crucial component – is finite
When the variance, or its upper bound, is calculated exactly, then the performance guarantee has the highest level of exactness
In practice, we may often know the variance, or a bound in the Hoeffding inequality, only approximately – for instance
we compute them only numerically, up to some precision
or, which can be seen as a special case of the previous, we estimate them via Monte Carlo
Then we have (very) slightly less exact performance guarantees
Even less exact performance guarantee is obtained if we use a central limit theorem to approximate the distribution of the estimator
57

The outreach to numerical analysis
As expected values are integrals, Monte Carlo technology is useful not only for computing statistical and probabilistic quantities, but also as a purely numerical method to compute integrals
in particular high-dimensional ones
While it can be successful in high-dimensional context compared to classical methods of numerical analysis, they have their limitations too
The success of integration methods based on the evaluation of a function at scattered point depends not only on their probability distribution, but also how evenly the N are spread over the integration domain
58

Right panel has more evenly spread points
0.0 0.2 0.4
0.6 0.8 1.0
runif(500)
0.0 0.2 0.4 0.6 0.8 1.0
xx[, 1] + rnorm(400, 0, 0.01)
59
0.0 0.2 0.4
0.6 0.8 1.0
runif(500)
xx[, 2] + rnorm(400, 0, 0.01)
0.0 0.2 0.4 0.6 0.8 1.0

Quasi Monte-Carlo methods
Points that behave as results of independent random variables with the same distributions cannot overcome ceratin limits regarding this
Recall: after all, every probability inequality told us that in order to get performance guarantee for another digit of the result – increase the precision by factor 1/10 – we have to multiply the number of repetitions by 100.
There are methods that perform better – by providing collections of points in higher dimensions that are more evenly spread as those obtained by random sampling: quasi Monte-Carlo methods
but these are subject of specialized monographs (and beyond the scope of this course)
60