CS代考计算机代写 scheme STAT 513/413: Lecture 11 Sampling and resampling

STAT 513/413: Lecture 11 Sampling and resampling
(bootstrap, permutation tests, and all that)
What we cannot establish by probabilistic calculations, we accomplish by simulations
Rizzo 7.1, 7.2, 7.3, 8.1

The median: recall
The median of a probability distribution P is defined to be a number m such that
P((−∞, m]) 􏰀 1/2 and P[(m, +∞)) 􏰀 1/2 A very common situation is that
P((−∞, m] = P[(m, +∞)) = 1/2
In such a case, the median is Q(0.5) = F−1(0.5) where F is the cumulative distribution function of P
Right now, we are not interested in what is the median, μ, of the distribution, but we are interested its estimator M10, the sample median, taken out of sample of 10 numbers that look like the outcomes of independent random variables with exponential distribution with λ = 1: how does it perform?
For starters: one of the performance measures of an estimator is its variance
1

Variance is easy: but is it the right thing here?
> var(replicate(100,median(rexp(10))))
[1] 0.09402476
> var(replicate(100,median(rexp(10))))
[1] 0.09007331
> var(replicate(100,median(rexp(10))))
[1] 0.08155923
> var(replicate(1000,median(rexp(10))))
[1] 0.09270415
> var(replicate(1000,median(rexp(10))))
[1] 0.09723465
> var(replicate(1000,median(rexp(10))))
[1] 0.09982106
> var(replicate(100000,median(rexp(10))))
[1] 0.09628878
> var(replicate(100000,median(rexp(10))))
[1] 0.09590175
Note the difference: M10 estimates m out of n = 10 numbers while TN estimates Var(M10) out of N = 100000 results
2

Is M10 an unbiased estimator of m?
A general measure of the quality of the estimator is the mean squared error E 􏰊(M10 − m)2􏰋. The variance is linked to it through bias- variance decomposition
E 􏰊(M10 − m)2􏰋 = E 􏰊(M10 − E(M10))2􏰋 + E 􏰊(E(M10) − m)2􏰋 + E [2(M10 − E(M10))(E(M10) − m)]
= Var(M10) + (m − E(M10))2
as
E [2(M10 − E(M10))(E(M10) − m)] = 2(E(M10)−m) E [M10 − E(M10)]
= 2(E(M10) − m)(E(M10) − E(M10) = 0
The expression m − E(M10) is called bias – hence bias-variance decomposition (which apparently holds in greater generality, not ust for M10)
The estimates for which is bias equal to zero – and thence their expected value is equal to the estimated quantity – are called unbiased
Is E(M10) = m – that is, is M10 an unbiased estimator of m?
3

A quick check of bias
> mean(replicate(100000,median(rexp(10))))
[1] 0.745127
> median(rexp(100000))
[1] 0.6890036
We can see that bias is likely not equal to 0 – even without knowing the exact value
> mm=-log(0.5)
> mm
[1] 0.6931472
We estimated it by 0.6890036, and the difference with 0.745127 seems to be big enough to be just attributed to chance
So M10 is apparently not unbiased. However, if we increase n and take M100 instead, we can see the situation improving
> mean(replicate(100000,median(rexp(100))))
[1] 0.6987744
Seems like some consistency holds there…
4

Back to mean squared error
So, let us do mean squared error now. First, there is a possibility that we actually know an estimated value
> mm=-log(0.5)
> mm
[1] 0.6931472
> mean(replicate(10,(median(rexp(10))-mm)^2))
[1] 0.06314839
> mean(replicate(100,(median(rexp(10))-mm)^2))
[1] 0.08828418
> mean(replicate(100,(median(rexp(10))-mm)^2))
[1] 0.1249923
> mean(replicate(100,(median(rexp(10))-mm)^2))
[1] 0.06841089
> mean(replicate(100000,(median(rexp(10))-mm)^2))
[1] 0.09911663
> mean(replicate(100000,(median(rexp(10))-mm)^2))
[1] 0.09856143
> mean(replicate(100000,(median(rexp(10))-mm)^2))
[1] 0.0979045
5

What if we do not know the estimated value?
Well, then instead of -log(0.5) we have to do something else, and if it involves random numbers, we may need to wait a bit longer
> mm=median(rexp(10000000)) ## consistency, remember?
> mm
[1] 0.6930446
> mean(replicate(100000,(median(rexp(10))-mm)^2))
[1] 0.09919107
> mean(replicate(100000,(median(rexp(10))-mm)^2))
[1] 0.0988128
> mean(replicate(100000,(median(rexp(10))-mm)^2))
[1] 0.09927002
6

Confronting with large n approximation
Actually, theory does not help us too much here. The only useful theoretical result concerns Mn for n large. In such a case, Mn is very close to μ, and the mean squared error is thus very close to variance, which in turn can be approximated by
1 4nf2(μ)
(theorem by Kolmogorov)
where f is the density of the distribution underlying the sample in our case.
Looks like this not that bad approximation already for n = 100
> mm = -log(0.5)
> mean(replicate(100000,(median(rexp(100))-mm)^2))
[1] 0.01004371
> 1/(4*100*dexp(log(2))^2)
[1] 0.01
but not that good for n = 10 (which is apparently not that large)
> 1/(4*10*dexp(log(2))^2)
[1] 0.1 # compare to the Monte Carlo value above
7

A research story
So, we have the estimator, sample median M10, of m This estimator works in general – for various distributions
say, for the exponential distribution with λ = 1
or for the exponential distribution with unknown λ
However, when the data follow an exponential distribution that is, an exponential distribution with unknown λ
In the latter case, we may actually devise a better estimator for m How about Q(0.5)? Well, actually it is Qλ(0.5)
and λ we do not know
but we can estimate it by 1/X ̄
Well, and how do we know it is better? – well, mean squared error, is it not?
8

Theory for mean-squared error
9

Computer experiment
Let us calculate the mean squared error – in a way that was already shown above – for a couple of selected λ and we will see (as a blind said to a deaf). A tiny script:
N <- 100000 n <- 10 las <- c(0.01,0.1,1,10,100) rslt <- matrix(0,2,length(las)) for (k in (1:length(las))) { mm <- qexp(0.5,las[k]) rslt[1,k] <- mean(replicate(N,(median(rexp(n,las[k])) - mm)^2)) rslt[2,k] <- mean(replicate(N,(qexp(0.5,1/mean(rexp(n,las[k]))) - mm)^2)) } print(rslt) 10 And the result > source(“twomed.R”)
[,1] [,2] [,3] [,4] [,5]
[1,] 998.0411 9.915687 0.09901681 0.0009827265 9.896574e-06
[2,] 477.9990 4.822185 0.04794337 0.0004825622 4.823859e-06
So, we seem to be good…
for five values of λ
and the N = 100000 used
and when the data come from an exponential distribution and if we did not have an error in the code
11

An error???
For instance, we could change n to n = 100 now
> source(“twomed.R”)
[,1] [,2] [,3] [,4] [,5]
[1,] 99.81847 0.9905156 0.009954382 1.010295e-04 9.962715e-07
[2,] 47.99285 0.4794193 0.004777554 4.797323e-05 4.821234e-07
and then try the large-sample approximation for the first line
> las
[1] 1e-02 1e-01 1e+00 1e+01 1e+02
> 1/(4*100*dexp(qexp(0.5,las),las)^2)
[1] 1e+02 1e+00 1e-02 1e-04 1e-06
and for the second one
> (-log(0.5)/(las*sqrt(n)))^2
[1] 4.80453e+01 4.80453e-01 4.80453e-03 4.80453e-05 4.80453e-07
Seems like we are good…
12

Aside: an even a bit better way
N <- 100000 n <- 10 las = c(0.01,0.1,1,10,100) rslt <- matrix(0,2,length(las)) mses <- matrix(0,2,N) for (k in (1:length(las))) { for (rep in 1:N) { mm <- qexp(0.5,las[k]) sss <- rexp(n,las[k]) mses[1,rep] <- (median(sss) - mm)^2 mses[2,rep] <- (qexp(0.5,1/mean(sss)) - mm)^2 } rslt[1,k] <- mean(mses[1,]) rslt[2,k] <- mean(mses[2,]) } print(rslt) 13 Not that different results, however > source(“twomod.R”)
[,1] [,2] [,3] [,4] [,5]
[1,] 987.1046 9.903584 0.09875735 0.0009891487 9.820579e-06
[2,] 483.8226 4.801344 0.04818387 0.0004835474 4.800250e-06
Recall the earlier
> source(“twomed.R”)
[,1] [,2] [,3] [,4] [,5]
[1,] 998.0411 9.915687 0.09901681 0.0009827265 9.896574e-06
[2,] 477.9990 4.822185 0.04794337 0.0004825622 4.823859e-06
14

And now for something a bit different
Fine… so the above gave me the mean squared error (which may be close to variance or not) for a sample median out of n = 10 when sampled from the exponential distribution with λ = 1
But now I have the following 10 numbers; I compute the median from them – any possibility to assess the precision of that?
> exs
[1] 2.710660 1.100322 11.934400 1.419022 1.523077
[6] 5.384740 2.801879 2.163556 4.837161 10.374159
> mm=median(exs)
> mm
[1] 2.75627
> var(exs)
[1] 14.67475
Yes, but does this say anything at all here?
> mean((exs-mm)^2)
[1] 15.99159
And this one??
15

Bootstrap!
How about taking the mean squared differences of the sample median of “the sample” (=original batch of numbers) from the sample medians repeatedly calculated… from the data at hand!
> mean((replicate(10,median(sample(exs,10)))-mm)^2)
[1] 0
Well, if we sample n numbers from n numbers (“resample”), then we always get the same thing; we need to do it with replacement
> mean((replicate(10,median(sample(exs,10,replace=TRUE)))-mm)^2)
[1] 2.362092
> mean((replicate(10,median(sample(exs,10,replace=TRUE)))-mm)^2)
[1] 3.502714
> mean((replicate(10,median(sample(exs,10,replace=TRUE)))-mm)^2)
[1] 3.041924
> mean((replicate(100000,median(sample(exs,10,replace=TRUE)))-mm)^2)
[1] 2.125004
> mean((replicate(100000,median(sample(exs,10,replace=TRUE)))-mm)^2)
[1] 2.111911
> mean((replicate(100000,median(sample(exs,10,replace=TRUE)))-mm)^2)
[1] 2.125408
16

The scheme of bootstrap
We are trying to figure out the quantity that may involve an unknown constant(s) – here let it be just one, m
And also depends on random variables X1,…Xn
independent and with the same distribution
which are believed to be those whose outcomes model the behavior of the observations x1,…xn (“sample”)
A bootstrap estimate of the desired quantity is then obtained by
estimating the unknown constant(s) from the original x1, x2, . . . xn
and then estimating the quantity itself from N batches of n numbers x∗1i, x∗1i . . . x∗ni sampled from the original x1, x2, . . . xn with replacement (“resampling”)
For instance, we are after
E􏰊(M10(X1,…,X10)−m)2􏰋 which we estimate by
1 􏰈N N i=1
􏰁M10(x∗1i, . . . , x∗10,i) − M10(x1, . . . , x10)􏰂2
17

Excercise: bias, variance
18

Bias and variance
We know the sample median is in general biased… Can we obtain the estimate of the bias? (bias = the difference between the estimated value and the expected value of the estimator; equal zero for unbiased estimators)
> mm-mean(replicate(10000,median(sample(exs,10,replace=TRUE))))
[1] -0.4691844
What is the variance (may be needed for confidence intervals, say)
> var(replicate(100000,median(sample(exs,10,replace=TRUE))))
[1] 1.895233
Efron & Tibshirani say that 200 would be enough (instead of 10000)… They may be right…
> var(replicate(200,median(sample(exs,10,replace=TRUE))))
[1] 1.831597
> var(replicate(200,median(sample(exs,10,replace=TRUE))))
[1] 2.062268
> var(replicate(10000,median(sample(exs,10,replace=TRUE))))
[1] 1.931406
19

And now, hocus-pocus
Bias-variance decomposition again: is it true here?
> set.seed(007); mse=mean((replicate(1000000,
+ median(sample(exs,10,replace=TRUE)))-mm)^2)
> mse
[1] 2.116876
> set.seed(007); mvr=var(replicate(1000000,
+ median(sample(exs,10,replace=TRUE))))
> mvr
[1] 1.881946
> set.seed(007); bias=mm-mean(replicate(1000000,
+ median(sample(exs,10,replace=TRUE))))
> bias
[1] -0.4846973
> mvr+bias^2
[1] 2.116877
> mse
[1] 2.116876
If time permits, we may return to cover bootstrap confidence intervals
20

And again different: permutation tests
We have two batches of 10 numbers:
> s1
[1] 3.7551030 11.7892438 4.1296516 0.9881743 1.1722081
[6] 1.1131551 7.5318461 4.9694114 0.6259583 1.3886535
> s2
[1] 1.2574818 1.9363749 1.4094235 0.9201450 2.4456553
[6] 4.9157286 0.7597842 0.5466244 0.7560977 1.5443949
Their means turn out to be quite different
> mean(s1)
[1] 3.746341
> mean(s2)
[1] 1.649171
> dss=mean(s1)-mean(s2)
> dss
[1] 2.097169
Is this difference significant at level 0.05?
21

The usual take on it
> t.test(s1,s2,alternative=”greater”)
Welch Two Sample t-test

t = 1.7312, df = 11.26, p-value = 0.05534
alternative hypothesis: true difference in means is greater than 0

mean of x mean of y
3.746341 1.649171
> t.test(s1,s2,var.equal=TRUE,alternative=”greater”)
Two Sample t-test
t = 1.7312, df = 18, p-value = 0.05026
alternative hypothesis: true difference in means is greater than 0

mean of x mean of y
3.746341 1.649171
22

The underlying assumptions?
01
factor(rep(0:1, c(10, 10)))
23
c(s1, s2)
2 4 6 8 10 12

In particular, normality?
Normal Q-Q Plot
-2 -1 0 1 2
Theoretical Quantiles
Note: putting both together is a bit tricky here… is it clear why? 24
Sample Quantiles
2 4 6 8 10 12

Permutation test!
If you think of both as having no difference, you can think about them as that their assignment into s1 and s2 is by pure chance: it is then equally likely that they end up as they did, with difference in means dss = 2.097169 as well as they do other way round, difference in means being then dss = -2.097169 – and most differences will be around 0 anyway…
However, the observed difference dss = 2.097169: is it not somewhat unlikely? To see that, we could take all possible outcomes under random allocations into s1 and s2 – and figure out how many differences of means exceed 2.097169
That certainly can be done, if we have enough time to wait for the
result of 􏰁20􏰂 allocations… if not, then… 10
… then we can do just random sampling of those, cannot we?
25

So, now only how to do it
We first illustrate the code on a very simple numbers – so that we can see what is to be done
> s1
[1] 15 16 17 18 19
> s2
[1] 5 6 7 8 9
> s12 = c(s1,s2) ; s12
[1]1516171819 5 6 7 8 9 > mean(s1)
[1] 17
> mean(s2)
[1] 7
> dss = mean(s1) – mean(s2) ; dss
[1] 10
> sss = sum(s12)/5 ; sss
[1] 24
> sss – 2*mean(s2) ## this is my trick to have the code short
[1] 10
26

So, here is what we are going to do
The above was for the observed s1 and s2. Now we combine their values into s12 and then we are going to sample new s1 and s2, again and again. For each of those, we compare the difference of their means with dss, the original difference of the means of s1 and s2, and count the proportion of how many times it gets exceeded
> sample(s12,5)
[1]1716 5 619
> sss-2*mean(sample(s12,5)) ## note: sample is different – know why? [1] 7.6
> replicate(10,sss-2*mean(sample(s12,5)))
[1] 1.6 -6.0 -5.6 1.6 -1.6 -1.6 3.6 1.2 2.4 5.2
> replicate(10,sss-2*mean(sample(s12,5))) > dss
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> mean(replicate(10,sss-2*mean(sample(s12,5))) > dss)
[1] 0 ## also here it could be different, but isn’t – guess why
27

And now with real s1 and s2
> s12=c(s1,s2)
> s12
[1] 3.7551030 11.7892438 4.1296516 0.9881743 1.1722081
[6] 1.1131551 7.5318461 4.9694114 0.6259583 1.3886535
[11] 1.2574818 1.9363749 1.4094235 0.9201450 2.4456553
[16] 4.9157286 0.7597842 0.5466244 0.7560977 1.5443949
> sss=sum(s12)/10 ; sss
[1] 5.395512
> dss
[1] 2.097169
> mean(replicate(100,sss-2*mean(sample(s12,10))) > dss)
[1] 0.06
> mean(replicate(100,sss-2*mean(sample(s12,10))) > dss)
[1] 0.07
Seems like it works now; only N = 100 does not yield stable result
28

The final run
> mean(replicate(100000,sss-2*mean(sample(s12,10))) > dss)
[1] 0.04684
> mean(replicate(100000,sss-2*mean(sample(s12,10))) > dss)
[1] 0.04713
> mean(replicate(100000,sss-2*mean(sample(s12,10))) > dss)
[1] 0.04674
> mean(replicate(100000,sss-2*mean(sample(s12,10))) > dss)
[1] 0.04633
Seems like this is stable enough, and consistently below 0.05 – that is, the test would reject the null hypothesis that there is no difference. We can still give it a final run
> set.seed(007)
> mean(replicate(10000000,sss-2*mean(sample(s12,10))) > dss)
and when we return after getting a beer from the fridge, we find
[1] 0.0467867
29