Bootstrap
Jennifer Wilcock STAT221 2020 S2
Final Topic: Bootstrapping
Bootstrapping is typically used in point estimation, (confidence) interval estimation and hypothesis testing.
Two types of bootstrapping: • nonparametric and
• parametric.
We’ll focus on nonparametric bootstrapping.
The idea behind the bootstrap is that we want to make inferences about an unknown population. Instead of making assumptions about the population distribution, we assume that the sample is sufficient to represent the population.
Sampling distribution of a statistic
In general, the sampling distribution for a statistic can be used to explore the variability (uncertainty) due to the sampling process itself i.e. to summarise information content in samples of size n
This is the idea underpinning ‘frequentist’ statistics.
When we obtain multiple samples all sampled from the same population, then there will be sampling variability, and different samples will gives slightly different estimates of the thing we want to infer using the sample (for example we might want to infer the population mean, using the sample mean).
(a)
Source: Moore & McCabe. Introduction to Practice of Statistics. Chapter 14 on ”Bootstrap Methods and Permutation Tests” freely available from http://bcs.whfreeman.com/ips5e/content/cat_080/pdf/moore14.pdf.
The standard deviation of the sampling distribution is called the standard error. In practice, though, we only ever have one sample.
SRS of size n SRS of size n SRS of size n
·· ·· ··
x–
x–
x–
POPULATION unknown mean
Sampling distribution
NORMAL POPULATION unknown mean
Theory
1
/0_n
Sampling distribution
(b)
x
are wrong!)
• with every assumption the maths involved gets simpler, but
Approach 3: Use resampling:
SRS of size n SRS of size n
·· ·· ··
The problem: How can we know what this sampling distribution is?
POPULATION unknown mean
Approach 1: Use theory:
Sampling distribution
(a)
If we know the true population distribution then, generally, we can determine the theoretical sampling distribution.
(b)
SRS of size n
But in most real world applications we will not know x–the population distribution. So what can we do?
(c)
/0_n
x– x–
NORMAL POPULATION unknown mean
Theory
/0_n
Sampling distribution
This is an approach we might take if we aSrReSdoofisnizgepnrobability theory, say, which is a branch of mathematics. x–
Approach 2: Use asymptotics:
x–
· ·
· R·esample of size n
SRS of size n
··
Traditionally, statisticians have used asymptotic models for sampling distributions under various assump-
POPULATION Resample of size n –
One SRS of size n Samplingxdistribution
tions about population (the ‘classical’ approach, for example in STAT213):
unknown mean
• Key problem: the results may be invalid if the assumptions are incxorrect (Remember: all models
·· ··
··
P•OPwUiLtAhTeIvOeNry assumption there is a risk that the assumption is wrong. unknown mean
(a)
Resample of size n
–
many samples, collect the x-values from each, and look at the distribution of these values. • we repeatedly resa(mb)pTlhee(twheitohryreshpolartcceumt:eifnwt)efkrnoomwttheatotrhiegipnoapluslatmiopnleva(lsuaeys,fo1l0lo0w0 taimnoersm)al distribution,
FIGURE 14.4
Theory
(a) The idea of the sampling distribution of the sample mean x: take very An alternative, ‘modern’ approach is to use a resampling approach called the ‘bootstrap’:
theory tells us that the sampling distribution of x is also normal. (c) The bootstrap idea: when
• this gives a large set of ’resamples’ (or ’bootstrap samples’), all with the same sample size, for which
Sampling distribution NOthReMoAryLfPaiOlsPaUnLdAwTIeOcNan afford only one sample, that sample stands in for the population, and
we calculate the samunpklneowstnamtiseatinc(such as the mean) . . .
the distribution of x in many resamples stands in for the sampling distribution.
• the distribution of the resample statistic app(rbo)ximates the sampling distribution that we need.
• this is like treating the original sample as the population.
(c)
14-9
x–
Bootstrap distribution
Resample of size n Resample of size n x–
x–
One SRS of size n
Resample of size n
·· ·· ··
x–
POPULATION unknown mean
Bootstrap distribution
FIGURE 14.4 (a) The idea of the sampling distribution of the sample mean x: take very many samples, collect the x-values from each, and look at the distribution of these values. (b) The theory shortcut: if we know that the population values follow a normal distribution,
theory tells us that the sampling distribution of x is also normal. (c) The bootstrap idea: when 2
theory fails and we can afford only one sample, that sample stands in for the population, and the distribution of x in many resamples stands in for the sampling distribution.
14-9
x
From this:
to this:
(a)
POPULATION unknown mean
Sampling distribution
SRS of size n
x–
·· ·· ··
(a)
SRS of size n
x–
PONPOURLMAATLIOPNOPULATION
unknown mean unknown mean
SRS of size n x–
/0_n
Sampling distribution
Sampling distribution
SRS of size n
– xTheory
·· ·· ··
(b)
One SRS of size n
x–
·· ·· ··
Resample of size n
Resample of size n x– 0_ Resample of size n x–
Theory
POPULATION unknown mean
Bootstrap distribution
/n
NORMAL POPULATION (c) unknown mean
Sampling distribution
That bootstrapping ‘works’ as a method was still established mathematically using asymptotic results,
FIGURE 14.4 (a) The idea of the sampling distribution of the sample mean x: take very butoncewehaveestablishedmatnhyastamthpelesm,ceotlhleoctdthweoxr-k(vbas)luiensgfreonmerealchw,aencdalnooakpaptltyhebdoisotrtibsturtiaopnpoifntgheasesvaalmuees.thod
(b) The theory shortcut: if we know that the population values follow a normal distribution,
theory tells us that the sampling distribution of x is also normal. (c) The bootstrap idea: when
theory fails and we can afford only one sample, that sample stands in for the population, and
the distribution of x in many resamples stands in for the sampling distribution. • it is very simple to implement.
Disadvantages of the bootstrap: Resample of size n x– 14-9
would be in a more classical context. So there is still a risk of these assumptions being violated (for
very generally.
Advantage of the bootstrap:
Resample of size n
• it is asymptotically consistent, but is not guaranteed to be consistent– in finite sample – it is highly
One SRS of size n
dependent on the initial sample collected.
x x–
Resample of size n
• there are still assumptions underlying the method, but they tend not to be explicitly stated as they
example that the resamples are independent).
POPULATION
·· ··
• doing the resampling can be time-consuming.
unknown mean
Bootstrap distribution
(c)
FIGURE 14.4 (a) The idea of the sampling distribution of the sample mean x: take very many samples, collect the x-values from each, and look at the distribution of these values. (b) The theory shortcut: if we know that the population values follow a normal distribution, theory tells us that the sampling distribution of x is also normal. (c) The bootstrap idea: when theory fails and we can afford only one sample, that sample stands in for the population, and the distribution of x in many resamples stands in for the sampling distribution.
14-9
3
··
Think about the usual (parametric) approach to confidence intervals
Suppose we were actually interested in the population mean, estimated using the sample mean X.
To generate a confidence interval for the population mean, the usual approach is to assume that X is
approximately normally distributed.
This assumption follows from the Central Limit Theorem, which says that population means are roughly normally distributed provided they have finite variance. The CLT is valid even for highly skewed distributions like the exponential, or discrete distributions like the Poisson.
Since we know (approximately) the distribution of X, we can get generate an approximate 100(1 − α)% confidence interval for μ using
X ± zα/2se(X)
wherese=s/ nisthestandarderrorofthesamplemean(sisthesamplestandarddeviation).
Nonparametric bootstrap confidence interval for the median
So to generate a confidence interval for the median, we’d like to do something like: Xm ± (something) ∗ se(Xm)
where (something) is a critical value and se(Xm) is the standard error of the sample median. Since this is unknown, instead we can simulate the distribution of Xm.
Note: this formulation assumes the distribution of Xm is symmetric, but for other sample statistics it may be asymmetric.
Example: Nonparametric bootstrapping for the median
Let’s start with an example where we think about the median of a sample.
Suppose we have an iid sample x1, . . . , xn and we are interested in estimating the population median. A point estimate of the population median is the sample median.
To find the median we need to put the observations in order, and we denote the ordered sample data by x(1) ≤ x(2) ≤ … ≤ x(n).
The sample median is then:
• for odd n: xm = x((n+1)/2)
• for even n: xm = x(n/2) + x(n/2)+1 2
The question is: How good is this estimate? Can we estimate a confidence interval for the population median?
We can explore this using the InzightVIT software, using a small dataset called course.csv and see how the resampling takes place.
√
4
Resampling: the bootstrap compared to a permutation test
Both methods of resampling are very popular but they are primarily used for different purposes: • bootstrapping is best for estimating confidence intervals, and
• a permutation test is best for testing hypotheses (particularly for t-tests and anova).
Bootstrapping is similar to a permutation test in that we are resampling the original sample. However:
• in bootstrapping we resample with replacement,
• in a permutation test we resample without replacement
and
• in bootstrapping we resample the observed values, so that we are sampling (approximately) from the actual data generating distribution,
• in a permutation test we resample the labels, so that we are sampling from a data distribution that satisfies the null hypothesis
and, for hypothesis tests,
• a bootstrapped hypothesis test is less powerful than a permutation test
• a permutation test is more powerful than bootstrapping
• so while you can actually do a hypothesis test using bootstrapping, you wouldn’t in practice.
5
Sampling distributions of the sample mean and median in R
Consider many small samples of size n = 5, one set are all sampled from the same Normal distribution (in this case the standard Normal with mean 0 and sd of 1), and the second set are all sampled from the
same exponential distribution.
We can compare the distribution of the sample mean and the sample median for each:
set.seed(1)
n = 5 # sample size
N = 10000 # number of bootstrap resamples
# create some empty vectors to store the results in
norm.means = norm.medians = exp.means = exp.medians = numeric(N)
for(i in 1:N) { # loop around the 10000 bootstraps
x = rnorm(n)
norm.means[i] = mean(x) # save the value of the mean of each sample norm.medians[i] = median(x)
y = rexp(n)
exp.means[i] = mean(y) exp.medians[i] = median(y)
}
par(mfrow=c(2,2))
hist(norm.means, breaks=seq(-2.5,2.5,0.1), prob=T, cex.axis=1.5, cex.lab=1.5) hist(norm.medians, breaks=seq(-2.5,2.5,0.1), prob=T, cex.axis=1.5, cex.lab=1.5) hist(exp.means, breaks=seq(0,4.5,0.1), prob=T, cex.axis=1.5, cex.lab=1.5) hist(exp.medians, breaks=seq(0,4.5,0.1), prob=T, cex.axis=1.5, cex.lab=1.5)
Histogram of norm.means
−2 −1 0 1 2
norm.means
Histogram of exp.means
Histogram of norm.medians
−2 −1 0 1 2
norm.medians
Histogram of exp.medians
01234 01234
exp.means
exp.medians
6
Density
Density
0.0 0.8
0.0 0.8
Density
Density
0.0 0.8
0.0 0.6
par(mfrow=c(1,1)) print(c(sd(norm.means),sd(norm.medians),sd(exp.means),sd(exp.medians)))
## [1] 0.4452831 0.5368943 0.4477151 0.4717130
What can we see from these plots?
For the normal samples, the distribution of mean X and median Xm both look similar to a normal distribution, but the histogram is slightly more spread out for the sample median than the sample mean, suggesting that Xm has higher variance than X. In fact, this is why the mean is better than the median for inferring the center of a normal distribution. Both X and Xm are unbiased for estimating μ, but X is less variable, so we can be more certain about the mean.
For the exponential samples, both distributions look roughly like a χ2 or Gamma distribution (the sample mean is in fact Gamma). The median looks similar but has increased probability of values close to 0, and has a lower center (compare the distributions near 1 on the horizontal axis)
7
Start of lecture 14
Bootstrap confidence intervals for the population median
The idea is to simulate the sampling distribution of your statistic (mean, median, mode, interquartile range, 80th percentile, etc.) by resampling from the sample data, and treating the sample as representative of the original (unknown) population.
The general idea makes most sense with larger samples. For smaller samples, it is harder to justify (e.g. extreme observations are unlikely to be observed). But in this situation verifying population
distribution assumptions is also challenging.
Example: Bootstrap for the median
Consider a sample x of 11 observations x = (x1, x2, . . . , x11): x = c(2, 6, 2, 7, 2, 0, 0, 4, 7, 0, 1)
We can sort the data to see the median directly:
sort(x)
## [1]00012224677
The sample median is x(6) = 2.
To get a bootstrap resample x∗, we sample the n = 11 values with replacement:
## [1]77026016122
sort(xstar)
## [1]00112226677
The median of this resample is 2.
Notice that in the resample the original observations appear different numbers of times. For example, 0
shows up 2 times (rather than 3), and 6 shows up twice (rather than once in the original sample). Further, the observation 4 did not show up at all in the resample.
This variation is due to the effect of sampling with replacement.
An additional bootstrap sample might be:
## [1] 4
In this case, the sample median changes from 2 to 4.
Resampling over and over again gives the sampling distribution for the sample median.
If the resample median is very steady, we can be fairly confident about what the population median is. If it varies a lot from resample to resample, then there is less information in the original sample with which to infer the population median. How about in this data?
set.seed(1)
xstar = sample(x, replace = T) xstar
set.seed(6)
xstar = sample(x, replace = T) median(sort(xstar))
8
For this data, there are a small set of possible resample medians which can be summarized in a table:
x = c(2, 6, 2, 7, 2, 0, 0, 4, 7, 0, 1) set.seed(1)
N = 10000 # number of bootstrap resamples
boot.medians = numeric(N) for(i in 1:N) {
xstar = sample(x, replace = T)
boot.medians[i] = median(xstar) }
table(boot.medians)
## boot.medians
## 0 1 2 4 6 7
## 513 1114 6579 1291 427 76
barplot(table(boot.medians))
012467
We can see that medians of 3 or 5 never occur – looking back at the 11 values in the original sample:
table(x)
## x
## 0 1 2 4 6 7
## 3 1 3 1 1 2
barplot(table(x))
9
0 1000 3000 5000
012467
we can see that, with 11 observations, the median can only takes the values 0, 1, 2, 4, 6, or 7 – it can’t take the values 3 or 5 since these values are not in the original sample.
We can see that all possible observed values in the data show up in the sample median for the bootstrap replicates, but some values are very infrequent.
The proportions corresponding to the table output can be calculated using the prop.table function, which estimates the probability mass function for the bootstrap distribution of the median:
prop.table(table(boot.medians))
## boot.medians
## 0 1 2 4 6 7
## 0.0513 0.1114 0.6579 0.1291 0.0427 0.0076
The medians 0, 6 and 7 are infrequent at 5.1%, 4.3% and 0.8%.
The proportion of times that the median was greater than 0 and less than 6 was 89.8% (0.1114 + 0.6579 +
0.1291 = 0.8984).
If this data is representative of the population as a whole, we might therefore be fairly confident that the population median is greater than 0 and less than 6, and we can think of [1, 6) as an 90% bootstrap confidence interval.
Here, [1, 6) means the value 1 is included in the interval, and the value 6 is excluded.
There wasn’t much ability to narrow down the population median for the last example due to the small
sample size, with n = 11.
We can explicitly look at the effect of increasing the size of the sample on the variability of the estimated
population median:
Example with sample size of 11
We can generate a sample of discrete data from a Poisson distribution with rate λ = 3, with n = 11 observations:
10
0.0 0.5 1.0 1.5 2.0 2.5 3.0
n = 11 # sample size set.seed(1)
(x = rpois(n, 3))
## [1]22352564312
table(x)
## x
## 1 2 3 4 5 6
## 1 4 2 1 2 1
N = 10000 # number of bootstrap resamples
boot.medians = numeric(N) for(i in 1:N) {
xstar = sample(x, replace = T)
boot.medians[i] = median(xstar) }
prop.table(table(boot.medians)) # proportion
## boot.medians
## 1 2 3 4 5 6
## 0.0002 0.3671 0.4513 0.1262 0.0551 0.0001
cumsum(prop.table(table(boot.medians))) # cumulative proportion ## 1 2 3 4 5 6
## 0.0002 0.3673 0.8186 0.9448 0.9999 1.0000
Here, 94.4% of the medians were either 2, 3 or 4. So we have a 94% confidence interval for the median of [2,4].
Example with sample size of 29
We can compare the above results with data generated from the same Poisson distribution with rate λ = 3, but now with n = 29 observations (chosen for convenience to be an odd number so that the medians are all integers):
## [1]22352564312142434824624122025
table(x)
## x
## 0 1 2 3 4 5 6 8
## 1 3 10 3 6 3 2 1
n = 29 # sample size set.seed(1)
(x = rpois(n, 3))
N = 10000 # number of bootstrap resamples
boot.medians = numeric(N) for(i in 1:N) {
xstar = sample(x, replace = T)
boot.medians[i] = median(xstar) }
prop.table(table(boot.medians)) # proportion
## boot.medians
11
## 2 3 4 5
## 0.4245 0.3976 0.1778 0.0001
cumsum(prop.table(table(boot.medians))) # cumulative proportion ## 2 3 4 5
## 0.4245 0.8221 0.9999 1.0000
This time, 82.2% of the medians were either 2 and 3, so increasing the sample size has reduced the variability quite a lot.
Example with sample size of 39
We can compare the above results with data generated from the same Poisson distribution with rate λ = 3, but now with n = 39 observations:
## [1]22352564312142434824624122025233315 ## [36] 4 4 1 4
table(x)
## x
## 0 1 2 3 4 5 6 8
## 1 5 11 6 9 4 2 1
n = 39 # sample size set.seed(1)
(x = rpois(n,3))
N = 10000 # number of bootstrap resamples
boot.medians = numeric(N) for(i in 1:N) {
xstar = sample(x, replace = T)
boot.medians[i] = median(xstar) }
prop.table(table(boot.medians)) # proportion
## boot.medians
## 2 3 4
## 0.2023 0.6691 0.1286
cumsum(prop.table(table(boot.medians))) # cumulative proportion ## 2 3 4
## 0.2023 0.8714 1.0000
This time, 66.9% of the medians were 3, so increasing the sample size has reduced the variability quite a lot.
Theoretically, the median for this population is approximately ⌊λ + 1/3 − 0.02/λ⌋, where ⌊ ⌋ rounds down the nearest integer value:
## [1] 3
The same bootstrap procedure can be used for continuous variables.
lambda = 3
floor(lambda + (1/3) – (0.02/lambda))
12
Percentile Bootstrap Confidence Interval
The examples above show that it can be a bit of a struggle to decide how to define confidence intervals under bootstrapping.
In practice, a simple and common approach to obtain a confidence interval is to identify the percentiles that cover the middle 100(1 − α)% of the bootstrap distribution.
For example, to obtain a 95% confidence interval you could take the 2.5% and 97.5% percentiles to define the lower and upper limits respectively.
This is referred to as the percentile bootstrap confidence interval, or percentile interval in short.
With N = 10, 000 bootstrap resamples, the central 950 observations will correspond to the most likely values for the population statistic, and to obtain this interval, we can use the 250th and 9751st ordered observations as the bootstrap confidence interval.
Percentiles
Why the 9751st observation?
For a 95% confidence interval the interval is the 2.5 and 97.5 percentiles.
For the 97.5% percentile, this should mean that roughly 97.5% of the bootstrap statistics should fall below the value and 2.5% should fall above it.
If we start counting from the left hand end of the distribution, then the pth percentile corresponds to the npth observation if this value is an integer, or approximately the npth observation otherwise.
Suppose there are N = 100 resamples, and that for each of these we calculate a statistic x. We order the resample statistics so that x1 ≤ x2 ≤ · · · ≤ x100. Then a crude approach to get the value corresponding to the 2.5 percentile might be to:
• round up to the 3rd value, x3; or
• average the 2nd and 3rd values, (x2 + x3)/2.
If you have N = 10, 000 values, then 10000 × 0.025 = 250, so that it looks like using the 250th value is
the way to go for the 2.5 percentile.
However, if we use the 250th value for the 2.5 percentile, then remember that there are 249 points,
x(1), . . . , x(249) below this lower limit of the interval.
To be consistent with this definition, we also need 249 points above the upper limit.
These will be the 249 points x(9752), . . . , x(10000), and we need to use the 9751st value as the 97.5 percentile.
Percentiles in R
So . . . one common solution is to make things ‘symmetric’ by using x9751 for the upper limit. Another solution is to make the limits
x(250) + x(251) x(9750) + x(9751) ,.
22
If there are no ties in the bootstrap statistics, this will ensure exactly 250 data points on either side of the interval. If there are ties, then it might not make a difference.
A more general approach is to let the pth percentile be a weighted average of the observations indexed by ⌊np⌋ and ⌊np + 1⌋, where ⌊·⌋ rounds a decimal down to an integer.
In fact, when calculating percentiles in R using the quantile function, there are 9 different approaches for computing the percentile, see help(quantile) for details.
From this point in the lectures we will just use the default setting of the quantile function. 13
Example: Percentile interval for the median of an exponential
Consider a sample of size n = 25 from an exponential distribution with a rate of 3.
Recall that this is the probability distribution of the time between events in a Poisson point process, so a
process in which events occur continuously and independently at a constant average rate.
## [1] 0.25172728 0.39388093 0.04856891 0.04659842 0.14535621 0.96498951
## [7] 0.40985402 0.17989428 0.31885583 0.04901533 0.46357838 0.25400995
## [13] 0.41253452 1.47464474 0.35151439 0.34508132 0.62534506 0.21824888
## [19] 0.11231116 0.19615991 0.78817175 0.21396420 0.09804013 0.18862184
## [25] 0.03535754
## [1] 0.2182489 0.2540100 0.3450813 0.2517273 0.2517273 0.3188558 0.3188558
## [8] 0.2182489 0.1961599 0.2540100 0.3188558 0.3450813 0.2182489 0.1453562
## [15] 0.2139642 0.2517273 0.2182489 0.3450813 0.3450813 0.3450813 0.1961599
## [22] 0.2182489 0.2517273 0.2517273 0.2517273 0.2182489 0.2517273 0.2182489
## [29] 0.3188558 0.1798943 0.2139642 0.2517273 0.2540100 0.2139642 0.1886218
## [36] 0.2182489 0.1961599 0.1961599 0.2517273 0.2182489 0.1961599 0.2517273
## [43] 0.2540100 0.3450813 0.1961599 0.2540100 0.1961599 0.2182489 0.2517273
## [50] 0.3188558
So some of the resample medians repeat.
table(boot.medians)
## boot.medians
## 0.0490153301679842 0.0980401292132835 0.11231115879491
## 1 5 20
## 0.145356208593058 0.179894279999038 0.18862184152628
## 76 220 483
## 0.196159907151014 0.213964196077238 0.218248879071325
## 833 1127 1457
## 0.251727277709448 0.254009951823358 0.318855831232818
## 1644 1457 1168
## 0.34508131534358 0.351514389103556 0.393880926369035
## 764 425 218
## 0.409854017804964 0.41253451691161 0.463578376270143
## 77 22 2
## 0.625345057469416
## 1
n = 25 # sample size
N = 10000 # number of bootstrap resamples
set.seed(1)
x = rexp(n, rate = 3) x
boot.medians = numeric(N) for(i in 1:N) {
xstar = sample(x, replace=T)
boot.medians[i] = median(xstar) }
boot.medians[1:50]
14
barplot(table(boot.medians))
0.0490153301679842 0.213964196077238 0.393880926369035
The cumsum function calculates the cumulative sum, which is useful for identifying the 250th and 9751st values:
cumsum(table(boot.medians))
## 0.0490153301679842 0.0980401292132835 0.11231115879491
## 1 6 26
## 0.145356208593058 0.179894279999038 0.18862184152628
## 102 322 805
## 0.196159907151014 0.213964196077238 0.218248879071325
## 1638 2765 4222
## 0.251727277709448 0.254009951823358 0.318855831232818
## 5866 7323 8491
## 0.34508131534358 0.351514389103556 0.393880926369035
## 9255 9680 9898
## 0.409854017804964 0.41253451691161 0.463578376270143
## 9975 9997 9999
## 0.625345057469416
## 10000
By inspection we can see that the 2.5% and 97.5% percentiles are 0.1799 and 0.3939 respectively. We can easily get these values out using the quantile function:
## 2.5% 97.5%
## 0.1798943 0.3938809
As above, the 2.5% and 97.5% percentiles are 0.1799 and 0.3939 respectively.
For a large sample (from the original population), without too many ties, these values could be reported as the 95% confidence interval.
However for small sample problems (or more generally with a small set of bootstrap statistics), you should estimate the confidence level using the sample proportion between these values:
alpha = 0.05 # significance level quantile(boot.medians, probs = c(alpha/2, 1 – alpha/2))
15
0 500 1000 1500
mean((boot.medians >= quantile(boot.medians, alpha/2)) & (boot.medians <= quantile(boot.medians, 1 - alpha/2))) ## [1] 0.9796 So this is actually a 97.8% confidence interval for the median. So, don’t be overly tied or attached to 90%, 95% and 99% confidence levels only! 16 Start of lecture 15 The boot function in R R has many inbuilt functions to carry out bootstrapping. The starting point is the boot function in the library of the same name: The boot function has many arguments: To use the boot function involves two steps. The first is to create a function that: • resamples the data, and • calculate the bootstrap statistic (the median in this case). This function must have (at least) two arguments. The first argument is the original data (so a vector, or a matrix, or a data frame), and the second is a vector that gives the ‘indices’ of a resample. The point of this being a separate function is that it can be as complicated as you want or need. For a median of a single sample it looks a bit trivial, but in general these functions are not trivial. This can then be plugged in to the boot function: ## ## ORDINARY NONPARAMETRIC BOOTSTRAP ## ## ## Call: ## boot(data = x, statistic = boot.stat, R = N) ## ## ## Bootstrap Statistics : ## original bias std. error ## t1* 0.2517273 0.004938495 0.05742262 library(boot) help(boot) boot(data, statistic, R, sim = "ordinary", stype = c("i", "f", "w"), strata = rep(1,n), L = NULL, m = 0, weights = NULL, ran.gen = function(d, p) d, mle = NULL, simple = FALSE, ..., parallel = c("no", "multicore", "snow"), ncpus = getOption("boot.ncpus", 1L), cl = NULL) boot.stat = function(x, indices) { xstar = x[indices] return(median(xstar)) } n = 25 # sample size N = 10000 # number of bootstrap resamples alpha = 0.05 # significance level set.seed(1) x = rexp(n, rate = 3) boot(x, statistic = boot.stat, R = N) 17 There are multiple different types of bootstrap resampling available: • ordinary - simple random sampling with replacement (what we have looked at so far), which is the default. • parametric - parametric assuming a particular population distribution. • balanced and antithetic - attempt to balance resampling over strata. • permutation - permutation of original sample (without replacement), which is only relevant when the bootstrap statistic depends on the order of the observations. We’ll to use ordinary and parametric in this section of the course. The statistic argument is a function which calculates the bootstrap statistic (e.g. the median in the above example) on each of the bootstrap resamples. The stype argument tells the boot function what the second option to the bootstrap statistic function represents (the default being ‘indices’). R is the number of bootstrap resamples. Many objects are calculated by the boot function and available to look at: ## [1] "t0" "t" "R" "data" "seed" ## [6] "statistic" "sim" "call" "stype" "strata" ## [11] "weights" The output t is the bootstrap statistic calculated on each bootstrap resample. These can be used to get the 95% confidence interval in the same way as we used earlier: quantile(results$t, probs = c(alpha/2, 1 - alpha/2)) ## 2.5% 97.5% ## 0.1798943 0.3938809 and we get exactly the same result as we got when we did the bootstrapping manually ourselves. The boot.ci function in R R also has a function called boot.ci to directly calculate the confidence interval: The main input is the argument boot.out which is the output of the boot function. conf is the confidence level (as a decimal). Multiple approaches are available for calculating bootstrap confidence intervals (using the type argument). The percentile interval we used earlier is available using type="perc". results = boot(x, statistic = boot.stat, R = N) names(results) boot.ci(boot.out, conf = 0.95, type = "all", index = 1:min(2,length(boot.out$t0)), var.t0 = NULL, var.t = NULL, t0 = NULL, t = NULL, L = NULL, h = function(t) t, hdot = function(t) rep(1,length(t)), hinv = function(t) t, ...) 18 results = boot(x, statistic = boot.stat, R = N) boot.ci(results, type = "perc") ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 10000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = results, type = "perc") ## ## Intervals : ## Level Percentile ## 95% ( 0.1799, 0.3939 ) ## Calculations and Intervals on Original Scale The estimated 95% confidence interval is exactly the same as before. Notice that the confidence level is not corrected for the finite set of possible test statistics. By default, the boot.ci function calculates all 5 available types of confidence intervals. ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 10000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = results, type = "all") ## ## Intervals : ## Level Normal Basic ## 95% ( 0.1347, 0.3585 ) ( 0.1096, 0.3236 ) ## ## Level Percentile BCa ## 95% ( 0.1799, 0.3939 ) ( 0.1799, 0.3515 ) ## Calculations and Intervals on Original Scale results = boot(x, statistic = boot.stat, R = N) boot.ci(results, type = "all") 19 Bootstrapping of more than one statistic We can bootstrap more than one statistic at a time, by returning a vector of statistics from the boot.stat function. For example, to return both the sample mean and median on each bootstrap resample: So here the vector returned is c(mean(xstar), median(xstar)). Example: Bootstrap for the mean and the median We will again consider the X ∼ Exponential(λ = 3) example from earlier, with n = 25. The true mean and median of this exponential are boot.stat = function(x, indices) { xstar = x[indices] return(c(mean(xstar), median(xstar))) } and . E(X) = 1/λ = 1/3 x0.5 = log(2)/λ = log(2)/3 = 0.231 We can do the bootstrapping using the function boot: boot.stat = function(x, indices) { xstar = x[indices] return(c(mean(xstar), median(xstar))) } n = 25 # sample size N = 10000 # number of bootstrap resamples alpha = 0.05 # significance level set.seed(1) x = rexp(n, rate = 3) results = boot(x, statistic = boot.stat, R = N) Then, to use the boot.ci function, we need to specify the correct statistic to use. The sample mean is the first statistic: boot.ci(results, type = "perc", index = 1) ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 10000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = results, type = "perc", index = 1) ## ## Intervals : ## Level Percentile ## 95% ( 0.2290, 0.4832 ) ## Calculations and Intervals on Original Scale 20 The sample median is the second statistic: boot.ci(results, type = "perc", index = 2) ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 10000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = results, type = "perc", index = 2) ## ## Intervals : ## Level Percentile ## 95% ( 0.1799, 0.3939 ) ## Calculations and Intervals on Original Scale The confidence limits are stored as the last two elements (4:5) of the percent object in the results: boot.ci(results, type = "perc", index = 1)$percent ## conf ## [1,] 0.95 250.03 9750.98 0.2289523 0.4832073 So that the confidence intervals can then be added to a density histogram of the bootstrap distribution of the statistic as vertical lines: Density Histogram of Bootstrap Sample Means hist(results$t[, 1], breaks = 100, prob = T, xlab = "Sample Means", main = "Density Histogram of Bootstrap Sample Means") abline(v = boot.ci(results, type = "perc", index = 1)$percent[4:5]) 0.2 0.3 0.4 0.5 0.6 Sample Means hist(results$t[, 2], breaks = 100, prob = T, xlab = "Sample Medians", main = "Density Histogram of Bootstrap Sample Medians") abline(v = boot.ci(results, type = "perc", index = 2)$percent[4:5]) 21 Density 0123456 Density Histogram of Bootstrap Sample Medians 0.1 0.2 0.3 0.4 Sample Medians The bootstrap distribution of the sample means is essentially continuous, so the confidence interval limits will be close of the nominal 2.5% and 97.5% percentiles. By contrast the bootstrap distribution of the sample median is discrete, due to the limited number of unique values in the sample (due to the small sample size). As previously noted the confidence interval limits for the median will rarely align with the nominal 2.5% and 97.5% percentiles. Bootstrapping with extra options For more complex problems it may be necessary to provide extra inputs to the bootstrap statistic function. For example, we can bootstrap the 10% and 90% percentiles of the distribution: The extra argument q.prob is used in the quantile function to specify the percentiles. Example: Bootstrapping quantiles Again we consider the X ∼ Exponential(λ = 3) example from earlier, with n = 25. In this case we know the true 10% and 90% quantiles are given by: qexp(c(0.1, 0.9), rate=3) ## [1] 0.03512017 0.76752836 boot.quant = function(x, indices, q.prob) { xstar = x[indices] return(quantile(xstar, prob = q.prob)) } 22 Density 0 10 20 30 40 50 60 The bootstrap is carried out in the usual way, but with the extra q.prob argument specified: boot.quant = function(x, indices, q.prob) { xstar = x[indices] return(quantile(xstar, prob = q.prob)) } n = 25 # sample size N = 10000 # number of bootstrap resamples alpha = 0.05 # significance level set.seed(1) x = rexp(n, rate = 3) results = boot(x, statistic = boot.quant, R = N, q.prob = c(0.1, 0.9)) The 10% percentile is the first statistic: boot.ci(results, type="perc", index=1) ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 10000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = results, type = "perc", index = 1) ## ## Intervals : ## Level Percentile ## 95% ( 0.0399, 0.1592 ) ## Calculations and Intervals on Original Scale The 90% percentile is the second statistic: boot.ci(results, type="perc", index=2) ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 10000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = results, type = "perc", index = 2) ## ## Intervals : ## Level Percentile ## 95% ( 0.4035, 1.2708 ) ## Calculations and Intervals on Original Scale 23 And we can plot the bootstrap distribution of each percentile, adding the confidence intervals: hist(results$t[,1], breaks=100, prob=T, xlab="Sample 10% Percentile", main="Density Histogram of Bootstrap Sample 10% Percentiles") abline(v=boot.ci(results, type="perc", index=1)$percent[4:5]) Density Histogram of Bootstrap Sample 10% Percentiles 0.05 0.10 0.15 0.20 Sample 10% Percentile Density Histogram of Bootstrap Sample 90% Percentiles hist(results$t[,2], breaks=100, prob=T, xlab="Sample 90% Percentile", main="Density Histogram of Bootstrap Sample 90% Percentiles") abline(v=boot.ci(results, type="perc", index=2)$percent[4:5]) 0.4 0.6 0.8 1.0 1.2 1.4 Sample 90% Percentile 24 Density Density 0 2 4 6 8 10 0 20 40 60 80 120 Start of lecture 16 Two sample hypothesis testing So far we’ve looked at problems involving only one sample. Now we apply the bootstrap to two sample problems. Here we might be interested in testing whether two population means are different (a hypothesis test) or we might wish to construct a confidence interval for the difference between two means. The easiest way to test a hypothesis in these examples is not to do a test, but is to construct the bootstrap percentile confidence interval and check whether it includes 0 or not, since 0 will be the null hypothesis. If it excludes 0, then the data provide evidence against the null, and suggest that the population medians are different. We will consider 4 examples. In the first two we will have paired data, and will look at bootstrapping the mean and the median. In the final two examples we will have two independent samples, and again we will look at comparing the means and then the medians of these samples. Comparisons with paired observations are easier. In this case, each pair (xi,yi) is treated as a single observation, so the bootstrap resamples are taken over these pairs. With two independent samples we will need to bootstrap the two samples independently. Two sample example 1: Bootstrap for paired sample means There are lots of ways in which the original data can by arranged, here we consider the two samples x and y when they form the two columns of a matrix: n = 25 # sample size N = 10000 # number of bootstrap resamples alpha = 0.05 # significance level set.seed(1) # Simulate correlated samples (paired) from standard normals x = rnorm(n) y = rnorm(n, 0.5 * x + 0.5) plot(y ~ x) −2 −1 0 1 x 25 −1.5 −0.5 0.5 1.0 1.5 y # Combine the paired data into a matrix with two columns xy = cbind(x, y) # 'cbind' means column bind' In calculating the bootstrap statistic (difference in means) the resampled datasets x∗ and y∗ are extracted in pairs: We can do the bootstrapping: ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 10000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = results, type = "perc") ## ## Intervals : ## Level Percentile ## 95% (-0.8014, -0.1265 ) ## Calculations and Intervals on Original Scale and plot the results: boot.pairedmean = function(xy, indices) { xystar = xy[indices, ] # extract resampled x and y datasets xstar = xystar[, 1] ystar = xystar[, 2] return(mean(xstar) - mean(ystar)) } results = boot(xy, statistic = boot.pairedmean, R = N) boot.ci(results, type = "perc") hist(results$t, breaks = 100, prob = T, ylim = c(0, 4), xlab = "Sample Difference in Paired Means", main = "Density Histogram of\nBootstrap Sample Difference in Paired Means") abline(v = boot.ci(results, type = "perc")$percent[4:5]) abline(v = -0.5, col = "red", lwd = 3) abline(v = mean(x) - mean(y), col = "blue", lwd = 3) abline(v = 0,lty = 2) legend("topright", legend = c("True Difference", "Empirical Difference", "95% CI", "Zero"), col = c("red", "blue", "black", "black"), lty = c(1, 1, 1, 2), lwd = c(2 ,2, 1, 1), bg = "white") 26 Density Histogram of Bootstrap Sample Difference in Paired Means True Difference Empirical Difference 95% CI Zero −1.0 −0.8 −0.6 −0.4 −0.2 0.0 Sample Difference in Paired Means The 95% confidence interval is (-0.80, -0.13). This interval covers a range of values that are entirely negative, so that the confidence interval doesn’t contain zero (which is the null hypothesis). Hence, we can conclude that there is evidence at the 95% confidence level that the two population means are different. This 95% confidence interval (-0.80, -0.13) is very similar to that obtained from the usual paired sample t-test, which gives a 95% confidence interval of (-0.81, -0.09). This gives some confidence that the bootstrap procedure is behaving as we would expect: t.test(x, y, paired = T) ## ## Paired t-test ## ## data: x and y ## t = -2.5476, df = 24, p-value = 0.01767 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.81075115 -0.08504635 ## sample estimates: ## mean of the differences ## -0.4478987 Two sample example 2: Bootstrap for paired sample medians The statistic function is the same as that for the paired means, except we now use medians: boot.pairedmedian = function(xy, indices) { xystar = xy[indices, ] # extract resampled x and y datasets xstar = xystar[, 1] ystar = xystar[, 2] return(median(xstar) - median(ystar)) } 27 Density 01234 We can use exactly the same data as in the first example, since the only difference between the examples is what we calculate from the data (medians instead of means) rather than the data itself: ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 10000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = results, type = "perc") ## ## Intervals : ## Level Percentile ## 95% (-0.7532, 0.1744 ) ## Calculations and Intervals on Original Scale and we can plot the results: results = boot(xy, statistic = boot.pairedmedian, R = N) boot.ci(results, type = "perc") hist(results$t, breaks = 100, prob = T, ylim = c(0, 8), xlab = "Sample Difference in Paired Medians", main = "Density Histogram of\nBootstrap Sample Difference in Paired Medians") abline(v = boot.ci(results, type = "perc")$percent[4:5]) abline(v = -0.5, col = "red", lwd = 3) abline(v = median(x) - median(y), col = "blue", lwd = 3) abline(v = 0, lty = 2) legend("topleft", legend = c("True Difference", "Empirical Difference", "95% CI", "Zero"), col = c("red", "blue", "black", "black"), lty = c(1, 1, 1, 2), lwd = c(2, 2, 1, 1), bg = "white") Density Histogram of Bootstrap Sample Difference in Paired Medians True Difference Empirical Difference 95% CI Zero −1.5 −1.0 −0.5 0.0 0.5 Sample Difference in Paired Medians The bootstrapping gives a 95% confidence interval for the difference in the medians of (-0.75, 0.17). This is wider than for the means and does include zero. Hence, we conclude that there is not enough evidence at the 95% confidence level of a difference between the two population medians. 28 Density 02468 Problems with two independent samples With two independent samples, bootstrapping is slightly more complicated since the bootstrapping is applied to the two samples independently. So there will be two resamples in each bootstrap iteration, one for each of the two original samples. Two sample example 3: Bootstrap for independent sample means The most basic approach would be to bootstrap each sample separately, and then combine the resamples. This would require a function that calculates the mean for a single resample: and would use this function to find the bootstrap mean for the two samples separately: boot.mean = function(x, indices) { xstar = x[indices] return(mean(xstar)) } nx = 20 # sample size for x ny = 30 # sample size for y N = 10000 # number of bootstrap resamples alpha = 0.05 # significance level set.seed(1) # Simulate two samples from standard normals, but with different means x = rnorm(nx, mean = 0) y = rnorm(ny, mean = 1) results.x = boot(x, statistic = boot.mean, R = N) results.y = boot(y, statistic = boot.mean, R = N) and we could then look at the difference: ## 2.5% 97.5% ## -1.3351673 -0.3765136 This gives a confidence interval for the difference in the means of (-1.3351, -0.3765). However a better approach is to bootstrap both together, which means that we can then use the boot.ci function. We generate the original data as we did before: mean.diffs = results.x$t - results.y$t quantile(mean.diffs, probs = c(alpha/2, 1 - alpha/2)) nx = 20 # sample size for x ny = 30 # sample size for y N = 10000 # number of bootstrap resamples alpha = 0.05 # significance level set.seed(1) # Simulate two samples from standard normals, but with different means x = rnorm(nx, mean = 0) y = rnorm(ny, mean = 1) 29 However we now concatenate the x and y samples together and also create a grouping factor id (which creates a label similar to that used in permutation test for two independent samples): We can start by writing a statistic function that ‘hard codes’ the sample sizes when extracting the indices 1:20 and 21:50: and then calculate the confidence interval. In boot we can now use the strata = id argument which uses id to identify the two ‘strata’ (or groups) within which the resampling is carried out: ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 10000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = results, type = "perc") ## ## Intervals : ## Level Percentile ## 95% (-1.3368, -0.3754 ) ## Calculations and Intervals on Original Scale This gives a confidence interval for the difference in the means of (-1.3368, -0.3754), which is very similar to the interval we found before, of (-1.3351, -0.3765). A more general version of the statistic function takes sample sizes as extra arguments, so that it can be used with arbitrary sample sizes: xy = c(x, y) id = as.factor(c(rep("x", nx), rep("y", ny))) boot.twomeans = function(xydata, indices) { # extract resampled x and y datasets xstar = xydata[indices[1:20]] ystar = xydata[indices[21:50]] return(mean(xstar) - mean(ystar)) } results = boot(xy, statistic = boot.twomeans, strata = id, R = N) boot.ci(results, type = "perc") boot.twomeans = function(xydata, indices, nx, ny) { # extract resampled x and y datasets xstar = xydata[indices[1:nx]] ystar = xydata[indices[(nx + 1):(nx + ny)]] return(mean(xstar) - mean(ystar)) } results = boot(xy, statistic = boot.twomeans, strata = id, R = N, nx = nx, ny = ny) boot.ci(results, type = "perc") ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 10000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = results, type = "perc") ## ## Intervals : ## Level Percentile 30 ## 95% (-1.3374, -0.3723 ) ## Calculations and Intervals on Original Scale Again, we can plot the results: hist(results$t, breaks = 100, prob = T, xlab = "Sample Difference in Means", main = "Density Histogram of\nBootstrap Sample Difference in Independent Means") abline(v = boot.ci(results, type = "perc")$percent[4:5]) abline(v = -1, col = "red", lwd = 3) abline(v = mean(x) - mean(y), col = "blue", lwd = 3) abline(v = 0, lty = 2) legend("topright", legend = c("True Difference", "Empirical Difference", "95% CI", "Zero"), col = c("red", "blue", "black", "black"), lty = c(1, 1, 1, 2), lwd = c(2, 2, 1, 1), bg = "white") Density Histogram of Bootstrap Sample Difference in Independent Means −1.5 −1.0 −0.5 Sample Difference in Means True Difference Empirical Difference 95% CI Zero 0.0 Since the 95% confidence interval for the difference in means is (-1.33, -0.37), which does not include zero, we conclude there is enough evidence at the 95% confidence level to say that the two population means are different. Two sample example 4: Bootstrap for independent sample medians We can do the same thing but now looking at the difference between the medians of two independent samples. We can use the same original data as in example 3: boot.twomedians = function(xydata, indices, nx, ny) { # extract resampled x and y datasets xstar = xydata[indices[1:nx]] ystar = xydata[indices[(nx + 1):(nx + ny)]] return(median(xstar) - median(ystar)) } results = boot(xy, statistic = boot.twomedians, strata = id, R = N, nx = nx, ny = ny) boot.ci(results, type = "perc") 31 Density 0.0 0.5 1.0 1.5 ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 10000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = results, type = "perc") ## ## Intervals : ## Level Percentile ## 95% (-1.4182, -0.2641 ) ## Calculations and Intervals on Original Scale We can plot the results. One thing we can see is that the interval for the difference between medians is asymmetric around the estimated difference in the median. hist(results$t, breaks = 100, prob = T, xlab = "Sample Difference in Medians", main = "Density Histogram of\nBootstrap Sample Difference in Independent Medians") abline(v = boot.ci(results, type = "perc")$percent[4:5]) abline(v = -1, col = "red", lwd = 3) abline(v = median(x) - median(y), col = "blue", lwd = 3) abline(v = 0, lty = 2) legend("topright", legend = c("True Difference", "Empirical Difference", "95% CI", "Zero"), col = c("red", "blue", "black", "black"), lty = c(1, 1, 1, 2), lwd = c(2, 2, 1, 1), bg = "white") Density Histogram of Bootstrap Sample Difference in Independent Medians True Difference Empirical Difference 95% CI Zero −2.0 −1.5 −1.0 −0.5 0.0 Sample Difference in Medians The 95% confidence interval for the difference in the medians is (-1.41, -0.26), which does not include zero. Thus we conclude that there is enough evidence at the 95% confidence level to conclude that the two population medians are different. 32 Density 0.0 0.5 1.0 1.5 The effect of outliers Outliers can have strange effects on bootstrap sampling distributions. Consider the bootstrap distribution for the mean of a single sample, where the sample has had a single, extreme outlier added to it: boot.mean = function(x, indices) { xstar = x[indices] return(mean(xstar)) } set.seed(1) x = rnorm(20) x = c(x, 100) # add one (extreme) outlier to the sample results = boot(x, statistic = boot.mean, R = 10000) hist(results$t, breaks = 100, prob = T, xlab = "Sample Means", main = "Density Histogram of Bootstrap Sample Means") Density Histogram of Bootstrap Sample Means 0 10 20 30 40 Sample Means This outlier is so extreme that it would have a significant effect on the mean of any given resample if that outlier is included in the resample. It will have an even bigger effect on the resample mean if it is included in the resample more than once. So the histogram above is an exageration of the effect of there being an outlier in the original sample. 33 Density 0.0 0.1 0.2 0.3 0.4 0.5 So . . . let’s look at the probability that a (single) outlier is included in a bootstrap resample . . . The calculation is made more complicated by the fact that we resample with replacement. If we think about creating a single resample, and then consider the probability of sampling just one of the resample values. Each observation in the original sample has a probability of 1/n of being sampled, and thus a probability of 1 − 1/n of not being sampled. Therefore, as we are sampling with replacement, the probability that an outlier is not included in a single resample is (1 − 1/n)n. It turns out that this converges to exp(−1) ≈ 0.37 as n → ∞. So, in slightly less than 2/3 of bootstrap resamples the outlier will be included. The number of times the outlier is included is a random variable over 0, 1, 2, . . . . times. Consequently the bootstrap distribution for the mean can be multimodal with modes corresponding to the number of times the outlier appears. This is what we can see in out histogram above. For less extreme outliers than this, the bootstrap distribution might appear bimodal, for example, or the multiple modes may merge together to increase the variability in the sampling distribution. 34 Parametric vs nonparametric bootstrapping So far we’ve looked at nonparametric bootstrapping, in which we resample from the original data. This is called nonparametric because we don’t assume anything about the distribution that the data come from, we only assume that the data are representative of the population. In parametric inference, we assume that the data come from some family of distributions indexed by one or more parameters. For example, we might assume that the data come from a Normal(μ,σ2), or a Poisson(λ), etc, and the goal is usually to estimate the parameter(s), or a function of the parameters. Parametric bootstrap The idea behind the parametric bootstrap is to assume that the sample comes from a particular population, F (θ). The random bootstrap resamples are then drawn from that distribution, rather than simply resampling from the original data. The parameter θ of the distribution is estimated from the original sample data, and denoted θ. The bootstrap resamples are then said to come from the fitted distribution F(θ). This requires knowing something about (or assuming something about) the distribution of the data, which may or may not be realistic. This approach is useful if we wish to compute quantities that are difficult to derive theoretically, or when the sampling distribution is limited due to small samples or restricted support due to the sample statistic. Example: Revisiting the bootstrap for the median In this example we go back and look again at the example using the Exponential(λ = 3) population median on page 21. In this case we know that the true median is x0.5 = log(2)/λ = log(2)/3 = 0.231. n = 25 # sample size N = 10000 # number of bootstrap resamples alpha = 0.05 # significance level set.seed(1) x = rexp(n, rate = 3) boot.medians = numeric(N) for(i in 1:N) { xstar = sample(x, replace = T) boot.medians[i] = median(xstar) } quantile(boot.medians, probs = c(alpha/2, 1 - alpha/2)) ## 2.5% 97.5% ## 0.1798943 0.3938809 Here the sample size is small and it is also the case that, because we are considering the median, the statistic itself limits the number of possible outcomes. We therefore have a sampling distribution that is discrete, with only a few outcomes that have a non-negligible chance of occurring. 35 hist(boot.medians, breaks = 100, prob = T, xlab = "Sample Medians", main = "Density Histogram of Bootstrap Sample Medians") Density Histogram of Bootstrap Sample Medians 0.1 0.2 0.3 0.4 0.5 0.6 Sample Medians If we assume that the original data do come from an exponential distribution then we are in the realms of the parametric bootstrap and . . . We know that the expected value (the mean) of the exponential is E(X) = 1/λ. So is we work out the mean of the original sample x ̄, this gives us an estimate of 1/λ. So we can use this result to give an intuitive estimate of λ, which would be 1/x (where in fact this is the maximum likelihood estimator, which is covered in STAT213. . . ). So we can estimate λ from the original sample to get λ, and then we can ‘plug in’ this estimate and generate our resamples from the assumed data generating distribution Exponential(λ): n = 25 # sample size N = 10000 # number of bootstrap resamples alpha = 0.05 # significance level set.seed(1) x = rexp(n, rate = 3) # Estimate exponential parameter lambdahat = 1 / mean(x) boot.medians = numeric(N) for(i in 1:N) { xstar = rexp(n, rate = lambdahat) # sample from the 'fitted distribution' boot.medians[i] = median(xstar) } quantile(boot.medians, probs = c(alpha/2, 1 - alpha/2)) ## 2.5% 97.5% ## 0.1278102 0.3984479 We can plot these results: 36 Density 0 10 20 30 40 50 60 hist(boot.medians, breaks = 100, prob = T, xlab = "Sample Medians", main = "Density Histogram of Bootstrap Sample Medians") Density Histogram of Bootstrap Sample Medians 0.1 0.2 0.3 0.4 0.5 Sample Medians As a result, the sampling distribution for the median is now continuous which is more realistic. But!!! the improvement in performance is reliant on the assumed population distribution being correct and parameter estimates being reliable. Example: A more complicated parametric bootstrap Now suppose we believe the data are generated from a population that follows a Poisson distribution, but that we don’t know the parameter of the distribution, λ (which is both the expected value and the variance of the Poisson). For a random sample X taken from the same distribution as this data, we might wish to estimate something like: √ 1.E[ X],or 2. P(X is an odd number). Both these quantities arise in practice: √ 1. E[ taking the logarithm, since taking the square root of 0 is not a problem (whereas log 0 is undefined). X]: count data are often transformed using square roots to make them less skewed rather than 2. P(X is an odd number): The probability that X is odd is used in genetics, where genes are shuffled around when chromosomes are formed in eggs and sperm: • if there is an odd number of ’cross-over’ events between two genes, the two genes will appear recombined; and • if it is an even number, they do not appear recombined. This is like flipping a switch. If a switch is in the ’on’ position and is flipped an odd number of times, it changes to ’off’. If it is flipped an even number times, it ends up being ’on’ again. 37 Density 01234567 Example 1: A difficult expected value For the first example, the problem is how to deal with the expected value of a square root. Possible approaches include: 1. transforming the random variable and taking the expected value of this new variable, 2. evaluating √ ∞ √ E[ X]= xP(X=x). or if it was for a continuous random variable, the expected value is √ x=0 ∞√ −∞ E[ where f(x) is the density. However this is often difficult to evaluate, √ X using a Taylor series (the approach in MATH103 and STAT213), 4. bootstrapping its distribution. Here we consider this final approach using a parametric bootstrap. If we know that our sample X comes from a Poisson, then we can estimate λ using the sample mean λ = x . We can then bootstrap resamples from a Poisson(λ), and take the average of √x1, √x2, √x3, . . . , √xn. √ X] = xf(x) dx 3. approximating If we do this N times, we get a bootstrap distribution which we can use to estimate E[ confidence interval. We first generate a sample that we can use to demonstrate this method: and then do the parametric bootstrapping: X] and get a n = 25 # sample size N = 10000 # number of bootstrap resamples alpha = 0.05 # significance level set.seed(1) x = rpois(n, lambda = 3) # Estimate Poisson parameter lambdahat = mean(x) boot.expected.values = numeric(N) for(i in 1:N) { xstar = rpois(n, lambda = lambdahat) # sample from fitted distribution boot.expected.values[i] = mean(sqrt(xstar)) } quantile(boot.expected.values, probs = c(alpha/2, 1 - alpha/2)) ## 2.5% 97.5% ## 1.490392 1.935225 This gives the bootstrap distribution: hist(boot.expected.values, breaks = 100, prob = T, xlab = "Sample Expected Values", main = "Density Histogram of Bootstrap Expected Value of Square Root") 38 Density Histogram of Bootstrap Expected Value of Square Root 1.4 1.6 1.8 2.0 Sample Expected Values Doing a parametric bootstrap using boot function requires an extra function to specify how the random samples from the fitted distribution are generated, and we then specify this function when we use boot by using the ran.gen argument, where ran.gen stands for ‘random number generator’. The estimated parameters can be directly specified using the mle argument, rather than having to explicitly calculate them. ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 10000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = results, type = "perc") ## ## Intervals : ## Level Percentile ## 95% ( 1.482, 1.934 ) ## Calculations and Intervals on Original Scale boot.pois = function(x, param) { xstar = rpois(length(x), lambda = param) return(xstar) } boot.root = function(x) { return(mean(sqrt(x))) } results = boot(x, statistic = boot.root, sim = "parametric", ran.gen = boot.pois, mle = lambdahat, R = 10000) boot.ci(results, type = "perc") 39 Density 0123 Example 2: A difficult probability Suppose we are interested in estimating P(X is odd), where we have a sample drawn from a population that we assume is Poisson. To apply the parametric bootstrap we: 1. assume X ∼ P oisson(λ) and estimate the parameter λ using λ = x, 2. generate bootstrap resamples from the Poisson(λ), 3. for each resample we record the proportion of odd numbers, 4. repeat bootstrap resamples 2-3 many times. We first generate a sample that we can use to demonstrate this method: We need a function to calculate the statistic: Here we have used the fact that: • (x %% 2)(Xmod2)gives0whenevenand1whenodd,sothat • (x %% 2) == 1 gives TRUE when odd, and thus mean() of this vector gives the proportion. We also need a function that will generate resamples from a Poisson distribution: And we can now do the (parametric) bootstrapping: ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 10000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = results, type = "perc") ## ## Intervals : ## Level Percentile ##95% (0.4, 0.6) ## Calculations and Intervals on Original Scale and plot the results: n = 100 # sample size N = 10000 # number of bootstrap resamples alpha = 0.05 # significance level set.seed(1) x = rpois(n, lambda = 3) boot.odd = function(x) { is.odd = (x %% 2) == 1 return(mean(is.odd)) } boot.pois = function(x, param) { xstar = rpois(length(x), lambda = param) return(xstar) } results = boot(x, statistic = boot.odd, sim = "parametric", ran.gen = boot.pois, mle = lambdahat, R = 10000) boot.ci(results, type = "perc") # 95% CI is (0.4, 0.6) plot(prop.table(table(results$t)), xlab = "Sample Proportion Odd", ylab = "Probability", main = "Probability Mass Function of Bootstrap Proportion Odd") 40 Probability Mass Function of Bootstrap Proportion Odd And that’s the end of STAT221 for 2020 !! 0.28 0.33 0.38 0.43 0.48 0.53 0.58 0.63 0.68 Sample Proportion Odd 41 Probability 0.00 0.02 0.04 0.06 0.08