title: “Week 11 Lab, MAST20005/MAST90058”
author: “School of Mathematics and Statistics, University of Melbourne”
date: “Semester 2, 2022”
institute: “University of Melbourne”
Copyright By PowCoder代写 加微信 powcoder
fontsize: 11pt
classoption: a4paper
html_document:
toc_float:
collapsed: false
**Goals:** (i) Properties of order statistics; (ii) Confidence
intervals for quantiles; (iii) An introduction to the bootstrap and an example
of its use.
“`{r echo=FALSE}
set.seed(2331)
# Order statistics
Let $X_{(1)} < X_{(2)} < X_{(3)}$ be order statistics of a random sample
$X_1,X_2,X_3$ from the the uniform distribution $\mathrm{Unif}(\theta - 0.5, \theta +
0.5)$. Three possible estimators for the median are the sample mean $W_1 =
\bar{X}$, the sample median $W_2 = X_{(2)}$, and the midrange $W_3 = (X_{(1)} +
X_{(3)}) / 2$.
The theoretical pdf for the smallest order statistic, $X_{(1)}$, is
\[ g_1(x) = 3 \left(1 - F(x)\right)^2 f(x) = 3 \left(1 - x\right)^2, \quad
\theta - 0.5 < x < \theta + 0.5. \]
Simulate the distribution of the first order statistic $X_{(1)}$ and compare
the resulting density with the above theoretical density, using some particular
value for $\theta$ (for example $\theta = 0.5$).
```{r fig.width=6, fig.height=3.5}
theta <- 0.5
x1.simulated <- numeric(1000) # initialise an empty vector
for (i in 1:1000) {
x <- runif(3, theta - 0.5, theta + 0.5)
x1.simulated[i] <- min(x)
g <- function(x)
3 * (1 - x)^2
hist(x1.simulated, breaks = 20, freq = FALSE, col = "lightblue",
xlim = c(0, 1), ylim = c(0, g(0)),
main = NULL, xlab = expression(x[(1)]))
curve(g, from = 0, to = 1, add = TRUE, col = "blue", lwd = 2)
Simulate 100 samples of size 3 from the uniform model above and calculate
and store values of $W_1$, $W_2$ and $W_3$.
w.simulated <- matrix(nrow = 100, ncol = 3) # initialise an empty matrix
for (i in 1:100) {
x <- runif(3, theta - 0.5, theta + 0.5)
x <- sort(x)
w1 <- mean(x)
w2 <- x[2]
w3 <- (x[1] + x[3]) / 2
w <- c(w1, w2, w3)
w.simulated[i, ] <- w
Compare the values of the sample means and sample variances for $W_1$,
$W_2$ and $W_3$. Which of these statistics is the best estimator of $\theta$?
# Compute `mean` and `var` for each column of `w.simulated`.
means <- apply(w.simulated, 2, mean)
vars <- apply(w.simulated, 2, var )
Here, `apply()` runs the given function (3rd argument) to each column of
the given matrix (1st argument). If the second argument has value `1`,
then it would run the given function to each row instead.
Simulated estimates are affected by error due to random sampling, thus
one could question whether 100 runs are enough to give reliable results.
Therefore, it is typical to either run a very large number of simulations or
otherwise present simulation results in terms of interval estimates. By
the Central Limit Theorem, the means from the simulations follow a normal
distribution. That allows us to calculate 95\% confidence intervals for
$E(W_1)$, $E(W_2)$ and $E(W_3)$ in the following way:
# CI for E(W1).
means[1] + c(-1, 1) * qnorm(0.975) * sqrt(vars[1]) / sqrt(100)
*Note:* It turns out that $E(W_1) = E(W_2) = E(W_3) = 0.5$ (you may check
this as a homework problem). Clearly, the intervals above contain the true
value $0.5$.
# Confidence intervals for quantiles
Let $X \sim \mathrm{Unif}(0,1)$ and consider a random sample of size 11 from $X$. In
the lectures we saw that if $m$ is the median and $X_{(1)},\dots,X_{(n)}$ are
the order statistics then
\[ \Pr(X_{(i)} < m < X_{(j)})
= \sum_{k = i}^{j - 1} \binom{n}{k} \left(\frac{1}{2}\right)^k
\left(\frac{1}{2}\right)^{n-k}. \]
We will check this formula using R by computing confidence intervals for the
median of $X$.
Use `qbinom()` to compute quantiles of the
$\mathrm{Bi}(11, 0.5)$ distribution (e.g.\ find $\pi_{0.975}$ so that $\Pr(X
\leqslant \pi_{0.975}) \approx 0.975$).
qbinom(c(0.025, 0.975), 11, 0.5)
Note that these are approximations as the distribution is discrete. However,
they can be used as endpoints of the desired confidence interval.
Determine $\Pr(X_{(2)} < m < X_{(9)})$
pbinom(8, 11, 0.5) - pbinom(1, 11, 0.5)
Note that the confidence interval is only approximate. However, it is still
useful since it ensures a confidence level slightly larger than 95\%.
The R command `X <- runif(11)` simulates 11 observations from $X$,
while `sort()` computes the order statistics. We automate calculation of
$X_{(2)}$ and $X_{(9)}$ by the following function:
f <- function() {
X <- runif(11)
Y <- sort(X)
c(Y[2], Y[9])
f() # try it out
Next we check the coverage probability for the interval above by
simulation.
```{r fig.height=3.5, fig.width=6}
nsimulations <- 100
C <- t(replicate(nsimulations, f()))
matplot(C, type = "l", xlab = "Simulated sample", ylab = "CI")
abline(c(0.5, 0), lty = 2, col = "darkgrey")
mean((C[, 1] < 0.5) & (0.5 < C[, 2]))
The above code computes the proportion of simulated intervals that contain the
true median value $0.5$. Is this close to your answer from question 2, above?
The associated plot shows the confidence interval endpoints for each simulation
run. If you want more precision, repeat with `nsimulations = 1000`.
# Bootstrap*
## Introduction
The bootstrap is a computational technique to approximate the sampling
distribution of almost any statistic. It is an example of a *resampling
method*, which refers to the fact that it involves taking samples from the
original sample (and so multiple times). The resulting approximate
distribution can be used to obtain confidence intervals or for hypothesis
Suppose we have an iid sample $X_1, \dots, X_n$ from some unknown distribution.
Our main interest is to find the distribution of some statistic, say
$\hat{\theta}$ (e.g. sample median, sample variance, estimate of a regression
coefficient). Such a distribution can be approximated by the following simple
1. Obtain a new sample, $X_1^\ast, \dots, X^\ast_n$, by drawing **with
replacement** from the original observations $X_1, \dots, X_n$.
2. Using the sample $X_1^\ast, \dots, X^\ast_n$, compute and store the
statistic $\hat{\theta}^\ast$.
3. Repeat steps 1 and 2 many times, say $B$, where $B$ might be
1\,000, 2\,000 or even 10\,000, thus obtaining $\hat{\theta}^\ast_1, \dots,
\hat{\theta}^\ast_B$. These are known as the *bootstrapped statistics*.
4. The empirical distribution of the bootstrapped statistics can be regarded
as an approximation of the distribution for the statistic $\hat{\theta}$
computed from the original sample (thus, they can be used to find confidence
intervals, etc.).
Advanced theory shows that the bootstrap approximation works well even if $n$
is quite small. This is in contrast to procedures relying on the Central Limit
Theorem, which typically require much larger samples. For example, let us
consider iid samples from the Cauchy distribution with pdf,
\[ f(x \mid \theta) = \frac{1}{\pi [1 + (x - \theta)^2]},
\quad -\infty < x < \infty. \]
The parameter $\theta$ represents the median. In R we can generate $n = 25$
samples using $\theta = 5$ as follows:
x <- rcauchy(25, location = 5)
Interestingly, many values are between 3 and 7 so they resemble a sample from a
normal distribution with mean 5 and standard deviation 1. However, note there
there are many outliers represented by quite extreme values. These occur
because the Cauchy distribution has very long tails. The presence of outliers
suggests that $\bar{X}$ is not a very good estimator of the location.
Next compare the distribution of the sample mean $\bar{X}$ with that of the
trimmed mean $\bar{X}_\text{tr}$ (mean without the most extreme observations in
each tail). The statistics computed from the original sample are:
x.bar <- mean(x)
x.bar.tr <- mean(x, trim = 0.35) # exclude 35% of observations from each tail
The statistics can be bootstrapped as follows:
x.bar.boot <- numeric(B)
x.bar.tr.boot <- numeric(B)
for (i in 1:B) {
x.ast <- sample(x, size = 25, replace = TRUE)
x.bar.boot[i] <- mean(x.ast)
x.bar.tr.boot[i] <- mean(x.ast, trim = 0.35)
Plot the distribution of the bootstrapped statistics, using a common scale:
```{r fig.width=6, fig.height=8.2}
xlim <- range(x.bar.boot, x.bar.tr.boot)
ylim <- c(0, 0.7)
par(mfrow = c(2, 1), mar = c(5.1, 4.1, 1, 1))
hist(x.bar.boot, xlab = expression(bar(X)), freq = FALSE,
xlim = xlim, ylim = ylim, col = "lightblue", main = NULL)
hist(x.bar.tr.boot, xlab = expression(bar(X)[tr]), freq = FALSE,
xlim = xlim, ylim = ylim, col = "lightblue", main = NULL)
To find a 95\% confidence interval for $\theta$ we will use the
percentile bootstrap method. This involves simply calculating the sample
quantiles of the bootstrapped statistics; for example, the 2.5\% and 97.5\%
sample quantiles in order to get a 95\% confidence interval. In R, sample
quantiles are computed using `quantile()`.
quantile(x.bar.tr.boot, c(0.025, 0.975))
The percentile bootstrap is a distribution-free method, since we make no
specific assumptions about the underlying distribution.
## Old Faithful geyser data
In this section we use the waiting times (in minutes) between the starts of
successive eruptions of the Old Faithful geyser in Yellowstone National Park
(Wyoming, USA). The data were collected continuously from 1 August until 15
August, 1985. We examine various features of this data using the bootstrap
method. The data should be available in the data frame `faithful` in
any standard installation of R.
The following computes the sample mean and shows a histogram of the
waiting times.
```{r fig.width=6, fig.height=4}
x <- faithful$waiting
hist(x, col = "lightblue")
Generate 10,000 bootstrap replicates of $\bar{X}$ and plot the bootstrap
distribution.
```{r fig.width=6, fig.height=5}
B <- 10000
x.bar.boot <- numeric(B)
for (i in 1:B) {
x.ast <- sample(x, replace = TRUE)
x.bar.boot[i] <- mean(x.ast)
hist(x.bar.boot, xlab = expression(bar(X)), freq = FALSE,
col = "lightblue", main = NULL)
A 95\% confidence interval for the true mean waiting time $\mu$ is
obtained as follows
quantile(x.bar.boot, c(0.025, 0.975))
Note that this confidence interval uses no information about the true
distribution generating the data. It is then easy to see how this procedure
gets its name, because it is like "pulling yourself by your own bootstraps"
with the empirical distribution acting as a bootstrap.
# Exercises
1. Consider a random sample of size 4 from an exponential distribution with
rate parameter 1. Simulate the distribution of the first order statistic. Draw
a histogram of the simulated values and superimpose the theoretical pdf (which
you will need to derive).
# Insert answer here...
2. Consider the shifted exponential distribution with pdf:
\[ f(x \mid \theta) = e^{-(x - \theta)} \quad (x > \theta). \]
We discussed this in the lectures early on. Two estimators we proposed were
$T_1 = \bar{X} – 1$ and $T_2 = X_{(1)} – \frac{1}{n}$. Using $\theta = 3$ and
a sample size of $n = 10$, use simulations to show that both of these are
unbiased and that $T_2$ has clearly smaller variance.
# Insert answer here…
3. Consider the scenario in the [Order statistics] section.
a) Consider the estimator $W_4 = X_{(3)} – 0.5$. Use simulations to show
that it is biased.
# Insert answer here…
b) Determine a value of $a$ that makes $W_5 = X_{(3)} – a$ an unbiased
estimator.
# Insert answer here…
c) Use simulations to compare the variance of $W_5$ to that of $W_1$, $W_2$
and $W_3$.
# Insert answer here…
4. Calculate a 95\% confidence interval for the simulated coverage estimate in
the [Confidence intervals for quantiles] section. Repeat for 1000 simulations.
# Insert answer here…
5. Do question 5 from the tutorial problems. Also, find an approximate 95\%
confidence interval for the first quartile.
# Insert answer here…
6. Consider the following random sample on $X$:
0.252, 0.287, 0.537, 0.511, 0.054,
0.022, 0.142, 0.021, 0.155, 0.241
Calculate the statistic $T = 1 / \bar{X}$. Suppose this is an estimator
for some underlying parameter $\theta$. Calculate a 95\% confidence
interval for $\theta$ using the percentile bootstrap.
# Insert answer here…
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com