CS代写 MAST20005/MAST90058"

title: “Week 11 Lab, MAST20005/MAST90058”
author: “School of Mathematics and Statistics, University of Melbourne”
date: “Semester 2, 2022”
institute: “University of Melbourne”

fontsize: 11pt
classoption: a4paper
html_document:
toc_float:
collapsed: false

**Goals:** (i) Properties of order statistics; (ii) Confidence
intervals for quantiles; (iii) An introduction to the bootstrap and an example
of its use.

“`{r echo=FALSE}
set.seed(2331)

# Order statistics

Let $X_{(1)} < X_{(2)} < X_{(3)}$ be order statistics of a random sample $X_1,X_2,X_3$ from the the uniform distribution $\mathrm{Unif}(\theta - 0.5, \theta + 0.5)$. Three possible estimators for the median are the sample mean $W_1 = \bar{X}$, the sample median $W_2 = X_{(2)}$, and the midrange $W_3 = (X_{(1)} + X_{(3)}) / 2$. The theoretical pdf for the smallest order statistic, $X_{(1)}$, is \[ g_1(x) = 3 \left(1 - F(x)\right)^2 f(x) = 3 \left(1 - x\right)^2, \quad \theta - 0.5 < x < \theta + 0.5. \] Simulate the distribution of the first order statistic $X_{(1)}$ and compare the resulting density with the above theoretical density, using some particular value for $\theta$ (for example $\theta = 0.5$). ```{r fig.width=6, fig.height=3.5} theta <- 0.5 x1.simulated <- numeric(1000) # initialise an empty vector for (i in 1:1000) { x <- runif(3, theta - 0.5, theta + 0.5) x1.simulated[i] <- min(x) g <- function(x) 3 * (1 - x)^2 hist(x1.simulated, breaks = 20, freq = FALSE, col = "lightblue", xlim = c(0, 1), ylim = c(0, g(0)), main = NULL, xlab = expression(x[(1)])) curve(g, from = 0, to = 1, add = TRUE, col = "blue", lwd = 2) Simulate 100 samples of size 3 from the uniform model above and calculate and store values of $W_1$, $W_2$ and $W_3$. w.simulated <- matrix(nrow = 100, ncol = 3) # initialise an empty matrix for (i in 1:100) { x <- runif(3, theta - 0.5, theta + 0.5) x <- sort(x) w1 <- mean(x) w2 <- x[2] w3 <- (x[1] + x[3]) / 2 w <- c(w1, w2, w3) w.simulated[i, ] <- w Compare the values of the sample means and sample variances for $W_1$, $W_2$ and $W_3$. Which of these statistics is the best estimator of $\theta$? # Compute `mean` and `var` for each column of `w.simulated`. means <- apply(w.simulated, 2, mean) vars <- apply(w.simulated, 2, var ) Here, `apply()` runs the given function (3rd argument) to each column of the given matrix (1st argument). If the second argument has value `1`, then it would run the given function to each row instead. Simulated estimates are affected by error due to random sampling, thus one could question whether 100 runs are enough to give reliable results. Therefore, it is typical to either run a very large number of simulations or otherwise present simulation results in terms of interval estimates. By the Central Limit Theorem, the means from the simulations follow a normal distribution. That allows us to calculate 95\% confidence intervals for $E(W_1)$, $E(W_2)$ and $E(W_3)$ in the following way: # CI for E(W1). means[1] + c(-1, 1) * qnorm(0.975) * sqrt(vars[1]) / sqrt(100) *Note:* It turns out that $E(W_1) = E(W_2) = E(W_3) = 0.5$ (you may check this as a homework problem). Clearly, the intervals above contain the true value $0.5$. # Confidence intervals for quantiles Let $X \sim \mathrm{Unif}(0,1)$ and consider a random sample of size 11 from $X$. In the lectures we saw that if $m$ is the median and $X_{(1)},\dots,X_{(n)}$ are the order statistics then \[ \Pr(X_{(i)} < m < X_{(j)}) = \sum_{k = i}^{j - 1} \binom{n}{k} \left(\frac{1}{2}\right)^k \left(\frac{1}{2}\right)^{n-k}. \] We will check this formula using R by computing confidence intervals for the median of $X$. Use `qbinom()` to compute quantiles of the $\mathrm{Bi}(11, 0.5)$ distribution (e.g.\ find $\pi_{0.975}$ so that $\Pr(X \leqslant \pi_{0.975}) \approx 0.975$). qbinom(c(0.025, 0.975), 11, 0.5) Note that these are approximations as the distribution is discrete. However, they can be used as endpoints of the desired confidence interval. Determine $\Pr(X_{(2)} < m < X_{(9)})$ pbinom(8, 11, 0.5) - pbinom(1, 11, 0.5) Note that the confidence interval is only approximate. However, it is still useful since it ensures a confidence level slightly larger than 95\%. The R command `X <- runif(11)` simulates 11 observations from $X$, while `sort()` computes the order statistics. We automate calculation of $X_{(2)}$ and $X_{(9)}$ by the following function: f <- function() { X <- runif(11) Y <- sort(X) c(Y[2], Y[9]) f() # try it out Next we check the coverage probability for the interval above by simulation. ```{r fig.height=3.5, fig.width=6} nsimulations <- 100 C <- t(replicate(nsimulations, f())) matplot(C, type = "l", xlab = "Simulated sample", ylab = "CI") abline(c(0.5, 0), lty = 2, col = "darkgrey") mean((C[, 1] < 0.5) & (0.5 < C[, 2])) The above code computes the proportion of simulated intervals that contain the true median value $0.5$. Is this close to your answer from question 2, above? The associated plot shows the confidence interval endpoints for each simulation run. If you want more precision, repeat with `nsimulations = 1000`. # Bootstrap* ## Introduction The bootstrap is a computational technique to approximate the sampling distribution of almost any statistic. It is an example of a *resampling method*, which refers to the fact that it involves taking samples from the original sample (and so multiple times). The resulting approximate distribution can be used to obtain confidence intervals or for hypothesis Suppose we have an iid sample $X_1, \dots, X_n$ from some unknown distribution. Our main interest is to find the distribution of some statistic, say $\hat{\theta}$ (e.g. sample median, sample variance, estimate of a regression coefficient). Such a distribution can be approximated by the following simple 1. Obtain a new sample, $X_1^\ast, \dots, X^\ast_n$, by drawing **with replacement** from the original observations $X_1, \dots, X_n$. 2. Using the sample $X_1^\ast, \dots, X^\ast_n$, compute and store the statistic $\hat{\theta}^\ast$. 3. Repeat steps 1 and 2 many times, say $B$, where $B$ might be 1\,000, 2\,000 or even 10\,000, thus obtaining $\hat{\theta}^\ast_1, \dots, \hat{\theta}^\ast_B$. These are known as the *bootstrapped statistics*. 4. The empirical distribution of the bootstrapped statistics can be regarded as an approximation of the distribution for the statistic $\hat{\theta}$ computed from the original sample (thus, they can be used to find confidence intervals, etc.). Advanced theory shows that the bootstrap approximation works well even if $n$ is quite small. This is in contrast to procedures relying on the Central Limit Theorem, which typically require much larger samples. For example, let us consider iid samples from the Cauchy distribution with pdf, \[ f(x \mid \theta) = \frac{1}{\pi [1 + (x - \theta)^2]}, \quad -\infty < x < \infty. \] The parameter $\theta$ represents the median. In R we can generate $n = 25$ samples using $\theta = 5$ as follows: x <- rcauchy(25, location = 5) Interestingly, many values are between 3 and 7 so they resemble a sample from a normal distribution with mean 5 and standard deviation 1. However, note there there are many outliers represented by quite extreme values. These occur because the Cauchy distribution has very long tails. The presence of outliers suggests that $\bar{X}$ is not a very good estimator of the location. Next compare the distribution of the sample mean $\bar{X}$ with that of the trimmed mean $\bar{X}_\text{tr}$ (mean without the most extreme observations in each tail). The statistics computed from the original sample are: x.bar <- mean(x) x.bar.tr <- mean(x, trim = 0.35) # exclude 35% of observations from each tail The statistics can be bootstrapped as follows: x.bar.boot <- numeric(B) x.bar.tr.boot <- numeric(B) for (i in 1:B) { x.ast <- sample(x, size = 25, replace = TRUE) x.bar.boot[i] <- mean(x.ast) x.bar.tr.boot[i] <- mean(x.ast, trim = 0.35) Plot the distribution of the bootstrapped statistics, using a common scale: ```{r fig.width=6, fig.height=8.2} xlim <- range(x.bar.boot, x.bar.tr.boot) ylim <- c(0, 0.7) par(mfrow = c(2, 1), mar = c(5.1, 4.1, 1, 1)) hist(x.bar.boot, xlab = expression(bar(X)), freq = FALSE, xlim = xlim, ylim = ylim, col = "lightblue", main = NULL) hist(x.bar.tr.boot, xlab = expression(bar(X)[tr]), freq = FALSE, xlim = xlim, ylim = ylim, col = "lightblue", main = NULL) To find a 95\% confidence interval for $\theta$ we will use the percentile bootstrap method. This involves simply calculating the sample quantiles of the bootstrapped statistics; for example, the 2.5\% and 97.5\% sample quantiles in order to get a 95\% confidence interval. In R, sample quantiles are computed using `quantile()`. quantile(x.bar.tr.boot, c(0.025, 0.975)) The percentile bootstrap is a distribution-free method, since we make no specific assumptions about the underlying distribution. ## Old Faithful geyser data In this section we use the waiting times (in minutes) between the starts of successive eruptions of the Old Faithful geyser in Yellowstone National Park (Wyoming, USA). The data were collected continuously from 1 August until 15 August, 1985. We examine various features of this data using the bootstrap method. The data should be available in the data frame `faithful` in any standard installation of R. The following computes the sample mean and shows a histogram of the waiting times. ```{r fig.width=6, fig.height=4} x <- faithful$waiting hist(x, col = "lightblue") Generate 10,000 bootstrap replicates of $\bar{X}$ and plot the bootstrap distribution. ```{r fig.width=6, fig.height=5} B <- 10000 x.bar.boot <- numeric(B) for (i in 1:B) { x.ast <- sample(x, replace = TRUE) x.bar.boot[i] <- mean(x.ast) hist(x.bar.boot, xlab = expression(bar(X)), freq = FALSE, col = "lightblue", main = NULL) A 95\% confidence interval for the true mean waiting time $\mu$ is obtained as follows quantile(x.bar.boot, c(0.025, 0.975)) Note that this confidence interval uses no information about the true distribution generating the data. It is then easy to see how this procedure gets its name, because it is like "pulling yourself by your own bootstraps" with the empirical distribution acting as a bootstrap. # Exercises 1. Consider a random sample of size 4 from an exponential distribution with rate parameter 1. Simulate the distribution of the first order statistic. Draw a histogram of the simulated values and superimpose the theoretical pdf (which you will need to derive). # Insert answer here... 2. Consider the shifted exponential distribution with pdf: \[ f(x \mid \theta) = e^{-(x - \theta)} \quad (x > \theta). \]
We discussed this in the lectures early on. Two estimators we proposed were
$T_1 = \bar{X} – 1$ and $T_2 = X_{(1)} – \frac{1}{n}$. Using $\theta = 3$ and
a sample size of $n = 10$, use simulations to show that both of these are
unbiased and that $T_2$ has clearly smaller variance.
# Insert answer here…

3. Consider the scenario in the [Order statistics] section.

a) Consider the estimator $W_4 = X_{(3)} – 0.5$. Use simulations to show
that it is biased.
# Insert answer here…

b) Determine a value of $a$ that makes $W_5 = X_{(3)} – a$ an unbiased
estimator.
# Insert answer here…

c) Use simulations to compare the variance of $W_5$ to that of $W_1$, $W_2$
and $W_3$.
# Insert answer here…

4. Calculate a 95\% confidence interval for the simulated coverage estimate in
the [Confidence intervals for quantiles] section. Repeat for 1000 simulations.
# Insert answer here…

5. Do question 5 from the tutorial problems. Also, find an approximate 95\%
confidence interval for the first quartile.
# Insert answer here…

6. Consider the following random sample on $X$:

0.252, 0.287, 0.537, 0.511, 0.054,
0.022, 0.142, 0.021, 0.155, 0.241

Calculate the statistic $T = 1 / \bar{X}$. Suppose this is an estimator
for some underlying parameter $\theta$. Calculate a 95\% confidence
interval for $\theta$ using the percentile bootstrap.
# Insert answer here…

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts