程序代写代做代考 algorithm —


title: “HW06”
author: “Your Name, Your Uniqname”
date: “Due Wednesday October 22, 2019 at 10pm on Canvas”
output: html_document

“`{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(ggplot2)
“`

For question 2, you will need to load the file `xy.csv`. Set your working directory using Session -> Set Working Directory -> To Source File.

“`{r}
xy <- read.csv("xy.csv") ``` ## Question 1 (4 pts) A Poisson process $N(t)$ with rate parameter $\lambda$ is such that, $N(0) = 0$ and for any $t > 0$,
$$P(N(s + t) – N(s) = n) = \frac{e^{-\lambda t} (\lambda t)^n}{n!}, \quad n \ge 0, t,s > 0$$

### Part (a) (1 pt)

Show that the mean of $N(t)$ (remembering that $t$ is a fixed index element) is
$$E(N(t)) = \lambda t$$
Recall that
$$e^a = \sum_{i=0}^\infty \frac{a^i}{i!}$$

### Part (b) (2 pt)

A compound Poisson process is stochastic process $\{X(t), t \ge 0\}$ such that
$$X(t) = \sum_{i = 1}^{N(t)} Y_i$$
where $Y_i$ are independently distributed from some other distribution.

Rizzo section 3.7 provides a way to simulate from a regular Poisson process. Use this method to simulate from a compound Poission(4)-Gamma(shape = 2, rate = 4).

Estimate the mean of $X(t)$ (by generating 1000 Poisson processes upto $t = 10$). Relate this mean to what you found in Part (a) and what you know about the mean of Gamma(2, 4).

### Part (c) (1 pt)
For any type of $Y_i$ (not just Gamma), $\lambda$ and $t$, find

$$E(X(t))$$

## Question 2 (3 pts)

Consider sampling $n$ pairs $(Y_i, X_i)$ from a very large population of size $N$. We will assume that the population is so large that we can treat $n/N \approx 0$, so that all pairs in our sample are effectively independent.

“`{r}
ggplot(xy, aes(x = x, y = y)) + geom_point()
“`

For the population, you want to relate $Y$ and $X$ as a linear function:
$$Y_i = \beta_0 + \beta_1 X_i + R_i$$
where
\[
\begin{aligned}
\beta_1 &= \frac{\text{Cov}(X,Y)}{\text{Var}(X)} \\
\beta_0 &= E(Y) – \beta_1 E(X) \\
R_i &= Y_i – \beta_0 – \beta_1 X_i
\end{aligned}
\]

The the line described by $\beta_0$ and $\beta_1$ is the “population regression line”. We don’t get to observe $R_i$ for our sample, but we can estimate $\beta_0$ and $\beta_1$ to get estimates of $R_i$.

### Part (a) (1 pt)

The `lm` function in R can estimate $\beta_0$ and $\beta_1$ using sample means and variances. Since these estimators are based on sample means, even we can use the **central limit theorem** to justify confidence intervals for $\beta_0$ and $\beta_1$.

Use the `lm` function to estimate $\beta_0$ and $\beta_1$. Apply the `confint` function to the results to get 95% confidence intervals for the $\beta$ parameters.

### Part (b) (2 pts)

You can use the `coef` function to get just the estimators $\hat \beta_0$ and $\hat \beta_1$. Use the `boot` package to get basic and percentile confidence intervals for just $\beta_1$. You will need to write a custom function to give as the `statistic` argument to `boot`. Use at least 1000 bootstrap samples. You can use `boot.ci` for the confidence intervals.

Compare these intervals to part (a) and comment on the assumptions required for the bootstrap intervals.

## Question 3 (4 pts)

Suppose that instead of sampling pairs, we first identified some important values of $x$ that we wanted to investigate. Treating these values as fixed, we sampled a varying number of $Y_i$ for each $x$ value. For these data, we’ll attempt to model the conditional distribution of $Y \, | \, x$ as:
$$Y \, | \, x = \beta_0 + \beta_1 x + \epsilon$$
where $\epsilon$ epsilon is assumed to be symmetric about zero (therefore, $E(\epsilon) = 0$) and the variance of $\epsilon$ does not depend on $x$ (a property called “homoskedasticity”). These assumptions are very similar to the population regression line model (as $E(R_i) = 0$ by construction), but cover the case where we want to design the study on paricular values (a common case is a randomized trial where $x$ values are assigned from a known procedure and $Y$ is measured after).

### Part (a) (2 pts)

Let’s start with some stronger assumptions and then relax them in the subsequent parts of the question.

Suppose we think that $\epsilon$ follows a scaled $t$-distribution with 4 degrees of freedom (i.e., has fatter tails than the Normal distribution):
$$\epsilon \sim \frac{\sigma}{\sqrt{2}} t(4) \Rightarrow \text{Var}(\epsilon) = \sigma^2$$
(The $\sqrt{2}$ is there just to scale the $t$-distribution to have a variance of 1. More generally, if we picked a differed degrees of freemdom parameter $v$, this would be replaced with $\sqrt{v/(v-2)}$.)

One way to get an estimate of the distribution of $\hat \beta_1$ is the following algorithm:

1. Estimate $\beta_0$, $\beta_1$, and $\sigma^2$ using linear regression
2. For all the $x_i$ in the sample, generate $\hat y_i = \hat \beta_0 + \hat \beta_1 x_i$
3. For $B$ replications, generate $Y_i^* = \hat y_i + \epsilon_i*$, where
$$\epsilon^* \sim \frac{\sqrt{\hat \sigma^2}}{\sqrt{2}} t(4)$$
4. For each replication, use linear regression to estimate $\hat \beta_1^*$.
5. Use the $\alpha/2$ and $1 – \alpha/2$ quantiles of the bootstrap distribution to get the confidence intervals:
$$[2 \hat \beta_1 – \hat \beta_1^*(1 – \alpha/2), 2 \hat \beta_1 – \hat \beta_1^*(\alpha/2)], \quad j = 0, 1$$
To avoid double subscripts I’ve written $\hat \beta^*_1(1 – \alpha/2)$ as the upper $1 – \alpha/2$ quantile of the bootstrap (and likewise for the lower $\alpha/2$ quantile).

You may note that this is a “basic” basic bootstrap interval. In fact, this procedure (fitting parameters, then simulating from a model) is known as a **parametric bootstrap**.

Use the algorithm above to generate confidence intervals for the $\beta$ parameters. Compare them to the fully parametric intervals produced in Question 2(a).

Note: The `boot` function does have the option of performing a parametric bootstrap using a user supplied `rand.gen` function. Feel free to use this functionality, but you may find it easier to implement the algorithm directly.

### Part (b) (2 pts)

As an alternative to sampling from an assumed distribuiton for $\epsilon$, we can replace step (3) in the previous algorithm with

3. Draw a sample (with replacement) from $\hat \epsilon_i$ and make $Y_i^* = \hat y_i + \epsilon_i^*$

Implement this version of a parametic bootstrap. Feel free to use the `boot` package. Compare the results to Part (a) of this question.

## Question 4 (1 pts)

Read the paper “THE RISK OF CANCER ASSOCIATED WITH SPECIFIC MUTATIONS OF BRCA1 AND BRCA2 AMONG ASHKENAZI JEWS.” Briefly summarize the paper. Make sure to discuss the research question, data source, methods, and results. How did the authors use the bootstrap procedure in this paper?