STAT 3701 Homework 3
Summer 2020 June 22, 2020
Show all work. Submit your solutions in a pdf document on Canvas. Include your R code (which must be commented and properly indented) in the pdf file. We recommend that you also submit one text file (.txt) with all your R code (comments and all) clearly labeled with the problem it goes with. This must be properly indented. Before every solution with random sampling use set.seed(3701).
Question 1 (15 points).
Let’s consider a slightly different data generating model for paired data. Suppose we want to compare students scores of a standardized test before and after taking a prep course. There are n students that will be taking the test. Let X be the yet-to-be observed random variable of the score before the course and Y be the yet-to-be observed random variable of the score after the course. Moreover, we let X and Y be the linear combinations of two common latent variable, i.e., the model for the ith student is
Xi = μ1 +a1Ai +b1Bi Yi = μ2 +a2Ai +b2Bi,
where μj,aj,bj, j = 1,2 are constants. Ai,Bi are independent both within the same subject and across different subjects. Ai’s are iid ∼ N(0,σA2), Biis are iid N(0,σB2).
(a) (4 points) What are the distributions of Xi and Yi, respectively? If we let Zi = Xi −Yi, does this data generating model still satisfy the assumptions for paired t-test? Why?
(b) (4points)Letμ1 =68,μ2 =70,σA2 =2,σB2 =1,a1 =2,b1 =1,a2 =1,b2 =2. Itcanbeshowthat E[X] = 68, E[Y] = 70. Use simulation to produce a 99% confidence interval for var(X) = E[(X −68)2] and a 99% confidence interval for var(Y) = E[(Y −70)2]. Set reps = 105.
(c) (3 points) Now we are interested in paired t-test for
H0 :μ1 =μ2
Ha :μ1 ̸=μ2,
with data assumed to be coming from our data generating model. Create a function named mypaired.pval that generate realizations of the random p-value of this test. Your function should have the following arguments
• mu1,meanofX
• mu2,meanofY
• a1,b1,a2,b2 as defined in our data generating model
1
• sigmaA, standard deviation of A • sigmaB, standard deviation of B • n, sample size (number of pairs) • reps, number of replications.
The function should output a vector containing the realizations of the random p-value.
(d) (4 points) Use μ1 = 68,μ2 = 68.5,σA2 = 2,σB2 = 1,a1 = 2,b1 = 1,a2 = 1,b2 = 2,α = 0.05. Set reps = 1000. Test your function by creating a simulation based estimation of the power curve with the sample size n ∈ {20,25,…,200}. Produce a pointwise 95% confidence interval for the power curve with the Clopper-Peason CI and include the curves for the lower bound and upper bound of CI in the same plot of the estimated power curve.
Question 2 (15 points)
For two-independent-samples model, we assume that Xi’s are iid N(μ1,σ2) for i = 1,…,n1, Yj’s are iid N(μ2,σ2) for j = 1,…,n2 and all observations are independent. We provided the 100(1−α)% confidence interval for μ1 − μ2 as
11 X ̄ −Y ̄ ±t1−α/2,n1+n2−2Sp n + n .
12
In this question, we are interested in simultaneous confidence intervals for both μ1 and μ2, i.e., instead of a CI for μ1 − μ2, we want one CI [L1,U1] for μ1 and another CI [L2,U2] for μ2 such that [L1,U1] and [L2,U2] should capture the corresponding means simultaneously with probability larger than 1−α. Here L1,L2,U1,U2 are some lower bounds and upper bounds calculated using the data and 1−α is our significance level.
(a) (3 points) A very natural (but could be wrong potentially) idea would be ̄ S1
[L1,U1] = X ±t1−α/2,n1−1 √n1
̄ S2 [L2,U2]=Y±t1−α/2,n2−1√n ,
2
where as in notes, S12 is the sample variance for {Xi} and S2 is the sample variance for {Yi}. Calculate the coverage probability for these simultaneous CIs, that is, calculate
̄ S1 ̄ S1 ̄ S2 ̄ S2 P X −t1−α/2,n1−1 √n ≤ μ1 ≤ X +t1−α/2,n1−1 √n ,Y −t1−α/2,n2−1 √n ≤ μ2 ≤ Y +t1−α/2,n2−1 √n .
1122
(b) (4points)Letμ1=68,μ2=70,σ=3,n1=n2=20,α=0.05.Usesimulationtoestimatethecoverage probability defined in (a). Set reps = 105. Create a 95% conservative CI for this coverage probability. Is 1−α in this CI?
(c) (4 points) Now consider a Bonferroni corrected simultaneous confidence interval, where ̄ S1
[L1,U1] = X ±t1−α/4,n1−1 √n1
̄ S2 [L2,U2]=Y±t1−α/4,n2−1√n .
·2·
2
Show that this set of simultaneous CIs has a coverage probability larger than 1 − α . You may want apply the Bonfferoni inequality P(A ∩ B) ≥ 1 − (1 − P(A)) − (1 − P(B)).
(d) (4points)Letμ1=68,μ2=70,σ=3,n1=n2=20,α=0.05.Usesimulationtoestimatethecoverage probability of the CIs defined in (c). Set reps = 105. Create a 95% conservative CI for this coverage probability.
Question 3 (20 points)
We have explored two sample procedures for the means, but in this problem we will write a function to do these procedures for proportions. Consider two binomial random variables V and W . More specifically, V ∼ Binom(n1,θ1) and W ∼ Binom(n2,θ2), where θ1,θ2 ∈ (0,1). We are interested in testing
H0 :θ1 =θ2 Ha :θ1 ̸=θ2.
Let θ1 = V /n1 , θ2 = W /n2 and θ = (V + W )/(n1 + n2 ). Then the test statistic for the test is
θ1 − θ2 T= 1 1
θ(1−θ) n1 +n2
Under the null hypothesis, T ∼ N(0,1) approximately.
(a) (5 points) Let θ1 = θ2 = 0.4, use simulation to evaluate the distribution of T for the cases
• n1 = n2 = 15 • n1 = n2 = 100
In each case, create a QQ-plot to compare the distribution of T with N(0,1) and comment on the plot. Set reps = 10000. You may use qnorm function when calculating the percentiles of the standard normal distribution. Comment on the plots. You are not allowed to use rbinom to generate from binomial distribution.
(b) (5 points) Besides the test, we can also provide an approximate 100(1 − α )% CI for θ1 − θ2 using the formula
11 n1 n2
(θ1 −θ2)±Z1−α/2 θ1(1−θ1) +θ2(1−θ2) .
Create an R function myprop.test that generates the data, calculate the test statistic T and then
output the p-value of the test and the CI for θ1 −θ2. The function should take the following arguments
• n1, number of trials for V
• n2, number of trials for W
• theta1, success probability for V • theta2, success probability for W • alpha, significance level
• reps, number of replications
·3·
and the function outputs a list with three components:
• pval, vector of p-values of each replication
• upper, vector of upper bounds of CIs in each replication • lower, vector of lower bounds of CIs in each replication
(c) (5 points) Test your function by estimating the coverage probability of the CI when θ1 = 0.6, θ2 = 0.4, n1 = n2 = 20, α = 0.05. Use reps=105 . Create a 95% conservative CI for the coverage probability. Does it contain 1−α?
(d) (5 points) Use your function to create a power curve when we increase the distance between θ1 and θ2. More specifically, let n1 = n2 = 100, θ2 = 0.52, α = 0.05 and θ1 takes values in {0.52, 0.525, . . . , 0.680}. Set reps = 1000. Make a plot of the estimated power curve and also include the lower and upper bounds of a 95% Clopper-Pearson confidence interval using dashed lines.
·4·