MAST20005/MAST90058: Week 4 Lab
Goals: (i) Compute and interpret confidence intervals; (ii) Assess properties of estimators
using numerical simulations.
Data for Section 1: Weight of primary tumour in mice (PTweight.txt). Observations represent measurements on primary tumour weight (in micro-grams) in mice randomly assigned to stress and control groups. Measurements are taken after a follow-up period. The goal of the study is to compare tumour development in the two groups of mice. (Data source: Sloan Biology Lab, Monash University)
Data for Section 2: Ivadopodia data (invadopodia.txt). Invadopodia are protrusions in cells which are often associated with cancer invasiveness and metastasis. The first dataset consists of counts for the number of cells susceptible of invadopodia development from mice tissue samples. The first column (Condition) denotes treatment group (1 = no treatment, 2 = isopropile (iso), 3 = propile (pro), 4 = iso + pro), while the second column (Total) denotes cell counts. (Data source: Sloan Biology Lab, Monash University)
1
Confidence intervals
1. While constructing confidence intervals, we will often use R to find quantiles of common distributions, including 2-, t- and F -distributions. For each of the following distributions, find the 0.9, 0.95 and 0.975 quantiles.
(a)
(b)
(c)
( d )
( e )
(f)
Standard normal, N(0, 1)
N(5,32) qnorm(p, 5, 3)
t5
qt(p, 5)
21
qchisq(p, 1)
25
# your turn…
F12,4
qf(p, 12, 4)
p <- c(0.9, 0.95, 0.975)
qnorm(p)
## [1] 1.281552 1.644854 1.959964
1
2. Load the primary tumour data into a data frame called PTweight (how?). Find an approximate 90% confidence interval for the mean tumour weight in the control group.
x <- PTweight[1:5, 2]
n <- length(x)
x.bar <- mean(x)
s <- sd(x)
t <- qt(0.95, n - 1)
x.bar + c(-1, 1) * t * s / sqrt(n)
## [1] 189.0001 791.1999
Notice that in the last line the vector ( 1, 1) is multiplied and added to scalars. This produces a two-element vector corresponding to x ̄ ± t 1 (0.05)s/pn. Next compare your
result with the in-built function t.test.
n 1
t.test(x, conf.level = 0.90)
##
## One Sample t-test
##
## data: x
## t = 3.47, df = 4, p-value = 0.02558
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 189.0001 791.1999
## sample estimates:
## mean of x
## 490.1
The last command is what we use in practice. We will look at the hypothesis testing part of the output later in the course.
3. Construct a 95% confidence interval for the di↵erence of means μX μY in the case and control groups. Can you conclude that stress and tumour growth are associated? (Assumeequalvariancesinthetwogroups,i.e. X2 = Y2)
The value μX μY < 0 is inside the interval. Thus, there is not enough evidence in these data to claim that the means in the two groups are di↵erent.
Compare now with the result from t.test.
y <- PTweight[6:10, 2] # stress group data
y.bar <- mean(y)
s.p <- sqrt((4 * var(x) + 4 * var(y)) / 8) # pooled sample sd
x.bar - y.bar + c(-1, 1) * qt(0.975, df = 8) * s.p * sqrt(1 / 5 + 1 / 5)
## [1] -662.5751 153.2951
2
t.test(x, y, var.equal = TRUE)
##
## Two Sample t-test
##
##data: xandy
## t = -1.4394, df = 8, p-value = 0.188
## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval:
## -662.5751 153.2951
## sample estimates:
## mean of x mean of y
## 490.10 744.74
4. Is it reasonable to assume that the two variances are the same in the two groups? This can be checked formally by computing a confidence interval for the variance ratio X2 / Y2 as follows:
var.test(x, y)
##
## F test to compare two variances
##
##data: xandy
## F = 1.7583, num df = 4, denom df = 4, p-value = 0.598
## alternative hypothesis: true ratio of variances is not equal to 1 ## 95 percent confidence interval:
## 0.183065 16.887192
## sample estimates:
## ratio of variances
## 1.758253
Since the value X2 / Y2 = 1 is inside the CI, there is not enough evidence to suggest that the variances are unequal.
2 Comparing estimators by simulation
1. Let X1, . . . , Xn be a random sample of size n from a Poisson distribution with mean . Let X ̄ and S2 denote the sample mean and sample variance, respectively. Note that both are unbiased estimators for since
E(X ̄) = E(S2) = .
While we already know that var(X ̄) = /n (recall how), an expression for var(S2) is harder to compute. In such situations computer simulations can help us compare estimators. First let us check that the two estimators are unbiased.
3
lambda <- 10
B <- 1000 # simulation runs
n <- 10 # sample size
xbar <- 1:B # we will collect results in these two vectors s2 <- 1:B
for(bin1:B){ #repeatBtimesforb=1,...,B
x <- rpois(n, lambda)
xbar[b] <- mean(x) # compute and store X-bar s2[b] <- var(x) # compute and store S^2
}
The LLN for iid variables Z1,...,ZB states that B 1 PBb=1 Zb ! E(Z1) as B ! 1. This can be used to approximate the E(X ̄) and E(S2) from our simulations as follows
mean(xbar)
## [1] 9.9721
mean(s2)
## [1] 9.8617
Both estimators seem to be unbiased for .
Using the LLN we can also approximate the variance of the two estimators as follows:
var(xbar)
## [1] 1.054046
var(s2)
## [1] 20.92035
While we already know that var(X ̄) = /n, a more involved calculation shows that var(S2) = [ (2n + n 1)]/[n(n 1)]. The values obtained from the above simulation are quite close to these theoretical values:
lambda / n
## [1] 1
lambda * (2 * n * lambda + n - 1) / (n * (n - 1))
## [1] 23.22222
Both estimators seem to be unbiased for but the first estimator is clearly superior in terms of variance. Try di↵erent values of the true parameter and n in the above code and see what happens. Also, check what happens when you increase B.
4
2. Compare the accuracy of the estimators graphically, for example using boxplots.
boxplot(xbar, s2, names = c("X-bar", "S-squared"),
col = "lightblue", horizontal = TRUE)
# Draw a dashed vertical line showing the true parameter value.
abline(v = lambda, lty = 2, lwd = 2, col = "magenta")
●
●●● ●●●●●● ●● ● ● ● ●
●
0 5 10 15 20 25 30
From the box plots, the distribution of both estimators is clearly centred around the true , but the distribution of X ̄ has a much smaller spread. Note also that the distribution of S2 is not symmetric.
3. Load the invadopodia data. Assume a Poisson model with mean j, j = 1,...,4 with di↵erent means corresponding to treatment groups and estimate j. For the first two groups we have:
4. Construct approximate 95% confidence intervals for recalling that ˆ = X ̄ obeys the Central Limit Theorem. Specifically, ˆ ⇡ N( , /n), so approximate 95% confidence intervals can be computed using:
ˆj±1.96⇥s ˆj, j=1,...,4. n
invadopodia <- read.table("invadopodia.txt")
x1 <- invadopodia[invadopodia$Condition == 1, 2]
x2 <- invadopodia[invadopodia$Condition == 2, 2]
x.bar1 <- mean(x1)
x.bar2 <- mean(x2)
5
X−bar S−squared
# 95% CI for group 1.
x.bar1 + c(-1, 1) * 1.96 * sqrt(x.bar1 / length(x1))
## [1] 4.871097 6.684459
# 95% CI for group 2.
x.bar2 + c(-1, 1) * 1.96 * sqrt(x.bar2 / length(x2))
## [1] 3.057910 4.293441
Note that 1.96 in the above code gives 95% confidence intervals. To compute 90% or 99% CIs, replace 1.96 by the appropriate standard normal quantiles:
qnorm(0.95) # use this to obtain a 90% CI ## [1] 1.644854
qnorm(0.995) # use this to obtain a 99% CI ## [1] 2.575829
# 95% CI for difference between groups 1 and 2.
x.bar1 - x.bar2 + c(-1, 1) * 1.96 *
sqrt(x.bar1 / length(x1) + x.bar2 / length(x2))
## [1] 1.004967 3.199237
3
5. The CIs do not overlap, suggesting that the two means are actually di↵erent. To carry out proper inference for the di↵erence of means, 1 2, we need to derive a specific CI. Since the first two groups are independent we have,
ˆ 1 ˆ 2 ⇡ N ✓ 1 2 , 1 + 2 ◆ n1 n2
The interval is clearly above the value 1 2 = 0, suggesting that the true di↵erence is likely to be di↵erent from zero. Hence, we conclude that we have evidence that the isopropile treatment reduces the number of susceptible cells in mice tissues.
Simulating discrete distributions
The function sample() carries out sampling from a discrete distribution in a few di↵erent ways. For example, to simulate 10 coin tosses you would use:
sample(c("tails", "heads"), 10, replace = TRUE)
## [1] "tails" "tails" "heads" "heads" "tails" "heads" "heads" "heads"
## [9] "tails" "tails"
6
The first argument specifies the possible values to observe. For example, we could pick a random day of the week using:
By default, all items are given equal probability. This can be changed by specifying di↵erent values using the prob argument. For example, biased coin tosses can be obtained with:
The vector passed to prob should have the same length as the set of possible values (the first argument).
The argument replace specifies whether to do sampling with or without replacement1. The default is without replacement, which means we always need to write replace = TRUE to get iid samples.
Exercises
1. Give estimates and 95% CIs for 3 and 4.
2. The investigators are wondering whether isopropile treatment a↵ects the number of sus- ceptible cells in mice tissues when propile treatment is already provided. What analysis answers that question? Carry out this analysis.
3. Consider question 1 from the tutorial problems. Derive a 75 % CI for the population mean.
4. Consider question 2 from the week 3 tutorial problems. This involved a random sample of n observations on X having the following pmf:
x012
p(x) 1 ✓ 3✓/4 ✓/4
days <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
sample(days, 1)
## [1] "Fri"
sample(c("tails", "heads"), 10, prob = c(0.2, 0.8), replace = TRUE)
## [1] "heads" "heads" "heads" "heads" "heads" "tails" "heads" "heads"
## [9] "heads" "tails"
For the case of n = 10 and ✓ = 0.6, use numerical simulations to show: (a) T1 = (4/5)X ̄ and T2 = 1 n 1 Pni=1 I(Xi = 0) are unbiased
(b) var(T1) > var(T2)
1‘Replacement’ refers to putting the sampled item back into the pool of possible values before taking the next sample.
7