title: “STAT340 Lecture 13: Multiple comparison problem”
author: ” and Wu”
date: “October 2021”
output: html_document
Copyright By PowCoder代写 加微信 powcoder
knit: (function(inputFile,encoding){rmarkdown::render(
inputFile,encoding=encoding,output_file=file.path(dirname(inputFile),’index.html’))})
“`{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE,
cache = TRUE, autodep = TRUE, cache.comments = FALSE)
library(Rfast)
library(tidyverse)
# What is multiple comparison? (and why is it a problem?)
# People v. Puckett
### Background
The following facts are from a real case ^[http://www.personal.psu.edu/dhk3/dhblog/ROB%28Puckett-CA%29.pdf] ^[https://www.prisonlegalnews.org/news/2009/jan/15/cold-case-hits-use-vastly-exaggerated-dna-match-statistics-8232upheld-by-california-supreme-court/].
On December 22, 1972, around 8:20am, 22-year old was found dead in her San Francisco apartment. Her landlord called the police after hearing loud noises, “a violent pounding on the floor”, and a woman’s screams. Diana was found on the floor unclothed with multiple stab wounds to her chest. She appeared to have been sexually assaulted, then strangled. DNA samples were collected and tested, but no primary suspect was identified and the case went cold for 30 years.
In 2003, the case was reopened, and the samples were reanalyzed and uploaded to the California DNA database ($N\approx338,000$). The initial DNA sample reportedly didn’t contain enough identifiable markers to be of use and was extrapolated by an analyst based on “inconclusive” markers to meet the minimum search length requirements. 70-year old was identified as a “[cold hit](https://www.encyclopedia.com/science/encyclopedias-almanacs-transcripts-and-maps/cold-hit)” from this partial match.
There was no other direct evidence connecting him to the case, so during the trial, prosecution relied heavily on the DNA match, which they claimed based on “expert testimony” had a 1 in 1.1 million chance of matching a random person. Puckett was convicted in 2008 on one count of first degree murder and sentenced to life imprisonment (w/ possibility of parole).
Later, in a paper published in Forensic Science Communications—a peer-reviewed journal published by the Federal Bureau of Investigation (FBI)—researchers pointed out the actual probability of finding a match was closer to 1 in 3 (a fact that judge Benson barred the defense from presenting to the jury during trial) ^[https://archives.fbi.gov/archives/about-us/lab/forensic-science-communications/fsc/july2009/undermicroscope/2009_07_micro01.htm#differentquestions].
So which number is correct? The short answer is they are both right, depending on what question you’re asking.
### Different questions, different answers
Assume the DNA analysis expert is right, and the sample had a 1 in 1.1 million probability of matching a random, unrelated, innocent person. In other words, the probability of a false positive is
P(\text{false}\,+)=P(+\,|\,\text{innocent}):=\alpha=\frac1{1.1\times10^6}\approx0.00000091
However, the California DNA database at the time had around $N=338,000$ samples in it. Assume for a moment Diana’s aggressor was not in the database (i.e. the samples in the database are all completely unrelated to this case) and that the samples in the database are independent. Then, each sample has $\alpha$ probability of returning a false positive. We can then easily calculate the probability of ***at least one match***
\begin{aligned}
P(\text{at least 1 match})&=1-P(\text{no matches})\\
&=1-\prod_{i=1}^NP(i\text{-th sample is NOT a match})\\
&=1-\prod_{i=1}^N\Big(1-P(\text{$i$-th sample IS a match})\Big)\\
&=1-(1-\alpha)^N\\
&=1-\left(1-\tfrac1{1,100,000}\right)^{338,000}\\
&\approx0.26455\\
&\approx\tfrac1{3.78}
\end{aligned}
In other words, the actual probability of at least one match in the database was about 26\%, or about 1 in 3.8 chance. If the defense was allowed to present this to the jury, would Puckett still have been convicted? We will never know.
The problem here lies in multiple testing. By uploading the sample to the database, detectives were not just running 1 test against a database, but $338,000$ individual tests. Any single, individual sample has a 1 in 1.1 million chance of returning a false positive match, but the entire database has a 26\% chance of returning a false positive match.
[**Yikes!**](https://en.wikipedia.org/wiki/Prosecutor%27s_fallacy)
### Tangent on [replication crisis](https://en.wikipedia.org/wiki/Replication_crisis)
This was an extreme example, but the same principle applies whenever you perform many *post-hoc* tests (i.e. not a few specific, redetermined questions) to search for any significant results without taking into consideration the problem of multiple testing. This is often called [data dredging](https://en.wikipedia.org/wiki/Data_dredging) or data fishing, and is an example of a type of $p$-hacking—a broad term referring to any kind of misused of statistical methods to produce exaggerated significance of results (e.g. $p$-values). This is motivated by the fact that more significant results are traditionally more likely to be published (which researchers and PIs often feel heavily pressured to do).
In recent years, there has been a slowly growing awareness of these issues in the scientific community. In 2005, published the highly influential paper [Why Most Published Research Findings Are False](https://doi.org/10.1371/journal.pmed.0020124), in which he argued that “most claimed research findings are false” due to many factors, including “high rate of nonreplication (lack of confirmation)”; bias in design, data, analysis, or presentation; selective reporting by editors/publishers of journals; effect sizes often being small compared to the power of the statistical method; and even intentional manipulation and misreporting of results.
Indeed, in 2015, a [large replication study](https://doi.org/10.1126/science.aac4716) of papers from prestigious psychology journals found about 63% of the significant results could NOT be reproduced. In 2016, a [survey of 1,576 researchers](https://doi.org/10.1038/533452a) across many fields found that 70\% “tried and failed to reproduce another scientist’s experiments”, and more than half “failed to reproduce their own experiments”. In 2019, a [replication study of hydrology journals](https://www.nature.com/articles/sdata201930) found with 95\% confidence the proportion of reproducible papers in the sample of nearly 2000 articles was between 0.6\% and 6.8\%.
# Simulation study
Getting back to the topic of multiple comparisons, let’s further illustrate the point by running a simulation study. Suppose we draw $g$ groups of $n$ observations ***all from the same normal distribution*** ^[the distribution type doesn’t really matter here, but normal was chosen to keep the testing procedure simple, since it’s not the focus here]. Suppose we are genuinely curious if any of the groups are different, so we do the naive thing: we conduct $g\choose2$ pairwise $t$-tests all at the 95\% confidence level.
The following script tests this in an automated way, and is able to determine empirically the probability of a false positive in **one of the tests** (i.e. a significant result (even though none of them should be!)).
library(Rfast)
library(tidyverse)
mult.comp.sim = function(g,n,i=5000,alpha=0.05){
sapply(g,function(..){
replicate(i,ttests.pairs(matrix(rnorm(..*n),ncol=..,nrow=n))$pvalue %>%
{.[lower.tri(.)]} %>% {any(.
# brief check that it’s working
mult.comp.sim(2,20)
# choose some arbitrary n, let’s say 20
# we can then show the probability for any given
# g of getting a false positive in at least ONE of the pairwise tests
{data.frame(g=.,p=mult.comp.sim(.,20),value=”data.frame”)} %>%
ggplot(aes(x=g,y=p)) + geom_point() + geom_smooth(se=F) +
geom_hline(aes(yintercept=0.05),linetype=”dashed”,color=”red”) +
scale_y_continuous(limits=c(0,1),expand=c(0,0)) +
labs(title=expression(paste(“Probability of false positive in “,
bgroup(“(“,atop(g,2),”)”),” tests”)))
As you can see, the problem is quite severe, with the probability of false positive going completely out of control as the number of tests we need to perform increases.
How do we prevent this from happening?
# Bonferroni
One of the most popular and broadly applicable ways is called the Bonferroni correction, named after the [Bonferroni inequalities](https://en.wikipedia.org/wiki/Boole%27s_inequality#Bonferroni_inequalities) that it makes use of in its derivation. It controls what’s called the **family-wise error rate** (FWER)—i.e. the probability that at least ONE test in the family results in a false positive.
Suppose we must conduct $m$ tests, and we wish to control the overall FWER at $\alpha$. Let $E_i$ represent the event that the $i$-th test results in a false positive. Suppose we perform each individual test at some more strict significance level $\alpha^*$. How small do we need to make $\alpha^*$ such that $\text{FWER}\leq\alpha$? We solve
\text{FWER}=P\left(\bigcup_{i=1}^mE_i\right)\leq\sum_{i=1}^mP(E_i)=m\alpha^*
Note if we choose $\alpha^*=\frac\alpha m$, then we have
\text{FWER}\leq m\alpha^*\leq\alpha
Thus, the Bonferroni method says if we perform each test at the significance level $\alpha/m$ instead of $\alpha$, then the overall FWER is controlled at $\alpha$.
Let’s use this and revisit the earlier simulation example, this time running each test at $\alpha/{g\choose2}$ level and see what the results look like.
mult.comp.sim.bonf = function(g,n,i=5000,alpha=0.05){
sapply(g,function(..){
replicate(i,ttests.pairs(matrix(rnorm(..*n),ncol=..,nrow=n))$pvalue %>%
{.[lower.tri(.)]} %>% {any(.<(alpha/choose(..,2)))}) %>% mean
set.seed(1)
bonf = 2:30 %>%
{data.frame(g=.,p=mult.comp.sim.bonf(.,20),method=”Bonferroni”)}
ggplot(bonf,aes(x=g,y=p)) + geom_point() + geom_smooth(se=F) +
geom_hline(aes(yintercept=0.05),linetype=”dashed”,color=”red”) +
scale_y_continuous(breaks=c(seq(0,1,.2),.05),limits=c(0,1),expand=c(0,0)) +
labs(x=”Number of groups”,y=”FWER”,
title=expression(paste(“Probability of false positive in “,
bgroup(“(“,atop(g,2),”)”),” tests with Bonferroni correction”)))
Wow that looks so much better! Some more comments about Bonferroni
– Advantages:
– easy to use
– very good at controlling FWER
– easy to apply to any test (just lower $\alpha$)
– Disadvantages:
– slightly conservative
– may have lower power than other methods
There are some other methods that are sometimes used, but basically speaking they all make the following tradeoff
| | | |
|—————–:|:————————:|:——————|
| more liberal | | more conservative |
| higher power |$\Leftarrow===\Rightarrow$| lower power |
|lower FWER control| |higher FWER control|
Another way of saying this is you can’t control both the type I error rate (false positive) and the type II error rate (false negative) at the same time. If you want to control type I, your test literally needs to produce more negative results, which necessarily raises the type II error rate, which lowers power. Conversely, if you want higher power, your test literally needs to produce more positive results, which necessarily raises the type I error rate.
# Holm–Bonferroni method
A slightly more complex method that is less conservative (and more powerful) than Bonferroni is the [Holm-Bonferroni](https://en.wikipedia.org/wiki/Holm%E2%80%93Bonferroni_method) approach. This approach also has the benefit that it can be applied to basically any method that produces a $p$-value, since it only relies on changing the significance level at which each test is performed.
Procedure:
1. Sort test $p$-values from lowest to highest.
2. Match the smallest $p$-value with the significance level $\alpha/m$, then match the next smallest one to $\alpha/(m-1)$, then match the next to $\alpha/(m-2)$, etc… until you match the largest to $\alpha$.
3. Look for the first time a $p$-value is greater than a significance level.
4. Reject the null for every test ***before*** this point.
For example, suppose we had the $p$-values 0.187, 0.508, 0.002, 0.012, 0.674. There are only 5 tests, so we can write
| | | | | | |
|————|——-|——–|——–|——-|——-|
| $p$-values | 0.002 | 0.012 | 0.187 | 0.508 | 0.674 |
| $\alpha/i$ | 0.01 | 0.0125 | 0.0167 | 0.025 | 0.05 |
Note the 1^st^ and 2^nd^ $p$-values are lower than $\alpha/i$ but the 3^rd^ is not. Therefore, the 1^st^ and 2^nd^ tests are significant, but the rest are not.
The Holm method requires a bit more work than the standard Bonferroni, but achieves higher power than Bonferroni while also effectively controlling type I error rate at $\alpha$.
Illustration on simulation study example:
mult.comp.sim.holm = function(g,n,i=5000,alpha=0.05){
sapply(g,function(..){
replicate(i,ttests.pairs(matrix(rnorm(..*n),ncol=..,nrow=n))$pvalue %>%
{.[lower.tri(.)]} %>% {any(p.adjust(.,”holm”)
set.seed(1)
holm = 2:30 %>%
{data.frame(g=.,p=mult.comp.sim.holm(.,20),method=”Holm-Bonferroni”)}
ggplot(holm,aes(x=g,y=p)) + geom_point() + geom_smooth(se=F) +
geom_hline(aes(yintercept=0.05),linetype=”dashed”,color=”red”) +
scale_y_continuous(breaks=c(seq(0,1,.2),.05),limits=c(0,1),expand=c(0,0)) +
labs(x=”Number of groups”,y=”FWER”,
title=expression(paste(“Probability of false positive in “,
bgroup(“(“,atop(g,2),”)”),” tests with Holm-Bonferroni correction”)))
Since the Holm-Bonferroni method offers a minor improvement over the traditional Bonferroni method with no increase of type I error rate and no loss of generalizability, it is now commonly used and generally preferred over the standard uncorrected Bonferroni.
# Tukey range test
Also known as Tukey’s Honest Significant Difference (or HSD). If you are specifically conducting ***all possible pairwise $t$-tests***, this is an additional method that can be used to very quickly and easily determine which groups have means that are significantly different.
Note it is often wrongly believed that Tukey’s HSD must be preceded by a significant ANOVA $F$-test (which we will learn about next week). This is false, and in fact doing so may in certain circumstances be slightly harmful ^[Hsu, J. C. (1996). Multiple comparisons: Theory and methods. Chapman & Hall. p.177]. You can use Tukey HSD whenever you wish to compare all possible group means by conducting all pairwise $t$-tests while controlling the FWER at $\alpha$.
First, compute the *pooled variance* $s_p^2$, a quantity that sort of represents the “overall” variance for all the data if we ignored the groups. In the formula below, $i=1,2,…,k$ represents the $i$-th group, and $n_i$ and $s_i^2$ represent the sample size and sample variance of the $i$-th group.
\displaystyle s_{p}^{2}={\frac {\sum _{i=1}^{k}(n_{i}-1)s_{i}^{2}}{\sum _{i=1}^{k}(n_{i}-1)}}={\frac {(n_{1}-1)s_{1}^{2}+(n_{2}-1)s_{2}^{2}+\cdots +(n_{k}-1)s_{k}^{2}}{n_{1}+n_{2}+\cdots +n_{k}-k}}
From here, there are 2 slightly different approaches, each with their own merits.
### Threshold of difference
This method is slightly easier to use if your data is “balanced”, i.e. all groups have the same number of observations. Using your pooled variance $s_p^2$, find the following
HSD=q_{\alpha,k,N-k}\sqrt{\frac{s_p^2}n}
where $k$ is the number of groups, $n$ is the number of observations in each group, $N$ is the total number of observations, and $q_{\alpha,k,N-k}$ is the quantile from the [studentized range distribution](https://en.wikipedia.org/wiki/Studentized_range_distribution) (not going too in detail about that, but think of it as a distribution of the (normalized) largest absolute value difference between the groups). Since it can be thought of as absolute differences, it’s positive-valued (no negative observations are possible) so “more extreme” observations always use the upper tail. Thus, we use $0.95$ as the first argument. The 2^nd^ and 3^rd^ arguments are just parameters of the distribution.
We can conveniently use `qtukey()` to compute this for our case. Suppose we have $g=5$ groups with $n=20$ people each. Then, there are $N=100$ observations in total. Our critical value can be found by
qtukey(0.95,5,100-5)
We can then compare this HSD to any pair of groups in our sample. In particular, for some groups $g_i,g_j$ in our sample, if
|\bar{y}_i-\bar{y}_j|\geq HSD
then that result is significant.
### Tukey- your data is unbalanced, i.e. groups don’t all have the same number of observations, you have to resort to using the Tukey-Kramer method, which involves making confidence intervals for each pair of groups. In particular, constructor for each $i,j$
\bar{y}_i-\bar{y}_j\,\pm\,q_{\alpha,k,N-k}\sqrt{\frac{s_p^2}2\left(\frac1{n_i}+\frac1{n_j}\right)}
Of course, to tell if a pair is significantly different requires just looking at if the interval contains 0.
Let’s also show this on the simulation example:
mult.comp.sim.tukey = function(g,n,i=5000,alpha=0.05){
sapply(g,function(..){
q = qtukey(1-alpha,..,..*n-..)
replicate(i,matrix(rnorm(..*n),ncol=..,nrow=n) %>%
{diff(range(colmeans(.))) >
q*sqrt(sum(colVars(.)*(n-1))/(..*n-..)/n)}) %>% mean
set.seed(1)
tukey = 2:30 %>%
{data.frame(g=.,p=mult.comp.sim.tukey(.,20),method=”Tukey”)}
ggplot(tukey,aes(x=g,y=p)) + geom_point() + geom_smooth(se=F) +
geom_hline(aes(yintercept=0.05),linetype=”dashed”,color=”red”) +
scale_y_continuous(breaks=c(seq(0,1,.2),.05),limits=c(0,1),expand=c(0,0)) +
labs(x=”Number of groups”,y=”FWER”,
title=expression(paste(“Probability of false positive in “,
bgroup(“(“,atop(g,2),”)”),” pairwise t-tests using Tukey HSD”)))
### Comparison with previous methods
Again, remember the Tukey method is ***only applicable for pairwise $t$-tests*** where you wish to compare to see if the means are significantly different. It is NOT a general method that can be applied to any $p$-values, unlike the Bonferroni or the Holm-Bonferroni method.
However, when pairwise $t$-tests are what you need, Tukey HSD gives one of the best combinations of controlling type I error rate while preserving high power. This is evidenced by the fact that the curve above very closely hugs the $\alpha=0.05$ line, whereas the other two methods both slightly control the error rate a little too low. This might not seem like a problem, but remember that a type I error rate lower than your desired $\alpha$ will always decrease your power, which is bad.
Here is a plot showing all three methods in the case of ***pairwise $t$-tests*** showing more closely their behaviors.
rbind(bonf,holm,tukey) %>%
ggplot(aes(x=g,y=p,color=method,shape=method)) + geom_point(aes(size=method)) + geom_smooth(se=F) +
geom_hline(aes(yintercept=0.05),linetype=”dashed”,color=”red”) +
scale_y_continuous(limits=c(0,.1),expand=c(0,0)) + scale_shape_manual(values=c(4,20,15)) +
scale_size_manual(values=c(4,3,1.5)) + scale_color_brewer(palette=”Dark2″) +
labs(x=”Number of groups”,y=”FWER
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com