Week 8: p-values
Given some observed data, it would be useful to know how surprising they are. Are these data consistent with what I’d expect by chance? If not, something more interesting might be going on.
Take Laplace’s question of the birth rate of boy vs. girls in Paris, for example. He observed 251,527 boys in 493,472 births. Is this 0.51 frequency surprisingly different from what we’d expect by chance?
To be quantitative, we have to specify exactly what chance means. We formulate a hypothesis called the “null hypothesis”, , so that we can calculate , the probability distribution over different data outcomes, given the null hypothesis.
Copyright By PowCoder代写 加微信 powcoder
In the Laplace example, the obvious null hypothesis is that boys and girls are equiprobable: . The possible data outcomes, in total births, are boys. The probability of any given count of boys is a binomial distribution .
As Laplace was aware, the probability of any specific outcome might be absurdly small, especially as gets large. A specific outcome can be unlikely (in the sense that you wouldn’t have bet on exactly that outcome beforehand), but unsurprising (in the sense that it’s one of many outcomes that are consistent with the null hypothesis). If , the probability of getting exactly 50% boys ( ) is tiny, 0.001. But the probability of getting a number within 1000 of 246736 is more than 99%.
Indeed, we may be able to find a range of data that are consistent with the null hypothesis, even if any particular one outcome is unlikely, and then ask if our observed data is outside that plausible range. This requires that the data can be arranged in some sort of linear order, so that it makes sense to talk about a range, and about data outside that range. That’s true for our counts of boys , and it’s also true of a wide range of “statistics” that we might calculate to summarize a dataset (for example, the mean of a bunch of observations ).
For example, what’s the probability that we would observe boys or more in Laplace’s problem, if p=0.5? For this, we (and Laplace) use a cumulative probability function (CDF), the probability of getting a result of or less:
Our boy count is discrete (and only defined on ) so we can get in Python’s SciPy stats.binom module:
which gives us 1.1e-16 . That’s not what Laplace got!
The trouble with computers
, which of course
import scipy.stats as stats
c = 251527
n = 493472
1 – stats.binom.cdf(c-1,n,p)
This number is totally wrong. The math is right, but computers are annoying. The number is so close to zero, we get an artifact from a floating-point rounding error. When numbers get very small, we have to worry about pesky details. So let’s take a detour for a second into how machines do arithmetic, and where it can go wrong if you’re not paying enough attention.
274394…0 = c 274394 = n )0H ∣ D(P
)p ∣ 625152 ≤ C( P − 1 = )725152 ≥ C( P 0 ≥ c
637642 = c
)θ ∣ x = X( P ∑ = )θ ∣ x ≤ X( P
n x . . 1 x
)p−1( p)c(=)n,p∣c(P c−n cn
On a machine, in floating point math, for some small threshold . In double-precision floating-point math (what Python uses internally), the machine is 1.1e-16. This is the smallest relative unit of magnitude that two floating point numbers can differ by. The result of stats.binom.cdf() is so close to 1 that the machine can’t keep track of the precision; it just left its return value at one epsilon less than 1, and gives us .
We have to make sure that we never try to represent if we know might be small; we need to use instead. Here that means we want SciPy to tell us 1 – CDF instead of the CDF. That’s got a name: the survival function, .sf() in SciPy. Let’s try again:
Now we get 1.2e-42, which is right. There’s a tiny probability that we’d observe 251,527 boys or more, if the boy-girl ratio is 50:50.
Definition of a p-value
A p-value is the probability that we would have gotten a result at least this extreme, if the null hypothesis were true.
We get the p-value from a cumulative probability function , so it has to make sense to calculate a CDF. There has to be an order to the data, so that “more extreme” is meaningful. Usually this means we’re representing the data as a single number: either the data is itself a number ( , in the Laplace example), or a summary statistic like a mean.
For example, it wouldn’t make sense to talk about the p-value of the result of rolling a die times. The observed data are six values , and it’s not obvious how to order them. We could calculate the p-value of observing sixes or more out of rolls, though. Similarly, it wouldn’t make sense to talk about the p-value of a specific poker hand, but you could talk about the p-value of drawing a pair or better, because the value of a poker hand is orderable.
A p-value is a false positive rate
We will get to this in more detail when we cover classifiers, but there is something called a false positive rate. If you have a group of items belonging to one class (“0”) or another (“1”), and then you apply some strategy to guess the class of each item, you can assess the performance of your strategy by calculating the false positive rate: the fraction of false positives out of all negatives: . If we consider our test statistic to be the threshold for defining positives, i.e. everything that scores at least is called positive, then the p-value and the false positive rate are the same thing: for data samples generated by the null hypothesis (negatives), what fraction of the time do they nonetheless score or greater?
This idea leads to a simple way of calculating p-values called order statistics. Generate synthetic negative datasets, calculate the score (test statistic) for each of them, and count the fraction of times that you get or more; that’s the p-value for score .
Any experimentalist is familiar with this idea. Do negative controls. Simulate negative datasets and count how frequently a negative dataset gets a score of your threshold or more.
p-values are uniformly distributed on (0,1)
If the data were actually generated by the null hypothesis, and you did repeated experiments, calculating a p-value for each observed data sample, you would see that the p-value is uniformly distributed. By construction – simply because it’s a cumulative distribution! 5% of the time, if the null hypothesis is true, we’ll get a p-value of ; 50% of the time, we’ll get a p-value .
c = 251527
n = 493472
stats.binom.sf(c-1,n,p)
xxx±1 ε )ε−1(−1
Understanding this uniform distribution of p-values is important. Sometimes people say that a result with a p-value of 0.7 is “less significant” than a result with a p-value of 0.3, but in repeated samples from the null hypothesis, you expect to obtain the full range of possible p-values from 0..1 in a uniform distribution. Seeing a p-value of 0.7 is literally equally probable as seeing a p-value of 0.3, or 0.999, or 0.001, under the null hypothesis. Indeed, seeing a uniform distribution under the null hypothesis is a good check that you’re calculating p-values correctly.
Null hypothesis significance testing
P-values were introduced in the 1920’s by the biologist and statistician . He intended them to be used as a tool for detecting unusual results:
There are three important things in this passage. First, it introduced as a standard of scientific evidence. Second, Fisher recognized that this was a “low standard”. Third, by saying “rarely fails”, Fisher meant it to be used in the context of repeated experiments, not a single experiment: a true effect should reproducibly and repeatedly be distinguishable from chance.
Many fields of science promptly forgot about the second two points and adopted as a hard standard of scientific evidence. A result is said to be “statistically significant” if it achieves . Sometimes, contrary to both logic and what Fisher intended, a single result with is publishable in some fields.
Nowadays there’s a backlash. Some people want to change the 0.05 threshold to 0.005, which rather misses the point. Some people want to ban P-values altogether.
P-values are useful, if you’re using them the way Fisher intended. It is useful to know when the observed data aren’t matching well to an expected null hypothesis, alerting you to the possibility that something else may be going on. But 5% is a low standard – even if the null hypothesis is true, 5% of the time you’re going to get results with . You need to see your unusual result reproduce consistently before you’re going to believe in it.
When you say that you’ve rejected the null hypothesis , and therefore your hypothesis is true. A tiny p-value doesn’t necessarily mean the data support some other hypothesis of yours, just because the data don’t agree with the null hypothesis. Nothing about a p-value calculation tests any other hypothesis, other than the null hypothesis.
When you equate “statistical significance” with effect size. A miniscule difference can become statistically significant, given large sample sizes. The p-value is a function of both the sample size and the effect size. In a sufficiently large dataset, it is easy to get small p-values, because real data always depart from simple null hypotheses. This is often the case in large, complex biological datasets.
When you do multiple tests but you don’t correct for it. Remember that the p-value is the probability that your test statistic would be at least this extreme if the null hypothesis is true. If you chose (the “standard” significance threshold), you’re going to get values that small 5% of the time, even if the null hypothesis is true: that is, you are setting your expected false positive rate to 5%. Suppose there’s nothing going on and your samples are generated by the null hypothesis. If you test one sample, you have a 5% chance of erroneously rejecting the null. But if you test a million samples, 50,000 of them will be “statistically significant”.
Most importantly, using a p-value to test whether your favorite hypothesis is supported by the data is fundamentally illogical. A p-value test never even considers ; it only considers the null hypothesis . “Your model is unlikely; therefore my model is right!” is just not the way logic works.
Multiple testing correction
“Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.”
1H 50.0 < P
50.0 < P 50.0 < P
Suppose you do test one million things. What do you need your p-value to be (per test), to decide that any positive result you get in tests is statistically significant?
Well, you expect false positives. The probability of obtaining one or more false positives is (by Poisson) . This is still a p-value, but with a different meaning, conditioned on the fact that we did tests: now we’re asking, what is the probability that we get result at least this extreme (at least one positive prediction), given the null hypothesis, when we do independent experiments? For small , , so the multiple-test-corrected p-value is approximately . That is, multiply your per-test p-value by the number of tests you did to get a “corrected p-value”. Like many simple ideas, this simple idea has a fancy name: it’s called a Bonferroni correction. It’s considered to be a very conservative correction.
The false discovery rate (FDR)
One reason that the Bonferroni correction is conservative is the following. Suppose you run a genome-wide screen and you make 80,000 predictions. Do you really need all of them to be “statistically significant” on their own? That is, do you really need to know that the probability of even one false positive in that search is or whatever? More reasonably, you might say you’d like to know that 99% of your 80,000 results are true positives, and 1% or less of them are false positives.
Suppose you tested a million samples to get your 80,000 positives, at a per-test p-value threshold of . By the definition of the p-value you expected up to 50,000 false positives, because in the worst case, all million samples are in fact from the null hypothesis, and at a significance threshold , you expect 5% of them to be called as (false) positives. So if you trust your numbers, at least 30,000 of your 80,000 predictions (80000 positives - 50000 false positives) are expected to be true positives. You could say that the expected fraction of false positives in your 80,000 positives is 50000/80000 = 62.5%.
This is called a false discovery rate calculation – specifically, it is called the Benjamini-Hochberg FDR.
The false discovery rate (FDR) is the proportion of your called “positives” that are expected to be false positives, given your p-
value threshold, the number of samples you tested, and the number of positives that were “statistically significant”.
Getting to the FDR from the p-value is straightforward. Suppose we’ve ranked all tests by their p-value, and we set a cutoff threshold at the th best test – i.e. we take the top tests and call them statistically significant. Let the p-value of the th test be ; then we expect up to false positives with this p-value or better (because the p-value is literally the false positive rate). We made predictions, and we expect up to to be false positives... the FDR is the fraction of the predictions that we expect to be false, . People typically choose FDR thresholds of 0.05 or so.
What Bayes says about p-values
A good way to see the issues with using p-values for hypothesis testing is to look at a Bayesian posterior probability calculation. Suppose we’re testing our favorite hypothesis against a null hypothesis , and we’ve collected some data . What’s the probability that is true? That’s its posterior:
To put numbers into this, we need to be able to calculate the likelihoods and , and we need to know how the priors and – how likely and were before the data arrived.
What does p-value testing give us? It gives us : the cumulative probability that some statistic of the data has a value at least as extreme as , under the null hypothesis.
We don’t know anything about how likely the data are under our hypothesis . We don’t know how likely or were in the first place. And we don’t even know , really, because all we know is a related cumulative probability function of
and the data.
)0H ∣ D(P x
x ≃ x−e − 1 x
)0H ∣ x ≥ )D(s(P 1H 0H
)0H(P)0H ∣ D(P + )1H(P)1H ∣ D(P = )D ∣ 1H(P )1H(P)1H ∣ D(P
r rpn r rpn
Therefore it is utterly impossible (in general) to calculate a Bayesian posterior probability, given a p-value – which means, a p- value simply cannot tell you how much evidence your data give in support of your hypothesis .
(This was the fancy way of saying that just because the data are unlikely given does not logically mean that must be true.)
What Nature (2014) said about p-values
The journal Nature ran a commentary article in 2014 called “Statistical errors” (https://www.nature.com/articles/506150a), about the fallacies of p-value testing. The article showed one figure, reproduced to the right. The figure shows how a p-value corresponds to a Bayesian posterior probability, under three different assumptions of the prior odds, for or
. It shows, for example, a result of might be “very significant”, but the posterior probability of the null hypothesis might still be quite high: 30%, if the null hypothesis was pretty likely to be true to begin with. The commentary was trying to illustrate the point that the p-value is not a posterior probability, and that a “significant” p-value does not move the evidence as much as you might guess.
But I just told you that it is utterly impossible (in general) to calculate a posterior probability from a p-value, and here we have Nature doing exactly that?
The key detail in the Nature commentary flashes by in a phrase – “According to one widely used calculation...”, and references a 2001 paper from statistican (https://pubmed.ncbi.nlm.nih.gov/11337600/). Let’s look at that calculation and see if we can understand it, and if we agree with its premises.
First we need some background information on statistical tests involving Gaussian distributions.
Differences of means of Gaussian distributions
Suppose I’ve collected a dataset with a mean value of , and suppose I have reason to expect, in replicate experiments, that the mean is normally (Gaussian) distributed with a standard error SE . (For now just thing of standard error as the standard deviation of our observed means, in repeated experiments.) My null hypothesis is that the true mean is . I want to know if
my observed is surprisingly far from – that is: what is the probability that I would have observed an absolute difference at least this large, if I were sampling from a Gaussian of mean and standard deviation ?
A Gaussian probability density function is defined as:
A useful thing to notice about Gaussian distributions is that they’re identical under a translation of and , and under a multiplicative rescaling of . The probability only depends on the ratio : that is, on the number of standard deviations away from the parametric mean. So, if I calculate a so-called Z-score:
then I can talk in terms of a simplified standard normal distribution:
This is a very useful simplification - among other things, it’ll be easier to remember things in units of “number of standard deviations away from the mean”, and will help you develop general, quantitative intuition for Gaussian-distributed quantities.
The probability that we get a Z score at least as extreme as is an example of a p-value:
We might be interested not just in whether our mean is surprisingly larger than , but also if it’s surprisingly smaller. That’s the difference between what statisticians call a “one-tailed” test versus a “two-tailed” test. In a one-tailed test, I’m specifically
testing whether symmetric, so
, for example; in a two-tailed test, I’m testing the absolute value, . The Gaussian is .
from the Gaussian cumulative distribution function (CDF):
(Because this is a continuous function, asymptotically; there’s asymptotically zero mass exactly at .)
There’s no analytical expression for a Gaussian CDF, but it can be computed numerically. In Python, the scipy.stats.norm module includes a CDF method and more:
from scipy.stats import norm
1 - norm.cdf(z)
norm.sf(z)
norm.isf(p)
norm.isf(p/2)
# gives one-tailed p-value P(Z >= 1.96) = 0.025
# 1-CDF(x) is the “survival function”
# you can invert the survival function with `.isf()`:
# given a 1-tailed p-value P(Z > z), what Z do you need? (gives 1.64)
# or for a two-tailed P(|Z| > z) (gives 1.96)
Goodman’s calculation
Crucially, for a Gaussian-distributed statistic, if I tell you the p-value, you can calculate the Z-score (by inverting the CDF); and given the Z-score, you can calculate the likelihood . Thus in this case we can convert a P-value to a likelihood of the null hypothesis for a -score (that we got from our mean ):
)z ≥ |Z|(P
)z ≥ Z(P2 = )z ≥ Z(P + )z− ≤ Z(P = )z ≥ |Z|(P )z ≥ Z(P
̄x E S = σ μ
| μ − ̄x |
)z ≤ Z(P = )z < Z(P )z(FDC − 1 = )z < Z( P − 1 = )z ≥ Z( P
∫ = )z ≥ Z(P z
2 σ 2 2)μ−x( −
e 2− σ− π− 2− √ = ) σ , μ ∣ x ( P 1
2 − e π− 2− √ = ) Z ( P 2Z1
σ/)μ − x( σ
For example, a p-value of for a two tailed test implies Z = 1.96, by inverting the CDF. Roughly speaking, 5% of the time, we expect a result more than 2 standard deviations away from the mean on either side.
So now we’ve got . That’s one of our missing terms dealt with
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com