Power of hypothesis tests
Jennifer Wilcock STAT221 2020 S2
Review of earlier notes: significance and power
Remember the possible outcomes of hypothesis testing:
H0 Not Rejected
H0 true Type I Error
H0 Rejected (α)
Right Decision (1 − β)
Right Decision (1−α)
H1 true Type II Error (β)
where α and β are the probabilities of getting the conclusion wrong.
The power of the test is the probability of correctly rejecting the null, (1 − β).
The effect of getting α wrong is typically discussed at 100-level, but not the effect of getting β ‘wrong’. If we set α = 0.05 then we are accepting that 19 times out of 20 times that we do a hypothesis test, we
will concluded that there is a ‘significant effect’ when there isn’t.
Ideally, in a statistical test we want to have a small α and a small β (to minimise being ‘wrong’).
In practice, we control α (we choose α) but have less control over β (because it depends on the context).
There is always a trade-off between the power of a test and the likelihood of false positives.
1
Choice of a parametric or nonparametric test
Should we use a t-test or use a permutation test for a particular problem? Or more generally a parametric versus nonparametric test?
If the assumptions of a parametric t-test are violated strongly enough, we might prefer doing a nonpara- metric permutation test. However, nonparametric tests sometimes have less power:
• the ability to reject the null hypothesis when it is false.
As general rule of thumb, if the assumptions underlying a parametric test (e.g. t-test) are correct then it will be more powerful than a permutation test (or other nonparametric test).
The factors that affect power
One way of choosing which test to use is to try to determine which has the most statistical power, where the power of a test is:
• the probability that the test rejects the null hypothesis given that the null is false.
Power depends on:
• the true population distribution the observations are from, • the test statistic used,
• the ’sided-ness’ of the test (one-sided or two-sided), and
• the sample size.
For example consider a one sample test comparing
H0 :μ=0 against H1 :μ̸=0
where the observed data come from a normal distribution with mean μ and standard deviation σ.
The power of the test will be greater when (holding all else constant): • there is a large sample size,
– so n = 1000 will provide more power that n = 10 • the variance of the normal distribution is small,
– so a variance of 0.001 will result in more power than a variance of 10 (when ’signal will be lost in the noise’),
• the mean of the alternative distribution is far from that of the null hypothesis (μ = 0 in this case),
– so a mean of 1000 for the alternative hypothesis will result in more power than when the mean
is 0.1
– (this is called the ’effect size’, so a larger effect size results in greater power)
• a one-sided alternative is used rather than a two-sided alternative,
– since the rejection region in the relevant tail is larger and so there is a greater probability of rejecting the null for a observed difference.
Finally, the power is likely to differ between different testing procedures, so the usual tests for the mean: t-test, a z-test, and a permutation test may have different power.
Usually power is determined through simulation since it can be very difficult to derive a formula for the probability of correctly rejecting the null, except in simple cases.
2
Understanding the distribution of a test statistic (like the sample mean) when the null hypothesis is false is also more difficult than knowing the distribution of a test statistic when the null is true.
Sample size calculations
We can effectively trade off sample size, power, significance level, and effect size.
An important application of power analyses is to determine the minimum sample size needed to reject the null hypothesis with a prescribed power.
‘Power calculations’ are usually required before obtaining ethics approval to carry out a study.
We can specify an α level and a power level that we wish to achieve (typically 5% and 80%), then estimate what we think the effect size is that we believe we need to be able to detect, and then work out the minimum sample size needed to achieve this.
We can then play around with these assumptions and see how sensitive the sample size is to them.
An example
Suppose we have an observed sample of size n = 10 from a normal distribution with μ = 1 and standard deviation σ = 1.
Suppose we want to use the usual one sample Student-t test with hypotheses: • H0 :μ=0
• H1 :μ̸=0.
What is the probability of rejecting the null hypothesis?
Here we have fixed the sample size, the effect size, and the α-level, and we want to find the power of the test.
We can obtain the power of the test using simulations. We can:
• simulatemanysamplesofsizen=10fromanormalwithμ=1andσ=1, • test the null hypothesis at the α = 5% signficance level,
• record the proportion of times the test is (correctly) rejected.
Here is the R code to do this:
# Sample size
n = 10
# Significance level of test
alpha = 0.05
# Number of simulations
N = 1000 set.seed(1)
# Create vector to store p-values from each test
pvalues = numeric(N) for (i in 1:N) {
x = rnorm(n, mean = 1, sd = 1) # simulate dataset from alternative hypothesis
pvalues[i] = t.test(x)$p.value # store p-value from t-test }
3
# Estimate power using sample proportion of rejected nulls
power.of.test = mean(pvalues <= alpha) power.of.test
## [1] 0.766
hbreaks = seq(0, 1, 0.01)
upperbreaks = hbreaks[-1]
hist(pvalues, breaks = hbreaks, prob = T, col = ifelse(upperbreaks <= alpha, "red", "blue"),
xlab = "p-values", main = "p-value Distribution and Significance Level") abline(v = alpha, col = "red")
p−value Distribution and Significance Level
0.0 0.2 0.4
0.6 0.8 1.0
p−values
The estimated power is 76.6%, so about 77% of the time the null hypothesis would be correctly rejected. In practice, 80% is the number typically used for sample size calculations.
The hist function in the above code colours the bars using: col = ifelse(upperbreaks <= alpha, "red", "blue")
Simulating power functions in general
The general procedure for simulating power is to:
• fix a sample size and population distribution to sample from, including its parameters
• simulate a dataset
• perform the hypothesis test at the significance level α
• repeat this process for a large number of simulations, keeping track of the number of times the null hypothesis was rejected (which estimates the power of the test).
The above process is then repeated for different values for the factors that influence the power.
Typically, we want to calculate the power over a range of possible sample sizes and effect sizes. When the power is defined over a range of sample sizes or effect sizes, they are referred to as power functions.
4
Density
0 10 20 30 40 50
To assist with this, we can write an R function to calculate power:
power.simulation = function(n = 10, pop.mean = 1, pop.sd = 1, N = 1000, alpha = 0.05, alt = "two.sided") {
pvalues = numeric(N) for (i in 1:N) {
x = rnorm(n, pop.mean, pop.sd) # simulate dataset from alternative hypothesis
pvalues[i] = t.test(x, alternative = alt)$p.value # store p-value from t-test }
# Return estimated power
return(mean(pvalues <= alpha)) }
We can test this function on the previous example:
## [1] 0.766
Since default parameters have been provided for all the input arguments, these values do not actually need to be specified, so that:
## [1] 0.766
returns exactly the same result.
Simulating a power function for different assumed population means
We can consider many possible values for the population mean, with all other simulation parameters left at their default values:
and plot the results:
set.seed(1)
power.simulation(n = 10, pop.mean = 1, pop.sd = 1, N = 1000,
alpha = 0.05, alt = "two.sided")
set.seed(1) power.simulation()
set.seed(1)
pop.means = seq(-2, 2, 0.1)
n.means = length(pop.means)
power.values = numeric(n.means) for (i in 1:n.means) {
power.values[i] = power.simulation(pop.mean = pop.means[i], N=10000) }
plot(pop.means, power.values, type = "b", ylim = c(0, 1),
xlab = "Assumed Population Mean", ylab = "Power of t-test",
main = "One Sample two-sided t-test Monte Carlo Power Estimates", sub = "H0: mu = 0, alpha=5%")
5
One Sample two−sided t−test Monte Carlo Power Estimates
−2 −1 0 1 2
Assumed Population Mean H0: mu = 0, alpha=5%
Here, this power function is symmetric because the t-test is symmetric (since negative and positive difference are treated in the same way). However in general, a power function is not necessarily symmetric for a two tailed test. For one tailed tests, the power function will be an asymmetric, monotonically increasing or decreasing function.
The power increases the further the assumed population mean is away from the null hypothesis, i.e. as the population becomes increasingly different from the null (or as the ‘effect size’ increases).
The power is exactly equal to the significance level α = 0.05 when the population mean is zero (i.e. value assumed under null hypothesis), when α = β.
Effect size
‘Effect size’ is generally thought of as a measure of the strength of impact of a phenomenon.
The p-value gives a measure of statistical significance of the effect, whereas the effect size is more useful for assessing the physical relevance.
It is often used to describe the difference between a given population parameter θ under the null (which we often call θ0) and the true θ under the alternative hypothesis.
Sice we generally don’t actually know the true effect size, θ − θ0, the relevant sample estimate is reported instead.
The effect size can be:
• unstandardized (e.g. x − y for estimated difference in two population means), or • standardized (e.g. t-statistic t = (x − y)/sx−y).
In the previous plot, power is plotted as a function of the difference in the assumed population mean θ = μ and the value under the null hypothesis θ0 = 0.
Inthiscase,thex-axisisalsotheeffectsizeasθ−θ0 =μ−0=μ.
The effect size could be a difference in the mean, or other quantities like the standard deviation, population
proportion, or population odds, log-odds, or odds ratios.
Comparing the power of one-sided and two-sided tests
In practice, whether a test is one-sided or two-sided normally depends on the context, but we can think about whether one is more powerful in general.
6
Power of t−test
0.0 0.2 0.4 0.6 0.8 1.0
We can investigate whether a one-sided test is more powerful than a two-sided test using our power function:
pop.means = seq(-2, 2, 0.1) n.means=length(pop.means)
power.onesided = numeric(n.means) power.twosided = power.onesided for (i in 1:n.means) {
power.onesided[i] = power.simulation(pop.mean=pop.means[i], N=10000, alt="two.sided")
power.twosided[i] = power.simulation(pop.mean=pop.means[i], N=10000, alt="greater") }
and plot the results:
plot(pop.means, power.twosided, type = "b", col = "blue", ylim = c(0, 1),
xlab = "Assumed Population Mean (Effect Size)", ylab = "Power of t-test", main = "One Sample t-test (two-sided and one-sided)",
sub = "H0: mu = 0 and H0: mu <= 0, alpha=5%")
lines(pop.means, power.onesided, type = "b", col = "red")
legend("bottomright", legend=c("Two-sided","One-sided"), lty=1, col=c("red","blue"))
One Sample t−test (two−sided and one−sided)
Two−sided One−sided
−2 −1 0 1 2
Assumed Population Mean (Effect Size) H0: mu = 0 and H0: mu <= 0, alpha=5%
We can think about what is happening in three sections of the plot: • for positive effect sizes
– the one-sided test has more power than the corresponding two-sided test (blue points above red)
– since the critical value will be lower, with a larger tail area, so that the null hypothesis is rejected more frequently (thus giving higher power).
• when the population mean is the same as the null hypothesis of μ = 0, – the one-sided and two-side tests are equally powerful
• for negative effect sizes
– the one-sided test has lower power than the corresponding two-sided test (blue points below red),
– since the population mean is behaving in the opposite manner to that considered in the alternative hypothesis.
7
Power of t−test
0.0 0.2 0.4 0.6 0.8 1.0
We conclude that, so long as the null/alternative hypothesis is in the appropriate direction, a one-sided test is to be preferred as it is more powerful.
Power functions for multiple effect and sample sizes
Here we compare many possible values for both:
• the population mean, and
• the sample size.
# sequence of population means and sample sizes
pop.means = seq(0.5, 3, 0.5) ns = seq(2, 20)
n.means=length(pop.means) n.size=length(ns)
# Loop over all the sample sizes and population means
power.values = matrix(NA, nrow = n.size, ncol = n.means) for (j in 1:n.means) {
for (i in 1:n.size) {
power.values[i,j] = power.simulation(n = ns[i], pop.mean = pop.means[j], N = 10000)
} }
Using these results, a statistician or applied researcher can determine the required sample size needed to find a significant effect, for a given level of power and assumed effect size:
matplot(ns, power.values, type = "l", lty = 1, ylim=c(0, 1),
xlab = "Sample Size", ylab = "Estimated Power of t-test",
main = "One Sample two-sided t-test Monte Carlo Power Estimates", sub = "H0: mu = 0, alpha=5%")
legend("bottomright", legend = c("mu=0.5","mu=1","mu=1.5","mu=2","mu=2.5","mu=3"), lty = 1, col = c("black","red","green","blue","cyan","purple"))
8
One Sample two−sided t−test Monte Carlo Power Estimates
mu=0.5 mu=1 mu=1.5 mu=2 mu=2.5 mu=3
5 10 15 20
Sample Size
H0: mu = 0, alpha=5%
As the sample size increases, the power increases (as variability of test statistic sampling distribution reduces).
As the effect size (μ − μ0) increases, so that the population mean is further away from the null hypothesis of zero, the power increases much more quickly with sample size.
Thus, to achieve a fixed level of power (80%, say), a larger effect size requires a smaller sample size. For example, for a fixed power of 80%:
• a minimum sample size of 5 is needed when μ = 2, • a minimum sample size of 10 is needed when μ = 1.
Power functions for different mean and variance parameters
Here we compare many possible values of both: • population mean, and
• population variance.
for a fixed sample size n = 10.
# sequence of population means and variances
pop.means = seq(0.5, 3, 0.5) pop.var = seq(0.5, 3, 0.5)
n.means=length(pop.means) n.sds=length(pop.var)
# Loop over all the sample sizes and population means
power.values = matrix(NA, nrow = n.sds, ncol = n.means) for (j in 1:n.means) {
for (i in 1:n.sds) {
9
Estimated Power of t−test
0.0 0.2 0.4 0.6 0.8 1.0
power.values[i,j] = power.simulation(pop.sd = sqrt(pop.var[i]),
pop.mean = pop.means[j], N = 10000)
} }
matplot(pop.var, power.values, type = "l", lty = 1, ylim=c(0, 1),
xlab = "Population Standard Deviation", ylab = "Estimated Power of t-test", main = "One Sample two-sided t-test Monte Carlo Power Estimates",
sub = "H0: mu = 0, alpha=5%")
legend("bottomright", legend = c("mu=0.5","mu=1","mu=1.5","mu=2","mu=2.5","mu=3"), lty = 1, col = c("black","red","green","blue","cyan","purple"))
One Sample two−sided t−test Monte Carlo Power Estimates
mu=0.5 mu=1 mu=1.5 mu=2 mu=2.5 mu=3
0.5 1.0 1.5 2.0 2.5 3.0
Population Standard Deviation H0: mu = 0, alpha=5%
As the variability of the population increases, the power of the test reduces.
Therefore, to maintain a fixed power level (of say 80%), the sample would need to increase as the population variance increases.
As a consequence, it is harder to find evidence for an effect in a more variable population, since the sample size needs to be larger to achieve the same power.
End of notes for Power. Next is ‘bootstrapping’.
10
Estimated Power of t−test
0.0 0.2 0.4 0.6 0.8 1.0