Hypothesis testing
(Module 6)
Statistics (MAST20005) & Elements of Statistics (MAST90058) Semester 2, 2022
1 Preface 1 1.1 Acautionaryword ……………………………………….. 1 1.2 Amotivatingexample ……………………………………… 2
Copyright By PowCoder代写 加微信 powcoder
2 Classical hypothesis testing (Neyman-Pearson) 2
2.1 Hypotheses …………………………………………… 2
2.2 Tests&statistics………………………………………… 3
2.3 Errors(TypeI,TypeII)…………………………………….. 4
2.4 Significancelevel&power……………………………………. 4
2.5 Alternativeformulations…………………………………….. 6
3 Significance testing (Fisher) 7
4 Modern hypothesis testing 8
5 Common scenarios 8
5.1 Singleproportion………………………………………… 8
5.2 Twoproportions ………………………………………… 11
5.3 Singlemean…………………………………………… 13
5.4 Singlevariance …………………………………………. 14
5.5 Twomeans …………………………………………… 15
5.6 Twovariances………………………………………….. 18
6 Usage & (mis)interpretation 19
Aims of this module
• Introduce the concepts behind statistical hypothesis testing • Explain the connections between estimation and testing
• Work through a number of common testing scenarios
• Emphasise the shortcomings of hypothesis testing
1.1 A cautionary word
What we are about to do…
• Over the next three weeks we will learn about hypothesis testing
• This is an approach to inference that dominates much of statistical practice. . . • …bynon-statisticians.
• It probably shouldn’t!
• The approaches described here are largely considered NOT best practice by professional statisticians • More appropriate procedures usually exist
• . . . and we have already learnt some of them!
• But we need to learn these anyway because:
– Hypothesis testing is ubiquitous
– Need to understand its weaknesses
– Sometimes it’s useful, or at least convenient
1.2 A motivating example
Factory example
You run a factory that produces electronic devices
Currently, about 6% of the devices are faulty
You want to try a new manufacturing process to reduce this
How do you know if it is better? Should you switch or keep the old one?
Run an experiment: make n = 200 devices with the new process and summarise this by the number, Y , that are faulty
You decide that if Y 7 (i.e. Y/n 0.035, or 3.5%) then you will switch to the new process Is this a sensible procedure?
We can formulate this as a statistical hypothesis test
Classical hypothesis testing (Neyman-Pearson)
Research questions as hypotheses
• Research questions / studies are often often framed in terms of hypotheses • Run an experiment / collect data and then ask:
• Do the data support/contradict the hypothesis?
• Can we frame statistical inference around this paradigm?
• Classical hypothesis testing (due to Neyman & Pearson) aims to do this 2.1 Hypotheses
Describing hypotheses
• A hypothesis is a statement about the population distribution
• A parametric hypothesis is a statement about the parameters of the population distribution • A null hypothesis is a hypothesis that specifies ‘no effect’ or ‘no change’, usually denoted H0
• An alternative hypothesis is a hypothesis that specifies the effect of interest, usually denoted H1
Null hypotheses
• Special importance is placed on the null hypothesis.
• When the aim of the study/experiment is to demonstrate an effect (as it often is), the ‘onus of proof’ is to show
there is sufficient evidence against the null hypothesis.
• I.e. we assume the null unless proven otherwise.
• Note: what is taken as the null hypothesis (i.e. the actual meaning of ‘no change’) will depend on the context of the study and where the onus of proof is deemed to lie.
For our factory example:
• We hypothesise that the new process will lead to fewer faulty devices
• Experiment gives: Y ∼ Bi(200, p), where p is the proportion of faulty devices • Null hypothesis:
• Alternative hypothesis:
Types of parametric hypotheses
H0 : p = 0.06 H1 : p < 0.06
• A simple hypothesis, also called a sharp hypothesis, specifies only one value for the parameter(s) • A composite hypothesis specifies many possible values
• Null hypotheses are almost always simple
• Alternative hypotheses are typically composite
Specification of hypotheses
• Usually, the null hypothesis is on the boundary of the alternative hypothesis (here, p = 0.06 versus p < 0.06)
• It is the ‘least favourable’ element for the alternative hypothesis: it is harder differentiate between p = 0.06 and
p = 0.05 (close to the boundary) than it is between p = 0.06 and p = 0.001 (far away from the boundary).
• For single parameters, the null is typically of the form θ = θ0 and the alternative is either one-sided and takes
theformθ<θ0 orθ>θ0,oritistwo-sidedandwrittenasθ̸=θ0.
2.2 Tests & statistics
Describing tests
• A statistical test (or hypothesis test or statistical hypothesis test, or simply a test) is a decision rule for deciding between H0 and H1.
• A test statistic, T , is a statistic on which the test is based
• The decision rule usually takes the form:
rejectH0 ifT ∈A
• The set A is called the critical region, or sometimes the rejection region.1 If it is an interval, the boundary value
is called the critical value.
1Extra notes (not discussed in the lecture, for your reference only): Some authors are specific with their terminology, referring to A only as the ‘rejection region’ and reserving the term ‘critical region’ to refer to the set of values of the data (rather than the set of values of the statistic) that give rise to a rejection, i.e. {x : T(x) ∈ A}. This is not very common. Other authors may use the same term(s) to refer to both of these sets.
• For our example, the test statistic is Y , the decision rule is to reject H0 if Y 7, the critical region is (−∞, 7) and the critical value is 7.
Describing test outcomes
Only two possible outcomes: 1. Reject H0
2. Fail to reject H0
We never say that we ‘accept H0’. Rather, we conclude that there is not enough evidence to reject it.
Often we don’t actually believe the null hypothesis. Rather, it serves as the default position of a skeptical judge,
whom we must convince otherwise.
Similar to a court case: innocent until proven guilty (H0 until proven not H0)
2.3 Errors (Type I, Type II)
Type I error
• What could go wrong with our decision rule for the factory example?
• The new process might produce the same number of faulty devices on average, but by chance we observe at most
7 failures
• Then we would switch to the new process despite not getting any benefit
• We have rejected H0 when H0 is actually true; this is called a Type I error
• This could be quite costly—changing a production line without reducing faults would be expensive
• (Controlling the probability of a Type I error will help to mitigate against this; see later. . . )
Type II error
• Could anything else could go wrong if Type I error is managed?
• The new process might reduce faults, but by chance we observe more than 7 failures
• Then we would give up on the new process, forgoing its benefits
• We have failed to reject H0 when H0 is false; this is called a Type II error
• In this case, the error would be less costly in the short term but might be much more costly long-term • (So, whilst Type I error is often the one that is specifically controlled, Type II error remains important)
Summary of outcomes
H0 is true H0 is false
Do not reject H0 Correct! Type II error
Reject H0 Type I error Correct!
2.4 Significance level & power
Significance level
α = Pr(Type I error) = Pr(reject H0 | H0 true) • This is called the significance level, or sometimes the size, of the test.
• In our example, under H0 we have p = 0.06 and therefore Y ∼ Bi(200, 0.06), giving: α=Pr(Y 7|p=0.06)=0.0829
• Calculate in R using: pbinom(7, 200, 0.06) Probability of type II error
β = Pr(Type II error) = Pr(do not reject H0 | H0 false)
. . . but need to actually condition on a simple hypothesis (an actual value of p) in order for β to be well-defined.
In our example, suppose the new process actually works better and produces only 3% faulty devices on average. Then we have Y ∼ Bi(200,0.03), giving β = Pr(Y > 7 | p = 0.03) = 0.254.
We have halved the rate of faulty devices but still have a 25% chance of not adopting the new process!
More commonly, we would report the power of the test, which is defined as: 1 − β = Pr(reject H0 | H0 false)
Typically, we would present this as a function of the true parameter value, e.g. K(θ) For our example, we have shown that K(0.03) = 1 − 0.254 = 0.746
0.00 0.01 0.02 0.03 0.04 0.05 0.06
Remarks about power
• Power is a function, not a single value: need to assume a value of p in order to calculate it
• This point is often forgotten because people talk about ‘the’ power of a study
• As might be expected, the test is good at detecting values of p that are close to zero but not so good when p is close to p0 = 0.06.
• K(p0) = α, the type I error rate
0.0 0.2 0.4
0.6 0.8 1.0
Controlling errors
• Typically, we construct a test so that it has a specified significance level, α, and then maximise power while respecting that constraint
• In other words, we set the probability of a type I error to be some value (we ‘control’ it) and then try to minimise the probability of a type II error.
• A widespread convention is to set α = 0.05
• I.e. we will incorrectly reject the null hypothesis about 1 time in 20
• Since K(p0) = α, how can we increase power while α is fixed?
• Can do this by:
– Choosing good/optimal test statistics (see later. . . ) – Increasing the sample size
2.5 Alternative formulations
Different ways to present a test
• There a other ways to present the result of a test
• These are all mathematically equivalent
• However, some are more popular than others, because they provide, or seem to provide, more information
Alternative formulation 1: based on a CI
• Instead of comparing a test statistic against a critical region. . .
• Calculate a 100 · (1 − α)% confidence interval for the parameter of interest
• Reject H0 if p0 is not in the interval
• This gives a test with significance level α
• If the CI is constructed from a statistic T , this test is equivalent to using T as a test statistic. • The convention of using 95% CIs is related to the convention of setting α = 0.05
Alternative formulation 2: based on a p-value
• Instead of comparing a test statistic against a critical region. . .
• Calculate a p-value for the data
• The p-value is the probability of observing data (in a hypothetical repetition of the experiment) that is as or more extreme than what was actually observed, under the assumption that H0 is true.
• It is typically a tail probability of the test statistic, taking the tail(s) that are more likely under H1 as compared to H0. (So, the exact details of this will vary between scenarios.)
• Reject H0 if the p-value is less than the significance level
• Note: p-values are, strictly speaking, not part of classical hypothesis testing, but have been adopted as part of
modern practice (more info later)
• P-values are like a ‘short cut’ to avoid calculating a critical value.
• If the test statistic is T and the decision rule is to reject H0 if T < c, then the p-value is calculated as
p = Pr(T < tobs).
• In this case, values of T that are smaller are ‘more extreme’, in the sense of being more compatible with H1 rather than H0.
• If tobs = c, the p-value is the same as the significance level, α.
• If tobs < c, the p-value is less than α.
• By calculating the p-value, we avoid calculating c, but the decision procedure is mathematically equivalent.
• Many different ways that people refer to p-values: P, p, p, P-value, p-value, p-value, P value, p value, p value
P-values for two-sided alternatives
• When we have a two-sided alternative hypothesis, typically the decision rule is of the form: reject H0 if |T | > c
• Then the p-value is p = Pr(|T| > |tobs|)
• This is a two-tailed probability
• The easy way to calculate this is to simply double the probability of one tail:
p = Pr(|T| > |tobs|) = 2 × Pr(T > |tobs|)
• For more general two-sided rejection regions, we also always double the relevant tail probability. This gives an implicit definition for what it means to be ‘more extreme’ when the two tails are not symmetric to each other. (See the examples of testing variances later on, for which the distribution of the test statistic is χ2)
We run our factory experiment. We obtain y = 6 faulty devices out of a total n = 200.
According to our original decision rule (Y 7), we reject H0 and decide to adopt the new process.
Let’s try it using a CI…
Recall that α = 0.083. Calculate a one-sided 91.7% confidence interval that gives an upper bound for p. The upper bound is: 5.4%. This is less than p0 = 6%, so therefore reject H0.
Let’s try it using a p-value. . .
The p-value is a binomial probability, Pr(Y 6 | p = p0) = 0.04. This is less than α, so therefore reject H0.
Significance testing (Fisher)
Significance testing
Pre-dating the classical theory of hypothesis testing was ‘significance testing’, developed by Fisher. The main differences to the classical theory are:
• Only use a null hypothesis, no reference to an alternative
• Use the p-value to assess the level of significance
• If the p-value is low, use as informal evidence that the null hypothesis is unlikely to be true.
• Otherwise, suspend judgement and collect more data.
• Use this procedure only if not much is yet known about the problem, to draw provisional conclusions only. • This is not a decision procedure; do not talk about accepting or rejecting hypotheses.
Disputes & disagreements
• Bitter clashes between proponents!
• Fisher vs Neyman & Pearson
• In particular, Fisher thought the classical approach was ill-suited for scientific research • Disputes never resolved (by the proponents)
4 Modern hypothesis testing
Modern practice
The two approaches have merged in current practice
It has led to an inconsistent/illogical hybrid
Largely use the terminology and formulation of the classical theory (Neyman & Pearson) but commonly report the results using a p-value and talk about ‘not rejecting’ rather than ‘accepting’ the null (both of which are ideas from Fisher)
This has given rise to many problems
Will come back to discuss these at the end. . .
Common scenarios
Common scenarios: overview
Proportions:
• Single proportion • Two proportions
Normal distribution: • Single mean
• Single variance • Two means
• Two variances
5.1 Single proportion
Single proportion
• Observe n Bernoulli trials with unknown probability p
• Summarise by Y ∼ Bi(n, p)
• TestH0:p=p0 versusH1:p>p0,andtakeα=0.05
• Reject H0 if observed value of Y is too large. That is, if Y c for some c. • Choosingc: needPr(Y c|p=p0)=α
• For large n, when H0 is true
• This implies,
Example (single proportion)
Z = np0(1 − p0) ≈ N(0, 1)
c = np0 + Φ−1(1 − α)np0(1 − p0)
• We buy some dice and suspect they are not properly weighted, meaning that the probability, p, of rolling a six is higher than usual.
• WanttoconductthetestH0:p=1/6versusH1:p>1/6
• Roll the dice n = 8000 times and observe Y sixes.
• The critical value is
c = 8000/6 + 1.6458000(1/6)(5/6) = 1388.162 8
• We observe y = 1389 so we reject H0 at the 5% level of significance and conclude that the die comes up with 6 too often
Single proportion, cont’d
• It is more common to use standardised test statistics
• Here, report Z instead of Y and compare to Φ−1(1 − α) instead of c
• Express Z as the standardised proportion of 6’s,
Z = p0(1 − p0)/n ≈ N(0, 1)
• Decision rule: reject H0 if Z > Φ−1(1 − α)
• In the previous example,
1389/8000 − 1/6
z = (1/6)(5/6)/8000 = 1.67
and since z > Φ−1(0.95) = 1.645 we reject H0.
• Suppose we used a two-sided alternative, H1 : p ̸= 1/6
• This would me we want to be able detect deviations in either direction: whether rolling a six is either lower or
higher than usual.
• We still compute the same test statistic,
Z = p0(1 − p0)/n ∼ N(0, 1)
• but the critical region has changed: we reject H0 at level α if |Z| > Φ−1(1 − α/2)
• In the previous example, we would use Φ−1(1 − α/2) = 1.96. Since z = 1.67, we would not reject H0.
Summary of tests for a single proportion
H0 H1 p=p0 p>p0
p=p0 p
Example 2 (single proportion)
• A woman claims she can tell whether the tea or milk was added first to a cup of tea
• Given 40 cups of tea and for each cup the order was determined by tossing a coin
• The woman gave the correct answer 29 times out of 40
• Is this evidence (at the 5% level of significance) that her claim is valid?
• Let p be the probability the woman gets the correct order for a single cup of tea H0:p=0.5 versus H1:p>0.5
• We need evidence against the hypothesis that she is simply guessing, the one-sided alternative is appropriate here.
• Data: y/n = 29/40 = 0.725
0.725 − 0.5
z = 0.5 × 0.5/40 = 2.84
• Critical value Φ−1(0.95) = 1.645, therefore reject H0 and conclude that the data supports the woman’s claim
• Alternatively, we could do this via a p-value:
p-value = Pr(Z > 2.84) = Φ(−2.84) = 0.00226
• Since 0.00226 < 0.05, we reject H0.
R code examples
> p1 = prop.test(29, 40, p = 0.5,
+ alternative = “greater”, correct = FALSE)
1-sample proportions test without continuity correction
data: 29 out of 40, null probability 0.5
X-squared = 8.1, df = 1, p-value = 0.002213
alternative hypothesis: true p is greater than 0.5
95 percent confidence interval:
0.597457 1.000000
sample estimates:
> sqrt(p1$statistic)
> 1 – pnorm(2.846)
[1] 0.002213610
There is also an exact test based on the binomial probabilities:
> binom.test(29, 40, p = 0.5, alternative = “greater”)
Exact binomial test
data: 29 and 40
number of successes = 29, number of trials = 40,
p-value = 0.003213
alternative hypothesis: true probability of success
95 percent confidence interval:
0.5861226 1.0000000
sample estimates:
probability of success
is greater than 0.5
Y distributed Binomial n=40, p=0.5 pmf
Z distributed Standard Normal
5.2 Two proportions
Two proportions
I• IIIIII•III
0 10 15 20 25 30 40 29
• Comparing two proportions: p1 and p2 are the probabilities of success in two different populations. • Wish to test:
H0:p1 =p2 versus H1:p1 >p2 11
based on independent samples (from the two populations) of size n1 and n2 with Y1 and Y2 successes.
• UnderH0 canassumethatp1 =p2 =p,
Z = p(1 − p)(1/n1 + 1/n2) ≈ N(0, 1)
• Let pˆ1 = y1/n1, pˆ2 = y2/n2, pˆ = (y1 + y2)/(n1 + n2).
• Reject H0 at level α if
Example (two proportions)
pˆ 1 − pˆ 2 − 1 z= pˆ(1−pˆ)(1/n1 +1/n2) >Φ
Y1/n1 − Y2/n2 − (p1 − p2) Z=p(1−p)/n +p(1−p)/n ≈N(0,1)
Y1/n1 − Y2/n2
We run a trial of two insecticides. The standard one kills 425 out of 500 mosquitoes, while the experimental one kills 459 out of 500. Is the experimental insecticide more effective?
Let p1 and p2 be the proportion of all mosquitoes killed by experimental and standard spray, respectively. H0:p1 =p2 versus H1:p1 >p2
> x <- c(459, 425)
> n <- c(500, 500)
> p.hat <- (x[1] + x[2]) / (n[1] + n[2])
> p1 <- x[1] / n[1]
> p2 <- x[2] / n[2]
> z <- (p1 - p2) / sqrt(p.hat * (1 - p.hat) *
+ (1 / n[1] + 1 / n[2]))
> pvalue <- 1 - pnorm(z)
> print(c(p1, p2, z, pvalue), digits = 3)
[1] 0.918000 0.850000 3.357560 0.000393
Alternatively, can use the R function prop.test() which calculates the statistic χ2 = Z2 and compares against a χ21 distribution.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com