CS代写 MAST20005) & Elements of Statistics (MAST90058)

Hypothesis testing
(Module 6)
Statistics (MAST20005) & Elements of Statistics (MAST90058)
School of Mathematics and Statistics University of Melbourne

Semester 2, 2022

Aims of this module
• Introduce the concepts behind statistical hypothesis testing • Explain the connections between estimation and testing
• Work through a number of common testing scenarios
• Emphasise the shortcomings of hypothesis testing

A cautionary word
A motivating example
Classical hypothesis testing (Neyman-Pearson) Hypotheses
Tests & statistics
Errors (Type I, Type II) Significance level & power Alternative formulations
Significance testing (Fisher) Modern hypothesis testing Common scenarios
Single proportion
Two proportions
Single mean
Single variance

What we are about to do…
• Over the next three weeks we will learn about hypothesis testing
• This is an approach to inference that dominates much of statistical
practice. . .
• …bynon-statisticians.
• It probably shouldn’t!

• The approaches described here are largely considered NOT best practice by professional statisticians
• More appropriate procedures usually exist
• . . . and we have already learnt some of them!
• But we need to learn these anyway because:
◦ Hypothesis testing is ubiquitous
◦ Need to understand its weaknesses
◦ Sometimes it’s useful, or at least convenient

Factory example
• You run a factory that produces electronic devices
• Currently, about 6% of the devices are faulty
• You want to try a new manufacturing process to reduce this
• How do you know if it is better? Should you switch or keep the old one?
• Run an experiment: make n = 200 devices with the new process and summarise this by the number, Y , that are faulty
• You decide that if Y 􏰀 7 (i.e. Y/n 􏰀 0.035, or 3.5%) then you will switch to the new process
• Is this a sensible procedure?
• We can formulate this as a statistical hypothesis test

Research questions as hypotheses
• Research questions / studies are often often framed in terms of hypotheses
• Run an experiment / collect data and then ask:
• Do the data support/contradict the hypothesis?
• Can we frame statistical inference around this paradigm?
• Classical hypothesis testing (due to Neyman & Pearson) aims to do this

Describing hypotheses
• A hypothesis is a statement about the population distribution
• A parametric hypothesis is a statement about the parameters of
the population distribution
• A null hypothesis is a hypothesis that specifies ‘no effect’ or ‘no
change’, usually denoted H0
• An alternative hypothesis is a hypothesis that specifies the effect of
interest, usually denoted H1

Null hypotheses
• Special importance is placed on the null hypothesis.
• When the aim of the study/experiment is to demonstrate an effect (as it often is), the ‘onus of proof’ is to show there is sufficient evidence against the null hypothesis.
• I.e. we assume the null unless proven otherwise.
• Note: what is taken as the null hypothesis (i.e. the actual meaning of ‘no change’) will depend on the context of the study and where the onus of proof is deemed to lie.

For our factory example:
• We hypothesise that the new process will lead to fewer faulty devices
• Experiment gives: Y ∼ Bi(200, p), where p is the proportion of faulty devices
• Null hypothesis:
• Alternative hypothesis:
H0 : p = 0.06
H1 : p < 0.06 Types of parametric hypotheses • A simple hypothesis, also called a sharp hypothesis, specifies only one value for the parameter(s) • A composite hypothesis specifies many possible values • Null hypotheses are almost always simple • Alternative hypotheses are typically composite Specification of hypotheses • Usually, the null hypothesis is on the boundary of the alternative hypothesis (here, p = 0.06 versus p < 0.06) • It is the ‘least favourable’ element for the alternative hypothesis: it is harder differentiate between p = 0.06 and p = 0.05 (close to the boundary) than it is between p = 0.06 and p = 0.001 (far away from the boundary). • For single parameters, the null is typically of the form θ = θ0 and the alternative is either one-sided and takes the form θ < θ0 or θ > θ0, or it is two-sided and written as θ ̸= θ0.

Describing tests
• A statistical test (or hypothesis test or statistical hypothesis test, or simply a test) is a decision rule for deciding between H0 and H1.
• A test statistic, T , is a statistic on which the test is based
• The decision rule usually takes the form:
rejectH0 ifT ∈A
• The set A is called the critical region, or sometimes the rejection region. If it is an interval, the boundary value is called the critical value.
• For our example, the test statistic is Y , the decision rule is to reject H0 if Y 􏰀 7, the critical region is (−∞, 7) and the critical value is 7.

Describing test outcomes
Only two possible outcomes: 1. Reject H0
2. Fail to reject H0
We never say that we ‘accept H0’. Rather, we conclude that there is not enough evidence to reject it.
Often we don’t actually believe the null hypothesis. Rather, it serves as the default position of a skeptical judge, whom we must convince otherwise.
Similar to a court case: innocent until proven guilty (H0 until proven not H0)

Type I error
• What could go wrong with our decision rule for the factory example?
• The new process might produce the same number of faulty devices on average, but by chance we observe at most 7 failures
• Then we would switch to the new process despite not getting any benefit
• We have rejected H0 when H0 is actually true; this is called a Type I error
• This could be quite costly—changing a production line without reducing faults would be expensive
• (Controlling the probability of a Type I error will help to mitigate against this; see later. . . )

Type II error
• Could anything else could go wrong if Type I error is managed?
• The new process might reduce faults, but by chance we observe
more than 7 failures
• Then we would give up on the new process, forgoing its benefits
• We have failed to reject H0 when H0 is false; this is called a Type II error
• In this case, the error would be less costly in the short term but might be much more costly long-term
• (So, whilst Type I error is often the one that is specifically controlled, Type II error remains important)

Summary of outcomes
H0 is true H0 is false
Do not reject H0 Reject H0
Correct! Type I error Type II error Correct!

Significance level
α = Pr(Type I error) = Pr(reject H0 | H0 true)
• This is called the significance level, or sometimes the size, of the
• In our example, under H0 we have p = 0.06 and therefore
Y ∼ Bi(200, 0.06), giving:
α=Pr(Y 􏰀7|p=0.06)=0.0829
• Calculate in R using: pbinom(7, 200, 0.06)

Probability of type II error
β = Pr(Type II error) = Pr(do not reject H0 | H0 false)
. . . but need to actually condition on a simple hypothesis (an actual
value of p) in order for β to be well-defined.
In our example, suppose the new process actually works better and produces only 3% faulty devices on average. Then we have
Y ∼ Bi(200,0.03), giving β = Pr(Y > 7 | p = 0.03) = 0.254.
We have halved the rate of faulty devices but still have a 25% chance of not adopting the new process!

More commonly, we would report the power of the test, which is defined as:
1 − β = Pr(reject H0 | H0 false)
Typically, we would present this as a function of the true parameter
value, e.g. K(θ)
For our example, we have shown that K(0.03) = 1 − 0.254 = 0.746

0.00 0.01 0.02 0.03 0.04 0.05 0.06
0.0 0.2 0.4
0.6 0.8 1.0

Remarks about power
• Power is a function, not a single value: need to assume a value of p in order to calculate it
• This point is often forgotten because people talk about ‘the’ power of a study
• As might be expected, the test is good at detecting values of p that are close to zero but not so good when p is close to p0 = 0.06.
• K(p0) = α, the type I error rate

Controlling errors
• Typically, we construct a test so that it has a specified significance level, α, and then maximise power while respecting that constraint
• In other words, we set the probability of a type I error to be some value (we ‘control’ it) and then try to minimise the probability of a type II error.
• A widespread convention is to set α = 0.05
• I.e. we will incorrectly reject the null hypothesis about 1 time in 20
• Since K(p0) = α, how can we increase power while α is fixed?
• Can do this by:
◦ Choosing good/optimal test statistics (see later. . . ) ◦ Increasing the sample size

Different ways to present a test
• There a other ways to present the result of a test
• These are all mathematically equivalent
• However, some are more popular than others, because they provide, or seem to provide, more information

Alternative formulation 1: based on a CI
• Instead of comparing a test statistic against a critical region. . .
• Calculate a 100 · (1 − α)% confidence interval for the parameter of
• Reject H0 if p0 is not in the interval
• This gives a test with significance level α
• If the CI is constructed from a statistic T , this test is equivalent to using T as a test statistic.
• The convention of using 95% CIs is related to the convention of setting α = 0.05

Alternative formulation 2: based on a p-value
• Instead of comparing a test statistic against a critical region. . .
• Calculate a p-value for the data
• The p-value is the probability of observing data (in a hypothetical repetition of the experiment) that is as or more extreme than what was actually observed, under the assumption that H0 is true.
• It is typically a tail probability of the test statistic, taking the tail(s) that are more likely under H1 as compared to H0. (So, the exact details of this will vary between scenarios.)
• Reject H0 if the p-value is less than the significance level
• Note: p-values are, strictly speaking, not part of classical hypothesis testing, but have been adopted as part of modern practice (more info later)

• P-values are like a ‘short cut’ to avoid calculating a critical value.
• If the test statistic is T and the decision rule is to reject H0 if
T < c, then the p-value is calculated as p = Pr(T < tobs). • In this case, values of T that are smaller are ‘more extreme’, in the sense of being more compatible with H1 rather than H0. • If tobs = c, the p-value is the same as the significance level, α. • If tobs < c, the p-value is less than α. • By calculating the p-value, we avoid calculating c, but the decision procedure is mathematically equivalent. • Many different ways that people refer to p-values: P, p, p, P-value, p-value, p-value, P value, p value, p value P-values for two-sided alternatives • When we have a two-sided alternative hypothesis, typically the decision rule is of the form: reject H0 if |T | > c
• Then the p-value is p = Pr(|T| > |tobs|)
• This is a two-tailed probability
• The easy way to calculate this is to simply double the probability of one tail:
p = Pr(|T| > |tobs|) = 2 × Pr(T > |tobs|)
• For more general two-sided rejection regions, we also always double the relevant tail probability. This gives an implicit definition for what it means to be ‘more extreme’ when the two tails are not symmetric to each other. (See the examples of testing variances
later on, for which the distribution of the test statistic is χ2) 29 of 100

• We run our factory experiment. We obtain y = 6 faulty devices out of a total n = 200.
• According to our original decision rule (Y 􏰀 7), we reject H0 and decide to adopt the new process.
• Let’s try it using a CI…
• Recall that α = 0.083. Calculate a one-sided 91.7% confidence interval that gives an upper bound for p. The upper bound is: 5.4%. This is less than p0 = 6%, so therefore reject H0.
• Let’s try it using a p-value. . .
• The p-value is a binomial probability, Pr(Y 􏰀 6 | p = p0) = 0.04.
This is less than α, so therefore reject H0. 30 of 100

A cautionary word
A motivating example
Classical hypothesis testing (Neyman-Pearson) Hypotheses
Tests & statistics
Errors (Type I, Type II) Significance level & power Alternative formulations
Significance testing (Fisher)
Modern hypothesis testing Common scenarios
Single proportion
Two proportions
Single mean
Single variance

Significance testing
Pre-dating the classical theory of hypothesis testing was ‘significance testing’, developed by Fisher.
The main differences to the classical theory are:
• Only use a null hypothesis, no reference to an alternative
• Use the p-value to assess the level of significance
• If the p-value is low, use as informal evidence that the null hypothesis is unlikely to be true.
• Otherwise, suspend judgement and collect more data.
• Use this procedure only if not much is yet known about the
problem, to draw provisional conclusions only.
• This is not a decision procedure; do not talk about accepting or rejecting hypotheses.

Disputes & disagreements
• Bitter clashes between proponents!
• Fisher vs Neyman & Pearson
• In particular, Fisher thought the classical approach was ill-suited for scientific research
• Disputes never resolved (by the proponents)

A cautionary word
A motivating example
Classical hypothesis testing (Neyman-Pearson) Hypotheses
Tests & statistics
Errors (Type I, Type II) Significance level & power Alternative formulations
Significance testing (Fisher)
Modern hypothesis testing
Common scenarios Single proportion Two proportions Single mean
Single variance

Modern practice
• The two approaches have merged in current practice
• It has led to an inconsistent/illogical hybrid
• Largely use the terminology and formulation of the classical theory (Neyman & Pearson) but commonly report the results using a p-value and talk about ‘not rejecting’ rather than ‘accepting’ the null (both of which are ideas from Fisher)
• This has given rise to many problems
• Will come back to discuss these at the end. . .

Common scenarios: overview
Proportions:
• Single proportion • Two proportions
Normal distribution:
• Single mean
• Single variance • Two means
• Two variances

Single proportion
• Observe n Bernoulli trials with unknown probability p
• Summarise by Y ∼ Bi(n, p)
• TestH0:p=p0 versusH1:p>p0,andtakeα=0.05
• Reject H0 if observed value of Y is too large.
Thatis,ifY 􏰁cforsomec.
• Choosingc: needPr(Y 􏰁c|p=p0)=α
• For large n, when H0 is true
• This implies,
Z = 􏰄np0(1 − p0) ≈ N(0, 1)
c = np0 + Φ−1(1 − α)􏰄np0(1 − p0)

Example (single proportion)
• We buy some dice and suspect they are not properly weighted, meaning that the probability, p, of rolling a six is higher than usual.
• WanttoconductthetestH0:p=1/6versusH1:p>1/6
• Roll the dice n = 8000 times and observe Y sixes.
• The critical value is
c = 8000/6 + 1.645􏰄8000(1/6)(5/6) = 1388.162
• Weobservey=1389sowerejectH0 atthe5%levelof
significance and conclude that the die comes up with 6 too often

Single proportion, cont’d
• It is more common to use standardised test statistics
• Here, report Z instead of Y and compare to Φ−1(1 − α) instead of
• Express Z as the standardised proportion of 6’s,
Z = 􏰄p0(1 − p0)/n ≈ N(0, 1)
• Decision rule: reject H0 if Z > Φ−1(1 − α)
• In the previous example,
1389/8000 − 1/6
z = 􏰄(1/6)(5/6)/8000 = 1.67
and since z > Φ−1(0.95) = 1.645 we reject H0. 40 of 100

• Suppose we used a two-sided alternative, H1 : p ̸= 1/6
• This would me we want to be able detect deviations in either
direction: whether rolling a six is either lower or higher than usual.
• We still compute the same test statistic,
Z = 􏰄p0(1 − p0)/n ∼ N(0, 1)
• but the critical region has changed: we reject H0 at level α if |Z| > Φ−1(1 − α/2)
• In the previous example, we would use Φ−1(1 − α/2) = 1.96. Since z = 1.67, we would not reject H0.

Summary of tests for a single proportion
H0 H1 p=p0 p>p0
p=p0 pΦ−1(1−α) <Φ−1(α) |z|= √|y/n−p0| p0 (1−p0 )/n >Φ−1(1−α/2)

Example 2 (single proportion)
• A woman claims she can tell whether the tea or milk was added first to a cup of tea
• Given 40 cups of tea and for each cup the order was determined by tossing a coin
• The woman gave the correct answer 29 times out of 40
• Is this evidence (at the 5% level of significance) that her claim is

• Let p be the probability the woman gets the correct order for a single cup of tea
H0:p=0.5 versus H1:p>0.5
• We need evidence against the hypothesis that she is simply guessing, the one-sided alternative is appropriate here.
• Data: y/n = 29/40 = 0.725
0.725 − 0.5
z = 􏰄0.5 × 0.5/40 = 2.84
• Critical value Φ−1(0.95) = 1.645, therefore reject H0 and conclude that the data supports the woman’s claim
• Alternatively, we could do this via a p-value:
p-value = Pr(Z > 2.84) = Φ(−2.84) = 0.00226
• Since 0.00226 < 0.05, we reject H0. 45 of 100 R code examples > p1 = prop.test(29, 40, p = 0.5,
+ alternative = “greater”, correct = FALSE)
1-sample proportions test without continuity correction
data: 29 out of 40, null probability 0.5
X-squared = 8.1, df = 1, p-value = 0.002213
alternative hypothesis: true p is greater than 0.5
95 percent confidence interval:
0.597457 1.000000
sample estimates:

> sqrt(p1$statistic)
> 1 – pnorm(2.846)
[1] 0.002213610

Z distributed Standard Normal

There is also an exact test based on the binomial probabilities:
> binom.test(29, 40, p = 0.5, alternative = “greater”)
Exact binomial test
data: 29 and 40
number of successes = 29, number of trials = 40,
p-value = 0.003213
alternative hypothesis: true probability of success
95 percent confidence interval:
0.5861226 1.0000000
sample estimates:
probability of success
is greater than 0.5

0 10 15 20 25 30 40 29
Y distributed Binomial n=40, p=0.5 pmf
I• IIIIII•III

Two proportions
• Comparing two proportions: p1 and p2 are the probabilities of success in two different populations.
• Wish to test:
H0:p1 =p2 versus H1:p1 >p2
based on independent samples (from the two populations) of size
n1 and n2 with Y1 and Y2 successes. • Know
Y1/n1 − Y2/n2 − (p1 − p2)
Z=􏰄p(1−p)/n +p(1−p)/n ≈N(0,1) 111222

• UnderH0 canassumethatp1 =p2 =p, Y1/n1 − Y2/n2
Z = 􏰄p(1 − p)(1/n1 + 1/n2) ≈ N(0, 1) • Let pˆ1 = y1/n1, pˆ2 = y2/n2, pˆ = (y1 + y2)/(n1 + n2).
• Reject H0 at level α if z=􏰄pˆ(1−pˆ)(1/n1+1/n2)>Φ (1−α)
pˆ 1 − pˆ 2 − 1

Example (two proportions)
We run a trial of two insecticides. The standard

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts