DSME5110F: Statistical Analysis
Confidence Interval Estimation and Introduction to Hypothesis Test
Outline • ConfidenceIntervalEstimation
Copyright By PowCoder代写 加微信 powcoder
– Confidence Interval for 𝜇𝜇 • When 𝜎𝜎 is known
• When 𝜎𝜎 is unknown
• Sample Size Determination
– Confidence Interval for Proportion 𝑝𝑝 • Sample Size Determination
• Introduction to Hypothesis Test
• Three Forms of Hypothesis Statements • 𝑝𝑝-value
• Two types of errors (significance level)
Interval Estimation
• Point estimation is good in the sense that it provides a very sharp estimation for the unknown population parameter.
• However, with only a single point, it is unlikely that the point estimator will actually hit the target. You may get very close, but it’s hard to be exactly right.
• Statisticians often use an interval, rather than a single point, to make estimation.
Outline • ConfidenceIntervalEstimation
– Confidence Interval for 𝜇𝜇 • When 𝜎𝜎 is known
• When 𝜎𝜎 is unknown
• Sample Size Determination
– Confidence Interval for Proportion 𝑝𝑝 • Sample Size Determination
• Introduction to Hypothesis Test
• Three Forms of Hypothesis Statements • 𝑝𝑝-value
• Two types of errors (significance level)
Confidence Interval for when 𝜎𝜎 is known
• By Central Limit Theorem, the distribution of sample mean will be normal with 𝜇𝜇 =𝜇𝜇 and 𝜎𝜎 =𝜎𝜎/ 𝑛𝑛 when the sample size 𝑛𝑛 is
sufficiently large.
• Let Z be a standardized normal variable with mean equal to 0 and
standard deviation equal to 1 and 𝑧𝑧 be the 1 − 𝛼𝛼 100𝑡𝑡𝑡
• Now, if we construct an interval that extends 𝑧𝑧 standard errors
percentile of Z.
to the left and right of the sample mean, i.e.,
• See the file CI Simulator.xlsm.
𝑥𝑥̅ ± 𝑧𝑧 𝜎𝜎 ,
then among all the intervals so constructed, (1 – α)100% of them will contain the true value of the population mean 𝜇𝜇.
Some Key Definitions
• Confidence interval – a range of values that is likely to contain the population parameter being estimated
• Margin of error – a value added to and subtracted from a point estimate for the purpose of developing an interval estimate of a population parameter
• Confidence level – every interval estimate has an associated confidence level
– 90% of the time, a 90% interval estimate is expected to contain the population parameter
– 95% of the time, a 95% interval estimate is expected to contain the population parameter
Find the Critical 𝑧𝑧1−𝛼𝛼/2 Value
• The critical value 𝑧𝑧 can be found by using the R function:
qnorm(1 − 𝛼𝛼/2).
• The three commonly used confidence levels are 90%, 95%, and
𝜎𝜎𝑛𝑛 •𝑥𝑥̅±1.645 .
– 90%confidenceinterval:
• 𝑧𝑧0.95 = 1.645 can be found by the formula: qnorm(0.95) – 95%confidenceinterval:
𝜎𝜎𝑛𝑛 •𝑥𝑥̅±1.96 .
• 𝑧𝑧0.975 = 1.96 can be found by the formula: qnorm(0.975) – 99%confidenceinterval:
𝜎𝜎𝑛𝑛 •𝑥𝑥̅±2.576 .
• 𝑧𝑧0.995 = 2.576 can be found by the formula: qnorm(0.995).
Example 7.1
• The height of female adult in follows approximately normal distribution with an unknown mean but a known standard deviation of 7.5 cm.
• To estimate the unknown population mean, a random sample of 100 women was collected and it is found that the sample mean is 160.5.
• Construct a 95% confidence interval to estimate the true average height of female adult in .
• 95%Confi𝑛𝑛denceIntervalfor𝜇𝜇is:
𝑥𝑥̅ ± 1.96 𝜎𝜎 = 160.5 ± 1.96 7.5 = [159.03, 161.97] 100 8
Outline • ConfidenceIntervalEstimation
– Confidence Interval for 𝜇𝜇 • When 𝜎𝜎 is known
• When 𝜎𝜎 is unknown
• Sample Size Determination
– Confidence Interval for Proportion 𝑝𝑝 • Sample Size Determination
• Introduction to Hypothesis Test
• Three Forms of Hypothesis Statements • 𝑝𝑝-value
• Two types of errors (significance level)
Confidence Interval for when 𝜎𝜎 is Unknown
• In the previous example, we assume that 𝜎𝜎 is known.
• When 𝜎𝜎 is known
𝑍𝑍 = 𝑋𝑋� − 𝜇𝜇 = 𝑋𝑋� − 𝜇𝜇 ,
𝜎𝜎 𝜎𝜎⁄𝑛𝑛 𝑥𝑥̅
follows the standard normal distribution (mean 0 and s.d. 1, Z- distribution) when the sample size is sufficiently large.
• Unfortunately, in reality, 𝜎𝜎 is rarely known. In such a case, it can be replaced by its sample estimate 𝑠𝑠.
• When 𝜎𝜎 is unknown
follows a t-distribution with 𝑛𝑛– 1 degrees of freedom.
𝑡𝑡 = 𝑋𝑋� − 𝜇𝜇 𝑠𝑠⁄ 𝑛𝑛
Z and t Distributions
• See the file Z and T.xlsm for the difference between Z and t distributions.
Both are bell shaped and centered at 0 (the mean).
However, t-distribution is more spread out than normal distribution. When 𝑛𝑛 is small, t is a lot more spread out than Z.
As 𝑛𝑛 becomes larger, t will be getting closer and closer to Z.
Eventually, as 𝑛𝑛 approaches infinity, there will be practically no difference between t and Z.
Confidence Interval for 𝜇𝜇 when 𝜎𝜎 is Unknown
• So, a confidence interval for 𝜇𝜇 when 𝜎𝜎 is unknown can be constructed as:
𝑥𝑥̅ ± 𝑡𝑡𝑛𝑛−1,1−𝛼𝛼/2
• Here, 𝑡𝑡 is the (1–α/2)100 percentile of the t-distribution with
• The critical t can be found by the R function: qt(1 − 𝛼𝛼/2, 𝑛𝑛 − 1). For example:
𝑛𝑛– 1 degrees of freedom.
– In other words, (1– 𝛼𝛼)100% of the t values are within ±𝑡𝑡 .
𝑛𝑛−1,1−𝛼𝛼/2
𝑛𝑛−1,1−𝛼𝛼/2
– qt(0.95, 10) = 1.8125 – qt(0.95, 30) = 1.6973 – qt(0.95, 100) = 1.6602
• If 𝑛𝑛 is sufficiently large, it really doesn’t matter whether we use Z or t. The formula below can as well be used:
– 𝑥𝑥̅ ± 𝑧𝑧 𝑠𝑠 .
– This is because, for very large n, 𝑧𝑧 ≈ 𝑡𝑡 . For example, 𝑧𝑧 =
1.96 ≈ 𝑡𝑡300, 0 .975 = 1.9679.
1−𝛼𝛼/2 𝑛𝑛−1,1−𝛼𝛼/2 0.975
Example 7.2:
Annual Household Income in HK
• The file income.csv contains the survey data about the annual household income of 1020 randomly selected families in in 2011.
• Use the data to find a 95% confidence interval estimate for the mean annual household income in in 2011.
• The R function t.test() can calculate the confidence interval for 𝜇𝜇.
• According to the R output, the 95% confidence interval is: [401454.0, 457814.9]
Example 7.2:
Annual Household Income in HK
• We can verify the result of t.test() by using the formula: 𝑥𝑥̅ ± 𝑡𝑡 𝑠𝑠 .
𝑛𝑛−1,1−𝛼𝛼/2
• The R codes that implement the above formula is given below and the result is the same as that of using t.test().
Outline • ConfidenceIntervalEstimation
– Confidence Interval for 𝜇𝜇 • When 𝜎𝜎 is known
• When 𝜎𝜎 is unknown
• Sample Size Determination
– Confidence Interval for Proportion 𝑝𝑝 • Sample Size Determination
• Introduction to Hypothesis Test
• Three Forms of Hypothesis Statements • 𝑝𝑝-value
• Two types of errors (significance level)
Sample Size Determination in the Case of 𝜇𝜇
• Recall that the interval estimate of 𝜇𝜇 takes this form:
𝒙𝒙� ± margain of error
• If we want our prediction to be within a specified margin of error with some confidence, how big a sample size 𝑛𝑛 should we take?
– If 𝜎𝜎 is known: solving for the sample size 𝑛𝑛 that results in a specified level of confidence and precision, PME, we get
𝑧𝑧 𝜎𝜎 =PME ⇒ 𝑛𝑛= 𝑧𝑧2𝜎𝜎2
– If 𝜎𝜎 is unknown, replace 𝜎𝜎 with 𝑠𝑠, the sample standard deviation,
(Here, 𝑧𝑧 = 𝑧𝑧1−𝛼𝛼/2 to make notation compact.)
where 𝑠𝑠 could be obtained from a pilot study or from other studies. 𝑛𝑛= 𝑧𝑧2𝑠𝑠2
Example 7.2 (Continued)
• Based on the result of Example 7.2, we are 95% confident that the margin of error is
(457814.9 − 401454.0)/2=28180.45
a. If we want the margin of error to be 15000 with 95% confidence,
how big a sample size should we take? (Ans: 3592)
b. If we want the margin of error to be 15000 with 99% confidence, how big a sample size should we take? (Ans: 6204)
Outline • ConfidenceIntervalEstimation
– Confidence Interval for 𝜇𝜇 • When 𝜎𝜎 is known
• When 𝜎𝜎 is unknown
• Sample Size Determination
– Confidence Interval for Proportion 𝑝𝑝 • Sample Size Determination
• Introduction to Hypothesis Test
• Three Forms of Hypothesis Statements • 𝑝𝑝-value
• Two types of errors (significance level)
Confidence Interval for Proportion
• Recall that, for large enough sample size, sample proportions follow approximately normal distribution with mean equal to the true population
= 𝑝𝑝) and standard error 𝜎𝜎 = 𝑝𝑝(1−𝑝𝑝).
proportion (i.e., 𝜇𝜇
• Since 𝑝𝑝 is unknown (we are developing the interval estimate of 𝑝𝑝), we
• Thus, a (1 − 𝛼𝛼)100% confidence interval for the population proportion is:
must use something in place of 𝑝𝑝.
𝑝𝑝̅±𝑧𝑧 𝜎𝜎 =𝑝𝑝̅±𝑧𝑧 𝑝𝑝̅(1−𝑝𝑝̅),
where 𝑝𝑝̅ is the sample proportion. • R function: t.test()
1−𝛼𝛼/2 𝑝𝑝̅ 1−𝛼𝛼/2 𝑛𝑛
– the raw data must be processed first to dummy variables
– the result obtained by t.test() will be slightly different (use 𝑡𝑡 , instead
𝑛𝑛−1,1−𝛼𝛼/2
of 𝑧𝑧1−𝛼𝛼/2). When the sample size is large, these two methods tend to produce
similar results with negligible differences. But when the sample size is small, they can produce very different results.
Example 7.3: Proportion of Low-Income Family
• Consider monthly household income of (File: income.csv).
• Suppose the government wants to subsidize families with annual income under $100,000 and needs to estimate the proportion of families that will receive subsidy.
• Construct a 95% confidence interval to estimate the proportion of families whose income are less than $100,000.
Confidence Interval for 𝑝𝑝 with t.test() To use t.test() to calculate confidence interval for 𝑝𝑝, we need to
> income2<-ifelse(income<100000,1,0) # coding: 1 for low income, 0 for non-low income
> head(data.frame(income, income2), 10)
# display income and income2 side-by-side to view the result of coding
Then, using t.test(income2).
– The sample proportion of low income family is 8.92% and we are 95% confident that the true proportion of low income family should be between 7.17% and 10.67%.
first code the raw data into either 1 or 0, where
– 1=Familieswithincomelessthan$100,000,and
– 0=otherwise.
– Thistypeofvariableisreferredtoas“dummyvariable”. – WecanuseR’sifelse()functiontocodethevariable:
Outline • ConfidenceIntervalEstimation
– Confidence Interval for 𝜇𝜇 • When 𝜎𝜎 is known
• When 𝜎𝜎 is unknown
• Sample Size Determination
– Confidence Interval for Proportion 𝑝𝑝 • Sample Size Determination
• Introduction to Hypothesis Test
• Three Forms of Hypothesis Statements • 𝑝𝑝-value
• Two types of errors (significance level)
Sample Size Determination in the Case of 𝑝𝑝 • The interval estimate of p takes this form:
𝒑𝒑� ± margain of error, where the margin of error is 𝑧𝑧1−𝛼𝛼/2 𝑝𝑝̅(1−𝑝𝑝̅).
• If we want our prediction to be within a specified margin of error with some confidence, how big a sample size 𝑛𝑛 should we take?
• Solving for the sample size n that results in a specified level of confidence and precision, PME, we get (Here, z = 𝑧𝑧 to make notation compact.)
𝑧𝑧 𝑝𝑝̅(1−𝑝𝑝̅)= PME 1−𝛼𝛼/2⇒ 𝑛𝑛=𝑧𝑧2𝑝𝑝̅(1−𝑝𝑝̅) 1−𝛼𝛼/2 𝑛𝑛 (PME)2
• however, 𝑝𝑝̅ is unknown until the sample has been collected, and so some other value must be used in its place. If we designate the substituted value as 𝑝𝑝̇, the sample size determination expression becomes
𝑛𝑛 = 𝑧𝑧2𝑝𝑝̇(1 − 𝑝𝑝̇) (PME)2
How to Set 𝑝𝑝̇?
• There are several approaches to setting a value for 𝑝𝑝̇.
– If a study similar to the current one has been done in the past, we could use 𝑝𝑝̅ from the earlier study.
– But if there has been no previous study, and the current one is the first of its kind, we may run a pilot study and use the 𝑝𝑝̅ from that pilot sample.
– Finally, the safest approach is to assign 𝑝𝑝̇ = 0.5, which results in the most conservative choice of n. This is because, when 𝑝𝑝̇ = 0.5, the resulted n is larger than values of n resulted from any other values of 𝑝𝑝̇. So, with this approach, we will only take more samples than we need, but we will never take less.
Table: Different values for 𝑝𝑝̇ and the product 𝑝𝑝̇ (1–𝑝𝑝̇)
Example 7.4: Presidential Election
• In a coming presidential election with two candidates, A and B, if a pollster wants to predict the percentage of votes for candidate A to be within ±2% (margin of error) with 95% confidence, how big a sample size should he take?
• With only two candidates, it should not be unreasonable to assume that both of them will get close to 50% votes. So, we can assume 𝑝𝑝̇ = 0.5. Hence,
𝑧𝑧2𝑝𝑝̇(1 − 𝑝𝑝̇) = 1.962(0.5)(0.5) = 2401. (𝑃𝑃𝑃𝑃𝑃𝑃)2 (0.02)2
Outline • ConfidenceIntervalEstimation
– Confidence Interval for 𝜇𝜇 • When 𝜎𝜎 is known
• When 𝜎𝜎 is unknown
• Sample Size Determination
– Confidence Interval for Proportion 𝑝𝑝 • Sample Size Determination
• Introduction to Hypothesis Test
• Three Forms of Hypothesis Statements • 𝑝𝑝-value
• Two types of errors (significance level)
Introduction
• Quite often, an analyst has a particular theory, or
hypothesis, that he or she would like to test.
hypothesis (denoted 𝐻𝐻 ). It is also frequently called
– Alternative hypothesis: The hypothesis that
the research hypothesis.
analyst is attempting to prove is called the alternative
–Null hypothesis: The opposite of the alternative hypothesis is called the null hypothesis (denoted 𝐻𝐻0). It usually represents the current thinking or status quo.
– That is, the null hypothesis is usually the accepted theory that the analyst is trying to disprove.
Concepts in Hypothesis Testing
• The null and alternative hypotheses divide all possibilities into two non-overlapping sets, exactly one of which must be true.
• To reject or not to reject:
– Traditionally, hypothesis testing has been phrased as a decision- making problem, where an analyst decides either to reject the null hypothesis or not to reject it (which is the same as accept it), based on the sample evidence.
• When sample information is used to test the hypotheses, the benefit of the doubt is given to 𝐻𝐻 and the burden of
proof is on 𝐻𝐻 .
• In other words, 𝑯𝑯𝟎𝟎 usually will not be rejected (and 𝑯𝑯𝟏𝟏
accepted) unless the sample evidence is strongly against
How Do We Decide Whether or Not to Reject a Null Hypothesis?
• In order to decide whether or not the null hypothesis should be rejected, we ask the following question:
– “If the null hypothesis is true, how likely would it be to get such a sample or a more extreme sample?”
• If it is not too unlikely, the sample evidence is considered not strong enough for us to reject the null hypothesis.
• On the other hand, if it is very unlikely, then this suggests that the null hypothesis is more likely to be untrue and therefore should be rejected.
• Test statistic – empirical result of the hypothesis test used to either reject or not reject the null hypothesis
• Rejection region (RR) – specifies range of values test statistic might assume that would lead to rejection of the null hypothesis
Outline • ConfidenceIntervalEstimation
– Confidence Interval for 𝜇𝜇 • when 𝜎𝜎 is known
• when 𝜎𝜎 is unknown
• Sample Size Determination
– Confidence Interval for Proportion 𝑝𝑝 • Sample Size Determination
• Introduction to Hypothesis Test
• Three Forms of Hypothesis Statements • 𝑝𝑝-value
• Two types of errors (significance level)
Three Forms of Hypothesis Statements
• Let 𝜃𝜃 be any population parameter (e.g., 𝜇𝜇, 𝑝𝑝, 𝜎𝜎, etc.) and 𝜃𝜃0 its hypothesized value. Then, we have the following three forms of
hypothesis statement. • Form I: Two-tailed Test
With this form, 𝐻𝐻0 will be rejected when the value of test statistics obtained is either very small or very large.
–𝐻𝐻:𝜃𝜃=𝜃𝜃 00
– 𝐻𝐻1:𝜃𝜃≠𝜃𝜃0
• Form II: Left-tailed Test
With this form, 𝐻𝐻0 will be rejected when the value of test statistics obtained is very small.
–𝐻𝐻:𝜃𝜃≥𝜃𝜃 00
– 𝐻𝐻1:𝜃𝜃<𝜃𝜃0
• Form III: Right-tailed Test
With this form, 𝐻𝐻0 will be rejected when the value of test statistics obtained is very large.
–𝐻𝐻:𝜃𝜃≤𝜃𝜃 00
–𝐻𝐻:𝜃𝜃>𝜃𝜃 10
• Note that 𝐻𝐻 always includes “=”. 0
Example 7.5: Average Height
• Suppose the Government claims that the average height of male adult (which is impossible to know unless we measure everyone) is at least 173 cm. Suppose it is known that the population standard deviation of height is 7.5 cm. How do we test the Government’s claim?
• The hypotheses to be tested are – 𝐻𝐻 : 𝜇𝜇 ≥ 173
– 𝐻𝐻1: 𝜇𝜇 < 173
• To test the above hypotheses, a sample of 900 randomly selected male adults was taken and the sample mean is 172.1 cm.
• Based on this sample mean, what can we say about the Government’s claim?
Example 7.5: Average Height
normalwith𝜇𝜇 =𝜇𝜇and𝜎𝜎 =𝜎𝜎/ 𝑛𝑛. 𝑥𝑥̅ 𝑥𝑥̅
• Recall that, by the Central Limit Theorem, for large enough sample size,
the sampling distribution of the sample mean will be approximately
• Hence, if 𝐻𝐻0 is true, the distribution of the sample mean should be normal with 𝜇𝜇 = 173 or larger and standard error 𝜎𝜎 = 7.5 = 0.25.
• Then, if the null hypothesis is true, the sample mean 172.1 would be at least 3.6 standard errors below the mean.
𝑥𝑥̅ 𝑥𝑥̅ 900
• How likely is it for us to get such a sample if the null hypothesis is true? To answer this question, we need to calculate
– P(𝑥𝑥̅≤172.1|𝐻𝐻 istrue)=P 𝑥𝑥̅≤172.1𝜇𝜇=173,𝜎𝜎 =0.25
– Using R command, pnorm(172.1, 173, 0.25), the probability is found to be
• So, it is not very likely. This means that, if 𝐻𝐻 is true, there is only less than
• So, do you think 𝐻𝐻 is likely to be true? 0
0.02% chance of getting such a sample mean or a more extreme one.
Example 7.5: Average Height
• What if the sample mean is 172.75, instead of 172.1?
• In this case, if 𝐻𝐻 is true (i.e., 𝜇𝜇 ≥ 173), the sample 0
mean could be just one standard error below the mean (if 𝜇𝜇 is taken to be 173), and the probability of getting such a sample or a more extreme one is about 15.87%, which is not very unusual.
– R command: pnorm(172.75, 173, 0.25).
• That is, even if 𝐻𝐻 is true (i.e., 𝜇𝜇 ≥ 173), there is still
about 16% chance to get such a sample or a more
extreme one.
– The sample evidence is not strong enough to reject 𝐻𝐻 .
Outline • ConfidenceIntervalEstimation
– Confidence Interval for 𝜇𝜇 • when 𝜎𝜎 is known
• when 𝜎𝜎 is unknown
• Sample Size Determination
– Confidence Interval for Proportion 𝑝𝑝 • Sample Size Determination
• Introduction to Hypothesis Test
• Three Forms of Hypothesis Statements • 𝑝𝑝-value
• Two types of errors (significance level)
The 𝑝𝑝-value of the Test Statistic
• The sample statistic used to test a null hypothesis is called “test statistic” (e.g., 𝑥𝑥̅ in Example 7.5).
• In the language of hypothesis test, the probability of getting a particular test statistic or a more extreme one when 𝐻𝐻0 is true is referred to as the 𝑝𝑝-value.
• When the 𝑝𝑝-value is “small”, it is an indication that 𝐻𝐻0 is unlikely to be true and hence can be rejected.
• But the question is “how small is considered small”?
• This will depend on how much the investigator can allow him/herself to falsely reject a true 𝐻𝐻 .
For a left-tailed test, if θ* is the test statistic
obtained, then its p-value is the shaded region to its left, i.e., P(θ ≤ θ*).
For a right-tailed test, if θ* is the test statistic
obtained, then its p-value is the shaded region to its right, i.e., P(θ ≥ θ*).
θ* For a two-tailed test, if θ* is the test statistic
obtained, then its p-value is two times the shaded region to its right, i.e., P(θ ≥ θ*) × 2.
Example 7.5: Average Height
1. For the test discussed in Example 7.5, what is the corresponding 𝑝𝑝-value if the sample mean (𝑥𝑥̅) is 172.6?
𝐻𝐻 : 𝜇𝜇 = 173 0
2. If the hypotheses to be tested are changed to a two-tailed test:
𝐻𝐻1:𝜇𝜇 ≠ 173
what is the 𝑝𝑝-value if 𝑥𝑥̅ = 172.2?
3. If the hypotheses to be tested in Example 7.5 are changed to a right-tailed test:
𝐻𝐻 : 𝜇𝜇 ≤ 173 0
𝐻𝐻1:𝜇𝜇 > 173
what is the 𝑝𝑝-value if 𝑥𝑥̅ = 173.6? Solutions:
The test in Example 7.5 is a left-tailed test. So,
– 𝑝𝑝-value=P 𝑥𝑥̅≤172.6𝜇𝜇=173,𝜎𝜎 =0.25 =0.0548
– 𝑝𝑝-value=P 𝑥𝑥̅≤172.2𝜇𝜇=173,𝜎𝜎 =0.25 ×2=0.000687×2=0.0014
– R code: pnorm(172.6, 173, 0.25) This is a two-tailed test. So,
– R code: pnorm(172.2, 173, 0.25)*2
This is a right-tailed test. So,
– 𝑝𝑝-value=P 𝑥𝑥̅≥173.6𝜇𝜇=173,𝜎𝜎 =0.25 =0.0082
– R code: 1 – pnorm(173.6, 173, 0.25)
Outline • ConfidenceIntervalEstimation
– Confidence Interval for 𝜇
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com