程序代写代做代考 Probability in Computing Spring 2017

Probability in Computing Spring 2017

Lab 5 – Hypothesis testing and linear regression.
Due: Wednesday May 3rd 11.59PM in PDF form via Websubmit.

Note that the whole coding part of this lab must be done in R.

1 What to submit (MANDATORY)

1. PDF file ( (lab5 fisrtname lastname.pdf) )including plots and a snapshot of the code used to answer
the questions.

• Names of the collaborators.
• Number of late days for this assignment.
• Number of late days so far.
• References used

2. R script (lab5 firstname lastname.R) with the code used.

Failing to meet any of the above requirements will cause a decrease of your grade.

2 Background

2.1 Confidence interval for the population mean

In class we learned that according to the central limit theorem, the distribution of the sample mean Xn
is approximately a normal distribution with a mean of µ ( the population mean) and standard deviation
of σ√

n
(where σ is the population standard deviation). For a random variable with a normal distribution,

the probability that its value is within 2 standard deviations of its mean is about 0.95. Obviously, if there
is a certain distance between the sample mean ( recall that the sample mean Xn =

1
n

∑n
i=1Xi) and the

population mean, we can describe that distance by starting at either value. So, if the sample mean Xn
falls within a certain distance of the population mean µ, then the population mean µ falls within the same
distance of the sample mean. Therefore, the statement, “There is a 95% chance that the sample mean Xn
falls within 2 standard deviations of µ” can be rephrased as: “We are 95% confident that the population
mean µ falls within 2 standard deviations units of Xn”. This second statement is exactly the interpretation
of the confidence interval. Similarly, if our hypothesis is that the population mean is equal to µ0, and µ0 is
within 2 standard deviations units of Xn, we say that the hypothesis is not rejected at a significance level
of α = 0.05.

Definition:
Given a sample of size n, under the assumption that we know the population standard deviation σ, the two
sided confidence interval of our sample is computed as follows:

Xn ± z ×
σ

n

(1)

where Xn is the sample mean and z is a multiplier that depends on the level of significance α.

Some important values of z are :

• For α = 0.1 (90% confidence interval), zα/2 = 1.645

• For α = 0.05 (95% confidence interval), zα/2 = 1.96

• For α = 0.01 (99% confidence interval), zα/2 = 2.576

Note that if we want to compute one sided confidence interval then we have to use zα. This is because
in the case of one sided intervals we are interested only in the lower value (when the alternative hypothesis
is “greater than”) or the upper value (when the alternative hypothesis is “less than”).

1 of 6

Probability in Computing
Spring 2017

Lab 5 – Hypothesis testing and linear regression.
Due: Wednesday May 3rd 11.59PM in PDF form via Websubmit.

Figure 1: Different types of confidence intervals, for all three figures α = 0.10.

2.2 Hypothesis testing for population mean

Recall that there are basically 4 steps in the process of hypothesis testing:

1. State the null (H0) and alternative hypotheses (H1).

2. Collect relevant data from a random sample and summarize them (using a test statistic).

3. Find the p-value, the probability of observing data like those observed assuming that H0 is true.

4. Based on the p-value, decide whether we have enough evidence to reject H0 (and accept H1) , and
draw our conclusions in context. To make a decision we have to chose a significance level. In this lab,
unless explicitly stated, we will use 0.05 significance level.

Assume that µ is our population mean. Note that the null hypothesis always takes the form: H0 : µ = µ0
(where µ0 is some value). The test statistic can take one of the following three forms, depending on what is
our alternative hypothesis:

1. H1 : µ > µ0 (right-tailed test)

2. H1 : µ < µ0 (left-tailed test) 3. H1 : µ 6= µ0 (double-tailed test) In hypothesis testing we have to distinguish between two cases: 1) the case where the population standard deviation (σ) is known, and 2) the case where σ is unknown. In the first case the test we will use is called the z-test for the population mean µ. In the second case, the test is called the t-test for the population mean µ. In the first case, the test statistic will have a standard normal (z) distribution (when H0 is true), and in the second case, the test statistic will have a t-distribution (when H0 is true). 3 z-test for the population mean (σ is known) 3.1 Learning example The SAT is constructed so that scores in each portion have a national average of 500 and standard deviation of 100. The distribution is close to normal. The dean of students of Ross College suspects that in recent years the college attracts students who are more quantitatively inclined. A random sample of 4 students from a recent entering class at Ross College had an average math SAT (SAT-M) score of 550. Does this provide enough evidence for the dean to conclude that the mean SAT-M of all Ross college students is higher than the national mean of 500? Assume that the scores of all Ross College students are also normally distributed with a standard deviation of 100. 1. State null and alternative hypothesis. 2 of 6 Probability in Computing Spring 2017 Lab 5 - Hypothesis testing and linear regression. Due: Wednesday May 3rd 11.59PM in PDF form via Websubmit. When we discussed probability models based on sampling distributions, we concluded that sample mean, Xn, is a random variable with the following properties: • The mean is the same as the population mean, µ. • The standard deviation is σ√ n , where σ is the standard deviation of the population. • The sample means are normally distributed if the underlying variable being sampled is normally dis- tributed in the population or the sample size is large enough to guarantee approximate normality. Recall that this last statement is the Central Limit Theorem. As a general guideline, if n > 30, the
Central Limit Theorem applies and we can use the normal distribution to model the distribution of
Xn

Based on this description of the sampling distribution of the sample mean Xn, we can define a test statis-
tic that measures the distance between the hypothesized value of µ (denoted µ0) and the sample mean
(determined by the data) in standard deviation units. The test statistic is:

Zn =
Xn − µ0

σ√
n

(2)

Comments

• Note that our test statistic (because it is a z-score), tells us how far Xn is from the null value µ0
measured in standard deviations. Since Xn represents the data and µ0 represents the null hypothesis,
the test statistic is a measure of how different our data are from what is claimed in the null hypothesis.
The larger the test statistic, the more evidence we have against H0, since what we saw in our data is
very different from what H0 claims.

• All inference procedures are based on probability. We are trying to determine if our sample results
are likely or unlikely based on our assumptions about the population. This requires that we have a
probability model that describes the long-term behavior of sample results that are randomly collected
from a population that fits our hypothesis. For this reason, the Central Limit Theorem gives us criteria
for deciding if the z-test for the population mean can be used. We need to verify:

1. The sample is random (or at least can be considered as random in context).

2. We are in one of the three situations marked with yes in the following table:

Conditions: z-test for a population mean Small sample size (n ≤ 30) Large sample size (n > 30)
Variable xi in the population from normal distribution YES YES

Variable xi in the population not from normal distribution NO YES

• 3. If the conditions are met, then values of Xn = 1n
∑n
i=1 xi vary normally, or at least close enough

to normally to use a normal model to calculate probabilities. When Xn values are normal, then the
z-scores will be normally distributed with a mean of 0 and a standard deviation of 1.

Now let’s get back to our SAT example.

2. Can we use the z-test to do our analysis? Hint: recall the condition we have to check

3. What is the value of the sample mean Xn?

4. What is the value of population standard deviation σ?

3 of 6

Probability in Computing
Spring 2017

Lab 5 – Hypothesis testing and linear regression.
Due: Wednesday May 3rd 11.59PM in PDF form via Websubmit.

5. What is the value of sample size ?

6. Compute the z-statistics and explain how one should interpret the result.

7. Find the p-value of the test using the normal table (http://www.normaltable.com/). Hint: Recall that
the p-value when H1 is ‘’greater than‘’ (right tailed z-test) is Pr(Z ≥ z). The normal table shows Pr(Z < z) 8. Suppose we reject the null hypothesis if our results are significant at 5% level. Can we reject the null hypothesis given the p-value we obtained? 9. What would be the minimum sample size we need to reject the null hypothesis with a significance level of 95%? Hint: first you have to find the z value for which p− value ≤ 0.05. Then you can compute the n needed. 10. Now let’s verify our results with code. For this we are going to use R. Create a function called significance that on input: the sample size n, the population standard deviation σ, the population mean µ and the sample mean Xn, computes z and the p-value. Hint: in R the function pnorm computes Pr(Z < z). To create a function in R you do the following function name = function(parameters)Submit your code. 11. Execute the significance function for increasing values of the sample size (starting with n = 4 increment every time by 1) until the results are statistically significant, i.e, p-value≤ 0.05. Provide a results table with the following 4 columns: n, z (test statistic), p-value and significant (yes/no). Which is the minimum sample size for which we can reject the null hypothesis? Using R you can test all the values of n from 4 to 14 by entering significance(5:14). Submit your code and table. 3.2 Problem Every year, the Environmental Protection Agency (EPA) collects data on fuel economy (randomly sampling from the entire population). With rising gasoline prices, consumers are using these figures as they decide which automobile to purchase. We will look at two-seater automobiles, many of which are sporty vehicles. Based upon the latest 2017 EPA sample, we wish to test the hypothesis that the combined city and highway miles per gallon (mpg) of two-seater automobiles is greater than 20. The standard deviation for all vehicles is 4.7 mpg. The dataset containing the data is epa.csv and the column you are interested in is COMB.MPG. 12. State the null and alternative hypothesis 13. Have the conditions that allow us to safely use the z-test been met? 14. Compute the test statistics and the p-value using the normal table (http://www.normaltable.com/). 15. the extend the function you wrote for question 10 such that on input • the sample size n • the population standard deviation σ • the population mean µ • the sample mean Xn • alternative: either “less”,“greater” or “two.sided” indicating the form of the alternative hypoth- esis. computes and outputs sample mean, sample size, z and the p-value. Hint: recall that for the two sided test the p− value = 2× Pr(Z ≥ |z|)Submit your code. 16. Provide the output of the function and verify that it matches the theoretical values computed above. 17. Draw conclusions based on the context of the problem. 4 of 6 apple 高亮 apple 高亮 apple 高亮 Probability in Computing Spring 2017 Lab 5 - Hypothesis testing and linear regression. Due: Wednesday May 3rd 11.59PM in PDF form via Websubmit. 18. Compute the one sided confidence interval for α = 0.05 (95% confidence interval), Provide both upper and lower values. 19. Use R to plot the confidence interval computed above. Recall that we assume that the sample mean Xn is normally distributed. Therefore you need to create a normal variable with mean equal to the sample mean and standard deviation equal to the population standard deviation and plot its probability density function. Then to the same figure add two vertical lines corresponding to the lower and upper confidence interval computed above. Hint: the function abline it is used to add vertical or horizontal reference lines to a plot in R. Submit your code and plot. 20. What would be the minimum sample size we need to reject the null hypothesis with a significance level of 95%? 3.3 Relating Hypothesis Tests and Confidence Intervals Suppose we want to test H0 : µ = µ0 vs. H1 : µ 6= µ0 using a significance level of α = 0.05. An alternative way to perform this test is to find a 95% confidence interval for µ and make the following conclusions: • If µ0 falls outside the confidence interval, reject H0. • If µ0 falls inside the confidence interval, do not reject H0. 21. Compute the one sided confidence interval for the SAT problem for α = 0.05. Provide both upper and lower value of the confidence interval. 22. Does µ0 (the population mean) fall outside or inside the confidence interval? 23. Now compute the confidence interval assuming n = 11. 24. Does µ0 (the population mean) fall outside or inside the confidence interval? 4 t-test for the population mean (σ is unknown) Unfortunately, only in few cases it is reasonable to assume that the population standard deviation (σ) is known. What can we use to replace σ? If you don’t know the population standard deviation, the best you can do is find the sample standard deviation, S (which formula is √ ( 1 n−1 ∑n i=1(xi − Xn) 2)), and use it instead of σ. In doing so we also have to change the test we use in the hypothesis testing which is now the t-test. The condition under which we can apply the t-test are the same expressed for the z-test (see Table 2). The test statistic for the t-test is defined as: t = Xn − µ0 S√ n (3) In the denominator we are using S instead of σ. This change has an effect on the distribution of the t- test statistic, which now does not follow a normal distribution. Instead it follows a distribution called t distribution or student distribution. The t distribution has slightly less area near the expected central value than the normal distribution does, and that the t distribution has correspondingly more area in the “tails” than the normal distribution does. Therefore, the t distribution ends up being the appropriate model in certain cases where there is more variability than would be predicted by the normal distribution. There are actually many different t distributions. The particular form of the t distribution is determined by its degrees of freedom. The degrees of freedom refers to the number of independent observations in a set of data. When estimating a mean score or a proportion from a single sample, the number of independent observations is equal to the sample size minus one. This is important when we want to compute the p-values for our hypothesis testing exercise: if the sample size is n = 10, we will compute the p-value of t(n−1) = t(9). 5 of 6 apple 高亮 Probability in Computing Spring 2017 Lab 5 - Hypothesis testing and linear regression. Due: Wednesday May 3rd 11.59PM in PDF form via Websubmit. 25. In order to compare the normal distribution with the t-distribution, plot the density of a normal distribution for 100 values in the range [−4, 4]. Then, to the same plot, add the density function of a t-distribution for the following values of degree of freedom: df = {1, 3, 8, 30} Submit your code and plot. 26. What happens to the t-distribution when we increase the degrees of freedom? 4.1 Problem We are going to use the SAT problem we analyzed in the previous section but with a little modification. Now we don’t know σ. Instead we will use the sample standard deviation S (which we can compute from the sample) as an approximation for σ. This change implies that the z-test is not longer appropriate and we need to use the t-test. 27. Can we use the t-test to do our analysis? Hint: recall the condition we have to check (see Table 2) 28. How many degrees of freedom we have? 29. Given S = 100 compute the t-statistic and explain how one should interpret the result. 30. Find the p-value of the test using R. Hint: the function is called pt. Recall that the p-value when H1 is “greater than” (right tailed z-test), pt by default computes Pr(T < t). 31. Is the p-value for the t-test larger or smaller than the p-value we computed with the z-test? Is it surprising? 32. Suppose we reject the null hypothesis if our results are significant at 5% level. Can we reject the null hypothesis given the p-value we obtained? 33. Compute the 95% one sided confidence interval (α = 0.05). In order to compute it you need to find the t statistic value tα. Provide both lower and upper values of the interval. Is the confidence interval wider that the one computed using the population standard deviation in part 21? Why?Hint: the R function is qt and it computes the t value for a one sided t test. You need to use S = 100 in the confidence interval formula since we do not have σ 6 of 6 apple 高亮 apple 高亮