CMDA-2006 Fall 2017 – Exam 2 (Statistics Exam 1)
Discussion of the problems with others cannot be prevented on a take-home exam, so feel free to do so. However, your work must still be your own! Outright copying is a violation of the honor code under
any testing situation.
To discourage any temptation of copying, every student has a unique dataset for each data problem. Therefore, every student will have different answers and potentially different conclusions!
Therefore copying will be caught very easily, so just don’t do it!
Primary Instruction:
Download the .zip file (that contains your data) corresponding to your PID from Canvas under Files > Tests and Exams > Test 2 (Stat Exam 1) > Data Files , then complete the exam to the best of your ability.
Be efficient and Don’t Panic! This exam is meant to be done 100% by R and in very few lines! You’ve seen how to do all of these types of questions in class notes or solutions by now. Be sure to justify your methods and write appropriate conclusions.
You must show your work in your write-up, meaning you need to show the code that you have used to solve the problems. Use R Markdown or copy and paste your code and results into a word document and edit. Don’t forget to properly answer the questions!
For problem 5, you should include a .R file that contains only that function, call this Lastname chisq.R.
Finally, submit your solutions in a formal writeup .pdf file (Lastname Firstname Exam2.pdf) by 2:59pm on canvas. You should start your upload by 2:50 just to be on the safe side. A 3:01pm submission is late and will not be accepted.
Every problem identifies how much it is worth.
1
1. (10 pts) Methamphetamine comes in several forms and can be smoked, inhaled (snorted), injected, or orally ingested. The time it takes for a euphoric response after a release of dopamine (rush) in the brain, regardless of method of delivery, is exponentially distributed.
Suppose we wish to compare the dopamine release time when methamphetamine is smoked versus injected in rats. It is hypothesized that smoking might lead to a quicker dopamine release than injection.
In order to investigate each method of delivery was run 12 times, and the time it takes for a dopamine rush is measured. The results, in milliseconds, are found in meth.csv.
- (a) Based upon the data, what method should you use to analyze the data? Use plots and statistical tests to justify your answer.
- (b) State the null and alternative hypothesis in words and/or symbols.
- (c) Using the method that you selected and justified, carry out that method. Test at the 5% level. Show the appropriate R output.
- (d) State the conclusion of the test and provide an interpretation of the conclusion in the context of this particular problem.
2
2. (20 pts) Patients suffering from rheumatic diseases or osteoporosis often suffer critical losses in bone mineral density (BMD). Alendronate is one medication prescribed to build or prevent further loss of BMD. Holcomb and Rothenberg looked at 96 women taking alendronate to determine if a difference existed in the mean percent change in BMD among five different primary diagnosis classifications. Group 1 patients were diagnosed with rheumatoid arthritis (RA). Group 2 patients were a mixed collection of patients with diseases including lupus, Wegener’s granulomatosis and polyarteritis, and other vasculitic diseases (LUPUS). Group 3 patients had polymyalgia rheumatica or temporal arthritis (PMRTA). Group 4 patients had osteoarthritis (OA) and group 5 patients had osteoporosis (O) with no other rheumatic diseases identified in the medical record.
The observations recorded were the changes in BMD for the 96 patients across their different groups. The data can be found in alendronate.csv.
Given the nature of this dataset, what general conclusions can you make about the effectiveness of prescribing alendronate for patients in the various groups? Test at the 5% level.
- (a) Plot the data using the most appropriate visualization tool for this kind of dataset. Label the aspects of the plot appropriately and provide a title.
- (b) Obtain summary statistics for each group, i.e. sample sizes, sample means, sample standard deviations, (can be done in 3 lines of R or less).
- (c) State the null and alternative hypotheses, using words (in the context of the study) and symbols.
- (d) Carry out the most appropriate statistical analysis using R to compare the change in BMD across all of the groups. Justify your the method that you have chosen by checking the assumptions.
- (e) If a follow-up analysis is seems appropriate, carry out that analysis and comment on the results.
3
3. (20 pts) The dataset fruitfly.csv contains observations on the longevity (lifetime in days) of male flies who have one of 5 daily activity groups described in further detail below.
The data mimics the data that was published by by L. Partridge and M. Farquhar in their paper “Sexual Activity and the Lifespan of Male Fruitflies”, Nature, 1981, 580581.
The dataset consists of a variable for the lifespan in days of the male fly (longevity), and a categorical variable that indicates which activity group that the fly was in.
The groups are:
• isolated = Male fly kept solitary,
• low = Male fly kept with one female virgin fruitfly,
• high = Male fly kept with eight female virgin fruitflies.
• one = Male fly kept with one pregnant female fruitfly,
• many = Male fly kept with eight pregnant female fruitflies,
The last two groups were included as a controls as pregnant fruitflies will not mate.
- (a) Plot the raw data using the most appropriate visualization tool for this kind of dataset. Label the aspects of the plot appropriately and provide a title.
- (b) Conduct an analysis for the relationship between the sexual activity and longevity. Prior to the collection of data, there were some questions the Partridge & Farquahr wished to answer.
- Do isolated male flies have a different lifespan on average compared to flies that are able to meet with other flies?
- Do male flies who have the ability to have sexual relationships have different lifespans on average than those who do not?
- Do male flies who live in larger communities have different lifespans on average than those who do not?
- Is there a difference in the average lifespans of male fruitflies if they live around just one pregnant female fruitfly compared to many pregnant fruitflies?
Include all necessary plots that support your claims. Comment on all the interesting results that you see during your analyses.
4
4. (20 pts) In August and September 2005, Hurricanes Katrina and Rita caused extraordinary flooding in New Orleans, Louisiana. Many homes were severely damaged or destroyed, of those that survived, many required extensive cleaning. It was thought that cleaning flood-damaged homes might present a health hazard due to the large amounts of mold present in many of the homes. The article “Health Effects of Exposure to Water-Damaged New Orleans Homes Six Months After Hurricanes Katrina and Rita” (K. Cummings, J. Cox-Ganser, et al., American Journal of Public Health, 2008:869-875) looked at a sample of residents who had participated in the cleaning of one or more homes and a sample of residents who had not participated in cleaning.
Some members of each group experienced symptoms of wheezing. The data can be found in hurricane.csv
The focus of this study is to compare the proportion of people who develop wheezing symptoms in the two population groups (those who participated in cleanup and those who did not).
- (a) Make a table of the results.
- (b) Generate a 95% confidence interval (using the Agresti Method) for the difference in the propor- tions for those with wheezing symptoms in the two groups. (This cannot be done directly with any of the R functions that I have taught you, but can be done in a couple of lines that you write yourself easily enough.) Be sure to interpret this confidence interval in the context of the study.
- (c) Suppose someone makes the claim that the frequency of wheezing symptoms is greater among those residents who participated in the cleaning of flood-damaged homes? State the null and alternative hypotheses using symbols for this situation.
- (d) Conduct the hypothesis test in part (c) using a z-test at the 5% significance level. Be sure to state the conclusion of the test and provide an interpretation of the conclusion in the context of this particular problem.
- (e) Instead of conducting a z-test to answer the question in part (c), conduct a chi-square test instead.
5
5. (30 pts) After reading the article “Determination of Carboxyhemoglobin Levels and Health Effects on Officers Working at the Istanbul Bosphorus Bridge” (G. Kocasoy and H. Yalin, Journal of Envi- ronmental Science and Health, 2004:1129-1139), you felt compelled to conduct your own study. This study much like what was presented in the above paper is concerned with assessing health outcomes of people working in an environment with high levels of carbon monoxide (CO).
To obtain data, you contacted a regional factory that routinely has workers who work in one of three shifts (Morning, Evening, Night). This factory also keeps a strict record of employees who report various medical ailments that may or may not be job related. These symptoms include Influenza, Headache, Weakness, and Shortness of Breath.
The data, found in workerhealth.csv, contains observations noting the shift of the worker and the symptom being reported.
Can you conclude that the proportions of workers with the various symptoms differ among the shifts? Test at the 1% level.
To answer this question, do the following:
- (a) Write a function in R called my.chi.test() that can carry out both chi-square tests for a single categorical variable and chi-square tests for contingency tables and has exactly the following format (I must be able to call your function):
my.chi.test() <- function(x,y=NULL, p=rep(1/length(x), length(x)) ){ your code goes here... return( list = c("X-squared" = xs, "df" = df.xs, "P-value" = pvalue))
}
- (b) Make a table of the results for all of the data. You should include all marginal totals and grand total as well, the table() function will give you a partial answer.
- (c) Plot the data using a stacked relative frequency barplot with shift on the x-axis.
- (d) State the null hypothesis (using symbols only) and alternative hypothesis (using words only).
- (e) You must use your R function to compute the test statistic, the degrees of freedom, and the P-value. Of course, you can check your answer with chisq.test() to make sure your function is working. Show the R output when using your function on the data.
- (f) State the conclusion of the test and provide an interpretation of the conclusion in the context of this particular problem.
- (g) Regardless of the time of shift, it has been hypothesized that the proportion of workers who report influenza is 25%, headaches is 40%, weakness is 20%, and shortness of breath is 15%. Is the data consistent with this hypothesis?
Test this claim, by first stating the null hypothesis (in symbols) and alternative hypothesis (in words). Show the appropriate data that is used as evidence. Then use your function (show the output) to conduct the test at 5% significance level. Comment on your results.
6