MM914 Foundations Of Probability And Statistics
Statistical Inference Minitab Lab Assignment
Louise Kelly, Johnathan Love & Kate Pyper
Instructions
This assignment is worth 15% of the overall mark for this course.
You project submission should be submitted in the form of a word document by 12 November 2018 by 12noon. At the beginning of the document, include your name and registration number.
The project involves a statistical analysis of data using Minitab and you can download the data from MYPLACE. Below are a number of questions relating to the datasets and you should address these questions using the appropriate Minitab commands. Your final report should provide the answers to these ques- tions with any required interpretation. There are marks for the correct answer and marks for using Minitab to compute the answer. The answer and the Minitab output used to compute the answer should be clearly shown in your report.
Your final submission should be clear and concise with only the minimum amount of relevant computer output included.
1
Question 1
The Minitab worksheet Movies contains data on 96 movies released in the period 2014-2015. There are 5 variables in the data set which are:
Column Name
-
C1 Release Date
-
C2 Release Year
- C3 Movie
-
C4 Budget (M $)
-
C5 Box Office (M $) Box office takings from U.S. release of movie
-
C6 Genre Type of movie
Description Date movie was released Year movie was released Title of movie Cost of producing the movie
- (a) Provide histograms and the appropriate measures of location and spread to summarise the movie budgets and their box office takings. For each, give a reason for your choice. (4 marks)
- (b) RecodetheHorror, Musical, RomanceandThrillergenresintoanOther classification and summarise in a table the number and percentage of genres present in the data set. What are the two most common genres in this data
set? (3 marks)
(c) Perform a hypothesis test at the 5% significance level to determine if certain types of movies were more likely to be released in 2014 or in 2015, giving a clear interpretation of the results. (4 marks)
(d) Create a scatterplot with Box Office on the y-axis and Budget on the x- axis with a clear title and axis labels. Describe what is observed from this plot. (3 marks)
- (e) What would be considered a valid reason for removing data from a data set? Assuming this is the case for the three highest valued budgets in the Movies data, remove these and compute the correlation coefficient between Box Office and Budget with the reduced data set. Explain what this correlation measures and how it should be interpreted. Justify whether or not the correlation is significant at the 1% significance level. (6 marks)
- (f) For the reduced data set, compute the least squares estimates for the regres- sion line which could be used to predict Box Office from Budget. Interpret these estimates in the context of this scenario. (5 marks)
- (g) Produce the diagnostic plots and validate the assumptions associated with the least squares regression line you fitted in (f). (5 marks)
2
(h) Using the least squares regression model produced in (f), determine the predicted box office takings for a movie with a total budget of $100,000,000. Compute the coefficient of determination and interpret this with respect to the fit of your regression model. Using this coefficient, comment on the reliability of your prediction. (5 marks)
Question 2
Medical researchers recorded blood cholesterol levels of 28 heart-attack victims 2, 4 and 14 days following the attack. The data are stored in the Minitab worksheet Cholesterol and coded as follows:
- (a) Produce a box-plot to show the cholesterol levels for each day and comment on the distribution of the cholesterol levels over time.
(4 marks)
- (b) Perform an ANOVA (using the one-way ANOVA option in Minitab) to assess whether there are any differences between the groups. Comment on the mean cholesterol level 4 days after a heart attack.
(4 marks)
- (c) Produce pairwise confidence intervals using the Tukey correction for multi- ple comparisons to determine which days, if any, have significantly different cholesterol levels.
(5 marks)
- (d) Discuss the assumptions of the ANOVA model and whether they are met for this data.
(5 marks)
3
Question 3
A rehabilitation centre is designing a study to look at the effects of their treat- ments on reducing smoking in their patients. The study will involve patient volunteers and the proposed measure of outcome is the level of nicotine (mg) detected in a patient’s body one week after being on one of two treatments being considered; either nicotine patches or e-cigarettes. The variability in levels of nicotine is likely to be similar to previous studies, and is estimated to be 9mg. If the difference between the levels of nicotine was at least 2.5mg on average, this would represent a clinically important difference between the treatments. It is likely that around one-fifth of patient volunteers will be lost to follow up.
- (a) State the appropriate null and alternative hypotheses for these tests
(3 marks) - (b) What is the total study size required in order to have an 80% probability of detecting such a difference between the treatments, at a 1% significance level, should it exist?
(4 marks)
- (c) What would be the difference in the total study sizes required, if one were to consider increasing the significance level to 5%? What affect does increasing the significance level have on the total study size required?
(3 marks)
- (d) By considering the design of the study in its current format, what two things would the rehabilitation centre need to ensure in order for the analysis you have conducted to be statistically valid?
(2 marks)
- (e) Data on nicotine levels in patients assigned to e-cigarettes are provided in the Minitab worksheet Nicotine. Perform an appropriate hypothesis test to assess whether the difference in nicotine levels is clinically important. Justify choice of test and significance level used.
(5 marks)
4