G. Umphrey STAT*6950 Statistics for the Life Sciences Fall 2020 Assignment #4 (In Part)
Note: This assignment will be graded, in whole or in part (but mainly ¡°in whole¡±). Details on submitting your work through Crowdmark will be forthcoming. Assume a 5% significance level unless otherwise specified. These are the first two questions, there is more to come!
1. W. S. Wilkinson et al. (1954) conducted an experiment to study the effect of two factors on plasma phospholipid in lambs. The two factors were time of bleeding (factor A), which was done either in the morning or afternoon, and the estrogenic compound diethylstilbestrol (factor B), which was either not administered (control) or administered (treated). The study utilized 20 lambs; five lambs were assigned at random to each of the four treatments. The data (from Steel and Torrie, 1980) are given in the table below. The units of plasma phospholipid seem to be mg per 100 ml plasma.
AM (a1)
PM (a2)
control 1 (a1b1)
treated 1 (a1b2)
control 2 (a2b1)
treated 2 (a2b2)
8.53
17.53
39.14
32.00
20.53
21.07
26.20
23.80
12.53
20.80
31.33
28.87
14.00
17.33
45.80
25.06
10.80
20.07
40.20
29.33
You are going to analyze the data in R using multiple regression with the lm() function as well as using an ¡°experimental design¡± approach using the aov() function. One objective for doing so is to gain some understanding of how the two approaches are related. We will also discover an advantage of having designs that are balanced. The data set, appended with various codings, is provided on CourseLink.
(a). Analyze the data as a CRD with four treatments (one treatment factor), using R¡¯s aov() function. The model statement is in the general form ¡° Response ~ Treatment¡±. (The ¡°Treatment¡± column will be coded with four levels. ¡°Response¡± is whatever the response variable has been named, here that is the plasma phospholipid measurement.)
(b). Analyze the data as a CRD with a 2¡Á2 Factorial treatment design using the aov() function. The model statement is in the form ¡°Response ~ A*B¡±. (Substitute A and B with your factor names.)
(c). What is the relationship between the SS(Treatments) generated in Q.1 and the Sum of Squares generated in Q.2?
(d). For the multiple regression analysis, two indicator variable coding schemes have been set up. The first coding scheme defines factor codes for the factor levels as follows. For Time, TimeCode = !1 if a.m., 1 if p.m. For Diethylstilbestrol, DiCode = !1 if No, 1 if Yes. Note the interaction variable, TimeXDi = TimeCode ¡Á DiCode. Run a multiple regression analysis using R¡¯s lm() function to predict the plasma phospholipid response, with independent variables TimeCode, DiCode, and TimeXDi. Enter the variables in the order listed.
(e). Run another regression analysis, but reorder model entry to: DiCode, TimeCode, TimeXDi.
(f). Run another regression analysis, but reorder model entry to: TimeXDi, TimeCode, DiCode.
(g). In Questions (d) to (f), how does the order of entry of the variables affect the Sum of
Squares in the ANOVA table? Feel free to try other orders of entry if you wish.
(h). How do the ANOVAs from the regression analysis conducted in (d) to (f) compare to the
ANOVA in (b)?
(i). Repeat Questions (d) to (f) , but using the variables that result from a (0,1) indicator
variable coding. When you compare ANOVA tables generated by the different dummy
variable coding schemes, what key difference is apparent?
(j). We will now look at how loss of balance affects the analysis. Delete the last three
observations from the treatment a2b1. Repeat questions (d) to (f) with this revised data set. How does the order of entry of the variables affect the Sum of Squares in the ANOVA table?
(k). Return to the full (balanced) data set. Obtain side-by-side interaction plots, varying which factor is on the Y-axis. (Label your graphs well.) Briefly interpret the nature of the apparent interaction.
2. I found the data set ¡°Animals¡± floating about on the internet, it is being provided as I found it. The data set contains Body Weight (in kg) and Brain Weight (in g) measurements for 27 species of animals; most are mammals but three are dinosaurs. We¡¯ll use simple linear regression and correlation analysis to investigate the association of brain weight with body weight. (Only parts of this will be handed in. You may not want to save all of the graphs, once you have taken a look at them.)
(a). First produce a scatterplot (in R) of brain weight on body weight. This (to me) looks like a bit of a mess. What characteristic of the data makes it difficult to visualize the relationship between the two variables on this graph?
(b). Logarithmically transform both variables using base 10 logarithms. Produce a scatterplot of log brain weight on log body weight, and superimpose the simple linear regression line. Would you agree that this looks more promising?
(c). You will note that the three species of dinosaurs really distort the fitted line, because dinosaurs have very small brains relative to the size of their bodies. Obtain measures of leverage and influence for these three data points and briefly interpret them.
(d). Create a variant data set with the three dinosaur species deleted from it. Now repeat question (b) with this revised data set. Obtain a box plot of the residuals from the model. What residual values have been denoted as outliers? What species are the outliers associated with? What measures of influence (i.e., Cook¡¯s Distance) do you get for these outliers?
(e). In (d) you should have found that the most extreme outlier belonged to the mountain beaver. Taking a closer look at the data, I thought it made no sense that a 1.35 kg rodent would have a brain weight of 465 g, that is about 1/3 of the animal¡¯s weight! So I checked the source from which the data had been harvested, Rousseeuw & Leroy¡¯s Robust Regression and Outlier Detection. Sure enough the correct value for the brain weight is 8.1 g. So reanalyze the mammal data with this value corrected (that is repeat question (d) with the corrected data).
(f). What are the two species with the most extreme outliers now? Are you suspicious of the brain and body weight measurements for these two species? Why or why not?
-2-