{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)
“`{r echo = FALSE} library(car) library(kableExtra) model.aov<-function(lm.model){ #sources of variation source.of.variation<-c(“Model”, “Residuals”, “Total”) #Sum of Squares outcome.var<-as.matrix(lm.model$model[1]) ssm<- sum((lm.modelfitted.values − mean(outcome.var))²)sse < − sum((outcome.var − lm.modelfitted.values)^2) tss<- ssm + sse sum.of.squares<-c(ssm,sse,tss) #degrees of freedom n<-length(lm.modelfitted.values)p < − length(lm.modelcoefficients) - 1 df.1<-p df.2<-n-p-1 df.total<-n-1 df.list<-c(df.1, df.2, df.total) #Mean Square mod.ms<-ssm/df.1 mse<-sse/df.2 mean.square<-c(mod.ms, mse, NA) #F f.value<-c(mod.ms/mse, NA,NA) #sig p.f<-pf(f.value[1], df1=df.1, df2=df.2, lower.tail=F) sig<-c(p.f,NA,NA) # r squared adj.r2<-c(summary(lm.model)$adj.r.squared, NA, NA) #final models table.prep<-signif(cbind(sum.of.squares,df.list,mean.square,f.value,sig, adj.r2),6) table<-as.table(cbind(source.of.variation,table.prep)) return(table) } ``` The purpose of this assignment is to give you practice exploring your data, constructing and interpreting linear models including both continuous and categorical predictors, and evaluating those models using diagnostic tests. Lab 9 should provide you with most of the steps and R code to complete this assignment. You are interested in determining _what influences the overall rating of a professor_. Using the Ratemyprof.csv dataset, you are going to construct different regression models to see which factors may be related to the overall rating. To review how the variables were measured, please use Lab 2. The written portion of the assignment should be brief. Please include _at least two_ predictor variables in your models. Consider using both continuous and categorical variables. Total: 32 marks ### Explore the data First read in the data. You may also wish to take a look at the data set structure and remind yourself of what the variables are. Setting up the model Complete the following questions for each possible explanatory variable in the data set with the outcome variable. [Total: 7 marks] 1.a) Construct appropriate graphs showing the relationship between each explanatory variable (there are 7) and the response variable. [3.5 marks] #Type your code for the plots here 1.b) Indicate what statistical test would be most appropriate to check whether there is a statistical relationship between each explanatory variable and the response variable (e.g. two-sample t-test). Include results here, with one sentence for each stating whether the result was statistically significant. [3.5 marks] #Type your code for the statistical tests here Building candidate models Create two candidate models. Choose whichever variables you think are important to be your explanatory variables (use your answers from question 1 to guide you). Feel free to add some interaction variables. You may wish to construct more than two models to determine which variables are most important, but only report two. 2.a) Provide a regression table (summary()) for each candidate model showing variable names, unstandardized coefficients, std. errors and p-values. [2 marks] Provide a model ANOVA table (model.aov()) for each candidate model that has the model, residuals and total with the sum of squares, df, mean squared values, the F-value, significance, and adjusted R-squared value. [2 marks] 3. From the candidate models you created, choose one to be your final model. Provide a justification for your choice. Note: Consider performing all analyses mentioned in this assignment for _both_ models to help provide justification. [2 marks] _FOR THE REMAINING QUESTIONS ONLY PROVIDE RESULTS FROM YOUR FINAL MODEL (CHOSEN IN QUESTION 3)_ Interpretation 4.a) What is the generic statistical hypothesis for _a coefficient_ (null and alternate)? [1 mark] 4.b) _Interpret_ each coefficient in your model. Indicate direction, magnitude and significance. Maximum one sentence per coefficient (Please also include your final table of coefficients here) [3 marks] These questions relate to your model ANOVA table. [Total: 3 marks] 5.a) What is the statistical hypothesis for the overall model (null and alternate)? [1 mark] 5.b) Which test-statistic is testing this hypothesis? [1 mark] 5.c) What does the adjusted R^2 value tell you? Interpret the value given to you in your model. [1 mark] Evaluating assumptions 6.a) Two assumptions of a linear model are normality of residuals and no multicollinearity. Does your model violate these assumptions? Assess the normality of residuals using appropriate plots: Assess multicollinearity using vif() (a vif > 5 indicates high correlation). [2 marks]
6.b) Now name two other assumptions of a linear model and test to see if your model violates them. Justify each answer. Limit your answer to one to two sentences per point. Please include graphs and test results where appropriate. [2 marks]
Checking your residuals
7. Check for outliers and look for patterns in your residuals. Provide one or two sentences describing each plot. Provide one residual plot:
and one normal quantile plot:
Evaluating your model
8. In a few sentences, evaluate your model. Describe at least one good aspect and one drawback of your model. [2 marks]
Calculating an F-value
9. You are interested in seeing what factors influence the number of bees that visit a plant. You have 100 plants and are able to monitor the number of bees that visit the plant in a day. You take measurements of plant height, the number of flowers and the number of other insects that are present on the plant. You use these variables to construct your model.
The sum of squares has already been calculated for you. Please complete the rest of the calculations for the table using the code below as a scaffold. Values you need to calculate are indicated. It may be helpful to write out the model for yourself. [2 marks]
#sources of variation
source.of.variation<-c("Model", "Residuals", "Total")
#sum of squares
sum.of.squares<-c(29.1,1954.6,1983.7)
#degrees of freedom
#You'll need to calculate the number of degrees of freedom for the model and for the residuals
n<-NA #change this
p<-NA #change this
df.1<-p
df.2<-n-p-1
df.total<-n-1
df<-c(df.1, df.2, df.total)
#Mean Square
#calculate model MS and MS error
mod.ms<-NA #change this
mse<-NA #change this
mean.square<-c(mod.ms, mse, NA)
#F
#Calculate F value
calc.f <- NA #change this
f.value<-c(calc.f, NA,NA)
#sig
p.f<-pf(f.value[1], df1=df.1, df2=df.2, lower.tail=F)
sig<-c(p.f, NA, NA)
#final models
table.prep<-signif(cbind(sum.of.squares, df, mean.square, f.value, sig),5)
table<-as.table(cbind(source.of.variation, table.prep))
kable(table)