GGR376
Assignment 2: Regression 44 Marks
Regression: Modelling the relationship between a response (or dependent variable) and one or more explanatory variables (or independent variables). linear regression is a linear approach to modelling the relationship.
Before completing the assignment, review the example R markdown document from Tutorial 4.
NOTE: Join the spatial data at the beginning, as it causes issues to do it at the end.
Research Problem:
Produce an explanatory regression model for the variation in housing costs by census tract in the City of Hamilton, Ontario, Canada.
Data:
Hamilton Census Tract boundaries, which includes the average house price and the unique identifier: CTUID.
You can access the data with the following command and URL:
library(rgdal) rgdal::readOGR(“https://raw.githubusercontent.com/gisUTM/GGR376/master/Lab_1/houseValu es.geojson”)
You will need to obtain 10 potential explanatory variables from the 2016 Census Data, available from CHASS: http://dc2.chass.utoronto.ca.myaccess.library.utoronto.ca/census/
Assignment Format:
The assignment submission will be composed of three files.
1. An R script of your code produced during the project, with the .R file extension.
2. A CSV file of the additional input data you utilized in your model (one table).
3. Answers to the questions listed below in a PDF file.
All three files must be submitted online.
Assignment Requirements:
• Ensure all procedures from the lab tutorial are replicated in your work.
• Fit and test 10 linear regression models.
o Example model names: model_1, model_2, etc.
o All models should remain in the code. o Rename your final model: final_model
• The final model must meet all assumptions with the possible exemption: o Independent errors due to spatial autocorrelation.
▪ Validate the independent errors assumption in your model with spatial autoregressive modelling.
GRADING
R Script: 10 Marks
The script you submit should be fully reproducible, which means the TA should be able to run your script without modification. The only allowable modification would be the file path for the CSV file of your additional input variables. Review the R Script grading scale below.
The general structure of your R script should follow:
1. Data Munging:
a. Reading Data
b. Merging Data
2. Graphical Analysis Pre-Check
3. Data Transformations
4. Correlation Assessment
5. Model Fitting and model assumption assessment (10 models)
a. If one assumption is broken you can continue to the next model. i. No need to test every assumption in that case
6. Spatial Autocorrelation Assessment
7. Spatial Autoregressive Modelling
R Script Grading:
10 / 10: The code is properly documented with comments and detailed variable names. No issues are present in the code. A person versed in R should be able to read through the code in one attempt.
9 / 10: The code is well documented. A single error, inconsistency, poor variable name or documentation is present. A reviewer may need to make a single check of previous code to interpret.
8 / 10: The code is documented. A couple errors, inconsistencies, poor variable names or documentation is present. A reviewer may need to make multiple checks of previous code to interpret.
7 / 10: The code is documented. A few errors, inconsistencies, poor variable names or documentation is present. A reviewer needs to make multiple checks of previous code to interpret but can understand all sections of the code.
6 / 10: The code is partially documented. Errors, inconsistencies, and poor variable names are present. A reviewer needs to make multiple checks of previous code to interpret and may not completely understand all sections of the code.
5 / 10: The code is sparsely documented. Many errors, inconsistencies, and poor variable names present. A reviewer needs to make multiple checks of previous code to interpret and does not completely understand all sections of the code.
4 or below: Many inconsistences in the code. It would not be able to be reproduced by another researcher without many questions directed to the original author.
Missing assignment requirements in the code will also reduce your mark.
• Too few linear models in the code (-1 for each missing model)
• Final_model is not renamed (-1)
• Model Assumptions not tested (-1 for each assumption)
• Moran’s I not tested correctly (-2)
• Code will not run when tested (-3)
• Other errors will be penalized as appropriate.
CSV File: 2 Marks
The CSV file should contain all the variables that you obtained from the Census for testing in your model. It must contain 10 variables.
Questions (32 Marks)
All figures must include a figure caption. 1. Complete the following table. (1 Mark)
2. Complete the following table. (2 Marks)
3. Produce a publication quality histogram of the dependent variable (transformed if you did a transformation). (3 Marks)
4. Write 50 words on why you did or did not transform your dependent variable based on the assumptions of the linear regression model. (2 Marks)
5. Describe in 200 words your process of model fitting. Address the selection of variables, how you decided to remove or add variables, and the way you assessed each assumption. (4 Marks)
To achieve a mark above 8, it is likely you would re-write your code after you have completed working through the assignment to ensure clarity.
Variable Name in CSV File
Min
Max
Mean
Variable Description
Variable Name in CSV File
Reason why you selected the variable.
6. Complete the following table (2 Marks)
7. For your final linear regression model, produce a figure from the 4 plots generated by plot(linear_model). (2 Marks)
8. Produce a publication quality figure of residuals vs fitted values for your final linear regression model. (3 Marks)
9. Calculate Moran’s I for your residuals. Report in 50 words, your values for Moran’s I and how you interpret these findings. (3 Marks)
10. Write 150 words interpreting your final linear regression model. (4 marks)
11. Would you require a spatial autoregressive model? Explain how you would have chosen the model to use. (3 marks)
12. Produce a map of a spatial autoregressive model’s residuals. (3 Marks)
Model Name
R2
p <0.05 (Y/N)
List Assumption(s) Violated or All assumptions met?