Lecture 6: Regression Part 2
Lecture 6: Regression Part 2
GGR376
Dr. Adams
Model Interpretation: Coefficients
model_mpg <- lm(cty~displ+cyl, data = mpg)
summary(model_mpg)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.288512 0.6876399 41.138555 2.721700e-108
displ -1.197882 0.3407738 -3.515181 5.287524e-04
cyl -1.234654 0.2731967 -4.519285 9.908652e-06
Model Interpretation: \(R^2\)
summary(model_mpg)$adj.r.squared
[1] 0.6641936
Application
How do we use a linear regression model?
Explanatory
Used to understand the relationships in existing data.
Coefficients, when x increases how does Y change
Predictive
Predicting the known relationships in our data into the unknown.
Powerful, but requires more analysis steps.
Cross-Validation
Leave One Out (LOO)
Useful for smaller data samples
Sub-setting
Training Data
Testing Data
Required for Predictive Models!
Cross-Validation LOO
Cross-Validation Subsetting
Predictive Modelling
Split the data
Training Data ~80%
Testing Data, remaining
Fit the model to the training data.
predict() the testing data using the model.
Compare predicted vs. actual of testing data.
Repeat
Predictive Modelling Demo
In Class Demo
lm(cty~displ+cyl, data = mpg)
dplyr::slice
Variable Selection
How do we determine how and which variables are included in the final model.
Manual
Step-wise
All subsets
Manual Selection
Requires some expert knowledge
Typically begins by including strongest predictor
Strategically add and remove variables
Step-wise
MASS::stepAIC()
Forward selection, begin with no variables
Add a variable
Test if improves model
Repeat
Backward elimination, begin with all candidate variables
Test loss in model by removal of each variable
Delete variable from model if no significant difference
Bidirectional elimination, a combination of the above
Testing at each step for variables to be included or excluded.
All Subsets
Test all combinations
Useful for smaller sets of data
library(caret)
leaps<-train(y ~ .,
data=mydata,
method = "lm")
All Subsets Example I
library(caret)
data(swiss)
Swiss Fertility and Socioeconomic Indicators
Fertility, lg, ‘common standardized fertility measure’
Agriculture, % of males involved in agriculture as occupation
Examination, % draftees receiving highest mark on army examination
Education, % education beyond primary school for draftees.
Catholic, % ‘catholic’ (as opposed to ‘protestant’).
Infant.Mortality, live births who live less than 1 year.
All Subsets Example II
all <- train(Fertility ~ ., data = swiss, method = "lm")
all$finalModel
Call:
lm(formula = .outcome ~ ., data = dat)
Coefficients:
(Intercept) Agriculture Examination Education
66.9152 -0.1721 -0.2580 -0.8709
Catholic Infant.Mortality
0.1041 1.0770
All Subsets Example III
options(scipen = 999)
summary(all$finalModel)$coefficients[,c(1,3,4)]
Estimate t value Pr(>|t|)
(Intercept) 66.9151817 6.250229 0.0000001906051
Agriculture -0.1721140 -2.448142 0.0187271543852
Examination -0.2580082 -1.016268 0.3154617231437
Education -0.8709401 -4.758492 0.0000243060459
Catholic 0.1041153 2.952969 0.0051900785452
Infant.Mortality 1.0770481 2.821568 0.0073357153206
P-hacking
Prediction Activity
Five Dice
Roll n dice and sum values.
For n = 1, 2, 3, 4, 5.
Predict the value if you were to roll 6, 10, and 20 dice?
Spatial Correlation
“everything is related to everything else, but near things are more related than distant things.”
Waldo Tobler
Temporal Correlation
set.seed(100)
# Generate a random sequence of numbers
t <- sample(100, 10)
# Vector with last value removed
t_reg <- t[-length(t)]
t_reg[1:5]
[1] 31 26 55 6 45
# Vector of lags
t_lag <- t[-1]
t_lag[1:5]
[1] 26 55 6 45 46
Random Values Test
Temperature data
temp <- airquality$Temp
temp_reg <- temp[-length(temp)]
temp_lag <- temp[-1]
Correlation
cor(t_reg,t_lag)
[1] -0.2921794
cor.test(t_reg,t_lag)$p.value
[1] 0.4455116
cor(temp_reg, temp_lag)
[1] 0.8154956
Temporal Lag Plot
acf(temp, cex.lab = 1.3)
Spatial Autocorrelation
Time is in one dimension
Space dealing with, at least, two dimensions
Less clear how to measure “near”
Measure of Spatial Autocorrelation
A measure of SA describes the degree to which values are smilair to other nearby objects.
Moran’s I
Global test statistic
Overall test for spatial autocorrelation
Moran’s I
Ranges from -1 to +1
Negative 1
Dissimilar values are near each other
Positive 1
Similar values are near each other
Zero, no spatial autocorrelation
Moran’s I & Spatial Correlation
(Radil 2011)
Moran’s I and Spatial Weights
Moran’s I Formula:
Similar to correlation coefficient
Spatial Weights Matrix \(w_{ij}\)
Spatial Weights
The measure of how “near” are objects in space.
Points
Calculate a distance
Polygons
Could use distance, centroid?
Based on contiguity
Contiguity
(Tenney 2013)
Weights Matrix (Row Standardized)
Modified from https://pqstat.com/?mod_f=macwag
Spatial Weights Exercise
Calculate Moran’s I in R
# Spatial Dependence Library
library(spdep)
# Moran's I Test - Analytical
moran.test()
# Monte Carlo Simulation
moran.mc()
Monte Carlo
Assign values to random polygons and calculate I
Repeat several time to form a distribution
Calculate I for observed data
Is it likely the observed is a random draw
Autocorrelation: Residuals
The linear regression model requires the residuals to be independent.
Auto-correlation violates this assumptions
Temporal Autocorrelation
Spatial Autocorrelation
Spatial Autocorrelation
Model residuals need to be tested with Moran’s I for spatial autocorrelation.
What to do after?
Additional Variable
Spatial Autoregressive Models
Spatial Lag Model
Spatial Error Model
Spatial Autoregresion Models
For this course you need to be aware of these two models.
- Their interpretation is challenging.
- When to use either model is at times unclear.
- Models are estimated with maximum liklihood
Spatial Error Model
Captures the influence of unmeasured independent variables.
Examines the clustering in unexplained portion of the response variable with clustering of the error terms.
Spatial Lag Model
Implies an influence from neighbouring variables
Not an artifact of unmeasured variables
The value of an outcome variable in one location affects the outcome variable in neighbouring locations.
Choosing a model
Lagrange Multiplier diagnostics for spatial dependence in linear models
spdep::lm.LMtests(model, # Linear Model
listw, # Spatial Weights
test = "all")
summary(lm.LMtests())
Lagrange Multiplier Output
Lagrange multiplier diagnostics for spatial dependence
data:
model: lm(formula = CRIME ~ HOVAL + INC, data = COL.OLD)
weights: nb2listw(COL.nb)
statistic parameter p.value
LMerr 5.723131 1 0.016743 *
LMlag 9.363684 1 0.002213 **
RLMerr 0.079495 1 0.777983
RLMlag 3.720048 1 0.053763 .
SARMA 9.443178 2 0.008901 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
May need an underlying theory to support your ideas.
References
Radil, Steven M. 2011. “Spatializing Social Networks: Making Space for Theory in Spatial Analysis.” Dissertation.
Tenney, Matthew. 2013. “A conceptual model of exploration wayfinding: An integrated theoretical framework and computational methodology.” ProQuest Dissertations and Theses, no. April 2013: 172. http://prx.library.gatech.edu/login?url=http://search.proquest.com/docview/1353676596?accountid=11107{\%}5Cnhttp://primo-pmtna03.hosted.exlibrisgroup.com/openurl/01GALI{\_}GIT/01GALI{\_}GIT{\_}SERVICES??url{\_}ver=Z39.88-2004{\&}rft{\_}val{\_}fmt=info:ofi/fmt:kev:mtx:dissertatio.