CS计算机代考程序代写 Lecture 6: Regression Part 2

Lecture 6: Regression Part 2

Lecture 6: Regression Part 2

GGR376

Dr. Adams

Model Interpretation: Coefficients

model_mpg <- lm(cty~displ+cyl, data = mpg) summary(model_mpg)$coefficients Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.288512 0.6876399 41.138555 2.721700e-108
displ -1.197882 0.3407738 -3.515181 5.287524e-04
cyl -1.234654 0.2731967 -4.519285 9.908652e-06

Model Interpretation: \(R^2\)

summary(model_mpg)$adj.r.squared

[1] 0.6641936

Application

How do we use a linear regression model?

Explanatory

Used to understand the relationships in existing data.
Coefficients, when x increases how does Y change

Predictive

Predicting the known relationships in our data into the unknown.
Powerful, but requires more analysis steps.

Cross-Validation

Leave One Out (LOO)
Useful for smaller data samples

Sub-setting
Training Data
Testing Data

Required for Predictive Models!

Cross-Validation LOO

Cross-Validation Subsetting

Predictive Modelling

Split the data
Training Data ~80%
Testing Data, remaining

Fit the model to the training data.
predict() the testing data using the model.
Compare predicted vs. actual of testing data.
Repeat

Predictive Modelling Demo

In Class Demo
lm(cty~displ+cyl, data = mpg)
dplyr::slice

Variable Selection

How do we determine how and which variables are included in the final model.

Manual
Step-wise
All subsets

Manual Selection

Requires some expert knowledge
Typically begins by including strongest predictor
Strategically add and remove variables

Step-wise

MASS::stepAIC()

Forward selection, begin with no variables
Add a variable
Test if improves model
Repeat

Backward elimination, begin with all candidate variables
Test loss in model by removal of each variable
Delete variable from model if no significant difference

Bidirectional elimination, a combination of the above
Testing at each step for variables to be included or excluded.

All Subsets

Test all combinations
Useful for smaller sets of data

library(caret)
leaps<-train(y ~ ., data=mydata, method = "lm") All Subsets Example I library(caret) data(swiss) Swiss Fertility and Socioeconomic Indicators Fertility, lg, ‘common standardized fertility measure’ Agriculture, % of males involved in agriculture as occupation Examination, % draftees receiving highest mark on army examination Education, % education beyond primary school for draftees. Catholic, % ‘catholic’ (as opposed to ‘protestant’). Infant.Mortality, live births who live less than 1 year. All Subsets Example II all <- train(Fertility ~ ., data = swiss, method = "lm") all$finalModel Call: lm(formula = .outcome ~ ., data = dat) Coefficients: (Intercept) Agriculture Examination Education 66.9152 -0.1721 -0.2580 -0.8709 Catholic Infant.Mortality 0.1041 1.0770 All Subsets Example III options(scipen = 999) summary(all$finalModel)$coefficients[,c(1,3,4)] Estimate t value Pr(>|t|)
(Intercept) 66.9151817 6.250229 0.0000001906051
Agriculture -0.1721140 -2.448142 0.0187271543852
Examination -0.2580082 -1.016268 0.3154617231437
Education -0.8709401 -4.758492 0.0000243060459
Catholic 0.1041153 2.952969 0.0051900785452
Infant.Mortality 1.0770481 2.821568 0.0073357153206

P-hacking

Prediction Activity

Five Dice
Roll n dice and sum values.
For n = 1, 2, 3, 4, 5.
Predict the value if you were to roll 6, 10, and 20 dice?

Spatial Correlation

“everything is related to everything else, but near things are more related than distant things.”

Waldo Tobler

Temporal Correlation

set.seed(100)

# Generate a random sequence of numbers
t <- sample(100, 10) # Vector with last value removed t_reg <- t[-length(t)] t_reg[1:5] [1] 31 26 55 6 45 # Vector of lags t_lag <- t[-1] t_lag[1:5] [1] 26 55 6 45 46 Random Values Test Temperature data temp <- airquality$Temp temp_reg <- temp[-length(temp)] temp_lag <- temp[-1] Correlation cor(t_reg,t_lag) [1] -0.2921794 cor.test(t_reg,t_lag)$p.value [1] 0.4455116 cor(temp_reg, temp_lag) [1] 0.8154956 Temporal Lag Plot acf(temp, cex.lab = 1.3) Spatial Autocorrelation Time is in one dimension Space dealing with, at least, two dimensions Less clear how to measure “near” Measure of Spatial Autocorrelation A measure of SA describes the degree to which values are smilair to other nearby objects. Moran’s I Global test statistic Overall test for spatial autocorrelation Moran’s I Ranges from -1 to +1 Negative 1 Dissimilar values are near each other Positive 1 Similar values are near each other Zero, no spatial autocorrelation Moran’s I & Spatial Correlation (Radil 2011) Moran’s I and Spatial Weights Moran’s I Formula: Similar to correlation coefficient Spatial Weights Matrix \(w_{ij}\) Spatial Weights The measure of how “near” are objects in space. Points Calculate a distance Polygons Could use distance, centroid? Based on contiguity Contiguity (Tenney 2013) Weights Matrix (Row Standardized) Modified from https://pqstat.com/?mod_f=macwag Spatial Weights Exercise Calculate Moran’s I in R # Spatial Dependence Library library(spdep) # Moran's I Test - Analytical moran.test() # Monte Carlo Simulation moran.mc() Monte Carlo Assign values to random polygons and calculate I Repeat several time to form a distribution Calculate I for observed data Is it likely the observed is a random draw Autocorrelation: Residuals The linear regression model requires the residuals to be independent. Auto-correlation violates this assumptions Temporal Autocorrelation Spatial Autocorrelation Spatial Autocorrelation Model residuals need to be tested with Moran’s I for spatial autocorrelation. What to do after? Additional Variable Spatial Autoregressive Models Spatial Lag Model Spatial Error Model Spatial Autoregresion Models For this course you need to be aware of these two models. - Their interpretation is challenging. - When to use either model is at times unclear. - Models are estimated with maximum liklihood Spatial Error Model Captures the influence of unmeasured independent variables. Examines the clustering in unexplained portion of the response variable with clustering of the error terms. Spatial Lag Model Implies an influence from neighbouring variables Not an artifact of unmeasured variables The value of an outcome variable in one location affects the outcome variable in neighbouring locations. Choosing a model Lagrange Multiplier diagnostics for spatial dependence in linear models spdep::lm.LMtests(model, # Linear Model listw, # Spatial Weights test = "all") summary(lm.LMtests()) Lagrange Multiplier Output Lagrange multiplier diagnostics for spatial dependence data: model: lm(formula = CRIME ~ HOVAL + INC, data = COL.OLD) weights: nb2listw(COL.nb) statistic parameter p.value LMerr 5.723131 1 0.016743 * LMlag 9.363684 1 0.002213 ** RLMerr 0.079495 1 0.777983 RLMlag 3.720048 1 0.053763 . SARMA 9.443178 2 0.008901 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 May need an underlying theory to support your ideas. References Radil, Steven M. 2011. “Spatializing Social Networks: Making Space for Theory in Spatial Analysis.” Dissertation. Tenney, Matthew. 2013. “A conceptual model of exploration wayfinding: An integrated theoretical framework and computational methodology.” ProQuest Dissertations and Theses, no. April 2013: 172. http://prx.library.gatech.edu/login?url=http://search.proquest.com/docview/1353676596?accountid=11107{\%}5Cnhttp://primo-pmtna03.hosted.exlibrisgroup.com/openurl/01GALI{\_}GIT/01GALI{\_}GIT{\_}SERVICES??url{\_}ver=Z39.88-2004{\&}rft{\_}val{\_}fmt=info:ofi/fmt:kev:mtx:dissertatio.