CS计算机代考程序代写 Lecture 6: Regression Part 2

Lecture 6: Regression Part 2
GGR376
Dr. Adams
Model Interpretation: Coefficients
model_mpg <- lm(cty~displ+cyl, data = mpg) summary(model_mpg)$coefficients Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.288512 0.6876399 41.138555 2.721700e-108
displ -1.197882 0.3407738 -3.515181 5.287524e-04
cyl -1.234654 0.2731967 -4.519285 9.908652e-06
Model Interpretation: \(R^2\)
summary(model_mpg)$adj.r.squared
[1] 0.6641936
Application
How do we use a linear regression model?
Explanatory
▪ Used to understand the relationships in existing data.
• Coefficients, when x increases how does Y change
Predictive
▪ Predicting the known relationships in our data into the unknown.
• Powerful, but requires more analysis steps.
Cross-Validation
▪ Leave One Out (LOO)
• Useful for smaller data samples
▪ Sub-setting
• Training Data
• Testing Data
▪ Required for Predictive Models!
Cross-Validation LOO

Cross-Validation Subsetting

Predictive Modelling
1. Split the data
▪ Training Data ~80%
▪ Testing Data, remaining
2. Fit the model to the training data.
3. predict() the testing data using the model.
4. Compare predicted vs. actual of testing data.
5. Repeat
Predictive Modelling Demo
▪ In Class Demo
• lm(cty~displ+cyl, data = mpg)
• dplyr::slice
Variable Selection
How do we determine how and which variables are included in the final model.
▪ Manual
▪ Step-wise
▪ All subsets
Manual Selection
▪ Requires some expert knowledge
▪ Typically begins by including strongest predictor
▪ Strategically add and remove variables
Step-wise
MASS::stepAIC()
▪ Forward selection, begin with no variables
• Add a variable
• Test if improves model
• Repeat
▪ Backward elimination, begin with all candidate variables
• Test loss in model by removal of each variable
• Delete variable from model if no significant difference
▪ Bidirectional elimination, a combination of the above
• Testing at each step for variables to be included or excluded.
All Subsets
▪ Test all combinations
▪ Useful for smaller sets of data
library(caret)
leaps<-train(y ~ ., data=mydata, method = "lm") All Subsets Example I library(caret) data(swiss) Swiss Fertility and Socioeconomic Indicators ▪ Fertility, lg, ‘common standardized fertility measure’ ▪ Agriculture, % of males involved in agriculture as occupation ▪ Examination, % draftees receiving highest mark on army examination ▪ Education, % education beyond primary school for draftees. ▪ Catholic, % ‘catholic’ (as opposed to ‘protestant’). ▪ Infant.Mortality, live births who live less than 1 year. All Subsets Example II all <- train(Fertility ~ ., data = swiss, method = "lm") all$finalModel Call: lm(formula = .outcome ~ ., data = dat) Coefficients: (Intercept) Agriculture Examination Education 66.9152 -0.1721 -0.2580 -0.8709 Catholic Infant.Mortality 0.1041 1.0770 All Subsets Example III options(scipen = 999) summary(all$finalModel)$coefficients[,c(1,3,4)] Estimate t value Pr(>|t|)
(Intercept) 66.9151817 6.250229 0.0000001906051
Agriculture -0.1721140 -2.448142 0.0187271543852
Examination -0.2580082 -1.016268 0.3154617231437
Education -0.8709401 -4.758492 0.0000243060459
Catholic 0.1041153 2.952969 0.0051900785452
Infant.Mortality 1.0770481 2.821568 0.0073357153206
P-hacking

Prediction Activity
▪ Five Dice
▪ Roll n dice and sum values.
▪ For n = 1, 2, 3, 4, 5.
▪ Predict the value if you were to roll 6, 10, and 20 dice?
Spatial Correlation
“everything is related to everything else, but near things are more related than distant things.”
▪ Waldo Tobler
Temporal Correlation
set.seed(100)

# Generate a random sequence of numbers
t <- sample(100, 10) # Vector with last value removed t_reg <- t[-length(t)] t_reg[1:5] [1] 31 26 55 6 45 # Vector of lags t_lag <- t[-1] t_lag[1:5] [1] 26 55 6 45 46 Random Values Test  Temperature data temp <- airquality$Temp temp_reg <- temp[-length(temp)] temp_lag <- temp[-1]  Correlation cor(t_reg,t_lag) [1] -0.2921794 cor.test(t_reg,t_lag)$p.value [1] 0.4455116 cor(temp_reg, temp_lag) [1] 0.8154956 Temporal Lag Plot acf(temp, cex.lab = 1.3)  Spatial Autocorrelation ▪ Time is in one dimension ▪ Space dealing with, at least, two dimensions • Less clear how to measure “near”  Measure of Spatial Autocorrelation A measure of SA describes the degree to which values are smilair to other nearby objects. ▪ Moran’s I • Global test statistic ◦ Overall test for spatial autocorrelation  Moran’s I ▪ Ranges from -1 to +1 ▪ Negative 1 • Dissimilar values are near each other ▪ Positive 1 • Similar values are near each other ▪ Zero, no spatial autocorrelation Moran’s I & Spatial Correlation  (Radil 2011) Moran’s I and Spatial Weights Moran’s I Formula:  ▪ Similar to correlation coefficient ▪ Spatial Weights Matrix \(w_{ij}\) Spatial Weights The measure of how “near” are objects in space. ▪ Points • Calculate a distance ▪ Polygons • Could use distance, centroid? • Based on contiguity Contiguity  (Tenney 2013) Weights Matrix (Row Standardized)  Modified from https://pqstat.com/?mod_f=macwag Spatial Weights Exercise  Calculate Moran’s I in R # Spatial Dependence Library library(spdep) # Moran's I Test - Analytical moran.test() # Monte Carlo Simulation moran.mc() Monte Carlo 1. Assign values to random polygons and calculate I 2. Repeat several time to form a distribution 3. Calculate I for observed data 4. Is it likely the observed is a random draw Autocorrelation: Residuals The linear regression model requires the residuals to be independent. ▪ Auto-correlation violates this assumptions 1. Temporal Autocorrelation 2. Spatial Autocorrelation Spatial Autocorrelation ▪ Model residuals need to be tested with Moran’s I for spatial autocorrelation. What to do after? ▪ Additional Variable ▪ Spatial Autoregressive Models • Spatial Lag Model • Spatial Error Model Spatial Autoregresion Models For this course you need to be aware of these two models. - Their interpretation is challenging. - When to use either model is at times unclear. - Models are estimated with maximum liklihood Spatial Error Model ▪ Captures the influence of unmeasured independent variables. • Examines the clustering in unexplained portion of the response variable with clustering of the error terms. Spatial Lag Model ▪ Implies an influence from neighbouring variables • Not an artifact of unmeasured variables The value of an outcome variable in one location affects the outcome variable in neighbouring locations. Choosing a model ▪ Lagrange Multiplier diagnostics for spatial dependence in linear models spdep::lm.LMtests(model, # Linear Model listw, # Spatial Weights test = "all") summary(lm.LMtests()) Lagrange Multiplier Output Lagrange multiplier diagnostics for spatial dependence data: model: lm(formula = CRIME ~ HOVAL + INC, data = COL.OLD) weights: nb2listw(COL.nb) statistic parameter p.value LMerr 5.723131 1 0.016743 * LMlag 9.363684 1 0.002213 ** RLMerr 0.079495 1 0.777983 RLMlag 3.720048 1 0.053763 . SARMA 9.443178 2 0.008901 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ▪ May need an underlying theory to support your ideas. References Radil, Steven M. 2011. “Spatializing Social Networks: Making Space for Theory in Spatial Analysis.” Dissertation. Tenney, Matthew. 2013. “A conceptual model of exploration wayfinding: An integrated theoretical framework and computational methodology.” ProQuest Dissertations and Theses, no. April 2013: 172. http://prx.library.gatech.edu/login?url=http://search.proquest.com/docview/1353676596?accountid=11107{\%}5Cnhttp://primo-pmtna03.hosted.exlibrisgroup.com/openurl/01GALI{\_}GIT/01GALI{\_}GIT{\_}SERVICES??url{\_}ver=Z39.88-2004{\&}rft{\_}val{\_}fmt=info:ofi/fmt:kev:mtx:dissertatio.