Lecture 5: Regression Part 1
Spatial Data Science II
Dr. Adams
Tidyverse
library(tidyverse)
Review
1. Examining data distributions 2. Data Visualization
3. Data Management
Tidy data
Big Bang
Correlation – Review
A measure of the dependence between two variables.
Pearson’s product-moment coefficient
Measure of the strength of a linear relationship between two variables.
Coefficient is represented with r
Ranges from -1 to +1
H0 : r = 0
H1 :̸=0
Examples
r = −0.5
2
1
0
−1
−2
−1 0 1
x
r =0
1
0
−1
−2 −1 0 1 2
x
r = 0.33
2
1
0
−1
−2
−2 −1 0 1
r = 0.85
4
2
0
−2
x
−1 0 1
x
yy
yy
Pearson’s r Assumptions
Normally distributed Linear relationship
100
75
50
25
0
Linear assumption not met.
−10 −5 0 5 10
x
y
Spearman’s rank correlation coefficient
Spearman’s ρ or rs
Non-parametric (distribution-free) rank statistic Relationship does not need to be linear
Accepts non-interval data
Spearman’s ρ
It is possible for r to be positive while ρ is negative.
“Make sure not to overinterpret Spearman’s rank correlation coefficient as a significant measure of the strength of the associations between two variables” (Hauke and Kossowski 2011)
Spurious Correlations I
Spurious Correlations II
Spurious Correlations III
Correlation / Causation
Moving On
How much does a house price increase when we increase its square footage?
Take a minute and think about how you could answer this question?
Linear Regression Model
Model a continuous variable as a linear function of one or more independent variables.
This allows us to understand if and how an attribute contributes to an outcome.
Model Use In Data Science
Linear Regression
A statistical model that:
Predicts a continuous variable: Y
Using one or more independent variables: xn Calculates a set of multipliers: βn
Regression coefficients Includes an intercept: β0
The linear regression formula is: Y = β0 +β1×1 +β2×2…βnxn +ε ε is our error or noise.
Linear Regression Visualized
30
20
10
Y = Β0 +
0
Y = 37.2
Βx1
+ (− 5.344)x1
0246
Independent Variable (Weight 1000lbs)
Dependent Variable (MPG)
Housing Example – Explanatory
House Price ($)
Number of bedrooms Square Footage
Number of bathrooms
House Sales Data
library(tidyverse) house = read_delim(
“http://www.rossmanchance.com/iscam2/data/housing.tx delim = “\t”) # Tab deliminated
## Parsed with column specification:
## cols(
## sqft = col_integer(),
## price = col_integer(),
## City = col_character(),
## bedrooms = col_integer(),
## baths = col_double()
## )
t”,
Take a look at our data
## # A tibble: 83 x 5
## sqft price City
##
## 1 3392 339000 Dublin
## 2 4100 899900 pleasanton
## 3 3200 448641 Clayton
## 4 1436 239999 Moraga
## 5 1944 377500 Antioch
## 6 1500 299900 Danville
## 7 1700 265000 El Dorado Hills
## 8 2507 449000 Shingle Springs
## 9 1580 439950 McKinleyville
## 10 1500 699888 Marina
## # … with 73 more rows
bedrooms baths
3 2.10
4 3.00
5 4.00
4 3.00
3 2.00
3 2.50
4 3.00
4 3.00
3 2.00
4 2.00
Housing Example – Explanatory
House Price (price)
Number of bedrooms (bedrooms) Square Footage (sqft)
Number of bathrooms (baths)
Data Validation
Before we fit our model we need to ensure the data fits the linear regression model’s assumptions
1. Linearity between independent variables (xn) and dependent variable (Y )
2. No outliers in xn
3. Normally Distributed xn & Y
Linearity: Price and Square Footage
ggplot(data = house, mapping = aes(y = price, x = sqft geom_point()+
geom_smooth(method = “lm”, se = F)
))+
1250000
1000000
750000
500000
250000
1000 2000
3000 4000
sqft
price
Linearity: Price and Baths
1250000
1000000
750000
500000
250000
12345
baths
price
Linearity: Price and Bedrooms
1250000
1000000
750000
500000
250000
246
bedrooms
price
Outliers – Box Plot
Data values outside of 1.5 * interquartile-range may be considered outliers.
The IQR is the distance from the 25th percentile to the 75th percentile.
We often visualize this with the box and whisker plot
Box Plot
Box and whisker plots:
bottom and top of the box are the first and third quartiles band inside the box is the second quartile (the median)
geom_boxplot:
Whiskers are largest or smallest value within 1.5 * IQR Points are outside of 1.5 * IQR
Box Plot Visualized
Check for outliers: Square Footage
ggplot(data = house)+
geom_boxplot( mapping = aes(y = sqft, x = “”))
4000
3000
2000
1000
x
sqft
Distribution: Price
ggplot(data = house)+ geom_histogram(mapping = aes(x = price))
7.5
5.0
2.5
0.0
0 250000 500000
750000 1000000
1250000
price
count
Distribution: log(Price)
ggplot(data = house)+ geom_histogram(mapping = aes(x = price))+ scale_x_log10()
6
4
2
0
1e+05 3e+05
1e+06
price
count
Distribution: Independent Variables
par(mfrow=c(1, 3))
hist(house$sqft); hist(house$bedrooms); hist(house$bat
Histogram of house$sqft Histogram of house$bedrooms Histogram of house$baths
hs)
Frequency
0 5 10 15 20
Frequency
0 5 10 15 20 25 30
Frequency
0 5 10 15 20 25 30
Transform Variables
house%>%
mutate(price, log_price = log(price))%>% mutate(sqft, log_sqft = log(sqft))%>% mutate(baths, log_baths = log(baths)) -> house
house%>% select(log_baths)
## # A tibble: 83 x 1
##
##
## 1
## 2
## 3
## 4
## 5
## 6
## 7
log_baths
0.742
1.10
1.39
1.10
0.693
0.916
1.10
## 8 1.10
Transformed
par(mfrow=c(1, 2))
hist(house$baths); hist(house$log_baths)
Histogram of house$baths
Histogram of house$log_baths
1 2 3 4 5
house$baths
0.0 0.5 1.0 1.5
house$log_baths
Frequency
0 5 1015202530
Frequency
0 5 1015202530
Review: Graphical Analysis Pre-Check
Scatter plot: Check for linear relationships between xn and Y Box plot: Outlier check
Histogram: Check variables for normal distributions
Explore Relationships: Correlation Strength
It is good if a xn is correlated with Y
Problematic when multiple xn are correlated
Multicollinearity
We will address later on in the lecture
Correlation Matrix
cor_mat <- cor(house%>%
select(bedrooms, log_price, sqft, log_baths))
cor_mat
## bedrooms log_price sqft log_baths
## bedrooms 1.0000000 0.3188020 0.5869470 0.7657905
## log_price 0.3188020 1.0000000 0.6638824 0.3554938
## sqft 0.5869470 0.6638824 1.0000000 0.6068155
## log_baths 0.7657905 0.3554938 0.6068155 1.0000000
Visualize Correlation: Plots
corrplot::corrplot(cor_mat, cl.pos = “b”, tl.pos = “d”
, add
bedr1ooms 0.32 0.59 0.77
0.32 log_1price 0.66 0.36
0.59 0.66
0.77 0.36
−1 −0.8−0.6−0.4−0.2 0
sq1ft 0.61
0.61 log_1baths 0.2 0.4 0.6 0.8 1
Tidyverse and modelling
The tidyverse has yet to really address statistical modelling “You can see some of the pieces in the recipes and rsample
packages but we do not yet have a cohesive system that solves a wide range of challenges. This work will largely replace the modelr package used in R4DS.”- Tidyverse site
Broom is package that may be helpful.
Fit the linear regression model
The function we use in R is stats::lm() The formula is:
dep ~ indep_1 + indep_2 +. . . + indep_n Y = β0 +β1×1 +β2×2…βnxn +ε
?lm: lm is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of co-variance (although aov may provide a more convenient interface for these).
R Formulas
Formulas use the tilde (by) and + (plus) characters:
Y~var1+var2+var3+. . . varN
Example:
lm(hwy~displ+year+cyl, data = mpg)
Linear Model
house_reg <- lm(log_price~sqft, data = house) house_reg
##
## Call:
## lm(formula = log_price ~ sqft, data = house)
##
## Coefficients:
## (Intercept) sqft
## 1.204e+01 4.274e-04
summary(lm())
summary(house_reg)
Accessing parts of the model
An object of class “lm” is a list containing at least the following components:
coefficients: a named vector of coefficients
residuals: the residuals, that is response minus fitted values. fitted.values: the fitted mean values.
house_reg$residuals[1:5]
Assessing the model
1. Are coefficients statistically significant? Check with the coefficient p-value
Uses the t-value
2. Is the model statistically significant, overall p-value
Check with the coefficient p-value Uses the F-test
Coefficients and t-value
The statistical signficance of each coefficicent is tested with the
t-value
t = coefficient std .error
Should be greater 1.96 for p-value to be less than 0.05 We reject the null hypothesis when p < 0.05
When p > 0.05 we remove this variable from the model.
F-statistic
The F-statistic assess the overall model. Null hypothesis:
An equal fit of the model with a model with zero predictors. Alternative hypothesis:
This model perform better than an intercept only model.
If the p-value associated to the F-statistic is < 0.05 we reject
H0
Refresher on p-values if you need it
https://www.youtube.com/watch?v=128yz0OCG-I
Sidenote: Alpha, p-value, Type I and II errors
Type I Error:
Incorrect rejection of a true null hypothesis False positive
Type II Error:
Incorrectly retaining a false alternative hypothesis False negative
Alpha
As we decrease our chance of a Type I error, we increase our risk of Type II
R Squared & Adjusted R Squared
R2 tells us is the proportion of variation in the dependent (response) variable that has been explained by this model.
The adjusted R2 accounts for the effect that occurs when you add more independent variables that your R2 increases.
Increases only if a new term improves the model more than expected by chance.
Decreases when a predictor improves the model by less than expected by chance.
Assumptions of the model
Checking your model:
1. The mean of the residuals is zero
2. Homoscedasticity of residuals or equal variance
3. Multicollinearity
4. The xn variables and residuals are uncorrelated
5. The variability in X values is positive
6. The number of observations must be greater than number of
Xn
7. Normality of residuals
8. No auto-correlation of residuals
Residuals Zero Mean
The mean of the residuals is zero. Residuals difference in model estimates and actual values.
mean(house_reg$residuals) ## [1] 2.70306e-17
Homoscedasticity: Equal variance across values
Homoscedasticity:
Requires variance of residuals to be the same across the fitted values.
Heteroscedasticity:
When the size of the error term differs across values of an independent variable.
Violation of homoscedasticity
Linear Regression (Ordinary Least Squares), seeks to minimize
residuals
OLS equally weights all observations
Cases with larger errors have more effect on the model estimation.
Checking with plot(lm(y~x, data = data))
par(mfrow=c(2,2)) # set 2 rows and 2 column plot layou plot(house_reg)
t
23
Residuals vs Fitted
13.0 13.5
Fitted values
Scale−Location
13.0 13.5
Fitted values
Normal Q−Q
−2 −1 0 1 2
23
72 18
72 18
12.5
23
18
12.5
Theoretical Quantiles
Residuals vs Leverage
23
Cook's distance 73 18
0.02
0.5
72
0.00
0.04 0.06
Leverage
0.08
Standardized residuals 0.0 1.0
Residuals
Standardized residuals
Standardized residuals
−3 −1 1 3
−2 0 2
−1.0 0.5
Dealing with Heteroscedasticity
# Breusch-Pagan test
lmtest::bptest(house_reg) car::ncvTest(house_reg)
1. Try different predictors 2. Variable transformation
Box-Cox
3. Select a different regression model (last case)
library(caret)
bc <- BoxCoxTrans('values') predict(bc, 'values')
No autocorrelation of residuals
Temporal Autocorrelation
The value at one point is not dependent on the previous value
Spatial Autocorrelation
Values at one location are not dependent on near values
Moran’s I
The xn variables and residuals are uncorrelated cor.test(house$sqft, house_reg$residuals)
##
## Pearson's product-moment correlation
##
## data: house$sqft and house_reg$residuals
## t = -4.7743e-16, df = 81, p-value = 1
## alternative hypothesis: true correlation is not equal to
## 95 percent confidence interval:
## -0.2156893 0.2156893
## sample estimates:
## cor
## -5.304794e-17
The variability in xn values is positive
var(house$sqft) ## [1] 863896.6
This is much greater than 0.
We cannot use a xn variable with a single value.
The number of observations must be greater than number of Xs
Unlikely to be an issue except with extreme cases.
Normality of residuals
hist(house_reg$residuals)
Histogram of house_reg$residuals
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
house_reg$residuals
Frequency
0 10 20 30
No perfect multicollinearity
Occurs when we have 2 or more predictor variables.
Assessed using VIF
Variance inflation factors
Rule of thumb VIF > 4
You should revise your variable selection
cars::vif()
Motor Trend Car Road Tests
car::mtcars
mpg: Miles/(US) gallon
cyl: Number of cylinders
disp: Displacement (cu.in.)
hp: Gross horsepower
drat: Rear axle ratio
wt: Weight (1000 lbs)
qsec: 1/4 mile time
vs: V/S
am: Transmission (0 = automatic, 1 = manual) gear: Number of forward gears
carb: Number of carburetors
Linear Model for mpg
library(car)
# We will include all variables in the model mpg_lm <- lm(mpg ~ ., data=mtcars) vif(mpg_lm)
## cyl disp hp drat wt q
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527
## am gear carb
## 4.648487 5.357452 7.908747
Remove variables in mpg Model
mpg_lm_2 <- lm(mpg ~ cyl + gear + am, data=mtcars) vif(mpg_lm_2)
## cyl gear am
## 1.407382 2.768828 2.884543
Issues caused by multicollinearity
Increase the variance of the coefficient estimates
Estimates may be very sensitive to minor changes in the model Statistical power is reduced
Coefficient sign switching
“A caution must be taken when more than two predictors in the model have even weak pairwise correlation coefficients (r=0.25) as they can result in a significant multicollinearity effect.” (P. Vatcheva and Lee 2016 ,pg. 13)
References
Hauke, Jan, and Tomasz Kossowski. 2011. “Comparison of values of pearson’s and spearman’s correlation coefficients on the same sets of data.” Quaestiones Geographicae 30 (2): 87–93. doi:10.2478/v10117-011-0021-1.
P. Vatcheva, Kristina, and MinJae Lee. 2016. “Multicollinearity in Regression Analyses Conducted in Epidemiologic Studies.” Epidemiology: Open Access 06 (02): 1–20. doi:10.4172/2161-1165.1000227.