CS计算机代考程序代写 data science Lecture 5: Regression Part 1

Lecture 5: Regression Part 1
Spatial Data Science II
Dr. Adams

Tidyverse
library(tidyverse)

Review
1. Examining data distributions 2. Data Visualization
3. Data Management
􏰀 Tidy data

Big Bang

Correlation – Review
A measure of the dependence between two variables.

Pearson’s product-moment coefficient
􏰀 Measure of the strength of a linear relationship between two variables.
􏰀 Coefficient is represented with r
􏰀 Ranges from -1 to +1
􏰀 H0 : r = 0
􏰀 H1 :̸=0

Examples
r = −0.5
2
1
0
−1
−2
−1 0 1
x
r =0
1
0
−1
−2 −1 0 1 2
x
r = 0.33
2
1
0
−1
−2
−2 −1 0 1
r = 0.85
4
2
0
−2
x
−1 0 1
x
yy
yy

Pearson’s r Assumptions
􏰀 Normally distributed 􏰀 Linear relationship
100
75
50
25
0
Linear assumption not met.
−10 −5 0 5 10
x
y

Spearman’s rank correlation coefficient
􏰀 Spearman’s ρ or rs
􏰀 Non-parametric (distribution-free) rank statistic 􏰀 Relationship does not need to be linear
􏰀 Accepts non-interval data

Spearman’s ρ
It is possible for r to be positive while ρ is negative.
“Make sure not to overinterpret Spearman’s rank correlation coefficient as a significant measure of the strength of the associations between two variables” (Hauke and Kossowski 2011)

Spurious Correlations I

Spurious Correlations II

Spurious Correlations III

Correlation / Causation

Moving On
How much does a house price increase when we increase its square footage?
􏰀 Take a minute and think about how you could answer this question?

Linear Regression Model
Model a continuous variable as a linear function of one or more independent variables.
􏰀 This allows us to understand if and how an attribute contributes to an outcome.

Model Use In Data Science

Linear Regression
A statistical model that:
􏰀 Predicts a continuous variable: Y
􏰀 Using one or more independent variables: xn 􏰀 Calculates a set of multipliers: βn
􏰀 Regression coefficients 􏰀 Includes an intercept: β0
The linear regression formula is: Y = β0 +β1×1 +β2×2…βnxn +ε ε is our error or noise.

Linear Regression Visualized
30
20
10
Y = Β0 +
0
Y = 37.2
Βx1
+ (− 5.344)x1
0246
Independent Variable (Weight 1000lbs)
Dependent Variable (MPG)

Housing Example – Explanatory
􏰀 House Price ($)
􏰀 Number of bedrooms 􏰀 Square Footage
􏰀 Number of bathrooms

House Sales Data
library(tidyverse) house = read_delim(
“http://www.rossmanchance.com/iscam2/data/housing.tx delim = “\t”) # Tab deliminated
## Parsed with column specification:
## cols(
## sqft = col_integer(),
## price = col_integer(),
## City = col_character(),
## bedrooms = col_integer(),
## baths = col_double()
## )
t”,

Take a look at our data
## # A tibble: 83 x 5
## sqft price City
##
## 1 3392 339000 Dublin
## 2 4100 899900 pleasanton
## 3 3200 448641 Clayton
## 4 1436 239999 Moraga
## 5 1944 377500 Antioch
## 6 1500 299900 Danville
## 7 1700 265000 El Dorado Hills
## 8 2507 449000 Shingle Springs
## 9 1580 439950 McKinleyville
## 10 1500 699888 Marina
## # … with 73 more rows
bedrooms baths

3 2.10
4 3.00
5 4.00
4 3.00
3 2.00
3 2.50
4 3.00
4 3.00
3 2.00
4 2.00

Housing Example – Explanatory
􏰀 House Price (price)
􏰀 Number of bedrooms (bedrooms) 􏰀 Square Footage (sqft)
􏰀 Number of bathrooms (baths)

Data Validation
Before we fit our model we need to ensure the data fits the linear regression model’s assumptions
1. Linearity between independent variables (xn) and dependent variable (Y )
2. No outliers in xn
3. Normally Distributed xn & Y

Linearity: Price and Square Footage
ggplot(data = house, mapping = aes(y = price, x = sqft geom_point()+
geom_smooth(method = “lm”, se = F)
))+
1250000
1000000
750000
500000
250000
1000 2000
3000 4000
sqft
price

Linearity: Price and Baths
1250000
1000000
750000
500000
250000
12345
baths
price

Linearity: Price and Bedrooms
1250000
1000000
750000
500000
250000
246
bedrooms
price

Outliers – Box Plot
􏰀 Data values outside of 1.5 * interquartile-range may be considered outliers.
􏰀 The IQR is the distance from the 25th percentile to the 75th percentile.
􏰀 We often visualize this with the box and whisker plot

Box Plot
Box and whisker plots:
􏰀 bottom and top of the box are the first and third quartiles 􏰀 band inside the box is the second quartile (the median)
􏰀 geom_boxplot:
􏰀 Whiskers are largest or smallest value within 1.5 * IQR 􏰀 Points are outside of 1.5 * IQR

Box Plot Visualized

Check for outliers: Square Footage
ggplot(data = house)+
geom_boxplot( mapping = aes(y = sqft, x = “”))
4000
3000
2000
1000
x
sqft

Distribution: Price
ggplot(data = house)+ geom_histogram(mapping = aes(x = price))
7.5
5.0
2.5
0.0
0 250000 500000
750000 1000000
1250000
price
count

Distribution: log(Price)
ggplot(data = house)+ geom_histogram(mapping = aes(x = price))+ scale_x_log10()
6
4
2
0
1e+05 3e+05
1e+06
price
count

Distribution: Independent Variables
par(mfrow=c(1, 3))
hist(house$sqft); hist(house$bedrooms); hist(house$bat
Histogram of house$sqft Histogram of house$bedrooms Histogram of house$baths
hs)
Frequency
0 5 10 15 20
Frequency
0 5 10 15 20 25 30
Frequency
0 5 10 15 20 25 30

Transform Variables
house%>%
mutate(price, log_price = log(price))%>% mutate(sqft, log_sqft = log(sqft))%>% mutate(baths, log_baths = log(baths)) -> house
house%>% select(log_baths)
## # A tibble: 83 x 1
##
##
## 1
## 2
## 3
## 4
## 5
## 6
## 7
log_baths

0.742
1.10
1.39
1.10
0.693
0.916
1.10
## 8 1.10

Transformed
par(mfrow=c(1, 2))
hist(house$baths); hist(house$log_baths)
Histogram of house$baths
Histogram of house$log_baths
1 2 3 4 5
house$baths
0.0 0.5 1.0 1.5
house$log_baths
Frequency
0 5 1015202530
Frequency
0 5 1015202530

Review: Graphical Analysis Pre-Check
􏰀 Scatter plot: Check for linear relationships between xn and Y 􏰀 Box plot: Outlier check
􏰀 Histogram: Check variables for normal distributions

Explore Relationships: Correlation Strength
􏰀 It is good if a xn is correlated with Y
􏰀 Problematic when multiple xn are correlated
􏰀 Multicollinearity
􏰀 We will address later on in the lecture

Correlation Matrix
cor_mat <- cor(house%>%
select(bedrooms, log_price, sqft, log_baths))
cor_mat
## bedrooms log_price sqft log_baths
## bedrooms 1.0000000 0.3188020 0.5869470 0.7657905
## log_price 0.3188020 1.0000000 0.6638824 0.3554938
## sqft 0.5869470 0.6638824 1.0000000 0.6068155
## log_baths 0.7657905 0.3554938 0.6068155 1.0000000

Visualize Correlation: Plots
corrplot::corrplot(cor_mat, cl.pos = “b”, tl.pos = “d”
, add
bedr1ooms 0.32 0.59 0.77
0.32 log_1price 0.66 0.36
0.59 0.66
0.77 0.36
−1 −0.8−0.6−0.4−0.2 0
sq1ft 0.61
0.61 log_1baths 0.2 0.4 0.6 0.8 1

Tidyverse and modelling
􏰀 The tidyverse has yet to really address statistical modelling 􏰀 “You can see some of the pieces in the recipes and rsample
packages but we do not yet have a cohesive system that solves a wide range of challenges. This work will largely replace the modelr package used in R4DS.”- Tidyverse site
􏰀 Broom is package that may be helpful.

Fit the linear regression model
􏰀 The function we use in R is stats::lm() 􏰀 The formula is:
􏰀 dep ~ indep_1 + indep_2 +. . . + indep_n Y = β0 +β1×1 +β2×2…βnxn +ε
?lm: lm is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of co-variance (although aov may provide a more convenient interface for these).

R Formulas
Formulas use the tilde (by) and + (plus) characters:
Y~var1+var2+var3+. . . varN
Example:
lm(hwy~displ+year+cyl, data = mpg)

Linear Model
house_reg <- lm(log_price~sqft, data = house) house_reg ## ## Call: ## lm(formula = log_price ~ sqft, data = house) ## ## Coefficients: ## (Intercept) sqft ## 1.204e+01 4.274e-04 summary(lm()) summary(house_reg) Accessing parts of the model An object of class “lm” is a list containing at least the following components: 􏰀 coefficients: a named vector of coefficients 􏰀 residuals: the residuals, that is response minus fitted values. 􏰀 fitted.values: the fitted mean values. house_reg$residuals[1:5] Assessing the model 1. Are coefficients statistically significant? 􏰀 Check with the coefficient p-value 􏰀 Uses the t-value 2. Is the model statistically significant, overall p-value 􏰀 Check with the coefficient p-value 􏰀 Uses the F-test Coefficients and t-value The statistical signficance of each coefficicent is tested with the t-value t = coefficient std .error 􏰀 Should be greater 1.96 for p-value to be less than 0.05 􏰀 We reject the null hypothesis when p < 0.05 􏰀 When p > 0.05 we remove this variable from the model.

F-statistic
􏰀 The F-statistic assess the overall model. 􏰀 Null hypothesis:
􏰀 An equal fit of the model with a model with zero predictors. 􏰀 Alternative hypothesis:
􏰀 This model perform better than an intercept only model.
􏰀 If the p-value associated to the F-statistic is < 0.05 we reject H0 Refresher on p-values if you need it https://www.youtube.com/watch?v=128yz0OCG-I Sidenote: Alpha, p-value, Type I and II errors Type I Error: 􏰀 Incorrect rejection of a true null hypothesis 􏰀 False positive Type II Error: 􏰀 Incorrectly retaining a false alternative hypothesis 􏰀 False negative Alpha As we decrease our chance of a Type I error, we increase our risk of Type II R Squared & Adjusted R Squared R2 tells us is the proportion of variation in the dependent (response) variable that has been explained by this model. The adjusted R2 accounts for the effect that occurs when you add more independent variables that your R2 increases. 􏰀 Increases only if a new term improves the model more than expected by chance. 􏰀 Decreases when a predictor improves the model by less than expected by chance. Assumptions of the model Checking your model: 1. The mean of the residuals is zero 2. Homoscedasticity of residuals or equal variance 3. Multicollinearity 4. The xn variables and residuals are uncorrelated 5. The variability in X values is positive 6. The number of observations must be greater than number of Xn 7. Normality of residuals 8. No auto-correlation of residuals Residuals Zero Mean The mean of the residuals is zero. Residuals difference in model estimates and actual values. mean(house_reg$residuals) ## [1] 2.70306e-17 Homoscedasticity: Equal variance across values Homoscedasticity: 􏰀 Requires variance of residuals to be the same across the fitted values. Heteroscedasticity: 􏰀 When the size of the error term differs across values of an independent variable. 􏰀 Violation of homoscedasticity 􏰀 Linear Regression (Ordinary Least Squares), seeks to minimize residuals 􏰀 OLS equally weights all observations 􏰀 Cases with larger errors have more effect on the model estimation. Checking with plot(lm(y~x, data = data)) par(mfrow=c(2,2)) # set 2 rows and 2 column plot layou plot(house_reg) t 23 Residuals vs Fitted 13.0 13.5 Fitted values Scale−Location 13.0 13.5 Fitted values Normal Q−Q −2 −1 0 1 2 23 72 18 72 18 12.5 23 18 12.5 Theoretical Quantiles Residuals vs Leverage 23 Cook's distance 73 18 0.02 0.5 72 0.00 0.04 0.06 Leverage 0.08 Standardized residuals 0.0 1.0 Residuals Standardized residuals Standardized residuals −3 −1 1 3 −2 0 2 −1.0 0.5 Dealing with Heteroscedasticity # Breusch-Pagan test lmtest::bptest(house_reg) car::ncvTest(house_reg) 1. Try different predictors 2. Variable transformation 􏰀 Box-Cox 3. Select a different regression model (last case) library(caret) bc <- BoxCoxTrans('values') predict(bc, 'values') No autocorrelation of residuals 􏰀 Temporal Autocorrelation 􏰀 The value at one point is not dependent on the previous value 􏰀 Spatial Autocorrelation 􏰀 Values at one location are not dependent on near values 􏰀 Moran’s I The xn variables and residuals are uncorrelated cor.test(house$sqft, house_reg$residuals) ## ## Pearson's product-moment correlation ## ## data: house$sqft and house_reg$residuals ## t = -4.7743e-16, df = 81, p-value = 1 ## alternative hypothesis: true correlation is not equal to ## 95 percent confidence interval: ## -0.2156893 0.2156893 ## sample estimates: ## cor ## -5.304794e-17 The variability in xn values is positive var(house$sqft) ## [1] 863896.6 This is much greater than 0. We cannot use a xn variable with a single value. The number of observations must be greater than number of Xs Unlikely to be an issue except with extreme cases. Normality of residuals hist(house_reg$residuals) Histogram of house_reg$residuals −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 house_reg$residuals Frequency 0 10 20 30 No perfect multicollinearity Occurs when we have 2 or more predictor variables. 􏰀 Assessed using VIF 􏰀 Variance inflation factors 􏰀 Rule of thumb VIF > 4
􏰀 You should revise your variable selection
cars::vif()

Motor Trend Car Road Tests
car::mtcars
􏰀 mpg: Miles/(US) gallon
􏰀 cyl: Number of cylinders
􏰀 disp: Displacement (cu.in.)
􏰀 hp: Gross horsepower
􏰀 drat: Rear axle ratio
􏰀 wt: Weight (1000 lbs)
􏰀 qsec: 1/4 mile time
􏰀 vs: V/S
􏰀 am: Transmission (0 = automatic, 1 = manual) 􏰀 gear: Number of forward gears
􏰀 carb: Number of carburetors

Linear Model for mpg
library(car)
# We will include all variables in the model mpg_lm <- lm(mpg ~ ., data=mtcars) vif(mpg_lm) ## cyl disp hp drat wt q ## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527 ## am gear carb ## 4.648487 5.357452 7.908747 Remove variables in mpg Model mpg_lm_2 <- lm(mpg ~ cyl + gear + am, data=mtcars) vif(mpg_lm_2) ## cyl gear am ## 1.407382 2.768828 2.884543 Issues caused by multicollinearity 􏰀 Increase the variance of the coefficient estimates 􏰀 Estimates may be very sensitive to minor changes in the model 􏰀 Statistical power is reduced 􏰀 Coefficient sign switching “A caution must be taken when more than two predictors in the model have even weak pairwise correlation coefficients (r=0.25) as they can result in a significant multicollinearity effect.” (P. Vatcheva and Lee 2016 ,pg. 13) References Hauke, Jan, and Tomasz Kossowski. 2011. “Comparison of values of pearson’s and spearman’s correlation coefficients on the same sets of data.” Quaestiones Geographicae 30 (2): 87–93. doi:10.2478/v10117-011-0021-1. P. Vatcheva, Kristina, and MinJae Lee. 2016. “Multicollinearity in Regression Analyses Conducted in Epidemiologic Studies.” Epidemiology: Open Access 06 (02): 1–20. doi:10.4172/2161-1165.1000227.