Introduction to information system
Logistic Regression
Bowei Chen, Deema Hafeth and Jingmin Huang
School of Computer Science
University of Lincoln
CMP3036M/CMP9063M
Data Science 2016 – 2017 Workshop
Today’s Objectives
• Study the slides in Part I, including:
– Implementation of logistic regression in R
– Interpretation of results of logistic regression in R
• Do the exercises in Part II
Part I:
Examples of Logistic Regression in R
Example (1/13)
This is an example of implementing logistic regression models in R.
We will use the Housing.csv dataset
> df <- read.csv(“C:/Housing.csv”) > dim(df)
[1] 546 12
Using dim() function to see the size of data. There are 546
observations and 12 features/predictors in the dataset
Example (2/13)
> head(df)
price housesize bedrooms bathrms stories driveway recroom fullbase gashw airco garagepl prefarea
1 420 5850 3 1 2 1 0 1 0 0 1 0
2 385 4000 2 1 1 1 0 0 0 0 0 0
3 495 3060 3 1 1 1 0 0 0 0 0 0
4 605 6650 3 1 2 1 1 0 0 0 0 0
5 610 6360 2 1 1 1 0 0 0 0 0 0
6 660 4160 3 1 1 1 1 1 0 1 0 0
Using head() function to look at a few
sample (default 6) observations of the data.
Example (3/13)
> lapply(df,class)
$price
[1] “numeric”
$housesize
[1] “integer”
$bedrooms
[1] “integer”
$bathrms
[1] “integer“
$stories
[1] “integer“
$driveway
[1] “integer“
…….
Using lapply() can look at what are the data
types of each variables (display in vertical way)
Example (4/13)
> summary(df)
price housesize bedrooms bathrms stories driveway
Min. : 250.0 Min. : 1650 Min. :1.000 Min. :1.000 Min. :1.000 Min. :0.000
1st Qu.: 491.2 1st Qu.: 3600 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
Median : 620.0 Median : 4600 Median :3.000 Median :1.000 Median :2.000 Median :1.000
Mean : 681.2 Mean : 5150 Mean :2.965 Mean :1.286 Mean :1.808 Mean :0.859
3rd Qu.: 820.0 3rd Qu.: 6360 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.000
Max. :1900.0 Max. :16200 Max. :6.000 Max. :4.000 Max. :4.000 Max. :1.000
recroom fullbase gashw airco garagepl prefarea
Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.1777 Mean :0.3498 Mean :0.04579 Mean :0.3168 Mean :0.6923 Mean :0.2344
3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :3.0000 Max. :1.0000
Using summary() to produce result
summaries at each variable
Example (5/13)
> summary(df$price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
250.0 491.2 620.0 681.2 820.0 1900.0
Using summary() to produce the result
summaries for one variable at a time
Example (6/13)
Let’s create graph with two subplots. Each subplot is for a predictor. This can be very
helpful for helping understand the effect of each predictor the response variable.
> par(mfrow=c(1, 2))
> plot(df$price, df$fullbase,xlab = “Price”,
+ ylab = “Fullbase”,frame.plot=TRUE,cex=1.5,
+ pch = 16, col = “green”,cex.lab=1.5, cex.axis=1.5,
+ cex.sub=1.5)
> plot(df$housesize, df$fullbase,xlab = “Housesize”,
+ ylab = “Fullbase”,frame.plot=TRUE,cex=1.5,
+ pch = 16, col = “blue”,cex.lab=1.5, cex.axis=1.5,
+ cex.sub=1.5)
Example (7/13)
> model1<-glm(fullbase~price,data=df,family=binomial) > model1$coefficients
(Intercept) price
-1.622737e+00 1.447098e-05
> plot(df$price, df$fullbase,xlab = “Price”,
+ ylab = “Fullbase”,frame.plot=TRUE,cex=1.5,pch = 16,
+ col = “blue”,cex.lab=1.5, cex.axis=1.5, cex.sub=1.5)
> xprice<-seq(min(df$price),max(df$price)) > yprice<-predict(model1,list(price=xprice),type="response") > lines(xprice,yprice)
Develop a logistic regression model
by using R built-in function
Note: The regression line may be not clear
because the big range values of price variable
Example (8/13)
# get better regression line plot
> range(df$price)
[1] 250 1900
>
> plot(df$price, df$fullbase, xlim=c(0,2150),ylim=c(-1,2),
+ xlab = “Price”, ylab = “Fullbase”, col = “blue”,
+ frame.plot=TRUE,cex=1.5,pch = 16,cex.lab=1.5,
+ cex.axis=1.5, cex.sub=1.5)
> xprice<-seq(0,2150) > yprice<-predict(model1,list(price=xprice),type="response") > lines(xprice,yprice)
Develop a logistic regression model
by using R built-in function
Here we see:
• If response variable and predictor(s)
are positively or negatively correlated
• 𝑧 value and 𝑝-value are for the
hypothesis test to see if the coefficient
is zero or not. The null hypothesis is
that the coefficient is zero. As the 𝑝-
value is much less than 0.05, we reject
the null hypothesis that 𝛽 = 0.
Example (9/13)
> summary(model1)
Call:
glm(formula = fullbase ~ price, family = binomial, data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6778 -0.8992 -0.8012 1.3529 1.7316
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6227365 0.2567345 -6.321 2.60e-10 ***
price 0.0014471 0.0003423 4.228 2.36e-05 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 706.89 on 545 degrees of freedom
Residual deviance: 688.28 on 544 degrees of freedom
AIC: 692.28
Number of Fisher Scoring iterations: 4
Note: You are not required to know
how these values are calculated
Example (10/13)
> summary(model1)
Call:
glm(formula = fullbase ~ price, family = binomial, data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6778 -0.8992 -0.8012 1.3529 1.7316
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6227365 0.2567345 -6.321 2.60e-10 ***
price 0.0014471 0.0003423 4.228 2.36e-05 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 706.89 on 545 degrees of freedom
Residual deviance: 688.28 on 544 degrees of freedom
AIC: 692.28
Number of Fisher Scoring iterations: 4
Deviance is a measure of goodness of fit of a
regression model (higher numbers indicate worse
fit). The ‘Null deviance’ shows how well the
response variable is predicted by a model that
includes only the intercept
R:
model1$null.deviance (find Null deviance)
model1$deviance (find Residual deviance)
For example, we have a value of 706.89 on 545
degrees of freedom. Including the independent
variables (price) decreased the deviance to
688.28 on 544 degrees of freedom.
The Residual Deviance has reduced by 18.61
with a loss of one degrees of freedom.
Note: You are not required to know
how these values are calculated
Example (11/13)
> summary(model1)
Call:
glm(formula = fullbase ~ price, family = binomial, data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6778 -0.8992 -0.8012 1.3529 1.7316
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6227365 0.2567345 -6.321 2.60e-10 ***
price 0.0014471 0.0003423 4.228 2.36e-05 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 706.89 on 545 degrees of freedom
Residual deviance: 688.28 on 544 degrees of freedom
AIC: 692.28
Number of Fisher Scoring iterations: 4
The Akaike Information Criterion (AIC) provides a
method for assessing the quality of your model
through comparison of related models (the model
that has the smallest AIC is best fitted model).
Fisher scoring is a derivative of
Newton’s method for solving maximum
likelihood problems numerically.
Note: You are not required to know
how these values are calculated
Example (12/13) Prediction
If a new house has 385.00 pounds rental price,
what is the probability of fullbase of this house?
> # Prediction ——————————————
> model1<-glm(fullbase~price,data=df,family=binomial) > plot(df$price, df$fullbase,xlab = “Price”, ylab = “Fullbase”,
+ frame.plot=TRUE,cex=1.5,pch = 16, col = “blue”,
+ cex.lab=1.5, cex.axis=1.5, cex.sub=1.5)
> xprice<-seq(min(df$price),max(df$price)) > yprice<-predict(model1,list(price=xprice),type="response") > lines(xprice,yprice)
> newdata <- data.frame(price = 385.00) > y_hat_i<-predict(model1, newdata, type="response") > points(newdata, y_hat_i, col = 2, pch=20)
>model2<-glm(fullbase~price+housesize,data=df,family=binomial) >model2$coefficient
(Intercept) price housesize
-1.466744e+00 1.766831e-03 -7.286285e-05
> summary(model2)
Call:
glm(formula = fullbase ~ price + housesize, family = binomial,
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7777 -0.8973 -0.7971 1.3701 1.7224
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.467e+00 2.784e-01 -5.269 1.37e-07 ***
price 1.767e-03 4.120e-04 4.289 1.80e-05 ***
housesize -7.286e-05 5.108e-05 -1.427 0.154
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 706.89 on 545 degrees of freedom
Residual deviance: 686.19 on 543 degrees of freedom
AIC: 692.19
Number of Fisher Scoring iterations: 4
Example (13/13)
Part II:
Exercises
Exercise 1/4
a) Download the “Worcester Heart Attack Trial.csv” dataset
b) Import the dataset into R
c) Show the last 10 observations in the dataset
d) Take fstat as the response variable and lenfol as the
predictor to build a logistic regression model
e) Plot the regression line
f) Calculate your own odds ratio’s value and logit (log odds)
directly [Hint: check the lecture slides]
Exercise 2/4
a) Use the developed logistic regression model at exercise 1 to estimate the probability
of fstat when lenfol is 6 days by direct calculation.
b) Use R function “predict()” to get the predicted fstat and compare the results with a)
Exercise 3/4
a) Generate a random vector 𝑝 with size 1000 in [0,1]
b) Create a graph that contains two subplots:
1) Subplot (a) is a line plot of x = 𝑝 and y = logit 𝑝
2) Subplot (b) is a line plot of x = logit−1(𝑝) and y = 𝑝
What do you see from the two subplots?
Exercise 4/4
a) Build a multiple logistic regression model by taking fullbase as the response variable
and price and housesize as predictors [Hint: using ‘glm’ function]
b) What is the value of deviance now?
– How much it is different from the previous models value (example 7)?
– More predictor variables better results
c) Can we improve the logistic model by add/remove some predictors into/from the
model? If so, which variables do you think can be added/removed? [Hint: check AIC]
Thank You
bchen@lincoln.ac.uk
dabdalhafeth@lincoln.ac.uk
jhua8590@gmail.com
mailto:bchen@lincoln.ac.uk
mailto:dabdalhafeth@lincoln.ac.uk
https://github.com/boweichen/CMP3036MDataScience/blob/master/jhua8590@gmail.com