程序代写代做代考 data science Introduction to information system

Introduction to information system

Logistic Regression

Bowei Chen, Deema Hafeth and Jingmin Huang

School of Computer Science

University of Lincoln

CMP3036M/CMP9063M

Data Science 2016 – 2017 Workshop

Today’s Objectives

• Study the slides in Part I, including:

– Implementation of logistic regression in R

– Interpretation of results of logistic regression in R

• Do the exercises in Part II

Part I:

Examples of Logistic Regression in R

Example (1/13)

This is an example of implementing logistic regression models in R.

We will use the Housing.csv dataset

> df <- read.csv(“C:/Housing.csv”) > dim(df)

[1] 546 12

Using dim() function to see the size of data. There are 546
observations and 12 features/predictors in the dataset

Example (2/13)

> head(df)

price housesize bedrooms bathrms stories driveway recroom fullbase gashw airco garagepl prefarea

1 420 5850 3 1 2 1 0 1 0 0 1 0

2 385 4000 2 1 1 1 0 0 0 0 0 0

3 495 3060 3 1 1 1 0 0 0 0 0 0

4 605 6650 3 1 2 1 1 0 0 0 0 0

5 610 6360 2 1 1 1 0 0 0 0 0 0

6 660 4160 3 1 1 1 1 1 0 1 0 0

Using head() function to look at a few
sample (default 6) observations of the data.

Example (3/13)

> lapply(df,class)

$price

[1] “numeric”

$housesize

[1] “integer”

$bedrooms

[1] “integer”

$bathrms

[1] “integer“

$stories

[1] “integer“

$driveway

[1] “integer“

…….

Using lapply() can look at what are the data
types of each variables (display in vertical way)

Example (4/13)

> summary(df)

price housesize bedrooms bathrms stories driveway

Min. : 250.0 Min. : 1650 Min. :1.000 Min. :1.000 Min. :1.000 Min. :0.000

1st Qu.: 491.2 1st Qu.: 3600 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000

Median : 620.0 Median : 4600 Median :3.000 Median :1.000 Median :2.000 Median :1.000

Mean : 681.2 Mean : 5150 Mean :2.965 Mean :1.286 Mean :1.808 Mean :0.859

3rd Qu.: 820.0 3rd Qu.: 6360 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.000

Max. :1900.0 Max. :16200 Max. :6.000 Max. :4.000 Max. :4.000 Max. :1.000

recroom fullbase gashw airco garagepl prefarea

Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000

1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000

Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000 Median :0.0000

Mean :0.1777 Mean :0.3498 Mean :0.04579 Mean :0.3168 Mean :0.6923 Mean :0.2344

3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000

Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :3.0000 Max. :1.0000

Using summary() to produce result
summaries at each variable

Example (5/13)

> summary(df$price)

Min. 1st Qu. Median Mean 3rd Qu. Max.

250.0 491.2 620.0 681.2 820.0 1900.0

Using summary() to produce the result
summaries for one variable at a time

Example (6/13)

Let’s create graph with two subplots. Each subplot is for a predictor. This can be very

helpful for helping understand the effect of each predictor the response variable.

> par(mfrow=c(1, 2))

> plot(df$price, df$fullbase,xlab = “Price”,

+ ylab = “Fullbase”,frame.plot=TRUE,cex=1.5,

+ pch = 16, col = “green”,cex.lab=1.5, cex.axis=1.5,
+ cex.sub=1.5)

> plot(df$housesize, df$fullbase,xlab = “Housesize”,

+ ylab = “Fullbase”,frame.plot=TRUE,cex=1.5,

+ pch = 16, col = “blue”,cex.lab=1.5, cex.axis=1.5,

+ cex.sub=1.5)

Example (7/13)

> model1<-glm(fullbase~price,data=df,family=binomial) > model1$coefficients

(Intercept) price

-1.622737e+00 1.447098e-05

> plot(df$price, df$fullbase,xlab = “Price”,

+ ylab = “Fullbase”,frame.plot=TRUE,cex=1.5,pch = 16,

+ col = “blue”,cex.lab=1.5, cex.axis=1.5, cex.sub=1.5)

> xprice<-seq(min(df$price),max(df$price)) > yprice<-predict(model1,list(price=xprice),type="response") > lines(xprice,yprice)

Develop a logistic regression model

by using R built-in function

Note: The regression line may be not clear

because the big range values of price variable

Example (8/13)

# get better regression line plot

> range(df$price)

[1] 250 1900

>

> plot(df$price, df$fullbase, xlim=c(0,2150),ylim=c(-1,2),

+ xlab = “Price”, ylab = “Fullbase”, col = “blue”,

+ frame.plot=TRUE,cex=1.5,pch = 16,cex.lab=1.5,

+ cex.axis=1.5, cex.sub=1.5)

> xprice<-seq(0,2150) > yprice<-predict(model1,list(price=xprice),type="response") > lines(xprice,yprice)

Develop a logistic regression model

by using R built-in function

Here we see:

• If response variable and predictor(s)

are positively or negatively correlated

• 𝑧 value and 𝑝-value are for the
hypothesis test to see if the coefficient

is zero or not. The null hypothesis is

that the coefficient is zero. As the 𝑝-
value is much less than 0.05, we reject

the null hypothesis that 𝛽 = 0.

Example (9/13)

> summary(model1)

Call:

glm(formula = fullbase ~ price, family = binomial, data = df)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.6778 -0.8992 -0.8012 1.3529 1.7316

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.6227365 0.2567345 -6.321 2.60e-10 ***

price 0.0014471 0.0003423 4.228 2.36e-05 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 706.89 on 545 degrees of freedom

Residual deviance: 688.28 on 544 degrees of freedom

AIC: 692.28

Number of Fisher Scoring iterations: 4

Note: You are not required to know

how these values are calculated

Example (10/13)
> summary(model1)

Call:

glm(formula = fullbase ~ price, family = binomial, data = df)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.6778 -0.8992 -0.8012 1.3529 1.7316

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.6227365 0.2567345 -6.321 2.60e-10 ***

price 0.0014471 0.0003423 4.228 2.36e-05 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 706.89 on 545 degrees of freedom

Residual deviance: 688.28 on 544 degrees of freedom

AIC: 692.28

Number of Fisher Scoring iterations: 4

Deviance is a measure of goodness of fit of a

regression model (higher numbers indicate worse

fit). The ‘Null deviance’ shows how well the

response variable is predicted by a model that

includes only the intercept

R:

model1$null.deviance (find Null deviance)
model1$deviance (find Residual deviance)

For example, we have a value of 706.89 on 545

degrees of freedom. Including the independent

variables (price) decreased the deviance to

688.28 on 544 degrees of freedom.

The Residual Deviance has reduced by 18.61

with a loss of one degrees of freedom.

Note: You are not required to know

how these values are calculated

Example (11/13)
> summary(model1)

Call:

glm(formula = fullbase ~ price, family = binomial, data = df)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.6778 -0.8992 -0.8012 1.3529 1.7316

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.6227365 0.2567345 -6.321 2.60e-10 ***

price 0.0014471 0.0003423 4.228 2.36e-05 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 706.89 on 545 degrees of freedom

Residual deviance: 688.28 on 544 degrees of freedom

AIC: 692.28

Number of Fisher Scoring iterations: 4

The Akaike Information Criterion (AIC) provides a

method for assessing the quality of your model

through comparison of related models (the model

that has the smallest AIC is best fitted model).

Fisher scoring is a derivative of

Newton’s method for solving maximum

likelihood problems numerically.

Note: You are not required to know

how these values are calculated

Example (12/13) Prediction
If a new house has 385.00 pounds rental price,

what is the probability of fullbase of this house?

> # Prediction ——————————————

> model1<-glm(fullbase~price,data=df,family=binomial) > plot(df$price, df$fullbase,xlab = “Price”, ylab = “Fullbase”,

+ frame.plot=TRUE,cex=1.5,pch = 16, col = “blue”,

+ cex.lab=1.5, cex.axis=1.5, cex.sub=1.5)

> xprice<-seq(min(df$price),max(df$price)) > yprice<-predict(model1,list(price=xprice),type="response") > lines(xprice,yprice)

> newdata <- data.frame(price = 385.00) > y_hat_i<-predict(model1, newdata, type="response") > points(newdata, y_hat_i, col = 2, pch=20)

>model2<-glm(fullbase~price+housesize,data=df,family=binomial) >model2$coefficient

(Intercept) price housesize

-1.466744e+00 1.766831e-03 -7.286285e-05

> summary(model2)

Call:

glm(formula = fullbase ~ price + housesize, family = binomial,

data = df)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.7777 -0.8973 -0.7971 1.3701 1.7224

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.467e+00 2.784e-01 -5.269 1.37e-07 ***

price 1.767e-03 4.120e-04 4.289 1.80e-05 ***

housesize -7.286e-05 5.108e-05 -1.427 0.154

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 706.89 on 545 degrees of freedom

Residual deviance: 686.19 on 543 degrees of freedom

AIC: 692.19

Number of Fisher Scoring iterations: 4

Example (13/13)

Part II:

Exercises

Exercise 1/4

a) Download the “Worcester Heart Attack Trial.csv” dataset

b) Import the dataset into R

c) Show the last 10 observations in the dataset

d) Take fstat as the response variable and lenfol as the

predictor to build a logistic regression model

e) Plot the regression line

f) Calculate your own odds ratio’s value and logit (log odds)

directly [Hint: check the lecture slides]

Exercise 2/4

a) Use the developed logistic regression model at exercise 1 to estimate the probability

of fstat when lenfol is 6 days by direct calculation.

b) Use R function “predict()” to get the predicted fstat and compare the results with a)

Exercise 3/4

a) Generate a random vector 𝑝 with size 1000 in [0,1]

b) Create a graph that contains two subplots:

1) Subplot (a) is a line plot of x = 𝑝 and y = logit 𝑝

2) Subplot (b) is a line plot of x = logit−1(𝑝) and y = 𝑝

What do you see from the two subplots?

Exercise 4/4

a) Build a multiple logistic regression model by taking fullbase as the response variable

and price and housesize as predictors [Hint: using ‘glm’ function]

b) What is the value of deviance now?

– How much it is different from the previous models value (example 7)?

– More predictor variables better results

c) Can we improve the logistic model by add/remove some predictors into/from the

model? If so, which variables do you think can be added/removed? [Hint: check AIC]

Thank You

bchen@lincoln.ac.uk

dabdalhafeth@lincoln.ac.uk

jhua8590@gmail.com

mailto:bchen@lincoln.ac.uk
mailto:dabdalhafeth@lincoln.ac.uk
https://github.com/boweichen/CMP3036MDataScience/blob/master/jhua8590@gmail.com