Logistic Regression 1
Billie Anderson, Mark Newman
2020-09-14
Agenda
▪ Question from last class
• At your work, are there any 2-way plots that you could make, but would be so cluttered that they would distract from their presentation? How would you solve this?
▪ Logistic Regression 1
▪ In-Class Examples
▪ Question for next class
Logistic Regression Models
What is a logistic regression?
A logistic regression is appropriate when the response variable (Y) is binary and you have one or more explanatory variables. The explanatory variables can be a mix of categorical and continuous.
A logistic regression is one model in a family of models called generalized linear models.
Introduction
Up to this point in the class, we have focused on understanding relationships of categorical variables, mainly by performing different types of hypothesis testing for categorical variables (Chi-square, Fishers, CMH, etc….).
We also have looked at odds ratios to determine relationships between two binary variables.
The rest of the class will be concerned with model building. That is, I am trying to predict or estimate a binary response variable, Y.
Remember, the famous saying
All models are wrong, but some are useful
▪ George Box
In this chapter, we are only concerned with a response variable that is binary, that is the Y is either a ‘success’ or a ‘failure’. ‘successes’ is denoted by \(1\) and ‘failures’ are denoted by \(0\). You are always interested in modeling the \(1’s\), the ‘successes’.
We will begin by studying one simple logistic regression, meaning one binary Y and one continuous explanatory variable, \(X\).
The Logistic Regression Model
The logistic regression model describes a relationship between the binary response variable, \(Y\), and a set of explanatory variables, \(X’s\).
We are going to start by focusing on one \(Y\) binary response and one continuous explanatory, \(X\), variable.
For a binary response variable, \(Y\), and a continuous explanatory variable, \(X\), we are interested in modeling the probability of a successful response (outcome). You are interested in the probability of the \(1s\).
We are going to put the above statement in statistical notation.
For a binary response, \(Y\), and a continuous explanatory variable, \(X\), we may be interested in modeling the probability of a successful outcome, which is denoted statistically as,
\[\pi(x)=Pr(Y=1|X=x)\]
The Logistic Regression Model (cont.)
What does this mean?
For any value of \(X\), say age, you can imagine a binomial distribution of the \(Y’s\), the responses. So, for any value of \(X\), the \(Y’s\) have a \(Bin(\pi(x),n_x)\).
What does this look like?
Let’s look at a previous example. In our module on contingency tables we examined data data for the treatment of arthritis.
library(vcd)
data <- vcd::Arthritis
head(data)
## ID Treatment Sex Age Improved
## 1 57 Treated Male 27 Some
## 2 46 Treated Male 29 None
## 3 77 Treated Male 30 None
## 4 17 Treated Male 32 Marked
## 5 36 Treated Male 46 Marked
## 6 23 Treated Male 58 Marked
str(data)
## 'data.frame': 84 obs. of 5 variables:
## $ ID : int 57 46 77 17 36 23 75 39 33 55 ...
## $ Treatment: Factor w/ 2 levels "Placebo","Treated": 2 2 2 2 2 2 2 2 2 2 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ Age : int 27 29 30 32 46 58 59 59 63 63 ...
## $ Improved : Ord.factor w/ 3 levels "None"<"Some"<..: 2 1 1 3 3 3 1 3 1 1 ...
The Logistic Regression Model (cont.)
data$Better <- as.numeric(data$Improved > “None”)
head(data)
## ID Treatment Sex Age Improved Better
## 1 57 Treated Male 27 Some 1
## 2 46 Treated Male 29 None 0
## 3 77 Treated Male 30 None 0
## 4 17 Treated Male 32 Marked 1
## 5 36 Treated Male 46 Marked 1
## 6 23 Treated Male 58 Marked 1
The Logistic Regression Model (cont.)
So, for any value of Age (\(X\)), the Y could be a 0 or a 1.
library(ggplot2)
ggplot(data = data, aes(x = Age, y = Better)) + geom_point()

The Logistic Regression Model (cont.)
Now, how do we denote the model?
The model presumes that the probability of ‘success’, probability of being a \(1\) varies linearly with the value of the explanatory variable, \(X\).
\[\pi(x)=Pr(Y=1|X=x)=E(Y=1|X=x)=\alpha+\beta x\] where \(E(Y|x)\) is the proportion of \(1s\); the proportion of \(1s\) you expect for a particular value of \(X\).
\(\alpha\) is the intercept and \(\beta\) is the parameter estimate associated with the explanatory variable, \(X\).
Development of the Logistic Regression Model
Using our arthritis data, I would be interested in modeling the probability of a patient reporting they are better (Better=\(Y=1\)), that is, \(\pi(x)\).
If I did this with the linear model defined earlier, this is what would happen:

Development of the Logistic Regression Model (cont.)
The problem with using the linear model to model the success probability, \(\pi(x)\) is that you could obtain predicted probabilities less than 0 and greater than 1. That makes no sense!
So, what do we do?
Instead of modeling the success probability, \(\pi(x)\), directly, we model a transformation of \(\pi(x)\). Now, instead of their being a linear relationship between \(\pi(x)\) and the explanatory variables, \(X’s\), there is a linear relationship between the transformation of \(\pi(x)\) and the \(X’s\).
What is this transformation of \(\pi(x)\)?
The transformation is known as the logit or log of the odds. Yes, odds like we studied in our module on contingency tables! So, you are somewhat familiar with this transformation!
The left-hand side of the logistic regression model is the logit or the log of the odds. So, in logistic regression, we model the log of the odds of success, not the probability of success directly.
Development of the Logistic Regression Model (cont.)
\[logit(\pi(x))=log\left(\frac{\pi(x)}{1-\pi(x)}\right)=\alpha+\beta x\]

Anytime you model a transformation of the original response, Y or \(\pi(x)\) in this case, these are known as generalized linear model. A logistic regression is one type of generalized linear model because we are modeling a function of \(\pi(x)\), the logit, and not \(\pi(x)\) directly.
Finding the Probability of Success from the Logistic Model
You are interested in find the probability of a patient reporting they are better (Better=\(Y=1\)), that is, \(\pi(x)\).
How do you do this?
By solving for \(\pi(x)\) in the logistic regression model.
We start with the logistic regression model in which we are modeling the log of the odds.
\[log\left(\frac{\pi}{1-\pi(x)}\right)=\alpha+\beta x\]
\[e^{log\left(\frac{\pi(x)}{1-\pi(x)}\right)}=e^{\alpha+\beta x}\]
\[\frac{\pi(x)}{1-\pi(x)}=e^{\alpha+\beta x}\]
\[\pi(x)=(1-\pi(x))e^{\alpha+\beta x}\]
\[\pi(x)=e^{\alpha+\beta x} – \pi(x)e^{\alpha+\beta x}\]
\[\pi(x)+ \pi(x)e^{\alpha+\beta x}=e^{\alpha+\beta x}\]
\[\pi(x)(1+e^{\alpha+\beta x})=e^{\alpha+\beta x}\]
\[…\]
Finding the Probability of Success from the Logistic Model (cont.)
\[\pi(x)(1+e^{\alpha+\beta x})=e^{\alpha+\beta x}\]
\[\pi(x)=\frac{e^{\alpha+\beta x}}{1+e^{\alpha+\beta x}}\]
\[\pi(x)=\frac{e^{\alpha+\beta x}*e^{-(\alpha+\beta x)}}{(1+e^{\alpha+\beta x})*e^{-(\alpha+\beta x)}}\]
\[\pi(x)=\frac{1}{e^{-(\alpha+\beta x)}+(e^{\alpha+\beta x}*e^{-(\alpha+\beta x)})}\]
\[\pi(x)=\frac{1}{e^{-(\alpha+\beta x)}+1}\]
\[\pi(x)=\frac{1}{1+e^{-(\alpha+\beta x)}}\]
The above, \(\pi(x)\) is the function you use to predict the success probability of what you are modeling, Better=\(1\) in this case. That is, the probability of the patient reporting they are better after taking the arthritis treatment.
Odds of a Success
Remember, in logistic regression, you are modeling the log of the odds, the logit.
\[logit(\pi(x))=log\left(\frac{\pi(x)}{1-\pi(x)}\right)=\alpha+\beta x\] But, if you were to tell someone that the log of the odds of a patient reporting that they are better is 2.5, what does that mean? Log of the odds does not have a contextual meaning in terms of what you are trying to model.
However, the odds of a patient reporting they are better, that does have meaning! Think back to our module on contingency tables when we first defined odds.
For a binary variable, let \(\pi\) denote the probability of ‘success’, then the odds of ‘success’ is \(\frac{\pi}{1-\pi}\).
From the rule of complements, \(1-\pi\) is the probability of a ‘success’ not occurring; probability of ‘failure’.
\(1\) is the number we always think about when we are trying to evaluate the odds of ‘success’.
Another way to think of odds: \(\frac{\pi}{1-\pi}\)=\(\frac{\text{probability of event of interest (“success”) occurring}}{\text{proabability of event of interest (“success”) not occurring}}=\frac{\text{probability of “success”}}{\text{probability of “failure”}}\)
Odds of a Success (cont.)
So, can we take our logistic regression model, that models the logit, the log of the odds of a ‘success’, and find the odds of a ‘success’?
Yes, of course!
Remember what we are modeling in logistic regression.
\[logit(\pi(x))=log\left(\frac{\pi(x)}{1-\pi(x)}\right)=\alpha+\beta x\]
\[log\left(\frac{\pi(x)}{1-\pi(x)}\right)=\alpha+\beta x\]
\[e^{log\left(\frac{\pi(x)}{1-\pi(x)}\right)}=e^{\alpha+\beta x}\]
\[\frac{\pi(x)}{1-\pi(x)}=e^{\alpha+\beta x}\]
So, \[odds(Y=1)=\frac{\pi(x)}{1-\pi(x)}=e^{\alpha+\beta x}=e^\alpha (e^\beta)^x\]
So, we see that the odds of ‘success’ can be expressed as a multiplicative model for the odds.
Understanding this multiplicative relationship will assist us when we interpret the model.
Fitting a Logistic Regression Model
Using our arthritis data, I would be interested in modeling the probability of a patient reporting they are better (Better=\(Y=1\)), that is, \(\pi(x)\).
You use the glm() function to build the logistic regression model.
The general form of the glm() function for a logistic regression is
newModel <- glm(outcome/response ~ predictor(s), data = dataFrame, family = name of a distribution)
library(zoo)
library(lmtest)
library(vcd)
data <- vcd::Arthritis
data$Better <- as.numeric(data$Improved > “None”)
fit <- glm(Better ~ Age, data = data, family = binomial)
coeftest(fit)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.642071 1.073167 -2.4619 0.01382 *
## Age 0.049249 0.019357 2.5443 0.01095 *
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Fitting a Logistic Regression Model (cont.)
A more statistical representation of this hypothesis test is:
\[logit(\pi(x))=log\left(\frac{\pi(x)}{1-\pi(x)}\right)=\alpha+\beta x\]
\(H_0:\) \(\beta=0\)
\(H_1:\) \(\beta \ne 0\)
Fitting a Logistic Regression Model (cont.)
You can now specify the model with the output above.
\[logit(\pi(x))=log\left(\frac{\pi(x)}{1-\pi(x)}\right)=\alpha+\beta x\]
\[log\left(\frac{\pi(x)}{1-\pi(x)}\right)=-2.64+0.05 x\]
Notice in the output there is a hypothesis test being conducted.
The hypothesis test is as follows:
\(H_0:\) age is not related to the response variable Improved
\(H_1:\) age is related to the response variable Improved
Based on sample data, we obtain an estimate of \(\beta\), sometimes denoted as \(b_1=0.05\).
The test statistic for this hypothesis test is a z statistic.
\[z=\frac{b_1}{\text{standard error} b_1}=\frac{0.049249}{0.019357}=2.5443\] This \(z\) value means that the observed estimate of \(\beta\), \(b_1=0.05\) is 2.5 standard deviations above the hypothesized value of \(\beta=0\).
Fitting a Logistic Regression Model (cont.)
Is the estimated value of \(\beta\), \(b_1\), is this far enough away from what we believe the true value of \(\beta\) to be, which is \(0\), that I can reject \(H_0\) and accept \(H_1\) and conclude that \(\beta \ne 0\), that is age is helpful in predicting the response, patient-reported improved symptoms?
The p-value will answer the above question for us.
From the R output, the p-value is 0.01095. How do we obtain this value?
2 * pnorm(2.5443, lower.tail = F)
## [1] 0.0109497
Since p-value is \(\le \alpha (\text{level of significance}=0.05)\) we can reject \(H_0\) and accept \(H_1\) and conclude that age is a useful predictor of the response variable Improved. Age helps me know something about whether patient reported their arthritis symptoms were improving or not.
Fitting a Logistic Regression Model (cont.)
Since this is a two-sided test, the p-value will be two-sided as well.

Interpretation of the Model
Since you are modeling the log of the odds directly, a direct interpretation of \(b_1\) makes no sense.
Direct interpretation of \(b_1=0.05\). As age increases by one year, the log of the odds of the patient reporting their symptoms are better increase by 0.05.
What does log of the odds of a patient reporting improved symtpoms mean?
I do not know!
However, I do understand if you could tell me how much the odds of a better response increased for a one-unit increase in age.
How can I obtain the odds of the response?
Remember what we are modeling in logistic regression.
\[logit(\pi(x))=log\left(\frac{\pi(x)}{1-\pi(x)}\right)=\alpha+\beta x\]
Interpretation of the Model (cont.)
\[logit(\pi(x))=log\left(\frac{\pi(x)}{1-\pi(x)}\right)=\alpha+\beta x\]
Look at the last two terms.
\[log\left(\frac{\pi(x)}{1-\pi(x)}\right)=\alpha+\beta x\]
\[e^{log\left(\frac{\pi(x)}{1-\pi(x)}\right)}=e^{\alpha+\beta x}\]
\[\frac{\pi(x)}{1-\pi(x)}=e^{\alpha+\beta x}\]
So, \[odds(Y=1)=\frac{\pi(x)}{1-\pi(x)}=e^{\alpha+\beta x}=e^\alpha (e^\beta)^x\]
So, we see that the odds of ‘success’ can be expressed as a multiplicative model for the odds.
Interpretation of the Model (cont.)
And, this is how we interpret the odds of response.
The left-hand side of the above formula is the odds of the patient reporting their symptoms improved. We can relate the odds of improved symptoms directly to the term \(e^\beta=e^.049249=1.05\)
\(e^\beta\) is known as the odds ratio and it is how we interpret the logistic regression model.
Interpretation: As age increased by one year, the odds of the patient reporting their symptoms improve increase by 1.05.
Remember, from our module on contingency tables, when we first started discussing odds, \(1\) was the ‘magic number’. You can use this logic to also interpret the coefficient.
Interpretation of the Model (cont.)
Three Scenarios:
1. Odds ratio > \(1\)
If the odds ratio is greater than \(1\), then the odds of response increase by a certain percentage.
\(\text{odds ratio} – 1\) = percentage that the odds of response increases for a one-unit change in the predictor (explanatory variable).
For our example: \(1.05 – 1 = .05 = 5\%\). So, as age increases by one year, the odds of the patient reporting improved symptoms increases by \(5\%\).
2. Odds Ratio = \(1\)
This means the predictor (explanatory) variable is not affecting the odds of response in either an increasing or decreasing way. That is, as the explanatory variable is increasing by \(1\) unit, the odds of response are neither increasing or decreasing.
3. Odds ratio < \(1\)
If the odds ratio is less than \(1\), then the odds of response decrease by a certain percentage.
\(\text{odds ratio}-1\)=percentage that the odds of response decrease for a one-unit change in the predictor (explanatory variable).
For example, if the odds ratio was \(0.95\), then \(0.95 - 1 = -0.05 = -5\%\). So, as age increases by one year, the odds of the patient reporting improved symptoms decreases by \(5\%\).
Computing the Odds Ratio in R
You can use the coef() function to find the odds ratio for the logistic regression model.
The argument for coef() is the name of the model (R object) that you created when you built the logistic regression, fit in our example.
exp(coef(fit))
## (Intercept) Age
## 0.0712136 1.0504819
In-Class Examples
▪ Tieing everything together
Question for next class
At your work, what is an example of something with a yes/no outcome? What predictors can help make that determination?