Regression Analysis: Simple Linear Regression
Logistic Regression
Ch.6 Multivariate Data Analysis. Joseph Hair et al. 2010. Pearson
Logistic Regression in Python. Dhiraj Kumar. 2019. Technics
Logistic Regression
a specialized form of regression that is designed to predict and explain a binary (two-group) categorical variable
0 for negative outcome (event did not occur)
1 for positive outcome (event did occur)
Assumptions
No specific distributional form for the independent variables
Heteroscedasticity of the independent variables is not required
Linear relationships between the dependent and independent variables are not required
Logistic regression is a specialized form of regression that is formulated to predict and explain a
binary (two-group) categorical variable rather than a metric dependent measure.
The logistic model was originally developed for use in survival analysis,wherethe
response y is typically measured as 0 or 1, depending on whether the experimental
unit (e.g., a patient) ‘‘survives.’’
Example of typical binary coding is zero for negative outcome and ONE for positive outcome.
Logistic regression may be described as estimating the relationship between a single nonmetric
(binary) dependent variable and a set of metric or nonmetric independent variables
It does
not requires any specific distributional form of the independent variables and issues such as heteroscedasticity
do not come into play as they did in discriminant analysis. Moreover, logistic regression
does not require linear relationships between the independent variables and the dependent variables
2
Simple Logit Model
maximum likelihood method: -2LL (-2 log likelihood)
Linear model
values: (-∞ to +∞)
Exponential Transformation
values: (0 to +∞)
Logit Transformation
values: (0 to 1)
the lower the -2LL value, the better the fit of the model
the logit R2 value ranges from 0.0 to 1.0
Perfect Fit:
-2ll = 0
Logit R2 = 1
If p > .5, we predict the outcome 1.
If p < .5, we predict the outcome 0.
Recall that our standard linear model can have expected values ranged from MINUS infinity to PLUS infinity. We need to limit the range of our predictions to the range between ZERO and ONE. We will use exponential function and logit transformation.
Multiple regression employs the method of
least squares, which minimizes the sum of the squared differences between the actual and
predicted values of the dependent variable. The nonlinear nature of the logistic transformation
requires that another procedure, the maximum likelihood procedure,
Logistic regression measures model estimation fit with the value of -2 times the log of the likelihood
value, referred to as -2LL or -2 log likelihood. Thus, the lower the -2LL value, the better the fit of
the model.
Just like its multiple regression counterpart, the logit R2 value ranges from 0.0 to 1.0. As the
proposed model increases model fit, the -2LL value decreases. A perfect fit has a -2LL value of 0.0
and a R2
LOGIT of 1.0.
Instead of minimizing the squared deviations
(least squares), logistic regression maximizes the likelihood that an event will occur
But how do we predict group membership from the logistic curve? For each observation, the
logistic regression technique predicts a probability value between 0 and 1
This predicted
probability is based on the value(s) of the independent variable(s) and the estimated coefficients.
If the predicted probability is greater than .50, then the prediction is that the outcome is 1
(the event happened); otherwise, the outcome is predicted to be 0 (the event did not happen).
3
Logistic Regression: Machine Learning Angle
Classification of a set of observations into 2 or more discrete categories is a central task in Machine Learning
The classic supervised learning setting:
Data points are represented by a set of features, i.e., discrete or continuous explanatory variables
The “training” data also have a label indicating the class of the data-point
A model is fitted on the training data
The trained model is then used to predict the class of unseen data-points (where we know the values of the features, but we do not have the label)
Logistic regression – the same settings, except that ML emphasis is placed on predicting the class of unseen data, rather than on the significance of the effect of the features/independent variables
Logistic regression - a standard technique in ML, sometimes known as Maximum Entropy
4
Generalized Linear Model Part 1: R
Logistic Regression is a part of GLM, where The expected value of the dependent variable is modeled via LINK (the “link” function)
For the binomial families, the response can be specified in one of three ways:
As a factor
As a numerical vector with values between zero and one, interpreted as the proportion of successful cases (with the total number of cases given by the weights)
As a two-column integer matrix: The first column gives the number of successes and the second, the number of failures
Giuseppe Ciaburro. 2018. Regression Analysis in R. Packt
Logistic Regression is a part of GLM
GLMs are extensions of traditional regression models that allow the mean to depend on the explanatory variables through a link function, and the response variable to be any member of a set of distributions called the exponential family (such as Binomial, Gaussian, Poisson, and others).
5
Workflow: Logistic Regression Analysis
BCData <- read.table(url("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"), sep = ",")
STEP 1: Explore
str(BCData)
How many observations: 699
How many variables: 11
Missing data: yes (?)
STEP 2: Remove Missing Data
names(BCData)<- c('Id', 'ClumpThickness', 'CellSize','CellShape', 'MarginalAdhesion','SECellSize', 'BareNuclei’,'BlandChromatin','NormalNucleoli', 'Mitoses','Class')
BCData<-BCData[!(BCData$BareNuclei=='?'),]
Giuseppe Ciaburro. 2018. Regression Analysis in R. Packt
BCData$BareNuclei<-as.integer(BCData$BareNuclei)
instead of the filename, we can also enter a complete URL of a file contained on a website repository. Now, we use the names() function to set the names of the dataset.
Let's start by checking the dataset using the str() function. This function provides a compact display of the internal structure of an object. There is a value equal to ?, indicative of the presence of missing attribute values. We need to remove these values
We need to convert the BareNuclei variable into integer values,
6
Workflow: Logistic Regression Analysis
summary(BCData)
STEP 3: Preparation
table(BCData$Class)
STEP 4: Model
BCData$Class<-replace(BCData$Class,BCData$Class==2,0)
BCData$Class<-replace(BCData$Class,BCData$Class==4,1)
Giuseppe Ciaburro. 2018. Regression Analysis in R. Packt
benign
malignant
LogModel <- glm(Class ~.-Id, family=binomial(link='logit'),data=BCData)
all features/independent except Id
Outcome/Dependent
Now, to obtain a brief summary of the dataset we can use the summary() function. The summary() function returns a set of statistics for each variable.
To get the number of cancer cases for each classes, we can use the table() function. This function uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels:
2 mean benign, 4 mean malignant. So, 444 cases of benign class and 239 cases of malignant class were detected. For glm we need to replace them by zero and ones by
he Id variable contains the observation ID. It is clear that this variable can be omitted from the model. Logistic regression is called by imposing the family: family = binomial (logit). The Class ~.-Id code means that we want to create a model that explains the Class variable (benign or malignant breast cancer) depending on all the other variables contained in datasets except Id.
7
Workflow: Logistic Regression Analysis
summary(LogModel)
STEP 5: Model Summary
Giuseppe Ciaburro. 2018. Regression Analysis in R. Packt
LGModelPred <- round(predict(LogModel, type="response"))
we only made 20 errors:
8 false negative, and 12 false positive
Confusion matrix
the deviance residuals, which are a measure of model fit. This part of the output shows the distribution of the deviance residuals for individual cases used in the model.
We can see that not all coefficients are significant (p-value <0.05). The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable. ClumpThickness, CellShape, MarginalAdhesion, BlandChromatin, and Mitoses are statistically significant.
Following the table of coefficients are fit indices, including the null and deviance residuals and the AIC. Null deviance indicates the response predicted by a model with nothing but an intercept. The lower the value, the better the model. Residual deviance indicates the response predicted by a model on adding independent variables. The lower the value, the better the model. Deviance is a measure of goodness of fit of a generalized linear model. Or rather, it's a measure of badness of fit—higher numbers indicate worse fit.
The analogous metric of adjusted R-squared in logistic regression is AIC. AIC is the measure of fit which penalizes a model for the number of model coefficients. Therefore, we always prefer a model with a minimum AIC value.
Finally, the number of Fisher Scoring iterations is returned. Fisher's scoring algorithm is a derivative of Newton's method for solving maximum likelihood problems numerically. For our model, we see that Fisher's scoring algorithm needed six iterations to perform the fit.
we can use the model to make predictions. To do this, we will use the predict() function.
In this line of code, we applied the predict() function to the previously built logistic regression model (LoGModel), passing the entire set of data at our disposal. The results were then rounded off using the round() function. Now, we count the cases:
To analyze the performance of the model in the classification, we can calculate the confusion matrix. In a confusion matrix, our classification results are compared to real data.
In this matrix, the diagonal cells show the number of cases that were correctly classified, all the others cells show the misclassified cases. The initial way to view the confusion matrix is obtained by the table() function, as follows:
8
/docProps/thumbnail.jpeg