Regression versus Classification
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 54/102
Regression versus Classification
Variables can be characterized as either quantitative or qualitative.
Examples of quantitative variables include a person’s age, height, or income, the value of a house, and the price of a stock. Quantitative variables take on numerical values.
In contrast, qualitative variables take on values in one of K different classes, or categories.
Examples of qualitative variables include a person’s gender (male or female), the brand of product purchased (brand A, B, or C),or whether a person defaults on a debt (yes or no).
When the response variable Y is quantitative, these machine learning problems are referred to as regression problems.
When the response variable Y is qualitative, these machine learning problems are referred to as classification problems.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 55/102
Application of Machine Learning Classification Tasks in Real Life
Classification problems occur often, perhaps even more so than regression problems. Some examples include:
A person arrives at a clinic with a set of symptoms and the doctor has to decide which of three medical conditions the patient is suffering from.
An online financial platform must decide whether or not a transaction being performed on the site is fraudulent based on past transaction history, and other information such as IP address.
A biologist has to figure out which DNA mutations are disease causing and which are not.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 56/102
Loss function for the Classification Setting
We saw how the quadratic loss function was the natural loss function to consider in regression settings:
Eπ(x,y)(Y −f(X))2 =E(L(Y,f(X)))
where L(a, δ) = (a − δ)2 is the loss function with unknown state a
and decision δ.
But many concepts we discussed earlier, such as the bias-variance
trade-off, carry over to the classification setting.
But there are some modifications due to the fact that Y is no longer numerical.
In the case of classification we consider the 0 − 1 loss function: L(a, δ) = 0 if a = δ and
= 1 if a ̸= δ.
i.e., L(a, δ) = I (a ̸= δ) where I (A) is the indicator function of set A.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 57/102
Best classifier formulation and derivation
Suppose we seek to estimate the best classifier f (X ) based on the 0 − 1 loss function.
The formulation of the best classifier problem is similar to the case of identifying E(Y|X) as the best regressor previously.
Assume Y can take two labels a and b, and the joint probability distribution of (X,Y) is π(x,y).
The expected loss when (X,Y) ∼ π(x,y) is Eπ(x,y)L(Y,f(X)) = Eπ(x)Eπ(y|x)L(Y,f(X))
= E
π(x)
L(a, f (X ))π(a|X ) + L(b, f (X ))π(b|X )
(∗)
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 58/102
Best classifier formulation and derivation (cont.)
To minimize Eπ(x,y)L(Y,f(X)) with respect to f, we minimize the expression (∗) inside the square brackets for each X. Thus, if
π(a|X ) < π(b|X ) it is best to take f (X ) = b since in that case,
(∗) = 1 · π(a|X ) + 0 · π(b|X ) = π(a|X ), the smaller of π(a|X) and π(b|X).
Conversely, if
π(a|X ) > π(b|X ) it is best to take f (X ) = a since in that case,
(∗) = 0 · π(a|X ) + 1 · π(b|X ) = π(b|X ),
which is again the smaller of π(a|X) and π(b|X).
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
59/102
Best classifier formulation and derivation (cont.)
Putting this all together, the best classfier f (X ) is the one that predicts the label of y as follows:
if π(a|X) > π(b|X),
if π(a|X) < π(b|X), and if π(a|X) = π(b|X).
Since π(a|X ) + π(b|X ) = 1, the three conditions above are equivalent to π(a|X) > 0.5, π(a|X) < 0.5 and π(a|X) = 0.5.
However, since π(x,y) is unknown, the conditional probabilities π(a|X) and π(b|X) are unknown and have to be estimated based on a pre-specified class C.
f (X) =
= b
a
= a or b,
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 60/102
Learning a classifier from training samples
We seek to estimate fˆ from class C based on a training dataset {(xi,yi), i = 1,2,··· ,n}.
Find
ˆ 1n f(X)=argminn I(yi ̸=f(xi))
f∈C i=1
This training error rate quantifies the proportion of mistakes (misclassfications) that are made if we use f (xi ) to predict labels of yi to the training observations.
fˆ is the classifier that minimizes the misclassification rate over class C. The test (validation) misclassification rate is
1 m
ErrorRateValid(C)=m
where {(x0,j , y0,j ), j = 1, 2, · · · , m} is a test (validation) dataset. Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
j=1
I(y0,j ̸=fˆ(x0,j))
61/102
Choice of C: Logistic Regression
The labels for Y are assumed to be 0 and 1 instead of a and b. Assume for now that there is one independent variable X. Logistic regression assumes that
π(1|x, β) = eβ0+β1x
1 + eβ0+β1x
where β = (β0, β1) and
π(0|x, β) = 1 − π(1|x, β) = 1
1 + eβ0+β1x
Based on the training criteria, we seek ˆˆ 1n
where
f(x;β)=argminn I(yi ̸=f(xi;β)) β i=1
f(x;β) = 1 ifπ(1|x,β)>π(0|x,β),and = 0 if π(1|x, β) < π(0|x, β).
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
62/102
Difficulty in Minimization
The minimization
ˆˆ 1n
where
f(x;β)=argminn I(yi ̸=f(xi;β)) β i=1
f(x;β) = 1 ifπ(1|x,β)>π(0|x,β),and = 0 if π(1|x, β) < π(0|x, β).
is impossible to do since f (x ; β) is not a continuous function of β. f (x ; β) obtained from π(1|x, β) above is called hard thresholding.
We need a loss function that is continuously differentiable in β. We use soft thresholding: We assume that yi ∼ Ber(π(1|xi,β)) independently for each i = 1,2,··· ,n.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 63/102
The Logistic Loss (Log Loss) Function
Since yi ∼ Ber(π(1|xi,β)) independently for each i = 1,2,··· ,n, the likelihood is given by
n
l(β0,β1 ; y) = π(1|xi,β)yi π(0|xi,β)1−yi
i=1
The general relationship between a likelihood l and its corresponding
loss function L is
So, the loss function used to train the logistic regression model is
LLogistic(β) = −logl(β0,β1 ; y) n
=yi logπ(1|xi,β)+(1−yi)logπ(0|xi,β). i=1
This is called the logistic loss function (log loss) and is used in place of the misclassification loss function to estimate β.
L = − log(l)
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 64/102
Classification based on Log Loss: Steps Involved
Choose the class
eβ0+β1x C= π(1|x,β)=1+eβ0+β1x
(β0,β1)=argminn yi logπ(1|xi,β)+(1−yi)logπ(0|xi,β). β i=1
Generate the classifier
fˆ(x)≡f(x;βˆ) = 1 ifπ(1|x,βˆ)>π(0|x,βˆ),and
= 0 if π(1|x,βˆ) < π(0|x,βˆ).
Obtain the outcomes of the classifier on a test (validation) dataset
Minimize the log loss with respect to β over a training datset: ˆˆ 1n
and compute
1 m
ErrorRateValid(C)=m
I(y0,j ̸=fˆ(x0,j))
j=1
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 65/102
Example
Consider the Default data set where the response Y = default falls into one of two categories, Yes or No.
We will use logistic regression to model the probability that Y belongs to a particular category.
For the Default data, logistic regression models the probability of default given balance as
π(default = Yes|balance) = eβ0+β1x
1 + eβ0+β1x
where X = balance.
The values of π(default = Yes|balance) will range between 0 and
1.
Then for any given value of balance, a prediction can be made for
default.
For example, one might predict default = Yes for any individual for
whom π(default = Yes|balance) > 0.5
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 66/102
Training, Testing and CV with R codes
Logistic regression can be easily fit using R, and so there is no need to go into the details of the maximum likelihood fitting procedure.
To fit logistic regression, we use the glm() function that fits a variety of generalized linear models in R including logistic regression.
The syntax of glm() is similar to that of lm() but an additional argument family = binomal has to be given to run logistic regression instead of another type of generalized linear model.
Note that the strcture of the R codes for training, testing and CV are all the same as previously.
The only differences are to replace lm() by glm() in the training part, and
To replace the squared error loss function by the 0 − 1 loss function in the testing/validation part.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 67/102
Example:
Do the following:
Fit a logistic regression model that is linear in x in the exponent, that is,
eβ0+β1x π(default = Yes|balance) = 1 + eβ0+β1x .
Call this class C1.
Next, similar to polynomial regression, fit a logistic regression model
that is a polynomial of degree p in x in the exponent, that is, eβ0+pj=1 βjxj
π(default = Yes|balance) = β0+p βj xj . 1+e j=1
Call this class Cp.
Run the CV procedure for 1 ≤ p ≤ 7 and choose the best p∗ that
minimizes the CV error rate.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
68/102
Here are the R codes
#Unit 2 of ML
#install.packages(“ISLR”)
#This R package contains all datasets
#used in the book
library(ISLR)
#Attach the Default dataset to R workspace attach(Default)
str(Default) #Gives to the structure of the DF
## ‘data.frame’: 10000 obs. of 4 variables:
## $default:Factorw/2levels”No”,”Yes”:11111111 ## $student:Factorw/2levels”No”,”Yes”:12111212 ## $ balance: num 730 817 1074 529 786 …
## $ income : num 44362 12106 31767 35704 38463 …
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 69/102
R codes (cont.)
logistic_fit <-
glm(default ~ balance, family = binomial,
data = Default)
summary(logistic_fit)
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
70/102
R codes (cont.)
##
## Call:
## glm(formula = default ~ balance, family = binomial, data =
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2697 -0.1465 -0.0589 -0.0221 3.7589
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
## balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
##
## (Dispersion parameter for binomial family taken to be 1)
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 71/102
R codes (cont.)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1596.5 on 9998 degrees of freedom
## AIC: 1600.5
##
## Number of Fisher Scoring iterations: 8
#Get predicted values for balance values
#in the Default dataset
#The syntax is the same as before but with an optional #argument type. default type is on the scale of the linear #predictors. type="response" gives probabilities in the #scale of the response variable
pred_probabilities <- predict(logistic_fit,
newdata = Default, type="response")
predicted_classes <- ifelse(pred_probabilities> 0.5,
“Yes”, “No”)
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
72/102
R codes (cont.)
# Model Accuracy
with(Default,
mean(predicted_classes == default)
)
## [1] 0.9725
Note that if you want to get model misclassification error, either do one minus 0.9725 or
#If you want to get Model Misclassification Error Rates
with(Default,
mean(predicted_classes != default)
)
## [1] 0.0275
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
73/102
R codes for CV
#Now do CV
#Now let’s do full CV library(dplyr)
K = 50;
P = 3;
MCE_train_mat <- rep(list(vector("numeric",K)),P)
MCE_valid_mat <- rep(list(vector("numeric",K)),P)
glm_deviance <- rep(list(vector("numeric",K)),P)
for (k in 1:K){
#Training dataset data.frame
train <- Default %>% sample_frac(0.7) #Validation dataset data.frame
valid <- Default %>% setdiff(train)
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
74/102
R codes for CV
#Determine class of learners which are polynomails #from degree 1 to 3
for (p in 1:P){
poly_train_fit <- glm(default ~ poly(balance,p), data = train, family = binomial) glm_deviance[[p]][k] <- poly_train_fit$deviance
poly_train_predict_probs <- predict(
poly_train_fit, train, type="response")
predicted_classes_train <- ifelse(
poly_train_predict_probs > 0.5, “Yes”, “No”)
poly_valid_predict_probs <- predict(
poly_train_fit, valid, type="response")
predicted_classes_valid <- ifelse(
poly_valid_predict_probs > 0.5, “Yes”, “No”)
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
75/102
R codes for CV
MCE_train_mat[[p]][k] <- with(train,
mean(default != predicted_classes_train))
MCE_valid_mat[[p]][k] <- with(valid,
mean(default != predicted_classes_valid))
}
}
MCE_train_p <- sapply(MCE_train_mat, mean) MCE_valid_p <- sapply(MCE_valid_mat, mean) plot(seq(1,P,1), MCE_train_p, col="red",
type="l", lwd=3, ylim = c(0.025, 0.03))
lines(seq(1,P,1), MCE_valid_p, type="l",
col="blue", lwd=3)
glm_deviance_p <- sapply(glm_deviance, mean)
plot(seq(1,P,1), glm_deviance_p, col="red", type="l", lwd=3)
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 76/102
Plot of Training and Validation Error Rates
1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0
seq(1, P, 1) seq(1, P, 1)
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 77/102
MCE_train_p
0.025 0.026 0.027 0.028 0.029 0.030
glm_deviance_p
1130.8 1131.0 1131.2 1131.4
Logistic Regression for Multiple Explanatory Variables
Assume that there are p independent variables X1, X2, · · · , Xp. Logistic regression assumes that
π(1|x, β) = where β = (β0,β1,··· ,βp) and
π(0|x, β) = 1 − π(1|x, β) = Based on the training criteria, we seek
eβ0+pj=1 βjxj β0+p βj xj
1+e j=1
1
β0+p βj xj 1+e j=1
where
ˆˆ 1n f(x;β)=argminn I(yi ̸=f(xi;β))
β i=1
f(x;β) = 1 ifπ(1|x,β)>π(0|x,β),and
= 0 if π(1|x, β) < π(0|x, β).
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
78/102