QBUS2820 Predictive Analytics – Classification
QBUS2820 Predictive Analytics
Classification
Semester 2, 2021
Discipline of Business Analytics, The University of Sydney Business School
QBUS2820 content structure
• Statistical and Machine Learning foundations and applications.
• Advanced regression methods.
• Classification methods.
• Time series forecasting.
2/55
Content
1. Introduction
2. Classification
3. K-nearest neighbours classifier
4. Logistic regression
5. More discussion on binary classification
3/55
Recommended reading
• Sec 4.1-4.3, An Introduction to Statistical Learning with
Applications in R by James et al.: easy to read, sloppy
discussions sometimes, comes with R/Python code for
practice.
• Sec 4.1 and 4.4, The Elements of Statistical Learning by
Hastie et al.: well-written, deep in theory, suitable for students
with a sound maths background.
4/55
Introduction
Introduction
Consider the following business decision making scenarios.
• Should we offer the mortgage to a home loan applicant?
• Should we invest resources in acquiring and retaining a
customer?
• Should we invest more resources to train an employee?
• Should we investigate a transaction for possible fraud?
5/55
Introduction
All these scenarios involve a classification task.
• Is the applicant predicted to repay the mortgage in full?
• Is the customer predicted to be profitable?
• Is the employee predicted to stay with the company?
• Is the transaction predicted to be a fraud?
6/55
Another example: Spam or not spam?
• How is your email platform able to filter out spam emails?
• The spam email data set created by the researchers at the
Hewlett-Packard Labs consists of 4061 messages, each has
been already classified as proper email or spam together with
57 attributes (covariates) which are relative frequencies of
commonly occurring words.
• The goal is to design a spam filter that could filter out spam
7/55
Classification
Classification
• Classification task is to classify an individual into one (and
only one) of several categories or classes, based on a set of
measurements on that individual
• Output variable Y takes value in a discrete set with C
categories
• Input variables are a vector X of predcitors X1, …, Xp
• A classifier is a prediction rule denoted by G that, based on
observation X = x of an individual, assigns the individual into
one of the C categories/classes.
• So, G(x) = c if the individual with observation x is classified
into class c.
8/55
Classification
• How can we construct a classifier?
• How do we measure the success of this classifier?
• What is the misclassification rate of this classifier?
• etc.
To answer these questions, we need the decision theory.
9/55
Decision theory for classification
Decision theory starts with loss function.
In classification, we represent the loss function by a C × C loss
matrix L. Each element of the loss matrix Lkℓ = L(k, ℓ) specifies
the loss of classifying in class ℓ when the actual class is k.
A commonly used loss function is the 0-1 loss, where
Lkℓ =
1, k ̸= ℓ0, k = ℓ,
i.e., a unit loss is incurred in the case of misclassification.
L(Y, G(X)) = I(Y ̸= G(X)) =
1, Y ̸= G(X)0, Y = G(X),
10/55
Decision theory for classification
The prediction loss/error of classifier G is defined as
Err(G) = E[L(Y, G(X))] = Average
{
L(yj , G(xj)), all future (yj , xj)
}
Our ultimate goal is to find a classifier G that minimises the
prediction error.
11/55
Bayes classifier
Let
pc(x) = P(Y = c|X = x), c = 1, …, C
be the conditional probability that Y = c given X = x.
Bayes classifier: classify individual x into class c if and only if
pc(x) ≥ pj(x) for all j = 1, …, C
Bayes classifier is optimal under the 0-1 loss, i.e. it has the
smallest prediction error than any other classifiers.
This is similar to the fact that E(Y |X = x) is optimal prediction
of Y under the squared loss. Proof? (homework)
12/55
Bayes classifier
Consider two classes k and j. The set
{x : pk(x) = pj(x)}
is called the decision boundary between class k and j.
13/55
Bayes classifier: binary case
When there’re only two categories: negative (0) and positive (1),
the Bayes classifier becomes
G(x) = 1 if P(Y = 1|X = x) ≥ 0.5
Why?
14/55
Empirical error
Given a training dataset {(yi, xi), i = 1, …, n}, the empirical error
or empirical misclassification rate of classifier G is
err(G) =
1
n
∑
i
I(yi ̸= G(xi))
15/55
Bayes classifier
• So, if we know pc(x) = P(Y = c|X = x), c = 1, …, C, the
Bayes classifier is the optimal classification rule (under 0-1
loss).
• Recall the regression situation: f(x) = E(Y |X = x) is the
optimal prediction of Y (under squared loss) when X = x,
but in practice we need to estimate f(·).
Given a training data set {(yi, xi), i = 1, …, n}, how can we
estimate pc(·), c = 1, …, C?
These can be estimated using methods such as kNN, logistic
regression, multinomial regression, neural networks.
16/55
K-nearest neighbours classifier
K-nearest neighbours classifier
The K-nearest neighbours classifier estimates the conditional
probability for class c as
pc(x) =
1
K
∑
xi∈NK(x,D)
I(yi = c)
for a training sample D = {(yi, xi)}ni=1. Here, NK(x, D) is the set
containing K input vectors in the training dataset that are closest
to x.
17/55
K-nearest neighbours classifier
o
o
o
o
o
oo
o
o
o
o
o o
o
o
o
o
oo
o
o
o
o
o
18/55
K-nearest neighbours classifier
• In words, the KNN method finds the K training input points
which are closest to x, and computes the conditional
probability as the fraction of those points that belong to class
c.
• The KNN classifier is a nonparametric approximation to the
Bayes classifier.
• As always, choosing the optimal K is crucial. We often use
cross validation to select K.
19/55
KNN classifier decision boundary
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
X1
X
2
KNN: K=10
20/55
KNN classifier decision boundary
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
KNN: K=1 KNN: K=100
21/55
K-nearest neighbours classifier
0.01 0.02 0.05 0.10 0.20 0.50 1.00
0
.0
0
0
.0
5
0
.1
0
0
.1
5
0
.2
0
1/K
E
rr
o
r
R
a
te
Training Errors
Test Errors
22/55
Logistic regression
Logistic regression
• Consider a binary classification problem with two categories: 1
(often called positive) and 0 (called negative)
• Then, we can use logistic regression for estimating
p1(x) = P(Y = 1|X = x) and p0(x) = P(Y = 0|X = x).
23/55
Logistic regression
• Response/output data yi are binary: 0 and 1, Yes or No
• Want to explain/predict yi based on a vector of
predictors/inputs xi = (xi1, …, xip)′.
• Linear regression
yi = β0 + β1xi1 + … + βpxip + εi
is not appropriate for this type of response data. Why?
• Instead, we assume yi|xi ∼ B(1, p1(xi)), i.e. the distribution
of yi is Bernoulli distribution with probability of success
p1(xi), where
p1(xi) = P(yi = 1|xi) =
exp(β0 + β1xi1 + … + βpxip)
1 + exp(β0 + β1xi1 + … + βpxip)
24/55
Logistic regression
Note: if Y is a Bernoulli r.v. with probability π, then
p(y|π) = P(Y = y) =
π, y = 11 − π, y = 0 = πy(1 − π)1−y.
Function p(y|π) is called the probability density function (also
called probability mass function for discrete random variable).
• The probability density function of yi is therefore
p(yi|xi, β) = p1(xi)yi(1 − p1(xi))1−yi
So the likelihood function is
p(y|X, β) =
n∏
i=1
p1(xi)yi(1 − p1(xi))1−yi
25/55
Logistic regression
Maximising this likelihood gives us the Maximum Likelihood
Estimate (MLE) of β
β̂ = argmaxβ{p(y|X, β)}
• It’s more mathematically convenient to work with the
log-likelihood
ℓ(β) = log p(y|X, β) =
∑
i
(yi log p1(xi) + (1 − yi) log(1 − p1(xi)))
• The value of β that maximises p(y|X, β) is exactly the value
that maximises ℓ(β). So
β̂ = argmaxβ{p(y|X, β)} = argmaxβ{ℓ(β)}
• Often, one uses optimisation methods similar to the
Newton-Raphson method to find β̂. 26/55
Customer churn data
Response: whether the customer had churned by the end
of the observation period.
Predictors
1. Average number of dollars spent on marketing efforts
to try and retain the customer per month.
2. Total number of categories the customer has purchased from.
3. Number of purchase occasions.
4. Industry: 1 if the prospect is in the B2B industry,
0 otherwise.
5. Revenue: annual revenue of the prospect’s firm.
6. Employees: number of employees in the prospect’s firm.
Observations: 500.
Source: Kumar and Petersen (2012).
27/55
Example: customer churn data
Logit Regression Results
==============================================================================
Dep. Variable: Churn No. Observations: 350
Model: Logit Df Residuals: 348
Method: MLE Df Model: 1
Date: Pseudo R-squ.: 0.1319
Time: Log-Likelihood: -208.99
converged: True LL-Null: -240.75
LLR p-value: 1.599e-15
===============================================================================
coef std err z P>|z| [0.025 0.975]
——————————————————————————-
Intercept -1.2959 0.189 -6.866 0.000 -1.666 -0.926
Avg_Ret_Exp 0.0308 0.004 7.079 0.000 0.022 0.039
===============================================================================
28/55
Example: customer churn
No we can predict the probability that a customer with average
retention expenses of 100 will churn.
p̂ =
exp(β̂0 + β̂1 × 100)
1 + exp(β̂0 + β̂1 × 100)
=
exp(−1.296 + 0.031 × 100)
1 + exp(−1.296 + 0.031 × 100)
= 0.856
29/55
Example: German Credit data
Contain observations on 30 variables for 1000 credit customers.
Each was already rated as good credit (700 cases) or bad credit
(300 cases).
• credit history: categorical, 0: no credits taken, 1: all
credits paid back duly, …
• new car: binary, 0: No, 1: Yes
• credit amount: numerical
• employment: categorical, 0: unemployed, 1: < 1 year, ...
• ...
• response: 1 if “good credit”, 0 if “bad credit”
Let’s develop a credit scoring rule that can determine if a new
applicant is a good credit or a bad credit, based on these
covariates.
30/55
Example: German Credit data
Many covariates are categorical, so we need dummy variables to
represent them. So in total, there are 62 predictors.
The dataset provided is a simplified version, where I used only 24
predictors.
Let’s use the first 800 observations as the training data and the
rest as the test data. We use logistic regression to estimate the
probability of “good credit”
p1(x) = P(Y = 1|x) =
exp(x′β)
1 + exp(x′β)
31/55
Example: German Credit data
Bayes classifier: a credit applicant with vector of predictor x is
classified as “good credit” if
p̂1(x) ≥ 0.5
Actual Predicted class
class Bad Good
Bad 120 19
Good 26 35
Test error (misclassification rate) = (19+26)/200=22.5%
32/55
Multinomial logistic regression*
With C > 2 categories, we can use multinomial regression to
estimate pc(x), c = 1, …, C
• Multinomial regression is an extension of logistic regression
• Response/output data y has C levels
P(Y = c|X = x) =
exp(β0c + β1cx1 + … + βpcxp)
1 +
∑C−1
k=1 exp(β0k + β1kx1 + … + βpkxp)
for c = 1, …, C − 1 and
P(Y = C|X = x) =
1
1 +
∑C−1
k=1 exp(β0k + β1kx1 + … + βpkxp)
• Almost all statistical software has packages to estimate this
model.
33/55
Recap…
• Want to classify a subject x into one and only one of C
classes {1, 2, …, C}
• Given a loss function L(·, ·), the optimal classifier G is the one
that mininizes the prediction loss
Err(G) = E[L(Y, G(X))]
• Under the 0-1 loss function, the Bayes classifier is optimal:
classify x into class c if the probability
pc(x) = P (Y = c|X = x)
is the largest (among all other pj(x))
• We can estimate pc(x) by kNN or logistic regression (for
C = 2, or binary response), or multinomial regression (for
C > 2).
34/55
Review questions
• What is classification?
• What is the zero-one loss?
• Define the optimal classifier?
• What is the optimal classifier under 0-1 loss function?
• Explain the KNN classifier
• Explain the logistic regression classifier.
35/55
Next questions to answer
• Given a general loss function (not 0-1 loss), how can we find
the corresponding optimal classififer?
• If a loss function isn’t given, what is the optimal classifier
then?
• How can we compare between the classifier?
• etc.
36/55
More discussion on binary
classification
Binary classification
• Consider a binary classification problem with two categories: 1
(called positive) and 0 (called negative)
• For binary classification, there are important concepts that we
are going to discuss in this section.
37/55
Confusion matrix (key concept)
A confusion matrix counts the number of true negatives, false
positives, false negatives, and true positives for the test data.
Classification (Prediction)
Ŷ = 0 Ŷ = 1 Total
A
ct
ua
l
Y = 0 True negatives (TN) False positives (FP) N
Y = 1 False negatives (FN) True positives (TP) P
Total Negative predictions Positive predictions
38/55
Estimating prediction error
Estimating the prediction error is straightforward using the loss
function L and confusion matrix.
For example, for 0-1 loss, what is estimate of the prediction error?
39/55
Decision rule (key concept)
• The Bayes classifier
G(x) =
1 if P (Y = 1|X = x) ≥ 0.5.0 if P (Y = 1|X = x) < 0.5.
is optimal under 0-1 loss
• In many business applications, other loss functions might be
more suitable to use.
40/55
Example: transaction fraud detection
In many business problems, there are distinct losses associated with
each classification outcome. Consider for example the case of
transaction fraud detection.
Classification
Legitimate Fraud
A
ct
ua
l
Legitimate No loss Investigation cost
Fraud Fraud loss Fraud loss avoided
The cost of investigating a suspicious transaction is likely to be
much lower than the loss in case of fraud.
41/55
Example: credit scoring
In credit scoring, we want to classify a loan applicant as
creditworthy (Y = 1) or not (Y = 0) based on the probability that
the customer will not default.
Classification
Ŷ = 0 Ŷ = 1
A
ct
ua
l
Y = 0 Default loss avoided Default loss
Y = 1 Profit opportunity lost Profit
A false positive is a more costly error than a false negative for this
business scenario. Our decision making should therefore take this
into account.
42/55
Decision rule (key concept)
General decision rule
Gτ (x) =
1 if P (Y = 1|X = x) ≥ τ .0 if P (Y = 1|X = x) < τ.
where τ is a decision threshold parameter.
Note: Bayes classifier with τ = 1/2 is optimal under 0-1 loss.
43/55
Classification outcomes (key concept)
How can we find optimal τ that takes into account a general loss
function L?
To proceed, let’s use the following terminology.
Classification
Ŷ = 0 Ŷ = 1
A
ct
ua
l
Y = 0 True negative (TN) False positive (FP)
Y = 1 False negative (FN) True positive (TP)
44/55
Loss matrix (key concept)
Suppose that the loss function L is specified by the following loss
matrix or cost-benefit matrix:
Classification
Ŷ = 0 Ŷ = 1
A
ct
ua
l
Y = 0 LTN LFP
Y = 1 LFN LTP
Often, LTN, LTP < 0, i.e., you make profit with a correct
classification.
LFN, LFP > 0, i.e., you have a real loss with an incorrect
classification.
45/55
Loss-based optimal decision rule
The optimal threshold τ minimises the prediction loss
Err(τ) = E [L(Y, Gτ (X))]
• Given a test dataset, using the confusion matrix and loss
matrix, we can estimate Err(τ).
• We can also use cross-validation
46/55
Other key concepts
• In some applications, a loss matrix isn’t available. E.g., in the
spam email filter example, it’s hard to define the loss of a
misclassification. Then how to decide on the threshold τ?
• How can we compare kNN classifier with logistic regression
classifier, or any other classifiers?
To answer these questions, we introduce some other key concepts.
47/55
Sensitivity and specificity (key concepts)
The sensitivity, or true positive rate, is
P (Ŷ = 1|Y = 1) =
TP
TP + FN
=
True positives
Actual positives
.
The specificity, or true negative rate, is
P (Ŷ = 0|Y = 0) =
TN
TN + FP
=
True negatives
Actual negatives
.
48/55
False positive and false negative rates
The false positive rate (FPR) is
P (Ŷ = 1|Y = 0) =
FP
TN + FP
=
False positives
Actual negatives
= 1−Specificity.
You can think of this as the probability of Type I error in
hypothesis testing. E.g., 1 is “fraud” then P (Ŷ = 1|Y = 0) is
probability of classifying a genuine transaction as fraud.
The false negative rate (FNR) is
P (Ŷ = 0|Y = 1) =
FN
TP + FN
=
False negatives
Actual positives
= 1 − Sensitivity.
This can be thought of as a Type II error probability. Hence,
sensitivity is the power of the classifier.
In hypothesis testing, we want to select a test with Type I
probability bounded, and its power maximised. 49/55
Trade-off between sensitivity and specificity
• There is a trade-off between sensitivity and specificity
• when τ varies, sensitivity increases while specificity decreases
and vice versa
• we obtain maximum sensitivity (specificity) by setting τ = 0
(τ = 1) as the classifier automatically returns positive
(negative).
• This is similar to the trade-off between Type I error probability
and Type II error probability in hypothesis testing.
50/55
Imbalanced classes
In some situations, we care more about sensitivity than specificity
and vice versa.
For example, many classification scenarios (such as fraud
detection) concern rare events, leading to a very large proportion
of negatives in the data. In this situation we say that the classes
are imbalanced.
The specificity is not very informative for these problems, as it will
tend to be high regardless of the quality of the classifier (nearly all
transactions are legitimate and classified as such). Hence, we care
more about sensitivity.
51/55
ROC curve
ROC Curve
False positive rate
T
ru
e
p
o
s
it
iv
e
r
a
te
0.0 0.2 0.4 0.6 0.8 1.0
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
52/55
ROC curve (key concept)
A receiver operating characteristic (ROC) curve plots the
sensitivity (i.e., power) against the false positive rate (i.e., type I
probability) when the threshold τ varies from 0 to 1.
• The name came from WW2 in military radar operation work.
• For each τ , you have a point (false positive rate,sensitivity) on
R2-plane. Connecting these points as τ varies gives ROC
curve.
• It tells us that, given a false positive rate that we can accept,
what the sensitivity of the classifier is.
53/55
ROC curve (key concept)
Figure 1:
https://en.wikipedia.org/wiki/Receiver operating characteristic
The quality of a classifier can be summarised by the area under
the curve (AUC). Higher AUC scores are better.
54/55
Review questions
• How do we formulate a decision rule for binary classification?
• What is a confusion matrix? Write down how the matrix looks
like.
• What are sensitivity and specificity?
• Why is there a trade-off between sensitivity and specificity?
• What is a ROC curve used for?
• How can you compare between classifiers using ROC?
55/55
Introduction
Classification
K-nearest neighbours classifier
Logistic regression
More discussion on binary classification