Predictive Analytics – Week 9: Classification I
Predictive Analytics
Week 9: Classification I
Semester 2, 2018
Discipline of Business Analytics, The University of Sydney Business School
QBUS2820 content structure
1. Statistical and Machine Learning foundations and applications.
2. Advanced regression methods.
3. Classification methods.
4. Time series forecasting.
2/60
Week 9: Classification I
1. Classification
2. Introduction to decision theory for classification
3. K-nearest neighbours classifier
4. Logistic regression
5. Model evaluation for binary classification
6. Decision theory for binary classification (optional)
Readings: Chapters 2.2.3, 4.1, 4.2 and 4.3 of ISL.
Exercise questions: Chapter 4.7 of ISL, Q1, Q4, Q6, Q8 and Q9. 3/60
Classification
Classification
Consider the following business decision making scenarios.
1. Should we invest resources in acquiring and retaining a
customer?
2. Should we offer a mortgage to a credit applicant?
3. Should we invest more resources to train an employee?
4. Should we place a bid to a sponsor an online search?
5. Should we investigate a transaction for possible fraud?
4/60
Classification
All these scenarios involve a classification task.
1. Do we predict that the customer will be profitable?
2. Do we predict that the applicant will repay the mortgage in
full?
3. Do we predict that the employee will stay in the company?
4. Do we predict that the user will click on the ad and make a
purchase?
5. Do we flag the transaction?
5/60
Classification
In classification, the response variable Y is qualitative or
categorical that takes values in a finite unordered set
Y = {1, . . . , C}, where C is the number of classes. Our task is to
predict which class a subject belongs to based on input variables.
A classifier Ŷ (X) is a mapping from the input vector x to
{1, . . . , C}. A classifier is a prediction rule that assigns the subject
to one of the classes, given the observed values of the predictors.
6/60
Classification
In the fraud detection example, our response variable may be
flag = {fraud, legitimate}. We can code this variable as
Y =
1 if fraud,0 if legitimate.
7/60
Introduction to decision theory for
classification
Loss functions for classification
In classification, we represent the loss function by a C × C loss
matrix L. Each element of the loss matrix Lk` = L(k, `) specifies
the loss of classifying in class ` when the actual class is k.
In this section, we focus on a simple framework by considering the
zero-one loss function.
8/60
Zero-one loss function (key concept)
The zero-one loss function is
L(y, ŷ) =
1 if y 6= ŷ0 if y = ŷ,
such that the loss is zero for a correct classification and one for a
misclassification.
9/60
Classification risk
As before, our objective is to minimise the risk or expected loss of
the classifier,
R(Ŷ (X)) = E
[
L(Y, Ŷ (X))
]
.
You can think of the risk as the average loss across all subjects in
the population (each subject has a pair of Y and X values).
By conditioning on the predictors, we can rewrite the risk as
R(Ŷ (X)) = EX
(
C∑
c=1
E
[
L(c, Ŷ (X))
]
P (Y = c|X)
)
10/60
Classification
Classification models lead to estimated conditional probabilities
P̂ (Y = c|X = x) for c = 1, . . . , C. We then classify a test case to
the class with highest estimated probability.
For binary classification, we therefore classify a subject as ŷ = 1 if
P̂ (Y = 1|X = x) ≥ 0.5.
11/60
Model evaluation
We compute the misclassification or error rate for the test data
of size n as
Errtest =
1
n
n∑
i=1
I(yi 6= ŷi).
12/60
K-nearest neighbours classifier
K-nearest neighbours classifier
The K-nearest neighbours classifier estimates the conditional
probability for class c as
P̂ (Y = c|X = x) =
1
K
∑
xi∈NK
I(yi = c)
for a training sample D = {(yi,xi)}Ni=1.
13/60
K-nearest neighbours classifier
o
o
o
o
o
oo
o
o
o
o
o o
o
o
o
o
oo
o
o
o
o
o
14/60
K-nearest neighbours classifier
• In words, the KNN method finds the K training input points
which are closest to x, and computes the conditional
probability as the fraction of those points that belongs to
class c.
• Similarly to the KNN regression method, the KNN classifier is
a direct nonparametric approximation to the Bayes classifier.
• The lower the K, the more flexible the decision boundary.
• As always, choosing the optimal level of flexibility is crucial.
We use cross validation to select K.
15/60
KNN classifier decision boundary
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
X1
X
2
KNN: K=10
16/60
KNN classifier decision boundary
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
KNN: K=1 KNN: K=100
17/60
K-nearest neighbours classifier
0.01 0.02 0.05 0.10 0.20 0.50 1.00
0
.0
0
0
.0
5
0
.1
0
0
.1
5
0
.2
0
1/K
E
rr
o
r
R
a
te
Training Errors
Test Errors
18/60
Logistic regression
Regression models for classification
Suppose that we want the specify a discriminative model for binary
classification. The response Y follows the Bernoulli distribution,
Y =
1 with probability P (Y = 1|X = x)0 with probability 1− P (Y = 1|X = x)
How to model the conditional probability P (Y = 1|X = x) as a
function of the predictors?
19/60
Regression models for classification
Since P (Y = 1|X = x) = E(Y |X = x), one option is to specify a
linear regression model
Y = β0 +
p∑
j=1
βjxj + ε.
This is called the linear probability model. However, there are a
few reasons why we want to move beyond this framework.
20/60
Why not the linear probability model?
1. There is no guarantee that a linear probability model will
generate probabilities between zero and one, since the
regression function β0 +
∑p
j=1 βjxj is unconstrained. In other
words, the linearity assumption does not hold.
2. The Bernoulli distribution has variance p(x)(1− p(x)).
Hence, the linear probability model violates the classical
assumption of constant error variance.
3. The linear probability approach is not robust to outliers.
4. The linear probability approach does not generalise to
categorical responses with more than two classes.
21/60
Logistic regression (key concept)
The logistic regression model is
Y |X = x ∼ Ber(p(x)),
where
p(x) =
exp(β0 +
∑p
j=1 βjxj)
1 + exp(β0 +
∑p
j=1 βjxj)
.
The logistic function
exp(a)
1 + exp(a)
=
1
1 + exp(−a)
constrains the probability to be between zero and one.
22/60
Logistic function
23/60
Logistic Regression
Define the odds ratio as
p(x)
1− p(x)
.
We can show that
p(x)
1− p(x)
= exp
β0 + p∑
j=1
βjxj
.
The logistic regression model therefore specifies a linear model for
the log odds,
log
(
p(x)
1− p(x)
)
= β0 +
p∑
j=1
βjxj ,
where we call the left-hand side the logit transformation of the
probability.
24/60
Prediction with logistic regression
Suppose that we a test point xi, if (β̂0 +
∑p
j=1 β̂jxij) > 0, then
P̂ (Y = 1|X = xi) = p̂(xi) =
exp(β̂0 +
∑p
j=1 β̂jxij)
1 + exp(β̂0 +
∑p
j=1 β̂jxij)
> 0.5,
then prediction ŷi = 1. When to predict ŷi = 0?
If (β̂0 +
∑p
j=1 β̂jxij) = 0, we have the decision boundary of logistic
regression.
25/60
Maximum likelihood estimation
We estimate the logistic regression model by maximum likelihood.
Recall that a Bernoulli random variable Y has probability mass
function
p(y;π) = πy(1− π)1−y.
In the context of the logistic regression model, the probability mass
function for a training case i is therefore
p(yi|xi) = p(xi)yi(1− p(xi))1−yi .
26/60
Maximum likelihood estimation
The likelihood function is
`(β) = p(y1|x1) p(y2|x2) . . . p(yN |xN )
=
N∏
i=1
p(yi|xi)
=
N∏
i=1
p(xi)yi(1− p(xi))1−yi
27/60
Maximum likelihood estimation
The log-likelihood is
L(β) = log
(
N∏
i=1
p(xi)yi(1− p(xi))1−yi
)
=
N∑
i=1
(yi log(p(xi)) + (1− yi) log(1− p(xi))) ,
where p(xi) can be calculated as before using logistic function.
The negative log-likelihood −L(β) is known as the cross-entropy
loss function or log loss in machine learning. Think the intuition
of the above formula.
28/60
Maximum likelihood estimation
The MLE for the logistic regression model is
β̂ = argmax
β
L(β),
where
L(β) =
N∑
i=1
(yi log(p(xi)) + (1− yi) log(1− p(xi)))
Similar to Lecture 5, then it can be shown that:
∂L(β)
∂β
= XT (y − p(X))
29/60
Gradient Ascend in matrix form
Here we have:
y =
y1
y2
…
yN
p(X) =
p(x1)
p(x2)
…
p(xN )
30/60
Gradient Ascend in matrix form
Hence gradient ascent in matrix form for logistic regression is:
β := β + αXT (y − p(X))
• Note the size of each matrix and vector in the above formula.
• Try to implement this in Python.
31/60
Maximum likelihood estimation
Optimisation. Setting the partial derivatives of the log-likelihood
to zero leads to estimation equations that are nonlinear in the
coefficients. We therefore use numerical optimisation routines to
obtain the estimates.
Statistical inference. We can conduct statistical inference for the
logistic regression model using the large sample theory for ML
estimation or the Bootstrap method.
32/60
Example: customer churn data
33/60
Customer churn data
Response: whether the customer had churned by the end
of the observation period.
Predictors
1. Average number of dollars spent on marketing efforts
to try and retain the customer per month.
2. Total number of categories the customer has purchased from.
3. Number of purchase occasions.
4. Industry: 1 if the prospect is in the B2B industry,
0 otherwise.
5. Revenue: annual revenue of the prospect’s firm.
6. Employees: number of employees in the prospect’s firm.
Observations: 500.
Source: Kumar and Petersen (2012).
34/60
Example: customer churn data
Logit Regression Results
==============================================================================
Dep. Variable: Churn No. Observations: 350
Model: Logit Df Residuals: 348
Method: MLE Df Model: 1
Date: Pseudo R-squ.: 0.1319
Time: Log-Likelihood: -208.99
converged: True LL-Null: -240.75
LLR p-value: 1.599e-15
===============================================================================
coef std err z P>|z| [0.025 0.975]
——————————————————————————-
Intercept -1.2959 0.189 -6.866 0.000 -1.666 -0.926
Avg_Ret_Exp 0.0308 0.004 7.079 0.000 0.022 0.039
===============================================================================
35/60
Example: customer churn
No we can predict the probability that a customer with average
retention expenses of 100 will churn.
p̂ =
exp(β̂0 + β̂1 × 100)
1 + exp(β̂0 + β̂1 × 100)
=
exp(−1.296 + 0.031× 100)
1 + exp(−1.296 + 0.031× 100)
= 0.856
Decision boundary? If we have two predictors x1 and x2, how the
decision boundary will be looked like?
36/60
Regularised logistic regression
Regularised risk minimisation applies to logistic regression. With
an `1 penalty as in the lasso, we solve the minimisation problem
min
β
− L(β) + λ
p∑
j=1
|βj |,
where
L(β) =
N∑
i=1
(yi log(p(xi)) + (1− yi) log(1− p(xi)))
Subset selection and dimension reduction with principal
components also extend to logistic regression in a straightforward
way. 37/60
Model evaluation for binary
classification
Decision rule (key concept)
More generally, the decision to classify a subject as positive or
negative is based on a decision rule
δ(x) =
1 if P (Y = 1|X = x) > τ.0 if P (Y = 1|X = x) ≤ τ .
where τ is a decision threshold parameter.
38/60
Confusion matrix (key concept)
A confusion matrix counts the number of true negatives, false
positives, false negatives, and true positives for the test data.
Classification (Prediction)
Ŷ = 0 Ŷ = 1 Total
A
ct
ua
l
Y = 0 True negatives (TN) False positives (FP) N
Y = 1 False negatives (FN) True positives (TP) P
Total Negative predictions Positive predictions
39/60
Estimating the generalisation error
• Estimating the generalisation error is straightforward using the
loss and confusion matrices.
• As always, it is important to quantify the uncertainty in the
estimate by reporting the standard error or doing interval
estimation.
• For the rest of this section, we discuss important concepts for
assessing binary classification models.
40/60
Sensitivity and specificity (key concepts)
The sensitivity, recall, or true positive rate is
P (Ŷ = 1|Y = 1) =
TP
TP + FN
=
True positives
Actual positives
.
The specificity is
P (Ŷ = 0|Y = 0) =
TN
TN + FP
=
True negatives
Actual negatives
.
41/60
False positive and false negative rates
The false positive rate (FPR) is
P (Ŷ = 1|Y = 0) =
FP
TN + FP
=
False positives
Actual negatives
= 1−Specificity.
The false negative rate (FNR) is
P (Ŷ = 0|Y = 1) =
FN
TP + FN
=
False negatives
Actual positives
= 1− Sensitivity.
42/60
Trade-off between sensitivity and specificity (key concept)
• There is a trade-off between sensitivity and specificity, since a
classifier can always obtain maximum sensitivity (specificity)
by setting τ = 0 (τ = 1) and automatically returning positive
(negative).
• Equivalently, this is a trade-off between sensitivity and
achieving a lower false positive rate.
43/60
Example: credit scoring
• Decreasing the threshold makes the loan decisions more
lenient, leading to loans to customers with lower probability of
full repayment.
• Issuing additional loans will increase both the number of true
positives (higher sensitivity) and false positives/defaults (lower
specificity).
44/60
ROC curve (key concept)
A receiver operating characteristic or ROC curve plots the
sensitivity against specificity or the false positive rate for a range of
threshold values τ .
We can read the the ROC plot as telling us the false positive rate
that we need to accept to obtain a given level of sensitivity.
We often summarise the quality of ROC curve as a single number
using the area under the curve or AUC. Higher AUC scores are
better, with a maximum of one.
45/60
ROC curve
ROC Curve
False positive rate
T
ru
e
p
o
s
it
iv
e
r
a
te
0.0 0.2 0.4 0.6 0.8 1.0
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
46/60
Imbalanced classes
Many classification scenarios (such as fraud detection) concern rare
events, leading to a very large proportion of negatives in the data.
In this situation we say that the classes are highly imbalanced.
The specificity is not very informative for these problems, as it will
tend to be high regardless of the quality of the classifier (nearly all
transactions are legitimate and classified as such).
47/60
Precision (key concept)
In the imbalanced scenario, we are usually more interested in the
proportion of detections that are actually positive. We define the
precision as
P (Y = 1|Ŷ = 1) =
TP
TP + FP
=
True positives
Positive classifications
The false discovery rate (FDR) is one minus the precision.
P (Y = 0|Ŷ = 1) =
FP
TP + FP
=
False positives
Positive classifications
48/60
Precision recall curve
A precision recall curve plots the precision against the recall
(sensitivity) as we vary the threshold τ . The mean precision
(averaging over recall values) approximates the area under the
precision recall curve.
49/60
Example: transaction fraud detection
• In fraud detection, the bank is concerned that too many false
alarms (low precision) would lead to high costs of
investigating flagged transactions.
• The bank will weigh this cost against the savings from
catching fraudulent transactions.
• Therefore, the financial institution is primarily interested in
the precision recall curve.
• Increasing τ reduces the number of false alarms.
50/60
Review questions
• What is classification?
• What is a zero-one loss?
• What is the misclassification rate?
• Explain the KNN classifier.
• How do we formulate a decision rule for binary classification?
• What is a confusion matrix? Write down how the matrix looks
like.
• What are sensitivity, specificity, and precision?
• Why is there a trade-off between sensitivity and specificity?
51/60
Decision theory for binary
classification (optional)
Classification outcomes
In most business problems, there are distinct losses associated with
each classification outcome. Consider for example the case of
transaction fraud detection.
Classification
Legitimate Fraud
A
ct
ua
l
Legitimate No loss Investigation cost
Fraud Fraud loss Fraud loss avoided
The cost of investigating a suspicious transaction is likely to much
lower than the loss in case of fraud.
52/60
Classification outcomes
We use the following terminology.
Classification
Ŷ = 0 Ŷ = 1
A
ct
ua
l
Y = 0 True negative False positive
Y = 1 False negative True positive
53/60
Loss matrix (key concept)
The context will specify a loss matrix or cost-benefit matrix for
classification as follows.
Classification
Ŷ = 0 Ŷ = 1
A
ct
ua
l
Y = 0 LTN LFP
Y = 1 LFN LTP
54/60
Example: credit scoring
In credit scoring, we want to classify a loan applicant as
creditworthy (Y = 1) or not (Y = 0) based on the probability that
the customer will not default.
Classification
Ŷ = 0 Ŷ = 1
A
ct
ua
l
Y = 0 Default loss avoided Default loss
Y = 1 Profit opportunity lost Profit
A false positive is a more costly error than a false negative for this
business scenario. Our decision making should therefore take this
into account.
55/60
Optimal decision (key concept)
Our decision problem is therefore to select an optimal threshold τ .
τ∗ = argmin
0<τ<1
E [L(Y, δτ (x))|X = x] .
Let π = P (Y = 1|X = x) to simplify the notation. We compare
the expected loss from each decision,
E [L(Y, δτ (x))|X = x] =
πLTP + (1− π)LFP if δτ (x) = 1,πLFN + (1− π)LTN if δτ (x) = 0.
56/60
Optimal decision (key concept)
The optimal decision threshold corresponds to the probability value
π such that the loss from a positive or negative classification is
equal.
τ∗LTP + (1− τ∗)LFP = τ∗LFN + (1− τ∗)LTN
Therefore, the optimal threshold is
τ∗ =
LFP − LTN
LFP + LFN − LTP − LTN
57/60
Example: zero-one loss
With the zero-one loss, we have that LFP = LFN = 1 and
LTP = LTN = 0. Therefore,
τ∗ =
LFP − LTN
LFP + LFN − LTP − LTN
=
1
2
58/60
Example: credit scoring
In the credit scoring example, we have that LTP = −LFN (profit
equals missed profit) and LTN = −LFP (avoided default loss
equals default loss).
Therefore,
τ∗ =
LFP − LTN
LFP + LFN − LTP − LTN
=
LFP
LFN + LFP
.
59/60
Example: credit scoring
Optimal threshold for loan decision:
τ∗ =
LFP
LFN + LFP
.
We expect that the loss from default to be much higher than the
profit from a loan to a creditworthy customer (LFP >> LFN),
leading to a high threshold τ∗.
It is only worth it to lend to customers that have a high probability
of repayment.
60/60
Classification
Introduction to decision theory for classification
K-nearest neighbours classifier
Logistic regression
Model evaluation for binary classification
Decision theory for binary classification (optional)