CS计算机代考程序代写 python QBUS2820 Predictive Analytics – Classification

QBUS2820 Predictive Analytics – Classification

QBUS2820 Predictive Analytics
Classification

Semester 2, 2021

Discipline of Business Analytics, The University of Sydney Business School

QBUS2820 content structure

• Statistical and Machine Learning foundations and applications.

• Advanced regression methods.

• Classification methods.

• Time series forecasting.

2/55

Content

1. Introduction

2. Classification

3. K-nearest neighbours classifier

4. Logistic regression

5. More discussion on binary classification

3/55

Recommended reading

• Sec 4.1-4.3, An Introduction to Statistical Learning with
Applications in R by James et al.: easy to read, sloppy
discussions sometimes, comes with R/Python code for
practice.

• Sec 4.1 and 4.4, The Elements of Statistical Learning by
Hastie et al.: well-written, deep in theory, suitable for students
with a sound maths background.

4/55

Introduction

Introduction

Consider the following business decision making scenarios.

• Should we offer the mortgage to a home loan applicant?

• Should we invest resources in acquiring and retaining a
customer?

• Should we invest more resources to train an employee?

• Should we investigate a transaction for possible fraud?

5/55

Introduction

All these scenarios involve a classification task.

• Is the applicant predicted to repay the mortgage in full?

• Is the customer predicted to be profitable?

• Is the employee predicted to stay with the company?

• Is the transaction predicted to be a fraud?

6/55

Another example: Spam or not spam?

• How is your email platform able to filter out spam emails?
• The spam email data set created by the researchers at the

Hewlett-Packard Labs consists of 4061 messages, each has
been already classified as proper email or spam together with
57 attributes (covariates) which are relative frequencies of
commonly occurring words.

• The goal is to design a spam filter that could filter out spam

7/55

Classification

Classification

• Classification task is to classify an individual into one (and
only one) of several categories or classes, based on a set of
measurements on that individual

• Output variable Y takes value in a discrete set with C
categories

• Input variables are a vector X of predcitors X1, …, Xp
• A classifier is a prediction rule denoted by G that, based on

observation X = x of an individual, assigns the individual into
one of the C categories/classes.

• So, G(x) = c if the individual with observation x is classified
into class c.

8/55

Classification

• How can we construct a classifier?
• How do we measure the success of this classifier?
• What is the misclassification rate of this classifier?
• etc.

To answer these questions, we need the decision theory.

9/55

Decision theory for classification

Decision theory starts with loss function.

In classification, we represent the loss function by a C × C loss
matrix L. Each element of the loss matrix Lkℓ = L(k, ℓ) specifies
the loss of classifying in class ℓ when the actual class is k.

A commonly used loss function is the 0-1 loss, where

Lkℓ =


1, k ̸= ℓ0, k = ℓ,

i.e., a unit loss is incurred in the case of misclassification.

L(Y, G(X)) = I(Y ̸= G(X)) =


1, Y ̸= G(X)0, Y = G(X),

10/55

Decision theory for classification

The prediction loss/error of classifier G is defined as

Err(G) = E[L(Y, G(X))] = Average
{

L(yj , G(xj)), all future (yj , xj)
}

Our ultimate goal is to find a classifier G that minimises the
prediction error.

11/55

Bayes classifier

Let
pc(x) = P(Y = c|X = x), c = 1, …, C

be the conditional probability that Y = c given X = x.

Bayes classifier: classify individual x into class c if and only if

pc(x) ≥ pj(x) for all j = 1, …, C

Bayes classifier is optimal under the 0-1 loss, i.e. it has the
smallest prediction error than any other classifiers.

This is similar to the fact that E(Y |X = x) is optimal prediction
of Y under the squared loss. Proof? (homework)

12/55

Bayes classifier

Consider two classes k and j. The set

{x : pk(x) = pj(x)}

is called the decision boundary between class k and j.

13/55

Bayes classifier: binary case

When there’re only two categories: negative (0) and positive (1),
the Bayes classifier becomes

G(x) = 1 if P(Y = 1|X = x) ≥ 0.5

Why?

14/55

Empirical error

Given a training dataset {(yi, xi), i = 1, …, n}, the empirical error
or empirical misclassification rate of classifier G is

err(G) =
1
n


i

I(yi ̸= G(xi))

15/55

Bayes classifier

• So, if we know pc(x) = P(Y = c|X = x), c = 1, …, C, the
Bayes classifier is the optimal classification rule (under 0-1
loss).

• Recall the regression situation: f(x) = E(Y |X = x) is the
optimal prediction of Y (under squared loss) when X = x,
but in practice we need to estimate f(·).

Given a training data set {(yi, xi), i = 1, …, n}, how can we
estimate pc(·), c = 1, …, C?

These can be estimated using methods such as kNN, logistic
regression, multinomial regression, neural networks.

16/55

K-nearest neighbours classifier

K-nearest neighbours classifier

The K-nearest neighbours classifier estimates the conditional
probability for class c as

pc(x) =
1
K


xi∈NK(x,D)

I(yi = c)

for a training sample D = {(yi, xi)}ni=1. Here, NK(x, D) is the set
containing K input vectors in the training dataset that are closest
to x.

17/55

K-nearest neighbours classifier

o

o

o

o

o

oo

o

o

o

o

o o

o

o

o

o

oo

o

o

o

o

o

18/55

K-nearest neighbours classifier

• In words, the KNN method finds the K training input points
which are closest to x, and computes the conditional
probability as the fraction of those points that belong to class
c.

• The KNN classifier is a nonparametric approximation to the
Bayes classifier.

• As always, choosing the optimal K is crucial. We often use
cross validation to select K.

19/55

KNN classifier decision boundary

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o o
o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o
o

oo

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

X1

X
2

KNN: K=10

20/55

KNN classifier decision boundary

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o o

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o

o
o

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o o

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o

o
o

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

KNN: K=1 KNN: K=100

21/55

K-nearest neighbours classifier

0.01 0.02 0.05 0.10 0.20 0.50 1.00

0
.0

0
0

.0
5

0
.1

0
0

.1
5

0
.2

0

1/K

E
rr

o
r

R
a

te

Training Errors

Test Errors

22/55

Logistic regression

Logistic regression

• Consider a binary classification problem with two categories: 1
(often called positive) and 0 (called negative)

• Then, we can use logistic regression for estimating
p1(x) = P(Y = 1|X = x) and p0(x) = P(Y = 0|X = x).

23/55

Logistic regression

• Response/output data yi are binary: 0 and 1, Yes or No
• Want to explain/predict yi based on a vector of

predictors/inputs xi = (xi1, …, xip)′.
• Linear regression

yi = β0 + β1xi1 + … + βpxip + εi

is not appropriate for this type of response data. Why?
• Instead, we assume yi|xi ∼ B(1, p1(xi)), i.e. the distribution

of yi is Bernoulli distribution with probability of success
p1(xi), where

p1(xi) = P(yi = 1|xi) =
exp(β0 + β1xi1 + … + βpxip)

1 + exp(β0 + β1xi1 + … + βpxip)

24/55

Logistic regression

Note: if Y is a Bernoulli r.v. with probability π, then

p(y|π) = P(Y = y) =


π, y = 11 − π, y = 0 = πy(1 − π)1−y.

Function p(y|π) is called the probability density function (also
called probability mass function for discrete random variable).

• The probability density function of yi is therefore

p(yi|xi, β) = p1(xi)yi(1 − p1(xi))1−yi

So the likelihood function is

p(y|X, β) =
n∏

i=1
p1(xi)yi(1 − p1(xi))1−yi

25/55

Logistic regression

Maximising this likelihood gives us the Maximum Likelihood
Estimate (MLE) of β

β̂ = argmaxβ{p(y|X, β)}

• It’s more mathematically convenient to work with the
log-likelihood

ℓ(β) = log p(y|X, β) =

i

(yi log p1(xi) + (1 − yi) log(1 − p1(xi)))

• The value of β that maximises p(y|X, β) is exactly the value
that maximises ℓ(β). So

β̂ = argmaxβ{p(y|X, β)} = argmaxβ{ℓ(β)}

• Often, one uses optimisation methods similar to the
Newton-Raphson method to find β̂. 26/55

Customer churn data

Response: whether the customer had churned by the end
of the observation period.

Predictors

1. Average number of dollars spent on marketing efforts
to try and retain the customer per month.

2. Total number of categories the customer has purchased from.
3. Number of purchase occasions.
4. Industry: 1 if the prospect is in the B2B industry,

0 otherwise.
5. Revenue: annual revenue of the prospect’s firm.
6. Employees: number of employees in the prospect’s firm.

Observations: 500.

Source: Kumar and Petersen (2012).

27/55

Example: customer churn data

Logit Regression Results
==============================================================================
Dep. Variable: Churn No. Observations: 350
Model: Logit Df Residuals: 348
Method: MLE Df Model: 1
Date: Pseudo R-squ.: 0.1319
Time: Log-Likelihood: -208.99
converged: True LL-Null: -240.75

LLR p-value: 1.599e-15
===============================================================================

coef std err z P>|z| [0.025 0.975]
——————————————————————————-
Intercept -1.2959 0.189 -6.866 0.000 -1.666 -0.926
Avg_Ret_Exp 0.0308 0.004 7.079 0.000 0.022 0.039
===============================================================================

28/55

Example: customer churn

No we can predict the probability that a customer with average
retention expenses of 100 will churn.

p̂ =
exp(β̂0 + β̂1 × 100)

1 + exp(β̂0 + β̂1 × 100)

=
exp(−1.296 + 0.031 × 100)

1 + exp(−1.296 + 0.031 × 100)

= 0.856

29/55

Example: German Credit data

Contain observations on 30 variables for 1000 credit customers.
Each was already rated as good credit (700 cases) or bad credit
(300 cases).

• credit history: categorical, 0: no credits taken, 1: all
credits paid back duly, …

• new car: binary, 0: No, 1: Yes
• credit amount: numerical
• employment: categorical, 0: unemployed, 1: < 1 year, ... • ... • response: 1 if “good credit”, 0 if “bad credit” Let’s develop a credit scoring rule that can determine if a new applicant is a good credit or a bad credit, based on these covariates. 30/55 Example: German Credit data Many covariates are categorical, so we need dummy variables to represent them. So in total, there are 62 predictors. The dataset provided is a simplified version, where I used only 24 predictors. Let’s use the first 800 observations as the training data and the rest as the test data. We use logistic regression to estimate the probability of “good credit” p1(x) = P(Y = 1|x) = exp(x′β) 1 + exp(x′β) 31/55 Example: German Credit data Bayes classifier: a credit applicant with vector of predictor x is classified as “good credit” if p̂1(x) ≥ 0.5 Actual Predicted class class Bad Good Bad 120 19 Good 26 35 Test error (misclassification rate) = (19+26)/200=22.5% 32/55 Multinomial logistic regression* With C > 2 categories, we can use multinomial regression to
estimate pc(x), c = 1, …, C

• Multinomial regression is an extension of logistic regression
• Response/output data y has C levels

P(Y = c|X = x) =
exp(β0c + β1cx1 + … + βpcxp)

1 +
∑C−1

k=1 exp(β0k + β1kx1 + … + βpkxp)

for c = 1, …, C − 1 and

P(Y = C|X = x) =
1

1 +
∑C−1

k=1 exp(β0k + β1kx1 + … + βpkxp)

• Almost all statistical software has packages to estimate this
model.

33/55

Recap…

• Want to classify a subject x into one and only one of C
classes {1, 2, …, C}

• Given a loss function L(·, ·), the optimal classifier G is the one
that mininizes the prediction loss

Err(G) = E[L(Y, G(X))]
• Under the 0-1 loss function, the Bayes classifier is optimal:

classify x into class c if the probability

pc(x) = P (Y = c|X = x)

is the largest (among all other pj(x))
• We can estimate pc(x) by kNN or logistic regression (for

C = 2, or binary response), or multinomial regression (for
C > 2).

34/55

Review questions

• What is classification?
• What is the zero-one loss?
• Define the optimal classifier?
• What is the optimal classifier under 0-1 loss function?
• Explain the KNN classifier
• Explain the logistic regression classifier.

35/55

Next questions to answer

• Given a general loss function (not 0-1 loss), how can we find
the corresponding optimal classififer?

• If a loss function isn’t given, what is the optimal classifier
then?

• How can we compare between the classifier?
• etc.

36/55

More discussion on binary
classification

Binary classification

• Consider a binary classification problem with two categories: 1
(called positive) and 0 (called negative)

• For binary classification, there are important concepts that we
are going to discuss in this section.

37/55

Confusion matrix (key concept)

A confusion matrix counts the number of true negatives, false
positives, false negatives, and true positives for the test data.

Classification (Prediction)
Ŷ = 0 Ŷ = 1 Total

A
ct

ua
l

Y = 0 True negatives (TN) False positives (FP) N
Y = 1 False negatives (FN) True positives (TP) P

Total Negative predictions Positive predictions

38/55

Estimating prediction error

Estimating the prediction error is straightforward using the loss
function L and confusion matrix.

For example, for 0-1 loss, what is estimate of the prediction error?

39/55

Decision rule (key concept)

• The Bayes classifier

G(x) =


1 if P (Y = 1|X = x) ≥ 0.5.0 if P (Y = 1|X = x) < 0.5. is optimal under 0-1 loss • In many business applications, other loss functions might be more suitable to use. 40/55 Example: transaction fraud detection In many business problems, there are distinct losses associated with each classification outcome. Consider for example the case of transaction fraud detection. Classification Legitimate Fraud A ct ua l Legitimate No loss Investigation cost Fraud Fraud loss Fraud loss avoided The cost of investigating a suspicious transaction is likely to be much lower than the loss in case of fraud. 41/55 Example: credit scoring In credit scoring, we want to classify a loan applicant as creditworthy (Y = 1) or not (Y = 0) based on the probability that the customer will not default. Classification Ŷ = 0 Ŷ = 1 A ct ua l Y = 0 Default loss avoided Default loss Y = 1 Profit opportunity lost Profit A false positive is a more costly error than a false negative for this business scenario. Our decision making should therefore take this into account. 42/55 Decision rule (key concept) General decision rule Gτ (x) =  1 if P (Y = 1|X = x) ≥ τ .0 if P (Y = 1|X = x) < τ. where τ is a decision threshold parameter. Note: Bayes classifier with τ = 1/2 is optimal under 0-1 loss. 43/55 Classification outcomes (key concept) How can we find optimal τ that takes into account a general loss function L? To proceed, let’s use the following terminology. Classification Ŷ = 0 Ŷ = 1 A ct ua l Y = 0 True negative (TN) False positive (FP) Y = 1 False negative (FN) True positive (TP) 44/55 Loss matrix (key concept) Suppose that the loss function L is specified by the following loss matrix or cost-benefit matrix: Classification Ŷ = 0 Ŷ = 1 A ct ua l Y = 0 LTN LFP Y = 1 LFN LTP Often, LTN, LTP < 0, i.e., you make profit with a correct classification. LFN, LFP > 0, i.e., you have a real loss with an incorrect
classification.

45/55

Loss-based optimal decision rule

The optimal threshold τ minimises the prediction loss

Err(τ) = E [L(Y, Gτ (X))]

• Given a test dataset, using the confusion matrix and loss
matrix, we can estimate Err(τ).

• We can also use cross-validation

46/55

Other key concepts

• In some applications, a loss matrix isn’t available. E.g., in the
spam email filter example, it’s hard to define the loss of a
misclassification. Then how to decide on the threshold τ?

• How can we compare kNN classifier with logistic regression
classifier, or any other classifiers?

To answer these questions, we introduce some other key concepts.

47/55

Sensitivity and specificity (key concepts)

The sensitivity, or true positive rate, is

P (Ŷ = 1|Y = 1) =
TP

TP + FN
=

True positives
Actual positives

.

The specificity, or true negative rate, is

P (Ŷ = 0|Y = 0) =
TN

TN + FP
=

True negatives
Actual negatives

.

48/55

False positive and false negative rates

The false positive rate (FPR) is

P (Ŷ = 1|Y = 0) =
FP

TN + FP
=

False positives
Actual negatives

= 1−Specificity.

You can think of this as the probability of Type I error in
hypothesis testing. E.g., 1 is “fraud” then P (Ŷ = 1|Y = 0) is
probability of classifying a genuine transaction as fraud.

The false negative rate (FNR) is

P (Ŷ = 0|Y = 1) =
FN

TP + FN
=

False negatives
Actual positives

= 1 − Sensitivity.

This can be thought of as a Type II error probability. Hence,
sensitivity is the power of the classifier.

In hypothesis testing, we want to select a test with Type I
probability bounded, and its power maximised. 49/55

Trade-off between sensitivity and specificity

• There is a trade-off between sensitivity and specificity
• when τ varies, sensitivity increases while specificity decreases

and vice versa
• we obtain maximum sensitivity (specificity) by setting τ = 0

(τ = 1) as the classifier automatically returns positive
(negative).

• This is similar to the trade-off between Type I error probability
and Type II error probability in hypothesis testing.

50/55

Imbalanced classes

In some situations, we care more about sensitivity than specificity
and vice versa.

For example, many classification scenarios (such as fraud
detection) concern rare events, leading to a very large proportion
of negatives in the data. In this situation we say that the classes
are imbalanced.

The specificity is not very informative for these problems, as it will
tend to be high regardless of the quality of the classifier (nearly all
transactions are legitimate and classified as such). Hence, we care
more about sensitivity.

51/55

ROC curve
ROC Curve

False positive rate

T
ru

e
p

o
s
it
iv

e
r

a
te

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

52/55

ROC curve (key concept)

A receiver operating characteristic (ROC) curve plots the
sensitivity (i.e., power) against the false positive rate (i.e., type I
probability) when the threshold τ varies from 0 to 1.

• The name came from WW2 in military radar operation work.
• For each τ , you have a point (false positive rate,sensitivity) on
R2-plane. Connecting these points as τ varies gives ROC
curve.

• It tells us that, given a false positive rate that we can accept,
what the sensitivity of the classifier is.

53/55

ROC curve (key concept)

Figure 1:
https://en.wikipedia.org/wiki/Receiver operating characteristic

The quality of a classifier can be summarised by the area under
the curve (AUC). Higher AUC scores are better.

54/55

Review questions

• How do we formulate a decision rule for binary classification?
• What is a confusion matrix? Write down how the matrix looks

like.
• What are sensitivity and specificity?
• Why is there a trade-off between sensitivity and specificity?
• What is a ROC curve used for?
• How can you compare between classifiers using ROC?

55/55

Introduction
Classification
K-nearest neighbours classifier
Logistic regression
More discussion on binary classification