程序代写代做代考 python flex Predictive Analytics – Week 9: Classification I

Predictive Analytics – Week 9: Classification I

Predictive Analytics
Week 9: Classification I

Semester 2, 2018

Discipline of Business Analytics, The University of Sydney Business School

QBUS2820 content structure

1. Statistical and Machine Learning foundations and applications.

2. Advanced regression methods.

3. Classification methods.

4. Time series forecasting.

2/60

Week 9: Classification I

1. Classification

2. Introduction to decision theory for classification

3. K-nearest neighbours classifier

4. Logistic regression

5. Model evaluation for binary classification

6. Decision theory for binary classification (optional)

Readings: Chapters 2.2.3, 4.1, 4.2 and 4.3 of ISL.

Exercise questions: Chapter 4.7 of ISL, Q1, Q4, Q6, Q8 and Q9. 3/60

Classification

Classification

Consider the following business decision making scenarios.

1. Should we invest resources in acquiring and retaining a
customer?

2. Should we offer a mortgage to a credit applicant?

3. Should we invest more resources to train an employee?

4. Should we place a bid to a sponsor an online search?

5. Should we investigate a transaction for possible fraud?

4/60

Classification

All these scenarios involve a classification task.

1. Do we predict that the customer will be profitable?

2. Do we predict that the applicant will repay the mortgage in
full?

3. Do we predict that the employee will stay in the company?

4. Do we predict that the user will click on the ad and make a
purchase?

5. Do we flag the transaction?

5/60

Classification

In classification, the response variable Y is qualitative or
categorical that takes values in a finite unordered set
Y = {1, . . . , C}, where C is the number of classes. Our task is to
predict which class a subject belongs to based on input variables.

A classifier Ŷ (X) is a mapping from the input vector x to
{1, . . . , C}. A classifier is a prediction rule that assigns the subject
to one of the classes, given the observed values of the predictors.

6/60

Classification

In the fraud detection example, our response variable may be
flag = {fraud, legitimate}. We can code this variable as

Y =


1 if fraud,0 if legitimate.

7/60

Introduction to decision theory for
classification

Loss functions for classification

In classification, we represent the loss function by a C × C loss
matrix L. Each element of the loss matrix Lk` = L(k, `) specifies
the loss of classifying in class ` when the actual class is k.

In this section, we focus on a simple framework by considering the
zero-one loss function.

8/60

Zero-one loss function (key concept)

The zero-one loss function is

L(y, ŷ) =


1 if y 6= ŷ0 if y = ŷ,

such that the loss is zero for a correct classification and one for a
misclassification.

9/60

Classification risk

As before, our objective is to minimise the risk or expected loss of
the classifier,

R(Ŷ (X)) = E
[
L(Y, Ŷ (X))

]
.

You can think of the risk as the average loss across all subjects in
the population (each subject has a pair of Y and X values).

By conditioning on the predictors, we can rewrite the risk as

R(Ŷ (X)) = EX

(
C∑
c=1

E
[
L(c, Ŷ (X))

]
P (Y = c|X)

)

10/60

Classification

Classification models lead to estimated conditional probabilities
P̂ (Y = c|X = x) for c = 1, . . . , C. We then classify a test case to
the class with highest estimated probability.

For binary classification, we therefore classify a subject as ŷ = 1 if
P̂ (Y = 1|X = x) ≥ 0.5.

11/60

Model evaluation

We compute the misclassification or error rate for the test data
of size n as

Errtest =
1
n

n∑
i=1

I(yi 6= ŷi).

12/60

K-nearest neighbours classifier

K-nearest neighbours classifier

The K-nearest neighbours classifier estimates the conditional
probability for class c as

P̂ (Y = c|X = x) =
1
K


xi∈NK

I(yi = c)

for a training sample D = {(yi,xi)}Ni=1.

13/60

K-nearest neighbours classifier

o

o

o

o

o

oo

o

o

o

o

o o

o

o

o

o

oo

o

o

o

o

o

14/60

K-nearest neighbours classifier

• In words, the KNN method finds the K training input points
which are closest to x, and computes the conditional
probability as the fraction of those points that belongs to
class c.

• Similarly to the KNN regression method, the KNN classifier is
a direct nonparametric approximation to the Bayes classifier.

• The lower the K, the more flexible the decision boundary.

• As always, choosing the optimal level of flexibility is crucial.
We use cross validation to select K.

15/60

KNN classifier decision boundary

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o o
o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o
o

oo

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

X1

X
2

KNN: K=10

16/60

KNN classifier decision boundary

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o o

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o

o
o

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o o

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o

o
o

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o
o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

KNN: K=1 KNN: K=100

17/60

K-nearest neighbours classifier

0.01 0.02 0.05 0.10 0.20 0.50 1.00

0
.0

0
0

.0
5

0
.1

0
0

.1
5

0
.2

0

1/K

E
rr

o
r

R
a

te

Training Errors

Test Errors

18/60

Logistic regression

Regression models for classification

Suppose that we want the specify a discriminative model for binary
classification. The response Y follows the Bernoulli distribution,

Y =


1 with probability P (Y = 1|X = x)0 with probability 1− P (Y = 1|X = x)

How to model the conditional probability P (Y = 1|X = x) as a
function of the predictors?

19/60

Regression models for classification

Since P (Y = 1|X = x) = E(Y |X = x), one option is to specify a
linear regression model

Y = β0 +
p∑
j=1

βjxj + ε.

This is called the linear probability model. However, there are a
few reasons why we want to move beyond this framework.

20/60

Why not the linear probability model?

1. There is no guarantee that a linear probability model will
generate probabilities between zero and one, since the
regression function β0 +

∑p
j=1 βjxj is unconstrained. In other

words, the linearity assumption does not hold.

2. The Bernoulli distribution has variance p(x)(1− p(x)).
Hence, the linear probability model violates the classical
assumption of constant error variance.

3. The linear probability approach is not robust to outliers.

4. The linear probability approach does not generalise to
categorical responses with more than two classes.

21/60

Logistic regression (key concept)

The logistic regression model is

Y |X = x ∼ Ber(p(x)),

where
p(x) =

exp(β0 +
∑p
j=1 βjxj)

1 + exp(β0 +
∑p
j=1 βjxj)

.

The logistic function

exp(a)
1 + exp(a)

=
1

1 + exp(−a)

constrains the probability to be between zero and one.

22/60

Logistic function

23/60

Logistic Regression
Define the odds ratio as

p(x)
1− p(x)

.

We can show that

p(x)
1− p(x)

= exp


β0 + p∑

j=1
βjxj


 .

The logistic regression model therefore specifies a linear model for
the log odds,

log
(

p(x)
1− p(x)

)
= β0 +

p∑
j=1

βjxj ,

where we call the left-hand side the logit transformation of the
probability.

24/60

Prediction with logistic regression

Suppose that we a test point xi, if (β̂0 +
∑p
j=1 β̂jxij) > 0, then

P̂ (Y = 1|X = xi) = p̂(xi) =
exp(β̂0 +

∑p
j=1 β̂jxij)

1 + exp(β̂0 +
∑p
j=1 β̂jxij)

> 0.5,

then prediction ŷi = 1. When to predict ŷi = 0?

If (β̂0 +
∑p
j=1 β̂jxij) = 0, we have the decision boundary of logistic

regression.

25/60

Maximum likelihood estimation

We estimate the logistic regression model by maximum likelihood.
Recall that a Bernoulli random variable Y has probability mass
function

p(y;π) = πy(1− π)1−y.

In the context of the logistic regression model, the probability mass
function for a training case i is therefore

p(yi|xi) = p(xi)yi(1− p(xi))1−yi .

26/60

Maximum likelihood estimation

The likelihood function is

`(β) = p(y1|x1) p(y2|x2) . . . p(yN |xN )

=
N∏
i=1

p(yi|xi)

=
N∏
i=1

p(xi)yi(1− p(xi))1−yi

27/60

Maximum likelihood estimation

The log-likelihood is

L(β) = log
(
N∏
i=1

p(xi)yi(1− p(xi))1−yi
)

=
N∑
i=1

(yi log(p(xi)) + (1− yi) log(1− p(xi))) ,

where p(xi) can be calculated as before using logistic function.

The negative log-likelihood −L(β) is known as the cross-entropy
loss function or log loss in machine learning. Think the intuition
of the above formula.

28/60

Maximum likelihood estimation

The MLE for the logistic regression model is

β̂ = argmax
β

L(β),

where

L(β) =
N∑
i=1

(yi log(p(xi)) + (1− yi) log(1− p(xi)))

Similar to Lecture 5, then it can be shown that:

∂L(β)
∂β

= XT (y − p(X))

29/60

Gradient Ascend in matrix form

Here we have:

y =



y1

y2

yN




p(X) =



p(x1)
p(x2)


p(xN )




30/60

Gradient Ascend in matrix form

Hence gradient ascent in matrix form for logistic regression is:

β := β + αXT (y − p(X))

• Note the size of each matrix and vector in the above formula.
• Try to implement this in Python.

31/60

Maximum likelihood estimation

Optimisation. Setting the partial derivatives of the log-likelihood
to zero leads to estimation equations that are nonlinear in the
coefficients. We therefore use numerical optimisation routines to
obtain the estimates.

Statistical inference. We can conduct statistical inference for the
logistic regression model using the large sample theory for ML
estimation or the Bootstrap method.

32/60

Example: customer churn data

33/60

Customer churn data

Response: whether the customer had churned by the end
of the observation period.

Predictors

1. Average number of dollars spent on marketing efforts
to try and retain the customer per month.

2. Total number of categories the customer has purchased from.
3. Number of purchase occasions.
4. Industry: 1 if the prospect is in the B2B industry,

0 otherwise.
5. Revenue: annual revenue of the prospect’s firm.
6. Employees: number of employees in the prospect’s firm.

Observations: 500.

Source: Kumar and Petersen (2012).

34/60

Example: customer churn data

Logit Regression Results
==============================================================================
Dep. Variable: Churn No. Observations: 350
Model: Logit Df Residuals: 348
Method: MLE Df Model: 1
Date: Pseudo R-squ.: 0.1319
Time: Log-Likelihood: -208.99
converged: True LL-Null: -240.75

LLR p-value: 1.599e-15
===============================================================================

coef std err z P>|z| [0.025 0.975]
——————————————————————————-
Intercept -1.2959 0.189 -6.866 0.000 -1.666 -0.926
Avg_Ret_Exp 0.0308 0.004 7.079 0.000 0.022 0.039
===============================================================================

35/60

Example: customer churn

No we can predict the probability that a customer with average
retention expenses of 100 will churn.

p̂ =
exp(β̂0 + β̂1 × 100)

1 + exp(β̂0 + β̂1 × 100)

=
exp(−1.296 + 0.031× 100)

1 + exp(−1.296 + 0.031× 100)

= 0.856

Decision boundary? If we have two predictors x1 and x2, how the
decision boundary will be looked like?

36/60

Regularised logistic regression

Regularised risk minimisation applies to logistic regression. With
an `1 penalty as in the lasso, we solve the minimisation problem

min
β
− L(β) + λ

p∑
j=1
|βj |,

where

L(β) =
N∑
i=1

(yi log(p(xi)) + (1− yi) log(1− p(xi)))

Subset selection and dimension reduction with principal
components also extend to logistic regression in a straightforward
way. 37/60

Model evaluation for binary
classification

Decision rule (key concept)

More generally, the decision to classify a subject as positive or
negative is based on a decision rule

δ(x) =


1 if P (Y = 1|X = x) > τ.0 if P (Y = 1|X = x) ≤ τ .

where τ is a decision threshold parameter.

38/60

Confusion matrix (key concept)

A confusion matrix counts the number of true negatives, false
positives, false negatives, and true positives for the test data.

Classification (Prediction)
Ŷ = 0 Ŷ = 1 Total

A
ct

ua
l

Y = 0 True negatives (TN) False positives (FP) N
Y = 1 False negatives (FN) True positives (TP) P

Total Negative predictions Positive predictions

39/60

Estimating the generalisation error

• Estimating the generalisation error is straightforward using the
loss and confusion matrices.

• As always, it is important to quantify the uncertainty in the
estimate by reporting the standard error or doing interval
estimation.

• For the rest of this section, we discuss important concepts for
assessing binary classification models.

40/60

Sensitivity and specificity (key concepts)

The sensitivity, recall, or true positive rate is

P (Ŷ = 1|Y = 1) =
TP

TP + FN
=

True positives
Actual positives

.

The specificity is

P (Ŷ = 0|Y = 0) =
TN

TN + FP
=

True negatives
Actual negatives

.

41/60

False positive and false negative rates

The false positive rate (FPR) is

P (Ŷ = 1|Y = 0) =
FP

TN + FP
=

False positives
Actual negatives

= 1−Specificity.

The false negative rate (FNR) is

P (Ŷ = 0|Y = 1) =
FN

TP + FN
=

False negatives
Actual positives

= 1− Sensitivity.

42/60

Trade-off between sensitivity and specificity (key concept)

• There is a trade-off between sensitivity and specificity, since a
classifier can always obtain maximum sensitivity (specificity)
by setting τ = 0 (τ = 1) and automatically returning positive
(negative).

• Equivalently, this is a trade-off between sensitivity and
achieving a lower false positive rate.

43/60

Example: credit scoring

• Decreasing the threshold makes the loan decisions more
lenient, leading to loans to customers with lower probability of
full repayment.

• Issuing additional loans will increase both the number of true
positives (higher sensitivity) and false positives/defaults (lower
specificity).

44/60

ROC curve (key concept)

A receiver operating characteristic or ROC curve plots the
sensitivity against specificity or the false positive rate for a range of
threshold values τ .

We can read the the ROC plot as telling us the false positive rate
that we need to accept to obtain a given level of sensitivity.

We often summarise the quality of ROC curve as a single number
using the area under the curve or AUC. Higher AUC scores are
better, with a maximum of one.

45/60

ROC curve

ROC Curve

False positive rate

T
ru

e
p

o
s
it
iv

e
r

a
te

0.0 0.2 0.4 0.6 0.8 1.0

0
.0

0
.2

0
.4

0
.6

0
.8

1
.0

46/60

Imbalanced classes

Many classification scenarios (such as fraud detection) concern rare
events, leading to a very large proportion of negatives in the data.

In this situation we say that the classes are highly imbalanced.

The specificity is not very informative for these problems, as it will
tend to be high regardless of the quality of the classifier (nearly all
transactions are legitimate and classified as such).

47/60

Precision (key concept)

In the imbalanced scenario, we are usually more interested in the
proportion of detections that are actually positive. We define the
precision as

P (Y = 1|Ŷ = 1) =
TP

TP + FP
=

True positives
Positive classifications

The false discovery rate (FDR) is one minus the precision.

P (Y = 0|Ŷ = 1) =
FP

TP + FP
=

False positives
Positive classifications

48/60

Precision recall curve

A precision recall curve plots the precision against the recall
(sensitivity) as we vary the threshold τ . The mean precision
(averaging over recall values) approximates the area under the
precision recall curve.

49/60

Example: transaction fraud detection

• In fraud detection, the bank is concerned that too many false
alarms (low precision) would lead to high costs of
investigating flagged transactions.

• The bank will weigh this cost against the savings from
catching fraudulent transactions.

• Therefore, the financial institution is primarily interested in
the precision recall curve.

• Increasing τ reduces the number of false alarms.

50/60

Review questions

• What is classification?

• What is a zero-one loss?

• What is the misclassification rate?

• Explain the KNN classifier.

• How do we formulate a decision rule for binary classification?

• What is a confusion matrix? Write down how the matrix looks
like.

• What are sensitivity, specificity, and precision?

• Why is there a trade-off between sensitivity and specificity?
51/60

Decision theory for binary
classification (optional)

Classification outcomes

In most business problems, there are distinct losses associated with
each classification outcome. Consider for example the case of
transaction fraud detection.

Classification
Legitimate Fraud

A
ct

ua
l

Legitimate No loss Investigation cost
Fraud Fraud loss Fraud loss avoided

The cost of investigating a suspicious transaction is likely to much
lower than the loss in case of fraud.

52/60

Classification outcomes

We use the following terminology.

Classification
Ŷ = 0 Ŷ = 1

A
ct

ua
l

Y = 0 True negative False positive
Y = 1 False negative True positive

53/60

Loss matrix (key concept)

The context will specify a loss matrix or cost-benefit matrix for
classification as follows.

Classification
Ŷ = 0 Ŷ = 1

A
ct

ua
l

Y = 0 LTN LFP
Y = 1 LFN LTP

54/60

Example: credit scoring

In credit scoring, we want to classify a loan applicant as
creditworthy (Y = 1) or not (Y = 0) based on the probability that
the customer will not default.

Classification
Ŷ = 0 Ŷ = 1

A
ct

ua
l

Y = 0 Default loss avoided Default loss
Y = 1 Profit opportunity lost Profit

A false positive is a more costly error than a false negative for this
business scenario. Our decision making should therefore take this
into account.

55/60

Optimal decision (key concept)

Our decision problem is therefore to select an optimal threshold τ .

τ∗ = argmin
0<τ<1 E [L(Y, δτ (x))|X = x] . Let π = P (Y = 1|X = x) to simplify the notation. We compare the expected loss from each decision, E [L(Y, δτ (x))|X = x] =  πLTP + (1− π)LFP if δτ (x) = 1,πLFN + (1− π)LTN if δτ (x) = 0. 56/60 Optimal decision (key concept) The optimal decision threshold corresponds to the probability value π such that the loss from a positive or negative classification is equal. τ∗LTP + (1− τ∗)LFP = τ∗LFN + (1− τ∗)LTN Therefore, the optimal threshold is τ∗ = LFP − LTN LFP + LFN − LTP − LTN 57/60 Example: zero-one loss With the zero-one loss, we have that LFP = LFN = 1 and LTP = LTN = 0. Therefore, τ∗ = LFP − LTN LFP + LFN − LTP − LTN = 1 2 58/60 Example: credit scoring In the credit scoring example, we have that LTP = −LFN (profit equals missed profit) and LTN = −LFP (avoided default loss equals default loss). Therefore, τ∗ = LFP − LTN LFP + LFN − LTP − LTN = LFP LFN + LFP . 59/60 Example: credit scoring Optimal threshold for loan decision: τ∗ = LFP LFN + LFP . We expect that the loss from default to be much higher than the profit from a loan to a creditworthy customer (LFP >> LFN),
leading to a high threshold τ∗.

It is only worth it to lend to customers that have a high probability
of repayment.

60/60

Classification
Introduction to decision theory for classification
K-nearest neighbours classifier
Logistic regression
Model evaluation for binary classification
Decision theory for binary classification (optional)