程序代写 QBUS2820 Predictive Analytics – Classification

QBUS2820 Predictive Analytics – Classification

QBUS2820 Predictive Analytics
Classification

Copyright By PowCoder代写 加微信 powcoder

Discipline of Business Analytics, The University of School

1. Introduction

2. Classification

3. K-nearest neighbours classifier

4. Logistic regression

Recommended reading

• Sec 4.1-4.3, An Introduction to Statistical Learning with
Applications in R by James et al.: easy to read, comes with
R/Python code for practice.

• Sec 4.1 and 4.4, The Elements of Statistical Learning by
Hastie et al.: well-written, deep in theory, suitable for students
with a sound maths background.

Introduction

Introduction

Consider the following business decision making scenarios.

• Should we offer the mortgage to a home loan applicant?

• Should we invest resources in acquiring and retaining a

• Should we invest more resources to train an employee?

• Should we investigate a transaction for possible fraud?

Introduction

All these scenarios involve a classification task.

• Is the applicant predicted to repay the mortgage in full?

• Is the customer predicted to be profitable?

• Is the employee predicted to stay with the company?

• Is the transaction predicted to be a fraud?

Another example: Spam or not spam?

• How is your email platform able to filter out spam emails?
• The spam email data set created by the researchers at the

Hewlett-Packard Labs consists of 4061 messages, each has
been already classified as proper email or spam together with
57 attributes (covariates) which are relative frequencies of
commonly occurring words.

• The goal is to design a spam filter that could filter out spam

Classification

Classification

• Classification task is to classify an individual into one (and
only one) of several categories or classes, based on a set of
measurements on that individual

• Output variable Y takes value in a discrete set with C
categories

• Input variables are a vector X of predcitors X1, …, Xp
• A classifier is a prediction rule denoted by G that, based on

observation X = x of an individual, assigns the individual into
one of the C categories/classes.

• So, G(x) = c if the individual with observation x is classified
into class c.

Classification

• How can we construct a classifier?
• How do we measure the success of this classifier?
• What is the misclassification rate of this classifier?

To answer these questions, we need the decision theory.

Decision theory for classification

Decision theory starts with loss function.

In classification, we represent the loss function by a C × C loss
matrix L. Each element of the loss matrix Lkℓ = L(k, ℓ) specifies
the loss of classifying in class ℓ when the actual class is k.

A commonly used loss function is the 0-1 loss, where

1, k ̸= ℓ0, k = ℓ,

i.e., a unit loss is incurred in the case of misclassification.

L(Y, G(X)) = I(Y ̸= G(X)) =

1, Y ̸= G(X)0, Y = G(X),

Decision theory for classification

The prediction loss/error of classifier G is defined as

Err(G) = E[L(Y, G(X))] = Average

L(yj , G(xj)), all future (yj , xj)

Our ultimate goal is to find a classifier G that minimises the
prediction error.

Bayes classifier

pc(x) = P(Y = c|X = x), c = 1, …, C

be the conditional probability that Y = c given X = x.

Bayes classifier: classify individual x into class c if and only if

pc(x) ≥ pj(x) for all j = 1, …, C

Bayes classifier is optimal under the 0-1 loss, i.e. it has the
smallest prediction error than any other classifiers.

Bayes classifier

Consider two classes k and j. The set

{x : pk(x) = pj(x)}

is called the decision boundary between class k and j.

Bayes classifier: binary case

When there’re only two categories: negative (0) and positive (1),
the Bayes classifier becomes

G(x) = 1 if P(Y = 1|X = x) ≥ 0.5

Empirical error

Given a training dataset {(yi, xi), i = 1, …, n}, the empirical error
or empirical misclassification rate of classifier G is

I(yi ̸= G(xi))

Bayes classifier

• So, if we know pc(x) = P(Y = c|X = x), c = 1, …, C, the
Bayes classifier is the optimal classification rule (under 0-1

• Recall the regression situation: f(x) = E(Y |X = x) is the
optimal prediction of Y (under squared loss) when X = x,
but in practice we need to estimate f(·).

Given a training data set {(yi, xi), i = 1, …, n}, how can we
estimate pc(·), c = 1, …, C?

These can be estimated using methods such as kNN, logistic
regression, multinomial regression, neural networks.

K-nearest neighbours classifier

K-nearest neighbours classifier

The K-nearest neighbours classifier estimates the conditional
probability for class c as

xi∈NK(x,D)

for a training sample D = {(yi, xi)}ni=1. Here, NK(x, D) is the set
containing K input vectors in the training dataset that are closest

K-nearest neighbours classifier

K-nearest neighbours classifier

• In words, the KNN method finds the K training input points
which are closest to x, and computes the conditional
probability as the fraction of those points that belong to class

• The KNN classifier is a nonparametric approximation to the
Bayes classifier.

• As always, choosing the optimal K is crucial. We often use
cross validation to select K.

KNN classifier decision boundary

KNN classifier decision boundary

KNN: K=1 KNN: K=100

K-nearest neighbours classifier

0.01 0.02 0.05 0.10 0.20 0.50 1.00

Training Errors

Test Errors

Logistic regression

Logistic regression

• Consider a binary classification problem with two categories: 1
(often called positive) and 0 (called negative)

• Then, we can use logistic regression for estimating
p1(x) = P(Y = 1|X = x) and p0(x) = P(Y = 0|X = x).

Logistic regression

• Response/output data yi are binary: 0 and 1, Yes or No
• Want to explain/predict yi based on a vector of

predictors/inputs xi = (xi1, …, xip)′.
• Linear regression

yi = β0 + β1xi1 + … + βpxip + εi

is not appropriate for this type of response data. Why?
• Instead, we assume yi|xi ∼ B(1, p1(xi)), i.e. the distribution

of yi is Bernoulli distribution with probability of success
p1(xi), where

p1(xi) = P(yi = 1|xi) =
exp(β0 + β1xi1 + … + βpxip)

1 + exp(β0 + β1xi1 + … + βpxip)

Logistic regression

Note: if Y is a Bernoulli r.v. with probability π, then

p(y|π) = P(Y = y) =

π, y = 11 − π, y = 0 = πy(1 − π)1−y.

Function p(y|π) is called the probability density function (also
called probability mass function for discrete random variable).

• The probability density function of yi is therefore

p(yi|xi, β) = p1(xi)yi(1 − p1(xi))1−yi

So the likelihood function is

p(y|X, β) =

p1(xi)yi(1 − p1(xi))1−yi

Logistic regression

Maximising this likelihood gives us the Maximum Likelihood
Estimate (MLE) of β

β̂ = argmaxβ{p(y|X, β)}

• It’s more mathematically convenient to work with the
log-likelihood

ℓ(β) = log p(y|X, β) =

(yi log p1(xi) + (1 − yi) log(1 − p1(xi)))

• The value of β that maximises p(y|X, β) is exactly the value
that maximises ℓ(β). So

β̂ = argmaxβ{p(y|X, β)} = argmaxβ{ℓ(β)}

• Often, one uses optimisation methods similar to the
Newton-Raphson method to find β̂. 25/35

Customer churn data

Response: whether the customer had churned by the end
of the observation period.

Predictors

1. Average number of dollars spent on marketing efforts
to try and retain the customer per month.

2. Total number of categories the customer has purchased from.
3. Number of purchase occasions.
4. Industry: 1 if the prospect is in the B2B industry,

0 otherwise.
5. Revenue: annual revenue of the prospect’s firm.
6. Employees: number of employees in the prospect’s firm.

Observations: 500.

Source: Kumar and Petersen (2012).

Example: customer churn data

Logit Regression Results
==============================================================================
Dep. Variable: Churn No. Observations: 350
Model: Residuals: 348
Method: MLE Df Model: 1
Date: Pseudo R-squ.: 0.1319
Time: Log-Likelihood: -208.99
converged: True LL-Null: -240.75

LLR p-value: 1.599e-15
===============================================================================

coef std err z P>|z| [0.025 0.975]
——————————————————————————-
Intercept -1.2959 0.189 -6.866 0.000 -1.666 -0.926
Avg_Ret_Exp 0.0308 0.004 7.079 0.000 0.022 0.039
===============================================================================

Example: customer churn

No we can predict the probability that a customer with average
retention expenses of 100 will churn.

exp(β̂0 + β̂1 × 100)

1 + exp(β̂0 + β̂1 × 100)

exp(−1.296 + 0.031 × 100)

1 + exp(−1.296 + 0.031 × 100)

Example: data

Contain observations on 30 variables for 1000 credit customers.
Each was already rated as good credit (700 cases) or bad credit
(300 cases).

• credit history: categorical, 0: no credits taken, 1: all
credits paid back duly, …

• new car: binary, 0: No, 1: Yes
• credit amount: numerical
• employment: categorical, 0: unemployed, 1: < 1 year, ... • response: 1 if “good credit”, 0 if “bad credit” Let’s develop a credit scoring rule that can determine if a new applicant is a good credit or a bad credit, based on these covariates. Example: data Many covariates are categorical, so we need dummy variables to represent them. So in total, there are 62 predictors. The dataset provided is a simplified version, where I used only 24 predictors. Let’s use the first 800 observations as the training data and the rest as the test data. We use logistic regression to estimate the probability of “good credit” p1(x) = P(Y = 1|x) = 1 + exp(x′β) Example: data Bayes classifier: a credit applicant with vector of predictor x is classified as “good credit” if p̂1(x) ≥ 0.5 Actual Predicted class class Bad Good Bad 120 19 Good 26 35 Test error (misclassification rate) = (19+26)/200=22.5% Multinomial logistic regression* With C > 2 categories, we can use multinomial regression to
estimate pc(x), c = 1, …, C

• Multinomial regression is an extension of logistic regression
• Response/output data y has C levels

P(Y = c|X = x) =
exp(β0c + β1cx1 + … + βpcxp)

k=1 exp(β0k + β1kx1 + … + βpkxp)

for c = 1, …, C − 1 and

P(Y = C|X = x) =

k=1 exp(β0k + β1kx1 + … + βpkxp)

• Almost all statistical software has packages to estimate this

• Want to classify a subject x into one and only one of C
classes {1, 2, …, C}

• Given a loss function L(·, ·), the optimal classifier G is the one
that mininizes the prediction loss

Err(G) = E[L(Y, G(X))]
• Under the 0-1 loss function, the Bayes classifier is optimal:

classify x into class c if the probability

pc(x) = P (Y = c|X = x)

is the largest (among all other pj(x))
• We can estimate pc(x) by kNN or logistic regression (for

C = 2, or binary response), or multinomial regression (for

Review questions

• What is classification?
• What is the zero-one loss?
• Define the optimal classifier?
• What is the optimal classifier under 0-1 loss function?
• Explain the KNN classifier
• Explain the logistic regression classifier.

Next questions to answer

• Given a general loss function (not 0-1 loss), how can we find
the corresponding optimal classififer?

• If a loss function isn’t given, what is the optimal classifier

• How can we compare between the classifier?

Introduction
Classification
K-nearest neighbours classifier
Logistic regression

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com