QBUS2820 Predictive Analytics – Classification
QBUS2820 Predictive Analytics
Classification
Copyright By PowCoder代写 加微信 powcoder
Discipline of Business Analytics, The University of School
1. Introduction
2. Classification
3. K-nearest neighbours classifier
4. Logistic regression
Recommended reading
• Sec 4.1-4.3, An Introduction to Statistical Learning with
Applications in R by James et al.: easy to read, comes with
R/Python code for practice.
• Sec 4.1 and 4.4, The Elements of Statistical Learning by
Hastie et al.: well-written, deep in theory, suitable for students
with a sound maths background.
Introduction
Introduction
Consider the following business decision making scenarios.
• Should we offer the mortgage to a home loan applicant?
• Should we invest resources in acquiring and retaining a
• Should we invest more resources to train an employee?
• Should we investigate a transaction for possible fraud?
Introduction
All these scenarios involve a classification task.
• Is the applicant predicted to repay the mortgage in full?
• Is the customer predicted to be profitable?
• Is the employee predicted to stay with the company?
• Is the transaction predicted to be a fraud?
Another example: Spam or not spam?
• How is your email platform able to filter out spam emails?
• The spam email data set created by the researchers at the
Hewlett-Packard Labs consists of 4061 messages, each has
been already classified as proper email or spam together with
57 attributes (covariates) which are relative frequencies of
commonly occurring words.
• The goal is to design a spam filter that could filter out spam
Classification
Classification
• Classification task is to classify an individual into one (and
only one) of several categories or classes, based on a set of
measurements on that individual
• Output variable Y takes value in a discrete set with C
categories
• Input variables are a vector X of predcitors X1, …, Xp
• A classifier is a prediction rule denoted by G that, based on
observation X = x of an individual, assigns the individual into
one of the C categories/classes.
• So, G(x) = c if the individual with observation x is classified
into class c.
Classification
• How can we construct a classifier?
• How do we measure the success of this classifier?
• What is the misclassification rate of this classifier?
To answer these questions, we need the decision theory.
Decision theory for classification
Decision theory starts with loss function.
In classification, we represent the loss function by a C × C loss
matrix L. Each element of the loss matrix Lkℓ = L(k, ℓ) specifies
the loss of classifying in class ℓ when the actual class is k.
A commonly used loss function is the 0-1 loss, where
1, k ̸= ℓ0, k = ℓ,
i.e., a unit loss is incurred in the case of misclassification.
L(Y, G(X)) = I(Y ̸= G(X)) =
1, Y ̸= G(X)0, Y = G(X),
Decision theory for classification
The prediction loss/error of classifier G is defined as
Err(G) = E[L(Y, G(X))] = Average
L(yj , G(xj)), all future (yj , xj)
Our ultimate goal is to find a classifier G that minimises the
prediction error.
Bayes classifier
pc(x) = P(Y = c|X = x), c = 1, …, C
be the conditional probability that Y = c given X = x.
Bayes classifier: classify individual x into class c if and only if
pc(x) ≥ pj(x) for all j = 1, …, C
Bayes classifier is optimal under the 0-1 loss, i.e. it has the
smallest prediction error than any other classifiers.
Bayes classifier
Consider two classes k and j. The set
{x : pk(x) = pj(x)}
is called the decision boundary between class k and j.
Bayes classifier: binary case
When there’re only two categories: negative (0) and positive (1),
the Bayes classifier becomes
G(x) = 1 if P(Y = 1|X = x) ≥ 0.5
Empirical error
Given a training dataset {(yi, xi), i = 1, …, n}, the empirical error
or empirical misclassification rate of classifier G is
I(yi ̸= G(xi))
Bayes classifier
• So, if we know pc(x) = P(Y = c|X = x), c = 1, …, C, the
Bayes classifier is the optimal classification rule (under 0-1
• Recall the regression situation: f(x) = E(Y |X = x) is the
optimal prediction of Y (under squared loss) when X = x,
but in practice we need to estimate f(·).
Given a training data set {(yi, xi), i = 1, …, n}, how can we
estimate pc(·), c = 1, …, C?
These can be estimated using methods such as kNN, logistic
regression, multinomial regression, neural networks.
K-nearest neighbours classifier
K-nearest neighbours classifier
The K-nearest neighbours classifier estimates the conditional
probability for class c as
xi∈NK(x,D)
for a training sample D = {(yi, xi)}ni=1. Here, NK(x, D) is the set
containing K input vectors in the training dataset that are closest
K-nearest neighbours classifier
K-nearest neighbours classifier
• In words, the KNN method finds the K training input points
which are closest to x, and computes the conditional
probability as the fraction of those points that belong to class
• The KNN classifier is a nonparametric approximation to the
Bayes classifier.
• As always, choosing the optimal K is crucial. We often use
cross validation to select K.
KNN classifier decision boundary
KNN classifier decision boundary
KNN: K=1 KNN: K=100
K-nearest neighbours classifier
0.01 0.02 0.05 0.10 0.20 0.50 1.00
Training Errors
Test Errors
Logistic regression
Logistic regression
• Consider a binary classification problem with two categories: 1
(often called positive) and 0 (called negative)
• Then, we can use logistic regression for estimating
p1(x) = P(Y = 1|X = x) and p0(x) = P(Y = 0|X = x).
Logistic regression
• Response/output data yi are binary: 0 and 1, Yes or No
• Want to explain/predict yi based on a vector of
predictors/inputs xi = (xi1, …, xip)′.
• Linear regression
yi = β0 + β1xi1 + … + βpxip + εi
is not appropriate for this type of response data. Why?
• Instead, we assume yi|xi ∼ B(1, p1(xi)), i.e. the distribution
of yi is Bernoulli distribution with probability of success
p1(xi), where
p1(xi) = P(yi = 1|xi) =
exp(β0 + β1xi1 + … + βpxip)
1 + exp(β0 + β1xi1 + … + βpxip)
Logistic regression
Note: if Y is a Bernoulli r.v. with probability π, then
p(y|π) = P(Y = y) =
π, y = 11 − π, y = 0 = πy(1 − π)1−y.
Function p(y|π) is called the probability density function (also
called probability mass function for discrete random variable).
• The probability density function of yi is therefore
p(yi|xi, β) = p1(xi)yi(1 − p1(xi))1−yi
So the likelihood function is
p(y|X, β) =
p1(xi)yi(1 − p1(xi))1−yi
Logistic regression
Maximising this likelihood gives us the Maximum Likelihood
Estimate (MLE) of β
β̂ = argmaxβ{p(y|X, β)}
• It’s more mathematically convenient to work with the
log-likelihood
ℓ(β) = log p(y|X, β) =
(yi log p1(xi) + (1 − yi) log(1 − p1(xi)))
• The value of β that maximises p(y|X, β) is exactly the value
that maximises ℓ(β). So
β̂ = argmaxβ{p(y|X, β)} = argmaxβ{ℓ(β)}
• Often, one uses optimisation methods similar to the
Newton-Raphson method to find β̂. 25/35
Customer churn data
Response: whether the customer had churned by the end
of the observation period.
Predictors
1. Average number of dollars spent on marketing efforts
to try and retain the customer per month.
2. Total number of categories the customer has purchased from.
3. Number of purchase occasions.
4. Industry: 1 if the prospect is in the B2B industry,
0 otherwise.
5. Revenue: annual revenue of the prospect’s firm.
6. Employees: number of employees in the prospect’s firm.
Observations: 500.
Source: Kumar and Petersen (2012).
Example: customer churn data
Logit Regression Results
==============================================================================
Dep. Variable: Churn No. Observations: 350
Model: Residuals: 348
Method: MLE Df Model: 1
Date: Pseudo R-squ.: 0.1319
Time: Log-Likelihood: -208.99
converged: True LL-Null: -240.75
LLR p-value: 1.599e-15
===============================================================================
coef std err z P>|z| [0.025 0.975]
——————————————————————————-
Intercept -1.2959 0.189 -6.866 0.000 -1.666 -0.926
Avg_Ret_Exp 0.0308 0.004 7.079 0.000 0.022 0.039
===============================================================================
Example: customer churn
No we can predict the probability that a customer with average
retention expenses of 100 will churn.
exp(β̂0 + β̂1 × 100)
1 + exp(β̂0 + β̂1 × 100)
exp(−1.296 + 0.031 × 100)
1 + exp(−1.296 + 0.031 × 100)
Example: data
Contain observations on 30 variables for 1000 credit customers.
Each was already rated as good credit (700 cases) or bad credit
(300 cases).
• credit history: categorical, 0: no credits taken, 1: all
credits paid back duly, …
• new car: binary, 0: No, 1: Yes
• credit amount: numerical
• employment: categorical, 0: unemployed, 1: < 1 year, ...
• response: 1 if “good credit”, 0 if “bad credit”
Let’s develop a credit scoring rule that can determine if a new
applicant is a good credit or a bad credit, based on these
covariates.
Example: data
Many covariates are categorical, so we need dummy variables to
represent them. So in total, there are 62 predictors.
The dataset provided is a simplified version, where I used only 24
predictors.
Let’s use the first 800 observations as the training data and the
rest as the test data. We use logistic regression to estimate the
probability of “good credit”
p1(x) = P(Y = 1|x) =
1 + exp(x′β)
Example: data
Bayes classifier: a credit applicant with vector of predictor x is
classified as “good credit” if
p̂1(x) ≥ 0.5
Actual Predicted class
class Bad Good
Bad 120 19
Good 26 35
Test error (misclassification rate) = (19+26)/200=22.5%
Multinomial logistic regression*
With C > 2 categories, we can use multinomial regression to
estimate pc(x), c = 1, …, C
• Multinomial regression is an extension of logistic regression
• Response/output data y has C levels
P(Y = c|X = x) =
exp(β0c + β1cx1 + … + βpcxp)
k=1 exp(β0k + β1kx1 + … + βpkxp)
for c = 1, …, C − 1 and
P(Y = C|X = x) =
k=1 exp(β0k + β1kx1 + … + βpkxp)
• Almost all statistical software has packages to estimate this
• Want to classify a subject x into one and only one of C
classes {1, 2, …, C}
• Given a loss function L(·, ·), the optimal classifier G is the one
that mininizes the prediction loss
Err(G) = E[L(Y, G(X))]
• Under the 0-1 loss function, the Bayes classifier is optimal:
classify x into class c if the probability
pc(x) = P (Y = c|X = x)
is the largest (among all other pj(x))
• We can estimate pc(x) by kNN or logistic regression (for
C = 2, or binary response), or multinomial regression (for
Review questions
• What is classification?
• What is the zero-one loss?
• Define the optimal classifier?
• What is the optimal classifier under 0-1 loss function?
• Explain the KNN classifier
• Explain the logistic regression classifier.
Next questions to answer
• Given a general loss function (not 0-1 loss), how can we find
the corresponding optimal classififer?
• If a loss function isn’t given, what is the optimal classifier
• How can we compare between the classifier?
Introduction
Classification
K-nearest neighbours classifier
Logistic regression
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com