STAT318/462 — Data Mining
Dr G ́abor Erd ́elyi
University of Canterbury, Christchurch,
Course developed by Dr B. Robertson. Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,1 / 36
Classification
Suppose we have a qualitative response Y (classification problem) that takes values from a finite (unordered) set
C = {y1,y2,…,yK},
and p different predictors,
X = (X1,X2,…,Xp).
Our goal is to build a classifier f (X ) that assigns a class label from C to an
unclassified observation X.
We are often interested in estimating the probability that X belongs to a
particular class (category).
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,2 / 36
Credit card data
0 500 1000 1500 2000 2500
Balance
The individuals who defaulted are shown in orange, and those that did not in blue.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,3 / 36
We are interested in whether an individual will default (response variable) based on annual income and monthly credit card balance (predictor variables). Observations:
• The classes are not linearly separable (there is overlap between the two classes as expected).
• The predictor Balance ‘separates’ the training data better than Income.
• A simple subjective classifier could be
Default=Yes if Balance > 1300
f (x ) = Default=No otherwise, but we could probably do better.
Income
0 20000 40000 60000
Credit card data
No Yes
Default
G. Erd ́elyi, University of Canterbury 2021
No Yes
Default
STAT318/462 — Data Mining ,4 / 36
These box plots illustrate the basic properties of the distributions of defaulters and non-defaulters for the two predictor variables. Observations:
• For Income, the distributions of defaulters and non-defaulters are more or less the same.
• For Balance, there is good separation between the two classes.
Balance
0 500 1000 1500 2000 2500
Income
0 20000 40000 60000
Credit card data: linear regression
We can code the response variable Default as 0 ifNo
Y= 1 ifYes, and perform multiple linear regression of Y on X.
We could then classify an observation as Yes if the predicted response is greater than 0.5 and No otherwise.
This seems reasonable, right?
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,5 / 36
What about more than two classes?
A patient presents at an emergency room and we must classify them as Stroke, Drug Overdose or Epileptic Seizure (response values) based on their symptoms (predictors).
We can code the response variable as
0
Y = 1 2
if Stroke
if Drug Overdose
if Epileptic Seizure,
and perform multiple linear regression of Y on X. Does this seem reasonable?
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,6 / 36
This coding suggests an ordering of the response variable when there is no ordering at all. It also implies that the difference between Stroke and Epileptic Seizure is bigger than the difference between Stroke and Drug Overdose. The multiple linear regression model is not appropriate here.
Credit card data: simple linear regression
| | | | |||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||| || | |
|||||||| ||||| | | ||
0 500 1000 1500 2000 2500 Balance
E(Y|X =x)=Pr(Y =1|X =x)≈β0 +β1x.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,7 / 36
Consider using a simple linear regression model to predict the probability of default using our coding
Y = 1 ifDefault=Yes This model might seem perfect because
0 if Default = No
E(Y|X =x) = 0Pr(Y =0|X =x)+1Pr(Y =1|X =x) = Pr(Y=1|X=x)
≈ β0+β1x.
That is, we model the Pr(Y = 1|X = x) as a linear function of X. However, the simple linear regression model will produce probability estimates greater than one and less than zero (amongst other drawbacks). We can do better.
Probability of Default 0.0 0.2 0.4 0.6 0.8 1.0
Logistic regression
Logistic regression uses the logistic function, given by
Pr(Y =1|X =x)=p(x)= exp(β0 +β1x)
1 + exp(β0 + β1x) Because exp(.) > 0, p(x) will take values between 0 and 1.
We can rearrange the logistic function to give
log p(x) =β0+β1x,
1−p(x)
called the log odds or logit transformation of p(x).
G. Erd ́elyi, University of Canterbury 2021
The log odds transform of p(x) is
p(x) =
p(x)(1 + exp(β0 + β1x)) = p(x) =
p(x)
log 1−p(x) =
Pr(Y = 1|X = x)
log Pr(Y=0|X=x) =
A non-linear transformation of p(x) is a probabilities on a non-linear scale).
STAT318/462 — Data Mining
exp(β0 +β1x)
1 + exp(β0 + β1x)
exp(β0 +β1x) (1−p(x))exp(β0 +β1x)
β0+β1x
,8 / 36
β0+β1x.
linear function of x (we are modelling
Credit card data: logistic regression
| | | | || | ||||| || |||||||||||||||||||||||||||||||||||||||||||||||||| |||| | | | | | |
|||||||| ||||| | | ||
0 500 1000 1500 2000 2500 Balance
log p(x) =β0+β1x. 1−p(x)
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,9 / 36
To define a classifier, we need to specify a classification threshold probability p(x) = α. For example, we could choose p(x) = 0.5 (like Bayes classifier) to minimize the error rate
Default=Yes if p(x) ≥ 0.5
Default=No otherwise.
Alternatively, we could use a lower threshold, p(x) = 0.3 for example, to reduce the risk of misclassifying defaulters (more about this later). We can also define the classifier using the decision boundary (where p(x) = α). For example, if p(x) = 0.5 our classifier is
f (x ) =
f (x ) =
Default=Yes if x ≥ −β0/β1
Default=No otherwise.
Probability of Default 0.0 0.2 0.4 0.6 0.8 1.0
Parameter estimation
In logistic regression, we use maximum likelihood estimation (MLE) to estimate the parameters. That is, the parameters are chosen to maximize the likelihood of observing the training data, Tr = {xi,yi}ni=1.
For binary classification problems (assuming Bernoulli trials), the likelihood
function is
n
l(β0, β1|Tr) = p(xi )yi (1 − p(xi ))1−yi .
i=1
We will be estimating parameters, β0 and β1, using R.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,10 / 36
Let’s see where the likelihood function comes from. For a single trial we have
Pr(Y =1|X =x) = p(x) Pr(Y=0|X=x) = 1−p(x)
Pr(Y =y|X =x) = p(x)y(1−p(x))1−y,
where y ∈ (0, 1). Hence, the joint probability of the observed sample is
n
p(xi )yi (1 − p(xi ))1−yi , i=1
which is called the likelihood function l(β0,β1|Tr).
We then find the β0 and β1 that maximize l(β0,β1|Tr), where
p(xi)= exp(β0+β1xi) . 1+exp(β0 +β1xi)
This maximization is done using an optimization algorithm (which is all taken care of in R).
Making Predictions: credit card data
Logistic regression using Balance as the predictor
log p(x) =β0+β1Balance.
Std. Error 0.3612 0.0002
1−p(x) Coefficient
Z-statistic -29.5
Intercept -10.6513 Balance 0.0055
p-value <0.0001 24.9 <0.0001
Estimate the probability of Default for someone with a balance of (a) $1000.
(b) $2000.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,11 / 36
The Z-statistic plays the same role as the t-statistic did for linear regression. The estimated slope parameter βˆ1 is significant, so Balance is useful for predicting the probability of an individual defaulting. The estimated probabilities are:
pˆ(Balance = 1000) = exp(−10.6513 + 0.0055(1000)) ≈ 0.006 1 + exp(−10.6513 + 0.0055(1000))
pˆ(Balance = 2000) = exp(−10.6513 + 0.0055(2000)) ≈ 0.586 1 + exp(−10.6513 + 0.0055(2000))
The model suggests that individuals with larger credit card balances have a greater chance of defaulting (which makes intuitive sense).
Making Predictions: credit card data
Logistic regression using Student as the predictor
log p(x) =β0+β1Student.
1−p(x) Coefficient
Intercept -3.5041 Student[Yes] 0.4049
Std. Error 0.0707 0.1150
Z-statistic p-value -49.55 <0.0001
Estimate the probability of Default for a (a) student.
(b) non-student.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining
The predictor is coded using a dummy variable
,12 / 36
Student =
1 if Student = Yes
0 if Student = No.
The estimated slope parameter βˆ1 is significant, so knowing whether someone is a student is useful for predicting the probability of an individual defaulting. The estimated probabilities are:
pˆ(Student=Yes) = exp(−3.5041 + 0.4049(1)) ≈ 0.0431 1 + exp(−3.5041 + 0.4049(1))
pˆ(Student=No) = exp(−3.5041 + 0.4049(0)) ≈ 0.0292 1 + exp(−3.5041 + 0.4049(0))
The model suggests that students have a slightly greater chance of defaulting.
3.52
0.0004
Multiple Logistic regression
When we have a binary response using p predictors, the log odds is simply log p(x) =β0+β1x1+...+βpxp.
1−p(x)
We can also extend the logistic model to polynomial logistic regression. For
example, the quadratic logistic model for one predictor (binary response) is log p(x) =β0+β1x1+β2x12.
1−p(x)
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,13 / 36
Adding polynomial terms increases the flexibility of the logistic regression model (just like it did in linear regression). We will consider choosing the correct level of flexibility for these models in Section 5.
Multiple logistic regression: credit card data
Multiple logistic regression using Balance and Student as the predictors log p(x) =β0+β1Balance+β2Student.
1−p(x) Coefficient
Std. Error 0.3692 0.0002 -4.185
Z-statistic p-value -29.12 <0.0001 24.75 <0.0001 3.52 <0.0001
Intercept
Balance
Student[Yes]
-0.1075 0.0057 -0.7149
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,14 / 36
The estimated slope parameters are both significant, so Balance and Student are both useful for predicting the probability of default. Once again, if p(x) > 0.5 we classify x to class 1. Otherwise x is class 0.
The decision boundary (using p(x) = 0.5) is linear and can be found as follows:
log
1 − p(x) Balance
= β0 + β1Balance + β2Student = βˆ0 + βˆ1Balance + βˆ2Student
βˆ 2 βˆ 0 = − βˆ Student − βˆ ,
11
p(x)ˆˆ ˆ
which is a linear boundary.
0
More than two classes
It is possible to perform logistic regression with more than two classes, although we will not discuss this here.
We will consider a popular technique called discriminant analysis when we have more than two classes (and others later in the course).
If the classes are well separated and the joint distribution of the predictors in each class is approximately normal, linear discriminant analysis is more stable than logistic regression.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,15 / 36
Discriminant analysis
The basic approach is to model the distribution of X in each class separately, and then use these distributions and Bayes theorem to approximate Pr(Y|X).
We will assume X is normally distributed in each class, which leads to linear and quadratic discriminant analysis.
This approach is quite general and other distributions for X can be used.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,16 / 36
Bayes theorem
Bayes theorem states
Pr(Y = k|X = x) = pk(x) = Kj=1 πjgj(x),
where
gk(x) = g(x|Y = k) is the PDF (density) for class k;
πk = Pr(Y = k) is the probability of being in class k.
If we could estimate gk(x) and πk, we could approximate pk(x) using Bayes theorem. If we classify an observation X = x to the class for which pk (x ) is greatest, we have an approximation to Bayes classifier!
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,17 / 36
The probability πk is also called the prior probability for class k or the marginal probability. These probabilities must sum to one: k πk = 1
πkgk(x)
Linear discriminant analysis when p = 1
The normal density for X in class k has the form
11 gk(x)=√2πσ exp −2σ2(x−μk)2 ,
kk
where μk and σk2 are the mean and variance in class k, respectively.
In linear discriminant analysis (LDA) we assume constant variance so that σk2 = σ2 for each class.
Bayes theorem gives us
pk(x)=K π√1 exp−1(x−μ)2.
π √1 exp− 1 (x−μ )2
k j=1 j 2πσ 2σ2 j
k 2πσ 2σ2
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,18 / 36
We want to find the most likely class for X = x. That is, the class for which pk(x) is greatest, which is equivalent to
1 2 argmaxk πkexp −2σ2(x−μk) ,
where argmaxk means ‘find the k (class) that maximizes this expression’. Taking logs (we can do this because log is a monotonically increasing function) we have
122 argmaxk log(πk) − 2σ2 (x − 2μkx + μk) ,
which is equivalent to
μ μ2 argmax k x− k +log(π)
k σ2 2σ2 k
because the x2 term does not involve k. We call the quantity δ (x) = μk x −
k σ2
μ2k + log(π ), the discriminant score. The value of k that maximizes δ (x) will
2σ2 k k also maximize pk(x).
Linear discriminant analysis when p = 1
To classify an observation X = x, we need to know which pk(x) is greatest.
This is equivalent to assigning x to the class with the greatest discriminant score, given by
δk(x)=μkx−μ2k +log(πk), σ2 2σ2
which is linear in x.
Exercise: Consider a two class classification problem with π1 = π2 = 0.5. Assume that each class is normally distribution with a common variance. Find the LDA decision boundary.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,19 / 36
For class 1 we have X ∼ Normal(μ1,σ2) and for class 2 we have X ∼ Normal(μ2,σ2), and we assume that μ1 ̸= μ2 (if μ1 = μ2 the PDFs would be on top of each other).
The decision boundary is where δ1(x) = δ2(x): μ1x−μ21 =μ2x−μ2
σ2 2σ2
(μ1 − μ2)x =
σ2 2σ2 12(μ21 − μ2)
1(μ1 −μ2)(μ1 +μ2) 2 (μ1 −μ2)
x =
= 12(μ1+μ2).
If π1 > π2, the decision boundary would move away from μ1 etc. Because each class is normally distributed and with common variance, Bayes classier and LDA are equivalent (they have the same decision boundary).
Two-class example
−4 −2 0 2 4
In this classification problem, μ1 = −1.5, μ2 = 1.5, π1 = π2 = 0.5 and σ2 = 1. G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,20 / 36
The LDA classifier (using p(X) = 0.5) is
1 ifx<0
f (x ) =
which is equivalent to Bayes classifier in this example. The Bayes error rate
is shaded below.
2 otherwise,
π1P(X >0|Y =1)+π2P(X <0|Y =2)
−4 −2 0 2 4
Estimating the parameters
Marginal probabilities, πk: Use the proportion of training observations in class k: πˆ k = n k / n .
Parameters in gk(x;μk,σ2): Use the sample mean in class k and the overall
variance estimator:
σˆ 2 =
where μˆk and nk are the sample mean and the number of observations in class k.
The take home message here is that to fit an LDA model, only basic sample statistics are required: proportions, means and a sample variance.
Revision material: The sample variance for the kth class is σˆ k2 = 1 ( x i − μˆ k ) 2 ,
and E(σˆk2) = σk2 (it is an unbiased estimator of the population variance σk2). LDA assumes that each class has a common variance σ2, so each σˆk2 should be similar.
μˆk = n1 xi k i:yi=k
n−K k=1i:yi=k
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,21 / 36
1K
( x i − μˆ k ) 2 ,
nk −1i:yi=k
LDA classifier
The estimated linear discriminant score is δˆk(x)=μˆkx−μˆ2k +log(πˆk)
σˆ2 2σˆ2
and the probability of X = x being in class k can be estimated using
exp(δˆk (x )) Kj=1 exp(δˆj(x))
For a two-class classification problem, we classify x using
0 ifPr(Y=0|X=x)≥0.5 fˆ ( x ) =
1 otherwise.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,22 / 36
An equivalent form: For a two-class classification problem, we classify x using
Pr(Y = k|X = x) =
0 ifδ(x)>δ(x) fˆ ( x ) = 0 1
1 otherwise.
Two-class classification example
−4 −2 0 2 4 −3 −2 −1 0 1 2 3 4
In this classification problem, μ1 = −1.5, μ2 = 1.5, π1 = π2 = 0.5 and σ2 = 1. G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,23 / 36
The histograms of the training observations in each class look ‘normal-ish’ with similar variances. The estimated decision boundary is the vertical line and dashed line is Bayes decision boundary (which is equivalent to the true LDA decision boundary because the true densities are normal with equal variance). Don’t get too caught up on whether the classes are ‘normally distributed with equal variance’. The model can still be useful (give good predictions) if these assumptions are wrong (more about this later).
012345
LDA for p > 1
The multivariate normal density for X in class k has the form
1 1 T−1
gk(x)=(2π)p/2|Σk|1/2exp −2(x−μk) Σk (x−μk)
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,24 / 36
We will not be going into details of multivariate distributions, but a basic un- derstanding of the variance covariance matrix Σ is useful. In 2-D with random variables X1 and X2, Σ has the form:
V(X1) Cov(X1,X2) E(X1 −E(X1))2 E(X1 −E(X1))(X2 −E(X2)) Cov(X1,X2) V(X2) = E(X1 −E(X1))(X2 −E(X2)) E(X2 −E(X2))2
where the Cov(X1,X2) is a measure of the joint variability of X1 and X2. In the left figure, X1 and X2 are uncorrelated (covariance is zero):
V(X1) 0 Σ= 0 V(X2)
In the right figure, X1 and X2 are positively correlated (an increase in X1 tends to correspond to an increase in X2). In this case, Cov(X1, X2) > 0.
If our random variables are independent, their pair-wise covariances are zero. Hence, if we assume independence, Σ is a diagonal matrix, which simplifies things a lot (we will make this assumption later).
x2
x2
x1
x1
LDA for p > 1
In LDA we assume that the covariance matrices are the same for all K classes,
Σk = Σ.
The discriminant score is
δk(x)=xTΣ−1μk −21μTk Σ−1μk +log(πk).
We need to estimate μk , πk and Σ using the training data. The formulas are
similar to what we’ve seen, but we’ll let R calculate them for us.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,25 / 36
I realise that some students do not have a background in linear algebra and hence, the discriminant score equation above may not make sense. It is included here for completeness and for those that do have sufficient background. We will not be working with matrix equations in this course. I want you to understand what needs to be estimated to fit the model (one covariance matrix (actually, its inverse), k mean vectors and the prior probabilities).
Three-class classification example
−4 −2 0 2 4 −4 −2 0 2 4
X1
The dashed lines show the Bayes decision boundary.
X1
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,26 / 36
The ellipses are contours of the Normal PDFs for each class, where each ellipse contains ≈ 95% of the probability of its class. We can see that X1 and X2 are positively correlated and that each class has the same covariance matrix (an as- sumption for LDA) because the contours have the same shape, orientation and size.
To fit the model we needed to estimate one covariance matrix (3 things V(X1),V(X2) and Cov(X1,X2)), k mean vectors (6 individual means) and the prior probabilities.
Revision Material: The sample covariance of two random variables X and Y is:
1 n
Cov(X,Y)=
n−1
i=1
(xi −x ̄)(yi −y ̄)
X2
−4 −2 0 2 4
X2
−4 −2 0 2 4
Default data
Exercise: Predict whether an individual will default on the basis of student status and credit card balance.
Confusion matrix for LDA on the default data:
True default status
No Predicted No 9644 default status Yes 23
Yes
252
Total 9896 104 10000
to the majority class) has
81 Total 9667 333
The training error is (23 + 252)/10000 or 2.75%.
The null classifier (assign all unclassified observations
training error of 333/10000 or 3.33%.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,27 / 36
The training error is low, maybe we are over-fitting? Probably not because there are 10,000 training observations and we only have to estimate a few parameters in LDA (a 2-D variance covariance matrix, two mean vectors and two proportions).
Class Imbalance Problem:
• For the true non-defaulters, we have an error rate of 23/9667 = 0.2%.
• For the true defaulters, we have an error rate of 252/333 = 75.5%.
The model is much better at predicting non-defaulters because there are many more non-defaulters in the training data. It’s usually easier to model the majority class because we have more training observations to learn from.
• Possible remedy: up-sample the minority class and/or down-sample the majority class (these sampling techniques are not covered in this course, but you should be aware of the class imbalance problem and its consequences).
Types of Errors
Positive Examples: examples from the main class of interest.
Negative Examples: examples that are not from the main class of interest.
False Positive Rate: the fraction of negative examples that are incorrectly classified as positive (Type I error, 1 − Specificity).
True Positive Rate: the fraction of positive examples that are correctly classified as positive (1-Type II error, Power, Sensitivity).
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,28 / 36
Often, the positive examples are the minority (rare) class in the training data. In our case, the positive examples are defaulters (we want our model to predict defaulters because they will affect the credit card company the most).
• The false positive rate is 23/9667 = 0.002 (very few non-defaulters in the training data are classified as defaulters).
• The true positive rate is 81/333 = 0.243 (about 25% of defaulters in the training data are correctly classified as defaulters, which is not that great).
• To improve the true positive rate, we could consider changing the classification threshold (more later).
Possible outcomes for a classifier
True Class
Predicted Class
Neg. Neg. True Neg. (TN) Pos. False Pos. (FP) Total N
Pos. Total False Neg. (FN) N∗ True Pos. (TP) P∗
P
Precision = TP/P∗
Specificity (true negative rate) = 1 – FP/N = TN/N Sensitivity (true positive rate) = TP/P
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining
,29 / 36
• TN: The number of negative examples correctly predicted by the model.
• TP: The number of positive examples correctly predicted by the model.
• FN: The number of positive examples wrongly predicted as negative by the model.
• FP: The number of negative examples wrongly predicted as positive by the model.
Since the training error (and testing error) treats every class as equally important, it may not be suitable for analysing imbalanced data sets. Precision, specificity and sensitivity (and others not covered here) can be useful for assessing the performance of a classifier in this case. In this class, we will only consider the measures discussed in these notes.
Varying the threshold
Bayes classifier works by assigning an observation to the most likely class. For the default data we classify an individual as a defaulter if
Pr(default = Yes|X = x) > 0.5.
If we have additional domain knowledge, the classification threshold can be altered. For example, if we want to increase the number of true positives in our LDA model for the default data, we could use
Pr(default = Yes|X = x) > 0.2.
However, we expect the overall error rate to increase because Bayes classifier uses 0.5.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,30 / 36
Choosing the threshold equal to 0.5 aims to minimise the training and testing error rates, which may not be the best idea if you’re interested in the minority class of an imbalanced data set.
Varying the threshold
Pr(default = Yes|X =
True
No Predicted No 9644 default status Yes 23 Total 9667
Pr(default = Yes|X =
True
No Predicted No 9432 default status Yes 235 Total 9667
x) > 0.5.
default status
Yes Total 252 9896 81 104 333 10000
x) > 0.2.
default status
Yes Total 138 9570 195 430 333 10000
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,31 / 36
Decreasing the classification threshold increases the number of training observa- tions that will be classified as defaulters (as expected because defaulters are the positive class here).
• The sensitivity (true positive rate) increased from 0.24 to 0.59, but the false positive rate has increased from 0.002 to 0.024 (or if you prefer, the specificity (true negative rate) decreased from 0.998 to 0.976).
• The precision has also decreased from 0.78 to 0.45 (because we are predicting more defaulters than we should be).
• The overall training error rate was
(235 + 138)/10000 = 3.73%,
which is worse than the null classifier. However, the bank could be happy with this new classification model because it identifies more defaulters in the training data. If there were a cost differential between a low-cost false positive and a higher cost false negative, the model would be useful. That is, if classifying a non-defaulter as defaulter is relatively cheap, the model is useful. Otherwise, the model is not particularly useful.
Varying the threshold: default data
0.0 0.1 0.2
0.3 0.4 0.5
Threshold
The black curve is the overall error rate, orange is the false positive rate and blue is the false negative rate.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,32 / 36
As we decrease the classification threshold, more negative examples are classified as positive (the false positive rate increases).
Error Rate
0.0 0.2 0.4 0.6
ROC Curve: default data
ROC Curve
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,33 / 36
The ROC curve plots the false positive rate (FPR, 1-specificity) and true positive rate (TPR, sensitivity) as the classification threshold is changed. There are three points along the curve that have well-known interpretations:
• TPR = 0, FPR = 0: The model predicts every example to be a negative class.
• TPR = 1, FPR = 1: The model predicts every example to be a positive class.
• TPR = 1, FPR = 0: The ideal model.
A good classifier should be as close as possible to the top left hand corner of the diagram (random guessing corresponds to the dashed line in the diagram).
The overall performance of the classifier is given by the area under the curve (AUC). The larger the AUC, the better the classifier. For the default data, LDA has an AUC = 0.95, which is close to the maximum value of one. We expect a classifier that performs no better than random guessing to have an AUC of 0.5.
We will not be plotting ROC curves in this class, but they are mentioned in the textbook so basic details are provided here so that you can interpret these plots.
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
Quadratic discriminant analysis
Quadratic discriminant analysis (QDA) is similar to LDA.
Like LDA, QDA assumes a normal distribution for X in each class.
Unlike LDA, QDA assumes that each class has its own covariance matrix: X ∼ Normal(μk , Σk ).
The discriminant score is
δk(x) = −1(x −μk)TΣ−1(x −μk)− 1 log|Σk|+log(πk).
2k2
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,34 / 36
Once again, I realise that some students do not have a background in linear algebra and hence, the discriminant score equation above may not make sense. In one dimension (I do expect you to understand this), the quadratic discriminant score is
kk
which is a quadratic equation (the x2 term on slide 18 cannot be cancelled here because σk depends on class k). I want you to understand what needs to be estimated to fit the QDA model (K covariance matrices, K mean vectors and the prior probabilities). Hence, we need to estimate more parameters in QDA than in LDA:
• QDA is more complex than LDA,
• expect higher variance for QDA than LDA, but lower bias.
Therefore, we require more training observations to fit the QDA model. If the training data set is relatively small, we run the risk of over-fitting if QDA is used (we can check for this using the method of cross-validation in section 5).
1 2 πk δk(x)=−2σ2(x−μk) +log σ ,
Quadratic discriminant analysis
−4 −2 0 2 4 −4 −2 0 2 4
X1
The purple dashed curve is Bayes decision boundary.
X1
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,35 / 36
If there are too many predictors (and not enough training data), QDA breaks down because we have to estimate several large variance covariance matrices Σk (and their inverses!). One way forward is to assume conditional independence for each class:
p
fk(x) = fjk(xj),
j=1
where each fjk is a normal density. This means Σk is a diagonal matrix (the independence assumption means the covariances are zero). If we make the in- dependence assumption, the method is called Naive Bayes classifier. There are several advantages to using this approach:
• Far fewer parameters to estimate.
• The matrix inverse is a trivial calculation.
• Can be very useful for large p.
• The independence assumption is usually wrong, but the classifier can work very well in practice.
X2
−4 −3 −2 −1 0 1 2
X2
−4 −3 −2 −1 0 1 2
Summary
Bayes classifier is the best you can do, but we don’t know the required distributions to compute it!
Logistic regression is very popular, especially for two-class classification problems.
If classes are well-separated and the normally distributed assumptions are reasonable, LDA (or QDA) is a good approximation to Bayes classifier.
LDA is also useful for multi-class problems and when the training data set is small.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,36 / 36