STAT318/462 — Data Mining
Dr G ́abor Erd ́elyi
University of Canterbury, Christchurch,
Course developed by Dr B. Robertson. Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
Copyright By PowCoder代写 加微信 powcoder
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,1 / 36
Classification
Suppose we have a qualitative response Y (classification problem) that takes values from a finite (unordered) set
C = {y1,y2,…,yK},
and p different predictors,
X = (X1,X2,…,Xp).
Our goal is to build a classifier f (X ) that assigns a class label from C to an
unclassified observation X.
We are often interested in estimating the probability that X belongs to a
particular class (category).
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,2 / 36
Credit card data
0 500 1000 1500 2000 2500
The individuals who defaulted are shown in orange, and those that did not in blue.
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,3 / 36
We are interested in whether an individual will default (response variable) based on annual income and monthly credit card balance (predictor variables). Observations:
• The classes are not linearly separable (there is overlap between the two classes as expected).
• The predictor Balance ‘separates’ the training data better than Income.
• A simple subjective classifier could be
Default=Yes if Balance > 1300
f (x ) = Default=No otherwise, but we could probably do better.
0 20000 40000 60000
Credit card data
G. Erd ́elyi, University of Canterbury 2022
STAT318/462 — Data Mining ,4 / 36
These box plots illustrate the basic properties of the distributions of defaulters and non-defaulters for the two predictor variables. Observations:
• For Income, the distributions of defaulters and non-defaulters are more or less the same.
• For Balance, there is good separation between the two classes.
0 500 1000 1500 2000 2500
0 20000 40000 60000
Credit card data: linear regression
We can code the response variable Default as 0 ifNo
Y= 1 ifYes, and perform multiple linear regression of Y on X.
We could then classify an observation as Yes if the predicted response is greater than 0.5 and No otherwise.
This seems reasonable, right?
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,5 / 36
What about more than two classes?
A patient presents at an emergency room and we must classify them as Stroke, Drug Overdose or Epileptic Seizure (response values) based on their symptoms (predictors).
We can code the response variable as
Y = 1 2
if Drug Overdose
if Epileptic Seizure,
and perform multiple linear regression of Y on X. Does this seem reasonable?
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,6 / 36
This coding suggests an ordering of the response variable when there is no ordering at all. It also implies that the difference between Stroke and Epileptic Seizure is bigger than the difference between Stroke and Drug Overdose. The multiple linear regression model is not appropriate here.
Credit card data: simple linear regression
| | | | |||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||| || | |
|||||||| ||||| | | ||
0 500 1000 1500 2000 2500 Balance
E(Y|X =x)=Pr(Y =1|X =x)≈β0 +β1x.
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,7 / 36
Consider using a simple linear regression model to predict the probability of default using our coding
Y = 1 ifDefault=Yes This model might seem perfect because
0 if Default = No
E(Y|X =x) = 0Pr(Y =0|X =x)+1Pr(Y =1|X =x) = Pr(Y=1|X=x)
That is, we model the Pr(Y = 1|X = x) as a linear function of X. However, the simple linear regression model will produce probability estimates greater than one and less than zero (amongst other drawbacks). We can do better.
Probability of Default 0.0 0.2 0.4 0.6 0.8 1.0
Logistic regression
Logistic regression uses the logistic function, given by
Pr(Y =1|X =x)=p(x)= exp(β0 +β1x)
1 + exp(β0 + β1x) Because exp(.) > 0, p(x) will take values between 0 and 1.
We can rearrange the logistic function to give
log p(x) =β0+β1x,
called the log odds or logit transformation of p(x).
G. Erd ́elyi, University of Canterbury 2022
The log odds transform of p(x) is
p(x)(1 + exp(β0 + β1x)) = p(x) =
log 1−p(x) =
Pr(Y = 1|X = x)
log Pr(Y=0|X=x) =
A non-linear transformation of p(x) is a probabilities on a non-linear scale).
STAT318/462 — Data Mining
exp(β0 +β1x)
1 + exp(β0 + β1x)
exp(β0 +β1x) (1−p(x))exp(β0 +β1x)
linear function of x (we are modelling
Credit card data: logistic regression
| | | | || | ||||| || |||||||||||||||||||||||||||||||||||||||||||||||||| |||| | | | | | |
|||||||| ||||| | | ||
0 500 1000 1500 2000 2500 Balance
log p(x) =β0+β1x. 1−p(x)
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,9 / 36
To define a classifier, we need to specify a classification threshold probability p(x) = α. For example, we could choose p(x) = 0.5 (like Bayes classifier) to minimize the error rate
Default=Yes if p(x) ≥ 0.5
Default=No otherwise.
Alternatively, we could use a lower threshold, p(x) = 0.3 for example, to reduce the risk of misclassifying defaulters (more about this later). We can also define the classifier using the decision boundary (where p(x) = α). For example, if p(x) = 0.5 our classifier is
Default=Yes if x ≥ −β0/β1
Default=No otherwise.
Probability of Default 0.0 0.2 0.4 0.6 0.8 1.0
Parameter estimation
In logistic regression, we use maximum likelihood estimation (MLE) to estimate the parameters. That is, the parameters are chosen to maximize the likelihood of observing the training data, Tr = {xi,yi}ni=1.
For binary classification problems (assuming Bernoulli trials), the likelihood
function is
l(β0, β1|Tr) = p(xi )yi (1 − p(xi ))1−yi .
We will be estimating parameters, β0 and β1, using R.
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,10 / 36
Let’s see where the likelihood function comes from. For a single trial we have
Pr(Y =1|X =x) = p(x) Pr(Y=0|X=x) = 1−p(x)
Pr(Y =y|X =x) = p(x)y(1−p(x))1−y,
where y ∈ (0, 1). Hence, the joint probability of the observed sample is
p(xi )yi (1 − p(xi ))1−yi , i=1
which is called the likelihood function l(β0,β1|Tr).
We then find the β0 and β1 that maximize l(β0,β1|Tr), where
p(xi)= exp(β0+β1xi) . 1+exp(β0 +β1xi)
This maximization is done using an optimization algorithm (which is all taken care of in R).
Making Predictions: credit card data
Logistic regression using Balance as the predictor
log p(x) =β0+β1Balance.
Std. Error 0.3612 0.0002
1−p(x) Coefficient
Z-statistic -29.5
Intercept -10.6513 Balance 0.0055
p-value <0.0001 24.9 <0.0001
Estimate the probability of Default for someone with a balance of (a) $1000.
(b) $2000.
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,11 / 36
The Z-statistic plays the same role as the t-statistic did for linear regression. The estimated slope parameter βˆ1 is significant, so Balance is useful for predicting the probability of an individual defaulting. The estimated probabilities are:
pˆ(Balance = 1000) = exp(−10.6513 + 0.0055(1000)) ≈ 0.006 1 + exp(−10.6513 + 0.0055(1000))
pˆ(Balance = 2000) = exp(−10.6513 + 0.0055(2000)) ≈ 0.586 1 + exp(−10.6513 + 0.0055(2000))
The model suggests that individuals with larger credit card balances have a greater chance of defaulting (which makes intuitive sense).
Making Predictions: credit card data
Logistic regression using Student as the predictor
log p(x) =β0+β1Student.
1−p(x) Coefficient
Intercept -3.5041 Student[Yes] 0.4049
Std. Error 0.0707 0.1150
Z-statistic p-value -49.55 <0.0001
Estimate the probability of Default for a (a) student.
(b) non-student.
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining
The predictor is coded using a dummy variable
1 if Student = Yes
0 if Student = No.
The estimated slope parameter βˆ1 is significant, so knowing whether someone is a student is useful for predicting the probability of an individual defaulting. The estimated probabilities are:
pˆ(Student=Yes) = exp(−3.5041 + 0.4049(1)) ≈ 0.0431 1 + exp(−3.5041 + 0.4049(1))
pˆ(Student=No) = exp(−3.5041 + 0.4049(0)) ≈ 0.0292 1 + exp(−3.5041 + 0.4049(0))
The model suggests that students have a slightly greater chance of defaulting.
Multiple Logistic regression
When we have a binary response using p predictors, the log odds is simply log p(x) =β0+β1x1+...+βpxp.
We can also extend the logistic model to polynomial logistic regression. For
example, the quadratic logistic model for one predictor (binary response) is log p(x) =β0+β1x1+β2x12.
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,13 / 36
Adding polynomial terms increases the flexibility of the logistic regression model (just like it did in linear regression). We will consider choosing the correct level of flexibility for these models in Section 5.
Multiple logistic regression: credit card data
Multiple logistic regression using Balance and Student as the predictors log p(x) =β0+β1Balance+β2Student.
1−p(x) Coefficient
Std. Error 0.3692 0.0002 -4.185
Z-statistic p-value -29.12 <0.0001 24.75 <0.0001 3.52 <0.0001
Student[Yes]
-0.1075 0.0057 -0.7149
G. Erd ́elyi, University of Canterbury 2022
STAT318/462 — Data Mining ,14 / 36
The estimated slope parameters are both significant, so Balance and Student are both useful for predicting the probability of default. Once again, if p(x) > 0.5 we classify x to class 1. Otherwise x is class 0.
The decision boundary (using p(x) = 0.5) is linear and can be found as follows:
1 − p(x) Balance
= β0 + β1Balance + β2Student = βˆ0 + βˆ1Balance + βˆ2Student
βˆ 2 βˆ 0 = − βˆ Student − βˆ ,
p(x)ˆˆ ˆ
which is a linear boundary.
More than two classes
It is possible to perform logistic regression with more than two classes, although we will not discuss this here.
We will consider a popular technique called discriminant analysis when we have more than two classes (and others later in the course).
If the classes are well separated and the joint distribution of the predictors in each class is approximately normal, linear discriminant analysis is more stable than logistic regression.
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,15 / 36
Discriminant analysis
The basic approach is to model the distribution of X in each class separately, and then use these distributions and Bayes theorem to approximate Pr(Y|X).
We will assume X is normally distributed in each class, which leads to linear and quadratic discriminant analysis.
This approach is quite general and other distributions for X can be used.
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,16 / 36
Bayes theorem
Bayes theorem states
Pr(Y = k|X = x) = pk(x) = Kj=1 πjgj(x),
gk(x) = g(x|Y = k) is the PDF (density) for class k;
πk = Pr(Y = k) is the probability of being in class k.
If we could estimate gk(x) and πk, we could approximate pk(x) using Bayes theorem. If we classify an observation X = x to the class for which pk (x ) is greatest, we have an approximation to Bayes classifier!
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,17 / 36
The probability πk is also called the prior probability for class k or the marginal probability. These probabilities must sum to one: k πk = 1
Linear discriminant analysis when p = 1
The normal density for X in class k has the form
11 gk(x)=√2πσ exp −2σ2(x−μk)2 ,
where μk and σk2 are the mean and variance in class k, respectively.
In linear discriminant analysis (LDA) we assume constant variance so that σk2 = σ2 for each class.
Bayes theorem gives us
pk(x)=K π√1 exp−1(x−μ)2.
π √1 exp− 1 (x−μ )2
k j=1 j 2πσ 2σ2 j
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,18 / 36
We want to find the most likely class for X = x. That is, the class for which pk(x) is greatest, which is equivalent to
1 2 argmaxk πkexp −2σ2(x−μk) ,
where argmaxk means ‘find the k (class) that maximizes this expression’. Taking logs (we can do this because log is a monotonically increasing function) we have
122 argmaxk log(πk) − 2σ2 (x − 2μkx + μk) ,
which is equivalent to
μ μ2 argmax k x− k +log(π)
k σ2 2σ2 k
because the x2 term does not involve k. We call the quantity δ (x) = μk x −
μ2k + log(π ), the discriminant score. The value of k that maximizes δ (x) will
2σ2 k k also maximize pk(x).
Linear discriminant analysis when p = 1
To classify an observation X = x, we need to know which pk(x) is greatest.
This is equivalent to assigning x to the class with the greatest discriminant score, given by
δk(x)=μkx−μ2k +log(πk), σ2 2σ2
which is linear in x.
Exercise: Consider a two class classification problem with π1 = π2 = 0.5. Assume that each class is normally distribution with a common variance. Find the LDA decision boundary.
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,19 / 36
For class 1 we have X ∼ Normal(μ1,σ2) and for class 2 we have X ∼ Normal(μ2,σ2), and we assume that μ1 ̸= μ2 (if μ1 = μ2 the PDFs would be on top of each other).
The decision boundary is where δ1(x) = δ2(x): μ1x−μ21 =μ2x−μ2
(μ1 − μ2)x =
σ2 2σ2 12(μ21 − μ2)
1(μ1 −μ2)(μ1 +μ2) 2 (μ1 −μ2)
= 12(μ1+μ2).
If π1 > π2, the decision boundary would move away from μ1 etc. Because each class is normally distributed and with common variance, Bayes classier and LDA are equivalent (they have the same decision boundary).
Two-class example
−4 −2 0 2 4
In this classification problem, μ1 = −1.5, μ2 = 1.5, π1 = π2 = 0.5 and σ2 = 1. G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,20 / 36
The LDA classifier (using p(X) = 0.5) is
which is equivalent to Bayes classifier in this example. The Bayes error rate
is shaded below.
2 otherwise,
π1P(X >0|Y =1)+π2P(X <0|Y =2)
−4 −2 0 2 4
Estimating the parameters
Marginal probabilities, πk: Use the proportion of training observations in class k: πˆ k = n k / n .
Parameters in gk(x;μk,σ2): Use the sample mean in class k and the overall
variance estimator:
where μˆk and nk are the sample mean and the number of observations in class k.
The take home message here is that to fit an LDA model, only basic sample statistics are required: proportions, means and a sample variance.
Revision material: The sample variance for the kth class is σˆ k2 = 1 ( x i − μˆ k ) 2 ,
and E(σˆk2) = σk2 (it is an unbiased estimator of the population variance σk2). LDA assumes that each class has a common variance σ2, so each σˆk2 should be similar.
μˆk = n1 xi k i:yi=k
n−K k=1i:yi=k
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,21 / 36
( x i − μˆ k ) 2 ,
nk −1i:yi=k
LDA classifier
The estimated linear discriminant score is δˆk(x)=μˆkx−μˆ2k +log(πˆk)
and the probability of X = x being in class k can be estimated using
exp(δˆk (x )) Kj=1 exp(δˆj(x))
For a two-class classification problem, we classify x using
0 ifPr(Y=0|X=x)≥0.5 fˆ ( x ) =
1 otherwise.
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,22 / 36
An equivalent form: For a two-class classification problem, we classify x using
Pr(Y = k|X = x) =
0 ifδ(x)>δ(x) fˆ ( x ) = 0 1
1 otherwise.
Two-class classification example
−4 −2 0 2 4 −3 −2 −1 0 1 2 3 4
In this classification problem, μ1 = −1.5, μ2 = 1.5, π1 = π2 = 0.5 and σ2 = 1. G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,23 / 36
The histograms of the training observations in each class look ‘normal-ish’ with similar variances. The estimated decision boundary is the vertical line and dashed line is Bayes decision boundary (which is equivalent to the true LDA decision boundary because the true densities are normal with equal variance). Don’t get too caught up on whether the classes are ‘normally distributed with equal variance’. The model can still be useful (give good predictions) if these assumptions are wrong (more about this later).
LDA for p > 1
The multivariate normal density for X in class k has the form
1 1 T−1
gk(x)=(2π)p/2|Σk|1/2exp −2(x−μk) Σk (x−μk)
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,24 / 36
We will not be going into details of multivariate distributions, but a basic un- derstanding of the variance covariance matrix Σ is useful. In 2-D with random variables X1 and X2, Σ has the form:
V(X1) Cov(X1,X2) E(X1 −E(X1))2 E(X1 −E(X1))(X2 −E(X2)) Cov(X1,X2) V(X2) = E(X1 −E(X1))(X2 −E(X2)) E(X2 −E(X2))2
where the Cov(X1,X2) is a measure of the joint variability of X1 and X2. In the left figure, X1 and X2 are uncorrelated (covariance is zero):
V(X1) 0 Σ= 0 V(X2)
In the right figure, X1 and X2 are positively correlated (an increase in X1 tends to correspond to an increase in X2). In this case, Cov(X1, X2) > 0.
If our random variables are independent, their pair-wise covariances are zero. Hence, if we assume independence, Σ is a diagonal matrix, which simplifies things a lot (we will make this assumption later).
LDA for p > 1
In LDA we assume that the covariance matrices are the same for all K classes,
The discriminant score is
δk(x)=xTΣ−1μk −21μTk Σ−1μk +log(πk).
We need to estimate μk , πk and Σ using the training data. The formulas are
similar to what we’ve seen, but we’ll let R calculate them for us.
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,25 / 36
I realise that some students do not have a background in linear algebra and hence, the discriminant score equation above may not make sense. It is included here for completeness and for those that do have sufficient background. We will not be working with matrix equations in this course. I want you to understand what needs to be estimated to fit the model (one covariance matrix (actually, its inverse), k mean vectors and the prior probabilities).
Three-class classification example
−4 −2 0 2 4 −4 −2 0 2 4
The dashed lines show the Bayes decision boundary.
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,26 / 36
The ellipses are contours of the Normal PDFs for each class, where each ellipse contains ≈ 95% of the probability of its class. We can see that X1 and X2 are positively correlated and that each class has the same covariance matrix (an as- sumption for LDA) because the contours have the same shape, orientation and size.
To fit the model we needed to estimate one covariance matrix (3 things V(X1),V(X2) and Cov(X1,X2)), k mean vectors (6 individual means) and the prior probabilities.
Revision Material: The sample covariance of two random variables X and Y is:
(xi −x ̄)(yi −y ̄)
−4 −2 0 2 4
−4 −2 0 2 4
Default data
Exercise: Predict whether an individual will default on the basis of student status and credit card balance.
Confusion matrix for LDA on the default data:
True default status
No Predicted No 9644 default status Yes 23
Total 9896 104 10000
to the majority class) has
81 Total 9667 333
The training error is (23 + 252)/10000 or 2.75%.
The null classifier (assign all unclassified observations
training error of 333/10000 or 3.33%.
G. Erd ́elyi, University of Canterbury 2022 STAT318/462 — Data Mining ,27 / 36
The training error is low, maybe we are over-fitting? Probably not beca
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com