(Linear) Classification
Data mining
Prof. Dr. Matei Demetrescu
Statistics and Econometrics (CAU Kiel) Summer 2020 1 / 40
Predicting qualitative responses
Recall, qualitative variables take values in an unordered set C, such as: eye color∈{brown, blue, green}
email∈{spam, ham}. The classification task:
given a feature vector X and a (typically unobserved) qualitative response Y taking values in the set C = {y1 …,yK},
build a function C(x) that takes as input the feature vector X and predicts the unknown Y , Yˆ = C(X) ∈ C.
To do this we need to estimate the conditional class probabilities.
We know that conditional class probabilities lead to optimal classifiers in the population;
And sometimes we need the probabilities P(Y = yk|X) themselves. Statistics and Econometrics (CAU Kiel) Summer 2020 2 / 40
Today’s outline
(Linear) Classification
1 Linear regression?
2 Logistic regression
3 Discriminant analysis
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 3 / 40
Linear regression?
Outline
1 Linear regression?
2 Logistic regression
3 Discriminant analysis
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 4 / 40
Linear regression?
Example: real estate
Say the target is a qualitative variable: you distinguish between affordable and luxury (Price > 50 ) housing.
Affordable (gray) and luxury (red) housing
0123456
Distance to public transportation
Luxury Affordable
Housing type
Luxury Affordable
Housing type
Statistics and Econometrics (CAU Kiel)
Summer 2020 5 / 40
0 10 20 30 40
0 10 20 30 40
Age
Age
Location
0123456
Linear regression?
Can we use linear regression?
Use the real estate data to fit linear classifier, and code 0 if Affordable
Y = 1 if Luxury.
Can we simply perform a linear regression of Y on X (say Location) and
classify as Luxury if Yˆ > 0.5, say?
Since in the population E(Y |X = x) = P(Y = 1|X = x), regression
seems appropriate.
In this case of a binary outcome, linear regression is equivalent under certain conditions to linear discriminant analysis which we discuss later (and does a good job under the curcumstances).
However, linear regression might produce probabilities less than zero or bigger than one, and sometimes none larger that 0.5.
Statistics and Econometrics (CAU Kiel) Summer 2020 6 / 40
Linear regression?
Linear regression for real estate data
0123456
Distance to public transportation
Linear regression does not estimate P(Y = 1|X) well (and does not
predict well either). Logistic regression will be better.
Statistics and Econometrics (CAU Kiel) Summer 2020 7 / 40
Type of housing
0.0 0.2 0.4 0.6 0.8 1.0
Linear regression?
Linear regression continued
Now suppose we have a response variable with three possible values: A patient presents at the emergency room,
and we must classify them according to their symptoms.
1 if stroke;
Y = 2 if drug overdose;
3 if epileptic seizure. This coding suggests an ordering!
Linear regression fails utterly here. Multiclass logistic regression or discriminant analysis are more appropriate. (And in case of genuinely ordered outcomes, use ordered logit.)
Statistics and Econometrics (CAU Kiel) Summer 2020
8 / 40
Logistic regression
Outline
1 Linear regression?
2 Logistic regression
3 Discriminant analysis
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 9 / 40
Logistic regression
The model
Let’s write p(x) = P(Y = 1|X = x). We may then set C(x) = I(p(x) > τ)
for a suitable τ. (Usually, τ = 1/2, but not all work with the 0/1 loss.) Logistic regression models1 take
eβ0 +β1 X p(X) = 1 + eβ0+β1X .
Clearly, p(X) ∈ (0,1). Equivalently,
p(X)
log 1−p(X) =β0+β1X.
This is called the log odds or logit transformation of p(X).
1You may recognize a generalized linear model with the canonical link.
Statistics and Econometrics (CAU Kiel) Summer 2020
10 / 40
Logistic regression
More vocabulary
The quantity
0.2 14 p(X) = 0.2 implies an odds of 1−0.2 = 4.
p(X) = eβ0+β1X. 1−p(X)
is called the odds and can take values between 0 and ∞.
Values of the odds close to 0 and ∞ indicate very low and very high
probabilities of default.
On average 1 in 5 people with an odds of 1 will default, since
Likewise on average nine out of every ten people with an odds of 9 will default, since p(X) = 0.9 implies an odds of 0.9 = 9.
1−0.9
Statistics and Econometrics (CAU Kiel) Summer 2020
11 / 40
Logistic regression
Estimation: ML
The likelihood function of the logit model is
L(β0, β1) = p(xi) (1 − p(xi)). i:yi =1 i:yi =0
(Of course, p also depends on β0,1.)
We pick β0 and β1 to maximize the (log-)likelihood.
Maximization is done numerically; use asymptotic standard errors (usually delivered by the software).
Statistics and Econometrics (CAU Kiel) Summer 2020 12 / 40
Logistic regression
Example
Logit regression of Housing type on Location.2
Coefficient Intercept 0.5530
Location -4.7911
Std. Error
0.2887 0.8456
t-statistic 1.916
-5.666
p-value 0.0554
< 10−4
We see that β1 ≈ −5: an increase in distance to public transportation is associated with an decrease in the probability of having luxury housing.
Estimate conditional class probability and classify to 1 if larger than 1/2 (say).
2Fitted in R using glm.
Statistics and Econometrics (CAU Kiel) Summer 2020 13 / 40
Logistic regression
Fitted conditional class probabilities
0123456
Distance to public transportation
A logit regression on Location, Age and Age2 is somewhat better. But an extra characteristic (say housing quality) would help more.
Better classification behavior than linear regression.
Not very precise, though.
Statistics and Econometrics (CAU Kiel) Summer 2020
14 / 40
Type of housing
0.0 0.2 0.4 0.6 0.8 1.0
Logistic regression
Logistic regression with several variables
Suppose, now we have not only one but p predictors X1, . . . , Xp to predict a binary response Y .
We can generalize the logistic regression with one predictor as follows:
p(X)
log 1−p(X) =β0 +β1X1 +···+βpXp
and correspondingly
p(X) = eβ0+β1X1+···+βpXp
1 + eβ0+β1X1+···+βpXp
Just as before we can estimate the coefficients β0, β1, . . . , βp using the maximum likelihood method.
Statistics and Econometrics (CAU Kiel) Summer 2020
15 / 40
Logistic regression
Real estate again
Multiple logit regression for the housing data
Intercept Location Age Age2
Coefficient
1.781693 -3.810012 -0.172365 0.002890
Std. Error
0.388382 0.835366 0.047255 0.001174
t-statistic 4.587
-4.561 -3.648 2.462
p-value < 10−4 < 10−4
0.0003 0.0138
Coefficient of Location lower than in simple regression case
Due to correlation between predictors, Location captures some of
predictive power of Age if taken alone.
This is the same as in linear regression, btw.
Statistics and Econometrics (CAU Kiel) Summer 2020 16 / 40
Logistic regression
Logistic regression with more than two classes
So far we have discussed logistic regression with two classes. It is easily generalized to more than two classes. One version (used in the R package glmnet) has the symmetric form
eβ0k +β1k X1 +···+βpk Xp P(Y = yk|X) = Kl=1 eβ0l+β1lX1+···+βplXp
There is a linear function for each class.
(Some cancellation is possible, and only K − 1 linear functions are needed
as in 2-class logistic regression.)
Multiclass logistic regression is also referred to as multinomial regression. (We skip most details here and discuss discriminant analysis instead, which is often quite similar in terms of results.)
Statistics and Econometrics (CAU Kiel) Summer 2020 17 / 40
Logistic regression
Example: high school
Entering high school, students choose which program to follow.
The data set (simulated) contains variables on 200 students.
Target: prog, program type (general, academic or vocational). The predictor variables are ses (social economic status, three-level ordinal) and math (score in mathematics, metric).
Program type not independent of features:
ses/prog general academic vocation low 16 19 12 middle 20 44 31
high 9 42 7
x ̄ sd(x) general 51.33 9.397775 academic 56.25714 7.943343 vocation 46.76000 9.318754
Statistics and Econometrics
(CAU Kiel)
Summer 2020
18 / 40
Logistic regression
Estimates and predictions using math score
general academic vocation
general βˆ01 βˆ11 βˆ21
academic βˆ02 βˆ12 βˆ22
vocation βˆ03 βˆ13 βˆ23
coefficient
0.5575534 -0.2517512 0.0000000
-3.90541902 0.24871680 0.08052303
3.34786567 0.00000000 -0.06551137
40
50 60
Points in mathematics
70
Statistics and Econometrics
(CAU Kiel)
Summer 2020
19 / 40
Probability
0.0 0.2 0.4 0.6 0.8 1.0
Discriminant analysis
Outline
1 Linear regression?
2 Logistic regression
3 Discriminant analysis
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 20 / 40
Discriminant analysis
Why discriminant analysis?
When the classes are well-separated,
the parameter estimates for the logistic regression model are surprisingly unstable.
Linear discriminant analysis does not suffer from this problem.
If n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model.
Linear discriminant analysis is popular when we have more than two response classes, because it also provides low-dimensional views of the data.
Statistics and Econometrics (CAU Kiel) Summer 2020 21 / 40
Discriminant analysis
An old friend
The approach is
1 model the distribution of X in each of the classes separately, and
2 use the Bayes theorem to obtain P(Y = yk|X).
Using normal (Gaussian) distributions for each class leads to linear or quadratic discriminant analysis.
This approach is quite general, and other distributions can be used as well. We will focus on normal distributions for starters.
Statistics and Econometrics (CAU Kiel) Summer 2020 22 / 40
Discriminant analysis
Bayes’ theorem for classification
Recall:
P(Y =yk|X =x)=P(X =x|Y =yk)·P(Y =yk) P(X = x)
One writes this slightly differently for discriminant analysis:
πk fk (x)
P(Y = yk|X = x) = Kl=1 πlfl(x), where
fk(x) = P(X = x|Y = yk) is the density for X in class yk. Here we will use normal densities for these, separately in each class.
πk = P(Y = yk) is the marginal or prior probability for label yk.3
3Depending on the interpretation.
Statistics and Econometrics (CAU Kiel) Summer 2020 23 / 40
Discriminant analysis
Classify to the highest density
Classify to the highest density
π1=.5, π2=.5 π1=.3, π2=.7
−4 −2 0 2 4 −4 −2 0 2 4
We classify a new point according to which density is highest.
cWlahsesnifytheapnrioerws apreodiinfftereanctc, owredtiankeg tthoemwinhtiochaccdoeuntsiatsywiesll,haingdhest. compare πkfk(x). On the right, we have prior preferences for the pink
class — the decision boundary has shifted to the left.
Statistics and Econometrics (CAU Kiel) Summer 2020 24 / 40
e
Discriminant analysis
Linear Discriminant Analysis when p = 1
The Gaussian density has the form
1 − 1 x−μk 2 fk(x)=√2πσ e 2 σk
k
Here μk is the mean, and σk2 the variance (in class yk). We will assume
that all the σk = σ are the same.
Bayes delivers a rather ugly expression for pk(x) = P(Y = yk|X = x):
pk (x) =
−1x−μk 2 πk√1e2 σ
2πσ Kl=1πl√1e2 σ
− 1 x−μl 2
2πσ
Happily, there are simplifications and cancellations.
Statistics and Econometrics (CAU Kiel) Summer 2020 25 / 40
Discriminant analysis
Discriminant functions
To classify at the value X = x, choose the yk for which pk(x) is largest. This is equivalent to assigning x to the class with the largest discriminant
score,
δk(x)=x·μk − μ2k +log(πk) σ2 2σ2
Note that δk(x) is a linear function of x. IfthereareK=2classesandπ1 =π2 =0.5,thenonecanseethat
decision boundary is at
x = μ1 + μ2 . 2
Statistics and Econometrics (CAU Kiel)
Summer 2020
26 / 40
Discriminant analysis
Example: μ1 = −1.5,μ2 = 1.5,π1 = π2 = 0.5, and σ2 = 1
−4 −2 0 2 4 −3 −2 −1 0 1 2 3 4
Left: The dashed vertical line represents the Bayes decision boundary.
amplewithμ1 = 1.5,μ2 =1.5,⇡1 =⇡2 =0.5,and 2 =1. Right: 20 observations were drawn from each of the two classes, and are
shown as histograms. The solid vertical line represents the LDA decision boundary estimated from the training data.
Statistics and Econometrics (CAU Kiel) Summer 2020 27 / 40
012345
x
Discriminant analysis
Estimating the parameters
Typically we don’t know these parameters; we just have the training data. In that case we simply estimate the parameters and plug them in:
πˆ=nk, μˆk=1Xi,
n
nk i:Yi=yk
( X i − μˆ k ) 2
1K
n − K k=1 i:Yi=yk
σˆ 2 =
=K nk−1·σˆk2,
1
nk −1
estimated variance in class yk.
where σˆ2 = k
i:Yi =yk
k=1 n−K
(Xi − μˆk)2 is the usual formula for the
Statistics and Econometrics (CAU Kiel) Summer 2020
28 / 40
Discriminant analysis
Linear Discriminant Analysis when p > 1
Linear Discriminant Analysis when p > 1
x2
x2
x1
x1
11
e2 DensityD:efn(sxit)y=: f(x)= e 2(x μ) −⌃ ((x− μ)Σ (x−μ)
p / 2 1 / 2 p/ 2 1/ 2 (2⇡) |⌃|2π |Σ| 1
Discriminantfunction: δk(x)=x′Σ−1μk−2μ′kΣ−1μk+logπk Despite its complex form, the latter is linear:
δ(x)=c +cx+cx+···+cx
k k 0 k 1 1 k 2 2 k p 2 7 p/ 4 0
Statistics and Econometrics (CAU Kiel) Summer 2020 29 / 40
1 T 11 ′−1
Illustration: p = 2 and K = 3 classes Illustration: p = 2 and K = 3 classes
Discriminant analysis
X2
−4 −2 0 2 4
X2
−4 −2 0 2 4
−4 −2 0 2 4 −4 −2 0 2 4
X1 Here⇡1 =⇡2 =⇡3 =1/3.
X1
The dashed lines are known as the Bayes decision boundaries.
Here π1 = π2 = π3 = 1 . The animated version 3
The dashed lines are the Bayes decision boundaries. Were they known, Were they known, they would yield the fewest misclassification
they would yield the fewest misclassification errors, among all possible
errors, among all possible classifiers.
classifiers. Solid lines are the corresponding estimates.
Statistics and Econometrics (CAU Kiel) Summer 2020 30 / 4028
Discriminant analysis
From discriminant scores to probabilities
Once we have estimates δˆk(x), we can turn these into estimates for class probabilities:
e δˆ k ( x ) P(Y = yk|X = x) = Kl=1 eδˆl(x)
(Strictly speaking, this is only valid for the Gaussian case we are dealing with.)
So classifying to the largest δˆk(x) amounts to classifying to the class for which P(Y = yk|X = x) is largest (not surprisingly).
WhenK=2,classifytoclass2ifP(Y =2|X=x)≥0.5,elsetoclass1. Statistics and Econometrics (CAU Kiel) Summer 2020 31 / 40
Discriminant analysis
Example: Fisher’s Iris Data
Fisher’s Iris Data
2.0 3.0 4.0
0.5 1.5 2.5
4 variables
3 species
50 samples/class
• Setosa
• Versicolor • Virginica
LDA classifies all but 3 of the 150 training samples correctly.
Sepal.Length
●
●●●●●●●●● ● ●●● ●
●●●● ● ● ●
● ● ●
●●
●● ●
● ●●●●●● ●● ●●●●●●●●●● ● ● ●● ● ● ●●●●●●
●●●● ●●● ● ●
● ●
● ● ●
● ●● ● ● ●
●●●
●● ● ● ● ●●
●●● ●●
●●
● ●●●●●
●●●●●
●●●●● ●●●●●
● ●●●● ●● ●
●
●
● ●● ●●
●●●
● ● ●●●●● ●●●● ●
●● ●●●
● ●● ●● ●●●●● ● ●●●●● ● ●
● ●● ●● ● ●●●● ● ●
● ●
● ●
● ●●●●
●●●● ● ●● ● ● ●● ●● ● ●●●
●●● ●●●
● ●●
●●● ●
●●●●●● ●●
●●●
● ●
●●● ● ●●●
● ●● ●●●● ● ●●
●
● ●
● ● ●
● ●●●
●● ●
●● ●●● ●●●●●● ●●●●●● ●●
Sepal.Width
● ● ●
● ●●
●● ●●
●
●● ●
● ●● ●●
● ●● ● ●● ●●●● ●●
●● ● ●●● ●
●●●●● ●●●●●● ●
●
●● ●
●● ●●●●●●
●
● ●
●
●●● ●●●●● ● ● ●●● ●●
●●● ● ●●●●● ●●
● ● ●●● ●
●● ●●●●
● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●●
●●●●● ● ●●● ●
●●● ●
● ●
● ●
●
●●● ●● ● ●
● ●● ●●● ●
●
●● ● ●● ●
●●● ● ●● ●● ●●●
●● ● ● ●
●● ● ●●● ●●●●●●● ●●●●
●●● ●
●●●● ●●●●● ●
● ●●● ● ●●
● ● ● ●
●
●●●●● ● ●●●● ●●●●●●
●● ●●
●● ●
●● ●●●●●●●●●●●●● ●●●●
●●●●●● ●● ● ●●●●●●●●●●●●
●●●● ●● ● ●●●●●
●
●●
●●●● ●● ●●●●
● ●●● ●●
●
●
●● ●●●●● ●● ● ●●
●●●
●●●● ● ● ● ●●
● ●●●● ● ● ●●●●
● ● ●● ●●● ●●●●●● ●● ●● ●●●●● ●
●● ●●●●●● ●●● ● ●
Petal.Length
● ● ● ●●●● ●●
●●
● ● ●
● ●●●●● ●●● ●
● ●● ●●
● ●
●● ● ● ●●●●● ● ●●●●●●●●●●● ●
● ●●●
●● ●●●●●●●● ● ● ●●●● ●●● ●●●●●
●●● ● ● ●●
●● ●●●●
●● ●●● ● ●● ● ●●●●● ●
●● ● ●● ●●● ● ●●●●●●●● ●●
●● ●● ●
● ● ●●●
● ●●● ● ●●● ● ●● ●
●●●● ● ● ●●
● ●●●●●● ●●
● ● ●● ● ● ●●●●● ●●●●●●●
● ● ●●●● ●●● ●
●●●●●●●●●● ● ● ●● ● ●
●●
● ●●● ●●
●
● ●●●●
●
● ●●● ●
●●
●
● ●●● ●● ●●●
●●●●
●● ● ●
●●●● ● ●● ● ●● ● ● ●●
●● ● ●● ●●●●
●●
●●● ● ●●●
●●● ● ● ●
●●●● ● ●
●● ●●●●●
●●●● ●●●
●●
●
●●●●● ● ●
●●●●●● ● ●● ●
●● ● ●● ●
●
Petal.Width
●●
● ●●●
●● ● ● ● ●● ●●●● ● ●●●●
●●● ● ●●●●●● ●●
●●● ●●●
4.5 5.5 6.5 7.5
1 2 3 4 5 6 7
StatisticsandEconometrics (CAUKiel) Summer2020 2392/40
0.5 1.5 2.5 2.0 3.0 4.0
1 2 3 4 5 6 7
4.5 5.5 6.5 7.5
Discriminant analysis
Fisher’s Discriminant Plot
Fisher’s discriminant plot
●● ●●●
●
●● ●● ●
● ●●● ● ●●●●
●●●●●●
●● ● ● ● ●● ● ●●
●●● ●●● ●●● ●●●● ●●●●●● ● ● ● ●● ●
●● ● ●● ●●●● ●●●●
●●●● ● ● ● ● ●
●● ●●● ●●● ●●●● ● ●●
●●● ●●● ●●● ● ● ●
●●● ●●●● ●● ● ●●●●●●
●●● ●●
● ●●●●●
●
●●
−10 −5 0 5 10 Discriminant Variable 1
When there are K classes, linear discriminant analysis can be When there are K classes, linear discriminant analysis can be viewed
viewed exactly in a K 1 dimensional plot. exactly in a K − 1 dimensional plot.
ThisWishbye?cause it essentially classifies to the closest centroid, and they span a K − 1 dimensional plane.
Even when K > 3, we can find the “best” 2-dimensional plane for visualizing the discriminant rule.
Statistics and Econometrics (CAU Kiel) Summer 2020
30/4
33 / 40
Discriminant Variable 2 −2 −1 0 1 2
Discriminant analysis
Other forms of discriminant analysis
πk fk (x) P(Y = yk|X = x) = Kl=1 πlfl(x)
When fk(x) are Gaussian densities, with the same covariance matrix Σ in each class, this leads to linear discriminant analysis. By altering the forms for fk(x), we get different classifiers.
With Gaussian fk, but different Σk in each class, we get quadratic discriminant analysis.
With fk(x) = pj=1 fjk(xj) (conditional independence model) in each class we get na ̈ıve Bayes. (For Gaussians this means the Σk are diagonal.)
Many other forms, by proposing specific density models for fk(x), including nonparametric approaches.
Statistics and Econometrics (CAU Kiel) Summer 2020 34 / 40
Discriminant analysis
Quadratic Discriminant Analysis
Quadratic Discriminant Analysis
X2
−4 −3 −2 −1 0 1 2
X2
−4 −3 −2 −1 0 1 2
−4 −2 0 2 4
−4 −2 0 2 4
X1
δ (kx(x))==− (x(x− μμ)kΣ) ⌃(x(−xμ )μ+k)l+oglπog⇡k
11 k22kkkkk
′ T−1 1
Because the ⌃ are di↵erent, the quadratic terms matter.
Because the Σk are kdifferent, the quadratic terms don’t cancel out.
This works nicely — almost every time.
37/
Statistics and Econometrics (CAU Kiel) Summer 2020 35 / 40
X1
Discriminant analysis
Regularized Discriminant Analysis
Sometimes, a common Σ is obviously wrong,
… but there are too many classes or p is too large to estimate each Σk.
One idea would be to take a weighted average of the restricted and the unrestricted estimators,
Σˆk(α)=αΣˆk +(1−α)Σˆ.
This induces bias if covariance matrices are heterogeneous,
… but reduces variance.
(We’ll meet regularization again later.)
Statistics and Econometrics (CAU Kiel) Summer 2020 36 / 40
Discriminant analysis
In praise of na ̈ıve Bayes
When p is large, multivariate methods like QDA and even LDA break down.
Need to simplify:
Gaussian na ̈ıve Bayes assumes each Σk is diagonal:
pp
δk(x) ∝ log πk
j=1
1(xj −μkj)2
fkj(xj) = −2
σ2 + log πk j=1 kj
Na ̈ıve Bayes can be used for mixed feature vectors (qualitative and quantitative). If Xj is qualitative, replace fkj(xj) with probability mass function (histogram) over discrete categories.
Despite strong assumptions, na ̈ıve Bayes often produces good classification results.
Statistics and Econometrics (CAU Kiel) Summer 2020 37 / 40
Discriminant analysis
Logistic regression versus LDA
For a two-class problem, one can show that for LDA
p1(x) p1(x)
log 1−p1(x) =log p2(x) =c0 +c1x1 +···+cpxp
So it has the same form as logistic regression. The difference is in how the parameters are estimated.
Logistic regression uses the conditional likelihood based on P(Y |X) (known as discriminative learning).
LDA uses the full likelihood based on P(X, Y ) (known as generative learning).
Despite these differences, in practice the results are often very similar.
Footnote: logistic regression can also fit quadratic boundaries like QDA, by explicitly including quadratic terms in the model.
Statistics and Econometrics (CAU Kiel) Summer 2020 38 / 40
Up next
Outline
1 Linear regression?
2 Logistic regression
3 Discriminant analysis
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 39 / 40
Up next
Coming up
Error estimation and model selection
Statistics and Econometrics (CAU Kiel) Summer 2020 40 / 40