Lecture 3: Classification
Can Yang
Department of Mathematics
The Hong Kong University of Science and Technology
Most of the materials here are from Chapter 4 of Introduction to Statistical learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.
March 4, 2020
1
1 4.1. Overview
2 4.2. Relation with regression
3 4.3. Logistic Regression.
4 4.4. Linear Discriminant Analysis
5 4.4. Quadratic Discriminant Analysis
6 4.5. Comparison of Classification Methods.
2
• A special form of supervised learning.
• Categorical or qualititative responses: Yes/No;
High/Median/Low; …
• Main task: predict the class of the subject based on the inputs.
• This type of task is called classification.
About Classification
3
• Widespread applications.
• This chapter will cover logistic regression; linear discriminant
analysis and KNN.
• More advanced methods will be introduced in later chapters. They include:
generalized additive model (gam); trees, random forests and boosting; and SVM; etc.
4
• Email spam detector
• Diagnose a person with a set of syndrome as virus carrier or
non-carrier.
• Identify which gene, out of a million genes, is disease-causing or not.
• Judge if a trading activity is a fraud or not.
Examples
5
Examples: The default data
• Simulated data: 10000 individuals.
• Two inputs: income and balance (monthly) • One output: Default (Yes or No).
• Judge if a trading activity is a fraud or not.
6
0
20000
40000 60000
0
20000
Income
40000 60000
0
500 1000 1500 2000 2500
Balance
Income
7
0 500 1000 1500 2000 2500 Balance
No Yes No Yes
Default Default
Figure: FIGURE 4.1. The Default data set. Left: The annual incomes and monthly credit card balances of a number of individuals. The individuals who defaulted on their credit card payments are shown in orange, and those who did not are shown in blue. Center: Boxplots of balance as a function of default status. Right: Boxplots of income as a function of default status.
Examples: The default data
• Strong relation between balance and default. • Weaker relation between income and default.
8
Why not regression
• Coding Qualitative response as numericals, such as 0, 1, 2,…, are generally inappropriate.
• classes could be Asian/European/African, Handwritten numbers (MNIST data); Identify genda (Male/female) from face image.
• In some cases, response classes are ordered, such as severe/moderate/mild; tall/medium/short.
• Coding these into 0, 1, 2 as numericals and apply regression can still be inappropriate, because it implies the differences between the adjacent classes are equal.
9
The special case of binary response
• Consider the output is binary: two class,
• Code the response into 0 and 1 and apply linear regression
produce the same result as linear discriminant anlaysis. • Not the case for output with more than two classes.
10
Example: The default data
| | | | || | |||||| ||||||||||||||||||||||||||||||||||||||||||||| |||| | | | | | |
|||||||| |||| | | ||
| | | | || | |||||| ||||||||||||||||||||||||||||||||||||||||||||| |||| | | | | | |
|||||||| |||| | | ||
0 500 1000 1500 2000 2500 0 500
Balance
1000 1500 2000 2500
Balance
Figure: Left: linear regresion; Right: logistic regression
11
Probability of Default
0.0 0.2 0.4 0.6 0.8 1.0
Probability of Default
0.0 0.2 0.4 0.6 0.8 1.0
The setup for binary output
• The training data: (xi,yi): i = 1,…,n.
• yi =1forclass1andyi =0forclass0.
• xi =(1,xi1,…,xip)arep+1vectorswithactuallypinputs.
• If instead consider linear regression model is
yi =βTxi +εi
β can be estimated by the least squares, and βˆT xi is the
predictor of yi .
• Key idea: should focus on predicting the probability of the
classes.
• Using P(y = 1|x) = βT x is not appropriate.
12
• Assume
As a result,
The logistic regression model
P(y = 1|x) = 1
1 + exp(−βT x)
•
P(y =0|x)=1−P(y =1|x)= 1
1 + exp(βT x)
log P(y = 1|x) = βT x P(y = 0|x)
This is called log-odds or logit. And P(y = 1|x)
P(y = 0|x)
is called odds.
• Interpretation: one unit increase in variable xj , increases the
log-odds of class 1 by βj .
13
The maximum likelihood estimation
• Recall that, the likelihood is the joint probability function of joint density function of the data.
• Here, we have independent observations (xi , yi ), i = 1, …, n, each follow the (conditional) distribution
P(yi = 1|xi) = 1 = 1−P(yi = 0|xi). 1+exp(−βTxi)
• So, the joint probability function is
p(yi = 1|xi) p(yi = 0|xi)
i =1,…,n;yi =1 i =1,…,n;yi =0 which can be conveniently written as
n exp(yiβTxi) . i=1 1+exp(βTxi)
14
The likelihood and log-likelihood
• The likelihood function is the same as the joint probability function, but viewed as a function of β.
• The log-likelihood function is n
l=[yiβTxi −log(1+exp(βTxi))] i=1
• The maximizor is denoted as βˆ, which is the MLE of β based on logistic model.
• Gradient and Hessian are given as n
▽l=[yi −pi]xi =xT(y−p), i=1
n
H=−pi[1−pi]xixTi =−xTWx,
i=1
where W = diag(p1[1 − p1], . . . , pn[1 − pn])
15
The Newton-Raphson Iterative Algorithm
• The Newton step is
βnew = βold − H−1 ▽ l
=βold +(xTWx)−1xT(y−p) =(xTWx)−1xTWxβold +xT(y−p) = (xT Wx)−1xT Wz
where z = xβold + W−1(y − p). • Iterative reweighted least squares.
16
Example: The default data with single input balance.
TABLE 4.1. (from ISLR) For the Default data, estimated coefficients of the logistic regression model that predicts the probability of default using balance. A one-unit increase in balance is associated with an increase in the log odds of default by 0.0055 units.
Coefficient Std.error Z-statistic p-value
Intercept -10.6513 0.3612 -29.5 < 0.0001 Balance 0.0055 0.0002 24.9 < 0.0001
17
Example: the default data with single input student
TABLE 4.2. For the Default data, estimated coefficients of the logistic regression model that predicts the probability of default using student status. Student status is encoded as a dummy variable, with a value of 1 for a student and a value of 0 for a non-student, and represented by the variable student[Yes] in the table.
Coefficient Std.error t-statistic p-value
Intercept -3.5041 0.0707 -49.55 < 0.0001 Student[Yes] 0.4049 0.1150 3.52 0.0004
18
The prediction
• For the model with balance as input, predict the default probability for an individual with a balance of 1000$ as
Pˆ(default = Yes|balance = 1000)
= 1 = 0.00576
1 + exp(−(−10.6513 + 0.0055 × 1000))
• For the model with student as input, predict the probability of
a student’s default probability as
Pˆ(default = Yes|student = Yes)
= 1 = 0.0431 1 + exp(−(−3.5042 + 0.4049 × 1))
19
• For the model with student as input, predict the probability of a non-student’s default probability as
Pˆ(default = Yes|student = No)
= 1 = 0.0292
The prediction
1 + exp(−(−3.5042 + 0.4049 × 0))
• Students have higher default probabilities than non-students.
20
Example: the default data with all three inputs: balance, income and student
TABLE 4.3. For the Default data, estimated coefficients of the logistic regression model that predicts the probability of default using balance, income, and student status. Student status is encoded as a dummy variable student[Yes], with a value of 1 for a student and a value of 0 for a non-student. In fitting this model, income was measured in thousands of dollars.
Coefficient Std.error t-statistic p-value
Intercept Balance Income Student[Yes]
-10.8690 0.4923 -22.08 0.0057 0.0002 24.74 0.0030 0.0082 0.37 -0.6468 0.2362 -2.74
< 0.0001 < 0.0001 0.7115 0.0062
21
Paradox and explanation
• Table 4.2 (single input) indicates student status has higher chance of default; but Table 4.3 (multiple input)indicates student status is negatively related with chance of default.
• Without considering other factors, students have higher chance of default.
• Considering the balance and income as additional factors, students have lower chance of default compared with non-students who hold the same balance and income.
• Generally, students hold higher balance (borrow more) and consequently overall higher chance of default. In other words, balance and student are two positively correlated variables.
22
Explanation
500 1000 1500 2000 No Yes
Credit Card Balance Student Status
Figure 4.3. Confounding in the Default data. Left: Default rates are shown for students (orange) and non-students (blue). The solid lines display default rate as a function of balance, while the horizontal broken lines display the overall default rates. Right: Boxplots of balance for students (orange) and non-students (blue) are shown.
23
Default Rate
0.0 0.2 0.4 0.6 0.8
Credit Card Balance
0 500 1000 1500 2000 2500
A geometric explanation
24
Remarks
• The function 1/(1 + exp(−x )) is called logistic sigmoid function.
• This is the cdf of the logistic distribution
• One can use other probabilistic functions to replace the
sigmoid function.
• The probit model uses Φ(x), where Φ is the cdf of standard normal distribution.
• For multiple classes (more than 2 classes), the logistic regression model can be adapted by using the softmax function.
25
The Bayes Theorem
• Suppose there are K, denoted as 1,2,...,K, for the output Y.
• X is the input of p-dimension. Both Y and X are random
variables.
• Let πk = P(Y = k).
• Letfk(x)=f(x|Y =k)betheconditionaldensityfunctionof
X given Y = k.
• Then, Bayes theorem implies
πkfk(x) pk(x) = P(Y = k|X = x) = Kj=1 πjfj(x)
• General Bayes theorem :
P(Ak|B) = Kj=1 P(B|Aj)P(Aj)
for disajoint sets A1, ..., AK whose union has probability 1.
• We classify a subject with input x into class k, if its pk(x) is
the largest, for k = 1, ..., K .
P(B|Ak )P(Ak )
26
Model assumptions of LDA
X is p-dimensional. Y = 1, ..., K , totally K classes. Assume, for k = 1, ..., K ,
X|Y = k ∼ N(μk,Σ),
where μ is p-vector and Σ is p by p variance matrix. i.e.,
fk (x ) = 1 exp− 1 (x − μk )T Σ−1 (x − μk ) (2π)p/2|Σ|1/2 2
Note that we assumed the same Σ for all classes k = 1, ..., K .
27
Computing pk(x) for LDA πk exp[(−1/2)(x − μk )T Σ−1(x − μk )]
pk(x)= Kl=1πl exp[(−1/2)(x−μl)TΣ−1(x−μl)]
Comparing pk(x) is the same as comparing the numerator. Note that the term xTΣ−1x is common for all numerators of pk(x).
Set
δk(x)=μ−1Σ−1x−1μTk Σ−1μk +logπk. k2
Then a subject with input x will be classified into class k, if δk is the largest.
The classifer based on δk is the Bayesian classifier, assuming known μk, Σ and πk.
A practical problem: Σ and μj and πj, j = 1,...,K may be unknown.
28
Solution: use the sample analogue
The sample analogue of δk is
δˆk(x)=μˆTk Σˆ−1x−1μˆTk Σˆ−1μk +logπˆk,
2
where μˆk and Σˆ and πˆk are sample mean, pooled sample variance, and sample proportion of class k in the data. Specifically, based on data (xi,yi),i = 1,...,n,
μˆ k = 1 x i ; nk i:yi=k
1K
(xi − μˆk )(xi − μˆk )T , k=1 i:yi =k
Σˆ =
n−K
where nk is the number of subjects in class k in the data, and
πˆ k = n k / n
29
In summary, we classify a subject with input x into class k, if δˆk is the largest.
δˆk(x) is called discrimination function.
δˆk(x) is here a linear function of x. This is why we call it LDA. The region of values of x being classified into a class has linear boundary, since δk(x) = δl(x) defines a linear hyperplane for x.
30
Knowing Normal distribution
−4 −2 0 2 4 −3 −2 −1 0 1 2 3 4
FIGURE 4.4. Left: Two one-dimensional normal density functions are shown. The dashed vertical line represents the Bayes decision boundary. Right: 20 observations were drawn from each of the two classes, and are shown as histograms. The Bayes decision boundary is again shown as a dashed vertical line. The solid vertical line represents the LDA decision boundary estimated from the training data.
31
012345
x2
x2
x1
x1
32
Knowing Normal distribution
FIGURE 4.5. Two multivariate Gaussian density functions are shown, with p = 2. Left: The two predictors are uncorrelated. Right: The two variables have a correlation of 0.7.
FIGURE 4.6. An example with three classes. The observations from each class are drawn from a multivariate Gaussian distribution with p = 2, with a class-specific mean vector and a common covariance matrix. Left: Ellipses that contain 95% of the probability for each of the three classes are shown. The dashed lines are the Bayes decision boundaries. Right: 20 observations were generated from each class, and the corresponding LDA decision boundaries are indicated using solid black lines. The Bayes decision boundaries are once again shown as dashed lines.
The graph of LDA
33
X2
−4 −2 0 2 4
X2
−4 −2 0 2 4
34
−4 −2 0 2 4 −4 −2 0 2 4
X1 X1
The graph of LDA
Example: the default data
Fit the LDA model to 10,000 training samples gives training error rate of 2.75%.
That is, 275 samples are misclassifed.
But, actually only 3.33% of the training samples default, meaning that, a naive classfier that classfies every sample as non-default only has 3.33% error rate.
The detailed result is shown in the confusion matrix.
35
The Confusion matrix.
TABLE 4.4. A confusion matrix compares the LDA predictions to the true default statuses for the 10, 000 training observations in the Default data set. Elements on the diagonal of the matrix represent individuals whose default statuses were correctly predicted, while off-diagonal elements represent individuals that were misclassified. LDA made incorrect predictions for 23 individuals who did not default and for 252 individuals who did default.
Predicted default status
True default status
No Yes Total No 9644 252 9896
Yes 23 81 104 Total 9667 333 10000
36
Class-specific performance
• Overall error rate: (252 + 23)/10000 = 2.75%.
• Error rate with default people: 252/333 = 75.7%
• Sensitivity: 1 − 75.7% = 24.3%
• Error rate within people no-default: 23/9667 = 0.24%. • Specificity: 1 − 0.24% = 99.8%
37
• The Bayes classifier classifies a subject into class k, if the posterior probability pk(x) is the largest.
• In this case with two classes (Yes/No), i.e., K = 2. Bayes classifier classifies into default class if
Pr(default = Yes|X = x) > 0.5.
• A modification is classifies into default class if Pr(default = Yes|X = x) > 0.2.
A modification
38
The classfication result of the modification
TABLE 4.5. A confusion matrix compares the LDA predictions to the true default statuses for the 10, 000 training observations in the Default data set, using a modified threshold value that predicts default for any individuals whose posterior default probability exceeds 20 %.
Predicted default status
True default status
No Yes Total No 9432 138 9570
Yes 235 195 430 Total 9667 333 10000
39
Class-specific performance
• Overall error rate: (235 + 138)/10000 = 3.73%. (increased)
• Error rate with default people: 138/333 = 41.4% (lower after
modification)
• Sensitivity: 1 − 41.4% = 58.6%
• Error rate within people no-default: 235/9667 = 2.43%. (increased)
• Specificity: 1 − 2.43% = 97.57%
• Idenfication of defaulter (sensitivity) is more important to
credit card company!
• This modification may be helpful to the campany. A tradeoff of specificity for sensitivity.
40
The tradeoff
FIGURE 4.7. For the Default data set, error rates are shown as a function of the threshold value for the posterior probability that is used to perform the assignment. The black solid line displays the overall error rate. The blue dashed line represents the fraction of defaulting customers that are incorrectly classified, and the orange dotted line indicates the fraction of errors among the non-defaulting customers.
0.0 0.1 0.2
0.3 0.4 0.5
Threshold
41
Error Rate
0.0 0.2 0.4 0.6
The ROC curve (Operation characteristic curve)
FIGURE 4.8. A ROC curve for the LDA classifier on the Default data. It traces out two types of error as we vary the threshold value for the posterior probability of default. The actual thresholds are not shown. The true positive rate is the sensitivity: the fraction of defaulters that are correctly identified, using a given threshold value. The false positive rate is 1-specificity: the fraction of non-defaulters that we classify incorrectly as defaulters, using that same threshold value. The ideal ROC curve hugs the top left corner, indicating a high true positive rate and a low false positive rate. The dotted line represents the no information classifier; this is what we would expect if student status and credit card balance are not associated with probability of default. Predicted class
42
ROC Curve
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
43
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
TABLE 4.6. Possible results when applying a classifier or diagnostic test to a population.
status
+ or Non-null Total
General confusion matrix
Predicted class
True – or Null
– or Null
True Neg. (TN) False Neg. (FN) N*
+ or Non-null Total False Pos. (FP) N True Pos. (TP) P
P*
44
Name
Definition Synonyms
Clarifying the terminology
TABLE 4.7. Important measures for classification and diagnostic testing, derived from quantities in Table 4.6.
False Pos. rate FP/N True Pos. rate TP/P Pos. Pred.value TP/P* Neg.Pred.value TN/N*
Type I error, 1 – specificity
1- Type II error, power, sensitivity, recall Precision, 1- false discovery rate
45
The model assumption of QDA
• Recall that LDA assume the class-specific normal distribution with class means μk and same class variance Σ.
• The QDA assume the class-specific normal distribution with class means μk and class variance Σk .
• The rest of the derivations follow an analogous line.
46
The discrimination function
• The discrimination function:
δk(x) = −1(x−μk)TΣ−1(x−μk)−1log(|Σk|)+logπk
2k2
= −1xTΣ−1x+xTΣ−1μk −1log(|Σk|)+logπk
2kk2 This is a quadratic funciton of x.
• In actual implementation, need to use class specific sample mean and class specific sample variance to estimate μk and Σk.
• More complex than LDA, more parameters to estimate.
• If the class variances are equal or close, LDA is better. Otherwise, QDA is better. Is this true?
47
48
Comparing LDA and QDA
FIGURE 4.9. Left: The Bayes (purple dashed), LDA (black dotted), and QDA (green solid) decision boundaries for a two-class problem with
Σ1 = Σ2. The shading indicates the QDA decision rule. Since the Bayes decision boundary is linear, it is more accurately approximated by LDA than by QDA. Right: Details are as given in the left-hand panel, except that Σ1 ̸= Σ2. Since the Bayes decision boundary is non-linear, it is more accurately approximated by QDA than by LDA.
X2
−4 −3 −2 −1 0 1 2
X2
−4 −3 −2 −1 0 1 2
−4 −2 0 2 4 −4 −2 0 2 4
X1 X1
The four classification methods.
• Consider two classes k = 1, 2 for simplicity.
• logistic regression: log(p2(x)/p1(x)) = linear function of x
• LDA: log(p2(x)/p1(x)) = linear function of x
• The models are the same, but the methods of estimating the linear function are different, and produce different results.
• QDA: assume the log-odds are quadratic funciton of x.
• KNN.
• Comparing their performance using simulation.
49
Comparions through simulation
FIGURE 4.10. Boxplots of the test error rates for each of the linear scenarios described in the main text.
SCENARIO 1 SCENARIO 2 SCENARIO 3
KNN−1 KNN−CV LDA Logistic QDA KNN−1 KNN−CV LDA Logistic QDA KNN−1 KNN−CV LDA Logistic QDA
50
0.15
0.20
0.25
0.30
0.20
0.25
0.30
0.35 0.40
0.45
0.25
0.30
0.35 0.40
0.45
Comparison through simulation
FIGURE 4.11. Boxplots of the test error rates for each of the non-linear scenarios described in the main text.
SCENARIO 4 SCENARIO 5 SCENARIO 6
KNN−1 KNN−CV LDA Logistic QDA KNN−1 KNN−CV LDA Logistic QDA KNN−1 KNN−CV LDA Logistic QDA
51
0.30 0.35 0.40
0.20 0.25 0.30
0.35 0.40
0.18 0.20 0.22 0.24 0.26
0.28 0.30
0.32
There were 20 training observations in each of two classes. The observations within each class were uncorrelated random normal variables with a different mean in each class. The left-hand panel of Figure 4.10 shows that LDA performed well in this setting, as one would expect since this is the model assumed by LDA. KNN performed poorly because it paid a price in terms of variance that was not offset by a reduction in bias. QDA also performed worse than LDA, since it fit a more flexible classifier than necessary. Since logistic regression assumes a linear decision boundary, its results were only slightly inferior to those of LDA.
Scenario 1
52
Details are as in Scenario 1, except that within each class, the two predictors had a correlation of 0.5. The center panel of Figure 4.10 indicates little change in the relative performances of the methods as compared to the previous scenario.
Scenario 2
53
Scenario 3
We generated X1 and X2 from the t-distribution, with 50 observations per class. The t-distribution has a similar shape to distribution the normal distribution, but it has a tendency to yield more extreme pointsthat is, more points that are far from the mean. In this setting, the decision boundary was still linear, and so fit into the logistic regression framework. The set-up violated the assumptions of LDA, since the observations were not drawn from a normal distribution. The right-hand panel of Figure 4.10 shows that logistic regression outperformed LDA, though both methods were superior to the other approaches. In particular, the QDA results deteriorated considerabl as a consequence of non-normality.
54
The data were generated from a normal distribution, with a correlation of 0.5 between the predictors in the first class, and correlation of −0.5 between the predictors in the second class. This setup corresponded to the QDA assumption, and resulted in quadratic decision boundaries. The left-hand panel of Figure 4.11 shows that QDA outperformed all of the other approaches.
Scenario 4
55
Within each class, the observations were generated from a normal distribution with uncorrelated predictors. However, the responses were sampled from the logistic function using X12, X2 and X1 × X2 as predictors. Consequently, there is a quadratic decision boundary. The center panel of Figure 4.11 indicates that QDA once again performed best, followed closely by KNN-CV. The linear methods had poor performance.
Scenario 5
56
Scenario 6
Details are as in the previous scenario, but the responses were sampled from a more complicated non-linear function. As a result, even the quadratic decision boundaries of QDA could not adequately model the data. The right-hand panel of Figure 4.11 shows that QDA gave slightly better results than the linear methods, while the much more flexible KNN-CV method gave the best results. But KNN with K = 1 gave the worst results out of all methods. This highlights the fact that even when the data exhibits a complex nonlinear relationship, a non-parametric method such as KNN can still give poor results if the level of smoothness is not chosen correctly.
57