程序代写代做代考 graph Classification Methods

Classification Methods
Dr. Du, Lan and Dr Liu, Ming
Faculty of Information Technology, Monash University, Australia FIT5149 week 4
(Monash) FIT5149 1 / 38

Outline
1 Regression for Classification
2 Linear Discriminant Analysis
3 Quadratic Discriminant Analysis (QDA)
4 Summary
(Monash) FIT5149 2 / 38

Learning Outcomes
After working your way through this module, you should be able to:
Apply linear models to solving different classification problems;
Assess the accuracy of coefficient estimates and the accuracy of the model; Produce analysis on the model output.;
aligned with unit learning outcomes
Analyse data sets with a range of statistical, graphical and machine-learning tools;
Evaluate the limitations, appropriateness and benefits of data analytics methods for given tasks;
Assess the results of an analysis;
(Monash) FIT5149 3 / 38

Classification
Qualitative (categorical) variables take values in an unordered set C, for example,
􏰀 email: spam and ham
􏰀 breast tumor diagnosis: benign, malignant 􏰀 sentiment: positive, negative, or neutral
Classification: given a feature vector X and a categorical response Y ∈ C, the classification task is to build a function f (X ) that takes input the feature vector X and predicts its value for Y . For example,
􏰀 To determine whether or not a breast tumor is benign on the basis of features computed from a digitised image of a fine needle aspirate (FNA) of a breast mass.
􏰀 To predict the sentiment of a product reviews on the basis of the review content (i.e., sequences of words).
􏰀 To predict whether or not an individual will default on his or her credit card payment, on the basis of gender, education, age, history of past payment, etc.
Methods: classify an observation based on the predicted probability of each of the categories of a qualitative variable.
(Monash) FIT5149 4 / 38

Regression for Classification
Outline
1 Regression for Classification
2 Linear Discriminant Analysis
3 Quadratic Discriminant Analysis (QDA)
4 Summary
(Monash) FIT5149 5 / 38

Regression for Classification
Example: Credit Card Default Data
Predict credible or not credible clients based on how likely the customers are to default.
Possible predictors X:
􏰀 Annual income
􏰀 Monthly credit card balance
The response variable default Y is categorical: Yes and NO. Questions:
􏰀 How to check the relationship between Y and X?
􏰀 How to build a model to predict default (Y ) for any give value of balance
(X1) and income (X2)?
(Monash) FIT5149 6 / 38

Example: Credit Card Default Data
0 500 1000 1500 2000 2500 Balance
No Yes
Default
No Yes
Default
Regression for Classification
0
20000
Income
40000 60000
0
20000
40000 60000
0
500 1000 1500 2000 2500
Balance
Income
Figure: The Default dataset in ISL. (Monash) FIT5149
7 / 38

Regression for Classification
Can We Use Linear Regression?
Surppose that we use the dummy variable approach to code the response
variable:
default Y =
Can we fit a linear regression of Y on X and predict default if Yˆ > 0.5?
default = β0 + β1 × balance
􏰀 In this case of a binary outcome, linear regression does a good job as a classifier, and is equivalent to linear discriminant analysis which we discuss later.
E(default | balance) = p(default = yes | balance)
􏰉0 IfNo
1 IfYes
(Monash) FIT5149 8 / 38

Regression for Classification
Why Not Linear Regression?
| | | | || | |||||| ||||||||||||||||||||||||||||||||||||||||||||| |||| | | | | | |
| | | | || | |||||| ||||||||||||||||||||||||||||||||||||
Problems:
Linear regression might produce probabilities less than zero or bigger than one.
|||||||| ||||
|||||||| |||| | | ||
0 500
1000 1500
Balance
2000
2500
0 500
1000 1500 2000
Balance
(Monash)
FIT5149
9 / 38
Probability of Default
0.0 0.2 0.4 0.6 0.8 1.0
Probability of Default
0.0 0.2 0.4 0.6 0.8 1.0

Regression for Classification
Why Not Linear Regression?
Problems:
Linear regression might
produce probabilities less than zero or bigger than one.
For a response variable with three possible values,
1 positive 
sentiment Y = 2 negative |||||||| ||||
3 neutral
0 500 1000 1500 2000
This coding implies an ordering on the outcomes. Balance
| | | | || | |||||| ||||||||||||||||||||||||||||||||||||||||||||| |||| | | | | | |
| | | | || | |||||| ||||||||||||||||||||||||||||||||||||
|||||||| |||| | | ||
0 500
1000 1500
Balance
2000 2500
(Monash)
FIT5149
9 / 38
Probability of Default
0.0 0.2 0.4 0.6 0.8 1.0
Probability of Default
0.0 0.2 0.4 0.6 0.8 1.0

Regression for Classification
Solution: Logistic Function
Logitic function:
ex
1+ex 1
1+e−x
Other similar functions:
􏰀 hyperbolic tangent function
f(x) = =
−10 −5 0 5 10
x
(Monash)
FIT5149
10 / 38
0.0 0.2 0.4
0.6 0.8 1.0
y

Regression for Classification
Logistic Regression
Logistic regression on the Default data set uses the following form
e β0 +β1 ×balance p(default = Yes | balance) = 1 + eβ0+β1×balance
0
500 1000 1500 2000 2500 0
500 1000 1500 2000 2500
| | | | || | |||||| ||||||||||||||||||||||||||||||||||||||||||||| |||| | | | | | |
| | | | || | |||||| ||||||||||||||||||||||||||||||||||||||||||||| |||| | | | | | |
|
|||||||| |||| | | ||
|||||||| |||| |
| ||
0
Logistic regression ensures that our estimate for p(default = Yes) is between 0
and 1
0
500 1000 1500 2000 2500
500 1000 1500 2000
2500
Balance
Balance
Balance
Balance
| | | | || | |||||| ||||||||||||||||||||||||||||||||||||||||||||| |||| | | | | | |
| | | | || | |||||| ||||||||||||||||||||||||||||||||||||||||||||| |||| | | | | |
|||||||| |||| | |
|||||||| ||||
||
|
| ||
(Monash) FIT5149
11 / 38
Probability of Default
Probability of Default
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2 0.4 0.6 0.8 1.0
Probability of Default
Probability of Default
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0

Regression for Classification
Logistic Regression
Logistic regression on the Default data set uses the following form
e β0 +β1 ×balance p(default = Yes | balance) = 1 + eβ0+β1×balance
After a bit of manipulation, we have the odds:
p(default = Yes | balance) = eβ0+β1×balance 1 − p(default = Yes | balance)
(Monash) FIT5149
11 / 38

Regression for Classification
Logistic Regression
Logistic regression on the Default data set uses the following form
e β0 +β1 ×balance p(default = Yes | balance) = 1 + eβ0+β1×balance
After a bit of manipulation, we have the odds:
p(default = Yes | balance) = eβ0+β1×balance 1 − p(default = Yes | balance)
The logit transformation gives the following logit link function or log odds:
log 􏰆 p(default = Yes | balance) 􏰈 = β0 + β1 × balance 1 − p(default = Yes | balance)
(Monash) FIT5149
11 / 38

Regression for Classification
Interpreting the Coefficients
|||||||||||||||||| |||| | | | | | |
In a simple linear regression model, β1 is the slope of the regression line. In a logistic regression model, what is β1?
| | | | || | |||||| ||||||||||||||||||||||||||||||||||||||||||||| |||| | | | | | |
||||| |||| | | ||
|||||||| |||| | | ||
00
2500
0
500 1000 1500 2000 2500
Balance
(Monash)
FIT5149
12 / 38
Probability of Default
0.0 0.2 0.4 0.6 0.8 1.0
0

Regression for Classification
Interpreting the Coefficients
In a simple linear regression model, β1 is the slope of the regression line. In a logistic regression model, what is β1?
| | | | || | |||||| ||||||||||||||||||||||||||||||||||||||||||||| |||| | | | | | |
|||||||| |||| | | ||
􏰀 if β1 > 0, increasing balance will increase p(default = Yes | balance).
􏰀 if β1 < 0, increasing balance will decrease p(default = Yes | balance). 0 500 1000 1500 2000 2500 Balance (Monash) FIT5149 12 / 38 Probability of Default 0.0 0.2 0.4 0.6 0.8 1.0 Regression for Classification Estimating the Regressing Coefficients Use maximum likelihood to estimate the coefficients L(β0, β1) = 􏰄 p(defaulti | balancei ) i :default i =Yes 􏰄 (1−p(defaulti |balancei)) This likelihood gives the probability of the observed zeros and ones in the data. We pick β0 and β1 to maximise the likelihood of the observed data. argmax L(β0, β1) β0 ,β1 Seek for β0 and β1 such that the predicted probability pˆ(defaulti | balancei ) for each balancei corresponds to as closely as possible to its observed defaulti . i :default i =No (Monash) FIT5149 13 / 38 Regression for Classification Assessing the Accuracy of the Coefficients Similar to linear regression, we still want to perform a hypothesis test H0 : β1=0 Ha : β1̸=0 Instead of t-statistic, we use the z-statistic associate with βˆ1; βˆ1/SE(βˆ1) The logistic regression model that predicts the probability of default using balance Coefficient Intercept -10.6513 balance 0.0055 The logistic regression model that predicts the probability of default using Std. error 0.3612 0.0002 Z-statistics -29.5 24.9 P-value < 0.0001 < 0.0001 student Intercept student [Yes] Coefficient -3.5041 0.4049 Std. error 0.0707 0.1150 Z-statistics -49.55 3.52 P-value < 0.0001 0.0004 (Monash) FIT5149 14 / 38 Regression for Classification Make Predictions The logistic regression model that predicts the probability of default using balance Coefficient Intercept -10.6513 balance 0.0055 What is the estimated probability of default for someone with a balance of Std. error 0.3612 0.0002 Z-statistics -29.5 24.9 P-value < 0.0001 < 0.0001 $1,000? pˆ(default = yes|balance = $1, 000) = 1 + e−10.6513+0.0055×1000 = 0.00576 e −10.6513+0.0055×1000 (Monash) FIT5149 15 / 38 Regression for Classification Make Predictions The logistic regression model that predicts the probability of default using student Coefficient Intercept -3.5041 student [Yes] 0.4049 Std. error 0.0707 0.1150 3.52 P-value < 0.0001 0.0004 Z-statistics -49.55 What is the estimated probability of default for a student? e −3.5041+0.4049×1 pˆ(default = yes|student = true) = 1 + e−3.5041+0.4049×1 = 0.0431 (Monash) FIT5149 15 / 38 Regression for Classification Multiple Logistic Regression Considering predicting a binary response using multiple predictors, we can generalise the one-variable logistic regression as folllows: e β0 +β1 X1 +···+βp Xp p(X) = 1 + eβ0+β1X1+···+βpXp and the logit function is now log( p(X) )=β0+β1X1+···+βpXp 1−p(X) For the Default data, the estimated coefficients of the multiple logistic regression model are Intercept balance income student [Yes] Coefficient -10.8690 0.0057 0.0030 -0.6468 Std. error 0.4923 0.0002 0.0082 0.2362 Z-statistics -22.08 24.74 0.37 -2.74 P-value <0.0001 <0.0001 0.7115 0.0062 (Monash) FIT5149 16 / 38 Regression for Classification Multiple Logistic Regression The negative coefficient: students are less likely to default than non-students. 500 1000 1500 2000 No Yes Credit Card Balance Student Status Figure: : students (orange), non-student (blue) Students tend to have higher balances than non-students, so their marginal default rate is higher than for non-students. But for each level of balance, students default less than non-students. (Monash) FIT5149 17 / 38 Default Rate 0.0 0.2 0.4 0.6 0.8 Credit Card Balance 0 500 1000 1500 2000 2500 Regression for Classification Logistic regression with more than two classes Classify a response variables that has more than two classes: (one version used in the R package glmnet) eβk,0+βk,1X1+···+βk,pXp p(Y = k | X) = 􏰃Kl=1 eβl,0+βl,1X1+···+βl,pXp where a linear function is associated with each classes. Multiclass logistic regression is also referred to as multinomial regression, or Maximum Entropy Model (MaxEnt). (Monash) FIT5149 18 / 38 Regression for Classification Logistic regression with ordinal responses Since the response variable is categorised and ordered, and Assumption 􏰆P(Y ≤ j | x)􏰈 c(x)=ln P(Y>j|x)
􏰆 p(Y≤j|x) 􏰈
ln 1−p(Y≤j|x) =αj+βx
􏰀 the proportional odds assumption or the parallel regression assumption: the relationship between each value pair of the response variable is the same
Example: MaxEnt and Ordinal logistic regression
(Monash) FIT5149 19 / 38

Linear Discriminant Analysis
Outline
1 Regression for Classification
2 Linear Discriminant Analysis
3 Quadratic Discriminant Analysis (QDA)
4 Summary
(Monash) FIT5149 20 / 38

Linear Discriminant Analysis
Discriminant Analysis
Discriminant analysis belongs to the branch of classification methods called generative modelling, where we try to estimate the within class density of X given the class label. Combined with the prior probability (unconditioned probability) of classes, the posterior probability of Y can be obtained by the Bayes formula.
p(Y=k|X=x) = =
p(X =x,Y =k) p(X = x)
p(X = x | Y = k)p(Y = k) 􏰃Kl p(X=x|Y=l)p(Y=l)
πk fk (x) 􏰃Kl πlfl(x)
=
where
􏰀 fk(x)=p(X=x|Y=k)isthedensityforXinclassk. 􏰀 πk = p(Y = k) is the prior probability for class k.
(Monash) FIT5149 21 / 38

Linear Discriminant Analysis
Discriminant Analysis
Discriminant analysis belongs to the branch of classification methods called generative modelling, where we try to estimate the within class density of X given the class label. Combined with the prior probability (unconditioned probability) of classes, the posterior probability of Y can be obtained by the Bayes formula.
p(X =x,Y =k) p(X = x)
p(X = x | Y = k)p(Y = k) 􏰃Kl p(X=x|Y=l)p(Y=l)
πk fk (x) 􏰃Kl πlfl(x)
Assume every density within each class is a Gaussian distribution:
􏰀 Linear Discriminate Analysis (LDA)
− Gaussian distributions for different classes share the same covariance structure.
􏰀 Quadratic Discriminant Analysis (QDA):
− No such a constraint on the covariance structure.
p(Y=k|X=x) = =
=
(Monash) FIT5149 21 / 38

Linear Discriminant Analysis
Classify to the highest density
Classify to the Highest Density
π1=.5, π2=.5 π1=.3, π2=.7
−4 −2 0 2 4 −4 −2 0 2 4
We classify a new point according to which density is highest
classify a new point according to which density is highestà
(Monash) FIT5149 22 / 38

Linear Discriminant Analysis
Linear Discriminant Analysis for p = 1
In the one-dimensional setting, the Gaussian density has the form 1 −1􏰁x−μk 􏰂2
fk(x)=√ e2 σk 2πσk
Here μk is the mean, and σk the variance in class k. We will assume that all the σk = σ are the same.
Plugging this into Bayes formula, we get a rather complex expression for
pk(x) = p(Y=k|X=x)
π√1 e−1􏰅x−μk􏰇2 k2σ
2πσ
􏰃Kπ√1 e−1􏰅x−μl􏰇2
ll 2σ 2πσ
=
(Monash)
FIT5149
23 / 38

Linear Discriminant Analysis
Linear Discriminant Analysis for p = 1
Plugging this into Bayes formula, we get a rather complex expression for
pk(x) = p(Y=k|X=x)
π√1 e−1􏰅x−μk􏰇2 k2σ
2πσ
􏰃Kπ√1 e−1􏰅x−μl􏰇2
ll 2σ 2πσ
To classify at the value X = x, we need to see which of the pk(x) is largest. Taking logs, and discarding terms that do not depend on k, we see that this is equivalent to assigning x to the class with the largest discriminant score:
δk(x)==x·μk − μ2k +log(πk) σ2 2σ2
Note that δk(x) is a linear function of x.
(Monash) FIT5149 23 / 38
=

Linear Discriminant Analysis
Linear Discriminant Analysis for p = 1
To classify at the value X = x, we need to see which of the pk(x) is largest. Taking logs, and discarding terms that do not depend on k, we see that this is equivalent to assigning x to the class with the largest discriminant score:
δk(x)==x·μk − μ2k +log(πk) σ2 2σ2
Note that δk(x) is a linear function of x.
If there are K = 2 classes and π1 = π2,then one can see that the decision
boundary is at
x = μ1 + μ2 2
(The mathier students will recognize that the boundary is given by setting δ1(x) = δ2(x).)
(Monash) FIT5149 23 / 38

Linear Discriminant Analysis
A Simple Example with p = 1
−4 −2 0 2 4 −3 −2 −1 0 1 2 3 4
Figure: Example with μ1 = −1.25, μ2 = 1.25, π1 = π2 = 0.5,and σ2 = 1.
20 observations were drawn from each of the two classes Error rates
􏰀 Bayes’ error rate: 10.6% 􏰀 LDA error rate: 11.1%
(Monash) FIT5149 24 / 38
012345

Linear Discriminant Analysis
A Simple Example with p = 1
−4 −2 0 2 4 −3 −2 −1 0 1 2 3 4
Figure: Example with μ1 = −1.25, μ2 = 1.25, π1 = π2 = 0.5,and σ2 = 1. Typically we don’t know these parameters; we just have the training data.
In that case we simply estimate the parameters and plug them into the rule.
(Monash) FIT5149 24 / 38
012345

Linear Discriminate Analysis when p > 1
Linear Discriminant Analysis
x2
x2
x
x
The multivariate Gaussian density:
f (x) = 1 e− 1 (x−μ)T Σ−1(x−μ)
The Discriminant function ( a linear function):
δk(x)=xTΣ−1μk −1μTk Σ−1μk +logπk 2
(Monash) FIT5149
25 / 38
2 (2π)p/2|Σ|1/2

Illustration: p = 2 and K = 3
Linear Discriminant Analysis
X2
−4 −2 0 2 4
X2
−4 −2 0 2 4
−4 −2 0 2 4 −4 −2 0 2 4
X
X
20 observations were generated from each class
Ellipses contain 95% of the probability for each of the three classes. The dashed lines are Bayes decision boundaries
The solid lines are LDA decision boundaries
(Monash) FIT5149 26 / 38

Linear Discriminant Analysis
Example: LDA on Credit Data
Figure: With classification threshold = 0.5, we receive 23+252 errors — a 2.75% 10000
misclassification rate!
Some caveats:
This is training error, and we may be overfitting. Not a big concern here since n = 10000 and p = 4!
If we classified to the prior — always to class No in this case — we would make 333/10000 = 3.33% errors.
Of the true No’s, we make 23/9667 = 0.2% errors; of the true Yes’s, we
make 252/333 = 75.7% errors.
(Monash) FIT5149 27 / 38

Linear Discriminant Analysis
Example: LDA on Credit Data
Figure: With classification threshold = 0.2, we receive 235+138 errors — a 3.73% 10000
misclassification rate!
LDA miss-predicted 138/333 = 41.4% of defaulters.
Trade-off between overall error rate and the sensitivity (the percentage of
true defaulters identified).
(Monash) FIT5149 28 / 38

Linear Discriminant Analysis
Varying the Classification Threshold
0.0 0.1 0.2
0.3 0.4 0.5
Black solid: overall error rate
Blue dashed: Fraction of defaulters missed
Orange dotted: non defaulters incorrectly classified
Threshold
(Monash) FIT5149
29 / 38
Error Rate
0.0 0.2 0.4 0.6

Linear Discriminant Analysis
Receiver Operating Characteristics (ROC) Curve
ROC Curve
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
False positive rate: The fraction of negative samples that are mis-classified.
True positive rate: The fraction of positive samples that are correctly
classified.
(Monash) FIT5149 30 / 38
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0

Linear Discriminant Analysis
Why Discriminate Analysis?
When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem.
If n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model.
Linear discriminant analysis is popular when we have more than two response classes
(Monash) FIT5149 31 / 38

Linear Discriminant Analysis
Possible Applications of LDA
Bankruptcy prediction: In bankruptcy prediction based on accounting ratios and other financial variables, linear discriminant analysis was applied to systematically explain which firms entered bankruptcy vs. survived.
Marketing: determine the factors which distinguish different types of customers and/or products on the basis of surveys or other forms of collected data.
Biomedical studies: assess the severity state of a patient and prognosis of disease outcome. Examples:
􏰀 Use results of clinical and laboratory analyses to built discriminant functions to classify disease in a future patient into mild, moderate or severe form.
Example: LDA on Fisher’s Iris data
(Monash) FIT5149 32 / 38

Quadratic Discriminant Analysis (QDA)
Outline
1 Regression for Classification
2 Linear Discriminant Analysis
3 Quadratic Discriminant Analysis (QDA)
4 Summary
(Monash) FIT5149 33 / 38

Quadratic Discriminant Analysis (QDA)
Quadratic Discriminant Analysis
The posterior probability:
p(Y=k|X=x)= πkfk(x) 􏰃Kl πlfl(x)
where fk(x) are Gaussian densities.
LDA: the same covariance matrix Σ in each class.
δk(x)=xTΣ−1μk −1μTk Σ−1μk +logπk 2
QDA: different covariance matrix Σk in each class.
δk(x) = −1(x −μk)TΣ−1(x −μk)− 1log|Σk|+logπk
2k2
(Monash) FIT5149 34 / 38

Quadratic Discriminant Analysis
Quadratic Discriminant Analysis (QDA)
X2
−4 −3 −2 −1 0 1 2
X2
−4 −3 −2 −1 0 1 2
−4 −2 0 2 4
X
−4 −2 0 2 4
X
Black dotted: LDA decision boundary
Purple dashed: Bayes decision boundary
Green solid: QDA decision boundar
Left: variances of the classes are equal (LDA is better fit) Right: variances of the classes are not equal (QDA is better fit)
(Monash) FIT5149 35 / 38

Quadratic Discriminant Analysis (QDA)
Logistic Regression v.s. LDA
For a two-class problem, one can show that for LDA
log􏰆 p1(x) 􏰈=log􏰆p1(x)􏰈=c0+c1x1+…,+cpxp 1 − p1(x) p2(x)
Similarity: Both logistic regression and LDA produce linear boundaries. Difference in how the parameters are estimated
􏰀 Logistic regression uses the conditional likelihood based on p(Y|X), known as discriminative learning
􏰀 LDA uses the full likelihood based on p(X , Y ) (known as generative learning). Note: LDA would do better than Logistic Regression if the assumption of
normality hold, otherwise logistic regression can outperform LDA.
(Monash) FIT5149 36 / 38

Summary
Summary
Logistic regression is very popular for classification, especially when K = 2.
LDA is useful when n is small, or the classes are well separated, and Gaussian assumptions are reasonable. Also when K > 2.
Hit: if the decision boundary is
􏰀 Linear: LDA and Logistic outperforms
􏰀 Moderately Non-linear: QDA outperforms. 􏰀 More complicated: KNN is superior.
See Section 4.5 for some comparisons of logistic regression, LDA and KNN.
(Monash) FIT5149 37 / 38

Summary
Reference
Reading material:
􏰀 “Classification”, Chapter 4 of “Introduction to Statistical Learning”, 6th edition
Some figures in this presentation were taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
Some of the slides are reproduced based on the slides from T. Hastie and R. Tibshirani
(Monash) FIT5149 38 / 38

Related Posts