CS699 Lecture 4 Classification 1
Supervised vs. Unsupervised Learning
Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are accompanied by class labels indicating the class of the observations
New data is classified based on the training set Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
2
Prediction Problems: Classification vs. Numeric Prediction
Classification
Predicts categorical class labels (discrete or
nominal)
Constructs a model based on the training dataset which has known class labels and uses it to classify new data (or determine the class labels of new data)
Numeric Prediction
Models continuous‐valued functions, i.e., class
attribute is a numeric attribute
Can be also used to predict missing values
3
Prediction Problems: Classification vs. Numeric Prediction
Typical applications
Credit/loan approval:
Medical diagnosis: if a tumor is cancerous or benign
Fraud detection: if a transaction is fraudulent
Web page categorization: which category it is
4
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, mathematical formulae, etc.
5
Classification—A Two-Step Process
Model usage: classify future or unknown tuples Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set
If the accuracy is acceptable, use the model to
classify data tuples with unknown class labels. Otherwise, build another model and repeat.
6
NAME SALARY
YEARS MANAGER
Classifier (Model)
Mike85K Mary 105K Bill 75K Jim90K Dave 85K Anne 80K
3 no 5 yes 7 yes 5 no 6 yes 3 no
IF salary > 100K
OR years > 5
THEN manager = ‘yes’
Process (1): Model Construction
Training Data
Classification Algorithms
7
Process (2): Using the Model in Prediction
Testing Data
Unseen Data
NAME SALARY YEARS MANAGER
(Jeff, 80K, 4) Manager?
Tom90K 2 no
Merlisa 75K George 95K Joseph 120K
4 no 6 yes 5 yes
No
Classifier
8
<=30 high <=30 high 31..40 high >40 medium >40 low >40 low 31..40 low <=30 medium <=30 low >40 medium <=30 medium 31..40 medium 31..40 high >40 medium
no fair no no excellent no no fair yes no fair yes yes fair yes yes excellent no yes excellent yes no fair no yes fair yes yes fair yes yes excellent yes no excellent yes yes fair yes no excellent no
Example (training) Dataset
age income
student credit_rating
buys_computer
9
1R Classifier
Simple and cheap
Idea: Make rules based on a single attribute Compute the error rate for each attribute
Choose the one with the smallest error rate
10
Algorithm
For each attribute
For each value of the attribute
1R Classifier
count the frequency of each class
find the most frequent class
make rule: assign that class to this attribute‐value
Compute the error rate of the rules (of this attribute) Choose the rules with the smallest error rate
11
Example dataset
outlook
temperature humidity
windy play
sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy
hot high hot high hot high mild high cool normal cool normal cool normal mild high cool normal mild normal mild normal mild high hot normal mild high
F N T N F Y F Y F Y T N T Y F N F Y F Y T Y T Y F Y T N
1R Classifier
12
1R Classifier
Make rules for each attribute and calculate error rate
Attribute outlook
sunny: 2 yes’s and 3 no’s
rule: outlook = sunny → no, error rate = 2/5 overcast: 4 yes’s and 0 no
rule: outlook = overcast → yes, error rate = 0 rainy: 3 yes’s and 2 no’s
rule: outlook = rainy → yes, error rate = 2/5 total error = 4/14
13
1R Classifier
Make rules for each attribute and calculate error rate Repeat the same for all other attributes.
Attribute temperature
temperature = hot → no (2/4), error rate = 2/4
(arbitrary tie breaking)
temperature = mild → yes (4/6), error rate = 2/6 temperature = cool → yes (3/4), error rate = 1/4
total error = 5/14
14
1R Classifier
Attribute humidity
humidity = high → no (4/7), error rate = 3/7
humidity = normal → yes (6/7), error rate = 1/7 total error = 4/14
Attribute windy
windy = false → yes (6/8), error rate = 2/8
windy = true → no (3/6), error rate = 3/6 (arbitrary tie breaking)
total error = 5/14
15
1R Classifier
outlook and humidity have the same total error rate of 4/14. If we (randomly) choose outlook, the rules are
If outlook = sunny → play = no
If outlook = overcast → play = yes If outlook = rainy → play = yes
16
Bayesian Classification
A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers
Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data
Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
17
P(H |X) P(X|H)P(H) P(X)
, where
Bayesian Theorem
Based on Bayes’ rule (or Bayes’ theorem) of conditional probability:
P(H) (prior probability), the initial probability
E.g., X will buy a computer, regardless of age, income, …
P(X) (evidence): probability that a sample data is observed
P(X|H) (likelihood), the probability of observing the sample X,
given that the hypothesis holds
E.g., Given that X buys a computer, the prob. that X is 31..40, medium income, …
18
Bayesian Theorem
A simple example of using Bayes’ theorem for inference: If a person has muscle pain, what is the probability that the person has flu?
19
Bayesian Theorem
F: A patent has flu, M: A patient has muscle pain
We know from historical data;
P(M|F) = 0.75 (if a person catches flu, he/she has muscle pain 75% of the time)
P(F) = 0.00002 (the probability that a person has flu)
P(M) = 0.005 (the probability that a person has muscle pain)
The probability that a person has flu if that person has muscle pain:
P(F |M ) P(M |F)P(F) P(M )
= 0.75 * 0.00002 / 0.005 = 0.003
20
Bayesian Theorem
Given a data sample X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem
P(H |X) P(X|H)P(H) P(X)
Informally, this can be written as
posteriori = (likelihood x prior)/evidence
21
Towards Naïve Bayesian Classifier
Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n‐ (attribute value) vector X = (x1, x2, …, xn), here xi is an attribute value of Ai.
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori
In other words, compute P(C1|X), P(C2|X), …, P(Cm|X) and predict X belongs to the class with the highest probability.
Practical difficulty: require initial knowledge of many probabilities, significant computational cost
22
Towards Naïve Bayesian Classifier
P(Ci|X) can be computed using Bayes’ theorem P(Ci|X)P(X|Ci)P(Ci)
P(X)
Buys_computer dataset example: Given that we know properties of a (new) customer (i.e., an evidence X), the probability that the customer will buy a computer (C1: buys_computer = yes) or the probability that the customer will not buy a computer (C2 : buys_computer = no)
23
Towards Naïve Bayesian Classifier
Class label, buys_computer, has two values, Y and N: P(classY|X)P(X|classY)P(classY)
P(X) P(classN|X)P(X|classN)P(classN)
P(X)
And, assign the class with higher probability to X (or we predict the class label of X will be the class with higher probability).
24
Towards Naïve Bayesian Classifier
In the equations, P(X) is the same for all classes. So we compute and compare only the numerators.
P(X|C )P(C ) ii
For class label with two values, Y and N, we compute and compare the following two:
P(X|classY)P(classY) P(X|classN)P(classN)
25
Derivation of Naïve Bayes Classifier
We need to compute P(X|Ci) and P(Ci).
P(Ci) can be easily estimated from the training dataset
For the Buys_computer dataset, there are two classes, yes and no. Let yes be C1 and no be C2.
We can estimate P(Ci) as follows:
P(C1) = (# yes tuples) / (total # tuples) P(C2) = (# no tuples) / (total # tuples)
26
Derivation of Naïve Bayes Classifier
Computation of P(X|Ci) is not easy.
It can be simplified with class‐conditional
independence assumption (a naïve assumption)
class‐conditional independence assumption: attributes are conditionally independent (i.e., no dependence relation between attributes):
This greatly reduces the computation cost
27
Derivation of Naïve Bayes Classifier
Based on this assumption, we compute P(xk|Ci) for each attribute xi, and multiply them all to obtain P(X|Ci) (this is possible because we assumed all attributes are independent of each other).
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
P(X| )nP( | )P( | )P( | )…P( | ) CixkCix1Cix2Ci xnCi
k1
(Example follows in the next slides)
28
Naïve Bayesian Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’
age income
<=30 high no <=30 high no 31...40 high no >40 medium no >40 low yes >40 low yes 31…40 low yes <=30 medium no <=30 low yes >40 medium yes <=30 medium yes 31...40 medium no 31...40 high yes >40 medium no
fair yes fair yes fair yes excellent no excellent yes fair no fair yes fair yes excellent yes excellent yes fair yes excellent no
Data sample
X = (age >40,
Income = high,
Student = no Credit_rating = excellent)
studentcredit_rating_comp fair no excellent no
29
Naïve Bayesian Classifier: An Example
Classpriorprobabilitiesare:
P(C1) = P(buys_computer = “yes”) = 9/14 = 0.643 P(C2) = P(buys_computer = “no”) = 5/14= 0.357
Next, we compute P(X|Ci) for each class
P(age = “>40” | buys_computer = “yes”) = 3/9 = 0.333
P(income = “high” | buys_computer = “yes”) = 2/9 = 0.222
P(student = “no” | buys_computer = “yes) = 3/9 = 0.333 P(credit_rating = “excellent” | buys_computer = “yes”) = 3/9 = 0.333 P(X|buys_computer = “yes”) = 0.333 x 0.222 x 0.333 x 0.333 = 0.008
P(age = “>40” | buys_computer = “no”) = 2/5 = 0.4
P(income = “high” | buys_computer = “no”) = 2/5 = 0.4
P(student = “no” | buys_computer = “no”) = 4/5 = 0.8 P(credit_rating = “excellent” | buys_computer = “no”) = 3/5 = 0.6 P(X|buys_computer = “no”) = 0.4 x 0.4 x 0.8 x 0.6 = 0.077
30
Naïve Bayesian Classifier: An Example
P(X|Ci) :
P(X|buys_computer = “yes”) = 0.008 P(X|buys_computer = “no”) = 0.077
P(X|Ci)*P(Ci) :
P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.008 * 0.643 = 0.005
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.077 * 0.357 = 0.027
Therefore, the model predicts that X belongs to class (“buys_computer = no”)
31
Naïve Bayesian Classifier: Comments
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore loss
of accuracy
Practically, dependencies exist among variables:
patient_profile: age, height, weight, etc.
Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
How to deal with these dependencies? Bayesian Belief Networks (Chapter 9)
32
A popular classifier
A model is built as a decision tree
Decision Tree
Internal node represents a test on an attribute
Branch represents outcome of test Leaf node has a class label
See example in the next slide
33
no
no yes yes
yes
Decision Tree Induction Example
Training data set: Buys_computer The data set follows an example of
age income
<=30 high no <=30 high no 31...40 high no >40 medium no >40 low yes >40 low yes 31…40 low yes
excellent no fair yes fair yes fair yes excellent no excellent yes fair no fair yes fair yes excellent yes excellent yes fair yes excellent no
Quinlan’s ID3 (Playing Tennis) Resulting tree:
<=30
age? <=30 medium no <=30 low yes >40 medium yes <=30 medium yes
student?
yes
credit rating? excellent fair
o3v1e.r.4c0ast >40
31…40 medium no 31…40 high yes >40 medium no
studentcredit_ratingbuys_computer fair no
34
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top‐down, recursive
manner
Attributes are categorical (if continuous‐ valued, they are discretized first)
At start, all training samples are considered to be at the root.
35
Algorithm for Decision Tree Induction
Basic algorithm (continued)
A test attribute is selected on the basis of a
heuristic or statistical measure (e.g.,
information gain)
samples are partitioned based on the selected
attribute – children nodes are created
The same process is repeated at each child
node (recursively)
36
Algorithm for Decision Tree Induction
Conditions for stopping partitioning
All samples for a given node belong to the
same class
There are no remaining attributes for further
partitioning – majority voting is employed for
classifying the leaf
There are no samples left
37
Algorithm for Decision Tree Induction
Initially all samples are in the same partition, associated with the root node.
38
Algorithm for Decision Tree Induction
Age is chosen as the test attribute, and samples are partitioned into three nodes based on the values of Age, i.e., “≤ 30”, “31..40”, and “>40.”
Needs to be further partitioned
test attribute
Needs to be further partitioned
All 4 tuples are Yes. So, no further partition of this node
39
Algorithm for Decision Tree Induction
The same process is repeated (recursively) on the left node and on the right node.
test attribute
test attribute
40
Attribute Selection Measure:
Consider the following dataset:
Temp Wind Class
Split by Temp
high east Y low west N low east N high west Y high west Y
Which split is better?
Split by Wind
41
Attribute Selection Measure:
Test attribute is selected based on “purity measure.”
Purity measures: information gain, gain ratio, Gini index
Information gain is based on info.
Info represents how pure a dataset is with regard to class labels.
Suppose a dataset has 10 tuples with two class values, Y and N.
If the dataset has all Y’s or all N’s, then the dataset is purest and the value of info is 0.
If the dataset has 5 Y’s and 5 N’s, this is the extreme impure case, and the value of info is 1.
42
Attribute Selection Measure: Info Examples
Notation: I(x1, x2, …, xm), where
m is the number of distinct class values,
xi is the number of tuples belonging to the i‐th class |D|=x1 +x2 +…+xm
Computation of info:
m
I(x1,x2,…,xm)pi log2(pi),where pi
xi D
i1
Info is also referred to as entropy
43
Attribute Selection Measure: Info Examples
Computing log base‐2 log2 xlog10 xlog10 x
log10 2 0.301
log 2 0.6 log10 0.6 0.2218 0.7369 log10 2 0.301
44
Attribute Selection Measure: Info Examples
A dataset has 6 tuples with 5 Y’s and 1 N: I (5,1) 56 log 2 ( 56 ) 16 log 2 ( 16 ) 0.650
A dataset has 10 tuples with all Y’s I(10,0)10log2(10) 0 log2(0)0
10 10 10 10
A dataset has 10 tuples with 5 Y’s and 5 N’s
I(5,5) 5 log2( 5 ) 5 log2( 5 )1 10 10 10 10
45
Attribute Selection Measure: Information Gain (ID3/C4.5)
Compute the information gain of each attribute
Select the attribute with the highest information gain Informal description of information gain:
Info(D): The amount of information we need to classify a sample in D.
Example: we need 0.9
InfoA(D): After splitting D on an attribute A, the amount of information needed to classify a sample.
Example: after splitting on A, now we need only 0.3 Gain(A)=Info(D)–InfoA(D).
Example: the gain is 0.9 – 0.3 = 0.6
46
Attribute Selection Measure: Information Gain (ID3/C4.5)
Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D: m
m:#classes Info(D)pi log2(pi) i1
Information needed to classify D, after using attribute A to split D into v partitions:
v: # distinct values of A
InfoA(D)
j Info(D )
Information gained by splitting by A
v |D|
j1 |D|
Gain(A) Info(D) Info A(D)
j
47
<=30 <=30 31... 40 >40 >40 >40 31… 40 <=30 <=30 >40 <=30 31... 40 31... 40 >40
high no high no high no medium no low yes low yes low yes medium no low yes medium yes medium yes medium no high yes medium no
fair no excellent no fair yes fair yes fair yes excellent no excellent yes fair no fair yes fair yes excellent yes excellent yes fair yes excellent no
There are 14 tuples in D
9 Yes’s and 5 No’s
Attribute Selection Measure: Information Gain (ID3/C4.5)
LetDbetheBuys_computerdataset
age income student credit_rating buys_computer
Info(D)I(9,5) 9 log2(9) 5 log2(5)0.940 14 14 14 14
48
Attribute Selection Measure: Information Gain (ID3/C4.5)
If we split D based on Age, three partitions are created.
Let’s call them Node‐1 (≤ 30), Node‐2 (31..40), and Node‐3(> 40).
49
Attribute Selection Measure: Information Gain (ID3/C4.5)
InfoofNode‐1is: I(2,3)52log2(52)53log2(53)0.971
I (4,0) 0
Info of Node‐3 is: I (3,2) I (2,3) 0.971
InfoofNode‐2is:
AftersplittingbyAge,theamountofinformationneededtoclassifyatuple is computed as the weighted average of above three:
Infoage (D) 5 I(2,3) 4 I(4,0) 5 I(3,2) 0.694 14 14 14
So,informationgainobtainedbysplittingbyAgeis: Gain(age) Info(D) Infoage (D) 0.246
50
Information Gain – Another Example
If we split by income
51
Information Gain – Another Example
high :4 tuples; 2 yes’s and 2 no’s
medium: 6 tuples; 4 yes’s and 2 no’s low: 4 tuples; 3 yes’s and 1 no
Infoincome (D) 4 I(2,2) 6 I(4,2) 4 I(3,1)0.911 14 14 14
I(2,2) 24 log2 (24) 24 log2 (24) 1.0 I(4,2) 64 log2 (64) 62 log2 (62) 0.918 I (3,1) 34 log 2 ( 34 ) 14 log 2 ( 14 ) 0.811
Gain(income)Info(D)Infoincome (D)0.9400.9110.029
52
Attribute Selection: Information Gain
We can compute Gain(student) and Gain(credit_rating) in the same way.
Then, we have:
Gain(Age) = 0.246 Gain(income) = 0.029 Gain(student) = 0.151 Gain(credit_rating) = 0.048
We select Age as the test attribute because it has the highest information gain.
53
Gain Ratio for Attribute Selection (C4.5)
Information gain measure is biased towards attributes with a large number of values.
C4.5 (a successor of ID3) uses gain ratio to overcome the problem.
v
|D| |D| j log ( j )
j1 |D|
SplitInfoA(D)
Here, attribute A has v distinct values, |Dj| is the number of
tuples with the j‐th value.
GainRatio(A) = Gain(A)/SplitInfoA(D)
The attribute with the maximum gain ratio is selected as the splitting attribute.
2
|D|
54
CHAID
Other Attribute Selection Measures
GINI index
C‐SEPG‐statisticMDL (Minimal Description Length) principle
Multivariate splits (partition based on multiple variable combinations)
Example: CART
Which attribute selection measure is the best?
Mostgivegoodresults,noneissignificantlysuperiorthan others
55
Overfitting and Tree Pruning
Overfitting:
A model reflects every details of the training
dataset, even an anomaly or noise.
A model becomes complex.
Accuracy on the training dataset is high.
Accuracy on the test dataset becomes low.
So, an overfitted model does not generalize well.
56
Overfitting and Tree Pruning
Overfitting of decision tree:
Too many branches, some may reflect anomalies due to
noise or outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction early – do not split a node if this would result in the goodness measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning (more common): Remove branches from a “fully
grown” tree – get a sequence of progressively pruned trees
Use a set of data different from the training data to decide which is the “best pruned tree”
57
• http://www.cs.illinois.edu/~hanj/bk3/
•
Ian H. Witten, E. Frank, and M.A. Hall, “Data Mining Practical Machine Learning Tools and Techniques,” Third Ed., 2011, Morgan Kaufmann
58
References
• Han, J., Kamber, M., Pei, J., “Data mining: concepts and techniques,” 3rd Ed., Morgan Kaufmann, 2012