Data Mining (EECS 4412)
Decision Tree Learning
Parke Godfrey
EECS
Lassonde School of Engineering York University
Thanks to
Professor Aijun An
for curation & use of these slides.
2
Outline
Overview of classification
Basic concepts in decision tree learning
Data representation in decision tree learning
What is a decision tree? Decision tree representation
How to learn a decision tree from data Basic decision tree learning algorithm
How to select best attribute Pruning decision tree
Other issues involved in decision tree learning
3
Classification—A Two-Step Process Model construction (i.e., learning):
Learn a model from a training set (a set of pre-classified training examples) — supervised learning
The model can be represented as classification rules, decision trees, neural networks, mathematical formulae, etc.
Model usage (i.e., prediction or classification): Classify future or unknown objects — main purpose
Test the learned model on a test set (another set of pre-
classified examples) to estimate accuracy of the model
The known class label of a test example is compared with the classification result from the model
Accuracy rate is the percentage of test examples that are correctly classified by the model
Test set is independent of training set, otherwise the testing result is not reliable
4
Classification Process (1): Model Construction
Classification Learning Algorithms
Training Data
Classifier (Model)
NAME
RANK
YEARS
TENURED
Mike
Assistant Prof
3
no
Mary
Assistant Prof
7
yes
B ill
Professor
2
yes
Jim
Associate Prof
7
yes
Dave
Assistant Prof
6
no
Anne
Associate Prof
3
no
IF rank = ‘professor’ OR years > 6
THEN tenured = ‘yes’
5
Classification Process (2): Use the Model in Prediction
Classifier (Model)
Testing Data
Unseen Data
(Jeff, Professor, 4)
NAME
RANK
YEARS
TENURED
Tom
Assistant Prof
2
no
Merlisa
Associate Prof
7
no
George
Professor
5
yes
Joseph
Assistant Prof
7
yes
Tenured?
6
Classification Learning Techniques
Decision tree learning
Decision rule learning
Bayesian classification
Neural networks
K-nearest neighbor method Support vector machines (SVM) Genetic algorithms
etc.
7
Decision Tree Learning
Objective of decision tree learning
Learn a decision tree from a set of training data
The decision tree can be used to classify new examples
Decision tree learning algorithms ID3 (Quinlan, 1986)
C4.5 (Quinlan, 1993)
CART (Breiman, Friedman, et. al. 1983) CHAID (Kass, 1980)
QUEST(Loh and Shih, 1997) etc.
8
Representation of Training Examples
Condition attributes
Class/Target/Decision attribute
Day Outlook Temperature Humidity Wind PlayTennis
D1
D2 D3
D4
D5
D6
D7
D8
D9
D10 D11 D12 D13
D14
Sunny Hot High Weak No
Training examples or cases
Sunny Overcast
Rainy
Rainy
Rainy Overcast
Hot Hot
Mild
Cool Cool Cool
High High
High
Normal
Normal
Normal
High
Normal
Normal Normal High Normal
Strong No Weak Yes
Weak Yes
Weak Yes Strong No Strong Yes
Weak No
Weak Yes
Weak Yes Strong Yes Strong Yes
Weak Yes Strong No
Sunny Mild Sunny Cool
Rainy Sunny Overcast Overcast
Rainy
Mild Mild Mild Hot
Mild High
9
Decision Tree Representation
A decision tree: representation of classification knowledge
Each non-leaf (internal) node tests an attribute (Outlook, Humidity, Wind)
Each branch corresponds to an
attribute value
Each leaf node assigns a classification
Classification
A new case is classified
by testing the case against
the nodes from the root to
a leaf node. The classification
associated with the leaf
is returned. For example,
áOutlook = Sunny, Temperature = Mild, Humidity = high, Wind = Strongñ ® No
Sunny
Overcast
Yes
Normal Strong
Rainy
High
Weak
Yes
No
Yes No
Outlook
Humidity
This tree classifies days according to whether or not they are suitable for playing tennis.
Wind
10
Another Example of Training Dataset
This follows an example from Quinlan’s ID3
age
income
student
credit_rating
buys_computer
<=30
high
no
fair
no
<=30
high
no
excellent
no
30...40
high
no
fair
yes
>40
medium
no
fair
yes
>40
low
yes
fair
yes
>40
low
yes
excellent
no
31…40
low
yes
excellent
yes
<=30
medium
no
fair
no
<=30
low
yes
fair
yes
>40
medium
yes
fair
yes
<=30
medium
yes
excellent
yes
31...40
medium
no
excellent
yes
31...40
high
yes
fair
yes
>40
medium
no
excellent
no
11
Output: A Decision Tree for “buys_computer”
age?
<=30
student?
ov30e.r.c4a0st yes
>40
credit rating?
no
no
yes
yes
excellent fair
no yes
12
Outline
Overview of classification
Basic concepts in decision tree learning
Data representation in decision tree learning
What is a decision tree? Decision tree representation
How to learn a decision tree from data Basic decision tree learning algorithm How to select best attribute
Pruning decision tree
Other issues involved in decision tree learning
13
Top-Down Induction of a Decision Tree
Day Outlook Tempe- Humi-
Wind PlayT- ennis
Weak No
Strong No Weak Yes
Weak Yes
Weak Yes Strong No Strong Yes
Weak No Weak Yes Weak Yes
Strong Yes Strong Yes Weak Yes
Strong No
Idea: divide and conquer
Method: recursive
partitioning of training data
according to the attribute
selected for a tree node.
D1
D2 D3
D4
D5
D6
D7
D8
D9
D10 D11 D12 D13 D14
Sunny
Sunny Overcast
Rainy
Rainy
Rainy Overcast
rature
Hot
Hot Hot
Mild
Cool Cool Cool
dity
High
High High
High
Normal
Normal
Normal
High
Normal
Normal Normal High Normal
Sunny Mild Sunny Cool
Rainy Sunny Overcast Overcast Rainy
Outlook
Overcast
Yes
(D1, D2, …, D14)
Rainy
Wind
Mild
Mild
Mild
Hot
Mild High
Sunny
Humidity
(D1, D2, D8, D9, D11) (D3, D7, D12, D13) (D4, D5, D6, D10, D14)
High
No
Normal Strong
Yes No
Weak
Yes
(D1-, D2-, D8-)
(D9+, D11+)
(D4+, D5+, D10+)
(D6-, D14-)
14
Three Issues in Decision Tree Induction
How to select an attribute for a node
When to declare a node terminal
a naïve, but not robust method: when node is pure, stop growing.
How to assign a class to a leaf node
Assign the most common class in the examples in the node to the node
Or output the probability for each class Sunny Humidity
Outlook
Overcast
Yes
(D1, D2, …, D14)
Rainy
Wind
(D4, D5, D6, D10, D14)
(D1, D2, D8, D9, D11) High
No
(D3, D7, D12, D13)
Normal
Yes
Strong
No
Weak
Yes
(D1-, D2-, D8-)
(D9+, D11+)
(D6-, D14-)
15
(D4+, D5+, D10+)
How to Select Attribute
Which attribute is the best attribute given a set of attributes and a set of examples?
(29+, 35-) A1=? (29+, 35-) A2=? tf tf
(21+, 5-) (8+, 30-)
(18+, 33-)
(11+, 2-)
Many selection criteria, including:
Information gain (Quinlan, 1983; used in ID3)
Gain ratio (Quinlan, 1986; used in C4.5)
Gini index (Breiman, 1984; used in CART)
Chi-square statistic (Kass,1980; used in CHAID. Mingers, 1989) Binarization (Bratko & Kononenko, 86)
Normalized information gain (Lopez de Mantaras, 91)
16
Objective:
Information Gain
Select an attribute so that the data in each of the descendant subsets are the “purest”.
An ideal split:
(29+, 35-)
(29+, 0-) (0+,35-)
Based on the concept of entropy
Entropy is a measure, commonly used in information
theory, that characterizes the impurity (uncertainty, chaos) of an arbitrary collection of examples.
17
Entropy
Given a set S of examples and k classes (C1, …, Ck), the
entropy of S with respect to the k classes is defined as: k
Entropy(S) = -åP(C )log (P(C ))
i=1
where P(Ci) is the probability of examples in S that belong to Ci.
The bigger Entropy(S) is, the more impure S is. Examples:
If all examples in S belong to the same class (i.e., S is pure), Entropy(S)=0.
If half of the examples in S belong to class 1 and the other half
i2i
Entropy(S)=1.
belong to class 2,
Suppose 9 examples are in class 1 and 5 examples in class 2,
Entropy(S)= -(9/14)log2(9/14) – (5/14)log2(5/14) = 0.940 If the examples are uniformly distributed in 3 classes,
Entropy(S)= – ((1/3)log2(1/3)) ́3=log23 =1.59
18
Information Gain (Cont’d) An attribute-selection criterion:
Used to choose an attribute to split a data set Assume that attribute A has m values.
Using A, data set S is split into S1, S2, …, Sm. Information Gain
Gain(S, A) = expected reduction in entropy due to partitioning S on attribute A
where |S| is the number of examples in set S, and |Si| is the number of examples in Si.
m
Gain(S, A) = Entropy(S) – å i Entropy(S )
|S| i=1 |S|
i
19
An illustrative example
Training examples
Day Outlook Temperature Humidity Wind PlayTennis
D1
D2 D3
D4
D5
D6
D7
D8
D9
D10 D11 D12 D13
D14
Sunny Hot High Weak No
Sunny Overcast
Rainy
Rainy
Rainy Overcast
Hot Hot
Mild
Cool Cool Cool
High High
High
Normal
Normal
Normal
High
Normal
Normal Normal High Normal
Strong No Weak Yes
Weak Yes
Weak Yes Strong No Strong Yes
Weak No
Weak Yes
Weak Yes Strong Yes Strong Yes
Weak Yes Strong No
Sunny Mild Sunny Cool
Rainy Sunny Overcast Overcast
Rainy
Mild Mild Mild Hot
Mild High
20
Which attribute is the best for the root?
S: [9+, 5-] Entropy=0.940
S: [9+, 5-] Entropy=0.940
Outlook
Overcast
SOvercast: [4+, 0-] Entropy=0
Humidity
High
SHigh: [3+, 4-] Entropy=0.985
Normal
SNormal: [6+, 1-] Entropy=0.592
Sunny
SSunny: [2+, 3-] Entropy=0.97095
Rainy
SRain: [3+, 2-] Entropy=0.97095
Gain(S, Humidity)
=.940- (7/14).985- (7/14).592=0.151
S: [9+, 5-] E=0.940
Wind
Weak
Gain(S, Outlook)
=.940- (5/14).97095- (4/14)0- (5/14).97095=0.2467
Strong
Hot
SHot: [2+, 2-] Entropy=1
S: [9+, 5-] E=0.940
Temperature
Mild
SMildt: [4+, 2-] Entropy=0.91826
Cool
SCool: [3+, 1-] Entropy=0.811
SWeak: [6+, 2-] Entropy=0.811
SStrong: [3+, 3-] Entropy=1
Gain(S, Wind)
=.940- (8/14).811- (6/14)1.0=0.048
Gain(S, Temperature)
=.940- (4/14)1- (6/14).91826- (4/14).811=0.029
21
An illustrative example (Cont’d.)
S: {D1, D2, …, D14} [9+, 5-]
Outlook
Sunny Overcast
Rainy
SSunny:{D1, D2, D8, D9, D11} [2+, 3-] Entropy=0.97095
SOvercast:{D3, D7, D12, D13} SRain:{D4,D5,D6,D10,D14} [4+, 0-] [3+, 2-]
Entropy=0 Entropy=0.97095
? Yes ? Which attribute should be tested here, Humidity, Temperature, or Wind?
Gain(Ssunny, Humidity) = .97095- (3/5)0.0- (2/5)0.0 = 0.97095
Gain(Ssunny, Temperature) = .97095- (2/5)0.0- (2/5)1.0- (1/5)0.0 = 0.57095 Gain(Ssunny, Wind) = .97095- (2/5)1.0- (3/5).918 = 0.02015
Therefore, Humidity is chosen as the next test attribute for the left branch.
22
An illustrative example (Cont’d.)
S: {D1, D2, …, D14} [9+, 5-]
Outlook
Sunny Overcast
SSunny:{D1,D2,D8,D9,D11} [2+, 3-] Entropy=0.97095
Humidity
SOvercast:{D3,D7,D12,D13} [4+, 0-] Entropy=0
Yes
Rain
SRain:{D4,D5,D6,D10,D14} [3+, 2-] Entropy=0.97095
Wind
Strong Weak
High
(D1, D2, D8) [0+, 3-]
No
Normal
(D9, D11) [2+, 0-]
Yes
(D6, D14) [0+, 2-]
No
(D4, D5, D10) [3+, 0-]
Yes
23
Basic Decision Tree Learning Algorithm 1. Select the “best” attribute A for the root node
2. Create new descendents of the node according to the values of A:
3. Sort training examples to the descendent nodes. 4. For each descendent node,
if the training examples associated with the node belong to the same class, the node is marked as a leaf node and labeled with the class
else if there are no remaining attributes on which the examples can be further partitioned, the node is marked as a leaf node and labeled with the most common class among the training cases for classification;
else if there is no example for the node, the node is marked as a leaf node and labeled with the majority class in its parent node.
otherwise, recursively apply the process on the new node.
when to terminate the recursive process
24
Outline
Overview of classification
Basic concept in decision tree learning Data representation in decision tree learning What is a decision tree?
How to learn a decision tree from data
Basic decision tree learning algorithm
How to select best attribute Information gain
Gain ratio
Gini index
Pruning decision tree Pre-pruning
Post-pruning
Other issues involved in decision tree learning
25
Bias in the Information Gain Measure
Favor unfairly attributes with large numbers of distinct values at the expense of those with few.
E.g., attribute Date: poor predictor, but has the highest gain because it alone perfectly predicts the target attribute over the training data
Day
D1
D2 D3
D4
D5
D6
D7
D8
D9
D10 D11 D12 D13 D14
Outlook Tempe- Humi-
Wind PlayT- ennis
Weak No
Strong No Weak Yes
Weak Yes
Weak Yes Strong No Strong Yes
Weak No Weak Yes Weak Yes
Strong Yes Strong Yes Weak Yes
Strong No
Sunny
Sunny Overcast
Rainy
Rainy
Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Rainy
rature dity Hot High
Hot High Hot High
Mild High
Cool Normal Cool Normal Cool Normal
Mild High Cool Normal Mild Normal Mild Normal
Mild High Hot Normal
Mild High
26
Gain Ratio
Proposed by Quinlan in 1986 (used in C4.5)
Idea:
penalizes attributes with many distinct values by dividing information gain by attribute information (entropy of data with respect to the values of attribute):
SplitInformation(S,A)=- å |Svi |log |Svi |=- åP(v)log P(v)
GainRatio(S, A) =
Gain(S, A) SplitInformation(S, A)
2
viÎValues(A) |S| |S| viÎValues(A)
i2i
27
Gini Index
Gini diversity index (used by CART)
Another measure that measures the impurity of a data set.
S is a set of training examples associated with a node
Suppose there are n classes: Ci (i = 1, …, n)
P(Ci) is the probability of examples in S that belong to Ci.
The Gini impurity of S with respect to classes can be measured as:
ån
i(S ) = P(C j )P(Ci ) = 1 – å (P(C j ))2
j1i j=1
Similar to entropy
Minimized if classes for all examples are the same Maximized if equal proportion of classes
Selection of attribute using Gini index selects an attribute A that most reduces the impurity due to partitioning on A:
Di(S,A)=i(S)- å |Svi |i(Sv ) viÎValues(A) |S| i
28
Outline
Overview of classification
Basic concepts in decision tree learning
Data representation in decision tree learning
What is a decision tree? Decision tree representation
How to learn a decision tree from data Basic decision tree learning algorithm How to select best attribute
Pruning decision tree
Other issues involved in decision tree learning
29
Basic Decision Tree Learning Algorithm (Review)
1. Select the “best” attribute A for the root node
2. Create new descendents of the node according to the
values of A:
3. Sort training examples to the descendent nodes.
4. For each descendent node,
if the training examples associated with the node belong to the same class, the node is marked as a leaf node and labeled with the class
else if there are no remaining attributes on which the examples can be further partitioned, the node is marked as a leaf node and labeled with the most common class among the training cases for classification;
else if there is no example for the node, the node is marked as a leaf node and labeled with the majority class in its parent node.
otherwise, recursively apply the process on the new node.
when to terminate the recursive process
30
Overfitting Problem
When to declare a node terminal in the basic algorithm:
grow each branch of the tree just deeply enough to classify the training examples as perfectly as possible.
This strategy leads to producing deep branches that cover very few examples.
This kind of trees overfits the training data in the following two situations:
there is noise in the data ® tree fits the noise.
the number of training examples is too small to produce a representative sample of the true target function ® tree is too specific to classify future examples well.
31
Overfitting Problem (Cont’d) A definition of overfitting
Consider the error of a model (e.g., a tree) h over training data: errortrain(h)
entire distribution D of data: errorD(h)
Model h overfits training data if there is an alternative model h’ such that
errortrain(h)< errortrain(h’) and errorD(h)> errorD(h’)
32
Overfitting in Decision Tree Learning (An Example Diagram)
On training data On test data
0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5
0 5 1012242735404851606370809095
Size of tree (number of nodes)
Accuracy = 1 – error rate
33
Accuracy
Preventing Overfitting
Pre-pruning: stop growing the tree when data split is not statistically significant
For example,
set a threshold a > 0 and declare a node terminal if
percentage of examples in the most common class > a, or set a threshold b > 0 and declare a node terminal if highest
information gain < b Problems:
Hard to set the threshold value
The splitting is either stopped too soon or continued too far depending on the threshold
Using more complicated stopping rule does not help
34
Preventing Overfitting (Cont’d)
Post-pruning: grow full tree, allow it to overfit the data, and then remove some subtrees
A BC
D
K OPQR
EFGHIJ MN
pruned
ST UV
pruned
More successful in practice
Criterion is needed to determine what to prune
35
Post-pruning Functions
Post-pruning functions are needed to determine which part(s) of the tree should be pruned
Several post-pruning functions, including: Reduced-error (Quinlan, 87) Error-complexity (CART, 84)
Pessimistic error (Quinlan, 86)
Minimum description length (MDL) (SLIQ by Mehta et al. 1996)
Minimum error (Niblett & Bratko, 86) Critical value (Mingers, 87)
There is no single best pruning algorithm
36
Reduced-Error Pruning
Procedure
Split training data into growing and pruning sets
Generate an overfitted decision tree using the growing set
Post-pruning: Do until further pruning is harmful:
Consider each of the internal non-root nodes in the tree to be candidates for pruning
Prune a node by removing subtree rooted at this node, making it a leaf node, and assigning it the most common classification of the training examples affiliated with this node
Evaluate impact of pruning this node on the pruning set by
calculating the classification error rate of the pruned tree on the pruning
set and comparing it with the error rate of the unpruned tree.
Greedily remove the one whose removal most reduces the error on
pruning set.
Aim to produce smallest version of most accurate subtree
What if data is limited?
But may be sub-optimal.
37
Error-Complexity Pruning
Similar procedure to the reduced-error pruning Split training data into growing and pruning sets
Generate the decision tree with the growing set
Post-pruning: Do until further pruning is harmful: Evaluate impact of pruning each subtree on the pruning set
Greedily remove the one whose removal minimizes the following expression on the pruning set:
Ra (T ) = E (T ) + a L(T)
where E(T) is the classification error of tree T, L(T) is the number of
leaf nodes in T and a is the complexity cost per leaf node.
Ra(T) is a linear combination of the classification error rate of the
tree and its complexity.
Similar to reduced-error pruning, not suitable if the
size of training data is small.
38
Pessimistic-Error Pruning
This method does not require a separate pruning set.
Suppose a subtree Ts contains L leaves and J of training examples associated with Ts are misclassified
If we replace Ts with a leaf which misclassifies E of the associated training examples, the pruned tree will be accepted if
E+1/2 < J+L/2
Advantage
fast: no need to separate growing and pruning sets. Tree building makes use of all the training data.
39
Outline
Overview of classification
Basic concepts in decision tree learning
Data representation in decision tree learning
What is a decision tree? Decision tree representation
How to learn a decision tree from data Basic decision tree learning algorithm
How to select best attribute Pruning decision tree
Other issues involved in decision tree learning
40
Some Other Issues in Decision Tree Learning
Convert decision trees to a set of rules How to deal with continuous attributes How to scale up decision tree learning Look-ahead approaches
Search for best sequence of individual tests.
Multiple attributes per test
These approaches tend to have greatly increased complexities (i.e., larger search spaces). 41
Convert Decision Trees to a Set of Rules Each branch from the root to a
Outlook
leaf can be transformed into a
Rainy
if-then rule.
Sunny
Overcast
Humidity
Yes
Wind
High
No
Normal
Yes
Strong
No
Weak
Yes
If (Outlook is Sunny and Humidity is High), then class is No.
If (Outlook is Sunny and Humidity is Normal), then class is Yes. If (Outlook is Overcast), then class is Yes.
......
42
How to deal with continuous attributes
Two ways:
Discretization before learning decision tree
Converts a continuous attribute into a discrete attribute by partitioning the range of the continuous attribute into intervals.
Interval labels can then be used to replace actual data values.
Dynamic discretization
Dynamically split the value range into two sub- ranges during the tree learning process
43
Dynamic Discretization
Dynamically split the value range into two sub-ranges and each descendent node corresponds to a sub-range.
Choose to split at the middle value between two examples which are in different categories
Temperature
PlayTennis?
40
No
48
No
60
Yes
70
Yes
80
Yes
90
No
The possible split points are 48 + 60 = 54 and 80 + 90 = 85 22
Evaluate the binary splitting points using the splitting criterion used for selecting attribute. For example, choose the splitting point that leads to the best information gain.
44
Example:
Dynamic Discretization
Temperature
PlayTennis?
40
No
48
No
60
Yes
70
Yes
80
Yes
90
No
The possible split points are Possible splits:
Temperature
48 + 60 = 54 and 80 + 90 = 85 22
Temperature
<=54
>54
<=85
>85
Choose the one that gives the better information gain to compete with
other attributes
45
Scalable Decision Tree Learning Methods
Scalability: deal with millions of examples and hundreds of attributes with reasonable speed
Most algorithms assume data can fit in memory.
Data mining research contributes to the scalability issue, especially for decision trees.
Successful examples SLIQ (Mehta et al., 1996)
SPRINT (Shafer et al., 1996) RainForest (Gehrke, et al., 1998)
46
Some Other Issues in Decision Tree Learning
Convert decision trees to a set of rules How to deal with continuous attributes How to scale up decision tree learning Look-ahead approaches
Search for best sequence of individual tests.
Multiple attributes per test/node
These approaches tend to have greatly increased complexities (i.e., larger search spaces).
47
Decision Trees: Strengths
Comprehensibility: Small trees are highly interpretable and intuitive for humans
Fast classification Relatively fast induction Mature technology
48
Decision Trees: Weaknesses
Trees can become incomprehensible when their size grows.
Rules converted from a tree
are mutually exclusive,
share at least one attribute (the root)
Thus, the size of the tree can grow much larger than the logic needed for overlapping rules.
As an example, in a successful application of ID3 to a chess end game (Mitchie, 86), the tree representation could not be understood at all even by the chess experts.
When using only one attribute at each internal node, trees are limited to axis-parallel partitions of the instance space 49
Convert Decision Trees to a Set of Rules Each branch from the root to a
Outlook
leaf can be transformed into a
Rainy
if-then rule.
Sunny
Overcast
Humidity
Yes
Wind
High
No
Normal
Yes
Strong
No
Weak
Yes
If (Outlook is Sunny and Humidity is High), then class is No.
If (Outlook is Sunny and Humidity is Normal), then class is Yes. If (Outlook is Overcast), then class is Yes.
……
50
Summary of Decision Tree Learning
Decision tree represents classification knowledge
Decision tree learning is a top-down recursive partitioning process. It is a two phase process if post-pruning is used:
Tree building
An attribute selection criterion (also called splitting criterion) is used: information gain, gain ratio, gini index, etc.
Post-pruning tree
Reduced-error (Quinlan, 87) Error-complexity (CART, 84) Pessimistic error (Quinlan, 86)
Some other issues
51
Next Class Data Preprocessing (Chapter 3)
52