Tree Learning
COMP9417 Machine Learning & Data Mining
Term 1, 2021
Adapted from slides by Dr Michael Bain
Aims
This lecture will enable you to describe decision tree learning, the use of entropy and the problem of overfitting. Following it you should be able to:
• define the decision tree representation
• list representation properties of data and models for which decision trees are appropriate
• reproduce the basic top-down algorithm for decision tree induction (TDIDT)
• define entropy in the context of learning a classifier from examples
• define overfitting of a training set by a hypothesis
• describe developments of the basic TDIDT algorithm: pruning, rule generation, numerical attributes, many-valued attributes, costs, missing values
• describe the inductive bias of the basic TDIDT algorithm
• describe regression trees
COMP9417 T1, 2021 1
Classification: Decision trees
Example: Let’s go back to the fish example with two types of “salmon” and “sea bass” and assume we have two features 𝑙𝑒𝑛𝑔𝑡h and 𝑤𝑖𝑑𝑡h to classify fish type
Salmon
Sea bass
no
no
yes
no (3,0)
no (1,5)
X1>80
X2>30
X2>10
yes (1,0) (0,4)
yes
x1=Length
Salmon
Sea bass
X1>110
yes (7,0)
80
110
x1=Length
COMP9417 T1, 2021
2
10
30
x2 = width x2 = width
Classification: Decision trees
yes (1,0) (0,4)
Salmon Sea bass
no (3,0)
Salmon
yes
no
yes
X1>80
X2>10
no
For any new sample, we start from the top of the tree and answer the questions and follow the appropriate road to the bottom and then decide about the label
X2>30
no (1,5)
Sea bass
yes (7,0)
Salmon
X1>110
COMP9417 T1, 2021 3
Why use decision trees?
• Decision trees are probably the single most popular data mining tool o Easy to understand
o Easy to implement
o Easy to use
o Computationally cheap (efficient, even on big data)
• There are some drawbacks, though — e.g., high variance
• They do classification, i.e., predict a categorical output from categorical and/or real inputs, or regression
COMP9417 T1, 2021 4
Decision Tree for PlayTennis
Day Outlook Humidity Wind Play
D1 Sunny
D2 Sunny
D3 Overcast
D4 Rain
D5 Rain
D6 Rain
D7 Overcast
D8 Sunny
D9 Sunny
D10 Rain
D11 Sunny
D12 Overcast
D13 Overcast
D14 Rain
High Weak No High Strong No High Weak Yes High Weak Yes Normal Weak Yes Normal Strong No Normal Strong Yes High Weak No Normal Weak Yes Normal Weak Yes Normal Strong Yes High Strong Yes Normal Weak Yes High Strong No
COMP9417 T1, 2021 5
Decision Tree for PlayTennis
Decision tree works in a divide and conquer fashion: – Split into subsets
– Are they pure?
o If yes: stop
o If not: repeat
– For a new data, find the subset the data falls into
COMP9417 T1, 2021 6
Decision Tree for PlayTennis
Sunny
Day Outlook Humid Wind
9 yes / 5 No Outlook
Overcast
Day Outlook Humid Wind
D3 Overcast High Weak Rain D7 Overcast Normal Strong
D12 Overcast High Strong
D13 Overcast Normal Weak
4 yes / 0 No Pure subset
D1 Sunny D2 Sunny D8 Sunny D9 Sunny D11 Sunny
High Weak High Strong High Weak Normal Weak
Normal Strong
High Weak Normal Weak Normal Strong Normal Weak High Strong
3 yes / 2 No Split Further
2 yes / 3 No Split Further
Day Outlook Humid Wind
D4 Rain D5 Rain D6 Rain D10 Rain D14 Rain
COMP9417 T1, 2021 7
Decision Tree for PlayTennis
Sunny
Humidity
High
Day Humid Wind
D1 High Weak D2 High Strong D8 High Weak
9 yes / 5 No Outlook
Overcast
Day Outlook Humid Wind
D3 Overcast High Weak Rain D7 Overcast Normal Strong
D12 Overcast High Strong
D13 Overcast Normal Weak
Normal
4 yes / 0 No Pure subset
D4 Rain D5 Rain D6 Rain D10 Rain D14 Rain
Day Humid Wind
D9 Normal Weak D11 Normal Strong
Day Outlook Humid Wind
High Weak Normal Weak Normal Strong Normal Weak High Strong
3 yes / 2 No Split Further
COMP9417 T1, 2021 8
Decision Tree for PlayTennis
Decision Tree for PlayTennis
What would be the decision for a new data:
COMP9417 ML & DM Tree Learning
Term 2, 2019
6 / 98
Introduction
Sunny Overcast Rain Yes
Humidity
High Normal No Yes
Strong Weak No Yes
Outlook
Wind
COMP9417 T1, 2021
9
A Tree to Predict C-Section Risk
Learned from medical records of 1000 women
Learned from medical records of 1000 women. Negative examples are C-
Negative examples are C-sections
sections
[833+,167-] .83+ .17-
Fetal_Presentation = 1: [822+,116-] .88+ .12-
| Previous_Csection = 0: [767+,81-] .90+ .10-
| | Primiparous = 0: [399+,13-] .97+ .03-
| | Primiparous = 1: [368+,68-] .84+ .16-
| | | Fetal_Distress = 0: [334+,47-] .88+ .12-
| | | | Birth_Weight < 3349: [201+,10.6-] .95+ .05- | | | | Birth_Weight >= 3349: [133+,36.4-] .78+ .22- | | | Fetal_Distress = 1: [34+,21-] .62+ .38-
| Previous_Csection = 1: [55+,35-] .61+ .39- Fetal_Presentation = 2: [3+,29-] .11+ .89- Fetal_Presentation = 3: [8+,22-] .27+ .73-
COMP9417 T1, 2021 10
Decision Tree for Credit Rating
COMP9417 T1, 2021 11
Decision Tree for Fisher’s Iris data
COMP9417 T1, 2021 12
Decision Trees
Decision tree representation:
– Each internal node tests an attribute
– Each branch corresponds to attribute value – Each leaf node assigns a classification
COMP9417 T1, 2021 13
Decision Tree: Expressiveness
Using decision trees to represent Boolean functions like AND (⋀), OR (⋁), XOR (⊕)
𝑿⋀𝒀
𝑋 = 𝑡:
| 𝑌 = 𝑡: 𝑡𝑟𝑢𝑒 | 𝑌=𝑓:𝑓𝑎𝑙𝑠𝑒 𝑋 = 𝑓: 𝑓𝑎𝑙𝑠𝑒
𝑿⋁𝒀
𝑋 = 𝑡: 𝑡𝑟𝑢𝑒
𝑋 = 𝑓:
| 𝑌=𝑡:𝑡𝑟𝑢𝑒
| 𝑌=𝑓:𝑓𝑎𝑙𝑠𝑒
COMP9417 T1, 2021 14
Decision Tree: Expressiveness
𝑿⊕𝒀
𝑋 = 𝑡:
| 𝑌=𝑡:𝑓𝑎𝑙𝑠𝑒 | 𝑌=𝑓:𝑡𝑟𝑢𝑒 𝑋 = 𝑓:
| 𝑌=𝑡:𝑡𝑟𝑢𝑒
| 𝑌=𝑓:𝑓𝑎𝑙𝑠𝑒
“In general, decision trees represent a disjunction of conjunctions of constraints on the attribute-values of instances. Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions” (Mitchell, 1997)
COMP9417 T1, 2021 15
When to Consider Decision Trees?
• Instances described by a mix of numeric features and discrete attribute– value pairs
• Target function is discrete valued (otherwise use regression trees)
• Possibly noisy training data
• Interpretability is an advantage
Examples are extremely numerous, including: o Equipment or medical diagnosis
o Credit risk analysis
o Modelling calendar scheduling preferences o etc.
COMP9417 T1, 2021 16
Top-Down Induction of Decision Trees (TDIDT)
Main loop:
– 𝐴 ← the “best” decision attribute for next node to split examples
– Assign 𝐴 as decision attribute for node
– For each value of 𝐴, create new descendant of node (child node)
– Split training examples to child nodes
– If training examples perfectly classified (pure subset), Then 𝑆𝑇𝑂𝑃, Else iterate over new child nodes
Discovered by two people independently:
o Ross Quinlan (ID3: 1986), (C4.5: 1993)
o Breiman et. al (CaRT: 1984) from statistics
COMP9417 T1, 2021 17
Top-Down Induction of Decision Trees (TDIDT)
• ID3 (Iterative Dichotomiser 3)
– Uses categorical attributes/features
– For categorical targets
– Is precursor of C4.5 (without restriction of categorical attributes)
• CaRT (Classification & Regression Tree)
– Uses both categorical and numerical attributes/features
– Supports both categorical & numerical targets
Here, the focus is on ID3 for classification.
COMP9417 T1, 2021 18
Which attribute is best?
Which attribute is best?
The Basic Algorithm
[29+,35-] A1=? [29+,35-] A2=? tf tf
[21+,5-] [8+,30-] [18+,33-] [11+,2-]
COMP9417 ML & DM Tree Learning Term 2, 2019 17 / 98
COMP9417 T1, 2021 19
Which attribute is best?
9 Yes / 5 No
9 Yes / 5 No
Outlook
Sunny
2 Yes / 3 No
Overcast
4 Yes / 0 No
Rain
3 Yes / 2 No
Weak
6 Yes / 2 No
Strong
3 Yes / 3 No
COMP9417 T1, 2021
20
Wind
Which attribute is best?
9 Yes / 5 No
9 Yes / 5 No
Outlook
Wind
Sunny Overcast
2 Yes / 3 No 4 Yes / 0 No
Rain
3 Yes / 2 No
Weak Strong
6 Yes / 2 No 3 Yes / 3 No
• We are looking for a split with higher “purity”
• We need to measure the purity of the split
– More certain about our classes after split
o A set with all examples belonging to one class is 100% pure
o A set with 50% examples in one class and 50% in the other is 100% uncertain and impure
COMP9417 T1, 2021 21
Which attribute is best?
There are many different ways to measure the purity of a subset, but one good measure for it is Entropy.
Entropy:
– From statistical point of view: is a measure of uncertainty
– From information theory point of view: The amount of information (in the Shannon sense) needed to specify the full state of a system
COMP9417 T1, 2021 22
Entropy
Entropy measures the “impurity” of 𝑆
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 =𝐻(𝑆)=−𝑝⊕ log”𝑝⊕ −𝑝⊖ log”𝑝⊖
– 𝑆 is subset of training examples
– 𝑝⊕, 𝑝⊖ are the portion(%) of positive and negative examples in 𝑆
Interpretation: if item 𝑥 belongs to 𝑆, how many bits are needed to tell if 𝑥 is positive or negative?
A “pure” sample set is one in which all examples are of the same class.
COMP9417 T1, 2021 23
Entropy
𝑝 ⊕
𝑝 ⊖
S is a sample of training examples
is the proportion of positive examples in 𝑆
Entropy
Where:Where:
𝑆 is a sample of training examples
p is the proportion of positive examples in S
is the proportion of negative examples in 𝑆 p is the proportion of negative examples in S
Information Gain
1.0
0.5
0.0 0.5 1.0 p+
COMP9417 ML & DM Tree Learning
Term 2, 2019
26/98
COMP9417 T1, 2021
24
Entropy(S)
General Case
From information theory, the optimal number of bits to encode a symbol with probability 𝑝 is − log# 𝑝 …
Suppose 𝑋 can have one of 𝑘 values … 𝑣$, 𝑣#, … , 𝑣%
𝑃 𝑋=𝑣$ =𝑝$,𝑃 𝑋=𝑣# =𝑝#,…,𝑃 𝑋=𝑣% =𝑝%
What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from 𝑋’s distribution ? It’s
𝐻 𝑋
𝐻𝑋 =−𝑝$ log#𝑝$−𝑝# log#𝑝#−⋯−𝑝% log#𝑝% %
=−D𝑝& log#𝑝& &’$
is the entropy of 𝑋
COMP9417 T1, 2021 25
Entropy
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 is the expected number of bits needed to encode class (⊕ or ⊖) of randomly drawn member of 𝑆 (under the optimal, shortest-length code)
So, expected number of bits to encode ⊕ or ⊖ of random member of 𝑆: 𝑝⊕ (− log# 𝑝⊕) + 𝑝⊖ (− log# 𝑝⊖)
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝⊕ log# 𝑝⊕ − 𝑝⊖ log# 𝑝⊖
And we want to minimize this entropy.
COMP9417 T1, 2021 26
Entropy
• “High entropy”/”impure set” means 𝑋 is very uniform and boring o Example: (3 samples from ⊕, 3 samples from ⊖)
o𝐻 𝑆 =−$%log”$%−$%log”$% =1(canbeinterpretedas1bits)
• “Low entropy”/”pure set” means 𝑋 is not uniform and interesting
o Example: (6 samples from ⊕, 0 samples from ⊖)
o𝐻 𝑆 =−%log”%−&%log”&% =0(canbeinterpretedas0bits)
COMP9417 T1, 2021 27
The Basic Algorithm
EntropWyhich attribute is best? Entropy is a measure in just one subset.
[29+,35-] A1=? [29+,35-] A2=? tf tf
[21+,5-] [8+,30-] [18+,33-] [11+,2-]
𝐻 𝑆 =−29log 29−35log 35=0.9936
64 # 64 64 # 64
17 / 98
But, we are interested in expected drop in entropy after split, to decide which attribute is the best. How?
COMP9417 ML & DM Tree Learning Term 2, 2019
COMP9417 T1, 2021 28
Information Gain
• 𝐺𝑎𝑖𝑛(𝑆, 𝐴) is the expected reduction in entropy due to sorting on 𝐴 𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − D 𝑆( 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆()
(∈*+,-./(1) 𝑆 o 𝜈 is the possible values of attribute 𝐴
o 𝑆 is the set of examples we want to split
o 𝑆( is the subset of examples where 𝑋1 = 𝜈
• We want to find the attribution which maximizes the gain
This is also called “mutual information” between attribute 𝐴 and class labels of 𝑆
COMP9417 T1, 2021 29
Information Gain
Which attribute is best?
What is the information gain for attribute 𝐴1 and 𝐴2?
COMP9417 ML & DM Tree Learning
Term 2, 2019
17 / 98
The Basic Algorithm
[29+,35-] A1=? [29+,35-] A2=? tf tf
[21+,5-] [8+,30-] [18+,33-] [11+,2-]
COMP9417 T1, 2021
30
Information Gain
𝐺𝑎𝑖𝑛 𝑆,𝐴1 =𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 −(𝑆3 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆3 + 𝑆4 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆4 ) 𝑆𝑆
𝐺𝑎𝑖𝑛 𝑆, 𝐴2
= 0.9936 −((26(−21log 21 − 5 log (5)))+ 64 26 # 26 26 # 26
(38(− 8 log 8 −30log (30)))) 64 38 # 38 38 # 38
= 0.9936 − 0.2869 + 0.4408 = 0.2658
= 0.9936 − 0.7464 + 0.0828 = 0.1643
COMP9417 T1, 2021 31
Information Gain
So in this example, we choose 𝐴1, since it gives a larger expected reduction in entropy.
To select best attribute at each branch:
– take every/remaining attributes in your data
– Compute information gain for that attribute
– Select the attribute that has the highest information gain
COMP9417 T1, 2021 32
Training Examples
Training Examples
Information Gain
Day Outlook Temperature Humidity Wind PlayTennis
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14
Sunny Hot
Sunny Hot Overcast Hot
Rain Mild Rain Cool Rain Cool
Overcast Cool Sunny Mild Sunny Cool
Rain Mild Sunny Mild Overcast Mild
Overcast Hot Rain Mild
High Weak No High Strong No High Weak Yes High Weak Yes
Normal Weak Yes Normal Strong No Normal Strong Yes
High Weak No Normal Weak Yes Normal Weak Yes Normal Strong Yes
High Strong Yes Normal Weak Yes High Strong No
COMP9417 ML & DM Tree Learning Term 2, 2019 33 / 98
COMP9417 T1, 2021 33
Information gain once more
Information Gain
Information gain once more
Which attribute is the best classifier?
S: [9+,5-] E =0.940
S: [9+,5-] E =0.940
High
[3+,4-] E =0.985
Gain (S, Humidity
Normal
[6+,1-] E =0.592
Weak
[6+,2-] E =0.811
Strong
[3+,3-] E =1.00
= .940 – (7/14).985 – (7/14).592 = .151
= .940 – (8/14).811 – (6/14)1.0 = .048
Humidity
)
Gain (S, Wind)
Wind
COMP9417 T1, 2021 34
COMP9417 ML & DM Tree Learning Term 2, 2019 34 / 98
Information gain once more
Information gain once more
Outlook
Sunny
{D1,D2,D8,D9,D11} [2+,3−]
{D1, D2, …, D14} [9+,5−]
Overcast
{D3,D7,D12,D13} [4+,0−]
Yes
Rain
{D4,D5,D6,D10,D14} [3+,2−]
?
Which attribute should be tested here? Ssunny = {D1,D2,D8,D9,D11}
Gain (Ssunny , Humidity) = .970 − (3/5) 0.0 − (2/5) 0.0 = .970
Gain (Ssunny , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570 Gain (Ssunny , Wind) = .970 − (2/5) 1.0 − (3/5) .918 = .019
?
COMP9417 T1, 2021 35
Information Gain
Will information gain always find the prefect solution?
COMP9417 T1, 2021 36
Information Gain Limitations
• Information gain is more biased towards attributes with large number of values/categories
o Subsets are more likely to be pure if there is a large number of values
o This can result overfitting (selection of an attribute which is non- optimal for prediction) which does not generalize well to unseen data
• Suggested solution: gain ratio
o A modification of information gain that reduces the bias o Takes number and size of branches into account
COMP9417 T1, 2021 37
Gain ratio
𝑆𝑝𝑙𝑖𝑡𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆, 𝐴) = −
D 𝑆( log 𝑆( (∈*+,-./(1) 𝑆 # 𝑆
Where:
§ 𝐴: candidate attribute
§ 𝜈: possible values of A
§ 𝑆: Set of examples {𝑋} at the node § 𝑆(: subset where 𝑋1 = 𝜈
𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 𝑆, 𝐴 = 𝐺𝑎𝑖𝑛(𝑆, 𝐴) 𝑆𝑝𝑙𝑖𝑡𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆, 𝐴)
COMP9417 T1, 2021 38
Overfitting in Decision Trees
• Can always classify training examples perfectly
– If necessary, keep splitting until each node contains 1 example
– Singleton subset (leaf nodes with one example) which is by definition pure
• But this is not always ideal
– In leaf nodes with singleton subsets, you have no confidence in your decision
COMP9417 T1, 2021 39
Overfitting in Decision Tree Learning
Overfitting in Decision Tree Learning
Overfitting and How To Avoid It
0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55
0.50 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes)
On training data
On test data
COMP9417 ML & DM Tree Learning Term 2, 2019 47 / 98
COMP9417 T1, 2021 40
Accuracy
Overfitting in General
Consider error of hypothesis h over
o training data: 𝑒𝑟𝑟𝑜𝑟35+67(h)
o entire distribution 𝒟 of data: 𝑒𝑟𝑟𝑜𝑟𝒟(h)
Definition
Hypothesis h ∈ 𝐻 overfits training data if there is an alternative hypothesis h′ ∈ 𝐻 such that
And
𝑒𝑟𝑟𝑜𝑟35+67 h < 𝑒𝑟𝑟𝑜𝑟35+67(h′) 𝑒𝑟𝑟𝑜𝑟𝒟 h > 𝑒𝑟𝑟𝑜𝑟𝒟(h′)
COMP9417 T1, 2021 41
Avoiding Overfitting
How can we avoid overfitting? Pruning
o pre-pruning: stop growing when data split not statistically
significant
o post-pruning: grow full tree, then remove sub-trees which are overfitting (based on validation set)
Post-pruning avoids problem of “early stopping”
COMP9417 T1, 2021 42
Avoiding Overfitting
Pre-pruning
– Can be based on statistical significance test
– Stops growing the tree when there is no statistically significant association between any attribute and the class at a particular node
– For example, in ID3: chi-squared test plus information gain
o only statistically significant attributes were allowed to be selected by information gain procedure
COMP9417 T1, 2021 43
Avoiding Overfitting
Pre-pruning
– Simplest approach: stop growing the tree when fewer than some lower- bound on the number of examples at a leaf
o In sklearn, this parameter is min _𝑠𝑎𝑚𝑝𝑙𝑒𝑠_𝑙𝑒𝑎𝑓
– Stop when Entropy changes is smaller than a lower-round
o In sklearn, the parameter min _𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦_𝑑𝑒𝑐𝑟𝑒𝑎𝑠𝑒 enables stopping when this falls below a lower-bound
COMP9417 T1, 2021 44
Avoiding Overfitting
• Other pre-pruning approaches:
o Maximum number of leaf nodes
§ in sklearn, max _𝑙𝑒𝑎𝑓_𝑛𝑜𝑑𝑒𝑠
o Minimum number of samples to split
§ in sklearn min _𝑠𝑎𝑚𝑝𝑙𝑒𝑠_𝑠𝑝𝑙𝑖𝑡 o Maximum depth
§ in sklearn, max _𝑑𝑒𝑝𝑡h
COMP9417 T1, 2021 45
Avoiding Overfitting
Early stopping
• Pre-pruning may suffer from early stopping: may stop the growth of tree prematurely
• Classic example: XOR/Parity-problem
o No individual attribute exhibits a significant association with the class
o Target structure only visible in fully expanded tree o Pre-pruning won’t expand the root node
• But: XOR-type problems not common in practice
• And: pre-pruning faster than post-pruning
COMP9417 T1, 2021 46
Avoiding Overfitting
Post-pruning
• Builds full tree first and prunes it afterwards
o Attribute interactions are visible in fully-grown tree
• Problem: identification of subtrees and nodes that are due to chance effects
• Subtree replacement
• Possible strategies: error estimation, significance testing, MDL principle. We examine
– reduced-error Pruning
– Minimum error
– Smallest tree
COMP9417 T1, 2021 47
Reduced-Error Pruning
Split data into training and validation set Do until further pruning is harmful:
o Evaluate the impact of pruning nodes in the bottom of the tree and replacing them with a leaf node on validation set
o Greedily remove the one that most improves validation set accuracy
• Good it is simple and produces smallest version of most accurate subtree
• Not so good reduces effective size of training set
COMP9417 T1, 2021 48
Overfitting and How To Avoid It
E↵fefcetcotf RoefduRcedd-EurrcorePdr-uEnirnrgor Pruning
0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55
0.50 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes)
On training data
On test data On test data (during pruning)
COMP9417 ML & DM Tree Learning Term 2, 2019 55 / 98
COMP9417 T1, 2021 49
Accuracy
Pr
uning operator: Sub-tree replacement
Pruning operator: Sub-tree replacement
ttom-up:
e is considered for replacement once all its sub-trees have been
nsidered
Bottom-up:
Tree is considered for replacement once all its sub-trees have been considered
COMP9417 T1, 2021 50
Bo tre co
COMP9417 ML & DM Tree Learning Term 2, 2019 57 / 98
Minimum Error
• Keep the Cross-Validation error during the growing
• Prune back to the point where the cross-validated error is a minimum.
– In the following example, the first 36 nodes will be kept
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0 10 20 30 40 50 60 70 80 90 100
number of nodes
COMP9417 T1, 2021 51
Error
Smallest Tree
• Keep the Cross-Validation error during the growing
• Prune back the tree to the smallest tree within 1 standard error of the minimum error.
– In the following example, the first 20 nodes will be kept
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0 10 20 30 40 50 60 70 80 90 100
number of nodes
COMP9417 T1, 2021 52
Error
Overfitting and How To Avoid It
Converting A Tree to Rules
Outlook
Sunny
High Normal No Yes
Overcast Rain Yes
Strong Weak No Yes
Humidity
Wind
If Then
If
Then
…
…
(𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑆𝑢𝑛𝑛𝑦) ∧ (𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 𝐻𝑖𝑔h)
Converting A Tree to Rules
IF (Outlook = Sunny) ^ (Humidity = High) 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑁𝑜
THEN PlayTennis = No
(𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑆𝑢𝑛𝑛𝑦) ∧ (𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 𝑁𝑜𝑟𝑚𝑎𝑙)
(Outlook = Sunny) ^ (Humidity = Normal) 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑌𝑒𝑠
IF
THEN P layT ennis = Y es
COMP9417 ML & DM Tree Learning
Term 2, 2019
64 / 98
COMP9417 T1, 2021
53
Rules from Trees
Rules can be simpler than trees but just as accurate, e.g.:
– path from root to leaf (in unpruned tree) forms a rule
o i.e., tree forms a set of rules
– can simplify rules independently by deleting conditions (by pruning)
o i.e., rules can be generalized while maintaining accuracy
COMP9417 T1, 2021 54
Continuous Valued Attributes
Decision trees originated for discrete attributes only. Now: continuous attributes.
Can create a discrete attribute to test continuous value:
o 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 82.5
o (𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 > 72.3) ∈ {𝑡, 𝑓}
o Usual method: continuous attributes have a binary split o Note:
§ discrete attributes – one split exhausts all values
§ continuous attributes – can have many splits in a tree
COMP9417 T1, 2021 55
Continuous Valued Attributes
Splits evaluated on all possible split points
More computation: n 1 possible splits for n values of an attribut in training set
• Splits evaluated on all possible split points
• More computation: 𝑛 − 1 possible splits for 𝑛 values of an attribute in
Fayyad (1991)
training set
• ••
Note: C4.5 uses actual values in data
sort examples on continuous attribute
Fayyad (1991)
find midway boundaries where class changes, e.g. for Temperature o sort examples on continuous attribute
(48+60) and (80+90)
o find midway boundaries where class changes, e.g. for 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒
2 (9:;<=) 2 (:=;>=) and
Choose best sp# lit point# by info gain (or evaluation of choice) • Choose best split point by info gain (or evaluation of choice)
• Note: C4.5 uses actual values in data
Temperature: 40 48 60 72 80 90 PlayTennis: No No Yes Yes Yes No
OMP9417 ML & DM Tree Learning
Term 2, 2019
COMP9417 T1, 2021 56
• •
•
• •
e
C 68/9
8
Continuous Valued Attributes
• Example for 𝑛 − 1 splits
123456789
10.2 11.1 11.8 13.0 13.6 14.4 15.5 16.5 18.0
COMP9417 T1, 2021
57
Continuous Valued Attributes
• Example for 𝑛 − 1 splits: more efficient number of splits if we know the class labels
123456789
Class labels
10.2 11.1 11.8 13.0 13.6 14.4 15.5 16.5 18.0
+__+++__+ – You can use the midway point as your boundary (e.g if 𝑥 > 12.4
between points 3 and 4)
– Or, you can use the value in the training (e.g if 𝑥 > 11.8 between points 3 and 4)
COMP9417 T1, 2021 58
Further Issues in Tree Learning
AAxixs-ipsar-apllaelrSapllitetilngSplitting
Fitting data that is not a good “match” to the possible splits in a tree.
Fitting data that is not a good “match” to the possible splits in a tree.
“Pattern Classification” Duda, Hart, and Stork, (2001)
“Pattern Classification” Duda, Hart, and Stork, (2001)
COMP9417 ML & DM Tree Learning Term 2, 2019
69 / 98
COMP9417 T1, 2021 59
Further Issues in Tree Learning
Splitting on Linear Combinations of Features
Splitting on Linear Combinations of Features
Reduced tree size by allowing splits that are a better “match” to the data.
Reduced tree size by allowing splits that are a better “match” to the data.
“Pattern Classification” Duda, Hart, and Stork, (2001)
“Pattern Classification” Duda, Hart, and Stork, (2001)
COMP9417 ML & DM Tree Learning Term 2, 2019 70 / 98
COMP9417 T1, 2021 60
Attributes with Costs
Consider
– medical diagnosis, 𝐵𝑙𝑜𝑜𝑑 𝑇𝑒𝑠𝑡 has cost $150
– robotics, computation cost 23 sec.
How to learn a consistent tree with low expected cost?
COMP9417 T1, 2021 61
Attributes with Costs
One approach: evaluate information gain relative to cost:
– Example
𝐺𝑎𝑖𝑛#(𝑆, 𝐴) 𝐶𝑜𝑠𝑡(𝐴)
Preference for decision trees using lower-cost attributes.
COMP9417 T1, 2021 62
Misclassification with Costs
Also: class (misclassification) costs, instance costs, . . . Can give false positives a different cost to false negatives.
Similar to expected loss in”Classification 2” lecture, we can define a loss function 𝜆(𝛼6|𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑐𝑙𝑎𝑠𝑠) (the loss associated to action 𝛼6 ) and define the expected loss as:
𝑅 𝛼6 𝑥 = D 𝜆(𝛼6|𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑐𝑙𝑎𝑠𝑠) 𝑃(𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑐𝑙𝑎𝑠𝑠|𝑥) ?∈@
– however, if we use this loss function to split the tree, it has been shown that minimizing 𝑅 does not necessarily lead to the best long-term growth of the tree.
COMP9417 T1, 2021 63
Misclassification with Costs
So what is usually done instead :
– Grow the tree using gain information to its full depth (or other impurity measures)
– Apply cost minimization when post pruning (or minimize cost-complexity measure which takes the size of the tree into account as well)
COMP9417 T1, 2021 64
Unknown Attribute Values
What if some examples missing values of 𝐴?
Use training example anyway, sort through tree. Here are 3 possible
approaches
– If node 𝑛 tests 𝐴, assign most common value of 𝐴 among other
examples sorted to node 𝑛
– assign most common value of 𝐴 among other examples with same
target value
– assign probability 𝑝6 to each possible value 𝜐6 of 𝐴
o assign fraction 𝑝6 of example to each descendant in tree Note: need to classify new (unseen) examples in same fashion
COMP9417 T1, 2021 65
Windowing
Early implementations – training sets was too large for memory. How to handle the problem?
As a solution ID3 implemented windowing:
1. select subset of instances – the window
2. construct decision tree from all instances in the window 3. use tree to classify training instances not in window
4. if all instances correctly classified then halt, else
5. add selected misclassified instances to the window
6. gotostep2
Windowing retained in C4.5 because it can lead to more accurate trees. Related to ensemble learning.
COMP9417 T1, 2021 66
Tree Learning More Generally
Hypothesis Space Search by ID3
Hypothesis Space Search by ID3
+–+
A1
+–+ +
A2
+–+ –
…
A2
…
A2
+–+ –
+–+ – +A3
A4
–
… …
COMP9417 T1, 2021 67
COMP9417 ML & DM Tree Learning Term 2, 2019 36 / 98
Hypothesis Space Search by ID3
• Usual graph-search technique: greedy or beam search, starting with the node corresponding to the “empty tree” (single leaf node)
• Greedy choice: which one to select? The one that results in the greatest increase in posterior probability
• RESULT: set of trees with (reasonably) high posterior probabilities given 𝐷: we can now use these to answer questions like 𝑃 (𝑦′ = 𝐶$| . . . )? or even make a decision or a classification that 𝑦′ = 𝐶$, given input data 𝑥
COMP9417 T1, 2021 68
Hypothesis Space Search by ID3
• ID3 searches the space of possible decision trees: doing hill-climbing on information gain.
• Hypothesis space is complete! (contains all finite discrete-valued functions w.r.t. attributes)
• Outputs a single hypothesis.
o It cannot tell us how many other viable ones there are.
• No back tracking
o Local minima…
• Uses all training examples at each step.
o Results are less sensitive to noise
• Inductive bias
COMP9417 T1, 2021 69
Inductive Bias in ID3
Inductive bias: set of assumptions needed in addition to training data to justify deductively learner’s classifications
Restriction bias:
– The set of hypothesis that can be modelled by decision trees
Preference biases:
– Prefers trees with good splits near the top (splitting on features with the
most information gain)
– Prefers shorter trees (comes naturally from good splits at the top and minimum description length )
COMP9417 T1, 2021 70
Decision Tree
Pros:
– Interpretability
– Easily handle irrelevant attributes (Gain = 0)
– Can handle both categorical and numerical data
– Can handle missing data
– Very compact (number of nodes << number of examples)
– Very fast at testing
Cons:
– Only axis-aligned splits of the data
– Tend to overfit
– Greedy (may not find the best tree)
o Exponentially many possible trees
COMP9417 T1, 2021 71
Regression Tree
Example: A clinical trial has been done to evaluate the effect of a drug dosage.
100
80
60
40 20
100 80 60
40 20
10 20 30 40 50 60 Dosage (mg)
10 20 30 40 50 60 Dosage (mg)
How to model the data on the right?
COMP9417 T1, 2021
72
Effectiveness (%)
Effectiveness (%)
Regression Tree (CaRT algorithm)
One option would be to use regression trees
Dos. < 20
100
80
60
40 20
Yes
No
10% effective
Dos. >= 40
Dos. >= 37
Yes
No
12% effective
Yes
No
10 20 30 40 50 60 Dosage (mg)
Each leaf corresponds to average drug effectiveness in a different cluster of examples, the tree does a better job than Linear Regression
70% effective
COMP9417 T1, 2021 73
95% effective
Effectiveness (%)
Regression Tree
Although in this example, we can use other methods like kNN or local regression, but the advantage of Regression Tree will be more clear if you have multiple variables, specially if some of them are categorical and some are real- valued.
How to build the regression tree and find the thresholds?
COMP9417 T1, 2021 74
Regression Tree
We will use mean squared errors to find the best threshold. We start from setting the threshold to the value of the first point to the last point, successively, and examine the mean of squared errors.
100 80 60
40 20
5000
4000
3000
2000
1000
10 20 30 40 50 60 Dosage (mg)
10 20 30 40 50 60 Dosage threshold for tree
COMP9417 T1, 2021 75
Effectiveness (%)
Error for one point
Mean squared errors
Regression Tree
• Mean squared error after setting the threshold, for each subset with 𝑚
examples is:
1A
𝑀 𝑆 𝐸 ( 𝑌 6 ) = 𝑚 D ( 𝑦 & − 𝑦x ) #
&’$
o The MSE here is equal to the variance of the examples in that subset
• Compute the weighted average of variance
, 𝑌6 𝑤𝑒𝑖𝑔h𝑡𝑒𝑑 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = D6 𝑌 𝑀𝑆𝐸(𝑌6)
• We pick a threshold which minimizes the weighted-average/average of MSE/variance
COMP9417 T1, 2021 76
Regression Tree
Multiple variables:
Imagine that in the drug experiment we also have other attributes like “age”, “sex”,…
To choose an attribute for each node:
– For each attribute separately find the best split using the minimum weighted average variance
– Pick the attribute with minimum weighted average variance
COMP9417 T1, 2021 77
Regression Tree
Similar to decision tree, if we continue splitting all nodes that have a mean squared error bigger than zero, we will get a tree which fits the training data perfectly but will not generalize well to the test data.
How to avoid overfitting:
– The simplest to avoid overfitting, similar to, decision tree is to only split examples, when there are more than some minimum number (usually a default value for minimum number of examples to split is 20).
– Or using maximum depth. Allow the tree to grow to the specified depth
– Or maximum number of leaves –…
COMP9417 T1, 2021 78
Learning Non-linear Regression Models with Trees
A Regression Tree and its Prediction Surface
A Regression Tree and its Prediction Surface
“Elements of Statistical Learning” Hastie, Tibshirani & Friedman (2001)
“Elements of Statistical Learning” Hastie, Tibshirani & Friedman (2001)
COMP9417 ML & DM Tree Learning Term 2, 2019
81 / 98
𝐹(𝑋)
COMP9417 T1, 2021 79
Learning Non-linear Regression Models with Trees
Regression Tree on sine dataset
Regression Tree on sine dataset
COMP9417 ML & DM Tree Learning Term 2, 2019 82 / 98
COMP9417 T1, 2021 80
gression Tree on CPU dataset
Regression Tree on CPU dataset
COMP9417 T1, 2021 81
Re
COMP9417 ML & DM Tree Learning Term 2, 2019 83 / 98
Learning Non-linear Regression Models with Trees
Learning a regression tree
Learning a regression tree
Imagine you are a collector of vintage Hammond tonewheel organs. You
Imagine you are a collector of vintage Hammond tonewheel organs. You have habveenbemeonnmitoorinigtoarninognalineoanulicnteionauscitteio,nfrosmitew,hfircohmyowuhcicohlleycoteudcsolmlecetdeadta
about interesting transactions:
some data about interesting transactions:
# Model Condition Leslie Price
1. B3 excellent no 4513
2. T202 fair yes 625
3. A100 good no 1051
4. T202 good no 270
5. M102 good yes 870
6. A100 excellent no 1770
7. T202 fair no 99
8. A100 good yes 1900
9. E112 fair no 77
COMP9417 ML & DM COMP9417TreTe1L,e2ar0n2in1g 8T2erm 2, 2019 85 / 98
Learning a regression tree
From this data, you want to construct a regression tree that will help you determine a reasonable price for your next purchase.
There are three features, hence three possible splits:
𝑀𝑜𝑑𝑒𝑙 = [𝐴100,𝐵3,𝐸112,𝑀102,𝑇202]
[1051, 1770, 1900][4513][77][870][99, 270, 625]
𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 = [𝑒𝑥𝑐𝑒𝑙𝑙𝑒𝑛𝑡,𝑔𝑜𝑜𝑑,𝑓𝑎𝑖𝑟]
[1770, 4513][270, 870, 1051, 1900][77, 99, 625]
𝐿𝑒𝑠𝑙𝑖𝑒 = [𝑦𝑒𝑠, 𝑛𝑜]
[625,870,1900][77,99,270,1051,1770,4513]
COMP9417 T1, 2021 83
Learning a regression tree
• The means of the first split are 1574, 4513, 77, 870 and 331, and the weighted average of mean squared errors is 6.21×109.
• The means of the second split are 3142, 1023 and 267, with weighted average of mean squared errors 5.9×10B;
• for the third split the means are 1132 and 1297, with weighted average of mean squared errors 1.72×10<.
We therefore branch on 𝑀𝑜𝑑𝑒𝑙 at the top level. This gives us three single- instance leaves, as well as three 𝐴100s and three 𝑇202s.
COMP9417 T1, 2021 84
Learning a regression tree
For the A100s we obtain the following splits:
𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 = [𝑒𝑥𝑐𝑒𝑙𝑙𝑒𝑛𝑡, 𝑔𝑜𝑜𝑑, 𝑓𝑎𝑖𝑟]
1770 1051,1900
𝐿𝑒𝑠𝑙𝑖𝑒 = [𝑦𝑒𝑠, 𝑛𝑜]
[1900][1051,1770]
Without going through the calculations we can see that the second split results in less variance (to handle the empty child, it is customary to set its variance equal to that of the parent). For the 𝑇202s the splits are as follows:
𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 = [𝑒𝑥𝑐𝑒𝑙𝑙𝑒𝑛𝑡, 𝑔𝑜𝑜𝑑, 𝑓𝑎𝑖𝑟]
[][270][99,625]
𝐿𝑒𝑠𝑙𝑖𝑒 = [𝑦𝑒𝑠, 𝑛𝑜]
[625][99,270]
Again we see that splitting on Leslie gives tighter clusters of values. The learned regression tree is depicted on the next slide.
COMP9417 T1, 2021 85
Learning Non-linear Regression Models with Trees
A regression tree
A regression tree
Model
Leslie
=yes =no
Leslie
=yes =no
=A100
=B3
=E122
=M102
=T202
f̂(x)=4513
f̂(x)=77
f̂(x)=870
f̂(x)=1900
f̂(x)=1411
f̂(x)=625
f̂(x)=185
A regression tree learned from the Hammond organ dataset.
A regression tree learned from the Hammond organ dataset. COMP9417 T1, 2021
Term 2, 2019
86
88 / 98
COMP9417 ML & DM Tree Learning
Regression trees
• Differences to decision trees:
o Splitting criterion: minimizing intra-subset variation
o Pruning criterion: based on numeric error measure
o Leaf node predicts average class values of training instances reaching that node
• Can approximate piecewise constant functions
• Easy to interpret
• More sophisticated version: model trees
COMP9417 T1, 2021 87
Model trees
• Like regression trees but with linear regression functions (or any model of your choice) at each node
• Linear regression applied to instances that reach a node after full tree has been built
• Only a subset of the attributes is used for Linear Regression
– Attributes occurring in subtree (+maybe attributes occurring in path to the root)
• Fast: overhead for Linear Regression not large because usually only a small subset of attributes is used in tree
COMP9417 T1, 2021 88
Model trees
COMP9417 T1, 2021 89
Model Tree Example
Suppose we want to approximate 𝑦 = cos 𝜋𝑥 on the interval −1 ≤ 𝑥 ≤ 1.
• A linear approximation is not much use here, since the best fit would be y =
0.
• However,ifwesplitthe𝑥-axisintwointervals−1≤𝑥<0and0≤𝑥≤1,we could find reasonable linear approximations on each interval. We can achieve this by using x both as a splitting feature and as a regression variable (next slide).
COMP9417 T1, 2021 90
Learning Non-linear Regression Models with Trees
AMsomdalelmloTdreletreEexample x
COMP9417 ML & DM
-1 0 1
-1
Tree Learning
Term 2, 2019
91
91 / 98
ŷ = 2x+1
<0 ≥0
ŷ = −2x+1
1
COMP9417 T1, 2021
(left) A model tree combining a one-split feature tree with linear regression models in the leaves. Notice how x is used as both a splitting feature and
Building the tree
• Splitting criterion: standard deviation reduction 𝑆𝐷𝑅=𝑠𝑑𝑇 −D|𝑇6|×𝑠𝑑(𝑇6)
6 |𝑇| where𝑇$, 𝑇#,... arethesetsfromsplitsofdataatnode.
– Termination criteria (important when building trees for numeric prediction):
o Standard deviation becomes smaller than certain fraction of sd for full training set (e.g. 5%)
o Too few instances remain (e.g. less than four)
COMP9417 T1, 2021 92
Pruning the tree
Model trees suffer the same deficits as decision tree, which is the tendency to overfit the train data when using complex models
• Pruning is based on estimated absolute error of LR models
• Linear Regression (LR) models are pruned by greedily removing terms to minimize the estimated error
• Model trees allow for heavy pruning: often a single LR model can replace a whole subtree
• Pruning proceeds bottom up: error for LR model at internal node is compared to error for subtree
COMP9417 T1, 2021 93
Discrete (nominal) attributes
• Nominal attributes converted to binary attributes and treated as numeric o Nominal values sorted using average class value for each one
o For 𝑘-values, 𝑘 binary attributes are generated
§ the 𝑖th binary attribute is 0 if an instance’s value is one of the
first 𝑖 in the ordering, 1 otherwise
• Best binary split with original attribute provably equivalent to a split on one
of the new attributes
This is basically “one-hot-encoding”
COMP9417 T1, 2021 94
Summary – decision trees
• Decision tree learning is a practical method for many classifier learning tasks – still a “Top 10” data mining algorithm
• TDIDT family descended from ID3 searches complete hypothesis space - the hypothesis is there, somewhere...
• Overfitting is inevitable with an expressive hypothesis space and noisy data, so pruning is important
• Decades of research into extensions and refinements of the general approach, e.g., for numerical prediction, logical trees
COMP9417 T1, 2021 95
Summary – regression and model trees
• Often the “try-first” machine learning method in applications, illustrates many general issues
• Performance can be improved with use of “ensemble” methods
• Regression trees were introduced in CART – R’s implementation is close to CART, but see sklearn.tree.DecisionTreeRegressor for a basic version
COMP9417 T1, 2021 96
Acknowledgements
• Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw-Hill (1997) http://www-2.cs.cmu.edu/~tom/mlbook.html
• Material derived from slides by Andrew W. Moore
• http:www.cs.cmu.edu/~awm/tutorials Material derived from slides by Eibe Frank
• http://www.cs.waikato.ac.nz/ml/weka Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
• Material derived from slides by Josh Starmer (StatQuest Channel)
COMP9417 T1, 2021 97