CS代写 COMP9417 Machine Learning & Data Mining

Tree Learning
COMP9417 Machine Learning & Data Mining
Term 1, 2022
Adapted from slides by Dr Michael

Copyright By PowCoder代写 加微信 powcoder

This lecture will enable you to describe decision tree learning, the use of entropy and the problem of overfitting. Following it you should be able to:
– Define the decision tree representation
– Reproduce the basic top-down algorithm for decision tree induction (TDIDT)
– Define entropy in the context of learning a classifier from examples
– Define overfitting in decision trees and how to control it
– Describe the inductive bias of the basic TDIDT algorithm
– Describe regression trees
COMP9417 T1, 2022 1

Classification: Decision trees
Example: Let’s go back to the fish example with two types of “salmon” and “sea bass” and assume we have two features 𝑙𝑒𝑛𝑔𝑡h and 𝑤𝑖𝑑𝑡h to classify fish type
yes (1,0) (0,4)
COMP9417 T1, 2022
x2 = width x2 = width

Classification: Decision trees
yes (1,0) (0,4)
Salmon Sea bass
For any new sample, we start from the top of the tree and answer the questions and follow the appropriate road to the bottom and then decide about the label
COMP9417 T1, 2022 3

Why use decision trees?
• Decision trees are probably the single most popular data mining tool o Easy to understand
o Easy to implement
o Easy to use
o Computationally cheap (efficient, even on big data)
• There are some drawbacks, though — e.g., high variance
• They do classification, i.e., predict a categorical output from categorical and/or real inputs, or regression
COMP9417 T1, 2022 4

Decision Tree for PlayTennis
Day Outlook Humidity Wind Play
D1 2 3 4 Rain
D7 8 9 10 Rain
D11 12 13 14 Rain
High Weak No High Strong No High Weak Yes High Weak Yes Normal Weak Yes Normal Strong No Normal Strong Yes High Weak No Normal Weak Yes Normal Weak Yes Normal Strong Yes High Strong Yes Normal Weak Yes High Strong No
COMP9417 T1, 2022 5

Decision Tree for PlayTennis
Decision tree works in a divide and conquer fashion: – Split into subsets
– Are they pure?
o If yes: stop
o If not: repeat
– For a new data, find the subset the data falls into
COMP9417 T1, 2022 6

Decision Tree for PlayTennis
ay Outlook Humid Wind
9 yes / 5 No Outlook
ay Outlook Humid Wind
D3 Weak Rain D7 Strong
D12 Strong
D13 Weak
4 yes / 0 No Pure subset
D1 2 8 9 11 Weak High Strong High Weak Normal Weak
Normal Strong
High Weak Normal Weak Normal Strong Normal Weak High Strong
3 yes / 2 No Split Further
2 yes / 3 No Split Further
Day Outlook Humid Wind
D4 Rain D5 Rain D6 Rain D10 Rain D14 Rain
COMP9417 T1, 2022 7

Decision Tree for PlayTennis
Day Humid Wind
D1 High Weak D2 High Strong D8 High Weak
9 yes / 5 No Outlook
ay Outlook Humid Wind
D3 Weak Rain D7 Strong
D12 Strong
D13 Weak
4 yes / 0 No Pure subset
D4 Rain D5 Rain D6 Rain D10 Rain D14 Rain
Day Humid Wind
D9 Normal Weak D11 Normal Strong
Day Outlook Humid Wind
High Weak Normal Weak Normal Strong Normal Weak High Strong
3 yes / 2 No Split Further
COMP9417 T1, 2022 8

Decision Tree for PlayTennis
Decision Tree for PlayTennis
What would be the decision for a new data: ?
COMP9417 ML & DM Tree Learning
Term 2, 2019
Introduction
Sunny Yes
High Normal No Yes
Strong Weak No Yes
COMP9417 T1, 2022

A Tree to Predict C-Section Risk
Learned from medical records of 1000 women
Learned from medical records of 1000 women. Negative examples are C-
Negative examples are C-sections
[833+,167-] .83+ .17-
Fetal_Presentation = 1: [822+,116-] .88+ .12-
| Previous_Csection = 0: [767+,81-] .90+ .10-
| | Primiparous = 0: [399+,13-] .97+ .03-
| | Primiparous = 1: [368+,68-] .84+ .16-
| | | Fetal_Distress = 0: [334+,47-] .88+ .12-
| | | | Birth_Weight < 3349: [201+,10.6-] .95+ .05- | | | | Birth_Weight >= 3349: [133+,36.4-] .78+ .22- | | | Fetal_Distress = 1: [34+,21-] .62+ .38-
| Previous_Csection = 1: [55+,35-] .61+ .39- Fetal_Presentation = 2: [3+,29-] .11+ .89- Fetal_Presentation = 3: [8+,22-] .27+ .73-
COMP9417 T1, 2022 10

Decision Tree for Credit Rating
COMP9417 T1, 2022 11

Decision Tree for Fisher’s Iris data
COMP9417 T1, 2022 12

Decision Trees
Decision tree representation:
– Each internal node tests an attribute
– Each branch corresponds to attribute value – Each leaf node assigns a classification
COMP9417 T1, 2022 13

Decision Tree: Expressiveness
Using decision trees to represent Boolean functions like AND (⋀), OR (⋁), XOR (⊕)
| 𝑌 = 𝑡: 𝑡𝑟𝑢𝑒 | 𝑌=𝑓:𝑓𝑎𝑙𝑠𝑒 𝑋 = 𝑓: 𝑓𝑎𝑙𝑠𝑒
𝑋 = 𝑡: 𝑡𝑟𝑢𝑒 𝑋 = 𝑓:
| 𝑌=𝑡:𝑡𝑟𝑢𝑒
| 𝑌=𝑓:𝑓𝑎𝑙𝑠𝑒
COMP9417 T1, 2022 14

Decision Tree: Expressiveness
| 𝑌=𝑡:𝑓𝑎𝑙𝑠𝑒 | 𝑌=𝑓:𝑡𝑟𝑢𝑒 𝑋 = 𝑓:
| 𝑌=𝑡:𝑡𝑟𝑢𝑒
| 𝑌=𝑓:𝑓𝑎𝑙𝑠𝑒
“In general, decision trees represent a disjunction of conjunctions of constraints on the attribute-values of instances. Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions” (Mitchell, 1997)
COMP9417 T1, 2022 15

When to Consider Decision Trees?
• Instances described by a mix of numeric features and discrete attribute– value pairs
• Target function is discrete valued (otherwise use regression trees)
• Possibly noisy training data
• Interpretability is an advantage
Examples are extremely numerous, including: o Equipment or medical diagnosis
o Credit risk analysis
o Modelling calendar scheduling preferences o etc.
COMP9417 T1, 2022 16

Top-Down Induction of Decision Trees (TDIDT)
Main loop:
– 𝐴 ← the “best” decision attribute for next node to split examples
– Assign 𝐴 as decision attribute for node
– For each value of 𝐴, create new descendant of node (child node)
– Split training examples to child nodes
– If training examples perfectly classified (pure subset), Then 𝑆𝑇𝑂𝑃, Else iterate over new child nodes
Discovered by two people independently:
o (ID3: 1986), (C4.5: 1993)
o Breiman et. al (CaRT: 1984) from statistics
COMP9417 T1, 2022 17

Top-Down Induction of Decision Trees (TDIDT)
• ID3 (Iterative Dichotomiser 3)
– Uses categorical attributes/features
– For categorical targets
– Is precursor of C4.5 (without restriction of categorical attributes)
• CaRT (Classification & Regression Tree)
– Uses both categorical and numerical attributes/features
– Supports both categorical & numerical targets
Here, the focus is on ID3 for classification.
COMP9417 T1, 2022 18

Which attribute is best?
Which attribute is best?
The Basic Algorithm
[29+,35-] A1=? [29+,35-] A2=? tf tf
[21+,5-] [8+,30-] [18+,33-] [11+,2-]
COMP9417 ML & DM Tree Learning Term 2, 2019 17 / 98
COMP9417 T1, 2022 19

Which attribute is best?
9 Yes / 5 No
9 Yes / 5 No
2 Yes / 3 No
4 Yes / 0 No
3 Yes / 2 No
6 Yes / 2 No
3 Yes / 3 No
COMP9417 T1, 2022

Which attribute is best?
9 Yes / 5 No
9 Yes / 5 No
Sunny Overcast
2 Yes / 3 No 4 Yes / 0 No
3 Yes / 2 No
Weak Strong
6 Yes / 2 No 3 Yes / 3 No
• We are looking for a split with higher “purity”
• We need to measure the purity of the split
– More certain about our classes after split
o A set with all examples belonging to one class is 100% pure
o A set with 50% examples in one class and 50% in the other is 100% uncertain and impure
COMP9417 T1, 2022 21

Which attribute is best?
There are many different ways to measure the purity of a subset, but one good measure for it is Entropy.
– From statistical point of view: is a measure of uncertainty
– From information theory point of view: The amount of information (in the Shannon sense) needed to specify the full state of a system
COMP9417 T1, 2022 22

Entropy measures the “impurity” of 𝑆
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 =𝐻(𝑆)=−𝑝⊕ logI𝑝⊕ −𝑝⊖ logI𝑝⊖
– 𝑆 is subset of training examples
– 𝑝⊕, 𝑝⊖ are the portion(%) of positive and negative examples in 𝑆
Interpretation: if item 𝑥 belongs to 𝑆, how many bits are needed to tell if 𝑥 is positive or negative?
A “pure” sample set is one in which all examples are of the same class.
COMP9417 T1, 2022 23

Where:Where:
𝑆 is a sample of training examples
S is a sample of training examples
𝑝⊕is the proportion of positive examples in 𝑆
p is the proportion of positive examples in S
𝑝 is the proportion of negative examples in 𝑆
p is the proportion of negative examples in S COMP9417 ML & DM Tree Learning
Information Gain
0.0 0.5 1.0 p+
COMP9417 T1, 2022
Term 2, 2019
Entropy(S)

General Case
From information theory, the optimal number of bits to encode a symbol with probability 𝑝 is − logI 𝑝 …
Suppose 𝑋 can have one of 𝑘 values … 𝑣N, 𝑣I, … , 𝑣Q
𝑃 𝑋=𝑣N =𝑝N,𝑃 𝑋=𝑣I =𝑝I,…,𝑃 𝑋=𝑣Q =𝑝Q
What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from 𝑋’s distribution ? It’s
𝐻 𝑋 = −𝑝N logI 𝑝N − 𝑝I logI 𝑝I − ⋯ − 𝑝Q logI 𝑝Q Q
=−S𝑝T logI𝑝T TUN
is the entropy of 𝑋
COMP9417 T1, 2022 25

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 is the expected number of bits needed to encode class (⊕ or ⊖) of randomly drawn member of 𝑆 (under the optimal, shortest-length code)
So, expected number of bits to encode ⊕ or ⊖ of random member of 𝑆: 𝑝⊕ (− logI 𝑝⊕) + 𝑝⊖ (− logI 𝑝⊖)
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝⊕ logI 𝑝⊕ − 𝑝⊖ logI 𝑝⊖
And we want to minimize this entropy.
COMP9417 T1, 2022 26

• “High entropy”/”impure set” means 𝑋 is very uniform and boring o Example: (3 samples from ⊕, 3 samples from ⊖)
o𝐻 𝑆 =−WXlogIWX−WXlogIWX =1(canbeinterpretedas1bits)
• “Low entropy”/”pure set” means 𝑋 is not uniform and interesting
o Example: (6 samples from ⊕, 0 samples from ⊖)
o𝐻 𝑆 =−XlogIX−ZXlogIZX =0(canbeinterpretedas0bits)
COMP9417 T1, 2022 27

The Basic Algorithm
Which attribute is best?
Entropy is a measure in just one subset.
[29+,35-] A1=? [29+,35-] A2=? tf tf
[21+,5-] [8+,30-] [18+,33-] [11+,2-]
𝐻 𝑆 =−29log 29−35log 35=0.9936
64 I 64 64 I 64
But, we are interested in expected drop in entropy after split, to decide which attribute is the best. How?
COMP9417 ML & DM Tree Learning Term 2, 2019
COMP9417 T1, 2022 28

Information Gain
• 𝐺𝑎𝑖𝑛(𝑆, 𝐴) is the expected reduction in entropy due to sorting on 𝐴 𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − S 𝑆d 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆d)
d∈fghijk(l) 𝑆 o 𝜈 is the possible values of attribute 𝐴
o 𝑆 is the set of examples we want to split
o 𝑆d is the subset of examples where 𝑋l = 𝜈
• We want to find the attribution which maximizes the gain
This is also called “mutual information” between attribute 𝐴 and class labels of 𝑆
COMP9417 T1, 2022 29

Information Gain
Which attribute is best?
What is the information gain for attribute 𝐴1 and 𝐴2?
COMP9417 ML & DM Tree Learning
Term 2, 2019
The Basic Algorithm
[29+,35-] A1=? [29+,35-] A2=? tf tf
[21+,5-] [8+,30-] [18+,33-] [11+,2-]
COMP9417 T1, 2022

Information Gain
𝐺𝑎𝑖𝑛𝑆,𝐴1=𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝑆−(𝑆o𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝑆o +𝑆p𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝑆p ) 𝑆𝑆
𝐺𝑎𝑖𝑛 𝑆, 𝐴2
= 0.9936 −((26(−21log 21 − 5 log (5)))+ 64 26 I 26 26 I 26
(38(− 8 log 8 −30log (30)))) 64 38 I 38 38 I 38
= 0.9936 − 0.2869 + 0.4408 = 0.2658
= 0.9936 − 0.7464 + 0.0828 = 0.1643
COMP9417 T1, 2022 31

Information Gain
So in this example, we choose 𝐴1, since it gives a larger expected reduction in entropy.
To select best attribute at each branch:
– take every/remaining attributes in your data
– Compute information gain for that attribute
– Select the attribute that has the highest information gain
COMP9417 T1, 2022 32

Training Examples
Training Examples
Information Gain
Day Outlook Temperature Humidity Wind PlayTennis
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14

Rain Mild Rain Cool Rain Cool

High Weak No High Strong No High Weak Yes High Weak Yes
Normal Weak Yes Normal Strong No Normal Strong Yes
High Weak No Normal Weak Yes Normal Weak Yes Normal Strong Yes
High Strong Yes Normal Weak Yes High Strong No
COMP9417 ML & DM Tree Learning Term 2, 2019 33 / 98
COMP9417 T1, 2022 33

Information gain once more
Information Gain
Information gain once more
Which attribute is the best classifier?
S: [9+,5-] E =0.940
S: [9+,5-] E =0.940
[3+,4-] E =0.985
Gain (S, Humidity
[6+,1-] E =0.592
[6+,2-] E =0.811
[3+,3-] E =1.00
= .940 – (7/14).985 – (7/14).592 = .151
= .940 – (8/14).811 – (6/14)1.0 = .048
Gain (S, Wind)
COMP9417 ML & DM Tree Learning Term 2, 2019 34 / 98
COMP9417 T1, 2022 34

Information gain once more
Information gain once more
{D1,D2,D8,D9,D11} [2+,3−]
{D1, D2, …, D14} [9+,5−]
{D3,D7,D12,D13} [4+,0−]
{D4,D5,D6,D10,D14} [3+,2−]
Which attribute should be tested here? Ssunny = {D1,D2,D8,D9,D11}
Gain (Ssunny , Humidity) = .970 − (3/5) 0.0 − (2/5) 0.0 = .970
Gain (Ssunny , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570 Gain (Ssunny , Wind) = .970 − (2/5) 1.0 − (3/5) .918 = .019
COMP9417 T1, 2022 35

Information Gain
Will information gain always find the prefect solution?
COMP9417 T1, 2022 36

Information Gain Limitations
• Information gain is more biased towards attributes with large number of values/categories
o Subsets are more likely to be pure if there is a large number of values
o This can result overfitting (selection of an attribute which is non- optimal for prediction) which does not generalize well to unseen data
• Suggested solution: gain ratio
o A modification of information gain that reduces the bias o Takes number and size of branches into account
COMP9417 T1, 2022 37

Gain ratio
𝑆𝑝𝑙𝑖𝑡𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆, 𝐴) = −
S 𝑆d log 𝑆d d∈fghijk(l) 𝑆 I 𝑆
§ 𝐴: candidate attribute
§ 𝜈: possible values of A
§ 𝑆: Set of examples {𝑋} at the node § 𝑆d: subset where 𝑋l = 𝜈
𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 𝑆, 𝐴 = 𝐺𝑎𝑖𝑛(𝑆, 𝐴) 𝑆𝑝𝑙𝑖𝑡𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆, 𝐴)
COMP9417 T1, 2022 38

Overfitting in Decision Trees
• Can always classify training examples perfectly
– If necessary, keep splitting until each node contains 1 example
– Singleton subset (leaf nodes with one example) which is by definition pure
• But this is not always ideal
– In leaf nodes with singleton subsets, you have no confidence in your decision
COMP9417 T1, 2022 39

Overfitting in Decision Tree Learning
Overfitting in Decision Tree Learning
Overfitting and How To Avoid It
0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55
0.50 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes)
On training data
On test data
COMP9417 ML & DM Tree Learning Term 2, 2019 47 / 98
COMP9417 T1, 2022 40

Overfitting in General
Consider error of hypothesis h over
o training data: 𝑒𝑟𝑟𝑜𝑟ovgwx(h)
o entire distribution 𝒟 of data: 𝑒𝑟𝑟𝑜𝑟𝒟(h)
Definition
Hypothesis h ∈ 𝐻 overfits training data if there is an alternative hypothesis h′ ∈ 𝐻 such that
𝑒𝑟𝑟𝑜𝑟ovgwx h < 𝑒𝑟𝑟𝑜𝑟ovgwx (h′) 𝑒𝑟𝑟𝑜𝑟𝒟 h > 𝑒𝑟𝑟𝑜𝑟𝒟(h′)
COMP9417 T1, 2022 41

Avoiding Overfitting
How can we avoid overfitting? Pruning
o pre-pruning: stop growing when data split not statistically
significant
o post-pruning: grow full tree, then remove sub-trees which are overfitting (based on validation set)
Post-pruning avoids problem of “early stopping”
COMP9417 T1, 2022 42

Avoiding Overfitting
Pre-pruning
– Can be based on statistical significance test
– Stops growing the tree when there is no statistically significant association between any attribute and the class at a particular node
– For example, in ID3: chi-squared test plus information gain
o only statistically significant attributes were allowed to be selected by information gain procedure
COMP9417 T1, 2022 43

Avoiding Overfitting
Pre-pruning
– Simplest approach: stop growing the tree when fewer than some lower- bound on the number of examples at a leaf
o In sklearn, this parameter is min _𝑠𝑎𝑚𝑝𝑙𝑒𝑠_𝑙𝑒𝑎𝑓
– Stop when Entropy changes is smaller than a lower-round
o In sklearn, the parameter min _𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦_𝑑𝑒𝑐𝑟𝑒𝑎𝑠𝑒 enables stopping when this falls below a lower-bound
COMP9417 T1, 2022 44

Avoiding Overfitting
• Other pre-pruning approaches:
o Maximum number of leaf nodes
§ in sklearn, max _𝑙𝑒𝑎𝑓_𝑛𝑜𝑑𝑒𝑠
o Minimum number of samples to split
§ in sklearn min _𝑠𝑎𝑚𝑝𝑙𝑒𝑠_𝑠𝑝𝑙𝑖𝑡 o Maximum depth
§ in sklearn, max _𝑑𝑒𝑝𝑡h
COMP9417 T1, 2022 45

Avoiding Overfitting
Early stopping
• Pre-pruning may suffer from early stopping: may stop the growth of tree prematurely
• Classic example: XOR/Parity-problem
o No individual attribute exhibits a significant association with the class
o Target structure only visible in fully expanded tree o Pre-pruning won’t expand the root node
• But: XOR-type problems not common in practice
• And: pre-pruning faster than post-pruning
COMP9417 T1, 2022 46

Avoiding Overfitting
Post-pruning
• Builds full tree first and prunes it afterwards
o Attribute interactions are visible in fully-grown tree
• Problem: identification of subtrees and nodes that are due to chance effects
• Subtree replacement
• Possible strategies: error estimation, significance testing, MDL principle. We examine
– reduced-error Pruning
– Minimum error
– Smallest tree
COMP9417 T1, 2022 47

Reduced-Error Pruning
Split data into training and validation set Do until further pruning is harmful:
o Evaluate the impact of pruning nodes in the bottom of the tree and replacing them with a leaf node on validation set
o Greedily remove the one that most improves validation set accuracy
• Good it is simple and produces smallest version of most accurate subtree
• Not so good reduces effective size of training set
COMP9417 T1, 2022 48

Overfitting and How To Avoid It
E↵fefcetcotf RoefduRcedd-EurrcorePdr-uEnirnrgor Pruning
0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55
0.50 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes)
On training data
On test data On test data (during pruning)
COMP9417 ML & DM Tree Learning Term 2, 2019 55 / 98
COMP9417 T1, 2022 49

uning operator: Sub-tree replacement
Pruning operator: Sub-tree replacement
e is considered for replacement once all its sub-trees have been
Bottom-up:
Tree is considered for replacement once all its sub-trees have been considered
COMP9417 T1, 2022 50
COMP9417 ML & DM Tree Learning Term 2, 2019 57 / 98

Minimum Error
• Keep the Cross-Validation error during the growing
• Prune back to the point where the cross-validated error is a minimum.
– In the following example, the first 36 nodes will be kept
0 10 20 30 40 50 60 70 80 90 100
number of nodes
COMP9417 T1, 2022 51

Smallest Tree
• Keep the Cross-Validation error during the growing
• Prune back the tree to the smallest tree within 1 standard error of the minimum error.
– In the following example, the first 20 nodes will be kept
0 10 20 30 40 50 60 70 80 90 100
number of nodes
COMP9417 T1, 2022 52

Overfitting and How To Avoid It
Converting A Tree to Rules
Normal No Yes
Strong Weak No Yes
If Then …
(𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑆𝑢𝑛𝑛𝑦) ∧ (𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 𝐻𝑖𝑔h)
Converting A Tree to Rules
IF (Outlook = Sunny) ^ (Humidity = High) 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑁𝑜
THEN PlayTennis = No
(𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑆𝑢𝑛𝑛𝑦) ∧ (𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 𝑁𝑜𝑟𝑚𝑎𝑙)
(Outlook = Sunny) ^ (Humidity = Normal) 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑌𝑒𝑠
THEN P lay

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com