Lecture 18: Decision Trees
Introduction to Machine Learning Semester 1, 2022
Copyright @ University of Melbourne 2022. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author.
Copyright By PowCoder代写 加微信 powcoder
So far … Classification and Evaluation
• KNN, Naive Bayes, Logistic Regression, Perceptron • Probabilistic models
• Loss functions, and estimation
• Evaluation
Today… Decision Trees
• Definition and motivation
• Estimation (ID3 Algorithm) • Discussion
From Decision Stumps to Decision Trees
We have seen decision stumps in action in the context of 1-R
Given the obvious myopia of decision stumps, how can we construct decision trees (of arbitrary depth) which have the ability to capture complex feature interaction?
outlook temperature
sunny o’cast rainy hot mild cool
Total errors =
Total errors =
45 Total errors =
true false
Total errors =
The Weather Dataset (again!)
a: b: c: d: e: f: g: h: i: j: k: l:
Outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast
Temperature Humidity hot high hot high hot high mild high
cool normal cool normal cool normal mild high cool normal mild normal mild normal mild high hot normal mild high
Windy Play FALSE no TRUE no
FALSE yes FALSE yes FALSE yes TRUE no TRUE yes FALSE no FALSE yes FALSE yes TRUE yes TRUE yes FALSE yes TRUE no
m: overcast n: rainy
Rule-based classification
true false
• Construct the tree
• Extract one rule per leaf node
1. if (outlook == o’cast) → yes
2. if (outlook == sunny & humidity == normal) → yes 3. if (outlook == rainy) & windy == false) → yes
Total errors = 0 14
Disjunctive descriptions
true false
Decision Trees can be read as a disjunction; for example, Yes:
(outlook = sunny ∧ humidity = normal) ∨(outlook = overcast) ∨(outlook = rainy ∧ windy = false)
Total errors = 0 14
Decision Trees: Classifying Novel Instances
At test time…
• Assume we have constructed a decision tree
• Now, classify novel instances by traversing down the tree and predict the class according to the label of the deepest reachable point in the tree structure (leaf)
Complications
• unobserved attribute–value pairs • missing values
Classification Example
Classify test instance:
[outlook=sunny, temp=hot, humdt=normal, wind=False]
true false
Total errors = 0 14
Classification Example
Classify test instance:
[outlook=rainy, temp=hot, humdt=low, wind=False)
true false
Total errors = 0 14
Classification Example
Classify test instance:
[outlook=?, temp=cool, humdt=high, wind=True)
true false
Total errors = 0 14
Decision Trees: Issues
• How to build an optimal tree?
• What does ‘optimal’ mean?
• How to choose attributes for decision points?
• When to stop growing the tree?
true false
Total errors = 0 14
ID3 Algorithm
ID3: Overview
Optimal construction of a Decision Tree is NP hard (non-deterministic polynomial).
So we use heuristics:
• Choose an attribute to partition the data at the node such that each partition is as pure (homogeneous) as possible.
• In each partition most of the instances should belong to as few classes as possible
• Each partition should be as large as possible.
We can stop the growth of the tree if all the leaf nodes are (largely) dominated by a single class (that is the leaf nodes are nearly pure).
{a,b,c,d,e,f,g, h,i,j,k,l,m,n
{a,b,h,i,k
high norm {
{a,b,h {i,k
o’cast rainy
{d,e,f,j,n
Constructing Decision Trees: ID3
Basic method: recursive divide-and-conquer
FUNCTION ID3 (Root)
IF all instances at root have same class**
ELSE 1. Select a new attribute to use in partitioning root node instances
2. Create a branch for each attribute value and parti- tion up root node instances according to each value
3. Call ID3(LEAFi ) for each leaf node LEAFi **This is overly simplified, as we will discuss momentarily
Criterion for Attribute Selection
How do we choose the attribute to partition the instances at a given node?
We want to get the smallest tree (Occam’s Razor; generalisability). Prefer the shortest hypothesis that fits the data.
• Fewer short hypotheses than long hypotheses
• a short hyp. that fits the data unlikely to be a coincidence • a long hyp. that fits data might be a coincidence
• Many ways to define small sets of hypotheses
Entropy and Information Gain (Intuition)
Information Gain: ‘Reduction of entropy before and after the data is partitioned using the attribute A’.
Entropy: The expected (average) level of surprise or uncertainty.
Given a random variable (e.g., a coinflip), how surprised am I when seeing a
certain outcome?
• P(H)>>P(T)
• High predictability
• Low suprisal, low entropy
• P(H)≈P(T)
• Low predictability
• High suprisal, high entropy
Entropy and Information Gain (Intuition)
Information Gain: ‘Reduction of entropy before and after the data is partitioned using the attribute A’.
Entropy: The expected (average) level of surprise or uncertainty.
Given a random variable (e.g., a coinflip), how surprised am I when seeing a
certain outcome?
• Low probability event: if it happens, it’s big news! High surprise! High information!
• High probability event: it was likely to happen anyway. Not very surprising. Low information!
Entropy (Definition)
• A measure of unpredictability
• Level of unpredictability (surprise) for a single event i: self-information
self-info(i) = 1 = − log2 P(i) P(i)
• Given a probability distribution, the information (in bits) required to predict an event is the distribution’s entropy or information value
• The entropy of a discrete random event x with possible outcomes x1 , ..xn is:
H(x) = P(xi)self-info(xi)
= −P(xi)log2 P(xi) i=1
where0log20=def 0
Entropy (Definition)
Example 1 Coin flips.
• Biased coin. 55 flips: 50x head, 5x tail:
≈ 0.44 bits
• Fair coin. 55 flips: 30x head, 25x tail: H=
≈ 0.99 bits
The more uncertainty, the higher the entropy.
Entropy (Definition)
Example 1 Coin flips.
• Biased coin. 55 flips: 50x head, 5x tail:
H = −[50log2(50)+5log2(5)] 55 55 55 55
≈ 0.44 bits
• Fair coin. 55 flips: 30x head, 25x tail:
H = −[30log2(30)+25log2(25)] 55 55 55 55
≈ 0.99 bits
The more uncertainty, the higher the entropy.
Entropy (Definition)
Example 2 In the context of Decision Trees, we are looking at the class distribution at a node:
• 50 Y instances, 5 N instances:
H = −[50log2(50)+5log2(5)] 55 55 55 55
≈ 0.44 bits • 30 Y instances, 25 N instances:
H = −[30 log2(30) + 25 log2(25)] 55 55 55 55
≈ 0.99 bits
We want to classify with high certainty (or: low surprise).
We want leaf nodes that result in a low entropy distribution over classes!
Entropy: Summary
Entropy is a measure of unpredictability
• If the probability of a single class is high • Probability mass is centered
• Entropy is low
• The event is predictable
• If the probability is evenly divided between multiple classes
• Probability mass is spread out • Entropy is high
• The event is unpredictable
sunny o’cast rainy
Total errors = 4 14
From Entropy to Information Gain
• Decision tree with low entropy: class is more predictable.
• Information Gain (reduction of entropy): measures how much uncertainty was reduced.
• Select the attribute that has largest information gain: the most entropy (uncertainty) is reduced, class is most predictable.
sunny o’cast rainy
Total errors = 4 14
Information Gain
The expected reduction in entropy caused by knowing the value of an attribute.
• the entropy before splitting the tree using the attribute’s values
• the weighted average of the entropy over the children after the split. This
is called the (Mean Information)
If the entropy decreases, then we have a better tree (more predictable)
Mean Information Associated with a Decision Stump
• We calculate the mean information for a tree stump with m attribute values as:
Mean Info(x1, .., xm) = P(xi )H(xi )
where H(xi) is the entropy of the class distribution for the instances at node xi
and P(xi ) is the proportion of instances at sub-node xi
Mean Information (outlook)
H(x) = −i P(xi)log2 P(xi)
H(rainy) = =0.971
sunny o’cast rainy
mean info = .693
Mean Information (outlook)
H(x) = −i P(xi)log2 P(xi)
H(overcast)
= −(( 53 ) log2 ( 35 ) + ( 25 ) log2 ( 52 )))
= −(−0.4422 − 0.5288) = 0.971
sunny o’cast rainy
=−((4)log2(4)+(04)log2(40))) =0
= −(( 25 ) log2 ( 25 ) + ( 35 ) log2 ( 53 ))) = 0.971
=−((9 )log (9 )+(5 )log (5 )) 14 2 14 14 2 14
= −(−.4098 − 0.5305) = 0.940
mean info = .693
Mean Information (outlook)
H(x) = −i P(xi)log2 P(xi)
H(overcast)
= −(( 53 ) log2 ( 35 ) + ( 25 ) log2 ( 52 )))
= −(−0.4422 − 0.5288) = 0.971
sunny o’cast rainy
=−((4)log2(4)+(04)log2(40))) =0
= −(( 25 ) log2 ( 25 ) + ( 35 ) log2 ( 53 ))) = 0.971
=−((9 )log (9 )+(5 )log (5 )) 14 2 14 14 2 14
mean info = .693
= −(−.4098 − 0.5305) = 0.940 P(rainy)H(rainy) + P(overcast)H(overcast) + P(sunny)H(sunny)
Mean info:
Mean Information (outlook)
H(x) = −i P(xi)log2 P(xi)
H(overcast)
= −(( 53 ) log2 ( 35 ) + ( 25 ) log2 ( 52 )))
= −(−0.4422 − 0.5288) = 0.971
sunny o’cast rainy
=−((4)log2(4)+(04)log2(40))) =0
= −(( 25 ) log2 ( 25 ) + ( 35 ) log2 ( 53 ))) = 0.971
=−((9 )log (9 )+(5 )log (5 )) 14 2 14 14 2 14
= −(−.4098 − 0.5305) = 0.940
mean info = .693
Mean info:
P(rainy)H(rainy) + P(overcast)H(overcast) + P(sunny)H(sunny) = 5/14∗0.971+0+5/14∗0.971 = 0.693
Mean Information (temperature)
temperature
hot mild cool
mean info = .911
Mean Information (humidity)
mean info = .787
Mean Information (windy)
true false
mean info = .892
Attribute Selection: Information Gain
• We determine which attribute RA (with values x1,…xm) best partitions the instances at a given root node R according to information gain (IG):
IG(RA|R) = H(R) − mean-info(RA) m
IG(outlook|R) IG(temperature|R) IG(humidity|R) IG(windy|R)
H(R) = 0.94
Mean info(outlook) = 0.693 Mean info(temperature) = 0.911 Mean info(humidity) = 0.787 Mean info(windy) = 0.892
= H(R) − P(xi )H(xi ) i=1
= 0.247 = 0.029 = 0.152 = 0.048
Attribute Selection: Information Gain
• We determine which attribute RA (with values x1,…xm) best partitions the instances at a given root node R according to information gain:
IG(outlook|R) IG(temperature|R) IG(humidity|R) IG(windy|R)
H(R) = 0.94
Mean info(outlook) = 0.693 Mean info(temperature) = 0.911 Mean info(humidity) = 0.787 Mean info(windy) = 0.892
= H(R) − mean-info(RA) = H ( R ) − mi = 1 P ( x i ) H ( x i )
= 0.029 = 0.152 = 0.048
Shortcomings of Information Gain
Information gain tends to prefer highly-branching attributes:
• A subset of instances is more likely to be homogeneous (pure) if there
are only a few instances
• Attribute with many values will have fewer instances at each child node
This may result in overfitting / fragmentation
Mean Information (label)
Information gain tends to prefer highly-branching attributes: .940
a … n 00
mean info = 0
Solution: Gain Ratio
• Gain ratio (GR) reduces the bias for information gain towards highly-branching attributes by normalising relative to the split information
• Split info (SI) is the entropy of a given split (evenness of the distribution of instances to attribute values)
GR(RA|R) = = =
• The entropy of the attribute
H(R) − mi=1 P(xi )H(xi )
− mi=1 P(xi ) log2 P(xi )
• Discourages the selection of attributes with many uniformly distributed values, which have high entropy.
Split Info
SI(outlook|R)
= −( 5 ) log ( 5 ) + ( 4 ) log ( 4 ) + ( 5 ) log ( 5 )
sunny o’cast rainy
mean info = .693
14 2 14 14 2 14 14 2 14 = 1.577
NB: Entropy of distribution of instances to attribute values (disregarding classes, unlike Mean Info)
Gain Ratio: Example
IG(outlook|R) SI(outlook|R) GR(outlook|R)
IG(humidity|R) SI(humidity|R) GR(humidity|R)
IG(label|R) SI(label|R) GR(label|R)
= 0.247 = 1.577 = 0.156
= 0.152 = 1.000 = 0.152
IG(temperature|R) SI(temperature|R) GR(temperature|R)
IG(windy|R) SI(windy|R) GR(windy|R)
= 0.029 = 1.557 = 0.019
= 0.048 = 0.985 = 0.049
Stopping criteria I
The definition of ID3 above suggests that:
• We recurse until the instances at a node are of the same class
• This is consistent with our usage of entropy: if all of the instances are of a single class, the entropy of the distribution is 0
• Considering other attributes cannot “improve” an entropy of 0 — the Info Gain is 0 by definition
This helps to ensure that the tree remains compact (Occam’s Razor)
Stopping criteria II
The definition of ID3 above suggests that:
• The Info Gain/Gain Ratio allows us to choose the (seemingly) best attribute at a given node
• However, it is also an approximate indication of how much absolute improvement we expect from partitioning the data according to the values of a given attribute
• An Info Gain of 0 means that there is no improvement; a very small improvement is often unjustifiable
• Typical modification of ID3: choose best attribute only if IG/GR is greater than some threshold τ
• Other similar approaches use pruning — post-process the tree to remove undesirable branches (with few instances, or small IG/GR improvements)
Stopping criteria III
The definition of ID3 above suggests that:
• We might observe improvement through every layer of the tree
• We then run out of attributes, even though one or more leaves could be improved further
• Fall back to majority class label for instances at a leaf with a mixed distribution — unclear what to do with ties
• Possibly can be taken as evidence that the given attributes are insufficient for solving the problem
Discussion
Hypothesis Space Search in ID3
ID3 (and DT learning in general) is an instance of combinatorial optimization
• ID3 can be characterized as searching a space of hypotheses for one that fits the training examples.
• The hypothesis space searched by ID3 is the set of possible decision trees.
• ID3 performs a greedy simple-to-complex search through this hypothesis space (with no backtracking),
• beginning with the empty tree
• considering progressively more elaborate hypotheses in search of a
decision tree that correctly classifies the training data
Pros / Cons of Decision Trees
• Highly regarded among basic supervised learners
• Fast to train, even faster to classify
• Very transparent (probably the most interpretable of all classification algorithms!)
• Prone to Overfitting (why?)
• Loss of information for continuous variables
• Complex calculation if there are many classes
• No guarantee to return the globally optimal decision
• Information gain: Bias for attributes with greater no. of values.
Variants of Decision Trees
ID3 is not the only (nor most popular) Decision Tree learner:
• Oblivious Decision Trees require the same attribute at every node in a layer
• Random Tree only uses a sample of the possible attributes at a given node
• Helps to account for irrelevant attributes
• Basis for a better Decision Tree variant: Random Forest. More on this in the next lecture!
• Describe the basic decision tree induction method used in ID3
• What is information gain, how is it calculated and what is its primary shortcoming?
• What is gain ratio, and how does it attempt to overcome the shortcoming of information gain?
• What are the theoretical and practical properties of ID3-style decision trees?
Mitchell, Tom (1997). Machine Learning. Chapter 3: Decision Tree Learning. Tan et al (2006) Introduction to Data Mining. Section 4.3, pp 150-171.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com