Decision Trees
Input Data
Attributes X1=x1
XM=xM
Class prediction
Y=y
Classifier
Training data
1
Decision Tree Example • Three variables:
– Hair = {blond, dark}
– Height = {tall,short}
– Country = {Gromland, Polvia}
Training data: (B,T,P) (B,T,P) (B,S,G) (D,S,G) (D,T,G) (B,S,G)
Height = T? P:2 G:0
P:2 G:4
Hair = B? P:2 G:2
Hair = D? P:0 G:2
Height = S? P:0 G:2
At each level of the tree, we split the data according to the value of on of the attributes
After enough splits, only
one class is represented
Height = T? P:2 G:0
Height = S?
P:0 G:2
‘G’ is the output for this node
Hair = B? P:2 G:2
output class for that node
in the node This is a P:2 G:4
terminal leaf of the tree We call that class the
Hair = D?
P:0 G:2
2
Height = T? P:2 G:0
Height = S? P:0 G:2
Hair = B? P:2 G:2
P:2 G:4
Hair = D?
P:0 G:2
The class of a new input can be classified by following the tree all the way down to a leaf and by reporting the output of the leaf. For example:
(B,T) is classified as P
(D,S) is classified as G
General Case (Discrete Attributes)
• WehaveRobservations from training data
– Each observation has M attributes X1,..,XM
– Each Xi can take N distinct discrete values
– Each observation has a class attribute Y with C distinct (discrete) values
– Problem: Construct a sequence of tests on the attributes such that, given a new input (x1,..,xM), the class attribute y is correctly predicted
Training Data
Input data
X1
x’1
……..
……..
XM
x’M
Y
???
Data 1
Data 2
………
Data R
X1
x1
…….
……..
XM
xM
Y
y
X = attributes of training data (RxM) Y = Class of training data (R)
3
General Decision Tree (Discrete Attributes)
X1 =
first possible value for X1?
Output class Y = y1
X1 =
nth possible value for X1 ?
Xj =
ith possible value for Xj ?
Output class Y = yc
X2=0.5
X1=0.5
Decision Tree Example
:7 :4
X1 < 0.5 ?? :3 :4
:4 :0
X2 < 0.5?? :3 :0
:0 :4
4
:7 :4
X1 < 0.5 ?? :3 :4
:4 :0
X2 < 0.5?? :3 :0
:0 :4
The class of a new input can be classified by following the tree all the way down to a leaf and by reporting the output of the leaf. For example:
(0.2,0.8) is classified as
(0.8,0.2) is classified as
General Case (Continuous Attributes)
• WehaveRobservations from training data
– Each observation has M attributes X1,..,XM
– Each Xi can take N distinct discrete values
– Each observation has a class attribute Y with C distinct (discrete) values
– Problem: Construct a sequence of tests of the formXi
Example Entropy Calculation
12
H1 = -pAlog2 pA – pBlog2 pB
= 0.59 = 0.97
Frequency of occurrence
of class A in node (1)
N=1 N=3
A FrequencAy of occurrence
N=6 N=2
B of class B in node (1)
pA = NA/(NA+NB) = 1/7 pB = NB/(NA+NB) = 6/7
H1 < H2 => (2) less pure than (1)
pA = NA/(NA+NB) = 3/5 Entropy of node (1)
pB = NB/(NA+NB) = 2/5
H2 = -pAlog2 pA – pBlog2 pB
11
Conditional Entropy Entropy before splitting: H
After splitting, a fraction PL of the data goes to the left node, which has entropy HL
After splitting, a fraction PR of the data goes to the left node, which has entropy HR
The average entropy after splitting is:
HLx PL+ HR x PR
Conditional Entropy Entropy before splitting: H
After splitting, a fraction
PL of the data goes to
the lEefnttnropdye,owf lheifcth has
After splitting, a fraction
entropy Hnode LR
The average entropy after splitting is:
“Conditional Entropy”
Probability that a random input
PR of the data goes to the is directed to the left node
left node, which has entropy H
HLx PL+ HR x PR
12
Information Gain
We want nodes as pure as possible
We want to reduce the entropy as much as possible We want to maximize the difference between the entropy of the parent node and the expected entropy of the children
Maximize:
PL
HL
H
HR
PR
IG = H – (HLx PL+ HR x PR)
Notations
• Entropy: H(Y) = Entropy of the distribution of classes at a node
• Conditional Entropy:
– Discrete: H(Y|Xj) = Entropy after splitting with
respect to variable j
– Continuous: H(Y|Xj,t) = Entropy after splitting
with respect to variable j with threshold t • Information gain:
– Discrete: IG(Y|Xj) = H(Y) – H(Y|Xj) = Entropy after splitting with respect to variable j
– Continuous: IG(Y|Xj,t) = H(Y) – H(Y|Xj,t) = Entropy after splitting with respect to variable j with threshold t
13
Information Gain
PL
HL
H
HR
PR
We want nodes as pure as possible
We want to reduce the entropy as much as possible We want to maximize the difference between the entropy of the parent node and the expected entropy of the children
Maximize:
Information Gain (IG) = Amount by which the ambiguity is decreased by splitting the node
IG = H – (HLx PL+ HR x PR)
IG =
H–(HL *4/11+HR *7/11)
IG =
H–(HL *5/11+HR *6/11)
H = 0.99 H = 0.99
HL =0 HR =0.58 HL =0.97 HR =0.92
14
IG = 0.62 IG = 0.052
H = 0.99 H = 0.99
HL =0 HR =0.58 HL =0.97 HR =0.92 H = 0.99 H = 0.99
Choose this split because the information gain is greater than with the other split
IG = 0.62 IG = 0.052
HL =0 HR =0.58 HL =0.97 HR =0.92
15
More Complete Example
= 20 training examples from class A = 20 training examples from class B
Attributes = X1 and X2 coordinates
IG
X1 Split value
Best split value (max Information Gain) for X1 attribute: 0.24 with IG = 0.138
16
IG
X2 Split value
Best split value (max Information Gain) for X2 attribute: 0.234 with IG = 0.202
Best X1 split: 0.24, IG = 0.138 Best X2 split: 0.234, IG = 0.202
Split on X2 with 0.234 X2
17
Best X split: 0.24, IG = 0.138
There is no point in splitting
Best Y split: 0.234, IG = 0.202
this node further since it
contains only data from a
single class return it as a
leaf node with output ‘A’ Split on Y with 0.234
This node is not pure so we
need to split further X2
IG
X1 Split value
Best split value (max Information Gain) for X1 attribute: 0.22 with IG ~ 0.182
18
IG
X2 Split value
Best split value (max Information Gain) for X2 attribute: 0.75 with IG ~ 0.353
Best X1 split: 0.22, IG = 0.182 Best X2 split: 0.75, IG = 0.353
Split on X2 with 0.75 X2
X2
19
Best X split: 0.22, IG = 0.182 There is no point in splitting
Best Y split: 0.75, IG = 0.353 this node further since it
contains only data from a
single class return it as a
leaf node with output ‘A’ Split on X with 0.5
X2
X2
A AA
A
Final decision tree
X2
B
Each of the leaf
nodes is pure X1 contains data from
only one class
X2 X1
20
A AA
A
Example (X,Y) = (0.5,0.5)
Final decision tree
Given an input (X,Y) Follow the tree down to a leaf.
Return corresponding output class for this leaf
X2
X1
B
X2 X1
Basic Questions
• How to choose the attribute/value to split on at each level of the tree?
• When to stop splitting? When should a node be declared a leaf?
• If a leaf node is impure, how should the class label be assigned?
• If the tree is too large, how can it be pruned?
21
Pure and Impure Leaves and When to Stop Splitting
All the data in the node comes from a single class We declare the node to be a leaf and stop splitting. This leaf will output the class of the data it contains
Several data points have exactly the same attributes even though they are from the same class We cannot split any further We still declare the node to be a leaf, but it will output the class that is the majority of the classes in the node (in this example, ‘B’)
Decision Tree Algorithm (Continuous Attributes)
• LearnTree(X,Y)
– Input:
• SetXofRtrainingvectors,eachcontainingthevalues(x1,..,xM)of M attributes (X1,..,XM)
• AvectorYofRelements,whereyj=classofthejthdatapoint
– If all the datapoints in X have the same class value y
• Returnaleafnodethatpredictsyasoutput
– If all the datapoints in X have the same attribute value (x1,..,xM)
• ReturnaleafnodethatpredictsthemajorityoftheclassvaluesinY as output
– Try all the possible attributes Xj and threshold t and choose the one, j*, for which IG(Y|Xj,t) is maximum
– XL, YL= set of datapoints for which xj* < t and corresponding classes
– XH, YH = set of datapoints for which xj* >= t and corresponding classes
– Left Child LearnTree(XL,YL)
– Right Child LearnTree(XH,YH)
22
Decision Tree Algorithm (Discrete Attributes)
• LearnTree(X,Y)
– Input:
• Set X of R training vectors, each containing the values (x1,..,xM) of M attributes (X1,..,XM)
• A vector Y of R elements, where yj = class of the jth datapoint
– If all the datapoints in X have the same class value y
• Return a leaf node that predicts y as output
– If all the datapoints in X have the same attribute value (x1,..,xM)
• Return a leaf node that predicts the majority of the class values in Y as output
– Try all the possible attributes Xj and choose the one, j*, for which IG(Y|Xj) is maximum
– For every possible value v of Xj*:
• Xv, Yv= set of datapoints for which xj* = v and corresponding
classes
• Childv LearnTree(Xv,Yv)
Decision Trees So Far
• GivenRobservationsfromtrainingdata,each with M attributes X and a class attribute Y, construct a sequence of tests (decision tree) to predict the class attribute Y from the attributes X
• Basicstrategyfordefiningthetests(“whento split”) maximize the information gain on the training data set at each node of the tree
• Problems (next):
– Computational issues How expensive is it to compute the IG
– The tree will end up being much too big pruning
– Evaluating the tree on training data is dangerous overfitting
23