IT enabled Business Intelligence, CRM, Database Applications
Sep-18
Classification using Decision Trees
Prof. Vibs Abhishek
The Paul Merage School of Business
University of California, Irvine
BANA 273 Session 6
1
Agenda
Using Decision Tree for Classification
Building Decision Trees
Review Assignment 2
2
Reading Rules off the Decision Tree
IF Income=High AND Balance=Low AND Age<45
THEN Non-Responder
For each leaf in the tree, read the rule from the root to that leaf.
You will arrive at a set of rules.
Income
Balance
Balance
Age
Responder
Non-Responder
Non-Responder
Low
Low
High
Low
High
High
>45
<45
IF Income=Low AND Balance=Low THEN Non-Responder
IF Income=Low AND Balance=High THEN Responder
IF Income=High AND Balance=Low AND Age>45
THEN Responder
Non-Responder
Responder
IF Income=High AND Balance=High THEN Non-Responder
3
Goal of Decision Tree Construction
Partition the training instances into purer sub groups
pure: the instances in a sub-group mostly belong to the same class
Entire population
Age<45 Age>=45
Balance
<50K
Balance
>=50K
Age>=45
Age<45
Default
Not default
How to build a tree: How to split instances into purer sub-groups
Class Variable
Status
4
Purity Measures
Purity measures: Many available
Gini (population diversity)
Entropy (information gain)
Information Gain
Chi-square Test
Most common one (from information theory) is: Information Gain
5
Why do we want to identify pure sub groups?
To classify a new instance, we can determine the leaf that the instance belongs to based on its attributes.
If the leaf is very pure (e.g. all have defaulted) we can determine with greater confidence that the new instance belongs to this class (i.e., the “Default” class.)
If the leaf is not very pure (e.g. a 50%/50% mixture of the two classes, Default and Not Default), our prediction for the new instance is more like a random guess.
6
Impurity
Very impure group
Less impure
Minimum
impurity
The figures above show distribution of the class variable
Default
Not default
Class Variable
Status
7
Example Split
Consider the two following splits.
Which one is more informative?
Split over whether Balance exceeds 50K
Over 50K
Less or equal 50K
Over 100K
Less or equal 100K
Split over whether
Income exceeds 100K
Class Variable
Status
Default
Not default
8
A tree is constructed by recursively partitioning the examples.
With each partition the examples are split into increasingly purer sub groups.
The key in building a tree: How to split
Decision Tree Construction
9
Choosing a Split
Few
Medium
PAYS
Few
High
PAYS
Philly
Philly
City
Philly
Philly
Children
Many
Many
Income
Medium
Low
Status
DEFAULTS
DEFAULTS
3
4
ApplicantID
1
2
Try split on Children attribute:
Try split on Income attribute:
Children
Many
Few
Income
Low
High
Medium
Notice how the split on the Children attribute gives purer partitions.
It is therefore chosen as the first split (and in this case the only split – because the two sub-groups are 100% pure).
10
Better, as purity of sub-nodes is improving.
Recursive Steps in Building a Tree
Example
STEP 1: Split Option A
Not good as sub-nodes are
still very heterogenous!
STEP 1: Split Option B
STEP 2: Choose Split Option B as it is the better split.
STEP 3: Try out splits on each of the sub-nodes of Split Option B.
Eventually, we arrive at:
Notice how examples in a parent node are split between sub-nodes - i.e. notice how the training examples are partitioned into smaller and smaller subsets. Also, notice that sub-nodes are purer than parent nodes.
11
Example 1: Riding Mower
12
Scatterplot of Lot Size versus Income
13
Splitting the Observations by Lot Size Value of 19
14
Tree Diagram: First Split
15
Second Split: Lot Size Value of 19K and then Income Value of 84.75K
16
Tree Diagram: First Two Splits
17
18
19
Final Partitioning
20
Full Tree
owner
21
Calculate the probability of each branch
12/24
10/12
12/24
2/12
3/12
9/12
7/9
2/9
1/3
2/3
1/2
1/2
7/10
3/10
2/3
1/3
owner
22
Given lot size = 20, what is the probability of owner?
12/24
10/12
12/24
2/12
3/12
9/12
7/9
2/9
1/3
2/3
1/2
1/2
7/10
3/10
2/3
1/3
owner
P(Owner | Lot size = 20) = P(Owner & Lot Size=20)/ (P(Owner & Lot Size=20)+P(Non-Owner & Lot Size=20))
23
Given Income = 60, what is the probability of owner?
12/24
10/12
12/24
2/12
3/12
9/12
7/9
2/9
1/3
2/3
1/2
1/2
7/10
3/10
2/3
1/3
owner
24
Calculating Impurity
Impurity = Entropy =
pi is proportion of class i
For example: our initial population is composed of 16 cases of class “Default” and 14 cases of class “Not default”
Entropy(entire population of examples)=
25
Calculating the Information Gain of a Split
For each sub-group produced by the split, calculate the impurity/entropy of that subset.
Calculate the weighted entropy of the split by weighting each sub-group’s entropy by the proportion of training examples (out of the training examples in the parent node) that are in that subset.
Calculate the entropy of the parent node, and subtract the weighted entropy of the child nodes to obtain the information gain for the split.
26
Calculating Information Gain
Entire population (30 instances)
Balance<50K
Balance>=50K
17 instances
13 instances
(Weighted) Average Entropy of Children =
Information Gain= 0.997 – 0.615 = 0.382
Information Gain = Entropy (parent) – Entropy (children)
27
Information Gain = Entropy (parent) – Entropy (children)
Entropy(A) =0.997
Entropy(B,C) = 0.615
Gain=0.382
Gain=0.381
Age<45 Age>=45
Balance<50K Balance>=50K
Entire population
Entropy(D,E) =0.406
Information Gain
Entropy(B) = 0.787
Entropy (C)= 0.391
Entropy(D)=0 Log20 +1 log21=0
Entropy(E) =
-3/7 Log23/7 -4/7Log24/7=0.985
A
B
C
D
E
28
Which attribute to split over?
At each node examine splits over each of the attributes
Select the attribute for which the maximum information gain is obtained
For a continuous attribute, also need to consider different ways of splitting (>50 or <=50; >60 or <=60)
For a categorical attribute with lots of possible values, sometimes also need to consider how to group these values ( branch 1 corresponds to {A,B,E} and branch 2 corresponds to {C,D,F,G})
29
Person Hair Length Weight Age Class
Homer 0” 250 36 M
Marge 10” 150 34 F
Bart 2” 90 10 M
Lisa 6” 78 8 F
Maggie 4” 20 1 F
Abe 1” 170 70 M
Selma 8” 160 41 F
Otto 10” 180 38 M
Krusty 6” 200 45 M
Comic 8” 290 38 ?
Example 2
30
Hair Length <= 5?
yes
no
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
Entropy(1F,3M) = -(1/4)log2(1/4) - (3/4)log2(3/4)
= 0.8113
Entropy(3F,2M) = -(3/5)log2(3/5) - (2/5)log2(2/5)
= 0.9710
Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911
Let us try splitting on Hair length
Gain= Entropy of parent – Weighted average of entropies of the children
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
31
Weight <= 160?
yes
no
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
Entropy(4F,1M) = -(4/5)log2(4/5) - (1/5)log2(1/5)
= 0.7219
Entropy(0F,4M) = -(0/4)log2(0/4) - (4/4)log2(4/4)
= 0
Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
Let us try splitting on Weight
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
32
age <= 40?
yes
no
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
Entropy(3F,3M) = -(3/6)log2(3/6) - (3/6)log2(3/6)
= 1
Entropy(1F,2M) = -(1/3)log2(1/3) - (2/3)log2(2/3)
= 0.9183
Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
Let us try splitting on Age
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
33
Weight <= 160?
yes
no
Hair Length <= 2?
yes
no
Of the 3 features we had, Weight was the best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified… So we simply continue splitting…
This time we find that we can split on Hair length, and then we are done!
34
Example 3: Which attribute to split on?
Exercise – Decision Tree
Customer
ID Student Credit Rating Class:
Buy PDA
1 No Fair No
2 No Excellent No
3 No Fair Yes
4 No Fair Yes
5 Yes Fair Yes
6 Yes Excellent No
7 Yes Excellent Yes
8 No Excellent No
Which attribute to split on first?
log2(2/3) = -0.585, log2(1/3) = -1.585, log2(1/2) = -1, log2(3/5) = -0.737,
Log2(2/5) = -1.322, log2(1/4) = -2, log2(3/4) = -0.415
36
Building a Tree - Stopping Criteria
You can stop building the tree when:
The impurity of all nodes is zero: Problem is that this tends to lead to bushy, highly-branching trees, often with one example at each node.
No split achieves a significant gain in purity (information gain not high enough)
Node size is too small: That is, there are less than a certain number of examples, or proportion of the training set, at each node.
37
Over-fitting
38
Overfitting & Underfitting
Notice how the error rate on the testing data increases for overly large trees.
Overfitting: the model performs poorly on new examples (e.g. testing examples) as it is too highly trained to the specific training examples (pick up patterns and noises).
Underfitting: the model performs poorly on new examples as it is too simplistic to distinguish between them (i.e. has not picked up the important patterns from the training examples)
underfitting
overfitting
39
Pruning
A decision trees is typically more accurate on its training data than on its test data. Removing branches from a tree can often improve its accuracy on a test set.
Classification and Regression Tree (CART) : Use validation data to delete “weak” sub-trees
Assess whether splitting a node improves purity by a statistically significant amount
40
There are many possible splitting rules that perfectly classify the data, but will not generalize to future datasets.
Glasses
Yes
No
Hair
Female
Short
Long
Male
Female
Hair
Female
Short
Long
Male
Training
Testing
Tree 1
Tree 2
100% accurate on training data
41
Decision tree
A tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
Tree pruning
Identify and remove branches that reflect noise or outliers
To avoid overfitting
Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the decision tree
Decision Tree Classification in a Nutshell
42
Strengths
In practice: One of the most popular methods
Very comprehensible – the tree structure specifies the entire decision structure
Easy for decision makers to understand model’s rational
Map nicely to a set of business rules
Relatively easy to implement
Very fast to run (to classify examples) with large data sets
Good at handling missing values: just treat “missing” as a value – can become a good predictor
Weakness
Bad at handling continuous data, good at categorical input and output.
43
Which attribute will you use as the root of the tree, given the following information:
gain(Outlook ) = 0.247 bits
gain(Temperature ) = 0.029 bits
gain(Humidity ) = 0.152 bits
gain(Windy ) = 0.048 bits
A: Outlook
B: Humidity
C: Windy
D: Temperature
E: None of the above
44
What is overfitting?
A: When the model fit is better on the top side
B: When the model fit is worse on the top side
C: When the model captures the correct trend and has best accuracy
D: When the model captures noise in the data, hurting accuracy
E: None of the above
45
Weka Example – Classification using Naïve Bayes
Download file from Canvas:
4bank-data-8.arff
Switch tab to “classify”
Select method: NaiveBayes
Verify class variable set to “pep”
Use 10 fold cross validation
Run classifier
Examine confusion matrix
Weka Exercise
Follow instructions on
http://facweb.cs.depaul.edu/mobasher/classes/ect584/WEKA/classify.html
Data files posted on Canvas
We will use J48 which is an implementation of the C4.5 algorithm
47
Next Session
Association Rules
48
å
-
i
i
i
p
p
2
log
997
.
0
30
16
log
30
16
30
14
log
30
14
2
2
=
÷
ø
ö
ç
è
æ
×
-
÷
ø
ö
ç
è
æ
×
-
997
.
0
30
16
log
30
16
30
14
log
30
14
2
2
=
÷
ø
ö
ç
è
æ
×
-
÷
ø
ö
ç
è
æ
×
-
=
impurity
787
.
0
17
4
log
17
4
17
13
log
17
13
2
2
=
÷
ø
ö
ç
è
æ
×
-
÷
ø
ö
ç
è
æ
×
-
=
impurity
615
.
0
391
.
0
30
13
787
.
0
30
17
=
÷
ø
ö
ç
è
æ
×
+
÷
ø
ö
ç
è
æ
×
391
.
0
13
12
log
13
12
13
1
log
13
1
2
2
=
÷
ø
ö
ç
è
æ
×
-
÷
ø
ö
ç
è
æ
×
-
=
impurity
NameHairGlassesClass
MaryLongNoFemale
MikeShortNoMale
Bill ShortNoMale
JaneLongNoFemale
AnnShortYesFemale
HairGlassesTree 1Tree 2TRUE
ShortYesFemaleMaleMale
ShortNoMaleMaleFemale
LongNoFemaleFemaleFemale
ShortYesFemaleMaleMale
Error:75%25%
Sheet1
Name Hair Glasses Class
Mary Long No Female
Mike Short No Male
Bill Short No Male
Jane Long No Female
Ann Short Yes Female
Sheet1
Hair Glasses Tree 1 Tree 2
Short Yes Female Male Male
Short No Male Male Female
Long No Female Female Female
Short Yes Female Male Male
Error: 75% 25%
/docProps/thumbnail.jpeg