PowerPoint Presentation
2
Week 10: Classification II
1. Decision Tree Intuition
2. Classification Trees
3. Regression Trees
4. Random Forest
Readings: Chapters 8.1 and 8.2.2
Exercice questions: Chapter 8.4 of ISL, Q1, Q3 and Q4.
3
4
❑ Non-parametric (any other nonparametric method we learnt before?)
❑ Supervised learning method that can be used for both classification
and regression.
❑ Through incorporating a set of if-then-else rules, decision tree can be
employed to predict target variable given data features
Decision trees intuition
5
• Try to discover the pattern under which the customer will purchase the
product
• Divide data set into subsets (branches of a tree)
• Check whether the stopping criteria is met
If yes
stop dividing
Else
keep dividing
• For a new customer, based on the features, we can see which subset
the customer will fall into
Decision trees intuition
6
Root Node
Decision node 1
Split
Leaf node
Decision Node
Leaf nodeLeaf node
• A decision node has two or more
branches.
• The topmost decision node in a tree
which corresponds to the best predictor
called root node.
• Leaf node represents a decision.
Leaf nodeDecision node Leaf node
Decision Trees example
7
Decision trees used in machine learning are of two main types:
❑Classification tree analysis is when the predicted outcome is the class to which
the data belongs. Target variable is categorical.
❑Regression tree analysis is when the predicted outcome can be considered a
real number (e.g. the price of a house, or a patient’s length of stay in a
hospital). Target variable is continuous.
Classification And Regression Tree (CART), Breiman et al., (1984). An umbrella
term used to refer to both of the above techniques.
Types of decision trees
8
9
❑ The task of growing a classification tree is quite similar to the task of
growing a regression tree
❑ Categorical response variable, e.g. yes/no, 1/0
❑ For a classification tree, we predict that each observation belongs to the
most commonly occurring class (mode) of training observations in the
region to which it belongs
❑ In interpreting the results of a classification tree, we are often interested
not only in the class prediction corresponding to a particular leaf node
region, but also in the class proportions among the training observations
that fall into that region
10
Customer Income Education Marital Status Purchase
1 Medium University Single Yes
2 High University Single No
3 High University Married No
4 Low University Single Yes
5 Low High school Single Yes
6 Low High school Married No
7 Medium High school Married Yes
8 High University Single No
9 High High school Single Yes
10 Low High school Single Yes
11 High High school Married Yes
12 Low University Married No
13 High University Single No
14 Medium University Married Yes
15 Medium High school Single Yes
We can have duplicated records.
Classification tree
11
❑ We need to build the tree from the root node with one feature and then
split examples into subsets
❑ How to select this feature?
❑ Idea: a good feature splits the examples into subsets that are (ideally) “all
positive” or “all negative“
❑ Purity
12
Let’s start the decision tree with feature income. Why?
Income
High Medium Low
Customer Income Education Marital Status Purchase
3 Medium University Single Yes
7 Medium High school Married Yes
12 Medium University Married Yes
13 Medium High school Single Yes
Customer Income Education Marital Status Purchase
4 Low University Single Yes
5 Low High school Single Yes
6 Low High school Married No
10 Low High school Single Yes
14 Low University Married No
PURE set
13
Income
High Medium Low
4 Yes
Education
Marital
status
University
High
school
2 Yes4 No
Married Single
3 Yes2 No
Let’s start the decision tree with feature income. Why?
14
Income
High Medium Low
4 Yes
Education
Marital
status
University
High
school
2 Yes4 No
Married Single
3 Yes2 No
A new married customer with
high income and university
degree
15
Income
High Medium
Low
4 Yes/ 0 No2 Yes/ 4 No 3 Yes/ 2 No
Marital
status
Married Single
3 Yes/ 3 No 6 Yes/ 3 No
9 Yes/ 6 No
9 Yes/ 6 No
Best feature of splitting
Which split is better?
16
• Entropy is a concept originally from physics and measures the disorder in a data set
• In decision trees, we use entropy 𝑯(𝑺) to measure of the amount of uncertainty in the
data set 𝑺.
• The entropy will be a small value if the dataset is pure.
• Smaller entropy, less disorder, higher PUTIRY (CERTAINTY)
• Larger entropy, more disorder, higher IMPUTIRY (UNCERTAINTY)
𝑯 𝑺 = 𝟏 𝑯 𝑺 = 𝟎. 𝟒𝟔𝟗
A glass of
water and ice
cubes, which
one is purer?
Entropy intuition
17
Measure the PURITY of the split:
Aim to be more certain about Yes/No after the spit
• Pure set: (4 yes/0 no)=> 100% certain
• Impure set: 3 yes/3 no => 50% certain and 50% uncertain
• Impure set: 1 yes/3 no => 25% certainty and 75% uncertain
Should be as PURE as
• Impure set: 3 yes/1 no => 75% certainty and 25% uncertain
Best feature of splitting
18
Entropy 𝑯(𝐒) is a measure of the amount of uncertainty in the data set 𝐒. The entropy
will be a small value if the dataset is pure.
➢ 𝐒: The current (data) set for which entropy is being calculated (changes every iteration
of the ID3 algorithm)
➢ 𝑝𝑘(𝐒): The proportion of the number of elements in class 𝑘 to the number of
elements in set 𝐒. 𝐾 classes in total in 𝐒.
➢ σ𝑘=1
𝐾 𝑝𝑘(𝐒) = 1
➢ 𝑝𝑘 𝐒 log2 𝑝𝑘 𝐒 equals zero when 𝑝𝑘(𝐒) = 0.
𝐻 𝐒 =
𝑘=1
𝐾
𝑝𝑘 𝐒 log2
1
𝑝𝑘 𝐒
= −
𝑘=1
𝐾
𝑝𝑘 𝐒 log2 𝑝𝑘 𝐒
Entropy calculation
19
More specifically, for a training set with 𝒑 positive examples and 𝒏 negative
examples:
Equivalently:
% of positive
examples in 𝐒
Interpretation: assume an item belongs to S, how many bits of information are
required to tell whether x is positive or negative. The smaller it is, the higher
certainty.
𝐻 𝐒 = −
𝑝
𝑝 + 𝑛
log2
𝑝
𝑝 + 𝑛
−
𝑛
𝑝 + 𝑛
log2
𝑛
𝑝 + 𝑛
𝐻 𝐒 = −𝑝+(𝐒) log2 𝑝+(𝐒) − 𝑝−(𝐒) log2 𝑝−(𝐒)
Entropy- two classes
20
When 𝑯 𝐒 = 𝟎 , the set 𝐒 is
perfectly classified, e.g. all elements
in 𝐒 are of the same class
A two class problem
𝑝+ 𝐒 = 0, 𝑝− 𝐒 = 1,𝐻 𝐒 = 0
𝑝− 𝐒 = 0, 𝑝+ 𝐒 = 1,𝐻 𝐒 = 0
𝑝− 𝐒 = 0.5, 𝑝+ 𝐒 = 0.5, 𝐻 𝐒 = 1
Symmetric
Entropy- two classes
If there are more than two classes: 1,2, … , 𝐾:
21
𝑲 classes in total in 𝐒
Demo only
𝐻 𝐒 = −𝑝1(𝐒) log2 𝑝1(𝐒)
−𝑝2(𝐒) log2 𝑝2(𝐒)
−𝑝3(𝐒) log2 𝑝3(𝐒)
… …
−𝑝𝐾(𝐒) log2 𝑝𝐾(𝐒)
𝑘=1
𝐾
𝑝𝑖(𝐒) = 1
Entropy- multiple classes
22
# entropy calculator
p= 3.0/5
H= – p*np.log2(p)-(1-p)*np.log2(1-p)
How to merge these
entropy together?
𝐻 𝐒 = −
9
15
log2
9
15
−
6
15
log2
6
15
= 0.971 bits
𝐻 𝐒1 = −
2
6
log2
2
6
−
4
6
log2
4
6
= 0.918 bits 𝐻 𝐒3 = −
3
5
log2
3
5
−
2
5
log2
2
5
= 0.971 bits
𝐻 𝐒2 = 0 bits
𝐻 𝐒 = 0.971 bits
𝐻 𝐒1 = −
3
6
log2
3
6
−
3
6
log2
3
6
= 1 bit 𝐻 𝐒2 = −
6
9
log2
6
9
−
3
9
log2
3
9
= 0.918 bits
Entropy example
23
• Entropy is not the only measurement of selecting the best feature
to split
• Other measurements include:
• Gini index
• The Gini index and the entropy are similar numerically
• Misclassification rate: not sufficiently sensitive for tree-growing.
James et al., (2014).
𝐻 𝐒 =
𝑘=1
𝐾
𝑝𝑖(𝐒)(1 − 𝑝𝑖 𝐒 )
Other Measurements
❑ How much information do we gain if we disclose/split the value of some
features?
❑ Answer: uncertainty before minus uncertainty after
❑ Information Gain (IG) or reduction in entropy from the feature test
❑ Information Gain is a measure of the disorder/uncertainty decrease
achieved by splitting the data set 𝑆
❑ Choose the feature split with the largest IG
24
𝐈𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧 𝐆𝐚𝐢𝐧 = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝐛𝐞𝐟𝐨𝐫𝐞 − 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝐚𝐟𝐭𝐞𝐫
Weighted sum of Entropy.
We want this term to be small.
We want this term to be
large
Information Gain
25
Information gain IG(S,A) is the measure of the difference in entropy from before to
after the data set 𝐒 is split on an feature 𝐴.
In other words, how much uncertainty in 𝐒 was reduced after splitting set 𝐒 on
feature 𝐴.
𝐻(𝐒) – Entropy of set 𝐒
𝐸𝐻(𝐴) – Expected entropy with split by feature 𝐴
𝐼𝐺 𝐒, 𝐴 = 𝐻 𝐒 − 𝐸𝐻(𝐴)
Information Gain
26
A selected feature 𝐴 with 𝐽 distinct values, e.g. feature “income” has 𝐽 = 3 possible values
“high”, “medium” and “low”, partitions the training set 𝐒 into 𝐽 subsets/branches 𝐒1, 𝐒2,
… , 𝐒𝐽
The expected entropy with split by feature 𝐴 is:
𝐒: the current (data) set for which entropy is being calculated
𝐒𝑗: subset 𝑗
Expected entropy is a measurement of subsets impurity.
Note this is the
entropy of the
subset, calculated
according to the
target categories
Weights based on size
of the subset
𝐸𝐻 𝐴 =
𝑗=1
𝐽
𝐒𝑗
𝐒
𝐻(𝐒𝑗)
Expected Entropy
27
Marital
status
Married Single
9 Yes / 6 NO
3 Yes / 3 NO
1)(
Married
=SH
0.023991.0
15
9
1
15
6
97.0
)(
15
9
)(
15
6
)(
),(
SingleMarried
=−−=
−−= SHSHSH
ASIG
If split on “marital status”, we would
GAIN 0.0239 bits on certainty.
Or we are 0.0239 bits more certain.
6 Yes / 3 NO
918.0)(
Single
=SH
Entropy before split. High impurity.
Entropy after split
Weights based on size
of the subsets to S
𝐻 𝐒 = −
9
15
log2
9
15
−
6
15
log2
6
15
= 0.971 bits
Information Gain example
28
❑IG favours split on an feature with many values (many leaf nodes): causing bias
❑If 1 feature splits in many more classes than another, it has an (unfair) advantage
if we use information gain
❑The Gain-Ratio is designed to compensate for this problem
GainRatio =
Information Gain
Split Entropy Penalize split with too
many small subsets
Split_Entropy 𝑺, 𝐴 = −
𝑗=1
𝐽
𝐒𝑗
𝐒
log2
𝐒𝑗
𝐒
Information Gain drawback
29
Split Entropy = −15
1
15
𝑙𝑜𝑔2
1
15
= 3.907
1 /
0
1 /
0
1 /
0
0 /
1
1 3 1
5
Split Entropy = −
6
15
𝑙𝑜𝑔2
6
15
−
9
15
𝑙𝑜𝑔2
9
15
= 0.971
Penalize split with too many
small subsets, although the
IG for such split is high.2
Split Entropy Example
9 Yes / 6 NO
30
➢ What should we do if some of the features are numeric/continuous?
➢ We use the form of 𝑥 < 𝜃 where 𝜃 is called a splitting value or cutting point.
Infinite
number of
possible split
values!!!
Split over numeric features
31
➢ Splits on numeric features use a
threshold
➢ How to decide a threshold like
3535 in the diagram
➢ Consider a feature 𝐴. Sort the
values of 𝐴 in data
➢ Evaluate split thresholds in
intervals between instances of
different classes
weight ≤
3535
Dyes Dno
Yes No
13 Yes / 5 NO 7 Yes / 8 NO
D: 20 Yes / 13 NO
35352573
Split over Numeric Features
32
The ID3 algorithm begins with the original set S as the root node.
For each iteration of the algorithm:
➢ Loop through every unused feature of the set S and calculates the
information gain 𝑰𝑮(𝑺) of that feature.
➢ Select the feature which has the largest information gain value, best feature
of splitting
➢ S is then split by the selected feature, e.g. income, to produce subsets of
the data.
➢ The algorithm continues to loop on each subset, excluding features used
before.
Ross Quinlan, 1986
ID3 algorithm summary
33
❑ All elements in the subset belong to the same class (Yes or No, 1 or 0, +
or -), then the node is turned into a leaf node and labelled with the class
of the examples
❑ No more features to be selected, while the examples still do not belong
to the same class (some are 1 and some are 0), then the node is turned
into a leaf node and labelled with the most common class of the
examples in the subset
❑ No examples in the subset, for example if there is no example with age >=
100. Then a leaf node is created, and labelled with the most common
class of the examples in the parent set.
Stopping Criteria
What to do if…
In some leaf nodes there are no examples:
➢ Choose yes or no according to the number of yes/no examples at
parent
Some examples have the same features but different label: we have an
error/noise
➢ Stop and use majority vote
In the applications of our unit, we focus more on decision tree with binary
split. Also, scikit-learn uses an optimised version of the CART algorithm
which constructs binary trees.
34
35
❑ If we keep growing the tree until perfect classification for the training set we
might over-fit
❑ For example, we can keep splitting the tree until each node contains 1 example
❑ This will fit perfectly on the training data, while NOT work on the new test data
Tree size
Misclassification rate
Train
Validation
Overfitting in decision trees
36
Prepruning:
Stop growing when data split is not statistically significant. For example: stop
tree construction when node size is smaller than a given limit, or impurity of
a node is below a given limit. (faster)
Postpruning:
Grow the whole tree, then prune subtrees which overfit on the validation
set. (more accurate)
Tree Pruning
❑ Prepruning : stop splitting when there is no statistically significant:
➢ Stop when Info-Gain (Gain-Ratio) is smaller than threshold
➢ Stop when there are p, e.g. 𝑝 = 5, examples in each leaf node
❑ Postpruning: grow the tree, then post-prune it based on validation set
❑ Regularization: penalize complex trees by minimizing with “complexity” = “#
of leaf nodes”. 𝑇 indicates the number of leaf nodes of the tree 𝑇
Note: if tree grows, complexity grows, but entropy shrinks (uncertainty
decreases).
All leaf nodes
𝐻 𝑆𝑗 + 𝜆 ∗ 𝑇
❑ Compute many trees on subsets of data and test: pick the best, or do
prediction vote
❑Random Forests are state of the art classifiers!
37
How to Avoid Overfitting?
38
39
In the real implementation, we transform the categorical features into dummy
variables.
40
Used expected entropy as impurity measurement to select the best
feature for 1st split (depth 1).
[0.82584103752696802,
0.97095059445466847,
0.72895548841643465,
0.78514543156506744,
0.78514543156506744,
0.95097750043269369,
0.95097750043269369]
Income-Medium is selected as the best feature to split the
root node.
Left node: [ 0. 4.] => [0 no, 4 yes]
Right node: [ 6. 5.] => [6 no, 5 yes]
Income-
Medium
9 Yes / 6 NO
4 Yes / 0 NO 5 Yes / 6 NO
Yes No
41
𝑥2
𝑥1
• A decision stump is a decision tree consisting of only one-level.
• A decision tree with one root node which is immediately connected
to the leaf nodes.
• We will use this concept to explain the boosting of the next lecture
Predict
negative
Yes No
10
𝑥1 < 10 Predict positive Decision Stump 42 43 𝑥1-age 𝑥2-experience 20 35 5 10 𝑅1 𝑅2 𝑅3 𝑅4 𝑅5 Regression tree intuition 44 𝑥1 <20 𝑅2𝑅1 Yes No 𝑥2 <10 Yes No 𝑥1 < 35 𝑅3 Yes No 𝑥2 < 5 𝑅4𝑅5 Yes No Regression tree Intuition 45 Two steps of building a regression tree: 1. Partition the feature space: the set of possible values for 𝐱1, 𝐱2, … , 𝐱𝑁 into 𝐽 distinct and non-overlapping regions for 𝑅1, 𝑅2, …, 𝑅𝐽 2. For a new observation that falls into the region 𝑅𝑗, we make the same prediction, which is simply the mean of the response values for the training examples in 𝑅𝑗 Building Regression Tree 46 ❑ Random forest (or random forests) is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the class's output by individual trees. ❑The term came from random decision forests that was first proposed by Tin Kam Ho of Bell Labs in 1995. ❑The method combines Breiman's "bagging" idea and the random selection of features. ❑Random forests provide an improvement over bagged trees by way of a random small tweak that decorrelates the trees. 47 Random Forest introduction 48 • Random forests (RF) are a combination of tree predictors • Each tree depends on the values of a random set sampled in dependently • The generalization error depends on the strength of the individual trees and the correlation between them Tree 1 Tree B Random Forest introduction ❑ Random forests provide an improvement over bagged trees by way of a random small tweak that decorrelates the trees ❑ In bagging, we build a number forest of decision trees on bootstrapped training samples ❑ Each time a split in a tree is considered, a random sample of 𝒑 features is chosen as split candidates from the full set of 𝒅 features ❑ In RF, the number of features considered at each split is approximately equal to the square root of the total number of features ❑ To avoid the situation that in bagging there is a quite strong feature, resulting most or all of the trees will use this strong predictor in the top split and produce very similar trees ❑ Random forests overcome this problem by forcing each split to consider only a subset of the features 49 Random Forest introduction 50 1. For tree 𝑏 = 1 to 𝐵: (a) Choose a bootstrap sample of size 𝑁 from training data (b) Grow a random-forest tree 𝑇𝑏 to the bootstrapped data, by recursively repeating the following steps for each leaf node of the tree, until the minimum node size is achieved: i. Select 𝑝 variables at random from the 𝑑 variables (𝑝 ≤ 𝑑).Why doing this? ii. Pick the best variable/split-point among the 𝑝. iii. Split the node into two decision nodes. 2. Output the ensemble of trees 𝑇𝑏 1 𝐵. Friedman et al., (2001) ❑ Randomly select 𝑁 observations (with replacement) from the original data set in order to produce a bootstrap data set RF Algorithm To make a prediction at a new point 𝐱𝟎: For regression: For classification: Suppose the class prediction of the 𝑏𝑡ℎ random-forest tree is 𝐶𝑏 𝐱𝟎 : 51 መ𝑓 𝐱𝟎 = 1 𝐵 𝑏=1 𝐵 𝑇𝑏 𝐱𝟎 መ𝑓 𝐱𝟎 = Mode 𝐶𝑏 𝐱𝟎 𝑏=1 𝐵 RF Prediction At each split in each tree, the improvement in the split-criterion is the importance measure attributed to the splitting variable, and is accumulated over all the trees in the forest separately for each variable. 52 Feature importance Review questions • What are the intuitions of decision trees? • How decision trees works? How to choose the feature to spit? • What is decision stump? • What is CART? • What are the tree growing and pruning? • How does ID3 algorithm work? • What are Entropy and information gain? • How does random forest work?