程序代写代做代考 decision tree algorithm PowerPoint Presentation

PowerPoint Presentation

2

Week 10: Classification II

1. Decision Tree Intuition

2. Classification Trees

3. Regression Trees

4. Random Forest

Readings: Chapters 8.1 and 8.2.2
Exercice questions: Chapter 8.4 of ISL, Q1, Q3 and Q4.

3

4

❑ Non-parametric (any other nonparametric method we learnt before?)

❑ Supervised learning method that can be used for both classification

and regression.

❑ Through incorporating a set of if-then-else rules, decision tree can be

employed to predict target variable given data features

Decision trees intuition

5

• Try to discover the pattern under which the customer will purchase the

product

• Divide data set into subsets (branches of a tree)

• Check whether the stopping criteria is met

If yes

stop dividing

Else

keep dividing

• For a new customer, based on the features, we can see which subset

the customer will fall into

Decision trees intuition

6

Root Node

Decision node 1

Split

Leaf node

Decision Node

Leaf nodeLeaf node

• A decision node has two or more

branches.

• The topmost decision node in a tree

which corresponds to the best predictor

called root node.

• Leaf node represents a decision.

Leaf nodeDecision node Leaf node

Decision Trees example

7

Decision trees used in machine learning are of two main types:

❑Classification tree analysis is when the predicted outcome is the class to which
the data belongs. Target variable is categorical.

❑Regression tree analysis is when the predicted outcome can be considered a
real number (e.g. the price of a house, or a patient’s length of stay in a
hospital). Target variable is continuous.

Classification And Regression Tree (CART), Breiman et al., (1984). An umbrella
term used to refer to both of the above techniques.

Types of decision trees

8

9

❑ The task of growing a classification tree is quite similar to the task of
growing a regression tree

❑ Categorical response variable, e.g. yes/no, 1/0

❑ For a classification tree, we predict that each observation belongs to the
most commonly occurring class (mode) of training observations in the
region to which it belongs

❑ In interpreting the results of a classification tree, we are often interested
not only in the class prediction corresponding to a particular leaf node
region, but also in the class proportions among the training observations
that fall into that region

10

Customer Income Education Marital Status Purchase

1 Medium University Single Yes

2 High University Single No

3 High University Married No

4 Low University Single Yes

5 Low High school Single Yes

6 Low High school Married No

7 Medium High school Married Yes

8 High University Single No

9 High High school Single Yes

10 Low High school Single Yes

11 High High school Married Yes

12 Low University Married No

13 High University Single No

14 Medium University Married Yes

15 Medium High school Single Yes

We can have duplicated records.

Classification tree

11

❑ We need to build the tree from the root node with one feature and then
split examples into subsets

❑ How to select this feature?

❑ Idea: a good feature splits the examples into subsets that are (ideally) “all
positive” or “all negative“

❑ Purity

12

Let’s start the decision tree with feature income. Why?

Income

High Medium Low

Customer Income Education Marital Status Purchase

3 Medium University Single Yes

7 Medium High school Married Yes

12 Medium University Married Yes

13 Medium High school Single Yes

Customer Income Education Marital Status Purchase

4 Low University Single Yes

5 Low High school Single Yes

6 Low High school Married No

10 Low High school Single Yes

14 Low University Married No

PURE set

13

Income

High Medium Low

4 Yes

Education
Marital
status

University
High

school

2 Yes4 No

Married Single

3 Yes2 No

Let’s start the decision tree with feature income. Why?

14

Income

High Medium Low

4 Yes

Education
Marital
status

University
High

school

2 Yes4 No

Married Single

3 Yes2 No

A new married customer with
high income and university
degree

15

Income

High Medium
Low

4 Yes/ 0 No2 Yes/ 4 No 3 Yes/ 2 No

Marital
status

Married Single

3 Yes/ 3 No 6 Yes/ 3 No

9 Yes/ 6 No

9 Yes/ 6 No

Best feature of splitting

Which split is better?

16

• Entropy is a concept originally from physics and measures the disorder in a data set
• In decision trees, we use entropy 𝑯(𝑺) to measure of the amount of uncertainty in the

data set 𝑺.
• The entropy will be a small value if the dataset is pure.

• Smaller entropy, less disorder, higher PUTIRY (CERTAINTY)
• Larger entropy, more disorder, higher IMPUTIRY (UNCERTAINTY)

𝑯 𝑺 = 𝟏 𝑯 𝑺 = 𝟎. 𝟒𝟔𝟗

A glass of
water and ice
cubes, which
one is purer?

Entropy intuition

17

Measure the PURITY of the split:

Aim to be more certain about Yes/No after the spit

• Pure set: (4 yes/0 no)=> 100% certain
• Impure set: 3 yes/3 no => 50% certain and 50% uncertain

• Impure set: 1 yes/3 no => 25% certainty and 75% uncertain
Should be as PURE as

• Impure set: 3 yes/1 no => 75% certainty and 25% uncertain

Best feature of splitting

18

Entropy 𝑯(𝐒) is a measure of the amount of uncertainty in the data set 𝐒. The entropy
will be a small value if the dataset is pure.

➢ 𝐒: The current (data) set for which entropy is being calculated (changes every iteration

of the ID3 algorithm)

➢ 𝑝𝑘(𝐒): The proportion of the number of elements in class 𝑘 to the number of

elements in set 𝐒. 𝐾 classes in total in 𝐒.

➢ σ𝑘=1
𝐾 𝑝𝑘(𝐒) = 1

➢ 𝑝𝑘 𝐒 log2 𝑝𝑘 𝐒 equals zero when 𝑝𝑘(𝐒) = 0.

𝐻 𝐒 = ෍

𝑘=1

𝐾

𝑝𝑘 𝐒 log2
1

𝑝𝑘 𝐒
= −෍

𝑘=1

𝐾

𝑝𝑘 𝐒 log2 𝑝𝑘 𝐒

Entropy calculation

19

More specifically, for a training set with 𝒑 positive examples and 𝒏 negative
examples:

Equivalently:
% of positive
examples in 𝐒

Interpretation: assume an item belongs to S, how many bits of information are
required to tell whether x is positive or negative. The smaller it is, the higher
certainty.

𝐻 𝐒 = −
𝑝

𝑝 + 𝑛
log2

𝑝

𝑝 + 𝑛

𝑛

𝑝 + 𝑛
log2

𝑛

𝑝 + 𝑛

𝐻 𝐒 = −𝑝+(𝐒) log2 𝑝+(𝐒) − 𝑝−(𝐒) log2 𝑝−(𝐒)

Entropy- two classes

20

When 𝑯 𝐒 = 𝟎 , the set 𝐒 is
perfectly classified, e.g. all elements
in 𝐒 are of the same class

A two class problem

𝑝+ 𝐒 = 0, 𝑝− 𝐒 = 1,𝐻 𝐒 = 0

𝑝− 𝐒 = 0, 𝑝+ 𝐒 = 1,𝐻 𝐒 = 0

𝑝− 𝐒 = 0.5, 𝑝+ 𝐒 = 0.5, 𝐻 𝐒 = 1

Symmetric

Entropy- two classes

If there are more than two classes: 1,2, … , 𝐾:

21

𝑲 classes in total in 𝐒

Demo only
𝐻 𝐒 = −𝑝1(𝐒) log2 𝑝1(𝐒)

−𝑝2(𝐒) log2 𝑝2(𝐒)
−𝑝3(𝐒) log2 𝑝3(𝐒)
… …
−𝑝𝐾(𝐒) log2 𝑝𝐾(𝐒)

𝑘=1

𝐾

𝑝𝑖(𝐒) = 1

Entropy- multiple classes

22

# entropy calculator
p= 3.0/5

H= – p*np.log2(p)-(1-p)*np.log2(1-p)

How to merge these
entropy together?

𝐻 𝐒 = −
9

15
log2

9

15

6

15
log2

6

15
= 0.971 bits

𝐻 𝐒1 = −
2

6
log2

2

6

4

6
log2

4

6
= 0.918 bits 𝐻 𝐒3 = −

3

5
log2

3

5

2

5
log2

2

5
= 0.971 bits

𝐻 𝐒2 = 0 bits

𝐻 𝐒 = 0.971 bits

𝐻 𝐒1 = −
3

6
log2

3

6

3

6
log2

3

6
= 1 bit 𝐻 𝐒2 = −

6

9
log2

6

9

3

9
log2

3

9
= 0.918 bits

Entropy example

23

• Entropy is not the only measurement of selecting the best feature
to split

• Other measurements include:
• Gini index

• The Gini index and the entropy are similar numerically

• Misclassification rate: not sufficiently sensitive for tree-growing.
James et al., (2014).

𝐻 𝐒 = ෍

𝑘=1

𝐾

𝑝𝑖(𝐒)(1 − 𝑝𝑖 𝐒 )

Other Measurements

❑ How much information do we gain if we disclose/split the value of some
features?

❑ Answer: uncertainty before minus uncertainty after

❑ Information Gain (IG) or reduction in entropy from the feature test

❑ Information Gain is a measure of the disorder/uncertainty decrease
achieved by splitting the data set 𝑆

❑ Choose the feature split with the largest IG

24

𝐈𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧 𝐆𝐚𝐢𝐧 = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝐛𝐞𝐟𝐨𝐫𝐞 − 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝐚𝐟𝐭𝐞𝐫

Weighted sum of Entropy.
We want this term to be small.

We want this term to be
large

Information Gain

25

Information gain IG(S,A) is the measure of the difference in entropy from before to
after the data set 𝐒 is split on an feature 𝐴.

In other words, how much uncertainty in 𝐒 was reduced after splitting set 𝐒 on
feature 𝐴.

𝐻(𝐒) – Entropy of set 𝐒

𝐸𝐻(𝐴) – Expected entropy with split by feature 𝐴

𝐼𝐺 𝐒, 𝐴 = 𝐻 𝐒 − 𝐸𝐻(𝐴)

Information Gain

26

A selected feature 𝐴 with 𝐽 distinct values, e.g. feature “income” has 𝐽 = 3 possible values
“high”, “medium” and “low”, partitions the training set 𝐒 into 𝐽 subsets/branches 𝐒1, 𝐒2,
… , 𝐒𝐽

The expected entropy with split by feature 𝐴 is:

𝐒: the current (data) set for which entropy is being calculated
𝐒𝑗: subset 𝑗
Expected entropy is a measurement of subsets impurity.

Note this is the
entropy of the
subset, calculated
according to the
target categories

Weights based on size
of the subset

𝐸𝐻 𝐴 =෍

𝑗=1

𝐽
𝐒𝑗

𝐒
𝐻(𝐒𝑗)

Expected Entropy

27

Marital
status

Married Single

9 Yes / 6 NO

3 Yes / 3 NO

1)(
Married

=SH

0.023991.0
15

9
1

15

6
97.0

)(
15

9
)(

15

6
)(

),(

SingleMarried

=−−=

−−= SHSHSH

ASIG

If split on “marital status”, we would

GAIN 0.0239 bits on certainty.

Or we are 0.0239 bits more certain.

6 Yes / 3 NO

918.0)(
Single

=SH

Entropy before split. High impurity.

Entropy after split

Weights based on size
of the subsets to S

𝐻 𝐒 = −
9

15
log2

9

15

6

15
log2

6

15
= 0.971 bits

Information Gain example

28

❑IG favours split on an feature with many values (many leaf nodes): causing bias
❑If 1 feature splits in many more classes than another, it has an (unfair) advantage

if we use information gain
❑The Gain-Ratio is designed to compensate for this problem

GainRatio =
Information Gain

Split Entropy Penalize split with too
many small subsets

Split_Entropy 𝑺, 𝐴 = −෍

𝑗=1

𝐽
𝐒𝑗
𝐒

log2
𝐒𝑗
𝐒

Information Gain drawback

29

Split Entropy = −15
1

15
𝑙𝑜𝑔2

1

15
= 3.907

1 /
0

1 /
0

1 /
0

0 /
1

1 3 1
5

Split Entropy = −
6

15
𝑙𝑜𝑔2

6

15

9

15
𝑙𝑜𝑔2

9

15
= 0.971

Penalize split with too many
small subsets, although the
IG for such split is high.2

Split Entropy Example

9 Yes / 6 NO

30

➢ What should we do if some of the features are numeric/continuous?
➢ We use the form of 𝑥 < 𝜃 where 𝜃 is called a splitting value or cutting point. Infinite number of possible split values!!! Split over numeric features 31 ➢ Splits on numeric features use a threshold ➢ How to decide a threshold like 3535 in the diagram ➢ Consider a feature 𝐴. Sort the values of 𝐴 in data ➢ Evaluate split thresholds in intervals between instances of different classes weight ≤ 3535 Dyes Dno Yes No 13 Yes / 5 NO 7 Yes / 8 NO D: 20 Yes / 13 NO 35352573 Split over Numeric Features 32 The ID3 algorithm begins with the original set S as the root node. For each iteration of the algorithm: ➢ Loop through every unused feature of the set S and calculates the information gain 𝑰𝑮(𝑺) of that feature. ➢ Select the feature which has the largest information gain value, best feature of splitting ➢ S is then split by the selected feature, e.g. income, to produce subsets of the data. ➢ The algorithm continues to loop on each subset, excluding features used before. Ross Quinlan, 1986 ID3 algorithm summary 33 ❑ All elements in the subset belong to the same class (Yes or No, 1 or 0, + or -), then the node is turned into a leaf node and labelled with the class of the examples ❑ No more features to be selected, while the examples still do not belong to the same class (some are 1 and some are 0), then the node is turned into a leaf node and labelled with the most common class of the examples in the subset ❑ No examples in the subset, for example if there is no example with age >=

100. Then a leaf node is created, and labelled with the most common

class of the examples in the parent set.

Stopping Criteria

What to do if…

In some leaf nodes there are no examples:

➢ Choose yes or no according to the number of yes/no examples at
parent

Some examples have the same features but different label: we have an
error/noise

➢ Stop and use majority vote

In the applications of our unit, we focus more on decision tree with binary
split. Also, scikit-learn uses an optimised version of the CART algorithm
which constructs binary trees.

34

35

❑ If we keep growing the tree until perfect classification for the training set we
might over-fit

❑ For example, we can keep splitting the tree until each node contains 1 example
❑ This will fit perfectly on the training data, while NOT work on the new test data

Tree size

Misclassification rate

Train

Validation

Overfitting in decision trees

36

Prepruning:

Stop growing when data split is not statistically significant. For example: stop

tree construction when node size is smaller than a given limit, or impurity of

a node is below a given limit. (faster)

Postpruning:
Grow the whole tree, then prune subtrees which overfit on the validation
set. (more accurate)

Tree Pruning

❑ Prepruning : stop splitting when there is no statistically significant:

➢ Stop when Info-Gain (Gain-Ratio) is smaller than threshold

➢ Stop when there are p, e.g. 𝑝 = 5, examples in each leaf node

❑ Postpruning: grow the tree, then post-prune it based on validation set

❑ Regularization: penalize complex trees by minimizing with “complexity” = “#

of leaf nodes”. 𝑇 indicates the number of leaf nodes of the tree 𝑇

Note: if tree grows, complexity grows, but entropy shrinks (uncertainty

decreases).

All leaf nodes

𝐻 𝑆𝑗 + 𝜆 ∗ 𝑇

❑ Compute many trees on subsets of data and test: pick the best, or do

prediction vote

❑Random Forests are state of the art classifiers!

37

How to Avoid Overfitting?

38

39

In the real implementation, we transform the categorical features into dummy
variables.

40

Used expected entropy as impurity measurement to select the best
feature for 1st split (depth 1).

[0.82584103752696802,
0.97095059445466847,
0.72895548841643465,
0.78514543156506744,
0.78514543156506744,
0.95097750043269369,
0.95097750043269369]

Income-Medium is selected as the best feature to split the
root node.

Left node: [ 0. 4.] => [0 no, 4 yes]
Right node: [ 6. 5.] => [6 no, 5 yes]

Income-
Medium

9 Yes / 6 NO

4 Yes / 0 NO 5 Yes / 6 NO

Yes No

41

𝑥2

𝑥1

• A decision stump is a decision tree consisting of only one-level.
• A decision tree with one root node which is immediately connected

to the leaf nodes.
• We will use this concept to explain the boosting of the next lecture

Predict
negative

Yes No

10

𝑥1 < 10 Predict positive Decision Stump 42 43 𝑥1-age 𝑥2-experience 20 35 5 10 𝑅1 𝑅2 𝑅3 𝑅4 𝑅5 Regression tree intuition 44 𝑥1 <20 𝑅2𝑅1 Yes No 𝑥2 <10 Yes No 𝑥1 < 35 𝑅3 Yes No 𝑥2 < 5 𝑅4𝑅5 Yes No Regression tree Intuition 45 Two steps of building a regression tree: 1. Partition the feature space: the set of possible values for 𝐱1, 𝐱2, … , 𝐱𝑁 into 𝐽 distinct and non-overlapping regions for 𝑅1, 𝑅2, …, 𝑅𝐽 2. For a new observation that falls into the region 𝑅𝑗, we make the same prediction, which is simply the mean of the response values for the training examples in 𝑅𝑗 Building Regression Tree 46 ❑ Random forest (or random forests) is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the class's output by individual trees. ❑The term came from random decision forests that was first proposed by Tin Kam Ho of Bell Labs in 1995. ❑The method combines Breiman's "bagging" idea and the random selection of features. ❑Random forests provide an improvement over bagged trees by way of a random small tweak that decorrelates the trees. 47 Random Forest introduction 48 • Random forests (RF) are a combination of tree predictors • Each tree depends on the values of a random set sampled in dependently • The generalization error depends on the strength of the individual trees and the correlation between them Tree 1 Tree B Random Forest introduction ❑ Random forests provide an improvement over bagged trees by way of a random small tweak that decorrelates the trees ❑ In bagging, we build a number forest of decision trees on bootstrapped training samples ❑ Each time a split in a tree is considered, a random sample of 𝒑 features is chosen as split candidates from the full set of 𝒅 features ❑ In RF, the number of features considered at each split is approximately equal to the square root of the total number of features ❑ To avoid the situation that in bagging there is a quite strong feature, resulting most or all of the trees will use this strong predictor in the top split and produce very similar trees ❑ Random forests overcome this problem by forcing each split to consider only a subset of the features 49 Random Forest introduction 50 1. For tree 𝑏 = 1 to 𝐵: (a) Choose a bootstrap sample of size 𝑁 from training data (b) Grow a random-forest tree 𝑇𝑏 to the bootstrapped data, by recursively repeating the following steps for each leaf node of the tree, until the minimum node size is achieved: i. Select 𝑝 variables at random from the 𝑑 variables (𝑝 ≤ 𝑑).Why doing this? ii. Pick the best variable/split-point among the 𝑝. iii. Split the node into two decision nodes. 2. Output the ensemble of trees 𝑇𝑏 1 𝐵. Friedman et al., (2001) ❑ Randomly select 𝑁 observations (with replacement) from the original data set in order to produce a bootstrap data set RF Algorithm To make a prediction at a new point 𝐱𝟎: For regression: For classification: Suppose the class prediction of the 𝑏𝑡ℎ random-forest tree is 𝐶𝑏 𝐱𝟎 : 51 መ𝑓 𝐱𝟎 = 1 𝐵 ෍ 𝑏=1 𝐵 𝑇𝑏 𝐱𝟎 መ𝑓 𝐱𝟎 = Mode 𝐶𝑏 𝐱𝟎 𝑏=1 𝐵 RF Prediction At each split in each tree, the improvement in the split-criterion is the importance measure attributed to the splitting variable, and is accumulated over all the trees in the forest separately for each variable. 52 Feature importance Review questions • What are the intuitions of decision trees? • How decision trees works? How to choose the feature to spit? • What is decision stump? • What is CART? • What are the tree growing and pruning? • How does ID3 algorithm work? • What are Entropy and information gain? • How does random forest work?