ISOM3360 Data Mining for Business Analytics, Session 5
Decision Trees (II)
Instructor: Department of ISOM Spring 2022
Copyright By PowCoder代写 加微信 powcoder
Recap: Classification Tree Learning
A tree is constructed by recursively partitioning the examples.
With each partition, the examples are split into subgroups that are “increasingly pure”.
How to choose the right attribute to split over first?
– Choose the attribute that maximizes information gain!
Information Gain =
entropy (parent) – [weighted average] entropy (children)
Decision Tree Algorithm (Full Tree)
❑ Step 1: Calculate the information gain from splitting over each attribute using the dataset
❑ Step 2: Split the set into subsets using the attribute for which the information gain is maximum
❑ Step 3: Make a decision tree node containing that attribute, divide the dataset by its branches and repeat the same process on every node.
❑ Step 4a: A node with entropy of 0, or cannot partition further, or obtain no information gain from additional split, is a leaf.
❑ Step 4b: Otherwise, the node needs further splitting.
Information Gain Exercise
Without calculation, which split (left or right) gives a higher information gain?
Another Measure: Gini Impurity
Similar to entropy, the Gini Impurity also measures the impurity of the data.
is the proportion of class in the data
Gini impurity = 1- (1/5)^2-(4/5)^2 = 0.32
Gini Impurity vs. Entropy
Entropy (scaled) = Entropy/2
Information Gain (based on Gini Impurity)
The information gain (based on Gini impurity) by splitting one attribute is given by:
The attribute that maximizes information gain (based on Gini impurity) is chosen as the splitting attribute.
Information Gain (based on Gini impurity)
= Gini impurity (parent) – [weighted average] Gini impurity (children)
Information Gain (based on Gini Impurity)
30 instances
Balance < 50k
Balance >= 50k
13 instances
impurity = ?
17 instances
impurity = ?
impurity = ?
Information Gain = ?
What If We Split Over “Age” First?
30 instances
impurity = ?
15 instances
15 instances
Impurity = ?
Impurity = ?
Information Gain = ?
What if Outcome Variable Has >2 Categories?
If K (number of classes) >2, same as before, we calculate multi-class entropy or Gini impurity first.
What if Outcome Variable is Numeric?
Regression trees
❖ Predicted output = average of the training
examples in the subset (leaf)
❖ Potential splits are measured on how much they reduce the mean squared error (MSE)
An Example of Regression Tree
A Note on Handling Continuous Variables
Handle continuous attribute by splitting into two intervals (can be more) at each node.
How to find the best threshold to divide? ❖ Answer: try them all!!!
❖ Use information gain (or other measures) again
❖ Sort all the values of an continuous attribute in increasing
order {v1, v2, …, vr},
❖ One possible threshold between two adjacent values vi and vi+1. Try all possible thresholds and find the one that maximizes the purity
What if we have a lot of unique values (say 100,000)?
An Example in a Continuous Space
Geometric Interpretation
Classification tree partitions space of examples with axis-parallel decision boundaries
Split over balance
Class = Default
Split over age >=45
Class = Default
Bad risk (Default) – 15 cases
Good risk (Not default) – 17 cases
Class = Not Default
Right Now, We Stop the Partitioning When
Maximum purity is obtained (i.e., all training examples reaching the node are of the same class)
There are no remaining attributes for further
partitioning (think about two records with exactly same values on explanatory variables but opposite value on target variable -> not possible to achieve maximum purity)
Additional splits obtain no information gain. Do we really need to grow the tree fully?
A Fully-Grown Tree
3000 training examples
Income < 114 no 0
Experience >= 16
Family < 3.5
Income < 92
CCAvg >= 4
CCAvg < 2.2
Education < 1.5
Income < 106
CDAccount < 0.5 Family < 2.5
Income < 116
CCAvg >= 1.1
Family < 2.5
Mortgage < 196
Mortgage >= 142
1 000000601
Online >= 0.5
Age >= 59 0
230 0 0 03 0 1 0 1 0 03 50 1 10 08
Age < 42 Education >= 1.5
164 112 887 13 327 214 41 12
1 Education >= 2.5
Income >= 92 Education < 1.5
Income >= 110 Income >= 92
Number of examples in each of the two classes
Income >= 108 1 Income >= 108 1 0 01 0 01
0 Family >= 2.5 0 1 Age < 56 380 0 10 03 0
Income < 104 0 Age >= 60 0 Age < 42 0 10 1 30 0
162 11 302 205 11 62 11
Age < 33 0 1 0 CCAvg < 3 Age < 44 1 0 1 0 Family < 1.5 0 1 0 10 01 210 0 0 01 10 01 50 1 10 01
Family < 2.5
CCAvg >= 3.2
CCAvg >= 3.2 0 Experience >= 34 0 250 0
1 Age < 63
0 04 0 013 10 01
0 01 100 0
15111 92204 12
Age < 28 0 1 120 0 10 01
0 1 Age >= 34 90 02 0
Mortgage >= 109
120 0 40 0
0 1 Family >= 1.5 1001001
A fully-grown tree has training error of close to 0. However, this tree does not learn any pattern, it simply memorize the training data!
A Problem: Overfitting
A tree may overfit the training data
❖ Symptoms: good accuracy on training data but poor on
A tree is too deep and have too many branches, some may reflect anomalies due to noise or outliers.
Growing to Purity is Bad (Overfitting)
Decent Decision tree rule:
Humidity > 0.70 -> No otherwise -> Yes
Play Tennis
Overfit Decision tree rule:
Humidity > 0.89 -> No
Humidity >0.87 AND Humidity <= 0.89 -> Yes Humidity >0.70 AND Humidity <=0.87 ->No Humidity <= 0.70 ->No
Symptoms of Overfitting
What Is a Good Model?
A good model must not only fit the training data well but also accurately classify records it has never seen.
In other words, a good model must have low training error and low generalization error.
❖ Training error: the misclassification error rate of the model on training records.
❖ Generalization error (can use testing error as a proxy): the expected error rate of the model on previously unseen records.
Overfitting
“Financial professionals must always be aware of the dangers of overfitting a model based on historical data. A common problem is using computer algorithms to search extensive databases of historical market data in order to find patterns. Given enough study, it is often possible to develop elaborate theorems which appear to predict things such as returns in the stock market with close accuracy. However, when applied to data outside of the sample, such theorems may likely prove to be merely an overfitting of a model to what were in reality just chance occurrences. In all cases, it is important to test a model against data which is outside of the sample used to develop it. ”
———Investopedia
Occam’s Razor
“Everything should be made as simple as possible, but no simpler.”
All other things being equal, simple models are preferable to complex ones.
❖ A simple model that fits the data is unlikely to be a coincidence.
❖ A complex hypothesis that fits the data might be a coincidence.
A Simpler Tree (Leaf Nodes are Not Pure)
3000 training examples
This simpler tree may have better generalization performance.
Leaf nodes, which are not pure, represent the probability of assigning class labels.
How to Avoid Overfitting for Decision Trees?
Pruning: replace a sub-tree with a leaf and assign the most popular class to the leaf.
Approach 1: Pre-pruning -> early stopping.
Approach 2: Post-pruning -> grow full tree first, then prune the nodes of the decision tree afterwards.
Pre-Pruning: Early Stopping
Stop growing if
❖ the depth of the tree is higher than some user-specified threshold
❖ the number of instances is less than some user-specified threshold
Very simple, not data-driven, but can incorporate domain knowledge.
Depth of a Tree
Depth of tree
❖ The depth of a node is the number of edges from the node to the tree’s root node.
❖ A root node will have a depth of 0.
❖ The depth of a tree is the depth of its deepest node.
Q: What is the depth of this tree?
Limiting Minimal Number of Instances
Minimal number of instances: 30
Post-Pruning
Reduced error pruning
❖ Split training data into a training and a validation set
Validation data is just for parameter tuning, NOT for testing!
❖ Grow a full tree on training set
❖ Trim the nodes of the decision tree in a bottom‐up
Eliminate subtree if the resulting pruned tree performs no worse than the original over the validation set.
Reduced Error Pruning
Training data
Validation data
Pros and Cons of Decision Tree Learning
❖ Simple to understand and interpret
❖ Require little data preparation (does not need data normalization for numerical variables, one-hot encoding for categorical variables, etc.)
❖ Can overfit, thus need pruning steps ❖ Tree structure is not very stable*
*A learning algorithm is said to be unstable if it is sensitive to small changes in the training data.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com