ISOM3360 Data Mining for Business Analytics, Session 4
Decision Trees (I)
Instructor: Department of ISOM Spring 2022
Copyright By PowCoder代写 加微信 powcoder
Recap: Data Understanding
Preliminary investigation of the data to better understand its specific characteristics
❖ Help in selecting appropriate data mining algorithms
Things to look at
❖ Class imbalance
❖ Dispersion of data attribute values ❖ Skewness, outliers, missing values ❖ Correlation analysis
Visualization tools are important ❖ Histograms, box plots
❖ Scatter plot
Recap: Data Preprocessing
Data transformation might be needed ❖ Handling missing values
❖ Handling categorical variables
❖ Feature transformation
e.g., log transformation
Normalization (back to this when clustering is discussed)
❖ Feature discretization
Which of the following normalization method(s) may transform the original variable to a negative value: min-max, z-score, or both, or neither?
I0,1 score
z training set
Do you apply normalization on training or testing set?
age i25 j income testing set
20,000 gender
age 25 j income 20 ooo gender
mean SDofhainingset todonormalization of testing set
What Induction Algorithm Shall We Use?
Commonly Used Induction Algorithms
Post-mortem analysis of a popular data mining competition
Why Decision Trees?
Decision trees are one of the most popular data mining tools.
❖ Classification trees (used when target variable is categorical)
❖ Regression trees (used when target variable is numeric) They are easy to understand, implement and use, and
computationally cheap.
The model comprehensibility is important for communication to non-DM-savvy stakeholders.
Classification Tree
Thetreeusually wn4
accuracy on havelwg training example
Objective: predicting borrowers who will default on loan payments.
Classification Tree: Upside-Down
Class = Not Default
Class = Not Default
Class = Not Default
Class = Default
Classification Tree: Divide and Conquer
“Recursive Partitioning”
❖ Each node represents one attribute
❖ Tests on nominal attribute: number of splits (branches) is number of possible values or 2 (one value vs. rest)
❖ Continuous attributes are discretized
Yes No Class = Not
Class = Not Default
❖ A class assignment (e.g., default /not default)
❖ Also provide a distribution over all possible classes (e.g., default with probability 0.25, not default with prob. 0.75)
Class = Not Default
Class = Default
How a Tree is Used for Classification ?
To determine the class of a new example: e,g., Mark, age 40, retired, balance 38K.
Yes ❖ The example is routed down the tree Class = Not
according to values of attributes .
❖ At each node, a test is applied to one attribute.
❖ When a leaf is reached, the example is assigned to a class—or alternatively to a distribution over the possible classes (e.g., default with probability 0.25, not default with prob. 0.75).
Class = Not Default
Class = Not Default
Class = Default
Assigning Probability Estimates
Entire Population
Age >=45 Age <45
Age >=45 Age < 45
Balance >= 50k Balance < 50k
Training set
Q: How would you assign estimates of class probability (e.g., probability of default/not default) based on such trees?
Exercise: Assume this is the classification tree you learned from the A Process View to a Business Problem
past defaulting data. A new guy, who is 45 and has 20K balance, is applying a credit card issued by your company. Can you predict if this new guy is gonna default? How confident are you about your prediction? Another girl is also applying the same credit card. But the only information we have about her is she has 70k balance. Can you predict if she will default? How sure about that?
nyfrot Default in
Classification Tree Learning
Objective: based on customer attributes, partition the customers into subgroups that are less impure – with respect to the class (i.e., such that in each group most instances belong to the same class)
Classification Tree Learning
Partitioning into “purer” groups Orange Bodies
Yes Yes Yes
Purple Bodies
Yes Yes No
Classification Tree Learning
Partitioning into “purer” groups recursively Purple Bodies
Classification Tree Learning
Purple Bodies
Classification Tree Learning
Partitioning into “purer” groups Orange Bodies
Yes Yes Yes
Purple Bodies
Yes Yes No
Classification Tree Learning
Partitioning into “purer” groups recursively Orange Bodies
Yes Yes Yes
Yes Yes No
Classification Tree Learning
Orange Bodies
Blue Limbs
Yellow Limbs
Classification Tree Learning
Orange Bodies
Purple Bodies
No No No No
Blue Limbs
Yes Yes Yes
Yellow Limbs
Summary: Classification Tree Learning
A tree is constructed by recursively partitioning the examples.
With each partition, the examples are split into subgroups that are “increasingly pure”.
The Idea of Partitioning
Let’s play a game. I have someone in my mind, and your job is to guess this person. You can only ask yes/no question.
This person is an entrepreneur.
Some important questions without being answered yet:
❖ How to automatically choose which attribute to be used to split the data?
❖ When to stop the splitting?
How to Choose Which Attribute to Split Over?
Objectives
❖ For each splitting node, choose the attribute that best
partitions the population into less impure groups. ❖ All else being equal, fewer nodes is better (more
comprehensible, easy to use, reduce overfitting)
Impurity measures: many available but most common one (from information theory) is: entropy.
is the proportion of class in the data
For example: our initial population is composed of 16 cases
of class “Default” and 14 cases of class “Not default”
Entropy (entire population of examples) =
HighImpurity or
Entropy Exercise
A dataset is composed of 10 cases of class “Positive” and 10 cases of class “Negative”
A 0.5 A dataset is composed of 0
cases of class “Positive” and 20 cases of class “Negative”
𝑡𝑖𝑝: 0 log2 0 = 0
% of one class
Two-class entropy function 28
Information Gain (based on Entropy)
The information gain is based on the reduction in entropy after a dataset is split on an attribute
Information Gain =
entropy (parent) – [weighted average] entropy (children)
where weight of each child is given by the proportion of the examples in that child.
Balance<50k
Balance>=50k
information
Information Gain Example
30 instances
Balance < 50k
Balance >= 50k
Entopywultorly
decrease or noremain unchange aftersplit
17 instances
13 instances
(Weighted) Average Impurity of Children =
Information Gain = 0.997 – 0.615 = 0.382
What If We Split Over “oAge” First? 30 instances
1092告十点以方 15 instances
0,836 Exercise!
15 instances Impurity = ? ogzz
Impurity = ?
0722 十 号 0.836 二0.779
Information Gain = ?
, gain from first splitting on “Balance” = 0.382
Our Original Question
Now, ready to answer anxiously awaiting question:
How to choose which attribute to split over?
Answer: at each node, choose the attribute that obtains maximum information gain!
Decision Tree Algorithm (Full Tree)
❑ Step 1: Calculate the information gain from splitting over each attribute using the dataset
❑ Step 2: Split the set into subsets using the attribute for which
maxliufgainl
❑ Step 3: Make a decision tree node containing that attribute, divide the dataset by its branches and repeat the same process on every node.
❑ Step 4a: A node with entropy of 0, or cannot partition further, or obtain no information gain from additional split, is a leaf.
the information gain is maximum
❑ Step 4b: Otherwise, the node needs further splitting.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com