DECISION TREES
COMP2420/COMP6420 INTRODUCTION TO DATA MANAGEMENT, ANALYSIS AND SECURITY
WEEK 5 – LECTURE 1
Monday 21 March 2022 (recorded version 13/04/2022)
Copyright By PowCoder代写 加微信 powcoder
of Computing
College of Engineering and Computer Science
Credit: (previous course convenor)
Learning Outcomes
Describe what decision trees are
Explain the concept of impurity measures
03 Describe the concept of entropy
Use the concept of entropy in the Decision tree algorithm
Explain advantages and disadvantages of using DT
MOTIVATION
Classification with a Tree
• Wehavecoveredlinearclassification and k-nearest neighbours
• Anyothernaiveideatodo classification?
• Pickonefeatureandtesthowitdoes
• Conditionedonlastchoice,repeat
this exercise
• Intheleaves,assignaclasswith
majority vote
• Doitforallbranches
A tree with decision based on axes values
Prediction
Example discrete data
Decision: to wait or not for a table at a restaurant?
Source: https://cs.adelaide.edu.au/~dsuter/Harbin_course/DecisionTrees.pdf
An example tree for this data
DECISION TREES
• Internal nodes test attributes
• Branching is determined by attribute value
• Leaf nodes are class assignments
• Data is of the form 𝑥,𝑦 , where 𝑥 = 𝑥!,𝑥”,…,𝑥# is a set of features (or attributes) and 𝑦 is an outcome (class or value)
• Given a set 𝐷 of labeled training samples, learn (generate) a decision tree (a predictive model) that correctly (optimally) predicts the outcomes
• The tree should (if generated well) perform well on test data samples
• Most common strategy for learning a decision tree is that of a greedy algorithm
1. Choose an attribute on which to descend at each level
2. Condition on earlier (higher) choices
3. Generally, restrict only one dimension at a time
4. Declare an output value when you get to the bottom
(In the orange/lemon example, we only split each dimension once, but it’s not a rule)
Classification and Regression
Each path from root to a leaf defines a region 𝑅$ of input space
Let{ 𝑥$! ,𝑡$! ,…, 𝑥$” ,𝑡$” }be the training examples that fall into 𝑅$
Classification tree:
discrete output
leaf value 𝑦$ typically set to the most
common value in {𝑡 $! ,…,𝑡 $” }
Classification and Regression (cont)
Regression tree:
continuous output
leaf value 𝑦$ typically set to the
mean value in {𝑡 $! ,…,𝑡 $” }
Note: We will only talk about classification
Expressiveness
Discrete-input, discrete-output case:
Decision trees can express any function of the input attributes
• Continuous-input, continuous-output case:
– Can approximate any function arbitrarily closely
• Trivially, there is a consistent decision tree for any training set, but it probably won’t generalize to new examples
Need some kind of regularization to ensure more compact decision tree
Problem with learning
• Learning the simplest (smallest)
decision tree is a NP complete
[if you are interested, check: Hyafil & Rivest’76]
• Resort to a greedy heuristic:
– Start from an empty decision tree
– Split on next best attribute – Recurse
• What is best attribute?
Choosing a good attribute
Which attribute is better to split on, X1 or X2?
Idea: Use counts at leaves to define probability distributions, so
we can measure uncertainty 18
Which attribute is better to split on, X1 or X2?
• Deterministic: good (all are true or false; just one class in the leaf)
• Uniform distribution: bad (all classes in leaf equally probable)
• What about distributions in between?
• If all samples at a node belong to the same class, we call it a pure node; else, it is an impure node
• Node impurity is
– zero when all examples at the node belong to one class
– maximum when all classes at the node are equally likely
• We want to minimize the impurity at each node
Approaches to choosing a good attribute
Let 𝑃 𝑐! be the fraction of samples at node 𝑁 in the class 𝑐!
• Giniimpurity=∑!𝑃 𝑐! 1−𝑃 𝑐!
• Misclassi6ication impurity = 1 − 𝑚𝑎𝑥𝑃 𝑐!
• Entropy impurity = − ∑! 𝑃 𝑐! 𝑙𝑜𝑔”𝑃 𝑐!
Approaches to choosing a good attribute
Gini impurity
Definition: The probability of incorrectly classifying a randomly chosen element in the dataset if it were randomly labelled according to the class distribution in the
Let 𝑃 𝑐! be the fraction of samples at node 𝑁 in the class 𝑐!
Giniimpurity G =∑!𝑃 𝑐!
G (before split) = 5/8 * (1-5/8) + 3/8 * (1-3/8) = 0.46875
L1: G = (4/4)(1-1) + (0/4)(1-1) = 0
L2: G = (3/4)(1-3/4) + (1/4)(1-1/4) = 3/8 L3: G = (3/4)(1-3/4) + (1/4)(1-1/4) = 3/8
L4: G = (2/4)(1-2/4) + (1/2)(1-2/4) = 1/2 22
Please note correction made in
Gini impurity value before split (error in previous slides shared)
Gini Gain= 0.4688 – [ (0.5*3/8) + (0.5*1/2) ]= 0.4688 – 0.4375 = 0.031
Gini Gain= 0.4688 – [ (0.5*0) + (0.5*3/8)] = 0.4688 – 0.1875 = 0.281
Approaches to choosing a good attribute
Misclassification impurity
Given by the classification error i.e 1 – accuracy where accuracy is how many we got correct over how many in total
Let 𝑃 𝑐! be the fraction of samples at node 𝑁 in the class 𝑐!
Misclassi:ication impurity(M) = 1 − 𝑚𝑎𝑥𝑃 𝑐!
L1: M = 1-1=0 L2: M = 1 – 3⁄4 = 1⁄4 L3: M = 1 – 3⁄4 = 1⁄4
L4: M = 1 – 2/4 = 1⁄2 23
Approaches to choosing a good attribute
Entropy impurity
Entropy: How much variance a set of data has.
Let 𝑃 𝑐! be the fraction of samples at node 𝑁 in the class 𝑐!
Entropy impurity(𝐸) = − ∑! 𝑃 𝑐! 𝑙𝑜𝑔”𝑃 𝑐!
L1: E = -(4/4))log(4/4) = 0
L2: E = -(3/4)log(3/4) – (1/4)log(1/4) = 0.811
L3: E = -(3/4)log(3/4) – (1/4)log(1/4) = 0.811 L4: E = -(2/4)log(2/4) -(2/4)log(2/4) = 1
Please note correction made in Entropy impurity value before split (error in previous slides shared)
Info Gain= 0.9544 – [ (0.5*0.811) + (0.5*1) ]= 0.0489
E (before split) = -(3/8)*log2(3/8) – ( 5/8)*log2(5/8)) = 0.9544
Info Gain= 0.9544 – [ (0.5*0) + (0.5*0.811)] =0.549
Attribution:Decision Tree – ML Wiki 25
Information theory
• We use information theory to guide us
• We won’t use information theory excessively, using only a simple probability formulae
• Let’s look at some information theory concepts required to build decision trees
• We will restrict ourselves to intuitive explanations and formulae to use
A motivating example of flipping two coins
(18 times) 0=Head; 1=Tail
Quantifying uncertainty
• How surprised are we by a new value in the sequence?
• How much information does it convey?
Entropy of joint distribution
Specific conditional entropy
Conditional entropy
Conditional entropy (contd..)
ENTROPY BASED ALGORITHM
Information Gain
• How much information about cloudiness do we get by discovering whether it is raining?
• 𝐼𝐺𝑌|𝑋 =𝐻𝑌 −𝐻𝑌|𝑋 ≈0.25
• Also called Information Gain in Y due to X
• How can we use this to construct our decision tree?
Constructing decision trees
• Earlier we made the fruit data partitioning just by eyeballing it.
• We can use the information gain to automate the process.
• At each level, one must choose:
• Which variable to split
• Possibly where to split it
• Choose them based on how much information we would gain (highest gain) from the decision!
Greedy construction algorithm
Simple, greedy, recursive approach, builds up tree node-by-node
1. pick an attribute to split at a non-terminal node
2. split examples into groups based on attribute value
3. for each group:
I. II. III.
if no examples – return majority from parent
else if all examples in same class – return class
else loop to step 1
Revisit the restaurant example
Attribute Selection
Which tree is better!
What makes a tree good?
• Not too small (if too smallàunderfitting): need to handle important but possibly subtle distinctions in data
• Not too big (if too bigàoverfitting):
– Computational efficiency (avoid redundant, spurious attributes)
– Avoid over-fitting training examples
• Occam’s Razor: find the simplest hypothesis (smallest tree) that fits the observations
• Leads to small trees with informative nodes near the root
Points to remember
• Problems:
– You have exponentially less data at lower levels
– Too big of a tree can overfit the data
– Greedy algorithms don’t necessarily yield the global optimum
• In practice, one often regularizes the construction process to try to get small but highly-informative trees (tree depth, number of nodes/leaves, number of samples per split, minimum gain, pruning, hypothesis testing, etc)
Real world example : Xbox!
Used by Kinect to identify different body parts and poses
Real world example : Xbox!
Classify various body parts and pose
• Can express any Boolean function, but most useful when function depends critically on few attributes
• Bad on: parity, majority functions; also not well-suited to continuous attributes
Advantages
• Easy to implement and use
• Intuitive and relatively easy to
• Computationally fast (using heuristics to solve)
• Not sensitive to outliers (split is based on proportion of samples)
Disadvantages
• A small change in data can cause large change in the structure of the decision tree (causing instability)
• Higher time to train the model (e.g. compared to KNN – but this can be
• Can be relatively inaccurate (due to overfitting) compared to other methods (for a single tree, could use multiple trees with a Random Forest)
Although very old and simple algorithm, used extensively in practical applications:
• Flight simulator: 20 state variables; 90K examples based on expert pilot’s actions; auto-pilot tree
• Yahoo Ranking Challenge (more details)
• Random Forests, a modern variant (ensemble learning technique): Microsoft Kinect Pose Estimation
Live coding
• Nextpractice!
• Mindikawillcoversomeexamplesin live coding to help with concepts
Reference list and further reading
• University of Berkeley CS194-10 Introduction to Machine Learning Fall 2011 Lecture 8 ( )
• MIT ML12 Introduction to Machine Learning fall 2012 Lecture 11 ( )
• Stanford CS229 Review of Probability Theory notes ( and )
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com