Decision_Trees_from_Scratch
Decision Trees¶
Decisions Trees are mainly used to solve classification problems. This notebook will cover how a decision tree is created, and will show how to plot the results of a decision tree.
Copyright By PowCoder代写 加微信 powcoder
This is based on sample code from Data Science from Scratch by , O’ , 2015.
**Problem**
The VP provides you with the interviewee data, consisting of (per your specification) pairs $(input, label)$, where each input is a dict of candidate attributes, and each label is either $True$ (the candidate interviewed well) or $False$ (the candidate interviewed poorly). In particular, you are provided with each candidate’s level, their preferred language, whether they are active on Twitter, and whether they have a PhD.
**Solution**
Our tree will consist of decision nodes (which ask a question and direct us differently depending on the answer) and leaf nodes (which give us a prediction). We will build it using the relatively simple ID3 algorithm, which operates in the following manner. Let’s say we’re given some labeled data, and a list of attributes to consider branching on.
If the data all have the same label, then create a leaf node that predicts that label and then stop.
If the list of attributes is empty (i.e., there are no more possible questions to ask), then create a leaf node that predicts the most common label and then stop.
Otherwise, try partitioning the data by each of the attributes.
Choose the partition with the lowest partition entropy.
Recur on each partitioned subset using the remaining attributes.
This is what’s known as a “greedy” algorithm because, at each step, it chooses the most immediately best option. Given a data set, there may be a better tree with a worse-looking first move. If so, this algorithm won’t find it. Nonetheless, it is relatively easy to understand and implement, which makes it a good place to begin exploring decision trees.
**Table of Contents**
Necessary Functions
Find Partition with Least Entropy
Building a Tree
Making Predictions
Importing necessary libraries¶
from __future__ import division
import math, random
from collections import Counter, defaultdict
from functools import partial
Necessary Functions¶
Here we will define necessary functions that we will use below.
This function will compute and return the entropy, given a list of class probabilities
def entropy(class_probabilities):
“””given a list of class probabilities, compute the entropy”””
return sum(-p * math.log(p, 2) for p in class_probabilities if p)
class_probabilities¶
def class_probabilities(labels):
total_count = len(labels)
return [count / total_count
for count in Counter(labels).values()]
data_entropy¶
def data_entropy(labeled_data):
labels = [label for _, label in labeled_data]
probabilities = class_probabilities(labels)
return entropy(probabilities)
partition_entropy¶
Partitions passed data into subsets.
def partition_entropy(subsets):
“””find the entropy from this partition of data into subsets”””
total_count = sum(len(subset) for subset in subsets)
return sum( data_entropy(subset) * len(subset) / total_count for subset in subsets )
def group_by(items, key_fn):
“””returns a defaultdict(list), where each input item
is in the list whose key is key_fn(item)”””
groups = defaultdict(list)
for item in items:
key = key_fn(item)
groups[key].append(item)
return groups
partition_by¶
def partition_by(inputs, attribute):
“””returns a dict of inputs partitioned by the attribute
each input is a pair (attribute_dict, label)”””
return group_by(inputs, lambda x: x[0][attribute])
partition_entropy_by¶
Uses partition_entropy to partition dataset, and then calculates entropy for each partition. Return both the partitions and the entropy values.
def partition_entropy_by(inputs,attribute):
“””computes the entropy corresponding to the given partition”””
partitions = partition_by(inputs, attribute)
return partition_entropy(partitions.values())
def classify(tree, input):
“””classify the input using the given decision tree”””
# if this is a leaf node, return its value
if tree in [True, False]:
return tree
# otherwise find the correct subtree
attribute, subtree_dict = tree
subtree_key = input.get(attribute) # None if input is missing attribute
if subtree_key not in subtree_dict: # if no subtree for key,
subtree_key = None # we’ll use the None subtree
subtree = subtree_dict[subtree_key] # choose the appropriate subtree
return classify(subtree, input) # and use it to classify the input
build_tree_id3¶
def build_tree_id3(inputs, split_candidates=None):
# if this is our first pass,
# all keys of the first input are split candidates
if split_candidates is None:
split_candidates = inputs[0][0].keys()
# count Trues and Falses in the inputs
num_inputs = len(inputs)
num_trues = len([label for item, label in inputs if label])
num_falses = num_inputs – num_trues
if num_trues == 0: # if only Falses are left
return False # return a “False” leaf
if num_falses == 0: # if only Trues are left
return True # return a “True” leaf
if not split_candidates: # if no split candidates left
return num_trues >= num_falses # return the majority leaf
# otherwise, split on the best attribute
best_attribute = min(split_candidates,
key=partial(partition_entropy_by, inputs))
partitions = partition_by(inputs, best_attribute)
new_candidates = [a for a in split_candidates
if a != best_attribute]
# recursively build the subtrees
subtrees = { attribute : build_tree_id3(subset, new_candidates)
for attribute, subset in partitions.items() }
subtrees[None] = num_trues > num_falses # default case
return (best_attribute, subtrees)
forest_classify¶
def forest_classify(trees, input):
votes = [classify(tree, input) for tree in trees]
vote_counts = Counter(votes)
return vote_counts.most_common(1)[0][0]
Generate Data¶
({‘level’:’Senior’,’lang’:’Java’,’tweets’:’no’,’phd’:’no’}, False),
({‘level’:’Senior’,’lang’:’Java’,’tweets’:’no’,’phd’:’yes’}, False),
({‘level’:’Mid’,’lang’:’Python’,’tweets’:’no’,’phd’:’no’}, True),
({‘level’:’Junior’,’lang’:’Python’,’tweets’:’no’,’phd’:’no’}, True),
({‘level’:’Junior’,’lang’:’R’,’tweets’:’yes’,’phd’:’no’}, True),
({‘level’:’Junior’,’lang’:’R’,’tweets’:’yes’,’phd’:’yes’}, False),
({‘level’:’Mid’,’lang’:’R’,’tweets’:’yes’,’phd’:’yes’}, True),
({‘level’:’Senior’,’lang’:’Python’,’tweets’:’no’,’phd’:’no’}, False),
({‘level’:’Senior’,’lang’:’R’,’tweets’:’yes’,’phd’:’no’}, True),
({‘level’:’Junior’,’lang’:’Python’,’tweets’:’yes’,’phd’:’no’}, True),
({‘level’:’Senior’,’lang’:’Python’,’tweets’:’yes’,’phd’:’yes’},True),
({‘level’:’Mid’,’lang’:’Python’,’tweets’:’no’,’phd’:’yes’}, True),
({‘level’:’Mid’,’lang’:’Java’,’tweets’:’yes’,’phd’:’no’}, True),
({‘level’:’Junior’,’lang’:’Python’,’tweets’:’no’,’phd’:’yes’},False)
[({‘level’: ‘Senior’, ‘lang’: ‘Java’, ‘tweets’: ‘no’, ‘phd’: ‘no’}, False),
({‘level’: ‘Senior’, ‘lang’: ‘Java’, ‘tweets’: ‘no’, ‘phd’: ‘yes’}, False),
({‘level’: ‘Mid’, ‘lang’: ‘Python’, ‘tweets’: ‘no’, ‘phd’: ‘no’}, True),
({‘level’: ‘Junior’, ‘lang’: ‘Python’, ‘tweets’: ‘no’, ‘phd’: ‘no’}, True),
({‘level’: ‘Junior’, ‘lang’: ‘R’, ‘tweets’: ‘yes’, ‘phd’: ‘no’}, True),
({‘level’: ‘Junior’, ‘lang’: ‘R’, ‘tweets’: ‘yes’, ‘phd’: ‘yes’}, False),
({‘level’: ‘Mid’, ‘lang’: ‘R’, ‘tweets’: ‘yes’, ‘phd’: ‘yes’}, True),
({‘level’: ‘Senior’, ‘lang’: ‘Python’, ‘tweets’: ‘no’, ‘phd’: ‘no’}, False),
({‘level’: ‘Senior’, ‘lang’: ‘R’, ‘tweets’: ‘yes’, ‘phd’: ‘no’}, True),
({‘level’: ‘Junior’, ‘lang’: ‘Python’, ‘tweets’: ‘yes’, ‘phd’: ‘no’}, True),
({‘level’: ‘Senior’, ‘lang’: ‘Python’, ‘tweets’: ‘yes’, ‘phd’: ‘yes’}, True),
({‘level’: ‘Mid’, ‘lang’: ‘Python’, ‘tweets’: ‘no’, ‘phd’: ‘yes’}, True),
({‘level’: ‘Mid’, ‘lang’: ‘Java’, ‘tweets’: ‘yes’, ‘phd’: ‘no’}, True),
({‘level’: ‘Junior’, ‘lang’: ‘Python’, ‘tweets’: ‘no’, ‘phd’: ‘yes’}, False)]
Let’s manually go through these steps on the interviewee data set. The data set has both True and False labels, and we have four attributes we can split on: lang, level, phd, and tweets.
remaining_attrs = [key for key in data[0][0].keys()]
remaining_attrs
[‘level’, ‘lang’, ‘tweets’, ‘phd’]
Let’s also generate a dictionary containing all the unique elements for each attribute:
valsDict = {}
for key in data[0][0].keys():
for d in data:
vals.append(d[0][key])
vals = list(set(vals))
valsDict[key] = vals
for k,v in valsDict.items():
print (k,v)
level [‘Junior’, ‘Senior’, ‘Mid’]
lang [‘Python’, ‘Java’, ‘R’]
tweets [‘yes’, ‘no’]
phd [‘yes’, ‘no’]
Find Partition with Least Entropy¶
So our first step will be to find the partition with the least entropy. Function partition_by() does the partitioning and function partition_entropy_by() uses the partitioning to compute entropy.
min_entropy = 1
split_attr = ”
for attribute in remaining_attrs:
attr_entropy = partition_entropy_by(data, attribute)
print (attribute, attr_entropy)
if attr_entropy < min_entropy:
split_attr = attribute
min_entropy = attr_entropy
print('\nAttribute to split on:',split_attr)
level 0.6935361388961919
lang 0.8601317128547441
tweets 0.7884504573082896
phd 0.8921589282623617
Attribute to split on: level
We want to split on the attribute that minimizes entropy, in this case level. We now need to make a subtree for each possible level value.
First we will drop level as a potential attribute to split on:
remaining_attrs.remove(split_attr)
print("Remaining Attributes:", remaining_attrs)
Remaining Attributes: ['lang', 'tweets', 'phd']
Then we will compute the score distribution for each level type:
print("Distribution of {} attribute:\n".format(split_attr))
for val in valsDict[split_attr]:
for d in data:
if d[0][split_attr] == val:
print("{} (T:F): {}:{}".format(val, T_ctr, F_ctr))
Distribution of level attribute:
Junior (T:F): 3:2
Senior (T:F): 2:3
Mid (T:F): 4:0
Ever $mid$ label candidate is labelled $True$, which means that the $mid$ subtree is simply a leaf node predicting $True$. For $senior$ and $junior$ candidates, we have a mix of $True$ and $False$, so we need another attribute to split on.
senior_inputs = [(d, label) for d, label in data if d["level"] == "Senior"]
min_entropy = 1
lvl_senior_split_attr = ''
print ('level == Senior\n')
for attribute in remaining_attrs:
attr_entropy = partition_entropy_by(senior_inputs, attribute)
print (attribute, attr_entropy)
if attr_entropy < min_entropy:
lvl_senior_split_attr = attribute
min_entropy = attr_entropy
print('\nAttribute to split on:',lvl_senior_split_attr)
level == Senior
tweets 0.0
phd 0.9509775004326938
Attribute to split on: tweets
This shows us that our next split should be on tweets for the $senior$ level, which results in a zero-entropy partition.
print("Distribution of {} attribute with level = {}:\n".format(lvl_senior_split_attr, 'Senior'))
for val in valsDict[lvl_senior_split_attr]:
for d in data:
if d[0]['level'] == 'Senior':
if d[0][lvl_senior_split_attr] == val:
print("{} (T:F): {}:{}".format(val, T_ctr, F_ctr))
Distribution of tweets attribute with level = Senior:
yes (T:F): 2:0
no (T:F): 0:3
From here, we see that for $Senior$ level candidates, $yes$ tweets always result in $True$ while $no$ tweets always result in $False$.
Now we do similarly with the $Junior$ level:
junior_inputs = [(d, label) for d, label in data if d["level"] == "Junior"]
min_entropy = 1
lvl_junior_split_attr = ''
print ('level == Junior\n')
for attribute in remaining_attrs:
attr_entropy = partition_entropy_by(junior_inputs, attribute)
print (attribute, attr_entropy)
if attr_entropy < min_entropy:
lvl_junior_split_attr = attribute
min_entropy = attr_entropy
print('\nAttribute to split on:',lvl_junior_split_attr)
level == Junior
lang 0.9509775004326938
tweets 0.9509775004326938
Attribute to split on: phd
This shows us that our next split should be on tweets for the $junior$ level, which results in a zero-entropy partition again.
print("Distribution of {} attribute with level = {}:\n".format(lvl_junior_split_attr, 'Junior'))
for val in valsDict[lvl_junior_split_attr]:
for d in data:
if d[0]['level'] == 'Junior':
if d[0][lvl_junior_split_attr] == val:
print("{} (T:F): {}:{}".format(val, T_ctr, F_ctr))
Distribution of phd attribute with level = Junior:
yes (T:F): 0:2
no (T:F): 3:0
From here, we see that for $Junior$ level candidates, $yes$ phd always result in $False$ while $no$ phd always result in $True$.
This routine is continued until all possible values are classified in a leaf node.
Building a Tree¶
Now that we’ve seen how the algorithm works, we would like to implement it more generally. This means we need to decide how we want to represent trees. We’ll use pretty much the most lightweight representation possible. We define a tree to be one of the following:
a tuple (attribute, subtree_dict)
Here $True$ represents a leaf node that returns True for any input, $False$ represents a leaf node that returns False for any input, and a tuple represents a decision node that, for any input, finds its attribute value, and classifies the input using the corresponding subtree.
There’s still the question of what to do if we encounter an unexpected (or missing) attribute value. What should our hiring tree do if it encounters a candidate whose level is “Intern”? We’ll handle this case by adding a None key that just predicts the most common label.
All that’s left is to build the tree representation from our training data. Our hiring tree would look like:
print ("---Building a Tree---\n")
tree = build_tree_id3(data)
lvl1, subtree1 = tree
print (lvl1,':')
for k,v in subtree1.items():
print(' {}'.format(k))
if type(v) != tuple:
print (' {}'.format(v))
lvl2, subtree2 = v
print(' {} :'.format(lvl2))
for k,v in subtree2.items():
print(' {}'.format(k))
if type(v) != tuple:
print (' {}'.format(v))
---Building a Tree---
In the tree we built, every leaf consisted entirely of $True$ inputs or entirely of $False$ inputs. This means that the tree predicts perfectly on the training data set. But we can also apply it to new data that wasn’t in the training set to classify a new input:
Making Predictions¶
We will now use our built tree to make predictions on the following examples.
Example 1:¶
level: Junior
lang: Java
tweets: Yes
'level':'Junior',
'lang':'Java',
'tweets':'yes',
'phd':'no'
print(ex1)
label = classify(tree,ex1)
print ('Label:',label)
{'level': 'Junior', 'lang': 'Java', 'tweets': 'yes', 'phd': 'no'}
Label: True
Example 2:¶
level: Junior
lang: Java
tweets: Yes
'level':'Junior',
'lang':'Java',
'tweets':'yes',
'phd':'yes'
print(ex2)
label = classify(tree,ex2)
print ('Label:',label)
{'level': 'Junior', 'lang': 'Java', 'tweets': 'yes', 'phd': 'yes'}
Label: False
Example 3: Unexpected/missing data¶
level: Intern
tweets: No
'level':'Intern',
'tweets':'no',
print(ex3)
label = classify(tree,ex3)
print ('Label:',label)
{'level': 'Intern', 'tweets': 'no'}
Label: True
Example 4: Unexpected/missing data¶
level: Senior
tweets: No
'level':'Senior',
'tweets':'no',
print(ex4)
label = classify(tree,ex4)
print ('Label:',label)
{'level': 'Senior', 'tweets': 'no'}
Label: False
Note: Since our goal was mainly to demonstrate how to build a tree, we built the tree using the entire data set. As always, if we were really trying to create a good model for something, we would have (collected more data and) split the data into train/validation/test subsets.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com