decision_trees
Decision Trees for Classification¶
In this problem, you will implement decision trees for classification on the spam dataset to determine whether or not an email is spam. The data is with the assignment.
Copyright By PowCoder代写 加微信 powcoder
Setup: Imports¶
We start with importing the relevant packages
from collections import Counter
import numpy as np
from numpy import genfromtxt
import matplotlib.pyplot as plt
import scipy.io
from scipy import stats
from sklearn.model_selection import train_test_split
import random
random.seed(246810)
np.random.seed(246810)
Setup: Data Preprocessing¶
The code to preprocess the data and evaluate the model have been provided for you. While we have provided the code for you, it is important that you read through the code and understand it yourself.
def preprocess(data, fill_mode=True, min_freq=10, onehot_cols=[]):
# fill_mode = False
# Temporarily assign -1 to missing data
data[data == b”] = ‘-1′
# Hash the columns (used for handling strings)
onehot_encoding = []
onehot_features = []
for col in onehot_cols:
counter = Counter(data[:, col])
for term in counter.most_common():
if term[0] == b’-1′:
if term[-1] <= min_freq:
onehot_features.append(term[0])
onehot_encoding.append((data[:, col] == term[0]).astype(float))
data[:, col] = '0'
onehot_encoding = np.array(onehot_encoding).T
data = np.hstack(
[np.array(data, dtype=float),
np.array(onehot_encoding)])
# Replace missing data with the mode value. We use the mode instead of
# the mean or median because this makes more sense for categorical
# features such as gender or cabin type, which are not ordered.
if fill_mode:
for i in range(data.shape[-1]):
mode = stats.mode(data[((data[:, i] < -1 - eps) +
(data[:, i] > -1 + eps))][:, i]).mode[0]
data[(data[:, i] > -1 – eps) *
(data[:, i] < -1 + eps)][:, i] = mode
return data, onehot_features
def evaluate(dtree, X, y, folds=5):
print("Cross Validation:")
train_accuracies = []
val_accuracies = []
for i in range(folds):
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=i)
dtree.fit(X_train, y_train)
train_preds = dtree.predict(X_train)
assert(train_preds.shape == y_train.shape)
train_accuracy = np.sum(train_preds == y_train) / y_train.shape[0]
train_accuracies.append(train_accuracy)
val_preds = dtree.predict(X_val)
assert(val_preds.shape == y_val.shape)
val_accuracy = np.sum(val_preds == y_val) / y_val.shape[0]
val_accuracies.append(val_accuracy)
avg_train_accuracy = np.mean(train_accuracies)
avg_val_accuracy = np.mean(val_accuracies)
print('averaged train accuracy:', avg_train_accuracy)
print('averaged validation accuracy:', avg_val_accuracy)
return avg_train_accuracy, avg_val_accuracy
eps = 1e-5 # a small number
dataset = "spam"
if dataset == "spam":
features = [
"pain", "private", "bank", "money", "drug", "spam", "prescription",
"creative", "height", "featured", "differ", "width", "other",
"energy", "business", "message", "volumes", "revision", "path",
"meter", "memo", "planning", "pleased", "record", "out",
"semicolon", "dollar", "sharp", "exclamation", "parenthesis",
"square_bracket", "ampersand"
assert len(features) == 32
# Load spam data
path_train = './datasets/spam_data/spam_data.mat'
data = scipy.io.loadmat(path_train)
X = data['training_data']
y = np.squeeze(data['training_labels'])
Z = data['test_data']
class_names = ["Ham", "Spam"]
raise NotImplementedError("Dataset %s not handled" % dataset)
print("Features", features)
print("Train/test size", X.shape, Z.shape)
1) Decision Tree Implementation¶
Below, we have provided skeleton code for you to implement a basic decision tree. Feel free to use the skeleton code provided, or write the DecisionTree class from scratch if you find that easier. However, you are not allowed to use any off-the-shelf decision tree implementation. For instance, you cannot use scipy's entropy function or sklearn's DecisionTreeClassifier. You should only need numpy for your decision tree implementation.
The fit() function should follow the following logic:
1) Check to make sure that the max_depth of the tree is not reached
2) If max_depth is reached, then assign this leaf node self.pred and self.leaf_samples
3) If max_depth is not reached, then:
a) Find the best feature and corresponding threshold to split the data on
b) Split the data based on that best feature and threshold
c) Repeat this for the left and right subtree
eps = 1e-5 # a small number
class DecisionTree:
def __init__(self, max_depth=3, features=None, min_samples_split=2):
This decision tree data structure is a binary tree that is written recursively. In other words, it has a
left and right branch that themselves are decision trees with the same attributes and methods.
Attributes:
- self.max_depth (int): Maximum depth of the tree
- self.features (List of strings): features that are used to make splits in the decision tree (Basically only used in the __repr__ function)
- self.left (DecisionTree): Left subtree of this decision tree
- self.right (DecisionTree): Right subtree of this decision tree
- self.split_feature (int): The index that corresponds to the feature in self.features that was used to
split this Decision Tree into its left and right branches
- self.thresh (int): If the value of the split feature is less than self.thresh, the data point will
go to the left subtree. Otherwise, it will go to the right subtree
- self.leaf_samples (int): The number of samples that are classified at a leaf node
- self.pred (int): The prediction made at a leaf node. Only assigned at a leaf node.
- self.min_samples_split (int): The minimum number of samples in each child node. If a split results in a
child node which contains less than min_samples_split samples, then don’t perform
the split and instead make the current node a leaf node.
self.max_depth = max_depth
self.features = features
self.left, self.right = None, None # Attributes only for non-leaf nodes
self.split_feature, self.thresh = None, None # Attributes only for non-leaf nodes
self.leaf_samples, self.pred = None, None # Attributes only for leaf nodes
self.min_samples_split = min_samples_split
@staticmethod
def entropy(y):
Calculate the entropy of the tree at the current node. Remember to take care of the special case where we
must handle log(0), in which case the entropy should be 0!
- y: n x 1 vector of class labels for each data point (either 0 or 1)
- Entropy: scalar value between 0 and 1
# YOUR CODE HERE ###
@staticmethod
def information_gain(X_feat, y, thresh):
Calculate the information gain from splitting the data based on a specific feature at a specific threshold
- X_feat: n x 1 vector representing a column of the X data matrix (a single feature)
- y: n x 1 vector of class labels for each data point (either 0 or 1)
- thresh: The threshold scalar value to split the feature on
- Information Gain: Scalar value between 0 and 1
# YOUR CODE HERE ###
def feature_split(self, X, y, feature, thresh):
Split the data into two halves based on a specific feature at a specific threshold
- X: n x d matrix
- y: n x 1 vector of class labels for each data point (either 0 or 1)
- feature: An index in the range [0, d-1] that represents a specific feature in the data matrix
- thresh: The threshold scalar value to split the feature on
- X0: k x d matrix representing the k data points that are split into the left subtree
- y0: k x 1 vector representing the k class labels that are split into the left subtree
- X1: (n - k) x d matrix representing the n - k data points that are split into the right subtree
- y1: (n - k) x 1 vector representing the n - k class labels that are split into the right subtree
### YOUR CODE HERE ###
def find_feature_thresh(self, X, y, feature):
Given the data and the feature, find the best threshold for the feature split. Choose the threshold
from 10 evenly spaced values between the (min_value + eps) and (max_value - eps) for the feature.
We need to include the +/- eps for min and max values because without it, there is a chance that the
training algorithm will split the data on the min or max value, which is not a useful split
Hint: You may find np.linspace helpful
- X: n x d matrix
- y: n x 1 vector of class labels for each data point (either 0 or 1)
- feature: An index in the range [0, d-1] that represents a specific feature in the data matrix
- max_ig: The largest information gain that is attained
- best_thresh: The best threshold value that gives us max_ig
### YOUR CODE HERE ###
def find_best_feature_split(self, X, y):
Find the best feature and threshold to split on
- X: n x d matrix
- y: n x 1 vector of class labels for each data point (either 0 or 1)
- best_feature: An index in the range [0, d-1] that represents the best feature in the data matrix to split on
- best_thresh: The best threshold value for best_feature
### YOUR CODE HERE ###
def fit(self, X, y):
Fit the decision tree
- X: n x d matrix
- y: n x 1 vector of class labels for each data point (either 0 or 1)
### YOUR CODE HERE ###
def predict_split(self, X, idx, thresh):
idx0 = np.where(X[:, idx] < thresh)[0]
idx1 = np.where(X[:, idx] >= thresh)[0]
X0, X1 = X[idx0, :], X[idx1, :]
return X0, idx0, X1, idx1
def predict(self, X):
if self.max_depth == 0:
return self.pred * np.ones(X.shape[0])
X0, idx0, X1, idx1 = self.predict_split(
X, idx=self.split_feature, thresh=self.thresh)
yhat = np.zeros(X.shape[0])
yhat[idx0] = self.left.predict(X0)
yhat[idx1] = self.right.predict(X1)
return yhat
def __repr__(self):
if self.max_depth == 0:
return “[Leaf: %s (%s)]” % (self.pred, self.leaf_samples)
return “[%s < %s: %s | %s]" % (self.features[self.split_feature],
self.thresh, self.left.__repr__(),
self.right.__repr__())
Testing your Decision Tree¶
Now, we will evaluate your decision tree using cross validation with 5 folds. Your decision tree should have a train/validation accuracy that is roughly 76% or higher. We have also included some test cases to ensure that you correctly implemented entropy.
print('==================================================')
print("Entropy Test Cases")
test1 = np.array([1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,
1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0,
0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1,
1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0])
test2 = np.array([1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0,
0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1,
0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1,
1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1,
0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1])
test3 = np.array([0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0,
1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0,
1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0])
test4 = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
''' The first values in the assertions correspond to entropy computed using natural log.
The second values correspond to entropy computed using log base 2'''
assert(DecisionTree.entropy(test1) == 0.6913460990017393 or DecisionTree.entropy(test1) == 0.9974015885677396)
print("Test 1 Passed")
assert(DecisionTree.entropy(test2) == 0.6859298002523728 or DecisionTree.entropy(test2) == 0.9895875212220556)
print("Test 2 Passed")
assert(DecisionTree.entropy(test3) == 0.6881388137135884 or DecisionTree.entropy(test3) == 0.9927744539878083)
print("Test 3 Passed")
assert(DecisionTree.entropy(test4) == 0)
print("Test 4 Passed")
assert(DecisionTree.entropy(test4 + 1) == 0)
print("Test 5 Passed")
print("All Test Cases Passed")
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
# Basic decision tree
print('==================================================')
print("\n\nYour decision tree")
dt = DecisionTree(max_depth=3, features=features)
dt.fit(X_train, y_train)
print("Predictions", dt.predict(Z)[:100])
print("Tree structure", dt.__repr__())
print('==================================================')
print("\n\nCross Validation on your decision tree")
evaluate(dt, X, y)
2) Tree Depth vs. Performance¶
Now, we will examine how the depth of our decision tree affects the tree's performance. Train decision trees with depth ranging from 1 to 40. Plot depth on the x axis and train/validation accuracy on the y axis. Comment on your findings in your writeup. Make sure to attach this code to your homework so that we can see your graph!
train_accuracy_depth = []
validation_accuracy_depth = []
for depth in range(1, 40):
# ### YOUR CODE HERE ###
# Plot the results here
# Study depth and plot
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com