程序代写代做代考 python data science algorithm decision tree trees_workshop(2)

trees_workshop(2)

Data Science Workshop: Week 7¶
This week we will learn something about trees and forests for regression and classification. We will start with regression trees. Please download all files from blackboard before starting the notebook. Also, execute each code cell in the correct order.

Please read over the whole notebook. It contains several excercises that you have to complete.

Regression Trees¶

We start with a rather simple regression task. We have to learn a continous function with a 1-dimensional input.

We first load the training data with the pandas data frames and plot the training points and the
ground-truth function. The ground-truth values are stored in the test data set in order to evaluate the quality
of our fit.

In [ ]:

import pandas as pd
import matplotlib.pyplot
import numpy as np

data_train = pd.DataFrame.from_csv(‘regression_train.csv’)
data_test = pd.DataFrame.from_csv(‘regression_test.csv’)

In [ ]:

x_train = data_train[‘x’].as_matrix()
y_train = data_train[‘y’].as_matrix()

x_test = data_test[‘x’].as_matrix()
y_test = data_test[‘y’].as_matrix()

x_train = x_train.reshape(-1, 1)

x_test = x_test.reshape(-1, 1)

x_train.shape, y_train.shape

In [ ]:

## plot the data
import matplotlib.pyplot as plt

plt.figure()
plt.clf()
plt.plot(x_train,y_train, ‘bo’)
plt.plot(x_test,y_test, ‘g’)
plt.legend((‘training points’, ‘ground truth’))
plt.hold(True)
plt.savefig(‘trainingdata.png’)
plt.show()

Training a Regression Tree¶

We will use the sklearn package to train our regression trees. sklearn is a generic machine learning library
that offers a lot of learning algorithms. A regression tree can be generated by:

In [ ]:

from sklearn import tree

regTree = tree.DecisionTreeRegressor(min_samples_leaf=1, max_depth=None)

We can set the minimum number of samples per leaf in the tree and the maximum depth of the tree as can be seen above. The tree can be trained by:

In [ ]:

regTree = regTree.fit(x_train, y_train)

We can use the trained tree for prediction by:

In [ ]:

y_predict = regTree.predict(x_test)

Exercise 1: Training the Tree¶
In this excercise you are supposed to train the tree for our regression task. Train a tree with
min_samples_leaf set to 1, 5 and 10 and predict the output for x_test and plot the predicted
function values. Do you see a difference in the functions? Which value of min_samples_leaf would you use?

In [ ]:

from sklearn import tree

# predict for min_samples_leaf = 1
# Put your code here
# y_predict1 = …

# predict for min_samples_leaf = 5
# Put your code here
# y_predict2 = …

# predict for min_samples_leaf = 10
# Put your code here
# y_predict3 = …

plt.figure()
plt.clf()
plt.plot(x_train,y_train, ‘bo’)
plt.plot(x_test,y_test, ‘g’)
plt.plot(x_test,y_predict1, ‘r’, linewidth=2.0)
plt.savefig(‘regressiontrees1.png’)
plt.title(‘min_samples = 1’)

plt.figure()
plt.clf()
plt.plot(x_train,y_train, ‘bo’)
plt.plot(x_test,y_test, ‘g’)
plt.plot(x_test,y_predict2, ‘r’,linewidth=2.0)
plt.savefig(‘regressiontrees5.png’)
plt.title(‘min_samples = 5’)

plt.figure()
plt.clf()
plt.plot(x_train,y_train, ‘bo’)
plt.plot(x_test,y_test, ‘g’)
plt.plot(x_test,y_predict3, ‘r’, linewidth=2.0)
plt.title(‘min_samples = 10’)

plt.savefig(‘regressiontrees10.png’)
plt.show()

Evaluating the trees¶
We can also use sklearn to evaluate the tree. There are different metrics that we can use. We will use the
mean squared error criterion to evaluate the trees. We can compute the mse on the test data

In [ ]:

import sklearn.metrics as metrics

mseTestTree1 = metrics.mean_squared_error(y_test, y_predict1)
mseTestTree2 = metrics.mean_squared_error(y_test, y_predict2)
mseTestTree3 = metrics.mean_squared_error(y_test, y_predict3)

mseTestTree1, mseTestTree2, mseTestTree3

We can see that trees with min_samples = 1 have 0 error on the training set as the training set is learned by heart.
However, the error on the test set is the one that really counts. Here, also min_samples_leaf = 1 performs the best, but
maybe we can find better settings of min_samples_leaf?

Exercise 2: Finding the best value for min_samples_leaf¶
We want to find the best value for min_samples_leaf. Evaluate the mean squared error (mse) criterion for
min_samples_leaf = [1, 2, 3 , 5, 7 ,10] on the training set and on the test set. Plot the mse for both sets as
a function of min_samples_leaf. Which min_samples_leaf would you pick?

In [ ]:

from sklearn.tree import DecisionTreeRegressor
import sklearn.metrics

minSamples = [1,2,3,5,7,10]
train_accuracy = np.zeros((len(minSamples),1))
test_accuracy = np.zeros((len(minSamples),1))

for i in range(0,len(minSamples)):

min_samples_leaf = minSamples[i]
# Put your code here for creating and training the tree with min_samples
# clf =

train_accuracy[i] = # … put your code here
test_accuracy[i] = # … put your code here

plt.figure()
plt.plot(minSamples, train_accuracy, ‘b’)
plt.plot(minSamples, test_accuracy, ‘g’)
plt.show()

Regression Forests¶

We will again use the sklearn package. A regression forest can be generated by:

In [ ]:

from sklearn import ensemble

regForest = ensemble.RandomForestRegressor(n_estimators = 10, min_samples_leaf=2)

We can set the same properties as for a single tree (min_samples_leaf, max_depth). In addition, we can set the number of trees that are used (n_estimators).

In [ ]:

regForest = regForest.fit(x_train, y_train)

We can use the trained tree for prediction by:

In [ ]:

y_predict = regForest.predict(x_test)

Excercise 3: Training a regression forest¶
In this excercise you are supposed to train a forest. Train a forest with
n_estimators set to 1, 10 and 50 and predict the output for x_test and plot the predicted
function values. Do you see a difference in the functions? Which value of n_estimators would you use? Use
the optimum value of min_samples_leaf that we estimated in excercise 2.

In [ ]:

from sklearn import ensemble

# predict for n_estimator = 1
# Put your code here
# y_predict1 = …

# predict for n_estimator = 5
# Put your code here
# y_predict2 = …

# predict for n_estimator = 10
# Put your code here
# y_predict3 = …

plt.figure()
plt.clf()
plt.plot(x_train,y_train, ‘bo’)
plt.plot(x_test,y_test, ‘g’)
plt.plot(x_test,y_predict1, ‘r’, linewidth=2.0)
plt.savefig(‘regressionforest1.png’)
plt.title(‘num_trees = 1’)

plt.figure()
plt.clf()
plt.plot(x_train,y_train, ‘bo’)
plt.plot(x_test,y_test, ‘g’)
plt.plot(x_test,y_predict2, ‘r’,linewidth=2.0)
plt.savefig(‘regressionforest5.png’)
plt.title(‘num_trees = 10’)

plt.figure()
plt.clf()
plt.plot(x_train,y_train, ‘bo’)
plt.plot(x_test,y_test, ‘g’)
plt.plot(x_test,y_predict3, ‘r’, linewidth=2.0)
plt.title(‘num_trees = 50’)

plt.savefig(‘regressionforest10.png’)
plt.show()

Excercise 4: Evaluating the best number of trees¶
Now we want to know the best number of trees. Adapt the code for excercise 2 and evaluate the random forest with
num_trees set to numTrees = [1,3,5,7,10,15,20]. What is the best number of trees? Repeat your experiment again. Do you come to the same conclusion? If not, why?

In [ ]:

from sklearn.tree import DecisionTreeRegressor
import sklearn.metrics

minTrees = [1,3,5,7,10,15,20]
train_accuracy = np.zeros((len(minTrees),1))
test_accuracy = np.zeros((len(minTrees),1))

for i in range(0,len(minTrees)):

min_samples_leaf = minTrees[i]
# Put your code here for creating and training the forest with n_estimator set to min_trees
# clf =

train_accuracy[i] = # … put your code here
test_accuracy[i] = # … put your code here

plt.figure()
plt.plot(minTrees, train_accuracy, ‘b’)
plt.plot(minTrees, test_accuracy, ‘g’)
plt.show()

Excercise 5: Improving the evaluation¶
As the regression forests are randomized, the result varies at each execution. To get a better result,
we can average the performance over multiple trials. Repeat the last experiment from excercise 4. However,
instead of evaluating just one regression forest, we train 10 regression forests and compute the average
performance for all 10 training trials.

In [ ]:

from sklearn.ensemble import RandomForestRegressor

import sklearn.metrics

numTrees = [1,3,5,7,10, 15, 20]

numTrials = 10

train_accuracy = np.zeros((len(numTrees),1))
test_accuracy = np.zeros((len(numTrees),1))

train_accuracy_single = np.zeros((len(numTrees),numTrials))
test_accuracy_single = np.zeros((len(numTrees),numTrials))

for i in range(0,len(numTrees)):
# Put your code here for creating the forest with n_estimator set to min_trees
# clf =

train_accuracy[i] = 0
test_accuracy[i] = 0
for j in range(0,numTrials):

# Put your code here for training the forest
# clf…

train_accuracy_single[i, j] = # … put your code here
test_accuracy_single[i, j] = # … put your code here

# compute mean over several trials
train_accuracy[i] = np.mean(train_accuracy_single[i,:])
test_accuracy[i] = np.mean(test_accuracy_single[i,:])

plt.figure()
plt.plot(numTrees, train_accuracy, ‘b’)
plt.plot(numTrees, test_accuracy, ‘g’)
plt.legend((‘mse train’, ‘mse_test’))

plt.show()

Compare the performance with the regression tree from exercise 2. Could we improve the performance by using multiple randomized trees?

Decision Trees¶
In contrast to regressoin trees, the output for a decision tree is a discrete class label not
a continuous value. A decision tree can be used for any multi-label classification problem. We will use the
decision trees on the iris data set, where we will only use the first two features in the training data as input
for illustration purposes.

In [ ]:

from sklearn import datasets
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
import numpy as np

iris = datasets.load_iris()

X = iris.data[:,:2]
Y = iris.target

x_min, x_max = X[:, 0].min() – .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() – .5, X[:, 1].max() + .5

plt.figure()
plt.clf()

# Plot the training points
plt.scatter(X[:, 0], X[:, 1], c=Y)
plt.xlabel(‘Sepal length’)
plt.ylabel(‘Sepal width’)

plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())

plt.show()

The iris data set contains 3 classes of iris flowers that should be classified according to their sepal width
and sepal length.

We split the data set into 67% training data and 33% test data. I.e., we have 100 training and 50 test points.

In [ ]:

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
X_train.shape, X_test.shape

Training a Decision Tree¶

We will again use the sklearn package. A decision tree can be generated by:

In [ ]:

from sklearn import tree

decTree = tree.DecisionTreeClassifier(min_samples_leaf=2, max_depth=None)

We can set the same properties as for a rgression tree. Similarly, we can train the tree

In [ ]:

decTree = decTree.fit(X_train, Y_train)

We can use the trained tree for prediction by:

In [ ]:

y_predict = decTree.predict(X_test)

Exercise 6: Plotting the decision boundary¶
In this excercise we want to train a decision tree for different values of min_samples_leaf. After training the tree,
we will plot the decision boundary of the learned tree with the existing python code.

Plot the decision boundary for different number of min_samples_leaf. Can you observe a qualitatitve difference
between the learned classifiers?

In [ ]:

import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
import sklearn.metrics as metrics

n_classes = 3
plot_colors = “bry”
plot_step = 0.02

# Put your code here: Create the decision tree with min_samples_leaf = 5
# clf =

# Put your code here: train the tree
# clf

# create a grid for the two input dimensions
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step))

gridData = np.c_[xx.ravel(), yy.ravel()]

# Put your code here: Predict using gridData as input
# Z = …

plt.figure()

Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)

plt.scatter(X_train[:, 0], X_train[:, 1], c=Y_train, cmap=plt.cm.Paired)

plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.axis(“tight”)
plt.show()

Which value of min_samples_leaf would you choose?

Evaluating decision trees¶
For evaluating the decision tree, we can compute the average number of correctly
classified samples on the test set. This metric is called accuracy_score in sklearn.

In [ ]:

from sklearn import metrics

train_accuracy = metrics.accuracy_score(Y_train, clf.predict(X_train))
test_accuracy = metrics.accuracy_score(Y_test, clf.predict(X_test))

train_accuracy, test_accuracy

Exercise 7: Finding the best min_samples_leaf value¶
For the values of min_samples_leaf = [1,2,3,5,7,10, 15, 20, 50], compute the accuracy_score on the train and on
the test set. Plot both accuracy scores as a function of min_samples_leaf.

In [ ]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score

minSamples = [1,2,3,5,7,10, 15, 20, 50]
train_accuracy = np.zeros((len(minSamples),1))
test_accuracy = np.zeros((len(minSamples),1))

for i in range(0,len(minSamples)):
min_samples_leaf = minSamples[i]
# Put your code here: Create the tree using min_samples_leaf
# clf =

# Put your code here: train the tree

train_accuracy[i] = # put your code here…
test_accuracy[i] = # put your code here…

plt.figure()
plt.plot(minSamples, train_accuracy, ‘b’)
plt.plot(minSamples, test_accuracy, ‘g’)
plt.show()

Training a Decision Forest¶

We will again use the sklearn package. A decision forest can be generated by:

In [ ]:

from sklearn import ensemble

decForest = ensemble.RandomForestClassifier(n_estimators=10, min_samples_leaf=2, max_depth=None)

We can set the same properties as for a decision forest (including number of trees by n_estimator). Similarly, we can train the forest

In [ ]:

decForest = decForest.fit(X_train, Y_train)

We can use the trained tree for prediction by:

In [ ]:

y_predict = decForest.predict(X_test)

Exercise 8: Plotting the decision boundary for decision forests¶
In this excercise we want to train a decision forests for different values of n_estimators. After training the forest, we will plot the decision boundary of the learned forest with the existing python code.

Plot the decision boundary for different number of n_estimators and use min_samples_leaf = 10. Can you observe a qualitatitve difference between the learned classifiers? Execute your code several times. Can you observe a difference between the executions? If yes, why?

In [ ]:

import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as metrics

n_classes = 3
plot_colors = “bry”
plot_step = 0.02

# Put your code here for creating the forest
clf = #…

# Put your code here for training the forest
# …

# Generate grid data
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step))

gridData = np.c_[xx.ravel(), yy.ravel()]
# Put your code here: Compute prediction using gridData as input
# Z = …

plt.figure()
Z = Z.reshape(xx.shape)

cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)

plt.scatter(X_train[:, 0], X_train[:, 1], c=Y_train, cmap=plt.cm.Paired)

plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.axis(“tight”)
plt.show()

Exercise 9: Finding the best n_estimator value¶
Test the algorithm for a different number of trees, i.e., [1,5,10,20,40,60,100]. Repeat each experiment 10 times and average over the performance values due to the randomness. Evaluate the average accuracy on the training and aon the test set for the given number of trees (n_estimators). Use min_samples_leaf = 15.

In [ ]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score

numTrees = [1,5,10,20,40,60,100]

numTrials = 10
train_accuracy_single = np.zeros((len(numTrees),numTrials))
test_accuracy_single = np.zeros((len(numTrees),numTrials))

train_accuracy_mean = np.zeros((len(numTrees),1))
test_accuracy_mean = np.zeros((len(numTrees),1))

for i in range(0,len(numTrees)):

# Put your code here for creating a forest with numTrees[i] trees
# clf = …

for j in range(0, numTrials):
# put your code here for training the forest..

train_accuracy_single[i,j] = # put your code here…
test_accuracy_single[i, j] = # put your code here…

train_accuracy_mean[i] = np.mean(train_accuracy_single[i,:])
train_accuracy_std[i] = np.std(train_accuracy_single[i,:])

test_accuracy_mean[i] = np.mean(test_accuracy_single[i,:])
test_accuracy_std[i] = np.std(test_accuracy_single[i,:])

plt.figure()
plt.plot(numTrees, train_accuracy_mean)
plt.plot(numTrees, test_accuracy_mean)

plt.show()

Compare the performance with the single decision tree from excercise 7. Could we improve the performance by using multiple randomized trees?