CS计算机代考程序代写 python chain data mining decision tree AWS algorithm COSC 2673/2793 | Machine Learning

COSC 2673/2793 | Machine Learning
Week 6 Lab Exercises: **Decision Trees**

Introduction
During the last couple of weeks we learned about the typical ML model development process. In this weeks lab we will explore decision tree based models.
The lab assumes that you have completed the labs for week 2-5. If you havent yet, please do so before attempting this lab.
The lab can be executed on either your own machine (with anaconda installation) or on AWS educate classroom setup for the course.
• Please refer canvas for instructions on installing anaconda python or setting up AWS Sagemaker notebook: Introduction to Amazon Web Services (AWS) Classrooms
Objective
• Continue to familiarise with Python and other ML packages.
• Learning classification decision trees from both categorical and continuous numerical data
• Comparing the performance of various trees after pruned.
• Learning regression decision trees and comparing these models to regression models from previous labs.
Dataset
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed.
Input variables:
• Bank client data:
1. age (numeric)
2. job : type of job (categorical: “admin.”,”unknown”,”unemployed”,”management”,”housemaid”,”entrepreneur”,”student”, “blue-collar”,”self-employed”,”retired”,”technician”,”services”)
3. marital : marital status (categorical: “married”,”divorced”,”single”; note: “divorced” means divorced or widowed)
4. education (categorical: “unknown”,”secondary”,”primary”,”tertiary”)
5. default: has credit in default? (binary: “yes”,”no”)
6. balance: average yearly balance, in euros (numeric)
7. housing: has housing loan? (binary: “yes”,”no”)
8. loan: has personal loan? (binary: “yes”,”no”)
• Related with the last contact of the current campaign:
1. contact: contact communication type (categorical: “unknown”,”telephone”,”cellular”)
2. day: last contact day of the month (numeric)
3. month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
4. duration: last contact duration, in seconds (numeric)
• Other attributes:
1. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
2. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
3. previous: number of contacts performed before this campaign and for this client (numeric)
4. poutcome: outcome of the previous marketing campaign (categorical: “unknown”,”other”,”failure”,”success”)
Output variable (desired target):
17. y – has the client subscribed a term deposit? (binary: “yes”,”no”)

This dataset is public available for research. The details are described in Moro et al., 2011.
Moro et al., 2011: S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference – ESM’2011, pp. 117-121, Guimarães, Portugal, October, 2011.
Lets read the data first.
In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

data = pd.read_csv(‘./bank-full.csv’, delimiter=’;’)
data.head()

The dataset contains categorical and numerical attributes. Lets convert the categorical columns to categorical data type in pandas.
In [ ]:
for col in data.columns:
if data[col].dtype == object:
data[col] = data[col].astype(‘category’)

sklearn’s classification decision tree learner doesn’t work with categorical attributes. It only works with continuous numeric attributes. The target class, however, must be categorical. So the categorical attributed must be converted into a suitable continuous format. Helpfully, Pandas can do this.
First, split the data into the target class and attributes:
In [ ]:
dataY = data[‘y’]
dataX = data.drop(columns=’y’)

Then use Pandas to generate “numerical” versions of the attributes:
In [ ]:
dataXExpand = pd.get_dummies(dataX)
dataXExpand.head()

As you can see, the categories are expanded into boolean (yes/no, that is, 1/0) values that can be treated as continuous numerical values. It’s not ideal, but it will allow a correct decision tree to be learned.
� Why is it necessary to convert the attributes into boolean representations, rather than just convert them into integer values? What problem would be caused by converting the attributes into integers?

The target class also needs to be pre-processed. The target will be treated by sklearn as a category, but sklearn requires that these categories are represented as integers (not strings). To convert the strings into numbers, the preprocessing. LabelEncoder class from sklearn can be used, as shown below. The two print statements show how to convert in both directions (strings to integers, and vice-versa).
In [ ]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(dataY)
class_labels = le.inverse_transform([0,1])
dataY = le.transform(dataY)
print(dataY)
print(class_labels)

EDA
☞ Task: Since we have covered how to do EDA in the previous labs, this section is left as an exercise for you. Complete the EDA and use the information to justify the decisions made in the subsequent code blocks.
In [ ]:
# TODO

Setting up the performance (evaluation) metric
There are many performance metrics that apply to this problem such as accuracy_score, f1_score, etc. More information on performance metrics available in sklearn can be found at: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
The insights gained in the EDA becomes vital in determining the performance metric. Try to identify the characteristics that are important in making this decision from the EDA results. Use your judgment to pick the best performance measure – discuss with the lab demonstrator to see if the performance measure you came up with is appropriate.
In this task, I want to give equal importance to all classes. Therefore I will select macro-averaged f1_score as my performance measure and I wish to achieve a target value of 75% f1_score.
F1-score is NOT the only performance measure that can be used for this problem.

Setup the experiment – data splits
Next what data should we use to evaluate the performance?
We can generate “simulated” unseen data in several methods
1. Hold-Out validation
2. Cross-Validation
Lets use hold out validation for this experiment.
☞ Task: Use the knowledge from last couple of weeks to split the data appropriately.
In [ ]:
from sklearn.model_selection import train_test_split

with pd.option_context(‘mode.chained_assignment’, None):
train_data_X_, test_data_X, train_data_y_ , test_data_y = train_test_split(dataXExpand, dataY, test_size=0.2,
shuffle=True,random_state=0)

with pd.option_context(‘mode.chained_assignment’, None):
train_data_X, val_data_X, train_data_y, val_data_y = train_test_split(train_data_X_, train_data_y_, test_size=0.25,
shuffle=True,random_state=0)

print(train_data_X.shape, val_data_X.shape, test_data_X.shape)
In [ ]:
train_X = train_data_X.to_numpy()
train_y = train_data_y

test_X = test_data_X.to_numpy()
test_y = test_data_y

val_X = val_data_X.to_numpy()
val_y = val_data_y

Lets setup few functions to visualise the results.
(Ignore section if on AWS) It is likely that you won’t have the graphviz package available, in which case you will need to install graphviz. This can be done through the anacoda-navigator interface (environment tab):
1. Change the dropbox to “All”
2. Search for the packagecpython-graphviz
3. Select the python-graphviz package and install (press “apply”)
If you cant install graphviz don’t worry – you can still complete the lab. Graphviz is nice to be able to see the trees that are being calculated. However, once the trees become complex, visualising them isn’t practical.
In [ ]:
import graphviz

def get_tree_2_plot(clf):
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=dataXExpand.columns,
class_names=class_labels,
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
return graph
In [ ]:
from sklearn.metrics import f1_score

def get_acc_scores(clf, train_X, train_y, val_X, val_y):
train_pred = clf.predict(train_X)
val_pred = clf.predict(val_X)

train_acc = f1_score(train_y, train_pred, average=’macro’)
val_acc = f1_score(val_y, val_pred, average=’macro’)

return train_acc, val_acc

Simple decision tree training
Lets train a simple decision tree and visualize it.
In [ ]:
from sklearn import tree

tree_max_depth = 2 #change this value and observe

clf = tree.DecisionTreeClassifier(criterion=’entropy’, max_depth=tree_max_depth, class_weight=’balanced’)
clf = clf.fit(train_X, train_y)
In [ ]:
Dtree = get_tree_2_plot(clf)
Dtree
In [ ]:
train_acc, val_acc = get_acc_scores(clf,train_X, train_y, val_X, val_y)
print(“Train f1 score: {:.3f}”.format(train_acc))
print(“Validation f1 score: {:.3f}”.format(val_acc))

� Did we achieve the desired target value? If not what do you thing the above results indicate: over-fitting, under-fitting
� Based on the answer to the above question, what do you think is the best course of action?

Hyper parameter tuning
� What are the hyper parameters of the DecisionTreeClassifier?
You may decide to tune the important hyper-paramters of the decision tree classifier (identified in the above question) to get the best performance. As an example I have selected two hyper parameters: max_depth and min_samples_split.
In this exercise I will be using GridSearch to tune my parameters. Sklearn has a function that do cross validation to tune the hyper parameters called GridSearchCV. Lets use this function.
This step may take several steps depending on the performance of your computer
In [ ]:
from sklearn.model_selection import GridSearchCV

parameters = {‘max_depth’:np.arange(2,400, 50), ‘min_samples_split’:np.arange(2,50,5)}

dt_clf = tree.DecisionTreeClassifier(criterion=’entropy’, class_weight=’balanced’)
Gridclf = GridSearchCV(dt_clf, parameters, scoring=’f1_macro’)
Gridclf.fit(train_X, train_y)
In [ ]:
pd.DataFrame(Gridclf.cv_results_)
In [ ]:
print(Gridclf.best_score_)
print(Gridclf.best_params_)

clf = Gridclf.best_estimator_
In [ ]:
train_acc, val_acc = get_acc_scores(clf,train_X, train_y, val_X, val_y)
print(“Train f1 score: {:.3f}”.format(train_acc))
print(“Validation f1 score: {:.3f}”.format(val_acc))

� Did we achieve the desired target value? If not what do you thing the above results indicate: over-fitting, under-fitting
� Based on the answer to the above question, what do you think is the best course of action?

Post pruning decision trees with cost complexity pruning
The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfitting. Those parameters prevent the tree from growing to large size and are examples of pre pruning.
Minimal cost-complexity pruning is an algorithm used to prune a tree to avoid over-fitting. This algorithm finds the node with the ”weakest link” characterised by an effective alpha. Then the nodes with the smallest effective alpha are pruned first. as the algorithm works after the tree is grown, this is a post pruning technique.
In [ ]:
clf = tree.DecisionTreeClassifier(class_weight=’balanced’)
path = clf.cost_complexity_pruning_path(train_X, train_y)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

The following step may take several steps depending on the performance of your computer
In [ ]:
clfs = []
for ccp_alpha in ccp_alphas:
clf = tree.DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha, class_weight=’balanced’)
clf.fit(train_X, train_y)
clfs.append(clf)
In [ ]:
train_scores = [f1_score(train_y, clf.predict(train_X), average=’macro’) for clf in clfs]
val_scores = [f1_score(val_y, clf.predict(val_X), average=’macro’) for clf in clfs]

fig, ax = plt.subplots(figsize=(10,10))
ax.set_xlabel(“alpha”)
ax.set_ylabel(“f1_score”)
ax.set_title(“Accuracy vs alpha for training and testing sets”)
ax.plot(ccp_alphas, train_scores, marker=’o’, label=”train”,
drawstyle=”steps-post”)
ax.plot(ccp_alphas, val_scores, marker=’o’, label=”test”,
drawstyle=”steps-post”)
ax.legend()
plt.show()

� What ccp_alphas value would you choose as the best for the task?

Random forest
Lets make many trees using our dataset. If we run the DT algorithm multiple times on same data, it will result in the same tree. To make different trees we can inject some randomness. Select data data points and features to be used in DT algorithm randomly – this process is called creating boor strapped datasets.
This is automatically done in sklearn for us in the RandomForestClassifier. Lets use that on our dataset.
In [ ]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=8, n_estimators=500, class_weight=’balanced_subsample’, random_state=0)
clf.fit(train_X, train_y)
In [ ]:
train_acc, val_acc = get_acc_scores(clf, train_X, train_y, val_X, val_y)
print(“Train Accuracy score: {:.3f}”.format(train_acc))
print(“Validation Accuracy score: {:.3f}”.format(val_acc))

☞ Task: Now tune the hyper parameters of the random forest.

� Is the final model that you get after hyper parameter tuning better than the previous decision tree model? Why?
Lets visualise the feature importance of the RF classifier.
In [ ]:
tree_feature_importances = clf.feature_importances_
sorted_idx = tree_feature_importances.argsort()

plt.figure(figsize=(10,10))
plt.barh(dataXExpand.columns[sorted_idx], tree_feature_importances[sorted_idx])
plt.show()

� Based on the above figure, do you see any reason to be concerned about the model?
� If the model uses duration to predict the target, what can be an issue?

Exercise: Regression Decision Tree
A regression decision tree can also be trained. These are decision trees where the leaf node is a regression function. You will investigate learning regression trees using the boston housing data set from previous labs.
The below code snippet will help get you started. Note that it does not make sense to use entropy for generating splits, so the default method from sklearn will be used. Also note that the DecisionTreeRegressor class uses similar pre-pruning parameters.
In [ ]:
# import pandas as pd
# import matplotlib.pyplot as plt
# import numpy as np
# import sklearn

# from sklearn import tree
# from sklearn import preprocessing
# from sklearn import metrics
# from sklearn import model_selection

# Load data

# bostonDataTarget = bostonData[‘MEDV’]
# bostonDataAttrs = bostonData.drop(columns=’MEDV’)
# trainY, testY, trainX, testX = model_selection.train_test_split(np.array(bostonDataTarget),np.array(bostonDataAttrs), test_size=0.2)
# clfBoston = sklearn.tree.DecisionTreeRegressor(max_depth=5, min_samples_split=5)
# clfBoston = clfBoston.fit(trainX, trainY)
# predictions = clfBoston.predict(testX)
# metrics.mean_squared_error(testY, predictions)

� How does the error on the regression decision tree compare to the best results you have found in previous labs?
� Find a good set of pre-pruning parameters that minimises the mean square error