程序代写代做代考 database algorithm python decision tree In [1]:

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = “all”

%matplotlib inline

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style(“whitegrid”)
sns.set_context(“notebook”)
#sns.set_context(“poster”)
In [3]:
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from sklearn.metrics import accuracy_score

from sklearn import preprocessing

Ensembles
Ensembles develop around two main ideas.
The first one the idea that combining weak learners we can get a strong learner. Around this idea there is a large corpus of theoretical work that gets implemented and refined through time.
The second main idea is more prosaic and revolves around the need to overcome overfitting, particularly in trees. This results in implementing combinations of the same learner in order to reduce variance and avoid overfitting while increasing the performance of the learner.
These ideas crystalize in three different models of ensembles:
• Bagging. Building multiple models, typically the same type, from different subsamples of a dataset (normmally with repetition) and combining them with an aggregate such as the mean.
• Boosting. The idea of boosting is to build the model incrementally where each iteration tries to fix the errors of the previous one.
• Voting. In this case we have multiple models, typically of different types, and a procedure to combine their predictions (normlly a simple statistic such as the mean).
In order to be able to compare them with the previous one, we will use the same dataset, the Pima Indians, with a 10-fold cross-validation and accuracy as the performance metric.


In this exercise we will use one of the traditional Machine Learning dataset, the Pima Indians diabetes dataset.
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
Content The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
▪ Pregnancies
▪ Glucose
▪ BloodPressure
▪ SkinThickness
▪ Insulin
▪ BMI
▪ DiabetesPedigreeFunction (scores de likelihood of diabetes based on family history)
▪ Age
▪ Outcome
In [4]:
# Load the Pima indians dataset and separate input and output components

from numpy import set_printoptions
set_printoptions(precision=3)

filename=”pima-indians-diabetes.data.csv”
names=[“pregnancies”, “glucose”, “pressure”, “skin”, “insulin”, “bmi”, “pedi”, “age”, “outcome”]
p_indians=pd.read_csv(filename, names=names)
p_indians.head()

# First we separate into input and output components
array=p_indians.values
X=array[:,0:8]
y=array[:,8]
np.set_printoptions(suppress=True)
X
pd.DataFrame(X).head()

# Create the DataFrames for plotting
resall=pd.DataFrame()
res_w1=pd.DataFrame()
Out[4]:

pregnancies
glucose
pressure
skin
insulin
bmi
pedi
age
outcome
0
6
148
72
35
0
33.6
0.627
50
1
1
1
85
66
29
0
26.6
0.351
31
0
2
8
183
64
0
0
23.3
0.672
32
1
3
1
89
66
23
94
28.1
0.167
21
0
4
0
137
40
35
168
43.1
2.288
33
1
Out[4]:
array([[ 6. , 148. , 72. , …, 33.6 , 0.627, 50. ],
[ 1. , 85. , 66. , …, 26.6 , 0.351, 31. ],
[ 8. , 183. , 64. , …, 23.3 , 0.672, 32. ],
…,
[ 5. , 121. , 72. , …, 26.2 , 0.245, 30. ],
[ 1. , 126. , 60. , …, 30.1 , 0.349, 47. ],
[ 1. , 93. , 70. , …, 30.4 , 0.315, 23. ]])
Out[4]:

0
1
2
3
4
5
6
7
0
6.0
148.0
72.0
35.0
0.0
33.6
0.627
50.0
1
1.0
85.0
66.0
29.0
0.0
26.6
0.351
31.0
2
8.0
183.0
64.0
0.0
0.0
23.3
0.672
32.0
3
1.0
89.0
66.0
23.0
94.0
28.1
0.167
21.0
4
0.0
137.0
40.0
35.0
168.0
43.1
2.288
33.0

Bagged Decision Trees¶
Bagging is the contraction of bootstraping + aggregation. The idea behind bagging is to reduce the variance of the weak learner by randomly sampling with repetigion and building a number of learners than later are being aggregated with voting if a classifier or with an statistic such as the mean if regression.
In this case we will use the DecisionTreeClassifier (CART) with the BaggingClassifier class.
In [5]:
# Bagged Decision Trees

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

seed=7

kfold=KFold(n_splits=10, random_state=seed)

#learner=DecisionTreeClassifier(class_weight=”balanced”, random_state=seed)
learner=DecisionTreeClassifier(random_state=seed)

num_trees=100

model=BaggingClassifier(base_estimator=learner, n_estimators=num_trees, random_state=seed)

results=cross_val_score(model, X, y, cv=kfold)

print(f’Bagged Decision Trees – Accuracy {results.mean()*100:.3f}% std {results.std()*100:3f}’)

res_w1[“Res”]=results
res_w1[“Type”]=”Bagged DT”

resall=pd.concat([resall,res_w1], ignore_index=True)

Bagged Decision Trees – Accuracy 77.075% std 7.386790

Random Forest¶
Random Forest is an extension of Bagged Decision Trees, aiming at reducing the correlation between the individual classifiers.
The strategy chosen consists in considering a randomly selected number of features in each split instead of searching greedily the best.
For Random Forest you have to use the RandomForestClassifier class.
In [6]:
# Random Forest

from sklearn.ensemble import RandomForestClassifier

seed=7

kfold=KFold(n_splits=10, random_state=seed)

num_trees=100
num_features=3

model=RandomForestClassifier(n_estimators=num_trees, max_features=num_features, random_state=seed)

results=cross_val_score(model, X, y, cv=kfold)

print(f’Random Forest – Accuracy {results.mean()*100:.3f}% std {results.std()*100:3f}’)

res_w1[“Res”]=results
res_w1[“Type”]=”Random Forest”

resall=pd.concat([resall,res_w1], ignore_index=True)

Random Forest – Accuracy 77.338% std 6.903630
In [10]:
# visualizing a single tree in a random forest

from sklearn.ensemble import RandomForestClassifier

from sklearn import tree
from graphviz import Source
from IPython.display import SVG, display
from ipywidgets import interactive

seed=7

num_trees=100
num_features=3

model=RandomForestClassifier(n_estimators=num_trees, max_features=num_features, random_state=seed)
model.fit(X,y)

estimator = model.estimators_[5]

graph=Source(tree.export_graphviz(estimator,
out_file=None,
feature_names=p_indians.columns[:-1],
class_names=[‘No Diabetes’,’Diabetes’],
filled=True,
rounded=True))
graph

#if you want to save it in a file
# the file will open in preview and you can save it
# just uncomment

#graph.format = ‘png’
#graph.render(‘dtree_render’,view=True)
Out[10]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion=’gini’,
max_depth=None, max_features=3, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=7, verbose=0,
warm_start=False)
Out[10]:
Tree 0
bmi <= 29.95 gini = 0.449 samples = 478 value = [507, 261] class = No Diabetes 1 pregnancies <= 5.5 gini = 0.21 samples = 178 value = [251, 34] class = No Diabetes 0->1
True
64
glucose <= 129.5 gini = 0.498 samples = 300 value = [256, 227] class = No Diabetes 0->64
False
2
glucose <= 164.0 gini = 0.106 samples = 130 value = [202, 12] class = No Diabetes 1->2 31
insulin <= 183.0 gini = 0.428 samples = 48 value = [49, 22] class = No Diabetes 1->31 3
glucose <= 104.5 gini = 0.082 samples = 127 value = [200, 9] class = No Diabetes 2->3 28
pedi <= 0.564 gini = 0.48 samples = 3 value = [2, 3] class = Diabetes 2->28 4
gini = 0.0
samples = 65
value = [106, 0]
class = No Diabetes
3->4 5
age <= 22.5 gini = 0.159 samples = 62 value = [94, 9] class = No Diabetes 3->5 6
gini = 0.0
samples = 22
value = [35, 0]
class = No Diabetes
5->6 7
pressure <= 71.0 gini = 0.23 samples = 40 value = [59, 9] class = No Diabetes 5->7 8
insulin <= 24.0 gini = 0.386 samples = 17 value = [17, 6] class = No Diabetes 7->8 15
pregnancies <= 2.5 gini = 0.124 samples = 23 value = [42, 3] class = No Diabetes 7->15 9
gini = 0.0
samples = 10
value = [11, 0]
class = No Diabetes
8->9 10
pedi <= 0.315 gini = 0.5 samples = 7 value = [6, 6] class = No Diabetes 8->10 11
gini = 0.0
samples = 1
value = [0, 4]
class = Diabetes
10->11 12
insulin <= 125.0 gini = 0.375 samples = 6 value = [6, 2] class = No Diabetes 10->12 13
gini = 0.0
samples = 2
value = [0, 2]
class = Diabetes
12->13 14
gini = 0.0
samples = 4
value = [6, 0]
class = No Diabetes
12->14 16
pedi <= 0.65 gini = 0.32 samples = 7 value = [8, 2] class = No Diabetes 15->16 23
age <= 43.0 gini = 0.056 samples = 16 value = [34, 1] class = No Diabetes 15->23 17
pedi <= 0.359 gini = 0.198 samples = 6 value = [8, 1] class = No Diabetes 16->17 22
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
16->22 18
gini = 0.0
samples = 4
value = [6, 0]
class = No Diabetes
17->18 19
skin <= 8.5 gini = 0.444 samples = 2 value = [2, 1] class = No Diabetes 17->19 20
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
19->20 21
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
19->21 24
gini = 0.0
samples = 12
value = [28, 0]
class = No Diabetes
23->24 25
bmi <= 26.9 gini = 0.245 samples = 4 value = [6, 1] class = No Diabetes 23->25 26
gini = 0.0
samples = 3
value = [6, 0]
class = No Diabetes
25->26 27
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
25->27 29
gini = 0.0
samples = 2
value = [0, 3]
class = Diabetes
28->29 30
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
28->30 32
glucose <= 174.5 gini = 0.444 samples = 45 value = [44, 22] class = No Diabetes 31->32 63
gini = 0.0
samples = 3
value = [5, 0]
class = No Diabetes
31->63 33
age <= 53.5 gini = 0.391 samples = 42 value = [44, 16] class = No Diabetes 32->33 62
gini = 0.0
samples = 3
value = [0, 6]
class = Diabetes
32->62 34
bmi <= 27.9 gini = 0.458 samples = 33 value = [29, 16] class = No Diabetes 33->34 61
gini = 0.0
samples = 9
value = [15, 0]
class = No Diabetes
33->61 35
glucose <= 160.0 gini = 0.397 samples = 24 value = [24, 9] class = No Diabetes 34->35 52
glucose <= 97.5 gini = 0.486 samples = 9 value = [5, 7] class = Diabetes 34->52 36
age <= 52.0 gini = 0.35 samples = 22 value = [24, 7] class = No Diabetes 35->36 51
gini = 0.0
samples = 2
value = [0, 2]
class = Diabetes
35->51 37
age <= 35.0 gini = 0.32 samples = 21 value = [24, 6] class = No Diabetes 36->37 50
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
36->50 38
pedi <= 0.234 gini = 0.473 samples = 11 value = [8, 5] class = No Diabetes 37->38 45
skin <= 32.0 gini = 0.111 samples = 10 value = [16, 1] class = No Diabetes 37->45 39
gini = 0.0
samples = 5
value = [6, 0]
class = No Diabetes
38->39 40
glucose <= 110.5 gini = 0.408 samples = 6 value = [2, 5] class = Diabetes 38->40 41
age <= 30.5 gini = 0.444 samples = 3 value = [2, 1] class = No Diabetes 40->41 44
gini = 0.0
samples = 3
value = [0, 4]
class = Diabetes
40->44 42
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
41->42 43
gini = 0.0
samples = 2
value = [2, 0]
class = No Diabetes
41->43 46
gini = 0.0
samples = 8
value = [14, 0]
class = No Diabetes
45->46 47
glucose <= 139.0 gini = 0.444 samples = 2 value = [2, 1] class = No Diabetes 45->47 48
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
47->48 49
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
47->49 53
pressure <= 36.0 gini = 0.32 samples = 3 value = [4, 1] class = No Diabetes 52->53 58
pressure <= 88.0 gini = 0.245 samples = 6 value = [1, 6] class = Diabetes 52->58 54
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
53->54 55
glucose <= 83.5 gini = 0.444 samples = 2 value = [2, 1] class = No Diabetes 53->55 56
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
55->56 57
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
55->57 59
gini = 0.0
samples = 5
value = [0, 6]
class = Diabetes
58->59 60
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
58->60 65
glucose <= 94.5 gini = 0.378 samples = 179 value = [212, 72] class = No Diabetes 64->65 166
glucose <= 154.5 gini = 0.344 samples = 121 value = [44, 155] class = Diabetes 64->166 66
bmi <= 50.15 gini = 0.108 samples = 45 value = [66, 4] class = No Diabetes 65->66 81
age <= 30.5 gini = 0.434 samples = 134 value = [146, 68] class = No Diabetes 65->81 67
pressure <= 51.0 gini = 0.083 samples = 44 value = [66, 3] class = No Diabetes 66->67 80
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
66->80 68
skin <= 27.5 gini = 0.5 samples = 2 value = [1, 1] class = No Diabetes 67->68 71
glucose <= 89.5 gini = 0.058 samples = 42 value = [65, 2] class = No Diabetes 67->71 69
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
68->69 70
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
68->70 72
gini = 0.0
samples = 29
value = [51, 0]
class = No Diabetes
71->72 73
pregnancies <= 9.0 gini = 0.219 samples = 13 value = [14, 2] class = No Diabetes 71->73 74
skin <= 40.5 gini = 0.124 samples = 12 value = [14, 1] class = No Diabetes 73->74 79
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
73->79 75
gini = 0.0
samples = 10
value = [13, 0]
class = No Diabetes
74->75 76
insulin <= 27.0 gini = 0.5 samples = 2 value = [1, 1] class = No Diabetes 74->76 77
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
76->77 78
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
76->78 82
skin <= 5.5 gini = 0.261 samples = 74 value = [104, 19] class = No Diabetes 81->82 119
skin <= 42.0 gini = 0.497 samples = 60 value = [42, 49] class = Diabetes 81->119 83
pedi <= 0.291 gini = 0.49 samples = 9 value = [8, 6] class = No Diabetes 82->83 94
glucose <= 127.5 gini = 0.21 samples = 65 value = [96, 13] class = No Diabetes 82->94 84
age <= 24.5 gini = 0.408 samples = 5 value = [2, 5] class = Diabetes 83->84 89
bmi <= 31.75 gini = 0.245 samples = 4 value = [6, 1] class = No Diabetes 83->89 85
bmi <= 32.45 gini = 0.444 samples = 3 value = [2, 1] class = No Diabetes 84->85 88
gini = 0.0
samples = 2
value = [0, 4]
class = Diabetes
84->88 86
gini = 0.0
samples = 2
value = [2, 0]
class = No Diabetes
85->86 87
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
85->87 90
gini = 0.0
samples = 1
value = [3, 0]
class = No Diabetes
89->90 91
pressure <= 75.0 gini = 0.375 samples = 3 value = [3, 1] class = No Diabetes 89->91 92
gini = 0.0
samples = 2
value = [3, 0]
class = No Diabetes
91->92 93
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
91->93 95
pedi <= 0.741 gini = 0.189 samples = 60 value = [93, 11] class = No Diabetes 94->95 114
skin <= 25.5 gini = 0.48 samples = 5 value = [3, 2] class = No Diabetes 94->114 96
pregnancies <= 1.5 gini = 0.045 samples = 49 value = [85, 2] class = No Diabetes 95->96 105
pressure <= 74.0 gini = 0.498 samples = 11 value = [8, 9] class = Diabetes 95->105 97
insulin <= 43.0 gini = 0.097 samples = 25 value = [37, 2] class = No Diabetes 96->97 104
gini = 0.0
samples = 24
value = [48, 0]
class = No Diabetes
96->104 98
pedi <= 0.528 gini = 0.408 samples = 6 value = [5, 2] class = No Diabetes 97->98 103
gini = 0.0
samples = 19
value = [32, 0]
class = No Diabetes
97->103 99
insulin <= 18.0 gini = 0.278 samples = 5 value = [5, 1] class = No Diabetes 98->99 102
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
98->102 100
gini = 0.0
samples = 4
value = [5, 0]
class = No Diabetes
99->100 101
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
99->101 106
insulin <= 128.5 gini = 0.346 samples = 6 value = [2, 7] class = Diabetes 105->106 111
age <= 27.5 gini = 0.375 samples = 5 value = [6, 2] class = No Diabetes 105->111 107
gini = 0.0
samples = 3
value = [0, 5]
class = Diabetes
106->107 108
pedi <= 0.845 gini = 0.5 samples = 3 value = [2, 2] class = No Diabetes 106->108 109
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
108->109 110
gini = 0.0
samples = 2
value = [0, 2]
class = Diabetes
108->110 112
gini = 0.0
samples = 4
value = [6, 0]
class = No Diabetes
111->112 113
gini = 0.0
samples = 1
value = [0, 2]
class = Diabetes
111->113 115
pedi <= 0.39 gini = 0.444 samples = 3 value = [1, 2] class = Diabetes 114->115 118
gini = 0.0
samples = 2
value = [2, 0]
class = No Diabetes
114->118 116
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
115->116 117
gini = 0.0
samples = 2
value = [0, 2]
class = Diabetes
115->117 120
bmi <= 39.45 gini = 0.483 samples = 53 value = [33, 48] class = Diabetes 119->120 161
pregnancies <= 6.5 gini = 0.18 samples = 7 value = [9, 1] class = No Diabetes 119->161 121
bmi <= 33.75 gini = 0.498 samples = 45 value = [32, 36] class = Diabetes 120->121 156
bmi <= 44.8 gini = 0.142 samples = 8 value = [1, 12] class = Diabetes 120->156 122
pedi <= 1.36 gini = 0.389 samples = 21 value = [9, 25] class = Diabetes 121->122 141
glucose <= 116.0 gini = 0.438 samples = 24 value = [23, 11] class = No Diabetes 121->141 123
age <= 42.5 gini = 0.342 samples = 20 value = [7, 25] class = Diabetes 122->123 140
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
122->140 124
pedi <= 0.153 gini = 0.204 samples = 15 value = [3, 23] class = Diabetes 123->124 133
pressure <= 75.0 gini = 0.444 samples = 5 value = [4, 2] class = No Diabetes 123->133 125
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
124->125 126
pregnancies <= 8.5 gini = 0.147 samples = 14 value = [2, 23] class = Diabetes 124->126 127
insulin <= 117.0 gini = 0.087 samples = 11 value = [1, 21] class = Diabetes 126->127 130
bmi <= 31.8 gini = 0.444 samples = 3 value = [1, 2] class = Diabetes 126->130 128
gini = 0.0
samples = 10
value = [0, 21]
class = Diabetes
127->128 129
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
127->129 131
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
130->131 132
gini = 0.0
samples = 2
value = [0, 2]
class = Diabetes
130->132 134
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
133->134 135
pregnancies <= 3.0 gini = 0.5 samples = 4 value = [2, 2] class = No Diabetes 133->135 136
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
135->136 137
glucose <= 101.0 gini = 0.444 samples = 3 value = [1, 2] class = Diabetes 135->137 138
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
137->138 139
gini = 0.0
samples = 2
value = [0, 2]
class = Diabetes
137->139 142
bmi <= 35.95 gini = 0.488 samples = 15 value = [8, 11] class = Diabetes 141->142 155
gini = 0.0
samples = 9
value = [15, 0]
class = No Diabetes
141->155 143
pregnancies <= 11.0 gini = 0.219 samples = 6 value = [1, 7] class = Diabetes 142->143 146
insulin <= 202.5 gini = 0.463 samples = 9 value = [7, 4] class = No Diabetes 142->146 144
gini = 0.0
samples = 5
value = [0, 7]
class = Diabetes
143->144 145
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
143->145 147
bmi <= 38.25 gini = 0.346 samples = 8 value = [7, 2] class = No Diabetes 146->147 154
gini = 0.0
samples = 1
value = [0, 2]
class = Diabetes
146->154 148
skin <= 25.5 gini = 0.444 samples = 6 value = [4, 2] class = No Diabetes 147->148 153
gini = 0.0
samples = 2
value = [3, 0]
class = No Diabetes
147->153 149
age <= 36.0 gini = 0.444 samples = 3 value = [1, 2] class = Diabetes 148->149 152
gini = 0.0
samples = 3
value = [3, 0]
class = No Diabetes
148->152 150
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
149->150 151
gini = 0.0
samples = 2
value = [0, 2]
class = Diabetes
149->151 157
gini = 0.0
samples = 6
value = [0, 11]
class = Diabetes
156->157 158
pedi <= 0.526 gini = 0.5 samples = 2 value = [1, 1] class = No Diabetes 156->158 159
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
158->159 160
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
158->160 162
gini = 0.0
samples = 4
value = [7, 0]
class = No Diabetes
161->162 163
glucose <= 117.5 gini = 0.444 samples = 3 value = [2, 1] class = No Diabetes 161->163 164
gini = 0.0
samples = 2
value = [2, 0]
class = No Diabetes
163->164 165
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
163->165 167
pressure <= 89.0 gini = 0.454 samples = 58 value = [31, 58] class = Diabetes 166->167 200
age <= 48.0 gini = 0.208 samples = 63 value = [13, 97] class = Diabetes 166->200 168
pedi <= 0.421 gini = 0.492 samples = 50 value = [31, 40] class = Diabetes 167->168 199
gini = 0.0
samples = 8
value = [0, 18]
class = Diabetes
167->199 169
pedi <= 0.156 gini = 0.467 samples = 25 value = [22, 13] class = No Diabetes 168->169 188
insulin <= 333.5 gini = 0.375 samples = 25 value = [9, 27] class = Diabetes 168->188 170
gini = 0.0
samples = 2
value = [0, 6]
class = Diabetes
169->170 171
skin <= 39.5 gini = 0.366 samples = 23 value = [22, 7] class = No Diabetes 169->171 172
pregnancies <= 0.5 gini = 0.269 samples = 19 value = [21, 4] class = No Diabetes 171->172 185
pedi <= 0.382 gini = 0.375 samples = 4 value = [1, 3] class = Diabetes 171->185 173
bmi <= 35.55 gini = 0.5 samples = 3 value = [2, 2] class = No Diabetes 172->173 176
pregnancies <= 4.5 gini = 0.172 samples = 16 value = [19, 2] class = No Diabetes 172->176 174
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
173->174 175
gini = 0.0
samples = 2
value = [0, 2]
class = Diabetes
173->175 177
gini = 0.0
samples = 8
value = [10, 0]
class = No Diabetes
176->177 178
bmi <= 32.0 gini = 0.298 samples = 8 value = [9, 2] class = No Diabetes 176->178 179
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
178->179 180
bmi <= 38.4 gini = 0.18 samples = 7 value = [9, 1] class = No Diabetes 178->180 181
gini = 0.0
samples = 5
value = [7, 0]
class = No Diabetes
180->181 182
pedi <= 0.223 gini = 0.444 samples = 2 value = [2, 1] class = No Diabetes 180->182 183
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
182->183 184
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
182->184 186
gini = 0.0
samples = 3
value = [0, 3]
class = Diabetes
185->186 187
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
185->187 189
pregnancies <= 12.5 gini = 0.225 samples = 22 value = [4, 27] class = Diabetes 188->189 198
gini = 0.0
samples = 3
value = [5, 0]
class = No Diabetes
188->198 190
age <= 29.5 gini = 0.128 samples = 21 value = [2, 27] class = Diabetes 189->190 197
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
189->197 191
age <= 27.5 gini = 0.346 samples = 7 value = [2, 7] class = Diabetes 190->191 196
gini = 0.0
samples = 14
value = [0, 20]
class = Diabetes
190->196 192
pressure <= 77.0 gini = 0.219 samples = 6 value = [1, 7] class = Diabetes 191->192 195
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
191->195 193
gini = 0.0
samples = 5
value = [0, 7]
class = Diabetes
192->193 194
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
192->194 201
pedi <= 0.314 gini = 0.123 samples = 54 value = [6, 85] class = Diabetes 200->201 220
pregnancies <= 0.5 gini = 0.465 samples = 9 value = [7, 12] class = Diabetes 200->220 202
glucose <= 189.5 gini = 0.245 samples = 18 value = [5, 30] class = Diabetes 201->202 217
insulin <= 661.5 gini = 0.035 samples = 36 value = [1, 55] class = Diabetes 201->217 203
pressure <= 77.0 gini = 0.208 samples = 17 value = [4, 30] class = Diabetes 202->203 216
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
202->216 204
skin <= 41.0 gini = 0.298 samples = 12 value = [4, 18] class = Diabetes 203->204 215
gini = 0.0
samples = 5
value = [0, 12]
class = Diabetes
203->215 205
glucose <= 177.0 gini = 0.245 samples = 11 value = [3, 18] class = Diabetes 204->205 214
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
204->214 206
insulin <= 261.0 gini = 0.111 samples = 8 value = [1, 16] class = Diabetes 205->206 211
glucose <= 179.5 gini = 0.5 samples = 3 value = [2, 2] class = No Diabetes 205->211 207
gini = 0.0
samples = 6
value = [0, 11]
class = Diabetes
206->207 208
skin <= 20.0 gini = 0.278 samples = 2 value = [1, 5] class = Diabetes 206->208 209
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
208->209 210
gini = 0.0
samples = 1
value = [0, 5]
class = Diabetes
208->210 212
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
211->212 213
gini = 0.0
samples = 2
value = [0, 2]
class = Diabetes
211->213 218
gini = 0.0
samples = 35
value = [0, 55]
class = Diabetes
217->218 219
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
217->219 221
gini = 0.0
samples = 1
value = [4, 0]
class = No Diabetes
220->221 222
pressure <= 69.0 gini = 0.32 samples = 8 value = [3, 12] class = Diabetes 220->222 223
gini = 0.0
samples = 1
value = [3, 0]
class = No Diabetes
222->223 224
gini = 0.0
samples = 7
value = [0, 12]
class = Diabetes
222->224

Extra Trees¶
Extra Tress stands for Extremely Randomized Trees and it’s a variation of Random Forest.
While similar to ordinary random forests in that they are an ensemble of individual trees, there are two main differences: first, each tree is trained using the whole learning sample (rather than a bootstrap sample), and second, the top-down splitting in the tree learner is randomized. Instead of computing the locally optimal cut-point for each feature under consideration (based on, e.g., information gain or the Gini impurity), a random cut-point is selected.
For Extra Tress you must use the ExtraTreeClassifier class.
In [230]:
# Extra Trees

from sklearn.ensemble import ExtraTreesClassifier

seed=7

kfold=KFold(n_splits=10, random_state=seed)

num_trees=300
num_features=5

model=ExtraTreesClassifier(n_estimators=num_trees, max_features=num_features, random_state=seed)

results=cross_val_score(model, X, y, cv=kfold)

print(f’Extra Trees – Accuracy {results.mean()*100:.3f}% std {results.std()*100:3f}’)

res_w1[“Res”]=results
res_w1[“Type”]=”Extra Trees”

resall=pd.concat([resall,res_w1], ignore_index=True)

Extra Trees – Accuracy 77.592% std 7.081081

AdaBoost¶
AdaBoost, short for Adaptative Boosting, was the first really successful boosting algorithm and in many ways opened the way to a new generation of boosting algorithms.
It works by weighting instances of the dataset according to their difficulty to classify and using these weights to pay more or less attention to each instance when constructing the subsequent models.
You can use AdaBoost for classification with the AdaBoostClassifier class.
In [231]:
# AdaBoost

from sklearn.ensemble import AdaBoostClassifier

seed=7

kfold=KFold(n_splits=10, random_state=seed)

num_trees=30

model=AdaBoostClassifier(n_estimators=num_trees, random_state=seed)

results=cross_val_score(model, X, y, cv=kfold)

print(f’AdaBoost – Accuracy {results.mean()*100:.3f}% std {results.std()*100:3f}’)

res_w1[“Res”]=results
res_w1[“Type”]=”AdaBoost”

resall=pd.concat([resall,res_w1], ignore_index=True)

AdaBoost – Accuracy 76.046% std 5.443778

Stochastic Gradient Boosting¶
Stochastic Gradient Boosting (also called Gradient Boosting Machines) is one of the most sophisticated ensemble techniques and one of the best in terms of improving the performance of ensembles.
For Stochastic Gradient Boosting you have to use the GradientBoostingClassifier class.
In [232]:
# Stochastic Gradient Boosting

from sklearn.ensemble import GradientBoostingClassifier

seed=7

kfold=KFold(n_splits=10, random_state=seed)

num_trees=30

model=GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)

results=cross_val_score(model, X, y, cv=kfold)

print(f’Stochastic Gradient Boosting – Accuracy {results.mean()*100:.3f}% std {results.std()*100:3f}’)

res_w1[“Res”]=results
res_w1[“Type”]=”GradientBoosting”

resall=pd.concat([resall,res_w1], ignore_index=True)

Stochastic Gradient Boosting – Accuracy 77.203% std 6.500026

Voting Ensemble¶
Voting is the simplest way to aggregate the predictions of multiple classifiers.
The idea behind is pretty straighforward. First you create all models using your training dataset and when predicting you average (or vote in case of a classifier) the predictions of the submodels.
More evolved variations can learn automatically how to best weight the predictions from the sub-models, although these versions are not currently available in scikit-learn
You can create a voting ensemble with the VotingClassifier class.
In [233]:
# Voting Ensemble

from sklearn.ensemble import VotingClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

seed=7

kfold=KFold(n_splits=10, random_state=seed)

# create the models
estimators=[]
model1=LogisticRegression(solver=”liblinear”)
estimators.append((“logistic”, model1))

model2=DecisionTreeClassifier(random_state=seed)
estimators.append((“cart”, model2))

#model3=SVC(gamma=”auto”)
#estimators.append((“svm”, model3))

num_trees=100
num_features=3

model4=RandomForestClassifier(n_estimators=num_trees, max_features=num_features, random_state=seed)
estimators.append((“rfc”, model4))

model=VotingClassifier(estimators)

results=cross_val_score(model, X, y, cv=kfold)

print(f’Voting Ensemble (log,cart,rfc) – Accuracy {results.mean()*100:.3f}% std {results.std()*100:3f}’)

res_w1[“Res”]=results
res_w1[“Type”]=”Voting”

resall=pd.concat([resall,res_w1], ignore_index=True)

Voting Ensemble (log,cart,rfc) – Accuracy 77.208% std 5.699165

Feature Importance¶
In [261]:
# Random Forest

plt.figure(figsize=(15,9))

from sklearn.ensemble import RandomForestClassifier

seed=7

num_trees=100
num_features=3

model=RandomForestClassifier(n_estimators=num_trees, max_features=num_features, random_state=seed)
model.fit(X,y)

for name, importance in zip(p_indians.columns, model.feature_importances_):
print(f'{name:15s} {importance:.4f}’)

sns.barplot(x=p_indians.columns[:-1], y=model.feature_importances_)
Out[261]:


Out[261]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion=’gini’,
max_depth=None, max_features=3, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=7, verbose=0,
warm_start=False)

pregnancies 0.0778
glucose 0.2754
pressure 0.0873
skin 0.0617
insulin 0.0626
bmi 0.1721
pedi 0.1251
age 0.1379
Out[261]:


In [ ]:

Algorithm Comparison¶
In [234]:
# Now let’s compare them all

plt.figure(figsize=(15,9))

sns.boxplot(data=resall, x=”Type”, y=”Res”)

sns.swarmplot(data=resall, x=”Type”, y=”Res”, color=”royalblue”)
Out[234]:


Out[234]:

Out[234]:


In [ ]:

In [ ]:

Mission 1
a) Do the same with the Titanic dataset.






In [ ]:

In [ ]:

In [ ]: