In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = “all”
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style(“whitegrid”)
sns.set_context(“notebook”)
#sns.set_context(“poster”)
In [3]:
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
Ensembles
Ensembles develop around two main ideas.
The first one the idea that combining weak learners we can get a strong learner. Around this idea there is a large corpus of theoretical work that gets implemented and refined through time.
The second main idea is more prosaic and revolves around the need to overcome overfitting, particularly in trees. This results in implementing combinations of the same learner in order to reduce variance and avoid overfitting while increasing the performance of the learner.
These ideas crystalize in three different models of ensembles:
• Bagging. Building multiple models, typically the same type, from different subsamples of a dataset (normmally with repetition) and combining them with an aggregate such as the mean.
• Boosting. The idea of boosting is to build the model incrementally where each iteration tries to fix the errors of the previous one.
• Voting. In this case we have multiple models, typically of different types, and a procedure to combine their predictions (normlly a simple statistic such as the mean).
In order to be able to compare them with the previous one, we will use the same dataset, the Pima Indians, with a 10-fold cross-validation and accuracy as the performance metric.

In this exercise we will use one of the traditional Machine Learning dataset, the Pima Indians diabetes dataset.
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
Content The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
▪ Pregnancies
▪ Glucose
▪ BloodPressure
▪ SkinThickness
▪ Insulin
▪ BMI
▪ DiabetesPedigreeFunction (scores de likelihood of diabetes based on family history)
▪ Age
▪ Outcome
In [4]:
# Load the Pima indians dataset and separate input and output components
from numpy import set_printoptions
set_printoptions(precision=3)
filename=”pima-indians-diabetes.data.csv”
names=[“pregnancies”, “glucose”, “pressure”, “skin”, “insulin”, “bmi”, “pedi”, “age”, “outcome”]
p_indians=pd.read_csv(filename, names=names)
p_indians.head()
# First we separate into input and output components
array=p_indians.values
X=array[:,0:8]
y=array[:,8]
np.set_printoptions(suppress=True)
X
pd.DataFrame(X).head()
# Create the DataFrames for plotting
resall=pd.DataFrame()
res_w1=pd.DataFrame()
Out[4]:
pregnancies
glucose
pressure
skin
insulin
bmi
pedi
age
outcome
0
6
148
72
35
0
33.6
0.627
50
1
1
1
85
66
29
0
26.6
0.351
31
0
2
8
183
64
0
0
23.3
0.672
32
1
3
1
89
66
23
94
28.1
0.167
21
0
4
0
137
40
35
168
43.1
2.288
33
1
Out[4]:
array([[ 6. , 148. , 72. , …, 33.6 , 0.627, 50. ],
[ 1. , 85. , 66. , …, 26.6 , 0.351, 31. ],
[ 8. , 183. , 64. , …, 23.3 , 0.672, 32. ],
…,
[ 5. , 121. , 72. , …, 26.2 , 0.245, 30. ],
[ 1. , 126. , 60. , …, 30.1 , 0.349, 47. ],
[ 1. , 93. , 70. , …, 30.4 , 0.315, 23. ]])
Out[4]:
0
1
2
3
4
5
6
7
0
6.0
148.0
72.0
35.0
0.0
33.6
0.627
50.0
1
1.0
85.0
66.0
29.0
0.0
26.6
0.351
31.0
2
8.0
183.0
64.0
0.0
0.0
23.3
0.672
32.0
3
1.0
89.0
66.0
23.0
94.0
28.1
0.167
21.0
4
0.0
137.0
40.0
35.0
168.0
43.1
2.288
33.0
Bagged Decision Trees¶
Bagging is the contraction of bootstraping + aggregation. The idea behind bagging is to reduce the variance of the weak learner by randomly sampling with repetigion and building a number of learners than later are being aggregated with voting if a classifier or with an statistic such as the mean if regression.
In this case we will use the DecisionTreeClassifier (CART) with the BaggingClassifier class.
In [5]:
# Bagged Decision Trees
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
seed=7
kfold=KFold(n_splits=10, random_state=seed)
#learner=DecisionTreeClassifier(class_weight=”balanced”, random_state=seed)
learner=DecisionTreeClassifier(random_state=seed)
num_trees=100
model=BaggingClassifier(base_estimator=learner, n_estimators=num_trees, random_state=seed)
results=cross_val_score(model, X, y, cv=kfold)
print(f’Bagged Decision Trees – Accuracy {results.mean()*100:.3f}% std {results.std()*100:3f}’)
res_w1[“Res”]=results
res_w1[“Type”]=”Bagged DT”
resall=pd.concat([resall,res_w1], ignore_index=True)
Bagged Decision Trees – Accuracy 77.075% std 7.386790
Random Forest¶
Random Forest is an extension of Bagged Decision Trees, aiming at reducing the correlation between the individual classifiers.
The strategy chosen consists in considering a randomly selected number of features in each split instead of searching greedily the best.
For Random Forest you have to use the RandomForestClassifier class.
In [6]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier
seed=7
kfold=KFold(n_splits=10, random_state=seed)
num_trees=100
num_features=3
model=RandomForestClassifier(n_estimators=num_trees, max_features=num_features, random_state=seed)
results=cross_val_score(model, X, y, cv=kfold)
print(f’Random Forest – Accuracy {results.mean()*100:.3f}% std {results.std()*100:3f}’)
res_w1[“Res”]=results
res_w1[“Type”]=”Random Forest”
resall=pd.concat([resall,res_w1], ignore_index=True)
Random Forest – Accuracy 77.338% std 6.903630
In [10]:
# visualizing a single tree in a random forest
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from graphviz import Source
from IPython.display import SVG, display
from ipywidgets import interactive
seed=7
num_trees=100
num_features=3
model=RandomForestClassifier(n_estimators=num_trees, max_features=num_features, random_state=seed)
model.fit(X,y)
estimator = model.estimators_[5]
graph=Source(tree.export_graphviz(estimator,
out_file=None,
feature_names=p_indians.columns[:-1],
class_names=[‘No Diabetes’,’Diabetes’],
filled=True,
rounded=True))
graph
#if you want to save it in a file
# the file will open in preview and you can save it
# just uncomment
#graph.format = ‘png’
#graph.render(‘dtree_render’,view=True)
Out[10]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion=’gini’,
max_depth=None, max_features=3, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=7, verbose=0,
warm_start=False)
Out[10]:
Tree 0
bmi <= 29.95
gini = 0.449
samples = 478
value = [507, 261]
class = No Diabetes
1
pregnancies <= 5.5
gini = 0.21
samples = 178
value = [251, 34]
class = No Diabetes
0->1
True
64
glucose <= 129.5
gini = 0.498
samples = 300
value = [256, 227]
class = No Diabetes
0->64
False
2
glucose <= 164.0
gini = 0.106
samples = 130
value = [202, 12]
class = No Diabetes
1->2 31
insulin <= 183.0
gini = 0.428
samples = 48
value = [49, 22]
class = No Diabetes
1->31 3
glucose <= 104.5
gini = 0.082
samples = 127
value = [200, 9]
class = No Diabetes
2->3 28
pedi <= 0.564
gini = 0.48
samples = 3
value = [2, 3]
class = Diabetes
2->28 4
gini = 0.0
samples = 65
value = [106, 0]
class = No Diabetes
3->4 5
age <= 22.5
gini = 0.159
samples = 62
value = [94, 9]
class = No Diabetes
3->5 6
gini = 0.0
samples = 22
value = [35, 0]
class = No Diabetes
5->6 7
pressure <= 71.0
gini = 0.23
samples = 40
value = [59, 9]
class = No Diabetes
5->7 8
insulin <= 24.0
gini = 0.386
samples = 17
value = [17, 6]
class = No Diabetes
7->8 15
pregnancies <= 2.5
gini = 0.124
samples = 23
value = [42, 3]
class = No Diabetes
7->15 9
gini = 0.0
samples = 10
value = [11, 0]
class = No Diabetes
8->9 10
pedi <= 0.315
gini = 0.5
samples = 7
value = [6, 6]
class = No Diabetes
8->10 11
gini = 0.0
samples = 1
value = [0, 4]
class = Diabetes
10->11 12
insulin <= 125.0
gini = 0.375
samples = 6
value = [6, 2]
class = No Diabetes
10->12 13
gini = 0.0
samples = 2
value = [0, 2]
class = Diabetes
12->13 14
gini = 0.0
samples = 4
value = [6, 0]
class = No Diabetes
12->14 16
pedi <= 0.65
gini = 0.32
samples = 7
value = [8, 2]
class = No Diabetes
15->16 23
age <= 43.0
gini = 0.056
samples = 16
value = [34, 1]
class = No Diabetes
15->23 17
pedi <= 0.359
gini = 0.198
samples = 6
value = [8, 1]
class = No Diabetes
16->17 22
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
16->22 18
gini = 0.0
samples = 4
value = [6, 0]
class = No Diabetes
17->18 19
skin <= 8.5
gini = 0.444
samples = 2
value = [2, 1]
class = No Diabetes
17->19 20
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
19->20 21
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
19->21 24
gini = 0.0
samples = 12
value = [28, 0]
class = No Diabetes
23->24 25
bmi <= 26.9
gini = 0.245
samples = 4
value = [6, 1]
class = No Diabetes
23->25 26
gini = 0.0
samples = 3
value = [6, 0]
class = No Diabetes
25->26 27
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
25->27 29
gini = 0.0
samples = 2
value = [0, 3]
class = Diabetes
28->29 30
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
28->30 32
glucose <= 174.5
gini = 0.444
samples = 45
value = [44, 22]
class = No Diabetes
31->32 63
gini = 0.0
samples = 3
value = [5, 0]
class = No Diabetes
31->63 33
age <= 53.5
gini = 0.391
samples = 42
value = [44, 16]
class = No Diabetes
32->33 62
gini = 0.0
samples = 3
value = [0, 6]
class = Diabetes
32->62 34
bmi <= 27.9
gini = 0.458
samples = 33
value = [29, 16]
class = No Diabetes
33->34 61
gini = 0.0
samples = 9
value = [15, 0]
class = No Diabetes
33->61 35
glucose <= 160.0
gini = 0.397
samples = 24
value = [24, 9]
class = No Diabetes
34->35 52
glucose <= 97.5
gini = 0.486
samples = 9
value = [5, 7]
class = Diabetes
34->52 36
age <= 52.0
gini = 0.35
samples = 22
value = [24, 7]
class = No Diabetes
35->36 51
gini = 0.0
samples = 2
value = [0, 2]
class = Diabetes
35->51 37
age <= 35.0
gini = 0.32
samples = 21
value = [24, 6]
class = No Diabetes
36->37 50
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
36->50 38
pedi <= 0.234
gini = 0.473
samples = 11
value = [8, 5]
class = No Diabetes
37->38 45
skin <= 32.0
gini = 0.111
samples = 10
value = [16, 1]
class = No Diabetes
37->45 39
gini = 0.0
samples = 5
value = [6, 0]
class = No Diabetes
38->39 40
glucose <= 110.5
gini = 0.408
samples = 6
value = [2, 5]
class = Diabetes
38->40 41
age <= 30.5
gini = 0.444
samples = 3
value = [2, 1]
class = No Diabetes
40->41 44
gini = 0.0
samples = 3
value = [0, 4]
class = Diabetes
40->44 42
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
41->42 43
gini = 0.0
samples = 2
value = [2, 0]
class = No Diabetes
41->43 46
gini = 0.0
samples = 8
value = [14, 0]
class = No Diabetes
45->46 47
glucose <= 139.0
gini = 0.444
samples = 2
value = [2, 1]
class = No Diabetes
45->47 48
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
47->48 49
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
47->49 53
pressure <= 36.0
gini = 0.32
samples = 3
value = [4, 1]
class = No Diabetes
52->53 58
pressure <= 88.0
gini = 0.245
samples = 6
value = [1, 6]
class = Diabetes
52->58 54
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
53->54 55
glucose <= 83.5
gini = 0.444
samples = 2
value = [2, 1]
class = No Diabetes
53->55 56
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
55->56 57
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
55->57 59
gini = 0.0
samples = 5
value = [0, 6]
class = Diabetes
58->59 60
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
58->60 65
glucose <= 94.5
gini = 0.378
samples = 179
value = [212, 72]
class = No Diabetes
64->65 166
glucose <= 154.5
gini = 0.344
samples = 121
value = [44, 155]
class = Diabetes
64->166 66
bmi <= 50.15
gini = 0.108
samples = 45
value = [66, 4]
class = No Diabetes
65->66 81
age <= 30.5
gini = 0.434
samples = 134
value = [146, 68]
class = No Diabetes
65->81 67
pressure <= 51.0
gini = 0.083
samples = 44
value = [66, 3]
class = No Diabetes
66->67 80
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
66->80 68
skin <= 27.5
gini = 0.5
samples = 2
value = [1, 1]
class = No Diabetes
67->68 71
glucose <= 89.5
gini = 0.058
samples = 42
value = [65, 2]
class = No Diabetes
67->71 69
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
68->69 70
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
68->70 72
gini = 0.0
samples = 29
value = [51, 0]
class = No Diabetes
71->72 73
pregnancies <= 9.0
gini = 0.219
samples = 13
value = [14, 2]
class = No Diabetes
71->73 74
skin <= 40.5
gini = 0.124
samples = 12
value = [14, 1]
class = No Diabetes
73->74 79
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
73->79 75
gini = 0.0
samples = 10
value = [13, 0]
class = No Diabetes
74->75 76
insulin <= 27.0
gini = 0.5
samples = 2
value = [1, 1]
class = No Diabetes
74->76 77
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
76->77 78
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
76->78 82
skin <= 5.5
gini = 0.261
samples = 74
value = [104, 19]
class = No Diabetes
81->82 119
skin <= 42.0
gini = 0.497
samples = 60
value = [42, 49]
class = Diabetes
81->119 83
pedi <= 0.291
gini = 0.49
samples = 9
value = [8, 6]
class = No Diabetes
82->83 94
glucose <= 127.5
gini = 0.21
samples = 65
value = [96, 13]
class = No Diabetes
82->94 84
age <= 24.5
gini = 0.408
samples = 5
value = [2, 5]
class = Diabetes
83->84 89
bmi <= 31.75
gini = 0.245
samples = 4
value = [6, 1]
class = No Diabetes
83->89 85
bmi <= 32.45
gini = 0.444
samples = 3
value = [2, 1]
class = No Diabetes
84->85 88
gini = 0.0
samples = 2
value = [0, 4]
class = Diabetes
84->88 86
gini = 0.0
samples = 2
value = [2, 0]
class = No Diabetes
85->86 87
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
85->87 90
gini = 0.0
samples = 1
value = [3, 0]
class = No Diabetes
89->90 91
pressure <= 75.0
gini = 0.375
samples = 3
value = [3, 1]
class = No Diabetes
89->91 92
gini = 0.0
samples = 2
value = [3, 0]
class = No Diabetes
91->92 93
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
91->93 95
pedi <= 0.741
gini = 0.189
samples = 60
value = [93, 11]
class = No Diabetes
94->95 114
skin <= 25.5
gini = 0.48
samples = 5
value = [3, 2]
class = No Diabetes
94->114 96
pregnancies <= 1.5
gini = 0.045
samples = 49
value = [85, 2]
class = No Diabetes
95->96 105
pressure <= 74.0
gini = 0.498
samples = 11
value = [8, 9]
class = Diabetes
95->105 97
insulin <= 43.0
gini = 0.097
samples = 25
value = [37, 2]
class = No Diabetes
96->97 104
gini = 0.0
samples = 24
value = [48, 0]
class = No Diabetes
96->104 98
pedi <= 0.528
gini = 0.408
samples = 6
value = [5, 2]
class = No Diabetes
97->98 103
gini = 0.0
samples = 19
value = [32, 0]
class = No Diabetes
97->103 99
insulin <= 18.0
gini = 0.278
samples = 5
value = [5, 1]
class = No Diabetes
98->99 102
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
98->102 100
gini = 0.0
samples = 4
value = [5, 0]
class = No Diabetes
99->100 101
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
99->101 106
insulin <= 128.5
gini = 0.346
samples = 6
value = [2, 7]
class = Diabetes
105->106 111
age <= 27.5
gini = 0.375
samples = 5
value = [6, 2]
class = No Diabetes
105->111 107
gini = 0.0
samples = 3
value = [0, 5]
class = Diabetes
106->107 108
pedi <= 0.845
gini = 0.5
samples = 3
value = [2, 2]
class = No Diabetes
106->108 109
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
108->109 110
gini = 0.0
samples = 2
value = [0, 2]
class = Diabetes
108->110 112
gini = 0.0
samples = 4
value = [6, 0]
class = No Diabetes
111->112 113
gini = 0.0
samples = 1
value = [0, 2]
class = Diabetes
111->113 115
pedi <= 0.39
gini = 0.444
samples = 3
value = [1, 2]
class = Diabetes
114->115 118
gini = 0.0
samples = 2
value = [2, 0]
class = No Diabetes
114->118 116
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
115->116 117
gini = 0.0
samples = 2
value = [0, 2]
class = Diabetes
115->117 120
bmi <= 39.45
gini = 0.483
samples = 53
value = [33, 48]
class = Diabetes
119->120 161
pregnancies <= 6.5
gini = 0.18
samples = 7
value = [9, 1]
class = No Diabetes
119->161 121
bmi <= 33.75
gini = 0.498
samples = 45
value = [32, 36]
class = Diabetes
120->121 156
bmi <= 44.8
gini = 0.142
samples = 8
value = [1, 12]
class = Diabetes
120->156 122
pedi <= 1.36
gini = 0.389
samples = 21
value = [9, 25]
class = Diabetes
121->122 141
glucose <= 116.0
gini = 0.438
samples = 24
value = [23, 11]
class = No Diabetes
121->141 123
age <= 42.5
gini = 0.342
samples = 20
value = [7, 25]
class = Diabetes
122->123 140
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
122->140 124
pedi <= 0.153
gini = 0.204
samples = 15
value = [3, 23]
class = Diabetes
123->124 133
pressure <= 75.0
gini = 0.444
samples = 5
value = [4, 2]
class = No Diabetes
123->133 125
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
124->125 126
pregnancies <= 8.5
gini = 0.147
samples = 14
value = [2, 23]
class = Diabetes
124->126 127
insulin <= 117.0
gini = 0.087
samples = 11
value = [1, 21]
class = Diabetes
126->127 130
bmi <= 31.8
gini = 0.444
samples = 3
value = [1, 2]
class = Diabetes
126->130 128
gini = 0.0
samples = 10
value = [0, 21]
class = Diabetes
127->128 129
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
127->129 131
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
130->131 132
gini = 0.0
samples = 2
value = [0, 2]
class = Diabetes
130->132 134
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
133->134 135
pregnancies <= 3.0
gini = 0.5
samples = 4
value = [2, 2]
class = No Diabetes
133->135 136
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
135->136 137
glucose <= 101.0
gini = 0.444
samples = 3
value = [1, 2]
class = Diabetes
135->137 138
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
137->138 139
gini = 0.0
samples = 2
value = [0, 2]
class = Diabetes
137->139 142
bmi <= 35.95
gini = 0.488
samples = 15
value = [8, 11]
class = Diabetes
141->142 155
gini = 0.0
samples = 9
value = [15, 0]
class = No Diabetes
141->155 143
pregnancies <= 11.0
gini = 0.219
samples = 6
value = [1, 7]
class = Diabetes
142->143 146
insulin <= 202.5
gini = 0.463
samples = 9
value = [7, 4]
class = No Diabetes
142->146 144
gini = 0.0
samples = 5
value = [0, 7]
class = Diabetes
143->144 145
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
143->145 147
bmi <= 38.25
gini = 0.346
samples = 8
value = [7, 2]
class = No Diabetes
146->147 154
gini = 0.0
samples = 1
value = [0, 2]
class = Diabetes
146->154 148
skin <= 25.5
gini = 0.444
samples = 6
value = [4, 2]
class = No Diabetes
147->148 153
gini = 0.0
samples = 2
value = [3, 0]
class = No Diabetes
147->153 149
age <= 36.0
gini = 0.444
samples = 3
value = [1, 2]
class = Diabetes
148->149 152
gini = 0.0
samples = 3
value = [3, 0]
class = No Diabetes
148->152 150
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
149->150 151
gini = 0.0
samples = 2
value = [0, 2]
class = Diabetes
149->151 157
gini = 0.0
samples = 6
value = [0, 11]
class = Diabetes
156->157 158
pedi <= 0.526
gini = 0.5
samples = 2
value = [1, 1]
class = No Diabetes
156->158 159
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
158->159 160
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
158->160 162
gini = 0.0
samples = 4
value = [7, 0]
class = No Diabetes
161->162 163
glucose <= 117.5
gini = 0.444
samples = 3
value = [2, 1]
class = No Diabetes
161->163 164
gini = 0.0
samples = 2
value = [2, 0]
class = No Diabetes
163->164 165
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
163->165 167
pressure <= 89.0
gini = 0.454
samples = 58
value = [31, 58]
class = Diabetes
166->167 200
age <= 48.0
gini = 0.208
samples = 63
value = [13, 97]
class = Diabetes
166->200 168
pedi <= 0.421
gini = 0.492
samples = 50
value = [31, 40]
class = Diabetes
167->168 199
gini = 0.0
samples = 8
value = [0, 18]
class = Diabetes
167->199 169
pedi <= 0.156
gini = 0.467
samples = 25
value = [22, 13]
class = No Diabetes
168->169 188
insulin <= 333.5
gini = 0.375
samples = 25
value = [9, 27]
class = Diabetes
168->188 170
gini = 0.0
samples = 2
value = [0, 6]
class = Diabetes
169->170 171
skin <= 39.5
gini = 0.366
samples = 23
value = [22, 7]
class = No Diabetes
169->171 172
pregnancies <= 0.5
gini = 0.269
samples = 19
value = [21, 4]
class = No Diabetes
171->172 185
pedi <= 0.382
gini = 0.375
samples = 4
value = [1, 3]
class = Diabetes
171->185 173
bmi <= 35.55
gini = 0.5
samples = 3
value = [2, 2]
class = No Diabetes
172->173 176
pregnancies <= 4.5
gini = 0.172
samples = 16
value = [19, 2]
class = No Diabetes
172->176 174
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
173->174 175
gini = 0.0
samples = 2
value = [0, 2]
class = Diabetes
173->175 177
gini = 0.0
samples = 8
value = [10, 0]
class = No Diabetes
176->177 178
bmi <= 32.0
gini = 0.298
samples = 8
value = [9, 2]
class = No Diabetes
176->178 179
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
178->179 180
bmi <= 38.4
gini = 0.18
samples = 7
value = [9, 1]
class = No Diabetes
178->180 181
gini = 0.0
samples = 5
value = [7, 0]
class = No Diabetes
180->181 182
pedi <= 0.223
gini = 0.444
samples = 2
value = [2, 1]
class = No Diabetes
180->182 183
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
182->183 184
gini = 0.0
samples = 1
value = [0, 1]
class = Diabetes
182->184 186
gini = 0.0
samples = 3
value = [0, 3]
class = Diabetes
185->186 187
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
185->187 189
pregnancies <= 12.5
gini = 0.225
samples = 22
value = [4, 27]
class = Diabetes
188->189 198
gini = 0.0
samples = 3
value = [5, 0]
class = No Diabetes
188->198 190
age <= 29.5
gini = 0.128
samples = 21
value = [2, 27]
class = Diabetes
189->190 197
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
189->197 191
age <= 27.5
gini = 0.346
samples = 7
value = [2, 7]
class = Diabetes
190->191 196
gini = 0.0
samples = 14
value = [0, 20]
class = Diabetes
190->196 192
pressure <= 77.0
gini = 0.219
samples = 6
value = [1, 7]
class = Diabetes
191->192 195
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
191->195 193
gini = 0.0
samples = 5
value = [0, 7]
class = Diabetes
192->193 194
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
192->194 201
pedi <= 0.314
gini = 0.123
samples = 54
value = [6, 85]
class = Diabetes
200->201 220
pregnancies <= 0.5
gini = 0.465
samples = 9
value = [7, 12]
class = Diabetes
200->220 202
glucose <= 189.5
gini = 0.245
samples = 18
value = [5, 30]
class = Diabetes
201->202 217
insulin <= 661.5
gini = 0.035
samples = 36
value = [1, 55]
class = Diabetes
201->217 203
pressure <= 77.0
gini = 0.208
samples = 17
value = [4, 30]
class = Diabetes
202->203 216
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
202->216 204
skin <= 41.0
gini = 0.298
samples = 12
value = [4, 18]
class = Diabetes
203->204 215
gini = 0.0
samples = 5
value = [0, 12]
class = Diabetes
203->215 205
glucose <= 177.0
gini = 0.245
samples = 11
value = [3, 18]
class = Diabetes
204->205 214
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
204->214 206
insulin <= 261.0
gini = 0.111
samples = 8
value = [1, 16]
class = Diabetes
205->206 211
glucose <= 179.5
gini = 0.5
samples = 3
value = [2, 2]
class = No Diabetes
205->211 207
gini = 0.0
samples = 6
value = [0, 11]
class = Diabetes
206->207 208
skin <= 20.0
gini = 0.278
samples = 2
value = [1, 5]
class = Diabetes
206->208 209
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
208->209 210
gini = 0.0
samples = 1
value = [0, 5]
class = Diabetes
208->210 212
gini = 0.0
samples = 1
value = [2, 0]
class = No Diabetes
211->212 213
gini = 0.0
samples = 2
value = [0, 2]
class = Diabetes
211->213 218
gini = 0.0
samples = 35
value = [0, 55]
class = Diabetes
217->218 219
gini = 0.0
samples = 1
value = [1, 0]
class = No Diabetes
217->219 221
gini = 0.0
samples = 1
value = [4, 0]
class = No Diabetes
220->221 222
pressure <= 69.0
gini = 0.32
samples = 8
value = [3, 12]
class = Diabetes
220->222 223
gini = 0.0
samples = 1
value = [3, 0]
class = No Diabetes
222->223 224
gini = 0.0
samples = 7
value = [0, 12]
class = Diabetes
222->224
Extra Trees¶
Extra Tress stands for Extremely Randomized Trees and it’s a variation of Random Forest.
While similar to ordinary random forests in that they are an ensemble of individual trees, there are two main differences: first, each tree is trained using the whole learning sample (rather than a bootstrap sample), and second, the top-down splitting in the tree learner is randomized. Instead of computing the locally optimal cut-point for each feature under consideration (based on, e.g., information gain or the Gini impurity), a random cut-point is selected.
For Extra Tress you must use the ExtraTreeClassifier class.
In [230]:
# Extra Trees
from sklearn.ensemble import ExtraTreesClassifier
seed=7
kfold=KFold(n_splits=10, random_state=seed)
num_trees=300
num_features=5
model=ExtraTreesClassifier(n_estimators=num_trees, max_features=num_features, random_state=seed)
results=cross_val_score(model, X, y, cv=kfold)
print(f’Extra Trees – Accuracy {results.mean()*100:.3f}% std {results.std()*100:3f}’)
res_w1[“Res”]=results
res_w1[“Type”]=”Extra Trees”
resall=pd.concat([resall,res_w1], ignore_index=True)
Extra Trees – Accuracy 77.592% std 7.081081
AdaBoost¶
AdaBoost, short for Adaptative Boosting, was the first really successful boosting algorithm and in many ways opened the way to a new generation of boosting algorithms.
It works by weighting instances of the dataset according to their difficulty to classify and using these weights to pay more or less attention to each instance when constructing the subsequent models.
You can use AdaBoost for classification with the AdaBoostClassifier class.
In [231]:
# AdaBoost
from sklearn.ensemble import AdaBoostClassifier
seed=7
kfold=KFold(n_splits=10, random_state=seed)
num_trees=30
model=AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results=cross_val_score(model, X, y, cv=kfold)
print(f’AdaBoost – Accuracy {results.mean()*100:.3f}% std {results.std()*100:3f}’)
res_w1[“Res”]=results
res_w1[“Type”]=”AdaBoost”
resall=pd.concat([resall,res_w1], ignore_index=True)
AdaBoost – Accuracy 76.046% std 5.443778
Stochastic Gradient Boosting¶
Stochastic Gradient Boosting (also called Gradient Boosting Machines) is one of the most sophisticated ensemble techniques and one of the best in terms of improving the performance of ensembles.
For Stochastic Gradient Boosting you have to use the GradientBoostingClassifier class.
In [232]:
# Stochastic Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier
seed=7
kfold=KFold(n_splits=10, random_state=seed)
num_trees=30
model=GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results=cross_val_score(model, X, y, cv=kfold)
print(f’Stochastic Gradient Boosting – Accuracy {results.mean()*100:.3f}% std {results.std()*100:3f}’)
res_w1[“Res”]=results
res_w1[“Type”]=”GradientBoosting”
resall=pd.concat([resall,res_w1], ignore_index=True)
Stochastic Gradient Boosting – Accuracy 77.203% std 6.500026
Voting Ensemble¶
Voting is the simplest way to aggregate the predictions of multiple classifiers.
The idea behind is pretty straighforward. First you create all models using your training dataset and when predicting you average (or vote in case of a classifier) the predictions of the submodels.
More evolved variations can learn automatically how to best weight the predictions from the sub-models, although these versions are not currently available in scikit-learn
You can create a voting ensemble with the VotingClassifier class.
In [233]:
# Voting Ensemble
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
seed=7
kfold=KFold(n_splits=10, random_state=seed)
# create the models
estimators=[]
model1=LogisticRegression(solver=”liblinear”)
estimators.append((“logistic”, model1))
model2=DecisionTreeClassifier(random_state=seed)
estimators.append((“cart”, model2))
#model3=SVC(gamma=”auto”)
#estimators.append((“svm”, model3))
num_trees=100
num_features=3
model4=RandomForestClassifier(n_estimators=num_trees, max_features=num_features, random_state=seed)
estimators.append((“rfc”, model4))
model=VotingClassifier(estimators)
results=cross_val_score(model, X, y, cv=kfold)
print(f’Voting Ensemble (log,cart,rfc) – Accuracy {results.mean()*100:.3f}% std {results.std()*100:3f}’)
res_w1[“Res”]=results
res_w1[“Type”]=”Voting”
resall=pd.concat([resall,res_w1], ignore_index=True)
Voting Ensemble (log,cart,rfc) – Accuracy 77.208% std 5.699165
Feature Importance¶
In [261]:
# Random Forest
plt.figure(figsize=(15,9))
from sklearn.ensemble import RandomForestClassifier
seed=7
num_trees=100
num_features=3
model=RandomForestClassifier(n_estimators=num_trees, max_features=num_features, random_state=seed)
model.fit(X,y)
for name, importance in zip(p_indians.columns, model.feature_importances_):
print(f'{name:15s} {importance:.4f}’)
sns.barplot(x=p_indians.columns[:-1], y=model.feature_importances_)
Out[261]:
pregnancies 0.0778
glucose 0.2754
pressure 0.0873
skin 0.0617
insulin 0.0626
bmi 0.1721
pedi 0.1251
age 0.1379
Out[261]:

In [ ]:
Algorithm Comparison¶
In [234]:
# Now let’s compare them all
plt.figure(figsize=(15,9))
sns.boxplot(data=resall, x=”Type”, y=”Res”)
sns.swarmplot(data=resall, x=”Type”, y=”Res”, color=”royalblue”)
Out[234]:

In [ ]:
In [ ]:
Mission 1
a) Do the same with the Titanic dataset.
In [ ]:
In [ ]:
In [ ]: