Decision_Trees
Decision Tree Example¶
In this notebook, we will go over a decision tree classification example with a loan dataset.
Copyright By PowCoder代写 加微信 powcoder
The structure of the dataset is as follows:
Cardhldr = Dummy variable, 1 if application for credit card accepted, 0 if not
Default = 1 if defaulted 0 if not (observed when Cardhldr = 1, 10,499 observations),
Age = Age in years plus twelfths of a year,
Adepcnt = 1 + number of dependents,
Acadmos = months living at current address,
Majordrg = Number of major derogatory reports,
Minordrg = Number of minor derogatory reports,
Ownrent = 1 if owns their home, 0 if rent
Income = Monthly income (divided by 10,000),
Selfempl = 1 if self employed, 0 if not,
Inc_per = Income divided by number of dependents,
Exp_Inc = Ratio of monthly credit card expenditure to yearly income,
Spending = Average monthly credit card expenditure (for Cardhldr = 1),
Logspend = Log of spending.
Source: Expenditure and Default Data, 13444 observations, source: Greene (1992)
Installing necessary libraries¶
#!pip install numpy
#!pip install pandas
#!pip install graphviz
#!pip install matplotlib
Importing necessary libraries¶
import pandas as pd
import operator # for sorting dictionaries
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import accuracy_score, confusion_matrix
from graphviz import Source
from IPython.display import Image
%matplotlib inline
Import Data¶
df = pd.read_csv(‘credit_count.csv’,sep=’,’)
CARDHLDR DEFAULT AGE ACADMOS ADEPCNT MAJORDRG MINORDRG OWNRENT INCOME SELFEMPL INCPER EXP_INC SPENDING LOGSPEND
0 0 0 27.250000 4 0 0 0 0 1200.000000 0 18000.0 0.000667
1 0 0 40.833332 111 3 0 0 1 4000.000000 0 13500.0 0.000222
2 1 0 37.666668 54 3 0 0 1 3666.666667 0 11300.0 0.033270 121.9896773 4.8039364
3 1 0 42.500000 60 3 0 0 1 2000.000000 0 17250.0 0.048427 96.8536213 4.5732008
4 1 0 21.333334 8 0 0 0 0 2916.666667 0 35000.0 0.016523 48.1916700 3.8751862
All the data is already in numerical format, and there are no categorical features so we can pass this data as is into a decision tree model for training.
Machine Learning¶
Split dataframe into 2 sets¶
First we will drop 2 feature to limit the size of our tree for visualization purposes
df.drop([‘SPENDING’,’LOGSPEND ‘],axis=1, inplace=True)
CARDHLDR DEFAULT AGE ACADMOS ADEPCNT MAJORDRG MINORDRG OWNRENT INCOME SELFEMPL INCPER EXP_INC
0 0 0 27.250000 4 0 0 0 0 1200.000000 0 18000.0 0.000667
1 0 0 40.833332 111 3 0 0 1 4000.000000 0 13500.0 0.000222
2 1 0 37.666668 54 3 0 0 1 3666.666667 0 11300.0 0.033270
3 1 0 42.500000 60 3 0 0 1 2000.000000 0 17250.0 0.048427
4 1 0 21.333334 8 0 0 0 0 2916.666667 0 35000.0 0.016523
We will split the set based on whether or not they are a cardholder.
hldrDF = df[df[‘CARDHLDR’] == 1].drop(‘CARDHLDR’,axis=1)
non_hldrDF = df[df[‘CARDHLDR’] == 0].drop(‘CARDHLDR’,axis=1)
Split data and label¶
Then we will split the data into features (X) and label (y).
X = hldrDF.drop(‘DEFAULT’,axis=1)
y = hldrDF[‘DEFAULT’]
AGE ACADMOS ADEPCNT MAJORDRG MINORDRG OWNRENT INCOME SELFEMPL INCPER EXP_INC
2 37.666668 54 3 0 0 1 3666.666667 0 11300.0 0.033270
3 42.500000 60 3 0 0 1 2000.000000 0 17250.0 0.048427
4 21.333334 8 0 0 0 0 2916.666667 0 35000.0 0.016523
5 20.833334 78 1 0 0 0 1750.000000 0 11750.0 0.031323
6 62.666668 25 1 0 0 1 5250.000000 0 36500.0 0.039269
Train Decision Tree¶
Now we will train the decision tree on the entire set:
Build model:
tree = DecisionTreeClassifier(criterion=’entropy’, min_samples_leaf=500)
Train model:
tree.fit(X,y)
DecisionTreeClassifier(criterion=’entropy’, min_samples_leaf=500)
Feature Importance¶
columns = X.columns.values
importance = tree.feature_importances_
c_imp = {}
# Populate dictionary
for col,imp in zip(columns,importance):
c_imp[col] = imp
# Print in descending order
for col, imp in reversed(sorted(c_imp.items(), key=operator.itemgetter(1))):
print (col,’:’,imp)
INCOME : 0.509823131790469
INCPER : 0.17450904237752576
AGE : 0.09999638836820762
EXP_INC : 0.08613396328349476
ACADMOS : 0.07011783496278864
MINORDRG : 0.059419639217514184
SELFEMPL : 0.0
OWNRENT : 0.0
MAJORDRG : 0.0
ADEPCNT : 0.0
Visualize Tree¶
Install Graphviz from http://graphviz.org/download/ and add PATH to it, if necessary.
os.environ[“PATH”] += os.pathsep + ‘C:\\Program Files (x86)\\Graphviz\\bin’
Visualize a tree.
graph = Source( export_graphviz(tree, out_file=None, feature_names=X.columns))
INCOME <= 2100.417 entropy = 0.453 samples = 10499 value = [9503, 996] AGE <= 21.875 entropy = 0.571 samples = 4581 value = [3963, 618] MINORDRG <= 0.5 entropy = 0.343 samples = 5918 value = [5540, 378] entropy = 0.773 samples = 515 value = [398, 117] INCPER <= 12561.25 entropy = 0.539 samples = 4066 value = [3565, 501] EXP_INC <= 0.044 entropy = 0.663 samples = 1288 value = [1066, 222] INCOME <= 1774.792 entropy = 0.47 samples = 2778 value = [2499, 279] entropy = 0.758 samples = 539 value = [421, 118] entropy = 0.581 samples = 749 value = [645, 104] ACADMOS <= 16.5 entropy = 0.514 samples = 1639 value = [1451, 188] ACADMOS <= 25.5 entropy = 0.402 samples = 1139 value = [1048, 91] entropy = 0.426 samples = 598 value = [546, 52] ACADMOS <= 46.5 entropy = 0.559 samples = 1041 value = [905, 136] entropy = 0.643 samples = 507 value = [424, 83] entropy = 0.467 samples = 534 value = [481, 53] entropy = 0.422 samples = 630 value = [576, 54] entropy = 0.376 samples = 509 value = [472, 37] INCOME <= 2622.75 entropy = 0.314 samples = 4916 value = [4637, 279] EXP_INC <= 0.064 entropy = 0.465 samples = 1002 value = [903, 99] INCPER <= 15725.0 entropy = 0.397 samples = 1742 value = [1605, 137] EXP_INC <= 0.041 entropy = 0.264 samples = 3174 value = [3032, 142] entropy = 0.519 samples = 627 value = [554, 73] ACADMOS <= 26.5 entropy = 0.317 samples = 1115 value = [1051, 64] entropy = 0.238 samples = 613 value = [589, 24] entropy = 0.401 samples = 502 value = [462, 40] INCPER <= 20081.25 entropy = 0.327 samples = 1317 value = [1238, 79] INCOME <= 3637.792 entropy = 0.214 samples = 1857 value = [1794, 63] entropy = 0.372 samples = 614 value = [570, 44] entropy = 0.285 samples = 703 value = [668, 35] INCPER <= 20904.167 entropy = 0.253 samples = 1088 value = [1042, 46] entropy = 0.153 samples = 769 value = [752, 17] entropy = 0.302 samples = 522 value = [494, 28] entropy = 0.203 samples = 566 value = [548, 18] entropy = 0.534 samples = 501 value = [440, 61] entropy = 0.387 samples = 501 value = [463, 38] Save tree to file¶ graph.format = 'png' View as image¶ png_byte = graph.pipe(format='png') Image(png_byte) Machine Learning w/ Validation¶ Split into training/testing set¶ X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) Build tree¶ tree = DecisionTreeClassifier(criterion='entropy') Train tree¶ tree.fit(X_train,y_train) DecisionTreeClassifier(criterion='entropy') y_pred = tree.predict(X_test) Evaluate model performance¶ acc = accuracy_score(y_test, y_pred)*100 print ('Model Accuracy: {}%'.format(round(acc,2))) Model Accuracy: 83.58% pd.DataFrame( confusion_matrix(y_test,y_pred), columns=['Predicted Not Default','Predicted Default'], index=['True Not Default','True Default'] Predicted Not Default Predicted Default True Not Default 2160 207 True Default 224 34 Visualize Tree¶ As you can see here, without setting the any tree parameters, like min_leaf_samples, the tree can grow to be quite large. graph = Source( export_graphviz(tree, out_file=None, feature_names=X.columns)) png_byte = graph.pipe(format='png') Image(png_byte) dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.958183 to fit 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com