Philip Szymanski
EM 623 Midterm 2016
Phase I – Business Understanding
The purpose of this study is to analyze the dataset diabetic_data.csv and create a model to
predict under what conditions a patient is readmitted. This will be done using Rattle and Excel. A
decision tree will be created using the results of the following data analysis.
Phase II – Data Understanding
The dataset was first loaded into Rattle. By using the ‘Explore’ tab in Rattle I was able to
determine that the dataset is made up of 41 variables with 20,497 observations. The variables consist of
things like race, age, different types of proteins, etc. All somehow relate to diabetes.
According to Rattle 20,115 of the observations have missing values, and there are a total of
19,847 NA’s. This means that the dataset will require a lot of cleaning.
At the moment the dataset contains too many variables for outlier or distribution analysis to be
feasible, it needs to be cleaned and shrunk first.
Phase III – Data Preparation
To begin cleaning, variables with more than 50% of their data missing and variables that
contained a constant were removed from the dataset with Rattle Software. This was simply done by
changing the variable type to ‘Ignore’ on the ‘Data’ tab, and then deleting them using the ‘Transform’
tab. Next, the dataset was exported into Excel and rows (observations) were analyzed to see if they
contained missing data. This was done using the various ‘COUNT’ statements in Excel. Fortunately none
of the rows were missing more than 50% of their data.
The variable ‘readmitted’ was altered in Excel so that it only contained two values: ‘yes’ or ‘no’.
Rows that contained the value ‘NA’ for the variable were deleted (Only the last row had this problem).
This was done using ‘IF’ statements.
Variables like encounter_id and patient_nbr were deleted because they have no connection
with the variable ‘readmitted’.
Next variables with a large number of unique entries were analyzed for outliers and for
correlation with the variable “readmitted”. The variables diag_1-3 all had a large number of unique
entries, and the following were their distributions:
These distributions show the existence of many outliers. The variables were then all changed
into numerical value so that their relevance could be examined with correlation:
As can be seen there is practically no correlation with the diag_1-3 variables and the variable
‘readmitted’. So these variables were deleted for having too many unique categorical entries.
Next all the variables with more than 10 unique entries were analyzed. They were all compared
with ‘readmitted’. It was found that ‘number_inpatient’ was quite strongly correlated with ‘readmitted’
and so this variable was deleted so as to not have adverse effects on the model. The variable
time_in_hospital was found to be very correlated to num_lab_procedures and num_medications, which
makes sense because the longer you are in the hospital the more medications and procedures you
undergo. And so num_lab_procedures and num_medications were deleted.
Next all of the protein or medical-term variables were analyzed. They had little to no correlation
with the variable ‘readmitted’. Over 80% of the data under the variables max_glu_serum and A1Cresult
was ‘None’, and so these variables were deleted. Similarly, over 90% of the data contained in the
variables metformin – glyburide.metformin was “No”, so these variables were deleted.
Now that the number of variables was manageable, they were all correlated with one another.
If any two variables were strongly correlated, one of them was removed so as to make the data
even more manageable. For example, admission_source_id and admission_type_id were correlated, so
one of them was deleted. Medical_specialty was also deleted because it had several thousand missing
values and was not strongly correlated to anything.
Phase IV – Modelling
The remaining variables were used to create a decision tree in Rattle. The partition split was set
at 70/30. The following decision tree was created:
According to Rattle, if your number of diagnoses is larger than or equal to 6.5, and the discharge
disposition id is less than 6.5, and the admission source id is greater than or equal to 6.5 you will be
readmitted (2=readmitted).
Phase V – Evaluation
The results of the decision tree created are not great. Whether or not you are readmitted is only
a little more accurate than flipping a coin. Unfortunately the inaccuracy of the model is due to time
constraints.
The error for this decision tree is about 0.43 according to the error matrixes generated.
And the ROC is as follows:
Looking back, less of the variables should have been deleted. This could have created a larger
and more accurate decision tree. In real life I would go back and prepare the data again in hopes of
creating a more accurate tree.
Phase VI – Deployment
The dataset can now be used on new, unsupervised datasets with related variables to predict
whether a patient will be readmitted or not.