程序代写代做代考 data mining Excel decision tree algorithm Hive alyamisalem_24701_3344907_Alyami_S EM624 Proj Present.

alyamisalem_24701_3344907_Alyami_S EM624 Proj Present.

PREDICTIVE MODEL FOR SUBSTANCE ABUSING
DISCHARGED PEOPLES FROM TREATMENT CENTERS
(ARRESTING, RETREATING AND PSYCHOLOGICAL PROBLEMS VULNERABILITY)

COURSE: EM 623

ASSIGNMENT: FINAL PROJECT

BY: SALEM ALYAMI

FOR: PROF. CARLO LIPIZZI

INTRODUCTION AND BUSINESS UNDERSTANDING

¡ Substance abuse treatment facilities usually report some information about admitted people to state
administrative data systems controlled by SAMHSA.

¡ We will try to utilize SAMHSA archive of data to support its functions and provide insights and
information to treatment centers and authorities across the nation by deriving some information that
could add value and help others.

¡ build models that can predict the circumstances that could lead to a discharged person from the
treatment facilities to be checked in again in a treatment facility or could be arrested by authorities.

¡ we also would like to predict when is the admitted or discharged person to treatment facilities would
be vulnerable to psychological problems.

DATA UNDERSTANDING

¡ The SAMHSA data provide information on the demographic and substance abuse characteristics of
substance abuse treatment discharges and their corresponding admissions aged 12 and older in
treatment facilities

¡ The dataset contains over 1,050,000 records and 64 variable that varies between demographic and
abused substances characteristics in addition to some other administrative information like insurance,
payment, and referrals

DATA PREPARATION
DATA SIZE:

¡ Considering the large data size and the limitations of the computing methods used we decided to take
a random sample of the data to analyses instead of the entire records.

¡ we created a new column contains randomly generated number then we sorted them and took the
first 13% of the data to use in our study (132,500 records).

DATA PREPARATION
MISSING VALUES:

¡ We set a threshold for excluding any attribute that has more than 45% of its data missing and replaced
it with appropriate random values for the same variable.

¡ Most of the variables miss less than 4% of the data as we can see in the screenshot below except for
the variables EDUCATION, DAYWAIT, DIAGNOSTIC AND PSYPROB they were missing 12%, 31%,
37% and 19% respectively

¡ We generated a random variable that fit the distribution of their data using excel function of
“=IF(I4=””,INT(NORM.INV(RAND(),mean(I),Stdv(I))),I4)”

DATA PREPARATION
VARIABLES VALUES:

¡ Some variables categorical values consist of a range of
categories like 1,2,3,4,5 but other have inconsistence series
like 1,2,3,4,19,21,23.

¡ We had to rewrite those variables and replace the
inconsistence values with more coherent series using excel “if
functions”. Those variables are Race, ROUTE1, ROUTE2, and
DSMCRIT.

DATA PREPARATION
OUTLIERS:

¡ We used KNIME to test the data for any
outliers

¡ Since most of our data are categorical and
binary, looking for outlier might not be
sufficient as if it is a continues.

¡ We considered using the outlier removal
node for the following variables Race, Ethnic,
LOS (length of stay in days) and DSMCRIT
(Client’s diagnosis) because they showed
some values that lie near the extreme limits
of their data range.

DATA PREPARATION
EDITING AND CREATING VARIABLES:

¡ We are focusing on three target variables ARRST (number of arrests 0,1,2),
NOPRIOR (number of previous treatment 0,1,2,3,4,5) and PSYPROB (if the
person has psychological problems 0,1).

¡ To have these variable ready for modeling we need to have them as binary
values, so we used excel to convert the arrest variable to 0 if not arrested
before and 1 if he/she did (regardless of the number of arrest times).

¡ We did the same with the number of prior treatment, it’s 0 or 1 either
he/she did or didn’t

¡ PSYPROB already in binary format.

NOPRIOR Binary_PRIOR
2 1
1 1
5 1
0 0

DATA PREPARATION
EDITING AND CREATING VARIABLES:

¡ We have a binary (0,1) flag variables if the
substance was reported as the primary,
secondary, or tertiary substance of abuse at
the time of admission.

¡ We have 18 flag variables for each record
with a value of 0 and 1 so we decided to
combine all the flags to come up with one
variable that can replace them all, we call it
NUMSUBS where it identifies the number
of abused substance by that person.

¡ we also converted to binary values 1 for
only one substance and 2 for more than
one.

DATA PREPARATION
CORRELATION MATRIX:

¡ We can’t read much, but we can clearly see
some scatter correlation between some
variables.

¡ By looking deeply into the nature of some
variables we can understand the correlation
relationship, especially for binary variables.

¡ For example, the number of substance will
correlate with the yes and no binary (1,0)
variable of the 2nd and 3rd substance.

¡ We can also see a strong correlation between
the number of substances and the ALCDRUG
variable which classifies substance abuse type
as alcohol only, other drugs only, alcohol and
other drugs.

DATA PREPARATION
ELIMINATING VARIABLES:

Originally, we have 64 variables but we had to reduce them by eliminating unnecessary variables
such:
¡ Records identifying variables (CASE_ID)

¡ Year of collected data since all our data are for 2013 (DISYR)

¡ Variables with missing values more than 45% like (DETNLF, PREG, DETCRIM, details about SUB2 and SUB3,
IDU, HLTHINS, and PRIMPAY)

¡ Geographical zones coding since we can’t use them like (STFIPS, CBSA, REGION, DIVISION, and
SERVSETD)

We end up having 31 variables and 124,495 records after the data cleaning and preparation with
well-defined values that could be more beneficial and helpful in building our models.

MODELING

¡ To build our predictive model we will also use Rattle by R to
construct a Decision Tree Model. It’s one of the most common
data mining models that produce easy to understand results.
For our case, we will use the traditional algorithm which
implements the rpart package

¡ For each variable we are working on, we must make sure it’s
marked as the target variable of the model in the Data screen
of Rattle.

¡ When building the tree there are some parameters that can
contribute to the size and growth of tree such Minsplit,
Minbucket, Maxdepth, and Complexity

¡ We worked on changing these parameters for each target
variable until we reached the optimal sitting for each one

MODELING
A) ARRESTING DISCHARGED PEOPLE MODEL AND EVALUATION

¡ This model should predict the circumstances
that could lead to a discharged person from
the treatment facilities to be arrested by
authorities based on previous records.

MODELING
A) ARRESTING DISCHARGED PEOPLE MODEL AND EVALUATION

Results:

¡ According to the data, there were only 10% who have been arrested which make it difficult for the
prediction model to perform well.

¡ People who are 35 order vulnerable for being arrested

¡ There is a higher chance if they left the treatment facilities against doctors and professionals approval
or have been terminated by the treatment centers.

¡ People who are not living independently or homeless are more likely to be arrested than those who
live independently.

¡ It also predicted that people suffering from problems with more than two substance

¡ People referred to the treatment centers by Community centers, courts, criminal justice departments,
DUI, and DWI have a higher chance of being arrested, which makes sense!!

MODELING
A) ARRESTING DISCHARGED PEOPLE MODEL AND EVALUATION

Evaluation

¡ Arrest model seems not efficient in predicting when a discharged
person could be arrested with an error of 99% .

¡ Predicted very well when he/she wouldn’t be arrest with an error of
0%

¡ It may make sense! in fact it is not a common thing for everyone to
be arrested in the past.

¡ We might also question the credibility of the arrest variable the
person might prefer to hide such information.

MODELING
A) ARRESTING DISCHARGED PEOPLE MODEL AND EVALUATION

The ROC curve:

¡ AUC value of 0.62

¡ Not perfect but, acceptable considering our context
of analysis since we are not doing a medical
diagnosis or clinical test.

¡ overall error is 9.9%.

MODELING
B) PRIOR TREATMENT MODEL AND EVALUATION

¡ predict circumstances that
could lead to checking in again
in treatment centers

MODELING
B) PRIOR TREATMENT MODEL AND EVALUATION

Results:

¡ most people have a history of treatments.

¡ Model split into above or below 20 years old.

¡ People having an abuse problem with less dangerous substance are more likely to avoid farther
treatment

¡ People addicted to dangerous substance like heroin, cocaine, or meth more likely need farther
treatment

¡ For adults with many substances abuse and have been diagnosed of substance disorder and
dependency are more likely to be check in again in treatment facilities,

¡ The same also goes for veterans who might consume substances more frequently like more than once
a week.

MODELING
B) PRIOR TREATMENT MODEL AND EVALUATION

Evaluation

¡ Tree doing fine in prediction situations of having previous treatments with
an error of 4.6%

¡ Poor prediction if person won’t be checked again in a treatment facility or
center with an error of 83.9%.

¡ AUC value of 0.63 better than the arrest model but still not perfect.

¡ We can accept such value but we need to improve it in future analysis.

¡ Overall error of the tree model is 25.5%.

MODELING
C) PSYCHOLOGICAL PROBLEMS VULNERABILITY MODEL AND EVALUATION

¡ Prediction model to understand
when we should anticipate the
person to experience any
psychological problems

MODELING
C) PSYCHOLOGICAL PROBLEMS VULNERABILITY MODEL AND EVALUATION

Results:

¡ This tree started by splitting into two trenches based on gender

¡ For men and women as the number of previous treatments decreases the chances of having
psychological problem increase and the opposite is true as well. Make Sense!!

¡ Extensive consumption of substance with more than 3 to 6 times per week will cause problems.

¡ Divorced and separated women are less likely to have such problems then the married ones.

¡ Source of income:

¡ women who rely on pensions or public assistance are less likely to have such problems may be due
to their trust in the source of income

¡ unlike men who won’t be safe from these psychological problems if they don’t have a reliable and
stable source of income like a salary to be more precise.

MODELING
C) PSYCHOLOGICAL PROBLEMS VULNERABILITY MODEL AND EVALUATION

Evaluation

¡ Good predictive model to chances of having psychological
problems with an error of 3.1%.

¡ Predicting not having problems has an error of 93.2%.

¡ Predicting having psychological problems easier especially in a
sample collected from treatment centers.

¡ AUC value of 0.6 which not perfect as well and the worst of the
three models we did.

¡ Good and acceptable AUC values for social since are often lower
than medical or clinical analysis.

¡ Generally, the overall error of the tree model for predicting the
psychological problems is 30.5%.

DEPLOYMENT

¡ For future analysis to better assessment for predicting person to be checked again in treatment center
can be achieved by having further details about treatment courses, results and doctor’s
recommendation at discharge.

¡ We also can improve our prediction models to all our variables if we can have details about annual
incomes, family members, motives lead to drug abuse, all drugs used by admitted and discharged
people not only what led to their admission and also more personal characteristics as well would help.

¡ An extended version of this analysis with the proper data available could be used by treatment Center
during admissions and discharges to help in treatment courses or discharge recommendations.

¡ Psychological problem models could be improved and used by official authorities like FBI to use it in
the gun control watch list, for example, to prevent selling guns to people with a psychological or
mental problem to avoid massacre incidents like what we witnessed this year in Las Vegas and Texas
church.

CONCLUSION

¡ Predictive models in Social science is always a tough call

¡ Those three models we built might have their flaws considering the nature of the data we have but we
did our best to prepare the data for the analysis and deriving the best outcomes we could achieve.

¡ We really enjoyed doing this analysis and we hope these models could add value or at least help in
having a better understanding and initial prototype of what more developed and complex models
could work and look like.

The End .