EM623-Week5 DataMining ProjectSample
Carlo Lipizzi
clipizzi@stevens.edu
SSE
2016
Machine Learning and Data Mining
Data Mining Project Template
Contents
q Project Goals and Conditions
q CRISP
q Business Understanding
q Data Understanding
q Data Preparation
q Modeling
q Evaluation
q Deployment
q Practical Results – Conclusions
q Attachments
2
Project Goals and Conditions
• What are the project goals? What is the key question you are
required to answer?
• Are there any conditions limiting or somehow defining the project,
like limited access to data, data too old, time constrains
• A brief description of the expected results may be added
3
Contents
q Project Goals and Conditions
q CRISP
q Business Understanding
q Data Understanding
q Data Preparation
q Modeling
q Evaluation
q Deployment
q Practical Results – Conclusions
q Attachments
4
CRISP-DM
• Cross-Industry Standard Process
for Data Mining (CRISP-DM)
developed in 1996
• Developed to fit data mining
into general business strategy
• Process vendor and tool-neutral
• Non-proprietary and freely
available
• Data mining projects follow
iterative, adaptive life cycle
consisting of 6 phases
Background Info & Definition*
*: from D. Larose – Discovering Knowledge in Data
5
Business Understanding
• Definition*
– Define business requirements and objectives
– Translate objectives into data mining problem definition
– Prepare initial strategy to meet objectives
• You want to be sure to clearly describe the business
needs and the steps to address them
*: from D. Larose – Discovering Knowledge in Data
6
Data Understanding
• Definition*
– Collect data
– Assess data quality
– Perform exploratory data analysis (EDA)
• Overall data description: sources, organization, key
characteristics (sensor/human generated, reliable/unreliable
source, …)
• Here you run all the descriptive statistical tests that make sense for
the specific case, describing the different steps and their specific
meanings
*: from D. Larose – Discovering Knowledge in Data
7
Data Preparation
• Definition*
– Cleanse, prepare, and transform data set
– Prepares for modeling in subsequent phases
– Select cases and variables appropriate for analysis
• First define the steps you are going to perform (e.g.: if you
normalize, why)
• Here you perform all the data transformation applicable to the
case: missing/miscalculated/misplaced values, outliers,
normalization
• Describe the final dataset (format, new records number, new
variables, …)
*: from D. Larose – Discovering Knowledge in Data
8
Modeling
• Definition*
– Select and apply one or more modeling techniques
– Calibrate model settings to optimize results
– If necessary, additional data preparation may be required
• Explain why you selected a model to an other
• Explain the setting parameters you chose (high level description
only)
• Describe the first results and eventually the adjustments you made
• Describe eventual adjustments you made back to the data
• Describe final results
*: from D. Larose – Discovering Knowledge in Data
9
Evaluation
• Definition*
– Evaluate one or more models for effectiveness
– Determine whether defined objectives achieved
– Make decision regarding data mining results before deploying to field
• Some models can be evaluated using part of the data you have
(supervised learning). In this case, describe the results, using reliable
metrics (e.g.: error/confusion matrix)
• If the model is unsupervised (no data for testing), evaluate your data
using a reliable key performance evaluator (KPI), from outside the
perimeter of your data, eventually using your knowledge of the domain
• Read the results with business sense and provide your comments
*: from D. Larose – Discovering Knowledge in Data
10
Deployment
• Definition*
– Make use of models created
– Simple deployment: generate report
– Complex deployment: implement additional data mining effort in
another department
– In business, customer often carries out deployment based on model
• If the output is a model to be used in real life, be sure it can be exported
in a format that can work in the target environment
• If the output is a model that is not going to run in real life (e.g.: proof of
concept, demo) produce all the reports that may be necessary to fully
explain the model and its value in this specific case
11
*: from D. Larose – Discovering Knowledge in Data
Contents
q Project Goals and Conditions
q CRISP
q Business Understanding
q Data Understanding
q Data Preparation
q Modeling
q Evaluation
q Deployment
q Practical Results – Conclusions
q Attachments
12
Conclusions
• This is the final recap: you briefly describe the whole
process, from the business need, to the data
collected, to the model you built, to the results you
obtained
• Describe the advantages in using the model,
compared to no model or previous models
• Describe possible limitations of the model and future
possible developments
13
Contents
q Project Goals and Conditions
q CRISP
q Business Understanding
q Data Understanding
q Data Preparation
q Modeling
q Evaluation
q Deployment
q Practical Results – Conclusions
q Attachments
14
Attachments
• All the tables and graphs will go here
• Add only the outputs that can support the case you
described in previous slides
• Outputs have to be either readable (no 1M row table
in 1 page)
15