Discovering Knowledge in Data
MD-MIS 637 – Fall 2020
*
MIS 637
Data Analytics and Machine Learning
Data Science & Analytics Lifecycle: Six Phases
MD-MIS 637 – Fall 2020
*
What is Data Analytics and ML?
“…the process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data…” (Gartner Group)
“…the analysis of observational data sets to find unsuspected relationships and to summarize data in novel ways…” (Hand et al.)
“…is an interdisciplinary field bringing together techniques from machine learning, pattern recognition, statistics, databases, and visualization…” (Cabana et al.)
MD-MIS 637 – Fall 2020
*
The Need for Human Direction of Data Analytics & ML
Some early data analytics/mining definitions described process as “automatic”
“…this has misled many people into believing data analytics/mining is product that can be bought rather than a discipline that must be mastered.” (Berry, Linoff)
Automation no substitute for human input
Data Analytics/Mining is easy to do badly
Understanding statistical and mathematical model structures of underlying software required
Humans need to be actively involved in every phase of data mining process
Task of data analytics and machine learning should be integrated into human process of problem solving
MD-MIS 637 – Fall 2020
*
Cross Industry Standard Process: CRISP-DM/DA
Cross-Industry Standard Process for Data Mining/Analytics (CRISP-DM) developed in 1996
Contributors include DaimlerChrysler, SPSS, and NCR
Developed to fit data mining into general business strategy
Process vendor and tool-neutral
Non-proprietary and freely available
Data mining projects follow iterative, adaptive life cycle consisting of 6 phases
Phase sequences are adaptive
Next, Figure 1.1 illustrates CRISP-DM lifecycle
MD-MIS 637 – Fall 2020
*
Cross Industry Standard Process: CRISP-DM (cont’d)
Iterative CRSIP-DM process shown in outer circle
Most significant dependencies between phases shown
Next phase depends on results from preceding phase
Returning to earlier phase possible before moving forward
Business / Research Understanding Phase
Deployment Phase
Evaluation Phase
Modeling Phase
Data Preparation Phase
Data Understanding Phase
MD-MIS 637 – Fall 2020
*
Cross Industry Standard Process: CRISP-DM (cont’d)
(1) Business/Research Understanding Phase
Define business/research requirements and objectives
Translate objectives into data mining problem definition
Prepare initial strategy to meet objectives
(2) Data Understanding Phase
Collect the data
Assess data quality
Perform exploratory data analysis (EDA)
(3) Data Preparation Phase
Cleanse, prepare, and transform data set
Prepares for modeling in subsequent phases
Select “cases” and “variables” appropriate for analysis
MD-MIS 637 – Fall 2020
*
Cross Industry Standard Process: CRISP-DM (cont’d)
(4) Modeling Phase
Select and apply one or more modeling techniques
Calibrate model settings to optimize results
If necessary, additional data preparation may be required
(5) Evaluation Phase
Evaluate one or more models for effectiveness
Determine whether defined objectives achieved
Make decision regarding data mining results before deploying to field
MD-MIS 637 – Fall 2020
*
Cross Industry Standard Process: CRISP-DM (cont’d)
(6) Deployment Phase
Make use of models created
Simple deployment: generate report
Complex deployment: implement additional data mining effort in another department
In business, customer often carries out deployment based on model
See http://www.crisp-dm.org for more information
MD-MIS 637 – Fall 2020
*
Case Study 1
Analyzing Automobile Warranty Claims
Business Understanding
Objectives include improving customer satisfaction and reducing costs associated with warranty claims
Manufacturing engineers consulted to formulate business problems
Data analytics techniques used to uncover possible issues:
Are there interdependencies among warranty claims?
Are past claims associated with future claims?
Is there an association between claim and repair facility?
MD-MIS 637 – Fall 2020
*
Case Study 1 (cont’d)
Data Understanding
40GB QUIS database containing 7 million vehicle records used
Vehicle records include manufacturing location, warrant claims, and additional 30 sales codes for each vehicle (5K causes)
Database unintelligible to non-domain experts
Costly effort to consult with domain experts from different departments
Data Preparation
QUIS discovered to have limited SQL access
Cases and variables manually extracted
Additional variables derived for modeling phase (number of days)
MD-MIS 637 – Fall 2020
*
Case Study 1 (cont’d)
Proprietary data analytics software used
Data format requirements varied for different algorithms
Resulted in exhaustive pre-processing of data
Modeling Phase
Applied Bayesian networks and association rules to uncover dependencies between warranty claims
Discovered specific combination of construction specifications doubles probability of electrical cable claim
Investigated whether some garages had more claims than others
Remaining results confidential
MD-MIS 637 – Fall 2020
*
Case Study 1 (cont’d)
Evaluation
Researchers disappointed in results
Association rules could not be generalized
Rules “not interesting” according to domain experts
Data models fell short of business objectives
Legacy databases not suited to data mining
Proposal suggested database redesign for future data mining efforts
Deployment
Foregoing effort identified as pilot project, models not deployed
Future data mining efforts planned to integrate more closely to database systems at DaimlerChrysler
MD-MIS 637 – Fall 2020
*
Case Study 1 (cont’d)
Summary
Uncovering hidden nuggets very difficult
During each phase, researchers encountered roadblocks
Applying new data analytics/mining effort problematic
Data analytics/mining effort requires management support
Substantial human participation required at every stage
Installation, configuration, and data analytics modeling not magic
Wrong analysis leads to possibly expensive policy recommendations
No guarantee data analytics effort delivers actionable results
However, used properly, data analytics may provide profitable results
MD-MIS 637 – Fall 2020
*
Fallacies of Data Mining & Analytics
Four Fallacies of Data Mining/Analytics (Louie, Nautilus Systems, Inc.)
Fallacy 1
Set of tools can be turned loose on data repositories
Finds answers to all business problems
Reality 1
No automatic data mining tools solve problems
Rather, data mining is process (CRISP-DM)
Integrates into overall business objectives
Fallacy 2
Data mining & Analytics process is autonomous
Requires little oversight
MD-MIS 637 – Fall 2020
*
Fallacies of Data Mining/Analytics (cont’d)
Reality 2
Requires significant intervention during every phase
After model deployment, new models require updates
Continuous evaluative measures monitored by analysts
Fallacy 3
Data mining/analytics quickly pays for itself
Reality 3
Return rates vary
Depending on startup, personnel, data preparation costs, etc.
Fallacy 4
Data mining/analytics software easy to use
MD-MIS 637 – Fall 2020
*
Fallacies of Data Mining/DA (cont’d)
Reality 4
Ease of use varies across projects
Analysts must combine subject matter knowledge with specific problem domain
Two Additional Fallacies (Larose)
Fallacy 5
Data analytics/mining identifies causes of business problems
Reality 5
DA & ML, and Knowledge discovery process uncovers patterns of behavior
Humans interpret results and identify causes
MD-MIS 637 – Fall 2020
*
Fallacies of Data Mining/DA (cont’d)
Fallacy 6
Data analytics/mining automatically cleans data in databases
Reality 6
Data analytics/mining often uses data from legacy systems
Data possibly not examined or used in years
Organizations starting data analytics/mining efforts confronted with huge data preprocessing task
MD-MIS 637 – Fall 2020
*
What Tasks Can DA & ML Accomplish?
Six common DA tasks
Description
Estimation
Prediction
Classification
Clustering
Association
(1) Description
Describes patterns or trends in data
For example, pollster may uncover patterns suggesting those laid-off less likely to support incumbent
Descriptions of patterns, often suggest possible explanations
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
For example, those laid-off now less financially secure; therefore, prefer alternate candidate
Data analytics models should be transparent
That is, results should be interpretable by humans
Some data analytics methods more transparent than others
For example, Decision Trees (transparent) <-> Neural Networks (opaque)
High-quality description accomplished using Exploratory Data Analysis (EDA)
Graphical method of exploring patterns and trends in data
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
(2) Estimation
Similar to Classification task, except target variable numeric
Models built from complete data records
Records include values for each predictor field and numeric target variable in training set
For new observations, estimate of target variable made
For example, estimate a patient’s systolic blood pressure, based on patient’s age, gender, body-mass index, and sodium levels
Here, estimation model built from training set records
Model then estimates value for new case
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
Estimation Tasks in Business and Research:
Estimate amount of money, family of four will spend on back-to-school shopping
Estimate percentage decrease in rotary movement sustained to NFL player with knee injury
Estimate number of points basketball player scores when double-teamed in playoffs
Estimate GPA of graduate student, based on student’s undergraduate GPA
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
Figure 1.2 shows scatter plot of graduate GPA against undergraduate GPA
Linear regression finds line (blue) best approximating relationship between two variables
Regression line estimates student’s graduate GPA based on their undergraduate GPA
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
Minitab statistical software produces regression
equation ŷ = 1.24 + 0.67x
Therefore, estimated student’s graduate GPA = 1.24 plus 0.67 times their undergraduate GPA
For example, suppose student’s undergraduate GPA = 3.0
According to estimation model
Estimated student’s graduate GPA = 1.24 + 0.67(3.0) = 3.25
Point (x = 3.0, ŷ = 3.25) lies on regression line
Statistical Analysis uses several estimation methods: point estimation, confidence interval estimation, linear regression and correlation, and multiple regression
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
(3) Prediction
Similar to classification and estimation, except results lie in the future
Prediction Tasks in Business and Research:
Predict price of stock 3 months into future, based on past performance
Q1
Q2
Q3
Q4
Stock Price
?
?
?
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
Predict percentage increase in traffic deaths next year, if speed limit increased
Predict whether molecule in newly discovered drug leads to profitable pharmaceutical drug
Methods used for classification and prediction applicable to prediction
Includes point estimation, confidence interval estimation, linear regression and correlation, and multiple regression
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
(4) Classification
Classification requires categorical target variable such as Income Bracket
Three values include “High”, “Middle”, “Low”
Data model examines records containing input fields and target field
Table shows several records from data set
Subject Age Gender Occupation Income Bracket
001 47 F Software Engineer High
002 28 M Marketing Consultant Middle
003 35 M Unemployed Low
… … … … …
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
Records of persons in data set used to “train” classification model
First, Model built from data records, where value of categorical target variable (Income Bracket) already known
Algorithm “first learns about” which combinations of input fields are associated with Income Bracket values in training set
For example, algorithm may determine that older females associated with high income
Next, trained model examines new records
Information regarding Income Bracket not available
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
Based on classifications in training set, new records classified
For example, 63-year old female professor might be classified in “High” income bracket
Classification Tasks in Business and Research:
Determine whether credit card transaction fraudulent
Assessing mortgage application to determine “good” or “bad” credit risk
Diagnosing whether particular disease present
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
Determine if will was written by deceased, or fraudulently by someone else
Identify whether certain financial behavior represents terrorist threat
Scatter plot shows Na/K ratio against Age for 200 patients
For example, classify drug type to prescribe based on patient’s age and sodium/potassium ratio
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
Actual drug type prescribed symbolized by shade (light, medium, dark) of points
Suppose prescription of new patient based on this data set?
Prescribe which drug for young patient with high Na/K ratio?
Young patients plotted on left
High Na/K plotted on upper-half
Quadrant of graph shows light points
Recommended drug = Y (corresponds to light points)
Prescribe which drug for older patient with low Na/K ratio?
Lower-right half of graph shows patients prescribed different drug types
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
Definitive classification cannot be made
More information required to make decision
Examples show graphs are helpful for understanding two-dimensional data
However, classification often requires many input attributes
More sophisticated methods of classification required
Commonly used algorithms for classification include k-Nearest Neighbor, Decision Trees, and Neural Networks
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
(5) Clustering
Refers to grouping records into classes of similar objects
Clustering algorithm seeks to segment data set into homogeneous subgroups
Where similarity of records in clusters maximized, and similarity to records outside clusters minimized
Target variable not specified
For example, Claritas, Inc. PRIZM software clusters demographic profiles for different geographic areas according to zip code
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
Table shows 62 distinct “lifestyle” types used by PRIZM
01 Blue Blood Estates 02 Winner’s Circle 03 Executive Suites 04 Pools & Patios
05 Kids & Cul-de-Sacs 06 Urban Gold Coast 07 Money & Brains 08 Young Literati
09 American Dreams 10 Bohemian Mix 11 Second City Elite 12 Upward Bound
13 Gray Power 14 Country Squires 15 God’s Country 16 Big Fish, Small Pond
17 Greenbelt Families 18 Young Influentials 19 New Empty Nests 20 Boomers & Babies
21 Suburban Sprawl 22 Blue-Chip Blues 23 Upstarts & Seniors 24 New Beginnings
25 Mobility Blues 26 Gray Collars 27 Urban Achievers 28 Big City Blend
29 Old Yankee Rows 30 Mid-City Mix 31 Latino America 32 Middleburg Managers
33 Boomtown Singles 34 Starter Families 35 Sunset City Blues 36 Towns & Gowns
37 New Homesteaders 38 Middle America 39 Red, White & Blues 40 Military Quarters
41 Big Sky Families 42 New Eco – topia 43 River City, USA 44 Shotguns & Pickups
45 Single City Blues 46 Hispanic Mix 47 Inner Cities 48 Smalltown Downtown
49 Hometown Retired 50 Family Scramble 51 Southside City 52 Golden Ponds
53 Rural Industria 54 Norma Rae-Ville 55 Mines & Mills 56 Agri – Business
57 Grain Belt 58 Blue Highways 59 Rustic Elders 60 Back Country Folks
61 Scrub Pine Flats 62 Hard Scrabble
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
What do the clusters mean?
According to PRIZM, Clusters for Beverly Hills, CA 90210 include:
Cluster 01: Blue Blood Estates
Cluster 10: Bohemian Mix
Cluster 02: Winner’s Circle
Cluster 08: Young Literati
Description of Cluster 01, “…’old money’ heirs that live in America’s wealthiest suburbs…accustomed to privilege and live luxuriously…”
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
Clustering Tasks in Business and Research:
Target marketing niche product for small business that does not have large marketing budget
For accounting purposes, segment financial behavior into benign and suspicious categories
Use as dimensionality-reduction tool for data set having several hundred inputs
Determine gene expression clusters, where many genes exhibit similar behavior or characteristics
Clustering often used as preliminary step in data mining
Resulting clusters used as input to different technique downstream, such as neural networks
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
(6) Association
Find out which attributes “go together”
Market Basket Analysis commonly used in business applications
Quantify relationships in the form of Rules
IF antecedent THEN consequent
Rules measured using support and confidence
For example, discover which items in supermarket are purchased together
Thursday night 200 of 1,000 customers bought diapers, and of those buying diapers, 50 purchased beer
Association Rule: “IF buy diapers, THEN buy beer”
Support = 200/1,000 = 5%, and confidence = 50/200 = 25%
MD-MIS 637 – Fall 2020
*
What Tasks Can Data Analytics Accomplish? (cont’d)
Association Tasks in Business and Research:
Investigating proportion of subscribers to cell phone plan responding positively to service upgrade offer
Predicting degradation in telecommunication networks
Discovering which items in supermarket purchased together
Determining proportion of cases where administering new drug exhibits serious side effects
Two commonly-used algorithms for generating association rules
A Priori and Generalized Rule Induction (GRI)
MD-MIS 637 – Fall 2020
*
Case Study 2
Predicting Abnormal Stock Market Returns
Business Understanding
Alan Safer reports stock market trades by insiders have abnormal returns
Profits by outsiders can be increased, by using legal insider trading information
Safer attempts to predict abnormal stock price returns arsing from legal insider trading
Data Preparation
Rank of company insiders not considered important
Also, company insiders omitted where not involved in company decisions
MD-MIS 637 – Fall 2020
*
Case Study 2
Predicting Abnormal Stock Market Returns
Data Understanding Phase
Data from 343 companies: Jan 93 – June 97
Source: SEC ( S&P 600, 400, and 500: small, Medium and large)
Of the 946 stocks 343 qualified for at least 2 purchases/year
Name and rank of insider transaction date, number, sell or buy,
MD-MIS 637 – Fall 2020
*
Case Study 2 (cont’d)
Modeling
Data split training = 80% and validation = 20%
Neural Network model uncovered results:
(a) Several groups had most predictable abnormal returns:
Electronic equipment, excluding computer equipment
Chemical products
Transportation equipment
Business services
(b) Predictions looking farther into future increased ability to predict unusual variations
(c) Abnormal stock returns of small companies easier to predict
MD-MIS 637 – Fall 2020
*
Case Study 2 (cont’d)
Evaluation
Multivariate Adaptive Regression Spline (MARS) also applied to data
Uncovered similar findings to Neural Network model
Confluence of results is powerful method of evaluating validity of model
This increases confidence in results
Deployment
Safer published findings in Intelligent Data Analysis
MD-MIS 637 – Fall 2020
*
Case Study 3
Mining Association Rules from Legal Databases
Business Understanding
Ivkovic, Yearwood, and Stranieri mine association rules from large database of applicants for government-funded legal aide in Australia
Legal data highly unstructured
Goal is to improve delivery of legal services
Data Understanding
Data provided by Victoria Legal Aid (VLA)
Contains 380,000 applications with 300 attributes
MD-MIS 637 – Fall 2020
*
Case Study 3 (cont’d)
Domain experts consulted in effort to reduce dimensionality
Researchers selected seven of the most important inputs for inclusion in data set
Data Preparation
VLA data set relatively clean
VLA database administration system responsible for high-quality data
Modeling: Association Rules Analytics
Rules restricted to having both single antecedent and consequent
Many interesting and uninteresting rules uncovered
MD-MIS 637 – Fall 2020
*
Case Study 3 (cont’d)
Researchers adopted premise that interesting rules spawn interesting hypotheses
For example, possible reasons for rule “If place of birth = Vietnam, then law type = criminal law” include:
Vietnamese applicants applied for criminal law assistance only
Vietnamese applicants committed more crimes than other groups
Perhaps high proportion of males applied, and males more closely associated with criminal activity
Vietnamese didn’t have access to VLA promotional material
Researchers concluded first hypothesis most likely
Note intense human activity required for data mining process
MD-MIS 637 – Fall 2020
*
Case Study 3 (cont’d)
Evaluation
Three domain experts estimated confidence level of 144 rules
These estimates compared to confidence level reported by software
Deployment
Web-based application developed (WebAssociator)
Non-specialists able to access rule-building engine
Researchers suggest WebAssociator deployment to enhance judicial system
May identify unjust processes
MD-MIS 637 – Fall 2020
*
Case Study 4
Predicting Corporate Bankruptcies using Decision Trees
Business Understanding
Recent economic crisis has spawned many corporate bankruptcies in East Asia
Sung, Chang, and Lee developing models predicting bankruptcies that maximize interpretability of results
Therefore, decision trees used for modeling
Data Understanding
Data included two groups of Korean companies
Those that went bankrupt 1991-1995 during stable period
Companies that went bankrupt during crisis years 1997-1998
MD-MIS 637 – Fall 2020
*
Case Study 4 (cont’d)
29 firms identified, mostly manufacturing
Financial data collected from Korean Stock Exchange
Data Preparation
56 financial ratios identified through literature search
16 dropped due to duplication
Measures included growth, profitability, etc.
Modeling
Separate decision tree models were applied under “normal” and “crisis” conditions
Normal-conditions rules uncovered:
If productivity of capital > 19.65, then predict non-bankrupt with 86% confidence
MD-MIS 637 – Fall 2020
*
Case Study 4 (cont’d)
If productivity of capital <= 19.65 and ratio of cash flow to total assets <= -5.65, then predict bankrupt with 84% confidence
Crisis-conditions rules uncovered:
If productivity of capital > 20.61, predict non-bankrupt with 95% confidence
If ratio of cash flow to liabilities > 2.54, predict non-bankrupt with 85% confidence
“Cash flow” and “productivity of capital” important predictors, regardless of economic conditions
Evaluation
Panel of domain experts concluded “productivity of capital” most important attribute for differentiating firms at risk
Domain experts verified results of decision tree
MD-MIS 637 – Fall 2020
*
Case Study 4 (cont’d)
Group confirmed results would generalize to population of Korean manufacturing firms
Discriminant analysis determined many of the 40 financial ratios were important predictors
Deployment
No specific deployment took place
However, financial institutions in Korea became more aware of important predictors of bankruptcy
MD-MIS 637 – Fall 2020
*
Case Study 5
Profiling the Tourism Market using k-Means Clustering Analysis
Business Understanding
Hudson and Richie were interested in studying intra-province tourism behavior in Alberta, Canada
Goal was development of marketing campaign for tourism in Alberta (sponsored by Travel Alberta)
Models created in effort to quantify factors for choosing vacations in Alberta
Data Understanding
Data collected from 13,445 Albertans using phone survey in 1999
Only 3,071/13,445 records included in modeling
MD-MIS 637 – Fall 2020
*
Case Study 5 (cont’d)
Data Preparation
One question asked respondents to indicate which of 13 factors most influenced their travel decisions
Factors included accommodations, weather conditions, etc.
Modeling
Between two and six clusters explored with k-Means
Five-cluster solution chosen with profile names:
Young Urban Outdoor Market
Indoor Leisure Traveler Market
Children-first Market
Fair-weather-friends Market
Older, Cost-conscious Traveler Market
MD-MIS 637 – Fall 2020
*
Case Study 5 (cont’d)
Evaluation
Discriminant analysis verified “reality” of clusters
Classified 93% correctly
Deployment
Findings resulted in launch of new campaign, “Alberta, Made to Order”
More than 80 projects launched
Travel Alberta found increase of 20% in number of Albertans considering Alberta “top-of-the-mind” travel destination