Discovering Knowledge in Data
MD-MIS 637 – Fall 2020
*
MIS 637
Data Analytics and Machine Learning
Data Science & Analytics Lifecycle: Mid-Semester Summary
MD-MIS 637 – Fall 2020
*
Cross Industry Standard Process: CRISP-DM (cont’d)
Iterative CRSIP-DM process shown in outer circle
Most significant dependencies between phases shown
Next phase depends on results from preceding phase
Returning to earlier phase possible before moving forward
Business / Research Understanding Phase
Deployment Phase
Evaluation Phase
Modeling Phase
Data Preparation Phase
Data Understanding Phase
MD-MIS 637 – Fall 2020
*
What Tasks Can DA & ML Accomplish?
Six common DA tasks
Description
Estimation
Prediction
Classification
Clustering
Association
(1) Description
Describes patterns or trends in data
For example, pollster may uncover patterns suggesting those laid-off less likely to support incumbent
Descriptions of patterns, often suggest possible explanations
MD-MIS 637 – Fall 2020
*
MD-MIS 637 – Fall 2020
*
MD-MIS 637 – Fall 2020
*
For Data, Only Two Moments Really Matter
MD-MIS 637 – Fall 2020
*
The Moment of Use
The Moment of Creation
PATH FROM CREATOR TO CUSTOMER
DATA CREATOR
DATA CUSTOMER
The whole point of data quality management is to connect the two!
Note that they DO NOT occur in IT
MD-MIS 637 – Fall 2020
*
MIS 637
Data Analytics and Machine Learning
Data Quality & Preprocessing
Cleaning, Transforming, and Exploring
*
MD-MIS 637 – Fall 2020
*
Handling Missing Data (cont’d)
(2) Replace Missing Values with Mode or Mean
Mode of categorical field cylinders = 4
Missing values replaced with this value
Mean for non-missing values in numeric field cubicinches = 200.65
Missing values replaced with 200.65
*
MD-MIS 637 – Fall 2020
*
Handling Missing Data (cont’d)
(3) Replace Missing Values with Random Values
Values randomly taken from underlying distribution
Value for cylinders, cubicinches, and hp randomly drawn proportionately from each field’s distribution
Method superior compared to mean substitution
Measures of location and spread remain closer to original
*
MD-MIS 637 – Fall 2020
*
Graphical Methods for Identifying Outliers (cont’d)
A histogram examines values of numeric fields
This histogram shows vehicle weights for cars data set
The extreme left-tail contains one outlier weighing several hundred pounds (192.5)
Should we doubt validity of this value?
*
MD-MIS 637 – Fall 2020
*
Numerical Methods for Identifying Outliers (cont’d)
Using Interquartile Range (IQR) to Identify Outliers
Robust statistical method and less sensitive to presence of outliers
Data divided into four quartiles, each containing 25% of data
First quartile (Q1) 25th percentile
Second quartile (Q2) 50th percentile (median)
Third quartile (Q3) 75th percentile
Fourth quartile (Q4) 100th percentile
IQR is measure of variability in data
*
MD-MIS 637 – Fall 2020
*
MIS 637
Data Analytics and Machine Learning
Data Preprocessing
Exploratory Data Analysis
*
MD-MIS 637 – Fall 2020
*
Data Transformation (cont’d)
Two prevalent normalization techniques available
(1) Min-Max Normalization
Determines how much greater field value is than minimum value for field
Scales this difference by field’s range
From Figure 2.7, Min = 8 and Max = 25 for time-to-60 field
*
MD-MIS 637 – Fall 2020
*
Dealing with Correlated Variables
Using correlated variables in data model:
Should be avoided!
Incorrectly emphasizes one or more data inputs
Creates model instability and produces unreliable results
Matrix plot of Day Minutes, Day Calls, and Day Charge shown in Figure 3.2 (Minitab)
MD-MIS 637 – Fall 2020
*
Deriving Rules From Data
Deriving Rules from Data
Data Analytics & Machine Learning Algorithms
Recursive Partitioning: C4.5 and CART
Overview
MD-MIS 637 – Fall 2020
*
Deriving Rules From Data
Decision Trees:
Classification Trees:
classification trees are used when dependent variable is a
categorical/qualitative “YES” or “NO” type of variable,
the goal is to categorize (classify) cases (observations) and derive
rules,
cases are fed into the decision tree ,
each case is classified at each node of the tree into branches
(categories), two or more branches
the size of nodes (circles) decreases towards the right indicating that
number of cases that fall into each successive cluster is decreasing,
the end result is classification rules such as “ IF Age is less than 35
and Residence = New York, THEN widget buyer ”,
this tree is called a classification tree.
*
MD-MIS 637 – Fall 2020
*
Deriving Rules From Data
Decision Trees:
Regression Trees:
regression trees are used for both categorical and continuous
dependent variable,
the goal is to predict or estimate the value of dependent variable(s),
each branch of the tree partitions off a subset of the data,
the last node of the tree shows the value of the dependent variable
(income) for that subset of the data,
the end result is a prediction/estimation rule such as: “ IF Residence = New York and Age is less than 35 then average Income is $40K with
a standard deviation of $5K”,
this tree is called a regression tree – strictly binary, containing
exactly two branches for each decision node
*
*
MD-MIS 637-Fall 2020
*
Optimal Number of Splits
Overfitting:
overfitting means that the algorithm finds model that fits the “training data”
( or the “sample”) and performs well on the trained data, but performs poorly on real
world new data it has not seen before (out of “sample” data),
the ML algorithm may pick up details in the data that are characteristics of the
training sample, but not the actual problem being modeled,
the neural nets and ML algorithms easily “overtrain”: perform well on the training
data but badly on real world data they have not seen before.
Cross – Validation:
to avoid ovefitting one approach adopted by statisticians is “cross-validation”,
in cross-validation the training data is divided into several parts, say 10,
all parts but one is used to train the model, and use the leftover part to test the model,
this is done several times each time using different training parts and leftover testing
part,
in effect splits are tested on several different combinations of training and testing data,
for instance, with 10 parts there is a total 10 combinations of training and testing ( 9
parts for training and 1 part for testing) .
*
*
MD-MIS 637-Fall 2020
*
Optimal Number of Splits
Tree Size – Optimal number of splits:
the accuracy of the model on “out of sample” data depends on the tree size (number of
splits),
the error rate of the model initially decreases as the size of the the tree increases, but
after some optimal point the error rate stars to increase (overfit),
the accuracy is low for both very simple and very complex trees
*
*
MD-MIS 637-Fall 2020
*
Optimal Number of Splits
Pruning: there are two approaches to get to “right sized” trees: “Stopping” and “Pruning”
in the stopping approach the ML algorithm must decide when to stop while going forward
constructing the tree – it turns out that this approach is very hard,
in the pruning approach the ML algorithm tests all possible splits and grows the complete
(overfitted) tree, and then “prunes back” the useless branches – the ones that increase the
error rate on “out of sample” data,
the tree is iteratively pruned, and with each prune, the error of the pruned tree is
computed, util the best overall tree is constructed,
research shows that pruning approach results in much more robust trees.
*
)
min(
)
max(
)
min(
)
range(
)
min(
*
X
X
X
X
X
X
X
X
–
–
=
–
=