Mining Frequent Patterns Without Candidate Generation
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
Deriving Rules from Data
Data Analytics & Machine Learning Algorithms
Recursive Partitioning: C4.5 and CART Algorithms
Overview
MD-MIS 637-Fall 2020
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
Machine Learning Algorithms (ML):
derive rules from the data,
create rules and rule-trees by searching through the data for statistical patterns and
relationships,
uncover useful relationships among variables, relationships that you may not have known
even existed
use the information in the data and cluster recorded items into specific classes/categories,
build prediction and classification models and explicit rules from the data,
derived models and rules help explaining the process that generated the data.
Database Queries answer the question: “What are the data that match this pattern?”
ML Algorithms answer the question: “What is the pattern that matches these data?”
Neural Nets learn from the data and drive implicit relationships between variables,
ML Algorithms learn from the data and drive explicit rule-like relationship between variables,
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
Data Analytics and ML Algorithms
Steps:
specify the problem, a profound question,
specify relevant variables – independent and dependent variables,
access, prepare, and clean the data,
apply Machine Learning (ML) algorithms to derive rules, patterns, relationships,
models, decision trees.
ML Algorithm:
takes the selected data and finds the relationship between dependent and independent
variables,
the relationship can be described as a decision tree, a rule, a chart, or an equation,
the challenge is to focus the statistical analysis so that hidden relationships bubble up
to the surface.
ML Techniques:
Genetic Algorithms,
Bayesian Probabilities,
Recursive Partitioning Algorithms.
Neural Networks,
MD-MIS 637-Fall 2020
*
Tribute to Sir Ronald Aylmer Fisher
R. A. Fisher
*
MD-MIS 637-Fall 2020
*
MD-MIS 637-Fall 2020
*
R. A. Fisher
Statistician
Evolutionary Biologist
Geneticist
The Iris flower data set or Fisher’s Iris data set
collected the data to quantify the morphologic variation of Iris flowers of three related species: setosa, versicolor, virginica.
Sepal Length, Sepal Width, Petal L, Petal W, Species
*
MD-MIS 637-Fall 2020
MD-MIS 637-Fall 2020
*
The Fisher’s Iris Flower Data Set
Sir Ronald Fisher (1936) as an example of discriminant analysis
*
MD-MIS 637-Fall 2020
*
Sepal Length Sepal Width Petal Length Petal Width Class
X1 X2 X3 X4 X5
x1 6.2 3.6 4.4 1.5 Iris-Versicolor
x2 6.9 1.0 1.1 1.6 Iris-Versicolor
x3 3.3 1.6 1.4 1.5 Iris-Versicolor
x4 4.0 2.9 3.2 0.4 Iris-Setosa
x5 4.5 1.5 1.0 1.0 Iris-Versicolor
x6 4.4 2.9 5.4 0.3 Iris-Setosa
x7 4.7 1.7 5.0 0.4 Iris-Virginica
… … … … … …
x148 5.7 3.4 1.2 1.1 Iris-Setosa
x149 3.0 3.7 1.1 1.7 Iris-Virginica
x150 6.3 3.2 5.6 1.3 Iris-Virginica
MD-MIS 637-Fall 2020
*
MD-MIS 637-Fall 2020
*
Wall Mart Type Data!
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Data Matrix
Data Can often be represented as an n X d matrix, with n rows and d columns:
Rows:
Called: samples, instances, cases, examples,
records, transactions, objects, points, feature-
vectors, etc.
Columns:
Called: variables, attributes, properties, features, dimensions, fields, etc.
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
Decision Trees
Recursive Partitioning Algorithms:
split the original data into finer and finer subsets (recursively) resulting in a decision tree,
a decision tree is a compact explanation of the data,
the tree will provide the profile of a “typical case” leading to a specific value of the
dependent variable,
example: most “widget” buyers live in New York and are less than 35 years old.
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
Decision Trees:
Classification Trees:
classification trees are used when dependent variable is a
categorical/qualitative “YES” or “NO” type of variable,
the goal is to categorize (classify) cases (observations) and derive
rules,
cases are fed into the decision tree ,
each case is classified at each node of the tree into branches
(categories), two or more branches
the size of nodes (circles) decreases towards the right indicating that
number of cases that fall into each successive cluster is decreasing,
the end result is classification rules such as “ IF Age is less than 35
and Residence = New York, THEN widget buyer ”,
this tree is called a classification tree.
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
Decision Trees:
Regression Trees:
regression trees are used for both categorical and continuous
dependent variable,
the goal is to predict or estimate the value of dependent variable(s),
each branch of the tree partitions off a subset of the data,
the last node of the tree shows the value of the dependent variable
(income) for that subset of the data,
the end result is a prediction/estimation rule such as: “ IF Residence = New York and Age is less than 35 then average Income is $40K with
a standard deviation of $5K”,
this tree is called a regression tree – strictly binary, containing
exactly two branches for each decision node
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
Recursive Partitioning Algorithms:
The two best known RP algorithms are: C4.5 and CART
both algorithms try to create homogeneous clusters,
the goal is to make clusters at the nodes “purer” and “purer” by
progressively reducing the “disorder” in the original data,
C4.5 developed by computer scientists (Ross Quinlan, C4.5 Programs for
ML) reduces the information disorder using entropy,
CART developed by statisticians (Sanquist and Morgan,1964, Leo Breiman
etl, 1984) reduces the statistical disorder using variance.
C4.5 Algorithm: Example
Problem: Widgitco Inc. wants to predict types of customers who are more likely to
buy Widget. In other words, the firm wants to find the pattern of customers who
buy the product.
Data: the firm has historical demographic data on six independent variables X1 up
to X6 and a binary dependent variable “Yes” or “No”.
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
C4.5 Algorithm: Example (continued)
Solution:
divide up the data along the independent variables so that cases form clusters where
each cluster contains predominantly “YES” or “NO”,
build the decision tree by first splitting the data with respect to the independent
variable X3 (residence) and then X1 (age),
this strategic splitting leads to a quick and simple clustering of the data useful for
marketing decisions:
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
C4.5 Algorithm:
How does the algorithm determine which independent variable is the most important to split first?
How does the algorithm determine the boundaries for the splits?
How does the algorithm arrive at the most useful discrimination of the data?
The C4.5 algorithm uses Entropy to calculate the impurity of a cluster.
The higher the entropy (disorder) of a cluster, the more information you need to describe the cluster, the more information you need to describe a cluster, the less certain you can be of what the cluster contains.
As the probability of a single category in a cluster gets closer and closer to 1, the entropy of the cluster (H) gets closer and closer to 0, the disorder of the cluster gets closer and closer to 0, and the cluster becomes most valuable in the sense that it contains a single category.
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
C4.5 Algorithm: Example
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
C4.5 Algorithm: Example
the first split is on variable X3 (residence),
it creates two much purer clusters: a cluster with 6 out of 7 buyers, and a cluster
with 10 out of 13 non-buyers,
the first split has reduced the impurity from 0.993 to 0.713 a gain of 0.280 bits,
the second split is on X1 (age),
the second split further reduces the impurity from 0.713 to 0.580 a gain of 0.133,
Patterns:
Non-New Yorkers have a pattern of not buying widget,
New Yorkers over the age of 35 have a pattern of not buying widgets,
Younger New Yorkers have a pattern of buying widgets.
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
C4.5 Algorithm
Splitting:
test all possible independent variables at all their possible split points,
compute resulting gains in purity for each variable and each of its possible splitting
points,
pick the split ( the variable and its splitting point) that maximizes the gain and form
the resulting cluster,
do this over and over again on each resulting cluster (recursively) until the decision
tree is built.
Example:
for the widget sales data (Page 3), the independent variable X3 (residence) at the
splitting point NY resulted in the highest gain in purity and clusters “ X3 = NY” and
“X3 = NY” were formed,
once the data were split at X3, the new cluster formed by all the NY records could be
further purified,
the algorithm tested all variables again, but only used the data in the new cluster “X3 =
NY”,
the variable X1 (age) at the splitting point X1 <= 35 resulted in the highest gain in
purity and became the most discriminator variable at its value of 35.
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
C4.5 Algorithm:
Overfitting :
how much splitting is appropriate?
figure below shows two extremes: “One split” (underfit) VS “Seven splits” (overfit):
“One split” results two sub-samples (clusters) that are still quite impure (underfit),
“Seven splits” results completely purified clusters, but the model has become very
complicated,
very complicated models are hard to understand and have the danger of overfitting the
data.
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
C4.5 Algorithm:
Overfitting:
overfitting means that the algorithm finds model that fits the “training data”
( or the “sample”) and performs well on the trained data, but performs poorly on real
world new data it has not seen before (out of “sample” data),
the ML algorithm may pick up details in the data that are characteristics of the
training sample, but not the actual problem being modeled,
the neural nets and ML algorithms easily “overtrain”: perform well on the training
data but badly on real world data they have not seen before.
Cross - Validation:
to avoid ovefitting one approach adopted by statisticians is “cross-validation”,
in cross-validation the training data is divided into several parts, say 10,
all parts but one is used to train the model, and use the leftover part to test the model,
this is done several times each time using different training parts and leftover testing
part,
in effect splits are tested on several different combinations of training and testing data,
for instance, with 10 parts there is a total 10 combinations of training and testing ( 9
parts for training and 1 part for testing) .
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
C4.5 Algorithm:
Tree Size - Optimal number of splits:
the accuracy of the model on “out of sample” data depends on the tree size (number of
splits),
the error rate of the model initially decreases as the size of the the tree increases, but
after some optimal point the error rate stars to increase (overfit),
the accuracy is low for both very simple and very complex trees
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
C4.5 Algorithm:
Pruning: there are two approaches to get to “right sized” trees: “Stopping” and “Pruning”
in the stopping approach the ML algorithm must decide when to stop while going forward
constructing the tree - it turns out that this approach is very hard,
in the pruning approach the ML algorithm tests all possible splits and grows the complete
(overfitted) tree, and then “prunes back” the useless branches - the ones that increase the
error rate on “out of sample” data,
the tree is iteratively pruned, and with each prune, the error of the pruned tree is
computed, util the best overall tree is constructed,
research shows that pruning approach results in much more robust trees.
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
Classification And Regression Trees (CART)
if the dependent variable is continuous, CART algorithm is used to construct decision trees,
for continuous variables there is no notion of “pure” partition - there are infinite values for
a dependent variable, classification and clustering is done through the notion of “dispersion”,
disorder is measured by variance ( a measure of “dispersion” around the “ center ”),
homogeneous clusters are formed from items that their dependent variable values are all
close to each other - i.e., their standard deviation would be quite small,
the criteria is to form clusters with minimum “ within cluster ” variance and maximum
“between clusters” means - impurity and gain are measured by the pair (mean, SD)
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
CART: Example
Problem: a software company that sells packaged PC software wants to derive decision rules that determine type of customers generating highest revenue,
Data: the company has a database containing records of approximately 2000 customers - each record consists of values of 31 variables,
Dependent Variable(s): “total annual sales” made to the customer ( a continuous variable),
Independent Variables: there were 30 independent variables recorded for each customer.
Solution:
_ since the dependent variable (sales) is continuous a CART Algorithm has been used,
clusters were formed by computing mean and standard deviation of customers total
annual sales,
homogenous clusters were identified based on minimum ‘within cluster SD” and
maximum between clusters means,
the partitioning algorithm found only three of the 30 independent variables to be
relevant for classifying customers:
the number of users in the customer’s organization,
the number of inquiries they made about the product they purchased,
and the number of product licenses they purchased,
figure below shows the tree that was constructed using actual data and the CART Algorithm.
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
CART
it is hard to interpret the tree,
the top branch has not been expanded.
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
CART:
The simplified tree is easier to interpret, it splits the data into four types that can be classified into a 2 by 2 matrix:
large customers who make large number of inquiries,
large customers who make a few number of inquiries,
small customers who make large number of inquiries,
small customers who make small number of inquiries.
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Probability, Information Theory and Entropy
Measuring Disorder (Impurity)
Disorder:
Suppose we have M things (cases). The M cases will have the highest impurity (maximum disorder) if they are all Distinct and therefor their classification (clustering) requires K = M Distinct Categories.
In a binary representation, if we have b bits (0s and 1s), we can describe (code):
2 x 2 x 2 x………x 2 = 2^b
Distinct Cases/characters/numbers. With “b” bits we can describe 2^b Distinct Cases
The number of bits, b, required to describe (code) M Distinct Cases into K (=M) Distinct Categories must satisfy the eq: 2^b = K, and thus we have:
b=log 2(K) = log2(M)
If we want to describe K Distinct Cases we need to have b = log2(K) bits
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Probability, Information Theory and Entropy
Measuring Disorder (Impurity)
Disorder:
Ex 1: for two distinct categories ( K=2) “YES” or “NO” we have:
b = log2 (2) =1,
i.e, it takes 1 bit (of information) to describe (code) two distinct categories, NO = 0, YES = 1. The measure of impurity (disorder) is 1.
Ex 2: for four categories (K=4) A, B, C, and D, we have:
b = log2 (4) = 2,
i.e., it takes 2 bits (of information) to describe (code) four distinct categories:
00 A
01 B The measure of impurity (disorder) is 2.
10 C
11 D
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Probability, Information Theory and Entropy
Measuring Disorder (Impurity)
Disorder:
The M cases will have the lowest impurity (minimum disorder) if they are all the same and fall into a single category (K=1). In this case we have:
b = log2 (1) = 0,
i.e, no bits (no information) is needed to describe (code) this case.
The M case will have the highest impurity (maximum disorder) if they are all distinct and thus fall into K=M categories, i.e., we need b = log2(M) bits to describe (code) them.
Thus:
log2 (K) is a good “measure” of disorder (impurity), where 1<=K<=M is the number of Distinct Categories into which the M Cases may be classified.
The unit of impurity is bit: the higher the bits necessary to describe, the higher the amount of information to code, and thus, the higher the impurity (disorder).
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Probability, Information Theory and Entropy
Measuring Disorder (Impurity)
Probability and Information:
Consider a stochastic experiment in which cases are assigned to N distinct categories (N distinct possible outcomes) with equal probabilities p = 1/N.
The larger the N ( i.e, the lower the p ) the higher the disorder (impurity).
The higher the disorder (impurity) the higher the number of bits (information) required for description (coding).
Impurity is proportional to the inverse of probability.
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Probability, Information Theory and Entropy
Measuring Disorder (Impurity)
Probability and Information:
This is the reason that in information theory a measure of information is defined as:
( 1/p) = - ( p).
Above is the information content of a category with probability of occurrence equal to p.
For the equally likely experiment with N possible outcomes the information (disorder) measure is:
( 1/p) = ( 1 / 1/N) = (N) :
= 0 bit for N = 1 (all cases fall into 1 category)
= 1 bit for N = 2 (cases fall into one of the 2 distinct categories)
= 2 bits for N = 4 (cases fall into one of the 4 distinct categories)
= 3 bits for N = 8 (cases fall into one of the 8 distinct categories)
MD-MIS 637-Fall 2020
*
*
MD-MIS 637-Fall 2020
*
Probability, Information Theory and Entropy
Measuring Disorder (Impurity)
Entropy:
Entropy generalizes the information (disorder) of particular categories to the information (disorder) of a cluster with a number of possible categories.
If pi is the probability of the occurrence of the ith category in a particular cluster, then the Entropy of that cluster is defined as:
H is the measure of disorder or the information content of a cluster.
The higher the entropy (disorder) of a cluster, the more information you need to describe (code) the cluster, the more information you need to describe a cluster, the less certain you can be of what the cluster contains.
As the probability of a single category in a cluster gets closer and closer to 1, the entropy of the cluster (H) gets closer and closer to 0, the disorder of the cluster gets closer and closer to 0, and the cluster becomes most valuable in the sense that it contains a single category.
MD-MIS 637-Fall 2020
*
2
log
22
log(1/)log()
iiii
ii
Hppppbits
==-
åå