程序代写代做代考 database information theory Bayesian algorithm decision tree Mining Frequent Patterns Without Candidate Generation

Mining Frequent Patterns Without Candidate Generation

*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data

Deriving Rules from Data
Data Analytics & Machine Learning Algorithms

Recursive Partitioning: C4.5 and CART Algorithms

Overview

MD-MIS 637-Fall 2020

*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data

Machine Learning Algorithms (ML):

derive rules from the data,
create rules and rule-trees by searching through the data for statistical patterns and

relationships,
uncover useful relationships among variables, relationships that you may not have known

even existed
use the information in the data and cluster recorded items into specific classes/categories,
build prediction and classification models and explicit rules from the data,
derived models and rules help explaining the process that generated the data.

Database Queries answer the question: “What are the data that match this pattern?”

ML Algorithms answer the question: “What is the pattern that matches these data?”

Neural Nets learn from the data and drive implicit relationships between variables,

ML Algorithms learn from the data and drive explicit rule-like relationship between variables,

MD-MIS 637-Fall 2020
*

*
MD-MIS 637-Fall 2020
*

Deriving Rules From Data
Data Analytics and ML Algorithms

Steps:
specify the problem, a profound question,
specify relevant variables – independent and dependent variables,
access, prepare, and clean the data,
apply Machine Learning (ML) algorithms to derive rules, patterns, relationships,

models, decision trees.
ML Algorithm:

takes the selected data and finds the relationship between dependent and independent

variables,
the relationship can be described as a decision tree, a rule, a chart, or an equation,
the challenge is to focus the statistical analysis so that hidden relationships bubble up

to the surface.

ML Techniques:

Genetic Algorithms,
Bayesian Probabilities,
Recursive Partitioning Algorithms.
Neural Networks,

MD-MIS 637-Fall 2020
*

Tribute to Sir Ronald Aylmer Fisher
R. A. Fisher
*
MD-MIS 637-Fall 2020
*

MD-MIS 637-Fall 2020

*

R. A. Fisher

Statistician
Evolutionary Biologist
Geneticist
The Iris flower data set or Fisher’s Iris data set
collected the data to quantify the morphologic variation of Iris flowers of three related species: setosa, versicolor, virginica.
Sepal Length, Sepal Width, Petal L, Petal W, Species

*
MD-MIS 637-Fall 2020

MD-MIS 637-Fall 2020

*

The Fisher’s Iris Flower Data Set
Sir Ronald Fisher (1936) as an example of discriminant analysis
*
MD-MIS 637-Fall 2020
*
  Sepal Length Sepal Width Petal Length Petal Width Class
  X1 X2 X3 X4 X5
x1 6.2 3.6 4.4 1.5 Iris-Versicolor
x2 6.9 1.0 1.1 1.6 Iris-Versicolor
x3 3.3 1.6 1.4 1.5 Iris-Versicolor
x4 4.0 2.9 3.2 0.4 Iris-Setosa
x5 4.5 1.5 1.0 1.0 Iris-Versicolor
x6 4.4 2.9 5.4 0.3 Iris-Setosa
x7 4.7 1.7 5.0 0.4 Iris-Virginica
… … … … … …
x148 5.7 3.4 1.2 1.1 Iris-Setosa
x149 3.0 3.7 1.1 1.7 Iris-Virginica
x150 6.3 3.2 5.6 1.3 Iris-Virginica

MD-MIS 637-Fall 2020

*
MD-MIS 637-Fall 2020
*

Wall Mart Type Data!

MD-MIS 637-Fall 2020
*

*
MD-MIS 637-Fall 2020
*
Data Matrix

Data Can often be represented as an n X d matrix, with n rows and d columns:

Rows:
Called: samples, instances, cases, examples,
records, transactions, objects, points, feature-
vectors, etc.

Columns:
Called: variables, attributes, properties, features, dimensions, fields, etc.

MD-MIS 637-Fall 2020
*

*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
Decision Trees
Recursive Partitioning Algorithms:

split the original data into finer and finer subsets (recursively) resulting in a decision tree,
a decision tree is a compact explanation of the data,
the tree will provide the profile of a “typical case” leading to a specific value of the

dependent variable,
example: most “widget” buyers live in New York and are less than 35 years old.

MD-MIS 637-Fall 2020
*

*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
Decision Trees:

Classification Trees:

classification trees are used when dependent variable is a

categorical/qualitative “YES” or “NO” type of variable,
the goal is to categorize (classify) cases (observations) and derive

rules,
cases are fed into the decision tree ,
each case is classified at each node of the tree into branches

(categories), two or more branches
the size of nodes (circles) decreases towards the right indicating that

number of cases that fall into each successive cluster is decreasing,
the end result is classification rules such as “ IF Age is less than 35

and Residence = New York, THEN widget buyer ”,
this tree is called a classification tree.

MD-MIS 637-Fall 2020
*

*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
Decision Trees:

Regression Trees:

regression trees are used for both categorical and continuous

dependent variable,
the goal is to predict or estimate the value of dependent variable(s),
each branch of the tree partitions off a subset of the data,
the last node of the tree shows the value of the dependent variable

(income) for that subset of the data,
the end result is a prediction/estimation rule such as: “ IF Residence = New York and Age is less than 35 then average Income is $40K with

a standard deviation of $5K”,
this tree is called a regression tree – strictly binary, containing

exactly two branches for each decision node

MD-MIS 637-Fall 2020
*

*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
Recursive Partitioning Algorithms:

The two best known RP algorithms are: C4.5 and CART

both algorithms try to create homogeneous clusters,
the goal is to make clusters at the nodes “purer” and “purer” by

progressively reducing the “disorder” in the original data,
C4.5 developed by computer scientists (Ross Quinlan, C4.5 Programs for

ML) reduces the information disorder using entropy,
CART developed by statisticians (Sanquist and Morgan,1964, Leo Breiman

etl, 1984) reduces the statistical disorder using variance.

C4.5 Algorithm: Example

Problem: Widgitco Inc. wants to predict types of customers who are more likely to
buy Widget. In other words, the firm wants to find the pattern of customers who
buy the product.

Data: the firm has historical demographic data on six independent variables X1 up
to X6 and a binary dependent variable “Yes” or “No”.

MD-MIS 637-Fall 2020
*

*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
C4.5 Algorithm: Example (continued)

Solution:
divide up the data along the independent variables so that cases form clusters where

each cluster contains predominantly “YES” or “NO”,
build the decision tree by first splitting the data with respect to the independent

variable X3 (residence) and then X1 (age),
this strategic splitting leads to a quick and simple clustering of the data useful for

marketing decisions:

MD-MIS 637-Fall 2020
*

*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
C4.5 Algorithm:

How does the algorithm determine which independent variable is the most important to split first?
How does the algorithm determine the boundaries for the splits?
How does the algorithm arrive at the most useful discrimination of the data?

The C4.5 algorithm uses Entropy to calculate the impurity of a cluster.

The higher the entropy (disorder) of a cluster, the more information you need to describe the cluster, the more information you need to describe a cluster, the less certain you can be of what the cluster contains.

As the probability of a single category in a cluster gets closer and closer to 1, the entropy of the cluster (H) gets closer and closer to 0, the disorder of the cluster gets closer and closer to 0, and the cluster becomes most valuable in the sense that it contains a single category.

MD-MIS 637-Fall 2020
*

*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
C4.5 Algorithm: Example

MD-MIS 637-Fall 2020
*

*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
C4.5 Algorithm: Example

the first split is on variable X3 (residence),
it creates two much purer clusters: a cluster with 6 out of 7 buyers, and a cluster

with 10 out of 13 non-buyers,
the first split has reduced the impurity from 0.993 to 0.713 a gain of 0.280 bits,
the second split is on X1 (age),
the second split further reduces the impurity from 0.713 to 0.580 a gain of 0.133,

Patterns:

Non-New Yorkers have a pattern of not buying widget,

New Yorkers over the age of 35 have a pattern of not buying widgets,

Younger New Yorkers have a pattern of buying widgets.

MD-MIS 637-Fall 2020
*

*
MD-MIS 637-Fall 2020
*
Deriving Rules From Data
C4.5 Algorithm

Splitting:

test all possible independent variables at all their possible split points,
compute resulting gains in purity for each variable and each of its possible splitting

points,
pick the split ( the variable and its splitting point) that maximizes the gain and form

the resulting cluster,
do this over and over again on each resulting cluster (recursively) until the decision

tree is built.

Example:

for the widget sales data (Page 3), the independent variable X3 (residence) at the

splitting point NY resulted in the highest gain in purity and clusters “ X3 = NY” and
“X3 = NY” were formed,
once the data were split at X3, the new cluster formed by all the NY records could be

further purified,
the algorithm tested all variables again, but only used the data in the new cluster “X3 =

NY”,
the variable X1 (age) at the splitting point X1 <= 35 resulted in the highest gain in purity and became the most discriminator variable at its value of 35. MD-MIS 637-Fall 2020 * * MD-MIS 637-Fall 2020 * Deriving Rules From Data C4.5 Algorithm: Overfitting : how much splitting is appropriate? figure below shows two extremes: “One split” (underfit) VS “Seven splits” (overfit): “One split” results two sub-samples (clusters) that are still quite impure (underfit), “Seven splits” results completely purified clusters, but the model has become very complicated, very complicated models are hard to understand and have the danger of overfitting the data. MD-MIS 637-Fall 2020 * * MD-MIS 637-Fall 2020 * Deriving Rules From Data C4.5 Algorithm: Overfitting: overfitting means that the algorithm finds model that fits the “training data” ( or the “sample”) and performs well on the trained data, but performs poorly on real world new data it has not seen before (out of “sample” data), the ML algorithm may pick up details in the data that are characteristics of the training sample, but not the actual problem being modeled, the neural nets and ML algorithms easily “overtrain”: perform well on the training data but badly on real world data they have not seen before. Cross - Validation: to avoid ovefitting one approach adopted by statisticians is “cross-validation”, in cross-validation the training data is divided into several parts, say 10, all parts but one is used to train the model, and use the leftover part to test the model, this is done several times each time using different training parts and leftover testing part, in effect splits are tested on several different combinations of training and testing data, for instance, with 10 parts there is a total 10 combinations of training and testing ( 9 parts for training and 1 part for testing) . MD-MIS 637-Fall 2020 * * MD-MIS 637-Fall 2020 * Deriving Rules From Data C4.5 Algorithm: Tree Size - Optimal number of splits: the accuracy of the model on “out of sample” data depends on the tree size (number of splits), the error rate of the model initially decreases as the size of the the tree increases, but after some optimal point the error rate stars to increase (overfit), the accuracy is low for both very simple and very complex trees MD-MIS 637-Fall 2020 * * MD-MIS 637-Fall 2020 * Deriving Rules From Data C4.5 Algorithm: Pruning: there are two approaches to get to “right sized” trees: “Stopping” and “Pruning” in the stopping approach the ML algorithm must decide when to stop while going forward constructing the tree - it turns out that this approach is very hard, in the pruning approach the ML algorithm tests all possible splits and grows the complete (overfitted) tree, and then “prunes back” the useless branches - the ones that increase the error rate on “out of sample” data, the tree is iteratively pruned, and with each prune, the error of the pruned tree is computed, util the best overall tree is constructed, research shows that pruning approach results in much more robust trees. MD-MIS 637-Fall 2020 * * MD-MIS 637-Fall 2020 * Deriving Rules From Data Classification And Regression Trees (CART) if the dependent variable is continuous, CART algorithm is used to construct decision trees, for continuous variables there is no notion of “pure” partition - there are infinite values for a dependent variable, classification and clustering is done through the notion of “dispersion”, disorder is measured by variance ( a measure of “dispersion” around the “ center ”), homogeneous clusters are formed from items that their dependent variable values are all close to each other - i.e., their standard deviation would be quite small, the criteria is to form clusters with minimum “ within cluster ” variance and maximum “between clusters” means - impurity and gain are measured by the pair (mean, SD) MD-MIS 637-Fall 2020 * * MD-MIS 637-Fall 2020 * Deriving Rules From Data CART: Example Problem: a software company that sells packaged PC software wants to derive decision rules that determine type of customers generating highest revenue, Data: the company has a database containing records of approximately 2000 customers - each record consists of values of 31 variables, Dependent Variable(s): “total annual sales” made to the customer ( a continuous variable), Independent Variables: there were 30 independent variables recorded for each customer. Solution: _ since the dependent variable (sales) is continuous a CART Algorithm has been used, clusters were formed by computing mean and standard deviation of customers total annual sales, homogenous clusters were identified based on minimum ‘within cluster SD” and maximum between clusters means, the partitioning algorithm found only three of the 30 independent variables to be relevant for classifying customers: the number of users in the customer’s organization, the number of inquiries they made about the product they purchased, and the number of product licenses they purchased, figure below shows the tree that was constructed using actual data and the CART Algorithm. MD-MIS 637-Fall 2020 * * MD-MIS 637-Fall 2020 * Deriving Rules From Data CART it is hard to interpret the tree, the top branch has not been expanded. MD-MIS 637-Fall 2020 * * MD-MIS 637-Fall 2020 * Deriving Rules From Data CART: The simplified tree is easier to interpret, it splits the data into four types that can be classified into a 2 by 2 matrix: large customers who make large number of inquiries, large customers who make a few number of inquiries, small customers who make large number of inquiries, small customers who make small number of inquiries. MD-MIS 637-Fall 2020 * * MD-MIS 637-Fall 2020 * Probability, Information Theory and Entropy Measuring Disorder (Impurity) Disorder: Suppose we have M things (cases). The M cases will have the highest impurity (maximum disorder) if they are all Distinct and therefor their classification (clustering) requires K = M Distinct Categories. In a binary representation, if we have b bits (0s and 1s), we can describe (code): 2 x 2 x 2 x………x 2 = 2^b Distinct Cases/characters/numbers. With “b” bits we can describe 2^b Distinct Cases The number of bits, b, required to describe (code) M Distinct Cases into K (=M) Distinct Categories must satisfy the eq: 2^b = K, and thus we have: b=log 2(K) = log2(M) If we want to describe K Distinct Cases we need to have b = log2(K) bits MD-MIS 637-Fall 2020 * * MD-MIS 637-Fall 2020 * Probability, Information Theory and Entropy Measuring Disorder (Impurity) Disorder: Ex 1: for two distinct categories ( K=2) “YES” or “NO” we have: b = log2 (2) =1, i.e, it takes 1 bit (of information) to describe (code) two distinct categories, NO = 0, YES = 1. The measure of impurity (disorder) is 1. Ex 2: for four categories (K=4) A, B, C, and D, we have: b = log2 (4) = 2, i.e., it takes 2 bits (of information) to describe (code) four distinct categories: 00 A 01 B The measure of impurity (disorder) is 2. 10 C 11 D MD-MIS 637-Fall 2020 * * MD-MIS 637-Fall 2020 * Probability, Information Theory and Entropy Measuring Disorder (Impurity) Disorder: The M cases will have the lowest impurity (minimum disorder) if they are all the same and fall into a single category (K=1). In this case we have: b = log2 (1) = 0, i.e, no bits (no information) is needed to describe (code) this case. The M case will have the highest impurity (maximum disorder) if they are all distinct and thus fall into K=M categories, i.e., we need b = log2(M) bits to describe (code) them. Thus: log2 (K) is a good “measure” of disorder (impurity), where 1<=K<=M is the number of Distinct Categories into which the M Cases may be classified. The unit of impurity is bit: the higher the bits necessary to describe, the higher the amount of information to code, and thus, the higher the impurity (disorder). MD-MIS 637-Fall 2020 * * MD-MIS 637-Fall 2020 * Probability, Information Theory and Entropy Measuring Disorder (Impurity) Probability and Information: Consider a stochastic experiment in which cases are assigned to N distinct categories (N distinct possible outcomes) with equal probabilities p = 1/N. The larger the N ( i.e, the lower the p ) the higher the disorder (impurity). The higher the disorder (impurity) the higher the number of bits (information) required for description (coding). Impurity is proportional to the inverse of probability. MD-MIS 637-Fall 2020 * * MD-MIS 637-Fall 2020 * Probability, Information Theory and Entropy Measuring Disorder (Impurity) Probability and Information: This is the reason that in information theory a measure of information is defined as: ( 1/p) = - ( p). Above is the information content of a category with probability of occurrence equal to p. For the equally likely experiment with N possible outcomes the information (disorder) measure is: ( 1/p) = ( 1 / 1/N) = (N) : = 0 bit for N = 1 (all cases fall into 1 category) = 1 bit for N = 2 (cases fall into one of the 2 distinct categories) = 2 bits for N = 4 (cases fall into one of the 4 distinct categories) = 3 bits for N = 8 (cases fall into one of the 8 distinct categories) MD-MIS 637-Fall 2020 * * MD-MIS 637-Fall 2020 * Probability, Information Theory and Entropy Measuring Disorder (Impurity) Entropy: Entropy generalizes the information (disorder) of particular categories to the information (disorder) of a cluster with a number of possible categories. If pi is the probability of the occurrence of the ith category in a particular cluster, then the Entropy of that cluster is defined as: H is the measure of disorder or the information content of a cluster. The higher the entropy (disorder) of a cluster, the more information you need to describe (code) the cluster, the more information you need to describe a cluster, the less certain you can be of what the cluster contains. As the probability of a single category in a cluster gets closer and closer to 1, the entropy of the cluster (H) gets closer and closer to 0, the disorder of the cluster gets closer and closer to 0, and the cluster becomes most valuable in the sense that it contains a single category. MD-MIS 637-Fall 2020 * 2 log 22 log(1/)log() iiii ii Hppppbits ==- åå