CS699
Lecture 8 Association Rule Mining
What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining
Motivation:Findinginherentregularitiesindata
What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.
2
Why Is Freq. Pattern Mining Important?
Freq. pattern: An intrinsic and important property of datasets
Foundation for many essential data mining tasks
Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns
Pattern analysis in spatiotemporal, multimedia, time- series, and stream data
Classification: discriminative, frequent pattern analysis
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Broad applications
3
• Two step process
Association Rule Mining
• First, mine all frequent patterns
A frequent pattern is also called a frequent itemset or a large itemset
• Second, mine strong rules from frequent itemsets
4
Basic Concepts: Frequent Itemsets
• itemset: A
• k-itemset:
• 1-itemset:
• 2-itemset:
• 3-itemset:
set of one or more items a set of k items
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40 50
Beer, Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk
{beer}, {nuts}, {diaper}, {coffee}, …
{beer, nuts}, {beer, coffee}, {eggs, milk}, … {beer, nuts, diaper}, {nuts, coffee, eggs}, …
5
Basic Concepts: Frequent Itemsets
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40 50
Beer, Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk
(absolute)support,or,supportcountofX:Frequencyorthenumberof transactions that contain itemset X
(relative)support,s,isthefractionoftransactionsthatcontainX(i.e., the probability that a transaction contains X)
Whenwesay“support”itcouldmeaneither.So,interpretitinthe context.
Supportof{beer}:4(count),0.8(4outof5),or80%
Supportof{coffee,diaper}:2,0.4(2outof5),or40%
6
Basic Concepts: Frequent Itemsets
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40 50
Beer, Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk
An itemset X is frequent if X’s support is no less than a predefined minimum support threshold, minsup.
If minsup = 0.6 or 60%
{beer} is frequent, or is a frequent itemset, or is a large
itemset, or is a frequent pattern
{coffee, diaper} is not frequent, or is not a frequent itemset/pattern
7
Basic Concepts: Association Rules
AruleR1=X→Y Support of R1:
s(R1) = support(X Y),
or probability that a transaction contains X Y
Confidence of R1:
c(R1) = support(X Y) / support(X),
or conditional probability that a transaction having X also contains Y
8
Tid
Items bought
Customer buys both
Customer buys diaper
10 20 30 40 50
Beer, Nuts, Diaper
Beer, Coffee, Diaper
Beer, Diaper, Eggs
Beer, Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk
Basic Concepts: Association Rules
R1={diaper}→{beer}
s(R1) = support({diaper, beer}) = 3 (count), 0.6 or 60%
Customer buys beer
3 transactions (or 60% of all transactions) contain both diaper and beer.
c(R1) = support({diaper, beer}) / support({diaper}) = 3 / 4 = 0.75 or 75%
Among those who purchased diaper, 70% of them also purchased beer.
9
Basic Concepts: Association Rules
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40 50
Beer, Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk
with support ≥ minimum support and Let minsup = 40%, minconf = 60%
Find all rules X Y
confidence ≥ minimum confidence. They are called strong rules.
Then,
Beer Eggs (support = 40%, confidence = 50%), is not a strong rule Eggs Beer (support = 40%, confidence = 67%), is a strong rule
10
The Downward Closure Property and Scalable Mining Methods
Apriori property of frequent itemsets
Any nonempty subset of a frequent itemset must be
frequent
If {beer, diaper, nuts} is frequent, so is {beer, diaper}.
Because every transaction having {beer, diaper, nuts}
also contains {beer, diaper} It can be also stated as:
If an itemset contains a subset that is not frequent, then the itemset can never be frequent.
Consider an itemset X = {milk, cheese, egg}.
If {milk, egg} is not frequent, then X can never be frequent.
11
The Downward Closure Property and Scalable Mining Methods
Scalable mining methods: Three major approaches
Apriori (Agrawal & Srikant@VLDB’94)
Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)
12
Apriori: A Candidate Generation & Test Approach
PruningusingAprioriproperty:Ifanitemsethasasubsetwhichis infrequent, then the itemset should not be tested!
(“test” means: to determine whether an itemset is frequent or not)
Algorithm(simplified):
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length k frequent itemsets
Prune candidate itemsets using Apriori property
Test the candidates against DB
Terminate when no frequent or candidate set can be generated
Forsimplicity,wewillnotdiscusshowtoprunecandidateitemsets.
13
The Apriori Algorithm – An Example
Database TDBSupmin = 2 Itemset {A}
sup
Itemset sup
Tid Items C1 {B}
2 3 3 1 3
L {A} 2 1 {B} 3
10 A, C, D 20 B,C,E 30 A,B,C,E 40 B, E
1st scan {C} {D}
{C} {E}
3 3
C2 L2 Itemset sup
Itemset
sup
C2
Itemset
{A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2
{A, B} {A, C} {A, E} {B, C} {B, E} {C, E}
1 2 1 2 3 2
2nd scan
{A, B} {A, C} {A, E} {B, C} {B, E} {C, E}
C3 Itemset {B, C, E}
3rd scan
L3
Itemset sup
{E}
{B, C, E} 2
14
The Apriori Algorithm (outline)
1. Scan DB and find candidate 1-itemsets → C1
2. Mine frequent 1-itemsets from C1 → L1
3. Generate candidate 2-itemsets from L1 → C2 (w/o count)
4. Scan DB and count → C2 (with count)
5. Mine frequent 2-itemsets from C2 → L2
6. k=3
7. Generate candidate k-itemsets from Lk-1 → Ck (w/o count)
8. Scan DB and count → Ck (with count)
9. Mine frequent k-itemsets from Ck → Lk
10. k=k+1,andGoToStep7
***. Stop when Ck is empty or Lk is empty
15
Implementation of Apriori
How to generate Ck+1 from Lk
Join two k-itemsets to generate (k+1)-itmesets
When joining two k-itemsets, we join only if the first k- 1 items are identical.
Example: Generate C4 from L3.
L3={abc, abd, acd, ace, bcd}
Generate abcd from abc and abd Generate acde from acd and ace
When joining two 3-itemsets, we join only if the first 2 items are identical.
C4 = {abcd}
16
• •
Items are sorted.
Pruning is not shown in the subsequent steps.
Another Example
Dataset (min. support = 30% or three transactions) Customer Items
C1 beer, bread, chip, egg
C2 beer, bread, chip, egg, popcorn,
C3 bread, chip, egg
C4 beer, bread, chip, egg, milk, popcorn C5 beer, bread, milk
C6 beer, bread, egg
C7 bread, chip, milk
C8 bread, butter, chip, egg, milk
C9 butter, chip, egg
C1
Itemset
Support count
Candidate 1‐itemsets
{beer} 5 {bread} 8 {butter} 2 {chip} 7 {egg} 7 {milk} 4 {popcorn} 2
L1
Itemset
Support count
Frequent 1‐itemsets
{beer} 5 {bread} 8 {chip} 7 {egg} 7 {milk} 4
C2
Itemset {beer, bread} {beer, chip} {beer, egg} {beer, milk} {bread, chip} {bread, egg} {bread, milk} {chip, egg} {chip, milk} {egg, milk}
Candidate 2‐itemsets
Candidate 2‐itemsets with Counts
C2 (scan the database and count the supports)
Support count {beer, bread} 5
{beer, chip} 3
{beer, egg} 4
{beer, milk} 2 {bread, chip} 6 {bread, egg} 6 {bread, milk} 4
{chip, egg} 6
{chip, milk} 3
{egg, milk} 2
Itemset
L2
Itemset {beer, bread} {beer, chip} {beer, egg} {bread, chip} {bread, egg} {bread, milk} {chip, egg} {chip, milk}
Support count 5
Frequent 2‐itemsets
3 4 6 6 4 6 3
C3
Itemset
{beer, bread, chip} {beer. bread, egg} {beer, chip, egg} {bread, chip, egg} {bread, chip, milk} {bread, egg, milk} {chip, egg, milk}
Two frequent 2‐itemsets are joined only if the first items are identical.
Candidate 3‐itemsets
We join {beer, bread} and {beer, chip} to generate {beer, bread, chip}.
But, we do not join {beer, chip} and {chip, egg}.
Note: Pruning is done at this step before counting but it is omitted here.
Candidate 3‐itemsets with Supports
C3 (scan the database and count supports)
Support count {beer, bread, chip} 3
{beer. bread, egg} 4
{beer, chip, egg} 3 {bread, chip, egg} 5 {bread, chip, milk} 3 {bread, egg, milk} 2
{chip, egg, milk} 2
Itemset
L3
Frequent 3‐itemsets
Support count {beer, bread, chip} 3
{beer. bread, egg} 4
{beer, chip, egg} 3 {bread, chip, egg} 5 {bread, chip, milk} 3
Itemset
C4
Itemsets {beer, bread, chip, egg} {bread, chip, egg, milk}
Candidate 4‐itemsets
Again, we join two frequent 3‐itemsets only if the first two items are identical.
{beer, bread, chip} JOIN {beer, bread, egg} generates {beer, bread, chip, egg} {bread, chip, egg} JOIN {bread, chip, milk} generates {bread, chip, egg, milk}
Candidate 4‐itemsets after Pruning
C4 (scan the database and count the supports)
Itemsets Support count {beer, bread, chip, egg} 3
.
L4
Frequent 4‐itemsets
Itemsets Support count {beer, bread, chip, egg} 3
All Frequent Itemsets
Frequent itemsets L = L1 L2 L3 L4
L = {{beer}, {bread}, {chip}, {egg}, {milk}, {beer, bread},
{beer, chip}, {beer, egg}, {bread, chip}, {bread, egg}, {bread, milk}, {chip, egg}, {chip, milk}, {beer, bread, chip}, {beer. bread, egg}, {beer, chip, egg}, {bread, chip, egg}, {bread, chip, milk}, {beer, bread, chip, egg}}
Mining Strong Rules
• For each frequent itemset, identify all nonempty proper subsets:
• Example: from {beer, bread, egg}
• All nonempty proper subsets are:
{beer}, {bread}, {egg}, {beer, bread}, {beer, egg}, {bread, egg} • For each subset, we form a rule:
R1: {beer} {bread, egg} R2: {bread} {beer, egg} R3: {egg} {beer, bread} R4: {beer, bread} {egg} R5: {beer, egg} {bread} R6: {bread, egg} {beer}
Mining Strong Rules
• Compute the confidences:
confidence = sup(all items) / sup(antecedent)
conf(R1) = (sup({ beer, bread, egg})) / sup({beer}) = 4/5 = 80% conf(R2) = (sup({ beer, bread, egg})) / sup({bread} )= 4/8 = 50% conf(R3) = (sup({ beer, bread, egg})) / sup({egg}) = 4/7 = 57.1% conf(R4) = (sup({ beer, bread, egg})) / sup({beer, bread}) = 4/5 = 80% conf(R5) = (sup({ beer, bread, egg})) / sup({beer, egg}) = 4/4 = 100% conf(R6) = (sup({ beer, bread, egg})) / sup({bread, egg}) = 4/6 = 66.7%
Mining Strong Rules
• Choosetheruleswhoseconfidencessatisfy minimum confidence.
• Ifmin_conf=80%,R1,R4,andR5arestrongrules.
• Ifmin_conf=60%,R1,R4,R5,andR6arestrong rules.
• Then, mine all strong rules from
the first frequent 3‐itemset (when 3‐itemsets are sorted by the items). Assume that the minimum confidence is 80%.
TID Items
Exercise
• Mine all frequent itemsets from the following dataset. Assume that the minimum support is 30% (or 3 transactions).
100 2,4,5,6 200 1,4,5,7 300 2,4,5
400 1,2,4,5,6,7 500 1,2,6
600 1,2,5,7 700 2,4,6 800 2,3,4,5,6 900 3,4,5,6
• http://www.cs.illinois.edu/~hanj/bk3/
34
References
• Han, J., Kamber, M., Pei, J., “Data mining: concepts and techniques,” 3rd Ed., Morgan Kaufmann, 2012