6asso
COMP9318: Data Warehousing and Data Mining 1
COMP9318: Data Warehousing
and Data Mining
— L6: Association Rule Mining —
COMP9318: Data Warehousing and Data Mining 2
n Problem definition and preliminaries
COMP9318: Data Warehousing and Data Mining 3
What Is Association Mining?
n Association rule mining:
n Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
n Frequent pattern: pattern (set of items, sequence, etc.)
that occurs frequently in a database [AIS93]
n Motivation: finding regularities in data
n What products were often purchased together? — Beer
and diapers?!
n What are the subsequent purchases after buying a PC?
n What kinds of DNA are sensitive to this new drug?
n Can we automatically classify web documents?
COMP9318: Data Warehousing and Data Mining
Why Is Frequent Pattern or Assoiciation
Mining an Essential Task in Data Mining?
n Foundation for many essential data mining tasks
n Association, correlation, causality
n Sequential patterns, temporal or cyclic association,
partial periodicity, spatial and multimedia association
n Associative classification, cluster analysis, iceberg cube,
fascicles (semantic data compression)
n Broad applications
n Basket data analysis, cross-marketing, catalog design,
sale campaign analysis
n Web log (click stream) analysis, DNA sequence
analysis, etc. c.f., google’s spelling suggestion
COMP9318: Data Warehousing and Data Mining 5
Basic Concepts: Frequent Patterns and
Association Rules
n Itemset X={x1, …, xk}
n Shorthand: x1 x2 … xk
n Find all the rules XàY with min
confidence and support
n support, s, probability that a
transaction contains XÈY
n confidence, c, conditional
probability that a transaction
having X also contains Y.
Let min_support = 50%,
min_conf = 70%:
sup(AC) = 2
A è C (50%, 66.7%)
C è A (50%, 100%)
Customer
buys diaper
Customer
buys both
Customer
buys beer
Transaction-id Items bought
10 { A, B, C }
20 { A, C }
30 { A, D }
40 { B, E, F }
frequent itemset
association rule
COMP9318: Data Warehousing and Data Mining 6
Mining Association Rules—an Example
For rule A è C:
support = support({A}∪{C}) = 50%
confidence = support({A}∪{C})/support({A}) = 66.6%
Min. support 50%
Min. confidence 50%
Transaction-id Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
Frequent pattern Support
{A} 75%
{B} 50%
{C} 50%
{A, C} 50%
major computation challenge: calculate the support of itemsets
ç The frequent itemset mining problem
COMP9318: Data Warehousing and Data Mining 7
n Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases
Association Rule Mining Algorithms
n Naïve algorithm
n Enumerate all possible itemsets
and check their support against
min_sup
n Generate all association rules
and check their confidence
against min_conf
n The Apriori property
n Apriori Algorithm
n FP-growth Algorithm
COMP9318: Data Warehousing and Data Mining 8
Candidate Generation
& Verification
COMP9318: Data Warehousing and Data Mining 9
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
All Candidate Itemsets for {A, B, C, D, E}
COMP9318: Data Warehousing and Data Mining 10
Apriori Property
n A frequent (used to be called large) itemset is an
itemset whose support is ≥ min_sup.
n Apriori property (downward closure): any subsets
of a frequent itemset are also frequent itemsets
n Aka the anti-monotone property of support
AB AC AD BC BD CD
A B C D
ABC ABD ACD BCD “any supersets of
an infrequent
itemset are
also infrequent
itemsets”
COMP9318: Data Warehousing and Data Mining 11
Found to be
Infrequent
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Illustrating Apriori Principle
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Pruned
supersets
Q: How to design an
algorithm to improve
the naïve algorithm?
COMP9318: Data Warehousing and Data Mining 12
Apriori: A Candidate Generation-and-test Approach
n Apriori pruning principle: If there is any itemset
which is infrequent, its superset should not be
generated/tested!
n Algorithm [Agrawal & Srikant 1994]
1. Ck ç Perform level-wise candidate generation
(from singleton itemsets)
2. Lk ç Verify Ck against Lk
3. Ck+1 ç generated from Lk
4. Goto 2 if Ck+1 is not empty
COMP9318: Data Warehousing and Data Mining 13
The Apriori Algorithm
n Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do begin
increment the count of all candidates in Ck+1
that are contained in t
end
Lk+1 = candidates in Ck+1 with min_support
end
return ∪kLk;
COMP9318: Data Warehousing and Data Mining 14
The Apriori Algorithm—An Example
Database TDB
1st scan
C1
L1
L2
C2 C2
2nd scan
C3 L33rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
minsup = 50%
COMP9318: Data Warehousing and Data Mining 15
Important Details of Apriori
1. How to generate candidates?
n Step 1: self-joining Lk (what’s the join condition? why?)
n Step 2: pruning
2. How to count supports of candidates?
Example of Candidate-generation
n L3={abc, abd, acd, ace, bcd}
n Self-joining: L3*L3
n abcd from abc and abd
n acde from acd and ace
n Pruning:
n acde is removed because ade is not in L3
n C4={abcd}
COMP9318: Data Warehousing and Data Mining 16
Generating Candidates in SQL
n Suppose the items in Lk-1 are listed in an order
n Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
n Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
COMP9318: Data Warehousing and Data Mining 17
Derive rules from frequent itemsets
n Frequent itemsets != association rules
n One more step is required to find association
rules
n For each frequent itemset X,
For each proper nonempty subset A of X,
n Let B = X - A
n A à B is an association rule if
n Confidence (A à B) ≥ min_conf,
where support (A à B) = support (AB), and
confidence (A à B) = support (AB) / support (A)
COMP9318: Data Warehousing and Data Mining 18
Example – deriving rules from frequent
itemsets
n Suppose 234 is frequent, with supp=50%
n Proper nonempty subsets: 23, 24, 34, 2, 3, 4, with
supp=50%, 50%, 75%, 75%, 75%, 75% respectively
n These generate these association rules:
n 23 => 4, confidence=100%
n 24 => 3, confidence=100%
n 34 => 2, confidence=67%
n 2 => 34, confidence=67%
n 3 => 24, confidence=67%
n 4 => 23, confidence=67%
n All rules have support = 50%
= (N* 50%)/(N*75%)
Q: is there any optimization (e.g., pruning) for this step?
COMP9318: Data Warehousing and Data Mining 19
Deriving rules
n To recap, in order to obtain A à B, we need
to have Support(AB) and Support(A)
n This step is not as time-consuming as
frequent itemsets generation
n Why?
n It’s also easy to speedup using techniques
such as parallel processing.
n How?
n Do we really need candidate generation for
deriving association rules?
n Frequent-Pattern Growth (FP-Tree)
COMP9318: Data Warehousing and Data Mining 20
Bottleneck of Frequent-pattern Mining
n Multiple database scans are costly
n Mining long patterns needs many passes of
scanning and generates lots of candidates
n To find frequent itemset i1i2…i100
n # of scans: 100
n # of Candidates:
n Bottleneck: candidate-generation-and-test
Can we avoid candidate generation altogether?
✓
100
1
◆
+
✓
100
2
◆
+ . . .+
✓
100
100
◆
= 2100 � 1
n FP-growth
COMP9318: Data Warehousing and Data Mining 21
No Pain, No Gain
Java Lisp Scheme Python Ruby
Alice X X
Bob X X
Charlie X X X
Dora X X
minsup = 1
n Apriori:
n L1 = {J, L, S, P, R}
n C2 = all the (52) combinations
n Most of C2 do not contribute to the result
n There is no way to tell because
No Pain, No Gain
Java Lisp Scheme Python Ruby
Alice X X
Bob X X
Charlie X X X
Dora X X
J
ɸ
minsup = 1
{A, C}
Ideas:
• Keep the support set for
each frequent itemset
• DFS
J è JL?
J è ???
Only need to look at support
set for J
No Pain, No Gain
Java Lisp Scheme Python Ruby
Alice X X
Bob X X
Charlie X X X
Dora X X
J
ɸ
JPR
minsup = 1
{A, C}
{C}
{C}
JP JR
Ideas:
• Keep the support set for
each frequent itemset
• DFS
…
{A,C}
Notations and Invariants
n CondiditonalDB:
n DB|p = {t ∈ DB | t contains itemset p}
n DB = DB|∅ (i.e., conditioned on nothing)
n Shorthand: DB|px = DB|(p∪x)
n SupportSet(p∪x, DB) = SupportSet(x, DB|p)
n {x | x mod 6 = 0 ⋀ x ∈ [100] } =
{x | x mod 3 = 0 ⋀ x ∈ even([100]) }
n A FP-tree is equivalent to a DB|p
n One can be converted to another
n Next, we illustrate the alg using conditionalDB
25
FP-tree Essential Idea /1
n Recursive algorithm again!
n FreqItemsets(DB|p):
n X = FindLocallyFrequentItems(DB|p)
n Foreach x in X
n DB*|px = GetConditionalDB+(DB*|p, x)
n
n FreqItemsets(DB*|px)
all frequent itemsets in
DB|p belong to one of
the following
categories:
output { (x p) | x ∈ X }
patterns ~ xip
patterns ~ ★px1
patterns ~ ★px2
patterns ~ ★pxi
patterns ~ ★pxn
obtained
via
recursion
easy task, as
only items (not
itemsets) are
needed
No Pain, No Gain
Java Lisp Scheme Python Ruby
Alice X X
Charlie X X X
minsup = 1
DB|J
n FreqItemsets(DB|J):
n {P, R} ç FindLocallyFrequentItems(DB|J)
n Output {JP, JR}
n Get DB*|JP; FreqItemsets(DB*|JP)
n Get DB*|JR; FreqItemsets(DB*|JR)
n // Guaranteed no other frequent itemset in DB|J
FP-tree Essential Idea /2
n FreqItemsets(DB|p):
n If boundary condition, then …
n X = FindLocallyFrequentItems(DB|p)
n [optional] DB*|p = PruneDB(DB|p, X)
n Foreach x in X
n DB*|px = GetConditionalDB+(DB*|p, x)
n [optional] if DB*|px is degenerated, then powerset(DB*|px)
n FreqItemsets(DB*|px) Also gets rid of items
already processed
before x è avoid
duplicates
Also output each item in
X (appended with the
conditional pattern)
Remove items not in X;
potentially reduce # of
transactions (∅ or dup).
Improves the efficiency.
output { (x p) | x ∈ X }
Lv 1 Recursion
n minsup = 3
F C A D G I M P
A B C F L M O
B F H J O W
B C K S P
A F C E L P M N
F C A M P
F C A B M
F B
C B P
F C A M P
DB DB*
X = {F, C, A, B, M, P}
DB*|P
DB*|M (sans P)
DB*|B (sans MP)
DB*|A (sans BMP)
DB*|C (sans ABMP)
DB*|F (sans CABMP)
F C A M P
C B P
F C A M P
F C A
F C A
F C A
Grayed items are for illustration purpose only.
Output: F, C, A, B, M, P
Lv 2 Recursion on DB*|P
n minsup = 3
C
C
C
DB DB*
X = {C}
DB*|C
F C A M P
C B P
F C A M P
Which is actually FullDB*|CP
C
C
C
Context = Lv 3
recursion on DB*|CP:
DB has only empty
sets or X = {} è
immediately returnsOutput: CP
Lv 2 Recursion on DB*|A (sans …)
n minsup = 3
F C
F C
F C
DB DB*
X = {F, C}
DB*|C
F C A
F C A
F C A
Which is actually FullDB*|CA
FC
FC
FC
DB*|F
F
F
F
boundary
case
Further
recursion
(output: FCA)
Output: FA, CA
Different Example:
Lv 2 Recursion on DB*|P
n minsup = 2
F C A
F C
F A
DB DB*
X = {F, C, A}
DB*|A
F C A M P
F C B P
F A P
Which is actually FullDB*|AP
F C
F
DB*|C
F
F
DB*|F
F
F
X = {F}
Output: FP, CP, AP
Output: FAP
I will give you back the FP-tree
n An FP-tree tree of DB consists of:
n A fixed order among items in DB
n A prefix, threaded tree of sorted transactions
in DB
n Header table: (item, freq, ptr)
n When used in the algorithm, the input DB is
always pruned (c.f., PruneDB())
n Remove infequent items
n Remove infrequent items in every transaction
FP-tree Example
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
minsup = 3
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
f : 1
{ }
c : 1
a : 1
m : 1
p : 1
Insert t1
f : 2
{ }
c : 2
a : 2
m : 1
p : 1
b : 1
m : 1
Insert t2
f : 4
{ }
c : 3
a : 3
m : 2
p : 2
b : 1
m : 1
b : 1
c : 1
b : 1
p : 1
Insert all ti
Item freq head
f 4
c 4
a 3
b 3
m 3
p 3
Output
f
c
a
b
m
p
…
Item freq head
f 4
c 4
a 3
b 3
m 3
p 3
f : 4
{ }
c : 3
a : 3
m : 2
p : 2
b : 1
m : 1
b : 1
c : 1
b : 1
p : 1
p’s conditional pattern base
f c a m : 2
c b : 1
2 3 2 1 2
Output
pc
c : 3
{ }
Header
TableSTOP
TID frequent items
100 {f, c, a, m, p}
200 {f, c, a, b, m}
300 {f, b}
400 {c, b, p}
500 {f, c, a, m, p}
Cleaned p’s
conditional
pattern base
C :2
C :1
f : 4
{ }
c : 3
a : 3
m : 2 b : 1
m : 1
b : 1
c : 1
b : 1
Item freq head
f 4
c 4
a 3
b 3
m 3
m’s conditional pattern base
f c a : 2
f c a b : 1
3 3 3 1
Output
mf
mc
ma
{ }
Header
Table
f : 3
c : 3
a : 3
gen_powerset
Output
mac
maf
mcf
macf
TID frequent items
100 {f, c, a, m, p}
200 {f, c, a, b, m}
300 {f, b}
400 {c, b, p}
500 {f, c, a, m, p}
f : 4
{ }
c : 3
a : 3
b : 1
b : 1
c : 1
b : 1
Item freq head
f 4
c 4
a 3
b 3
b’s conditional pattern base
f c a : 1
f : 1
c : 1
2 2 1
STOP
f : 4
{ }
c : 3
a : 3
c : 1Item freq head
f 4
c 4
a 3
a’s conditional pattern base
f c : 3
3 3
Output
af
ac
{ }
Header
Table
f : 3
c : 3
gen_powerset
Output
acf
f : 4
{ }
c : 3
c : 1Item freq head
f 4
c 4
c’s conditional pattern base
f : 3
3
Output
cf
{ }
Header
Table
f : 3
STOP
Item freq head
f 4
f : 4
{ }
STOP
COMP9318: Data Warehousing and Data Mining 42
FP-Growth vs. Apriori: Scalability With the Support
Threshold
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Ru
n
tim
e(
se
c.
)
D1 FP-grow th runtime
D1 Apriori runtime
Data set T25I20D10K
COMP9318: Data Warehousing and Data Mining 43
Why Is FP-Growth the Winner?
n Divide-and-conquer:
n decompose both the mining task and DB according to
the frequent patterns obtained so far
n leads to focused search of smaller databases
n Other factors
n no candidate generation, no candidate test
n compressed database: FP-tree structure
n no repeated scan of entire database
n basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching