程序代写代做代考 scheme data mining python algorithm database Java DNA SQL 6asso

6asso

COMP9318: Data Warehousing and Data Mining 1

COMP9318: Data Warehousing
and Data Mining

— L6: Association Rule Mining —

COMP9318: Data Warehousing and Data Mining 2

n Problem definition and preliminaries

COMP9318: Data Warehousing and Data Mining 3

What Is Association Mining?

n Association rule mining:
n Finding frequent patterns, associations, correlations, or

causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.

n Frequent pattern: pattern (set of items, sequence, etc.)
that occurs frequently in a database [AIS93]

n Motivation: finding regularities in data
n What products were often purchased together? — Beer

and diapers?!
n What are the subsequent purchases after buying a PC?
n What kinds of DNA are sensitive to this new drug?
n Can we automatically classify web documents?

COMP9318: Data Warehousing and Data Mining

Why Is Frequent Pattern or Assoiciation
Mining an Essential Task in Data Mining?

n Foundation for many essential data mining tasks
n Association, correlation, causality
n Sequential patterns, temporal or cyclic association,

partial periodicity, spatial and multimedia association
n Associative classification, cluster analysis, iceberg cube,

fascicles (semantic data compression)
n Broad applications

n Basket data analysis, cross-marketing, catalog design,
sale campaign analysis

n Web log (click stream) analysis, DNA sequence
analysis, etc. c.f., google’s spelling suggestion

COMP9318: Data Warehousing and Data Mining 5

Basic Concepts: Frequent Patterns and
Association Rules

n Itemset X={x1, …, xk}
n Shorthand: x1 x2 … xk

n Find all the rules XàY with min
confidence and support
n support, s, probability that a

transaction contains XÈY
n confidence, c, conditional

probability that a transaction
having X also contains Y.

Let min_support = 50%,
min_conf = 70%:

sup(AC) = 2
A è C (50%, 66.7%)
C è A (50%, 100%)

Customer
buys diaper

Customer
buys both

Customer
buys beer

Transaction-id Items bought
10 { A, B, C }
20 { A, C }
30 { A, D }
40 { B, E, F }

frequent itemset

association rule

COMP9318: Data Warehousing and Data Mining 6

Mining Association Rules—an Example

For rule A è C:
support = support({A}∪{C}) = 50%
confidence = support({A}∪{C})/support({A}) = 66.6%

Min. support 50%
Min. confidence 50%

Transaction-id Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F

Frequent pattern Support
{A} 75%
{B} 50%
{C} 50%

{A, C} 50%

major computation challenge: calculate the support of itemsets
ç The frequent itemset mining problem

COMP9318: Data Warehousing and Data Mining 7

n Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases

Association Rule Mining Algorithms

n Naïve algorithm
n Enumerate all possible itemsets

and check their support against
min_sup

n Generate all association rules
and check their confidence
against min_conf

n The Apriori property
n Apriori Algorithm
n FP-growth Algorithm

COMP9318: Data Warehousing and Data Mining 8

Candidate Generation
& Verification

COMP9318: Data Warehousing and Data Mining 9

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

All Candidate Itemsets for {A, B, C, D, E}

COMP9318: Data Warehousing and Data Mining 10

Apriori Property

n A frequent (used to be called large) itemset is an
itemset whose support is ≥ min_sup.

n Apriori property (downward closure): any subsets
of a frequent itemset are also frequent itemsets

n Aka the anti-monotone property of support

AB AC AD BC BD CD

A B C D

ABC ABD ACD BCD “any supersets of
an infrequent
itemset are
also infrequent
itemsets”

COMP9318: Data Warehousing and Data Mining 11

Found to be
Infrequent

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Illustrating Apriori Principle
null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE
Pruned
supersets

Q: How to design an
algorithm to improve
the naïve algorithm?

COMP9318: Data Warehousing and Data Mining 12

Apriori: A Candidate Generation-and-test Approach

n Apriori pruning principle: If there is any itemset
which is infrequent, its superset should not be
generated/tested!

n Algorithm [Agrawal & Srikant 1994]
1. Ck ç Perform level-wise candidate generation

(from singleton itemsets)
2. Lk ç Verify Ck against Lk
3. Ck+1 ç generated from Lk
4. Goto 2 if Ck+1 is not empty

COMP9318: Data Warehousing and Data Mining 13

The Apriori Algorithm
n Pseudo-code:

Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin

Ck+1 = candidates generated from Lk;
for each transaction t in database do begin

increment the count of all candidates in Ck+1
that are contained in t

end
Lk+1 = candidates in Ck+1 with min_support

end
return ∪kLk;

COMP9318: Data Warehousing and Data Mining 14

The Apriori Algorithm—An Example

Database TDB

1st scan

C1
L1

L2
C2 C2

2nd scan

C3 L33rd scan

Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E

Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3

Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3

Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}

Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2

Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2

Itemset
{B, C, E}

Itemset sup
{B, C, E} 2

minsup = 50%

COMP9318: Data Warehousing and Data Mining 15

Important Details of Apriori

1. How to generate candidates?
n Step 1: self-joining Lk (what’s the join condition? why?)
n Step 2: pruning

2. How to count supports of candidates?

Example of Candidate-generation
n L3={abc, abd, acd, ace, bcd}
n Self-joining: L3*L3

n abcd from abc and abd
n acde from acd and ace

n Pruning:
n acde is removed because ade is not in L3

n C4={abcd}

COMP9318: Data Warehousing and Data Mining 16

Generating Candidates in SQL

n Suppose the items in Lk-1 are listed in an order
n Step 1: self-joining Lk-1

insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 n Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck COMP9318: Data Warehousing and Data Mining 17 Derive rules from frequent itemsets n Frequent itemsets != association rules n One more step is required to find association rules n For each frequent itemset X, For each proper nonempty subset A of X, n Let B = X - A n A à B is an association rule if n Confidence (A à B) ≥ min_conf, where support (A à B) = support (AB), and confidence (A à B) = support (AB) / support (A) COMP9318: Data Warehousing and Data Mining 18 Example – deriving rules from frequent itemsets n Suppose 234 is frequent, with supp=50% n Proper nonempty subsets: 23, 24, 34, 2, 3, 4, with supp=50%, 50%, 75%, 75%, 75%, 75% respectively n These generate these association rules: n 23 => 4, confidence=100%
n 24 => 3, confidence=100%
n 34 => 2, confidence=67%
n 2 => 34, confidence=67%
n 3 => 24, confidence=67%
n 4 => 23, confidence=67%
n All rules have support = 50%

= (N* 50%)/(N*75%)

Q: is there any optimization (e.g., pruning) for this step?

COMP9318: Data Warehousing and Data Mining 19

Deriving rules
n To recap, in order to obtain A à B, we need

to have Support(AB) and Support(A)
n This step is not as time-consuming as

frequent itemsets generation
n Why?

n It’s also easy to speedup using techniques
such as parallel processing.
n How?

n Do we really need candidate generation for
deriving association rules?
n Frequent-Pattern Growth (FP-Tree)

COMP9318: Data Warehousing and Data Mining 20

Bottleneck of Frequent-pattern Mining

n Multiple database scans are costly
n Mining long patterns needs many passes of

scanning and generates lots of candidates
n To find frequent itemset i1i2…i100

n # of scans: 100
n # of Candidates:

n Bottleneck: candidate-generation-and-test
Can we avoid candidate generation altogether?

✓
100

◆
+

✓
100

◆
+ . . .+

✓
100

100

◆
= 2100 � 1

n FP-growth

COMP9318: Data Warehousing and Data Mining 21

No Pain, No Gain
Java Lisp Scheme Python Ruby

Alice X X

Bob X X

Charlie X X X

Dora X X

minsup = 1

n Apriori:
n L1 = {J, L, S, P, R}
n C2 = all the (52) combinations

n Most of C2 do not contribute to the result
n There is no way to tell because

No Pain, No Gain
Java Lisp Scheme Python Ruby

Alice X X

Bob X X

Charlie X X X

Dora X X

minsup = 1

{A, C}

Ideas:
• Keep the support set for

each frequent itemset
• DFS

J è JL?
J è ???
Only need to look at support
set for J

No Pain, No Gain
Java Lisp Scheme Python Ruby

Alice X X

Bob X X

Charlie X X X

Dora X X

JPR
minsup = 1

{A, C}

{C}

{C}
JP JR

Ideas:
• Keep the support set for

each frequent itemset
• DFS

…

{A,C}

Notations and Invariants
n CondiditonalDB:

n DB|p = {t ∈ DB | t contains itemset p}
n DB = DB|∅ (i.e., conditioned on nothing)
n Shorthand: DB|px = DB|(p∪x)

n SupportSet(p∪x, DB) = SupportSet(x, DB|p)
n {x | x mod 6 = 0 ⋀ x ∈ [100] } =

{x | x mod 3 = 0 ⋀ x ∈ even([100]) }
n A FP-tree is equivalent to a DB|p

n One can be converted to another
n Next, we illustrate the alg using conditionalDB

FP-tree Essential Idea /1
n Recursive algorithm again!

n FreqItemsets(DB|p):

n X = FindLocallyFrequentItems(DB|p)

n Foreach x in X
n DB*|px = GetConditionalDB+(DB*|p, x)
n

n FreqItemsets(DB*|px)

all frequent itemsets in
DB|p belong to one of
the following
categories:

output { (x p) | x ∈ X }

patterns ~ xip
patterns ~ ★px1
patterns ~ ★px2
patterns ~ ★pxi
patterns ~ ★pxn

obtained
via
recursion

easy task, as
only items (not
itemsets) are
needed

No Pain, No Gain
Java Lisp Scheme Python Ruby

Alice X X

Charlie X X X
minsup = 1

DB|J

n FreqItemsets(DB|J):
n {P, R} ç FindLocallyFrequentItems(DB|J)
n Output {JP, JR}
n Get DB*|JP; FreqItemsets(DB*|JP)
n Get DB*|JR; FreqItemsets(DB*|JR)
n // Guaranteed no other frequent itemset in DB|J

FP-tree Essential Idea /2

n FreqItemsets(DB|p):
n If boundary condition, then …
n X = FindLocallyFrequentItems(DB|p)
n [optional] DB*|p = PruneDB(DB|p, X)

already processed
before x è avoid
duplicates

Also output each item in
X (appended with the
conditional pattern)

Remove items not in X;
potentially reduce # of
transactions (∅ or dup).
Improves the efficiency.

output { (x p) | x ∈ X }

Lv 1 Recursion

n minsup = 3

F C A D G I M P
A B C F L M O
B F H J O W
B C K S P
A F C E L P M N

F C A M P
F C A B M
F B
C B P
F C A M P

DB DB*

X = {F, C, A, B, M, P}

DB*|P

DB*|M (sans P)

DB*|B (sans MP)

DB*|A (sans BMP)

DB*|C (sans ABMP)

DB*|F (sans CABMP)

F C A M P
C B P
F C A M P

F C A
F C A
F C A

Grayed items are for illustration purpose only.

Output: F, C, A, B, M, P

Lv 2 Recursion on DB*|P

n minsup = 3

C
C
C

DB DB*

X = {C}

DB*|C
F C A M P
C B P
F C A M P

Which is actually FullDB*|CP

C
C
C

Context = Lv 3
recursion on DB*|CP:
DB has only empty

sets or X = {} è
immediately returnsOutput: CP

Lv 2 Recursion on DB*|A (sans …)
n minsup = 3

F C
F C
F C

DB DB*

X = {F, C}

DB*|C
F C A
F C A
F C A

Which is actually FullDB*|CA

FC
FC
FC

DB*|F
F
F
F

boundary
case

Further
recursion

(output: FCA)

Output: FA, CA

Different Example:
Lv 2 Recursion on DB*|P

n minsup = 2

F C A
F C
F A

DB DB*

X = {F, C, A}

DB*|A

F C A M P
F C B P
F A P

Which is actually FullDB*|AP
F C
F

DB*|C
F
F

DB*|F

F
F

X = {F}

Output: FP, CP, AP

Output: FAP

I will give you back the FP-tree

n An FP-tree tree of DB consists of:
n A fixed order among items in DB
n A prefix, threaded tree of sorted transactions

in DB
n Header table: (item, freq, ptr)

n When used in the algorithm, the input DB is
always pruned (c.f., PruneDB())
n Remove infequent items
n Remove infrequent items in every transaction

FP-tree Example
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

minsup = 3

TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

f : 1

{ }

c : 1

a : 1

m : 1

p : 1

Insert t1

f : 2

{ }

c : 2

a : 2

m : 1

p : 1

b : 1

m : 1

Insert t2

f : 4

{ }

c : 3

a : 3

m : 2

p : 2

b : 1

m : 1

b : 1

c : 1

b : 1

p : 1

Insert all ti

Item freq head
f 4
c 4
a 3
b 3
m 3
p 3

Output
f
c
a
b
m
p

…

Item freq head
f 4
c 4
a 3
b 3
m 3
p 3

f : 4

{ }

c : 3

a : 3

m : 2

p : 2

b : 1

m : 1

b : 1

c : 1

b : 1

p : 1

p’s conditional pattern base
f c a m : 2
c b : 1

2 3 2 1 2

Output
pc

c : 3

{ }

Header
TableSTOP

TID frequent items
100 {f, c, a, m, p}
200 {f, c, a, b, m}
300 {f, b}
400 {c, b, p}
500 {f, c, a, m, p}

Cleaned p’s
conditional
pattern base
C :2

C :1

f : 4

{ }

c : 3

a : 3

m : 2 b : 1

m : 1

b : 1

c : 1

b : 1

Item freq head
f 4
c 4
a 3
b 3
m 3

m’s conditional pattern base
f c a : 2
f c a b : 1

3 3 3 1

Output
mf
mc
ma

{ }

Header
Table

f : 3

c : 3

a : 3

gen_powerset

Output
mac
maf
mcf
macf

TID frequent items
100 {f, c, a, m, p}
200 {f, c, a, b, m}
300 {f, b}
400 {c, b, p}
500 {f, c, a, m, p}

f : 4

{ }

c : 3

a : 3

b : 1

c : 1

b : 1

Item freq head
f 4
c 4
a 3
b 3

b’s conditional pattern base
f c a : 1
f : 1

c : 1

2 2 1

STOP

f : 4

{ }

c : 3

a : 3

c : 1Item freq head
f 4
c 4
a 3

a’s conditional pattern base
f c : 3

3 3

Output
af
ac

{ }

Header
Table

f : 3

c : 3

gen_powerset

Output
acf

f : 4

{ }

c : 3

c : 1Item freq head
f 4
c 4

c’s conditional pattern base
f : 3

Output
cf

{ }

Header
Table

f : 3

STOP

Item freq head
f 4

f : 4

{ }

STOP

COMP9318: Data Warehousing and Data Mining 42

FP-Growth vs. Apriori: Scalability With the Support
Threshold

100

0 0.5 1 1.5 2 2.5 3
Support threshold(%)

Ru
n

tim
e(

se
c.

)

D1 FP-grow th runtime

D1 Apriori runtime

Data set T25I20D10K

COMP9318: Data Warehousing and Data Mining 43

Why Is FP-Growth the Winner?

n Divide-and-conquer:
n decompose both the mining task and DB according to

the frequent patterns obtained so far
n leads to focused search of smaller databases

n Other factors
n no candidate generation, no candidate test
n compressed database: FP-tree structure
n no repeated scan of entire database
n basic ops—counting local freq items and building sub

FP-tree, no pattern search and matching

Related Posts