CS计算机代考程序代写 SQL scheme python database DNA Java data mining algorithm COMP9318: Data Warehousing and Data Mining

COMP9318: Data Warehousing and Data Mining
— L6: Association Rule Mining —
COMP9318: Data Warehousing and Data Mining 1

n Problem definition and preliminaries
COMP9318: Data Warehousing and Data Mining 2

What Is Association Mining?
n Association rule mining:
n Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
n Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93]
n Motivation: finding regularities in data
n What products were often purchased together? — Beer
and diapers?!
n What are the subsequent purchases after buying a PC? n What kinds of DNA are sensitive to this new drug?
n Can we automatically classify web documents?
COMP9318: Data Warehousing and Data Mining 3

Why Is Frequent Pattern or Assoiciation Mining an Essential Task in Data Mining?
n Foundation for many essential data mining tasks
n Association, correlation, causality
n Sequential patterns, temporal or cyclic association, partial periodicity, spatial and multimedia association
n Associative classification, cluster analysis, iceberg cube, fascicles (semantic data compression)
n Broad applications
n Basket data analysis, cross-marketing, catalog design,
sale campaign analysis
n Web log (click stream) analysis, DNA sequence analysis, etc.
c.f., google’s spelling suggestion
COMP9318: Data Warehousing and Data Mining

Basic Concepts: Frequent Patterns and Association Rules
n ItemsetX={x1,…,xk}
n Shorthand: x1 x2 … xk
n FindalltherulesXàYwithmin confidence and support
n support, s, probability that a transaction contains XÈY
n confidence, c, conditional probability that a transaction having X also contains Y.
Let min_support = 50%, min_conf = 70%:
sup(AC) = 2
AèC (50%, 66.7%)
CèA (50%, 100%)
COMP9318: Data Warehousing and Data Mining 5
Transaction-id
Items bought
10
{ A, B, C }
20
{ A, C }
30
{ A, D }
40
{ B, E, F }
Customer buys both
Customer buys diaper
Customer buys beer
frequent itemset
association rule

Mining Association Rules—an Example
Min. support 50% Min. confidence 50%
Transaction-id
Items bought
10
A, B, C
20
A, C
30
A, D
40
B, E, F
Frequent pattern
Support
{A}
75%
{B}
50%
{C}
50%
{A, C}
50%
For rule AèC:
support = support({A}∪{C}) = 50%
confidence = support({A}∪{C})/support({A}) = 66.6%
major computation challenge: calculate the support of itemsets ç The frequent itemset mining problem
COMP9318: Data Warehousing and Data Mining 6

n Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases
COMP9318: Data Warehousing and Data Mining 7

Association Rule Mining Algorithms
n Naïve algorithm
n Enumerate all possible itemsets and check their support against min_sup
n Generate all association rules and check their confidence against min_conf
n The Apriori property
n Apriori Algorithm
n FP-growth Algorithm
Candidate Generation & Verification
COMP9318: Data Warehousing and Data Mining 8

All Candidate Itemsets for {A, B, C, D, E}
null
ABCDE
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
COMP9318: Data Warehousing and Data Mining
9

Apriori Property
n A frequent (used to be called large) itemset is an itemset whose support is ≥ min_sup.
n Apriori property (downward closure): any subsets of a frequent itemset are also frequent itemsets
n Aka the anti-monotone property of support
ABC ABD ACD BCD
AB AC AD BC BD CD
A B C D
“any supersets of an infrequent itemset are also infrequent itemsets”
COMP9318: Data Warehousing and Data Mining
10

Illustrating Apriori Principle
Q: How to design an algorithm to improve the naïve algorithm?
null
A B C D E
AB
ABC
AC AD
ABD ABE
AE
ACD
ABCE
BC
ACE
ABDE
BD BE CD CE DE
Found to be Infrequent
ADE BCD BCE
ACDE BCDE
BDE CDE
ABCD
Pruned supersets
ABCDE
COMP9318: Data Warehousing and Data Mining
11

Apriori: A Candidate Generation-and-test Approach
n Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!
n Algorithm [Agrawal & Srikant 1994]
1. Ck ç Perform level-wise candidate generation
(from singleton itemsets)
2. Lk ç Verify Ck against Lk
3. Ck+1 ç generated from Lk
4. Goto 2 if Ck+1 is not empty
COMP9318: Data Warehousing and Data Mining 12

The Apriori Algorithm
n Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do begin
increment the count of all candidates in Ck+1
that are contained in t end
Lk+1 = candidates in Ck+1 with min_support end
return ∪kLk;
COMP9318: Data Warehousing and Data Mining 13

The Apriori Algorithm—An Example
Itemset
{A}
{B}
{C}
{D}
{E}
sup
2
3
3
1
3
Itemset
{A}
{B}
{C}
{E}
minsup = 50%
sup
2
3
3
3
Database TDB
C1
1st scan
C2
L1
Tid
Items
10
A, C, D
20
B, C, E
30
A, B, C, E
40
B, E
Itemset
sup
{A, B}
1
{A, C}
2
{A, E}
1
{B, C}
2
{B, E}
3
{C, E}
2
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
L2
C2
2nd scan
Itemset
sup
{A, C}
2
{B, C}
2
{B, E}
3
{C, E}
2
Itemset
{B, C, E}
C3
3rd scan
L3
Itemset
sup
{B, C, E}
2
COMP9318: Data Warehousing and Data Mining 14

Important Details of Apriori
1. How to generate candidates?
n Step 1: self-joining Lk (what’s the join condition? why?) n Step 2: pruning
2. How to count supports of candidates?
Example of Candidate-generation
n L3={abc, abd, acd, ace, bcd}
n Self-joining: L3*L3
n abcd from abc and abd n acde from acd and ace
n Pruning:
n acde is removed because ade is not in L3
n C4={abcd}
COMP9318: Data Warehousing and Data Mining 15

Generating Candidates in SQL
n Suppose the items in Lk-1 are listed in an order n Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 n Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck COMP9318: Data Warehousing and Data Mining 16 Derive rules from frequent itemsets n Frequent itemsets != association rules n One more step is required to find association rules n For each frequent itemset X, For each proper nonempty subset A of X, n Let B = X - A n A à B is an association rule if n Confidence (A à B) ≥ min_conf, where support (A à B) = support (AB), and confidence (A à B) = support (AB) / support (A) COMP9318: Data Warehousing and Data Mining 17 Example – deriving rules from frequent itemsets n Suppose 234 is frequent, with supp=50% n Proper nonempty subsets: 23, 24, 34, 2, 3, 4, with supp=50%, 50%, 75%, 75%, 75%, 75% respectively n These generate these association rules: n 23=>4,
n 24=>3,
n 34=>2,
n 2=>34,
n 3=>24,
n 4=>23,
n All rules have support = 50%
confidence=100% confidence=100% confidence=67% confidence=67% confidence=67% confidence=67%
= (N* 50%)/(N*75%)
Q: is there any optimization (e.g., pruning) for this step?
COMP9318: Data Warehousing and Data Mining 18

Deriving rules
n To recap, in order to obtain A à B, we need to have Support(AB) and Support(A)
n This step is not as time-consuming as frequent itemsets generation
n Why?
n It’s also easy to speedup using techniques
such as parallel processing. n How?
n Do we really need candidate generation for deriving association rules?
n Frequent-Pattern Growth (FP-Tree)
COMP9318: Data Warehousing and Data Mining 19

Bottleneck of Frequent-pattern Mining
n Multiple database scans are costly
n Mining long patterns needs many passes of
scanning and generates lots of candidates
n To find frequent itemset i1i2…i100
n # of scans: 100 ✓100◆ ✓100◆
n#ofCandidates: 1 + 2 +…+ 100 =2 1
✓100◆ 100 n Bottleneck: candidate-generation-and-test
Can we avoid candidate generation altogether? COMP9318: Data Warehousing and Data Mining
20

n FP-growth
COMP9318: Data Warehousing and Data Mining 21

No Pain, No Gain
Alice
Bob
Charlie
Dora
Java
X
X
Lisp
X
Scheme
X
Python
X
X
Ruby
X
X
X
minsup = 1
n Apriori:
nL1 ={J,L,S,P,R}
n C2 = all the (52) combinations
n Most of C2 do not contribute to the result n There is no way to tell because

Alice
Bob
Charlie
Dora
Ideas:
• Keep the support set for
each frequent itemset • DFS
JèJL?
Jè???
Only need to look at support set for J
Java
X
X
No Pain, No Gain
Lisp
X
minsup = 1
Scheme
X
Python
X
X
Ruby
X
X
X
{A, C}
J
ɸ

No Pain, No Gain
Alice
Bob
Charlie
Dora
Java
X
X
Lisp
X
minsup = 1
Scheme
X
Python
X
X
Ruby
X
X
X
Ideas:
• Keep the support set for
each frequent itemset • DFS
{C}
JPR
{C}
ɸ
JP

JR
{A,C}
{A, C}
J

Notations and Invariants
n CondiditonalDB:
n DB|p = {t ∈ DB | t contains itemset p}
n DB = DB|∅ (i.e., conditioned on nothing) n Shorthand: DB|px = DB|(p∪x)
n SupportSet(p∪x, DB) = SupportSet(x, DB|p) n {x | x mod 6 = 0 ⋀ x ∈ [100] } =
{x | x mod 3 = 0 ⋀ x ∈ even([100]) } n A FP-tree|p is equivalent to a DB|p
n One can be converted to another
n Next, we illustrate the alg using conditionalDB
25

FP-tree Essential Idea /1
n Recursive algorithm again! n FreqItemsets(DB|p):
n X = FindFrequentItems(DB|p) n Foreach x in X
all frequent itemsets in DB|p belong to one of the following categories:
patterns ~ xip
patterns ~ ★px1
patterns ~ ★px2
patterns ~ ★pxi
patterns ~ ★pxn
output { (x p) | x ∈ X }
easy task, as only items (not itemsets) are needed
obtained via recursion
DB*|px = GetConditionalDB+(DB*|p, x)
n
n
n FreqItemsets(DB*|px)

No Pain, No Gain
Alice
DB|J
Charlie
Java
X
X
Lisp
Scheme
Python
X
Ruby
X
X
minsup = 1
n FreqItemsets(DB|J):
n {P, R} ç FindFrequentItems(DB|J) n Output {JP, JR}
n Get DB*|JP; FreqItemsets(DB*|JP)
n Get DB*|JR; FreqItemsets(DB*|JR)
n // Guaranteed no other frequent itemset in DB|J

FP-tree Essential Idea /2
n FreqItemsets(DB|p):
n If boundary condition, then …
n X = FindFrequentItems(DB|p)
n [optional] DB*|p = PruneDB(DB|p, X)
n Foreach x in X
output { (x p) | x ∈ X }
DB*|px = GetConditionalDB+(DB*|p, x)
n
n [optional] if DB*|px is degenerated, then powerset(DB*|px) n FreqItemsets(DB*|px)
Also output each item in X (appended with the conditional pattern)
Remove items not in X; potentially reduce # of transactions (∅ or dup). Improves the efficiency.
Also gets rid of items already processed before xèavoid duplicates

Lv 1 Recursion
n minsup = 3
DB*|P
DB*|M (sans P) DB*|B (sans MP) DB*|A (sans BMP)
DB*|C (sans ABMP) DB*|F (sans CABMP)
FCAMP
CBP
FCAMP
FCADGIMP
ABCFLMO
BFHJOW
BCKSP
AFCELPMN
FCAMP
FCABM
FB
CBP
FCAMP
DB
DB*
FCA
FCA
FCA
X = {F, C, A, B, M, P}
Output: F, C, A, B, M, P
Grayed items are for illustration purpose only.

Lv 2 Recursion on DB*|P
n minsup = 3
Which is actually FullDB*|CP
DB*|C
Context = Lv 3 recursion on DB*|CP: DB has only empty sets or X = {}è immediately returns
C
C
C
C
C
C
FCAMP
CBP
FCAMP
DB DB*
X = {C}
Output: CP

Lv 2 Recursion on DB*|A (sans …)
n minsup = 3
Further recursion (output: FCA)
DB*|C
DB*|F
Which is actually FullDB*|CA
FC
FC
FC
FC
FC
FC
FCA
FCA
FCA
F
F
F
DB
DB*
X = {F, C}
boundary case
Output: FA, CA

Different Example:
Lv 2 Recursion on DB*|P
X = {F}
F
Output: FAP
n minsup = 2
Which is actually FullDB*|AP
F
FC
F
DB*|A DB*|C
DB*|F
F
F
FCA
FC
FA
FCAMP
FCBP
FAP
DB
DB*
X = {F, C, A}
Output: FP, CP, AP

I will give you back the FP-tree
n An FP-tree tree of DB consists of:
n A fixed order among items in DB
n A prefix, threaded tree of sorted transactions in DB
n Header table: (item, freq, ptr)
n When used in the algorithm, the input DB is
always pruned (c.f., PruneDB())
n Remove infequent items
n Remove infrequent items in every transaction

FP-tree Example
minsup = 3
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o, w}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}
{f, c, a, m, p} {f, c, a, b, m} {f, b}
{c, b, p}
{f, c, a, m, p}

{} {} f:1 f:2
c:1 c:2 …
a:1 a:2
m:1 m:1 b:1
p:1 p:1 m:1
{}
f:4 c:1
Item freq head
f4 c4 a3 b3 m3 p3
c:3
b:1 b:1 p:1
a:3
m:2 b:1
Output
f
c
a
b
m
p
p:2 m:1
Insert all ti
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o, w}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}
{f, c, a, m, p} {f, c, a, b, m} {f, b}
{c, b, p}
{f, c, a, m, p}
Insertt2
Insert t1

TID frequent items
100 {f, c, a, m, p}
200 {f, c, a, b, m} 300 {f, b}
400 {c, b, p}
500 {f, c, a, m, p}
p’s conditional pattern base
fca m:2
cb:1
Output
pc
{}
f:4 c:1
c:3 b:1 b:1
23212
Item freq head
f4 c4 a3 b3 m3 p3
Cleaned p’s conditional pattern base
C
:2
C
:1
a:3
m:2 b:1
p:2 m:1
p:1
{} c:3
Header Table
STOP

TID frequent items
100 {f, c, a, m, p}
200 {f, c, a, b, m}
300 {f, b} 400 {c, b, p}
m’s conditional pattern base
Output
fca :2
mf
fcab:1
mc
500 {f, c, a, m, p}
3331
ma
{}
f:4 c:1 c:3 b:1 b:1 a:3
m:2 b:1 m:1
Item freq head
f4 c4 a3 b3 m3
{}
Header Table
f:3
Output
mac
maf
mcf
macf
c:3
a:3
gen_powerset

b’s conditional pattern base
fca:1
f:1
c:1
{}
f:4 c:1
c :3 b :1 b :1 a:3
b:1
STOP
221
Item freq head
f4 c4 a3 b3

a’s conditional pattern base
fc:3
Output
Item freq head
f4 c4 a3
{}
f:4 c:1
c:3
a:3
{} f:3 c:3
af
33
ac
gen_powerset
Header Table
Output
acf

c’s conditional pattern base
f:3
Output
cf
3
{}
f:4 c:1
c:3
Item freq head
f4 c4
STOP
{} f:3
Header Table

{}
STOP
Item freq head
f4
f:4

FP-Growth vs. Apriori: Scalability With the Support Threshold
100
90
80
70
60
50
40
30
20
10
0
Data set T25I20D10K
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
D1 FP-grow th runtime
D1 Apriori runtime
COMP9318: Data Warehousing and Data Mining 42
Run time(sec.)

Why Is FP-Growth the Winner?
n Divide-and-conquer:
n decompose both the mining task and DB according to
the frequent patterns obtained so far
n leads to focused search of smaller databases
n Other factors
n no candidate generation, no candidate test
n compressed database: FP-tree structure
n no repeated scan of entire database
n basic ops—counting local freq items and building sub FP-tree, no pattern search and matching
COMP9318: Data Warehousing and Data Mining 43