CS计算机代考程序代写 data mining AI algorithm database EECS-4412: Data Mining Frequent Pattern & Association

EECS-4412: Data Mining Frequent Pattern & Association
Rule Mining
Parke Godfrey
(Thanks to Aijun An & Jiawei Han)

Outline
Basic concepts of association rule learning Apriori algorithm
FP-Growth Algorithm
Finding interesting rules.
2

Why Mining Association Rules?
Objective:
Finding interesting co-occurring items (or objects, events) in a given data set.
Examples:
Given a database of transactions, each transaction is a list of items (purchased by a customer in a visit), you may find:
computerà financial_management_software [support=2%, confidence=60%]
From a student database, you may find
major(x, “CS”) ^ gpa(x, “A”)àhas_taken(x, “DB”) [1%,
75%]
3

Why Mining Association Rules? (Cont’d)
Popular application: Basket data analysis place items frequently bought together close to each
other to increase sales of these items.
?àiPad (What the store should do to boost sales
of the particular product, i.e., iPad)
iPad à ? (What other products should the store stock up?)
4

What Kind of Databases?
Transactional database TDB
Itemset: a set of items
A transaction is a tuple (tid, X)
Transaction ID tid Itemset X
A transactional database is a set of transactions
In many cases, a transaction database can be treated as a set of itemsets (ignore TIDs)
TID
Items
100
f, a, c, d, g, i, m, p
200
a, b, c, f, l,m, o
300
b, f, h, j, o
400
b, c, k, s, p
500
a, f, c, e, l, p, m, n
Association rule from TDB (relates two itemsets): {a, c, m} ® {l} [support=40%, confidence=66.7%]
5

What Kind of Databases? (Contd) Relational database (RDB)
An attribute-value pair E.g., Outlook = sunny
Record: a set of attribute- value pairs
Relational DB: a set of records
Day
Outlook
Temp
Humid
Wind
PlayTennis
D1
Sunny
Hot
High
Weak
No
D2
Sunny
Hot
High
Strong
No
D3
Overcast
Hot
High
Weak
Yes
D4
Rain
Mild
High
Weak
Yes
D5
Rain
Cool
Normal
Weak
Yes
D6
Rain
Cool
Normal
Strong
No
D7
Overcast
Cool
Normal
Strong
Yes
D8
Sunny
Mild
High
Weak
No
D9
Sunny
Cool
Normal
Weak
Yes
D10
Rain
Mild
Normal
Weak
Yes
D11
Sunny
Mild
Normal
Strong
Yes
D12
Overcast
Mild
High
Strong
Yes
D13
Overcast
Hot
Normal
Weak
Yes
D14
Rain
Mild
High
Strong
No
Association rule from RDB (relates two sets of attribute-value pairs): (Outlook=sunny)Ù(Temp=hot) ® (PlayTennis=No)
[support=14%, confidence=100%]
6

Definition of Association Rule
An association rule is of the form:
X ® Y [support, confidence] antecedent consequent
X Ì I, Y Ì I, XÇY=Æ and I is a set of items (objects or attribute-value pairs).
support: probability that a transaction (or a record) contains X and Y, i.e.,
support (X ® Y ) = P(XÈ Y) confidence: conditional probability that a transaction (or a
record) having X also contains Y, i.e., confidence(X ® Y ) = P(Y|X)
A rule associates one set of items (or attribute-value pairs) with another set of items (or attribute-value pairs)
where
7

Support and Confidence: Example support(XàY) = P(X È Y)
confidence(XàY) = P(Y|X)
Items Bought
Transaction ID
2000
1000
4000
5000
A,B,C
A,C
A,D
B,E,F
Relative frequency is used to estimate the probability.
{A}à{C} {C}à{A} {A, C}à{B} {A, B}à{E}
(50%, 66.7%) (50%, 100%) (25%, 50%) (0%, 0%)
8

Mining Association Rules Problem statement
Given a minimum support (min_sup), also called support threshold, and a minimum confidence (min_conf), also called confidence threshold, find all association rules that satisfy both min_sup and min_conf from a data set D.
9

Basic Concepts Strong rules:
An association rule is strong if it satisfies both min_sup and min_conf.
k-itemset: an itemset that contains k items.
{computer, financial_management_software} is a 2-itemset.
Support count of an itemset in data set D: number of transactions in D that contain the itemset.
Minimum support count = min_sup ́ total number of transactions in a data set.
Frequent itemset in a data set D: itemset whose support count in D is at least the minimum support count.
10

How to Mine Association Rules
A two-step process:
Find all frequent itemsets —- the key step
Generate strong association rules from frequent itemsets.
Example: given min_sup=50% and min_conf=50%
Transaction ID
Items Bought
2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F
Frequent Itemset
Support
{A}
75%
{B}
50%
{C}
50%
{A, C}
50%
Generate strong rules:
{A} ® {C} [support=50%, confidence=66.6%] {C} ® {A} [support=50%, confidence=100%]
11

Finding Frequent Itemsets
Objective: given a database, find all the itemsets (or sets of attribute-value pairs) that satisfy the minimum support count.
Algorithms Apriori
FP-Growth H-Mine Eclat Partition CLOSET CHARM etc.
12

Search Space for Finding All Frequent Itemsets Search space for DB with 4 items:
Æ
abcd
ab ac ad bc bd cd
abc abd
acd bcd
abcd
13

Apriori
The Apriori property (an anti-monotone property), also called downward closure property:
Any nonempty subset of a frequent itemset must be frequent
i.e., if {A,B,C} is a frequent itemset, {A,B}, {A,C}, {B,C}, {A}, {B} and {C} must also be frequent.
This is because a transaction containing {A, B, C} also contains {A,B}, {B,C}, …. Thus, the support count of a subset is not less than that of the superset.
No superset of any infrequent itemset should be checked. Many item combinations can be pruned from the search space.
Apriori-based mining (level-wise iterations)
Generate length (k+1) candidate itemsets from frequent k-itemsets Test the candidates against DB
14

Pruning Search Space Using Apriori Property
If c is not frequent, all the sets containing c are pruned:
Æ
abcd
ab ac ad bc bd cd
abc abd
acd bcd
abcd
15

The Apriori Algorithm
Based on the Apriori property, use iterative level-
wise approach and candidate generation-and-test Pseudo-code:
Ck: a set of candidate itemsets of size k Lk : the set of frequent itemsets of size k
L1 = {frequent items};
for (k = 1; Lk !=Æ; k++) do begin
Ck+1 = candidates generated from Lk; if Ck+1!=Æ
for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 satisfying min_support end
(Scan database to find all frequent 1-itemsets)
Scan database to calculate support for each itemset in
Ck+1
return Èk Lk-1; Level-wise Generation process: Lk ® Ck+1 ® Lk+1 16

The Apriori Algorithm — Example
itemset
sup.
{1} {2} {3}
2 3 3
{4}
1
{5}
3
Database D
Scan D
C2
L1
itemset
sup.
{1} {2} {3} {5}
2 3 3 3
C1
TID
Items
100 200 300 400
134 235 1235 25
itemset
{1 2} {1 3} {1 5} {2 3} {2 5} {3 5}
itemset
sup
{1 2}
1
{1 3}
2
{1 5}
1
{2 3} {2 5} {3 5}
2 3 2
itemset
sup
{1 3} {2 3} {2 5} {3 5}
2 2 3 2
L2
C2
Scan D
itemset
sup
{2 3 5}
2
itemset
{2 3 5}
C3
Scan D L3
Assume:
min_sup_count = 2
17

Apriori Algorithm (Flow Chart)
Ck: set of candidate k- itemsets
Lk : set of frequent k- itemsets
L1= set of frequent 1-itemset (scan DB) k=1
No Lk≠ Æ?
Yes
No Ck+1≠ Æ?
Yes
Compute candidate set Ck+1: • Ck+1=join Lk with Lk
• prune Ck+1
Important steps:
Generating candidates
Counting supports of candidates by scanning DB
Scan DB to get Lk+1 from Ck+1
Output L1, …, Lk-1
Output L1, …, Lk
k=k+1
18

How to Generate Candidates? (i.e., How to Generate Ck+1 from Lk )
Given Lk = the set of frequent k-itemsets
List the items in each itemset of Lk in an order
L3 =
Join Step: Join Lk with Lk by joining two k-itemsets in Lk. Two k- itemsets are joinable if their first (k-1) items are the same and the last item in the first itemset is smaller than the last item in the second itemset (the condition for joining two members of Lk).
Now, C4={{1 2 3 4}, {1 3 4 5}}
Prune Step: Delete all candidates in Ck+1 that have a non-frequent subset by checking all length-k subsets of a candidate
Now, C4={{1 2 3 4}}
Given Lk , generate Ck+1 in two steps:
{1 2 3}
{1 2 4}
{1 3 4}
{1 3 5}
{2 3 4}
19

Example of Candidate-generation L4={abcd, abcg, abdg, abef, abeh, acdg, bcdg}
Self-joining: L4*L4
abcdg from abcd and abcg abefh from abef and aceh
Pruning:
abefh is removed because abfh or aefh or befh is not in
L4 C5={abcdg}
20

Support Counting of Candidates in DB scan Objective of candidate support counting
Find frequent itemsets Lk from a set of candidates Ck
Naïve method: match each candidate with each transaction. Time consuming when
the total number of candidates is very large one transaction contains many candidates
Hash tree method:
Store candidate itemsets in Ck in a hash-tree
Leaf node of hash-tree contains a list of candidates and their counts Interior node contains a hash table
Use a subset function to find all the candidates contained in a
transaction
22

Building a Hash Tree
Initially treat the root as a leaf node Insert itemsets in Ck into the tree
When the number of itemsets in a leaf node exceeds a specified threshold (such as 3 in the following example), convert the leaf node into an interior node by
hashing the itemsets according to the d-th item of the itemset, where d is the level of the node.
Given C3 = { 124,
125, 136, 145, 159,
234, 345, 356, 357,
367, 368, 457, 458, 567,689} d=3
d=1 (h1=x1 mod 3)
d=2 2 3 4 5 6 7
(h2=x2 mod 3)
356 367(h3=x3 mod3)
357 368 689
145
136
159
345
d=4
457 458
124 125
23

Subset Function Functionality
Given Ck (in a hash tree) and a transaction t, find all the candidates in Ck contained in t and increase the count of these candidates:
Subset (Ck, t): candidate itemsets contained in t Method
At root, hash on every item in t (until the kth item from the end) .
At an interior node reached by hashing on item i, hash on
each item that comes after i in t
the end, where d is the level of the node in the tree recursively apply to the nodes in the corresponding bucket
(until the (k-d+1)th item from ),
At a leaf, find itemsets contained in t
Benefit
Don’t have to match each candidate with each transaction. 24

Example: finding the candidates contained in a transaction.
Given a transaction {1 3 5 6} and hash tree:
root
1+356 d=1 (h1=x1 mod 3) 3+56
13+56 d=2 2 3 4 15+6 567
(h2=x2 mod 3)35+6 356 367(h3=x3 mod3)
357 368 689
156
d=3145
136 345
159
d=4
457 458
124 125
hash function
0
1
2
It only goes to the leaf that may contain subsets of the transaction.
25

Scan DB to Obtain the Counts of All
Length-3 Candidates Assume a transaction DB:
The counts of all the candidates in the hash tree are initialized to zero.
Apply the subset function on each transaction and increase the count of a candidate if the transaction contains the candidate
(h1=x1 mod 3)
2 3 4 (0) (h2=x2 mod 3)
{1,3,5,6} {2,3,5} {1,3,6,7,8} {3,5,6}
…..
d=1
d=3
d=2 145(0)
5 6 7 (0)
136(2) 345(0) 356(2) 367(1) 357(0) 368(1)
(h3=x3 mod 3)
d=4
457(0) 458(0)
6 8 9 (0)
124(0)
125(0) 159(0)
26

How to Mine Association Rules
A two-step process:
Find all frequent itemsets —- done
Generate strong association rules from frequent itemsets.
Example: given min_sup=50% and min_conf=50%
Transaction ID
Items Bought
2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F
Frequent Itemset
Support
{A}
75%
{B}
50%
{C}
50%
{A, C}
50%
Generate strong rules:
{A} ® {C} [support=50%, confidence=66.6%] {C} ® {A} [support=50%, confidence=100%]
27

Generate Association Rules from Frequent Itemsets
Naïve algorithm:
for each frequent itemset l whose length ≥ 2 do for each nonempty proper subset s of l do
if (support(l) / support(s) >= min_conf) output the rule s ® l – s , with
support = support(l) and confidence = support(l) / support(s)
Note that we only need to check the confidence.
Do we need to scan the database?
No. Why?
28

Generate Association Rules from Frequent Itemsets
Example:
Given a frequent itemset: l = {A, B, C}
nonempty proper subsets of l are {A,B}, {A,C}, {B,C}, {A}, {B}, {C}
resulting association rules:
These confidence values are make-up numbers. This example is not continued previous examples.
{A,B}®{C}, {A,C}®{B}, {B,C}®{A}, {A}®{B,C}, {B}®{A,C}, {C}®{A,B},
confidence= support({A,B,C}) =50% support({A,B})
confidence = support({A, B,C}) = 100% support({A,C})
confidence = support({A, B,C}) = 100% support({B,C})
confidence = support({A, B,C}) = 33% support({A})
confidence= support({A,B,C}) =29% support({B})
confidence = support({A, B,C}) = 100% support({C})
If minimum confidence threshold is 70%, only 3 rules are outputted.
29

Is Apriori Fast Enough?
Performance Bottlenecks
The core of the Apriori algorithm:
Use frequent k -itemsets to generate candidate (k+1)-itemsets
Use database scan and pattern matching to collect counts for the candidate itemsets to generate frequent (k+1)-itemsets from k+1 candidate set
ThebottleneckofApriori: candidategenerationandtesting Huge candidate sets:
104 frequent 1-itemsets will generate more than 107 candidate 2- itemsets
To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate at least 2100 » 1030 candidates.
Multiple scans of database:
Needs n or n+1 scans, n is the length of the longest frequent pattern
30

Improving Apriori: General Ideas
Shrink the number of candidates Hash-based technique
(DHP — direct hashing and pruning — algorithm) Reduce the number of database scans on disk
Partitioning data (Partition algorithm)
Avoid candidate generation FP-growth
31

Shrink the Number of Candidates (DHP) Hash-based technique can be used to reduce the size of
Ck , especially C2
Build a hash table when scanning DB to generate L1.
For all 2-itemsets in each transaction, hash into the buckets and increase counts:
Hash function:
h(x,y)=((idof x) ́10+(idof y))mod7
Only this array is stored
32

Shrink the Number of Candidates (DHP) (Cont’d)
If the count of a bucket is less than minimum support count, all the itemsets in the bucket are removed from
C2
The hashtable in the last slide was generated from this data set
TID
items
T100
I1, I2, I5
T200
I2, I4
T300
I2, I3
T400
I1, I2, I4
T500
I1, I3
T600
I2, I3
T700
I1, I3
T800
I1, I2, I3, I5
T900
I1, I2, I3
33

Reduce the number of disk scans (Partition) Partition DB
Each partition is held in main memory
Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB (can be proved)
Scan 1: partition database and for each partition find local frequent patterns
Scan 2: consolidate global frequent patterns
A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95
34

Improving Apriori: General Ideas
Shrink the number of candidates Hash-based technique
Reduce the number of database scans Partitioning data
Avoid candidate generation FP-growth (next)
35

FP-Growth
J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation., Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD’00), Dallas, TX, May 2000.
J. Han, J. Pei, Y. Yin and R. Mao, Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach, Data Mining and Knowledge Discovery, 8(1):53- 87, 2004. (http://www- faculty.cs.uiuc.edu/~hanj/pdf/dami04_fptree.pdf)
Chapter 6.2.4 (3rd edition) or Chapter 5.2.4 (2nd Edition)
36

Mining Frequent Patterns Without Candidate Generation (FP-growth)
Two major steps:
Compress a large database into a compact, Frequent- Pattern tree (FP-tree) structure
highly condensed, but complete for frequent pattern mining
avoid costly database scans
Mine frequent patterns (itemsets) from an FP-tree
A divide-and-conquer methodology: decompose mining tasks into smaller ones
Efficient: avoid candidate generation — generate frequent patterns from the tree directly.
37

Construct FP-tree from a Transaction DB
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n} Steps:
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree
{f, c, a, m, p} {f, c, a, b, m} {f, b}
{c, b, p}
{f, c, a, m, p}
min_support = 0.5
minimum suppot count =3
root
Header Table
Item count head
f4 c4 a3 b3 m3 p3
f:4
c:1
c:3
b:1
b:1
a:3
p:1
m:2
b:1
p:2
m:1
38

Benefits of the FP-tree Structure
Completeness:
map each transaction into a path in the tree
preserves complete information for frequent pattern mining no need to scan the database any more
Compactness
reduce irrelevant information—infrequent items are gone
A path can store one or more transactions
Items in frequency descending order (f-list): more frequent items are more likely to be shared
never be larger than the original database (not counting node-links and the count fields)
39

How Effective Is FP-tree?
Alphabetical FP-tree Tran. DB
Ordered FP-tree
Freq. Tran. DB
100000 10000 1000 100 10 1
Dataset: Connect-4 (a dense dataset)
0%
20% 40% 60% 80% 100%
Support threshold
40
Size (K)

Compressing Sparse Dataset
Alphabetical FP-tree Tran. DB
Ordered FP-tree
Freq. Tran. DB
100000 10000 1000 100 10 1 0.1 0.01
Dataset: T10I4D100k (a sparse dataset)
0% 2% 4% 6% 8%
Support threshold
41
Size (K)

Mining Frequent Patterns from FP-tree
(Frequent pattern = frequent itemset)
General idea (divide-and-conquer): Recursively
partition the set of frequent patterns
build conditional pattern base and conditional FP-tree for each partition
Partition the set of frequent patterns
Frequent patterns can be partitioned into subsets according to f-list: f-c-a-b-m-p (the list of freq. items in frequency- descending order)
Patterns containing p Patterns having m but no p
…
Patterns having c but no a nor b, m, or p Pattern f
The partitioning is complete and without any overlap
42

Partitioning Frequent Patterns f-list: f-c-a-b-m-p
All frequent patterns
p
no p
m
no m
b
a
no b
no a
c
only f
43

Find Frequent Patterns Having Item “p” Only transactions containing p are needed
Form p-conditional pattern base (p-projected database) TDB|p Contains transactions containing p
Starting at entry p of header table
Follow the side-link of frequent item p Accumulate all transformed prefix paths of p
p-conditional pattern base TDB|p fcam: 2
cb: 1
Local frequent item: c:3
Frequent patterns containing p cp: 3
Header table
root f:4
c:3
item
f
c
a
b
m
p
b:1
c:1 b:1 p:1
a:3
m:2 b:1
p:2 m:1
minimum suppot count =3
44

Find Frequent Patterns Having Item m But No p
Form m-conditional pattern base (m-projected database) TDB|m Item p is excluded (by looking at only the prefix paths of m) TDB|m contains fca:2, fcab:1
Recursively apply FP-growth to find freq. patterns from TDB|m Local frequent items: f, c, a
After removing local infrequent item: fca:2, fca:1
Build m-conditional FP-tree from TDB|m
Header table
root
b:1
b:1 m:1
item
f
c
a
b
m
p
Frequent patterns
having m:
fm, cm, am fcm, fam, cam fcam
Header root table
f:3 c:3 a:3
m-conditional FP-tree
f:4 c:3 a:3 m:2 p:2
c:1 b:1 p:1
item
f
c
a
45

Find Frequent Patterns Having Item m But No p (more complex situation)
Suppose m-conditional pattern base is: fca:3, fb:3
Local frequent items: f:6, c:3, a:3, b: 3
Build m-conditional FP-tree
First generate:
fm: 6, cm: 3, am:3, bm:3
root
f:6
c:3 b:3
a:3
Compute ym-conditional pattern bases:
ym conditional pattern base
item count
f
6
c
3
a
3
b
3
To learn longer patterns containing m
Further partition frequent patterns containing m (but no p) into
Patterns containing b
Patterns containing a but no b Patterns containing c but no b or a Patterns containing only f (i.e. fm)
bm f:3 am fc:3 cm f:3
46

Find Frequent Patterns Having Item m But No p (more complex situation)
Having ym-conditional pattern bases:
ym
bm am
conditional pattern base
f:3 fc:3
cm
Built ym-conditional FP-trees
bm: root am: root cm:
| | root
f:3 f:3
| f:3
c:3
f:3
General frequent patterns with suffix ym:
fbm, fam, cam, fcam, fcm
47

Major Steps to Mine FP-tree
For each item in the FP-tree
1. Construct conditional pattern base
2. Construct conditional FP-tree from the conditional pattern-base
3. Generate frequent patterns from the conditional FP- tree
If the conditional FP-tree contains a single path, simply enumerate all the patterns
Otherwise, recursively mine the conditional FP-tree and grow frequent patterns obtained so far
48

Step 1: From FP-tree to Conditional Pattern Base
Starting at the frequent header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item
Accumulate all prefix paths of that item to form a
conditional pattern base
Conditional pattern bases item cond. pattern base c f:3
a fc:3
b fca:1, f:1, c:1 fca:2, fcab:1
p fcam:2, cb:1
Header table
root
item
f
c
a
b
m
p
f:4
c:3
a:3
m:2
p:2 m:1
c:1
b:1 b:1
b:1
p:1 m
49

Step 2: Construct Conditional FP-tree
For each pattern-base
Accumulate the count for each item in the base
Remove locally infrequent items
Construct conditional FP-tree for the frequent items of the pattern base
m-conditional pattern base:
fca:2, fcab:1
Taken out infrequent items
fca:2, fca:1
root
p-conditional pattern base:
fcam:2, cb:1
Taken out infrequent items
c:2, c:1
root
p-conditional FP-tree
f:3
c:3 c:3
a:3
m-conditional FP-tree
50

Conditional Pattern-Bases and Conditional FP-trees
Item
Conditional pattern-base
Conditional FP-tree
p
{(fcam:2), (cb:1)}
{(c:3)}|p
m
{(fca:2), (fcab:1)}
{(f:3, c:3, a:3)}|m
b
{(fca:1), (f:1), (c:1)}
Empty
a
{(fc:3)}
{(f:3, c:3)}|a
c
{(f:3)}
{(f:3)}|c
51

Step 3: Generate Frequent Patterns from Conditional FP-tree
If an x-conditional FP-tree has a single path P The complete set of frequent patterns with suffix x can be
generated by enumeration of all the combinations of items in P
Header table
root
f:3 c:3 a:3
All frequent patterns with suffix m
fm:3, cm:3, am:3, fcm:3, fam:3, cam:3, fcam:3
item
f
c
a
m-conditional FP-tree
m-conditional FP-tree
52

Step 3: Generate Frequent Patterns from Conditional FP-tree (Contd.)
If an x-conditional FP-tree has more than one path For each item y that appears in x-conditional FP-tree
Generate pattern yx with support = the support of y in x-conditional FP-tree.
Construct yx-conditional pattern base and then yx-conditional FP-tree to generate frequent patterns with suffix yx (a recursive procedure).
53

Step 3: Generate Frequent Patterns from Conditional FP-tree (Contd.)
Suppose m-conditional FP-tree is Root
Generate frequent 2-itemsets having m: fm:6, cm:3, am:3, bm:3
Compute ym-conditional pattern bases:
item count
ym
bm am cm
conditional pattern base
f:3 fc:3 f:3
f:6
c:3 b:3
a:3
m-conditional FP-tree
Built ym-conditional FP-trees bm: R am:R cm: R
| |
f
6
c
3
a
3
b
3
f:3
General frequent patterns with suffix ym:
fbm:3, fam:3, cam:3, fcam:3, fcm:3
f:3
f:3 |
c:3
54

FP-Growth Algorithm
Input: FP-tree (a FP-tree built by scanning DB) Output: the complete set of frequent patterns Method: call FP-growth(FP-tree, null)
Procedure FP-growth(A_conditional_FP_Tree, A) if Tree contains a single path P
for each combination (denoted as B) of the nodes in the path P do generate pattern BA with support = minimum support of nodes in B
else for each item ai in the header table of Tree do generate pattern B=aiA with support = the support of ai
construct B’s conditional pattern base and then B’s conditional FP-tree TreeB;
if TreeB in not empty,
call FP-growth(TreeB, B)
55

Exercise A transaction DB:
TID
items
T100
I1, I2, I5
T200
I2, I4
T300
I2, I3
T400
I1, I2, I4
T500
I1, I3
T600
I2, I3
T700
I1, I3
T800
I1, I2, I3, I5
T900
I1, I2, I3
Item
Sup. count
{I1}
6
{I2}
7
{I3}
6
{I4}
2
{I5}
2
Support counts for single items:
Find all frequent patterns with minimum support count =2.
56

FP-tree
57

FP-growth vs. Apriori
100 90 80 70 60 50 40 30 20 10 0
Dataset: T25I20D10k (a sparse dataset)
D1 FP-grow th runtime
D1 A priori runtime
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
58
Run time(sec.)

Mining Very Dense Dataset
400 350 300 250 200 150 100
50
0
70% 75%
Apriori TreeProjection FP-growth
Dataset: Connect-4 (a dense dataset)
80% 85% 90% 95%
Support threshold
59
Runtime (second)

Scalability of FP-growth
700
600
500
400
300
200
100
0
T10I4D100k (0.01%) T25I20D100k (0.30%)
0
200 400 600 800
Number of transactions (K)
1,000
60
Runtime (second)

Why Is FP-growth Efficient?
Divide-and-conquer strategy
Decompose both the mining task and DB Lead to focused search of smaller databases
No candidate generation nor candidate test
Database compression using FP-tree No repeated scan of entire database
Basic operations:
counting local freq items and building FP-tree, no pattern search nor pattern matching
61

Major Costs in FP-growth
Building FP-trees A stack of FP-trees
Redundant information is stored in a stack of FP-trees.
Can we avoid the redundancy?
H-mine (another algorithm by Pei and Han)?
62

Compare FP-growth to Apriori Search space for DB with 4 items:
Æ
abcd
ab ac ad bc bd cd
abc abd acd bcd
abcd
Apriori: breadth-first FP-growth: Depth-first
63

Outline
Basic concepts of association rule learning Apriori algorithm
FP-Growth Algorithm
Finding interesting rules
64

Two Problems with Association Rule Mining
Quantity problem
Too many rules can be generated
Given a dataset, the number of rules generated depends on the support and confidence thresholds.
If the support threshold is high, a small number of rules are generated. But some interesting rules are missed.
If the support threshold is low, a huge number of rules are generated.
Quality problem
Not all the generated rules are interesting
65

Number of Generated Patterns versus Support Threshold (An Example)
Support threshold
0.02
0.01
0.008
0.005
0.003
0.0028
0.0025
0.002
0.001
Num. of rules (conf. thres.=0.5)
2
14
39
88
723
4,556
74,565
4,800,070
>109
Num. of rules (conf. thres.=0.8)
1
7
17
38
591
4,172
65,615
3,584,339
>109
Number of sessions (transactions): 30586 Number of objects (items): 38679
66

Solutions to the Problems
Finding only maximum or closed frequent patterns Other frequent patterns can be generated from them
Constraint-based data mining
Applying constraints in the mining process so the search can be more focused.
Using interestingness measures to remove or rank rules Remove misleading associations and find correlation rules Prune patterns using other interestingness measures
Using rule structures
Eliminate structurally and semantically redundant rules. Group or summarize related rules
67

Maximal Frequent Itemset
An itemset X is a maximal frequent itemset in a data set D if X is
frequent and none of the proper super-set of X is frequent in D. null
Maximal Itemsets
ABCDE
AB AC AD AE BC BD BE CD CE DE
ABC ABD
ABE ACD ACE
ABCD ABCE
ADE BCD
BCE BDE CDE
ABDE
ABCD E
ACDE BCDE
Infrequent Itemsets
Border

Maximal Frequent Patterns Reducing the # of patterns returned to the user
Maximal frequent patterns are a lossy compression of frequent patterns
Given the set of all maximal frequent patterns and their supports in a data set D, we can generate all the frequent patterns, but not their supports.
Algorithm for mining maximal frequent itemsets: MaxMiner
R. Bayardo. Efficiently mining long patterns from databases.
SIGMOD’98
69

Closed Patterns
Problem with maximal frequent itemsets:
Supports of their subsets are not known – additional DB scans are needed (to get the supports)
An itemset is closed if none of its proper supersets has the same support as the itemset
Itemset
Support
{A}
4
{B}
5
{C}
3
{D}
4
{A,B}
4
{A,C}
2
{A,D}
3
{B,C} {B,D} {C,D}
3 4 3
TID
Items
1
{A,B}
2
{B,C,D}
3
{A,B,C,D}
4
{A,B,D}
5
{A,B,C,D}
Itemset
Support
{A,B,C}
2
{A,B,D}
3
{A,C,D}
2
{B,C,D}
2
{A,B,C,D}
2

Closed Frequent Patterns
An itemset X is a closed frequent itemset in a data set D if X is
both closed and frequent in D with respect to a support threshold. Closed frequent itemsets are a lossless compression of frequent
patterns
Reducing the # of patterns returned to the user
Given the set of all closed frequent patterns and their supports in a data set D, the user can generate all the frequent patterns and their supports.
Algorithm for finding closed frequent patterns: CLOSET
J. Pei, J. Han & R. Mao. “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets”, DMKD’00.
71

Maximal vs Closed Frequent Itemsets
Minimum support count= 2 null
Closed but not maximal
Closed Maximal
124 123
1234 245 345
Closed and maximal
ABCDE
12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
12 2
ABC ABD
ABE
24 4 4 2 3 4
ACD
ACE
ADE
BCD BCE
4 ACDE BCDE
BDE CDE
TID
Items
1
ABC
2
ABCD
3
BCE
4
ACDE
5
DE
2 ABCD ABCE
ABDE
ABCDE
# Closed = 9 # Maximal = 4

Closed Patterns and Max-Patterns
Exercise. DB = {, < a1, ..., a50>} Min_sup = 1.
What is the set of closed frequent itemsets? : 1
< a1, ..., a50>: 2
What is the set of maximal frequent itemsets? : 1
What is the set of all frequent itemsets? !!
73

Constrain-based Frequent Pattern Mining
Mining frequent patterns with constraint C
find all patterns satisfying not only min_sup, but also constraint C
Examples of Constraints
?àa particular product
a particular productà ?
small sales (price < $10) triggers big sales (sum > $200)
75

Constrain-based Frequent Pattern Mining (Cont’d)
A naïve solution
Testing frequent patterns on C as a post-processing process
Some constraints can be incorporated into the mining process to improve the efficiency
More efficient approaches
Analyze the properties of constraints comprehensively
Push the constraint as deeply as possible inside the frequent pattern mining
Example: find all frequent itemsets containing item “b”
76

Types of Constraints
Anti-monotonic constraints
An itemset S satisfies the constraint, so does any of its subset (That is, S violates the constraint, so does any of its superset).
Monotonic constraints
An itemset S satisfies the constraint, so does any of its superset
Examples
Sum of the prices of items in S ≤ 100 is anti-monotone
is anti-monotone
Maximum price in S ≤ 15
Sum of the prices of items in S ≥100 is monotone
is monotone
Minimum price in S ≤ 15
77

How to Use Antimonotonic or Monotonic Constraints in Mining
Antimonotonic constraints
In Apriori:
Use it to prune candidates in each iteration
In FP-growth
Use it to stop growing a pattern
Monotonic constraints
If an itemset satisfies a monotonic constraint, no need to check its supersets on the constraint
Only checks their support
78

Types of Constraints (Cont’d)
Convertible constraints
Some constraints are not anti-monotonic or monotonic
But can be converted to anti-monotonic or monotonic by properly ordering items
Example of convertible constraint: Average price of the items in S ≥ 25
Order items in price-descending order
If an itemset afb violates C So does afbh, afb*
It becomes anti-monotone!
79

Example of Convertible Constraints
Convertible constraint:
Average price of the items in S ≥ 25
Price-descending order of items: a, f, g, d
Search space:
Æ
afgd
af ag ad fg fd gd
afg afd agd fgd
afgd
How would you build an FP-tree to do such a search?
80

Solutions to the Problems
Finding only maximum or closed frequent patterns Other frequent patterns can be generated from them
Constraint-based data mining
Applying constraints in the mining process so the search can be more focused.
Using interestingness measures to remove or rank rules
Remove misleading associations and find correlation rules
Prune patterns using other interestingness measures
Using rule structures
Eliminate structurally and semantically redundant rules. Group or summarize related rules
81

Misleading Association Rules
Basketball
Not basketball
Sum (row)
Cereal
2000
1750
3750
Not cereal
1000
250
1250
Sum(col.)
3000
2000
5000
play basketball Þ eat cereal
The overall percentage of students eating cereal is
which is higher than 66.7%
play basketball Þ not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence
Association ≠ Correlation !!!
[40%, 66.7%] is misleading 75%
82

Interestingness Measure: Correlation Correlation
If P(A|B)>P(A), A and B are positively correlated. Note:P(A|B)>P(A)ÛP(B|A)>P(B)Û P(AB)>P(A)P(B)
If P(A|B)

Related Posts