Finding Frequent Itemsets: Limited Pass Algorithms
Thanks for source slides and material to:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets
http://www.mmds.org
1
Limited Pass Algorithms
! Algorithms so far: compute exact collection of frequent itemsets of size k in k passes
! A-Priori, PCY, Multistage, Multihash
! Many applications where it is not essential to
discover every frequent itemset ! Sufficient to discover most of them
! Next: algorithms that find all or most frequent itemsets using at most 2 passes over data
! Sampling
! SON
! Toivonen’s Algorithm
2
Random Sampling of Input Data
3
Random Sampling
! Take a random sample of the market
baskets that fits in main memory
! Leave enough space in memory for counts
! Run a-priori or one of its improvements in main memory
! For sets of all sizes, not just pairs
! Don’t pay for disk I/O each
time we increase the size of itemsets
! Reduce support threshold proportionally to match the sample size
Copy of sample baskets
Space for
counts
4
Main memory
How to Pick the Sample
! Best way: read entire data set
! For each basket, select that basket for the
sample with probability p
! For input data with m baskets
! At end, will have a sample with size close to pm
baskets
! If file is part of distributed file system, can pick chunks at random for the sample
5
Support Threshold for Random Sampling
! Adjust support threshold to a suitable, scaled-back number
! To reflect the smaller number of baskets
6
Support Threshold for Random Sampling
! Adjust support threshold to a suitable, scaled-back number
! To reflect the smaller number of baskets
! Example
! If sample size is 1% or 1/100 of the baskets ! Use s /100 as your support threshold
! Itemset is frequent in the sample if it appears in at least s/100 of the baskets in the sample
7
Random Sampling: Not an exact algorithm
! With a single pass, cannot guarantee:
! That algorithm will produce all itemsets that are
frequent in the whole dataset
• False negative: itemset that is frequent in the whole but not in the sample
8
Random Sampling: Not an exact algorithm
! With a single pass, cannot guarantee:
! That algorithm will produce all itemsets that are
frequent in the whole dataset
• False negative: itemset that is frequent in the whole but not in the sample
! That it will produce only itemsets that are frequent in the whole dataset
• False positive: frequent in the sample but not in the whole
! If the sample is large enough, there are unlikely to be serious errors
9
Random Sampling: Avoiding Errors
” Improvement
! Makeasecondpassthroughthefulldataset
! Countallitemsetsthatwereidentifiedasfrequentinthesample
! Verifythatthecandidatepairsaretrulyfrequentinentiredataset
10
Random Sampling: Avoiding Errors
” Eliminate false positives
! Makeasecondpassthroughthefulldataset
! Countallitemsetsthatwereidentifiedasfrequentinthesample
! Verifythatthecandidatepairsaretrulyfrequentinentiredataset
” But this doesn’t eliminate false negatives
! Itemsetsthatarefrequentinthewholebutnotinthesample ! Remainundiscovered
” Reduce false negatives
! Before,weusedthresholdpswherepisthesamplingfraction
! Reducethisthreshold:e.g.,0.9ps
! Moreitemsetsofeachsizehavetobecounted
! Ifmemoryallows:requiresmorespace
! Smallerthresholdhelpscatchmoretrulyfrequentitemsets
11
Savasere, Omiecinski and Navathe (SON) Algorithm
12
SON Algorithm
! Avoids false negatives and false positives ! Requires two full passes over data
13
SON Algorithm – (1)
! Repeatedly read small subsets of the baskets into main memory
! Run an in-memory algorithm (e.g., a priori, random sampling) to find all frequent itemsets
! Note: we are not sampling, but processing the entire file in memory-sized chunks
! An itemset becomes a candidate if it is found to be frequent in any one or more subsets of
the baskets
14
SON Algorithm – (2)
” On a second pass, count all the candidate itemsets and determine which are frequent in the entire set
15
SON Algorithm – (2)
” On a second pass, count all the candidate itemsets and determine which are frequent in the entire set
” Key “monotonicity” idea: an itemset cannot be frequent in the entire set of baskets unless it is frequent in at least one subset
16
SON Algorithm – (2)
” On a second pass, count all the candidate itemsets and determine which are frequent in the entire set
” Key “monotonicity” idea: an itemset cannot be frequent in the entire set of baskets unless it is frequent in at least one subset
! Subsetorchunkcontainsfractionpofwholefile
! 1/pchunksinfile
! Ifitemsetisnotfrequentinanychunk,thensupportin each chunk is less than ps
! Supportinwholefileislessthans:notfrequent
17
SON – Distributed Version
! SON lends itself to distributed data mining
! MapReduce
! Baskets distributed among many nodes
! Subsets of the data may correspond to one or more chunks in distributed file system
! Compute frequent itemsets at each node
! Distribute candidates to all nodes
! Accumulate the counts of all candidates
18
SON: Map/Reduce
Phase 1: Find candidate itemsets
” Map
! Inputisachunk/subsetofallbaskets;fractionpoftotalinputfile
! Finditemsetsfrequentinthatsubset(e.g.,usingrandom sampling algorithm)
! Usesupportthresholdps
! Outputissetofkey-valuepairs(F,1)whereFisa
frequent itemset from sample
” Reduce
! Eachreducetaskisassignedsetofkeys,whichareitemsets ! Produceskeysthatappearoneormoretime
! Frequentinsomesubset
! Thesearecandidateitemsets
20
SON: Map/Reduce
Phase 2: Find true frequent itemsets
” Map
! EachMaptasktakesoutputfromfirstReducetaskANDa
chunk of the total input data file
! AllcandidateitemsetsgotoeveryMaptask
! Countoccurrencesofeachcandidateitemsetamongthebaskets
in the input chunk
! Outputissetofkey-valuepairs(C,v),whereCisa candidate frequent itemset and v is the support for that itemset among the baskets in the input chunk
” Reduce
! Eachreducetasksisassignedasetofkeys(itemsets) ! Sumsassociatedvaluesforeachkey:totalsupportforitemset ! Ifsupportofitemset>=s,emititemsetanditscount
21
Toivonen’s Algorithm
22
Toivonen’s Algorithm
! Given sufficient main memory, uses one pass over a small sample and one full pass over data
! Gives no false positives or false negatives
! BUT, there is a small but finite probability it will fail to produce an answer
! Will not identify frequent itemsets
! Then must be repeated with a different
sample until it gives an answer
! Need only a small number of iterations
23
Toivonen’s Algorithm (1)
First find candidate frequent itemsets from sample
” Start as in the random sampling algorithm, but lower the threshold slightly for the sample
! Example:ifthesampleis1%ofthebaskets,uses/125asthe support threshold rather than s /100
! Forfractionpofbasketsinsample,use0.8psor0.9psas support threshold
” Goal is to avoid missing any itemset that is frequent in the full set of baskets
” The smaller the threshold:
! Themorememoryisneededtocountallcandidateitemsets
! Thelesslikelythealgorithmwillnotfindananswer
24
Toivonen’s Algorithm – (2) After finding frequent itemsets for the
sample, construct the negative border
! Negative border: Collection of itemsets that are not frequent in the sample but all of their immediate subsets are frequent
! Immediate subset is constructed by deleting exactly one item
25
!
Example: Negative Border ABCD is in the negative border if and
only if:
1. It is not frequent in the sample, but
2. All of ABC, BCD, ACD, and ABD are frequent • Immediate subsets: formed by deleting an item
A is in the negative border if and only if it is not frequent in the sample
” Note: The empty set is always frequent
!
26
… tripletons
doubletons singletons
Picture of Negative Border
Negative Border
Frequent Itemsets from Sample
27
Toivonen’s Algorithm (1) First pass:
(1) First find candidate frequent itemsets from sample
! Sampleonfirstpass!
! Uselowerthreshold:Forfractionpofbasketsinsample,
use 0.8ps or 0.9ps as support threshold
” Identifies itemsets that are frequent for the sample
(2) Construct the negative border
! Itemsetsthatarenotfrequentinthesamplebutallof their immediate subsets are frequent
28
Toivonen’s Algorithm – (3)
” In the second pass, process the whole file (no
sampling!!)
” Count:
! allcandidatefrequentitemsetsfromfirstpass ! allitemsetsonthenegativeborder
” Case 1: No itemset from the negative border turns out to be frequent in the whole data set
! Correctsetoffrequentitemsetsisexactlytheitemsetsfrom the sample that were found frequent in the whole data
” Case 2: Some member of negative border is frequent in the whole data set
! Cangivenoansweratthistime
! Mustrepeatalgorithmwithnewrandomsample
29
Toivonen’s Algorithm – (4)
” Goal: Save time by looking at a sample on first pass
! Butisthesetoffrequentitemsetsforthesamplethecorrect set for the whole input file?
” If some member of the negative border is frequent in the whole data set, can’t be sure that there are not some even larger itemsets that:
! Areneitherinthenegativebordernorinthecollectionof frequent itemsets for the sample
! Butarefrequentinthewhole
” So start over with a new sample
” Try to choose the support threshold so that probability of failure is low, while number of itemsets checked on
the second pass fits in main-memory
30
A few slides on Hashing
Introduction to Data Mining with Case Studies Author: G. K. Gupta
Prentice Hall India, 2006.
31
Hashing
In PCY algorithm, when generating L1, the set of frequent itemsets of size 1, the algorithm also:
• generates all possible pairs for each basket
• hashes them to buckets
• keeps a count for each hash bucket
• Identifies frequent buckets (count >= s)
Item counts
Hash
Hash table table
for pairs
December 2008
32
Recall: Main-Memory Picture of PCY
Pass 1
©GKGupta
Pass 2
Frequent items
Bitmap
Counts of candidate pairs
Main memory
Example
Consider a basket database in the first table below
All itemsets of size 1 determined to be frequent on previous pass The second table below shows all possible 2-itemsets for each basket
Basket ID
Items
100
Bread, Cheese, Eggs, Juice
200
Bread, Cheese, Juice
300
Bread, Milk, Yogurt
400
Bread, Juice, Milk
500
Cheese, Juice, Milk
100
(B, C) (B,E) (B, J) (C,E) (C, J) (E, J)
200
(B, C) (B, J) (C, J)
300
(B, M) (B, Y) (M, Y)
400
(B, J) (B, M) (J, M)
500
(C, J) (C, M) (J, M)
December 2008 ©GKGupta 33
Example Hash Function
• Foreachpair,anumericvalueisobtainedbyfirst representing B by 1, C by 2, E 3, J 4, M 5 and Y 6.
• Noweachpaircanberepresentedbyatwodigitnumber • (B,E)by13 (C,M)by26
• Usehashfunctiononthesenumbers:e.g.,numbermodulo8
• Hashed value is the bucket number
• Keepcountofthenumberofpairshashedtoeachbucket
• Buckets that have a count above the support value are
frequent buckets
• Set corresponding bit in bit map to 1; otherwise, bit is 0
• Allpairsinrowsthathavezerobitareremovedascandidates
December 2008 ©GKGupta 34
Hashing Example
Support Threshold = 3
The possible pairs:
100
(B, C) (B,E) (B, J) (C,E) (C, J) (E, J)
200
(B, C) (B, J) (C, J)
300
(B, M) (B, Y) (M, Y)
400
(B, J) (B, M) (J, M)
500
(C, J) (C, M) (J, M)
(B,C) -> 12, 12%8 = 4; (B,E) -> 13, 13%8 = 5; (C, J) -> 24, 24%8 = 0
Mapping table
Bit map for frequent buckets
Bucket number
Count
Pairs that hash to bucket
1
0
0
1
0
2
0
3
0
4
1
5
1
6
1
7
B
1
C
2
E
3
J
4
M
5
Y
6
December 2008 ©GKGupta 35
Hashing Example
Support Threshold = 3
The possible pairs:
100
(B, C) (B,E) (B, J) (C,E) (C, J) (E, J)
200
(B, C) (B, J) (C, J)
300
(B, M) (B, Y) (M, Y)
400
(B, J) (B, M) (J, M)
500
(C, J) (C, M) (J, M)
(B,C) -> 12, 12%8 = 4; (B,E) -> 13, 13%8 = 5; (C, J) -> 24, 24%8 = 0
Mapping table
Bit map for frequent buckets
Bucket number
Count
Pairs that hash to bucket
1
0
0
1
0
2
0
3
0
4
2
(B, C)
1
5
3
(B, E) (J, M)
1
6
1
7
B
1
C
2
E
3
J
4
M
5
Y
6
Bucket 5 is frequent. Are any of the pairs that hash to the bucket frequent?
Does Pass 1 of PCY know which pairs contributed to the bucket?
Hashing Example
Support Threshold = 3
The possible pairs:
100
(B, C) (B,E) (B, J) (C,E) (C, J) (E, J)
200
(B, C) (B, J) (C, J)
300
(B, M) (B, Y) (M, Y)
400
(B, J) (B, M) (J, M)
500
(C, J) (C, M) (J, M)
(B,C) -> 12, 12%8 = 4; (B,E) -> 13, 13%8 = 5; (C, J) -> 24, 24%8 = 0
Mapping table
Bit map for frequent buckets
Bucket number
Count
Pairs that hash to bucket
1
0
5
(C, J) (B, Y) (M, Y)
0
1
1
(C, M)
0
2
1
(E, J)
0
3
0
0
4
2
(B, C)
1
5
3
(B, E) (J, M)
1
6
3
(B, J)
1
7
3
(C, E) (B, M)
B
1
C
2
E
3
J
4
M
5
Y
6
At end of Pass 1, know only which buckets are frequent 37 All pairs that hash to those buckets are candidates and will be counted
Reducing number of candidate pairs
” Goal: reduce the size of candidate set C2 ! Onlyhavetocountcandidatepairs
! Pairsthathashtoafrequentbucket
” Essential that the hash table is large enough so that collisions are few
” Collisions result in loss of effectiveness of the hash table
” In our example, three frequent buckets had collisions
” Must count all those pairs to determine which are truly frequent
December 2008 ©GKGupta 38