CS代考 Frequent itemset mining

Frequent itemset mining

Frequent patterns
• Frequent patterns: patterns (e.g., itemsets, subsequences, or substructures) that appear frequently in a data set

• Example: a set of items, such as milk and bread, that appear frequently together in a transaction data set is a frequent itemset
• Finding frequent patterns plays an essential role in mining associations, correlations, and many other interesting relationships among data
• Application examples: market basket analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc.

• Let I={i1,…,in} be a collection of one or more data items
• Transaction T 􏰀 I is a subset of items in I
• Itemset X 􏰀 I is a subset of items
• k-Itemset X: itemset with k items
• Absolute support of X: frequency of occurrence of
• Relative support: fraction of transactions that include X (probability that X appears in a transaction)
• Frequent itemset: itemset with support above a threshold
Customer buys milk
Items bought
Bread, Milk
Bread, Coffee, Milk
Bread, Cocoa, Milk
Cocoa, Coffee, Milk
Bread, Cocoa, Coffee, Milk
Customer buys bread
Customer buys both

Association rules
• Association rule AàB where A and B are itemsets
• Support s:
fraction of transactions including both A and B
probability P(A 􏰁 B)
{Bread, Milk}àCoffee
Items bought
Bread, Milk
Bread, Coffee, Milk
Bread, Cocoa, Milk
Cocoa, Coffee, Milk
Bread, Cocoa, Coffee, Milk
• Confidence c:
fraction of transactions including A that also include B
probability P(B|A)

Association rules
• Association rule AàB where A and B are itemsets
• Support s:
fraction of transactions including both A and B
probability P(A 􏰁 B)
{Bread, Milk}àCoffee
s = |{$%&'(,*+,,&&,-./0}| = 3 = 0.4
c = |{$%&'(,*+,,&&,-./0}| = 3 = 0.5 |{$%&'(,-./0}| 8
Items bought
Bread, Milk
Bread, Coffee, Milk
Bread, Cocoa, Milk
Cocoa, Coffee, Milk
Bread, Cocoa, Coffee, Milk
• Confidence c:
fraction of transactions including A that also include B
probability P(B|A)

Items bought
Bread, Milk
Bread, Coffee, Milk
Bread, Cocoa, Milk
Cocoa, Coffee, Milk
Bread, Cocoa, Coffee, Milk
Strong rules
• Goal: given a set T of transactions, find strong association rules that have
• Support ≥ minsup threshold
• Confidence ≥ minconf threshold
• All the rules in the example are partitions of {Bread,Coffee,Milk}
• Have the same support
• Have different confidence
{Bread,Milk}à{Coffee} s=0.4, c=0.5 {Bread,Coffee}à{Milk} s=0.4, c=1.0 {Milk,Coffee}à{Bread} s=0.4, c=1.0 {Bread}à{Coffee,Milk} s=0.4, c=0.5 {Coffee}à{Bread,Milk} s=0.4, c=0.67 {Milk}à{Bread,Coffee} s=0.4, c=0.5

Association rule mining
Association rule mining can operate in two steps
1. Find all frequent itemsets
• Determines the performance of the approach
2. Generate strong association rules from frequent itemstets
• Much less costly
Note: a frequent itemset of length 100 includes 100 1-frequent itemsets, 100!/2!98! = 100/2 = 50 2-frequent itemsets … overall, it includes nearly 1.27*1030 frequent patterns!
Too many!!

Itemset lattice
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE

Closed and maximal itemsets
• X is a closed itemset if there exists no proper super-itemset Y with the same absolute support (i.e., frequency) as X
• X is a closed frequent itemset if it is a closed itemset with frequency above minfreq
• X is a maximal frequent itemset (max-itemset) if it is frequent and there exists no super-itemset Y such that X 􏰂 Y and Y is frequent
• X is frequent and is not included in another frequent itemset

Apriori property
• Apriori property: all non- empty subsets of a frequent itemset must also be frequent
• If I is not frequent, adding i to I produces an itemset I 􏰁 {i} that cannot appear more frequently than I
• Antimonotonicity: if a set does not pass a test, also its supersets will fail the same test
not frequent null
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE
BCE BDE CDE

Apriori algorithm (1)
• Scan all transactions and count the occurrences of each item (1-itemsets in candidate set C1)
• Select the 1-itemsets in C1 satisfying minsup threshold, and include them in L1
• Compute the set of candidates frequent 2-itemsets C2 as L1􏰃L1
• Scan all transactions to compute the absolute support of each
2-itemset in C2
• Select the 2-itemsets in C2 satisfying minsup threshold, and include them in
• Compute the set of candidates frequent 3-itemsets C3 as L2􏰃L2 •…
• Stop when Ci is empty

Apriori algorithm (2)
Ck: Candidate itemset of size k Lk : frequent itemset of size k
L1 = {frequent 1-itemsets}
for (k = 2; Lk-1 !=Æ; k++) do begin
Ck = generate candidates Lk-1
for each transaction t in database do
increment the count of all candidates in Ck that are contained in t Lk = candidates in Ck with support higher than minsup
return Èk Lk

Generate candidates
• To generate candidates Ck, compute a self-join between Lk-1 and Lk-1
• Join condition: the first k-2 items should be in common
• Items l1 and l2 in Lk-1 are joined if
• (l1[1]=l2[1]) ^ (l1[2]=l2[2]) ^ … ^ (l1[k-2]=l2[k-2]) • (l1[k-1]?(A 􏰁 C) :;<<=>?(A)
• Association rules can be generated as follows:
1. For each frequent itemset l, generate all non-empty subsets of l
2. For every non-empty subset s of l, generate rule sà(l-s) if support(A 􏰁 B) ≥ minconf
support(A)
Note: minimum support is guaranteed by the fact that l is a frequent itemset

Generating association rules: example
• Consider frequent itemset X={I1, I2, I5} • Non-empty subsets of X are
List of item IDS
I1,I2,I3,I5

Generating association rules: example
• Consider frequent itemset X={I1, I2, I5}
• Non-empty subsets of X are
• {I1,I2} generated rule {I1,I2}àI5 with confidence 2/4 • {I1,I5} generated rule {I1,I5}àI2 with confidence 2/2 • {I2,I5} generated rule {I2,I5}àI1 with confidence 2/2 • {I1} generated rule I1à{I2,I5} with confidence 2/6
• {I2} generated rule I2à{I1,I5} with confidence 2/7
• {I5} generated rule I5à{I2,I5} with confidence 2/2
• minsup=0.7
List of item IDS
I1,I2,I3,I5

Apriori algorithm: exercise
• minsup=0.75
A, C, D, E
A, B, C, D, E
A, B, C, D

Improving the Efficiency of Apriori
• Major computational challenges
• Multiple scans of transaction database
• Huge number of candidates
• Tedious workload of support counting for candidates
• Improving Apriori: general ideas
• Reduce passes of transaction database scans • Shrink number of candidates
• Facilitate support counting of candidates

Partition: Scan Database Only Twice
• Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB
• Scan 1: partition database and find local frequent patterns • Scan 2: consolidate global frequent patterns
• A. Savasere, E. Omiecinski and S. Navathe, VLDB’95
DB1 + DB2 + + DBk = DB sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB DHP: Reduce the Number of Candidates • A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent • Candidates: a, b, c, d, e • Hash entries • {ab, ad, ae} • {bd, be, de} •... • Frequent 1-itemset: a, b, d, e • ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold • J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. SIGMOD’95 Hash Table {ab, ad, ae} {bd, be, de} {yz, qs, wt} Sampling for Frequent Patterns • Select a sample of original database, mine frequent patterns within sample using Apriori • Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked • Example: check abcd instead of ab, ac, ..., etc. • Scan database again to find missed frequent patterns • H. Toivonen. Sampling large databases for association rules. In VLDB’96 DIC: Reduce Number of Scans AB AC BC AD BD CD • Once both A and D are determined frequent, the counting of AD begins • Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins Itemset lattice S. . Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. SIGMOD’97 Transactions 1-itemsets 2-itemsets 1-itemsets FP-tree (1) 1. Scan the DB (similarly to Apriori) to find frequent 1-itemsets 2. Sort frequent items in frequency descending order (f-list) 3. Scan DB again, construct FP-tree f-list: f, c, a, m, b, p Transactional Database List of item IDS f, a, c, d, g, i, m, p a, b, c, f, l, m, o b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n Support count List of item IDS Ordered items f, a, c, d, g, i, m, p f, c, a, m, p a, b, c, f, l, m, o f, c, a, b, m b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n f, c, a, m, p FP-tree (2) • Root: empty node {} • For each transaction, order its items according to the f-list • Each transaction is a path in the FP- tree • Build a path for each transaction, increasing by one the counter in common sub-paths c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Support count FP-tree: example (1) List of item IDS f, a, c, d, g, i, m, p a, b, c, f, l, m, o b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n List of item IDS Ordered items f, a, c, d, g, i, m, p f, c, a, m, p a, b, c, f, l, m, o f, c, a, b, m b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n f, c, a, m, p Support count FP-tree: example (2) List of item IDS Ordered items f, a, c, d, g, i, m, p f, c, a, m, p a, b, c, f, l, m, o f, c, a, b, m b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n f, c, a, m, p FP-tree: example (3) a:2 m:1 b:1 List of item IDS Ordered items f, a, c, d, g, i, m, p f, c, a, m, p a, b, c, f, l, m, o f, c, a, b, m b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n f, c, a, m, p FP-tree: example (4) f:3 c:2 b:1 a:2 m:1 b:1 List of item IDS Ordered items f, a, c, d, g, i, m, p f, c, a, m, p a, b, c, f, l, m, o f, c, a, b, m b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n f, c, a, m, p FP-tree: example (5) c:1 c:2 b:1 b:1 a:2 p:1 m:1 b:1 p:1 m:1 List of item IDS Ordered items f, a, c, d, g, i, m, p f, c, a, m, p a, b, c, f, l, m, o f, c, a, b, m b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n f, c, a, m, p FP-tree: example (6) c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 List of item IDS Ordered items f, a, c, d, g, i, m, p f, c, a, m, p a, b, c, f, l, m, o f, c, a, b, m b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n f, c, a, m, p FP-tree: pattern base • The problem of mining frequent patterns in databases is transformed to that of mining the FP-tree • Extract the conditional pattern base by visiting the tree c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Support count Conditional pattern base fca:2 fcab:1 fca:1 f:1 c:1 fcam:2 cb:1 Conditional FP-tree • For each pattern-base • Accumulate the count for each item in the base • Construct the FP-tree for the frequent items of the pattern base • Generate frequent patterns Conditional pattern base fca:2 fcab:1 fca:1 f:1 c:1 fcam:2 cb:1 f:3 c:3 a:3 fm, fcm, fcam Support count below threshold FP-tree: exercise • minsup=0,75 B, D, C, A A, D, B, E • FP-growth transforms the problem of finding long frequent patterns to searching for shorter ones recursively and concatenating the suffix • It uses the least frequent suffix offering a good selectivity • It reduces the search cost • If the tree does not fit into main memory, partition the database • Efficient and scalable for mining both long and short frequent patterns Which patterns are interesting? • Minimum support and confidence thresholds help to exclude uninteresting rules • When mining low support thresholds or long patterns, we may end up with uninteresting rules • Is a rule interesting? • Subjective measure: only users can ultimately say if a rule is interesting • Objective measure: may help filtering uninteresting rules based on statistics behind data Misleading association rules • Consider an electronic shop and, in particular, transactions including computer games and videos • Out of 10,000 transactions • 6,000 include games • 7,500 include videos • 4,000 include both games and videos • Rule {games}à{video} has support=0.4 and confidence=0.66 • It is considered strong if the threshold is fixed to 0.3 • Misleading!! computer games and videos are negatively associated: the purchase of one decreases the likelihood of purchasing the other Correlation analysis • The confidence of a rule AàB can be deceiving • estimate of the conditional probability of itemset B given itemset A • does not measure the real strength of the correlation implication between A and B • Need to adopt correlation analysis • There exist different correlation measures • Lift • Chi-square •... • The occurrence of A is independent of the occurrence of B if P A􏰁B =P A P B otherwiseAandBaredependentand correlated as events lift(A,B) = T(A 􏰁 C) T A T(C) • lift(A,B)<1: A is negatively correlated with B (occurrence of A likely leads to absence of B) • lift(A,B)>1: A is positively correlated with B
(occurrence of A likely leads to occurrence of B)
• lift(A,B)=1: A and B are independent

Lift: example
• Out of 10,000 transactions
• 6,000 include games
• 7,500 include videos
• 4,000 include both games and videos
• Rule {games}à{video} • support=0.4
• confidence=0.66
• lift = P({game,video})/(P({game})P({video})) = 0.4/(0.6*0,75)=0.89 < 1 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts