Quiz 2: Frequent Itemsets
1) Here is a collection of 6 baskets. Each baskect, {i, j, k} contains three of the six items 1 through 6. {1, 4, 3} {2, 3, 5 } {2, 4, 5 } {4, 5, 6 }{1, 3, 5 } {2, 4, 6 } The support threshold is 2. The hash function is (i+j) mod 6. Using the PCY Algorithm, you need to show 1. frequent buckets and 2. frequent pairs. (1 pt)
(Answer) Pass1
single item count
Hashing for pairs
item
Count
Bucket Num (i+j) mod 6
Pairs
Count
1
2
0
{2,4},{1,5},{2,4}
3
2
3
1
{3,4}, {2,5},{2,5}
3
3
3
2
{3,5},{3,5},{2,6}
3
4
4
3
{4,5},{4,5}
2
5
4
4
{1,3},{4,6},{1,3},{4,6}
4
6
2
5
{1,4},{2,3},{5,6}
3
All single items and all buckets 0,1,2,3,4,5 are frequent.
Pass2
Count candidate pairs (all pairs are candidates because all single items are frequent, and all buckets are frequent)
Frequent pairs: {2,4}, {2,5}, {3,5}, {4,5}, {1,3}, {4,6}
2) Consider the following basket data and a support threshold s = 2, answer the following questions.
Find all frequent itemsets with set size <= 3 (1pt) Write down two association rules and
their confidence and interest numbers. One of your association rule should be derived from a frequent pair (i.e., X->Y), and the other one should be derived from a frequent triplet (i.e., X, Y->Z) (1pt)
(Answer)
Frequent single: m, c, b, p, j
Frequent pair: (m,c), (m,b), (m,p), (m,j), (c,b), (c,j), (b,j) Frequent triplet: (m,b,c), (c,b,j)
Association rules, confidence and interest
m ->c
m, c -> b
1) confidence = 3/5 = 0.6
2) interest = 3/5 – 6/8 = -3/20 = -0.15
1) confidence = 3/3 = 1
2) interest = 1 – 6/8 = 1/4 = 0.25
c ->m
m, b -> j
1) confidence = 3/6 = 0.5
2) interest = 3/6 – 5/8 = -1/8 = -0.125
1) confidence = 1/4 = 0.25
2) interest = 1/4 – 4/8 = -1/4 = -0.25
b -> c
b, c -> j
1) confidence = 5/6 = 0.83
2) interest = 5/6 – 6/8 = 1/12 = 0.083
1) confidence = 2/5 = 0.4
2) interest = 2/5 – 4/8 = -1/10 = -0.1
c -> b
m, b -> c
1) confidence = 5/6 = 0.83
2) interest = 5/6 – 6/8 = 1/12 = 0.083
1) confidence = 3/4 = 0.75 2) interest = 3/4 – 6/8 = 0
m -> b
c, j -> b
1) confidence = 4/5 = 0.8
2) interest = 4/5 – 6/8 = 1/20 = 0.05
1) confidence = 2/3 = 0.67
2) interest = 2/3 – 6/8 = -1/12 = -0.083
b -> m
b, j -> c
1) confidence = 4/6 = 0.67
2) interest = 4/6 – 5/8 = 1/24 = 0.042
1) confidence = 2/2 = 1
2) interest = 2/2 – 6/8 = 1/4 = 0.25
c -> j
b, c -> m
1) confidence = 3/6 = 0.5 2) interest = 3/6 – 4/8 = 0
1) confidence = 3/5 = 0.6
2) interest = 3/5 – 5/8 = -1/40 = -0.025
m -> p
1) confidence = 2/5 = 0.4
2) interest = 2/5 – 2/8 = 3/20 = 0.15
3) In the SON algorithm, please write down the input/output for the Map/Reduce functions in both phases if we use a two-phase Map/Reduce (2pts total, 1pt for each phase).
(Answer)
Phase 1 Map Input: a chunk/subset of all baskets (fraction p of the total input file) Phase 1 Map Output: set of pairs (F, 1) // F: frequent itemset from sample
Phase 1 Reduce Input: (F, 1’s)
Phase 1 Reduce Output: candidate itemsets (F)
Phase 2 Map Input: output from first Reduce task (candidate itemsets) and a chunk of the total input data file
Phase 2 Map Output: set of key-value pairs (C, v)
// C: candidate frequent itemset, v: support of C in the input chunk
Phase 2 Reduce Input: (C, list of v)
Phase 2 Reduce Output: if sum(list of v) >= support threshold s, emit (C, sum(list of v))
4) Considering the Toivonen’s algorithm, give one example of a singleton and one example of a pair in the negative border. You need to explain why your examples are considered as itemsets in the negative border (1pt). Explain how and why the Toivonen’s algorithm uses the itemsets in the negative border (you can use your examples) (1pt).
(Answer)
Singleton in the negative border: {A} iff {A} is not frequent in the sample because {A}’s
immediate subset is empty set which is always frequent.
Pair in the negative border: {B, C} iff {B, C} is not frequent in the sample and all immediate subsets, {B}, {C} are frequent in the sample.
(how: 0.5)
Toivonene’s algorithm
Pass 1: find candidate frequent itemsets from sample and construct the negative border
Pass 2: count candidate frequent itemsets and itemsets on the negative border over the whole file. No false positives by pass 2.
Case1: If no itemset from the negative border is not frequent in pass 2, all frequent itemsets are found correctly. The result of pass 2 is the right answer. No false negative.
Case2: If an itemset from the negative border is frequent in pass 2, the algorithm must be repeated with new random sample.
(why: 0.5)
In case2, there could be other frequent itemsets in the whole file neither of which are not in the frequent itemsets from the sample and on the negative border.
5) (3pts) The figure below shows you the memory footprint of the A Priori algorithm for counting frequent pairs. Please draw the memory footprint for counting frequent pairs in (a) the PCY algorithm, (b) the Multistage algorithm, and (c) the Multihash algorithm. You need to clearly label each block in your answer like in the example. (1pt each)
(Answer)
(a) the PCY algorithm
(b) the Multistage algorithm
(c) the Multihash algorithm