Tools
Spark version: 2.2.1 Scala version: 2.11
Commands
I use the time command to record the execution time.
Small2.case1.txt:
time spark-submit –class FrequentItemsets Yang_Yueqin_SON.jar 1 Data/small2.csv 3
Small2.case2.txt:
time spark-submit –class FrequentItemsets Yang_Yueqin_SON.jar 2 Data/small2.csv 5
Beauty.case1-50.txt:
time spark-submit –class FrequentItemsets Yang_Yueqin_SON.jar 1 Data/beauty.csv 50
Beauty.case2-40.txt:
time spark-submit –class FrequentItemsets Yang_Yueqin_SON.jar 2 Data/beauty.csv 40
Books.case1-1200.txt:
time spark-submit –class FrequentItemsets Yang_Yueqin_SON.jar 1 Data/books.csv 1200
Books.case2-1500.txt:
time spark-submit –class FrequentItemsets Yang_Yueqin_SON.jar 2 Data/beauty.csv 1500
Run Time
File Name
Case Number
Support
Runtime (sec)
beauty.csv
1
50
484.52
beauty.csv
2
40
56.29
books.csv
1
1200
920.63
books.csv
2
1500
111.53
Approach
I use SON algorithm as required and the A-Priori algorithm to process each chunk. I first use HashMap to compute the counts of each single item, and filter out the frequent singletons. Then I use the loop to get the frequent items from items. Because frequent set is the union of two frequent set. Thus, I get the candidate set as follows
for each pair of a, b frequent items with length n:
c = union of a and b
if c has length n+1
it is a candidate item