Yang_Yueqin_Description
Tools
Spark version: 2.2.1 Scala version: 2.11
Commands
I use the time command to record the execution time.
Run Time
Small2.case1.txt:
time spark-submit –class FrequentItemsets Yang_Yueqin_SON.jar 1
Data/small2.csv 3
Small2.case2.txt:
time spark-submit –class FrequentItemsets Yang_Yueqin_SON.jar 2
Data/small2.csv 5
Beauty.case1-50.txt:
time spark-submit –class FrequentItemsets Yang_Yueqin_SON.jar 1
Data/beauty.csv 50
Beauty.case2-40.txt:
time spark-submit –class FrequentItemsets Yang_Yueqin_SON.jar 2
Data/beauty.csv 40
Books.case1-1200.txt:
time spark-submit –class FrequentItemsets Yang_Yueqin_SON.jar 1
Data/books.csv 1200
Books.case2-1500.txt:
time spark-submit –class FrequentItemsets Yang_Yueqin_SON.jar 2
Data/beauty.csv 1500
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
File Name Case Number Support Runtime (sec)
beauty.csv 1 50 484.52
beauty.csv 2 40 56.29
books.csv 1 1200 920.63
books.csv 2 1500 111.53
Approach
I use SON algorithm as required and the A-Priori algorithm to process each chunk. I first use
HashMap to compute the counts of each single item, and filter out the frequent singletons.
Then I use the loop to get the frequent items from items. Because frequent set is
the union of two frequent set. Thus, I get the candidate set as follows
for each pair of a, b frequent items with length n:
c = union of a and b
if c has length n+1
it is a candidate item
1
2
3
4
5
Tools
Commands
Run Time
Approach