## Tools
**Spark version**: 2.2.1
**Scala version**: 2.11
## Commands
I use the `time` command to record the execution time.
“`bash
Small2.case1.txt:
time spark-submit –class FrequentItemsets Yang_Yueqin_SON.jar 1 Data/small2.csv 3
Small2.case2.txt:
time spark-submit –class FrequentItemsets Yang_Yueqin_SON.jar 2 Data/small2.csv 5
Beauty.case1-50.txt:
time spark-submit –class FrequentItemsets Yang_Yueqin_SON.jar 1 Data/beauty.csv 50
Beauty.case2-40.txt:
time spark-submit –class FrequentItemsets Yang_Yueqin_SON.jar 2 Data/beauty.csv 40
Books.case1-1200.txt:
time spark-submit –class FrequentItemsets Yang_Yueqin_SON.jar 1 Data/books.csv 1200
Books.case2-1500.txt:
time spark-submit –class FrequentItemsets Yang_Yueqin_SON.jar 2 Data/beauty.csv 1500
“`
## Run Time
| File Name | Case Number | Support | Runtime (sec) |
| ——— | ———– | ——- | ————- |
| beauty.csv | 1 | 50 | 484.52 |
| beauty.csv | 2 | 40 | 56.29 |
| books.csv | 1 | 1200 | 920.63 |
| books.csv | 2 | 1500 | 111.53 |
## Approach
I use SON algorithm as required and the `A-Priori` algorithm to process each chunk. I first use `HashMap` to compute the counts of each single item, and filter out the frequent singletons. Then I use the loop to get the frequent $n+1$ items from $n$ items. Because $n+1$ frequent set is the union of two $n$ frequent set.
Thus, I get the candidate set as follows
“`
for each pair of a, b frequent items with length n:
c = union of a and b
if c has length n+1
it is a candidate item
“`