Optimization
Optimization
1. Compute the token frequency and sort each record according to the token frequency.
I first use map reduce to compute the frequency, then collect as map and broadcast the
frequency table using broadcast method. Finally, use mapPartitions to sort each record
according to the frequency table.
2. For each record, for the first recordLength – (recordLength * simThreshold) tokens, map
the token and the record pair. Then group the records according to the token key.
3. For each pair of records in the same token group, compute their similarity and filter
tthhose above the similarity threshold.
4. Remove the duplicates and sort by the record id pair.
Optimization