Microsoft Word – Description1.docx
Homework 4
The version of Spark is version 2.1.0
The version of Python is version 2.7.10
l Task 1
How to run the program: The first parameter in the command line
should be the file path of the source code (ChiWei_Liu_LSH.py),
and the second parameter should be the file path of the ratings.csv.
The third parameter should be the output path and file name
(ChiWei_Liu_SimiliarMovies.txt).
For example:
l Task 2
I hashed totally 12 times in the program. I used 6 bands and 2 rows to
find the similar items.
* Precision = tp / (tp + fp) = 1.0
* Recall = tp / (tp + fn) = 0.81
Screenshot of the result:
l Task 3
r = 2, b = 6
s 1-(1-sr)b
0.2 0.217
0.3 0.432
0.4 0.648
0.5 0.822
0.6 0.931
0.7 0.982
0.8 0.9978
l Task 4
When deciding the threshold for finding similar pair. We can observe from
the testing data to get the result if we have larger r, we will get larger
threshold. If we have larger b, we will get lower threshold.
* Larger r => larger threshold
* Larger b => lower threshold