Instructions
1. You are required to complete your implementation in the
file provided along with this notebook.
2. You are not allowed to print out unnecessary stuff. We will not consider any output
printed out on the screen. All results should be returned in appropriate data
structures via corresponding functions.
3. You are required to submit the following files:
submission.py
• (i) (your code),
• (ii) (illustrating your implementation details)
submission.py
report.pdf
4. You are allowed to add other functions and/or import modules (you may have to for this project), but you are not allowed to define global variables. All the functions should be implemented in submission.py.
5. In this project, you may need to CREATE YOUR OWN TEST CASES in order to evaluate the correctness, while at the same time improving the efficiency of your
implementation. DO NOT COMPLETELY RELY ON THE TOY EXAMPLE IN THE SPEC!
• In order to create your own test cases, you are expected to use real datasets or randomly generated data, and generate hash functions by yourself.
6. The testing environment is the same as that of . Note: Importing other modules (not a part of the Lab1 test environment) may lead to errors, which will result
in ZERO score for the ENTIRE Project.
Task1: C2LSH (90 points)
In this question, you will implement the C2LSH algorithm in Pyspark. Specifically, you are required to write a method c2lsh() in the file submission.py that takes the following four arguments as input:
1. data_hashes: is a rdd where each element (i.e., key,value pairs) in this rdd corresponds to (id, data_hash). id is an integer and data_hash is a python list that
contains 𝑚m integers (i.e., hash values of the data point).
2. query_hashes is a python list that contains 𝑚m integers (i.e., hash values of the
query).
3. alpha_m is an integer which indicates the minimum number of collide hash values
between data and query (i.e., 𝛼𝑚αm).
4. beta_n is an integer which indicates the minimum number of candidates to be
returned (i.e., 𝛽𝑛βn).
Note:
1. You don’t need to implement hash functions and generate hashed data, we will provide the data hashes for you.
2. Please follow the description of the algorithm provided in the lecture notes, which is slightly different to the original C2LSH paper.
Lab1
3. While one of the main purposes of this project is to use spark to solve the problems. Therefore, it is meaningless to circumvent pyspark and do it in other ways (e.g., collect the data and implement the algorithm without transformations etc.). Any such attempt will be considered as a invalid implementation, hence will be
assigned ZERO score. For example, you are not allowed to use the following PySpark functions:
• aggregate, treeAggregate,aggregateByKey
• collect, collectAsMap
• countByKey, countByValue
• foreach, foreachPartition
• reduce, treeReduce, reduceByKeyLocallly
• saveAs* (e.g. saveAsTextFile)
• take* (e.g. take, takeOrdered)
• top
• fold
• toLocalIterator
• lookup
• checkpoint, localCheckpoint
• getOrCreate
Return Format
The c2lsh() method returns a rdd which contains a sequence of candidate id’s. Notice: The order of the elements in the list does not matter (e.g., we will collect the
elements and evaluate them as a set).
Evaluation
Your implementation will be tested using 3 different test cases. We will be evaluating based on the following factors:
• the correctness of implemented c2lsh(). The output will be compared with the result from the correct implementation. Any difference will be considered as incorrect.
• the efficiency of your implmentation. We will calculate the running time of c2lsh() in each testcase (denoted as 𝑇T).
For each testcase (worth 30 points), the following marking criteria will be used:
• Case 1, 0 points: the returned rdd is incorrect, or 𝑇>𝑇1T>T1
• Case 2, 10 points: the returned
• Case 3, 20 points: the returned
• Case 4, 30 points: the returned
is correct, and 𝑇1≥𝑇>𝑇2T1≥T>T2, is correct, and 𝑇2≥𝑇>𝑇3T2≥T>T3, is correct, and 𝑇3≥𝑇T3≥T.
rdd rdd rdd
Where 𝑇1>𝑇2>𝑇3T1>T2>T3 depend on the testing environment and the test cases. Task 2: Report (10 points)
You are also required to submit your project report, named: report.pdf. Specifically, in the report, you are at least expected to answer the following questions:
1. Implementation details of your c2lsh(). Explain how your major transform function works.
2. Show the evaluation result of your implementation using your own test cases.
3. What did you do to improve the efficiency of your implementation?
from pyspark import SparkContext, SparkConf from time import time
import pickle
import submission
def createSC():
conf = SparkConf() conf.setMaster(“local[*]”) conf.setAppName(“C2LSH”)
sc = SparkContext(conf = conf) return sc
with open(“toy/toy_hashed_data”, “rb”) as file: data = pickle.load(file)
with open(“toy/toy_hashed_query”, “rb”) as file: query_hashes = pickle.load(file)
alpha_m = 10 beta_n = 10
sc = createSC()
data_hashes = sc.parallelize([(index, x) for index, x in enumerate(data)]) start_time = time()
res = submission.c2lsh(data_hashes, query_hashes, alpha_m, beta_n).collect() end_time = time()
How to execute your implementation (EXAMPLE)
sc.stop()
# print(‘running time:’, end_time – start_time) print(‘Number of candidate: ‘, len(res)) print(‘set of candidate: ‘, set(res))
Number of candidate: 10
set of candidate: {0, 70, 40, 10, 80, 50, 20, 90, 60, 30}