COMP9313 Project 3
Spatial and Textual Similarity Join (22 marks)
Background: Similarity join is an important task in data mining. In this project, you will do the spatial and textual similarity join. Each record R=(L, S) has two attributes: the spatial location in the 2D space (i.e., R.L is a coordinate (x, y)), and the textual description (i.e., R.S is a set of terms).
Copyright By PowCoder代写 加微信 powcoder
Given a set of records, the task is to find all pairs of records such that they are close to each other and have a high textual similarity. Formally, each result pair must satisfy the following two conditions:
• Their Euclidean distance is smaller than a given threshold d, that is:
(R1.L.x −R2.L.x)2+(R1.L.y−R2.L.y)2≤d(R1.L.x −R2.L.x)2+(R1.L.y−R2.L.y)2≤d
• Their Jaccard Similarity is larger than a given threshold s, that is:
∣R1.S∩R2.S∣∣R1.S∪R2.S∣≥s∣R1.S∪R2.S∣∣R1.S∩R2.S∣≥s
Given the following example (see the sample file), and set d = 2 and s=0.5,
we can get the result pairs below:
Output Format: The output file contains all the similar records together with their similarities. The output format is “(record1.id,record2.id):distance value, jaccard similarity value”. Each pair must have record1.id < record2.id. For the distance and similarity values, round the results to 6 decimal places (using the round() function). There should be no duplicates in the output results. The pairs are sorted in ascending order (by the first and then the second). Given the sample dataset above with the distance threshold 2 and similarity threshold 0.5, the output result should be:
(0,2):1.414214, 0.75
(2,3):1.0, 0.5
(2,4):1.0, 0.5
Code Format: The code template has been provided. Your code should take three parameters: the input file, the output folder, the distance threshold d and the similarity threshold s. You need to use the command below to run your code:
$ spark-submit project3.py input output d s
Submission
Deadline: Monday 18 November 11:59:59 PM
If you need an extension, please apply for a special consideration via “myUNSW” first. You can submit multiple times before the due date and we will only mark your final submission. To prove successful submission, please take a screenshot as the assignment submission instructions show and keep it to yourself. If you have any problems with submissions, please email
Late submission penalty
5% reduction of your marks for up to 5 days, submissions delayed for over 5 days will be rejected.
Some notes
• You need to design an exact approach to finding similar records (Please revisit Week 8 slides for more tips).
• Check the paper mentioned in slides if you want to know more details Efficient Parallel Set-Similarity Joins Using MapReduce. SIGMOD’10
• You cannot compute the pairwise similarities directly!!!
• Regular Python programming is not permitted in project3.
• When testing the correctness and efficiency of submissions, all the code will be run with two local threads using the default setting of Spark. Please be careful with your runtime and memory usage.
Marking Criteria
Your source code will be inspected and marked based on readability and ease of understanding. The efficiency and scalability of this project are very important and will be evaluated as well.
• Submission can be compiled and run on Spark => +6
• Accuracy (no unexpected pairs, no missing pairs, correct order, correct distance and similarity values, correct format) => +6
• Efficiency (rules are shown as follows) => +10
The rank of runtime on the largest test case using two local threads, and correct and incorrect submissions will be ranked separately:
— Correct results (e.g., top 10% => 10):
10−⌊rank percentage − 110⌋10−⌊10rank percentage − 1⌋
— Incorrect results:
0.4∗(10−⌊rank percentage − 110⌋)0.4∗(10−⌊10rank percentage − 1⌋)
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com