Chapter 1: Introduction
COMP9313: Big Data Management
Lecturer: Xin Cao
Course web site: http://www.cse.unsw.edu.au/~cs9313/
1.‹#›
1
Set Similarity Join on Hadoop
1.‹#›
3
Set-Similarity Join
Finding pairs of records with a similarity on their join attributes > t
1.‹#›
3
Application: Record linkage
Star
Keanu Reeves
Samuel Jackson
Schwarzenegger
…
Table R
Table S
Star
Keanu Reeves
Samuel L. Jackson
Schwarzenegger
…
1.‹#›
Two-step Solution
Step 2: Verification
Star
…
Table R
Table S
Star
…
Step 1:
Similarity Join
1.‹#›
Why Hadoop?
Large amounts of data
Data or processing does not fit in one machine
Assumptions:
Self join: R = S
Two similar sets share at least 1 token
Efficient Parallel Set-Similarity Joins Using Hadoop (SIGMOD’10)
1.‹#›
Map: <23, (a,b,c)> (a, 23), (b, 23), (c, 23)
7
A naïve solution
Too much data to transfer
Too many pairs to verify .
Reduce:(a,23),(a,29),(a,50), … Verify each pair (23, 29), (23, 50), (29, 50) … …
1.‹#›
7
Solving frequency skew: prefix filtering
prefix
r1
r2
8
Prefixes of similar sets should share tokens
Sort tokens by frequency (ascending)
Prefix of a set: least frequent tokens
Sorted by frequency
Chaudhuri, Ganti, Kaushik: A Primitive Operator for Similarity Joins in Data Cleaning. ICDE’06
1.‹#›
Prefix filtering: example
9
Each set has 5 tokens
“Similar”: they share at least 4 tokens
Prefix length: 2
Record 1
Record 2
1.‹#›
Stage 1: Order tokens by frequency
(Already done in the given example data)
Stage 2: Finding “similar” id pairs (verification)
Stage 3: remove duplicates
10
Hadoop Solution: Overview
1.‹#›
10
11
Stage 1: Sort tokens by frequency
Compute token frequencies
Sort them
MapReduce phase 1
MapReduce phase 2
1.‹#›
11
12
Stage 2: Find “similar” id pairs
Partition using prefixes
Verify similarity
1.‹#›
12
Compute the Length of Shared Tokens
Jaccard Similarity: sim(r, s) = |rs|/|rs|
If sim(r, s) >= τ, l = |rs| >= |rs| * τ >= max(|r|, |s|) * τ
Given a record r, you can compute the prefix length as p = |r| – l + 1
r and s is a candidate pair, they must share at least one token in the first (|r| – l + 1) tokens
Given a record r = (A, B, C, D) and p = 2, the mapper emits (A, r) and (B, r)
1.‹#›
Stage 3: Remove Duplicates
1.‹#›
More Optimization Strategies
It is your job!!!
The faster the better
1.‹#›
/docProps/thumbnail.jpeg