程序代写代做代考 hadoop Chapter 1: Introduction

Chapter 1: Introduction

COMP9313: Big Data Management

Lecturer: Xin Cao
Course web site: http://www.cse.unsw.edu.au/~cs9313/

1.‹#›

Set Similarity Join on Hadoop

1.‹#›

3
Set-Similarity Join
Finding pairs of records with a similarity on their join attributes > t

1.‹#›

Application: Record linkage
Star
Keanu Reeves
Samuel Jackson
Schwarzenegger
…

Table R
Table S
Star
Keanu Reeves
Samuel L. Jackson
Schwarzenegger
…

1.‹#›

Two-step Solution

Step 2: Verification
Star

…

Table R
Table S
Star

…

Step 1:
Similarity Join

1.‹#›

Why Hadoop?
Large amounts of data
Data or processing does not fit in one machine

Assumptions:
Self join: R = S
Two similar sets share at least 1 token

Efficient Parallel Set-Similarity Joins Using Hadoop (SIGMOD’10)

1.‹#›

Map: <23, (a,b,c)>  (a, 23), (b, 23), (c, 23)
7
A naïve solution

Too much data to transfer 
Too many pairs to verify .
Reduce:(a,23),(a,29),(a,50), … Verify each pair (23, 29), (23, 50), (29, 50) … …

1.‹#›

Solving frequency skew: prefix filtering
prefix
r1
r2
8

Prefixes of similar sets should share tokens
Sort tokens by frequency (ascending)
Prefix of a set: least frequent tokens

Sorted by frequency

Chaudhuri, Ganti, Kaushik: A Primitive Operator for Similarity Joins in Data Cleaning. ICDE’06

1.‹#›

Prefix filtering: example
9

Each set has 5 tokens
“Similar”: they share at least 4 tokens
Prefix length: 2

Record 1
Record 2

1.‹#›

Stage 1: Order tokens by frequency
(Already done in the given example data)

Stage 2: Finding “similar” id pairs (verification)

Stage 3: remove duplicates
10
Hadoop Solution: Overview

1.‹#›

11
Stage 1: Sort tokens by frequency
Compute token frequencies

Sort them
MapReduce phase 1
MapReduce phase 2

1.‹#›

12
Stage 2: Find “similar” id pairs
Partition using prefixes

Verify similarity

1.‹#›

Compute the Length of Shared Tokens
Jaccard Similarity: sim(r, s) = |rs|/|rs|

If sim(r, s) >= τ, l = |rs| >= |rs| * τ >= max(|r|, |s|) * τ

Given a record r, you can compute the prefix length as p = |r| – l + 1

r and s is a candidate pair, they must share at least one token in the first (|r| – l + 1) tokens

Given a record r = (A, B, C, D) and p = 2, the mapper emits (A, r) and (B, r)

1.‹#›

Stage 3: Remove Duplicates

1.‹#›

More Optimization Strategies
It is your job!!!

The faster the better

1.‹#›

/docProps/thumbnail.jpeg

Related Posts