计算机代考程序代写 database Example: Bing’s “Movies Vertical”

Example: Bing’s “Movies Vertical”
adding
entity actions to entity cards
Need to link movie records from Bing and Netflix
Easy problem only if there’s an attribute with unique value per record
• If there’s a “key” then use “database join”

Not so fast! The linkage challenge
No keys
Noisy values Missing attributes
Usually remove diacritics (why?) déjà → deja
“Unrelated works with same titles” http://www.imdb.com/list/ls000397020/

Concatenate “Title, Year”: surely a key?
” ,2006″
” ,1997″
“Unrelated works with same titles” http://www.imdb.com/list/ls000397020/

Concatenate “Title, Year”: surely a key?
http://www.imdb.com

Pre-processing
Prep Block Score Match Merge
Simple pre-processing to clean records
Title
Year
Directors
Cast
Runtime

come fly with me
2004
peter kagan
michael buble
63

michael jordan come fly with me
1989
michael jordan, jay thomas
42

Pairwise comparison
• Need to compare two records to determine if they are to be linked
• Calculate similarity between a pair of records If similar:
link pair
• High complexity – we will need to compare/”score” many pairs of records for similarity.
Prep Block Score Match Merge

Complexity – pairwise comparison
• Two databases A (m records) and B (n records) • How many comparisons? 𝑚 × 𝑛
• One MSFT problem had 1b records
• Complexity 𝑂(𝑛2)
• If𝑚==𝑛==1𝑏 109 records
• 109 × 109 number of comparisons;
• It will take a long time to compute 1018 pairs
• Naïve whole-database all-pairs comparison doesn’t scale. Can we do better?

Blocking – scalable record linkage
• Maps complex records to simple values (blocks).
• The simple values are blocking-keys
• Same blocking methods applied to both A and B.
Each record is allocated to
one or more of the blocks
b1
b2 B b3
a1 a2 a3
A
b1 b4
a1 a3
b1 b2 b3
a2 a5
b3
a2 a5
b4 a4
b5 a5
Prep Block
Score
Match
Merge
b1 b5
a4

Blocking – cont.
• Record-pairs assigned to the same blocks are potential matches.
• Withinblockpairwise comparisons only
• For example, b5 is only compared with a4, and no other records in A
2×1
2×2
b1 b5
a4
b1 – a4 b5 – a4
b1 – a1 b1 – a3 b4 – a1 b4 – a3
b1 – a2 b1 – a5 b2 – a2 b2 – a5 b3 – a2 b3 – a5
b1 b4
a1 a3
b1 b2 b3
a2 a5
3×2
b3
a2 a5
1×2
b3 – a2 b3 – a5

Scalable record linkage with blocking

Blocking example
come fly jordan michael
Prep Block Score Match Merge
Represent complex records as simple values (blocks); Only score records with simple value (block) in common.
Title
come fly with me
michael jordan come fly with me


Blocking example
A movie record: <“ghost busters”, 2016>
• On one function: release year, result in one block
• 2016
• On one function: concatenated values of release year and one title-
word: results in two blocks • 2016 ghost
• 2016 busters
• Movies with a difference in release year by one; three functions: • release year,
• release year + 1,
• release year – 1; results in three blocks:
• 2015, 2016, and 2017
• Redundant blocks, why?

Common blocking methods for strings
• Token/word (‘come fly with me’) {come, fly, with, me}
• Phrases/n-words (‘come fly with me’, n=2)
{’_ come’, ’come fly’, ’fly with’, ‘with me’, ‘me _’}
• N-grams (‘Jamie’, n=2)
• {_j, ja, am, mi, ie, e_}
• blocking-keys are the 2-grams; ‘jamie’ are in 6 blocks
• Prefix, suffix
• soundex
• Pauline: P450, Paulina: P450, • Chris: C620
• Kris: K620

Blocking methods– design
• Blocking methods should be efficient, e.g. hashing • What would be a bad blocking method?
• Trade-off between block sizes: true matches being missed vs computational efficiency
• Can filter out large blocks
• Blocks can overlap but avoid redundant blocks • Need to ensure that recall does not suffer

Measuring blocking method
Given the ground truth (matching record-pairs)
False positives (fp): non-matching pairs in the same block. False negatives(fn): matching pairs never in the same block. True positives (tp): matching pairs in the same block.
True negative (tn): non-matching pairs never in the same block. Total number of pairs with no blocking: 𝑚 × 𝑛 ( == 𝑓𝑝 + 𝑓𝑛 + 𝑡𝑝 + 𝑡𝑛)
Blocking measures:
• Pair-completeness (PC): 𝑡𝑝Τ(𝑡𝑝 + 𝑓𝑛)
• Reduction-ratio (RR): 1 − (𝑓𝑝 + 𝑡𝑝)Τ(𝑓𝑝 + 𝑓𝑛 + 𝑡𝑝 + 𝑡𝑛)