COMP20008
Elements of data processing
Semester 1 2020
Lecture 7: Data (Record) Linkage
Today
• What is data linkage, when and why is it needed and by whom?
• What are some challenges? • How to define similarity
• Scalability
• How to merge/group records
• Thanks to
• Dr Ben Rubinstein for use of lecture materials on movies example
Data linkage: what is it?
• We collect information about entities such as people, products, images, songs, …
• Combining, grouping, matching electronic records of the same real- word entities.
• Two hospitals H1 and H2 wants to know if same patients visited both hospitals.
Example from Data Matching book by Christen
Applications: security
Match data about people scheduled to fly to Australia by plane, with information across different databases, to identify high risk passengers before boarding. Databases with information such as
• Previous visits to Australia
• Previous visa applications/cancellations • Crime databases …
A famous senator Sen. Ted Kennedy told the Senate Judiciary Committee in 2004 that he had been stopped and interrogated on at least five occasions as he attempted to board flights at several different airports. A Bush administration official explained to the Washington Post that Kennedy had been held up because the name “T. Kennedy” had become a popular pseudonym among terror suspects.
http://edition.cnn.com/2015/12/07/politics/no-fly-mistakes-cat-stevens-ted-kennedy-john-lewis/
Applications: business
• Two businesses collaborate with each other for a marketing campaign. Need a combined database of individuals to target
• Bob moves into a new home and wishes to be connected to electricity provider. For verification, provider matches the address Bob supplies against its “master” list of street addresses. Not always reliable!
• Online shopping comparison
• Is product X in Store A the same as product Y in Store B? • www.shopbot.com.au
Facebook likes
Rotten Tomato
Data linkage applications – cont.
Src: K. Raymond CC4.0
Centrelink: “Robo-debt” collection
Sydney Morning Herald 10/4/17
• Data matching using Centrelink data and Tax office data
• System checks for “discrepancies” in income
• Example data matching issue
• Welfare recipient reports to Centrelink working for a company with its trading name.
Tax office records show a different registered company name.
• Failure to match between the two names triggered conclusion that some income was not being declared
• Automated notice …
Examples of problematic matching
• Centrelink
Centrelink Income: Jane Doe
May’16: Maccas $7,000 June’16: Maccas $4,000
ATO Income: Jane Doe
2015-16: McDonald’s $11,000
Discrepancy detected – potential undeclared income Þ Automated process triggered -> letter to Jane Doe
• A famous senator Sen. Ted Kennedy told the Senate Judiciary Committee in 2004 that he had been stopped and interrogated on at least five occasions as he attempted to board flights at several different airports. A Bush administration official explained to the Washington Post that Kennedy had been held up because the name “T. Kennedy” had become a popular pseudonym among terror suspects.
Record linkage – terminology
Combine related/equivalent records across sources (sometimes within a single source)
Studied across communities –different terminology
• Statistics: Record linkage [Dunn’46]
• Databases: Entity resolution, data integration, deduplication
• Natural Language Processing: coreference resolution, named-entity recognition
…meaning and scope varies
Example: Bing’s “Movies Vertical”
Bing Movies adding
entity actions to entity cards
Need to link movie records from Bing and Netflix
Easy problem only if there’s an attribute with unique value per record
• If there’s a “key” then use “database join”
Not so fast! The linkage challenge
No keys
Noisy values Missing attributes
Usually remove diacritics (why?) déjà à deja
“Unrelated works with same titles” http://www.imdb.com/list/ls000397020/
Concatenate “Title, Year”: surely a key?
“Deja Vu,2006″ß
“Deja Vu,1997″ß
“Unrelated works with same titles” http://www.imdb.com/list/ls000397020/
Concatenate “Title, Year”: surely a key?
http://www.imdb.com
Pre-processing
Prep Block Score Match Merge
Simple pre-processing to clean records
Title
Year
Directors
Cast
Runtime
…
come fly with me
2004
peter kagan
michael buble
63
…
michael jordan come fly with me
1989
michael jordan, jay thomas
42
…
Pairwise comparison
• Need to compare two records to determine if they are to be linked
• Calculate similarity between a pair of records If similar:
link pair
• High complexity – we will need to compare/”score” many pairs of records for similarity.
Prep Block Score Match Merge
Complexity – pairwise comparison
• Two databases A (m records) and B (n records) • How many comparisons? 𝑚×𝑛
• One MSFT problem had 1b records
• Complexity 𝑂(𝑛!)
• If𝑚==𝑛==1𝑏 10! records
• 10!×10! number of comparisons;
• It will take a long time to compute 10″# pairs
• Naïve whole-database all-pairs comparison doesn’t scale. Can we do better?
Blocking – scalable record linkage
• Maps complex records to simple values (blocks).
• The simple values are blocking-keys
• Same blocking methods applied to both A and B.
Each record is allocated to
one or more of the blocks
B
A
b1 b5
a4
b1
b2
b3
b4
b5
a1
a2
a3
a4
a5
b1 b4
a1 a3
b1 b2 b3
a2 a5
b3
a2 a5
Prep Block
Score
Match
Merge
Blocking – cont.
• Record-pairsassignedtothe same blocks are potential matches.
• Withinblockpairwise comparisons only
• Forexample,b5isonly compared with a4, and no other records in A
2×1
2×2
b1 b5
a4
b1 – a4 b5 – a4
b1 – a1 b1 – a3 b4 – a1 b4 – a3
b1 – a2 b1 – a5 b2 – a2 b2 – a5 b3 – a2 b3 – a5
b1 b4
a1 a3
b1 b2 b3
a2 a5
3×2
b3
a2 a5
1×2
b3 – a2 b3 – a5
Scalable record linkage with blocking
Blocking example
come fly jordan michael
Prep Block Score Match Merge
Represent complex records as simple values (blocks); Only score records with simple value (block) in common.
Title
come fly with me
michael jordan come fly with me
…
…
…
Blocking example
A movie record: <“ghost busters”, 2016>
• On one function: release year, result in one block
• 2016
• On one function: concatenated values of release year and one title- word: results in two blocks
• 2016 ghost
• 2016 busters
• Movies with a difference in release year by one; three functions: • release year,
• release year + 1,
• release year – 1; results in three blocks:
• 2015, 2016, and 2017
• Redundant blocks, why?
Common blocking methods for strings
• Token/word (‘come fly with me’) {come, fly, with, me}
• Phrases/n-words (‘come fly with me’, n=2)
{’_ come’, ’come fly’, ’fly with’, ‘with me’, ‘me _’}
• N-grams (‘Jamie’, n=2)
• {_j, ja, am, mi, ie, e_}
• blocking-keys are the 2-grams; ‘jamie’ are in 6 blocks
• Prefix, suffix
• soundex
• Pauline: P450, Paulina: P450,
• Chris: C620 • Kris: K620
Blocking methods– design
• Blocking methods should be efficient, e.g. hashing • What would be a bad blocking method?
• Trade-off between block sizes: true matches being missed vs computational efficiency
• Can filter out large blocks
• Blocks can overlap but avoid redundant blocks • Need to ensure that recall does not suffer
Measuring blocking method
Given the ground truth (matching record-pairs)
False positives (fp): non-matching pairs in the same block. False negatives(fn): matching pairs never in the same block. True positives (tp): matching pairs in the same block.
True negative (tn): non-matching pairs never in the same block. Total number of pairs with no blocking: 𝑚×𝑛 ( == 𝑓𝑝 + 𝑓𝑛 + 𝑡𝑝 + 𝑡𝑛)
Blocking measures:
• Pair-completeness (PC): 𝑡𝑝⁄(𝑡𝑝 + 𝑓𝑛)
• Reduction-ratio (RR): (𝑓𝑝 + 𝑡𝑝)⁄(𝑓𝑝 + 𝑓𝑛 + 𝑡𝑝 + 𝑡𝑛)
Assess pairwise similarity
Multiple scoring functions on different parts of the records
Prep Block Score Match Merge
Comparing two records : Assess their similarity
Title
come fly with me
2004
15
1989
Directors
peter kagan
0
Cast
michael buble
0
63
…
0.82
Year
michael jordan, jay thomas
Runtime
21
…
michael jordan come fly with me
42
…
Scoring similarities – text
What is the similarity score between a text attribute between a pair of records?
Revision on text processing!
Exercise: set-based string similarity
• Compute dice coefficient similarity between 𝑥 and 𝑦 using 2-grams. • 𝑥=james
• 𝑦=jamie
• 𝑆$ = _j,ja,am,me,es,s_
• 𝑆% = _j,ja,am,mi,ie,e_ • 𝑆$ ∩ 𝑆% = 3, 𝑆$ = 6 𝑆% = 6
•𝑠𝑖𝑚&'()𝑆$,𝑆% =*×,!∩,” =*∗0=0.5 ,! . ,” 1.1
• Why 2-grams? What about 1-gram or 10-grams?
Exercise: direct string similarity
• Edit-distance similarity
Minimum number of character insertions, deletions, substitutions to go from s1 to s2 • “valuation” –> “revolution”?
•𝑠𝑖𝑚𝑥,𝑦 =1− &$,% =1− 6 =0.6 345 $,% “7
• Jaro-Winkler similarity • Based on edit-distance
• Favours matches at the start of the strings.
v
a
l
u
a
t
i
o
n
r
e
v
o
l
u
–
t
i
o
n
Scoring similarity
Combine the set of similarity scoresàfinal score
( 0.82 , 15 , 0 , 0 , 21 )
Title
…
come fly with me
…
michael jordan come fly with me
…
Prep Block
Score
f: Rdà[0,1]
0.1
Match
Need a good f
Merge
Score record pairs for similarity
More on score combination
Idea 1: sum up feature scores
Idea 2: +similarities, -dissimilarities Idea 3: weighted sum
Idea 4: label examples of non/matching record pairs, train a classifier using machine learning
Will learn the weight
(0.82, , , , ) f: Rdà[0,1]
Prep Block Score Match Merge
15
0
0
21
0.1
Match ‘sufficiently’ similar records
pairs compared
sufficiently similar pairs
Prep Block Score Match Merge
Threshold θ
Match when final score > θ e.g.
threshold θ = 0.5
final score > 0.5
Merge matched records
matched pairs
Prep Block Score Match Merge
Merge matched records
• Also needs to resolve conflicting attributes • False positives and false negatives still exist
Prep Block Score Match Merge
Evaluation of record linkage results
False positives (fp): # non-matched pairs that are incorrectly classified as matches. False negatives(fn): # true matching pairs that are incorrectly classified as non-matches True positives (tp): # true matching pairs that are classified as matches
True negative (tn): #non-matched pairs that are classified as non-matches
• Precision: 𝑡𝑝⁄(𝑡𝑝 + 𝑓𝑝)
Proportion of pairs classified as matches that are true matches.
• Recall: 𝑡𝑝⁄(𝑡𝑝 + 𝑓𝑛)
Proportion of true matching pairs that are classified as matches.
35
Summary
üUnderstanding what the record linkage problem is
üAbility to outline where record linkage is applied and why
üAppreciation of why record linkage can be tricky
üCan describe basic approaches to record linkage, such as the methodology of blocking
üCan evaluate blocking methods and record linkage results.
36
Acknowledgements
• Lecture slides are based on presentation materials created by Ben Rubinstein