Data (record) linkage – Introduction
School of Computing and Information Systems
@University of Melbourne 2022
Copyright By PowCoder代写 加微信 powcoder
Data linkage: what is it?
• We collect information about entities such as people, products, images, songs, …
• Combining, grouping, matching electronic records of the same real- word entities.
• Two hospitals H1 and H2 wants to know if same patients visited both
hospitals.
Example from Data Matching book by OMP20008 Elements of Data Processing
Applications: security
Match data about people scheduled to fly to Australia by plane, with information across different databases, to identify high risk passengers before boarding. Databases with information such as
• Previous visits to Australia
• Previous visa applications/cancellations • Crime databases …
A famous senator Sen. told the Senate Judiciary Committee in 2004 that he had been stopped and interrogated on at least five occasions as he
attempted to board flights at several different airports. A Bush
administration official explained to the Washington Post that Kennedy
had been held up because the name “T. Kennedy” had become a
popular pseudonym among terror suspects. http://edition.cnn.com/2015/12/07/politics/no-fly-mistakes-cat-stevens-ted-kennedy-john-lewis/
COMP20008 Elements of Data Processing
Applications: business
• Two businesses collaborate with each other for a marketing campaign. Need a combined database of individuals to target
• Bob moves into a new home and wishes to be connected to electricity provider. For verification, provider matches the address Bob supplies against its “master” list of street addresses. Not always reliable!
• Online shopping comparison
• Is product X in Store A the same as product Y in Store B? • www.shopbot.com.au
COMP20008 Elements of Data Processing
Data linkage applications – cont.
COMP20008 Elements of Data Processing
Src: K. C4.0
Facebook – Likes, Interests
Rotten Tomato
COMP20008 Elements of Data Processing
Matching a database against itself
• Business wishes to carry out an advertising campaign. • Has a large database of customers
• The customer database changes over time, people move address, change their names.
• Duplicate records about individuals – business wishes to know if the same person appears more than once
• E.g. All the following are the same entity
• Dr James Bailey, Department of Computing, Kings College London,
• Dr James Bailey Department of Computer Science, The University of Melbourne,
• Dr James Bailey, Department of Computer Science and Software Engineering, The University of Melbourne,
• Professor James Bailey, Department of Computing and Information Systems, The University of Melbourne,
COMP20008 Elements of Data Processing
Centrelink: “Robo-debt” collection
Herald 10/4/17
• Data matching using Centrelink data and Tax office data
• System checks for “discrepancies” in income
• Example data matching issue
• Welfare recipient reports to Centrelink working for a company with its trading
name. Tax office records show a different registered company name.
• Failure to match between the two names triggered conclusion that some income was not being declared
• Automated notice …
COMP20008 Elements of Data Processing
Examples of problematic matching
• Income:
May’16: Maccas $7,000 June’16: Maccas $4,000
ATO Income:
2015-16: McDonald’s $11,000
Discrepancy detected – potential undeclared income Þ Automated process triggered -> letter to
COMP20008 Elements of Data Processing
Record linkage – terminology
Combine related/equivalent records across sources (sometimes within a single source)
Studied across communities –different terminology • Statistics: Record linkage [Dunn’46]
• Databases: Entity resolution, data integration, deduplication
• Natural Language Processing: coreference resolution, named-entity
recognition
…meaning and scope varies
COMP20008 Elements of Data Processing
Building a Search Engine’s “Movies Vertical”
Why: adding entity actions to entity cards
What: Need to link movie records from Bing and Netflix
How: Easy problem only if there’s an attribute with unique value per record
• If there’s a “key” then use “database join”
COMP20008 Elements of Data Processing
Not So Fast! The Linkage Challenge
Noisy values Missing attributes
Usually remove diacritics (why?) déjà à deja
“Unrelated works with same titles” http://www.imdb.com/list/ls000397020/
COMP20008 Elements of Data Processing
Concatenate “Title,Year”: Surely a Key?
“Unrelated works with same titles” http://www.imdb.com/list/ls000397020/
COMP20008 Elements of Data Processing
Concatenate “Title,Year”: Surely a Key?
http://www.imdb.com
COMP20008 Elements of Data Processing
Learning objectives
• Understanding what the record linkage problem is • Ability to outline where record linkage is applied
• Appreciation of why record linkage can be tricky
for use of lecture materials on movies example
COMP20008 Elements of Data Processing
From the time before, 2019
COMP20008 Elements of Data Processing
Data (record) linkage – Pipeline
School of Computing and Information Systems
@University of Melbourne 2022
Building a Search Engine’s “Movies Vertical”
What :Need to link movie records from Bing and Netflix
How: Easy problem only if there’s an attribute with unique value per record
• If there’s a “key” then use “database join”
COMP20008 Elements of Data Processing
Typical Pipeline
Prep Block Score Match Merge
Come Fly With Me
: Come Fly With Me
COMP20008 Elements of Data Processing
Typical Pipeline
Prep Block Score Match Merge
Clean records
come fly with me
peter kagan
michael buble
michael jordan come fly with me
michael jordan, jay thomas
COMP20008 Elements of Data Processing
Typical Pipeline
Prep Block
We’ll need to compare/”score” many pairs of records for similarity
If we compare source 1’s ! records against all ” of source 2’s: ” · ! work
One MSFT problem had 1b records Score Match Merge
come fly with me
michael jordan come fly with me
Represent complex records as simple values (blocks); Only score records with simple value (block) in common.
COMP20008 Elements of Data Processing
Typical Pipeline
come fly with me
michael jordan come fly with me
come fly jordan michael
Prep Block Score Match Merge
Represent complex records as simple values (blocks); Only score records with simple value (block) in common.
COMP20008 Elements of Data Processing
Each record from B allocated to one or more of the blocks
Each record from A allocated to one or more of the blocks
COMP20008 Elements of Data Processing
Within each block, compare the records from A against those from B and find those that match
If two records are not assigned to the same block, it means we believe they are not a match
COMP20008 Elements of Data Processing
Typical Pipeline
Title Year
come fly 2004 with me
jordan come 1989
fly with me
peter kagan
michael buble
michael jordan, jay thomas
Runtime …
63 … 42 …
Comparing two records : Asses their similarity
COMP20008 Elements of Data Processing
Scoring Examples: String Similarity
Prep Block Score Match Merge Revision
COMP20008 Elements of Data Processing
Scoring similarity
Combine the set of similarity scoresàfinal score
Title … come fly …
michael jordan come …
fly with me
( 0.82 , 15 , 0 , 0 , 21 )
f: Rdà[0,1]
Need a good f
Score record pairs for similarity
COMP20008 Elements of Data Processing
More on score combination
Idea 1: sum up feature scores
Idea 2: +similarities, -dissimilarities Idea 3: weighted sum
Idea 4: label examples of non/matching record pairs, train a classifier using machine learning
Will learn the weight
( 0.82 , 15 , 0 , 0 , 21 )
f: Rdà[0,1]
Prep Block Score Match Merge
COMP20008 Elements of Data Processing
Match ‘sufficiently’ similar records
pairs compared
sufficiently similar pairs
Prep Block Score Match Merge
Minimum threshold θ Match when final score > θ e.g.
threshold θ = 0.5
final score > 0.5
COMP20008 Elements of Data Processing
Merge matched records
matched pairs
Prep Block Score Match Merge
COMP20008 Elements of Data Processing
Summary – pipeline
Preprocessing Block
Score and Match Merge
COMP20008 Elements of Data Processing
Evaluation of record linkage results
False positives (fp): non-matching pairs that are incorrectly classified as matches. False negatives(fn): true matching pairs that are incorrectly classified as non-matches True positives (tp): true matching pairs that are classified as matches
True negative (tn): non-matching pairs that are classified as non-matches
• Precision: ##$⁄(##$ + #’$)
Proportion of pairs classified as matches that are true matches.
• Recall: ##$⁄(##$ + #’))
Proportion of true matching pairs that are classified as matches.
Acknowledgements
• Lecture slides are based on presentation materials created by
COMP20008 Elements of Data Processing
From the time before, 2019
COMP20008 Elements of Data Processing
Data (record) linkage – blocking challenges
School of Computing and Information Systems
@University of Melbourne 2022
Prep Block
We’ll need to compare/”score” many pairs of records for similarity
If we compare source 1’s ! records against all ” of source 2’s: ” · ! work
One MSFT problem had 1b records Score Match Merge
come fly with me
michael jordan come fly with me
Represent complex records as simple values (blocks); Only score records with simple value (block) in common.
COMP20008 Elements of Data Processing
Scalable record linkage with blocking
COMP20008 Elements of Data Processing
come fly with me
michael jordan come fly with me
come fly jordan michael
Prep Block Score Match Merge
Represent complex records as simple values (blocks); Only score records with simple value (block) in common.
COMP20008 Elements of Data Processing
Blocking example <``Ghost busters’’, 2016>
Blocking Examples: <“Ghost busters”,2016> What blocks to use and which block(s) contain above
Q: Block on release year being same?
Q: Block on same release year, and title word in common?
Q: Block on release year being same or ±1?
Prep Block Score Match Merge
COMP20008 Elements of Data Processing
Blocking example <``Ghost busters’’, 2016>
Blocking Examples: <“Ghost busters”,2016> What blocks to use and which block(s) contain above record?
Q: Block on release year being same? 2010 2011 2012
COMP20008 Elements of Data Processing
Blocking Examples: <“Ghost busters”,2016> What blocks to use and which block(s) contain above record?
Q: Block on same release year, and title word in common?
2010 2010 2010 come fly go
2016 2016 2016
2010 2010 Ghost busters
2016 Ghost
2016 busters
come fly go
COMP20008 Elements of Data Processing
Blocking example <``Ghost busters’’, 2016>
Blocking Examples: <“Ghost busters”,2016> What blocks to use and which block(s) contain above record?
Q: Block on release year being same or ±1? 2010 2011
COMP20008 Elements of Data Processing
Blocking example <``Ghost busters’’, 2016>
Blocking Examples: <“Ghost busters”,2016>
What blocks to use and which block(s) contain above record? Q: Block on release year being same or ±1?
2015 2016 2017
Is it correct to put the “Ghost busters”, 2016 movie in only two blocks: 2016 and 2017?
COMP20008 Elements of Data Processing
Common blocking methods for strings
• Token/word (‘come fly with me’) {come, fly, with, me}
• Phrases/n-words (‘come fly with me’, n=2)
{’# come’, ’come fly’, ’fly with’, ‘with me’, ‘me #’}
• N-grams (‘Jamie’, n=2)
• {#j, ja, am, mi, ie, e#}
• blocking-keys are the 2-grams; ‘jamie’ are in 6 blocks
• Prefix, suffix
• Pauline: P450, Paulina: P450,
• Chris: C620 • Kris: K620
COMP20008 Elements of Data Processing
Blocking methods– design
• Blocking methods should be efficient and capture good heuristics for matching.
• Represent complex records as simple values (blocks) • Hashing?
• What would be a bad blocking method?
• Trade-off between block sizes: missing true matches vs huge blocks • Can post-process to filter out large blocks
• Blocks can overlap but avoid largely redundant blocks
• Need to ensure that recall does not suffer
COMP20008 Elements of Data Processing
Measuring blocking method
Given the ground truth (matching record-pairs)
False positives (fp): non-matching pairs in the same block.
False negatives(fn): matching pairs never in the same block. True positives (tp): matching pairs in the same block.
True negative (tn): non-matching pairs never in the same block. Total number of pairs blocking: “×! ( == #'( + #’! + #*( + #*!)
Blocking measures:
• Pair-completeness(PC): #*(⁄(#*( + #’!)
• Reduction-ratio(RR): 1 − (#'( + #*()⁄(#'( + #’! + #*( + #*!)
COMP20008 Elements of Data Processing
From the time before, 2019
COMP20008 Elements of Data Processing
Data linkage and Privacy – I
School of Computing and Information Systems
@University of Melbourne 2022
Data Linkage and Privacy
Confidentiality and privacy concerns can arise if
• Matched data is being passed to another organisation or being made public
• Data matching is being conducted across databases from different organisations
COMP20008 Elements of Data Processing
Example 1: Need for privacy in public health
• Research team investigating effects of car accidents on the public health system. Research questions
• Most common injuries for what types of car accident?
• When and where accidents occurred, the road and weather conditions at time
of accident and health of people involved in accident, as well as two years later?
• Data needed
• Hospital data on patients
• Private health insurance data • Police
• Road traffic authorities
• These organisations can’t share all their data with the research team. COMP20008 Elements of Data Processing
Example 2: Need for privacy – national security
• National crime investigation unit analysing crimes of national significance (significance to all of Australia)
• Wants to link its own database about suspicious individuals to different databases around Australia
• Law enforcement
• Financial institutions
• Only linked records should be available to the unit
• It should not get access from the bank to financial data about non-
suspicious individuals
• It should not get access to tax records about non-suspicious individuals
COMP20008 Elements of Data Processing
Privacy Preserving Data Linkage: Problem Statement
• How can we perform data linkage for two databases, each from a different organisation
• Without revealing any information about individuals who do not get linked across the databases (i.e., individuals who occur in one database and not in the other)
• We will need methods for computing similarity of records, without revealing the record values
• Hashing: an important tool
COMP20008 Elements of Data Processing
Number representation – in brief
• We can represent numbers using different bases
• Decimal (base 10), binary (base 2), octal (base 8), hexadecimal (base 16)
• Example: The number 5332 (decimal: base 10) 5332=5″10! +3″10″ +3″10# +2″10$
Can be written as:
• 1010011010100 (binary: base 2)
• 12324(octal:base8) ! ” # $ • 14d4(hexadecimal:base16)=1″16 +4″16 +13″16 +4″16
• We’re not concerned with the details of how the conversions are done, just that it’s possible to do it
COMP20008 Elements of Data Processing
A hash function H maps a data item of arbitrary size to a data item of fixed size
• A hash function *%&’! • !!”#$ 32 = 2
• !!”#$ 20 = 2
• !!”#$ 6 = 0
• !!”#$ 7 = 1
• A hash function *(# • !%& )*+,- = 10 • !%& .*/, = 11
• !%& 01+ = 20
• COMP20008 Elements of Data Processing
Non-invertible (one way) hash function
Given the output H(X), extremely hard to reconstruct X. Examples
• MD5 hash function (produces a 32 digit hexadecimal number, equal to a string of 128 bits)
• H(James)= d52e32f3a96a64786814ae9b5279fbe5
• H(I love data wrangling)= 614416fa9d994aa8225ebd7c50f22060
• H(12345678)= 25d55ad283aa400af464c76d713c07ad
• SHA-3-512 hash function (produces a 128 digit hexadecimal number, equal to a string of 512 bits)
• H(James)=02c56351888fa73ff825ffd65526b264ebefe7916fa5d8d5c58e766bfdd1de8e85b68bf12599b9d 21eca6683d4abfa8616acfa6834e7c478e394374a7b015898
• H(12345678)=8a56bac869374c669443a1626ff0967af258123f83faf6b55e31dd541e6bbd90308a3385713 294bf2e8861bc8cf8f8feda41f9c4db19d5811a6b5de85eac9870
COMP20008 Elements of Data Processing
Hash encoding for exact matching: 2 party protocol
Each organisation
• Applies a (one way) hash function to the attribute used to join the databases
• Shares its hashed values with the other organisation. Each checks which ones match. These are the linked records.
COMP20008 Elements of Data Processing
Small changes in input, large change in output
• Disadvantage 1: What about single character differences in the original value? E.g. MD5 hash function
• H(James)= d52e32f3a96a64786814ae9b5279fbe5
• H(Jamex)= c3bfa7fa6ad2b987619bb4c932e65b4a
• Single character difference results in a completely different output. This is generally true for one way hash functions such as MD5, SHA ….
• Advantage?
COMP20008 Elements of Data Processing
Dictionary attack
• Disadvantage 2: An organisation could mount a dictionary attack to “invert” the hash function. E.g. Organisation A generates a hash dictionary by computing hashes for all words of length 4
• H(aaaa)=… • H(aaab)= … • H(aaac)= … • H(aaad)= … • …..
• H(zzzz)= …
• Organisation A then scans the hashed values received from Organisation B. Checks if any match its hash
dictionary. If yes, privacy is lost for those items.
• Could also generate dictionary for all known words, pairs of words, …. [up to some limit of feasibility]
COMP20008 Elements of Data Processing
Hash encoding for exact matching: 3 party protocol
Possible solution (to dictionary attacks?)
• Involve a trusted 3rd party (Organisation C)
• Organisations A and B send their hashed values to Organisation C, who then checks for matches.
• What if Organisation C is malicious?
• Organisation C could mount a dictionary attack and guess the hashed
• Solution: A and B perform “dictionary attack resistant” hashing
COMP20008 Elements of Data Processing
3rd Party Protocol using salt
• Organisations A and B concatenate a secret word to every name field in their data before hashing (known as a salt). Organisation C does not know what this word is and thus can’t perform a dictionary attack to “reverse” the hashed values it receives.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com