COMP3430 / COMP8430 Data wrangling
Lecture 17: Record pair classification (1) (Lecturer: )
Lecture outline
● Classifying record pairs
● Threshold-based classification ● Probabilistic classification
The record linkage process
Classifying record pairs (1)
● The comparison step generates one vector of similarities
(also known as weight vector) for each of compared record pair
● The elements of such vectors are the calculated similarities (exact or approximate)
● For example: (assuming
edit distance calculations)
Miller
23
Main
Street
P
Miller
4/23
Main
St
comparison:
1.0
0.0
1.0
0.0
1.0
0.0
0.0
Approximate comparison:
1.0
0.25
1.0
0.5
1.0
0.4
0.57
Classifying record pairs (2)
● Classifying record pairs can be based on (a) summing the calculated similarities into a single similarity values, or
(b) using the full vector of similarities
Miller
23
Main
Street
P
Miller
4/23
Main
St
:
Exact comparison:
1.0
0.0
1.0
0.0
1.0
0.0
0.0
3.0 / 7
Approximate comparison:
1.0
0.25
1.0
0.5
1.0
0.4
0.57
4.72 / 7
Example histogram of summed similarities
● Deduplicationofahealth
data set with different weights attached to different similarities, and where the true match status was determined using the commercial record linkage software AutoMatch.
(from Christen, 2012)
Threshold based classification (1)
● Is generally applied on summed similarities
● Can either use one or two similarity thresholds
– One threshold t: 0 ≤ t ≤ simmax, where simmax is equal to the
number of similarities in the vectors
(a) Record pairs with a similarity of at least t → Classified match
(b) Record pairs with a similarity below t → Classified non-match
– Two thresholds tl and tu: 0 ≤ tl < tu ≤ simmax
(a) Record pairs with a similarity of at least tu → Classified match
(b) Record pairs with a similarity below tl → Classified non-match (c) Record pairs with a similarity between tl and tu → Classified
potential match
Threshold based classification (2)
● If similarities are simply summed then each attribute
has the same importance (or same weight)
– Does having the same gender say as much about two records
being about the same person as having the same postcode?
● A weighted sum approach provides more weight to
attributes that contain more information
– Weights can be based on domain knowledge
– Or they can be calculated based on the number of unique
values in an attribute a:
wa = log(number of unique attribute values)
Threshold based classification (3)
● Total similarity is then a weighted sum:
sim(reci, recj ) = ∑a sim(reci[a], recj[a]) * wa,
where wa is the weight for attribute a
● To normalise this similarity into the 0..1 interval we can divide
sim(reci, recj ) by ∑a wa
● Further weight calculations take the frequencies of values
into account
– Two records with the common surname “Smith” are less likely to refer
to the same person compared to two records with the rare surname 'Dijkstra'
Probabilistic classification (1)
● Known as probabilistic record linkage
– Basic ideas were introduced by Newcombe and Kennedy in 1962 – Theoretical foundation by Fellegi and Sunter in 1969
● Basic idea:
– Compare common record attributes (or fields) using approximate
(string) comparison functions
– Calculate matching weights based on frequency ratios (global or value
specific ratios) and error estimates
– Sum of the matching weights is used to classify a pair of records as a
match, non-match, or potential match (using two thresholds)
● Problems: Estimating errors, find optimal thresholds, assumption
of independence, and manual clerical review
Probabilistic classification (2)
● A ratio R is calculated for each compared record pair r = (a,b) in the product space A × B:
R = P (γ ∈ Γ | r ∈ M ) / P (γ ∈ Γ | r ∈ U ),
where M and U are the sets of true matches and true non-matches, and γ (gamma) is an agreement pattern in the comparison space Γ (Gamma), with:
A×B={(a,b):a∈ A,b∈ B}forfiles(datasets)AandB M={(a,b):a=b,a∈ A,b∈ B} Truematches U={(a,b):a≠b,a∈ A,b∈ B} Truenon-matches
Probabilistic classification (3)
● Fellegi and Sunter proposed the following decision rule: R≥tu → r isclassifiedasamatch
tl < R < tu → r is classified as a potential match R ≤ tl → r is classified as a non-match
Probabilistic classification (4)
● Assuming conditional independence between attributes allows to calculate individual attribute-wise probabilities
mi =P([ai =bi ,a∈ A,b∈ B]|r∈ M)and
ui =P([ai =bi ,a∈ A,b∈ B]|r∈ U),
● where ai and bi are the values of attribute i being compared
● Based on these m- and u-probabilities, we calculate a matching weight wi for attribute i as:
wi = log2(mi / ui ) if ai = bi Agreement weight
wi = log2((1-mi ) / (1- ui )) if ai ≠ bi Dis-agreement weight
Weight calculation example
● Assume two data sets with a 3% error in attribute month of birth
● Probability that two matched records (representing the same person) have
the same month value is 97% (mi )
● Probability that two matched records do not have the same month value is
3% (1 - mi )
● Probability that two (randomly picked) un-matched records have the same
month value is 1/12 = 8.3% (ui )
● Probability that two un-matched records do not have the same month value is
11/12 = 91.7% (1 - ui )
● Agreement weight: log2(mi / ui ) = log2(0.97 / 0.083) = 3.54
● Disagreement weight log2((1-mi ) / (1- ui )) = log2(0.03 / 0.917) = -4.92