程序代写CS代考 AI COMP3430 / COMP8430 Data wrangling

COMP3430 / COMP8430 Data wrangling
Lecture 17: Record pair classification (1) (Lecturer: )

Lecture outline
● Classifying record pairs
● Threshold-based classification ● Probabilistic classification

The record linkage process

Classifying record pairs (1)
● The comparison step generates one vector of similarities
(also known as weight vector) for each of compared record pair
● The elements of such vectors are the calculated similarities (exact or approximate)
● For example: (assuming
edit distance calculations)

Miller
23
Main
Street

P
Miller
4/23
Main
St
comparison:
1.0
0.0
1.0
0.0
1.0
0.0
0.0
Approximate comparison:
1.0
0.25
1.0
0.5
1.0
0.4
0.57

Classifying record pairs (2)
● Classifying record pairs can be based on (a) summing the calculated similarities into a single similarity values, or
(b) using the full vector of similarities

Miller
23
Main
Street

P
Miller
4/23
Main
St
:
Exact comparison:
1.0
0.0
1.0
0.0
1.0
0.0
0.0
3.0 / 7
Approximate comparison:
1.0
0.25
1.0
0.5
1.0
0.4
0.57
4.72 / 7

Example histogram of summed similarities
● Deduplicationofahealth
data set with different weights attached to different similarities, and where the true match status was determined using the commercial record linkage software AutoMatch.
(from Christen, 2012)

Threshold based classification (1)
● Is generally applied on summed similarities
● Can either use one or two similarity thresholds
– One threshold t: 0 ≤ t ≤ simmax, where simmax is equal to the
number of similarities in the vectors
(a) Record pairs with a similarity of at least t → Classified match
(b) Record pairs with a similarity below t → Classified non-match
– Two thresholds tl and tu: 0 ≤ tl < tu ≤ simmax (a) Record pairs with a similarity of at least tu → Classified match (b) Record pairs with a similarity below tl → Classified non-match (c) Record pairs with a similarity between tl and tu → Classified potential match Threshold based classification (2) ● If similarities are simply summed then each attribute has the same importance (or same weight) – Does having the same gender say as much about two records being about the same person as having the same postcode? ● A weighted sum approach provides more weight to attributes that contain more information – Weights can be based on domain knowledge – Or they can be calculated based on the number of unique values in an attribute a: wa = log(number of unique attribute values) Threshold based classification (3) ● Total similarity is then a weighted sum: sim(reci, recj ) = ∑a sim(reci[a], recj[a]) * wa, where wa is the weight for attribute a ● To normalise this similarity into the 0..1 interval we can divide sim(reci, recj ) by ∑a wa ● Further weight calculations take the frequencies of values into account – Two records with the common surname “Smith” are less likely to refer to the same person compared to two records with the rare surname 'Dijkstra' Probabilistic classification (1) ● Known as probabilistic record linkage – Basic ideas were introduced by Newcombe and Kennedy in 1962 – Theoretical foundation by Fellegi and Sunter in 1969 ● Basic idea: – Compare common record attributes (or fields) using approximate (string) comparison functions – Calculate matching weights based on frequency ratios (global or value specific ratios) and error estimates – Sum of the matching weights is used to classify a pair of records as a match, non-match, or potential match (using two thresholds) ● Problems: Estimating errors, find optimal thresholds, assumption of independence, and manual clerical review Probabilistic classification (2) ● A ratio R is calculated for each compared record pair r = (a,b) in the product space A × B: R = P (γ ∈ Γ | r ∈ M ) / P (γ ∈ Γ | r ∈ U ), where M and U are the sets of true matches and true non-matches, and γ (gamma) is an agreement pattern in the comparison space Γ (Gamma), with: A×B={(a,b):a∈ A,b∈ B}forfiles(datasets)AandB M={(a,b):a=b,a∈ A,b∈ B} Truematches U={(a,b):a≠b,a∈ A,b∈ B} Truenon-matches Probabilistic classification (3) ● Fellegi and Sunter proposed the following decision rule: R≥tu → r isclassifiedasamatch tl < R < tu → r is classified as a potential match R ≤ tl → r is classified as a non-match Probabilistic classification (4) ● Assuming conditional independence between attributes allows to calculate individual attribute-wise probabilities mi =P([ai =bi ,a∈ A,b∈ B]|r∈ M)and ui =P([ai =bi ,a∈ A,b∈ B]|r∈ U), ● where ai and bi are the values of attribute i being compared ● Based on these m- and u-probabilities, we calculate a matching weight wi for attribute i as: wi = log2(mi / ui ) if ai = bi Agreement weight wi = log2((1-mi ) / (1- ui )) if ai ≠ bi Dis-agreement weight Weight calculation example ● Assume two data sets with a 3% error in attribute month of birth ● Probability that two matched records (representing the same person) have the same month value is 97% (mi ) ● Probability that two matched records do not have the same month value is 3% (1 - mi ) ● Probability that two (randomly picked) un-matched records have the same month value is 1/12 = 8.3% (ui ) ● Probability that two un-matched records do not have the same month value is 11/12 = 91.7% (1 - ui ) ● Agreement weight: log2(mi / ui ) = log2(0.97 / 0.083) = 3.54 ● Disagreement weight log2((1-mi ) / (1- ui )) = log2(0.03 / 0.917) = -4.92