程序代做CS代考 database COMP3430 / COMP8430 Data wrangling

COMP3430 / COMP8430 Data wrangling
Lecture 15: Record pair comparison (1) (Lecturer: )

Lecture outline
● Comparing records for record linkage
● Similarity and distance functions
● Basic comparison functions
● Numerical and other comparison functions

The record linkage process

Comparing record pairs (1)
● The blocking process generates groups / clusters of records
● From each block, record pairs are generated
– For a linkage of two databases, all records from database A are paired
with all records from database B from blocks that have the same BKV – For the deduplication of a single database, all unique record pairs are
formed from each block (a record is not compared with itself)
● Record pairs are compared based on their common available
attributes (fields)
– These are commonly names, addresses, dates, phone numbers, etc. – They contain variations and errors (even after cleaning), or can be
missing or out-of-date

Comparing record pairs (2)
● Exact comparison of attribute values will not provide good linkage
results
– Even true matching record pairs often contain different attribute values – For example:
[‘peter’, ‘paul’, ‘meier’, ‘2/21 main st’, ‘acton’, ‘act’, ‘ 2601’] [‘peter’, ‘p’, ‘meyer’, ’21 main street’, ‘acton’, ‘act’, ‘ 2602’]
● Approximate comparison functions are required
– To calculate similarities between attribute values, not only ‘is the same
or is different’
– They need to be appropriate for the content of a certain attribute
(text: names, addresses, dates, phone numbers; numerical: ages, salaries)

Similarities and distances (1)
● Approximate matching functions generally calculate a numerical
similarity value
– sim = 0: Two values are completely different (‘peter’ and ‘david’) – sim = 1: Two values are exactly the same (‘peter and peter’)
– 0 < sim < 1: Two values are somewhat similar ('peter' and 'pedro') ● For the same pair of values, different functions calculate different similarities ● Some functions calculate distances ● Distances can be converted into similarities – sim = 1/dist, with sim = 1 if dist = 0 – dist = 1-sim, if 0 ≤ dist ≤1 Similarities and distances (2) ● Distance functions (or distance metrics) have several properties: – dist(val1, val1) = 0 Distance from an object to itself is always 0 – dist(val1, val2) ≥ 0 Distances are always positive – dist(val1, val2) = dist(val2, val1) Distances are symmetric – dist(val1, val2) + dist(val1, val3) ≥ dist(val2, val3) Triangular inequality must hold ● Not all approximate matching functions are proper metric distances – Triangular inequality does not hold for certain functions – Some functions are not symmetric (for example, those that calculate if one attribute value is included in another, like 'pete' and 'peter') Basic comparison functions ● Exact comparison: – sexact(val1, val2) = 1 if val1 = val2, 0 otherwise – val1 and val2 can be strings, numbers, etc. Truncate string comparison (x characters the same at beginning) – For string values only – strunc(val1, val2) = 1 if val1[:x] = val2[:x], 0 otherwise – Similar for testing if the end is the same ● Phonetic encoded comparisons – For strings only – sencode(val1, val2) = 1 if encode(val1) = encode(val2), 0 otherwise – encode() is a phonetic encoding function as discussed previously ● Numerical comparison functions (1) ● For numerical values, we also want to have a comparison that calculates a similarity between 0 and 1 ● We set a maximum absolute difference allowed, or a maximum percentage difference allowed – If two values differ more their similarity will be 0 ● For absolute maximum difference of dmax and two values n1 and n2: –Ifabs(n1 –n2)≥dmax :simnum_abs =0 –Ifabs(n1 –n2)