COMP3430 / COMP8430 Data Wrangling
Interactive lecture week 6:
Record Linkage: Blocking and Comparison (Lecturer: )
Lecture outline
● Administrative matters
– Assignments, Labs, Semester break
● Questions on Wattle ● Q and A Session
Administrative matters
● Assignments
– Assignment 1: due on Friday 3rd September (today)
– Make sure you submit the assignment before it’s due – Assignments 2, 3, and 4 (COMP8430 only): open now
● Labs
– Lab 3: sample solutions will be released on Monday
● Next two weeks – Semester break
– From 6th Monday September to 17th Friday September
Summary of week 5 and 6
● Record linkage / data matching / entity resolution
– The process of identifying records in one or more databases that correspond
to the same real-world entity (person, business, product, etc.)
– The linkage of a single database is called deduplication (duplicate detection)
● Blocking / indexing
– Cheaply remove / filter-out record pairs that are obviously not matches by grouping the similar records together
● Comparison
– Compare the record pairs in more detail to calculate the similarities between them. Different comparison techniques are used
Questions from Wattle forum (1)
● [bag distance] We will need to count the remaining elements after subtracting one “bag” from another. But what if one of the bags isn’t strictly a subset of the other “bag”? e.g. How would you compute “peter”: [e, e, p, r, t] – “petra”: [a, e, p, r, t] = ?
Questions from Wattle forum (1)
● Bag of s1 = “peter” = b1 = [p, e, t, e, r]
● Bag of s2 = “petra” = b2 = [p, e, t, r, a]
● bag_dist(s1, s2) = max(|b1 – b2|, |b2 – b1|)
= max(|[e]|, |[a]|) = 1
● bag_dist_sim(s1, s2) = 1 – (bag_dist / max(|s1|, |s2|))
= 1 – (1 / 5) = 0.8
Questions from Wattle forum (2)
● [SLK-581] The algorithm instructs to use number 999 for compensating “if full name component missing”. Depending on how people interpret such criterion, the resulting linkage key can be different. Would you mind advice whether the following schemes make sense, and what are their correct forms?
– “_missing_ miller”, 13/04/1991, “f” → “999 ar 13041991 2” → “222 ar 13041991 2”
– “marie _missing_”, 13/04/1991, “f” → “ile 99 13041991 2” → “ile 22 13041991 2”
– “_missing_”, 13/04/1991, “f” → “999 13041991 2” → “999 99 13041991 2”
Questions from Wattle forum (2)
– “_missing_ miller”, 13/04/1991, “f” → ile 99 13041991 2 – “marie _missing_”, 13/04/1991, “f” → 999 ar 13041991 2 – “_missing_”, 13/04/1991, “f” → 999 99 13041991 2
Questions from Wattle forum (3)
● According to the recording, Substitution of a character, Transpositions of two adjacent characters can be considered as one or two operations. Could I know whether Substitution of a character, Transpositions of two adjacent characters should count as one operation or two operations when calculating the similarity?
Questions from Wattle forum (3)
● Levenshtein edit distance
– Deletion → one operation
– Insertion → one operation
– Substitution → one operation
● However, the cost for each operation can be different
Questions from Wattle forum (3)
● Jaro similarity
– d → half distance of the longer string
d = (max(|s1|, |s2|) / 2) – 1
– c → number of common characters that agree within the distance d
– t → number of transpositions (eg: ‘ab’ and ‘ba’) – jaro_sim(s1, s2) = 1/3 (c/|s1| + c/|s2| + (c – t )/c)
Q and A Session
● Socrative
– https://b.socrative.com/login/student/ – Room Name: COMP3430