The Australian National University School of Computing, CECS
COMP3430/COMP8430 – Data Wrangling – 2021
Lab 3: Blocking for Record Linkage Week 6
Overview and Objectives
Today’s lab is the first in a series of five labs during which we will gradually build a complete record linkage system. We
will provide you with basic Python skeleton modules and over the next few labs you will be asked to complete the different
components of the modules.
In today’s lab we examine the blocking step, which is covered in lectures 13 and 14. The idea of blocking is to avoid having
to compare every possible pair of records across two data sets. By using blocking we exclude from the comparison step those
record pairs which are very unlikely to be matches.
You should start this lab by downloading from Wattle the comp3430 archive (in Week 6)
that contains the lab materials. In there you will find the skeleton modules and data sets which we will be using to complete
the record linkage system. Please create a folder for the program on your machine and extract the materials there.
Before you start, please also have a look through the Python skeleton modules to get a feel for how it is structured and what
the different parts are. Today we are only working with, but it is good to have some understanding of how it
all fits together. First look at since this is the module that runs the complete process.
Run as it is. It will use some of the provided data sets, and the functions already implemented. This will
show you what the output for the different steps will look like. We recommend using the small data sets with no corruptions,
named clean-A-1000.csv and clean-B-1000.csv, as a way to test whether your program is working. Once your program
is working, apply it on the other, larger, data sets.
Lab Questions
Your tasks today are to experiment with the blocking process and to implement two blocking functions. We have already
provided one blocking function, simpleBlocking, where records are placed into different blocks based on the value of a
chosen blocking attribute (or attributes). For example, all the records with the same Surname will go in the same block.
We will begin the lab by reviewing the aim of blocking, and how simple blocking, Soundex, and the SLK-581 methods (see
below) work. The tasks for this lab are as follows:
1. Review and discuss the aim of blocking as part of the record linkage process. Discuss how simple blocking works.
Now, calulate the Soundex codes for Brian Schmidt (our VC) and Queen Elizabeth II. Then calculate the SLK-581 for
Brian Schmidt and Queen Elizabeth II (their required personal details are publicly available).
Then calculate the Soundex code for your first name and your last name, and then calculate your SLK-581 code.
2. Next, start looking at and explore how the blocking functions work (inputs, return values, etc.). Then run
the blocking step on the two small data sets using both noBlocking and simpleBlocking (on attributes of your choice)
and investigate how blocking affects the output. You can comment or uncomment different lines in
to call different combinations of functions. Examine how these blocking techniques do affect the number of record
pairs that are to be compared. Write down the number of blocks generated, as well as their minimum, average, and
maximum sizes.
3. In the module, implement the Soundex phonetic encoding, and use the Soundex value as the blocking
value. In other words, all values with the same Soundex code should be placed in the same block. For a full description
of Soundex, see lecture 14.
Note: There are different specifications and implementations of Soundex, we expect you to follow the description given
in lecture 14.
4. The Statistical Linkage Key SLK-581 is an identifier that can be used to identify records that belong to the same
person if they have the same SLK-581. As shown in the figure on the next page, SLK-581 is made up of four elements,
including three letters from family name (surname or last name), two letters from given name (first name), date of
birth, and gender.
Three letters
of First nameof Last name
Date of birth Gender
Two letters
In the module, implement SLK-581 as a blocking key where all records with the same SLK-581 identifier
should be added to the same block. We will cover SLK-581 in more detail in lecture 16, but you can also refer to
the web page at
581-guide.pdf.aspx for more details on SLK-581.
After you have finished coding each blocking method, please repeat your experiments from task 2 on the other data sets, but
using the new blocking functions. Consider both the output, i.e. the number of records per block, number of blocks, etc.,
and the performance information such as time taken. You may also want to print out some of the blocking keys to see what
they look like, or build a frequency distribution of the blocking keys or block sizes. Think about the following questions:
� Which do you think are the best blocking functions and keys, and why?
� Can you use some of the attributes unsuitable for blocking in combination with others to improve blocking?
� Can you come up with a list of criteria for good blocking keys based on the experiments you conducted?
If time permits, there are plenty of other blocking techniques that you can look at and implement into the
module. Two that are commonly used in practice are canopy clustering and sorted neighbourhood blocking. If you have
time please implement either of these techniques. Note that to implement sorted neighbourhood based blocking, you will
need to modify the skeleton program somewhat in order to make it accept an index rather than a dictionary of blocks.