程序代写代做代考 data science Record Linkage

Record Linkage
Due date: 23:59 on Saturday of w7 (23:59)
PERSON.csv
Field Name
PID
FNAME LNAME SEX DOB_DAY DOB_MON DOB_YEAR PERSONS DWELLERS HSE_NUM STR_NAME POSTCODE
CENSUS.csv
Field Name
CENSUS_ID FNAME LNAME
SEX DOB_DAY DOB_MON DOB_YEAR PERSONS DWELLERS HSE_NUM STR_NAME POSTCODE
Description
PersonID
First Name
Last Name
Gender (M/F)
Day of birth
Month of birth
Year of birth
Person number, a numeric label for each household member The number of persons within each household
House number, a numeric label for each house within a street Street name of person’s household’s street
Postcode
Description
Census ID
First Name
Last Name
Gender (M/F)
Day of birth
Month of birth
Year of birth
Person number, a numeric label for each household member The number of persons within each household
House number, a numeric label for each house within a street Street name of person’s household’s street
Postcode
T ext
T ext
T ext
T ext Numeric Numeric Numeric Numeric T ext
T ext T ext T ext
T ext
T ext
T ext
T ext Numeric Numeric Numeric Numeric T ext
T ext T ext T ext
INFO2150
Introduction to Health Data Science
Individual Assessment: 10% (of the whole unit) DESCRIPTION
Owing to the privacy issue in Australia, we are not allowed to used a unique ID, e.g. Medicare Number or Tax File Number, to connect records from different sources to recreate the information of an entity. Probabilistic linkage is the way to go, where “useful” fields appearing in two sources are employed to link up records. In this assignment, you are given two data files, this specification and a zipped file of dataset, such that you have to write a program to link them together, and carry out some analysis. There are 3 files in the zipped file, where two of them are mock up files; one representing a person’s basic information and another one is the census data. Please be aware that not everyone partakes the census and there are some erroneous entries when census was carried out. There is a third file that is a sample output file you have to submit as well.
Page 1 of 2

Here are the tasks that need to be completed:
1. Study the two files and come up descriptive statistics for each of them.
2. To use the Fellegi-Sunter Method for record linkage in this assignment, you need to define the M- probability and U-probability for each of the fields. Let us use only the following 7 (seven) fields for the linkage:
• FNAME
• LNAME
• SEX
• DOB_DAY
• DOB_MON
• DOB_YEAR
• POSTCODE
Since the two files essentially contain the same fields, you only need to, say, use PERSON.csv to do the calculation. If your value is not derived from calculation, state the assumption.
3. Write your own program (in whatever language or tool) to link a record from CENSUS to PERSON. Write each linkage to a file call LINKAGE.csv with the first line CENSUS_ID, PID, then the actual ids in subsequent lines. A sample file LINKAGE_sample.csv is included for your reference.
4. Calculate the score for each linked pair using the M-probability and U-probability you have defined in Step 2.
5. Plot these scores on a chart and determine where you will define a threshold for manual inspection. Give reasons. If this is the threshold, how many pairs you have to investigate manually?
6. Write a report (we will only mark the first 8 pages, excluding the cover and content pages) using the work from the above steps. The report should include but not limited to the following sections:
i. Executive Summary
ii. Introduction
iii. Fellegi-Sunter Method – the derivation of U- and M-probability of each field
iv. Description of your implementation of record linkage, especially FNAME and LNAME
v. A section on the scores of the linked pairs, essentially Step 5 above.
vi. Discuss the overall performance your linkage and share your experience in this Linkage
assignment.
Marking Criteria
You need to submit a report and a data file named as LINKAGE.csv Please check the marking rubric for this assignment from INFO2150 Canvas site.
Submission Procedure
Students are expected to submit the report electronically and any supporting files, no later than
● Due date: submitted to Canvas site with the due dates specified in the above table
● Penalty: A penalty of 5% of the maximum marks will be taken per day (or part) late. After ten
days, you will be awarded a mark of zero.
Page 2 of 2