COMP3430 / COMP8430 Data wrangling
Interactive lecture week 9: Assignments, labs, and summary of week 9
(Lecturer: )
Lecture outline
● Administrative matters
– Assignments, Labs, and Exam
● Questions on Wattle ● Summary of week 9 ● Q and A Session
Labs
● We had lab 6 this week, continuation of the record linkage project
● Different evaluation methods are discussed
● We have now completed the record linkage process
● Sample solutions for lab will be released on Monday
● For assignment 3 you can now run the entire Record Linkage programme
Labs next week
● We will give you another list of data sets for conducting experiments
● Focus on experimenting with different attribute combinations, parameter settings, and data set pairs. Try to understand how these aspects affect the final record linkage quality
● Next week lab is the final lab session. You have a chance to ask questions about Assignment 3
Assignment 1
● Assignment 1, all remark requests have been answered ● Your marks are now finalised and fixed
● We have identified two marking issues
– Different MAD calculations in R and Python. R uses a scaling factor of 1.4826 by default when calculating MAD
– Cramér’s V calculation for correlation between categorical variables
Assignments 2 and 3
● Assignment 2: deadline is today, 8th October 2021, 11:55 PM
● Make sure you submit your assignments before the deadline
● Assignment 3: due on 22nd October 2021, 11:55 PM
Assignment 4
● Assignment 4 is for all the students who enrolled in COMP8430.
● Assignment 4: due on 29th October 2021, 11:55 PM
Final examination
● Final exam will be on Monday 15th November in the afternoon from 5.40 PM AEDT
● For details see ANU exam timetable
● We recommend you to have Python installed and Virtual
box or VMWare Horizon setup in your machines
● We will discuss final exam in the last interactive lecture
Questions from Wattle forum
● Assume you need to deduplicate a database that contains 10,000 records. You apply a blocking technique and a total of 1,250,000 candidate record pairs are generated by your blocking technique. What is the reduction ratio of this blocking technique on this database?
Questions from Wattle forum
● Number of naive pairwise comparisons = (10000*9999)/2 = 49995000
● RR = 1 – (num_pairs_after_blocking / total_num_pairs) = 1 – (1250000/49995000)
= 0.97499 = 0.975
Summary of week 9
● Data Fusion
– Difficulties with data fusion: Missing values, contradicting attribute values, Uncertainty, Implementation into database – Different types of records needs to be fused: identical records, subsumed records, complementing records, conflicting records
– Different conflicting types, resolution strategies, resolution functions, operations we can use in data fusion
Summary of week 9
● Advanced record linkage techniques – Group based record linkage
– Collective linkage techniques
– Active learning
– Geocode matching
– Linking temporal and dynamic data
Summary of week 9
● Privacy-preserving record linkage (PPRL) techniques
– Difference between privacy, confidentiality, and security – Why privacy needs to be protected
– Need to preserve the privacy and confidentiality of entities represented by data during the DW process
– PPRL techniques: Secure hash encoding, noise addition, generalisation, secure multi-party computation, perturbation
Q and A Session
● Socrative
– https://b.socrative.com/login/student/ – Room Name: COMP3430