计算机代考程序代写 data structure database ER COMP3430 / COMP8430 Data wrangling

COMP3430 / COMP8430 Data wrangling
Lecture 23: Privacy aspects in data wrangling
and privacy-preserving record linkage
(Lecturer: )
Based on slides by (Data61 / CSIRO)

Lecture outline
Privacy, confidentiality and security
Privacy in data wrangling
Privacy-preserving record linkage (PPRL)
Taxonomy of PPRL
PPRL techniques
Bloom filter-based techniques Attacks on PPRL
● ● ●
– –
● ●
2

Privacy, confidentiality, and security
Three important and related concepts of data protection Privacy
Right of individual entities (e.g. customers or patients) to make decisions about how their personal data are shared and used
● ●



Obligation or responsibility of professionals and organisations who have access to data to hold in confidence


Means or tools used to protect the privacy of entities’ data and to support professionals / organisations in holding data in confidence
Confidentiality
Security
3

Privacy by design
Personal data are valuable for various applications, but are at risk
of being used, stored, shared, exchanged, or revealed due to
growing privacy concerns
Important to have proper systems in place that provide data protection
But allow applications and research studies utilise available information in data

– –





Standards and regulations are required
Safe environments to handle them in Proper handling procedures and output Safe storage
Privacy laws, such as recent European Union GDPR (General Data Protection Regulation)
4

Privacy in data wrangling
Preserve privacy and confidentiality of entities represented by data during the data wrangling pipeline
Privacy and confidentiality concerns arise when data are shared or exchanged between different organisations
Mainly the task of data integration / record linkage in the pipeline that requires data to be integrated from multiple sources held by different organisations
Require disclosure limitation to protect the privacy and confidentiality of sensitive data (such as personal names, addresses, etc.)




5


Disclosure limitations
Filter or mask (encode or encrypt) raw data to block what is revealed or disclosed
Disclosure-limited masking:


Using masking (encoding) functions to transform data such that there exists a specific functional relationship between the masked and
the original data
Budget-constrained problem – the goal of masking functions is to achieve the maximum utility under a fixed privacy budget
Examples include noise addition, generalisation, and probabilistic filters


6

Privacy-preserving record linkage (PPRL)
Objective of PPRL is to perform linkage across organisations using masked (encoded) records
Besides certain attributes of the matched records no information about the sensitive original data can be learned by any party involved in the linkage, or any external party


7

PPRL: Example applications
Health outbreak systems
Early detection of infectious diseases before they spread widely
Requires data to be integrated across human health data, travel data, consumed drug data, and even animal health data
National security applications

– –


Integrate data from law enforcement agencies, Internet service providers, and financial institutions to identify crime and fraud, or terrorism suspects


Compile mailing lists or integrate customer data from different sources for marketing activities and/or recommendation systems

Neither of the parties is willing or allowed by law to exchange or provide their data between/to other parties
Business applications
8

PPRL taxonomy
PPRL Taxonomy
Privacy Aspects Linkage Techniques Theoretical Analysis Evaluation Practical Aspects
No of parties
Blocking
Scalability

Adversary model
Matching
Linkage quality
Linkage quality
Implementation
Privacy techniques
Classification
Privacy vulnerabilities
Privacy
Application areas
Vatsalan et al., A Taxonomy of PPRL techniques, JIS, 38(6),pp. 946-969, 2013
9

PPRL protocols
1. parameters
2. masked data
3. matches
Three-party protocols
2. masked data
1. parameters
2. masked data
3. matches
Two-party protocols
LU
3. matches
• Three-party protocols: Use a linkage unit (LU) to conduct/facilitate linkage
• Two-party protocols: Only two database owners participate in the linkage
• Multi-party protocols: Linking records from multiple databases (with or without a LU) with the additional challenges of scalability and privacy risk (collusion between parties)
10

PPRL adversary models
Honest-but-curious (HBC) or semi-honest model




– –


Accountable computing and covert model allow to identify if a party has not followed the protocol with a certain probability
Lower complexity than malicious and more secure than HBC
Parties follow the protocol while being curious to learn about another party’s data
Most existing PPRL protocols assume HBC model Malicious model
Parties may behave arbitrarily, not following the protocol Evaluating privacy under malicious model is difficult
Advanced models

11

Attack models









A set of parties (in three-party and multi-party protocols) collude with the aim to learn about another party’s data
Dictionary attack
Mask a list of known values using existing masking functions until a matching masked value is identified (SHA or MD5)
Keyed masking approach, like HMAC, can overcome this attack Frequency attack
Frequency distribution of masked values is matched with the distribution of known values
Cryptanalysis attack
A special type of frequency attack applicable to Bloom filters
Collusion
12

PPRL techniques
Several techniques developed


● ●

Generalisation such as k-anonymity, noise addition and differential privacy; secure multi-party computation (SMC) such as homomorphic encryptions and secure summation; and probabilistic filters such as Bloom filters and variations
First generation (mid 1990s): Exact matching only
Second generation (early 2000s): Approximate matching but not scalable
Third generation (mid 2000s): Take scalability into account
13

Secure hash encoding
First generation PPRL techniques
Use a one-way hash-encoding function (like SHA) to encode
values and then compare the hash-encoded values to identify
matching records
Only exact matching is possible
Single character difference in two values results in a pair of completely different hash-encoded values
(for example, “peter” → “1010…0101”, and “pete” → “0111…1010”)
● ●
– –


Having only access to hash-codes will make it nearly impossible
to learn the original values
Frequency attacks are still possible
14

Noise and differential privacy
Add noise to overcome frequency attack at the cost of loss in linkage quality
Differential privacy is an alternative to random noise addition
The probability of holding any property on the perturbed database is approximately the same whether or not an individual value is present in the database
Magnitude of noise depends on privacy budget and sensitivity of data




15

Generalisation techniques
Generalises the records to overcome frequency attacks Example: k-anonymity
Ensure every combination of attribute values is shared by at least k records
● ●


Other techniques – value generalisation hierarchies and binning (as covered earlier in the course)
16


Encryption and SMC
Commutative and homomorphic encryptions are used Computationally expensive
Secure scalar product, secure set intersection, secure set union, and secure summation are the most commonly used SMC
● ● ●
techniques Example:
Secure summation of values x1 =6andx2 =5usingaLU
P1 x1=6
P2 50+6=56 x2=5
56+5=61 LU
50
r=50 61-50 =11
17

Bloom filters (1)
Probabilistic data structure
Bit vector of l bits (initially all set to 0)
k independent hash functions are used to hash-map each element in a
set S into a Bloom filter by setting the corresponding bits to 1 pe et te er

– –
1
1
0
1
1
0
0
1
1
Encoding q-grams (with q=2) of string ‘peter’ into Bloom filters of length l=9 bits using k=2 hash functions
18


Bloom filters (2)
Dice coefficient similarity of p BFs is calculated as:
Dice_sim(b1,…,bp) = ∑i xi ,
p×z where z is the number of common 1-bits in p BFs and xi is the number of 1-
bits in BF bi
1
1
0
1
1
0
0
1
1
peter pete
b1 b2
x1 = 6 x2 = 5
1
1
0
1
1
0
0
0
1
sim = 2×5 / (6+5) 12 =0.9
1
1
0
1
1
0
0
0
1
b ^b
z=5
Schnell et al., Privacy-Preserving Record Linkage using Bloom filters, BMC MIDM, 9(1), 2009
19

Bloom filters (3)
Bloom filter-based matching




The larger the false positive rate, the higher the privacy but lower the accuracy
Similarity of Bloom filters can be calculated using a token-based similarity function, such as Jaccard, Dice, and Hamming
Dice is mostly used, as it is insensitive to many matching zeros
Similarity of Bloom filters ≥ similarity of input values (due to false positive rate of Bloom filters)
False positive rate determines privacy and accuracy
– –
20

Bloom filters (4)
Attacks on Bloom filters


Susceptible to cryptanalysis attacks – mapping bit patterns to q-grams and values based on frequency and co-occurrence information
Several attack methods on basic Bloom filters have been developed





We have recently developed a new efficient attack method that allows re-identification of frequent attribute values
Patterns in Bloom filter bit positions that have co-occurring patterns are identified using pattern mining
The most successful attack method so far
Advanced Bloom filter hardening techniques are required
Christen et al., Pattern-Mining based Cryptanalysis of Bloom Filters for Privacy-Preserving Record Linkage, PAKDD, 2018
21

Bloom filters (5)
Christen et al., Pattern-Mining based Cryptanalysis of Bloom Filters for Privacy-Preserving Record Linkage, PAKDD, 2018

Linking Sensitive Data – the book
Springer, November 2020, 490 pages
The Book describes how linkage methods work and how to evaluate their performance. It covers all the major concepts and methods and also discusses practical matters such as computational efficiency, which are critical if the methods are to be used in practice – and it does all this in a highly accessible way!
Prof David Hand, OBE, Imperial College London