COMP20008
Elements of Data Processing
Semester 1 2020
Lecture 19: Public Data Release and Individual Anonymity
Plan today
• Public release of wrangled data – anonymity issues and pitfalls – How can it be maintained?
The problem
• The public is concerned that computer scientists can purportedly identify individuals hidden in anonymized data with “astonishing ease.”
https://fpf.org/wp-content/uploads/The-Re-identification-of-Governor-Welds-Medical-Information-Daniel-Barth-Jones.pdf
Example 1: Massachusetts mid 1990s
• Mid 1990s: Massachusetts Group Insurance Commission releases records about history of hospital visits of State employees
– Governor of Massachusetts assured public that personal identifiers had been deleted
• name, address, social security number deleted
• Zip code (post code), birth date, sex retained
• 1997: Latanya Sweeney, a PhD student, went searching for the Governor’s records in this dataset
– Purchased voter rolls of city where he lived
• Name, address, postcode, birth date, sex in rolls
• Only 6 people had same birth date as Governor
• Only 3 were men
• Of these, only one lived in his zipcode …….
https://upload.wikimedia.org/wikipedia/c ommons/3/39/WilliamWeld.jpg
Example 2: Census Data
•
Sweeney continued her research in privacy:
– https://www.youtube.com/watc h?v=tivCK_fBBfo
She did a study of records from the1990 USA census, concluding that
• 87% of Americans uniquely identified by zip code, birth date and sex
• 58% of Americans uniquely identified by city, birth date and sex
• Led to changes in privacy legislation in the USA
• http://latanyasweeney.org/work/identifiability.html
•
• Australia
• Privacy Act 1988, census data
• http://www.abs.gov.au/websitedbs/censushome.nsf/home/privacy
Anonymous ID
Example 3: Netflix Dataset
• 2006: Netflix publicly releases 6 years of data about its customers viewing habits
– Cinematch is the bit of software embedded in the Netflix Web site that analyzes each customer’s movie-viewing habits and recommends other movies that the customer might enjoy.
– https://www.nytimes.com/2008/11/21/technology/21iht-23netflixt.18049332.html
– An anonymous id is created for each user
– Sampled 10% of their data
– Slight data perturbation
• Aim: Help to build better collaborative filtering algorithms (10% improvement to cinematch.
– 1 million dollar prize for a model
Star Wars
Batman
Jurassic World
The Martian
The Revenant
Lego Movie
Selma
….
A1
3
2
–
–
–
1
–
A2
A3
–
1
–
–
1
–
2
3
–
2
–
1
–
–
Linking Netflix data with IMDb public data
• Two researchers, Narayanan and Shmatikov: – https://arxiv.org/pdf/cs/0610105v2.pdf
• Given knowledge about a person’s “public” movie habits on IMDb, showed it was possible uncover their “private” movie habits in the Netflix dataset
– 8 movie ratings (≤ 2 wrong ratings, dates ± 2 weeks):
• 99% re-identified raters
Measures of anonymity for individuals
• Removing explicit identifiers from a dataset is not enough • Solutions
– k-anonymity – l-diversity
• Terminology
– Explicit identifier: Unique for an individual
• name, national ID, Tax file number, account numbers
– Quasi –identifier: A combination of non sensitive attributes that can be linked with external data to identify an individual
• E.g {Gender, Age, Zip code} combination from earlier
– Sensitive attribute(s)
• Information that people don’t wish to reveal (e.g. medical condition)
Problem: If the data gets into the wrong hands
• If I know target is a 35 year old American living in zip 13068 – Can infer they have cancer
• If I know target is a 28 year old Russian living in zip 13053 – Can infer they have heart disease
l-Diversity: Privacy Beyond k-Anonymity. Machanavajjhala, Gehrke, Kifer and Venkitasubramaniam, 2007
k-anonymity
• “Produce a release of the data with scientific guarantees that the individuals who are the subjects of the data cannot be re- identified while the data remain practically useful.’’
• A table satisfies k-anonymity if every record in the table is indistinguishable from at least k − 1 other records with respect to every set of quasi-identifier attributes; such a table is called a k- anonymous table.
• Hence, for every combination of values of the quasi-identifiers in the k-anonymous table, there are at least k records that share those values.
k-anonymity example
l-Diversity: Privacy Beyond k-Anonymity. Machanavajjhala, Gehrke, Kifer and Venkitasubramaniam, 2007
What level of k-anonymity is satisfied here?
• Sensitive attribute: COMP20008 Grade
• Quasi identifier: {Gender, Age, Hair Colour}
Student Name
Gender
Age
Hair Colour
COMP20008 Grade
7930c
Male
20
Brown
78
1a985
Male
20
Brown
88
04ed9
Female
19
Red
75
82260
Female
19
Red
85
e461e
Female
19
Red
80
1e609
Female
21
Brown
80
k=1, 2, 3 or 4? What is maximal k for which it satisfies k-anonymity?
Another example: What level of anonymity?
• Sensitive attribute: Problem
• Quasi identifier: {Race, Birth, Gender, ZIP}
https://epic.org/privacy/reidentification/Sweeney_Article.pdf
Another example: What level of anonymity?
• Sensitive attribute: Problem
• Quasi identifier: {Race, Birth, Gender, ZIP}
https://epic.org/privacy/reidentification/Sweeney_Article.pdf
How to Achieve k-anonymity
• Generalization
– Make the quasi identifiers less specific
– Column level
– Example: race
http://www.springerlink.com/content/ht1571nl63563x16/fulltext.pdf
How to Achieve k-anonymity- continued
• Generalization
– Example: Zip code
– When generalizing 94138 which one is a better strategy? • 9413*
• *4138
• Parkville 3078: 307* or *078 http://www.springerlink.com/content/ht1571nl63563x16/fulltext.pdf
How to Achieve k-anonymity- continued
• Suppression
– Remove (suppress) the quasi identifiers completely
– Moderate the generalization process
– Limited number of outliers
– Row, column and cell level
– Example:
• Removing the last two lines
• Generalizing zip code to 941**
• Generalizing race to person
http://www.springerlink.com/content/ht1571nl63563x16/fulltext.pdf
k-anonymity: recap
• In the worst case, if data gets into the wrong hands, can only narrow down a quasi identifier to a group of k individuals
• Data publisher needs to
– Determine quasi identifier(s) – Choose parameter k
Attack on k-anonymity I: Homogeneity attack
• k-anonymity can create groups that leak information due to lack of diversity in the sensitive attribute.
– Alice knows that Bob is a 31-year-old American male who lives in the zip code 13053. Therefore, Alice knows that Bob’s record number is 9,10,11, or 12
• Alice can conclude that Bob has cancer if she sees the data
l-Diversity: Privacy Beyond k-Anonymity. Machanavajjhala, Gehrke, Kifer and Venkitasubramaniam, 2007
Attack on k-anonymity II: Background attack
• k-anonymity does not protect against attacks based on background knowledge.
– Alice knows that Umeko is a 21 year-old Japanese female who currently lives in zip code 13068.
– She knows that that Umeko’s information is contained in record number 1,2,3, or 4.
– She concludes that Umeko has a viral infection, since Japanese have very low incidence of heart disease
l-Diversity: Privacy Beyond k-Anonymity. Machanavajjhala, Gehrke, Kifer and Venkitasubramaniam, 2007
Solution: l-diversity
• Make the sensitive attribute diverse within each group.
– l-diversity: For each k anonymous group, there are at least l different sensitive attribute values
l-Diversity: Privacy Beyond k-Anonymity. Machanavajjhala, Gehrke, Kifer and Venkitasubramaniam, 2007
Limitations of L-Diversity
• Imbalanced number of occurrences of a sensitive attribute within one group may limit usefulness
• Can be unnecessary and difficult to achieve
• Similarity attacks
Location & Trajectory Privacy
• •
What about datasets that record information about an individual in time and space?
Location data being collected and stored throughout the day
– GPS-enabled smart phones, cars, and wearable devices
– Wi-Fi access points
– Cell towers
– Geo-tagged tweets, Facebook status, location check-ins …
Trajectory
ID
• A function from time to geographical space
111478
111478
111478
111478
GPS-Latitude
33.692771
33.692752
33.692723
33.692804
GPS-Longitude
-111.993959
-111.993895
-111.993581
-111.993464
Time
11:52
11:54
11:56
11:58
111478
33.69314
-111.993223
12:28
111478
33.69317
-111.993192
12:30
Location & Trajectory Data: Privacy Concerns
• Status quo of current mobile systems
– Able to continuously monitor, communicate, and process information about a person’s location
– Have a high degree of spatial and temporal precision and accuracy
– Might be linked with other data
• Analyzing and sharing location datasets has significant privacy implications
– Personal safety, e.g., stalking, assault
– Location-based profiling, e.g., Facebook
– Intrusive inferences, e.g. individual’s political views, personal preferences, health conditions
Inference Attacks – Example
• A user’s Monday to Thursday trips
– Home/work location pair may lead to a small set of potential individuals -> only {Bob, Alice} travel from A to B
A
x
t
2 pm
B
8 am
A
y
B
walk
Stop
Car
walk
Inference Attacks – Example
• The same user’s Friday trips
– Regular visit to a heart hospital -> Alice is Japanese, so most probably the user is Bob
A
Hospital
x
t
2 pm
B
8 am
A
y
B
POI
walk
Stop
Stop
Car
Car
walk
Inference Attacks – Example
• Bob’s Saturday trips
– We can learn about his habits, preferences, etc.
t
2 pm
11 am
y
A
POI
x
A
Book Club
stop
walk
train
walk
Anonymity: Cloaking
• k-anonymity
– Individuals are k-anonymous if their location information cannot be distinguished from k−1 other individuals
– Don’t report exact latitutde/longitude. Report smallest region for which I am k-anonymous
• Spatial cloaking
– Gruteser & Grunwald use quadtrees
– Adapt the spatial precision of location information about a person according to the number of other people in the same quadrant
Spatial Cloaking (kmin = 4)
Obfuscation
• Idea
– Mask an individual’s precise location
– Deliberately degrade the quality of information about an individual’s location (imperfect information)
– Report a region (bounding box), based on individual’s privacy requirement (size of region). Larger region means more strict privacy requirement
– Identity can be revealed • Assumption
– Spatial imperfection ≈ privacy
– The greater the imperfect knowledge about a user’s location, the greater the user’s privacy
Actual Location: (x,y) Reported Location: Region
Motivation for Obfuscation
• Finding the closest Sushi restaurant
Ichiban
Location-based service provider
A: Sushi Ten
Q: I am in Princess park. What is the closest Sushi restaurant?
Yo! Sushi
Sushi Ten Visitor
Princess Park
Overview of Privacy Models
• Location privacy vs. trajectory privacy
Exact location points 3-anonymized location points Obfuscated location points
Challenge:
– Is separately anonymising a user’s location at each time point enough? What if they are moving and we have their trajectory?
Challenge:
-Becomes harder!
Summary
• To reduce risk of re-identification of individuals in released datasets
– Choose value of k
– Manipulate data to make it k-anonymous, either
• Replace categories by broader categories
• Suppress attributes with a * (limited utility) – Further manipulate data to make it l-diverse
• Ensure there are at least l different values of the sensitive attribute in each group
• Privacy is difficult to maintain in high-dimensional datasets like trajectory datasets
– Cloaking provides spatial k-anonymity
– Obfuscation ensures location imprecision
Acknowledgements
This lecture was prepared using some material adapted from:
• Masachusettes story
– https://epic.org/privacy/reidentification/ohm_article.pdf
• From a social science perspective
– http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006
• l-diversity
– https://www.cs.cornell.edu/~vmuthu/research/ldiversity.pdf