代写代考 COMP20008 Elements of Data Processing

Public Data Release and Individual Anonymity
School of Compu2ng and Informa2on Systems
©University of Melbourne 2022

Copyright By PowCoder代写 加微信 powcoder

Plan today
Public release of wrangled data – anonymity issues and pitfalls – How can it be maintained?
COMP20008 Elements of Data Processing

The problem
The public is concerned that computer scien-sts can purportedly iden-fy individuals hidden in anonymized data with “astonishing ease.”
h?ps://fpf.org/wp-content/uploads/The-Re-idenIficaIon-of-Governor-Welds-Medical-InformaIon-Daniel-Barth-Jones.pdf
COMP20008 Elements of Data Processing

Example 1: Massachuse9s mid 1990s
• Mid 1990s: Massachuse?s Group Insurance Commission releases records about history of hospital visits of State employees
– Governor of Massachuse/s assured public that personal iden6fiers had been deleted • name, address, social security number deleted
• Zipcode(postcode),birthdate,sexretained
• 1997: , a PhD student, went searching for the Governor’s records in this dataset
– Purchased voter rolls of city where he lived
• Name, address, postcode, birth date, sex in rolls
• Only 6 people had same birth date as Governor
• Only 3 were men
• Of these, only one lived in his zipcode …….
COMP20008 Elements of Data Processing

Example 2: Census Data
• Sweeney conJnued her research in privacy: h=ps://www.youtube.com/watch?v=2vCK_fBBfo
• She did a study of records from the 1990 USA census, concluding that
• 87% of Americans uniquely iden2fied by zip code, birth date and sex
• 58% of Americans uniquely iden2fied by city, birth date and sex
• Led to changes in privacy legisla2on in the USA h=p://latanyasweeney.org/work/iden2fiability.html
• Australia
• PrivacyAct1988,censusdata
• h”p://www.abs.gov.au/websitedbs/censushome.nsf/home/privacy COMP20008 Elements of Data Processing

Example 3: Ne
• 2006: NePlix publicly releases 6 years of data about its customers viewing habits
• Cinematch is the bit of soFware embedded in the NeGlix Web site that analyzes each customer’s movie-viewing habits and recommends other movies that the customer might enjoy.
• h/ps://www.ny6mes.com/2008/11/21/technology/21iht-23neGlixt.18049332.html
• An anonymous id is created for each user
• Sampled 10% of their data
• Slight data perturba6on
• Aim: Help to build be?er collaboraJve filtering algorithms (10% improvement to cinematch)
• 1 million dollar prize for a model
COMP20008 Elements of Data Processing

Linking NeBlix data with IMDb public data
• Two researchers, Narayanan and Shmatikov: • https://arxiv.org/pdf/cs/0610105v2.pdf
• Given knowledge about a person’s “public” movie habits on IMDb, showed it was possible uncover their “private” movie habits in the Netflix dataset
• 8 movie ratings (≤ 2 wrong ratings, dates ± 2 weeks): • 99% re-identified raters
COMP20008 Elements of Data Processing

Public Data Release and Individual Anonymity
School of Computing and Information Systems
©University of Melbourne 2022

Measures of anonymity for individuals
• Removing explicit identifiers from a dataset is not enough
• Solutions
• k-anonymity
• l-diversity
• Terminology
• Explicit identifier: Unique for an individual
• name, national ID, Tax file number, account numbers
• Quasi –identifier: A combination of non sensitive attributes that can be linked
with external data to identify an individual
• E.g{Gender,Age,Zipcode}combinationfromearlier
• Sensitive attribute(s)
• Informationthatpeopledon’twishtoreveal (e.g. medical condition)
COMP20008 Elements of Data Processing

Problem: If the data gets into the wrong hands
• If I know target is a 35 year old American living in zip 13068 • Can infer they have cancer
• If I know target is a 28 year old Russian living in zip 13053 • Can infer they have heart disease
l-Diversity: Privacy Beyond k-Anonymity. Machanavajjhala, Gehrke, Kifer and Venkitasubramaniam, 2007
COMP20008 Elements of Data Processing

k-anonymity
• “Produce a release of the data with scientific guarantees that the individuals who are the subjects of the data cannot be re- identified while the data remain practically useful.’’
• A table satisfies k-anonymity if every record in the table is indistinguishable from at least k − 1 other records with respect to every set of quasi-identifier attributes; such a table is called a k- anonymous table.
• Hence, for every combination of values of the quasi-identifiers in the k-anonymous table, there are at least k records that share those values.
COMP20008 Elements of Data Processing

k-anonymity example
l-Diversity: Privacy Beyond k-Anonymity. Machanavajjhala, Gehrke, Kifer and Venkitasubramaniam, 2007
COMP20008 Elements of Data Processing

What level of k-anonymity is satisfied here? • Sensitive attribute: COMP20008 Grade
• Quasi identifier: {Gender, Age, Hair Colour}
Student Name
Hair Colour
COMP20008 Grade
k=1, 2, 3 or 4? What is maximal k for which it saPsfies k-anonymity?
COMP20008 Elements of Data Processing

How to achieve k-anonymity • GeneralizaLon
– Make the quasi iden.fiers less specific
– Column level
– Example: race
hLp://www.springerlink.com/content/ht1571nl63563x16/fulltext.pdf
COMP20008 Elements of Data Processing

How to achieve k-anonymity- continued • Generalization
– Example: Zip code, when generalizing 94138 which one is a better strategy?
• Parkville 3078: • 307*
COMP20008 Elements of Data Processing
hLp://www.springerlink.com/content/ht1571nl63563x16/fulltext.pdf

How to achieve k-anonymity- continued • Suppression
– Remove (suppress) the quasi iden.fiers completely
– Moderate the generaliza.on process
– Limited number of outliers
– Row, column and cell level
– Example:
• Removing the last two lines
• Generalizing zip code to 941**
• Generalizing race to person
hLp://www.springerlink.com/content/ht1571nl63563x16/fulltext.pdf
COMP20008 Elements of Data Processing

k-anonymity: recap
• In the worst case, if data gets into the wrong hands, can only
narrow down a quasi identifier to a group of k individuals
• Data publisher needs to
• Determine quasi identifier(s) • Choose parameter k
COMP20008 Elements of Data Processing

Attack on k-anonymity I: Homogeneity attack
• k-anonymity can create groups that leak information due to lack of diversity in the sensitive attribute.
• Alice knows that Bob is a 31-year-old American who lives in the zip code 13053.
Therefore, Alice knows that Bob’s record number is 9,10,11, or 12
• Alice can conclude that Bob has cancer if
she sees the data
l-Diversity: Privacy Beyond k-Anonymity. Machanavajjhala, Gehrke, Kifer and Venkitasubramaniam, 2007
COMP20008 Elements of Data Processing

A>ack on k-anonymity II: Background a>ack k-anonymity does not protect against
attacks based on background knowledge.
• Alice knows that Umeko is a 21 year-old Japanese who currently lives in zip code 13068.
• She knows that that Umeko’s information is contained in record number 1,2,3, or 4.
• She concludes that Umeko has a viral infection, since Japanese have very low incidence of heart disease
l-Diversity: Privacy Beyond k-Anonymity. Machanavajjhala, Gehrke, Kifer and Venkitasubramaniam, 2007
COMP20008 Elements of Data Processing

Solution: l-diversity
• Make the sensitive attribute diverse within each group.
• l-diversity: For each k anonymous group, there are at least l different sensitive attribute values
l-Diversity: Privacy Beyond k-Anonymity. Machanavajjhala, Gehrke, Kifer and Venkitasubramaniam, 2007
COMP20008 Elements of Data Processing

Limita only {Bob, Alice} travel from A to B
COMP20008 Elements of Data Processing

Inference A5acks – Example
• The same user’s Friday trips
• Regular visit to a heart hospital -> Alice is Japanese, so most probably the user is Bob
COMP20008 Elements of Data Processing

Inference A5acks – Example
• Bob’s Saturday trips
• Wecanlearnabouthishabits,preferences,etc.
COMP20008 Elements of Data Processing

Anonymity: Cloaking
• k-anonymity
• Individuals are k-anonymous if their location information cannot be distinguished from k−1 other individuals
• Don’t report exact latitutde/longitude. Report smallest region for which I am k-anonymous
• Spatial cloaking
• Gruteser & Grunwald use quadtrees
• Adapt the spatial precision of location information about a person according to the number of other people in the same quadrant
COMP20008 Elements of Data Processing

Spatial Cloaking (kmin = 4)
COMP20008 Elements of Data Processing

Obfusca%on
• Mask an individual’s precise loca2on
• Deliberately degrade the quality of informa2on about an individual’s loca2on (imperfect informa2on)
• Report a region (bounding box), based on individual’s privacy requirement (size of region). Larger region means more strict privacy requirement
• Iden2ty can be revealed
• Assump2on
• Spa2al imperfec2on ≈ privacy
• The greater the imperfect knowledge about a user’s loca2on, the greater the user’s privacy
Actual Loca.on: (x,y) Reported Loca.on: Region
COMP20008 Elements of Data Processing

Motivation for Obfuscation
• Finding the closest Sushi restaurant %on-based service provider
A: Sushi Ten
COMP20008 Elements of Data Processing
Q: I am in Princess park. What is the closest Sushi restaurant?
Princess Park

Overview of Privacy Models
• Loca2onprivacyvs.trajectoryprivacy
Exact loca)on points 3-anonymized loca)on points Obfuscated location points
Challenge:
• Is separately anonymising a user’s loca3on at each 3me point enough? What if they are moving and we have their trajectory?
Challenge:
-Becomes harder!
COMP20008 Elements of Data Processing

• To reduce risk of re-identification of individuals in released datasets
• Choose value of k
• Manipulate data to make it k-anonymous, • Replace categories by broader categories
• Suppress attributes with a * (limited utility)
• Ensure there are at least l different values of the sensitive attribute in each group
• Further manipulate data to make it l-diverse
• Privacyisdifficulttomaintaininhigh-dimensionaldatasetslike trajectory datasets
• Cloaking provides spatial k-anonymity
• Obfuscation ensures location imprecision
COMP20008 Elements of Data Processing

Acknowledgements
This lecture was prepared using some material adapted from: • MasachuseAes story
• hAps://epic.org/privacy/reiden.fica.on/ohm_ar.cle.pdf • From a social science perspec.ve
• hAp://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006
• l-diversity
• https://www.cs.cornell.edu/~vmuthu/research/ldiversity.pdf

Differen’al Privacy – Local and Global
School of Computing and Information Systems
©niversity of Melbourne

Differential privacy
“The future of privacy is lying” – (April 10 2013, , )
COMP20008 Elements of Data Processing

Negative data surveys
Negative data survey – ask people to lie, and then make inferences based on the aggregate answers
COMP20008 Elements of Data Processing

COMP20008 Elements of Data Processing

Negative data surveys – local privacy
• Participantsselectachoicethatdoesnotfittheirsituation
• Providingmorechoicesprovidesmoreprivacy
• Maybechallengingtodesignappropriatequestions
• Relianceonhonestyoftherespondents
• Thisisanexampleofalocaltypeofprivacy,eachperson responsible for adding noise to their data
COMP20008 Elements of Data Processing

Differential privacy: Local and global
Each person is responsible for adding noise to their own data. Classic survey example each person has to answer question “Do you use drugs?”
• They flip a coin in secret and answer “Yes” if it comes up heads but tell the truth otherwise.
• Plausible deniability about a “Yes” answer • Global:
• We have a sensitive dataset, a trusted data owner Alice and a researcher Bob. Alice does analysis on the raw data, adds noise to the answers, and reports the (noisy) answers to Bob
• More accurate, less noise needed.
COMP20008 Elements of Data Processing

Global differential privacy: Where?
Since its introducLon in 2006:
– US Census Bureau in 2012: On The Map project
• Where people are employed and where they live – Apple in 2016: iOS 10
• User data collecLon, e.g. for emoji suggesLons
– NSW Department of Transport open release of 2016 Lng system data
COMP20008 Elements of Data Processing

k-anonymity l-diversity
Privatizing
Differential privacy
Privatized Analysis
Original Data
Anonymous Data
Original Data
COMP20008 Elements of Data Processing

What is being protected?
Imagine a survey is asking you:
– Question: Are you a smoker?
– Result: Number of smokers will be reported
– Would you take part in this dataset/survey?
COMP20008 Elements of Data Processing

What is being protected?
• I would feel safe submitting the survey if:
• I know the chance that the privatized result would be ! is nearly the same, whether or not I take part in the survey.
• Doesthismeanthatanindividual’sanswerhasnoimpact on the released result?
COMP20008 Elements of Data Processing

Overview of the process: Global differential privacy
Original Data
Privatized Analysis
The privatized analysis comprises two steps:
– Query the data and obtain the real result, e.g., how many female students are in the survey?
– Add random noise to hide the presence/absence of any individual. Release noisy result to the user.
COMP20008 Elements of Data Processing

The released results will be different each time (different amount of noised added)
• Query:Howmanyfemalesinthedataset?(trueresult=32)
• Generatesomerandomvalues,accordingtoadistributionwith mean value 0: {1,2,-2,-1,0,-3,1,0}, add to true result and release
1. Released result=33 (32+1)
2. Released result=34 (32+2)
3. Released result=30 (32-2)
4. Released result=31 (32-1)
5. Released result=32 (32+0) 6. Released result=29 (32-3)
7. Released result=33 (32+1) 8. Released result=32 (32,0)
• On average, the released result will be 32, but observing a single released result doesn’t give the adversary exact knowledge
COMP20008 Elements of Data Processing

Emoji scenario and use of differential privacy
• A developer wants to understand which emoji’s are popular, in order to make better recommendations. There is a database like
• Queryfromdeveloper:Howmanytimeswas!usedtoday?
• Systemwillreleaseanoisyresulttodeveloper,toprotect customer privacy.
Emoji used today
COMP20008 Elements of Data Processing

The promise of differential privacy
• The chance that the noisy released result will be ” is nearly the
same, whether or not an individual participates in the dataset.
A=Probability that result is R B=Probability that result is R
Possible world where I
participate
Possible world where I
do not participate
• If we can guarantee A!B (A is very close to B), then no one can guess which possible world resulted in R.
COMP20008 Elements of Data Processing

Differential Privacy – Local and Global
School of Compu2ng and Informa2on Systems
©University of Melbourne 2022

The promise of differential privacy
• Does this mean that the attacker cannot learn anything sensitive
about individuals from the released results?
COMP20008 Elements of Data Processing

Differential privacy: How?
• How much noise should we add to the result? This depends on – Privacy loss budget: How private we want the result to be (how hard
for the aCacker to guess the true result)
– Global sensi5vity: How much difference the presence or absence of an individual could make to the result.
COMP20008 Elements of Data Processing

Privacy loss budget = k
• We want that the presence or absence of a user in the dataset does not have a considerable effect on the released result
A=Probability that result is R
Possible world where I
participate
B=Probability that result is R
Possible world where I
do not participate
Privacy loss budget = k (k ≥ 0)
Choose k to guarantee that A ≤ 2k × B

Privacy loss budget = k
A=Probability that result is R
Possible world where I
participate
B=Probability that result is R
Possible world where I
do not participate
Privacy loss budget=k (k ≥0)
Choose k to guarantee that A ≤ 2k × B
k=0: No privacy loss (A=B), low u8lity k=high: Larger privacy loss, higher u8lity k=low: Low privacy loss, lower u8lity

Global sensitivity
• Global sensi?vity of a query Q is the maximum difference in answers that adding or removing any individual from the dataset can cause (maximum effect of an individual)
• Intui?vely, we want to consider the worst-case scenario
• If asking mul?ple queries, global sensi?vity is equal to the sum of the differences
COMP20008 Elements of Data Processing

Global sensitivity – example
• QUERY:Howmanypeopleinthedatasetarefemale?
Global sensitivity = 1
X+1 people are female
Possible world where I
participate
Possible world where I
do not participate
COMP20008 Elements of Data Processing
X people are female

Global sensitivity – example
• QUERY:Howmanypeopleinthedatasetaresmokers?
Global sensitivity = 1
X+1 people are smokers
Possible world where I Possible world where I
participate
do not participate
COMP20008 Elements of Data Processing
X people are smokers

Global sensitivity – example
• QUERY: How many people in the dataset are female? And how many people are smokers?
Global sensitivity = 1 + 1 = 2
X+1 people are smokers M+1 males and F females OR
M males and F+1 females
Possible world where Poss

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com