COMPSCI-3IS3 Information Security Homework 2 – Due: 02/28/2022
Fundamentals of Differential Privacy
TAs: Keivan and Wei
1. Compute the sensitivity of the following queries:
Copyright By PowCoder代写 加微信 powcoder
• Suppose we want to compute the average of a set of numbers known to lie in the interval [0,R] on a data set of size n. That is, given the dataset D = {x1,…,xn} with each entry xi ∈ [0,R], what is the sensitivity of qD = n1 ni=1 xi.
• Consider a histogram: given a data set D = {x1, . . . , xn} and a partition of D into m disjoint sets B1, . . . , Bm (think of these as “bins” or ”types” of items in D), we count how many records there are of each type. Hence, qD = (n1,n2,…,nm) where nj = #{i : xi ∈ Bj}. So, for example, if we wanted to compute the number of residents of each of the 50 US states from a census of the US population, we would be asking a histogram query.
2. This problem is about implementing differential privacy on real data in Python. You can use any Python compiler you wish (e.g., Jupyter Notebook). If you do not have any experience with Python, try Colab. First go to https://colab.research.google.com, sign into your Google account, and create a new notebook. You now have a Jupyter Notebook environment for your commands.
We need two libraries Numpy and Panda, so we first import them:
We will be working with Adult dataset: The dataset contains 32,561 entries, from US census 1996, with a total of 15 columns representing different attributes of the people. Here’s the list: Age (from 17 to 90), Work class (Private, Federal-Government, etc), Final Weight (the number of people the census believes the entry represents), Education (the highest level of education obtained), Education Number (the number of years of education), Marital Status, Occupation, Relationship in family, Race, Sex, Capital Gain, Capital Loss, Hours (worked) per week, Native Country, Income (whether or not an individual makes more than $50k annually). To read this dataset, execute the following command:
If you execute adult, you can see the whole dataset. We are interested in two queries: “How many individuals in the dataset are 40 years old or older?” and “Are there more individuals with income < $50k?”. The following commands produces qD1 and qD2 :
Which one of these two queries can be released via Randomize Response mechanism? Use the imple- mentation of Bernoulli random variable and XOR in Python to produce differentially private answers
to this query, with ε = 0.1, ε = 1, and ε = 5. For the other query, use the implementation of Laplace mechanism to produce differentially private answers, with ε = 0.1, ε = 1, and ε = 5. Hint: A simple implementation of Laplace mechanism is as follows:
For all these cases, run mechanisms for 1000 times and then: (1) Calculate the percent error for each one of the answers against the original (non-private) answer (i.e., |MD −qD | ∗ 100), and (2) Plot the
the error.
distribution of errors using a histogram. Use these histograms to describe, in words, the effect of ε on
If you are interested in the implementation of differential privacy in Python, the following short book is highly recommended: “ Programming Differential Privacy” by . Near and Chik ́e Abuah. It is accessible here: https://programming-dp.com/
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com