Machine Learning
Lecture 13: Advanced Topics
Prof. Dr. ̈nnemann
Data Analytics and Machine Learning Technical University of Munich
Copyright By PowCoder代写 加微信 powcoder
Winter term 2020/2021
Introduction – Beyond Accuracy
Differential Privacy
Algorithmic Fairness
Advanced Topics 2
Data Analytics and Machine Learning
• In the previous lectures we were focusing on models and algorithms and we were optimizing for simple metrics (e.g. misclassification rate, reconstruction error, etc.)
• As ML/AI is becoming more widespread and is used in critical applications (e.g. algorithmic decision-making involving humans) we have to consider societal impacts
• As ML models get deployed in the real-world they create feedback loops which can have potentially unintended consequences
• We should always ask: ”Are we optimizing for the right thing?”
Advanced Topics 3
Data Analytics and Machine Learning
Beyond accuracy we have to consider
• Privacy: how to avoid revealing (sensitive) information about the individuals in the training set (e.g. medical diagnosis)
• Security: what if an attacker can ”fool”the system with malicious input, poison the training data, ”steal”the model, etc.?
• Fairness: how can we ensure that the system does not disadvantage particular individuals or (marginalized) groups
• Explainability: people should be able to understand why and how a decision was made about them (GDPR)
• Accountability: an outside auditor should be able to verify that the system is functioning as intended
Advanced Topics 4
Data Analytics and Machine Learning
Impact of AI/ML more broadly
• How should self-driving cars trade off the safety of passengers, pedestrians, etc.? (Trolley problems)
• Face recognition and other surveillance-enabling technologies
• Autonomous weapons
• Risk of international AI arms races
• Long-term risks of superintelligent AI
• Unemployment due to automation
• Bad side effects of optimizing for click-through
Advanced Topics 5
Data Analytics and Machine Learning
Disclaimer
• These concepts sound ”vague”and properly formalizing them is half the challenge 1
• Any ”solution”can only be partly technical and tackling these issues must always involve social/legal/political aspects and must be an interdisciplinary effort
• Given the above we will focus on three topics: privacy and fairness since they have well-established technical principles and techniques that address part of the problem
1Most of these topics in this lecture are active areas of research with serious effort starting only around 5 years ago.
Advanced Topics 6
Data Analytics and Machine Learning
Differential Privacy
Anonymization is hard
• US government releases a dataset of medical visits (Sweeney, ’13) • Identifying info (names, addresses and SSNs) was removed
• Data on zip code, birth date, and gender was left
• Around 87% of Americans are uniquely identifiable from this triplet
• Netflix Challenge: competition to improve movie recommendations
• Dataset of 100 million movie ratings with anonymized user ID
• 99% of users who rated at least 6 movies could be identified by
cross-referencing with IMDB reviews (associated with real names) (Narayanan & Smahtikov ’08)
• Re-Identifying >40% of anonymous volunteers in DNA Study (Sweeney, ’13)
• ”A Face is Exposed for AOL Searcher No. 4417749”(Barrabo, ’06)
• … and many others
Advanced Topics 8
Data Analytics and Machine Learning
Anonymization is hard
It is not sufficient to prevent unique identification of individuals
* * * * * * * * *
Age Gender
60-70 Male 60-70 Female 60-70 Male 60-70 Female 60-70 Male 50-60 Female 50-60 Male 50-60 Male 50-60 Female
Zip Smoker
191** Y 191** N 191** Y 191** N 191** Y 191** N 191** Y 191** Y 191** N
Heart disease Arthritis
Lung Cancer Crohn’s disease Lung Cancer HIV
Lyme disease Seasonal allergies Ulcerative Colitis
Table: Kearns & Roth, The Ethical Algorithm
database, then we know she has 1 of 2 diseases.
know Rebecca is 55 years old and in this (fictional) hospital
Advanced Topics 9
Data Analytics and Machine Learning
Sensitive information in the model
• Even if you don’t release the raw data, the weights of a trained network might reveal sensitive information
• Model inversion attacks recover information about the training data from the trained model
Figure: Reconstructing faces given a classifier trained on private data and a generative model trained on public data, Zhang et al., 2019
• Email provider uses language models for email autocompletion, the model can remember (and spit out) sensitive info from past emails
Advanced Topics 10
Data Analytics and Machine Learning
How can we compute (statistical) queries and train ML models without leaking (too much) (sensitive) information about any individual?
Advanced Topics 11
Data Analytics and Machine Learning
Warmup: Randomized Response
• Goal: Conduct a survey on a potentially incriminating or sensitive question with a binary (yes/no) answer
• Examples:
• Have you ever committed tax fraud?
• Does anyone in your family suffer from HIV?
• We would like to motivate the participants to answer truthfully despite the sensitive nature of the question
• Idea: introduce randomization to provide plausible deniability
Advanced Topics 12
Data Analytics and Machine Learning
Warmup: Randomized Response
• Let each of the n participants follow the procedure (Warner, 1965): • Flip a coin
• If it lands tails answer truthfully
• Else flip another coin. If it lands tails answer Yes, else answer No
• What is the fraction of participants that answer Yes truthfully?
• We can accurately estimate the population mean μ from the randomized responses by μ = 14 (1 − μˆ) + 34 μˆ, where μˆ is the MLE
• P(response = Yes | truth = Yes) = 43 • P(response = Yes | truth = No) = 41
• μ is an unbiased estimator of the non-randomized mean, the variance decays as n1 but it is 4× larger because of the randomization
Advanced Topics 13
Data Analytics and Machine Learning
Beyond Randomized Response
• With Randomized Response we could compute useful queries (e.g. the fraction) in aggregate without learning the truthful answer for any individual
• In general, randomness is a useful technique for preventing information leakage
• Q: How to answer more complex (general) queries (e.g. computing arbitrary functions over data) with mathematical privacy guarantees?
• A: Differential Privacy
Advanced Topics 14
Data Analytics and Machine Learning
Differential Privacy
• A (trusted) curator is given access to some input data X ∈ X
• The curator computes some function Y = f (X ) ∈ Y
• The curator wants to release the output Y to the public without leaking (too much) information about the data
• Let X = {0, 1}n ∪ {0, 1}n+1 be the set of binary vectors, e.g.
containing the answers to a survey question for n or (n + 1) users
• Letf bethemean,thusY∈[0,1]
• Let X , X ′ ∈ X be two ”neighboring” input vectors, s.t. X ′ is obtained by appending the answer for a new participant to X
• Or alternatively they differ in the answer of a single participant
• Informally, DP enforces that f (X ) and f (X ′ ) do not differ significantly preventing to leak the answer of the new participant
Advanced Topics 15
Data Analytics and Machine Learning
Differential Privacy
General Setup:
• Input space X with symmetric neighboring relation ≃ • Output space Y (with σ-algebra of measurable events) • Function of interest f , and Privacy parameter ε ≥ 0
Definition
A randomized mechanism Mf : X → Y 2 is ε-differentially private if for all neighboring inputs X ≃ X ′ and for all sets of outputs Y ⊆ Y we have:3
P[Mf (X) ∈ Y] ≤ eε[Mf (X′) ∈ Y] For any possible set of outputs we have
e−ε ≤ P[Mf(X)∈Y] ≤eε P[Mf (X ′) ∈ Y ]
2Note, Mf includes the function f we want to compute. It is not useful to just output random numbers.
3There is a more general (ε,δ) definition which we won’t cover in the lecture
Advanced Topics 16
Data Analytics and Machine Learning
Differential Privacy Intuition
The outcome of the statistical analysis should not change by much if we include/exclude or modify the information of a single instance
Figure: Difference between two neighboring datasets (Grosse, CSC 2515)
• ≃ captures what is protected, e.g. two different vectors X and X ′ that differ in a single coordinate, or two different datasets D and D′ that differ in single instance
• If the mechanism Mf behaves nearly identically for X and X′, an attacker can’t tell whether X or X ′ was used and thus can’t learn much about the individual
Advanced Topics 17
Data Analytics and Machine Learning
Laplace Mechanism (Output Perturbation)
• Define the global sensitivity of a function f : X → Rd as ∆p =supX≃X′ ||f(X)−f(X′)||p
• ∆p measures the magnitude by which a single instance can change the output of the function in the worst case
• Output perturbation with Laplace Noise:
• A curator holds data X = (x1,…,xn) ∈ X about n individuals
• The curator computes the function f (X )
• They sample i.i.d. Laplace noise Z ∼ Lap(0, ∆1 )d
• Theyrevealthenoisyvaluef(X)+Z We can prove that: Mf ,Lap is ε-DP 4
C log(1/δ)
4Adding Gaussian noise Z ∼ N(0,σ2) to the output of f with σ = ∆2 yields a (ε, δ)-DP private mechanism Mf ,N , where δ accounts for ”bad” events
Advanced Topics 18
Data Analytics and Machine Learning
Example: Laplace Mechanism for the Mean
• Computing the mean μ = f(X) = f(x1,…,xn) = n1 ni=1 xi where xi ∈ {0, 1} is binary
• The global l1-sensitivity of the mean is ∆1 = n1
• Changing/excluding the value of a single instance/individual can
change the output by at most n1 • Sample noisy Z ∼ Lap(0, 1 )
• Revealthenoisymeanμ ̃=μ+Z
• In this case we can also say something about the utility of this mechanism
• |μ − μ ̃| = Exponential(εn), which has a mean of ( 1 ) εn
• The true mean is not going to differ by much from the randomized mean and this difference decreases with the size of the data
• In general, computing the sensitivity of a function is challenging, and showing something about the utility is even more challenging
Advanced Topics 19
Data Analytics and Machine Learning
DP for Machine Learning
• The curator would like to learn an ML model on a dataset D, i.e.
find the optimal weights θ∗ of some parametrized model
• And release the weights θ∗ in public, e.g. by publishing the trained
classifier Setup:
• Let D = {(x1,y1),…,(xn,yn)} be a dataset of instances, and
θ ∈ Θ be the parameters of a model (e.g. logistic regression weights)
• Let f (D) = arg minθ∈Θ L(D, θ) be the trained weights
• e.g. L(D,θ) = n1 ni=1 l(xi,yi,θ), where l is e.g. cross-entropy
• Define for two datasets D ≃ D′ if they differ in one instance (xj , yj )
• Compared to before where we had vectors X and X′, now we have entire datasets D and D′
Advanced Topics 20
Data Analytics and Machine Learning
Differential Privacy techniques for ML models
• Perturb input: perturb D directly and rely on the post-processing property (next slide)
• Perturb weights: Compute the optimal weights
θ∗ = arg min Lθ∈Θ(D, θ) and perturb them with Laplace noise
• Need to calculate the global sensitivity of the optimization procedure which is difficult for complex models
• Perturb objective: Optimize L(D, θ) + θT Z where Z is some noise
• Perturb gradients: Perturb and release the gradient of L w.r.t. a mini-batch, useful in Federated Learning (next slides)
Advanced Topics 21
Data Analytics and Machine Learning
Fundamental properties of DP
• Robustness to post-processing: if M is ε-DP then g ◦ M is ε-DP • You can apply any function g on an output from a DP mechanism and the new output remains DP (as long as you don’t touch again
• Composition: ifMj,j =1,…,k areεj-DPthen X → (M1(X),…,Mk(X)) is (kj=1 εj)-DP5
• Group privacy: if M is ε-DP with respect to X ≃ X′ then M is (tε)-DP with respect to changes of t instances/individuals
• DP generalizes to arbitrary spaces beyond Euclidean space using the Exponential noise mechanism
5There is an advanced composition theorem with better guarantees
Advanced Topics 22
Data Analytics and Machine Learning
Federated Learning
• Differential Privacy assumes there’s a curator who we trust with access to all the raw data
• Compare this to the Randomized Response procedure where the user randomizes their answer before sending it the curator
• Allows for better privacy guarantees but not always possible
• Federated learning: learning a model without any centralized entity having access to all the data
• Send the current weights of the model to the user
• Perform few steps of SGD locally
• Send the updated weights to the central entity for aggregation
• Does not satisfy DP automatically but can be made private by randomizing the local updates
Advanced Topics 23
Data Analytics and Machine Learning
Differential Privacy Summary
• A lot of ML models are trained on datasets containing sensitive information about individuals, and database reconstruction attacks can be surprisingly effective
• Differential privacy gives a way of provably preventing (much) information about individuals from leaking
• Building blocks of differential privacy
• Laplace mechanism adds noise to the output
• Composition rules combine multiple private queries
• Sometimes differentially private algorithms can accurately answer queries for large populations
• The 2020 US Census used differential privacy
Advanced Topics 24
Data Analytics and Machine Learning
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com