Lecture 21: Ethics in Machine Learning: Measuring and Mitigating Algorithmic Bias
Introduction to Machine Learning Semester 1, 2022
Copyright @ University of Melbourne 2022. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without
written permission from the author.
Copyright By PowCoder代写 加微信 powcoder
So far… ML nuts and bolts
• Supervised learning / classification • Unsupervised learning
• Features selection
• Evaluation
Today… ML in the world
• What is Bias and where does it come from? • Algorithmic Fairness
• How to make ML algorithms fairer
Introduction
Applications
• medical diagnoses
• language generation
• hate speech detection
• stock market prediction
• insurance policy suggestion • search and recommendation • spam / malware detection
• predictive policing
What can possibly go wrong?
A quick (and biased) press review
https://www.theguardian.com/technology/2019/dec/12/ai- end- uk- use- racially- biased- algorithms- noel- sharkey
A quick (and biased) press review
//www.theguardian.com/technology/2020/feb/05/welfare- surveillance- system- violates- human- rights- dutch- court- rules
A quick (and biased) press review
https://www.theguardian.com/technology/2019/feb/14/elon- musk- backed- ai- writes- convincing- news- fiction
A quick (and biased) press review
https: 4 //www.theguardian.com/commentisfree/2019/nov/16/can- planet- afford- exorbitant- power- demands- of- machine- learning
Sources and types of bias in ML
Humans are biased
https://commons.wikimedia.org/wiki/File:Cognitive_bias_codex_en.svg
Humans are biased
Out-group homogeneity bias (Stereotypes/Prejudice)
https://xkcd.com/385/
Humans tend to perceive out-group members as less nuanced than in-group members.
Humans are biased
Correlation Fallacy
https://xkcd.com/552/
Humans have a tendency to mistake correlation (two co-incidentally co-occurring events) with causation.
Data is biased
Historical Bias: A randomly sampled data set, reflects the world as it was including existing biases which should not be carried forward
“professor”
Data is biased
Representation bias / Reporting bias
• The data sets do not faithfully represent the whole population
• Minority groups are underrepresented
• Obvious facts are underrepresented. Anomalies are overemphasized.
From: Gordon and van Durme (2013)
Data is biased
Measurement bias
1. Noisy measurement → errors or missing data points which are not randomly distributed
• e.g., records of police arrests differ in level of detail across postcode areas
2. Mistaking a (noisy) proxy for a label of interest
• e.g., ‘hiring decision’ as a proxy for ‘applicant quality’. (why noisy?)
3. Oversimplification of the quantity of interest
• e.g., classifying political leaning into: ‘Democrat’ vs. ‘Republican’ (USA); binarizing gender into: ‘Male’ vs. ‘Female’
Data is biased
Measurement bias
1. Noisy measurement → errors or missing data points which are not randomly distributed
• e.g., records of police arrests differ in level of detail across postcode areas
2. Mistaking a (noisy) proxy for a label of interest
• e.g., ‘hiring decision’ as a proxy for ‘applicant quality’. (why noisy?)
3. Oversimplification of the quantity of interest
• e.g., classifying political leaning into: ‘Democrat’ vs. ‘Republican’
(USA); binarizing gender into: ‘Male’ vs. ‘Female’
1. Know your domain 2. Know your task 3. Know your data
Models are Biased
• Weak models: high bias – low variance • Unjustified model assumptions
Models are Biased
Biased Loss Function
• Blind to certain types of errors
• E.g., 0/1 loss will tend to tolerate errors in the minority class for highly imbalanced data
Models are Biased
Biased Loss Function
• Blind to certain types of errors
• E.g., 0/1 loss will tend to tolerate errors in the minority class for highly imbalanced data
1. Carefully consider model assumptions
2. Carefully choose loss functions
3. Model groups separately (e.g., multi-task learning) 4. Represent groups fairly in the data
Bias in Evaluation or Deployment
Evaluation bias
• Test set not representative of target population
• Overfit to a test set. Widely used benchmark data sets can reinforce the problem.
• Evaluation metrics may not capture all quantities of interest (disregard minority groups or average effects). E.g.,
• Accuracy (remember assignment 1…?)
• Face recognition models largely trained/evaluated on images of
ethnically white people.
Deployment bias
• Use of systems in ways they weren’t intended to use. Lack of education of end-users.
Bias in Evaluation or Deployment
Evaluation bias
• Test set not representative of target population
• Overfit to a test set. Widely used benchmark data sets can reinforce the problem.
• Evaluation metrics may not capture all quantities of interest (disregard minority groups or average effects). E.g.,
1. Ca•reAfuclclyursaeclyec(rteymouermebvearluastsioignnmetnritc1s…?)
• Face recognition models largely trained/evaluated on images of
2. Use multiple evaluation metrics
ethnically white people.
3. Carefully select your test sets and benchmarks
Deployment bias
4. Document your models to ensure they are used correctly
• Use of systems in ways they weren’t intended to use. Lack of education of end-users.
The Machine Learning Pipeline
World Individuals
action feedback Data learn Model
The Machine Learning Pipeline
Data learn
Individuals
action feedback
Measurement
• Define your variables of interests
• Define your target variable
• Especially critical, if target variable is not measured explicitly.
E.g., hiring decision→applicant quality or income→creditworthiness
The Machine Learning Pipeline
World Individuals
action feedback Data learn Model
• Models are faithful to the data (by design!)
• Data contains “knowledge” (smoking causes cancer)
• Data contains “Stereotypes” (boys like blue, girls like pink)
• What’s the difference? Based on social norms, no clear line!
The Machine Learning Pipeline
World Individuals
action feedback Data learn Model
• ML concept: regression, classification, information retrieval, …
• Resulting action: Class prediction (Spam, credit granted), search results, hotel recommendation, …
The Machine Learning Pipeline
Data learn
Individuals
action feedback
• Approximated from user behavior • Ex: click-rates
Demographic Disparity / Sample Size Disparity
Demographic groups will be differently represented in our samples
• Historical bias • Minority groups • …
What does this mean for model fit?
• Models will work better for majorities (ex: dialects, speech recognition)
• Models will generalize based on majorities (ex: fake names in Facebook)
• Think: Anomaly detection
Effects on society
• Minorities may adopt technology more slowly (increase the gap)
• The danger of feedback loops: (ex: predictive policing → more arrests → re-enforce model signal …)
Demographic Disparity / Sample Size Disparity
Two questions
• Is any disparity justified?
• Is any disparity harmful? Case study
• Amazon same-day delivery by zip-code (USA)
• Areas with largely black population are left out
• Amazon’s objective: min cost / max efficiency
• Is the system biased?
• Is discrimination happening?
• If so: is the discrimination justified? Is it harmful?
https://www.bloomberg.com/graphics/2016-amazon-same-day/
Sensitive Attributes
X non-sensitive features
A sensitive attributes with discrete labels (male/female, old/young, …) Y true labels
Yˆ classifier score (predicted label, for our purposes)
Very often instances have a mix of useful, uncontroversial attributes, and sensitive attributes based on which we do not want to make classification decisions.
Different attributes lead to different demographic groups of the population
It is rarely clear which attributes are sensitive and which are not. Choice can have profound impacts.
Fairness through Unawareness
• Hide all sensitive features from the classifier. Only train on X and remove A.
P(Yˆn|Xn,An) ≈ P(Yˆn|Xn)
Another case study
• A bank which serves both humans and martians wants a classifier predicting whether an applicant should get a credit or not. Assume access to features (credit history, education, …) for all applicants.
• WhatareX,A,Y andYˆ?
• Apply “fairness through unawareness”. Would the model be fair?
Fairness through Unawareness
• Hide all sensitive features from the classifier. Only train on X and remove A.
P(Yˆn|Xn,An) ≈ P(Yˆn|Xn)
Another case study
• A bank which serves both humans and martians wants a classifier predicting whether an applicant should get a credit or not. Assume access to features (credit history, education, …) for all applicants.
• WhatareX,A,Y andYˆ?
• Apply “fairness through unawareness”. Would the model be fair?
• General features may be strongly correlated with sensitive features. Example..?
Consequently, this approach does not generally result in a fair model
Formal Fairness Criteria
Quick recap of metrics
true positive (TP)
false negative (FN)
false positive (FP)
true negative (TN)
Positive Predictive Value (PPV) (also: precision) TP
True Positive Rate (TPR) (also: Recall) TP
False Negative Rate (FNR):
FN = 1 − TPR
TP +TN +FN +FP
Example Problem: Credit scoring
• We trained a classifier to predict a binary credit score: should an applicant be granted a credit or not?
• Assume a version of the Adult data set as our training data, which covers humans and martians. It includes actual credit scoring information
• We consider species to be the protected attribute: our classifier should make fair decisions for both human and martian applicants.
How can we measure fairness?
Fairness Criteria I: Group Fairness (Demographic Parity)
The sensitive attribute shall be statistically independent of the prediction
Yˆ ⊥ A For our classifier this would imply that:
P(Yˆ =1|A=m)=P(Yˆ =1|A=h) Martians (protected/minority) Humans (majority)
True positives (high credit score) True negatives (low credit score)
Predicted credit scores
Fairness Criteria I: Group Fairness (Demographic Parity)
The sensitive attribute shall be statistically independent of the prediction
For our classifier this would imply that:
P(Yˆ =1|A=m)=P(Yˆ =1|A=h)
Goal: same chance to get a positive credit score for all applicants, regardless of their species.
• Simple and intuitive
• This is independent of the ground truth label Y
• We can predict good instances for majority class, but bad instances for minority class.
• Danger to further harm reputation of minority class
Fairness Criteria II: Predictive Parity
All groups shall have the same PPV (precision): i.e., probability of a predicted positive to be truly positive.
For our classifier this would imply that:
P(Y =1|Yˆ =1,A=m)=P(Y =1|Yˆ =1,A=h)
Martians (protected/minority) Humans (majority)
Predicted credit scores
True positives (high credit score) True negatives (low credit score)
Fairness Criteria II: Predictive Parity
All groups shall have the same PPV (precision): i.e., probability of a predicted positive to be truly positive.
For our classifier this would imply that:
P(Y =1|Yˆ =1,A=m)=P(Y =1|Yˆ =1,A=h)
• The chance to correctly get a positive credit score should be the same for both human and martian applicants
• Now, we take the ground truth Y into account
• Limitation: Accept (and amplify) possible unfairness in the ground truth: If humans are more likely to have a good credit score in the data, then the classifier may predict good scores for humans with a higher probability than for martians in the first place.
Fairness Criteria III: Equal Opportunity
All groups shall have the same FNR (and TPR): i.e., probability of a truly positive instance to be predicted negative.
For our classifier this would imply that:
P(Yˆ =0|Y =1, A=m) = P(Yˆ =0|Y =1, A=h) or equivalently, P(Yˆ =1|Y =1, A=m) = P(Yˆ =1|Y =1, A=h)
Martians (protected/minority) Humans (majority)
True positives (high credit score) True negatives (low credit score)
Predicted credit scores
Fairness Criteria III: Equal Opportunity
All groups shall have the same FNR (and TPR): i.e., probability of a truly positive instance to be predicted negative.
For our classifier this would imply that:
P(Yˆ =0|Y =1, A=m) = P(Yˆ =0|Y =1, A=h) or equivalently,
P(Yˆ =1|Y =1, A=m) = P(Yˆ =1|Y =1, A=h)
• Our classifier should make similar prediction for humans and martians
with truly good credit scores
• We take the ground truth Y into account
• Limitation: Similar as for “Predictive Parity”, we accept (and amplify) possible unfairness in the ground truth.
Fairness Criteria IV: Individual Fairness
Rather than balancing by group (human, martian), compare individual applicants directly.
P(Yˆi=1|Ai,Xi) ≈ P(Yˆj=1|Aj,Xj) if sim(Xi,Xj) > θ
sim , = ???
• Individuals which have similar features X (job, education, …) should receive similar classifier scores
• Need to define a similarity function sim() (often non-trivial) • Need to select a similarity threshold θ
No Fair Free Lunch!
• Many more criteria exist. Many cannot be simultaneously satisfied. And many limit the maximum performance that is achievable.
Source: Hardt et al. (2016)
No Fair Free Lunch!
• Many more criteria exist. Many cannot be simultaneously satisfied. And many limit the maximum performance that is achievable.
• Long-term impacts: “Group fairness” enforces equal rates of credit loans to males and females even though females are statistically less likely to return. This further disadvantages the already poor (as well as the bank).
• Fairness criteria as soft constraints, not hard rules
• Fairness criteria as diagnostic tools rather than constraints: analyzing classifiers through the lens of fairness criteria can highlight social impacts once the system is deployed
• All criteria we discussed are observational, i.e., correlations. They do not allow us to argue about causality.
Classifier Evaluation Revisited
Fairness Evaluation
GAP measures: Measure the deviation of performance any group φg from the global average performance φ
• Average GAP: GAPavg = G1 Gg=1 |φg − φ|
• Maximum GAP: GAPmax = maxg∈G|φg − φ|
Accura7c8y GAP 76
66 Group 0
Fairness Evaluation
GAP measures: Measure the deviation of performance any group φg from the global average performance φ
• Average GAP: GAPavg = G1 Gg=1 |φg − φ|
• Maximum GAP: GAPmax = maxg∈G|φg − φ|
True p7o0sitive rate (TPR) GAP: Equal opportunity 60
50 40 30 20 10
True Positive Rate (TPR)
Fairness Evaluation
GAP measures: Measure the deviation of performance any group φg from the global average performance φ
• Average GAP: GAPavg = G1 Gg=1 |φg − φ|
• Maximum GAP: GAPmax = maxg∈G|φg − φ|
Positiv3e5 Predictive Value (PPV) GAP: Predictive Parity 30
25 20 15 10
Precison (PPV)
Creating Fairer Classifiers
Creating Fairer Classifiers
Now we know
• Where bias can arise (data, model, …)
• How we can statistically define fairness in classification
• How we can diagnose (un)fairness in evaluation
• What can we do – practically – to achieve better fairness?
We can improve fairness in
1. Pre-processing
2. Training / Optimization 3. Post-processing
1. Pre-processing
Balancing the data set
• Up-sample the minority group (martians)
• Down-sample the majority group (humans)
1. Pre-processing
Re-weighting data instances
Expected distribution (if A ⊥ Y )
Pexp(A=a,Y=1) = P(A=a) × P(Y=1) = #(A=a) × #(Y=1)
Observed distribution
Pobs (A=a, Y =1) = #(Y =1, A=a) |D|
Weigh each instance by
W(Xi={xi,ai,yi}) = Pexp(A=ai,Y=yi) Pobs(A=ai,Y=yi)
2. Model training / optimization
Add constraints to the optimization function. minimize L(f (X , θ), Y )) the overall loss
subject to ∀g ∈ G |φg − φ| < α fairness constraints (e.g., GAP)
Incorporate using a Lagrangian (cf. Lecture 5: constrained optimization!)
G Lfinal(θ) = L(f(X,θ),Y)) + λgψg
2. Model training / optimization
Adversarial Training (taster)
• Learn a classifier that predicts credit scores while being agnostic to the
species of the applicant.
Tabular demographic data
hidden representation
Predicted label (high vs low score)
Classifier
Predicted protected attribute (human vs martian)
• E maps input to latent representation h
• C uses h to predict target label: h should be good at predicting yˆ.
• A uses h to predict protected attribute: h should be bad at predicting gˆ.
L=LC(yˆi,yi) + LA(gˆi,g)
minimize loss maximize loss
2. Model training / optimization
Adversarial Training (taster)
• Learn a classifier that predicts the sentiment of a tweet (positive,
negative) while being agnostic to the demographic group of the author.
hidden representation
Sentiment (positive vs negative)
Classifier
Predicted protected attribute (demographics: AAE vs SAE)
3. Post-processing
Modify the classifier predictions (scores s or labels yˆ)
• E.g., decide on individual thresholds per group, such that:
yˆ = 1 if s > θ iii
• Come up with a special strategy for “difficult” instances, i.e., instances where P(yˆ) ≈ 0.5
• Model-independent
• Even works with proprietary / black-box models
• Needs access to protected attribute at test time
Some (optional, but excellent) talks
Predictive Policing: The danger of predictive algorithms in criminal justice https://www.youtube.com/watch?v=p-82YeUPQh0
Impacts of Machine Learning: Humans Need Not Apply https://www.youtube.com/watch?v=7Pq-S557XQU&feature=youtu.be
Tutorial on Fairness in ML (Far beyond the scope of this subject) https://fairmlbook.org/tutorial1.html
Netflix Documentary: Coded bias https://www.netflix.com/au/title/81328723
Today… Fair machine learning
• What is Bias and where does it come from? • Algorithmic Fairness
• How to make ML algorithms fairer
• Guest lecture by Dr. (ML Ethicist)
“Social Media Sentiment Matters, data science, #tweets”
• Subject overview and exam info
References
Harini Suresh and . Guttag (2019). A Framework for Understanding Unintended Consequences of Machine Learning. arXiv preprint arXiv:1901.10002/
Gordon, Jonathan, and Durme. ”Reporting bias and knowledge acquisition.” Proceedings of the 2013 workshop on Automated knowledge base construction. 2013.
Verma, Sahil, and . ”Fairness definitions explained.” 2018 ieee/acm international workshop on software fairness (fairware). IEEE, 2018.
Hardt, Moritz, , and . ”Equality of opportunity in supervised learning.” arXiv preprint arXiv:1610.02413 (2016).
and and (2019). “Fairness and Machine Learning”. http://www.fairmlbook.org
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com