Machine learning lecture slides
COMS 4771 Fall 2020
0/18
Classification III: Classification objectives
Outline
Scoring functions
Cost-sensitive classification
Conditional probability estimation Reducing multi-class to binary
Fairness in classification
1/18
Scoring functions in general
Statistical model: (X, Y ) ∼ P for distribution P over X × {−1, +1}
Binary classifiers are generally of the form x → sign(h(x))
for some scoring function h: X → R
E.g. Bayes classifier uses scoring function h(x) = η(x) − 1/2
whereη(x)=Pr(Y =+1|X=x)
Use with loss functions like l0/1, llogistic, lsq, lmsq, lhinge
R(h) = E[l(Y h(X))]
Issues to consider:
Different types of mistakes have different costs
HowtogetPr(Y =+1|X=x)fromh(x)? More than two classes
2/18
Cost-sensitive classification
Cost matrix for different kinds of mistakes (for c ∈ [0, 1])
yˆ=−1 yˆ=+1 y = −1 c
y = +1 1 − c 0
(Why can we restrict attention to c ∈ [0, 1]?) Cost-sensitive l-loss:
l(c)(y, yˆ) = 1{y=+1} · (1 − c) + 1{y=−1} · c · l(yyˆ). If l is convex in yˆ, then so is l(c)(y,·)
Cost-sensitive (empirical) risk:
R(c)(h) := E[l(c)(Y, h(X))]
( c ) 1 n
R(h):=n l(c)(yi,h(xi))
0
i=1
3/18
Minimizing cost-sensitive risk
What is the analogue of Bayes classifier for cost-sensitive (zero-one loss) risk?
Letη(x)=Pr(Y =1|X=x)
Fix x; what is conditional cost-sensitive risk of predicting yˆ?
η(x)·(1−c)·1{yˆ=−1} +(1−η(x))·c·1{yˆ=+1}.
Minimized when
yˆ=+1 if η(x)·(1−c)>(1−η(x))·c −1 otherwise
So use scoring function h(x) = η(x) − c
Equivalently, use η as scoring function, but threshold at c
instead of 1/2
Where does c come from?
4/18
Example: balanced error rate
Balanced error rate: BER := 1 FNR + 1 FPR 22
Which cost sensitive risk to try to minimize? 2·BER
π 1−π where π = Pr(Y = +1).
Therefore, we want to use the following cost matrix: yˆ=−1 yˆ=+1
=Pr(h(X)≤0|Y =+1)+Pr(h(X)>0|Y =−1)
= 1 ·Pr(h(X)≤0∧Y =+1)+ 1 ·Pr(h(X)>0∧Y =−1)
1 1−π
y = −1
y=+11 0
π
0
This corresponds to c = π.
5/18
Importance-weighted risk
Perhaps the world tells you how important each example is Statistical model: (X, Y, W ) ∼ P
W is (non-negative) importance weight of example (X, Y ) Importance-weighted l-risk of h:
E[W · l(Y h(X))]
Estimate from data (x1, y1, w1), . . . , (xn, yn, wn):
1 n
n
i=1
wi · l(yih(xi))
6/18
Conditional probability estimation (1)
Howtogetestimateofη(x)=Pr(Y =+1|X=x)?
Useful if want to know expected cost of a prediction
E[l(c)(Yh(X))|X=x]=(1−c)·η(x) ifh(x)≤0
0/1
c·(1−η(x)) ifh(x)>0 Squared loss risk minimized by scoring function
h(x) = 2η(x) − 1.
Therefore, given h, can estimate η using ηˆ(x) = 1+h(x)
2
Recipe:
Find scoring function h that (approximately) minimizes
(empirical) squared loss risk
Construct conditional probability estimate ηˆ using above formula
7/18
Conditional probability estimation (2)
Similar strategy available for logistic loss But not for hinge loss!
Hinge loss risk is minimized by h(x) = sign(2η(x) − 1) Cannot recover η from h
Caveat: If using insufficiently expressive functions for h (e.g., linear functions), may be far from minimizing squared loss risk
Fix: use more flexible models (e.g., feature expansion)
8/18
Application: Reducing multi-class to binary
Multi-class: Conditional probability function is vector-valued
function
Pr(Y =1|X =x) η(x) = .
. Pr(Y =K|X=x)
Reduction: learn K scalar-valued functions, the k-th function is supposed to approximate
ηk(x)=Pr(Y =k|X=x).
This can be done by create K binary classification problems,
where in problem k, label is 1{y=k}.
Given the K learned conditional probability functions
ηˆ1,…,ηˆK, we form a final predictor fˆ ˆ
f(x) = argmaxηˆk(x). k=1,…,K
9/18
When does one-against-all work well?
If learned conditional probability functions ηˆk are accurate, then behavior of one-against-all classifier fˆ is similar to optimal classifier
f⋆(x)=argmaxPr(Y =k|X=x). k=1,…,K
Claim:
ˆ⋆
err(f)≤err(f )+2·E[max|ηˆk(X)−ηk(X)|].
k
10/18
Fairness
Use of predictive models (e.g., in admissions, hiring, criminal justice) has raised concerns about whether they offer “fair treatment” to individuals and/or groups
We will focus on group-based fairness
Individual-based fairness also important, but not as well-studied
11/18
Disparate treatment
Often predictive models work better for some groups than for others
Example: face recognition (Buolamwini and Gebru, 2018; Lohr, 2018)
12/18
Possible causes of unfairness
People deliberately being unfair
Disparity in number of available training data for different
groups
Disparity in usefulness of available features for different groups
Disparity in relevance of prediction problem for different groups
…
13/18
ProPublica study
ProPublica (investigative journalism group) studied a particular predictive model being used to determine “pre-trial detention”
Angwin et al, 2016
Judge needs to decide whether or not an arrested defendant
should be released while awaiting trial
Predictive model (“COMPAS”) provides an estimate of
Pr(Y =1|X=x)where
Y = 1{will commit (violent) crime if released} and X is “features” of
defendant.
Study argued that COMPAS treated black defendants unfairly
in a certain sense
What sense? How do they make this argument?
14/18
Fairness criteria
Setup:
X: features for individual
A: group membership attribute (e.g., race, sex, age, religion)
Y : outcome variable to predict (e.g., “will repay loan”, “will
re-offend”)
Yˆ : prediction of outcome variable (as function of (X, A))
For simplicity, assume A, Y , and Yˆ are {0, 1}-valued
Many fairness criteria are based on joint distribution of (A,Y,Yˆ)
Caveat: Often, we don’t have access to Y in training data
15/18
Classification parity
Fairness criterion: Classification parity
Pr(Yˆ =1|A=0)≈Pr(Yˆ =1|A=1)
Sounds reasonable, but easy to satisfy with perverse methods
Example: trying to predict Y = 1{will repay loan if given one}
Suppose conditional distributions of (Y, Yˆ ) given A are as
follows:
(A=0) Yˆ=0 Yˆ=1 (A=1) Yˆ=0 Yˆ=1 Y=0 0 Y=0 1/4 Y=1 0 1/2 Y=1 1/4 1/4
For A = 0 people, correctly give loans to people who will repay
For A = 1 people, give loans randomly (Bernoulli(1/2))
Satisfies criterion, but bad for A = 1 people
1/2
1/4
16/18
Equalized odds (1)
Fairness criterion: Equalized odds
Pr(Yˆ =1|Y =y,A=0)≈Pr(Yˆ =1|Y =y,A=1)
for both y ∈ {0, 1}.
In particular, FPR and FNR must be (approximately) same
across groups.
Could also just ask for Equalized FPR, or Equalized FNR
Previous example fails to satisfy equalized odds:
(A=0) Yˆ=0 Yˆ=1 (A=1) Yˆ=0 Yˆ=1 Y=0 0 Y=0 1/4 Y=1 0 1/2 Y=1 1/4 1/4
E.g., A = 0 group has 0% FPR, while A = 1 has 50% FPR.
Criteria imply constraints on the classifier / scoring function
1/2
1/4
Can try to enforce constraint during training
17/18
Equalized odds (2)
ProPublica study:
Found that FPR for A = 0 group (black defendants; 45%) was
higher than FPR for A = 0 group (white defendants; 23%)
(A=0) Yˆ=0 Yˆ=1 (A=1) Yˆ=0 Yˆ=1
Y=0 0.22 Y=0 0.14 Y =1 0.14 0.37 Y =1 0.19 0.21
0.27
0.46
18/18