Machine learning lecture slides
Machine learning lecture slides
COMS 4771 Fall 2020
0 / 18
Classification III: Classification objectives
Outline
I Scoring functions
I Cost-sensitive classification
I Conditional probability estimation
I Reducing multi-class to binary
I Fairness in classification
1 / 18
Scoring functions in general
I Statistical model: (X,Y ) ∼ P for distribution P over
X × {−1,+1}
I Binary classifiers are generally of the form
x 7→ sign(h(x))
for some scoring function h : X → R
I E.g. Bayes classifier uses scoring function h(x) = η(x)− 1/2
where η(x) = Pr(Y = +1 | X = x)
I Use with loss functions like `0/1, `logistic, `sq, `msq, `hinge
R(h) = E[`(Y h(X))]
I Issues to consider:
I Different types of mistakes have different costs
I How to get Pr(Y = +1 | X = x) from h(x)?
I More than two classes
2 / 18
Cost-sensitive classification
I Cost matrix for different kinds of mistakes (for c ∈ [0, 1])
ŷ = −1 ŷ = +1
y = −1 0 c
y = +1 1− c 0
(Why can we restrict attention to c ∈ [0, 1]?)
I Cost-sensitive `-loss:
`(c)(y, ŷ) =
(
1{y=+1} · (1− c) + 1{y=−1} · c
)
· `(yŷ).
I If ` is convex in ŷ, then so is `(c)(y, ·)
I Cost-sensitive (empirical) risk:
R(c)(h) := E[`(c)(Y, h(X))]
R̂
(c)
(h) :=
1
n
n∑
i=1
`(c)(yi, h(xi))
3 / 18
Minimizing cost-sensitive risk
I What is the analogue of Bayes classifier for cost-sensitive
(zero-one loss) risk?
I Let η(x) = Pr(Y = 1 | X = x)
I Fix x; what is conditional cost-sensitive risk of predicting ŷ?
η(x) · (1− c) · 1{ŷ=−1} + (1− η(x)) · c · 1{ŷ=+1}.
I Minimized when
ŷ =
+1 if η(x) · (1− c) > (1− η(x)) · c−1 otherwise
I So use scoring function h(x) = η(x)− c
I Equivalently, use η as scoring function, but threshold at c
instead of 1/2
I Where does c come from?
4 / 18
Example: balanced error rate
I Balanced error rate: BER := 12FNR +
1
2FPR
I Which cost sensitive risk to try to minimize?
2 · BER
= Pr(h(X) ≤ 0 | Y = +1) + Pr(h(X) > 0 | Y = −1)
=
1
π
· Pr(h(X) ≤ 0 ∧ Y = +1) +
1
1− π
· Pr(h(X) > 0 ∧ Y = −1)
where π = Pr(Y = +1).
I Therefore, we want to use the following cost matrix:
ŷ = −1 ŷ = +1
y = −1 0 11−π
y = +1 1
π
0
I This corresponds to c = π.
5 / 18
Importance-weighted risk
I Perhaps the world tells you how important each example is
I Statistical model: (X,Y,W ) ∼ P
I W is (non-negative) importance weight of example (X,Y )
I Importance-weighted `-risk of h:
E[W · `(Y h(X))]
I Estimate from data (x1, y1, w1), . . . , (xn, yn, wn):
1
n
n∑
i=1
wi · `(yih(xi))
6 / 18
Conditional probability estimation (1)
I How to get estimate of η(x) = Pr(Y = +1 | X = x)?
I Useful if want to know expected cost of a prediction
E[`(c)0/1(Y h(X)) | X = x] =
(1− c) · η(x) if h(x) ≤ 0c · (1− η(x)) if h(x) > 0
I Squared loss risk minimized by scoring function
h(x) = 2η(x)− 1.
Therefore, given h, can estimate η using η̂(x) = 1+h(x)2
I Recipe:
I Find scoring function h that (approximately) minimizes
(empirical) squared loss risk
I Construct conditional probability estimate η̂ using above formula
7 / 18
Conditional probability estimation (2)
I Similar strategy available for logistic loss
I But not for hinge loss!
I Hinge loss risk is minimized by h(x) = sign(2η(x)− 1)
I Cannot recover η from h
I Caveat: If using insufficiently expressive functions for h (e.g.,
linear functions), may be far from minimizing squared loss risk
I Fix: use more flexible models (e.g., feature expansion)
8 / 18
Application: Reducing multi-class to binary
I Multi-class: Conditional probability function is vector-valued
function
η(x) =
Pr(Y = 1 | X = x)
…
Pr(Y = K | X = x)
I Reduction: learn K scalar-valued functions, the k-th function
is supposed to approximate
ηk(x) = Pr(Y = k | X = x).
I This can be done by create K binary classification problems,
where in problem k, label is 1{y=k}.
I Given the K learned conditional probability functions
η̂1, . . . , η̂K , we form a final predictor f̂
f̂(x) = arg max
k=1,…,K
η̂k(x).
9 / 18
When does one-against-all work well?
I If learned conditional probability functions η̂k are accurate,
then behavior of one-against-all classifier f̂ is similar to optimal
classifier
f?(x) = arg max
k=1,…,K
Pr(Y = k | X = x).
I Claim:
err(f̂) ≤ err(f?) + 2 · E[max
k
|η̂k(X)− ηk(X)|].
10 / 18
Fairness
I Use of predictive models (e.g., in admissions, hiring, criminal
justice) has raised concerns about whether they offer “fair
treatment” to individuals and/or groups
I We will focus on group-based fairness
I Individual-based fairness also important, but not as well-studied
11 / 18
Disparate treatment
I Often predictive models work better for some groups than for
others
I Example: face recognition (Buolamwini and Gebru, 2018; Lohr,
2018)
12 / 18
http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf
Possible causes of unfairness
I People deliberately being unfair
I Disparity in number of available training data for different
groups
I Disparity in usefulness of available features for different groups
I Disparity in relevance of prediction problem for different groups
I . . .
13 / 18
ProPublica study
I ProPublica (investigative journalism group) studied a particular
predictive model being used to determine “pre-trial detention”
I Angwin et al, 2016
I Judge needs to decide whether or not an arrested defendant
should be released while awaiting trial
I Predictive model (“COMPAS”) provides an estimate of
Pr(Y = 1 | X = x) where
Y = 1{will commit (violent) crime if released} and X is “features” of
defendant.
I Study argued that COMPAS treated black defendants unfairly
in a certain sense
I What sense? How do they make this argument?
14 / 18
https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
Fairness criteria
I Setup:
I X: features for individual
I A: group membership attribute (e.g., race, sex, age, religion)
I Y : outcome variable to predict (e.g., “will repay loan”, “will
re-offend”)
I Ŷ : prediction of outcome variable (as function of (X,A))
I For simplicity, assume A, Y , and Ŷ are {0, 1}-valued
I Many fairness criteria are based on joint distribution of
(A, Y, Ŷ )
I Caveat: Often, we don’t have access to Y in training data
15 / 18
Classification parity
I Fairness criterion: Classification parity
Pr(Ŷ = 1 | A = 0) ≈ Pr(Ŷ = 1 | A = 1)
I Sounds reasonable, but easy to satisfy with perverse methods
I Example: trying to predict Y = 1{will repay loan if given one}
I Suppose conditional distributions of (Y, Ŷ ) given A are as
follows:
(A = 0) Ŷ = 0 Ŷ = 1
Y = 0 1/2 0
Y = 1 0 1/2
(A = 1) Ŷ = 0 Ŷ = 1
Y = 0 1/4 1/4
Y = 1 1/4 1/4
I For A = 0 people, correctly give loans to people who will repay
I For A = 1 people, give loans randomly (Bernoulli(1/2))
I Satisfies criterion, but bad for A = 1 people
16 / 18
Equalized odds (1)
I Fairness criterion: Equalized odds
Pr(Ŷ = 1 | Y = y,A = 0) ≈ Pr(Ŷ = 1 | Y = y,A = 1)
for both y ∈ {0, 1}.
I In particular, FPR and FNR must be (approximately) same
across groups.
I Could also just ask for Equalized FPR, or Equalized FNR
I Previous example fails to satisfy equalized odds:
(A = 0) Ŷ = 0 Ŷ = 1
Y = 0 1/2 0
Y = 1 0 1/2
(A = 1) Ŷ = 0 Ŷ = 1
Y = 0 1/4 1/4
Y = 1 1/4 1/4
E.g., A = 0 group has 0% FPR, while A = 1 has 50% FPR.
I Criteria imply constraints on the classifier / scoring function
I Can try to enforce constraint during training
17 / 18
Equalized odds (2)
I ProPublica study:
I Found that FPR for A = 0 group (black defendants; 45%) was
higher than FPR for A = 0 group (white defendants; 23%)
(A = 0) Ŷ = 0 Ŷ = 1
Y = 0 0.27 0.22
Y = 1 0.14 0.37
(A = 1) Ŷ = 0 Ŷ = 1
Y = 0 0.46 0.14
Y = 1 0.19 0.21
18 / 18
Classification III: Classification objectives