程序代写代做代考 C flex Machine learning lecture slides

Machine learning lecture slides
COMS 4771 Fall 2020
0/18

Classification III: Classification objectives

Outline
􏰛 Scoring functions
􏰛 Cost-sensitive classification
􏰛 Conditional probability estimation 􏰛 Reducing multi-class to binary
􏰛 Fairness in classification
1/18

Scoring functions in general
􏰛 Statistical model: (X, Y ) ∼ P for distribution P over X × {−1, +1}
􏰛 Binary classifiers are generally of the form x 􏰜→ sign(h(x))
for some scoring function h: X → R
􏰛 E.g. Bayes classifier uses scoring function h(x) = η(x) − 1/2
whereη(x)=Pr(Y =+1|X=x)
􏰛 Use with loss functions like l0/1, llogistic, lsq, lmsq, lhinge
R(h) = E[l(Y h(X))]
􏰛 Issues to consider:
􏰛 Different types of mistakes have different costs
􏰛 HowtogetPr(Y =+1|X=x)fromh(x)? 􏰛 More than two classes
2/18

Cost-sensitive classification
􏰛 Cost matrix for different kinds of mistakes (for c ∈ [0, 1])
yˆ=−1 yˆ=+1 y = −1 c
y = +1 1 − c 0
(Why can we restrict attention to c ∈ [0, 1]?) 􏰛 Cost-sensitive l-loss:
l(c)(y, yˆ) = 􏰁1{y=+1} · (1 − c) + 1{y=−1} · c􏰂 · l(yyˆ). 􏰛 If l is convex in yˆ, then so is l(c)(y,·)
􏰛 Cost-sensitive (empirical) risk:
R(c)(h) := E[l(c)(Y, h(X))]
( c ) 1 􏰊n
R􏰓(h):=n l(c)(yi,h(xi))
0
i=1
3/18

Minimizing cost-sensitive risk
􏰛 What is the analogue of Bayes classifier for cost-sensitive (zero-one loss) risk?
􏰛 Letη(x)=Pr(Y =1|X=x)
􏰛 Fix x; what is conditional cost-sensitive risk of predicting yˆ?
η(x)·(1−c)·1{yˆ=−1} +(1−η(x))·c·1{yˆ=+1}.
􏰛 Minimized when 
yˆ=+1 if η(x)·(1−c)>(1−η(x))·c −1 otherwise
􏰛 So use scoring function h(x) = η(x) − c
􏰛 Equivalently, use η as scoring function, but threshold at c
instead of 1/2
􏰛 Where does c come from?
4/18

Example: balanced error rate
􏰛 Balanced error rate: BER := 1 FNR + 1 FPR 22
􏰛 Which cost sensitive risk to try to minimize? 2·BER
π 1−π where π = Pr(Y = +1).
􏰛 Therefore, we want to use the following cost matrix: yˆ=−1 yˆ=+1
=Pr(h(X)≤0|Y =+1)+Pr(h(X)>0|Y =−1)
= 1 ·Pr(h(X)≤0∧Y =+1)+ 1 ·Pr(h(X)>0∧Y =−1)
1 1−π
y = −1
y=+11 0
π
0
􏰛 This corresponds to c = π.
5/18

Importance-weighted risk
􏰛 Perhaps the world tells you how important each example is 􏰛 Statistical model: (X, Y, W ) ∼ P
􏰛 W is (non-negative) importance weight of example (X, Y ) 􏰛 Importance-weighted l-risk of h:
E[W · l(Y h(X))]
􏰛 Estimate from data (x1, y1, w1), . . . , (xn, yn, wn):
1 􏰊n
n
i=1
wi · l(yih(xi))
6/18

Conditional probability estimation (1)
􏰛 Howtogetestimateofη(x)=Pr(Y =+1|X=x)?
􏰛 Useful if want to know expected cost of a prediction
 E[l(c)(Yh(X))|X=x]=(1−c)·η(x) ifh(x)≤0
0/1
c·(1−η(x)) ifh(x)>0 􏰛 Squared loss risk minimized by scoring function
h(x) = 2η(x) − 1.
Therefore, given h, can estimate η using ηˆ(x) = 1+h(x)
2
􏰛 Recipe:
􏰛 Find scoring function h that (approximately) minimizes
(empirical) squared loss risk
􏰛 Construct conditional probability estimate ηˆ using above formula

7/18

Conditional probability estimation (2)
􏰛 Similar strategy available for logistic loss 􏰛 But not for hinge loss!
􏰛 Hinge loss risk is minimized by h(x) = sign(2η(x) − 1) 􏰛 Cannot recover η from h
􏰛 Caveat: If using insufficiently expressive functions for h (e.g., linear functions), may be far from minimizing squared loss risk
􏰛 Fix: use more flexible models (e.g., feature expansion)
8/18

Application: Reducing multi-class to binary
􏰛 Multi-class: Conditional probability function is vector-valued
function
Pr(Y =1|X =x) η(x) =  . 
 .  Pr(Y =K|X=x)
􏰛 Reduction: learn K scalar-valued functions, the k-th function is supposed to approximate
ηk(x)=Pr(Y =k|X=x).
􏰛 This can be done by create K binary classification problems,
where in problem k, label is 1{y=k}.
􏰛 Given the K learned conditional probability functions
ηˆ1,…,ηˆK, we form a final predictor fˆ ˆ
f(x) = argmaxηˆk(x). k=1,…,K
9/18

When does one-against-all work well?
􏰛 If learned conditional probability functions ηˆk are accurate, then behavior of one-against-all classifier fˆ is similar to optimal classifier
f⋆(x)=argmaxPr(Y =k|X=x). k=1,…,K
􏰛 Claim:
ˆ⋆
err(f)≤err(f )+2·E[max|ηˆk(X)−ηk(X)|].
k
10/18

Fairness
􏰛 Use of predictive models (e.g., in admissions, hiring, criminal justice) has raised concerns about whether they offer “fair treatment” to individuals and/or groups
􏰛 We will focus on group-based fairness
􏰛 Individual-based fairness also important, but not as well-studied
11/18

Disparate treatment
􏰛 Often predictive models work better for some groups than for others
􏰛 Example: face recognition (Buolamwini and Gebru, 2018; Lohr, 2018)
12/18

Possible causes of unfairness
􏰛 People deliberately being unfair
􏰛 Disparity in number of available training data for different
groups
􏰛 Disparity in usefulness of available features for different groups
􏰛 Disparity in relevance of prediction problem for different groups
􏰛 …
13/18

ProPublica study
􏰛 ProPublica (investigative journalism group) studied a particular predictive model being used to determine “pre-trial detention”
􏰛 Angwin et al, 2016
􏰛 Judge needs to decide whether or not an arrested defendant
should be released while awaiting trial
􏰛 Predictive model (“COMPAS”) provides an estimate of
Pr(Y =1|X=x)where
Y = 1{will commit (violent) crime if released} and X is “features” of
defendant.
􏰛 Study argued that COMPAS treated black defendants unfairly
in a certain sense
􏰛 What sense? How do they make this argument?
14/18

Fairness criteria
􏰛 Setup:
􏰛 X: features for individual
􏰛 A: group membership attribute (e.g., race, sex, age, religion)
􏰛 Y : outcome variable to predict (e.g., “will repay loan”, “will
re-offend”)
􏰛 Yˆ : prediction of outcome variable (as function of (X, A))
􏰛 For simplicity, assume A, Y , and Yˆ are {0, 1}-valued
􏰛 Many fairness criteria are based on joint distribution of (A,Y,Yˆ)
􏰛 Caveat: Often, we don’t have access to Y in training data
15/18

Classification parity
􏰛 Fairness criterion: Classification parity
Pr(Yˆ =1|A=0)≈Pr(Yˆ =1|A=1)
􏰛 Sounds reasonable, but easy to satisfy with perverse methods
􏰛 Example: trying to predict Y = 1{will repay loan if given one}
􏰛 Suppose conditional distributions of (Y, Yˆ ) given A are as
follows:
(A=0) Yˆ=0 Yˆ=1 (A=1) Yˆ=0 Yˆ=1 Y=0 0 Y=0 1/4 Y=1 0 1/2 Y=1 1/4 1/4
􏰛 For A = 0 people, correctly give loans to people who will repay
􏰛 For A = 1 people, give loans randomly (Bernoulli(1/2))
􏰛 Satisfies criterion, but bad for A = 1 people
1/2
1/4
16/18

Equalized odds (1)
􏰛 Fairness criterion: Equalized odds
Pr(Yˆ =1|Y =y,A=0)≈Pr(Yˆ =1|Y =y,A=1)
for both y ∈ {0, 1}.
􏰛 In particular, FPR and FNR must be (approximately) same
across groups.
􏰛 Could also just ask for Equalized FPR, or Equalized FNR
􏰛 Previous example fails to satisfy equalized odds:
(A=0) Yˆ=0 Yˆ=1 (A=1) Yˆ=0 Yˆ=1 Y=0 0 Y=0 1/4 Y=1 0 1/2 Y=1 1/4 1/4
E.g., A = 0 group has 0% FPR, while A = 1 has 50% FPR.
􏰛 Criteria imply constraints on the classifier / scoring function
1/2
1/4
􏰛 Can try to enforce constraint during training
17/18

Equalized odds (2)
􏰛 ProPublica study:
􏰛 Found that FPR for A = 0 group (black defendants; 45%) was
higher than FPR for A = 0 group (white defendants; 23%)
(A=0) Yˆ=0 Yˆ=1 (A=1) Yˆ=0 Yˆ=1
Y=0 0.22 Y=0 0.14 Y =1 0.14 0.37 Y =1 0.19 0.21
0.27
0.46
18/18