CS计算机代考程序代写 flex Machine learning lecture slides

Machine learning lecture slides

Machine learning lecture slides

COMS 4771 Fall 2020

0 / 18

Classification III: Classification objectives

Outline

I Scoring functions
I Cost-sensitive classification
I Conditional probability estimation
I Reducing multi-class to binary
I Fairness in classification

1 / 18

Scoring functions in general

I Statistical model: (X,Y ) ∼ P for distribution P over
X × {−1,+1}

I Binary classifiers are generally of the form

x 7→ sign(h(x))

for some scoring function h : X → R
I E.g. Bayes classifier uses scoring function h(x) = η(x)− 1/2

where η(x) = Pr(Y = +1 | X = x)
I Use with loss functions like `0/1, `logistic, `sq, `msq, `hinge

R(h) = E[`(Y h(X))]

I Issues to consider:
I Different types of mistakes have different costs
I How to get Pr(Y = +1 | X = x) from h(x)?
I More than two classes

2 / 18

Cost-sensitive classification

I Cost matrix for different kinds of mistakes (for c ∈ [0, 1])

ŷ = −1 ŷ = +1
y = −1 0 c
y = +1 1− c 0

(Why can we restrict attention to c ∈ [0, 1]?)
I Cost-sensitive `-loss:

`(c)(y, ŷ) =
(
1{y=+1} · (1− c) + 1{y=−1} · c

)
· `(yŷ).

I If ` is convex in ŷ, then so is `(c)(y, ·)
I Cost-sensitive (empirical) risk:

R(c)(h) := E[`(c)(Y, h(X))]


(c)

(h) :=
1
n

n∑
i=1

`(c)(yi, h(xi))

3 / 18

Minimizing cost-sensitive risk

I What is the analogue of Bayes classifier for cost-sensitive
(zero-one loss) risk?

I Let η(x) = Pr(Y = 1 | X = x)
I Fix x; what is conditional cost-sensitive risk of predicting ŷ?

η(x) · (1− c) · 1{ŷ=−1} + (1− η(x)) · c · 1{ŷ=+1}.

I Minimized when

ŷ =


+1 if η(x) · (1− c) > (1− η(x)) · c−1 otherwise

I So use scoring function h(x) = η(x)− c
I Equivalently, use η as scoring function, but threshold at c

instead of 1/2
I Where does c come from?

4 / 18

Example: balanced error rate

I Balanced error rate: BER := 12FNR +
1
2FPR

I Which cost sensitive risk to try to minimize?

2 · BER
= Pr(h(X) ≤ 0 | Y = +1) + Pr(h(X) > 0 | Y = −1)

=
1
π
· Pr(h(X) ≤ 0 ∧ Y = +1) +

1
1− π

· Pr(h(X) > 0 ∧ Y = −1)

where π = Pr(Y = +1).
I Therefore, we want to use the following cost matrix:

ŷ = −1 ŷ = +1
y = −1 0 11−π
y = +1 1

π
0

I This corresponds to c = π.

5 / 18

Importance-weighted risk

I Perhaps the world tells you how important each example is
I Statistical model: (X,Y,W ) ∼ P

I W is (non-negative) importance weight of example (X,Y )
I Importance-weighted `-risk of h:

E[W · `(Y h(X))]

I Estimate from data (x1, y1, w1), . . . , (xn, yn, wn):

1
n

n∑
i=1

wi · `(yih(xi))

6 / 18

Conditional probability estimation (1)

I How to get estimate of η(x) = Pr(Y = +1 | X = x)?
I Useful if want to know expected cost of a prediction

E[`(c)0/1(Y h(X)) | X = x] =


(1− c) · η(x) if h(x) ≤ 0c · (1− η(x)) if h(x) > 0

I Squared loss risk minimized by scoring function

h(x) = 2η(x)− 1.

Therefore, given h, can estimate η using η̂(x) = 1+h(x)2
I Recipe:

I Find scoring function h that (approximately) minimizes
(empirical) squared loss risk

I Construct conditional probability estimate η̂ using above formula

7 / 18

Conditional probability estimation (2)

I Similar strategy available for logistic loss
I But not for hinge loss!

I Hinge loss risk is minimized by h(x) = sign(2η(x)− 1)
I Cannot recover η from h

I Caveat: If using insufficiently expressive functions for h (e.g.,
linear functions), may be far from minimizing squared loss risk
I Fix: use more flexible models (e.g., feature expansion)

8 / 18

Application: Reducing multi-class to binary

I Multi-class: Conditional probability function is vector-valued
function

η(x) =




Pr(Y = 1 | X = x)

Pr(Y = K | X = x)




I Reduction: learn K scalar-valued functions, the k-th function
is supposed to approximate

ηk(x) = Pr(Y = k | X = x).

I This can be done by create K binary classification problems,
where in problem k, label is 1{y=k}.

I Given the K learned conditional probability functions
η̂1, . . . , η̂K , we form a final predictor f̂

f̂(x) = arg max
k=1,…,K

η̂k(x).

9 / 18

When does one-against-all work well?

I If learned conditional probability functions η̂k are accurate,
then behavior of one-against-all classifier f̂ is similar to optimal
classifier

f?(x) = arg max
k=1,…,K

Pr(Y = k | X = x).

I Claim:

err(f̂) ≤ err(f?) + 2 · E[max
k
|η̂k(X)− ηk(X)|].

10 / 18

Fairness

I Use of predictive models (e.g., in admissions, hiring, criminal
justice) has raised concerns about whether they offer “fair
treatment” to individuals and/or groups
I We will focus on group-based fairness
I Individual-based fairness also important, but not as well-studied

11 / 18

Disparate treatment

I Often predictive models work better for some groups than for
others
I Example: face recognition (Buolamwini and Gebru, 2018; Lohr,

2018)

12 / 18

http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf

Possible causes of unfairness

I People deliberately being unfair
I Disparity in number of available training data for different

groups
I Disparity in usefulness of available features for different groups
I Disparity in relevance of prediction problem for different groups
I . . .

13 / 18

ProPublica study

I ProPublica (investigative journalism group) studied a particular
predictive model being used to determine “pre-trial detention”
I Angwin et al, 2016
I Judge needs to decide whether or not an arrested defendant

should be released while awaiting trial
I Predictive model (“COMPAS”) provides an estimate of

Pr(Y = 1 | X = x) where
Y = 1{will commit (violent) crime if released} and X is “features” of
defendant.

I Study argued that COMPAS treated black defendants unfairly
in a certain sense
I What sense? How do they make this argument?

14 / 18

https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

Fairness criteria

I Setup:
I X: features for individual
I A: group membership attribute (e.g., race, sex, age, religion)
I Y : outcome variable to predict (e.g., “will repay loan”, “will

re-offend”)
I Ŷ : prediction of outcome variable (as function of (X,A))
I For simplicity, assume A, Y , and Ŷ are {0, 1}-valued

I Many fairness criteria are based on joint distribution of

(A, Y, Ŷ )

I Caveat: Often, we don’t have access to Y in training data

15 / 18

Classification parity

I Fairness criterion: Classification parity

Pr(Ŷ = 1 | A = 0) ≈ Pr(Ŷ = 1 | A = 1)

I Sounds reasonable, but easy to satisfy with perverse methods
I Example: trying to predict Y = 1{will repay loan if given one}
I Suppose conditional distributions of (Y, Ŷ ) given A are as

follows:

(A = 0) Ŷ = 0 Ŷ = 1
Y = 0 1/2 0
Y = 1 0 1/2

(A = 1) Ŷ = 0 Ŷ = 1
Y = 0 1/4 1/4
Y = 1 1/4 1/4

I For A = 0 people, correctly give loans to people who will repay
I For A = 1 people, give loans randomly (Bernoulli(1/2))
I Satisfies criterion, but bad for A = 1 people

16 / 18

Equalized odds (1)

I Fairness criterion: Equalized odds

Pr(Ŷ = 1 | Y = y,A = 0) ≈ Pr(Ŷ = 1 | Y = y,A = 1)

for both y ∈ {0, 1}.
I In particular, FPR and FNR must be (approximately) same

across groups.
I Could also just ask for Equalized FPR, or Equalized FNR

I Previous example fails to satisfy equalized odds:

(A = 0) Ŷ = 0 Ŷ = 1
Y = 0 1/2 0
Y = 1 0 1/2

(A = 1) Ŷ = 0 Ŷ = 1
Y = 0 1/4 1/4
Y = 1 1/4 1/4

E.g., A = 0 group has 0% FPR, while A = 1 has 50% FPR.
I Criteria imply constraints on the classifier / scoring function

I Can try to enforce constraint during training
17 / 18

Equalized odds (2)

I ProPublica study:
I Found that FPR for A = 0 group (black defendants; 45%) was

higher than FPR for A = 0 group (white defendants; 23%)

(A = 0) Ŷ = 0 Ŷ = 1
Y = 0 0.27 0.22
Y = 1 0.14 0.37

(A = 1) Ŷ = 0 Ŷ = 1
Y = 0 0.46 0.14
Y = 1 0.19 0.21

18 / 18

Classification III: Classification objectives