CS计算机代考程序代写 algorithm 50 Years of Test (Un)fairness: Lessons for Machine Learning

50 Years of Test (Un)fairness: Lessons for Machine Learning

50 Years of Test (Un)fairness: Lessons for Machine Learning
Ben Hutchinson and Margaret Mitchell

{benhutch,mmitchellai}@google.com

ABSTRACT
Quantitative definitions of what is unfair and what is fair have been
introduced in multiple disciplines for well over 50 years, including

in education, hiring, and machine learning. We trace how the no-

tion of fairness has been defined within the testing communities of

education and hiring over the past half century, exploring the cul-

tural and social context in which different fairness definitions have

emerged. In some cases, earlier definitions of fairness are similar

or identical to definitions of fairness in current machine learning

research, and foreshadow current formal work. In other cases, in-

sights into what fairness means and how to measure it have largely

gone overlooked. We compare past and current notions of fairness

along several dimensions, including the fairness criteria, the focus

of the criteria (e.g., a test, a model, or its use), the relationship of fair-

ness to individuals, groups, and subgroups, and the mathematical

method for measuring fairness (e.g., classification, regression). This

work points the way towards future research and measurement of

(un)fairness that builds from our modern understanding of fairness

while incorporating insights from the past.

ACM Reference Format:
Ben Hutchinson and Margaret Mitchell. 2019. 50 Years of Test (Un)fairness:

Lessons for Machine Learning. In Proceedings of FAT* ’19: Conference on
Fairness, Accountability, and Transparency (FAT* ’19). ACM, New York, NY,
USA, 11 pages. https://doi.org/10.1145/3287560.3287600

1 INTRODUCTION
The United States Civil Rights Act of 1964 effectively outlawed

discrimination on the basis of of an individual’s race, color, religion,

sex, or national origin. The Act contained two important provisions

that would fundamentally shape the public’s understanding of what

it meant to be unfair, with lasting impact into modern day: Title VI,
which prevented government agencies that receive federal funds

(including universities) from discriminating on the basis of race,

color or national origin; and Title VII, which prevented employers

with 15 or more employees from discriminating on the basis of race,

color, religion, sex or national origin.

Assessment tests used in public and private industry immedi-

ately came under public scrutiny. The question posed by many

at the time was whether the tests used to assess ability and fit in

education and employment were discriminating on bases forbidden

by the new law [2]. This stimulated a wealth of research into how

to mathematically measure unfair bias and discrimination within

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full citation

on the first page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specific permission and/or a

fee. Request permissions from .

FAT* ’19, January 29–31, 2019, Atlanta, GA, USA
© 2019 Association for Computing Machinery.

ACM ISBN 978-1-4503-6125-5/19/01. . . $15.00

https://doi.org/10.1145/3287560.3287600

the educational and employment testing communities, often with a

focus on race. The period of time from 1966 to 1976 in particular

gave rise to fairness research with striking parallels to ML fair-

ness research from 2011 until today, including formal notions of

fairness based on population subgroups, the realization that some

fairness criteria are incompatible with one another, and pushback

on quantitative definitions of fairness due to their limitations.

Into the 1970s, there was a shift in perspective, with researchers

moving from defining how a test may be unfair to how a test may
be fair. It is during this time that we see the introduction of mathe-
matical criteria for fairness identical to the mathematical criteria

of modern day. Unfortunately, this fairness movement largely dis-

appeared by the end of the 1970s, as the different and sometimes

competing notions of fairness left little room for clarity on when

one notion of fairness may be preferable to another. Following

the retrospective analysis of Nancy Cole [15], who introduced the

equivalent of Hardt et al.’s 2016 equality of opportunity [32] in 1973:

The spurt of research on fairness issues that began in the

late 1960s had results that were ultimately disappointing.

No generally agreed uponmethod to determine whether

or not a test is fair was developed. No statistic that could

unambiguously indicate whether or not an item is fair

was identified. There were no broad technical solutions

to the issues involved in fairness.

By learning from this past, we hope to avoid such a fate.

Before further diving in to the history of testing fairness, it is

useful to briefly consider the structural correspondences between

tests and ML models. Test items (questions) are analogous to model

features, and item responses analogous to specific activations of

those features. Scoring a test is typically a simple linearmodel which

produces a (possibly weighted) sum of the item scores. Sometimes

test scores are normalized or standardized so that scores fit a desired

range or distribution. Because of this correspondence, much of the

math is directly comparable; and many of the underlying ideas in

earlier fairness work trivially map on to modern day ML fairness.

“History doesn’t repeat itself, but it often rhymes”; and by hearing

this rhyme, we hope to gain insight into the future of ML fairness.

Following terminology of the social sciences, applied statistics,

and the notation of [4], we use “demographic variable” to refer

to an attribute of individuals such as race, age or gender, denoted

by the symbol A. We use “subgroup” to denote a group of indi-
viduals defined by a shared value of a demographic variable, e.g.,

A = a. Y indicates the ground truth or target variable, R denotes a
score output by a model or a test, and D denotes a binary decision
made using that score. We occasionally make exceptions when

referencing original material.

2 HISTORY OF FAIRNESS IN TESTING
2.1 1960s: Bias and Unfair Discrimination
Concerned with the fairness of tests for black and white students,

T. Anne Cleary defined a quantitative measure of test bias for the

ar
X

iv
:1

81
1.

10
10

4v
2

[
cs

.A
I]

3
D

ec
2

01
8

https://doi.org/10.1145/3287560.3287600
https://doi.org/10.1145/3287560.3287600

FAT* ’19, January 29–31, 2019, Atlanta, GA, USA Ben Hutchinson and Margaret Mitchell

(a) Labels on regression lines indicate which subgroup they fit. (b) The regression line labeled πc fits both subgroups separately (and
hence also their union).

Figure 1: Petersen and Novick’s [52] original figures demonstrating fairness criteria. The marginal distributions of test scores
and ground truth scores for subgroups π1 and π2 are shown by the axes.

first time, cast in terms of a formal model for predicting educational

outcomes from test scores [10, 11]:

A test is biased for members of a subgroup of the popula-
tion if, in the prediction of a criterion for which the test

was designed, consistent nonzero errors of prediction

are made for members of the subgroup. In other words,

the test is biased if the criterion score predicted from the

common regression line is consistently too high or too

low for members of the subgroup. With this definition

of bias, there may be a connotation of “unfair,” particu-

larly if the use of the test produces a prediction that is

too low. (Emphasis added.)

According to Cleary’s criterion, the situation depicted in Fig-

ure 1a is biased for members of subgroup π2 if the regression line
π1 is used to predict their ability, since it underpredicts their true
ability. For Cleary, the situation depicted in Figure 1b is not biased:

since data from each of the subgroups produce the same regression

line, that line can be used to make predictions for either group.

In addition to defining bias in terms of predictions by regression

models, Cleary also performed a study on real-world data from three

state-supported and state-subsidized schools, comparing college

GPA with SAT scores. Racial data was obtained from an admissions

office, from an NAACP list of black students, and from examining

class pictures. Cleary used Analysis of Covariance (ANCOVA) to

test the relationships between SAT andHSR scores with GPA grades.

Contrary to some expectations, Cleary found little evidence of the

SAT being a biased predictor of GPA. (Later, larger studies found

that the SAT overpredicted the GPA of black students [64]; it may

be that the SAT is biased but less so than the GPA.)

While Cleary’s focus was on education, her contemporary Robert

Guion was concerned with unfair discrimination in employment.

Arguing for the importance of quantitative analyses in 1966, he

wrote that: “Illegal discrimination is largely an ethical matter, but

the fulfillment of ethical responsibility begins with technical compe-

tence” [30], and defined unfair discrimination to be “when persons

with equal probabilities of success on the job have unequal proba-

bilities of being hired for the job.” However, Guion recognized the

challenges in using constructs such as the probability of success.

We can observe actual success and failure after selection, but the

probability of success is not itself observable, and a sophisticated

model is required to estimate it at the time of selection.

By the end of the 1960s, there was political and legal support

backing concerns with the unfairness of the educational system for

black children and the unfairness of tests purporting to measure

black intellectual competence. Responding to these concerns, the

Association of Black Psychologists formed in 1969 immediately

published “A Petition of Concerns”, calling for a moratorium on

standardized tests “(which are used) to maintain and justify the

practice of systematically denying economic opportunities” [66].

The NAACP followed up on this in 1974 by adopting a resolution

that demanded “a moratorium on standardized testing wherever

such tests have not been corrected for cultural bias” (cited by [56]).

Meanwhile, advocates of testing worried that alternatives to testing

such as interviews would introduce more subjective bias [26].
1

2.2 1970s: Fairness
As the 1960s turned to the 1970s, work began to arise that parallels

the recent evolution of work in ML fairness, marking a change

in framing from unfairness to fairness. Following Thorndike [62],
“The discussion of ‘fairness’ in what has gone before is clearly over-

simplified. In particular, it has been based upon the premise that

the available criterion score is a perfectly relevant, reliable and

unbiased measure…” Thorndike’s sentiment was shared by other

academics of the time, who, in examining the earlier work of Cleary,

objected that it failed to take into account the differing false positive

and false negative rates that occur when subgroups have different

base rates (i.e., A is not independent of Y ) [24, 62].
With the goal of moving beyond simplified models, Thorndike

[62] proposed one of the first quantitative criteria for measuring

test fairness. With this shift, Thorndike advocated for considering
the contextual use of a test:

A judgment on test-fairness must rest on the inferences

that are made from the test rather than on a comparison

of mean scores in the two populations. One must then

focus attention on fair use of the test scores, rather than

on the scores themselves.

1
For example, the origins of the college entrance essay are rooted in ivy league

universities’ covert attempts to suppress the numbers of Jewish students, whose

performance on entrance exams had led them to become an increasing percentage of

the student population [38].

50 Years of Test (Un)fairness: Lessons for Machine Learning FAT* ’19, January 29–31, 2019, Atlanta, GA, USA

Contrary to Cleary, Thorndike argued that sharing a common

regression line is not important, as one can achieve fair selection

goals by using different regression lines and different selection

thresholds for the two groups.

As an alternative to Cleary, Thorndike proposed that the ratio

of predicted positives to ground truth positives be equal for each

group. Using confusion matrix terminology, this is equivalent to

requiring that the ratio (TP + FP)/(TP + FN ) be equal for each
subgroup. According to Thorndike, the situation in Figure 1a is fair

for test cutoff x∗. Figure 1b is unfair using any single threshold, but
fair if threshold x∗

1
is used for group π1 and threshold x


2
is used

for group π2.
Similar tomodern dayML fairness, e.g., Friedler et al. in 2016 [28],

Thorndike also pointed out the tension between individual notions

of fairness and group notions of fairness: “the two definitions of

fairness—one based on predicted criterion score for individuals and

the other on the distribution of criterion scores in the two groups—

will always be in conflict.” The conflict was also raised by others in

the period, including Sawyer et al. [57], in a foreshadowing of the

compas debate of 2016:

A conflict arises because the success maximization pro-

cedures based on individual parity do not produce equal

opportunity (equal selection for equal success) based on

group parity and the opportunity procedures do not pro-

duce success maximization (equal treatment for equal

prediction) based on individual parity.

Almost as an aside, Thorndike mentions the existence of another

regression line ignored by Cleary: the line that estimates the value

of the test score R given the target variable Y . This idea hints at the
notion of equal opportunity for those with a given value of Y , an
idea which soon was picked up by Darlington [19] and Cole [14].

At a glance, Cleary’s and Thorndike’s definitions are difficult to

compare directly because of the different ways in which they’re

defined. Darlington [19] helped to shed light on the relationship

between Cleary and Thorndike’s conceptions of fairness by express-

ing them in a common formalism. He defines four fairness criteria

in terms of the correlation ρAR between the demographic variable
and the test score. Following Darlington,

(1) Cleary’s criterion can be restated in terms of correlations of

the “culture variable” with test scores. If Cleary’s criterion

holds for every subgroup, then ρAR = ρAY /ρRY [63]. 2
(2) Similarly, Thorndike’s criterion is equivalent to requiring

that ρAR = ρAY .
(3) The criterion ρAR = ρAY × ρRY is motivated by thinking

about R as a dependent variable affected by independent
variables A and Y . If A has no direct effect on R once Y is
taken into account then we have a zero partial correlation,

i.e. ρAR .Y = 0.
3
.

(4) An alternative “starkly simple” criterion of ρAR = 0 (rec-
ognizable as modern day demographic parity [23]) is intro-

duced but not dwelt on.

Darlington’s mapping of Cleary’s and Thorndike’s criteria lets

him prove that they’re incompatible except in the special cases

2
Although Darlington does not mention this additional constraint, we believe the

criterion only holds if A, R and Y have a multivariate normal distribution.
3
See footnote 2

Figure 2: Darlington’s original graph of fair values of the
correlation between culture and test score (rCX in Darling-
ton’s notation), plotted against the correlation between test
score and ground truth (rXY ), according to his definitions
(1–4). (The correlation between the demographic and target
variables is assumed here to be fixed at 0.2.)

where the test perfectly predicts the target variable (ρRY = 1), or
where the target variable is uncorrelated with the demographic vari-

able (ρAY = 0). Figure 2, reproduced from Darlington’s 1971 work,
shows that, for any given non-zero correlation between the demo-

graphic and target variables, definitions (1), (2), and (3) converge

as the correlation between the test score and the target variable

approach 1. When the test has only a poor correlation with the

target variable, there may be no fair solution using definition (1).

Figure 2 enables a range of further observations. According to

definition (1), for a given correlation between demographic and

target variables, the lower the correlation of the test with the target

variable, the higher it is allowed to correlate with the demographic

variable and still be considered fair. Definition (3), on the other hand,

is the opposite, in that the lower the correlation of the test with

the target variable, the lower too must be the the test’s correlation

with the demographic variable. Darlington’s criterion (2) is the

geometric mean of criteria (1) and (3): “a compromise position

midway between [the] two… however, a compromise may end up

satisfying nobody; psychometricians are not in the habit of agreeing

on important definitions or theorems by compromise.” Darlington

shows that definition (3) is the only one of the four whose errors

are uncorrelated with the demographic variable, where by “errors”,

he means errors in the regression task of estimating Y from R.
In 1973, Cole [14] continued exploring ideas of equal outcomes

across subgroups, defining fairness as all subgroups having the same

True Positive Rate (TPR), recognizable as modern day equality of

opportunity [32]. That same year, Linn [43] introduced (but did not

advocate for) equal Positive Predictive Value (PPV) as a fairness

criterion, recognizable as modern day predictive parity [9].
4

Under Cleary and Darlington’s conceptions, bias or (un)fairness

is a property of the test itself. This is contrary to Thorndike, Linn

and Cole, who take fairness to be a property of the use of a test.

The latter group tended to assume that a test is static, and focused

on optimizing its use; whereas Cleary’s concerns were with how

to improve the tests themselves. Cleary worked for Educational

4
Although he cites [30] and [24], a seeming misattribution, as pointed out by [52].

FAT* ’19, January 29–31, 2019, Atlanta, GA, USA Ben Hutchinson and Margaret Mitchell

Category Description
individual Fairness criterion defined purely in

terms of individuals

non-comparative Fairness criterion for each subgroup

does not reference other subgroups

subgroup parity Fairness criterion defined in terms of

parity of some value across subgroups

correlation Fairness criterion defined in terms of

the correlation of the demographic vari-

able with the model output

Table 1: Categories of Fairness Criteria

Testing Services, and one can imagine a test being designed allowing

for a range of use cases, since it may not be knowable in advance

either i) the precise populations on which it will be deployed, nor

ii) the number of students which an institution deploying the test

is able to offer places to.

By March 1976, the interest in fairness in the educational testing

community was so strong that an entire issue of the Journal of

Education Measurement was devoted to the topic [47], including

a lengthy lead article by Peterson and Novick [52], in which they

consider for the first time the equality of True Negative Rates (TNR)

across subgroups, and equal TPR / equal TNR across subgroups

(modern day equalized odds [32]). Similarly, they consider the case

of equal PPV and equal NPV across subgroups.
5

Work from the mid-1960s to mid-1970s can be summarized along

four distinct categories: individual, non-comparative, subgroup

parity, and correlation, defined in Table 1. It should be empha-

sized that in not all cases where a researcher defined a criterion did

they also advocate for it. In particular, Darlington, Linn, Jones, and

Peterson and Novick all define criteria purely for the purposes of

exploring the space of concepts related to fairness. A summary of

fairness technical definitions during this time is listed in Table 2.

2.3 Mid-1970s: The Fairness Tide Turns
Immediately after the the journal issue of 1976, research into quan-

titative definitions of test fairness seems to have come to a halt.

Considering why this happened may be a valuable lesson to learn

from for modern day fairness research. The same Cole who in 1973

proposed equality of TPR, wrote in 2001 that [15]:

In short, research over the last 30 or so years has not

supplied any analyses to unequivocally indicate fairness

or unfairness, nor has it produced clear procedures to

avoid unfairness. To make matters worse, the views of

fairness of the measurement profession and the views of

the general public are often at odds.

Foreshadowing this outcome, statements from researchers in

the 1970s indicate an increasing concern with how fairness cri-

teria obscure “the fundamental problem, which is to find some

rational basis for providing compensatory treatment for the dis-

advantaged” [48]. Following Peterson and Novick, the concepts of

5
They do not advocate for either combination (neither equal TPR and TNR, nor equal

PPV and NPV) on the grounds that either combination requires unusual circumstances.

However there is a flaw in their reasoning. For example, arguing against equal TPR

and equal TNR, they claim that this requires equal base rates in the ground truth in

addition to equal TPR.

culture-fairness and group parity are not viable in practice, leading

to models that can sanction the discrimination they seek to rectify

[52]. They argue that fairness should be reconceptualized as a prob-

lem in maximizing expected utility [51], recognizing “high social

utility in equalizing opportunity and reducing disadvantage” [48].

A related thread of work highlights that different fairness cri-

teria encode different value systems [34], and that quantitative

techniques alone cannot answer the question of which to use. In

1971, Darlington [19] urges that the concept of “cultural fairness”

be replaced by “cultural optimality”, which takes into account a

policy-level question concerning the optimum balance between

accuracy and cultural factors. In 1974, Thorndike points out that

“one’s value system is deeply involved in one’s judgment as to what

is ‘fair use’ of a selection device” [48]), and similarly, in 1976, Linn

[44] draws attention to the fact that “Values are implicit in the mod-

els. To adequately address issues of values they need to be dealt

with explicitly.” Hunter and Schmidt [34] begin to address this issue

by bringing ethical theory to the discussion, relating fairness to

theories of individualism and proportional representation. Current

work may learn from this point in history by explicitly connecting

fairness criteria to different cultural and social values.

2.4 1970s on: Differential Item Functioning
Concurrent with the development of criteria for the fair use of tests,

another line of research in the measurement community concerned

looking for bias in test questions (“items”). In 1968, Cleary and

Hilton [12] used an analysis of variance (ANOVA) design to test

the interaction between race, socioeconomic level and test item.

Ten years later, the related idea of Differential Item Functioning

(DIF) was introduced by Scheuneman in 1979 [58]: “an item is

considered unbiased if, for persons with the same ability in the

area being measured, the probability of a correct response on the

item is the same regardless of the population group membership

of the individual.” That is, if I = I (q) is the variable representing a
correct response on question q, then by this definition I is unbiased
if A ⊥ I |Y .

In practice, the best measure of the ability that the item is testing

is often the test in which the item is a component [21]:

Amajor change from focusing primarily on fairness in a

domain, where so many factors could spoil the validity

effort, to a domainwhere analyses could be conducted in

a relatively simple, less confounded way. … In a DIF anal-

ysis, the item is evaluated against something designed to

measure a particular construct and something that the

test producer controls, namely a test score.

Figure 3 illustrates DIF for a test item.

DIF became very influential in the education field, and to this day

DIF is in the toolbox of test designers. Items displaying a DIF are

ideally examined further to identify the cause of bias, and possibly

removed from the test [50].

2.5 1980s and beyond
With the start of the 1980s came renewed public debate about the

existence of racial differences in general intelligence, and the impli-

cations for fair testing, following the publication of the controversial

50 Years of Test (Un)fairness: Lessons for Machine Learning FAT* ’19, January 29–31, 2019, Atlanta, GA, USA

Source Criterion Category Proposition

Guion (1966)

“people with equal probabilities of success on the job

have equal probabilities of being hired for the job”
individual Is the use of the test fair?

Cleary (1966) “a subgroup does not have consistent errors” non-comparative Is the test fair to subgroup a?

Einhorn and Bass (1971)
† Prob(Y > y ∗ |R = ra∗,A = a) is constant for all

subgroups a
subgroup parity

Is the use of the test fair
with respect to A?

Thorndike (1971)

Prob(R >= ra ∗ |A = a)/Prob(Y >= y ∗ |A = a) is
constant for all subgroups a

subgroup parity

Is the use of the test fair
with respect to A?

Darlington (1971) (1) ρAX = ρAY /ρRY (equivalent to ρAY .R = 0)

correlation Is the test fair with respect to A?
Darlington (1971) (2) ρAR = ρAY

Darlington (1971) (3) ρAR = ρAY × ρRY (equivalent to ρAR .Y = 0)
Darlington (1971) (4) ρAR = 0

Darlington (1971) ρR(Y−kA), is maximized where k is the subjective
value placed on subgroup attribute A = 1

correlation

Does the test produce the
culturally optimum† optimal outcome w.r.t. A?

Cole (1973)

Prob(R >= ra ∗ |Y >= y∗,A = a) is constant for all
subgroups a

subgroup parity

Is the use of the test fair
with respect to A?

Linn (1973)

Prob(Y >= y ∗ |R >= ra∗,A = a) is constant for all
subgroups a

subgroup parity

Is the use of the test fair
with respect to A?

Jones (1973)

E(Ŷ |a) = E(Y |a) non-comparative Is the test fair to subgroup a?
mean fair†

Jones (1973)
a subgroup a has equal representation in the
top-n candidates ranked by model score as it has
in the top-n candidates ranked by Y , for all n

non-comparative Is the test fair to subgroup a?general standard†

Jones (1973)
a subgroup a has equal representation in the
top-n candidates ranked by model score as it has
in the top-n candidates ranked by Y

non-comparative

Is the use of the test fair to
subgroup a?

at position n†

Peterson & Novick (1976) Prob(R >= ra ∗ |Y >= y∗,A = a) is constant for all
subgroups a, and Prob(R < ra ∗ |Y < y∗,A = a) is constant for all subgroups a subgroup parity Is the use of the test fair with respect to A? conditional probability and its converse Peterson & Novick (1976) Prob(Y >= y ∗ |R >= ra∗,A = a) is constant for all
subgroups a, and Prob(Y < y ∗ |R < ra∗,A = a) is constant for all subgroups a subgroup parity Is the use of the test fair with respect to A? equal probability and its converse Table 2: Early technical definitions of fairness in educational and employment testing. Variables: R is the test score; Y is the target variable; A is the demographic variable. The Proposition column indicates whether fairness is considered a property of the way in which a test is used, or of the test itself. † indicates that the criterion is discussed in the appendix. Bias in Mental Testing [36]. Political opponents of group-based con- siderations in educational and employment practices framed them in terms of “preferential treatment” for minorities and “reverse dis- crimination” against whites. Despite, or perhaps because of, much public debate, neither Congress nor the courts gave unambiguous answers to the question of how to balance social justice consid- erations with the historical and legal importance placed on the individual in the United States [18]. Into the 1980s, courts were asked to rule on many cases involv- ing (un)fairness in educational testing. To give just one example, Zwick and Dorans [71] described the case of Debra P. v. Turlington 1984, in which a lawsuit was filed on behalf of “present and future twelfth grade students who had failed or would fail” a high school graduation test. The initial ruling found that the test perpetuated past discrimination and was in violation of the Civil Rights Act. More examples of court rulings on fairness are given by [53, 71]. By the early 1980s, ideas about fairness were having awidespread influence on U.S. employment practices. In 1981, with no public debate, the United States Employment Services implemented score- adjustment strategy that was sometimes called “race-norming” [54]. Each individual is assigned a percentile ranking within their own ethnic group, rather than to the test-taking population. By the mid- 1980s, race-norming was “a highly controversial issue sparking heated debate.” The debate was settled through legislation, with the 1991 Civil Rights Act banning the practice of race-norming [65]. FAT* ’19, January 29–31, 2019, Atlanta, GA, USA Ben Hutchinson and Margaret Mitchell Figure 3: Original graph from [22] illustrating DIF. 3 CONNECTIONS TO ML FAIRNESS 3.1 Equivalent Notions Many of the fairness criteria we have overviewed are identical to modern-day fairness definitions. Here is a brief summary of these connections: • Peterson and Novick’s “conditional probability and its con- verse” is equivalent to what in ML fairness is variously called sufficiency [4], equalized odds [32], or conditional procedure accuracy [5], sometimes expressed as the conditional inde- pendence A ⊥ D |Y . • Similarly, their “equal probability and its converse” is equiv- alent to what is called sufficiency [4] or conditional use accu- racy equality [5], A ⊥ Y |D. • Cole’s 1973 fairness definition is identical to equality of op- portunity [32], A ⊥ D |Y = 1. • Linn’s 1973 definition is equivalent to predictive parity [9], A ⊥ Y |D = 1. • Darlington’s criterion (1) is equivalent to sufficiency in the special case where A, R and Y have a multivariate Gaussian distribution. This is because for this special case the partial correlation ρAY .X = 0 is equivalent to A ⊥ Y |R [3]. In gen- eral though, we cannot assume even a one way implication, since A ⊥ Y |R does not imply ρAY .X = 0 (see [63] for a counterexample). • Similarly, Darlington’s criteria (2) and (3) are equivalent to independence and separation only in the special cases of multivariate Gaussian distributions. • Darlington’s definition (4) is a relaxation of what is called independence [4] or demographic parity in ML fairness, i.e. A ⊥ R; it is equivalent when A and R have a bivariate Gauss- ian distribution. • Guion’s definition “people with equal probabilities of success on the job have equal probabilities of being hired for the job” is a special case of Dwork’s [23] individual fairness with the presupposition that “probability of success on the job” is a construct that can be meaningfully reasoned about. The fairness literature in both the fields of ML and in testing have also been motivated by causal considerations [32, 41]. Darlington [19] motivate his definition (3) on the basis of a causal relation- ship between Y and R (since an ability being measured affects the performance on the test). However [34] have pointed out that in testing scenarios we typically only have a proxy for ability, such as later GPA 4 years later, and it is wrong to draw a causal connection from GPA to college entrance exam. Hardt et al. [32] describe the challenge in building causal mod- els, by considering two distinct models and their consequences and concluding that “no test based only on the target labels, the protected attribute and the score would give different indications for the optimal score R∗ in the two scenarios.” This is remarkably reminiscent of Anastasi [1], writing in 1961 about test fairness: No test can eliminate causality. Nor can a test score, how- ever derived, reveal the origin of the behavior it reflects. If certain environmental factors influence behavior, they will also influence those samples of behavior covered by tests.Whenwe use tests to compare different groups, the only question the tests can answer directly is: “How do these groups differ under existing cultural conditions?” Both the testing fairness and ML fairness literatures have also paid great attention to impossibility results, such as the distinction between group fairness and individual fairness, and the impossi- bility of obtaining more than one of separation, sufficiency and independence except under special conditions [4, 9, 19, 40, 52, 62]. In addition, we see some striking parallels in the framing of fairness in terms of ethical theories, including explicit advocacy for utilitarian approaches. • Petersen and Novick’s utility-based approaches relate to Corbett-Davies et al.’s framing of the cost of fairness [17]. • Hunter and Schmidt’s analysis of the value systems under- lying fairness criteria is similar in spirit to Friedler et al.’s relation of fairness criteria and different worldviews [28]. 3.2 Variable Independence As briefly mentioned above, modern day ML fairness has catego- rized fairness definitions in terms of independence of variables, which includes sufficiency and separation [4]. Some historical no- tions of fairness neatly fit into this categorization, but others shed light on further dimensions of fairness criteria. Table 3 summarizes these connections, linking the historical criteria introduced in Sec- tion 2 to modern day categories. (Utility-based criteria are omitted, but will be discussed below.) We find that non-comparative criteria (discussed by Cleary and Jones) do not map onto any of the independence conditions used in ML fairness. Similarly, Thorndike’s, and Darlington’s have no counterparts that we know of. There are conceptual similarities between Jones’ criteria and the constrained ranking problem de- scribed by [8], and also between Einhorn’s criterion and concerns about infra-marginality [60]. For a binary classifier, Thorndike’s 1971 group parity criterion is equivalent to requiring that the ratio of positive predictions to ground truth positives be equal for all subgroups. This ratio has no common name that we could find (unlike e.g., precision, recall, etc.), although [52] refer to this as the “Constant RatioModel”. It is closely related to coverage constraints [29], class mass normalization [70] and expectation regularization [45]. Similar arguments can be made for Darlington’s criterion (2) and Jones’ criteria “at position n” and 50 Years of Test (Un)fairness: Lessons for Machine Learning FAT* ’19, January 29–31, 2019, Atlanta, GA, USA Historical criterion ML fairness criterion Relationship Guion (1966) individual relaxation Cleary (1968) sufficiency when Cleary’s criterion holds for all subgroups then we we have equivalence when R and Y have bivariate Gaussian distribution Einhorn and Bass (1971) sufficiency both involve probability of Y conditioned on R , but Einhorn and Bass are only concerned with the conditional likelihood at the decision threshold Thorndike (1971) — — Darlington (1971) (1) sufficiency equivalent when variables have a multivariate Gaussian distribution Darlington (1971) (2) — — Darlington (1971) (3) separation equivalent when variables have a multivariate Gaussian distribution Darlington (1971) (4) independence equivalent when variables have a bivariate Gaussian distribution Cole (1973) separation relaxation (equivalent to equality of opportunity) Linn (1973) sufficiency relaxation (equivalent to predictive parity) Jones (1973) mean fair — — Jones (1973) at position n — — Jones (1973) general criterion — — Peterson and Novick (1976) separation equivalent conditional probability and its converse Peterson and Novick (1976) sufficiency equivalent equal probability and its converse Table 3: Relationships between testing criteria and ML’s independence criteria “general criterion”.When viewed as amodel of subgroup quotas [34], Thorndike’s criterion is reminiscent of fair division in economics. 3.3 Regression and Correlation In reviewing the history of fairness in testing, it becomes clear that regression models have played a much larger role than in the ML community. Similarly, the use of correlation as a fairness criterion is all but absent in modern ML Fairness literature. Given that correlation of two variables is a weaker criterion than independence, it is reasonable to ask why one might want a fairness criterion defined in terms of correlations. One practical reason is that calculating correlations is a lot easier than estimating independence. Whereas correlation is a descriptive statistic, and so calculating requires few assumptions, estimating independence requires an the use of inferential statistics, which can in general be highly non-trivial [59]. Considering the analogy between model features and test items described in the Introduction, we also know of no ML analogs to the Differential Item Functioning. Such analogs might test for bias in model features. Instead, one approach adopted in ML fairness has been the use of adversarial methods to mitigate the effects of features with undesirable correlations with subgroups, e.g., [6, 69]. 3.4 Model vs. Model Use Section 2 described how the test literature had competing notions of whether fairness is a property of a test, or of the use of a test. A similar discussion of whether ML models can be judged as fair or unfair independent of a specific use (including a specific model threshold) has been largely implicit or missing in the ML fairness literature. Models are sometimes trained to be “fair” at their default decision threshold (e.g., 0.5), although the use of different thresholds can have a major impact on fairness [32]. The ML fairness notion of calibration, i.e., P(Y = 1|A = a,R = r ) = r for all a and r , can be interpreted to be a property of the model rather than of its use, since it does not depend on the choice of decision threshold. 3.5 Race and Gender Some work on practically assessing fairness in ML has tackled the problem of using race as a construct. This echoes concerns in the testing literature that stem back to at least 1966: “one stum- bles immediately over the scientific difficulty of establishing clear yardsticks by which people can be classified into convenient racial categories” [30]. Recent approaches have used Fitzpatrick skin type or unsupervised clustering to avoid racial categorizations [7, 55]. We note that the testing literature of the 1960s and 1970s frequently uses the phrase “cultural fairness” when referring to parity between blacks and whites. Other than Thomas [61], the test fairness lit- erature of the 1960s and 1970s was typically concerned with race rather than gender (although received attention later, e.g., [67]). The role of culture in gender identity and gender presentation has seen less consideration in ML fairness, but gender labels raise ethical concerns [31, 33]. Comparable to modern sentiment in the difficulties of measur- ing fairness, earlier decisions in the courtroom highlighted the impossibility of properly accounting for all factors that influence in- equalities. For example, in 1964, Illinois Fair Employment Practices Commission (FEPC) examiner found that Motorola had discrimi- nated against Leon Myart, a black American, in his application to work at Motorola as an “analyzer and phaser”. The examiner found that the 5 minute screening test that Myart took did not account for inequalities and environmental factors of culturally deprived groups. The case was appealed to the Illinois Supreme Court, which found that Myart actually passed the test, and so declined to rule on the fairness of the test [2]. FAT* ’19, January 29–31, 2019, Atlanta, GA, USA Ben Hutchinson and Margaret Mitchell 4 FAIRNESS GAPS 4.1 Fairness and Unfairness In mapping out earlier fairness approaches and their relationship to ML fairness, some conceptual gaps emerge. One noticeable gap relates to the difference in framing between fairness and unfairness. In earlier work on test fairness, there was a focus on defining mea- surements in terms of unfair discrimination and unfair bias, which brought with it the problem of uncovering sources of bias [12]. In the 1970s, this developed into framings in terms of fairness, and the introduction of fairness criteria similar or identical to ML fairness criteria known today. However, returning to the idea of unfairness suggests several new areas of inquiry, including quantifying dif- ferent kinds of unfairness and bias (such as content bias, selection system bias, etc., cf. [35]), and a shift in focus from outcomes to inputs and processes [13]. Quantifying types of unfairness may not only add to the problems that machine learning can address, but also accords with realities of sentencing and policing behind much of the fairness research today: Individuals seeking justice do so when they believe that something has been unfair. 4.2 Differential Item Functioning Another gap that becomes clear from the historical perspective is the lack of an analog to Differential Item Functioning (Section 2.4) in current ML fairness research. DIF was used by education professionals as a motivation for investigating causes of bias, and a modern-day analog might include unfairness interpretability in ML models. An direct analog inML could be to compare P(Xi |R = r ,A = a) for different input features Xi , model outputs R and subgroupsA. For example, when predicting loan repayment, this might involve comparing how income levels differ across subgroups for a given predicted likelihood of repaying the loan. 4.3 Target Variable / Model Score Relationship Another gap is the ways in which the model (test) score and the tar- get variable are related to each other. In many cases in ML fairness and test fairness, there are correspondences between pairs of criteria which differ only in the roles played by the model (test) score R and the target variable Y . That is, one criterion can be transformed into another by swapping the symbols R and Y ; for example, separation can be transformed into sufficiency: A ⊥ R |Y −→ A ⊥ Y |R. In this section we will refer to this type of correspondence as “converse”, i.e., separation is the converse of sufficiency. When viewed in this light, some asymmetries stand out: • Converse Cleary criterion: Cleary’s criterion considers the case of a regression model that predicts a target variable Y given test score R. One could also consider the converse regression model (mentioned in passing by [62]), which pre- dicts model score R from ground truth Y , as an instrument for detecting bias. 6 The converse Cleary condition would say that a test has connotations of unfair for a subgroup if the converse regression line has positive errors, i.e., for each given level of ground truth ability, the test score is higher than the converse regression line predicts. 6 The Cleary regression model and its converse are distinct except in the special case where the magnitudes of the variables have been standardized. • Converse calibration: In a regression scenario, the calibration condition P(Y = 1|R = r ,A = a) = r can be rewritten as E(Y |R = r ,A = a) = r , or E(Y − r |R = r ,A = a) = 0. The converse calibration condition is therefore E(R − y |Y = y,A = a) = 0 for all subgroups A = a. In other words, for each subgroup and level of ground truth performance Y = y, the expected error in R’s prediction of the value y is zero. We point out these overlooked concepts not to advocate for their use, but to map out the geography of concepts related to fairness more completely. 4.4 Compromises Darlington [19] points out that Thorndike’s criterion is a compro- mise between one criterion related to sufficiency and one related to separation (see Section 2.2 and Tables 2 and 3). In general, a space of compromises is possible; in terms of correlations, this might be modeled using a parameter λ: ρAR = ρAY .ρ λ RY (1) where λ values of -1, 0, and 1 imply Darlington’s definitions (1), (2) and (3), respectively. This also suggests exploring interpolations between the contrast- ing sufficiency and separation criteria. For example, one way of parameterizing their interpolation is in terms of binary confusion matrix outcomes. Definition 4.1. (λ1, λ2)-Thorndikian fairness: A binary classifier satisfies (λ1, λ2)-Thorndikian fairness with respect to demographic variable A if both a) TP + λ1FP TP + λ2FN is constant for all values of A , and b) TN + λ1FN TN + λ2FP is constant for all values of A. Note that (1, 0)-Thorndikian fairness is equivalent to sufficiency, while (0, 1)-Thorndikian fairness is equivalent to separation. Petersen and Novick [52] showed that (1, 1)-Thorndikian fairness requires that either a) for each subgroup, the positive class is pre- dicted in proportion to its ground truth rate; or b) every subgroup has the same ground truth rate of positives. We can also consider relaxations of (λ1, λ2)-Thorndikian fairness in which only one of the two conditions (a) or (b) is required to hold. For example, only re- quiring condition (a) gives us a way of parameterizing compromises between equality of opportunity and predictive parity. Our goal here is not to advocate for this particular model of compromise between separation and sufficiency. Rather, since sep- aration and sufficiency criteria can encode competing interests of different parties, our goal is to suggest that ML fairness consider how to encode notions of compromise, which in some scenarios might relate to the public’s notion of fairness. We propose that the economics literature on fair division might provide some useful ideas, as has also been suggested by [68]. However, we do heed Darlington’s [19] warning that “a compromise may end up satisfy- ing nobody; psychometricians are not in the habit of agreeing on important definitions or theorems by compromise.” This statement may be equally true of ML practitioners. 50 Years of Test (Un)fairness: Lessons for Machine Learning FAT* ’19, January 29–31, 2019, Atlanta, GA, USA 5 DISCUSSION This short review of historical connections in fairness suggest sev- eral concrete steps forward for future research in ML fairness: (1) Developing methods to explain and reduce model unfairness by focusing on the causes of unfairness. To paraphrase Dar- lington’s [19] question: “What can be said about models that discriminate among cultures at various levels?” yields more actionable insights than “What is a fair model?” This is re- lated to research on causality in ML Fairness (see Section 3.1), but including examination of full causal pathways, and processes that interact well before decision time. In other words: What causes the disparities? (2) Drawing from earlier insights of Guion [30], Thorndike [62], Cole [14], Linn [43], Jones [37], and Peterson & Novick [52] to expand fairness criteria to include model context and use. (3) Building from earlier insights of 1970s researchers [19, 34, 44] to incorporate quantitative factors for the balance between fairness goals and other goals, such as a value system or a system of ethics. This will likely include clearly articulating assumptions and choices, as recently proposed in [46]. (4) Diving more deeply into the question of how subgroups are defined, suggested as early as 1966 [30], including question- ing whether subgroups should be treated as discrete cate- gories at all, and how intersectionality can be modeled. This might include, for example, how to quantify fairness along one dimension (e.g., age) conditioned on another dimension (e.g., skin tone), as recent work has begun to address [27, 39]. 6 CONCLUSIONS The spike in interest in test fairness in the 1960s arose during a time of social and political upheaval, with quantitative definitions catalyzed in part by U.S. federal anti-discrimination legislation in the domains of education and employment. The rise of interest in fairness today has corresponded with public interest in the use of machine learning in criminal sentencing and predictive polic- ing, including discussions around compas [16, 20, 42] and PredPol [25, 49]. Each era gave rise to its own notions of fairness and rele- vant subgroups, with overlapping ideas that are similar or identical. In the 1960s and 1970s, the fascination with determining fairness ul- timately died out as the work became less tied to the practical needs of society, politics and the law, and more tied to unambiguously identifying fairness. We conclude by reflecting on what further lessons the history of test fairness may have for the future of ML fairness. Careful attention should be paid to legal and public concerns about fair- ness. The experiences of the test fairness field suggest that in the coming years, courts may start ruling on the fairness of ML models. If technical definitions of fairness stray too far from the public’s perceptions of fairness, then the political will to use scientific con- tributions in advance of public policy may be difficult to obtain. Perhaps ML practitioners should cautiously take heed from Cole and Zieky’s [15] portrayal of developments in their field: Members of the public continue to see apparently inap- propriate interpretations of test scores and misuses of test results. They see this area as a primary fairness con- cern. However, the measurement profession has strug- gled to understand the nature of its responsibility in this area, and has generally not acted strongly against in- stances of misuse, nor has it acted in concert to attack misuses. We welcome broader debate on fairness that includes both tech- nical and cultural causes, how the context and use of ML models further influence potential unfairness, and the suitability of the vari- ables used in fairness research for capturing systemic unfairness. We agree with Linn’s [44] argument from 1976 that values encoded by technical definitions should be made explicit. By concretely re- lating fairness debates to ethical theories and value systems (as done by [34, 71]), we can make discussions more accessible to the general public and to researchers of other disciplines, as well as helping our own ML Fairness community to be more attuned to our own implicit cultural biases. 7 ACKNOWLEDGEMENTS Thank you to Moritz Hardt and Shira Mitchell for invaluable con- versations and insight. REFERENCES [1] Anne Anastasi. 1961. Psychological tests: Uses and abuses. Teachers College Record (1961). [2] Philip Ash. 1966. The implications of the Civil Rights Act of 1964 for psychological assessment in industry. American Psychologist 21, 8 (1966), 797. [3] Kunihiro Baba, Ritei Shibata, and Masaaki Sibuya. 2004. Partial correlation and conditional correlation as measures of conditional independence. Australian & New Zealand Journal of Statistics 46, 4 (2004), 657–664. [4] Solon Barocas, Moritz Hardt, and Arvind Naranayan. 2018. Fairness in Machine Learning. http://fairmlbook.org. (2018). [5] Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. 2017. Fairness in criminal justice risk assessments: the state of the art. arXiv preprint arXiv:1703.09207 (2017). [6] Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H. Chi. 2017. Data Decisions and Theoretical Implications whenAdversarially Learning Fair Representations. CoRR abs/1707.00075 (2017). arXiv:1707.00075 http://arxiv.org/abs/1707.00075 [7] Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accu- racy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency. 77–91. [8] L Elisa Celis, Damian Straszak, and Nisheeth K Vishnoi. 2017. Ranking with fairness constraints. arXiv preprint arXiv:1704.06840 (2017). [9] Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data 5, 2 (2017), 153–163. [10] T. Anne Cleary. 1966. Test bias: Validity of the Scholastic Aptitude Test for Negro and white students in integrated colleges. ETS Research Bulletin Series 1966, 2 (1966), i–23. [11] T. Anne Cleary. 1968. Test bias: Prediction of grades of Negro and white students in integrated colleges. Journal of Educational Measurement 5, 2 (1968), 115–124. [12] T Anne Cleary and Thomas L Hilton. 1968. An investigation of item bias. Educa- tional and Psychological Measurement 28, 1 (1968), 61–75. [13] Irina Cojuharenco and David Patient. 2013. Workplace fairness versus unfairness: Examining the differential salience of facets of organizational justice. Journal of Occupational and Organizational Psychology 86, 3 (2013), 371–393. [14] Nancy S Cole. 1973. Bias in selection. Journal of educational measurement 10, 4 (1973), 237–255. [15] Nancy S Cole and Michael J Zieky. 2001. The new faces of fairness. Journal of Educational Measurement 38, 4 (2001), 369–382. [16] Sam Corbett-Davies, Emma Pierson, Avi Feller, and Sharad Goel. 2016. A computer program used for bail and sentencing deci- sions was labeled biased against blacks. Its actually not that clear. https://www.washingtonpost.com/news/monkey-cage/wp/2016/10/17/can- an-algorithm-be-racist-our-analysis-is-more-cautious-than-propublicas/. (2016). [17] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. 2017. Algorithmic decision making and the cost of fairness. CoRR abs/1701.08230 (2017). arXiv:1701.08230 http://arxiv.org/abs/1701.08230 http://arxiv.org/abs/1707.00075 http://arxiv.org/abs/1707.00075 http://arxiv.org/abs/1701.08230 http://arxiv.org/abs/1701.08230 FAT* ’19, January 29–31, 2019, Atlanta, GA, USA Ben Hutchinson and Margaret Mitchell [18] National Research Council et al. 1989. Fairness in employment testing: Validity generalization, minority issues, and the General Aptitude Test Battery. National Academies Press. [19] Richard B Darlington. 1971. Another Look at Cultural Fairness. Journal of Educational Measurement 8, 2 (1971), 71–82. [20] William Dieterich, Christina Mendoza, and Tim Brennan. 2016. COMPAS risk scales: Demonstrating accuracy equity and predictive parity. http://go.volarisgroup.com/rs/430-MBX- 989/images/ProPublica_Commentary_Final_070616.pdf. (2016). [21] Neil J Dorans. 2017. Contributions to the Quantitative Assessment of Item, Test, and Score Fairness. In Advancing Human Assessment. Springer, 201–230. [22] Neil J Dorans and Paul W Holland. 1992. DIF Detection and Description: Mantel- Haenszel and Standardization. ETS Research Report Series 1992, 1 (1992), i–40. [23] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness Through Awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS ’12). ACM, New York, NY, USA, 214–226. https://doi.org/10.1145/2090236.2090255 [24] Hillel J Einhorn and Alan R Bass. 1971. Methodological considerations relevant to discrimination in employment testing. Psychological Bulletin 75, 4 (1971), 261. [25] Danielle Ensign, Sorelle A Friedler, Scott Neville, Carlos Scheidegger, and Suresh Venkatasubramanian. 2017. Runaway feedback loops in predictive policing. arXiv preprint arXiv:1706.09847 (2017). [26] Ronald L Flaugher. 1974. Bias in Testing: A Review and Discussion. TM Report No. 36. Technical Report. Educational Testing Services. [27] James R. Foulds and Shimei Pan. 2018. An Intersectional Definition of Fairness. CoRR abs/1807.08362 (2018). [28] Sorelle A Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian. 2016. On the (im) possibility of fairness. arXiv preprint arXiv:1609.07236 (2016). [29] Gabriel Goh, Andrew Cotter, Maya Gupta, and Michael P Friedlander. 2016. Satis- fying real-world goals with dataset constraints. In Advances in Neural Information Processing Systems. 2415–2423. [30] Robert M Guion. 1966. Employment tests and discriminatory hiring. Industrial Relations: A Journal of Economy and Society 5, 2 (1966), 20–37. [31] Foad Hamidi, Morgan Klaus Scheuerman, and Stacy M Branham. 2018. Gender Recognition or Gender Reductionism?: The Social Implications of Embedded Gender Recognition Systems. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 8. [32] Moritz Hardt, Eric Price, , and Nati Srebro. 2016. Equality of Opportu- nity in Supervised Learning. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Gar- nett (Eds.). Curran Associates, Inc., 3315–3323. http://papers.nips.cc/paper/ 6374-equality-of-opportunity-in-supervised-learning.pdf [33] Anna Lauren Hoffmann. 2017. Data, technology, and gender: Thinking about (and from) trans lives. In Spaces for the Future. Routledge, 15–25. [34] John E Hunter and Frank L Schmidt. 1976. Critical analysis of the statistical and ethical implications of various definitions of test bias. Psychological Bulletin 83, 6 (1976), 1053. [35] Christopher Jencks. 1998. Racial bias in testing. The Black-White test score gap 55 (1998), 84. [36] Arthur R Jensen. 1980. Bias in mental testing. (1980). [37] Marshall B Jones. 1973. Moderated regression and equal opportunity. Educational and Psychological Measurement 33, 3 (1973), 591–602. [38] Jerome Karabel. 2006. The chosen: The hidden history of admission and exclusion at Harvard, Yale, and Princeton. Houghton Mifflin Harcourt. [39] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. 2018. Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness. In ICML. [40] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2016. Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807 (2016). [41] Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfac- tual fairness. In Advances in Neural Information Processing Systems. 4066–4076. [42] Jeff Larson, Surya Mau, Lauren Kirchner, and Julia Angwin. 2016. How We Analyzed the COMPAS Recidivism Algorithm. https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism- algorithm. (2016). [43] Robert L Linn. 1973. Fair test use in selection. Review of Educational Research 43, 2 (1973), 139–161. [44] Robert L Linn. 1976. In search of fair selection procedures. Journal of Educational Measurement 13, 1 (1976), 53–58. [45] Gideon S Mann and Andrew McCallum. 2007. Simple, robust, scalable semi- supervised learning via expectation regularization. In Proceedings of the 24th international conference on Machine learning. ACM, 593–600. [46] Shira Mitchell, Eric Potash, and Solon Barocas. 2018. Prediction-Based De- cisions and Fairness: A Catalogue of Choices, Assumptions, and Definitions. arXiv:1811.07867 (2018). [47] National Council on Measurement in Education NCME (Ed.). 1976. Journal of Education Measurement. 13, 1 (1976). [48] Melvin R Novick and Nancy S Petersen. 1976. Towards equalizing educational and employment opportunity. Journal of Educational Measurement 13, 1 (1976), 77–88. [49] Cathy O’Neil. 2016.Weapons of math destruction: How big data increases inequality and threatens democracy. Broadway Books. [50] Randall D Penfield. 2016. Fairness in Test Scoring. In Fairness in Educational Assessment and Measurement. Routledge, 71–92. [51] Nancy S Petersen. 1976. An expected utility model for “optimal” selection. Journal of Educational Statistics 1, 4 (1976), 333–358. [52] Nancy S Petersen and Melvin R Novick. 1976. An evaluation of some models for culture-fair selection. Journal of Educational Measurement 13, 1 (1976), 3–29. [53] S E Phillips. 2016. Legal Aspects of Test Fairness. In Fairness in Educational Assessment and Measurement, Neil J Dorans and Linda L Cook (Eds.). Routledge, 239–268. [54] Mitchell F Rice and Brad Baptiste. 1994. Race Norming, Validity Generalization, and Employment Testing. Handbook of Public Personnel Administration 58 (1994), 451. [55] Hee Jung Ryu, Hartwig Adam, and Margaret Mitchell. 2018. InclusiveFaceNet: Improving Face Attribute Detection with Race and Gender Diversity. InWorkshop on Fairness, Accountability and Transparency in Machine Learning. [56] Ronald J Samuda. 1998. Psychological testing of American minorities: Issues and consequences. Vol. 10. Sage. [57] Richard L Sawyer, Nancy S Cole, and James WL Cole. 1976. Utilities and the issue of fairness in a decision theoretic model for selection. Journal of Educational Measurement 13, 1 (1976), 59–76. [58] Janice Scheuneman. 1979. A method of assessing bias in test items. Journal of Educational Measurement 16, 3 (1979), 143–152. [59] Rajen D Shah and Jonas Peters. 2018. The Hardness of Conditional Independence Testing and the Generalised Covariance Measure. arXiv preprint arXiv:1804.07203 (2018). [60] Camelia Simoiu, Sam Corbett-Davies, Sharad Goel, et al. 2017. The problem of infra-marginality in outcome tests for discrimination. The Annals of Applied Statistics 11, 3 (2017), 1193–1216. [61] Charles L Thomas. 1973. The Overprediction Phenomenon among Black Colle- gians: Some Prelinimary Considerations. (1973). [62] Robert L Thorndike. 1971. Concepts of culture-fairness. Journal of Educational Measurement 8, 2 (1971), 63–70. [63] András Vargha, Tamas Rudas, Harold D Delaney, and Scott E Maxwell. 1996. Dichotomization, partial correlation, and conditional independence. Journal of Educational and Behavioral statistics 21, 3 (1996), 264–282. [64] Frederick E Vars andWilliam G Bowen. 1998. Scholastic aptitude test scores, race, and academic performance in selective colleges and universities. The Black-White test score gap (1998), 457–79. [65] Kimberly West-Faulcon. 2011. Fairness Feuds: Competing Conceptions of Title VII Discriminatory Testing. Wake Forest L. Rev. 46 (2011), 1035. [66] Robert L Williams, William Dotson, Patricia Don, and Willie S Williams. 1980. The war against testing: A current status report. The Journal of Negro Education 49, 3 (1980), 263–273. [67] Warren W Willingham and Nancy S Cole. 2013. Gender and fair assessment. Routledge. [68] Muhammad Bilal Zafar, Isabel Valera, Manuel Rodriguez, Krishna Gummadi, and Adrian Weller. 2017. From parity to preference-based notions of fairness in classification. In Advances in Neural Information Processing Systems. 229–239. [69] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. 2018. Mitigating Un- wanted Biases with Adversarial Learning. (2018). [70] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. 2003. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03). 912–919. [71] Rebecca Zwick and Neil J Dorans. 2016. Philosophical Perspectives on Fairness in Educational Assessment. In Fairness in Educational Assessment and Measurement, Neil J Dorans and Linda L Cook (Eds.). Routledge, 267–281. https://doi.org/10.1145/2090236.2090255 http://papers.nips.cc/paper/6374-equality-of-opportunity-in-supervised-learning.pdf http://papers.nips.cc/paper/6374-equality-of-opportunity-in-supervised-learning.pdf 50 Years of Test (Un)fairness: Lessons for Machine Learning FAT* ’19, January 29–31, 2019, Atlanta, GA, USA APPENDIX A: ADDITIONAL DEFINITIONS OF TEST FAIRNESS This appendix provides some details of fairness definitions included in Table 2 that were not introduced in the text of Section 2. Einhorn and Bass In 1971, Einhorn and Bass [24] noted that even if Cleary’s criterion is satisfied, different rates of false positives and false negatives may be achieved for different subgroups due to differences in standard errors of estimate for the two subgroups. That is, differences in variability around the common line of regression leads to different false positive and false negative rates. To address this, they propose a criterion based on achieving equal false discovery rate, or as they put it, “designated risk”, at the decision boundary. That is, Prob(Y > y ∗ |R = ra∗,A = a) is constant for all subgroups a.

Darlington’s “culturally optimum”
Darlington [19] proposes that the subjective value that one places

on test validity (related to accuracy) and diversity can be scenario-

specific. He proposes a technique for eliciting these value judge-

ments, leading to a variable k which measures the amount of trade-
off in validity that is acceptable to increase diversity. He proposes

that the “culturally optimum” test is one that maximizes ρX (Y−kC).

Jones
In 1973, Jones [37] proposed a “general standard” of fairness that

is related to Thorndike’s (and hence also related quota-based def-

initions of fairness). In Jones’ criterion, candidates are ranked in

descending order both by test score and by ground truth. If an

equal proportion of candidates from the subgroup are present in

the top n% of both ranked lists then the test is fair “at position n”.
Jones’ “general standard” of fairness requires that this hold for all

values of n. Jones assumes a regression model relating test scores
to ground truth, and also defines a weaker “mean-fair” criterion

for a subgroup that “the group’s average predicted score equals its

average performance score on the [ground truth].”

Abstract
1 Introduction
2 History of fairness in testing
2.1 1960s: Bias and Unfair Discrimination
2.2 1970s: Fairness
2.3 Mid-1970s: The Fairness Tide Turns
2.4 1970s on: Differential Item Functioning
2.5 1980s and beyond

3 Connections to ML fairness
3.1 Equivalent Notions
3.2 Variable Independence
3.3 Regression and Correlation
3.4 Model vs. Model Use
3.5 Race and Gender

4 Fairness Gaps
4.1 Fairness and Unfairness
4.2 Differential Item Functioning
4.3 Target Variable / Model Score Relationship
4.4 Compromises

5 Discussion
6 Conclusions
7 Acknowledgements
References