Journal of Economic Perspectives—Volume 9, Number 2—Spring 1995—Pages 63–84
The Case for Randomized Field Trials in Economic and Policy Research
Gary Burtless
ocial experimentation dates back more than a quarter of a century. Over Sthat time, spending on randomized field trials for social policy has consumed well over a billion dollars (measured in 1994 dollars). In a recent catalogue of social experiments, Greenberg and Shroder (1991) identi- fied more than 90 separate field trials involving a wide variety of distinctive research areas, including health insurance, prisoner rehabilitation, labor sup- ply, worker training, and housing subsidies. Some of the major recent experi- ments are listed in Table 1. New randomized trials are launched by state and federal agencies each month. Classical experimentation in social policy has the
appearance of a flourishing industry.
If social experimentation is an industry, its fortunes are less robust than
this survey may suggest. Greater real resources were devoted to experimenta- tion in the 1970s and early 1980s than have been invested since. The large-scale social experiments begun in the 1960s and 1970s were ambitious and costly attempts to estimate basic behavioral parameters—the income and price elastic- ities of labor supply and housing demand functions and the elasticity of demand for health care in response to alternative insurance arrangements. These lavish experiments generated hundreds of research reports and many
1
articles in leading scholarly journals. Although recent experiments have been
much more numerous, they have also been narrower in focus, less ambitious, and less likely to yield major scholarly contributions.
Paradoxically, the findings of the newer and less ambitious experiments have had a larger impact on actual policy decisions. While findings from social
1
• Gary Burtless is a Senior Fellow, Economic Studies Program, The Brookings Institu- tion, Washington, D.C.
For a critical survey of the early large-scale experiments, see Hausman and Wise (1985).
64 Journal of Economic Perspectives
experiments have sunk out of view of most academic economists, they loom larger than ever for policymakers in state capitals and the federal government. At the same time social experiments gained new influence in policymaking, some prominent economists grew disenchanted with this research tool and challenged the value of experiments in answering central questions about human behavior and policy effectiveness. Criticisms of such experiments by social scientists, if loud and persistent enough, can affect the willingness of policymakers to support this kind of study. Politicians are naturally suspicious of a research method involving experimentation with and possible harm to human subjects (also known as “voters”). It is important for them to under-
stand the strengths as well as the limitations of this unique research tool.
This paper examines the rationale for field experimentation in economics and considers some of the main criticisms leveled at experiments in recent years. Academics have attacked experiments for a wide range of real and imagined sins. Recent experimental designers have taken past criticisms into account and sought to address some of the most serious ones through im- proved experimental design. In spite of recent criticisms, classical experimenta- tion on a modest scale has become an accepted part of policy evaluation in the United States. The essential reason is that policymakers and many social scientists find experimental results easier to understand—and ultimately more
convincing—than results from most other kinds of policy evaluation.
Definition of Social Experiments
In common parlance, an experiment is any major deviation from past policy or practice. Under this broad definition, the introduction of Social Security in 1935 and the sharp reduction in top marginal income tax rates in
1981 represent policy experiments. The scientific notion of experiments is considerably narrower. It emphasizes the researcher’s control of the variables under investigation and over the environment in which those variables are observed. In a typical scientific experiment, the investigator deliberately manip- ulates the environment or introduces change into the environment to measure the consequences of change.
The income tax reduction passed in 1981 does not represent an experi- ment under this definition, because policymakers exercised little control over most aspects of the environment that affected economic agents’ responses to policy change. For example, U.S. gross domestic product fell in three of the first four quarters after passage of the 1981 tax cuts. One possibility is that tax cuts caused the recession. More plausibly, the recession affected consumers’ and producers’ responses to the tax changes introduced in 1981. The pure effects of the tax cuts on consumer and producer behavior were never directly observed. Instead, they have been inferred by economists after disentangling the effects of other changes in the environment. Since it is unclear how analysts can reliably
Table 1
Selected Social Experiments
Gary Burtless 65
66 Journal of Economic Perspectives
establish the effects of other environmental factors, the effect of the 1981 tax changes on economic behavior remains a subject of intense controversy.
In many scientific experiments, the investigator simply introduces a change in a controlled environment and observes the effect of the change on the material or organism under study. Of course, reliable measurement of the effect requires some basis for comparison. Consumer Reportstests the strength of automobile bumpers by subjecting them to a uniform blow and then determin- ing the cost of the necessary repairs. The implicit basis of comparison is the pristine state of the tested vehicle before the blow was delivered. However, a before-and-after comparison is not always appropriate or feasible. The toxic effect of Twinkies cannot be discovered simply by observing that 10 percent of laboratory mice die within one month of eating a Twinkie. No matter how well controlled the environment in which the experiment is conducted, some mice will die whether or not they consume a questionable dessert product. If the usual mortality rate of laboratory mice is 3 percent a month, the extra mortality from consuming a Twinkie is 7 percent. If mortality is usually 12 percent, the dessert is not toxic at all; it reduces mortality by 2 percent a month.
The problem, of course, is establishing a credible basis of comparison. In the previous example, an investigator might compare the mortality experience of Twinkie-eating rodents with that of a handpicked group of mice that is deprived of cream-filled desserts. Naturally, the researcher will want this comparison group to be as similar as possible to the group of mice offered Twinkies. As early as the 1920s, R. A. Fisher (1928, p. 230) argued that the only fully satisfactory method of achieving equivalence between the treatment and comparison groups was to assign subjects to the two groups “wholly at random.” This kind of observational study is commonly referred to as a randomized or controlled experiment.
Fisher’s case for randomization was widely influential in agricultural, biological, and medical research, but its impact on economic research has been smaller and much slower to develop. Many economists probably believe that the important questions at issue in economics cannot be answered with ran- domized trials. While this is certainly true of questions involving general equilibrium or movements in economy-wide aggregates, many propositions in economics treat behavioral response at the individual, family, or company level: saving responses to movements in real interest rates, labor supply responses to changes in after-tax wages or unearned income, labor demand responses to tax subsidies, consumption responses to changes in relative price, and so on. In principle, all of these issues can be studied using small- or medium-sized randomized trials. It is certainly feasible for researchers to assign economic agents randomly to different policy regimes in the same way that agricultural scientists assign dairy cattle to different feeding regimes or small plots of land to different varieties of fertilizer. Readers may be uncomfortable with the implied equivalence between human beings and farm livestock, but the statistical rationale for randomization is essentially the same in both cases. However,
The Casefor Randomized Field Trials in Economic and Policy Research 67
whether it is ethical, useful, or cost-effective to carry out such experiments on humans is a matter of debate.
The critical element that distinguishes controlled experiments from all other methods of research is the random assignment of meaningfully different treatments to the observational units of study. In the context of social science, a randomized field trial (or social experiment) is simply a controlled experiment that takes place outside a laboratory setting, in the usual environment where social and economic interactions occur. In the simplest kind of experiment, a single treatment is assigned to a randomly selected subsample (the treatment group) and withheld from the remainder of the enrolled sample (the control or null-treatment group). Many social experiments have tested a variety of differ- ent treatments rather than only one. Some have not enrolled a pure control group at all. Instead, the investigators have concentrated on measuring the differences in effect of a number of distinctive new treatments. The definition of an experiment can include tests of innovative new policies as well as studies that are intended to measure the effect of current policies relative to a null treatment.
Analysts distinguish between two kinds of social experiments, one of which aims to estimate the underlying parameters of a population response function and a second that attempts to measure the overall effects of one or more distinctive treatments. An example of the first type of experiment is the Seattle-Denver negative income tax (NIT) experiment, which tested a variety of combinations of income guarantees and tax rates in order to estimate labor supply functions in the low-income population (Munnell, 1987). In this kind of “structural” experiment, individual treatments can be defined as points within a continuous policy parameter space, and the experimental objective is to estimate a smooth response surface.
The second kind of experiment tests a sort of “black box,” in the sense that each treatment tested represents a unique intervention. Most experiments in employment training policy—indeed, most recent experiments in any policy area—have been black box experiments. In this kind of experiment, one or more specific combinations of government services are tested. Even those experiments testing multiple treatments can find no natural way to parameter- ize the experimental treatments as points along a policy continuum. Thus, the results of black box experiments cannot be easily extrapolated to infer the effects of similar but nonidentical treatments.
In the absence of information from social experiments, economists and other social scientists rely on four main alternatives to experiments to learn about crucial behavioral parameters or the effectiveness of particular programs. One source of information is data on the relationship between economy-wide aggregates, such as interest rates and consumption, either over time or across regions. However, aggregate statistics are inappropriate for analyzing many kinds of microeconomic behavior. A second source is management data col- lected in the administration of existing programs, but data from an existing program seldom provide any information about what the participants’ experi-
68 Journal of Economic Perspectives
ences would have been if they had been enrolled in a different program or in no program at all. A third source is new survey data, which are usually more costly to obtain than programmatic data but which provide information about the experiences of nonparticipants as well as participants in a program, and thus offer some evidence about likely behavior in the absence of treatment. A fourth source is data generated by special demonstration programs. Like experiments, demonstrations involve the special provision of a treatment, collection of information about outcomes, and analysis of treatment effects. Unlike experiments, demonstrations do not involve random assignment. What all experiments have in common, whether based on black box or structural designs, is that the tested treatments are randomly assigned to observational units—that is, to individuals, companies, government offices, or entire communities.
Advantages of Experimentation
The advantages of controlled experimentation over other methods of analysis are easy to describe. Because experimental subjects are randomly assigned to alternative treatments, the effects of the treatments on behavior can be measured with high reliability. The assignment procedure assures us of the direction of causality between treatment and outcome: differences in average outcomes among the several treatment groups are caused by differences in treatment, and differences in average outcome are not the cause of the ob- served differences in treatment. Causality is not so easy to determine in nonexperimental data. In measuring the response of health spending to alter- native health insurance plans, for example, it is unclear in nonexperimental data whether generous insurance coverage causes high health spending or large anticipated health bills cause consumers to purchase generous health insurance policies. In an experiment where insurance plans are assigned to people at random, the direction of causality is certain.
Random assignment also removes any systematic correlation between treat- ment status and both observed and unobserved participant characteristics. Estimated treatment effects are therefore free from the selection bias that potentially taints all estimates based on nonexperimental sources of informa- tion. In a carefully designed and well-administered experiment, there is usually a persuasive case that the experimental data can produce an internally valid
2 estimate of average treatment effect.
2
An”internally valid” estimate is an unbiased measure of the treatment effect in the sample actually
enrolled in an experiment. An “externally valid” estimate is a treatment-effect estimate that can be validly extrapolated to the entire population represented by the sample enrolled in the experiment. Some experimental estimates may not be internally valid, perhaps because treatment is not assigned randomly or because attrition produces noncomparable treatment-group and control samples. In addition, some internally valid estimates may lack external validity. One reason is that a treatment offered to a small experimental sample may not correspond to any treatment that could actually be provided to a broad cross section of the population. Other reasons are described later.
Another advantage of experiments is that they permit analysts to measure —and policymakers to observe—the effects of economic stimuli or new kinds of treatment that have not previously been observed. In many cases the naturally occurring variation in relative prices or policy treatments is too small to allow economists to infer reliably the effects of potential price movements or promis- ing new policies. Many politicians believe, for example, that private employers would hire a greater number of disadvantaged workers if the wages paid to these workers were generously subsidized. If a program of this type has never been tested, it would be difficult or impossible to forecast employer response to a particular wage subsidy level. Of course, new policies can be tested in demonstration programs, too. But in comparison with most sources of nonex- perimental information, experiments permit economists to learn about the effects of a much wider range of prices and policies.
Finally, the simplicity of experiments offers notable advantages in making results convincing to other social scientists and understandable to policymakers. A carefully conducted experiment permits analysts to describe findings in extremely straightforward language: “Relative to employers in the control group, employers eligible for government-provided wage subsidies hired X percent more disadvantaged adults and Y percent fewer workers who were not economically disadvantaged.” This kind of simplicity in describing results is seldom possible in nonexperimental research, where analytical findings are necessarily subject to a variety of complicated qualifications.
In recent years the last advantage of experiments has turned out to be particularly important for experiments that test practical policy alternatives. Because policymakers can easily grasp the findings and significance of a simple experiment, they concentrate on the implications of the results for changing public policy. They do not become entangled in a protracted and often inconclusive scientific debate about whether the findings of a particular study are statistically valid. Politicians are more likely to act on results they find convincing.
Social experiments have contributed to important advances in basic knowl- edge, improved understanding of program effectiveness, and, in rarer cases, significant policy reform. The Health Insurance Experiment, for example, provided convincing evidence about the price sensitivity of the demand for medical care. Even more important, it gave medical practitioners and policy- makers unprecedented information about the health consequences of variations in medical care that are induced because consumers face different prices for medical treatment as a result of differences in their insurance coverage (Brook et al., 1983; Manning et al., 1987). Results from the Job Training Partnership Experiment suggest that the government’s main training programs for disad- vantaged adults yield significant gains in participants’ employment and earn- ings, although programs targeted on out-of-school youth are ineffective and possibly even harmful (Bloom et al., 1993). The administration and Congress scaled back Job Training Partnership Act (JTPA) funding for youth programs partly as a result of these findings. The Manpower Demonstration Research
Gary Burtless 69
70 Journal of Economic Perspectives
Corporation (MDRC) is responsible for experiments that produced the most notable and immediate effect on policy. A series of studies known as the W ork-W elfare Experiments offered tangible evidence that work-oriented train- ing and job search programs could boost employment and reduce welfare dependency in the AFDC population (Gueron and Pauly, 1991). The experi- mental findings were persuasive enough to lawmakers to have a significant impact on the design and implementation of the 1988 Family Support Act.
Of course, experimentation does not completely eliminate uncertainty about the correct answer to a well-posed question about economic behavior or policy, as discussed below. But it can dramatically reduce uncertainty. More important, the small number of qualifications to experimental findings can be explained in language that is accessible to people without formal training in statistics or economics. This is a crucial reason for the broad political acceptance of findings from recent labor market experiments.
The analytical advantage of experiments over nonexperimental research methods can be described in terms of a simple model of treatment effect.3 Suppose the true behavioral model, including treatment effects, is
Yi =α+βTi +γXi +εi,
where Y is the behavioral outcome of interest; Ti is the treatment dosage received by the ith sample member; Xi is a measured or unmeasured charac- teristic of person i that influences Y; and ε is an error term that captures the effects of random error or measurement error in Y—it is uncorrelated with T and X. T might represent eligibility for a particular government service, say, employment training for the disadvantaged, while X could represent a person’s academic achievement as measured on a standardized test. In this example, Y would usually indicate individual earnings, an outcome that training is sup- posed to improve. Assuming that T and X actually vary and can be measured without error, least squares estimation using information from surveys of people who receive different doses of T can be used to obtain an unbiased estimate of the program treatment effect, β. Under these circumstances there is no need for an experiment.
Where X is not observed, however, the least squares estimates of α and β may be biased, depending on the relationship between X and T.4 If X and T
3This exposition borrows from Leamer (1983).
4 In particular, suppose X and T are related by E(Xi/ Ti) = r0 + r1Ti. In this case, the least squares estimates will be biased by:
E(Y/ T) =α+βT+ γE(X/ T) = α + βT + γ(r0 + r1T)
=(α + α*) + (β + β*)T,
where α* and β* are measures of the least squares bias. The bias will increase as γ and the correlation between X and T increase in absolute value.
The Case for Randomized Field Trials in Economic and Policy Research 71
are correlated with each other, the β coefficient will be biased. This could easily occur, for example, if unmeasured academic ability, X, affects the willingness of a person to enroll in training, T. The estimation problem disappears in a classical experiment. By the definition of an experiment, the treatment doses, Ti, are randomly assigned within the estimation sample and, presumably, are accurately measured. T will therefore be uncorrelated with both X and ε, implying that least squares estimation of Y on T can produce an unbiased estimate of β.
A common source of bias in microeconomic statistical studies is sample selection. Nonexperimental studies of education and training, for example, usually rely on observations of naturally occurring variation in treatment doses in order to form estimates of the effects of training. Analysts typically compare measured outcomes in employment or earnings for participants in a training program and for a comparison group of similar people who did not participate in the program. The value of a college degree is often calculated by comparing the earnings of college graduates with the earnings of similar people who graduated from high school but did not attend college. Even if the analysis fully controls for the effects of all measurable characteristics of sample members, it is still possible that average outcomes are influenced by systematic differences in the unmeasured characteristics of individuals in the treatment and comparison groups. In the simplest kind of training program, for example, members of the estimation sample are exposed to just two doses of treatment: T = 1 for people enrolled in the training program, and T = 0 for people who never enrolled. Program participation represents the sample member’s decision to choose one treatment dose or none. Obviously, this decision may be affected by unobserved tastes or other characteristics, which also affect a person’s later employment or earnings. Since these factors are unknown and cannot be estimated, the amount of bias in the nonexperimental estimate of β is unknown.
Selection bias is a practical estimation problem in most nonexperimental policy studies. Naive analysts sometimes ignore the problem, implicitly assum- ing that unmeasured differences between training participants and nonpartici- pants either do not exist or do not matter. While one or both of these assumptions might be true, the case for believing either of them is usually quite weak. Critics of early training evaluations often pointed out, for example, that people who voluntarily enrolled in employment training for the disadvantaged might be more ambitious than typical disadvantaged workers, yet ambition is an unmeasured personal trait. If personal ambition is correlated with a person’s subsequent labor market success, it is unclear what percentage of the average earnings advantage among trainees is due to extra training and what percent- age is due to the greater average ambition of trainees.
The selection bias could go in the opposite direction, too. Disadvantaged workers who are less optimistic about their labor market prospects may enroll in job training in disproportionate numbers. If their pessimism is based on a realistic—but unobserved—assessment of their opportunities, we should expect
72 Journal of Economic Perspectives
that their earnings in the absence of training would be lower than those of people with identical observable characteristics who do not enroll in training.
Our uncertainty about the presence, direction, and potential size of selec- tion bias makes it difficult for social scientists to agree on the reliability of estimates drawn from nonexperimental studies. The estimates may be sugges- tive, and they may even be helpful when estimates from many competing studies all point in the same direction. But if statisticians obtain widely differing estimates or if the available estimates are the subject of strong methodological criticism, policymakers will be left uncertain about the effectiveness of the program. Ashenfelter and Card (1985) and Barnow (1987) have shown that this kind of uncertainty is more than just a theoretical possibility. Both studies found an uncomfortably wide range of estimated impacts of the Comprehensive Employment and Training Act (CETA) on earnings when estimates were based on nonexperimental data. The range of plausible estimates reported in these studies, especially for male trainees, was too wide to permit policymakers to decide whether CETA-sponsored training was cost-effective. It was not even clear for some groups whether the impact of CETA training was positive.
Is There an Econometric Fix?
Econometricians have proposed sophisticated methods to test for the presence of selection bias and to obtain estimates of treatment effects that are purged of selection bias (Heckman, 1976; Maddala and Lee, 1976; Barnow, Cain, and Goldberger, 1980; Heckman and Robb, 1985). The problem with these methods is that they rest on an ultimately untestable assumption about the distribution of the error term or the specification of the equation represent- ing the decision to participate in a program. If critics of a nonexperimental estimator question the reliability of the key assumption, other social scientists (and policymakers) often have no reliable method to decide whether the maintained assumption is a good approximation of reality. In the case of a randomized trial, the key assumption for reliable estimation is that the experi- menter has been successful in randomly assigning subjects to different treat- ments and in measuring the responses of subjects exposed to the treatments. Statisticians and lay observers ordinarily find it much easier to assess the validity of this assumption than to decide whether the key assumptions of nonexperimental studies are valid.
LaLonde (1986) and Fraker and Maynard (1987) studied the reliability of a variety of nonexperimental estimators with an ingenious procedure. Using data from a true randomized trial—the National Supported Work Demonstration —the two sets of authors compared actual estimates obtained in the demonstra- tion with nonexperimental estimates that would have been obtained if no information had been available from the Supported Work Demonstration’s control group. To derive their nonexperimental estimates, the authors selected
a variety of nonrandom comparison groups drawn from general population surveys, such as the Current Population Survey and Panel Study of Income Dynamics, and used several methods to control for the problem of sample selection bias. Many of these methods had been used in the earlier evaluation literature on CETA. Neither study found nonexperimental estimators to be reliable. For some groups and most estimators, the nonexperimental estimates of effect differed substantially from the estimate based on a true, randomly selected control group. More disturbingly, LaLonde (1986, p. 617) found that “even when the econometric estimates pass conventional specification tests, they still fail to replicate the experimentally determined results.”
Heckman and Hotz (1989) have shown that some key assumptions of certain nonexperimental estimation methods can be systematically analyzed using available data. Some assumptions of incorrect models can potentially be rejected by formal specification tests. These tests may not have much statistical power to reject incorrect models, however. Then, the nonexperimental estima- tors that are not rejected by formal specification tests may yield varying estimates of the effectiveness of a particular program. Analysts will still be left with the difficult choice of deciding which model of sample selection is most plausible.
In the case analyzed by Heckman and Hotz (1989), specification tests led to the rejection of several incorrect models of selection. This ruled out some nonexperimental estimates of the effect of the Supported Work program that were clearly implausible in the light of estimates obtained using the classical experimental estimator. However, the nonexperimental estimators that were not rejected by the Heckman-Hotz specification tests had much more sampling variability than the classical experimental estimator. In other words, the classi- cal experimental estimator still had a major advantage over the nonexperimen- tal estimators for users who care about the statistical precision of the estimates they use. But the more important advantage is that the validity of the experi- mental estimator depends upon assumptions that are ordinarily much easier to evaluate—and to believe.
Problems with Experiments
Randomized field trials face numerous problems, of course. Many have been acknowledged since the inception of large-scale social experimentation (Campbell and Stanley, 1966; Rivlin, 1974). Others have come into prominence in recent years (Heckman, 1992; Garfinkel et al., 1992; Levitan, 1992). Several supposed problems turn out to be shortcomings of social research in general or survey research in particular, rather than with experimentation per se. In some cases, of course, difficulties with experimentation can be insuperable. Under those circumstances it makes no sense to conduct an experiment. This still leaves many areas of microeconomic and policy research where social
Gary Burtless 73
74 Journal of Economic Perspectives
experimentation represents a cost-effective way to improve basic knowledge. Problems unique to experimentation are often relatively minor in comparison with those that plague nonexperimental research.
Cost
Experiments have three kinds of cost that can make them more expensive than nonexperimental research on the same topic. They consume a great deal of real resources, especially in comparison with nonexperimental analysis of existing data sources. They are almost always costly in terms of time. Several years usually elapse between the time an experiment is conceived or de- signed and the release of its final report. If policy decisions about a particular public policy cannot be deferred, the usefulness of an experiment may be questionable.
In addition, experiments often involve significant political costs. It is more difficult to develop, implement, and administer a new treatment than it is simply to analyze information about past economic conditions or collect and analyze new information about economic behavior. Voters and policymakers are rightly concerned about possible ethical issues raised by experiments (discussed further below). As a result, it is usually easier to persuade officials to appropriate small amounts for pure research or medium-sized sums for a new survey than it is to convince them that some people should be systematically denied a potentially beneficial intervention in a costly new study.
These disadvantages of experiments are real, but should be placed in perspective. Some forms of nonexperimental research suffer from identical or similar disadvantages. A demonstration program, which lacks a randomly selected control group, can easily cost as much money as a social experiment that tests the same innovative treatment. The demonstration will certainly take as much time to complete as an experiment. If a new survey must be fielded to obtain the needed information, the extra time and money required for an experiment may seem relatively modest.
Ethical Issues of Experimentation with Human Beings
Many observers are troubled by the ethical issues raised by experimenta- tion with human beings, especially when the experimental treatment (or the denial of treatment) has the capacity to inflict serious harm. If the tested treatment is perceived to be beneficial, program administrators may find it hard to deny the treatment to a randomly selected group of people enrolled in a study. Except among philosophers and research scientists, random assign- ment is often thought to be an unethical way to ration public resources. If, on the other hand, the tested treatment is viewed as potentially harmful, it will be difficult to persuade policymakers to undertake the experiment or to recruit program managers to run the project. It may not be ethical to mount such an experiment in any event. Readers should recall, however, that similar ethical issues arise in studies of new medicines and medical procedures, where the
The Casefor Randomized Field Trials in Economic and Policy Research 75
stakes for experimental participants are usually much greater than they are in a social experiment. Yet randomized field trials have been common in medicine for far longer than they have in social policy. In fact, such trials are often required to prove the efficacy of new medical treatments.
Good experimental design can reduce ethical objections to random assign- ment (Burtless and Orr, 1986, pp. 621–24; the essays in Rivlin and Timpane, 1975). At a minimum, participants in experimental studies should be fully informed of the risks of participation. Under some circumstances, people offered potentially injurious treatments or denied beneficial services should be compensated for the risks they face. The risks of participation are frequently unclear, however, because it is uncertain whether the tested treatment will be beneficial or harmful.
Of course, uncertainty about the direction or size of the treatment effect is the main reason that an experiment is worthwhile. If successful, an experiment will substantially reduce our uncertainty about the size and direction of the treatment effect. The ethical argument in favor of experimentation is that it is preferable to inflict possible harm on a small scale in an experimental study rather than unwittingly inflict harm on a much larger scale as a result of misguided public policy.
Limited Duration
Social experiments are limited in duration. For some kinds of treatments, this poses a problem for valid inference. As one example, participants may take time to understand the nature of the tested treatment and to react to it. A more serious issue is that participants may react differently to a treatment if they know in advance that it is of limited duration, compared with how they would react if the treatments were expected to last indefinitely. An experimental housing subsidy that will last just one year will presumably have a smaller effect on housing decisions than an equally generous subsidy that is expected to be permanent.
The limited duration of social experiments is not always an issue. The aim of many experiments is to test a short-duration intervention that is supposed to immediately benefit a target population, for example, by improving school achievement, employment rates, or subsequent earnings. The intervention itself lasts only a few weeks or months and is completed long before the end of the experiment. Even where limited duration is a critical issue, as in the housing allowance experiments, research planners can explicitly address the issue through sensible experimental design. The duration of the treatment can itself be experimentally varied to determine whether the length of a subsidy affects the size of response.
Attrition and Interview Nonresponse
Critics of social experiments see statistical problems with experiments in addition to the difficulties connected to cost and ethical propriety. Several
76 Journal of Economic Perspectives
experiments have been criticized, for example, because high attrition in either the treatment or control groups has meant that even though the samples originally enrolled in the two groups were randomly selected from an identical population, the members of the treatment and control groups ultimately used in the analysis were self-selected members of nonidentical populations as a result of different rates of attrition. The crucial advantage of random assign- ment has been lost. For example, the negative income tax (NIT) experiments suffered high attrition among families enrolled in the control group and in some of the less generous NIT plans. The final analysis samples were thus unrepresentative of the population that was originally enrolled in the experi- ments. Because attrition differed in the treatment and control groups, it is possible that the average outcome difference between the two groups was partly due to compositional differences between the groups as well as to the pure effect of the NIT treatment.
While this criticism is valid for some social experiments, it is hardly one that applies only or even mainly to experiments. It applies to all research studies, whether experimental or nonexperimental, that rely on longitudinal survey data in which attrition or interview nonresponse is a problem. In the United States, such surveys include the Panel Study of Income Dynamics, the National Longitudinal Surveys, the Retirement History Survey, and even the Current Population Survey, each of which has been used in a large number of nonexperimental studies. Ironically, more is known about the influence of attrition in experimental studies because experimental designers often take
5
extraordinary steps to reduce its effects or to measure its impact.
In fact, many recent social experiments have abandoned longitudinal surveys as a method of gathering information about behavioral outcomes. People enrolled in an experimental sample are given a baseline interview upon enrollment and are never interviewed again. To measure behavioral outcomes, the experiments rely on transfer payment records maintained by public assis- tance authorities and employment and earnings records supplied by social insurance agencies. It is unlikely that either source of information is seriously affected by differential attrition or nonresponse bias. Thus, the problems of nonrandom attrition and interview nonresponse are not intrinsic to social
experimentation.
Partial Equilibrium Results
Experimental findings are often criticized because they do not reflect the general equilibrium effect of a particular price or policy change. In small-scale
5
Inan effort to minimize attrition, for example, the negative income tax (NIT) experiments paid
both treatment-group and control-group members for their continued participation. In addition, three of the four experiments checked their interview-based estimates of the treatment effect against an estimate based on information not derived from interviews. Analysts collected earnings records maintained by the social insurance authorities and reestimated the effects of the NIT plans using this information.
training experiments, for example, the advantages of experimental training are conferred on only a small fraction of the people who would benefit under a full-blown national program. The benefits a worker derives from extra training are magnified when few other people in the local labor market receive addi- tional training; employers can choose among only a handful of workers with improved qualifications. If the entire eligible population were offered training, the effects of the program on employment or average earnings would almost certainly be smaller. Similarly, the negative income tax experiments measured the supply-side effects of higher marginal tax rates and more generous income guarantees. But without knowing how employers would alter their wages if more generous income guarantees reduced labor supply across the board, it is impossible to forecast the full general-equilibrium effect of a more generous income transfer system.
While it is true that randomized field trials can measure, at most, partial
equilibrium responses to a price or policy change, the partial equilibrium effect
6
is often the response of critical interest. If a training program is intended to
improve the job prospects of disadvantaged workers and is found in a small-scale experiment to have no detectable influence on employment or earnings, the general equilibrium effects of the program are probably small and certainly irrelevant. For an experimental program found to raise the employment rate of trainees, analysts are still left with the problem of determining the general equilibrium effect of a full-fledged program, but at least they will be in a better position to predict the general equilibrium effects of the program.
Garfinkel et al. (1992) recently argued that experiments miss at least three other kinds of general equilibrium effects that determine the net impact of a program. First, a small-scale experiment cannot offer participants the informa- tion or helpful insights about a program that would be available in a community-wide or nationwide program. Other omitted effects include social interaction and norm formation processes that would be present when a large percentage of the population is affected by a program but which are absent when only a minuscule proportion of the eligible population is enrolled.
These effects are a problem for social experiments—but they often repre- sent an equally serious challenge to nonexperimental research methods. All three require an unknown amount of time to affect observed behavior. This means the statistician cannot assume that the price or policy in effect at a given time determines individual behavior at that time; behavior is also determined by past prices or policies, with an unknown weight on the price or policy in
6
General equilibrium effects could be measured in field trials if entire communities were assigned at random to one policy regime or another. A large number of communities would have to participate in such an experiment, however, and the administrative and research costs of the experiment would therefore be large. The experimental housing allowance supply experiment was intended to measure general equilibrium effects through the provision of housing allowances to eligible families in two communities. The number of communities offering subsidies was too small to measure the full market response reliably, however.
Gary Burtless 77
78 Journal of Economic Perspectives
effect in each past period. Some statisticians are confident they can estimate these weights with naturally occurring data, but the estimation problem is formidable, especially for price or policy variables that remain constant over long periods or vary little from one individual to the next.
A concrete example can illustrate the point. In 1961 federal law was changed to permit men aged 62–64 to draw early Social Security benefits. A randomized trial of alternative early retirement ages conducted in the late
1950s might have shown that the availability of Social Security benefits at progressively younger ages caused labor force withdrawal at younger ages. These results could be criticized. An experiment would have missed the effects of information diffusion, social interaction, and induced shifts in social norms, each of which would presumably magnify the response observed in a small-scale trial.
A nonexperimental statistical study conducted before 1961 would have faced an even more severe problem, however: early retirement benefits for men had never been offered before 1961. A nonexperimental study conducted in a year after 1961 has the advantage of greater variation in the policy variable of interest, but since 1961, there has been no further change in the availability of early pensions. Moreover, economists are uncertain when the general equilib- rium effects of the 1961 reform became fully operational. Did retirement norms in 1965 fully reflect the effects of the 1961 reform? Or did norms continue to adjust through 1970? Through 1980? Garfinkel et al. (1992) have not identified a statistical technique available to nonexperimental researchers that would avoid the problem in analysis of nonexperimental data.
Program Entry Effects
A recent criticism of experiments is that the population enrolled in the treatment and control groups is not representative of the population that would be affected by the treatment if it were provided in an ongoing, national program. While occasionally valid, this criticism applies equally to a nonexperi- mental study when the object of the study is to predict the effects of a new public program or major reform of an old one.
Suppose, for example, that policymakers would like to know the effects of a new kind of training program when some or all public assistance recipients are required to participate in the program as a condition for obtaining welfare. An experiment could enroll a sample of welfare recipients representative of the population that is supposed to participate in the program. A random subsam- ple of this group would be asked to enroll in training; the remainder of the sample would be assigned to a control group. The experiment could accurately measure the effect of the training requirement in a small-scale trial.
However, if the tested program was made permanent and extended to the entire welfare population, it could affect the entry or exit of people from the public assistance rolls (Moffitt, 1992). A training program that is regarded as burdensome by assistance recipients, for example, will deter some people from
The Casefor Randomized Field Trials in Economic and Policy Research 79
applying for welfare and persuade some assistance recipients to exit the rolls
7
faster than they otherwise would. Neither of these effects will have occurred
when the experiment begins. It follows that the sample enrolled in the experi- ment will not be representative of the population that would participate in the program if the treatment were made permanent and were universally available to the eligible population.
In this respect, however, an experiment suffers from no disadvantage in comparison with many kinds of nonexperimental analysis. If no similar training requirement had been imposed in the past, nonexperimental researchers are not in any better position to estimate program entry effects than experimental researchers. They are in a much worse position to estimate program exit effects, since at least the experiment offers evidence on people’s likelihood of leaving the program. However, if a similar training requirement had been imposed in the past, a nonexperimental analysis could provide better informa- tion about program entry than any small-scale experiment.
Experiments in Ongoing Programs
A new wave of criticisms has been levelled against a relatively new kind of social experiment—one that takes place in an existing program. A prominent experiment of this type is the recent evaluation of JTPA (Bloom et al., 1993). The JTPA is the major U.S. program that pays for employment training and
job search assistance for economically disadvantaged workers. Sixteen local areas participated in the experiment. In each site up to a third of the people who applied for job training services were randomly selected and enrolled in a control group. They were not permitted to receive services funded by JTPA during the first 18 months after they initially applied for services. Analysts obtained an estimate of theJTPA treatment effect by measuring the difference in average outcomes between members of the control group and people offered the JTPA services.
Heckman (1992) has strongly criticized the JTPA experiment for at least three reasons. One is that the sites may have enrolled a different group of trainees than would have been enrolled in the absence of the experiment (this is “sample contamination”). Since about a third of applicants were enrolled in the control group rather than JTPA training, the sites enrolled participants who would not have applied for services or who would have been denied services without the experiment. Heckman’s second criticism is that normal
JTPA services were disrupted as a result of the pressures created by the experimental protocol (this is “treatment contamination”). Finally, Heckman criticized the experiment because only a small handful of potential sites were enrolled in the study, and all of them were self-selected. The 16 sites that
7
Conversely, a training program that is regarded as exceptionally beneficial might actually attract extra applicants to the public assistance rolls.
80 Journal of Economic Perspectives
agreed to participate may not have been representative of JTPA training sites throughout the United States.
These criticisms have merit. But none of them are a necessary by-product of random assignment. Instead, they resulted from the Labor Department’s initial choices about experimental design. The potential problems of sample contamination, treatment contamination, and site self-selection could have been avoided if the department had enrolled a small number of control observations in many representative sites rather than a large number of observations in only a handful of sites. In its design of a new randomized trial to evaluate the Job Corps program, the Labor Department learned from its experience in the
JTPA experiment and has decided to enroll small numbers of control observa- tions in many sites.
Of course, it is more costly to enroll experimental participants in many sites rather than only a few. The higher costs of a better design should be weighed against the extra benefits the experiment would provide if its results were more reliable. In case of the JTPA experiment, the findings of the experiment would have been safely generalizable to the entire population served by JTPA. Users of the results could have been assured that the services evaluated in the study were identical to those typically offered by local agencies. The added costs would have been small, in my view, in relation to the benefits from a better design. The federal government spends about $1.8 billion each year on employment and training services for the disadvantaged under JTPA. The cost of the experiment was less than $50 million. Even if including more sites in the study had increased the experimental cost by a factor of five, the cost of the evaluation would still have represented less than 5 percent of the operating budget of the JTPA program over the three-year period covered by the experiment. Since the results of the study suggest that economically disad- vantaged youngsters may be hurt by their participation in JTPA, lawmakers and taxpayers—as well as some JTPA participants—would be better off if we had greater confidence in the reliability of the experimental findings.
Other Criticisms
A simple experiment cannot answer some questions about the behavioral response to a treatment (Heckman, 1992). In particular, randomized trials often fail to provide an unbiased estimate of the effect of a program on those who actually participate in the program. If only 30 percent of a sample that is randomly assigned to a training program actually receives services under the program, then 70 percent of the enrolled sample has declined to participate. An experiment can provide a valid estimate of the average impact of the offerof training services. It cannot provide a reliable estimate of the average impact of services among participants who actually receive them, without additional assump- tions about the determinants of participation. Moreover, while an experiment can yield a valid estimate of the mean difference between the treatment and
control groups, without additional assumptions it cannot provide estimates of other critical parameters in the response function, like the median response.
These criticisms of experiments are valid, but rarely decisive. One reason is that nonexperimental data are subject to the same shortcomings (or worse ones). Another is that experiments can usually provide valid estimates of the most important behavioral response. Many policy issues hinge on theaverage response to a particular price or policy change. The median response, the variability of response, and the overall shape of the response function hold intellectual interest, but if policymakers could be confident that the average impact of the treatment offer is positive, most would serenely accept their ignorance of the higher moments of the response distribution. Furthermore, in traditional cost-benefit analysis, the mean effect of the intervention is the crucial determinant of its social usefulness. How the benefits are divided between participants and nonparticipants affects the distribution of social gains but not the net social gain from the treatment. If analysts believe it is important to estimate the distribution of gains across participants and nonparticipants, they can certainly use nonexperimental research methods to analyze evidence col- lected in an experiment. Random assignment does not prevent researchers from using nonexperimental methods to analyze the data. It does offer them a powerful source of identifying information to measure the average effect of the treatment offer.
Conclusion
Are experiments preferable to nonexperimental evaluations? The answer will differ for different research questions and in different circumstances. The more relevant question for a particular research question is this: what are the costs and benefits of an experiment compared with the available alternatives?
Even if it were true that reliance on an experiment involved risk of drawing incorrect conclusions about policy effectiveness, the alternatives to an experiment often produce conclusions that are much less reliable. As we have seen, nonexperimental studies are usually plagued by more serious statistical problems than those that occur in randomized trials. More fundamentally, the failure to conduct any evaluation research at all can lead to the perpetuation of programs that are less effective and more costly than policy alternatives. In some cases, our failure to evaluate reliably can lead to the perpetuation of
8 policies that are actually harmful to intended beneficiaries.
8
Evidence from a pair of experiments involving the Targeted Jobs Tax Credit suggests, for
example, that this employment subsidy program significantly harms the job-finding opportunities of disadvantaged job seekers (Burtless, 1985; Masters et al., 1982). This finding is exactly the opposite of the one predicted by simple intuition, standard economic theory, and previous nonexperimental studies.
Gary Burtless 81
82 Journal of Economic Perspectives
How can we assess whether the greater reliability of experimental results is worth the extra cost of obtaining them? When the direct benefits from im- proved knowledge are easy to predict and measure, analysts can calculate the financial gains from improved decision making and compare them with the additional costs associated with conducting an experiment. An experiment should be undertaken when the value of the improved decision exceeds the extra cost of the experiment (Stafford, 1979; Burtless and Orr, 1986). Potential benefits are clearest when the focus of study is narrow, as is the case when the government wishes to determine whether a particular current policy is effective or whether a small variation in policy would yield better results. Potential benefits from an experiment are much less obvious when the object of the study is to improve basic knowledge about behavioral parameters that may be important across a range of policies, but that have no clear implications for any single policy decision. The Health Insurance Experiment improved our know- ledge about the price sensitivity of demand for medical services in a way that no nonexperimental study has been able to match. The potential value of this kind of knowledge is almost impossible to measure, however. Not surprisingly, narrow policy experiments are now much more common than social experi- ments aimed at improving our basic knowledge.
When an experiment looks unpromising, for whatever reason, a different research strategy should be adopted. But in comparison with the evaluation alternatives available, an experiment often represents the best combination of reliability, practicality, and cost-effectiveness. Many experiments have produced treatment-effect estimates that are widely believed to be reliable measures of the average outcome difference caused by a program. The confidence of policy- makers in simple experimental estimates rests on a solid foundation: random assignment offers a more credible basis for inference than the assumptions needed in most nonexperimental analyses.
• I gratefully acknowledge the comments and suggestions of Alan Auerbach, David Greenberg,Judith Gueron, Lynn Karoly, Robert Moffitt, Philip Robins, Carl Shapiro, and Timothy Taylor. The views expressedare solelymy own and should not beattributed to any of thesepeople or to the Brookings Institution.
The Casefor Randomized Field Trials in Economic and Policy Research 83
References
Aigner, Dennis J., “The Residential Elec- tricity Time-of-use Pricing Experiments: What Have We Learned?” In Hausman, Jerry A., and David A. Wise, eds., Social Experimentation. Chicago: University of Chicago Press, 1985, pp. 11–48.
Ashenfelter, Orley, and David Card, “Using the Longitudinal Structure of Earnings to Es- timate the Effect of Training Programs,” Re- view of Economics and Statistics, November 1985, 67:4, 648–60.
Barnow, Burt S., “The Impact of CETA Programs on Earnings: A Review of the Liter- ature,” Journal of Human Resources,Spring 1987, 22:2, 157–93.
Barnow, Burt S., Glen C. Cain, and Arthur S. Goldberger, “Issues in the Analysis of Se- lectivity Bias.” In Stromsdorfer, Ernst W., and George Farkas, eds., EvaluationStudies Review Annual. Vol. 5. Beverly Hills, Calif.: Sage Pub- lications, 1980, pp. 42–59.
Bloom, Howard S., Larry L. Orr, George Cave, Stephen H. Bell, and Fred Doolittle, The National JTPA Study: Title II-A Impacts on Earnings and Employment at 18 Months. Bethesda, Md.: Abt Associates, January 1993.
Bradbury, Katharine L., and Anthony Downs, eds., Do Housing Allowances Work? Washington, D.C.: Brookings Institution, 1981.
Brook, Robert H., John E. Ware, Jr., William H. Rogers, Emmett B. Keeler, and others, “Does Free Care Improve Adults’ Health?,” New England Journal of Medicine, De- cember 8, 1983, 309:23, 1426–34.
Burtless, Gary, “Are Targeted Wage Subsi- dies Harmful? Evidence from a Wage Voucher Experiment,” Industrial and Labor Relations Re- view,October 1985, 39:1, 105–14.
Burtless, Gary, and Jerry A. Hausman,
“The Effect of Taxation on Labor Supply: Evaluating the Gary NIT Experiment,” Jour- nal of Political Economy, December 1978, 86:6, 1103–30.
Burtless, Gary, and Larry L. Orr, “Are Classical Experiments Needed for Manpower Policy?,”JournalofHumanResources, Fall 1986, 21:4, 606–39.
Campbell, Donald T., and Julian C. Stanley, Experimentaland Quasi-Experimental
Designs for Research. Chicago: Rand McNally, 1966.
Caves, Douglas W., and Lauritis R. Chris- tensen, “Econometric Analysis of Residential Time-of-use Electricity Pricing Experiments,”
Journal of Econometrics, December 1980, 14:3, 287–306.
Fisher, R. A., Statistical Methods for Research Workers. 2nd ed. London: Oliver and Boyd, 1928.
Fraker, Thomas, and Rebecca Maynard,
“The Adequacy of Comparison Group De- signs for Evaluations of Employment-Related Programs,”JournalofHumanResources, Spring 1987, 22:2, 194–227.
Garfinkel, Irwin, Charles F. Manski, and Charles Michalopoulos, “Micro Experiments and Macro Effects.” In Manski, Charles F., and Irwin Garfinkel, eds., Evaluating Welfare and Training Programs. Cambridge, Mass.: Harvard University Press, 1992, pp. 253–76.
Greenberg, David, and Mark Shroder, Di- gest of the Social Experiments. Madison, Wis.:
Institute for Research on Poverty, University of Wisconsin, 1991.
Gueron, Judith M., and Edward Pauly,
FromWelfare to Work. New York: Russell Sage, 1991.
Hausman, Jerry A., and David A. Wise, eds., Social Experimentation. Chicago: Univer- sity of Chicago Press, 1985.
Heckman, James J., “The Common Struc- ture of Statistical Models of Truncation, Sam- ple Selection, and Limited Dependent Vari- ables and a Simple Estimator for Such Models,” Annuals of Economic and Social Mea- surement, Fall 1976, 5:4, 475–92.
Heckman, James J., “Randomization and Social Policy Evaluation.” In Manski, Charles F., and Irwin Garfinkel, eds., Evaluating Wel-
fare and TrainingPrograms. Cambridge, Mass.: Harvard University Press, 1992, pp. 201–30.
Heckman, James J., and V. Joseph Hotz,
“Choosing among Alternative Nonexperimen- tal Methods for Estimating Impact of Social Programs: The Case of Manpower Training,”
Journal of the American Statistical Association, De- cember 1992, 84:408, 262–80.
Heckman, James J., and Richard Robb,
“Alternative Methods for Evaluating the Im- pact of Interventions.” In Heckman, James J., and Burton Singer, eds., Longitudinal Analysis of Labor Market Data. New York: Cambridge University Press, 1985, pp. 156–245.
84 Journal of Economic Perspectives
Keeley, Michael C., Philip K. Robins, Robert G. Spiegelman, and Richard W. West, “The Estimation of Labor Supply Models us- ing Experimental Data,” American Economic Review,December 1978, 68:5, 873–87.
LaLonde, Robert, “Evaluating the Econo- metric Evaluations of Training Programs with Experimental Data,” American Economic Re- view, September 1986, 76:4, 604–20.
Leamer, Edward E., “Let’s Take the Con Out of Econometrics,” American Economic Re- view, March 1983, 73:1, 31–43.
Levitan, Sar A., Evaluationof Federal Social Programs: An Uncertain Impact. Washington, D.C.: Center for Social Policy Studies, George Washington University, June 1992.
Maddala, G. S., and Lung-fei Lee, “Recur- sive Models with Qualitative Endogenous Variables,” Annals of Economic and Social Mea- surement, Fall 1976, 5:4, 525–45.
Manning, Willard G., Joseph P . Newhouse, Naihua Duan, and others, “Health Insurance and the Demand for Medical Care: Evidence from a Randomized Experiment,” American Economic Review, June 1987, 77:3, 251077.
Manpower Demonstration Research Corpo- ration, Summaryand Findingsof the National Supported Work Demonstration. New York: Man- power Demonstration Research Corporation, 1980.
Masters, Stanley, et al., “Jobs Tax Credits:
The Report of the Wage Bill Subsidy Research Project, Phase II,” mimeo, Madison, Wis., Wisconsin Department of Health and Social Services and Institute for Research on Poverty, University of Wisconsin, 1982.
Moffitt, Robert, “Evaluation Methods for Program Entry Effects.” In Manski, Charles F., and Irwin Garfinkel, eds., Evaluating Wel-
fare and TrainingPrograms. Cambridge, Mass.: Harvard University Press, 1992, pp. 231–52.
Munnell, Alicia H., ed., Lessons from the IncomeMaintenance Experiments. Boston: Fed- eral Reserve Bank of Boston, 1987.
Rivlin, Alice M., “How Can Experiments Be More Useful?,” American Economic Review, Papers and Proceedings, May 1974, 64:2, 346–54.
Rivlin, Alice M., and T. Michael Timpane, eds., Ethicaland LegalIssuesof Social Experi- mentation. Washington, D.C.: Brookings Insti- tution, 1975.
Stafford, Frank P., “A Decision Theoretic Approach to the Evaluation of Training Pro- grams.” In Block, Farrell E., ed., Research in Labor Economics: Evaluating Manpower Training Programs. Greenwich, Conn.: JAI Press, 1979, pp. 9–35.
Struyk, Raymond J., and Marc Bendick, Jr., eds., HousingVouchers for thePoor: Lessons from a NationalExperiment. Washington, D.C.: Ur- ban Institute, 1981.
This article has been cited by:
1. Owen C. King. 2020. The good of today depends not on the good of tomorrow: a constraint on theories of well-being. Philosophical Studies 177:8, 2365-2380. [Crossref]
2. Junjun Cheng. 2020. Bidirectional Relationship Progression in Buyer–Seller Negotiations: Evidence from South Korea. Group Decision and Negotiation 29:2, 293-320. [Crossref]
3. William Rhodes, Gerald Gaes. 2020. Estimating the Distribution of Treatment Effects From Random Design Experiments. Evaluation Review 3, 0193841X2090623. [Crossref]
4. Stephen Hills, Matthew Walker, Marlene Dixon. 2019. The Importance of Theorizing Social Change in Sport for Development: A Case Study of Magic Bus in London. Journal of Sport Management 33:5, 415-425. [Crossref]
5. Sophie Thoyer, Raphaële Préget. 2019. Enriching the CAP evaluation toolbox with experimental approaches: introduction to the special issue. European Review of Agricultural Economics 46:3, 347-366. [Crossref]
6.Matteo M. Galizzi, Lorraine Whitmarsh. 2019. How to Measure Behavioral Spillovers: A Methodological Review and Checklist. Frontiers in Psychology 10. . [Crossref]
7. Johannes Haushofer, Michala Iben Riis-Vestergaard, Jeremy Shapiro. 2019. Is there a social cost of randomization?. Social Choice and Welfare 52:4, 709-739. [Crossref]
8. Margaret Dalziel. 2018. Why are there (almost) no randomised controlled trial-based evaluations of business support programmes?. Palgrave Communications 4:1. . [Crossref]
9. Denise Peth, Oliver Mußhoff, Katja Funke, Norbert Hirschauer. 2018. Nudging Farmers to Comply With Water Protection Rules – Experimental Evidence From Germany. Ecological Economics 152, 310-321. [Crossref]
10. Lauren Lanahan, Maryann P. Feldman. 2018. Approximating Exogenous Variation in R&D: Evidence from the Kentucky and North Carolina SBIR State Match Programs. The Review of Economics and Statistics 100:4, 740-752. [Crossref]
11. Matteo M. Galizzi, Glenn W. Harrison, Marisa Miraldo. Experimental Methods and Behavioral Insights in Health Economics: Estimating Risk and Time Preferences in Health 1-21. [Crossref]
12. Thomas Hinz. Methoden der Arbeitsmarktforschung 479-524. [Crossref]
13. John Geweke, Joel Horowitz, Hashem Pesaran. Econometrics 3199-3242. [Crossref]
14. Andrew S. Griffen, Petra E. Todd. 2017. Assessing the Performance of Nonexperimental Estimators for Evaluating Head Start. Journal of Labor Economics 35:S1, S7-S63. [Crossref]
15. Barbara Sianesi. 2017. Evidence of randomisation bias in a large-scale social experiment: The case of ERA. Journal of Econometrics 198:1, 41-64. [Crossref]
16. Beliyou Haile, Carlo Azzarri, Cleo Roberts, David J. Spielman. 2017. Targeting, bias, and expected impact of complex innovations on developing-country agriculture: evidence from Malawi. Agricultural Economics 48:3, 317-326. [Crossref]
17.Sylvain Chabé-Ferret, Laura Dupont-Courtade, Nicolas Treich. 2017. Évaluation des Politiques Publiques : expérimentation randomisée et méthodes quasi-expérimentales. Économie & prévision n° 211-212:2, 1. [Crossref]
18. Liesbeth Colen, Sergio Gomez y Paloma, Uwe Latacz-Lohmann, Marianne Lefebvre, Raphaële Préget, Sophie Thoyer. 2016. Economic Experiments as a Tool for Agricultural Policy Evaluation: Insights from the European CAP. Canadian Journal of Agricultural Economics/Revue canadienne d’agroeconomie 64:4, 667-694. [Crossref]
19.Angelino C. G. Viceisza. 2016. CREATING A LAB IN THE FIELD: ECONOMICS EXPERIMENTS FOR POLICYMAKING. Journal of Economic Surveys 30:5, 835-854. [Crossref]
20.Gérard Lassibille. 2016. Improving the management style of school principals: results from a randomized trial. Education Economics 24:2, 121-141. [Crossref]
21. Michèle Belot, Jonathan James. 2016. Partner selection into policy relevant field experiments. Journal of Economic Behavior & Organization 123, 31-56. [Crossref]
22. John Ifcher, Homa Zarghamee. Ethics and Experimental Economics 195-205. [Crossref]
23. Charles Bellemare, Steeve Marchand, Bruce Shearer. 2016. Combiner les expériences de terrain et la modélisation structurelle : le cas de la réciprocité en milieu de travail. L’Actualité économique 92:1-2, 385-401. [Crossref]
24.Gian Paolo Barbetta, Paolo Canino, Stefano Cima. 2015. The impact of energy audits on energy efficiency investment of public owners. Evidence from Italy. Energy 93, 1199-1209. [Crossref]
25.Victor Olusegun Okoruwa, Obi Egbedi Ogheneruemu, Ololade Adeniran Lydia. 2015. Root and tuber expansion programme and poverty reduction among farmers in Southwest Nigeria. Journal of Development and Agricultural Economics 7:10, 332-343. [Crossref]
26. Hasan Bakhshi, John S. Edwards, Stephen Roper, Judy Scully, Duncan Shaw, Lorraine Morley, Nicola Rathbone. 2015. Assessing an experimental approach to industrial policy evaluation: Applying RCT+ to the case of Creative Credits. Research Policy 44:8, 1462-1472. [Crossref]
27. Stefanie Ettelt, Nicholas Mays, Pauline Allen. 2015. Policy experiments: Investigating effectiveness or confirming direction?. Evaluation 21:3, 292-307. [Crossref]
28. Khaled Makhloufi, Bruno Ventelou, Mohammad Abu-Zaineh. 2015. Have health insurance reforms in Tunisia attained their intended objectives?. International Journal of Health Economics and Management 15:1, 29-51. [Crossref]
29. Omar I. Asensio, Magali A. Delmas. 2015. Nonprice incentives and energy conservation. Proceedings of the National Academy of Sciences 112:6, E510-E515. [Crossref]
30. P. Dolan, M. M. Galizzi. 2014. Getting policy-makers to listen to field experiments. Oxford Review of Economic Policy 30:4, 725-752. [Crossref]
31. Oliver Musshoff, Norbert Hirschauer. 2014. Using business simulation games in regulatory impact analysis – the case of policies aimed at reducing nitrogen leaching. Applied Economics 46:25, 3049-3060. [Crossref]
32. Michèle Belot, Jonathan James. 2014. A new perspective on the issue of selection bias in randomized controlled field experiments. Economics Letters 124:3, 326-328. [Crossref]
33. Joshua Newman. 2014. Measuring Policy Success: Case Studies from Canada and Australia. Australian Journal of Public Administration 73:2, 192-205. [Crossref]
34.Harounan Kazianga, Damien de Walque, Harold Alderman. 2014. School feeding programs, intrahousehold allocation and the nutrition of siblings: Evidence from a randomized trial in rural Burkina Faso. Journal of Development Economics 106, 15-34. [Crossref]
35. Cynthia Lum, Lorraine Mazerolle. History of Randomized Controlled Experiments in Criminal Justice 2227-2239. [Crossref]
36.John Goering. 2013. Neighborhood Effects and Public Policy. City & Community 12:1, 13-20. [Crossref]
37.Enrico Schöbel. 2013. Fallbearbeitung nach Bauchgefühl? Erfahrungswissen und adaptive Entscheidungsbildung in der Einkommensteuerveranlagung. Perspektiven der Wirtschaftspolitik 14:1-2. . [Crossref]
38. Harold Alderman, Daniel O. Gilligan, Kim Lehrer. 2013. The Impact of Food for Education Programs on School Participation In Northern Uganda. SSRN Electronic Journal . [Crossref]
39. H. Kazianga, D. de Walque, H. Alderman. 2012. Educational and Child Labour Impacts of Two Food- for-Education Schemes: Evidence from a Randomised Trial in Rural Burkina Faso. Journal of African Economies 21:5, 723-760. [Crossref]
40. Harold Alderman, Daniel O. Gilligan, Kim Lehrer. 2012. The Impact of Food for Education Programs on School Participation in Northern Uganda. Economic Development and Cultural Change 61:1, 187-218. [Crossref]
41. Norbert Hirschauer,, Oliver Mußhoff,. 2012. Smarte Regulierung in der Ernährungswirtschaft durch Name-and-Shame. Vierteljahrshefte zur Wirtschaftsforschung 81:4, 163-181. [Crossref]
42.ALLEN BLACKMAN, JORGE RIVERA. 2011. Producer-Level Benefits of Sustainability Certification. Conservation Biology 25:6, 1176-1185. [Crossref]
43. Hilmar Schneider, Arne Uhlendorff, Klaus F. Zimmermann. 2011. Mit Workfare aus der Sozialhilfe? Lehren aus einem Modellprojekt. Zeitschrift für ArbeitsmarktForschung 44:1-2, 197-203. [Crossref]
44. Lynette Feder, Phyllis Holditch Niolon, Jacquelyn Campbell, Jan Wallinder, Robin Nelson, Hattie Larrouy. 2011. The Need for Experimental Methodology in Intimate Partner Violence: Finding Programs That Effectively Prevent IPV. Violence Against Women 17:3, 340-358. [Crossref]
45. Helen Z. Margetts. 2011. Experiments for Public Management Research. Public Management Review 13:2, 189-208. [Crossref]
46. Burt S. Barnow. 2010. Setting up social experiments: the good, the bad, and the ugly. Zeitschrift für ArbeitsmarktForschung 43:2, 91-105. [Crossref]
47.William Rhodes. 2010. Heterogeneous Treatment Effects: What Does a Regression Estimate?. Evaluation Review 34:4, 334-361. [Crossref]
48. P.J. McEwan. Empirical Research Methods in the Economics of Education 187-192. [Crossref]
49. John A. Maluccio. 2010. The Impact of Conditional Cash Transfers on Consumption and Investment
in Nicaragua. Journal of Development Studies 46:1, 14-38. [Crossref]
50.Miriam Bruhn,, David McKenzie. 2009. In Pursuit of Balance: Randomization in Practice in Development Field Experiments. American Economic Journal: Applied Economics 1:4, 200-232. [Abstract] [View PDF article] [PDF with links]
51. Guido W. Imbens,, Jeffrey M. Wooldridge. 2009. Recent Developments in the Econometrics of Program Evaluation. Journal of Economic Literature 47:1, 5-86. [Abstract] [View PDF article] [PDF with links]
52. Alan de Brauw, John Hoddinott. 2009. Must Conditional Cash Transfer Programs be Conditioned to be Effective? The Impact of Conditioning Transfers on School Enrollment in Mexico. SSRN Electronic Journal . [Crossref]
53. Asad Zaman. 2009. Causal Relations via Econometrics. SSRN Electronic Journal . [Crossref]
54. Habiba Djebbari, Jeffrey Smith. 2008. Heterogeneous impacts in PROGRESA. Journal of Econometrics
145:1-2, 64-80. [Crossref]
55. John Geweke, Joel Horowitz, Hashem Pesaran. Econometrics 1-44. [Crossref]
56. Hua Xu, Sock Hwan Lee, Tae Ho Eom. Introduction to Panel Data Analysis . [Crossref]
57. Sven Ove Hansson. 2007. Praxis Relevance in Science. Foundations of Science 12:2, 139-154. [Crossref]
58. Gary King, Emmanuela Gakidou, Nirmala Ravishankar, Ryan T. Moore, Jason Lakin, Manett Vargas, Martha María Téllez-Rojo, Juan Eugenio Hernández Ávila, Mauricio Hernández Ávila, Héctor Hernández Llamas. 2007. A “politically robust” experimental design for public policy evaluation, with application to the Mexican Universal Health Insurance program. Journal of Policy Analysis and Management 26:3, 479-506. [Crossref]
59. Martin Ravallion. Chapter 59 Evaluating Anti-Poverty Programs 3787-3846. [Crossref]
60. T. P. Hutchinson. 2007. Concerns about Methodology Used in Real-World Experiments on Transport and Transport Safety. Journal of Transportation Engineering 133:1, 30-38. [Crossref]
61. Caroline Danielson. Social Experiments and Public Policy 381-392. [Crossref]
62. Tat Y. Chan, Barton H. Hamilton. 2006. Learning, Private Information, and the Economic Evaluation
of Randomized Experiments. Journal of Political Economy 114:6, 997-1040. [Crossref]
63. Jennifer L. Romich. 2006. Randomized social policy experiments and research on child development.
Journal of Applied Developmental Psychology 27:2, 136-150. [Crossref]
64. Radha Jagannathan, Michael J. Camasso, Sara S. McLanahan. 2005. Welfare Reform and Child Fostering: Pinpointing Affected Child Populations*. Social Science Quarterly 86:s1, 1080-1103. [Crossref]
65. Nistha Sinha. 2005. Fertility, Child Work, and Schooling Consequences of Family Planning Programs: Evidence from an Experiment in Rural Bangladesh. Economic Development and Cultural Change 54:1, 97-128. [Crossref]
66. Cynthia Lum, Sue-Ming Yang. 2005. Why do evaluation researchers in crime and justice choose non- experimental methods?. Journal of Experimental Criminology 1:2, 191-213. [Crossref]
67. Elamin H. Elbasha. 2005. Risk aversion and uncertainty in cost-effectiveness analysis: the expected- utility, moment-generating function approach. Health Economics 14:5, 457-470. [Crossref]
68. David H. Greenberg, Stephen Morris. 2005. Large-Scale Social Experimentation in Britain. Evaluation 11:2, 223-242. [Crossref]
69. Arild Aakvik, James J. Heckman, Edward J. Vytlacil. 2005. Estimating treatment effects for discrete outcomes when responses to treatment vary: an application to Norwegian vocational rehabilitation programs. Journal of Econometrics 125:1-2, 15-51. [Crossref]
70. Rajeev H. Dehejia. 2005. Program evaluation as a decision problem. Journal of Econometrics 125:1-2, 141-173. [Crossref]
71. Jeffrey A. Smith, Petra E. Todd. 2005. Does matching overcome LaLonde’s critique of nonexperimental estimators?. Journal of Econometrics 125:1-2, 305-353. [Crossref]
72. Karl Widerquist. 2005. A failure to communicate: what (if anything) can we learn from the negative income tax experiments?. The Journal of Socio-Economics 34:1, 49-81. [Crossref]
73. Jan Blustein. 2005. Toward a more public discussion of the ethics of federal social program evaluation. Journal of Policy Analysis and Management 24:4, 824-846. [Crossref]
74. Productivity Commission. 2005. Quantitive Tools for Microeconomic Policy Analysis. SSRN Electronic Journal . [Crossref]
75. Jaume Puig-Junoy. 2005. What is Required to Evaluate the Impact of Pharmaceutical Reference Pricing?. Applied Health Economics and Health Policy 4:2, 87-98. [Crossref]
76. Michael J. Camasso. 2004. Isolating the Family Cap Effect on Fertility Behavior: Evidence From New Jersey’s Family Development Program Experiment. Contemporary Economic Policy 22:4, 453-467. [Crossref]
77. Radha Jagannathan, Michael J. Camasso, Mark R. Killingsworth. 2004. Do Family Caps on Welfare Affect Births Among Welfare Recipients? Reconciling Efficacy and Effectiveness Estimates of Impact through a Blended Design Strategy. American Journal of Evaluation 25:3, 295-319. [Crossref]
78. Bruce Shearer. 2004. Piece Rates, Fixed Wages and Incentives: Evidence from a Field Experiment. The Review of Economic Studies 71:2, 513-534. [Crossref]
79.Magadelena Martinez, Edith Fernández. 2004. Latinos at community colleges. New Directions for Student Services 2004:105, 51-62. [Crossref]
80. Robert A. Moffitt. 2004. The Role of Randomized Field Trials in Social Science Research. American Behavioral Scientist 47:5, 506-540. [Crossref]
81. Deborah A. Cobb-Clark, Thomas Crossley. 2003. Econometrics for Evaluations: An Introduction to Recent Developments. Economic Record 79:247, 491-511. [Crossref]
82.David Greenberg, Robert Meyer, Charles Michalopoulos, Michael Wiseman. 2003. Explaining Variation in the Effects of Welfare-To-Work Programs. Evaluation Review 27:4, 359-394. [Crossref]
83. Bruce Shearer. 2003. Compensation Policy and Worker Performance: Identifying Incentive Effects from Field Experiments. Journal of the European Economic Association 1:2-3, 503-511. [Crossref]
84. Radha Jagannathan, Michael J. Camasso. 2003. Family Cap and Nonmarital Fertility: The Racial Conditioning of Policy Effects. Journal of Marriage and Family 65:1, 52-71. [Crossref]
85. M. Vera-Hernández. 2003. Evaluar intervenciones sanitarias sin experimentos. Gaceta Sanitaria 17:3, 238-248. [Crossref]
86. Barton H. Hamilton, Tat Y. Chan. 2003. Learning, Private Information, and the Economic Evaluation of Randomized Experiments. SSRN Electronic Journal . [Crossref]
87. Klaus F. Zimmermann, Gert G. Wagner. Labor Economics 95-126. [Crossref]
88. Klaus F. Zimmermann, Gert G. Wagner. Arbeitsökonomie 113-148. [Crossref]
89. Jere R. Behrman, John Hoddinott. 2002. Program Evaluation with Unobserved Heterogeneity and Selective Implementation: The Mexican Progresa Impact on Child Nutrition. SSRN Electronic Journal . [Crossref]
90. David Weisburd, Cynthia M. Lum, Anthony Petrosino. 2001. Does Research Design Affect Study Outcomes in Criminal Justice?. The ANNALS of the American Academy of Political and Social Science 578:1, 50-70. [Crossref]
91. Eduardo Tomé. 2001. The evaluation of vocational training: a comparative analysis. Journal of European Industrial Training 25:7, 380-388. [Crossref]
92. Astrid Grasdal. 2001. The performance of sample selection estimators to control for attrition bias. Health Economics 10:5, 385-398. [Crossref]
93.Robert Walker. 2001. Great Expectations: Can Social Science Evaluate New Labour’s Policies?. Evaluation 7:3, 305-330. [Crossref]
94. William Rhodes, Bernadette Pelissier, Gerald Gaes, William Saylor, Scott Camp, Susan Wallace. 2001. Alternative Solutions to the Problem of Selection Bias in an Analysis of Federal Residential Drug Treatment Programs. Evaluation Review 25:3, 331-369. [Crossref]
95.Joel Slemrod, Marsha Blumenthal, Charles Christian. 2001. Taxpayer response to an increased probability of audit: evidence from a controlled experiment in Minnesota. Journal of Public Economics 79:3, 455-483. [Crossref]
96. Martin Eichler, Michael Lechner. Public Sector Sponsored Continuous Vocational Training in East Gemany: Institutional Arrangements, Participants, and Results of Empirical Evaluations 208-253. [Crossref]
97. Almas Heshmati, Lars-Gunnar Engström. Estimating the effects of vocational rehabilitation programs in Sweden 183-210. [Crossref]
98. Jeffrey A. Smith, Petra Todd. 2001. Does Matching Overcome Lalonde’s Critique of Nonexperimental Estimators?. SSRN Electronic Journal . [Crossref]
99. Carol Harvey,, Michael J. Camasso,, Radha Jagannathan,. 2000. Evaluating Welfare Reform Waivers Under Section 1115. Journal of Economic Perspectives 14:4, 165-188. [Citation] [View PDF article] [PDF with links]
100. Maurice Basl…. 2000. Comparative Analysis of Quantitative and Qualitative Methods in French Non- Experimental Evaluation of Regional and Local Policies. Evaluation 6:3, 323-334. [Crossref]
101. David M. Blau. 2000. The Production of Quality in Child-Care Centers: Another Look. Applied Developmental Science 4:3, 136-148. [Crossref]
102. Michael Lechner. 2000. An Evaluation of Public-Sector-Sponsored Continuous Vocational Training Programs in East Germany. SSRN Electronic Journal . [Crossref]
103. Jonathan Morduch. 1999. The Microfinance Promise. Journal of Economic Literature 37:4, 1569-1614. [Abstract] [View PDF article] [PDF with links]
104. Brett E. Coleman. 1999. The impact of group lending in Northeast Thailand. Journal of Development Economics 60:1, 105-141. [Crossref]
105. David Greenberg,, Mark Shroder,, Matthew Onstott,. 1999. The Social Experiment Market. Journal of Economic Perspectives 13:3, 157-172. [Abstract] [View PDF article] [PDF with links]
106. David H. Dean, Robert C. Dolan, Robert M. Schmidt. 1999. Evaluating the Vocational Rehabilitation Program Using Longitudinal Data. Evaluation Review 23:2, 162-189. [Crossref]
107. Joshua D. Angrist, Alan B. Krueger. Empirical Strategies in Labor Economics 1277-1366. [Crossref]
108. James J. Heckman, Robert J. Lalonde, Jeffrey A. Smith. The Economics and Econometrics of Active
Labor Market Programs 1865-2097. [Crossref]
109. Elchanan Cohn, John T. Addison. 1998. The Economic Returns to Lifelong Learning in OECD
Countries. Education Economics 6:3, 253-307. [Crossref]
110. Michael Lechner. References 195-204. [Crossref]
111. Douglas. J. Lamdin, Micheal Mintrom. 1997. School Choice in Theroy and Practise: Taking Stock and Looking Ahead. Education Economics 5:3, 211-244. [Crossref]
112. Mark Dynarski. 1997. Trade-Offs in Designing a Social Program Experiment. Children and Youth Services Review 19:7, 525-540. [Crossref]
113. Lawrence Haddad, Manfred Zeller. 1997. Can social security programmes do more with less? General issues and the challenges for Southern Africa. Development Southern Africa 14:2, 125-153. [Crossref]
114.V. Joseph Hotz, Jacob Alex Klerman, Robert J. Willis. Chapter 7 The economics of fertility in developed countries 275-347. [Crossref]
115.Michael J Camasso, Radha Jagannathan, Mark Killingsworth, Carol Harvey. NEW JERSEY’S FAMILY CAP AND FAMILY SIZE DECISIONS: FINDINGS FROM A FIVE-YEAR EVALUATION 71-112. [Crossref]