程序代写代做代考 AI database arm scheme ER decision tree Bayesian Excel mips algorithm chain flex cache information theory i

i

Reinforcement Learning:
An Introduction

Second edition, in progress

Richard S. Sutton and Andrew G. Barto
c© 2012

A Bradford Book

The MIT Press
Cambridge, Massachusetts

London, England

ii

In memory of A. Harry Klopf

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Series Forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

Summary of Notation . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

I The Problem 1

1 Introduction 3

1.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 4

1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Elements of Reinforcement Learning . . . . . . . . . . . . . . 7

1.4 An Extended Example: Tic-Tac-Toe . . . . . . . . . . . . . . 10

1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.6 History of Reinforcement Learning . . . . . . . . . . . . . . . 16

1.7 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . 23

2 Bandit Problems 25

2.1 An n-Armed Bandit Problem . . . . . . . . . . . . . . . . . . 26

2.2 Action-Value Methods . . . . . . . . . . . . . . . . . . . . . . 27

2.3 Softmax Action Selection . . . . . . . . . . . . . . . . . . . . . 30

2.4 Incremental Implementation . . . . . . . . . . . . . . . . . . . 32

2.5 Tracking a Nonstationary Problem . . . . . . . . . . . . . . . 33

2.6 Optimistic Initial Values . . . . . . . . . . . . . . . . . . . . . 35

2.7 Associative Search (Contextual Bandits) . . . . . . . . . . . . 37

iii

iv CONTENTS

2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.9 Bibliographical and Historical Remarks . . . . . . . . . . . . . 40

3 The Reinforcement Learning Problem 43

3.1 The Agent–Environment Interface . . . . . . . . . . . . . . . . 43

3.2 Goals and Rewards . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3 Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4 Unified Notation for Episodic and Continuing Tasks . . . . . . 52

∗3.5 The Markov Property . . . . . . . . . . . . . . . . . . . . . . . 53

3.6 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . 58

3.7 Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.8 Optimal Value Functions . . . . . . . . . . . . . . . . . . . . . 66

3.9 Optimality and Approximation . . . . . . . . . . . . . . . . . 71

3.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.11 Bibliographical and Historical Remarks . . . . . . . . . . . . . 74

II Tabular Action-Value Methods 79

4 Dynamic Programming 83

4.1 Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.2 Policy Improvement . . . . . . . . . . . . . . . . . . . . . . . . 87

4.3 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.4 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.5 Asynchronous Dynamic Programming . . . . . . . . . . . . . . 98

4.6 Generalized Policy Iteration . . . . . . . . . . . . . . . . . . . 99

4.7 Efficiency of Dynamic Programming . . . . . . . . . . . . . . . 101

4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.9 Bibliographical and Historical Remarks . . . . . . . . . . . . . 103

5 Monte Carlo Methods 107

5.1 Monte Carlo Policy Evaluation . . . . . . . . . . . . . . . . . 108

CONTENTS v

5.2 Monte Carlo Estimation of Action Values . . . . . . . . . . . . 112

5.3 Monte Carlo Control . . . . . . . . . . . . . . . . . . . . . . . 114

5.4 On-Policy Monte Carlo Control . . . . . . . . . . . . . . . . . 118

∗5.5 Evaluating One Policy While Following Another (Off-policy Pol-
icy Evaluation) . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.6 Off-Policy Monte Carlo Control . . . . . . . . . . . . . . . . . 122

5.7 Incremental Implementation . . . . . . . . . . . . . . . . . . . 124

5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.9 Bibliographical and Historical Remarks . . . . . . . . . . . . . 127

6 Temporal-Difference Learning 129

6.1 TD Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.2 Advantages of TD Prediction Methods . . . . . . . . . . . . . 134

6.3 Optimality of TD(0) . . . . . . . . . . . . . . . . . . . . . . . 137

6.4 Sarsa: On-Policy TD Control . . . . . . . . . . . . . . . . . . 141

6.5 Q-Learning: Off-Policy TD Control . . . . . . . . . . . . . . . 144

6.6 Games, Afterstates, and Other Special Cases . . . . . . . . . . 147

6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.8 Bibliographical and Historical Remarks . . . . . . . . . . . . . 149

7 Eligibility Traces 153

7.1 n-Step TD Prediction . . . . . . . . . . . . . . . . . . . . . . . 154

7.2 The Forward View of TD(λ) . . . . . . . . . . . . . . . . . . . 159

7.3 The Backward View of TD(λ) . . . . . . . . . . . . . . . . . . 163

7.4 Equivalence of Forward and Backward Views . . . . . . . . . . 166

7.5 Sarsa(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

7.6 Q(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

7.7 Replacing Traces . . . . . . . . . . . . . . . . . . . . . . . . . 175

7.8 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . 178

∗7.9 Variable λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

7.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

vi CONTENTS

7.11 Bibliographical and Historical Remarks . . . . . . . . . . . . . 180

8 Planning and Learning with Tabular Methods 183

8.1 Models and Planning . . . . . . . . . . . . . . . . . . . . . . . 183

8.2 Integrating Planning, Acting, and Learning . . . . . . . . . . . 186

8.3 When the Model Is Wrong . . . . . . . . . . . . . . . . . . . . 191

8.4 Prioritized Sweeping . . . . . . . . . . . . . . . . . . . . . . . 194

8.5 Full vs. Sample Backups . . . . . . . . . . . . . . . . . . . . . 198

8.6 Trajectory Sampling . . . . . . . . . . . . . . . . . . . . . . . 202

8.7 Heuristic Search . . . . . . . . . . . . . . . . . . . . . . . . . . 205

8.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

8.9 Bibliographical and Historical Remarks . . . . . . . . . . . . . 209

III Approximate Solution Methods 211

9 On-policy Approximation of Action Values 213

9.1 Value Prediction with Function Approximation . . . . . . . . 214

9.2 Gradient-Descent Methods . . . . . . . . . . . . . . . . . . . . 217

9.3 Linear Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 221

9.4 Control with Function Approximation . . . . . . . . . . . . . . 230

9.5 Should We Bootstrap? . . . . . . . . . . . . . . . . . . . . . . 236

9.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

9.7 Bibliographical and Historical Remarks . . . . . . . . . . . . . 239

10 Off-policy Approximation of Action Values 243

11 Policy Approximation 245

11.1 Actor–Critic Methods . . . . . . . . . . . . . . . . . . . . . . . 245

11.2 R-Learning and the Average-Reward Setting . . . . . . . . . . 248

12 State Estimation 251

CONTENTS vii

13 Temporal Abstraction 253

IV Frontiers 255

14 Biological Reinforcement Learning 257

15 Applications and Case Studies 259

15.1 TD-Gammon . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

15.2 Samuel’s Checkers Player . . . . . . . . . . . . . . . . . . . . . 265

15.3 The Acrobot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

15.4 Elevator Dispatching . . . . . . . . . . . . . . . . . . . . . . . 272

15.5 Dynamic Channel Allocation . . . . . . . . . . . . . . . . . . . 277

15.6 Job-Shop Scheduling . . . . . . . . . . . . . . . . . . . . . . . 281

16 Prospects 289

16.1 The Unified View . . . . . . . . . . . . . . . . . . . . . . . . . 289

16.2 Other Frontier Dimensions . . . . . . . . . . . . . . . . . . . . 292

References 295

Index 320

viii PREFACE

Preface

We first came to focus on what is now known as reinforcement learning in late
1979. We were both at the University of Massachusetts, working on one of
the earliest projects to revive the idea that networks of neuronlike adaptive
elements might prove to be a promising approach to artificial adaptive intel-
ligence. The project explored the “heterostatic theory of adaptive systems”
developed by A. Harry Klopf. Harry’s work was a rich source of ideas, and
we were permitted to explore them critically and compare them with the long
history of prior work in adaptive systems. Our task became one of teasing
the ideas apart and understanding their relationships and relative importance.
This continues today, but in 1979 we came to realize that perhaps the simplest
of the ideas, which had long been taken for granted, had received surprisingly
little attention from a computational perspective. This was simply the idea of
a learning system that wants something, that adapts its behavior in order to
maximize a special signal from its environment. This was the idea of a “he-
donistic” learning system, or, as we would say now, the idea of reinforcement
learning.

Like others, we had a sense that reinforcement learning had been thor-
oughly explored in the early days of cybernetics and artificial intelligence. On
closer inspection, though, we found that it had been explored only slightly.
While reinforcement learning had clearly motivated some of the earliest com-
putational studies of learning, most of these researchers had gone on to other
things, such as pattern classification, supervised learning, and adaptive con-
trol, or they had abandoned the study of learning altogether. As a result, the
special issues involved in learning how to get something from the environment
received relatively little attention. In retrospect, focusing on this idea was
the critical step that set this branch of research in motion. Little progress
could be made in the computational study of reinforcement learning until it
was recognized that such a fundamental idea had not yet been thoroughly
explored.

The field has come a long way since then, evolving and maturing in sev-
eral directions. Reinforcement learning has gradually become one of the most
active research areas in machine learning, artificial intelligence, and neural net-
work research. The field has developed strong mathematical foundations and
impressive applications. The computational study of reinforcement learning is
now a large field, with hundreds of active researchers around the world in di-
verse disciplines such as psychology, control theory, artificial intelligence, and
neuroscience. Particularly important have been the contributions establishing
and developing the relationships to the theory of optimal control and dynamic
programming. The overall problem of learning from interaction to achieve

PREFACE ix

goals is still far from being solved, but our understanding of it has improved
significantly. We can now place component ideas, such as temporal-difference
learning, dynamic programming, and function approximation, within a coher-
ent perspective with respect to the overall problem.

Our goal in writing this book was to provide a clear and simple account
of the key ideas and algorithms of reinforcement learning. We wanted our
treatment to be accessible to readers in all of the related disciplines, but we
could not cover all of these perspectives in detail. Our treatment takes almost
exclusively the point of view of artificial intelligence and engineering, leaving
coverage of connections to psychology, neuroscience, and other fields to others
or to another time. We also chose not to produce a rigorous formal treatment
of reinforcement learning. We did not reach for the highest possible level of
mathematical abstraction and did not rely on a theorem–proof format. We
tried to choose a level of mathematical detail that points the mathematically
inclined in the right directions without distracting from the simplicity and
potential generality of the underlying ideas.

The book consists of three parts. Part I is introductory and problem ori-
ented. We focus on the simplest aspects of reinforcement learning and on its
main distinguishing features. One full chapter is devoted to introducing the
reinforcement learning problem whose solution we explore in the rest of the
book. Part II presents tabular versions (assuming a small finite state space)
of all the basic solution methods based on estimating action values. We intro-
duce dynamic programming, Monte Carlo methods, and temporal-difference
learning. There is a chapter on eligibility traces which unifies the latter two
methods, and a chapter that unifies planning methods (such as dynamic pro-
gramming and state-space search) and learning methods (such as Monte Carlo
and temporal-difference learning). Part III is concerned with extending the
tabular methods to include various forms of approximation including function
approximation, policy-gradient methods, and methods designed for solving
off-policy learning problems. Part IV surveys some of the frontiers of rein-
forcement learning in biology and applications.

This book was designed to be used as a text in a one-semester course, per-
haps supplemented by readings from the literature or by a more mathematical
text such as Bertsekas and Tsitsiklis (1996) or Szepesvari. This book can also
be used as part of a broader course on machine learning, artificial intelligence,
or neural networks. In this case, it may be desirable to cover only a subset of
the material. We recommend covering Chapter 1 for a brief overview, Chapter
2 through Section 2.2, Chapter 3 except Sections 3.4, 3.5 and 3.9, and then
selecting sections from the remaining chapters according to time and interests.
The five chapters of Part II build on each other and are best covered in se-
quence; of these, Chapter 6 is the most important for the subject and for the

x PREFACE

rest of the book. A course focusing on machine learning or neural networks
should cover Chapter 9, and a course focusing on artificial intelligence or plan-
ning should cover Chapter 8. Throughout the book, sections that are more
difficult and not essential to the rest of the book are marked with a ∗. These
can be omitted on first reading without creating problems later on. Some ex-
ercises are marked with a ∗ to indicate that they are more advanced and not
essential to understanding the basic material of the chapter.

The book is largely self-contained. The only mathematical background
assumed is familiarity with elementary concepts of probability, such as expec-
tations of random variables. Chapter 9 is substantially easier to digest if the
reader has some knowledge of artificial neural networks or some other kind of
supervised learning method, but it can be read without prior background. We
strongly recommend working the exercises provided throughout the book. So-
lution manuals are available to instructors. This and other related and timely
material is available via the Internet.

At the end of most chapters is a section entitled “Bibliographical and His-
torical Remarks,” wherein we credit the sources of the ideas presented in that
chapter, provide pointers to further reading and ongoing research, and describe
relevant historical background. Despite our attempts to make these sections
authoritative and complete, we have undoubtedly left out some important
prior work. For that we apologize, and welcome corrections and extensions for
incorporation into a subsequent edition.

In some sense we have been working toward this book for thirty years, and
we have lots of people to thank. First, we thank those who have personally
helped us develop the overall view presented in this book: Harry Klopf, for
helping us recognize that reinforcement learning needed to be revived; Chris
Watkins, Dimitri Bertsekas, John Tsitsiklis, and Paul Werbos, for helping us
see the value of the relationships to dynamic programming; John Moore and
Jim Kehoe, for insights and inspirations from animal learning theory; Oliver
Selfridge, for emphasizing the breadth and importance of adaptation; and,
more generally, our colleagues and students who have contributed in countless
ways: Ron Williams, Charles Anderson, Satinder Singh, Sridhar Mahadevan,
Steve Bradtke, Bob Crites, Peter Dayan, and Leemon Baird. Our view of re-
inforcement learning has been significantly enriched by discussions with Paul
Cohen, Paul Utgoff, Martha Steenstrup, Gerry Tesauro, Mike Jordan, Leslie
Kaelbling, Andrew Moore, Chris Atkeson, Tom Mitchell, Nils Nilsson, Stuart
Russell, Tom Dietterich, Tom Dean, and Bob Narendra. We thank Michael
Littman, Gerry Tesauro, Bob Crites, Satinder Singh, and Wei Zhang for pro-
viding specifics of Sections 4.7, 15.1, 15.4, 15.5, and 15.6 respectively. We
thank the the Air Force Office of Scientific Research, the National Science
Foundation, and GTE Laboratories for their long and farsighted support.

PREFACE xi

We also wish to thank the many people who have read drafts of this book
and provided valuable comments, including Tom Kalt, John Tsitsiklis, Pawel
Cichosz, Olle Gällmo, Chuck Anderson, Stuart Russell, Ben Van Roy, Paul
Steenstrup, Paul Cohen, Sridhar Mahadevan, Jette Randlov, Brian Sheppard,
Thomas O’Connell, Richard Coggins, Cristina Versino, John H. Hiett, An-
dreas Badelt, Jay Ponte, Joe Beck, Justus Piater, Martha Steenstrup, Satin-
der Singh, Tommi Jaakkola, Dimitri Bertsekas, Torbjörn Ekman, Christina
Björkman, Jakob Carlström, and Olle Palmgren. Finally, we thank Gwyn
Mitchell for helping in many ways, and Harry Stanton and Bob Prior for be-
ing our champions at MIT Press.

xii

Series Forward

SUMMARY OF NOTATION xiii

Summary of Notation

Capital letters are used for random variables and major algorithm variables.
Lower case letters are used for the values of random variables and for scalar
functions. Quantities that are required to be real-valued vectors are written
in bold and in lower case (even if random variables).

s state
a action
S set of all nonterminal states
S+ set of all states, including the terminal state
A(s) set of actions possible in state s

t discrete time step
T final time step of an episode
St state at t
At action at t
Rt reward at t, dependent, like St, on At−1 and St−1
Gt return (cumulative discounted reward) following t

G
(n)
t n-step return (Section 7.1)

Gλt λ-return (Section 7.2)

π policy, decision-making rule
π(s) action taken in state s under deterministic policy π
π(a|s) probability of taking action a in state s under stochastic policy π
p(s′|s, a) probability of transition from state s to state s′ under action a
r(s, a, s′) expected immediate reward on transition from s to s′ under action a

vπ(s) value of state s under policy π (expected return)
v∗(s) value of state s under the optimal policy
qπ(s, a) value of taking action a in state s under policy π
q∗(s, a) value of taking action a in state s under the optimal policy
Vt estimate (a random variable) of vπ or v∗
Qt estimate (a random variable) of qπ or q∗

v̂(s,w) approximate value of state s given a vector of weights w
q̂(s, a,w) approximate value of state–action pair s, a given weights w
w,wt vector of (possibly learned) weights underlying an approximate value function
x(s) vector of features visible when in state s
w>x inner product of vectors, w>x =


iwixi; e.g., v̂(s,w) = w

>x(s)

xiv SUMMARY OF NOTATION

δt temporal-difference error at t (a random variable, even though not upper case)
Zt(s) eligibility trace for state s at t
Zt(s, a) eligibility trace for a state–action pair
zt eligibility trace vector at t

γ discount-rate parameter
ε probability of random action in ε-greedy policy
α, β step-size parameters
λ decay-rate parameter for eligibility traces

Part I

The Problem

1

Chapter 1

Introduction

The idea that we learn by interacting with our environment is probably the
first to occur to us when we think about the nature of learning. When an
infant plays, waves its arms, or looks about, it has no explicit teacher, but it
does have a direct sensorimotor connection to its environment. Exercising this
connection produces a wealth of information about cause and effect, about
the consequences of actions, and about what to do in order to achieve goals.
Throughout our lives, such interactions are undoubtedly a major source of
knowledge about our environment and ourselves. Whether we are learning to
drive a car or to hold a conversation, we are acutely aware of how our environ-
ment responds to what we do, and we seek to influence what happens through
our behavior. Learning from interaction is a foundational idea underlying
nearly all theories of learning and intelligence.

In this book we explore a computational approach to learning from inter-
action. Rather than directly theorizing about how people or animals learn, we
explore idealized learning situations and evaluate the effectiveness of various
learning methods. That is, we adopt the perspective of an artificial intelligence
researcher or engineer. We explore designs for machines that are effective in
solving learning problems of scientific or economic interest, evaluating the
designs through mathematical analysis or computational experiments. The
approach we explore, called reinforcement learning , is much more focused on
goal-directed learning from interaction than are other approaches to machine
learning.

3

4 CHAPTER 1. INTRODUCTION

1.1 Reinforcement Learning

Reinforcement learning is learning what to do—how to map situations to
actions—so as to maximize a numerical reward signal. The learner is not
told which actions to take, as in most forms of machine learning, but instead
must discover which actions yield the most reward by trying them. In the most
interesting and challenging cases, actions may affect not only the immediate
reward but also the next situation and, through that, all subsequent rewards.
These two characteristics—trial-and-error search and delayed reward—are the
two most important distinguishing features of reinforcement learning.

Reinforcement learning is defined not by characterizing learning methods,
but by characterizing a learning problem. Any method that is well suited to
solving that problem, we consider to be a reinforcement learning method. A
full specification of the reinforcement learning problem in terms of optimal
control of Markov decision processes must wait until Chapter 3, but the basic
idea is simply to capture the most important aspects of the real problem facing
a learning agent interacting with its environment to achieve a goal. Clearly,
such an agent must be able to sense the state of the environment to some extent
and must be able to take actions that affect the state. The agent also must
have a goal or goals relating to the state of the environment. The formulation
is intended to include just these three aspects—sensation, action, and goal—in
their simplest possible forms without trivializing any of them.

Reinforcement learning is different from supervised learning, the kind of
learning studied in most current research in machine learning, statistical pat-
tern recognition, and artificial neural networks. Supervised learning is learn-
ing from examples provided by a knowledgable external supervisor. This is
an important kind of learning, but alone it is not adequate for learning from
interaction. In interactive problems it is often impractical to obtain examples
of desired behavior that are both correct and representative of all the situa-
tions in which the agent has to act. In uncharted territory—where one would
expect learning to be most beneficial—an agent must be able to learn from its
own experience.

One of the challenges that arise in reinforcement learning and not in other
kinds of learning is the trade-off between exploration and exploitation. To
obtain a lot of reward, a reinforcement learning agent must prefer actions
that it has tried in the past and found to be effective in producing reward.
But to discover such actions, it has to try actions that it has not selected
before. The agent has to exploit what it already knows in order to obtain
reward, but it also has to explore in order to make better action selections in
the future. The dilemma is that neither exploration nor exploitation can be
pursued exclusively without failing at the task. The agent must try a variety of

1.1. REINFORCEMENT LEARNING 5

actions and progressively favor those that appear to be best. On a stochastic
task, each action must be tried many times to gain a reliable estimate its
expected reward. The exploration–exploitation dilemma has been intensively
studied by mathematicians for many decades (see Chapter 2). For now, we
simply note that the entire issue of balancing exploration and exploitation
does not even arise in supervised learning as it is usually defined.

Another key feature of reinforcement learning is that it explicitly considers
the whole problem of a goal-directed agent interacting with an uncertain envi-
ronment. This is in contrast with many approaches that consider subproblems
without addressing how they might fit into a larger picture. For example,
we have mentioned that much of machine learning research is concerned with
supervised learning without explicitly specifying how such an ability would
finally be useful. Other researchers have developed theories of planning with
general goals, but without considering planning’s role in real-time decision-
making, or the question of where the predictive models necessary for planning
would come from. Although these approaches have yielded many useful results,
their focus on isolated subproblems is a significant limitation.

Reinforcement learning takes the opposite tack, starting with a complete,
interactive, goal-seeking agent. All reinforcement learning agents have ex-
plicit goals, can sense aspects of their environments, and can choose actions
to influence their environments. Moreover, it is usually assumed from the
beginning that the agent has to operate despite significant uncertainty about
the environment it faces. When reinforcement learning involves planning, it
has to address the interplay between planning and real-time action selection,
as well as the question of how environmental models are acquired and im-
proved. When reinforcement learning involves supervised learning, it does so
for specific reasons that determine which capabilities are critical and which
are not. For learning research to make progress, important subproblems have
to be isolated and studied, but they should be subproblems that play clear
roles in complete, interactive, goal-seeking agents, even if all the details of the
complete agent cannot yet be filled in.

One of the larger trends of which reinforcement learning is a part is that
toward greater contact between artificial intelligence and other engineering
disciplines. Not all that long ago, artificial intelligence was viewed as almost
entirely separate from control theory and statistics. It had to do with logic
and symbols, not numbers. Artificial intelligence was large LISP programs, not
linear algebra, differential equations, or statistics. Over the last decades this
view has gradually eroded. Modern artificial intelligence researchers accept
statistical and control algorithms, for example, as relevant competing methods
or simply as tools of their trade. The previously ignored areas lying between
artificial intelligence and conventional engineering are now among the most

6 CHAPTER 1. INTRODUCTION

active, including new fields such as neural networks, intelligent control, and our
topic, reinforcement learning. In reinforcement learning we extend ideas from
optimal control theory and stochastic approximation to address the broader
and more ambitious goals of artificial intelligence.

1.2 Examples

A good way to understand reinforcement learning is to consider some of the
examples and possible applications that have guided its development.

• A master chess player makes a move. The choice is informed both by
planning—anticipating possible replies and counterreplies—and by im-
mediate, intuitive judgments of the desirability of particular positions
and moves.

• An adaptive controller adjusts parameters of a petroleum refinery’s op-
eration in real time. The controller optimizes the yield/cost/quality
trade-off on the basis of specified marginal costs without sticking strictly
to the set points originally suggested by engineers.

• A gazelle calf struggles to its feet minutes after being born. Half an hour
later it is running at 20 miles per hour.

• A mobile robot decides whether it should enter a new room in search of
more trash to collect or start trying to find its way back to its battery
recharging station. It makes its decision based on how quickly and easily
it has been able to find the recharger in the past.

• Phil prepares his breakfast. Closely examined, even this apparently mun-
dane activity reveals a complex web of conditional behavior and inter-
locking goal–subgoal relationships: walking to the cupboard, opening it,
selecting a cereal box, then reaching for, grasping, and retrieving the box.
Other complex, tuned, interactive sequences of behavior are required to
obtain a bowl, spoon, and milk jug. Each step involves a series of eye
movements to obtain information and to guide reaching and locomotion.
Rapid judgments are continually made about how to carry the objects
or whether it is better to ferry some of them to the dining table before
obtaining others. Each step is guided by goals, such as grasping a spoon
or getting to the refrigerator, and is in service of other goals, such as
having the spoon to eat with once the cereal is prepared and ultimately
obtaining nourishment.

1.3. ELEMENTS OF REINFORCEMENT LEARNING 7

These examples share features that are so basic that they are easy to over-
look. All involve interaction between an active decision-mak