Reinforcement Learning
Introduc2on
Subramanian Ramamoorthy
School of Informa2cs
17 January 2017
Admin
Lecturer: Subramanian (Ram) Ramamoorthy
IPAB, School of Informa;cs
s.ramamoorthy@ed (preferred method of contact)
Informa;cs Forum 1.41, x505119
Teaching Assistant: Svetlin Penkov, SV.Penkov@ed.ac.uk
Class representa5ve?
Mailing list: Are you on it – I will use it for announcements!
17/01/17 Reinforcement Learning 2
Admin
Lectures:
Tuesday and Friday 12:10 – 13:00 (Teviot Lecture Theatre)
Assessment: Homework/Exam 10+10% / 80%
– HW1: Out 3 Feb, Due 17 Feb
• Some pen+paper ques;ons & setup for HW2
– HW2: Out 28 Feb, Due 28 Mar
• Implement an agent using ALE
17/01/17 Reinforcement Learning 3
Admin
Webpage: www.informa;cs.ed.ac.uk/teaching/courses/rl
• Lecture slides to be uploaded, typically, the day before
Main Textbook
– R. Suaon and A. Barto, Reinforcement Learning, 2nd ed.
Other Suggested Books
– See list on course web page: going into more detail on specific
topics (e.g., dynamic programming, POMDPs, ac;ve sensing) than
possible or necessary in lectures. These books are FYI and op;onal.
Other readings, e.g., papers, to be suggested for specific lectures.
Background: Mathema;cs, Matlab, Exposure to machine learning?
17/01/17 Reinforcement Learning 4
Problem of Learning from Interac5on
Ø with environment
Ø to achieve some goal
• Baby playing. No teacher. Sensorimotor connec;on to
environment.
– Cause <-> effect
– Ac;on <-> consequences
– How to achieve goals
• Learning to drive a car, hold conversa;on, etc.
• Environment’s response affects our subsequent ac;ons
• We find out the effects of our ac;ons later
17/01/17 Reinforcement Learning 5
Rough History of RL Ideas
• Psychology – learning by trial and error
… ac;ons followed by good or bad outcomes have their
tendency to be reselected altered accordingly
- Selec;onal: try alterna;ves and pick good ones
- Associa;ve: associate alterna;ves with par;cular situa;ons
• Computa;onal studies (e.g., credit assignment problem)
– Minsky’s SNARC, 1950
– Michie’s MENACE, BOXES, etc. 1960s
• Temporal Difference learning (Minsky, Samuel, Shannon, …)
– Driven by differences between successive es;mates over ;me
17/01/17 Reinforcement Learning 6
Rough History of RL, contd.
• In 1970-80, many researchers, e.g., Klopf, Suaon & Barto,…,
looked seriously at issues of “geong results from the
environment” as opposed to supervised learning
– Although supervised learning methods such as backpropaga;on
were some;mes used, emphasis was different
• Stochas;c op;mal control (mathema;cs, opera;ons
research)
– Deep roots: Hamilton-Jacobi → Bellman/Howard
– By the 1980s, people began to realize the connec;on between
MDPs and the RL problem as above…
17/01/17 Reinforcement Learning 7
What is the Nature of the Problem?
• As you can tell from the history, many ways to understand the
problem – you will see this as we proceed through course
• Unifying perspec;ve:
– Sequen;al decision making
– Stochas;c op;miza;on over ;me
• Let us unpack this through a few applica;on examples…
17/01/17 Reinforcement Learning 8
Example Domain: Robo;cs
Environment
Percep;on
Ac;on
Adversarial
ac;ons & other
agents
Adversarial
ac;ons & other
agents
High-level
goals
Problem: How to generate actions, to achieve high-level goals, using limited
perception and incomplete knowledge of environment?
17/01/17 9 Reinforcement Learning
Example Domain:
Natural Language Dialogue
17/01/17 10 [Source: http://www.agilingua.com/images/DM_EN.jpg]
How to Model Decisions, Computa;onally?
• Who makes them?
– Individual
– ‘Group’
• What are the condi;ons?
– Certainty
– Risk
– Uncertainty
For most of this course, we’ll take the ‘individual’ viewpoint
and we’ll be somewhere in between risk and uncertainty
17/01/17 11 Reinforcement Learning
How to Model Decision under Certainty?
• Given a set of possible acts
• Choose one that maximizes some given index
If a is a generic act in a set of feasible acts A, f(a) is an index
being maximized, then
Problem: Find a* in A such that f(a*) > f(a) for all a in A.
The index f plays a key role, e.g., think of buying a pain;ng.
Essen;al problem: How should the subject select an index
func;on such that her choice reduces to finding maximizers?
17/01/17 12 Reinforcement Learning
An Opera;onal Way to Find Index Func;on
• Observe subject’s behaviour in restricted seongs and predict
purchase behaviour from that:
• Instruct the subject as follows:
– Here are ten valuable reproduc;ons
– We will present these to you in pairs
– You will tell us which one of the pair you prefer to own
– Aver you have evaluated all pairs, we will pick a pair at random and
present you with the choice you previously made (it is to your
advantage to remember your true tastes)
• The subject’s behaviour is as though there is a ranking over all
pain;ngs, so each pain;ng can be summarized by a number
17/01/17 13 Reinforcement Learning
Some Desiderata of this Ranking
• Transi5vity: Previous argument only makes sense if the rank is
transi;ve – if A is preferred in (A, B) and B is preferred in (B,
C) then A is preferred in (A, C); and this holds for all triples of
alterna;ves A, B and C
• Ordinal nature of index: One is tempted to immediately turn
the ranking into a latent measure of ‘sa;sfac;on’ but that is
premature as u;li;es need not be unique.
e.g., we could assign 3 u;les to A, 2 u;les to B and 1 u;le to C
to explain the choice behaviour
Equally, 30, 20.24 and 3.14 would yield the same choice
While it is OK to compare indices, it isn’t (yet) OK to add/mul;ply
17/01/17 14 Reinforcement Learning
What Happens if we Relax Transi;vity?
• Assume Pandora says (in the pairwise comparisons):
– Apple < Orange
– Orange < Fig
– Fig < Apple
• Why is this a problem for Pandora?
• Assume a merchant who transacts with her as follows:
– Pandora has an Apple at the start of the conversa;on
– He offers to exchange Orange for Apple, if she gives him a penny
– He then offers an exchange of Fig for Orange, at the price of a penny
– Then, offers Apple for the Fig, for a penny
– Now, what is Pandora’s net posi2on?
17/01/17 15 Reinforcement Learning
Decision Making under Risk
• Ini;ally appeared as analysis of fair gambles, needed some
no;ons of u;lity
• Gamble has n outcomes, each worth a1, …, an
• The probability of each outcome is p1, …, pn
• How much is it worth to par;cipate in this gamble?
b = a1 p1 + … + an pn
One may treat this monetary expected value as a fair price
Is this a sufficient descrip;on of choice behaviour under risk?
17/01/17 16 Reinforcement Learning
St. Petersburg Paradox of D. Bernoulli
• A fair coin is tossed un;l a head appears
• Gambler receives 2n if the first head appears on trial n
• Probability of this event = probability of tail in first (n-1) trials
and head on trial n, i.e., (1/2)n
Expected value = 2.(1/2) + 4.(1/2)2 + 8.(1/2)8 +… = ∞
• Are you willing to bet in this way? Is anyone?
17/01/17 17 Reinforcement Learning
Defining U;lity
• Bernoulli went on to argue that people do not act in this way
• The thing to average is the ‘intrinsic worth’ of the monetary
values, not the absolute values
e.g., intrinsic worth of money may increase with money but at
a diminishing rate
• Let us say u;lity of m is log10 m, then expected value is,
log10 2.(1/2) + log10 4.(1/2)2 + log10 8.(1/2)8 +… = b < ∞
Monetary fair price of the gamble is a where log10a = b.
17/01/17 18 Reinforcement Learning
Some Cri;ques of Bernoulli’s Formula;on
von Neumann and Morgenstern (vNM), who ini;ated the
formal study of game theory, raised the following ques;ons:
• The assignment of u;lity to money is arbitrary and ad hoc
– There are an infinity of func;ons that capture ‘diminishing rate’,
how should we choose?
– The associa;on may vary from person to person
• Why is the defini;on of the decision based upon expected
value of the u;lity?
– Is this actually descrip;ve of a single gambler, in one-shot choice?
– How to define/constrain u;lity?
17/01/17 19 Reinforcement Learning
von Neumann & Morgenstern Formula;on
• If a person is able to express preferences between every
possible pair of gambles
where gambles are taken over some basic set of alterna;ves
• Then one can introduce u;lity associa;ons to the basic
alterna;ves in such a manner that
• If the person is guided solely by the u;lity expected value, he
is ac5ng in accord with his true tastes.
– provided his tastes are consistent in some way
17/01/17 20 Reinforcement Learning
Construc;ng U;lity Func;ons
• Suppose we know the following preference order:
– A < b ~ c < d < e
• The following are u;lity func;ons that capture this:
– So, in situa;ons like St Petersburg paradox, the revealed preference of
any realis;c player may differ from the case of infinite expected value
– Sa;sfac;on at some large value, risk tolerance, ;me preference, etc.
17/01/17 21
a b c d E
U 0 1/2 1/2 3/4 1
V -1 1 1 2 3
W -8 0 0 1 8
Reinforcement Learning
Certainty Equivalents and Indifference
• The previous statement applies equally well to certain events
and gambles or loaeries
• So, even aotudes regarding tradeoffs between the two ought
to be captured
• Basic issue – how to compare?
• Imagine the following choice (A > B > C pref.) : (a) you get B
for certain, (b) you get A with probability p and C otherwise
• If p is near 1, op;on b is beaer; if p is near 0, then op;on a:
there is a single point where we switch
• Indifference is described as something like
(2/3) (1) + (1 – 2/3) (0) = 2/3
17/01/17 22
A C B
Reinforcement Learning
Decision Making under Uncertainty
• A choice must be made from among a set of acts, A1, …, Am.
• The rela;ve desirability of these acts depends on which state
of nature prevails, either s1, …, sn .
• As decision maker we know that one of several things is true
and this influences our choice
• Key point about decision making: whether or not you have a
probabilis2c characteriza2on of alterna2ves has a big impact
on how to approach the problem
17/01/17 23 Reinforcement Learning
Example: Savage’s Omelet Problem
Your friend has broken 5 good eggs into a bowl when you come
in to volunteer and finish the omelet.
A sixth egg lies unbroken (you must use it or waste it
altogether).
Your three acts: break it into bowl, break it into saucer – inspect
and pour into bowl, throw it uninspected
17/01/17 Reinforcement Learning 24
Decision in Savage’s Omelet Problem
• To each outcome, we could assign a u;lity and maximize it
• What do we know about the state of nature?
– We may act as though there is one true state, we just don’t know it
– If we assume a probability over si, this is decision under risk
– If we do not assume a probability over si, what might one do?
17/01/17 25 Reinforcement Learning