程序代写代做代考 matlab Reinforcement Learning

Reinforcement Learning

Introduc2on

Subramanian Ramamoorthy
School of Informa2cs

17 January 2017

Admin

Lecturer: Subramanian (Ram) Ramamoorthy
IPAB, School of Informa;cs
s.ramamoorthy@ed (preferred method of contact)
Informa;cs Forum 1.41, x505119

Teaching Assistant: Svetlin Penkov, SV.Penkov@ed.ac.uk

Class representa5ve?

Mailing list: Are you on it – I will use it for announcements!

17/01/17 Reinforcement Learning 2

Admin

Lectures:
Tuesday and Friday 12:10 – 13:00 (Teviot Lecture Theatre)

Assessment: Homework/Exam 10+10% / 80%

–  HW1: Out 3 Feb, Due 17 Feb
•  Some pen+paper ques;ons & setup for HW2

–  HW2: Out 28 Feb, Due 28 Mar
•  Implement an agent using ALE

17/01/17 Reinforcement Learning 3

Admin

Webpage: www.informa;cs.ed.ac.uk/teaching/courses/rl
•  Lecture slides to be uploaded, typically, the day before

Main Textbook
–  R. Suaon and A. Barto, Reinforcement Learning, 2nd ed.

Other Suggested Books
–  See list on course web page: going into more detail on specific

topics (e.g., dynamic programming, POMDPs, ac;ve sensing) than
possible or necessary in lectures. These books are FYI and op;onal.

Other readings, e.g., papers, to be suggested for specific lectures.

Background: Mathema;cs, Matlab, Exposure to machine learning?

17/01/17 Reinforcement Learning 4

Problem of Learning from Interac5on

Ø with environment
Ø to achieve some goal

•  Baby playing. No teacher. Sensorimotor connec;on to
environment.
– Cause <-> effect
– Ac;on <-> consequences
– How to achieve goals

•  Learning to drive a car, hold conversa;on, etc.

•  Environment’s response affects our subsequent ac;ons
•  We find out the effects of our ac;ons later

17/01/17 Reinforcement Learning 5

Rough History of RL Ideas

•  Psychology – learning by trial and error
… ac;ons followed by good or bad outcomes have their
tendency to be reselected altered accordingly
-  Selec;onal: try alterna;ves and pick good ones
-  Associa;ve: associate alterna;ves with par;cular situa;ons

•  Computa;onal studies (e.g., credit assignment problem)
–  Minsky’s SNARC, 1950
–  Michie’s MENACE, BOXES, etc. 1960s

•  Temporal Difference learning (Minsky, Samuel, Shannon, …)
–  Driven by differences between successive es;mates over ;me

17/01/17 Reinforcement Learning 6

Rough History of RL, contd.

•  In 1970-80, many researchers, e.g., Klopf, Suaon & Barto,…,
looked seriously at issues of “geong results from the
environment” as opposed to supervised learning
–  Although supervised learning methods such as backpropaga;on
were some;mes used, emphasis was different

•  Stochas;c op;mal control (mathema;cs, opera;ons
research)
–  Deep roots: Hamilton-Jacobi → Bellman/Howard
–  By the 1980s, people began to realize the connec;on between
MDPs and the RL problem as above…

17/01/17 Reinforcement Learning 7

What is the Nature of the Problem?

•  As you can tell from the history, many ways to understand the
problem – you will see this as we proceed through course

•  Unifying perspec;ve:
–  Sequen;al decision making
–  Stochas;c op;miza;on over ;me

•  Let us unpack this through a few applica;on examples…

17/01/17 Reinforcement Learning 8

Example Domain: Robo;cs

Environment

Percep;on

Ac;on

Adversarial
ac;ons & other
agents

Adversarial
ac;ons & other
agents

High-level
goals

Problem: How to generate actions, to achieve high-level goals, using limited
perception and incomplete knowledge of environment?

17/01/17 9 Reinforcement Learning

Example Domain:
Natural Language Dialogue

17/01/17 10 [Source: http://www.agilingua.com/images/DM_EN.jpg]

How to Model Decisions, Computa;onally?

•  Who makes them?
–  Individual
–  ‘Group’

•  What are the condi;ons?
–  Certainty
–  Risk
–  Uncertainty

For most of this course, we’ll take the ‘individual’ viewpoint
and we’ll be somewhere in between risk and uncertainty

17/01/17 11 Reinforcement Learning

How to Model Decision under Certainty?

•  Given a set of possible acts
•  Choose one that maximizes some given index

If a is a generic act in a set of feasible acts A, f(a) is an index
being maximized, then
Problem: Find a* in A such that f(a*) > f(a) for all a in A.

The index f plays a key role, e.g., think of buying a pain;ng.
Essen;al problem: How should the subject select an index
func;on such that her choice reduces to finding maximizers?

17/01/17 12 Reinforcement Learning

An Opera;onal Way to Find Index Func;on

•  Observe subject’s behaviour in restricted seongs and predict
purchase behaviour from that:

•  Instruct the subject as follows:
–  Here are ten valuable reproduc;ons
–  We will present these to you in pairs
–  You will tell us which one of the pair you prefer to own
–  Aver you have evaluated all pairs, we will pick a pair at random and

present you with the choice you previously made (it is to your
advantage to remember your true tastes)

•  The subject’s behaviour is as though there is a ranking over all
pain;ngs, so each pain;ng can be summarized by a number

17/01/17 13 Reinforcement Learning

Some Desiderata of this Ranking
•  Transi5vity: Previous argument only makes sense if the rank is

transi;ve – if A is preferred in (A, B) and B is preferred in (B,
C) then A is preferred in (A, C); and this holds for all triples of
alterna;ves A, B and C

•  Ordinal nature of index: One is tempted to immediately turn
the ranking into a latent measure of ‘sa;sfac;on’ but that is
premature as u;li;es need not be unique.
e.g., we could assign 3 u;les to A, 2 u;les to B and 1 u;le to C
to explain the choice behaviour
Equally, 30, 20.24 and 3.14 would yield the same choice

While it is OK to compare indices, it isn’t (yet) OK to add/mul;ply

17/01/17 14 Reinforcement Learning

What Happens if we Relax Transi;vity?

•  Assume Pandora says (in the pairwise comparisons):
–  Apple < Orange –  Orange < Fig –  Fig < Apple •  Why is this a problem for Pandora? •  Assume a merchant who transacts with her as follows: –  Pandora has an Apple at the start of the conversa;on –  He offers to exchange Orange for Apple, if she gives him a penny –  He then offers an exchange of Fig for Orange, at the price of a penny –  Then, offers Apple for the Fig, for a penny –  Now, what is Pandora’s net posi2on? 17/01/17 15 Reinforcement Learning Decision Making under Risk •  Ini;ally appeared as analysis of fair gambles, needed some no;ons of u;lity •  Gamble has n outcomes, each worth a1, …, an •  The probability of each outcome is p1, …, pn •  How much is it worth to par;cipate in this gamble? b = a1 p1 + … + an pn One may treat this monetary expected value as a fair price Is this a sufficient descrip;on of choice behaviour under risk? 17/01/17 16 Reinforcement Learning St. Petersburg Paradox of D. Bernoulli •  A fair coin is tossed un;l a head appears •  Gambler receives 2n if the first head appears on trial n •  Probability of this event = probability of tail in first (n-1) trials and head on trial n, i.e., (1/2)n Expected value = 2.(1/2) + 4.(1/2)2 + 8.(1/2)8 +… = ∞ •  Are you willing to bet in this way? Is anyone? 17/01/17 17 Reinforcement Learning Defining U;lity •  Bernoulli went on to argue that people do not act in this way •  The thing to average is the ‘intrinsic worth’ of the monetary values, not the absolute values e.g., intrinsic worth of money may increase with money but at a diminishing rate •  Let us say u;lity of m is log10 m, then expected value is, log10 2.(1/2) + log10 4.(1/2)2 + log10 8.(1/2)8 +… = b < ∞ Monetary fair price of the gamble is a where log10a = b. 17/01/17 18 Reinforcement Learning Some Cri;ques of Bernoulli’s Formula;on von Neumann and Morgenstern (vNM), who ini;ated the formal study of game theory, raised the following ques;ons: •  The assignment of u;lity to money is arbitrary and ad hoc –  There are an infinity of func;ons that capture ‘diminishing rate’, how should we choose? –  The associa;on may vary from person to person •  Why is the defini;on of the decision based upon expected value of the u;lity? –  Is this actually descrip;ve of a single gambler, in one-shot choice? –  How to define/constrain u;lity? 17/01/17 19 Reinforcement Learning von Neumann & Morgenstern Formula;on •  If a person is able to express preferences between every possible pair of gambles where gambles are taken over some basic set of alterna;ves •  Then one can introduce u;lity associa;ons to the basic alterna;ves in such a manner that •  If the person is guided solely by the u;lity expected value, he is ac5ng in accord with his true tastes. –  provided his tastes are consistent in some way 17/01/17 20 Reinforcement Learning Construc;ng U;lity Func;ons •  Suppose we know the following preference order: –  A < b ~ c < d < e •  The following are u;lity func;ons that capture this: –  So, in situa;ons like St Petersburg paradox, the revealed preference of any realis;c player may differ from the case of infinite expected value –  Sa;sfac;on at some large value, risk tolerance, ;me preference, etc. 17/01/17 21 a b c d E U 0 1/2 1/2 3/4 1 V -1 1 1 2 3 W -8 0 0 1 8 Reinforcement Learning Certainty Equivalents and Indifference •  The previous statement applies equally well to certain events and gambles or loaeries •  So, even aotudes regarding tradeoffs between the two ought to be captured •  Basic issue – how to compare? •  Imagine the following choice (A > B > C pref.) : (a) you get B

for certain, (b) you get A with probability p and C otherwise
•  If p is near 1, op;on b is beaer; if p is near 0, then op;on a:

there is a single point where we switch
•  Indifference is described as something like
(2/3) (1) + (1 – 2/3) (0) = 2/3

17/01/17 22
A C B

Reinforcement Learning

Decision Making under Uncertainty
•  A choice must be made from among a set of acts, A1, …, Am.
•  The rela;ve desirability of these acts depends on which state

of nature prevails, either s1, …, sn .
•  As decision maker we know that one of several things is true

and this influences our choice

•  Key point about decision making: whether or not you have a
probabilis2c characteriza2on of alterna2ves has a big impact
on how to approach the problem

17/01/17 23 Reinforcement Learning

Example: Savage’s Omelet Problem

Your friend has broken 5 good eggs into a bowl when you come
in to volunteer and finish the omelet.

A sixth egg lies unbroken (you must use it or waste it
altogether).

Your three acts: break it into bowl, break it into saucer – inspect
and pour into bowl, throw it uninspected

17/01/17 Reinforcement Learning 24

Decision in Savage’s Omelet Problem

•  To each outcome, we could assign a u;lity and maximize it
•  What do we know about the state of nature?

–  We may act as though there is one true state, we just don’t know it
–  If we assume a probability over si, this is decision under risk
–  If we do not assume a probability over si, what might one do?

17/01/17 25 Reinforcement Learning