程序代写代做代考 arm Bayesian algorithm chain Reinforcement Learning

Reinforcement Learning

Bandit Problems, Markov Chains and Markov
Decision Processes

Subramanian Ramamoorthy
School of Informa@cs

20 January 2017

What is Reinforcement Learning(RL)?

•  An approach to Ar7ficial Intelligence
•  Learning from interac@on
•  Learning about, from, and while interac7ng (trial and error)

with an external environment
•  Goal-oriented learning; implying delayed rewards

•  Learning what to do—how to map situa7ons to ac7ons—so as
to maximize a numerical reward signal

•  Can be thought of as a stochas7c op7miza7on over 7me

2 20/01/17 Reinforcement Learning

Setup for RL
Agent (algorithm) is:
•  Temporally situated
•  Con7nual learning and planning
•  Objec7ve is to affect the environment – ac7ons and states
•  Environment is uncertain, stochas7c

Environment

Agent
3 20/01/17 Reinforcement Learning

Mul7-arm Bandits (MAB)

•  N possible ac7ons
•  You can play for some period

of 7me and you want to
maximize reward (expected
u7lity)

Which is the best arm/

machine?

DEMO

4 20/01/17 Reinforcement Learning

Numerous Applica7ons!

20/01/17 5 Reinforcement Learning

What is the Choice?

6 20/01/17 Reinforcement Learning

n-Armed Bandit Problem

•  Choose repeatedly from one of n ac7ons; each choice is
called a play

•  A_er each play at , you get a reward rt , where

These are unknown action values
Distribution of depends only on rt at

Objective is to maximize the reward in the long term, e.g., over 1000 plays

To solve the n-armed bandit problem,
you must explore a variety of actions
and exploit the best of them

7 20/01/17 Reinforcement Learning

Explora7on/Exploita7on Dilemma

•  Suppose you form es7mates

•  The greedy ac7on at 7me t is at*

•  You can’t exploit all the 7me; you can’t explore all the 7me
•  You can never stop exploring; but you could reduce

exploring.

Qt(a) ≈Q
*(a) action value estimates

at
* = argmax

a
Qt (a)

at = at
* ⇒ exploitation

at ≠ at
* ⇒ exploration

8 20/01/17

Why?

Reinforcement Learning

Ac7on-Value Methods

•  Methods that adapt ac7on-value es7mates and nothing else,
e.g.: suppose by the t-th play, ac7on a had been chosen ka
7mes, producing rewards r1 , r2 , …, rka , then

“sample average”

lim
ka→∞

Qt (a) =Q
*(a)

9 20/01/17 Reinforcement Learning

What is the behaviour
with finite samples?

ε-Greedy Ac7on Selec7on

•  Greedy ac7on selec7on:

•  ε-Greedy:

at = at
* = argmax

a
Qt (a)

at
* with probability 1−ε

random action with probability ε{at =
. . . the simplest way to balance exploration and exploitation

10 20/01/17 Reinforcement Learning

20/01/17 Reinforcement Learning 11

Worked Example: 10-Armed Testbed

•  n = 10 possible actions

•  Each is chosen randomly from a normal distrib.:

•  Each is also normal:

•  1000 plays, repeat the whole thing 2000 times and average
the results

rt

Q*(a) )1,0(N

N(Q*(at ),1)

12 20/01/17 Reinforcement Learning

10-Armed Testbed Rewards

1 2 63 54 7 8 9 10

0

1

2

3

-3

-2

-1

q⇤(1)

q⇤(2)

q⇤(3)

q⇤(4)

q⇤(5)

q⇤(6)

q⇤(7)

q⇤(8)

q⇤(9)

q⇤(10)
Reward

distribution

Action

-4

4

20/01/17 Reinforcement Learning 13

ε-Greedy Methods on the 10-Armed Testbed

14 20/01/17 Reinforcement Learning

Incremental Implementa7on

Qk =
r1 + r2 +!rk

k

Sample average es7ma7on method:

How to do this incrementally (without storing all the rewards)?

We could keep a running sum and count, or, equivalently:

Qk+1 =Qk +
1
k +1

rk+1 −Qk[ ]

The average of the first k rewards is
(dropping the dependence on a ):

NewEs5mate = OldEs5mate + StepSize [Target – OldEs5mate]

15 20/01/17 Reinforcement Learning

Tracking a Non-sta7onary Problem

Choosing to be a sample average is appropriate in a
sta7onary problem,
i.e., when none of the change over 7me,

But not in a nonsta7onary problem.

Qk

Q*(a)

The beler op7on in the nonsta7onary case is:

Qk+1 = Qk +α rk+1 −Qk[ ]
for constant α, 0 <α ≤ 1 = (1− α)kQ0 + α (1−α i=1 k ∑ )k −i ri exponen5al, recency-weighted average 16 20/01/17 Reinforcement Learning Op7mis7c Ini7al Values •  All methods so far depend on , i.e., they are biased •  Encourage explora7on: ini7alize the ac7on values op7mis7cally, Q0 (a) i.e., on the 10-armed testbed, use Q0 (a) = 5 for all a 17 20/01/17 Reinforcement Learning So_max Ac7on Selec7on •  So_max ac7on selec7on methods grade ac7on probabili7es by es7mated values. •  The most common so_max uses a Gibbs, or Boltzmann, distribu7on: Choose action a on play t with probability eQt (a) τ eQt (b) τ b=1 n ∑ , where τ is a 'computational temperature' 18 20/01/17 Reinforcement Learning Another Interpreta7on of MAB Problems 20/01/17 19 Related to ‘rewards’ Reinforcement Learning MAB is a Special Case of Online Learning 20/01/17 20 Reinforcement Learning How to Evaluate Online Alg.: Regret •  A_er you have played for T rounds, you experience a regret: = [Reward sum of op7mal strategy] – [Sum of actual collected rewards] •  If the average regret per round goes to zero with probability 1, asympto7cally, we say the strategy has no-regret property ~ guaranteed to converge to an op7mal strategy •  ε-greedy is sub-op7mal (so has some regret). Why? 20/01/17 21 [ ] kk T t i T t t trETrT t µµ µµρ max )(ˆ * 1 * 1 * = −=−= ∑∑ == Randomness in draw of rewards & Player’s strategy Reinforcement Learning Interval Es7ma7on •  Alribute to each arm an “op7mis7c ini7al es7mate” within a certain confidence interval •  Greedily choose arm with highest op7mis7c mean (upper bound on confidence interval) •  Infrequently observed arm will have over-valued reward mean, leading to explora7on •  Frequent usage pushes op7mis7c es7mate to true values 20/01/17 22 Reinforcement Learning Interval Es7ma7on Procedure •  Associate to each arm 100(1-α)% reward mean upper band •  Assume, e.g., rewards are normally distributed •  Arm is observed n 7mes to yield empirical mean & std dev •  α-upper bound: •  If α is carefully controlled, could be made zero-regret strategy –  In general (i.e., for other distribu7ons), we don’t know 20/01/17 23 dx x tc c n u t ∫ ∞− − ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ −= −+= 2 exp 2 1 )( )1( ˆ ˆ 2 1 π α σ µα Cum. Distribu7on Func7on Reinforcement Learning Reminder: Chernoff-Hoeffding Bound 20/01/17 24 Reinforcement Learning Variant: UCB Strategy •  Again, based on no7on of an upper confidence bound but more generally applicable •  Algorithm: –  Play each arm once –  At 7me t > K, play arm it maximizing

20/01/17 25

far so playedbeen has j timesofnumber :

ln2
)(

,

,

tj

tj
j

T

T
t

tr +

Reinforcement Learning

UCB Strategy

20/01/17 26 Reinforcement Learning

UCB Strategy – Behaviour

20/01/17 27

We will not prove the following result, but I quote the
theorem to explain the benefit of UCB – regret is bounded.

K = number of arms

Reinforcement Learning

Empirical Behaviour: UCB

1

!-greedy ! = 0.1

UCB c = 2

Average
reward

Steps

20/01/17 Reinforcement Learning 28

Varia7on on So_Max:

•  It is possible to drive regret down by annealing τ
•  Exp3 : Exponen7al weight alg. for explora7on and exploita7on
•  Probability of choosing arm k at 7me t is

20/01/17 29

∑ =
n

b
bQ

aQ

t

t

e
e

1
)(

)(

τ

τ

( ))log(Regret

at pulled is arm if

)(
)(
)(

exp)(
)1(

)(

)(
)1()(

1

KKTO

otherwise
tj

tw
KtP
tr

tw
tw

Ktw

tw
tP

j

j

j
j

j

k

j
j

k
k








=+

+−=


=

γ

γ
γ

γ is a user defined
open parameter

Reinforcement Learning

The Giuns Index

•  Each arm delivers reward with a probability
•  This probability may change through 7me but only when arm

is pulled
•  Goal is to maximize discounted rewards – future is discounted

by an exponen7al discount factor δ < 1 •  The structure of the problem is such that, all you need to do is compute an “index” for each arm and play the one with the highest index •  Index is of the form: 20/01/17 30 ν i = sup T>0

δ tRi (t)
t=0

T

δ t

t=0

T


Reinforcement Learning

Giuns Index – Intui7on

•  We will not give a proof of its op7mality now, and will return
to that issue later in the course.

•  Analysis is based on stopping 7me: the point where you
should ‘terminate’ a bandit arm

•  Nice Property: Giuns index for any given bandit is
independent of expected outcome of all other bandits
–  Once you have a good arm, keep playing un7l there is a beler one
–  If you add/remove machines, computa7on doesn’t really change

BUT:
–  hard to compute, even when you know the distribu7ons
–  Explora7on issues; arms aren’t updated unless used (restless bandits?)

20/01/17 31 Reinforcement Learning

What About State Changes?

•  In MAB, we were in a single casino and the only decision is to
pull from a set of n arms
–  Some change, perhaps, if an adversary were introduced…

Next,
•  What if there is more than one state?
•  So, in this state space, what is the effect of the distribu7on of

payout changing based on how you pull arms?
•  What happens if you only obtain a net reward corresponding

to a long sequence of arm pulls (at the end)?

20/01/17 32 Reinforcement Learning

Decision Making Agent-Environment Interface

Agent and environment interact at discrete time steps: t = 0, 1, 2,…
Agent observes state at step t: st ∈ S
produces action at step t : at ∈ A(st )
gets resulting reward: rt+1 ∈ℜ
and resulting next state: st+1

t
. . . s t a

r t +1 s t +1
t +1 a

r t +2 s t +2
t +2 a

r t +3 s t +3 . . .
t +3 a

20/01/17 33 Reinforcement Learning

Markov Decision Processes

•  A model of the agent-environment system
•  Markov property = history doesn’t maler, only current state
•  If state and ac7on sets are finite, it is a finite MDP.
•  To define a finite MDP, you need to give:

–  state and ac@on sets
–  one-step “dynamics” defined by transi@on probabili@es:

–  reward probabili@es:

Ps ʹs
a = Pr st+1 = ʹs st = s,at = a{ } for all s, ʹs ∈ S, a ∈ A(s).

Rs ʹs
a = E rt+1 st = s,at = a, st+1 = ʹs{ } for all s, ʹs ∈ S, a ∈ A(s).

20/01/17 34 Reinforcement Learning

Recycling Robot

An Example Finite MDP

•  At each step, robot has to decide whether it should (1) ac7vely
search for a can, (2) wait for someone to bring it a can, or (3) go to
home base and recharge.

•  Searching is beler but runs down the balery; if it runs out of
power while searching then it has to be rescued (which is bad).

•  Decisions made on the basis of current energy level: high, low.

•  Reward = number of cans collected

20/01/17 35 Reinforcement Learning

Recycling Robot MDP

20/01/17 36 Reinforcement Learning

Rewards while searching/waiting :

rsearch > rwait

Enumerated In Tabular Form

20/01/17 37

If you were given this much, what can you say about
the behaviour (over time) of the system?

Reinforcement Learning

Very Brief Primer on
Markov Chains

20/01/17 38 Reinforcement Learning

Stochas7c Processes

•  A stochas5c process is an indexed collec7on of random
variables .
–  e.g., collec7on of weekly demands for a product

•  One type: At a par7cular 7me t, labelled by integers, system is
found in exactly one of a finite number of mutually exclusive
and exhaus7ve categories or states, labelled by integers too

•  Process could be embedded in that 7me points correspond to
occurrence of specific events (or 7me may be equi-spaced)

•  Random variables may depend on others, e.g.,

20/01/17 39 Reinforcement Learning

Markov Chains

•  The stochas7c process is said to have a Markovian property if

•  Markovian property means that the condi5onal probability of
a future event given any past events and current state, is
independent of past states and depends only on present

•  The condi7onal probabili7es are transi@on probabili@es,

•  These are sta7onary if 7me invariant, denote pij,

20/01/17 40 Reinforcement Learning

Markov Chains

•  Looking forward in 7me, n-step transi@on probabili@es, pij(n)

•  One can write a transi7on matrix,

•  A stochas7c process is a finite-state Markov chain if it has,
–  Finite number of states
–  Markovian property
–  Sta7onary transi7on probabili7es
–  A set of ini7al probabili7es P{X0 = i} for all i

20/01/17 41 Reinforcement Learning

Markov Chains

•  n-step transi7on probabili7es can be obtained from 1-step
transi7on probabili7es recursively (Chapman-Kolmogorov)

•  We can get this via the matrix too

•  First Passage Time: number of transi7ons to go from i to j for

the first 7me
–  If i = j, this is the recurrence @me
–  In general, this itself is a random variable

20/01/17 42 Reinforcement Learning

Markov Chains

•  n-step recursive rela7onship for first passage 7me

•  For fixed i and j, these fij(n) are nonnega7ve numbers so that

•  If ,state is recurrent; if so for n=1 then state i is
absorbing

20/01/17 43

What does <1 signify? Reinforcement Learning Markov Chains: Long-Run Proper7es 20/01/17 44 •  Consider this transi7on matrix of an inventory process: •  This captures the evolu7on of inventory levels in a store – What do the 0 values mean? –  Other proper7es of this matrix? Reinforcement Learning Markov Chains: Long-Run Proper7es The corresponding 8-step transi7on matrix becomes: Interes7ng property: probability of being in state j a_er 8 weeks appears independent of ini5al level of inventory. •  For an irreducible ergodic Markov chain, one has limi7ng probability Reciprocal gives you recurrence time 20/01/17 45 Reinforcement Learning Markov Decision Model •  Consider the following applica7on: machine maintenance •  A factory has a machine that deteriorates rapidly in quality and output and is inspected periodically, e.g., daily •  Inspec7on declares the machine to be in four possible states: –  0: Good as new –  1: Operable, minor deteriora7on –  2: Operable, major deteriora7on –  3: Inoperable •  Let Xt denote this observed state –  evolves according to some “law of mo7on”, it is a stochas7c process –  Furthermore, assume it is a finite state Markov chain 20/01/17 46 Reinforcement Learning Markov Decision Model •  Transi7on matrix is based on the following: •  Once the machine goes inoperable, it stays there un7l repairs –  If no repairs, eventually, it reaches this state which is absorbing! •  Repair is an ac@on – a very simple maintenance policy. –  e.g., machine from from state 3 to state 0 20/01/17 47 Reinforcement Learning Markov Decision Model •  There are costs as system evolves: –  State 0: cost 0 –  State 1: cost 1000 –  State 2: cost 3000 •  Replacement cost, taking state 3 to 0, is 4000 (and lost produc7on of 2000), so cost = 6000 •  The modified transi7on probabili7es are: 20/01/17 48 Reinforcement Learning Markov Decision Model •  Simple ques7on (a behavioural property): What is the average cost of this maintenance policy? •  Compute the steady state probabili7es: •  (Long run) expected average cost per day, 20/01/17 49 How? Reinforcement Learning Markov Decision Model •  Consider a slightly more elaborate policy: –  Replace when inoperable but if only needing major repairs, overhaul •  Transi7on matrix now changes a lille bit •  Permit one more thing: overhaul –  Go back to minor repairs state (1) for the next 7me step –  Not possible if truly inoperable, but can go from major to minor •  Key point about the system behaviour. It evolves according to –  “Laws of mo7on” –  Sequence of decisions made (ac7ons from {1: none,2:overhaul,3: replace}) •  Stochas7c process is now defined in terms of {Xt} and {Δt} –  Policy, R, is a rule for making decisions •  Could use all history, although popular choice is (current) state-based 20/01/17 50 Reinforcement Learning Markov Decision Model •  There is a space of poten7al policies, e.g., •  Each policy defines a transi7on matrix, e.g., for Rb Which policy is best? Need costs…. 20/01/17 51 0 0 Reinforcement Learning Markov Decision Model •  Cik = expected cost incurred during next transi7on if system is in state i and decision k is made •  The long run average expected cost for each policy may be computed using, State Dec. 1 2 3 0 0 4 6 1 1 4 6 2 3 4 6 3 ∞ ∞ 6 Rb is best: Work out details at home. 20/01/17 52 Reinforcement Learning So, What is a Policy? •  A “program” •  Map from states (or situa7ons in the decision problem) to ac7ons that could be taken –  e.g., if in ‘level 2’ state, call contractor for overhaul –  If less than 3 DVDs of a film, place an order for 2 more •  A probability distribu7on π(x,a) –  A joint probability distribu7on over states and ac7ons –  If in a state x1, then with probability defined by π, take ac7on a1 20/01/17 53 Reinforcement Learning Markov Decision Processes 20/01/17 54 •  ‘Sta7c’ view: •  Example: Notation: State ó s/x Reinforcement Learning MDPs as Bayesian Networks 20/01/17 55 Reinforcement Learning A Decision Criterion •  The general approach, that computa7onally implements the previous calcula7ons with simultaneous equa7ons over probabili7es is linear programming •  Another approach to dealing with MDPs is via ‘learning’ –  O_en, trea7ng the discounted, episodic seung •  What is the criterion for adapta7on (i.e., learning)? 20/01/17 56 Rt = rt+1 +γ rt+2 +γ 2rt+3 +!= γ krt+k+1, k=0 ∞ ∑ where γ, 0 ≤ γ ≤1, is the discount rate. Effect of changing γ? Reinforcement Learning Episodic vs. Infinite: A Unified Nota7on •  In (discrete) episodic tasks, we could number the 7me steps of each episode star7ng from zero. •  We usually do not have to dis7nguish between episodes, so we write instead of for the state at step t of episode j. •  Think of each episode as ending in an absorbing state that always produces reward of zero: •  We can cover all cases by wri7ng st st, j Rt = γ krt+k+1, k=0 ∞ ∑ where γ can be 1 only if a zero reward absorbing state is always reached. 20/01/17 57 Reinforcement Learning Acknowledgements •  The Markov Chains and MDP formula7on slides are adapted from chapters in F.S. Hillier & G.J. Lieberman, Opera7ons Research, 2nd ed. •  Ini7al slides on MAB and some later slides on reinforcement learning formula7on are adapted from web resources associated with Sulon and Barto’s book. 20/01/17 58 Reinforcement Learning