Reinforcement Learning
Bandit Problems, Markov Chains and Markov
Decision Processes
Subramanian Ramamoorthy
School of Informa@cs
20 January 2017
What is Reinforcement Learning(RL)?
• An approach to Ar7ficial Intelligence
• Learning from interac@on
• Learning about, from, and while interac7ng (trial and error)
with an external environment
• Goal-oriented learning; implying delayed rewards
• Learning what to do—how to map situa7ons to ac7ons—so as
to maximize a numerical reward signal
• Can be thought of as a stochas7c op7miza7on over 7me
2 20/01/17 Reinforcement Learning
Setup for RL
Agent (algorithm) is:
• Temporally situated
• Con7nual learning and planning
• Objec7ve is to affect the environment – ac7ons and states
• Environment is uncertain, stochas7c
Environment
Agent
3 20/01/17 Reinforcement Learning
Mul7-arm Bandits (MAB)
• N possible ac7ons
• You can play for some period
of 7me and you want to
maximize reward (expected
u7lity)
Which is the best arm/
machine?
DEMO
4 20/01/17 Reinforcement Learning
Numerous Applica7ons!
20/01/17 5 Reinforcement Learning
What is the Choice?
6 20/01/17 Reinforcement Learning
n-Armed Bandit Problem
• Choose repeatedly from one of n ac7ons; each choice is
called a play
• A_er each play at , you get a reward rt , where
These are unknown action values
Distribution of depends only on rt at
Objective is to maximize the reward in the long term, e.g., over 1000 plays
To solve the n-armed bandit problem,
you must explore a variety of actions
and exploit the best of them
7 20/01/17 Reinforcement Learning
Explora7on/Exploita7on Dilemma
• Suppose you form es7mates
• The greedy ac7on at 7me t is at*
• You can’t exploit all the 7me; you can’t explore all the 7me
• You can never stop exploring; but you could reduce
exploring.
Qt(a) ≈Q
*(a) action value estimates
at
* = argmax
a
Qt (a)
at = at
* ⇒ exploitation
at ≠ at
* ⇒ exploration
8 20/01/17
Why?
Reinforcement Learning
Ac7on-Value Methods
• Methods that adapt ac7on-value es7mates and nothing else,
e.g.: suppose by the t-th play, ac7on a had been chosen ka
7mes, producing rewards r1 , r2 , …, rka , then
“sample average”
lim
ka→∞
Qt (a) =Q
*(a)
9 20/01/17 Reinforcement Learning
What is the behaviour
with finite samples?
ε-Greedy Ac7on Selec7on
• Greedy ac7on selec7on:
• ε-Greedy:
at = at
* = argmax
a
Qt (a)
at
* with probability 1−ε
random action with probability ε{at =
. . . the simplest way to balance exploration and exploitation
10 20/01/17 Reinforcement Learning
20/01/17 Reinforcement Learning 11
Worked Example: 10-Armed Testbed
• n = 10 possible actions
• Each is chosen randomly from a normal distrib.:
• Each is also normal:
• 1000 plays, repeat the whole thing 2000 times and average
the results
rt
Q*(a) )1,0(N
N(Q*(at ),1)
12 20/01/17 Reinforcement Learning
10-Armed Testbed Rewards
1 2 63 54 7 8 9 10
0
1
2
3
-3
-2
-1
q⇤(1)
q⇤(2)
q⇤(3)
q⇤(4)
q⇤(5)
q⇤(6)
q⇤(7)
q⇤(8)
q⇤(9)
q⇤(10)
Reward
distribution
Action
-4
4
20/01/17 Reinforcement Learning 13
ε-Greedy Methods on the 10-Armed Testbed
14 20/01/17 Reinforcement Learning
Incremental Implementa7on
Qk =
r1 + r2 +!rk
k
Sample average es7ma7on method:
How to do this incrementally (without storing all the rewards)?
We could keep a running sum and count, or, equivalently:
Qk+1 =Qk +
1
k +1
rk+1 −Qk[ ]
The average of the first k rewards is
(dropping the dependence on a ):
NewEs5mate = OldEs5mate + StepSize [Target – OldEs5mate]
15 20/01/17 Reinforcement Learning
Tracking a Non-sta7onary Problem
Choosing to be a sample average is appropriate in a
sta7onary problem,
i.e., when none of the change over 7me,
But not in a nonsta7onary problem.
Qk
Q*(a)
The beler op7on in the nonsta7onary case is:
Qk+1 = Qk +α rk+1 −Qk[ ]
for constant α, 0 <α ≤ 1
= (1− α)kQ0 + α (1−α
i=1
k
∑ )k −i ri
exponen5al, recency-weighted average
16 20/01/17 Reinforcement Learning
Op7mis7c Ini7al Values
• All methods so far depend on , i.e., they are biased
• Encourage explora7on: ini7alize the ac7on values op7mis7cally,
Q0 (a)
i.e., on the 10-armed testbed, use Q0 (a) = 5 for all a
17 20/01/17 Reinforcement Learning
So_max Ac7on Selec7on
• So_max ac7on selec7on methods grade ac7on probabili7es
by es7mated values.
• The most common so_max uses a Gibbs, or Boltzmann,
distribu7on:
Choose action a on play t with probability
eQt (a) τ
eQt (b) τ
b=1
n
∑
,
where τ is a 'computational temperature'
18 20/01/17 Reinforcement Learning
Another Interpreta7on of MAB Problems
20/01/17 19
Related to
‘rewards’
Reinforcement Learning
MAB is a Special Case of Online Learning
20/01/17 20 Reinforcement Learning
How to Evaluate Online Alg.: Regret
• A_er you have played for T rounds, you experience a regret:
= [Reward sum of op7mal strategy] – [Sum of actual collected rewards]
• If the average regret per round goes to zero with probability
1, asympto7cally, we say the strategy has no-regret property
~ guaranteed to converge to an op7mal strategy
• ε-greedy is sub-op7mal (so has some regret). Why?
20/01/17 21
[ ]
kk
T
t
i
T
t
t trETrT t
µµ
µµρ
max
)(ˆ
*
1
*
1
*
=
−=−= ∑∑
== Randomness in
draw of rewards &
Player’s strategy
Reinforcement Learning
Interval Es7ma7on
• Alribute to each arm an “op7mis7c ini7al es7mate” within a
certain confidence interval
• Greedily choose arm with highest op7mis7c mean (upper
bound on confidence interval)
• Infrequently observed arm will have over-valued reward
mean, leading to explora7on
• Frequent usage pushes op7mis7c es7mate to true values
20/01/17 22 Reinforcement Learning
Interval Es7ma7on Procedure
• Associate to each arm 100(1-α)% reward mean upper band
• Assume, e.g., rewards are normally distributed
• Arm is observed n 7mes to yield empirical mean & std dev
• α-upper bound:
• If α is carefully controlled, could be made zero-regret strategy
– In general (i.e., for other distribu7ons), we don’t know
20/01/17 23
dx
x
tc
c
n
u
t
∫ ∞−
−
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
−=
−+=
2
exp
2
1
)(
)1(
ˆ
ˆ
2
1
π
α
σ
µα
Cum. Distribu7on Func7on
Reinforcement Learning
Reminder: Chernoff-Hoeffding Bound
20/01/17 24 Reinforcement Learning
Variant: UCB Strategy
• Again, based on no7on of an upper confidence bound but
more generally applicable
• Algorithm:
– Play each arm once
– At 7me t > K, play arm it maximizing
20/01/17 25
far so playedbeen has j timesofnumber :
ln2
)(
,
,
tj
tj
j
T
T
t
tr +
Reinforcement Learning
UCB Strategy
20/01/17 26 Reinforcement Learning
UCB Strategy – Behaviour
20/01/17 27
We will not prove the following result, but I quote the
theorem to explain the benefit of UCB – regret is bounded.
K = number of arms
Reinforcement Learning
Empirical Behaviour: UCB
1
!-greedy ! = 0.1
UCB c = 2
Average
reward
Steps
20/01/17 Reinforcement Learning 28
Varia7on on So_Max:
• It is possible to drive regret down by annealing τ
• Exp3 : Exponen7al weight alg. for explora7on and exploita7on
• Probability of choosing arm k at 7me t is
20/01/17 29
∑ =
n
b
bQ
aQ
t
t
e
e
1
)(
)(
τ
τ
( ))log(Regret
at pulled is arm if
)(
)(
)(
exp)(
)1(
)(
)(
)1()(
1
KKTO
otherwise
tj
tw
KtP
tr
tw
tw
Ktw
tw
tP
j
j
j
j
j
k
j
j
k
k
≈
⎪
⎩
⎪
⎨
⎧
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
=+
+−=
∑
=
γ
γ
γ
γ is a user defined
open parameter
Reinforcement Learning
The Giuns Index
• Each arm delivers reward with a probability
• This probability may change through 7me but only when arm
is pulled
• Goal is to maximize discounted rewards – future is discounted
by an exponen7al discount factor δ < 1 • The structure of the problem is such that, all you need to do is compute an “index” for each arm and play the one with the highest index • Index is of the form: 20/01/17 30 ν i = sup T>0
δ tRi (t)
t=0
T
∑
δ t
t=0
T
∑
Reinforcement Learning
Giuns Index – Intui7on
• We will not give a proof of its op7mality now, and will return
to that issue later in the course.
• Analysis is based on stopping 7me: the point where you
should ‘terminate’ a bandit arm
• Nice Property: Giuns index for any given bandit is
independent of expected outcome of all other bandits
– Once you have a good arm, keep playing un7l there is a beler one
– If you add/remove machines, computa7on doesn’t really change
BUT:
– hard to compute, even when you know the distribu7ons
– Explora7on issues; arms aren’t updated unless used (restless bandits?)
20/01/17 31 Reinforcement Learning
What About State Changes?
• In MAB, we were in a single casino and the only decision is to
pull from a set of n arms
– Some change, perhaps, if an adversary were introduced…
Next,
• What if there is more than one state?
• So, in this state space, what is the effect of the distribu7on of
payout changing based on how you pull arms?
• What happens if you only obtain a net reward corresponding
to a long sequence of arm pulls (at the end)?
20/01/17 32 Reinforcement Learning
Decision Making Agent-Environment Interface
Agent and environment interact at discrete time steps: t = 0, 1, 2,…
Agent observes state at step t: st ∈ S
produces action at step t : at ∈ A(st )
gets resulting reward: rt+1 ∈ℜ
and resulting next state: st+1
t
. . . s t a
r t +1 s t +1
t +1 a
r t +2 s t +2
t +2 a
r t +3 s t +3 . . .
t +3 a
20/01/17 33 Reinforcement Learning
Markov Decision Processes
• A model of the agent-environment system
• Markov property = history doesn’t maler, only current state
• If state and ac7on sets are finite, it is a finite MDP.
• To define a finite MDP, you need to give:
– state and ac@on sets
– one-step “dynamics” defined by transi@on probabili@es:
– reward probabili@es:
Ps ʹs
a = Pr st+1 = ʹs st = s,at = a{ } for all s, ʹs ∈ S, a ∈ A(s).
Rs ʹs
a = E rt+1 st = s,at = a, st+1 = ʹs{ } for all s, ʹs ∈ S, a ∈ A(s).
20/01/17 34 Reinforcement Learning
Recycling Robot
An Example Finite MDP
• At each step, robot has to decide whether it should (1) ac7vely
search for a can, (2) wait for someone to bring it a can, or (3) go to
home base and recharge.
• Searching is beler but runs down the balery; if it runs out of
power while searching then it has to be rescued (which is bad).
• Decisions made on the basis of current energy level: high, low.
• Reward = number of cans collected
20/01/17 35 Reinforcement Learning
Recycling Robot MDP
20/01/17 36 Reinforcement Learning
Rewards while searching/waiting :
rsearch > rwait
Enumerated In Tabular Form
20/01/17 37
If you were given this much, what can you say about
the behaviour (over time) of the system?
Reinforcement Learning
Very Brief Primer on
Markov Chains
20/01/17 38 Reinforcement Learning
Stochas7c Processes
• A stochas5c process is an indexed collec7on of random
variables .
– e.g., collec7on of weekly demands for a product
• One type: At a par7cular 7me t, labelled by integers, system is
found in exactly one of a finite number of mutually exclusive
and exhaus7ve categories or states, labelled by integers too
• Process could be embedded in that 7me points correspond to
occurrence of specific events (or 7me may be equi-spaced)
• Random variables may depend on others, e.g.,
20/01/17 39 Reinforcement Learning
Markov Chains
• The stochas7c process is said to have a Markovian property if
• Markovian property means that the condi5onal probability of
a future event given any past events and current state, is
independent of past states and depends only on present
• The condi7onal probabili7es are transi@on probabili@es,
• These are sta7onary if 7me invariant, denote pij,
20/01/17 40 Reinforcement Learning
Markov Chains
• Looking forward in 7me, n-step transi@on probabili@es, pij(n)
• One can write a transi7on matrix,
• A stochas7c process is a finite-state Markov chain if it has,
– Finite number of states
– Markovian property
– Sta7onary transi7on probabili7es
– A set of ini7al probabili7es P{X0 = i} for all i
20/01/17 41 Reinforcement Learning
Markov Chains
• n-step transi7on probabili7es can be obtained from 1-step
transi7on probabili7es recursively (Chapman-Kolmogorov)
• We can get this via the matrix too
• First Passage Time: number of transi7ons to go from i to j for
the first 7me
– If i = j, this is the recurrence @me
– In general, this itself is a random variable
20/01/17 42 Reinforcement Learning
Markov Chains
• n-step recursive rela7onship for first passage 7me
• For fixed i and j, these fij(n) are nonnega7ve numbers so that
• If ,state is recurrent; if so for n=1 then state i is
absorbing
20/01/17 43
What does <1 signify? Reinforcement Learning Markov Chains: Long-Run Proper7es 20/01/17 44 • Consider this transi7on matrix of an inventory process: • This captures the evolu7on of inventory levels in a store – What do the 0 values mean? – Other proper7es of this matrix? Reinforcement Learning Markov Chains: Long-Run Proper7es The corresponding 8-step transi7on matrix becomes: Interes7ng property: probability of being in state j a_er 8 weeks appears independent of ini5al level of inventory. • For an irreducible ergodic Markov chain, one has limi7ng probability Reciprocal gives you recurrence time 20/01/17 45 Reinforcement Learning Markov Decision Model • Consider the following applica7on: machine maintenance • A factory has a machine that deteriorates rapidly in quality and output and is inspected periodically, e.g., daily • Inspec7on declares the machine to be in four possible states: – 0: Good as new – 1: Operable, minor deteriora7on – 2: Operable, major deteriora7on – 3: Inoperable • Let Xt denote this observed state – evolves according to some “law of mo7on”, it is a stochas7c process – Furthermore, assume it is a finite state Markov chain 20/01/17 46 Reinforcement Learning Markov Decision Model • Transi7on matrix is based on the following: • Once the machine goes inoperable, it stays there un7l repairs – If no repairs, eventually, it reaches this state which is absorbing! • Repair is an ac@on – a very simple maintenance policy. – e.g., machine from from state 3 to state 0 20/01/17 47 Reinforcement Learning Markov Decision Model • There are costs as system evolves: – State 0: cost 0 – State 1: cost 1000 – State 2: cost 3000 • Replacement cost, taking state 3 to 0, is 4000 (and lost produc7on of 2000), so cost = 6000 • The modified transi7on probabili7es are: 20/01/17 48 Reinforcement Learning Markov Decision Model • Simple ques7on (a behavioural property): What is the average cost of this maintenance policy? • Compute the steady state probabili7es: • (Long run) expected average cost per day, 20/01/17 49 How? Reinforcement Learning Markov Decision Model • Consider a slightly more elaborate policy: – Replace when inoperable but if only needing major repairs, overhaul • Transi7on matrix now changes a lille bit • Permit one more thing: overhaul – Go back to minor repairs state (1) for the next 7me step – Not possible if truly inoperable, but can go from major to minor • Key point about the system behaviour. It evolves according to – “Laws of mo7on” – Sequence of decisions made (ac7ons from {1: none,2:overhaul,3: replace}) • Stochas7c process is now defined in terms of {Xt} and {Δt} – Policy, R, is a rule for making decisions • Could use all history, although popular choice is (current) state-based 20/01/17 50 Reinforcement Learning Markov Decision Model • There is a space of poten7al policies, e.g., • Each policy defines a transi7on matrix, e.g., for Rb Which policy is best? Need costs…. 20/01/17 51 0 0 Reinforcement Learning Markov Decision Model • Cik = expected cost incurred during next transi7on if system is in state i and decision k is made • The long run average expected cost for each policy may be computed using, State Dec. 1 2 3 0 0 4 6 1 1 4 6 2 3 4 6 3 ∞ ∞ 6 Rb is best: Work out details at home. 20/01/17 52 Reinforcement Learning So, What is a Policy? • A “program” • Map from states (or situa7ons in the decision problem) to ac7ons that could be taken – e.g., if in ‘level 2’ state, call contractor for overhaul – If less than 3 DVDs of a film, place an order for 2 more • A probability distribu7on π(x,a) – A joint probability distribu7on over states and ac7ons – If in a state x1, then with probability defined by π, take ac7on a1 20/01/17 53 Reinforcement Learning Markov Decision Processes 20/01/17 54 • ‘Sta7c’ view: • Example: Notation: State ó s/x Reinforcement Learning MDPs as Bayesian Networks 20/01/17 55 Reinforcement Learning A Decision Criterion • The general approach, that computa7onally implements the previous calcula7ons with simultaneous equa7ons over probabili7es is linear programming • Another approach to dealing with MDPs is via ‘learning’ – O_en, trea7ng the discounted, episodic seung • What is the criterion for adapta7on (i.e., learning)? 20/01/17 56 Rt = rt+1 +γ rt+2 +γ 2rt+3 +!= γ krt+k+1, k=0 ∞ ∑ where γ, 0 ≤ γ ≤1, is the discount rate. Effect of changing γ? Reinforcement Learning Episodic vs. Infinite: A Unified Nota7on • In (discrete) episodic tasks, we could number the 7me steps of each episode star7ng from zero. • We usually do not have to dis7nguish between episodes, so we write instead of for the state at step t of episode j. • Think of each episode as ending in an absorbing state that always produces reward of zero: • We can cover all cases by wri7ng st st, j Rt = γ krt+k+1, k=0 ∞ ∑ where γ can be 1 only if a zero reward absorbing state is always reached. 20/01/17 57 Reinforcement Learning Acknowledgements • The Markov Chains and MDP formula7on slides are adapted from chapters in F.S. Hillier & G.J. Lieberman, Opera7ons Research, 2nd ed. • Ini7al slides on MAB and some later slides on reinforcement learning formula7on are adapted from web resources associated with Sulon and Barto’s book. 20/01/17 58 Reinforcement Learning