Reinforcement Learning
Dynamic Programming;
Monte Carlo Methods
Subramanian Ramamoorthy
School of Informa=cs
27 January 2017
Recap: Key Quan–es defining an MDP
• System dynamics are
stochas-c – represented by a
probability distribu-on.
• Problem is defined as
maximiza-on of expected
rewards
– Recall that E(X) = Σ xi p(xi)
for finite-state systems
27/01/2017 Reinforcement Learning 2
Recap: Decision Criterion
What is the criterion for op-miza-on (i.e., learning)?
20/01/17 3
Rt = rt+1 +γ rt+2 +γ
2rt+3 +!= γ
krt+k+1,
k=0
∞
∑
where γ, 0 ≤ γ ≤1, is the discount rate.
Effect of changing γ?
Reinforcement Learning
Nota-on for Episodic vs. Infinite
• In (discrete) episodic tasks, we could number the -me
steps of each episode star-ng from zero.
• We usually do not have to dis-nguish between episodes,
so we write instead of for the state at step t of
episode j.
• Think of each episode as ending in an absorbing state that
always produces reward of zero:
• We can cover all cases by wri-ng
st st, j
Rt = γ
krt+k+1,
k=0
∞
∑
where γ can be 1 only if a zero reward absorbing state is always reached.
20/01/17 4 Reinforcement Learning
Three Aspects of the RL Problem
• Planning
The MDP is known (states, ac-ons, transi-ons, rewards).
Find an op-mal policy, π*!
• Learning
The MDP is unknown. You are allowed to interact with it.
Find an op-mal policy π*!
• Op=mal learning
While interac-ng with the MDP, minimize the loss due to not
using an op-mal policy from the beginning.
27/01/2017 Reinforcement Learning 5
Solving MDPs – Many Dimensions
• Which problem? (Planning, learning, op-mal learning)
• Exact or approximate?
• Uses samples?
• Incremental?
• Uses value func-ons?
– Yes: Value-func-on based methods
• Planning: DP, Random Discre-za-on Method, FVI, …
• Learning: Q-learning, Actor-cri-c, …
– No: Policy search methods
• Planning: Monte-Carlo tree search, Likelihood ra-o methods (policy
gradient), Sample-path op-miza-on (Pegasus), …
• Representa-on
– Structured state:
• Factored states, logical representa-on, …
– Structured policy space:
• Hierarchical methods
27/01/2017 Reinforcement Learning 6
Value Func-ons
• Value func-ons are used to determine how good it is for the agent
to be in a given state
– Or, how good is it to perform an ac-on from a given state?
• This is defined w.r.t. a specific policy, i.e., distribu-on π(s,a)
• State value func-on:
• Ac-on (or State-Ac-on) value func-on:
27/01/2017 Reinforcement Learning 7
Value Func-ons
Note that there are mul-ple sources of (probabilis-c) uncertainty:
• In state s, one is allowed to select different ac-ons a
• The system may transi-on to different states s’ from s
• Depending on the above, return (defined in terms of reward) is a
random variable – which we seek to maximize in expecta-on
27/01/2017 Reinforcement Learning 8
Recursive Form of V – Bellman Equa=on
27/01/2017 Reinforcement Learning 9
Expand 1-step forward
& rewrite expectation
Backup Diagrams
• If you go all the way ‘down’, you can just read off the reward value
• The backup process (i.e., recursive equa-on above) allows you to
compute the corresponding value at current state
– taking transi-on probabili-es into account
27/01/2017 Reinforcement Learning 10
Bellman Equa-on for Q
(State-Action Value Function)
27/01/2017 Reinforcement Learning 11
Op(mal Value Func-on
• For finite MDPs,
• Let us denote the op-mal policy π*
• The corresponding op-mal value func-ons are:
• From this,
• Will there always be a well defined π* ?
Theorem [Blackwell, 1962] – For every MDP with finite state/ac-on
space, there exists an op-mal determinis-c sta-onary plan
27/01/2017 Reinforcement Learning 12
Recursive Form of V*
27/01/2017 Reinforcement Learning 13
Backup for Q*
27/01/2017 Reinforcement Learning 14
What is Dynamic Programming?
Given a known model of the environment as an MDP
(transi-on dynamics and reward probabili-es),
DP is a collec-on of algorithms for compu-ng op-mal policies
(via Op-mal Value Func-ons)
27/01/2017 Reinforcement Learning 15
Policy Evalua-on
How to compute V(s) for an arbitrary policy π? (Predic(on problem)
For a given MDP, this yields a system of simultaneous equa-ons
– as many unknowns as states
– Solve using linear algebraic computa-on
Solve itera-vely, with a sequence of value func-ons,
27/01/2017 Reinforcement Learning 16
Computa-onally…
We could achieve this in a number of different ways:
• Maintain two arrays, compu-ng itera-ons over one and copying
results to the other
• In-place: Overwrite as new backed-up values become available
• It can be shown that this algorithm will also converge to op-mality
(somewhat faster, even)
– Backups sweep through the space
– Sweep order has significant influence on convergence rates
27/01/2017 Reinforcement Learning 17
Itera-ve Policy Evalua-on
27/01/2017 Reinforcement Learning 18
Grid-World Example
27/01/2017 Reinforcement Learning 19
Itera-ve Policy Evalua-on in Grid World
27/01/2017 Reinforcement Learning 20
Note: The value function can be searched
greedily to find long-term optimal actions
Policy Improvement
Does it make sense to deviate from π(s) at any state (following the
policy everywhere else)?
27/01/2017 Reinforcement Learning 21
- Policy Improvement Theorem [Howard/Blackwell]
Key Idea Behind Policy Improvement
27/01/2017 Reinforcement Learning 22
Compu-ng Berer Policies
Star-ng with an arbitrary policy, we’d like to approach truly op-mal
policies. So, we compute new policies using the following,
Are we restricted to determinis-c policies? No.
With stochas-c policies,
27/01/2017 Reinforcement Learning 23
Policy Itera-on
We can combine policy evalua-on and improvement to obtain a
sequence of monotonically improving policies and value func-ons
– Each policy is guaranteed to be a strict improvement over
previous one (unless it is already op-mal)
[Policy Improvement Theorem]
– As a finite MDP admits finitely many policies, this eventually
converges to an op-mal policy
27/01/2017 Reinforcement Learning 24
Policy Itera-on Algorithm
27/01/2017 Reinforcement Learning 25
Example: Jack’s Car Rental
• £10 for each car rented (must be available when request received)
• Two loca-ons, maximum of 20 cars at each
• Cars returned and requested randomly
– Poisson distribu-on, n returns/requests with probability
– Loca-on 1: Average requests = 3, Average returns = 3
– Loca-on 2: Average requests = 4, Average Returns = 2
• Can move up to 5 cars between loca-ons overnight (costs £2 each)
Problem setup:
• States, ac-ons, rewards?
• Transi-on probabili-es?
27/01/2017 Reinforcement Learning 26
Solu-on: Jack’s Car Rental
27/01/2017 Reinforcement Learning 27
Numbers indicate
action: #cars to move
Value
Points to Ponder: Jack’s Car Rental
• Suppose first car moved is free but all others transfers cost £2
– From Loca-on 1 to Loca-on 2 (not other direc-on!)
– Because an employee would anyway go in that direc-on, by bus
• Suppose only 10 cars can be parked for free at each loca-on
– More than 10 incur fixed cost £4 for using an extra parking lot
… typical examples of ‘real-world nonlineari(es’
27/01/2017 Reinforcement Learning 28
Value Itera-on
27/01/2017 Reinforcement Learning 29
Each step in Policy Iteration needs Policy Evaluation (upto convergence)
– can we avoid this computational overhead?
… use Bellman equation
as update rule
Value Itera-on Algorithm
27/01/2017 Reinforcement Learning 30
Example: Gambler’s Problem
• Gambler can repeatedly bet on a coin flip
• Heads: wins stake; Tails: loses his money
• Ini-al capital ∈ {$1, $2, … , $99}
• Gambler has won if he reaches $100 and has lost if he goes
bankrupt ($0)
• Unfair coin: p(H) = 0.4, p(T) = 0.6
Problem formula-on:
• States, Ac-ons, Rewards?
• State transi-ons?
27/01/2017 Reinforcement Learning 31
Solu-on to Gambler’s Problem
27/01/2017 Reinforcement Learning 32
Based on successive
sweeps of value iteration
Why does it look
so strange?
Generalized Policy Itera-on
Caricature of the process:
Builds on the no-on of
interleaving evalua-on and
improvement – but allows the
granularity to be flexible
27/01/2017 Reinforcement Learning 33
Monte Carlo Methods
• Learn value func-ons
• Discover op-mal policies
• Do not assume knowledge of model as in DP, i.e.,
• Learn from experience: Sample sequences of states, ac-ons
and rewards (s, a, r)
– In simulated or real (e.g., physical robo-c) worlds
– Clearly, simulator is a model but not a full one as in a prob.
distribu-on
• Eventually arain op-mal behaviour (same as with DP)
27/01/2017 Reinforcement Learning 34
Pass0 ,R
a
ss0
Learning in MDPs
• You are learning from a long
stream of experience:
… up to some terminal state
• Direct methods:
Approximate value func-on
(V/Q) straight away –
without compu-ng
31/01/2017 35 Reinforcement Learning
s0a0r0s1a1r1…skakrk…
Pass0 ,R
a
ss0
Pictorial: What does DP Do?
27/01/2017 Reinforcement Learning 36
Pictorial: What does Simple MC Do?
27/01/2017 Reinforcement Learning 37
Monte Carlo Policy Evalua-on
• Goal: Approximate a value func-on
• Given: Some number of episodes under π which contain s
• Maintain average returns ayer visits to s
• First visit vs. Every visit MC:
– Consider a reward process and define the
first visit -me, , and a set,
– First visit MC averages
whereas every visit MC averages over
27/01/2017 Reinforcement Learning 38
What is the effect of π?
What if it is deterministic?
First-visit Monte Carlo Policy Evalua-on
27/01/2017 Reinforcement Learning 39
Example: Blackjack
• Goal: Achieve a card sum
greater than dealer without
exceeding 21
• Player’s op-ons: Hit (take
another card) or S-ck (pass)
– If player crosses 21 – loss
• Dealer follows simple rule:
S-ck if ≥ 17, else Hit
• Result:
Closest to 21 wins
Equally close is a draw
27/01/2017 Reinforcement Learning 40
Example: Blackjack
• Goal: Achieve a card sum greater than dealer without
exceeding 21
• State space: (200 states)
– Current sum (12 – 21)
– Dealer’s showing card (ace – 10)
– Do I have a usable ace (can be used as 11 without overshoot)?
• Reward: +1 for win, 0 for loss, -1 for a loss
• Ac-on space: s(ck (no more cards), hit (receive another card)
• Policy: s(ck if sum is 20 or 21, else hit
Note: This is an (arbitrary) policy π with which algorithm works
27/01/2017 Reinforcement Learning 41
Solu-on ( ) : Blackjack
27/01/2017 Reinforcement Learning 42
Why is this
more choppy?
Remarks on Blackjack Example
• Why does the value func-on jump up for the last two rows in
the rear?
– When sums correspond to 20 or 21, policy is to s(ck; this is a
good choice in this region of state space
• Why does it drop off for the whole last row on the ley?
– Dealer is showing an ace, which gives him extra flexibility (two
chances to get close to 21)
• Why are the foremost values higher on upper plots than
lower plots?
– Player has usable ace (more flexibility)
27/01/2017 Reinforcement Learning 43
Backup in MC
• Does the concept of backup diagram make sense for MC
methods?
• As in figure, MC does not sample all transi-ons
– Root node to be updated as before
– Transi-ons are dictated by policy
– Acquire samples along a sample path
– Clear path from eventual reward to states
along the way (credit assignment easier)
• Es-mates are different states are independent
– Computa-onal complexity not a func-on of state
dimensionality
27/01/2017 Reinforcement Learning 44
Monte Carlo Es-ma-on of Ac-on Values
• Model is not available, so we do not know how states and
ac-ons interact
– We want Q*
• We can try to approximate Qπ(s,a) using Monte Carlo method
– Asympto-c convergence if every state-ac-on pair is visited
• Explore many different star-ng state-ac-on pairs: Equal
chance of star-ng from any given state
– Not en-rely prac-cal, but simple to understand
27/01/2017 Reinforcement Learning 45
Monte Carlo Control
• Policy Evalua-on:
Monte Carlo method
• Policy Improvement:
Greedify with respect to
state-value of ac-on-value
func-on
27/01/2017 Reinforcement Learning 46
Convergence of MC Control
• Policy improvement s-ll works if evalua-on is done with MC:
• πk+1 ≥ πk by the policy improvement theorem
• Assump-on: exploring starts and infinite number of episodes
for MC policy evalua-on (i.e., value func-on has stabilized)
• Things to do (as in DP):
– update only to given tolerance
– interleave evalua-on/improvement
27/01/2017 Reinforcement Learning 47
Monte Carlo Exploring Starts
27/01/2017 Reinforcement Learning 48
Blackjack Example – Op-mal Policy
27/01/2017 Reinforcement Learning 49