Reinforcement Learning
On- and Off-Policy Learning
Subramanian Ramamoorthy
School of Informa6cs
3 February, 2017
Can We Avoid Thorny Assump4ons?
• Two major MC assump4ons (infinite sampling and exploring
all states) are unrealis4c. How to circumvent the issue?
• Need to con4nually explore, ε-soF policies:
– On-policy method: Explore in an ε-greedy manner
– Off-policy method: Use a behaviour policy that is good at
exploring, then infer op4mal policy from that
03/02/2017 2 Reinforcement Learning
On-Policy Monte Carlo Control
• Overall idea is s4ll that of Generalized Policy Itera4on (move
towards greedy policy), but throw in con4nual explora4on
• In order to always explore, we want to keep policy ε-soC:
• Moreover, one may really wish to adopt an ε-greedy policy:
• In this case, we have
03/02/2017 3 Reinforcement Learning
On-Policy MC Control
Evaluate as before
Improve towards
ε-greedy, not the max
03/02/2017 4 Reinforcement Learning
The Policy Improvement Step
• Any ε-greedy policy w.r.t. Qπ is an improvement over any ε-
soF policy π (Policy Improvement Theorem)
ε – greedy policy
03/02/2017 5 Reinforcement Learning
Off-policy Method
• Evaluate one policy while following another one
– Behaviour policy takes you around the environment
– Es4ma4on policy is what you are aFer
• Of course, this requires:
• Then, the off-policy procedure works as follows:
– Compute the weighted average of returns from behaviour
policy
– Weigh4ng factors are the probability of the moves being in
es4ma4on policy
– i.e., weight each return by rela4ve probability of being
generated by π and πʹ
03/02/2017 6 Reinforcement Learning
Learning a Policy while Following Another
Using this to get data
03/02/2017 7 Reinforcement Learning
Learning a Policy while Following Another
03/02/2017 8 Reinforcement Learning
Comparing the two Probabili4es
03/02/2017 9 Reinforcement Learning
Off-Policy MC Algorithm
03/02/2017 10 Reinforcement Learning
Off-Policy MC Algorithm, cont.
03/02/2017 11 Reinforcement Learning
Off-Policy MC Algorithm, cont.
03/02/2017 12 Reinforcement Learning
Off-Policy MC Algorithm, cont.
03/02/2017 13 Reinforcement Learning
The Off-Policy MC Control Algorithm
03/02/2017 14 Reinforcement Learning
Incremental Implementa4on
• Beeer to implement MC incrementally (think memory…)
• To compute the weighted average of each return:
We may also wish to assign relative
weights to different episodes…
03/02/2017 15 Reinforcement Learning
Racetrack Example
• Go as fast as possible but do
not skid off the track
• Velocity = #grid cells (h/v)
per 4me step, bounded
• Noise added to ac4ons
• State/ac4on space?
• Reward
• Episode?
• On-policy/off-policy
learning?
03/02/2017 16 Reinforcement Learning
Racetrack Example
Track Layout State Value Func6on
03/02/2017 17 Reinforcement Learning