程序代写代做代考 algorithm Reinforcement Learning

Reinforcement Learning

On- and Off-Policy Learning

Subramanian Ramamoorthy
School of Informa6cs

3 February, 2017

Can We Avoid Thorny Assump4ons?

•  Two major MC assump4ons (infinite sampling and exploring
all states) are unrealis4c. How to circumvent the issue?

•  Need to con4nually explore, ε-soF policies:

–  On-policy method: Explore in an ε-greedy manner

–  Off-policy method: Use a behaviour policy that is good at
exploring, then infer op4mal policy from that

03/02/2017 2 Reinforcement Learning

On-Policy Monte Carlo Control

•  Overall idea is s4ll that of Generalized Policy Itera4on (move
towards greedy policy), but throw in con4nual explora4on

•  In order to always explore, we want to keep policy ε-soC:

•  Moreover, one may really wish to adopt an ε-greedy policy:

•  In this case, we have

03/02/2017 3 Reinforcement Learning

On-Policy MC Control

Evaluate as before

Improve towards
ε-greedy, not the max

03/02/2017 4 Reinforcement Learning

The Policy Improvement Step

•  Any ε-greedy policy w.r.t. Qπ is an improvement over any ε-
soF policy π (Policy Improvement Theorem)

ε – greedy policy

03/02/2017 5 Reinforcement Learning

Off-policy Method

•  Evaluate one policy while following another one
–  Behaviour policy takes you around the environment
–  Es4ma4on policy is what you are aFer

•  Of course, this requires:
•  Then, the off-policy procedure works as follows:

–  Compute the weighted average of returns from behaviour
policy

–  Weigh4ng factors are the probability of the moves being in
es4ma4on policy

–  i.e., weight each return by rela4ve probability of being
generated by π and πʹ

03/02/2017 6 Reinforcement Learning

Learning a Policy while Following Another

Using this to get data

03/02/2017 7 Reinforcement Learning

Learning a Policy while Following Another

03/02/2017 8 Reinforcement Learning

Comparing the two Probabili4es

03/02/2017 9 Reinforcement Learning

Off-Policy MC Algorithm

03/02/2017 10 Reinforcement Learning

Off-Policy MC Algorithm, cont.

03/02/2017 11 Reinforcement Learning

Off-Policy MC Algorithm, cont.

03/02/2017 12 Reinforcement Learning

Off-Policy MC Algorithm, cont.

03/02/2017 13 Reinforcement Learning

The Off-Policy MC Control Algorithm

03/02/2017 14 Reinforcement Learning

Incremental Implementa4on

•  Beeer to implement MC incrementally (think memory…)

•  To compute the weighted average of each return:

We may also wish to assign relative
weights to different episodes…

03/02/2017 15 Reinforcement Learning

Racetrack Example

•  Go as fast as possible but do
not skid off the track

•  Velocity = #grid cells (h/v)
per 4me step, bounded

•  Noise added to ac4ons
•  State/ac4on space?
•  Reward
•  Episode?

•  On-policy/off-policy
learning?

03/02/2017 16 Reinforcement Learning

Racetrack Example

Track Layout State Value Func6on

03/02/2017 17 Reinforcement Learning