程序代写代做代考 algorithm Reinforcement Learning

Reinforcement Learning

Temporal-Difference (TD) Learning

Subramanian Ramamoorthy
School of Informa6cs

31 January, 2017

Learning in MDPs

•  You are learning from a long
stream of experience:

… up to some terminal state

•  Direct methods:
Approximate value func=on
(V/Q) straight away –
without compu=ng

Should you wait until episodes end
or can you learn on-line?

31/01/2017 2 Reinforcement Learning

Pass0 ,R
a
ss0

s0a0r0s1a1r1…skakrk…

Recap: Incremental Monte Carlo Algorithm

•  Incremental sample-average procedure:

•  Where n(s) is number of first visits to state s

–  Note that we make one update, for each state, per episode
•  One could pose this as a generic constant step-size algorithm:

–  Useful in tracking non-sta=onary problems (task + environment)

31/01/2017 3 Reinforcement Learning

V (s) V (s) +
1

n(s)
[R� V (s)]

V (s) V (s) + ↵[R� V (s)]

Example: Driving Home

State Elapsed Time
(minutes)

Predicted
Time to Go

Predicted
Total Time

leaving office 0 30 30

reach car, raining 5 35 40

exit highway 20 15 35

behind truck 30 10 40

home street 40 3 43

arrive home 43 0 43

(5)

(15)

(10)

(10)

(3)

4 Reinforcement Learning

Driving Home

Changes recommended by Monte
Carlo methods (α=1)

Changes recommended
by TD methods (α=1)

31/01/2017 5 Reinforcement Learning

What does DP Do?

31/01/2017 6 Reinforcement Learning

What does Simple MC Do?

31/01/2017 7 Reinforcement Learning

Idea behind Temporal Difference Procedure

31/01/2017 8 Reinforcement Learning

Temporal Difference Predic=on

31/01/2017 9 Reinforcement Learning

Temporal Difference Learning

•  Does not require a model (i.e., transi=on and reward prob.) –
learn directly from experience

•  Update es=mate of V(s) soon aeer visi=ng the state s

Actual 1-step reward
Discounted estimate
of future reward

Initial estimate
of future reward

This is better than ………………….. this

31/01/2017 10 Reinforcement Learning

TD(0) Update

31/01/2017 11 Reinforcement Learning

V (st) V (st) + ↵[rt+1 + �V (st+1)� V (st)]

TD(0) Algorithm for Learning V π

31/01/2017 12 Reinforcement Learning

Why TD Learning?

Why?

31/01/2017 13 Reinforcement Learning

Bootstrapping, Sampling

31/01/2017 14 Reinforcement Learning

Random Walk Example

Values learned by TD(0) aeer
various numbers of episodes

31/01/2017 15 Reinforcement Learning

TD and MC on the Random Walk

Data averaged over
100 sequences of episodes

31/01/2017 16 Reinforcement Learning

Understanding TD vs. MC

S+B Example 6.4:
•  You observe 8 episodes of a process:
A,0,B,0 B,1 B,1 B,1 B,1 B,1 B,1 B,0

•  Interpreta=on:
–  First episode starts in state A, transi=ons to B geing a reward of 0,

and terminates with a reward of 0
–  Second episode starts in state B, terminates with a reward of 1, etc.

Ques=on: What are good es=mates for V(A) and V(B)?

31/01/2017 17 Reinforcement Learning

S+B Example 6.4: Underlying Markov Process

V(A) = ?

31/01/2017 18 Reinforcement Learning

TD and MC Es=mated
•  Batch Monte Carlo (update aeer all episodes done) gets V(A) = 0.

–  This best matches the training data
–  It minimises the mean-square error on the training set

•  Consider sequen=ality: A to B to termina=ng state; V(A) = 0.75.
–  This is what TD(0) gets
–  Expect that this will produce beler es=mate of future data even

though MC gives the best es=mate on the present data
–  Is correct for the maximum-likelihood es=mate of the model of the

Markov process that generates the data, i.e. the best-fit Markov
model based on the observed transi=ons

–  Assume this model is correct; es=mate the value func=on – “certainty-
equivalence es=mate”

TD(0) tends to converge faster because it moves towards a be)er es=mate.

31/01/2017 19 Reinforcement Learning

TD for Control: Learning Q-Values

31/01/2017 20 Reinforcement Learning

TD for Control: Learning Q-Values

31/01/2017 21 Reinforcement Learning

Algorithm: SARSA

31/01/2017 22 Reinforcement Learning

Windy Gridworld

undiscounted, episodic, reward = –1 un=l goal

31/01/2017 23 Reinforcement Learning

Results of Sarsa on the Windy Gridworld

31/01/2017 24 Reinforcement Learning

Q-Learning

31/01/2017 25 Reinforcement Learning

Algorithm: Q-Learning

31/01/2017 26 Reinforcement Learning

Backup Diagrams: SARSA and Q-Learning

31/01/2017 27 Reinforcement Learning

Cliffwalking

ε-greedy, ε = 0.1

31/01/2017 28 Reinforcement Learning

Q-Learning vs. SARSA

31/01/2017 29 Reinforcement Learning

Summary

•  Idea of Temporal Difference Predic=on
•  1-step tabular model-free TD method
•  Can extend to the GPI approach:
–  On-policy: SARSA
–  Off-policy: Q-learning

•  TD methods bootstrap and sample, combining benefits of DP
and MC methods

31/01/2017 30 Reinforcement Learning