Reinforcement Learning
Temporal-Difference (TD) Learning
Subramanian Ramamoorthy
School of Informa6cs
31 January, 2017
Learning in MDPs
• You are learning from a long
stream of experience:
… up to some terminal state
• Direct methods:
Approximate value func=on
(V/Q) straight away –
without compu=ng
Should you wait until episodes end
or can you learn on-line?
31/01/2017 2 Reinforcement Learning
Pass0 ,R
a
ss0
s0a0r0s1a1r1…skakrk…
Recap: Incremental Monte Carlo Algorithm
• Incremental sample-average procedure:
• Where n(s) is number of first visits to state s
– Note that we make one update, for each state, per episode
• One could pose this as a generic constant step-size algorithm:
– Useful in tracking non-sta=onary problems (task + environment)
31/01/2017 3 Reinforcement Learning
V (s) V (s) +
1
n(s)
[R� V (s)]
V (s) V (s) + ↵[R� V (s)]
Example: Driving Home
State Elapsed Time
(minutes)
Predicted
Time to Go
Predicted
Total Time
leaving office 0 30 30
reach car, raining 5 35 40
exit highway 20 15 35
behind truck 30 10 40
home street 40 3 43
arrive home 43 0 43
(5)
(15)
(10)
(10)
(3)
4 Reinforcement Learning
Driving Home
Changes recommended by Monte
Carlo methods (α=1)
Changes recommended
by TD methods (α=1)
31/01/2017 5 Reinforcement Learning
What does DP Do?
31/01/2017 6 Reinforcement Learning
What does Simple MC Do?
31/01/2017 7 Reinforcement Learning
Idea behind Temporal Difference Procedure
31/01/2017 8 Reinforcement Learning
Temporal Difference Predic=on
31/01/2017 9 Reinforcement Learning
Temporal Difference Learning
• Does not require a model (i.e., transi=on and reward prob.) –
learn directly from experience
• Update es=mate of V(s) soon aeer visi=ng the state s
Actual 1-step reward
Discounted estimate
of future reward
Initial estimate
of future reward
This is better than ………………….. this
31/01/2017 10 Reinforcement Learning
TD(0) Update
31/01/2017 11 Reinforcement Learning
V (st) V (st) + ↵[rt+1 + �V (st+1)� V (st)]
TD(0) Algorithm for Learning V π
31/01/2017 12 Reinforcement Learning
Why TD Learning?
Why?
31/01/2017 13 Reinforcement Learning
Bootstrapping, Sampling
31/01/2017 14 Reinforcement Learning
Random Walk Example
Values learned by TD(0) aeer
various numbers of episodes
31/01/2017 15 Reinforcement Learning
TD and MC on the Random Walk
Data averaged over
100 sequences of episodes
31/01/2017 16 Reinforcement Learning
Understanding TD vs. MC
S+B Example 6.4:
• You observe 8 episodes of a process:
A,0,B,0 B,1 B,1 B,1 B,1 B,1 B,1 B,0
• Interpreta=on:
– First episode starts in state A, transi=ons to B geing a reward of 0,
and terminates with a reward of 0
– Second episode starts in state B, terminates with a reward of 1, etc.
Ques=on: What are good es=mates for V(A) and V(B)?
31/01/2017 17 Reinforcement Learning
S+B Example 6.4: Underlying Markov Process
V(A) = ?
31/01/2017 18 Reinforcement Learning
TD and MC Es=mated
• Batch Monte Carlo (update aeer all episodes done) gets V(A) = 0.
– This best matches the training data
– It minimises the mean-square error on the training set
• Consider sequen=ality: A to B to termina=ng state; V(A) = 0.75.
– This is what TD(0) gets
– Expect that this will produce beler es=mate of future data even
though MC gives the best es=mate on the present data
– Is correct for the maximum-likelihood es=mate of the model of the
Markov process that generates the data, i.e. the best-fit Markov
model based on the observed transi=ons
– Assume this model is correct; es=mate the value func=on – “certainty-
equivalence es=mate”
TD(0) tends to converge faster because it moves towards a be)er es=mate.
31/01/2017 19 Reinforcement Learning
TD for Control: Learning Q-Values
31/01/2017 20 Reinforcement Learning
TD for Control: Learning Q-Values
31/01/2017 21 Reinforcement Learning
Algorithm: SARSA
31/01/2017 22 Reinforcement Learning
Windy Gridworld
undiscounted, episodic, reward = –1 un=l goal
31/01/2017 23 Reinforcement Learning
Results of Sarsa on the Windy Gridworld
31/01/2017 24 Reinforcement Learning
Q-Learning
31/01/2017 25 Reinforcement Learning
Algorithm: Q-Learning
31/01/2017 26 Reinforcement Learning
Backup Diagrams: SARSA and Q-Learning
31/01/2017 27 Reinforcement Learning
Cliffwalking
ε-greedy, ε = 0.1
31/01/2017 28 Reinforcement Learning
Q-Learning vs. SARSA
31/01/2017 29 Reinforcement Learning
Summary
• Idea of Temporal Difference Predic=on
• 1-step tabular model-free TD method
• Can extend to the GPI approach:
– On-policy: SARSA
– Off-policy: Q-learning
• TD methods bootstrap and sample, combining benefits of DP
and MC methods
31/01/2017 30 Reinforcement Learning