Temporal Difference Methods for Prediction
Rupam Mahmood February 24, 2020
R&L AI
Prediction as estimating value functions
Predictions are building blocks for many control methods The usefulness of predictions goes beyond control
Forming a predictive question: How many times will you get honked at today?
(Pseudo-) reward: +1 for each honk
Termination of episode: end of the day
Behavior: the way you drive (think of your average speed, frequency of changing lanes, etc.) Can be answered by estimating vπ(s) =· Eπ[Gt | St = s]
Much of prediction is about estimating expected values
E[X]
≈ 1 ∑n X k n k=1
compute exactly if you know the model
= ∑ xP(X = x) x
Estimate from samples
Much of prediction is about estimating expected values
v π ( s ) =· E π [ G t | S t = s ]
Dynamic programming
E.g., Iterative policy evaluation
Sample-based
Monte Carlo (MC)
From MC to TD(0)
Monte Carlo estimator for on-policy prediction: V(s) =· Incremental Monte Carlo estimator: V(St) ← V(St) + 1 [Gt − V(St)]
Constant-α MC: V(St) ← V(St) + α [Gt − V(St)] TD(0): V(St) ← V(St) + α [Rt+1 + γV(St+1) − V(St)]
∑t∈𝒯(s) Gt |𝒯(s)|
N(St)
t+1 t+1
Carlo update is Gt, whereas the target for the TD update is Rt+1 + V (St+1). This TD
method is called TD(0), or one-step TD, because it is a special case of the TD( ) and n-step TD methods developed in Chapter 12 and Chapter 7. The box below specifies
Unlike Monte Carlo, TD(0) works online
TD(0) completely in procedural form.
Tabular TD(0) for estimating v⇡
Input: the policy ⇡ to be evaluated
Algorithm parameter: step size ↵ 2 (0, 1]
Initialize V (s), for all s 2 S+, arbitrarily except that V (terminal) = 0
Loop for each episode: Initialize S
Loop for each step of episode:
A action given by ⇡ for S
Take action A, observe R, S0
V(S) V(S)+↵⇥R+ V(S0) V(S)⇤ S S0
until S is terminal
Because TD(0) bases its update in part on an existing estimate, we say that it is a
Say an oracle gives us return G from future at each step. Replace R + γV(S′) with G . bootstrapping method, like DP. We know from Chapter 3 that
This is an online but acausal Monte Carlo method. Will it be first-visit or every-visit?
v⇡(s) =. E⇡[Gt | St =s] (6.3) = E⇡[Rt+1 + Gt+1 | St =s] (from (3.9))
From Monte Carlo error to TD error
Constant-αMC:V(St)←V(St)+α [Gt−V(St)] MC error: Δt
TD(0): V(St) ← V(St) + α [Rt+1 + γV(St+1) − V(St)] TD error: δt
Δt = ∑T−1 γk−tδk k=t
Based on this equivalence, can you think of an implementation of Monte Carlo method that computes partial updates in an online manner and completes the full update at the end of the episode?