Temporal Difference Methods for Control
Rupam Mahmood March 2, 2020
R&L AI
Difference between prediction and control in pseudocode
tabular TD(0) for qπ: Q(St, At) ← Q(St, At) + α [Rt+1 + γQ(St+1, At+1) − Q(St, At)] 130 Chapter 6: Temporal-Di↵erence Learning
Sarsa (on-policy TD control) for estimating Q ⇡ q⇤ Algorithm parameters: step size ↵ 2 (0, 1], small ” > 0
Initialize Q(s,a), for all s 2 S+,a 2 A(s), arbitrarily except that Q(terminal,·) = 0
Loop for each episode: Initialize S
Choose A from S using policy derived from Q (e.g., “-greedy) Loop for each step of episode:
Take action A, observe R, S0
Choose A0 from S0 using policy derived from Q (e.g., “-greedy) Q(S, A) Q(S, A) + ↵⇥R + Q(S0, A0) Q(S, A)⇤
S S0; A A0;
until S is terminal
What would you modify in the above to get the pseudo code for TD(0)?
Example 6.5: Windy Gridworld Shown inset below is a standard gridworld, with start and goal states, but with one di↵erence: there is a crosswind running upward
Q-learning and on-policy vs. off-policy
SARSA: Q(St, At) ← Q(St, At) + α [Rt+1 + γQ(St+1, At+1) − Q(St, At)] Q-learning: Q(St, At) ← Q(St, At) + α [Rt+1 + γ max Q(St+1, b) − Q(St, At)]
b
Notice what the target of the update represents and whether the underlying policy of that matches the behavior policy
On-policy constant-α MC: VMC(St) ← VMC(St) + α Gt − VMC(St) ⏟
target
Off-policy constant-α MC: VMC(St) ← VMC(St) + α [ρt:T−1Gt − VMC(St)]
Under this assumption and a variant of the usual stochastic approximation conditions on
Difference between Q-learning and SARSA in pseudocode
130 Chapter 6: Temporal-Di↵erence Learning
choose action
Take action &
choose action
until S is terminal Take
action & observe
choose action
Take action & observe
t=0
Example6.5: WindyGridworld Showninsetbelowisastandardgridworld,with
…
observe
start and goal states, but with one di↵erence: there is a crosswind running upward
through the middle of the grid. The actions are the standard four—up, down, right,
Where would you put SARSA updates? Where would you put Q-learning updates?
and left—but in the middle region the resultant next states are shifted upward by a
Time t →
the sequence of step-size parameters, Qhas been shown to converge with probability 1 to q⇤. TheQ-learningalgorithmisshownbelowinproceduralform.
Q-learning (o↵-policy TD control) for estimating ⇡ ⇡ ⇡⇤ Algorithmparameters: stepsize↵2(0,1],small”>0
Initialize Q(s, a), for all s 2 S+, a 2 A(s), arbitrarily except that Q(terminal , ·) = 0
Sarsa (on-policy TDcontrol) for estimating Q⇡q⇤ Algorithmparameters: stepsize↵2(0,1],small”>0
Initialize Q(s, a), for all s 2 S+, a 2 A(s), arbitrarily except that Q(terminal , ·) = 0
Loop for each episode:
Initialize S
Choose A from S using policy derived from Q (e.g., “-greedy) Loop for each step of episode:
Take action A, observe R, S0
ChooseA0 fromS0 usingpolicyderivedfromQ(e.g.,”-greedy) Q(S, A) Q(S, A) +↵⇥R+ Q(S0, A0) Q(S, A)⇤
S S0; A A0;
until S is terminal
Loop for each episode:
Initialize S
Loop for each step of episode:
Choose A from S using policy derived from Q (e.g., “-greedy) Take action A, observe R, S0
Q(S, A) Q(S, A) + ↵⇥R + maxa Q(S0 , a) Q(S, A)⇤
S S0
Under this assumption and a variant of the usual stochastic approximation conditions on
What happens if we switch the time of update for both?
130 Chapter 6: Temporal-Di↵erence Learning
Sarsa (on-policy TDcontrol) for estimating Q⇡q⇤ Algorithmparameters: stepsize↵2(0,1],small”>0
Initialize Q(s, a), for all s 2 S+, a 2 A(s), arbitrarily except that Q(terminal , ·) = 0
Loop for each episode:
Initialize S
Choose A from S using policy derived from Q (e.g., “-greedy) Loop for each step of episode:
Take action A, observe R, S0
ChooseA0 fromS0 usingpolicyderivedfromQ(e.g.,”-greedy) Q(S, A) Q(S, A) +↵⇥R+ Q(S0, A0) Q(S, A)⇤
S S0; A A0;
until S is terminal
the sequence of step-size parameters, Qhas been shown to converge with probability 1 to q⇤. TheQ-learningalgorithmisshownbelowinproceduralform.
Q-learning (o↵-policy TD control) for estimating ⇡ ⇡ ⇡⇤ Algorithmparameters: stepsize↵2(0,1],small”>0
Initialize Q(s, a), for all s 2 S+, a 2 A(s), arbitrarily except that Q(terminal , ·) = 0
Loop for each episode:
Initialize S
Loop for each step of episode:
Choose A from S using policy derived from Q (e.g., “-greedy) Take action A, observe R, S0
Q(S, A) Q(S, A) + ↵⇥R + maxa Q(S0 , a) Q(S, A)⇤
S S0
until S is terminal
Say we make the SARSA update after taking the next action. Is it still same or correct?
Example6.5: WindyGridworld Showninsetbelowisastandardgridworld,with start and goal states, but with one di↵erence: there is a crosswind running upward
Say we make the Q-learning update before taking action. Is it still same or correct?
through the middle of the grid. The actions are the standard four—up, down, right, and left—but in the middle region the resultant next states are shifted upward by a
any method guaranteed to find optimal behavior in the general case must require it.
Under this assumption and a variant of the usual stochastic approximation conditions on
Modify the Q-learning Pseudocode minimally to get SARSA
the sequence of step-size parameters, Q has been shown to converge with probability 1 to q⇤. The Q-learning algorithm is shown below in procedural form.
Q-learning (o↵-policy TD control) for estimating ⇡ ⇡ ⇡⇤ Algorithm parameters: step size ↵ 2 (0, 1], small ” > 0
Initialize Q(s,a), for all s 2 S+,a 2 A(s), arbitrarily except that Q(terminal,·) = 0
Loop for each episode: Initialize S
Loop for each step of episode:
Choose A from S using policy derived from Q (e.g., “-greedy) Take action A, observe R, S0
Q(S,A) Q(S,A)+↵⇥R+ maxa Q(S0,a) Q(S,A)⇤
S S0
until S is terminal
The equivalence of two SARSA algorithms breaks in the real world
Sarsa (on-policy TD control) for estimating Q ⇡ q⇤ Algorithm parameters: step size ↵ 2 (0, 1], small ” > 0
Initialize Q(s,a), for all s 2 S+,a 2 A(s), arbitrarily except that Q(terminal,·) = 0
Loop for each episode: Initialize S
Choose A from S using policy derived from Q (e.g., “-greedy) Loop for each step of episode:
Take action A, observe R, S0
Choose A0 from S0 using policy derived from Q (e.g., “-greedy) Q(S, A) Q(S, A) + ↵⇥R + Q(S0, A0) Q(S, A)⇤
S S0; A A0;
until S is terminal
130 Chapter 6: Temporal-Di↵erence Learning
Example 6.5: Windy Gridworld Shown inset below is a standard gridworld, with
start and goal states, but with one di↵erence: there is a crosswind running upward
Question:
What will be an off-policy TD(0) update for qπ?