CMPUT 397 Worksheet Monte Carlo February 9, 2021
1. (Exercise 5.4 S&B) The pseudocode for Monte Carlo ES is inefficient be- cause, for each state-action pair, it maintains a list of all returns and re- peatedly calculates their mean. How can we modify the algorithm to have incremental updates for each state-action pair?
2. (Exercise 5.5 S&B) Consider an MDP with a single nonterminal state s and a single action that transitions back to s with probability p and transitions to the terminal state with probability 1 − p. Let the rewards be +1 on all transitions, and let γ = 1. Suppose you observe one episode that lasts 10 steps, with return of 10. What is the (every-visit) Monte-carlo estimator of the value of the nonterminal state s?
Every-Visit Monte Carlo prediction, for estimating V
Input: a policy π to be evaluated Initialize:
Vs ∈R,arbitrarily,foralls∈S
Returnss ←anemptylist,foralls∈S Loop forever (for each episode):
Generateanepisodefollowing π:Sà,Aà,R,Sààà,ST−,AT−,RT G←à
Loopforeachstepofepisode, t= T−,T−2,ààà,à
G←γG+ Rt+
Append G to Retu rns St
V St ← average Retu rns St
1
CMPUT 397 Worksheet Monte Carlo February 9, 2021
3. Off-policy Monte Carlo prediction allows us to use sample trajectories to estimate the value function for a policy that may be different than the one used to generate the data. Consider the following MDP, with two states B and C, with 1 action in state B and two actions in state C, with γ = 1.0. Assume the target policy π has π(A = 1|C) = 0.9 and π(A = 2|C) = 0.1, and that the behaviour policy b has b(A = 1|C) = 0.25 and b(A = 2|C) = 0.75.
(a) What are the true values vπ?
(b) Imagine you got to execute π in the environment for one episode, and observed the episode trajectory S0 = B,A0 = 1,R1 = 1,S1 = C,A1 = 1,R2 = 1. What is the return for B for this episode? Additionally, what are the value estimates Vπ, using this one episode with Monte Carlo updates?
(c) But, you do not actually get to execute π; the agent follows the behaviour policy b. Instead, you get one episode when following b, and observed the episode trajectory S0 = B,A0 = 1,R1 = 1,S1 = C,A1 = 2,R2 = 10. What is the return for B for this episode? Notice that this is a return for the behaviour policy, and using it with Monte Carlo updates (without importance sampling ratios) would give you value estimates for b.
(d) But, we do not actually want to estimate the values for behaviour b, we want to estimates the values for π. So, we need to use importance sampling ratios for this return. What is the return for B using this episode, but now with importance sampling ratios? Additionally, what is the resulting value estimate for Vπ using this return?
+1 B +1 C
+10
T
2