CMPUT 397 Worksheet MDPs January 20, 2021
1. Suppose γ = 0.9 and the reward sequence is R1 = 2, R2 = −2, R3 = 0 followed by an infinite sequence of 7s. What are G1 and G0?
1
CMPUT 397 Worksheet MDPs January 20, 2021
2. Assume you have a bandit problem with 4 actions, where the agent can see rewards from the set R = {−3.0, −0.1, 0, 4.2}. Assume you have the probabilities for rewards for each action: p(r|a) for a ∈ {1, 2, 3, 4} and r ∈ {−3.0, −0.1, 0, 4.2}. How can you write this problem as an MDP? Remember that an MDP consists of (S, A, R, P, γ).
More abstractly, recall that a Bandit problem consists of a given action space A = {1,…,k} (the k arms) and the distribution over rewards p(r|a) for each action a ∈ A. Specify an MDP that corresponds to this Bandit problem.
2
CMPUT 397 Worksheet MDPs January 20, 2021
3. Prove that the discounted sum of rewards is always finite, if the rewards are bounded: |Rt+1| ≤ Rmax for all t for some finite Rmax > 0.
∞
γiRt+1+i < ∞ for γ ∈ [0, 1) i=0
Hint: Recall that |a + b| < |a| + |b|.
3