CS计算机代考程序代写 CMPUT 397 Worksheet MDPs January 20, 2021

CMPUT 397 Worksheet MDPs January 20, 2021
1. Suppose γ = 0.9 and the reward sequence is R1 = 2, R2 = −2, R3 = 0 followed by an infinite sequence of 7s. What are G1 and G0?
1

CMPUT 397 Worksheet MDPs January 20, 2021
2. Assume you have a bandit problem with 4 actions, where the agent can see rewards from the set R = {−3.0, −0.1, 0, 4.2}. Assume you have the probabilities for rewards for each action: p(r|a) for a ∈ {1, 2, 3, 4} and r ∈ {−3.0, −0.1, 0, 4.2}. How can you write this problem as an MDP? Remember that an MDP consists of (S, A, R, P, γ).
More abstractly, recall that a Bandit problem consists of a given action space A = {1,…,k} (the k arms) and the distribution over rewards p(r|a) for each action a ∈ A. Specify an MDP that corresponds to this Bandit problem.
2

CMPUT 397 Worksheet MDPs January 20, 2021
3. Prove that the discounted sum of rewards is always finite, if the rewards are bounded: |Rt+1| ≤ Rmax for all t for some finite Rmax > 0.
􏰄􏰄􏰃∞ 􏰄􏰄
􏰄􏰄 γiRt+1+i􏰄􏰄 < ∞ for γ ∈ [0, 1) 􏰄i=0 􏰄 Hint: Recall that |a + b| < |a| + |b|. 3