代写 R C game 1. (a)

1. (a)
Suppose a Q-learning agent, with fixed α, and discount γ, was in state 34, did action 7, received reward 3 and ended up in state 65. What value(s) get updated? Give an expression for the new value (2%).
(b) In Q-learning, we look at a window of (st, at, st+1, rt) to update our Q-values. One can think of using an update rule that uses a larger window to update these values. Give an update rule for Q(st, at) given the window (st, at, rt, st+1, at+1, rt+1, st+2). (3 %)
2

2. With no ghosts around, Pacman can eat as many dots as he wants. He is in the 5 × 1 grid shown below. The cells are numbered from left to right as 1, . . . , 5. In cells 1 through 4, the actions available to him are to move Right (R) or to Fly (F) out of the bonus level. The action Right deterministically lands Pacman in the cell to the right (and he eats the dot there), while the Fly action deterministically lands him in a terminal state and ends the game. From cell 5, Fly is the only action. Eating a dot gives a reward of 10, while flying out gives a reward of 20. Pacman starts in the leftmost cell (cell 1).
We write this as an MDP where the state is the cell that Pacman is in. The discount is γ. Consider the following 3 policies:
π0(s) = F for all s
π1(s) = R if s ≤ 3,F otherwise π2(s) = R if s ≤ 4,F otherwise
Does there exist a value for γ such that π2 is strictly better than π0 and π1? If yes, give a value for γ (10 %).
3

3. Consider a 5 × 5 gridworld. The agent can move up, right, down or left. A prize can appear at one of the corners. Suppose that the agent steps through the state space in the order of steps given in the diagram below, (i.e., going from s1 to s2 to s3 to s4 to s5), each time doing a “right” action. In this diagram the right grid represents those states where there is a treasure in the top-right position and the states on the left represents states where there is no treasure.
You can assume that this is the first time the robot has visited any of these states. All Q-values are initialized to zero. Assume that the discount factor γ = 0.9.
(a) Suppose the agent received a reward of -10 entering state s3 and received a re- ward of +10 on entering the state s5, and no other rewards. What Q-values are updated during Sarsa based on this sequence of experiences? What values they get assigned? Assume α = 0.5. (5 %)
4

(b) Suppose that, at some later time in the same run of Sarsa , the robot revisits the same states: s1 to s2 to s3 to s4 to s5, and hasn’t visited any of these states in between (i.e. this is the second time visiting any of these states). Suppose this time, the agent receives a reward of +10 on entering the state s5, and every other reward is zero. Assume α = 0.5. What Q-values have their values changed? What are their new values? (5 %)
5

4. Consider the following mini-grid (rewards shown on left, state names shown on right). A is the start state and double-rectangle states are exit states. From an exit state, the only action available is Exit, which results in the listed reward and ends the game ( by moving into a terminal state X, not shown). From non-exit states, the agent can choose either Left or Right action, which moves the agent in the corresponding direction. There are no living rewards; the only non-zero rewards come from exiting the grid.
Assume γ = 1. We observe the following transition sequence (recall that state X is the end-of-game absorbing state):
(a) After this sequence of transitions, if we use a learning rate of α = 0.5, what would temporal difference learning learn for the value of A? Remember that V(s) is initialized with 0 for all s. (5 %)
6

(b) If these transitions repeated many times and learning rates were appropriately small for convergence, what would temporal difference learning converge to for the value of A? (5 %)
(c) After this sequence of transitions, if we use a learning rate of α = 0.5, what would Q-learning learn for the Q-value of (A,Right)? Remember that Q(s,a) is initialized with 0 for all (s,a). (5 %)
7

(d) If these transitions repeated many times and learning rates were appropriately small for convergence, what would Q-learning converge to for the Q-value of (A,Right) (5 %)?
8

5. Consider an MDP modeling a hurdle race track, shown below. There is a single hurdle in square D. The terminal state is G. The agent can run left or right. If the agent is in square C, it cannot run right. Instead, it can jump, which either results in a fall to the hurdle square D, or a successful hurdle jump to square E. Rewards are shown below. Assume a discount of γ = 1.
Actions:
• right: Deterministically move to the right.
• left: Deterministically move to the left.
• jump: Stochastically jump to the right. This action is available for square C only. P(C, jump, E) = 0.5 (jump succeeds)
P(C, jump, D) = 0.5 (jump fails).
Note: The agent cannot use right from square C.
(a) Perform two iterations of value iteration. The values for all states are initialized
to be zero. Fill the results in the table below (10 %).
A
B
C
D
E
F
G
V1
V2
9

10

(b) Fill in the blank cells of the table below with the Q-values that result from ap- plying the Q-learning update for the 4 transitions specified by the episode below. You may leave Q-values that are unaffected by the current update blank. Use a learning rate α of 0.5. Assume all Q-values are initialized to 0. Show the calculation details for Q-values that changed after updates (10 %).
11

6. Consider an infinite-horizon, discounted MDP (S, A, T, R, γ). Suppose that the tran- sition probabilities and the reward function have the following form:
T(s,a,s′) = P(s′|f(s,a)) R(s, a, s′) = R(s, a).
Here, f is some deterministic function mapping S × A → Y , where Y is a set of states called post-decision states. We will use the letter y to denote an element of Y , i.e., a post decision state. In words, the state transitions consist of two types: a deterministic step that depends on the action, and a stochastic step that does not depend on the action. The sequence of states (st), action (at), post-decision states (yt), and rewards (rt) is illustrated below.
(s0,s0) f y0 P (s1,a1) f y1 P (s2,a2) f …
r0 r1 r2
You have learned about Vπ(s), which is the expected discounted sum of rewards,
starting from state s, when acting according to policy π.
V π(s0) = E[R(s0, a0) + γR(s1, a1) + γ2R(s2, a2) + . . .]
given at = π(st) for t = 0,1,2,… V ∗(s) is the value function of the optimal policy, V ∗(s) = maxπ V π(s).
This question will explore the concept of computing value functions on the post-decision states y.
Wπ(y0) = E[R(s1,a1)+γR(s2,a2)+γ2R(s3,a3)+…], where we define W∗(y) = maxπ Wπ(y).
(1) Write W∗ in terms of V∗. (7 %) (Hint: consider expressing Wπ in terms of Vπ first).
12

(2)WriteV∗ intermsofW∗.(7%)
(3) Recall that the optimal value function V ∗ satisfies the Bellman equation:
V∗ =max􏰂T(s,a,s′)􏰀R(s,a)+γV∗(s′)􏰁,
a
s′
which can also be used as an update equation to compute V ∗. Please provide the equivalent of the Bellman equation for W∗. (6 %)
13

7. Read section 10.3 about function approximation under the average reward setting in the textbook, then solve the following two problems.
a. (Exercise 10.4) Give pseudo-code for a differential version of semi-gradient Q- learning. (8 %)
14

b. (Exercise 10.5) What equations are needed (beyond 10.10, i.e., the differential form of TD errors using vˆ) to specify the differential version of TD(0)? (7 %)
15