CMPUT 397 Worksheet 4 January 25, 2021
1. (Exercise 3.12 in 2nd ed.) Recall that the value v¦Ð(s) for state s when following policy ¦Ð is the expected total reward (or discounted reward) the agent would receive when starting from state s and executing policy ¦Ð. How can we write v¦Ð(s) in terms of the action values q¦Ð(s,a)?
1
CMPUT 397 Worksheet 4 January 25, 2021
2. In this question, you will take a word specification of an MDP, and write the formal terms and determine the optimal policy. Suppose you have a problem with two actions. The agent always starts in the same state, s0. From this state, if it takes action 1 it transitions to a new state s1 and receives reward 10; if it takes action 2 it transitions to a new state s2 and receives reward 5. From s1 if it takes action 1 it receives a reward of 5 and terminates; if it takes action 2 it receives a reward of 10 and terminates. From s2 if it takes action 1 it receives a reward of 10 and terminates; if it takes action 2 it receives a reward of 5 and terminates. Assume the agent cares equally about long term reward as about immediate reward.
(a) Draw the MDP for this problem. Is it an episodic or continuing problem? What is ¦Ã?
(b) Assume the policy is ¦Ð(a = 1|si) = 0.3 for all si ¡Ê {s0,s1,s2}. What is ¦Ð(a = 2|si)? And what is the value function for this policy? In other words, find v¦Ð(s) for all three states.
(c) What is the optimal policy in this environment?
2
CMPUT 397 Worksheet 4 January 25, 2021
3. Consider the gridworld and value function in the figure below. Using your knowledge of the transition dynamics and the values (numbers in each grid cell), write down the policy corresponding to taking the greedy action with respect to the values in each state. Create a grid with the same dimension as the figure and draw an arrow in each square denoting the greedy action.
3
CMPUT 397 Worksheet 4 January 25, 2021
4. (Exercise 3.22 in 2nd ed.) Consider the continuing MDP shown on the bottom. The only decision to be made is that in the top state, where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action.
(a) List and describe all the possible policies in this MDP.
(b) Is the following policy valid for this MDP (i.e. does if fit our definition of a policy): Choose left for five steps, then right for five steps, then left for five steps, and so on? Explain your answer.
(c) Whatpolicyisoptimalif¦Ã=0? If¦Ã=0.9? If¦Ã=0.5?
(d) For each possible policy, what is the value of state s? Write down the numeric value to two decimal places. Hint: write down the return under each policy starting in state s (don¡¯t forget ¦Ã). Simplify the infinite sum, using the fact that many rewards are zero. Then plug in the rewards and ¦Ã and compute the number.
4