Problem 1 (12 marks)
Re-implement (e.g. in Matlab or Python) the results presented in Figure 2.2 of the Sutton & Barto book comparing a greedy method with two -greedy methods (𝜀 = 0.01 and 𝜀 = 0.1), on the 10-armed testbed, and present your code and results. Include a discussion of the exploration – exploitation dilemma in relation to your findings.
Problem 2 (8 marks)
Consider a Markov Decision Process (MDP) with states 𝑆 = {4,3,2,1,0}, where 4 is the starting state. In states 𝑘 ≥ 1 you can walk (W) and 𝑇(𝑘, 𝑊, 𝑘 − 1) = 1. In states 𝑘 ≥ 2 you can also jump (J) and
𝑇(𝑘, 𝐽, 𝑘 − 2) = 3/4 and 𝑇(𝑘, 𝐽, 𝑘) = 1/4. State 0 is a terminal state. The reward 𝑅(𝑠, 𝑎, 𝑠′) = (𝑠 − 𝑠′)2 for all (𝑠, 𝑎, 𝑠′). Use a discount of 𝛾 = 1⁄2. Compute both 𝑉∗(2) and 𝑄∗(3, 𝐽). Clearly show how you computed these values.
Problem 3 (5 marks)
a) What does the Q-learning update rule look like in the case of a stateless or 1-state problem? Clarify your answer. (2 marks)
b) Discuss the main challenges that arise when moving from single- to multi-agent learning, in terms of the learning target and convergence. (3 marks)
Problem 4 (15 marks)
Re-implement (e.g. in Matlab or Python) the results presented in Figure 6.4 of the Sutton & Barto book comparing SARSA and Q-learning in the cliff-walking task. Investigate the effect of choosing different values for the exploration parameter for both methods. Present your code and results. In your discussion clearly describe the main difference between SARSA and Q-learning in relation to your findings.
Note: For this problem, use 𝛼 = 0.1 and 𝛾 = 1 for both algorithms. The “smoothing” that is mentioned in the caption of Figure 6.4 is a result of 1) averaging over 10 runs, and 2) plotting a moving average over the last 10 episodes.