代写 algorithm matlab python Problem 1 (12 marks)

Problem 1 (12 marks)
Re-implement (e.g. in Matlab or Python) the results presented in Figure 2.2 of the Sutton & Barto book comparing a greedy method with two -greedy methods (𝜀 = 0.01 and 𝜀 = 0.1), on the 10-armed testbed, and present your code and results. Include a discussion of the exploration – exploitation dilemma in relation to your findings.
Problem 2 (8 marks)
Consider a Markov Decision Process (MDP) with states 𝑆 = {4,3,2,1,0}, where 4 is the starting state. In states 𝑘 ≥ 1 you can walk (W) and 𝑇(𝑘, 𝑊, 𝑘 − 1) = 1. In states 𝑘 ≥ 2 you can also jump (J) and
𝑇(𝑘, 𝐽, 𝑘 − 2) = 3/4 and 𝑇(𝑘, 𝐽, 𝑘) = 1/4. State 0 is a terminal state. The reward 𝑅(𝑠, 𝑎, 𝑠′) = (𝑠 − 𝑠′)2 for all (𝑠, 𝑎, 𝑠′). Use a discount of 𝛾 = 1⁄2. Compute both 𝑉∗(2) and 𝑄∗(3, 𝐽). Clearly show how you computed these values.
Problem 3 (5 marks)
a) What does the Q-learning update rule look like in the case of a stateless or 1-state problem? Clarify your answer. (2 marks)
b) Discuss the main challenges that arise when moving from single- to multi-agent learning, in terms of the learning target and convergence. (3 marks)
Problem 4 (15 marks)
Re-implement (e.g. in Matlab or Python) the results presented in Figure 6.4 of the Sutton & Barto book comparing SARSA and Q-learning in the cliff-walking task. Investigate the effect of choosing different values for the exploration parameter  for both methods. Present your code and results. In your discussion clearly describe the main difference between SARSA and Q-learning in relation to your findings.
Note: For this problem, use 𝛼 = 0.1 and 𝛾 = 1 for both algorithms. The “smoothing” that is mentioned in the caption of Figure 6.4 is a result of 1) averaging over 10 runs, and 2) plotting a moving average over the last 10 episodes.