机器学习增强学习代写: COMP532 Assignment 1 – Reinforcement Learning

COMP532 Assignment 1 – Reinforcement Learning

You need to solve each of the following problems. Problem 1 concerns an example/exercise from the book of Sutton and Barto. You must also include a brief report describing and discussing your solutions to the problems. This work can be carried out in pairs of two persons.

o This assignment is worth 10% of the total mark for COMP532
o 80% of the marks will be awarded for correctness of results.
o 20% of the marks will be awarded for the quality of the accompanying report

Submission Instructions

o Send all solutions as 1 PDF document containing your answers, results, and discussion of results. Attach the source code for the programming problems as separate files.

o Submit your solution by email to shan.luo@liverpool.ac.uk, clearly stating in the subject line: “COMP532 Task 1 Solution”

o The deadline for this assignment 09/03/2018, 5:00pm

o Penalties for late submission apply in accordance with departmental policy as set out in the student handbook, which can be found at

http://intranet.csc.liv.ac.uk/student/msc-handbook.pdf

and the University Code of Practice on Assessment, found at

https://www.liverpool.ac.uk/media/livacuk/tqsd/code-of-practice-on- assessment/code_of_practice_on_assessment.pdf

Problem 1 (12 marks)

Re-implement (e.g. in Matlab) the results presented in Figure 2.2 of the Sutton & Barto book comparing a greedy method with two -greedy methods (𝜀 = 0.01 and 𝜀 = 0.1), on the 10-armed testbed, and present your code and results. Include a discussion of the exploration – exploitation dilemma in relation to your findings.

Problem 3 (5 marks)

  1. a)  What does the Q-learning update rule look like in the case of a stateless or 1-state problem? Clarify your answer. (2 marks)
  2. b)  Discuss the main challenges that arise when moving from single- to multi-agent learning, in terms of the learning target and convergence. (3 marks)

Problem 4 (15 marks)

Problem 2 (8 marks)

Consider an MDP with states 𝑆 = {4,3,2,1,0}, where 4 is the starting state. In states 𝑘 ≥ 1 you can walk (W) and 𝑇(𝑘, 𝑊, 𝑘 − 1) = 1. In states 𝑘 ≥ 2 you can also jump (J) and 𝑇(𝑘, 𝐽, 𝑘 − 2) = 3/4 and 𝑇(𝑘, 𝐽, 𝑘) = 1/4. State 0 is a terminal state. The reward 𝑅(𝑠, 𝑎, 𝑠′) = (𝑠 − 𝑠′)2 for all (𝑠, 𝑎, 𝑠′). Use a discount of 𝛾 = 1⁄2. Compute both 𝑉∗(2) and 𝑄∗(3, 𝐽). Clearly show how you computed these values.

Re-implement (e.g. in Matlab) the results presented in Figure 6.4 of the Sutton & Barto book comparing SARSA and Q-learning in the cliff-walking task. Investigate the effect of choosing different values for the exploration parameter  for both methods. Present your code and results. In your discussion clearly describe the main difference between SARSA and Q-learning in relation to your findings.

Note: the book is not completely clear on this example. Use 𝛼 = 0.1 and 𝛾 = 1 for both algorithms. The “smoothing” that is mentioned in the caption of Figure 6.4 is a result of 1) averaging over 10 runs, and 2) plotting a moving average over the last 10 episodes.