程序代写代做代考 algorithm matlab COMP106

COMP106

COMP532 Assignment 1 –
Reinforcement Learning

You need to solve each of the following problems. Problem 1 concerns an example/exercise from the

book of Sutton and Barto. You must also include a brief report describing and discussing your

solutions to the problems. This work can be carried out in pairs of two persons.

o This assignment is worth 10% of the total mark for COMP532

o 80% of the marks will be awarded for correctness of results.

o 20% of the marks will be awarded for the quality of the accompanying report

Submission Instructions

o Send all solutions as 1 PDF document containing your answers, results, and discussion of
results. Attach the source code for the programming problems as separate files.

o Submit your solution by email to shan.luo@liverpool.ac.uk, clearly stating in the subject line:
“COMP532 Task 1 Solution”

o The deadline for this assignment 09/03/2018, 5:00pm

o Penalties for late submission apply in accordance with departmental policy as set out in the
student handbook, which can be found at

http://intranet.csc.liv.ac.uk/student/msc-handbook.pdf

and the University Code of Practice on Assessment, found at

https://www.liverpool.ac.uk/media/livacuk/tqsd/code-of-practice-on-

assessment/code_of_practice_on_assessment.pdf

mailto:shan.luo@liverpool.ac.uk
http://intranet.csc.liv.ac.uk/student/msc-handbook.pdf
https://www.liverpool.ac.uk/media/livacuk/tqsd/code-of-practice-on-assessment/code_of_practice_on_assessment.pdf
https://www.liverpool.ac.uk/media/livacuk/tqsd/code-of-practice-on-assessment/code_of_practice_on_assessment.pdf

Problem 1 (12 marks)

Re-implement (e.g. in Matlab) the results presented in Figure 2.2 of the Sutton & Barto book comparing a

greedy method with two -greedy methods (𝜀 = 0.01 and 𝜀 = 0.1), on the 10-armed testbed, and present
your code and results. Include a discussion of the exploration – exploitation dilemma in relation to your

findings.

Problem 2 (8 marks)

Consider an MDP with states 𝑆 = {4,3,2,1,0}, where 4 is the starting state. In states 𝑘 ≥ 1 you can walk (W)
and 𝑇(𝑘,𝑊, 𝑘 − 1) = 1. In states 𝑘 ≥ 2 you can also jump (J) and 𝑇(𝑘, 𝐽, 𝑘 − 2) = 3/4 and 𝑇(𝑘, 𝐽, 𝑘) =
1/4. State 0 is a terminal state. The reward 𝑅(𝑠, 𝑎, 𝑠′) = (𝑠 − 𝑠′)2 for all (𝑠, 𝑎, 𝑠′). Use a discount of 𝛾 =
1 2⁄ . Compute both 𝑉∗(2) and 𝑄∗(3, 𝐽). Clearly show how you computed these values.

Problem 3 (5 marks)

a) What does the Q-learning update rule look like in the case of a stateless or 1-state problem? Clarify
your answer. (2 marks)

b) Discuss the main challenges that arise when moving from single- to multi-agent learning, in terms of
the learning target and convergence. (3 marks)

Problem 4 (15 marks)

Re-implement (e.g. in Matlab) the results presented in Figure 6.4 of the Sutton & Barto book comparing

SARSA and Q-learning in the cliff-walking task. Investigate the effect of choosing different values for the

exploration parameter  for both methods. Present your code and results. In your discussion clearly describe

the main difference between SARSA and Q-learning in relation to your findings.

Note: the book is not completely clear on this example. Use 𝛼 = 0.1 and 𝛾 = 1 for both algorithms. The
“smoothing” that is mentioned in the caption of Figure 6.4 is a result of 1) averaging over 10 runs, and 2)

plotting a moving average over the last 10 episodes.