COMP 3620/6320 Assignment 4: Reinforcement Learning

COMP 3620/6320

Assignment 4: Reinforcement Learning

Semester 1, 2015

Main Details

In this assignment you will implement an experiment with a number of temporal-difference (TD) learning methods in examples based on the Grid- world. This assignment will require detailed knowledge of the reinforcement learning problem and TD methods. Before starting, if you have not already done so, we strongly reccomend that you work your way through Chapters 1, 2, 3, and 6 of [1]. A link to this book is available on Wattle.

Submit the following files in a tar.gz archive:

  • answers.pdf containing answers to all the questions. This should in-

    clude both the graphs and reasoning.

  • agents.py containing your implementation of the Sarsa and Q-learning agents, and the main experiment loop; and
  • experiments.pywhichshouldcontaincodetogenerateallofthegraphs used in answers.pdf.

    Please also list at the top of your answers.pdf file the estimated amount of time it took you to do this assignment, excluding time used for learning background and course material.

    1

Marking Scheme

The marking scheme emphasises understanding and reasoning over imple- mentation, which in this case should not take you very long. You are pro- vided with an implementation in Python of all the environments and the templates for the agents.

Note that if you have an incorrect implementation of the algorithms, the intended phenomena may not occur.
Total Marks: 50

  • 10 marks for correct implementation of the required algorithms. Marks will be deducted for errors, unreadable or uncommented code, or in- correct algorithm implementation.
  • Q1. 10 marks for a correct proof and K expressed in simplest form.
  • Q2. 10 marks for producing sensible graphs and correct reasoning about

    what you observe with the learning rates.

  • Q3. 10 marks for producing sensible graphs and correct reasoning about

    the behaviour of Sarsa as δ increases.

  • Q4. 10 marks for producing sensible graphs and correct reasoning on

    all points raised.

  • Q5. 5 (bonus) marks for finding the correct value and explaining why it is that value.

    As usual, bonus marks cannot take your mark above 50.

    Q1. Additive Reward Shift (10 marks)

    In the Gridworld example (See example 3.8 in [1] for details), rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Treat the task as non-episodic. Prove that adding a constant C to all the rewards adds a constant K to the values of all states, and thus does not affect the relative values of any states under any policies. What is K in terms of C and discount factor γ?

    2

Q2. Sarsa and Learning Rates (10 Marks)

Figure 1: Windy Gridworld with standard moves

Figure 1 shows a windy Gridworld, with start state (S) and goal state (G). There is a crosswind upward through the middle of the grid. Four actions are allowed (standard moves: up, down, left, right). The strength of the wind is given below each column, in number of cells shifted upward. For example, if you are one cell to the right of the goal, then the action left takes you to the cell just above the goal. Let us treat this as a discounted episodic task with discount factor γ = 0.99, in which rewards are -1 all the time, except when you reach the goal state where a reward of +1 is awarded. Initialise all Q-values to zero.

  • Implement ε-Sarsa in agents.py. Refer to [1] for the detailed descrip- tion of the algorithm (Figure 6.9)
  • Fix ε = 0.01 but show how different choices of step size α change the behaviour of the algorithm, using graphs. These graphs should plot the episode number against the time steps at which the episodes finished (so you get a graph like Figure 6.11 in [1]). Code for generating the graphs should be included in experiments.py. Present your graphs and explain your results in detail in answers.pdf.

    3

Q3. Stochastic Winds (10 marks)

Assume that we are still using normal moves (up, right, down, left). Suppose that now in the Gridworld, the wind blows randomly from any direction, i.e. with probability δ the wind blows you in one of 4 directions.

Figure 2: Windy water Gridworld

Further, there is a bridge (B) crossing a bay of water (W) in the middle of the Gridworld as shown in Figure 2. You are still penalized of reward -1 as you step on the bridge, but falling into the water is very inconvenient resulting in a “reward” of -10 for each time you are in one of the water tiles. Crossing the bridge gets you to the goal with a minimum number of steps, but you risk falling into the water, which may be avoided by taking a detour. Use ε-Sarsa with ε = 0.005 and step size α = 0.1 for learning this task with various wind probabilities 0 ≤ δ ≤ 1.

• Plot the final (after “convergence”) episode length as a function of δ and explain your findings in answers.pdf. In order to find the after convergence episode length, turn off ε for the last 500 episodes and take the average of the lengths of these episodes. Make sure you use sufficiently many episodes so that the algorithm converges. Code for generating the graphs should be included in experiments.py.

Q4. Q-learning versus Sarsa (10 marks)

In this experiment you will compare Sarsa and Q-learning on the environment from Q3. Use the above environment with δ = 0 (i.e. no wind), and changing

4

the “reward” of falling in the water to -100. Compare Sarsa and Q-learning on varying ε in 0.1, …, 0.7.

• Implement Q-learning in agents.py. Refer to [1] for the detailed description of the algorithm (Figure 6.12);

• Address the following in answers.pdf:

  • –  Plot the after convergence average episode length and average reward evaluated both with ε and without (i.e. keep on or turn off ε for the last 500 episodes).
  • –  What is the optimal path from start state to goal in this environ- ment? Why do Sarsa and Q-learning choose the paths that they do in each of the evaluations above?
  • –  Why do the algorithms still find the goal when ε = 0 despite the need for exploration?

    Q5. Q-value initialisation (5 bonus marks)

    Now in the same environment as Q4, choose ε = 0, δ = 0. Find the lowest (can be negative) initialisation for Q such that Sarsa still finds the goal. Explain why this particular value is the threshold.

    References

    [1] R. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, The MIT Press, 1998.

5