2020/11/25 COMP9444 Exercises 6
COMP9444 Neural Networks and Deep Learning Term 3, 2020
Exercises 7: Reinforcement Learning This page was last updated: 11/08/2020 06:07:55
Consider an environment with two states S = {S1, S2} and two actions A = {a1, a2}, where the (deterministic) transitions ¦Ä and reward R for each state and action are as follows:
¦Ä(S1, a1) = S1, R(S1, a1) = +1 ¦Ä(S1, a2) = S2, R(S1, a2) = -2 ¦Ä(S2, a1) = S1, R(S2, a1) = +7 ¦Ä(S2, a2) = S2, R(S2, a2) = +3
1. Draw a picture of this environment, using circles for the states and arrows for the transitions.
2. Assuming a discount factor of ¦Ã = 0.7, determine: a. the optimal policy ¦Ð* : S ¡ú A
b. the value function V* : S ¡ú R
c. the “Q” function Q* : S ¡Á A ¡ú R
Write the Q values in a matrix like this:
Trace through the first few steps of the Q-learning algorithm, with a learning rate of 1 and with all Q values initially set to zero. Explain why it is necessary to force exploration through probabilistic choice of actions, in order to ensure convergence to the true Q values.
3. Now let’s consider how the Value function changes as the discount factor ¦Ã varies between 0 and 1.
There are four deterministic policies for this environment, which can be written as ¦Ð11,
¦Ð12, ¦Ð21 and ¦Ð22, where ¦Ðij(S1) = ai, ¦Ðij(S2) = aj
a. Calculate the value function V¦Ð(¦Ã): S ¡ú R for each of these four policies (keeping
¦Ã as a variable)
b. Determine for which range of values of ¦Ã each of the policies ¦Ð11, ¦Ð12, ¦Ð21, ¦Ð22 is
optimal
Make sure you attempt the questions yourself, before looking at the Sample Solutions. https://www.cse.unsw.edu.au/~cs9444/20T3/tut/Ex7_Reinforce.html 1/2
Q
a1
a2
S1
S2
2020/11/25 COMP9444 Exercises 6
https://www.cse.unsw.edu.au/~cs9444/20T3/tut/Ex7_Reinforce.html
2/2