2020/8/14 COMP9444 Exercise 6 Solutions
COMP9444 Neural Networks and Deep Learning Term 2, 2020
Solutions to Exercise 7: Reinforcement Learning This page was last updated: 07/20/2020 10:25:31
Consider an environment with two states S = {S1, S2} and two actions A = {a1, a2}, where the (deterministic) transitions ¦Ä and reward for each state and action are as follows:
¦Ä(S1, a1) = S1, (S1, a1) = +1 ¦Ä(S1, a2) = S2, (S1, a2) = -2 ¦Ä(S2, a1) = S1, (S2, a1) = +7 ¦Ä(S2, a2) = S2, (S2, a2) = +3
1. Draw a picture of this environment, using circles for the states and arrows for the transitions.
2. Assuming a discount factor of ¦Ã = 0.7, determine: a. the optimal policy ¦Ð* : S ¡ú A
¦Ð*(S1) = a2 ¦Ð*(S2) = a1
b. the value function V : S ¡ú R
V(S1) = -2 + ¦ÃV(S2) V(S2) = +7 + ¦ÃV(S1)
So V(S1) = -2 + 7¦Ã + ¦Ã2V(S1)
i.e. V(S1) = (-2 + 7¦Ã)/(1 – ¦Ã2) = (-2 + 7 ¡Á 0.7)/(1 – 0.49) = 5.69 Then V(S2) = 7 + 0.7 ¡Á 5.69 = 10.98
c. the “Q” function Q : S ¡Á A ¡ú R
Q(S1, a1) = 1 + ¦ÃV(S1) = 4.98 Q(S1, a2) = V(S1) = 5.69
https://www.cse.unsw.edu.au/~cs9444/20T2/tut/sol/Ex7_Reinforce_sol.html 1/3
R
R R R R
2020/8/14 COMP9444 Exercise 6 Solutions
Q(S2, a1) = V(S2) = 10.98 Q(S2, a2) = 3 + ¦ÃV(S2) = 10.69
Writing the Q values in a matrix, we have:
Trace through the first few steps of the Q-learning algorithm, with a learning rate of 1 and with all Q values initially set to zero. Explain why it is necessary to force exploration through probabilistic choice of actions, in order to ensure convergence to the true Q values.
With a deterministic environment and a learning rate of 1, the Q-Learning update rule is
Q(S, a) ¡û r(S, a) + ¦Ã maxbQ(¦Ä(S, a), b)
Let’s assume the agent starts in state S1. Since the initial Q values are all zero, the first action must be chosen randomly. If action a1 is chosen, the
agent will get a reward of +1 and update Q(S1,a1) ¡û 1 + ¦Ã ¡Á 0 = 1
If we do not force exploration, the agent will always prefer action a1 in state S1, and will never explore action a2. This means that Q(S1, a2) will remain
zero forever, instead of converging to the true value of 5.69 . If we do force exploration, the next steps may look like this:
At this point, the table looks like this:
Again, we need to force exploration, in order to get the agent to choose a1 from S2, and to again choose a2 from S1
Q
a1
a2
S1
4.98
5.69
S2
10.98
10.69
current state
chosen action
new Q value
S1
a2
-2 + ¦Ã*0 = -2
S2
a2
+3 + ¦Ã*0 = +3
Q
a1
a2
S1
1
-2
S2
0
3
current state
chosen action
new Q value
S2
a1
+7 + ¦Ã*1 = 7.7
S1
a2
-2 + ¦Ã*7.7 = 3.39
https://www.cse.unsw.edu.au/~cs9444/20T2/tut/sol/Ex7_Reinforce_sol.html 2/3
2020/8/14 COMP9444 Exercise 6 Solutions
Q
a1
a2
S1
1
3.39
S2
7.7
3
Further steps will refine the Q value estimates, and, in the limit, they will converge to their true values.
3. Now let’s consider how the Value function changes as the discount factor ¦Ã varies between 0 and 1.
There are four deterministic policies for this environment, which can be written as ¦Ð11,
¦Ð12, ¦Ð21 and ¦Ð22, where ¦Ðij(S1) = ai, ¦Ðij(S2) = aj
a. Calculate the value function V¦Ð(¦Ã): S ¡ú R for each of these four policies (keeping ¦Ã as a variable)
V¦Ð11(S1) = +1 + ¦Ã V¦Ð11(S1), so V¦Ð11(S1) = 1/(1 – ¦Ã) V¦Ð11(S2) = +7 + ¦Ã V¦Ð11(S1) = 7 + ¦Ã/(1 – ¦Ã)
V¦Ð12(S1) = V¦Ð11(S1) = 1/(1 – ¦Ã) V¦Ð12(S2) = 3/(1 – ¦Ã)
V¦Ð21(S1) = -2 + 7¦Ã + ¦Ã2V¦Ð21(S1), V¦Ð21(S2) = +7 – 2¦Ã + ¦Ã2V¦Ð21(S2),
V¦Ð22(S1) = -2 + 3¦Ã/(1 – ¦Ã) V¦Ð22(S2) = 3/(1 – ¦Ã)
so V¦Ð21(S1) = (-2 + 7¦Ã)/(1 – ¦Ã2) so V¦Ð21(S2) = (7 – 2¦Ã)/(1 – ¦Ã2)
b. Determine for which range of values of ¦Ã each of the policies ¦Ð11, ¦Ð12, ¦Ð21, ¦Ð22 is optimal
¦Ð11 is optimal when
0 < V¦Ð11(S1) - V¦Ð21(S1) = ((1 + ¦Ã) - (-2 + 7¦Ã))/(1 - ¦Ã2) = (3 - 6¦Ã)/(1 - ¦Ã2), i.e. 0 ¡Ü ¦Ã ¡Ü 0.5
¦Ð22 is optimal when
0 < V¦Ð22(S2) - V¦Ð21(S2) = (3(1 + ¦Ã) - (7 - 2¦Ã))/(1 - ¦Ã2) = (-4 + 5¦Ã)/(1 - ¦Ã2), i.e. 0.8 ¡Ü ¦Ã < 1.0
¦Ð21 is optimal for 0.5 ¡Ü ¦Ã ¡Ü 0.8
¦Ð12 is never optimal because it is dominated by ¦Ð11 when ¦Ã < 2/3 and by ¦Ð22 when ¦Ã > 0.6
https://www.cse.unsw.edu.au/~cs9444/20T2/tut/sol/Ex7_Reinforce_sol.html 3/3