COSC1127/1125 Artificial Intelligence
School of Computing Technologies Semester 2, 2021
Prof. ̃a
Tutorial No. 9 Reinforcement Learning
You can check this video on max/min vs arg max/arg min; and this video on the formulas for Temporal Difference in the AIMIA book.
PART 1: Passive agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . In the passive Adaptive Dynamic Programming (ADP) paradigm, an agent estimates the transition model T using tables of frequencies (model-based learning). Consider the 2 by 3 gridworld with the policy depicted as arrows and the terminal states are illustrated with X’s.
12 1
2 3
Cells are indexed by (row, column).
The passive ADP agent sees the following state transition observations:
Trial 1: (3,2) → (2,2) → (2,1) → (1,1)
Trial 2: (3,2) → (2,2) → (1,2) → (1,1)
Trial 3: (3,2) → (2,2) → (2,1) → (3,1)
Trial 4: (3,2) → (2,2) → (2,1) → (2,2) → (2,1) → (1,1)
X
←
↑
←
X
↑
With perceived rewards as follows:
State Reward
(3, 2) −1 (2, 2) −2 (2, 1) −2 (1, 2) −1 (3, 1) −10 (1, 1) +16
(a) Compute the transition model by computing:
i. thetableofstate-actionfrequencies/counts,N(s,a);
1
COSC1127/1125 – AI Tutorial 9: Reinforcement Learning Page 2 of 6
Solution: (Note: this is a partial solution)
state (s) action (a) Frequency
(3,2) N 4 (2,2) W …. (2,1) N …. (1,2) W 1
ii. thetableofstate-action-statefrequencies/counts,N(s,a,s′);
Solution: (Note: this is a partial solution)
state (s)
(3,2) (2, 2) (2, 2) (2, 1) (2, 1) (2, 1) (1,2)
action (a)
N W W N N N W
next state (2,2) (2,1) (1,2) (1,1) (3,1) (2,2) (1,1)
(s’)
Frequency
4
…. …. …. ….
….
1
iii. an estimate of the transition model T , using the above tables.
Solution: (Note: this is a partial solution)
state (s)
(3, 2) (2, 2) (2, 2) (2,1) (2,1) (2, 1) (1, 2)
action (a)
N W W N N N W
next state (s’)
(2, 2) (2, 1) (1, 2) (1,1) (3,1) (2, 2) (1, 1)
T(s,a,s’)
4/4 = 1.0 ….
….
….
….
….
1/1 = 1.0
(b) Using the same environment model and observations, now consider how a passive Temporal Difference agent will estimate the utility of the states. The temporal difference learning rule is:
U(s) ← U(s) + α[R(s) + γU(s′) − U(s)]
If we take α = 0.5 as a constant (recall α is the learning rate) and γ = 1. Compute:
i. the utilities after the first trial;
Solution: (Note: this is a partial solution)
Initially, we set U[(s)] = 0 for all states s ∈ S.
• Starting state is (3, 2), R[(3, 2)] = −1, U [(3, 2)] = 0. • Firstobservationis(3,2)→(2,2).UpdateU[(3,2)]:
U[(3, 2)] = U[(3, 2)] + α(R[(3, 2)] + γU[(2, 2)] − U[(3, 2)]) = 0 + 0.5(−1 + 0 + 0) = −0.5
• Secondobservationis(2,2)→(2,1).UpdateU[(2,2)]:
U[(2, 2)] = U[(2, 2)] + α(R[(2, 2)] + γU[(2, 1)] − U[(2, 2)]) = ….
RMITAI2021(Semester2)-SebastianSardin ̃a Exercise1continuesonthenextpage…
COSC1127/1125 – AI Tutorial 9: Reinforcement Learning Page 3 of 6
• Thirdobservationis(2,1)→(1,1).UpdateU[(2,1)]: U [(2, 1)] = ….
= ….
You can check THIS VIDEO on the solving of this problem, in which we set initially U[(s)] = R[(s)] for all states s ∈ S (according to the book Russell & Norving ”Artificial Intelligence: A Modern Approach” 3th edition).
ii. the utilities after the first and second trials.
Solution: (Note: this is a partial solution)
• Firstobservationis(3,2)→(2,2).UpdateU[(3,2)]:
U[(3, 2)] = U[(3, 2)] + α(R[(3, 2)] + γU[(2, 2)] − U[(3, 2)])
= −0.5 + 0.5(−1 − 1 + 0.5) = −1.25 • Secondobservationis(2,2)→(1,2).UpdateU[(2,2)]:
U[(2, 2)] = U[(2, 2)] + α(R[(2, 2)] + γU[(1, 2)] − U[(2, 2)]) = ….
• Thirdobservationis(1,2)→(1,1).UpdateU[(1,2)]:
U [(1, 2)] = …. = ….
(c) (optional) Consider the “occasionally-random” and exploration function methods to strike a balance be- tween exploitation and exploration. Recall in the “occasionally-random” approach, 1t of the time the agent selects a random action, otherwise follow the greedy policy. What would a high t value mean? What about a low value t?
Contrast this with the exploration function concept. As an example, consider this exploration function,
R+ , n < Ne
f(u,n) = What does high/low settings for R+ and Ne result in?
u , otherwise
Solution:
For the occasionally-random scheme, a high t value means we do not often select a random action to take, resulting in less exploration. It means the agent is more likely to (greedily) select what it considers is the best policy, i.e., more exploitation, from its current estimation of the environment/world. Con- versely a low t value means more focus on exploration over exploitation.
Using the exploration function given in the textbook and lectorials, the first case/condition encourages exploration, while second case/condition encourages exploitation. Hence a high value of R+ means the agent has an optimistic view about unknown or rarely states - the higher it is, the more optimistic. Unvisited states will be set this high value for its utility/Q-values. However, if this view is far from reality, the update rules will bring it back towards the true value, but will take longer. Similarly a small value will mean the agent has a pessimistic view about unknown or rarely visited states. Ne determines
RMIT AI 2021 (Semester 2) - ̃a
COSC1127/1125 - AI Tutorial 9: Reinforcement Learning Page 4 of 6
for how long we keep in exploration mode. The higher Ne is, the more willing the agent is to explore a state and its associated actions.
PART2: Q-learning..................................................................................
(a) Consider the grid world from Part 1, but now there are no policies specified. With all Q-values initialised at 0, apply Q-learning taught in class and textbook to learn Q-Values of:
i. all state-action pairs for the first trial/episode;
Solution:
• Start at state (3, 2), R[(3, 2)] = −1.
No previous state, so we don’t need to update any Q-values. Instead, given all actions and next states have the same Q-value and we haven’t visited any of them yet, we randomly select an action - say west (W) and arrive at (3, 1).
• We move to (3, 1), so current state is (3, 1) and previous state is (3, 2). Recall for terminal states, we will update the Q-values for all actions to its reward, but only after updating the Q-value of previous state. Update Q[(3, 2), W ]:
Q[(3, 2), W ] = Q[(3, 2), W ] + α(R[(3, 2)] + γ max Q[(3, 1), a′] − Q[(3, 2), W ]) a′
= 0 + 0.5(−1 + 0 − 0) = −0.5
We update Q[(3,1),N] = Q[(3,1),S] = Q[(3,1),E] = Q[(3,1),W] = −10.
ii. all state-action pairs for the second trial/episode, given the first.
Solution:
• Start at state (3, 2), R[(3, 2)] = −1.
No previous state, so we don’t need to update any Q-values.
Instead, given the Q-value for west is −0.5 and we haven’t gone north from (3, 2), (so our exploration function will return a value of +10), we go north (N ), and go to (2, 2).
• We move to (2, 2), so current state is (2, 2) and previous state is (3, 2). Non-terminal state. Update Q[(3, 2), N ]:
Q[(3, 2), N] = Q[(3, 2), N] + α(R[(3, 2)] + γ max Q[(2, 2), a′] − Q[(3, 2), N]) a′
= 0 + 0.5(−1 + 0 − 0) = −0.5
From (2, 2), all actions of Q[(2, 2), ∗] are equally good according to exploration function,
hence we random select one. Say we selected north (N), and go to (1, 2).
• We move to (1, 2), so current state is (1, 2) and previous state is (2, 2). Non-terminal state.
Update Q[(2, 2), N ]:
Q[(2, 2), N] = Q[(2, 2), N] + α(R[(2, 2)] + γ max Q[(1, 2), a′] − Q[(2, 2), N])
a′
= 0 + 0.5(−2 + 0 − 0) = −1
From (1, 2), all actions of Q[(1, 2), ∗] are equally good according to exploration function,
hence we random select one. Say we selected west (W), and go to (1, 1).
• We move to (1,1), so current state is (1,1) and previous state is (1,2). (1,1) is a terminal state, so we will update the Q-values for all actions to its reward, but only after updating the Q-value
of previous state. Update Q[W, (1,2)]:
Q[(1, 2), W ] = Q[(1, 2), W ] + α(R[(1, 2)] + γ max Q[(1, 1), a′] − Q[(1, 2), W ]) a′
= 0 + 0.5(−1 + 0 − 0) = −0.5
RMITAI2021(Semester2)-SebastianSardin ̃a Exercise2continuesonthenextpage...
COSC1127/1125 - AI Tutorial 9: Reinforcement Learning Page 5 of 6
Afterwards, we also update Q[(1,1),N] = Q[(1,1),S] = Q[(1,1),E] = Q[(1,1),W] =
+16.
You can check THIS VIDEO on the solving of this problem .
Recall, the learning rule is:
Q(s, a) ← Q(s, a) + α[R(s) + γ max Q(s′, a′) − Q(s, a)] a′
The algorithm in the textbook (and which we follow in this course) does not update the Q-Values of terminal states. Instead, when the next state s′ is a terminal state, rather then just setting all previous states, actions and rewards to null, we will additional set all Q-values for that terminal state to its reward. E.g., for state (1,1), its reward is +16, and the first time we visit (1,1) we will set Q[(1,1),N] = Q[(1,1),S] = Q[(1,1),E] = Q[(1,1),W] = +16.
To simplify calculations, for this question assume actions are deterministic, i.e., if agent goes west, it will go west.
Similar to the TD learning question, use α = 0.5 and γ = 1. For the exploration function, use the one described in Question 1(c), with R+ = +10 and Ne = 1. In reality the starting state can vary, but for this question, assume we always start at (3, 2). Note that there can be several answers possible, depending on the action selected when there are tie breakers.
(b) For the following scenarios, briefly describe how we can model them as reinforcement learning problems (states, actions, type of rewards etc).
i. Learning to play chess.
ii. Maryisabouttograduate,andshedecidestoplanherfinancesforthenext40years(untilretirement). Consider what a reinforcement model might look like for financial planning.
(c) Consider chess. If we wanted to approximate the utility of the states using a sum of weighted features, discuss the type of features that might be useful for playing chess.
Solution:
States could be all the possible piece configurations. Actions are the moves of each individual piece. Non-terminal state rewards could be the value of a piece taken/lost, and terminal state (win or lose) can have a large positive or negative value.
Solution:
This is an interesting problem, because there is time involved. Investing at 20 years old will likely have smaller returns (rewards) then investing at 40 years old, making the assumption a person will have more money to invest the older they get. To reflect this, the states can be Mary at different ages, actions are different types of investment, and rewards are tied with a state (age) and action (investment). There is a variant of Q-learning that associates rewards to a state-action pair, and it would be appropriate to use such a one.
Solution:
Remember the utility is a indication of the “usefulness” of a state - in the case of chess, it is evaluating whether a state can lead to a winning or losing strategy. In this context, some features, not exhaustive, could be:
• Number of pieces agent has;
• Number of pieces opponent has;
• Total value of pieces (Queen = 9, Rook = 5 etc) agent has (similarly for opponent);
• Numberofsquarestheagent’sKingcanmove(ifnotmany,Kingmightnothavemuchopportunity
to break a check;
• Number of moves for pawn closest to been promoted; • Many more!
RMITAI2021(Semester2)-SebastianSardin ̃a Exercise2continuesonthenextpage...
COSC1127/1125 - AI Tutorial 9: Reinforcement Learning Page 6 of 6
RMIT AI 2021 (Semester 2) - ̃a
End of tutorial worksheet