COMP3702 Artificial Intelligence
Semester 2, 2021
Tutorial 7
Before you begin, please note:
• Tutorial exercises are provided to help you understand the materials discussed in class, and to improve
your skills in solving AI problems.
• Tutorial exercises will not be graded. However, you are highly encouraged to do them for your own
learning. Moreover, we hope you get the satisfaction from solving these problems.
• The skills you acquire in completing the tutorial exercises will help you complete the assignments.
• You’ll get the best learning outcome when you try to solve these exercises on your own first (before
your tutorial session), and use your tutorial session to ask about the difficulties you face when trying to
solve this set of exercises.
Exercises
Exercise 7.1. In general, an MDP’s objective is:
E
[
∞∑
t=0
γtR(st, at)
]
.
The value function of an MDP, V π(s), is the expected future cost of following an (arbitrary) policy, π,
starting from state, s, given by:
V π(s) =
∑
s′∈S
P (s′ | π(s), s) [R(s, π(s), s′) + γV π(s′)] .
where the policy π(s) determines that action taken in state s. Here we have dropped the time index, as it is
redundant, but note that at = π(st). Also note that R(s, a) =
∑
s′ P (s
′ | s, a)R(s, a, s′).
Question: Derive V π from the MDP objective function.
Exercise 7.2. Consider the gridworld below:
An agent is currently on grid cell s2, as indicated by the star, and would like to collect the rewards that lie
on both sides of it. If the agent is on a numbered square (0, 5 or 10), the instance terminates and the agent
receives a reward equal to the number on the square. On any other (non-numbered) square, its available
actions are to move Left and Right. Note that Up and Down are never available actions. If the agent is in a
square with an adjacent square below it, it does not always move successfully: when the agent is in one of
these squares and takes a move action, it will only succeed with probability p. With probability 1 − p, the
move action will fail and the agent will instead fall downwards into a trap. If the agent is not in a square with
an adjacent space below, it will always move successfully.
a) Consider the policy πR, which is to always move right when possible. For each state s ∈ {s1, s2, s3, s4}
in the diagram above, give the value function V πR in terms of γ ∈ [0, 1] and p ∈ [0, 1].
b) Consider the policy πL, which is to always move left when possible. For each state s ∈ {s1, s2, s3, s4}
in the diagram above, give the value function V πL in terms of γ and p.
1
COMP3702 Tutorial 7
Example Gridworld domain
Consider the gridworld below:
1
-100
States in this environment are the positions on the tiles. The world is bounded by a boundary wall, and
there is one obstacle, at [1,1] (using python or zero indexing starting from the bottom left corner). In addition,
there are two terminal states, indicated by the coloured squares.
Terminal states: In practice, we would say that the two coloured tiles are terminal states of the MDP. For
mathematical convenience, we often prefer to deal with an infinite horizon MDP (see more below). To convert
this model to an infinite horizon MDP, we will add an extra pseudo-state to the list of MDP states called
exited, which is absorbing (the agent cannot leave it irrespective of its action, i.e. T (exited, a, exited) = 1
for all a).
Actions and Transitions: In this world, an agent can generally choose to move in four directions — up,
down, left and right (which we might sometimes refer to as ^, v,< and >, respectively). However, the agent
moves successfully with only p = 0.8, and moves perpendicular to its chosen direction with p = 0.1 in each
perpendicular direction. If it hits a wall or obstacle, the agent stays where it is. In addition, once the agent
arrives on a coloured square with a value, it has only one special action available to it; that is, to exit the
environment.
Rewards: The values stated on the coloured squares are the reward for exiting the square and the
environment, so the reward is not repeatedly earned; that is, R([3, 2], exit) = 1, and R([3, 1], exit) = −100.
All other states have 0 reward.
Discount factor: γ = 0.9.
Exercises on Gridworld
You can utilise the sample starter code GridWorld starter.py to help solve the following exercises.
Exercise 7.3. Define and code to apply a move to the gridworld environment; call it apply_move().
apply_move() should do two things: (i) modify the map, according to the state transition function given,
and (ii) return a reward.
a) What is the one-step probability of arriving in state [1,0] from each state s after taking action a, i.e
what is P ([1, 0]|a, s) ∀ (a, s)?
b) What is the one-step probability of arriving in each state s′ when starting from [0,0] for each a, i.e what
is P (s′|a, [0, 0]) ∀ a, s′?
c) Write a function that can compute the probabilities of arriving in any state s’ from an initial state s
after taking action a, i.e. P (s′|a, s). Rather than hard-coding the probabilities, check if each action
moves to a neighbour state or if it results in any collisions with the boundary or the obstacle. Then
use this information to compute the transition probabilities. Hint: You may find some of the tricks for
8-puzzle from Tutorial 2 useful, e.g. using linear indexing and logical tests for collisions. You may also
wish to parameterise this function so that p can be varied too.
d) Write a new function to processes the gridworld and store all of the transition probabilities, using the
transition-probability computing function you have developed above. The aim is to have functions
that can be used on an arbitrary gridworld (i.e. do not hard-code your function just for this problem
instance!).
Page 2
COMP3702 Tutorial 7
Exercise 7.4. Implement VI for this problem, using: V [s]← max
a
∑
s′
P (s′ | s, a) (R(s, a, s′) + γV [s′]).
Note that R(s, a, s′) = R(s) is the reward for landing on a square, which is non-zero for only the red and
blue squares at [3,1] and [3,2], respectively.
a) What is the value function estimate after 4 iterations? What is the policy according to the value function
estimate after 4 iterations?
b) What is the value function estimate after 10 iterations? What is the policy according to the value
function estimate after 10 iterations?
c) What is the value function estimate after 1000 iterations? What is the policy according to the value
function estimate after 1000 iterations?
d) What is the largest difference in the value function estimates for any state at iteration 999 and 1000?
Based on this, can you say that the algorithm has converged?
e) How long does 1000 iterations of VI take?
Exercise 7.5.
a) Let Pπ ∈ R|S|×|S| be a matrix containing probabilities for each transition under the policy π, where:
Pπij = P (st+1 = j | st = i, at = π(st))
What is the size of Pπ in this gridworld, when the special pseudo-state exited is included?
b) Set the policy to move right everywhere, π(s) => ∀ s ∈ S. Calculate the row of P> corresponding to
an initial state at the bottom left corner, [0,0] of the gridworld.
c) Write a function that calculates the row of P> corresponding to an initial state at the bottom left
corner, [0,0] of the gridworld for any action or deterministic π([0, 0]).
d) Turn this into a function that computes Pπ for any deterministic π. Note that P ([3, 1], a, existed) =
P ([3, 2], a, existed) = P (exited, a, existed) = 1 for all a, so the rows of Pπ for these states s are all
zeros except for a 1 corresponding to the transition to s′ = existed.
e) Compute
V > = (I − γP>)−1r.
To do this, define r as the reward for landing on a square, which holds because R(s, a, s′) = R(s) for
the red and blue squares at [3,1] and [3,2], respectively. (Hint: Rather than explicitly computing the
matrix inverse, you may want to use numpy.linalg.solve() for large |S|. This implements the LU
decomposition to solve for V π using forward and backward substitution, so can result in a big speed-up.)
f) Using the policy evaluation above, implement PI for this problem, following:
• Set π(s) => ∀ s ∈ S, and let iter = 0
• Repeat:
1. Solve for V πi(s) (or Qπi(s, a)):
V πi(s) =
∑
s′∈S
P (s′ | πi(s), s) [R(s, πi(s), s′) + γV πi(s′)] ∀s ∈ S
2. Update policy:
πi+1(s)← argmax
a
∑
s′∈S
P (s′ | a, s) [R(s, a, s′) + γV πi(s′)]
3. iter = iter + 1
• until πi(s) = πi−1(s)
g) How many iterations does PI take to converge? How long does each iteration take?
Page 3