CSE 473: Introduction to Artificial Intelligence
Home
Schedule
Textbooks
Assignments
Assignment 6: AEMS

Heaven or hell
I can’t tell.
In this assignment you will implement an online POMDP solver, named AEMS2.
Introduction
As we have seen in class, in a POMDP, the agent can’t directly observe the world-states, thus the agent instead maintains a posterior probability distribution over all states to represent how likely the agent thinks it’s in each state. This “posterior probability” is called a belief state and defined as the following:
b
t
(
s
)
=
P
(
s
|
s
0
,
a
0
,
z
1
,
a
1
,
.
.
.
,
a
t
−
1
,
z
t
)
b
t
(
s
)
=
P
(
s
|
s
0
,
a
0
,
z
1
,
a
1
,
.
.
.
,
a
t
−
1
,
z
t
)
Where
b
t
(
s
)
b
t
(
s
)
indicates how likely the agent thinks it’s in state
s
s
.
Now, we can think of the discrete world-states in the POMDP as being ordered by some ordering. We can then “name” each state with a number that represents its position in that list, with the first discrete state being 0, and the last being
|
S
|
−
1
|
S
|
−
1
. This allows us to write the belief state instead as a vector (a list of numbers), where the
i
i
-th number in that vector has the value
b
t
(
s
i
)
b
t
(
s
i
)
(the probability of being in that state). This new vector fully represents all the possible belief states.
The belief state represents the agent’s partial understanding of which state of the world it is actually in. Therefore, a POMDP may actually be seen as an MDP, whose state space consists of the belief state vectors we defined earlier. This means that in theory we could apply the same techniques used in solving MDPs to find the optimal policy of a POMDP model.
However, the belief state space is a continuous, real-valued probability distribution over states, meaning there are infinitely many possible belief state vectors. As a result, standard MDP solvers such as VI, PI, and even RTDP are too computationally expensive. In fact, this is true even when we know the world dynamics encoded by the transition, reward, and observation functions. While we may use the same fundamental principles of search such as greedy action selection and admissible heuristics, we need to develop more complex algorithms with better convergence rates to operate on POMDP models if we want to have any chance of solving them.
Implementation
Important: For this part of the assignment, you will need to have NumPy installed.
If you don’t already have NumPy installed, you may install it through your distribution package manager if you are using Linux.
Alternatively, you can install NumPy using pip: pip install numpy (you may need to use pip3), or conda: conda install numpy.
Please use this link to download the zip file for the assignment code.
We will provide you the model parser, an environment simulator (not as fancy as Pacman though), and vectors for a tight lower bound. You will be asked to implement an upper bound, a loose lower bound, and most importantly AEMS2 for an infinite horizon POMDP.
Files you’ll edit:
mdpSolver.py
Both of the heuristics you implement.
aems.py
Your implementation of AEMS2
Files you will not edit:
pomdp.py
Store the model in a pomdp class (learn how T, O, and R are presented)
offlineSolver.py
General offline solver for POMDPs (look at the abstract methods)
onlineSolver.py
General online solver for POMDPs (look at the abstract methods)
environment.py
Changes the real state of the environment by the agent’s action, simulate and return an observation with gained instant reward
policyReader.py
Gives the value function or the best function based on a provided policy (used for PBVI as the lower bound).
evaluate.py
Evaluate a POMDP solver by running it multiple times.
test.py
Test and grade! You can check the tests in test folder.
The above builds on a general POMDP solver that works with any model expressed in a common POMDP format, developed by Anthony Cassandra (see here: http://www.pomdp.org/code/pomdp-file-spec.html.)
A few environments are provided in /examples/env:
Tiger
Two closed doors, opening one gives you a large reward. Behind the other one is a tiger! (Full description: Kaelbling et al., 1998)
Hallway
A navigation environment with 57 states. (Full description: QMDP paper by Littman et al., 1995)
Hallway2
A navigation environment with 57 states. (Full description: QMDP paper by Littman et al., 1995)
TagAvoid
Laser tag game for robots! (Full description: PBVI paper by Pineau et al., 03)
RockSample_4_4
Scientific exploration by a rover (Full description: HSVI paper by Smith & Simmons, 04)
You do not need to worry too much about the format. Just take a look at the QMDP class in mdpSolver.py which documents some of the properties of the POMDP object. Remember that we only work with indices in the code, so actions, states, and observations are represented as a number and not their name.
Important: The parameter named “precision” determines the computing precision of your methods (including deciding convergence). You can use it to avoid extra computation in value iteration or heuristic updates.
Question 1 (5 points) – QMDP – An upper bound attempt at a POMDP solution
QMDP is one of the very first algorithms suggested for solving POMDPs. It can be calculated very fast and offline (without the need for simulation). However, it’s not a very good approximation. Therefore, QMDP is often used to calculate a bound for other online solvers, including the AEMS solver that we will be implementing at the end.
To describe it simply, QMDP assumes that after one action, the agent will learn what state it is in with certainty (this is an optimistic assumption thus it provides an upper bound). As a result, after just one action the problem is reduced to a standard an MDP. Since we know all the parameters of the MDP underlying the POMDP, we can solve this MDP using standard tools like value iteration:
Q
MDP
(
s
,
a
)
=
∑
s
′
T
(
s
,
a
,
s
′
)
[
R
(
s
,
a
,
s
′
)
+
γ
V
MDP
(
s
′
)
]
=
E
[
R
(
s
,
a
)
]
+
γ
∑
s
′
T
(
s
,
a
,
s
′
)
V
MDP
(
s
′
)
Q
MDP
(s,a)
=
∑
s
′
T(s,a,
s
′
)[R(s,a,
s
′
)+γ
V
MDP
(
s
′
)]
=E[R(s,a)]+γ
∑
s
′
T(s,a,
s
′
)
V
MDP
(
s
′
)
Thus QMDP boils down to solving the underlying MDP producing the
Q
MDP
Q
MDP
above, and then constructing the POMDP solution as follows:
Q
QMDP
(
b
t
,
a
)
=
∑
s
b
t
(
s
)
Q
MDP
(
s
,
a
)
V
QMDP
=
max
a
Q
QMDP
(
b
t
,
a
)
Q
QMDP
(
b
t
,a)
=
∑
s
b
t
(s)
Q
MDP
(s,a)
V
QMDP
=
max
a
Q
QMDP
(
b
t
,a)
Implement the QMDP class in mdpSolver.py. Although it is offline, you should try to make it as fast as possible. Consider using vector operations in numpy.
Hint: In a POMDP rewards are given as
R
(
s
,
a
,
s
′
,
o
)
R
(
s
,
a
,
s
′
,
o
)
which depend on the state reached after the action and the observation. How would you construct
E
[
R
(
s
,
a
)
]
E
[
R
(
s
,
a
)
]
using the POMDP rewards, observation probabilities and transition probabilities?
You can evaluate the performance of QMDP with the following command:
python3 evaluate.py QMDP [Environment] [Number of Runs] [precision]
For example the following command gives you the average total reward of QMDP for 1000 runs in Hallway problem (Computing precision = 0.01):
python3 evaluate.py QMDP Hallway 1000 0.01
You can test your QMDP implementation with the following command:
python3 test.py q1
Question 2 (3 points) – MinMDP – A naive lower bound of a POMDP solution
Now that we have an upper bound, it would be nice to find a lower bound of a POMDP solution. Here we won’t spend too much effort and will use a MinMDP, the “worst possible future reward” to lower bound the POMDP.
Here’s how the MinMDP works: After getting the (expected) immediate reward of the next action, the agent will always receive the worst possible reward (minimum of all
R
(
s
,
a
,
s
′
)
R
(
s
,
a
,
s
′
)
no matter what the belief is) in any future steps. We can see that this is trivially a lower bound. The concrete formulation is as follows:
R
min
=
min
a
,
s
,
s
′
R
(
s
,
a
,
s
′
)
V
worst
(
b
t
)
=
max
a
∑
s
b
t
(
s
)
R
(
s
,
a
)
+
(
γ
+
γ
2
+
.
.
.
)
R
min
R
min
=
min
a
,
s
,
s
′
R(s,a,
s
′
)
V
worst
(
b
t
)
=
max
a
∑
s
b
t
(s)R(s,a)+(γ+
γ
2
+…)
R
min
Hint 1: How does one calculate the
V
V
function given the
Q
Q
function? Consider the definition of
V
worst
(
b
t
)
V
worst
(
b
t
)
above. Can you derive the corresponding
Q
Q
function?
Hint 2:
(
γ
+
γ
2
+
.
.
.
)
(
γ
+
γ
2
+
.
.
.
)
is the inifinte sum of a geometric sequence. What value does it take when
λ
<
1
λ
<
1
?
Implement this bound in the MinMDP class. You can evaluate it with the following command:
python3 evaluate.py MinMDP [Environment] [Number of Runs] [precision]
You can test your implementation by running the following command:
python3 test.py q2
Question 3 (5 points) - Understanding AEMS2
Here's a link to the paper introducing AEMS. Feel free to refer to it to understand the details of the algorithm as we will only be describing high level concepts below.
As mentioned in the introduction, solving POMDPs in an offline way is hard due to the infinite number of belief states, so much so that an offline solver may never finish computing the complete optimal policy even for a given initial state. We instead turn to an online POMDP solver, which interleaves planning with executing actions (recall similarities with reinforcement learning). By considering a number of actions in an anytime fashion, the agent is able to choose a good action (partial solution) when it must act in the real world while improving it's solution during off-time.
Recap
First, let's go over how the belief states are updated in POMDPs. The belief state is updated in two steps after each action: first because of performing the action (
a
t
a
t
) and then based on the new observation (
z
t
+
1
z
t
+
1
) that the agent received because of this action:
(1) Action update
b
′
t
(
s
)
=
P
(
s
|
s
0
,
a
0
,
z
1
,
a
1
,
.
.
.
,
a
t
−
1
,
z
t
)
,
a
t
=
∑
s
′
P
(
s
′
|
s
0
,
.
.
.
,
a
t
−
1
,
z
t
)
P
(
s
|
a
t
,
s
′
)
=
∑
s
′
b
t
(
s
)
P
(
s
|
a
t
,
s
′
)
b
t
′
(s)
=P(s
|
s
0
,
a
0
,
z
1
,
a
1
,...,
a
t
−
1
,
z
t
),
a
t
=
∑
s
′
P(
s
′
|
s
0
,...,
a
t
−
1
,
z
t
)P(s
|
a
t
,
s
′
)
=
∑
s
′
b
t
(s)P(s
|
a
t
,
s
′
)
(2) Observation update
b
t
+
1
(
s
)
=
P
(
s
|
s
0
,
.
.
.
,
z
t
,
a
t
,
z
t
+
1
)
=
P
(
s
|
s
0
,
.
.
.
,
a
t
−
1
,
z
t
,
a
t
)
P
(
s
|
a
t
,
z
t
+
1
)
=
b
′
t
(
s
)
P
(
s
|
z
t
+
1
)
∝
b
′
t
(
s
)
P
(
z
t
+
1
|
s
,
a
t
)
∝
b
′
t
(
s
)
O
(
s
,
a
t
,
z
t
+
1
)
b
t
+
1
(s)
=P(s
|
s
0
,...,
z
t
,
a
t
,
z
t
+
1
)
=P(s
|
s
0
,...,
a
t
−
1
,
z
t
,
a
t
)P(s
|
a
t
,
z
t
+
1
)
=
b
t
′
(s)P(s
|
z
t
+
1
)
∝
b
t
′
(s)P(
z
t
+
1
|
s,
a
t
)
∝
b
t
′
(s)O(s,
a
t
,
z
t
+
1
)
Representing Search in a POMDP: And-Or Tree
To represent the search tree of an online POMDP solver, we can encode belief states during the search process (
b
b
and
b
′
b
′
) as nodes in a tree. As the belief state is updated by two aspects, i.e. action and observation, there are two different types of belief nodes: OR nodes representing parts of the search where an action is picked (
b
t
→
b
′
t
b
t
→
b
t
′
) and AND nodes representing parts of the search where the observations are considered (
b
′
t
→
b
t
+
1
b
t
′
→
b
t
+
1
).
Note that since the agent can pick the actions, only one branch of an OR node would be optimal. In contrast, at an AND node, it may be possible to observe any observation so all branches need to be considered.

And-Or Tree: Triangles represent OR nodes where the agent should pick one action from all available ones. OR node contains the belief state. Circles represent AND nodes where the agent observes. The probability of each observation depends on the observation function and the posterior probability of states in the current node (
b
′
b
′
). This And-Or tree has two heuristics, one upper and one lower bound (numbers in brackets).
Searching the And-Or Tree
A solution of a POMDP (and in fact an MDP) can be thought of as finding the value
V
∗
V
∗
at some belief state (or state in the case of an MDP). This is similar to the adversarial search case where we tried to find the achievable utility.
Recall alpha-beta pruning in adversarial search where we used a branch-and-bound mechanism to efficiently prune the search tree. We can utilize the same principles here and select two heuristics (one upper and one lower bound) which will enable us to prune nodes with their upper bound lower than the lower bound of at least one other node. In fact, there is an online POMDP algorithm that is based on pruning (Ros et al 2008). However, without sufficiently tight bounds, pruning turns out to not be very effective in practice.
Instead, we take a different approach: we try to minimize the difference between the upper and lower bounds. This is referred to as Error Minimization. We can think of this difference as the agent's uncertainty about the value of an action, or alternatively, as the error of its estimated value.
Anytime Error Minimization Search (AEMS) is based on this idea. At each step it expands and updates the node in the tree with maximum (expected) error thus reducing the error. The original paper contains all the details of this algorithm, but to make things easier, we'll go over the high level ideas.
AEMS, at last
First, let's define a representation for a lower bound of the POMDP's
V
∗
V
∗
-value as
L
(
b
)
L
(
b
)
and an upper bound as
U
(
b
)
U
(
b
)
, where
b
b
is the belief state. We define the "error" as
ϵ
(
b
)
=
U
(
b
)
−
L
(
b
)
ϵ
(
b
)
=
U
(
b
)
−
L
(
b
)
. This error will eventually be used to guide us on which node to expand, however, right not it isn't sufficient yet.
Next, we need to define a way to pick "optimal" actions. One way of doing this is to select an action probablistically following the likelihood of that action being "good" in the long run: Let
P
(
a
|
b
)
P
(
a
|
b
)
be the probability of the action
a
a
being the best action at the belief state
b
b
. There are many ways model this distribution that all make sense. However, to make things simple, AEMS2 takes the optimistic view and imagines that each action can eventually achieve a value equal to its upper bound. This means that the best action to pick is the one with highest upper bound. Specifically, it means that
P
(
a
|
b
)
P
(
a
|
b
)
is equal to 1 if
a
a
maximizes the upper bound at
b
b
and 0 if otherwise. To understand the details of why this is a reasonable assumption, look to page 4 of the paper.
With this done, we can model the probability of reaching belief state
b
b
after
d
d
steps in the tree (where we pick actions according to the method described before):
P
(
b
d
)
=
∏
d
−
1
i
=
0
P
(
o
i
|
b
i
,
a
i
)
P
(
a
i
|
b
i
)
P
(
b
d
)
=
∏
i
=
0
d
−
1
P
(
o
i
|
b
i
,
a
i
)
P
(
a
i
|
b
i
)
This is derived through chaining probabilities (remember we have Markovian assumptions so things are independent). We note that the first term can be derived based on the observation and transition probabilities, whereas the second term is the action selection probabilities we defined a paragraph ago.
This means we can write the expected error
E
=
E
[
ϵ
]
E
=
E
[
ϵ
]
of a belief state
b
b
that's
d
d
steps away (= deep down the tree) as:
E
(
b
d
)
=
γ
d
P
(
b
d
)
ϵ
(
b
d
)
E
(
b
d
)
=
γ
d
P
(
b
d
)
ϵ
(
b
d
)
Each time we expand a leaf node on the search tree, we are able to get a better estimate of the upper and lower bounds of that node and reduce its error. To make our expansion worthwile, we want to improve the estimate of the worst error bound. Thus we will pick the leaf node that has the maximum
E
(
b
d
)
E
(
b
d
)
value as our candidate to expand.
With this we can derive a high level algorithm for AEMS:
While we still have planning time remaining (don't need to make a decision yet), do:
1. Find the node in the fringe (leaves) of the And-Or tree that has maximal E
(
b
d
)
E
(
b
d
)
value as defined above.
2. Expand this node into a subtree. Using QMDP and MinMDP as heuristics for bounds when creating new leaves.
3. Update the upper and lower bounds of this node based on those of its children.
4. Backtrack the bounds changes all the way back up to the root
If we're out of time, use the current tree to determine the best action at the root. Return that action.
To check that you understand AEMS conceptually, please answer the following questions in the writeup:
1. (1 point) What does "anytime" mean in the context of AEMS? Explain why the "anytime" property is desirable for an online POMDP solver?
2. (4 points) Write down the error term that AEMS2 would calculate for the fringe nodes in the example And-Or tree above (b
2
b
2
to b
8
b
8
). (Use a discount factor γ
=
0.95
γ
=
0.95
. The heuristic upper and lower bounds are indicated by the numbers in square brackets.). Then indicate which node will be expanded next.
Question 4 (8 points) - Implementing AEMS2
Finally, it's time to implement AEMS2. In the assignment files, you will find aems.py. We've populated much of the framework code already, so you just need to write the relevant methods to build the And-Or tree and update the bounds.
Note: Before you start implementing, it's a good idea to go over some examples of building And-Or trees by hand including the calculation of bound updates.
You can evaluate your code with the following command:
python3 evaluate.py AEMS2 [LowerBound] [UpperBound] [time limit in sec] [Environment] [Number of Runs] [precision]
For example this will return the average reward of AEMS2 using QMDP and MinMDP on Hallway environment over 200 runs, if for each action the solver has at most 100 ms (0.1 s) with computing precision of 0.01:
python3 evaluate.py AEMS2 MinMDP QMDP 0.1 Hallway 200 0.01
You can test your implementation by running the following command:
python3 test.py q4
Important: AEMS involves querying the
U
(
b
)
U
(
b
)
and
L
(
b
)
L
(
b
)
each time a node is expanded, and since it's timed, the faster you can expand nodes, the better AEMS will perform. AEMS performance can be improved through vectorization, which parallelizes the calculation of summations and products. However, we understand that this may be a difficult concept.
If you worry that your implementation of QMDP or MinMDP is slow (or unreliable), we have provided pre-solved bounds that can be queried instead of your own implementations for QMDP and MinMDP. To use our reference bounds, replace the following:
python3 evaluate.py AEMS2 MinMDP QMDP 0.1 Hallway 200 0.01
With:
python3 evaluate.py AEMS2 ref-MinMDP ref-QMDP 0.1 Hallway 200 0.01
Note, reference bounds only exist for Hallway and RockSample_4_4 environments so the above commands will fail if the environment is something else!
Also to note, the autograder for AEMS2 uses reward achieved over simulated runs to evaluate correctness. This means if your bound calculations are exceptionally slow, you may not be able to expand a sufficient number of nodes to pass the tests. If you suspect that this is the case, please change USE_REF_BOUNDS in test.py to True. This will make the autograder use the reference bounds instead of your QMDP and MinMDP implementations.
We will attempt to grade your AEMS2 with your own QMDP and MinMDP implementations first. If any AEMS tests fail or any single test takes more than 10 minutes without output (at which point we consider it failed), we will switch to running your AEMS2 with the reference precomputed bounds instead. Your autograded score will be max(score using own bounds, score with ref bounds - 1).
Question 5 (4 points) - Playing with the time limit and heuristics
You may have noticed that our lower bound MinMDP is actually quite loose. This is undesirable as it results in large error in the leaf nodes. To remedy this, we have provided you the policy of a point based method for RockSample_4_4 and TagAvoid. Evaluate the performance of AEMS2 with PBVI as the lower bound for these environments for various time limits: 5 ms, 10 ms, 20 ms, 50 ms, 100 ms, 200 ms, and 500 ms. Compare using PBVI with MinMDP by following the same procedure with MinMDP.
Each evaluation should be done on for at least 500 runs for time limits less than or equal to 100ms. Minimum numbers of runs required for 200ms and 500ms are 200 and 100, respectively. Report the results in the form of two plots (one for each environment). Y axis is the average reward over the runs, and X axis is the time limit. Each plot contains two sets of 7 data points (connect the points). Show PBVI results with blue and MinMDP with red. Feel free to use any tool to generate the plots. Briefly explain the results (Compare the heuristics, and also time limits). Precision of 0.1 is sufficient for this report.
Example commands you need to run for this question:
python3 evaluate.py AEMS2 PBVI QMDP 0.25 RockSample_4_4 500 0.1
python3 evaluate.py AEMS2 MinMDP QMDP 0.01 TagAvoid 500 0.1
For longer time limits, you can evaluate AEMS2 for smaller number of runs and then report the average, e.g instead of 500 runs with one command use 50 runs 10 times and calculate the average by hand.
Note: You may choose to use ref-QMDP and ref-MinMDP in lieu of your own implementations when completing this part. However, note that the reference pre-computed values don't exist for TagAvoid.
Submission
In order to submit your assignment, please (1) create a zip archive of the following files: mdpSolver.py, aems.py (2) export your writeup as writeup.pdf. Upload both (as two separate files using the "Add another file" button on Canvas) to Canvas.

UW Site Use Agreement