Reinforcement Learning
Par0al Observability and the POMDP Model
(Source: S. Thrun et al., Probabilis0c Robo0cs, MIT Press)
Subramanian Ramamoorthy
School of Informa0cs
7 March, 2017
A Trace through an MDP
What happens if agent does not get, “You are now in state…”
Instead, all the agent gets are, “You now see these observations…”
07/03/2017 2
07/03/2017 3
4
Par$ally Observed Markov Decision Processes
• In POMDPs we apply the very same idea as in MDPs.
• Since the state (x) is not observable, the agent has to make its
decisions based on the belief state which is a posterior
distribu0on over states.
• Let b be the belief (a probability esJmate) of the agent about
the state (x) under consideraJon.
• POMDPs compute a value func0on over belief space:
07/03/2017
Par$ally Observed MDP Problem
Belief is sufficient statistic for given history
07/03/2017 5
bt = Pr{xt|b0, u0, o1, …, ot�1, ut�1, ot}
6
Some Problems to Consider
• Each belief is a probability distribuJon, thus, each value in a
POMDP is a func0on of an en0re probability distribu0on.
• This is problema0c, since probability distribu0ons are
con0nuous.
• AddiJonally, we have to deal with the huge complexity of
belief spaces.
• For finite worlds with finite state, acJon, and measurement
spaces and finite horizons, however, we can effec0vely
represent the value func0ons by piecewise linear func0ons.
07/03/2017
7
An IllustraJve Example
2x1x 3
u
8.0
2z
1z
3u
2.0
8.0
2.0
7.0
3.0
3.0
7.0
measurements action u3 state x2
payoff
measurements
1u 2u 1u 2u
100− 50−100 100
actions u1, u2
payoff
state x1
1z
2z
07/03/2017
Our Plan
We will work out the value funcJon updates for this example.
The key steps will be:
1. Express payoff in terms of beliefs over states
2. Use this to write down an iniJal expression for π and V
3. Propagate forward an expected value of V, given one
observaJon from the world
4. Predict state transiJon upon taking an acJon in response
to this observaJon and resulJng belief
5. Iterate (simplifying along the way) …
07/03/2017 8
9
The Parameters of the Example
• The actions u1 and u2 are terminal actions.
• The action u3 is a sensing action that potentially
leads to a state transition.
• The horizon is finite and γ=1.
07/03/2017
10
Payoff in POMDPs
• In MDPs, the payoff (or return) depended on the state of
the system.
• In POMDPs, however, the true state is not exactly known.
• Therefore, we compute the expected payoff (i.e., reward at
next step) by integra0ng over all states:
07/03/2017
11
Payoffs in Our Example (1)
• If we are totally certain that we are in state x1 and execute
acJon u1, we receive a reward of -100
• If, on the other hand, we definitely know that we are in x2 and
execute u1, the reward is +100.
• In between it is the linear combinaJon of the extreme values
weighted by the probabiliJes
07/03/2017
12
Payoffs in Our Example (2)
p1
p1
p1
p1 07/03/2017
13
The ResulJng Policy for T=1
• Given we have a finite POMDP with T=1, we would use V1(b)
to determine the opJmal policy.
• In our example, the opJmal policy for T=1 is
• This is the upper thick graph in the diagram.
07/03/2017
14
Piecewise Linearity, Convexity
• The resulJng value funcJon V1(b) is the maximum of the three
funcJons at each point
• It is piecewise linear and convex.
07/03/2017
15
Pruning
• If we carefully consider V1(b), we see that only the first two
components contribute.
• The third component can therefore safely be pruned away
from V1(b).
07/03/2017
16
Increasing the Time Horizon
• Assume the robot can make an observation before
deciding on an action.
V1(b)
p1
07/03/2017
17
Increasing the Time Horizon
• Assume the robot can make an observaJon before deciding on
an acJon.
• Suppose the robot perceives z1 for which
p(z1 | x1)=0.7 and p(z1| x2)=0.3.
• Given the observaJon z1 we update the belief using Bayes rule.
3.04.0)1(3.07.0)(
)(
)1(3.0
‘
)(
7.0
‘
1111
1
1
2
1
1
1
+=−+=
−
=
=
pppzp
zp
p
p
zp
p
p
07/03/2017
18
Value FuncJon
Update relation:
b’(b|z1)
V1(b)
V1(b|z1)
07/03/2017
x1 x2
x1
19
Increasing the Time Horizon
• Assume the robot can make an observaJon before deciding on
an acJon.
• Suppose the robot perceives z1 for which
p(z1 | x1)=0.7 and p(z1| x2)=0.3.
• Given the observaJon z1 we update the belief using Bayes rule.
• Thus V1(b | z1) is given by
07/03/2017
20
Expected Value ader Measuring
• Since we do not know in advance what the next
measurement will be, we have to compute the expected
belief
( )∑
∑
∑
=
=
=
=
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
=
==
2
1
111
2
1
11
1
2
1
111
)|(
)(
)|(
)(
)|()()]|([)(
i
i
i i
i
i
i
iiz
pxzpV
zp
pxzp
Vzp
zbVzpzbVEbV
07/03/2017
21
Expected Value ader Measuring
• Since we do not know in advance what the next
measurement will be, we have to compute the expected
belief
07/03/2017
22
ResulJng Value FuncJon
• The four possible combinaJons yield the following funcJon
which then can be simplified and pruned.
07/03/2017
23
Value FuncJon
b’(b|z1)
p(z1) V1(b|z1)
p(z2) V1(b|z2)
07/03/2017
p1
p1
p1
24
State TransiJons (PredicJon)
• When the agent selects u3 its state potenJally
changes.
• When compuJng the value funcJon, we have to take
these potenJal state changes into account.
07/03/2017
25
State TransiJons (PredicJon)
07/03/2017
p1
26
ResulJng Value FuncJon ader execuJng u3
• Taking the state transiJons into account, we finally
obtain.
07/03/2017
27
Value FuncJon ader execuJng u3
07/03/2017
p1
p1
p1
p’1
28
Value FuncJon for T=2
• Taking into account that the agent can either directly
perform u1 or u2 or first u3 and then u1 or u2, we obtain
(ader pruning)
07/03/2017
29
Graphical RepresentaJon of V2(b)
u1 optimal u2 optimal
unclear
outcome of measurement is
important here
07/03/2017
p1
30
Deep Horizons and Pruning
• We have now completed a full backup in belief space.
• This process can be applied recursively.
• The value funcJons for T=10 and T=20 are:
07/03/2017
p1 p1
31
Deep Horizons and Pruning
32
Why Pruning is EssenJal
• Each update introduces addi0onal linear components to V.
• Each measurement squares the number of linear components.
• Thus, an un-pruned value funcJon for T=20 includes more than
10547,864 linear funcJons.
• At T=30 we have 10561,012,337 linear funcJons.
• The pruned value funcJons at T=20, in comparison, contains
only 12 linear components.
• The combinatorial explosion of linear components in the value
funcJon are the major reason why this simple formulaJon of
POMDPs are imprac0cal for most applica0ons.
07/03/2017
Nature of the POMDP Value FuncJon
• Ader n consecuJve iteraJons of this opJmizaJon, the value
funcJon consists of a set of α-vectors.
• Each a-vector is an |S|-dim hyperplane (line in our example).
• So, the value funcJon is of the form,
07/03/2017 33
Vn = {↵0,↵1, …,↵m}
Vn(b) = max
↵2Vn
X
s2S
↵(s)b(s)
34
POMDP ApproximaJon:
Point-based Value IteraJon
• Maintain a smaller set of example belief states
• Propagate value function forward as before, but
use this approximate representation
• Pruning: only consider constraints that maximize
value function for at least one of the example
beliefs
B = {b1, b2, …}
07/03/2017
PBVI SchemaJc
07/03/2017 35
[Source: J. Pineau et al., Point-based Value Iteration: An anytime algorithm for POMDPs, In Proc. IJCAI 2003]
36
Quality of Point-based Value IteraJon
Exact value function PBVI
Value functions for T=30
After pruning, 120 constraints 11 constraints
07/03/2017
37
Example (Real) ApplicaJon
07/03/2017
38
Example ApplicaJon
07/03/2017
39
POMDP Summary
• POMDPs compute the optimal action in partially
observable, stochastic domains.
• For finite horizon problems, the resulting value
functions are piecewise linear and convex.
• In each iteration the number of linear
constraints grows exponentially.
• In this form, POMDPs have only been applied
successfully to small state spaces with small
numbers of possible observations and actions.
– need to formulate problems carefully…
07/03/2017