程序代写代做代考 algorithm Reinforcement Learning

Reinforcement Learning

Par0al Observability and the POMDP Model
(Source: S. Thrun et al., Probabilis0c Robo0cs, MIT Press)

Subramanian Ramamoorthy
School of Informa0cs

7 March, 2017

A Trace through an MDP

What happens if agent does not get, “You are now in state…”
Instead, all the agent gets are, “You now see these observations…”

07/03/2017 2

07/03/2017 3

4

Par$ally Observed Markov Decision Processes

•  In POMDPs we apply the very same idea as in MDPs.
•  Since the state (x) is not observable, the agent has to make its

decisions based on the belief state which is a posterior
distribu0on over states.

•  Let b be the belief (a probability esJmate) of the agent about
the state (x) under consideraJon.

•  POMDPs compute a value func0on over belief space:

07/03/2017

Par$ally Observed MDP Problem
Belief is sufficient statistic for given history

07/03/2017 5

bt = Pr{xt|b0, u0, o1, …, ot�1, ut�1, ot}

6

Some Problems to Consider

•  Each belief is a probability distribuJon, thus, each value in a
POMDP is a func0on of an en0re probability distribu0on.

•  This is problema0c, since probability distribu0ons are
con0nuous.

•  AddiJonally, we have to deal with the huge complexity of
belief spaces.

•  For finite worlds with finite state, acJon, and measurement
spaces and finite horizons, however, we can effec0vely
represent the value func0ons by piecewise linear func0ons.

07/03/2017

7

An IllustraJve Example

2x1x 3
u

8.0

2z
1z

3u

2.0

8.0
2.0

7.0

3.0

3.0

7.0

measurements action u3 state x2

payoff

measurements

1u 2u 1u 2u
100− 50−100 100

actions u1, u2

payoff

state x1

1z

2z

07/03/2017

Our Plan

We will work out the value funcJon updates for this example.
The key steps will be:

1.  Express payoff in terms of beliefs over states
2.  Use this to write down an iniJal expression for π and V
3.  Propagate forward an expected value of V, given one

observaJon from the world
4.  Predict state transiJon upon taking an acJon in response

to this observaJon and resulJng belief
5.  Iterate (simplifying along the way) …

07/03/2017 8

9

The Parameters of the Example

•  The actions u1 and u2 are terminal actions.
•  The action u3 is a sensing action that potentially

leads to a state transition.
•  The horizon is finite and γ=1.

07/03/2017

10

Payoff in POMDPs

•  In MDPs, the payoff (or return) depended on the state of
the system.

•  In POMDPs, however, the true state is not exactly known.
•  Therefore, we compute the expected payoff (i.e., reward at

next step) by integra0ng over all states:

07/03/2017

11

Payoffs in Our Example (1)
•  If we are totally certain that we are in state x1 and execute

acJon u1, we receive a reward of -100
•  If, on the other hand, we definitely know that we are in x2 and

execute u1, the reward is +100.
•  In between it is the linear combinaJon of the extreme values

weighted by the probabiliJes

07/03/2017

12

Payoffs in Our Example (2)

p1
p1

p1
p1 07/03/2017

13

The ResulJng Policy for T=1

•  Given we have a finite POMDP with T=1, we would use V1(b)
to determine the opJmal policy.

•  In our example, the opJmal policy for T=1 is

•  This is the upper thick graph in the diagram.

07/03/2017

14

Piecewise Linearity, Convexity

•  The resulJng value funcJon V1(b) is the maximum of the three
funcJons at each point

•  It is piecewise linear and convex.

07/03/2017

15

Pruning

•  If we carefully consider V1(b), we see that only the first two
components contribute.

•  The third component can therefore safely be pruned away
from V1(b).

07/03/2017

16

Increasing the Time Horizon
•  Assume the robot can make an observation before

deciding on an action.

V1(b)

p1
07/03/2017

17

Increasing the Time Horizon

•  Assume the robot can make an observaJon before deciding on
an acJon.

•  Suppose the robot perceives z1 for which
p(z1 | x1)=0.7 and p(z1| x2)=0.3.

•  Given the observaJon z1 we update the belief using Bayes rule.

3.04.0)1(3.07.0)(
)(
)1(3.0

)(
7.0

1111

1

1
2

1

1
1

+=−+=


=

=

pppzp
zp
p

p

zp
p

p

07/03/2017

18

Value FuncJon

Update relation:
b’(b|z1)

V1(b)

V1(b|z1)

07/03/2017

x1 x2

x1

19

Increasing the Time Horizon

•  Assume the robot can make an observaJon before deciding on
an acJon.

•  Suppose the robot perceives z1 for which
p(z1 | x1)=0.7 and p(z1| x2)=0.3.

•  Given the observaJon z1 we update the belief using Bayes rule.
•  Thus V1(b | z1) is given by

07/03/2017

20

Expected Value ader Measuring

•  Since we do not know in advance what the next
measurement will be, we have to compute the expected
belief

( )∑

=

=

=

=

⎟⎟


⎜⎜


=

==

2

1
111

2

1

11
1

2

1
111

)|(

)(
)|(

)(

)|()()]|([)(

i
i

i i

i
i

i
iiz

pxzpV

zp
pxzp

Vzp

zbVzpzbVEbV

07/03/2017

21

Expected Value ader Measuring

•  Since we do not know in advance what the next
measurement will be, we have to compute the expected
belief

07/03/2017

22

ResulJng Value FuncJon

•  The four possible combinaJons yield the following funcJon
which then can be simplified and pruned.

07/03/2017

23

Value FuncJon

b’(b|z1)

p(z1) V1(b|z1)

p(z2) V1(b|z2)
07/03/2017

p1

p1

p1

24

State TransiJons (PredicJon)

•  When the agent selects u3 its state potenJally
changes.

•  When compuJng the value funcJon, we have to take
these potenJal state changes into account.

07/03/2017

25

State TransiJons (PredicJon)

07/03/2017

p1

26

ResulJng Value FuncJon ader execuJng u3

•  Taking the state transiJons into account, we finally
obtain.

07/03/2017

27

Value FuncJon ader execuJng u3

07/03/2017

p1

p1

p1

p’1

28

Value FuncJon for T=2

•  Taking into account that the agent can either directly
perform u1 or u2 or first u3 and then u1 or u2, we obtain
(ader pruning)

07/03/2017

29

Graphical RepresentaJon of V2(b)

u1 optimal u2 optimal

unclear

outcome of measurement is
important here

07/03/2017
p1

30

Deep Horizons and Pruning

•  We have now completed a full backup in belief space.
•  This process can be applied recursively.
•  The value funcJons for T=10 and T=20 are:

07/03/2017

p1 p1

31

Deep Horizons and Pruning

32

Why Pruning is EssenJal

•  Each update introduces addi0onal linear components to V.
•  Each measurement squares the number of linear components.
•  Thus, an un-pruned value funcJon for T=20 includes more than

10547,864 linear funcJons.
•  At T=30 we have 10561,012,337 linear funcJons.
•  The pruned value funcJons at T=20, in comparison, contains

only 12 linear components.
•  The combinatorial explosion of linear components in the value

funcJon are the major reason why this simple formulaJon of
POMDPs are imprac0cal for most applica0ons.

07/03/2017

Nature of the POMDP Value FuncJon

•  Ader n consecuJve iteraJons of this opJmizaJon, the value
funcJon consists of a set of α-vectors.

•  Each a-vector is an |S|-dim hyperplane (line in our example).
•  So, the value funcJon is of the form,

07/03/2017 33

Vn = {↵0,↵1, …,↵m}

Vn(b) = max
↵2Vn

X

s2S
↵(s)b(s)

34

POMDP ApproximaJon:
Point-based Value IteraJon

•  Maintain a smaller set of example belief states

•  Propagate value function forward as before, but
use this approximate representation

•  Pruning: only consider constraints that maximize
value function for at least one of the example
beliefs

B = {b1, b2, …}

07/03/2017

PBVI SchemaJc

07/03/2017 35

[Source: J. Pineau et al., Point-based Value Iteration: An anytime algorithm for POMDPs, In Proc. IJCAI 2003]

36

Quality of Point-based Value IteraJon

Exact value function PBVI

Value functions for T=30

After pruning, 120 constraints 11 constraints
07/03/2017

37

Example (Real) ApplicaJon

07/03/2017

38

Example ApplicaJon

07/03/2017

39

POMDP Summary

•  POMDPs compute the optimal action in partially
observable, stochastic domains.

•  For finite horizon problems, the resulting value
functions are piecewise linear and convex.

•  In each iteration the number of linear
constraints grows exponentially.

•  In this form, POMDPs have only been applied
successfully to small state spaces with small
numbers of possible observations and actions.
–  need to formulate problems carefully…

07/03/2017