程序代写代做代考 Bayesian Reinforcement Learning

Reinforcement Learning

Explora3on and Controlled Sensing

Subramanian Ramamoorthy
School of Informa3cs

21 March, 2017

Example Applica,on: Sampling
Spa,otemporal Fields

21/03/2017 2

Ques,ons for Ocean Sampling

•  How to represent the objec,ve that the goal of mo,on
planning is to acquire informa,on which is then used in
model learning?

•  Concretely, how to decide where and when to sample on the
basis of this?

21/03/2017 3

Example Problem: Preference Elicita,on
Luggage Capacity?
Two Door? Cost?
Engine Size?
Color? Options?

Shopping for a Car:

21/03/2017 4

Preference Elicita,on Problem

… the process of determining a user’s preferences/u,li,es to
the extent necessary to make a decision on her behalf

•  Issues:
–  preferences vary widely
–  large (mul,-aTribute) outcome spaces
–  quan,ta,ve u,li,es (the “numbers”) difficult to assess

•  Preference elicita,on can be posed as a POMDP
–  Let us try to formulate the state-ac,on-observa,on space…

21/03/2017 5

Plan for This Lecture

1.  A look (recap) at what Bayesian upda,ng of model
parameters achieves

2.  Informa,on acquisi,on problems and the value of
informa,on (VoI)

3.  Policies based on informa,on gain, e.g., for robots sampling
in a naviga,on seang

21/03/2017 6

Bayesian Upda,ng

21/03/2017 7

Recap of Background

•  Learning problem: probabilis,c statement of what we believe
about parameters that characterise system behaviours

•  Focus is on uncertainty about performance:
–  Choice: e.g., of person, technology
–  Design: e.g., policies for running business opera,ons
–  Policy: e.g., when to sell an asset, maintenance decisions

•  Beliefs are influenced by observa,ons we make
•  Two key ways of thinking about learning problems:

frequen,st and Bayesian
•  Bayesian: start with ini,al beliefs regarding parameters and

combine prior with measurements to compute posteriors

21/03/2017 8

Key Ideas in Bayesian Models

•  Begin with a prior distribu,on over unknown parameter µ
•  Any number whose value is unknown is a random variable
•  Distribu,on of the random variable ~ our belief about how

likely µ is to take on certain values

•  Bayesian perspec,ve is well suited to informa,on collec,on
•  We always start with some sort of prior knowledge or history
•  More important is the conceptual framework that there exists

some truth that we are trying to discover
•  Op,mal learning: learn µ as efficiently as possible

21/03/2017 9

µ ⇠ N(✓0,�20) Prior belief

Updates for Independent Beliefs

•  Consider a random variable, e.g., observa,on W, normally
distributed. We can write its variance and precision as,

•  Having seen n observa,ons, we believe mean of µ is θn and
variance is

•  Aker observing the next measurement we update to,

21/03/2017 10

�2W ,�W =
1

�2W

1/�n

✓n+1 =
�n✓n + �WWn+1

�n + �W

�n+1 = �n + �W

Updates for Independent Beliefs

•  We could combine these into the more compact form,

•  Now, consider the variance of the form,

•  This is the variance, given that we have collected n
measurements already, so the only random variable at this
point is Wn+1. Also, think of it as change in variance of θn.

21/03/2017 11

✓n+1 = (�n+1)
�1(�n✓n + �WWn+1)

V arn[·] = V ar[·|W1,W2, …,Wn]

�̃2n = V arn[✓n+1 � ✓n]

Updates for Independent Beliefs

•  We could also write θn+1 in a different way by defining the
variable,

•  This is a random variable only because we have not yet
observed Wn+1.

•  So that we have the update,

21/03/2017 12

Z =
✓n+1 � ✓n

�̃n

✓n+1 = ✓n + �̃nZ

What Happens to Variance
aker a Measurement

V ar(µ) = E[µ2]� (E[µ])2

= E(µ2)� E[(E[µ|W ])2] + E[(E[µ|W ])2]� (E[µ])2

= E[E[µ2|W ]� (E[µ|W ])2] + E[(E[µ|W ])2]� (E[E[µ|W ]])2

= E[V ar(µ|W )] + V ar[E(µ|W )]

21/03/2017 13

E[V ar(µ|W )] = V ar(µ)� V ar(E[µ|W ])

i.e., variance after measurement will, on average, always be smaller
than the original variance. The last term could be zero (if W is irrelevant),
but with a sensible signal this is the benefit to measurements.

Informa,on Acquisi,on and VoI

21/03/2017 14

Informa,on Acquisi,on

•  We want to understand the “economics of informa,on”

•  Cost of informa,on is highly problem dependent

•  Benefits of informa,on can oken be captured using models
that combine the issues of uncertainty in the context of
simple decision problems

•  We will look at a simple problem to illustrate key ideas
regarding these benefits

21/03/2017 15

Example: Simple Game as a Decision Tree

•  We need to decide whether
to first acquire a signal that
provides informa,on into
the probability of winning

•  Illustrated in decision tree
•  Game has two outcomes:

–  If we win (“W”), we receive
reward R

–  If we lose (“L”), we lose -1
–  Lack of informa,on is the

informa,on state “N”

21/03/2017 16

Expected Value

•  Without any informa,on signal (“N”), probability of winning is
known to be p

•  Expected value is,

–  where we assume we will not play if expected value is nega,ve

21/03/2017 17

E[V |N ] = max{0, pR� (1� p)}

Remark on nota0on:
Unlike in our previous discussions where V represented value as in
expectation of discounted return, here value will stand for a reward
at the end of the game (following convention in litt. on this topic)

Informa,ve Signal

•  Before we play the game, we have the op,on of acquiring an
informa,on signal S (e.g., purchasing a report or checking
informa,on on the internet)

•  The signal may be good (“g”) or bad (“b”)
•  We assume that this signal will correctly predict the outcome

of this game with probability q, i.e.,

21/03/2017 18

P [S = g|W ] = P [S = b|L] = q

We would like to understand:
•  the value of purchasing the signal (elementary informa,on

acquisi,on problem)
•  the value of the quality of signal, represented by probability q

Condi,onal Value

•  We first need to understand how the signal changes the
expected payoff from the game.

•  Condi,onal value of the game given the signal is,

•  This equa,on captures our ability to observe the signal, and
then decide whether we want to play the game or not.

•  If the signal is bad, expected winnings are,

21/03/2017 19

E[V |S = g] = max{0, R.P [W |S = g]� P [L|S = g]}

E[V |S = b] = max{0, R.P [W |S = b]� P [L|S = b]}

Decision to Acquire

•  We next need to find the value of the game given that we
have decided to acquire the signal, but before we know its
realisa,on. This is given by,

•  For this, we need the uncondi,onal probabili,es:

21/03/2017 20

E[V |S] = E[V |S = g]P [S = g] + E[V |S = b]P [S = b]

P [S = g] = P [S = g|W ]P [W ] + P [S = g|L]P [L]
P [S = g] = qp+ (1� q)(1� p)

P [S = b] = P [S = b|W ]P [W ] + P [S = b|L]P [L]
P [S = g] = (1� q)p+ q(1� p)

Condi,onal Probability of Win/Loss
Given the Outcome of Signal

•  Use Bayes theorem to write,

•  Correspondingly, for the bad signal,

21/03/2017 21

P [W |S = g] =
P [W ]P [S = g|W ]

P [S = g]

P [W |S = g] =
pq

qp+ (1� q)(1� p)

P [W |S = b] =
P [W ]P [S = b|W ]

P [S = b]

P [W |S = g] =
p(1� q)

(1� q)p+ q(1� p)

P [L|S = g] = 1� P [W |S = g], etc.

Value of the Signal
•  Let S represent the decision to acquire the signal before we

know the outcome of the signal.
•  Expected value of the game given that we have chosen to

acquire the signal is,

21/03/2017 22

E[V |S] = E[V |S = g]P [S = g] + E[V |S = b]P [S = b]
E[V |S] = max{0, RP [W |S = g]� P [L|S = g]}(qp+ (1� q)(1� p))+

max{0, RP [W |S = b]� P [L|S = b]}((1� q)p+ q(1� p))

E[V |S] = max{0, R
pq

qp+ (1� q)(1� p)
}(qp+ (1� q)(1� p))+

max{0, R
p(1� q)

(1� q)p+ q(1� p)

q(1� p)
(1� q)p+ q(1� p)

}+

((1� q)p+ q(1� p))

Value of the Signal
•  The value of the signal which depends on the game reward R,

the probability of winning, p, and the quality of the signal, q,

21/03/2017 23

V s(R, p, q) = E[V |S]� E[V |N ]

p = 0.5 q = 0.7

Summary of the Simple Example

•  We have computed the “value” of a discrete piece of
informa,on in a stylized seang.
–  Note that the use of value here, while consistent with our
earlier usage, is slightly simpler nota,onally: the return for a
single piece of informa,on does not need a discounted sum

•  Next, we turn to a variant where we are allowed to take
mul,ple measurements to increase the precision of the
informa,on gained

21/03/2017 24

Towards Marginal Value of Informa,on

•  Imagine that we have a choice between doing nothing (with
reward 0) and choosing a random reward with mean µ.

•  Assume that our prior belief about µ is normally distributed
with mean and precision,

•  Before playing the game, we are allowed to collect a series of
measurements, (we’ll ignore cost for now)

•  We assume that W has the unknown mean µ and a known
precision

21/03/2017 25

(✓0,�0 =
1

�20
)

W1,W2, …,Wn

�W

Es,ma,ng Reward aker n Measurements

•  If we choose to make n measurements, the precision of our
es,mate of the reward would be,

•  The updated es,mate of our reward (using a Bayesian model)
would be,

21/03/2017 26

�n = �0 + n�W

✓n =
�0✓0 + n�W W̄n

�0 + n�W

W̄n =
1

n

nX

k=1

Wk

•  Create a random variable capturing belief about reward
•  Use this to make a decision about whether to play the game
•  Start with a known iden,ty,

where,

•  We can write the change in variance (variance of θn given
what we knew before we took the n measurements),

21/03/2017 27

V ar(µ) = E[V ar(µ|W1, …,Wn)] + V ar[E[µ|W1, …,Wn]]

V ar(µ|W1, …,Wn) =
1

�n
= (�0 + n�W )

�1

E[µ|W1, …,Wn] = ✓n

�̃2(n) = V ar(✓n) = V ar(µ)� E[
1

�n
]

�̃2(n) = V ar(✓n) =
1

�0

1

�n
=

1

�0

1

�0 + n�W

Value of Informa,on

•  With Z deno,ng a standard zero mean –unit variance normal
distribu,on, we can write,

•  Aker our n measurements, we are going to choose to play the
game if we believe the value of the game is non-zero.

•  That value is

•  For each distribu,on family of interest, one could write down
such an expression and expand to get analy,cal formula,on
of VoI

21/03/2017 28

✓n = ✓0 + �̃
2(n)Z

Vn = E[max{0, ✓n}]

Example VoI Curves

21/03/2017 29

The slope of these curves provide a marginal VoI

Explora,on with a Mobile Robot

21/03/2017 30

Explora,on Problems

•  Explora,on: control a mobile robot so as to maximize
knowledge about the external world

•  Example: robot needs to acquire a map of a sta,c
environment. If we represent map as “occupancy grid”,
explora,on is to maximise cumula,ve informa,on we have
about each grid cell

•  POMDPs already subsume this func,on but we need to define
an appropriate payoff func0on

•  One good choice is informa,on gain:
Reduc,on in entropy of a robot’s belief as a func,on of its
ac,ons

21/03/2017 31

Explora,on Heuris,cs

•  While POMDPs are conceptually useful here, we may not
want to use them directly – state/observa,on space is huge

•  We will instead try to derive greedy heuris,c based on the
no,on of informa0on gain.

•  Limit lookahead to just one explora,on ac,on
–  The explora,on ac,on could itself involve a sequence of
control ac,ons (but logically, it will serve as one
explora,on ac,on)

–  For instance, select a loca,on to explore anywhere in the
map, then go there

21/03/2017 32

Informa,on and Entropy

•  The key to explora,on is informa,on.
•  Entropy of expected informa,on:

•  Entropy is at its maximum for a uniform distribu,on, p
•  Condi,onal entropy is the entropy of a condi,onal distrib.
•  In explora,on, we seek to minimize the expected entropy of

the belief aker execu,ng an ac,on
•  So, condi,on on measurement z and control u that define the

belief state transi,on

21/03/2017 33

Hp(x) = �
Z

p(x) log p(x)dx �
X

x

p(x) log p(x)or

Condi,onal Entropy aker Ac,on/Observa,on

•  With B(b,z,u) deno,ng the belief aker execu,ng control u and
observing z under belief b,

•  Condi,onal entropy of state x’ aker execu,ng ac,on u and
measuring z is given by,

•  The condi,onal entropy of the control is,

21/03/2017 34

Hb(x
0|z, u) = �

Z
B(b, z, u)(x0) logB(b, z, u)(x0)dx0

Hb(x
0|u) = Ez[Hb(x0|z, u)]

=

Z Z
Hb(x

0|z, u)p(z|x0)p(x0|u, x)b(x)dzdx0dx

Greedy Techniques

•  Expected informa,on gain lets us phrase explora,on as a
decision theore,c problem.

•  Informa,on Gain is

21/03/2017 35

Ib(u) = Hp(x)�Hb(x0|u)
= Hp(x)� Ez[Hb(x0|z, u)]

Greedy Techniques

•  If r(x,u) is the cost of applying control u in state x (trea,ng
cost as nega,ve numbers), then op,mal greedy explora,on
for the belief b maximizes difference between informa,on
gain and cost,

21/03/2017 36

⇡(b) = argmax
u

↵(Hp(x)� Ez[Hb(x0|z, u)]) +
Z

r(x, u)b(x)dx

Expected information gain
(Original entropy – Cond. Entropy) Expected cost

Example: Combining Explora,on and
Mapping

•  By reasoning about control, the mapping process can be
made much more effec,ve

•  Ques,on: Where to move next in a map?

21/03/2017 37

Explora,on Problem: Visually

21/03/2017 38

Map Entropy

21/03/2017 39