程序代写代做代考 scheme chain Reinforcement Learning

Reinforcement Learning

Generaliza2on and Func2on Approxima2on

Subramanian Ramamoorthy
School of Informa2cs

28 February, 2017

Core RL Task: Es.mate Value Func.on

•  We are interested in determining Vπ from experience
generated using a policy, π.

•  So far, we have considered situa.ons wherein Vπ is
represented in a table.

•  We would now like to represent Vπ using a parameterized
func.onal form.

28/02/2017 2

Need for Generaliza.on

•  Large state/ac.on spaces
•  Con.nuous valued states and ac.ons
•  Most states may not be experienced exactly before
•  Many considera.ons:

– Memory
–  Time
–  Data

28/02/2017 3

How can experience with a small part of state space be used to
produce good behaviour over large part of state space?

Func.on Approxima.on

•  Use a weight vector θ to parameterize the func.onal form.
•  Vπ is approximated by another func.on
•  This func.on could be a linear func.on in features of

the state s
–  θ would then be the feature weights

•  The func.on could also be computed by a neural network
–  θ would be the vector of connec.on weights in all layers
– More expressive and could capture many func.on forms

•  Another example is to compute V with a decision tree
–  θ would be numbers defining split points and leaf values

28/02/2017 4

V (s, ✓)

V (s, ✓)

Feature Vectors

28/02/2017 5

This gives you a summary of state,
e.g., state ó weighted linear combination of features

Issues with Func.on Approxima.on

•  Number of weights (number of components of θ) is much less
than the number of states

•  Changing any component of the weight vector will have an
effect on more than one state at a .me
–  In contrast, with a tabular representa.on, backups were
computed state by state independently

•  This is generaliza.on
–  Poten.ally more powerful
– May need to be managed with care

28/02/2017 6

(n ⌧ |S|)

Supervised Learning Approach

Adapt Parameterized
Function Inputs Outputs

Training Info = desired (target) outputs

Error = L (target output – actual output)

Training example = {input, target output}

28/02/2017 7

Value Predic.on – Backups

•  All of the value predic.on methods we have looked at can be
understood as ‘backups’

•  Updates to an exis.ng value func.on that shif its value at
par.cular states towards ‘backed-up value’

•  Func.on mapping Vπ(s) to a goal value towards which Vπ(s) is
then shifed.

•  In the case of Monte Carlo predic.on, goal value is return Rt
•  The goal value in the case of TD(0) is

28/02/2017 8

rt+1 + �V
⇡(st+1)

Using Backups as Training Examples

e.g., the TD(0) backup:
V (st )←V (st )+α rt+1 +γV (st+1)−V (st )[ ]

description of st, rt+1 +γV (st+1){ }

As a training example:

Input
(e.g., feature vector)

Target Output

28/02/2017 9

What Kind of Func.on Approxima.on?

•  Neural networks, decision trees, mul.variate regression …
•  Use of sta.s.cal clustering for state aggrega.on
•  Can swap in/out your favourite func.on approxima.on

method as long as they can deal with:
–  learning while interac.ng, online (not just batch methods)
–  non-sta.onarity induced by policy changes

•  As learning proceeds, target func.on may change (e.g.,
in the case of TD(0))

•  So, combining gradient descent methods with RL requires
care (especially regarding convergence)

28/02/2017 10

Predic.on Objec.ve

•  In the tabular case, updates were decoupled. Eventually, we
would arrive at the correct Vπ for all states.

•  When approxima.ng value func.on, we need a measure of
quality of predic.on
– May not be possible to get exactly correct predic.on in all
states

–  Because we have more states than weights
•  What else could we do?

–  Decide which states we care about the most
– Weight error with a distribu.on P(s)

28/02/2017 11

Performance Measures

•  Many are applicable but…
•  a common and simple one is the mean-squared error (MSE)

over a distribu.on P :

•  Why minimize MSE?
Real objec.ve is of course a bemer policy, but unclear how
else to get at that other than value predic.on.

•  Why P ?
Consider, e.g., the case where P is always the distribu.on of
states at which backups are done.

MSE(θt ) = P(s) V
π (s)−Vt (s)⎡⎣ ⎤⎦

s∈S

2

28/02/2017 12

Approximated ‘surface’Value obtained, e.g.,
by backup updates

Choosing P(s)

•  One natural defini.on: frac.on of .me spent in s while
following the target policy π.

•  In con.nuous tasks, this is the the sta.onary distribu.on
under π

•  In episodic tasks, this depends on how the ini.al states of
episodes are drawn

•  The on-policy distribu.on: the distribu.on created while
following the policy being evaluated. Stronger results are
available for this distribu.on.

28/02/2017 13

Linear Methods

Represent states as feature vectors;
for each s ∈ S :
!
φs = φ1,φ2,…,φn( )s

T

Vt (s) =
!
θt
T
!
φs = (θi )t (φi )s

i=1

n

∇ !
θ
Vt (s) = ?

28/02/2017 14

What are we learning? From what?

Update the weight vector:
!
θt = θ1,θ2,…,θn( )t

T

Assume Vt is a (sufficiently smooth) differentiable function

of
!
θt, for all s ∈ S.

Assume, for now, training examples of this form:

description of st, V
π (st ){ }

28/02/2017 15

Concept: Gradient Descent

28/02/2017 16

Let f be any function of the parameter space.

Its gradient at any point
!
θt in this space is:

∇ !
θ
f (
!
θt ) =

∂ f (
!
θt )

∂θ1
,
∂ f (
!
θt )

∂θ2
,…,

∂ f (
!
θt )

∂θn



T

θ1

θ2 !
θt = θ1,θ2( )t

T

!
θt+1 =

!
θt −α∇ !θ f (

!
θt )

Iteratively move down the gradient:

Gradient Descent for Weights
– Basic Setup

!
θt+1 =

!
θt −

1
2
α∇ !

θ
MSE(

!
θt )

=
!
θt −

1
2
α∇ !

θ
P(s)

s∈S
∑ V π (s)−Vt (s)⎡⎣ ⎤⎦

2

=
!
θt +α P(s)

s∈S
∑ V π (s)−Vt (s)⎡⎣ ⎤⎦∇ !θVt (s)

For the MSE given earlier and using the chain rule:

28/02/2017 17

Gradient Computa.on

!
θt+1 =

!
θt −

1
2
α∇ !

θ
V π (st )−Vt (st )⎡⎣ ⎤⎦

2

=
!
θt +α V

π (st )−Vt (st )⎡⎣ ⎤⎦∇ !θVt (st ),

In prac.ce, could just use the sample gradient
(as we are ac.ng based on the same distribu.on, P(s):

Since each sample gradient is an unbiased es2mate of
the true gradient, this converges to a local minimum
of the MSE if α decreases appropriately with t.

E V π (st )−Vt (st )⎡⎣ ⎤⎦∇ !θVt (st ) = P(s) V
π (s)−Vt (s)⎡⎣ ⎤⎦

s∈S
∑ ∇ !θVt (s)

28/02/2017 18

But We Don’t have these Targets

Suppose we just have targets vt instead :
!
θt+1 =

!
θt +α vt −Vt (st )[ ]∇ !θVt (st )

If each vt is an unbiased estimate of V
π (st ),

i.e., E vt{ }=V
π (st ), then gradient descent converges

to a local minimum (provided α decreases appropriately).

e.g., the Monte Carlo target vt = Rt :

!
θt+1 =

!
θt +α Rt −Vt (st )[ ]∇ !θVt (st )

28/02/2017 19

State aggrega.on is the simplest kind of
Value Func.on Approxima.on

•  States are par..oned into disjoint subsets (groups)
•  One component of 𝜽 is allocated to each group

Recall:

V
t

(s) = ✓
group

(s)

r✓Vt(s) = [0, 0, …, 1, 0, 0, …, 0]

✓ ✓ + ↵[Targett � Vt(st)]r✓Vt(st)

20

1000-state random walk example

•  States are numbered 1 to 1000

•  Walks start in the near middle, at state 500

•  At each step, jump to one of the 100 states to the right,
or to one of the 100 states to the lef

•  If the jump goes beyond 1 or 1000, terminates with a
reward of -1 or +1
(otherwise rt = 0)

21

State aggrega.on into 10 groups of 100

The whole value function over 1000 states will be approximated with 10 numbers!

22

VF Computed with Gradient MC

23

•  10 groups of 100 states

•  afer 100,000 episodes

•  α = 2 x 10-5

•  state distribu.on affects
accuracy

28/02/2017

On Basis Func.ons: Coarse Coding

28/02/2017 24

Many ways to achieve “coding”:

Shaping Generaliza.on in Coarse Coding

28/02/2017 25

Binary features defined by overlap between receptive fields.

Learning and Coarse Coding

28/02/2017 26

Tile Coding

28/02/2017

•  Binary feature for each .le
•  Number of features present at

any one .me is constant
•  Binary features means

weighted sum easy to compute
•  Easy to compute indices of the

features present

27

Encoding 2D Space with Many Tiles

28/02/2017 28

Generalizing with Uniformly Offset Tiles

28/02/2017 29

Generalizing with Asymmetrically Offset
Tiles

28/02/2017 30

Tile Coding, Contd.

Irregular .lings

Hashing

28/02/2017 31

Radial Basis Func.ons (RBFs)

(φi )s = exp −
s− ci

2

2σ i
2





e.g., Gaussians

28/02/2017 32

Bea.ng the “Curse of Dimensionality”

•  Can you keep the number of features from going up
exponen.ally with the dimension?

•  Func.on complexity, not dimensionality, is the problem.
•  Kanerva coding:

–  Select a set of binary prototypes
–  Use Hamming distance as distance measure
–  Dimensionality is no longer a problem, only complexity

•  “Lazy learning” schemes:
–  Remember all the data
–  To get new value, find nearest neighbors & interpolate
–  e.g., (nonparametric) locally-weighted regression

28/02/2017 33

Going from Value Predic.on to GPI

•  So far, we’ve only discussed policy evalua.on where the value
func.on is represented as an approximated func.on

•  In order to extend this to a GPI-like sepng,
1.  Firstly, we need to use the ac.on-value func.ons
2.  Combine that with the policy improvement and ac.on

selec.on steps

28/02/2017 34

Gradient Descent Update for
Ac.on-Value Func.on Predic.on

28/02/2017 35

How to Plug-in Policy Improvement
or Ac.on Selec.on?

•  If spaces are very large, or con.nuous, this is an ac.ve
research topic

•  For manageable discrete spaces,
–  For each ac.on, a, available at a state, st, compute Qt(st,a)
and find the greedy ac.on according to it

–  Then, one could use this as part of an ε-greedy ac.on
selec.on or as the es.ma.on policy in off-policy methods

28/02/2017 36

Example: Mountain-car Task

•  Drive an underpowered car up
a steep mountain road

•  Gravity is stronger than engine
(like in cart-pole example)

•  Example of a con.nuous
control task where system
must move away from goal
first, then converge to goal

•  Reward of -1 un.l car
‘escapes’

•  Ac.ons: +τ, -τ, 0

28/02/2017 37

Mountain-car Example:
Cost-to-go Func.on (SARSA solu.on)

28/02/2017 38

Mountain Car Solu.on with RBFs

28/02/2017 39

[Computed by M. Kretchmar]