Reinforcement Learning
Generaliza2on and Func2on Approxima2on
Subramanian Ramamoorthy
School of Informa2cs
28 February, 2017
Core RL Task: Es.mate Value Func.on
• We are interested in determining Vπ from experience
generated using a policy, π.
• So far, we have considered situa.ons wherein Vπ is
represented in a table.
• We would now like to represent Vπ using a parameterized
func.onal form.
28/02/2017 2
Need for Generaliza.on
• Large state/ac.on spaces
• Con.nuous valued states and ac.ons
• Most states may not be experienced exactly before
• Many considera.ons:
– Memory
– Time
– Data
28/02/2017 3
How can experience with a small part of state space be used to
produce good behaviour over large part of state space?
Func.on Approxima.on
• Use a weight vector θ to parameterize the func.onal form.
• Vπ is approximated by another func.on
• This func.on could be a linear func.on in features of
the state s
– θ would then be the feature weights
• The func.on could also be computed by a neural network
– θ would be the vector of connec.on weights in all layers
– More expressive and could capture many func.on forms
• Another example is to compute V with a decision tree
– θ would be numbers defining split points and leaf values
28/02/2017 4
V (s, ✓)
V (s, ✓)
Feature Vectors
28/02/2017 5
This gives you a summary of state,
e.g., state ó weighted linear combination of features
Issues with Func.on Approxima.on
• Number of weights (number of components of θ) is much less
than the number of states
• Changing any component of the weight vector will have an
effect on more than one state at a .me
– In contrast, with a tabular representa.on, backups were
computed state by state independently
• This is generaliza.on
– Poten.ally more powerful
– May need to be managed with care
28/02/2017 6
(n ⌧ |S|)
Supervised Learning Approach
Adapt Parameterized
Function Inputs Outputs
Training Info = desired (target) outputs
Error = L (target output – actual output)
Training example = {input, target output}
28/02/2017 7
Value Predic.on – Backups
• All of the value predic.on methods we have looked at can be
understood as ‘backups’
• Updates to an exis.ng value func.on that shif its value at
par.cular states towards ‘backed-up value’
• Func.on mapping Vπ(s) to a goal value towards which Vπ(s) is
then shifed.
• In the case of Monte Carlo predic.on, goal value is return Rt
• The goal value in the case of TD(0) is
28/02/2017 8
rt+1 + �V
⇡(st+1)
Using Backups as Training Examples
e.g., the TD(0) backup:
V (st )←V (st )+α rt+1 +γV (st+1)−V (st )[ ]
description of st, rt+1 +γV (st+1){ }
As a training example:
Input
(e.g., feature vector)
Target Output
28/02/2017 9
What Kind of Func.on Approxima.on?
• Neural networks, decision trees, mul.variate regression …
• Use of sta.s.cal clustering for state aggrega.on
• Can swap in/out your favourite func.on approxima.on
method as long as they can deal with:
– learning while interac.ng, online (not just batch methods)
– non-sta.onarity induced by policy changes
• As learning proceeds, target func.on may change (e.g.,
in the case of TD(0))
• So, combining gradient descent methods with RL requires
care (especially regarding convergence)
28/02/2017 10
Predic.on Objec.ve
• In the tabular case, updates were decoupled. Eventually, we
would arrive at the correct Vπ for all states.
• When approxima.ng value func.on, we need a measure of
quality of predic.on
– May not be possible to get exactly correct predic.on in all
states
– Because we have more states than weights
• What else could we do?
– Decide which states we care about the most
– Weight error with a distribu.on P(s)
28/02/2017 11
Performance Measures
• Many are applicable but…
• a common and simple one is the mean-squared error (MSE)
over a distribu.on P :
• Why minimize MSE?
Real objec.ve is of course a bemer policy, but unclear how
else to get at that other than value predic.on.
• Why P ?
Consider, e.g., the case where P is always the distribu.on of
states at which backups are done.
MSE(θt ) = P(s) V
π (s)−Vt (s)⎡⎣ ⎤⎦
s∈S
∑
2
28/02/2017 12
Approximated ‘surface’Value obtained, e.g.,
by backup updates
Choosing P(s)
• One natural defini.on: frac.on of .me spent in s while
following the target policy π.
• In con.nuous tasks, this is the the sta.onary distribu.on
under π
• In episodic tasks, this depends on how the ini.al states of
episodes are drawn
• The on-policy distribu.on: the distribu.on created while
following the policy being evaluated. Stronger results are
available for this distribu.on.
28/02/2017 13
Linear Methods
Represent states as feature vectors;
for each s ∈ S :
!
φs = φ1,φ2,…,φn( )s
T
Vt (s) =
!
θt
T
!
φs = (θi )t (φi )s
i=1
n
∑
∇ !
θ
Vt (s) = ?
28/02/2017 14
What are we learning? From what?
Update the weight vector:
!
θt = θ1,θ2,…,θn( )t
T
Assume Vt is a (sufficiently smooth) differentiable function
of
!
θt, for all s ∈ S.
Assume, for now, training examples of this form:
description of st, V
π (st ){ }
28/02/2017 15
Concept: Gradient Descent
28/02/2017 16
Let f be any function of the parameter space.
Its gradient at any point
!
θt in this space is:
∇ !
θ
f (
!
θt ) =
∂ f (
!
θt )
∂θ1
,
∂ f (
!
θt )
∂θ2
,…,
∂ f (
!
θt )
∂θn
⎛
⎝
⎜
⎞
⎠
⎟
T
θ1
θ2 !
θt = θ1,θ2( )t
T
!
θt+1 =
!
θt −α∇ !θ f (
!
θt )
Iteratively move down the gradient:
Gradient Descent for Weights
– Basic Setup
!
θt+1 =
!
θt −
1
2
α∇ !
θ
MSE(
!
θt )
=
!
θt −
1
2
α∇ !
θ
P(s)
s∈S
∑ V π (s)−Vt (s)⎡⎣ ⎤⎦
2
=
!
θt +α P(s)
s∈S
∑ V π (s)−Vt (s)⎡⎣ ⎤⎦∇ !θVt (s)
For the MSE given earlier and using the chain rule:
28/02/2017 17
Gradient Computa.on
!
θt+1 =
!
θt −
1
2
α∇ !
θ
V π (st )−Vt (st )⎡⎣ ⎤⎦
2
=
!
θt +α V
π (st )−Vt (st )⎡⎣ ⎤⎦∇ !θVt (st ),
In prac.ce, could just use the sample gradient
(as we are ac.ng based on the same distribu.on, P(s):
Since each sample gradient is an unbiased es2mate of
the true gradient, this converges to a local minimum
of the MSE if α decreases appropriately with t.
E V π (st )−Vt (st )⎡⎣ ⎤⎦∇ !θVt (st ) = P(s) V
π (s)−Vt (s)⎡⎣ ⎤⎦
s∈S
∑ ∇ !θVt (s)
28/02/2017 18
But We Don’t have these Targets
Suppose we just have targets vt instead :
!
θt+1 =
!
θt +α vt −Vt (st )[ ]∇ !θVt (st )
If each vt is an unbiased estimate of V
π (st ),
i.e., E vt{ }=V
π (st ), then gradient descent converges
to a local minimum (provided α decreases appropriately).
e.g., the Monte Carlo target vt = Rt :
!
θt+1 =
!
θt +α Rt −Vt (st )[ ]∇ !θVt (st )
28/02/2017 19
State aggrega.on is the simplest kind of
Value Func.on Approxima.on
• States are par..oned into disjoint subsets (groups)
• One component of 𝜽 is allocated to each group
Recall:
V
t
(s) = ✓
group
(s)
r✓Vt(s) = [0, 0, …, 1, 0, 0, …, 0]
✓ ✓ + ↵[Targett � Vt(st)]r✓Vt(st)
20
1000-state random walk example
• States are numbered 1 to 1000
• Walks start in the near middle, at state 500
• At each step, jump to one of the 100 states to the right,
or to one of the 100 states to the lef
• If the jump goes beyond 1 or 1000, terminates with a
reward of -1 or +1
(otherwise rt = 0)
21
State aggrega.on into 10 groups of 100
The whole value function over 1000 states will be approximated with 10 numbers!
22
VF Computed with Gradient MC
23
• 10 groups of 100 states
• afer 100,000 episodes
• α = 2 x 10-5
• state distribu.on affects
accuracy
28/02/2017
On Basis Func.ons: Coarse Coding
28/02/2017 24
Many ways to achieve “coding”:
Shaping Generaliza.on in Coarse Coding
28/02/2017 25
Binary features defined by overlap between receptive fields.
Learning and Coarse Coding
28/02/2017 26
Tile Coding
28/02/2017
• Binary feature for each .le
• Number of features present at
any one .me is constant
• Binary features means
weighted sum easy to compute
• Easy to compute indices of the
features present
27
Encoding 2D Space with Many Tiles
28/02/2017 28
Generalizing with Uniformly Offset Tiles
28/02/2017 29
Generalizing with Asymmetrically Offset
Tiles
28/02/2017 30
Tile Coding, Contd.
Irregular .lings
Hashing
28/02/2017 31
Radial Basis Func.ons (RBFs)
(φi )s = exp −
s− ci
2
2σ i
2
⎛
⎝
⎜
⎜
⎞
⎠
⎟
⎟
e.g., Gaussians
28/02/2017 32
Bea.ng the “Curse of Dimensionality”
• Can you keep the number of features from going up
exponen.ally with the dimension?
• Func.on complexity, not dimensionality, is the problem.
• Kanerva coding:
– Select a set of binary prototypes
– Use Hamming distance as distance measure
– Dimensionality is no longer a problem, only complexity
• “Lazy learning” schemes:
– Remember all the data
– To get new value, find nearest neighbors & interpolate
– e.g., (nonparametric) locally-weighted regression
28/02/2017 33
Going from Value Predic.on to GPI
• So far, we’ve only discussed policy evalua.on where the value
func.on is represented as an approximated func.on
• In order to extend this to a GPI-like sepng,
1. Firstly, we need to use the ac.on-value func.ons
2. Combine that with the policy improvement and ac.on
selec.on steps
28/02/2017 34
Gradient Descent Update for
Ac.on-Value Func.on Predic.on
28/02/2017 35
How to Plug-in Policy Improvement
or Ac.on Selec.on?
• If spaces are very large, or con.nuous, this is an ac.ve
research topic
• For manageable discrete spaces,
– For each ac.on, a, available at a state, st, compute Qt(st,a)
and find the greedy ac.on according to it
– Then, one could use this as part of an ε-greedy ac.on
selec.on or as the es.ma.on policy in off-policy methods
28/02/2017 36
Example: Mountain-car Task
• Drive an underpowered car up
a steep mountain road
• Gravity is stronger than engine
(like in cart-pole example)
• Example of a con.nuous
control task where system
must move away from goal
first, then converge to goal
• Reward of -1 un.l car
‘escapes’
• Ac.ons: +τ, -τ, 0
28/02/2017 37
Mountain-car Example:
Cost-to-go Func.on (SARSA solu.on)
28/02/2017 38
Mountain Car Solu.on with RBFs
28/02/2017 39
[Computed by M. Kretchmar]