Reinforcement Learning
Policy Op4miza4on and Planning
(Material not examinable)
Subramanian Ramamoorthy
School of Informa4cs
31 March, 2017
Plan for Lecture: Policies and Plans
• Policy Op5miza5on
– Policies can be op5mized directly, without learning value
func5ons
– Policy-gradient methods
– Special case: how could we learn with real-valued
(con5nuous) ac5ons
• Planning
– Uses of “environment models”
– Integra5on of planning, learning, and execu5on
– “Model-based reinforcement learning”
31/03/2017 2
Policy-gradient methods
(Note: slightly different nota5on in this sec5on,
following 2nd ed. of S+B)
Approaches to control
1. Previous approach: Ac5on-value methods:
– learn the value of each (state-)ac5on;
– pick the max, usually
2. New approach: Policy-gradient methods:
– learn the parameters of a stochas5c policy
– update by gradient ascent in performance
– includes actor-cri5c methods, which learn both value
and policy parameters
31/03/2017 4
Actor-cri5c architecture
World
31/03/2017 5
Why Approximate Policies rather than
Values?
• In many problems, the policy is simpler to
approximate than the value func5on
• In many problems, the op5mal policy is stochas5c
– e.g., bluffing, POMDPs
• To enable smoother change in policies
• To avoid a search on every step (the max)
• To be^er relate to biology
31/03/2017 6
Policy Approxima5on
• Policy = a func5on from state to ac5on
– How does the agent select ac5ons?
– In such a way that it can be affected by learning?
– In such a way as to assure explora5on?
• Approxima5on: there are too many states and/or
ac5ons to represent all policies
– To handle large/con5nuous ac5on spaces
31/03/2017 7
Gradient Bandit Algorithm
31/03/2017 8
Core Principle: Policy Gradient Methods
• Parameterized policy selects ac5ons without consul5ng a
value func5on
• VF can s5ll be used to learn the policy weights
– But not needed for ac5on selec5on
• Gradient ascent on a performance measure η(θ) with respect
to policy weights
31/03/2017 9
✓t+1 = ✓t + ↵ \r⌘(✓t)
Expectation approximates
the gradient (hence “policy gradient”)
Linear-exponen5al policies
(discrete ac5ons)
31/03/2017 10
Factor to modulate
TD update, going
beyond TD(0) to
TD(λ)
eg, linear-gaussian policies
(con5nuous ac5ons)
action
Action
prob.
density
𝜇 and 𝜎 linear
in the state
31/03/2017 11
eg, linear-gaussian policies
(con5nuous ac5ons)
31/03/2017 12
Gaussian eligibility func5ons
31/03/2017 13
Policy Gradient Setup
31/03/2017 14
REINFORCE: Monte-Carlo Policy Gradient,
from Policy Gradient Theorem
31/03/2017 15
The generality of the
policy-gradient strategy
• Can be applied whenever we can compute the effect of
parameter changes on the ac5on probabili5es,
e.g., has been applied to spiking neuron models
• There are many possibili5es other than linear-exponen5al
and linear-gaussian
– e.g., mixture of random, argmax, and fixed-width
gaussian; learn the mixing weights, drij/diffusion models
31/03/2017 16
Planning
Paths to a Policy
31/03/2017 18
Schematic
31/03/2017 19
Models
• Model: anything the agent can use to predict how the
environment will respond to its actions
• Distribution model: description of all possibilities and their
probabilities
– e.g.,
• Sample model: produces sample experiences
– e.g., a simulation model
• Both types of models can be used to produce simulated
experience
• Often sample models are much easier to come by
Ps ʹ s
a and Rs ʹ s
a for all s, ʹ s , and a ∈A(s)
31/03/2017 20
Planning
valuesbackupsmodel simulatedexperience policy
planning
model policy
• Planning: any computational process that uses a model to
create or improve a policy
• Planning in AI:
– state-space planning
– plan-space planning (e.g., partial-order planner)
• We take the following (unusual) view:
– all state-space planning methods involve computing
value functions, either explicitly or implicitly
– they all apply backups to simulated experience
31/03/2017 21
22
Planning Cont.
Random-Sample One-Step Tabular Q-Planning
• Classical DP methods are state-space planning methods
• Heuristic search methods are state-space planning methods
• A planning method based on Q-learning:
31/03/2017
Paths to a Policy: Dyna
31/03/2017 23
Learning, Planning, and Acting
planning
value/policy
experiencemodel
model
learning
acting
direct
RL
• Two uses of real experience:
– model learning: to
improve the model
– direct RL: to directly
improve the value
function and policy
• Improving value function
and/or policy via a model is
sometimes called indirect
RL or model-based RL.
Here, we call it planning.
24
Direct vs. Indirect RL
• Indirect methods:
– make fuller use of
experience: get better
policy with fewer
environment interactions
• Direct methods
– simpler
– not affected by bad
models
But they are very closely related and can be usefully combined:
planning, acting, model learning, and direct RL can occur simultaneously
and in parallel
31/03/2017 25
The Dyna Architecture (Sutton 1990)
real
direct RL
update
Model
planning update
search
control
Policy/value functions
experience
model
learning
Environment
simulated
experience
31/03/2017 26
The Dyna-Q Algorithm
model learning
planning
direct RL
31/03/2017 27
Dyna-Q on a Simple Maze
rewards = 0 until goal, when =1
31/03/2017 28
Dyna-Q Snapshots: �
Midway in 2nd Episode
S
G
S
G
WITHOUT PLANNING (N=0) WITH PLANNING (N=50)
31/03/2017 29
30
When the Model is Wrong:�
Blocking Maze
Cumulative
reward
0 1000 2000 3000
Time steps
150
0
Dyna-Q+
Dyna-Q
Dyna-AC
S
G G
S
The changed envirnoment is harder
31/03/2017
31
Shortcut Maze
Cumulative
reward
S
G G
S
0 3000 6000
Time steps
400
0
Dyna-Q+
Dyna-Q
Dyna-AC
The changed environment is easier
31/03/2017
What is Dyna-Q+ ?
• Uses an “exploration bonus”:
– Keeps track of time since each state-action pair was
tried for real
– An extra reward is added for transitions caused by
state-action pairs related to how long ago they were
tried: the longer unvisited, the more reward for visiting
– The agent actually “plans” how to visit long unvisited
states
31/03/2017 32