程序代写代做代考 algorithm Reinforcement Learning

Reinforcement Learning

Policy Op4miza4on and Planning
(Material not examinable)

Subramanian Ramamoorthy
School of Informa4cs

31 March, 2017

Plan for Lecture: Policies and Plans
•  Policy Op5miza5on

–  Policies can be op5mized directly, without learning value
func5ons

–  Policy-gradient methods

–  Special case: how could we learn with real-valued
(con5nuous) ac5ons

•  Planning
–  Uses of “environment models”
–  Integra5on of planning, learning, and execu5on
–  “Model-based reinforcement learning”

31/03/2017 2

Policy-gradient methods
(Note: slightly different nota5on in this sec5on,

following 2nd ed. of S+B)

Approaches to control

1.  Previous approach: Ac5on-value methods:

–  learn the value of each (state-)ac5on;

–  pick the max, usually

2.  New approach: Policy-gradient methods:

–  learn the parameters of a stochas5c policy

–  update by gradient ascent in performance

–  includes actor-cri5c methods, which learn both value
and policy parameters

31/03/2017 4

Actor-cri5c architecture

World
31/03/2017 5

Why Approximate Policies rather than
Values?

•  In many problems, the policy is simpler to
approximate than the value func5on

•  In many problems, the op5mal policy is stochas5c

–  e.g., bluffing, POMDPs

•  To enable smoother change in policies

•  To avoid a search on every step (the max)

•  To be^er relate to biology

31/03/2017 6

Policy Approxima5on

•  Policy = a func5on from state to ac5on

–  How does the agent select ac5ons?

–  In such a way that it can be affected by learning?

–  In such a way as to assure explora5on?

•  Approxima5on: there are too many states and/or
ac5ons to represent all policies

–  To handle large/con5nuous ac5on spaces

31/03/2017 7

Gradient Bandit Algorithm

31/03/2017 8

Core Principle: Policy Gradient Methods

•  Parameterized policy selects ac5ons without consul5ng a
value func5on

•  VF can s5ll be used to learn the policy weights
–  But not needed for ac5on selec5on

•  Gradient ascent on a performance measure η(θ) with respect
to policy weights

31/03/2017 9

✓t+1 = ✓t + ↵ \r⌘(✓t)

Expectation approximates
the gradient (hence “policy gradient”)

Linear-exponen5al policies
(discrete ac5ons)

31/03/2017 10

Factor to modulate
TD update, going
beyond TD(0) to
TD(λ)

eg, linear-gaussian policies
(con5nuous ac5ons)

action

Action
prob.
density

𝜇 and 𝜎 linear
in the state

31/03/2017 11

eg, linear-gaussian policies
(con5nuous ac5ons)

31/03/2017 12

Gaussian eligibility func5ons

31/03/2017 13

Policy Gradient Setup

31/03/2017 14

REINFORCE: Monte-Carlo Policy Gradient,
from Policy Gradient Theorem

31/03/2017 15

The generality of the
policy-gradient strategy

•  Can be applied whenever we can compute the effect of
parameter changes on the ac5on probabili5es,

e.g., has been applied to spiking neuron models

•  There are many possibili5es other than linear-exponen5al
and linear-gaussian

–  e.g., mixture of random, argmax, and fixed-width
gaussian; learn the mixing weights, drij/diffusion models

31/03/2017 16

Planning

Paths to a Policy

31/03/2017 18

Schematic

31/03/2017 19

Models

•  Model: anything the agent can use to predict how the
environment will respond to its actions

•  Distribution model: description of all possibilities and their
probabilities
–  e.g.,

•  Sample model: produces sample experiences
–  e.g., a simulation model

•  Both types of models can be used to produce simulated
experience

•  Often sample models are much easier to come by

Ps ʹ s
a and Rs ʹ s

a for all s, ʹ s , and a ∈A(s)

31/03/2017 20

Planning

valuesbackupsmodel simulatedexperience policy

planning
model policy

•  Planning: any computational process that uses a model to
create or improve a policy

•  Planning in AI:
–  state-space planning
–  plan-space planning (e.g., partial-order planner)

•  We take the following (unusual) view:
–  all state-space planning methods involve computing

value functions, either explicitly or implicitly
–  they all apply backups to simulated experience

31/03/2017 21

22

Planning Cont.

Random-Sample One-Step Tabular Q-Planning

•  Classical DP methods are state-space planning methods
•  Heuristic search methods are state-space planning methods
•  A planning method based on Q-learning:

31/03/2017

Paths to a Policy: Dyna

31/03/2017 23

Learning, Planning, and Acting

planning

value/policy

experiencemodel

model
learning

acting

direct
RL

•  Two uses of real experience:
–  model learning: to

improve the model
–  direct RL: to directly

improve the value
function and policy

•  Improving value function
and/or policy via a model is
sometimes called indirect
RL or model-based RL.
Here, we call it planning.

24

Direct vs. Indirect RL

•  Indirect methods:
–  make fuller use of

experience: get better
policy with fewer
environment interactions

•  Direct methods
–  simpler
–  not affected by bad

models

But they are very closely related and can be usefully combined:
planning, acting, model learning, and direct RL can occur simultaneously
and in parallel

31/03/2017 25

The Dyna Architecture (Sutton 1990)

real

direct RL
update

Model

planning update

search
control

Policy/value functions

experience
model

learning

Environment

simulated
experience

31/03/2017 26

The Dyna-Q Algorithm

model learning

planning

direct RL

31/03/2017 27

Dyna-Q on a Simple Maze

rewards = 0 until goal, when =1

31/03/2017 28

Dyna-Q Snapshots: �
Midway in 2nd Episode

S

G

S

G
WITHOUT PLANNING (N=0) WITH PLANNING (N=50)

31/03/2017 29

30

When the Model is Wrong:�
Blocking Maze

Cumulative
reward

0 1000 2000 3000

Time steps

150

0

Dyna-Q+
Dyna-Q

Dyna-AC

S

G G

S

The changed envirnoment is harder

31/03/2017

31

Shortcut Maze

Cumulative
reward

S

G G

S

0 3000 6000
Time steps

400

0

Dyna-Q+
Dyna-Q

Dyna-AC

The changed environment is easier

31/03/2017

What is Dyna-Q+ ?

•  Uses an “exploration bonus”:
–  Keeps track of time since each state-action pair was

tried for real
–  An extra reward is added for transitions caused by

state-action pairs related to how long ago they were
tried: the longer unvisited, the more reward for visiting

–  The agent actually “plans” how to visit long unvisited
states

31/03/2017 32