程序代写代做代考 algorithm chain Reinforcement Learning

Reinforcement Learning

Abstrac(on and Hierarchy in RL

Subramanian Ramamoorthy
School of Informa6cs

3 March, 2017

2

On the Organiza-on of Behaviour

Consider day to day tasks such as opening a bo;le or picking up
your coat from the stand
–  How would you tell someone how to perform this?
–  How do you actually get the job done? Using what?

03/03/2017

An old idea:
sequen&al behavior cannot be understood as a chain
of s&mulus–response associa&ons… behavior displays
hierarchical structure, comprising nested subrou&nes.

– Karl Lashley (1951)

On the Complexity of RL

03/03/2017 3

•  RL agents must learn from exploring the environment,
trying out different ac-ons in different states
•  Complexity grows quickly; how to get around?

•  Balance explora-on/exploita-on
•  Abstrac-ons

Example: Instrumental Hierarchy

03/03/2017 4

[Source: M. Botvinick, Hierarchical models of behavior and prefrontal func-on, Trends in Cogni-ve Science 12(5), 2008]

In Computa-onal Terms, Why Hierarchy?

•  Knowledge transfer/injec-on
–  Concise encoding of temporally extended ac-ons enables
transfer of that knowledge to a new task needing the same

•  Biases explora-on
–  Rather than search only at the finest resolu-on, algorithm
can compute value of alterna-ves at a larger scale

•  Faster solu-ons (even if model known)
–  Following well understood principles of computer science,
hierarchy enables modularity of policies and models

03/03/2017 5

An Early Idea: Reward Shaping

•  The robots’ objec-ve is to
collec-vely find pucks and
bring them home.

•  Represent the 12-dim
environment by state
variables (features?):
–  have-puck?
–  at-home?
–  near-intruder?

•  What should the immediate
reward func-on be?

03/03/2017 6

[Source: M. Mataric, Reward Func-ons for Accelerated Learning, ICML 1994]

Reward Shaping

•  If a reward is given only when a robot drops a puck at home, learning
will be extremely difficult.
–  The delay between the ac-on and the reward is large.

•  Solu-on: Reward shaping (intermediate rewards).
–  Add rewards/penal-es for achieving sub-goals/errors:

•  subgoal: grasped-puck
•  subgoal: dropped-puck-at-home
•  error: dropped-puck-away-from-home

•  Add progress es-mators:
–  Intruder-avoiding progress func-on
–  Homing progress func-on

•  Adding intermediate rewards will poten-ally allow RL to handle more
complex problems.

03/03/2017 7

Reward Shaping Results

Percentage of policy learnt aher 15 minutes:

03/03/2017 8

Another Early Idea: Hierarchical Algorithms
– Ga-ng Mechanisms

Hierarchical Learning (especially popular with neural network models)
• Learn the gating function
• Learn the individual behaviours
• Learn both simultaneously

*

*Can be a mul6-
level hierarchy.

g is a gate

bi is a
behaviour

03/03/2017 9

Temporal Abstrac-on

•  What’s the issue?
– Want “macro” ac-ons (mul-ple -me steps)
–  Advantages:

•  Avoid dealing with (exploring/compu-ng values for)
less desirable states

•  Reuse experience across problems/regions

•  What’s not obvious
–  Dealing with the Markov assump-on
–  Genng the calcula-ons right (e.g., stability and
convergence)

03/03/2017 10

Semi-Markov Decision Processes

03/03/2017 11

Semi-Markov Decision Processes

•  A generaliza-on of MDPs:
The amount of -me between one decision and the next is a
random variable (either real or integer valued)

•  Treat the system as remaining in each state for a random
wai-ng -me
–  aher which, transi-on to next state is instantaneous

•  Real valued case: con-nuous -me, discrete events
•  Discrete case: Decisions only made an integer mul-ple of an

underlying -me step

03/03/2017 12

Semi-Markov Decision Processes

•  SMDP is defined in terms of
P(s’,τ|s,a): Transi-on probability (τ is the wai-ng -me)
R(s,a) or just r: Reward, amount expected to accumulate
during wai-ng -me, τ, in par-cular state and ac-on

•  Bellman equa-on can then be wri;en down as, for all s:

03/03/2017 13

V ⇤(s) = max
a2As

[r +
X

s0,⌧

�⌧P (s0, ⌧ |s, a)V ⇤(s0)]

Note the need to sum over wai-ng -me, as well.

Semi-Markov Decision Processes

•  Likewise, we can write down the Bellman equa-on for the
state-ac-on value func-on as,

•  So, Dynamic Programming algorithms can be naturally
extended to the case of SMDPs as well

03/03/2017 14

Q⇤(s, a) = r +
X

s0,⌧

�⌧P (s0, ⌧ |s, a) max
a02As

Q⇤(s0, a0)

8s 2 S, a 2 As

Q-Learning with SMDPs

•  Can we also modify sampling based algorithms accordingly?
•  Consider the standard Q-learning algorithm, rewri;en slightly

in the following form,

•  If we write down the reward sum, in brackets, for the en-re
wai-ng -me dura-on, then we will have

03/03/2017 15

Qk+1(s, a) = (1� ↵k)Qk(s, a) + ↵k[r + � max
a02As

Qk(s
0, a0)]

Qk+1(s, a) = (1� ↵k)Qk(s, a) + ↵k[rt+1 + �rt+2 + …
+�⌧�1rt+⌧ + �


max

a02As
Qk(s

0, a0)]

Case Study: Elevator Dispatching

[Crites and Barto, 1996]

03/03/2017 16

Semi-Markov Q-Learning

Rt = γ
krt+k+1

k=0

∑ or Rt = e−βτrt+τ dτ
0

Con-nuous–me problem but decisions in discrete jumps.
For this SMDP, the expression for returns can be wri;en as,

Note that the meaning of quan-ty r differs in the two expressions:

– reward at a discrete -me step in discrete case
– reward “rate” in con-nuous case

The nega-ve exponen-al has a similar role as the discount
factor as we have been using it so far.

03/03/2017 17

Semi-Markov Q-Learning

Q(s,a)← (1−α)Q(s,a)+α e−β τ−t1( )rτ dτ + e
−β t2−t1( )max

ʹa
Q( ʹs , ʹa

t1

t2

∫ )





Suppose system takes action a from state s at time t1,
and next decision is needed at time t2 in state ʹs :

03/03/2017 18

Problem Setup:
Passenger Arrival Pa;erns

Up-peak and Down-peak traffic

•  Not equivalent: down-peak handling capacity is much greater than up-
peak handling capacity; so up-peak capacity is limiting factor.

•  Up-peak easiest to analyse: once everyone is onboard at lobby, rest
of trip is determined. The only decision is when to open and close
doors at lobby. Optimal policy for pure case is: close doors when
threshold number on; threshold depends on traffic intensity.

•  More policies to consider for two-way and down-peak traffic.

•  We focus on down-peak traffic pattern.
03/03/2017 19

Various Extant Control Strategies

• Zoning: divide building into zones; park in zone when idle. Robust in
heavy traffic.

• Search-based methods: greedy or non-greedy. Receding Horizon
control.

• Rule-based methods: expert systems/fuzzy logic; from human
“experts”

• Other heuristic methods: Longest Queue First (LQF), Highest
Unanswered Floor First (HUFF), Dynamic Load Balancing (DLB)

• Adaptive/Learning methods: NNs for prediction, parameter space
search using simulation, DP on simplified model, non-sequential RL

03/03/2017 20

The Elevator Model
(Lewis, 1991)

Parameters:
• Floor Time (time to move one floor at max speed): 1.45 secs.
• Stop Time (time to decelerate, open and close doors, and accelerate

again): 7.19 secs.
• TurnTime (time needed by a stopped car to change directions): 1 sec.
• Load Time (the time for one passenger to enter or exit a car): a random

variable with range from 0.6 to 6.0 secs, mean of 1 sec.
• Car Capacity: 20 passengers

Discrete Event System: continuous time,
asynchronous elevator operation

Traffic Profile:
• Poisson arrivals with rates changing every 5 minutes; down-peak

03/03/2017 21

State Space
• 18 hall call buttons: 218 combinations
• positions and directions of cars: 184 (rounding to nearest floor)
• motion states of cars (accelerating, moving, decelerating,

stopped, loading, turning): 6
• 40 car buttons: 240

• Set of passengers waiting at each floor, each passenger’s arrival
time and destination: unobservable. However, 18 real numbers
are available giving elapsed time since hall buttons pushed; we
discretize these.

• Set of passengers riding each car and their destinations:
observable only through the car buttons

Conservatively about 1022 sates

03/03/2017 22

Ac-ons

• When moving (halfway between floors):
– stop at next floor
– continue past next floor

• When stopped at a floor:
– go up
– go down

• Asynchronous

03/03/2017 23

Constraints
• A car cannot pass a floor if a passenger wants to get off

there
• A car cannot change direction until it has serviced all

onboard passengers traveling in the current direction
• Don’t stop at a floor if another car is already stopping, or

is stopped, there
• Don’t stop at a floor unless someone wants to get off

there
• Given a choice, always move up

standard

special
heuristic

Stop and Continue

03/03/2017 24

Performance Criteria

• Average wait time
• Average system time (wait + travel time)
• % waiting > T seconds (e.g., T = 60)
• Average squared wait time (to encourage fast and fair service)

Minimize:

03/03/2017 25

Average Squared Wait Time

Instantaneous cost, p individuals:

Define return as an integral rather than a sum (Bradtke and Duff, 1994):

rτ = wait p(τ )( )
p

2

e−βτrτ dτ
0

03/03/2017 26

Compu-ng Rewards

e−β (τ−ts )rτ dτ
0

Must calculate

• “Omniscient Rewards”: the simulator knows how long each
passenger has been waiting.

• “On-Line Rewards”: Assumes only arrival time of first passenger in
each queue is known (elapsed hall button time); estimate arrival
times

03/03/2017 27

Neural Networks

• 9 binary: state of each hall down button
• 9 real: elapsed time of hall down button if pushed
• 16 binary: one on at a time: position and direction of car making

decision
• 10 real: location/direction of other cars: “footprint”
• 1 binary: at highest floor with waiting passenger?
• 1 binary: at floor with longest waiting passenger?
• 1 bias unit ≡ 1

47 inputs, 20 sigmoid hidden units, 1 or 2 output units

Inputs:

03/03/2017 28

Elevator Results

03/03/2017 29

Op&ons Framework

03/03/2017 30

Options example:
Move un-l end of hallway

Start : Any state in the hallway.

Execute : policy π as shown.

Terminate : when state s is the
end of hallway.

03/03/2017 31

[Reference: R.S. Su;on, D. Precup, S. Singh, Between MDPs and Semi-MDPs: A framework for temporal
Abstrac-on in reinforcement learning, Ar-ficial Intelligence Journal 112:181-211, 1999.

Op-ons [Su;on, Precup, Singh ’99]

•  An op-on is a behaviour defined in terms of:
o = { Io, πo, βo }

•  Io : Set of states in which o can be ini-ated.
•  πo(s) : Policy (mapping S to A)§ when o is execu-ng.
•  βo(s) : Probability that o terminates in s.

§Can be a policy
over lower level
op-ons.

03/03/2017 32

Rooms Example

03/03/2017 33

Op-ons Define a Semi-MDP

03/03/2017 34

MDP + Op-ons = SMDP

03/03/2017 35

Why is this Useful?

•  We can now define policy over op-ons as well:

•  And redefine all value func-ons appropriately:

•  All policy learning methods discussed so far, e.g., Value and
Policy Itera-on, can be defined over S and O

•  Coherent theory of learning and planning, with courses of
ac-on at variable -me scales, yet at the same level

03/03/2017 36

µ : S ⇥O ! [0, 1]

V

µ(s), Qµ(s, o), V ⇤O(s), Q

O(s, o)

Value Func-ons Over Op-ons

We can write the expression for op-mal value as,

03/03/2017 37

V ⇤O(s) = max
µ2Q(O)

V µ(s)

V


O(s) = max

o2Os
E{r

t+1 + …+ �
k�1

r

t+k + �
k

V


O(st+k)|E(o, s, t)}

V


O(s) = max

o2Os
E[r + �

k

V


O(s

0
)|E(o, s)]

k being the dura-on of o when taken in s; condi-oning is over the event
that the op-on is ini-ated at that state and -me.

Mo-va-ons for Op-ons Framework

•  Add temporally extended ac-vi-es to choices available to RL
agent, without precluding planning and learning at finer
grained MDP level

•  Op-mal policies over primi-ves are not compromised due to
addi-on of op-ons

•  However, if an op-on is useful, learning will quickly find this
out – prevent prolonged and useless ‘flailing about’

PS: If all op-ons are 1-step, you recover the core MDP

03/03/2017 38

Rooms Example –
Policy within One Room

03/03/2017 39

Time Course of Use of Ac-on/Op-on

03/03/2017 40

[Source: M. Botvinick, Hierarchical models of behavior and prefrontal func-on, Trends in Cogni-ve Science 12(5), 2008]

Performance Improvement with Op-ons

03/03/2017 41

[Source: M. Botvinick, Hierarchical models of behavior and prefrontal func-on, Trends in Cogni-ve Science 12(5), 2008]

Learning Op-ons

03/03/2017 42

Goal/State Abstrac-on

•  Why are these together?
–  Abstract goals typically imply abstract states

•  Makes sense for classical planning
–  Classical planning uses state sets
–  Implicit in use of state variables

•  Does this make sense for RL? Things to think about:
–  No goals
– Markov property issues

03/03/2017 43

Automa-c Hierarchy Discovery

•  Hard in contexts such as classical planning

•  Within a single problem:
–  Ba;le is lost if all states considered (polynomial speedup
at best)

–  If fewer states considered, when to stop?

•  Across problems
–  Considering all states OK for few problems?
–  Generalize to other problems in class (transfer learning)

•  How best to measure progress? Remains an open problem…

03/03/2017 44

Bo;lenecks: Principle for Op-on Learning

•  Subgoals are created based on commonali-es across mul-ple
paths to a solu-on

•  Cast the task of finding these commonali-es as a mul-ple-
instance learning problem

•  System a;empts to iden-fy a target concept on the basis of
“bags” of instances: posi-ve bags have at least one posi-ve
instance, while nega-ve bags consist of all nega-ve instances.

•  A successful trajectory corresponds to a posi-ve bag, where the
instances are the agent’s observa-ons along that trajectory.

•  A nega-ve bag consists of observa-ons made over an
unsuccessful trajectory.

03/03/2017 45

[Reference: A. McGovern, A.G. Barto, Accelera-ng reinforcement learning through discovery of useful subgoals,
Proc. i-SAIRAS 2001.

Mo-va-on for Bo;lenecks

•  Agent searches for bo;leneck regions in observa-on space.
•  The idea of looking for bo;leneck regions was mo-vated by

studying room-to-room naviga-on tasks where the agent
should quickly discover the u-lity of doorways as subgoals.

•  IF the agent can recognize that a doorway is a kind of
bo;leneck by detec-ng that the sensa-on of being in the
doorway always occurred somewhere on successful
trajectories
–  BUT not always on unsuccessful ones,

•  THEN it can create an op-on to reach the doorway.

03/03/2017 46

Bo;lenecks – Computa-onally

•  If the agent uses some form of randomness to select exploratory
primi-ve ac-ons, it is likely to remain within the more strongly
connected regions of the state space.

•  An op&on for achieving a bo;leneck region, on the other hand, will
tend to connect separate strongly connected areas.

•  For example, in a room-to-room naviga-on task, naviga-on using
primi-ve movement commands produces rela-vely strongly
connected dynamics within each room but not between rooms.

•  A doorway links two strongly connected regions. By adding an
op-on to reach a doorway sub-goal, the rooms become more
closely connected.

•  This allows the agent to more uniformly explore its environment.

03/03/2017 47

Example Data in a Simulated Environment

03/03/2017 48

First-visit histogram:

Concept of Diverse Density

•  Due to Maron and Lozano Perez
•  Compute the probability Pr(t) that the tth concept from a set

of concepts {ci} is the correct one
•  If denote the posi-ve and nega-ve bags of experienced

traces,

03/03/2017 49

DD(t) = Pr(t|B+1 , …, B
+
n , B


1 , …, B


m)

B±i

DD(t) =
Y

1in
Pr(t|B+i )

Y

1jm
Pr(t|B�j )

Op-ons Discovery

•  For the simulated environment being considered, when DD is
calculated over all traces:

•  From this, op-ons can be constructed by learning local
policies to the subgoals

03/03/2017 50