Reinforcement Learning
Abstrac(on and Hierarchy in RL
Subramanian Ramamoorthy
School of Informa6cs
3 March, 2017
2
On the Organiza-on of Behaviour
Consider day to day tasks such as opening a bo;le or picking up
your coat from the stand
– How would you tell someone how to perform this?
– How do you actually get the job done? Using what?
03/03/2017
An old idea:
sequen&al behavior cannot be understood as a chain
of s&mulus–response associa&ons… behavior displays
hierarchical structure, comprising nested subrou&nes.
– Karl Lashley (1951)
On the Complexity of RL
03/03/2017 3
• RL agents must learn from exploring the environment,
trying out different ac-ons in different states
• Complexity grows quickly; how to get around?
• Balance explora-on/exploita-on
• Abstrac-ons
Example: Instrumental Hierarchy
03/03/2017 4
[Source: M. Botvinick, Hierarchical models of behavior and prefrontal func-on, Trends in Cogni-ve Science 12(5), 2008]
In Computa-onal Terms, Why Hierarchy?
• Knowledge transfer/injec-on
– Concise encoding of temporally extended ac-ons enables
transfer of that knowledge to a new task needing the same
• Biases explora-on
– Rather than search only at the finest resolu-on, algorithm
can compute value of alterna-ves at a larger scale
• Faster solu-ons (even if model known)
– Following well understood principles of computer science,
hierarchy enables modularity of policies and models
03/03/2017 5
An Early Idea: Reward Shaping
• The robots’ objec-ve is to
collec-vely find pucks and
bring them home.
• Represent the 12-dim
environment by state
variables (features?):
– have-puck?
– at-home?
– near-intruder?
• What should the immediate
reward func-on be?
03/03/2017 6
[Source: M. Mataric, Reward Func-ons for Accelerated Learning, ICML 1994]
Reward Shaping
• If a reward is given only when a robot drops a puck at home, learning
will be extremely difficult.
– The delay between the ac-on and the reward is large.
• Solu-on: Reward shaping (intermediate rewards).
– Add rewards/penal-es for achieving sub-goals/errors:
• subgoal: grasped-puck
• subgoal: dropped-puck-at-home
• error: dropped-puck-away-from-home
• Add progress es-mators:
– Intruder-avoiding progress func-on
– Homing progress func-on
• Adding intermediate rewards will poten-ally allow RL to handle more
complex problems.
03/03/2017 7
Reward Shaping Results
Percentage of policy learnt aher 15 minutes:
03/03/2017 8
Another Early Idea: Hierarchical Algorithms
– Ga-ng Mechanisms
Hierarchical Learning (especially popular with neural network models)
• Learn the gating function
• Learn the individual behaviours
• Learn both simultaneously
*
*Can be a mul6-
level hierarchy.
g is a gate
bi is a
behaviour
03/03/2017 9
Temporal Abstrac-on
• What’s the issue?
– Want “macro” ac-ons (mul-ple -me steps)
– Advantages:
• Avoid dealing with (exploring/compu-ng values for)
less desirable states
• Reuse experience across problems/regions
• What’s not obvious
– Dealing with the Markov assump-on
– Genng the calcula-ons right (e.g., stability and
convergence)
03/03/2017 10
Semi-Markov Decision Processes
03/03/2017 11
Semi-Markov Decision Processes
• A generaliza-on of MDPs:
The amount of -me between one decision and the next is a
random variable (either real or integer valued)
• Treat the system as remaining in each state for a random
wai-ng -me
– aher which, transi-on to next state is instantaneous
• Real valued case: con-nuous -me, discrete events
• Discrete case: Decisions only made an integer mul-ple of an
underlying -me step
03/03/2017 12
Semi-Markov Decision Processes
• SMDP is defined in terms of
P(s’,τ|s,a): Transi-on probability (τ is the wai-ng -me)
R(s,a) or just r: Reward, amount expected to accumulate
during wai-ng -me, τ, in par-cular state and ac-on
• Bellman equa-on can then be wri;en down as, for all s:
03/03/2017 13
V ⇤(s) = max
a2As
[r +
X
s0,⌧
�⌧P (s0, ⌧ |s, a)V ⇤(s0)]
Note the need to sum over wai-ng -me, as well.
Semi-Markov Decision Processes
• Likewise, we can write down the Bellman equa-on for the
state-ac-on value func-on as,
• So, Dynamic Programming algorithms can be naturally
extended to the case of SMDPs as well
03/03/2017 14
Q⇤(s, a) = r +
X
s0,⌧
�⌧P (s0, ⌧ |s, a) max
a02As
Q⇤(s0, a0)
8s 2 S, a 2 As
Q-Learning with SMDPs
• Can we also modify sampling based algorithms accordingly?
• Consider the standard Q-learning algorithm, rewri;en slightly
in the following form,
• If we write down the reward sum, in brackets, for the en-re
wai-ng -me dura-on, then we will have
03/03/2017 15
Qk+1(s, a) = (1� ↵k)Qk(s, a) + ↵k[r + � max
a02As
Qk(s
0, a0)]
Qk+1(s, a) = (1� ↵k)Qk(s, a) + ↵k[rt+1 + �rt+2 + …
+�⌧�1rt+⌧ + �
⌧
max
a02As
Qk(s
0, a0)]
Case Study: Elevator Dispatching
[Crites and Barto, 1996]
03/03/2017 16
Semi-Markov Q-Learning
Rt = γ
krt+k+1
k=0
∞
∑ or Rt = e−βτrt+τ dτ
0
∞
∫
Con-nuous–me problem but decisions in discrete jumps.
For this SMDP, the expression for returns can be wri;en as,
Note that the meaning of quan-ty r differs in the two expressions:
– reward at a discrete -me step in discrete case
– reward “rate” in con-nuous case
The nega-ve exponen-al has a similar role as the discount
factor as we have been using it so far.
03/03/2017 17
Semi-Markov Q-Learning
Q(s,a)← (1−α)Q(s,a)+α e−β τ−t1( )rτ dτ + e
−β t2−t1( )max
ʹa
Q( ʹs , ʹa
t1
t2
∫ )
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥
Suppose system takes action a from state s at time t1,
and next decision is needed at time t2 in state ʹs :
03/03/2017 18
Problem Setup:
Passenger Arrival Pa;erns
Up-peak and Down-peak traffic
• Not equivalent: down-peak handling capacity is much greater than up-
peak handling capacity; so up-peak capacity is limiting factor.
• Up-peak easiest to analyse: once everyone is onboard at lobby, rest
of trip is determined. The only decision is when to open and close
doors at lobby. Optimal policy for pure case is: close doors when
threshold number on; threshold depends on traffic intensity.
• More policies to consider for two-way and down-peak traffic.
• We focus on down-peak traffic pattern.
03/03/2017 19
Various Extant Control Strategies
• Zoning: divide building into zones; park in zone when idle. Robust in
heavy traffic.
• Search-based methods: greedy or non-greedy. Receding Horizon
control.
• Rule-based methods: expert systems/fuzzy logic; from human
“experts”
• Other heuristic methods: Longest Queue First (LQF), Highest
Unanswered Floor First (HUFF), Dynamic Load Balancing (DLB)
• Adaptive/Learning methods: NNs for prediction, parameter space
search using simulation, DP on simplified model, non-sequential RL
03/03/2017 20
The Elevator Model
(Lewis, 1991)
Parameters:
• Floor Time (time to move one floor at max speed): 1.45 secs.
• Stop Time (time to decelerate, open and close doors, and accelerate
again): 7.19 secs.
• TurnTime (time needed by a stopped car to change directions): 1 sec.
• Load Time (the time for one passenger to enter or exit a car): a random
variable with range from 0.6 to 6.0 secs, mean of 1 sec.
• Car Capacity: 20 passengers
Discrete Event System: continuous time,
asynchronous elevator operation
Traffic Profile:
• Poisson arrivals with rates changing every 5 minutes; down-peak
03/03/2017 21
State Space
• 18 hall call buttons: 218 combinations
• positions and directions of cars: 184 (rounding to nearest floor)
• motion states of cars (accelerating, moving, decelerating,
stopped, loading, turning): 6
• 40 car buttons: 240
• Set of passengers waiting at each floor, each passenger’s arrival
time and destination: unobservable. However, 18 real numbers
are available giving elapsed time since hall buttons pushed; we
discretize these.
• Set of passengers riding each car and their destinations:
observable only through the car buttons
Conservatively about 1022 sates
03/03/2017 22
Ac-ons
• When moving (halfway between floors):
– stop at next floor
– continue past next floor
• When stopped at a floor:
– go up
– go down
• Asynchronous
03/03/2017 23
Constraints
• A car cannot pass a floor if a passenger wants to get off
there
• A car cannot change direction until it has serviced all
onboard passengers traveling in the current direction
• Don’t stop at a floor if another car is already stopping, or
is stopped, there
• Don’t stop at a floor unless someone wants to get off
there
• Given a choice, always move up
standard
special
heuristic
Stop and Continue
03/03/2017 24
Performance Criteria
• Average wait time
• Average system time (wait + travel time)
• % waiting > T seconds (e.g., T = 60)
• Average squared wait time (to encourage fast and fair service)
Minimize:
03/03/2017 25
Average Squared Wait Time
Instantaneous cost, p individuals:
Define return as an integral rather than a sum (Bradtke and Duff, 1994):
rτ = wait p(τ )( )
p
∑
2
e−βτrτ dτ
0
∞
∫
03/03/2017 26
Compu-ng Rewards
e−β (τ−ts )rτ dτ
0
∞
∫
Must calculate
• “Omniscient Rewards”: the simulator knows how long each
passenger has been waiting.
• “On-Line Rewards”: Assumes only arrival time of first passenger in
each queue is known (elapsed hall button time); estimate arrival
times
03/03/2017 27
Neural Networks
• 9 binary: state of each hall down button
• 9 real: elapsed time of hall down button if pushed
• 16 binary: one on at a time: position and direction of car making
decision
• 10 real: location/direction of other cars: “footprint”
• 1 binary: at highest floor with waiting passenger?
• 1 binary: at floor with longest waiting passenger?
• 1 bias unit ≡ 1
47 inputs, 20 sigmoid hidden units, 1 or 2 output units
Inputs:
03/03/2017 28
Elevator Results
03/03/2017 29
Op&ons Framework
03/03/2017 30
Options example:
Move un-l end of hallway
Start : Any state in the hallway.
Execute : policy π as shown.
Terminate : when state s is the
end of hallway.
03/03/2017 31
[Reference: R.S. Su;on, D. Precup, S. Singh, Between MDPs and Semi-MDPs: A framework for temporal
Abstrac-on in reinforcement learning, Ar-ficial Intelligence Journal 112:181-211, 1999.
Op-ons [Su;on, Precup, Singh ’99]
• An op-on is a behaviour defined in terms of:
o = { Io, πo, βo }
• Io : Set of states in which o can be ini-ated.
• πo(s) : Policy (mapping S to A)§ when o is execu-ng.
• βo(s) : Probability that o terminates in s.
§Can be a policy
over lower level
op-ons.
03/03/2017 32
Rooms Example
03/03/2017 33
Op-ons Define a Semi-MDP
03/03/2017 34
MDP + Op-ons = SMDP
03/03/2017 35
Why is this Useful?
• We can now define policy over op-ons as well:
• And redefine all value func-ons appropriately:
• All policy learning methods discussed so far, e.g., Value and
Policy Itera-on, can be defined over S and O
• Coherent theory of learning and planning, with courses of
ac-on at variable -me scales, yet at the same level
03/03/2017 36
µ : S ⇥O ! [0, 1]
V
µ(s), Qµ(s, o), V ⇤O(s), Q
⇤
O(s, o)
Value Func-ons Over Op-ons
We can write the expression for op-mal value as,
03/03/2017 37
V ⇤O(s) = max
µ2Q(O)
V µ(s)
V
⇤
O(s) = max
o2Os
E{r
t+1 + …+ �
k�1
r
t+k + �
k
V
⇤
O(st+k)|E(o, s, t)}
V
⇤
O(s) = max
o2Os
E[r + �
k
V
⇤
O(s
0
)|E(o, s)]
k being the dura-on of o when taken in s; condi-oning is over the event
that the op-on is ini-ated at that state and -me.
Mo-va-ons for Op-ons Framework
• Add temporally extended ac-vi-es to choices available to RL
agent, without precluding planning and learning at finer
grained MDP level
• Op-mal policies over primi-ves are not compromised due to
addi-on of op-ons
• However, if an op-on is useful, learning will quickly find this
out – prevent prolonged and useless ‘flailing about’
PS: If all op-ons are 1-step, you recover the core MDP
03/03/2017 38
Rooms Example –
Policy within One Room
03/03/2017 39
Time Course of Use of Ac-on/Op-on
03/03/2017 40
[Source: M. Botvinick, Hierarchical models of behavior and prefrontal func-on, Trends in Cogni-ve Science 12(5), 2008]
Performance Improvement with Op-ons
03/03/2017 41
[Source: M. Botvinick, Hierarchical models of behavior and prefrontal func-on, Trends in Cogni-ve Science 12(5), 2008]
Learning Op-ons
03/03/2017 42
Goal/State Abstrac-on
• Why are these together?
– Abstract goals typically imply abstract states
• Makes sense for classical planning
– Classical planning uses state sets
– Implicit in use of state variables
• Does this make sense for RL? Things to think about:
– No goals
– Markov property issues
03/03/2017 43
Automa-c Hierarchy Discovery
• Hard in contexts such as classical planning
• Within a single problem:
– Ba;le is lost if all states considered (polynomial speedup
at best)
– If fewer states considered, when to stop?
• Across problems
– Considering all states OK for few problems?
– Generalize to other problems in class (transfer learning)
• How best to measure progress? Remains an open problem…
03/03/2017 44
Bo;lenecks: Principle for Op-on Learning
• Subgoals are created based on commonali-es across mul-ple
paths to a solu-on
• Cast the task of finding these commonali-es as a mul-ple-
instance learning problem
• System a;empts to iden-fy a target concept on the basis of
“bags” of instances: posi-ve bags have at least one posi-ve
instance, while nega-ve bags consist of all nega-ve instances.
• A successful trajectory corresponds to a posi-ve bag, where the
instances are the agent’s observa-ons along that trajectory.
• A nega-ve bag consists of observa-ons made over an
unsuccessful trajectory.
03/03/2017 45
[Reference: A. McGovern, A.G. Barto, Accelera-ng reinforcement learning through discovery of useful subgoals,
Proc. i-SAIRAS 2001.
Mo-va-on for Bo;lenecks
• Agent searches for bo;leneck regions in observa-on space.
• The idea of looking for bo;leneck regions was mo-vated by
studying room-to-room naviga-on tasks where the agent
should quickly discover the u-lity of doorways as subgoals.
• IF the agent can recognize that a doorway is a kind of
bo;leneck by detec-ng that the sensa-on of being in the
doorway always occurred somewhere on successful
trajectories
– BUT not always on unsuccessful ones,
• THEN it can create an op-on to reach the doorway.
03/03/2017 46
Bo;lenecks – Computa-onally
• If the agent uses some form of randomness to select exploratory
primi-ve ac-ons, it is likely to remain within the more strongly
connected regions of the state space.
• An op&on for achieving a bo;leneck region, on the other hand, will
tend to connect separate strongly connected areas.
• For example, in a room-to-room naviga-on task, naviga-on using
primi-ve movement commands produces rela-vely strongly
connected dynamics within each room but not between rooms.
• A doorway links two strongly connected regions. By adding an
op-on to reach a doorway sub-goal, the rooms become more
closely connected.
• This allows the agent to more uniformly explore its environment.
03/03/2017 47
Example Data in a Simulated Environment
03/03/2017 48
First-visit histogram:
Concept of Diverse Density
• Due to Maron and Lozano Perez
• Compute the probability Pr(t) that the tth concept from a set
of concepts {ci} is the correct one
• If denote the posi-ve and nega-ve bags of experienced
traces,
03/03/2017 49
DD(t) = Pr(t|B+1 , …, B
+
n , B
�
1 , …, B
�
m)
B±i
DD(t) =
Y
1in
Pr(t|B+i )
Y
1jm
Pr(t|B�j )
Op-ons Discovery
• For the simulated environment being considered, when DD is
calculated over all traces:
• From this, op-ons can be constructed by learning local
policies to the subgoals
03/03/2017 50