程序代写代做代考 algorithm flex graph CMPUT 366 F20: Reinforcement Learning VI

CMPUT 366 F20: Reinforcement Learning VI
Vadim Bulitko & James Wright
October 13, 2020
CMPUT 366 F20: Reinforcement Learning VI
1

Lecture Outline
Reinforcement Learning (RL)
SB 6.0-6.2, 6.4-6.5
Evolutionary Reinforcement Learning (ERL)
CMPUT 366 F20: Reinforcement Learning VI 2

Temporal Difference Learning
Want an on-line learning method that learns from experience, bootstraps and does not rely on the world model
and we want a pony too!
Updates to the value function in policy evaluation:
Dynamic Programming: V(s) ← 􏰈a,s′,r π(a|s)p(s′, r|s, a)[r + γV(s)] Monte Carlo: V(s) ← V(s) + α[G − V(s)]
α is the learning rate; G is the target to move towards
need to wait until the end of the episode to get the return G Temporal difference: V(s) ← V(s) + α[r + γV(snext) − V(s)]
using r + γV(snext) as the target in place of G
do not need to wait until the end of the episode; bootstrapping
CMPUT 366 F20: Reinforcement Learning VI 3

TD(à) for Policy Evaluation
The update rule for V(S) uses sample updates: a single observed outcome DP uses expected updates: a weighted sum over all possible outcomes
CMPUT 366 F20: Reinforcement Learning VI 4

TD(à) for Control
We can plug TD prediction into the generalized policy iteration framework
Monte Carlo control loop: Generate an episode using π Update estimates of Q and π
On-policy TD control loop:
Take an action according to π Update estimates of Q and π
CMPUT 366 F20: Reinforcement Learning VI 5

On-policy TD(à) Control
The update rule uses St, At, Rt+, St+, At+ – hence SARSA
Converges to optimal if use ε-greedy policy and cool off ε over time (e.g., ε = /t)
CMPUT 366 F20: Reinforcement Learning VI 6

Off-policy TD(à) Control
Learns values for an optimal policy while running a behaviour policy
Converges to optimal if use ε-greedy policy and cool off ε over time (e.g., ε = /t)
CMPUT 366 F20: Reinforcement Learning VI 7

Sarsa versus Q-learning
Agent gets − reward until they reach the goal state → episode ends Step into the cliff region, get reward −àà → episode ends
CMPUT 366 F20: Reinforcement Learning VI 8

Sarsa versus Q-learning
ε = à., constant
Why is Sarsa doing better?
What happens if we gradually reduce ε → à?
CMPUT 366 F20: Reinforcement Learning VI 9

Sarsa | ε = à.
CMPUT 366 F20: Reinforcement Learning VI 10

Q-learning | ε = à.
CMPUT 366 F20: Reinforcement Learning VI 11

Sarsa|ε=  e
CMPUT 366 F20: Reinforcement Learning VI 12

Q-learning | ε =  e
CMPUT 366 F20: Reinforcement Learning VI 13

Q-learning versus LRTA*
Q-learning update rule
Q(s, a) ← Q(s, a) + α[r + γ max Q(s′, a) − Q(s, a)] a
Q-learning action-selection rule (with ε-greedy policy):
Why is there no explicit exploration (e.g., no ε-greedy) in LRTA* and yet it converges to an optimal solution?
LRTA* update rule:
h(s) ← min[cost(s, s′) + h(s′)]
s′
LRTA* action-selection rule:
snext ← arg min[cost(s, s′)+h(s′)] s′

argmax Q(s, a)
 − ε fraction of times otherwise
a←
If α = , γ = , ε = à then we get
a
random action
and
Q(s, a) ← r + max Q(s′, a) a
a ← argmax Q(s, a) a
CMPUT 366 F20: Reinforcement Learning VI
14

Summary
Temporal Difference Learning bootstraps and learns from experience Dynamic programming bootstraps, but does not learn from experience Monte Carlo learns from experience, but does not bootstrap
Policy evaluation: TD(à) algorithm
Sarsa estimates action-values of actual ε-greedy policy
Q-Learning estimates action-values of optimal policy while executing an ε-greedy policy
Connections to LRTA*
CMPUT 366 F20: Reinforcement Learning VI 15

How to Specify Desired Behaviour of an Agent
Desired action for each state → supervised learning
Desired goal state(s) → search
Good/bad feedback per action → reinforcement learning Where do rewards come from?
Survival/death → evolution
CMPUT 366 F20: Reinforcement Learning VI 16

Interactions Between Learning and Evolution
Ackley and Littman (1991)
Rewards come from an evolved reward module
https://www.cs.unm.edu/~ackley/
CMPUT 366 F20: Reinforcement Learning VI
17

Reinforcement Learning versus Evolution
Reinforcement Learning: Reward: good/bad feedback
Possibly delayed
But still, learnable from
Evolution:
No feedback during agent’s lifetime Natural selection over generations
Can they complement each other?
https://en.wikipedia.org/wiki/On_the_Origin_of_Species#/media/File: Origin_of_Species_title_page.jpg
CMPUT 366 F20: Reinforcement Learning VI
18

Agent Structure
Agent has:
an initial behaviour policy a reward function
Encoded genetically and evolved over generations
RL modifies the initial policy using the reward signal
Ackley and Littman (1991)
CMPUT 366 F20: Reinforcement Learning VI
19

Agent’s Policy
Maps agent’s senses to probabilities for an action
Partially observable states: the agent can only see locally
Uses function approximation instead of a tabular representation
a single layer artificial neural network weights initially given by the agent’s gene modifiable during lifetime using RL
modifications are not inheritable (no Lamarckism)
Ackley and Littman (1991)
CMPUT 366 F20: Reinforcement Learning VI
20

Agent’s Reward Function
Maps agent’s senses to a scalar evaluation of its state Partially observable states: the agent can only see locally Function approximation instead of a tabular representation
a single layer artificial neural network weights initially given by the agent’s gene NOT modifiable during lifetime
Discuss why not allow the agent to modify it
Difference between two successive state evaluation is the reward signal
A positive difference modifies the policy to increase the probability of the previous action
A negative difference modifies the policy to decrease the probability of the previous action
Ackley and Littman (1991)
CMPUT 366 F20: Reinforcement Learning VI
21

Evolutionary Reinforcement Learning (ERL)
CMPUT 366 F20: Reinforcement Learning VI
Ackley and Littman (1991)
22

Differences from Traditional Evolutionary Computation
Not explicit fitness function needed
No discrete generations
Population size is not a control parameter Simulation at a finer time scale
Artificial Life (A-life)
https://en.wikipedia.org/wiki/NetLogo
CMPUT 366 F20: Reinforcement Learning VI
23

A-life Environment
Ackley and Littman (1991)
CMPUT 366 F20: Reinforcement Learning VI
24

A-life Environment
Ackley and Littman (1991)
CMPUT 366 F20: Reinforcement Learning VI
25

Comparison of Different Learning Mechanisms
Ackley and Littman (1991)
CMPUT 366 F20: Reinforcement Learning VI
26

Summary
Learning at multiple time scales
Reinforcement learning works within the agent’s lifetime
allows for flexibility: agent can adapt to its environment quickly
requires an appropriate reward signal
learning takes time during which the agent may be vulnerable
Evolution works over multiple generations
can provide a good initial behaviour
can provide a reward signal for RL
slower in adapting to the environment (Calvin 2002)
CMPUT 366 F20: Reinforcement Learning VI 27

Bibliography
Ackley, David and Michael Littman (1991). “Interactions between learning and evolution”. In: Artificial life II 10, pp. 487–509.
Calvin, William H (2002). A brain for all seasons: Human evolution and abrupt climate change. University of Chicago Press.
CMPUT 366 F20: Reinforcement Learning VI 28