COMP9444
Neural Networks and Deep Learning
Outline
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Deep Reinforcement Learning
2
COMP9444 20T2 Deep Reinforcement Learning
3
7b. Deep Reinforcement Learning
Policy Learning
◮ Evolution Strategies ◮ Policy Gradients
Hill Climbing (Evolution Strategy)
Case Study – Simulated Hockey
Initialize “champ” policy θchamp = 0
for each trial, generate “mutant” policy
θmutant = θchamp + Gaussian noise (fixed σ) champ and mutant are evaluated on the same task(s)
if mutant does “better” than champ,
θchamp ← (1 − α)θchamp + α θmutant
in some cases, the size of the update is scaled according to the difference in fitness (and may be negative)
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Deep Reinforcement Learning
1
Actor-Critic
History of Reinforcement Learning
Deep Q-Learning for Atari Games
Asynchronous Advantage Actor Critic (A3C)
COMP9444 20T2
Deep Reinforcement Learning
4
COMP9444 20T2 Deep Reinforcement Learning 5
Shock Sensors
Shock Inputs
6 Braitenberg-style sensors equally spaced around the vehicle
each sensor has an angular range of 90◦ with an overlap of 30◦
3 additional inputs specify the current velocity of the vehicle total of 3×6+3 = 21 inputs
between neighbouring sensors COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Deep Reinforcement Learning
6
COMP9444 20T2
Deep Reinforcement Learning 7
Shock Agent
Policy Gradients
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
velocity
longitudinal (right skate) lateral
log
πθ(at |st ) =
log πθ(at |st )
sensor 0
puck { enemy goal friendly goal
Policy Gradients are an alternative to Evolution Strategy, which use gradient ascent rather than random updates.
sensor 1
{
puck enemy goal friendly goal
Let’s first consider episodic games. The agent takes a sequence of actions a1 a2 … at … am
sensor 2
{ puck enemy goal friendly goal
sensor 3
At the end it receives a reward r
total
sensor 4
puck enemy goal friendly goal
(low), we change the parameters to make the agent more (less) likely to take the same actions in the same situations. In other words, we want to
puck { enemy goal friendly goal
output vector z
. We don’t know which actions contributed the most, so we just reward all of them equally. If rtotal is high
{
{ enemy goal
sensor 5
{longitudinal (left skate)
increase (decrease)
∏m t=1
∑m t=1
puck friendly goal
each of the 6 sensors responds to three different stimuli ◮ ball / puck
◮ own goal
◮ opponent goal
COMP9444 20T2 Deep Reinforcement Learning
8
COMP9444 20T2 Deep Reinforcement Learning 9
Policy Gradients
REINFORCE Algorithm
If rtotal = +1 for a win and −1 for a loss, we can simply multiply the log probability by rtotal. Differentials can be calculated using the gradient
We then get the following REINFORCE algorithm:
COMP9444 20T2
Deep Reinforcement Learning
10
COMP9444 20T2 Deep Reinforcement Learning 11
mm
∇θ rtotal ∑logπθ(at|st) = rtotal ∑∇θ logπθ(at|st)
for each trial
run trial and collect states st , actions at , and reward rtotal for t = 1 to length(trial)
t=1 t=1
The gradient of the log probability can be calculated nicely using Softmax.
θ ← θ+η(rtotal −b)∇θ logπθ(at|st) end
If rtotal takes some other range of values, we can replace it with (rtotal − b) where b is a fixed value, called the baseline.
end
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
Policy Gradients
Policy Gradients
We wish to extend the framework of Policy Gradients to non-episodic domains, where rewards are received incrementally throughout the game (e.g. PacMan, Space Invaders).
fitness(πθ ) = ∑ ρπθ (s) ∑ Qπθ (s, a) πθ (a|s) s a
Every policy πθ determines a distribution ρπθ (s) on S ρπ (s)= ∑γtprobπ ,t(s)
Note: In the case of episodic games, we can take γ = 1, in which case Qπθ (s, a) is simply the expected reward at the end of the game. However, the above equation holds in the non-episodic case as well.
θθ
t≥0
where probπθ ,t (s) is the probability that, after starting in state s0 and performing t actions, the agent will be in state s. We can then define the fitness of policy π as
The gradient of ρπθ (s) and Qπθ (s, a) are extremely hard to determine, so we ignore them and instead compute the gradient only for the last term πθ(a|s).
∇θ fitness(πθ ) = ∑ ρπθ (s) ∑ Qπθ (s, a)∇θ πθ (a|s) sa sa
fitness(πθ ) = ∑ ρπθ (s) ∑ Qπθ (s, a) πθ (a|s)
COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20
This algorithm has successfully been applied, for example, to learn to play the game of Pong from raw image pixels.
COMP9444 20T2 Deep Reinforcement Learning
12
COMP9444 20T2 Deep Reinforcement Learning 13
The Log Trick
Actor-Critic
π π ∇θ πθ(a|s) ∑Q θ(s,a)∇θπθ(a|s)=∑Q θ(s,a)πθ(a|s) πθ(a|s)
Recall:
∇θfitness(πθ)=Eπ [Qπθ(s,a)∇θlogπθ(a|s)] aaθ
= ∑ Qπθ (s, a) πθ (a|s) ∇θ log πθ (a|s) a
For non-episodic games, we cannot easily find a good estimate for Qπθ(s,a). One approach is to consider a family of Q-Functions Qw determined by parameters w (different from θ) and learn w so that Qw approximates Qπθ , at the same time that the policy πθ itself is also being learned.
So
∇θ fitness(πθ ) = ∑ ρπθ (s) ∑ Qπθ (s, a) πθ (a|s) ∇θ log πθ (a|s)
The reason for the last equality is this:
This is known as an Actor-Critic approach because the policy determines the action, while the Q-Function estimates how good the current policy is, and thereby plays the role of a critic.
s a
= Eπθ [ Qπθ (s, a)∇θ log πθ (a|s) ]
ρπθ (s) is the number of times (discounted by γ t ) that we expect to visit state s when using policy πθ . Whenever state s is visited, action a will be chosen with probability πθ(a|s).
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2 Deep Reinforcement Learning
14
COMP9444 20T2 Deep Reinforcement Learning
15
Actor Critic Algorithm
Reinforcement Learning Timeline
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
for each trial
sample a0 from π(a|s0) for each timestep t do
model-free methods
end end
sample reward rt from R (r|st,at) sample next state st+1 from δ(s|st,at) sample action at+1 from π(a|st+1)
◮ 1961 MENACE tic-tac-toe (Donald Michie) ◮ 1986 TD(λ) (Rich Sutton)
◮ 1989 TD-Gammon (Gerald Tesauro)
◮ 2015 Deep Q Learning for Atari Games
dE=−[r+γQ(s ,a )−Q(s,a)] dQ t w t+1 t+1 w t t
methods relying on a world model ◮ 1959 Checkers (Arthur Samuel)
θ ← θ+ηθ Qw(st,at)∇θ logπθ(at |st)
w←w−η dE ∇ Q (s,a) wdQ w w t t
◮ 1997 TD-leaf (Baxter et al.) ◮ 2009 TreeStrap (Veness et al.) ◮ 2016 Alpha Go (Silver et al.)
◮ 2016 A3C (Mnih et al.)
◮ 2017 OpenAI Evolution Strategies (Salimans et al.)
COMP9444 20T2 Deep Reinforcement Learning
16
COMP9444 20T2
Deep Reinforcement Learning
17
MENACE
MENACE
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
Machine Educable Noughts And Crosses Engine Donald Michie, 1961
COMP9444 20T2 Deep Reinforcement Learning
18
COMP9444 20T2
Deep Reinforcement Learning
19
Game Tree (2-player, deterministic)
Martin Gardner and HALO
MAX (X)
XXX
MIN (O) X X X
X O X O X … MAX (X) O
MIN (O)
X O X X O X O … X X
…
… … …
XOX XOX XOX TERMINAL OX OOX X
.. .
O XXO XOO Utility −1 0 +1
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
XXX
COMP9444 20T2
Deep Reinforcement Learning
20
COMP9444 20T2 Deep Reinforcement Learning 21
Hexapawn Boxes
Reinforcement Learning with BOXES
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Deep Reinforcement Learning
22
COMP9444 20T2
Deep Reinforcement Learning
23
Deep Q-Learning for Atari Games
Deep Q-Network
end-to-end learning of values Q(s, a) from pixels s input state s is stack of raw pixels from last 4 frames
◮ 8-bitRGBimages,210×160pixels
outputisQ(s,a)for18joystick/buttonpositions reward is change in score for that timestep
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
this BOXES algorithm was later adapted to learn more general tasks such as Pole Balancing, and helped lay the foundation for the modern field of Reinforcement Learning
for various reasons, interest in Reinforcement Learning faded in the late 70’s and early 80’s, but was revived in the late 1980’s, largely through the work of Richard Sutton
Gerald Tesauro applied Sutton’s TD-Learning algorithm to the game of Backgammon in 1989
COMP9444 20T2 Deep Reinforcement Learning 24
COMP9444 20T2 Deep Reinforcement Learning 25
Q-Learning
Deep Q-Learning with Experience Replay
Q(st,at) ← Q(st,at)+η[rt +γ max Q(st+1,b)−Q(st,at)] b
choose actions using current Q function (ε-greedy)
build a database of experiences (st,at,rt,st+1)
sample asynchronously from database and apply update, to minimize
with lookup table, Q-learning is guaranteed to eventually converge
for serious tasks, there are too many states for a lookup table
instead, Qw(s,a) is parametrized by weights w, which get updated so as to minimize
[rt +γ max Qw(st+1,b)−Qw(st,at)]2 b
[rt +γ max Qw(st+1,b)−Qw(st,at)]2 b
removes temporal correlations by sampling from variety of game situations in random order
◮ note: gradient is applied only to Qw(st,at), not to Qw(st+1,b)
this works well for some tasks, but is challenging for Atari games, partly due to temporal correlations between samples
(i.e. large number of similar situations occurring one after the other)
makes it easier to parallelize the algorithm on multiple GPUs
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2 Deep Reinforcement Learning
26
COMP9444 20T2 Deep Reinforcement Learning
27
DQN Results for Atari Games
DQN Improvements
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
Prioritised Replay
◮ weight experience according to surprise
Double Q-Learning
◮ current Q-network w is used to select actions ◮ older Q-network w is used to evaluate actions
Advantage Function
◮ action-independent value function Vu(s)
◮ action-dependent advantage function Aw(s,a)
Q(s, a) = Vu (s) + Aw (s, a)
COMP9444 20T2 Deep Reinforcement Learning 28
COMP9444 20T2 Deep Reinforcement Learning 29
Prioritised Replay
Double Q-Learning
instead of sampling experiences uniformly, store them in a priority queue according to the DQN error
if the same weights w are used to select actions and evaluate actions, this can lead to a kind of confirmation bias
Advantage Function
Advantage Actor Critic
|rt +γ max Qw(st+1,b)−Qw(st,at)| b
could maintain two sets of weights w and w, with one used for selection and the other for evaluation (then swap their roles)
this ensures the system will concentrate more effort on situations where the Q value was “surprising” (in the sense of being far away from what was predicted)
in the context of Deep Q-Learning, a simpler approach is to use the current “online” version of w for selection, and an older “target” version w for evaluation; we therefore minimize
COMP9444
⃝c Alan Blair, 2017-20
values of w, and this w is broadcast to all processors.
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T2
Deep Reinforcement Learning 30
COMP9444 20T2 Deep Reinforcement Learning 31
The Q Function Qπ(s,a) can be written as a sum of the value function Vπ(s) plus an advantage function Aπ(s,a) = Qπ(s,a)−Vπ(s)
Recall that in the REINFORCE algorithm, a baseline b could be subtracted
Aπ(s,a) represents the advantage (or disadvantage) of taking action a in state s, compared to taking the action preferred by the current policy π. We can learn approximations for these two components separately:
θ ← θ+η(rtotal −b)∇θ logπθ(at|st) θ ← θ+ηθ Q(st,at)∇θ logπθ(at |st)
Q(s, a) = Vu (s) + Aw (s, a)
Note that actions can be selected just using Aw(s,a), because
We can also subtract a baseline from Q(st,at). This baseline must be independent of the action at , but it could be dependent on the state st . A good choice of baseline is the value function Vu(s), in which case the Q function is replaced by the advantage function
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
argmaxb Q(st+1,b) = argmaxb Aw(st+1,b)
[rt +γQw(st+1,argmaxb Qw(st+1,b))−Qw(st,at)]2
a new version of w is periodically calculated from the distributed
from rtotal
Intheactor-criticframework,rtotal isreplacedbyQ(st,at)
Aw(s,a) = Q(s,a)−Vu(s)
COMP9444 20T2 Deep Reinforcement Learning
32
COMP9444 20T2 Deep Reinforcement Learning 33
Asynchronous Advantage Actor Critic
Latest Research in Deep RL
use policy network to choose actions
learn a parameterized Value function Vu(s) by TD-Learning estimate Q-value by n-step sample
augment A3C with unsupervised auxiliary tasks
Q(st,at) = rt+1 +γrt+2 +…+γn−1rt+n +γnVu(st+n) update policy by
concentrate on state features from which the preceding action is more predictable
θ ← θ+ηθ [Q(st,at)−Vu(st)]∇θ logπθ(at |st) update Value function my minimizing
transfer learning (between tasks)
inverse reinforcement learning (infer rewards from policy)
hierarchical RL
multi-agent RL
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Deep Reinforcement Learning
34
References
[Q(st,at)−Vu(st)]2
David Silver, Deep Reinforcement Learning Tutorial, http://icml.cc/2016/tutorials/deep rl tutorial.pdf
A Brief Survey of Deep Reinforcement Learning, https://arxiv.org/abs/1708.05866
Asynchronous Methods for Deep Reinforcement Learning, https://arxiv.org/abs/1602.01783
Evolution Strategies as a Scalable Alternative to Reinforcement Learning, https://arxiv.org/abs/1703.03864
Eric Jang, Beginner’s Guide to Variational Methods, http://blog.evjang.com/2016/08/variational-bayes.html
COMP9444 ⃝c Alan Blair, 2017-20
encourage exploration, increased entropy
encourage actions for which the rewards are less predictable