8b: Policy Learning and Deep RL
Policy Learning
Policy learning algorithms do not use a value function but instead operate directly on the policy,
chosen from a family of policies determined by parameters .π :θ S ↦ A θ
Typically, is a neural network with weights which takes a state as input and produces action
as output, which may be either continuous or discrete. If there is a discrete choice of actions, the
network has one output for each possible action and uses Softmax to determine the conditional
probability of performing action in state .
π θ θ s a
π (a ∣ s)θ a s
For episodic domains like Backgammon, we do not need a discount factor, and the “�tness” of policy
can be taken as the Value function of the initial states under this policy, which is the expected
(or average) total reward received in each game by an agent using policy .
π θ s 0
π θ
fitness(π ) = θ V (s ) =
π θ
0 E (r )π θ total
Policy Learning algorithms include Policy Gradients, which use gradient descent to modify the
parameters , and Evolution Strategies, which make random changes to and keep only those
updates that are seen to increase the reward.
θ θ
Evolution Strategies
We will not go into details of evolutionary strategies in this course, but this video explains the basic
idea using a simple example. Further details can be found in (Blair & Sklar, 1999) and (Salimans,
2017).
Policy Gradients
Let us �rst consider episodic games where the reward is received only at the end of the episode. The
agent takes a sequence of actions
a a … a ⋯ a 1 2 t m
At the end it receives a reward We don’t know which actions contributed the most, so we just
reward all of them equally. If is high (low), we change the parameters to make the agent more
(less) likely to take the same actions in the same situations. In other words, we want to increase
(decrease)
r total.
r total
log π a ∣ s =
t=1
∏
m
θ ( t t) log π a ∣ s
t=1
∑
m
θ ( t t)
If for a win and for a loss, we can simply multiply the log probability by .
Di�erentials can be calculated using the gradient
r =total +1 −1 r total
∇ r log π a ∣ s =θ total
t=1
∑
m
θ ( t t) r ∇ log π a ∣ s total
t=1
∑
m
θ θ ( t t)
The gradient of the log probability can be calculated nicely using Softmax.
If takes some other range of values, we can replace it with where is a �xed value,
called the baseline.
r total r − b( total ) b
The REINFORCE Algorithm
We then get the following REINFORCE algorithm (Williams, 1992):
This algorithm has successfully been applied, for example, to learn to play the game of Pong from
raw image pixels.
for each trial
run trial and collect states s , actions a , and reward r t t total
for t = 1 to length(trial)
θ ← θ + η(r − b)∇ log π (a ∣ s )total θ θ t t
end
end
Policy Gradients with Incremental Rewards
We wish to extend the framework of Policy Gradients to non-episodic domains where rewards are
received incrementally throughout the game (e.g. PacMan, Space Invaders). Every policy
determines a distribution on
π θ
ρ (s)π θ S
ρ (s) =π θ γ prob (s)
t≥0
∑ t π ,tθ
where prob is the probability that, after starting in state and performing actions, the
agent will be in state We can then de�ne the �tness of policy as
prob (s)π ,tθ s 0 t
s. π
fitness π =( θ) ρ (s) Q (s, a)π (a ∣ s)
s
∑ π θ
a
∑ π θ θ
fitness π =( θ) ρ (s) Q (s, a)π (a ∣ s)
s
∑ π θ
a
∑ π θ θ
Note: In the case of episodic games, we can take in which case is simply the
expected reward at the end of the game. However, the above equation holds in the non-episodic case
as well. The gradient of and are extremely hard to determine, so we ignore them
and instead compute the gradient only for the last term
γ = 1, Q (s, a)π θ
ρ (s)π θ Q (s, a)
π θ
π (a ∣ s)θ
∇ fitness π =θ ( θ) ρ (s) Q (s, a)∇ π (a ∣ s)
s
∑ π θ
a
∑ π θ θ θ
The Log Trick
Q (s, a)∇ π (a ∣ s)
a
∑ π θ θ θ = Q (s, a)π (a ∣ s)
a
∑ π θ θ
π (a ∣ s)θ
∇ π (a ∣ s)θ θ
= Q (s, a)π (a ∣ s)∇ log π (a ∣ s)
a
∑ π θ θ θ θ
So
∇ fitness π θ ( θ) = ρ (s) Q (s, a)π (a ∣ s)∇θ log π (a ∣ s)
s
∑ π θ
a
∑ π θ θ θ
= E [Q (s, a)∇ log π (a ∣ s)]π θ
π θ
θ θ
The reason for the last equality is this: is the number of times (discounted by ) that we
expect to visit state when using policy . Whenever state is visited, action will be chosen with
probability
ρ (s)π θ γ
t
s π θ s a
π (a ∣ s)θ
Actor-Critic
Recall:
∇ fitness π =θ ( θ) Eπ Q (s, a)∇ log π (a ∣ s)θ [
π θ
θ θ ]
For non-episodic games, we cannot easily �nd a good estimate for One approach is to
consider a family of -functions determined by parameters (di�erent from ) and learn so
that approximates at the same time that the policy itself is also being learned.
Q (s, a).πθ
Q Q w w θ w
Q w Q ,π θ π θ
This is known as an Actor-Critic approach because the policy determines the action, while the –
Function estimates how good the current policy is, and thereby plays the role of a critic.
Q
Actor Critic Algorithm
for each trial
sample a from π a ∣ s 0 θ ( 0)
for each timestep t do
sample reward r from R r ∣ s , a t ( t t)
sample next state s from δ s ∣ s , a t+1 ( t t)
sample action a from π a ∣ s t+1 θ ( t+1)
= − r + γQ s , a − Q s , a
dQ
dE
[ t w ( t+1 t+1) w ( t t)]
θ ← θ + η Q s , a ∇ log π a ∣ s θ w ( t t) θ θ ( t t)
w ← w − η ∇ Q s , a w
dQ
dE
w w ( t t)
end
end
References
Blair, A.D., & E. Sklar, 1999. Exploring evolutionary learning in a simulated hockey environment,
Congress on Evolutionary Computation (CEC), 197-203.
Salimans, T., Ho, J., Chen, X., Sidor, S., & Sutskever, I., 2017. Evolution strategies as a scalable
alternative to reinforcement learning. arXiv:1703.03864.
Williams, R.J., 1992. Simple statistical gradient-following algorithms for connectionist reinforcement
learning. Machine Learning, 8(3-4), 229-256.
Deep Reinforcement Learning
Deep Q-Learning for Atari Games
Mnih (2015) demonstrated how Q-learning could be combined with deep CNNs to learn to play Atari
games from pixels.
The input state is a stack of raw pixels from four consecutive frames. The images are converted
from 8-bit RGB images at resolution 210 160 pixels to 84 84 greyscale images. The 18 outputs are
the Q-values for 18 di�erent combinations of joystick/button positions. The reward is the
change in score during the timestep.
s
× ×
Q(s, a)
Recall the Q-Learning update rule:
Q s , a ←( t t) Q s , a +( t t) η r + γ Q s , b − Q s , a [ t
b
max ( t+1 ) ( t t)]
If a lookup table is used to store the Q values for every state and action separately, the algorithm is
guaranteed to eventually converge to an optimal policy. But, if the number of states is exponentially
large, we must instead use a neural network and adjust the weights according to the Q-
Learning update rule, which is equivalent to minimizing:
The gradient is applied only to not to .
Q w w
r + γ Q s , b − Q s , a[ t
b
max w ( t+1 ) w ( t t)]
2
Q s , a ,w ( t t) Q (s , b)w t+1
Experience Replay
Training of deep neural networks for classi�cation tasks generally requires that each mini-batch
should include a variety of di�erent inputs and target outputs. For Atari games, many similar states
may occur in succession, often with the same action being selected. We can remove this temporal
correlation between samples by storing experiences in a Replay Bu�er.
In this scenario, one thread repeatedly plays the game, selecting its actions using an -greedy strategy
based on the current -values, and builds a database of experiences . Another
thread samples asynchronously from this database and applies the Q-learning rule to minimise
ε
Q (s , a , r , s )t t t t+1
This removes temporal correlations by sampling from a variety of game situations in random order,
and also makes it easier to parallelise the algorithm on multiple GPUs.
r + γ Q s , b − Q s , a[ t
b
max w ( t+1 ) w ( t t)]
2
Prioritised Replay
Instead of sampling experiences uniformly, it may be more e�cient to store and retrieve them in a
priority queue with priority based on the DQN error (Schaul, 2015).
r +∣
∣
t γ Qw(s , b) −
b
max t+1 Q (s , a ) w t t ∣
∣
This ensures the system will concentrate more e�ort on situations where the value was
“surprising” (in the sense of being far away from what was predicted).
Q
Double Q-Learning
If the same weights are used to select actions and to evaluate actions (as well as states) the
network may learn a suboptimal strategy due to a kind of “con�rmation bias”. One way to avoid this
is to maintain two sets of weights and with one used for action selection and the other for
evaluation (then swap their roles).
w
w ,w̄
In the context of Deep Q-learning, a simpler approach is to use the current “online” version of for w
selection, and an older “target” version for evaluation (Van Hasselt, 2016). We therefore minimizew̄
r + γQ s , Q s , b − Q s , a [ t w̄ ( t+1
b
arg max w ( t+1 )) w ( t t)]
2
A new version of is periodically calculated from the distributed values of and this is
broadcast to all processors.
w̄ w, w̄
Advantage Function
The Function can be written as a sum of the value function plus an advantage
function
Q Q (s, a)π V (s)π
A (s, a) =π Q (s, a) −π V (s)π
represents the advantage (or disadvantage) of taking action in state , compared to taking
the action preferred by the current policy . We can learn approximations for these two components
separately:
A (s, a)π a s
π
Q(s, a) = V (s) +u A (s, a)w
Note that actions can be selected just using becauseA (s, a),w
Q(s , b) =
b
arg max t+1 A (s , b)
b
arg max w t+1
Advantage Actor-Critic
Recall that in the REINFORCE algorithm, a baseline could be subtracted from for the purpose
of variance reduction.
b r total
θ ← θ + η r − b ∇ log π a ∣ s ( total ) θ θ ( t t)
In the actor-critic framework, is replaced by r total Q s , a ( t t)
θ ← θ + η Q s , a ∇ log π (a ∣ s )θ ( t t) θ θ t t
We can also subtract a baseline from This baseline must be independent of the action
but it could be dependent on the state . A good choice of baseline is the value function in
which case the Q function is replaced by the advantage function
Q s , a .( t t) a ,t
s t V (s),u
A (s, a) =w Q(s, a) − V (s)u
Asynchronous Advantage Actor-Critic
The Asynchronous Advantage Actor-Critic or A C Algorithm combines a policy network , value
function network and an (estimated) Q-function.
3 π θ
V u
use polity network to choose actionsπ θ
learn a parameterised value function by TD-learningV (s)u
estimate Q-value by n-step sample
Q s , a =( t t) r +t+1 γr +t+2 … + γ r +
n−1
t+n γ V s
n
u ( t+n)
update policy byπ θ
θ ← θ + η Q s , a − V s ∇ log π a ∣ s θ [ ( t t) u ( t)] θ θ ( t t)
update value function by minimisingV u
Q s , a − V s [ ( t t) u ( t)]
2
References
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., et al. 2015.
Human-level control through deep reinforcement learning, Nature 518(7540) 529-533.
Schaul, T., Quan, J., Antonoglou, I., & Silver, D., 2015. Prioritized experience replay, arXiv:1511.05952.
Van Hasselt, H., Guez, A., & Silver, D., 2016. Deep reinforcement learning with double Q-learning. In
Proceedings of the AAAI Conference on Arti�cial Intelligence (Vol. 30, No. 1).
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Harley, T., Lillicrap, T., Silver, D., & Kavukcuoglu, K., 2016.
Asynchronous methods for deep reinforcement learning. In International Conference on Machine
Learning (pp. 1928-1937).
Examples of Deep Reinforcement Learning
Dr Alan Blair discusses a variety of examples of Deep Reinforcement Learning in practice.
Mr Alex Long discusses frameworks for understanding Deep Reinforcement Learning with
demonstrations.
Quiz 7: Reinforcement Learning
Question 1
No response
Question 2
No response
Question 3
No response
Question 4
No response
Question 5
No response
Describe the elements (sets and functions) that are needed to give a formal description of a
reinforcement learning environment. What is the di�erence between a deterministic environment
and a stochastic environment?
Name three di�erent models of optimality in reinforcement learning, and give a formula for
calculating each one.
What is the de�nition of:
the optimal policy
the value function
the Q-function?
Assuming a stochastic environment, discount factor and learning rate of , write the equation for:γ η
Temporal Di�erence learning TD(0)
Q-Learning
Remember to de�ne any symbols you use.
Write out the steps in the REINFORCE algorithm, making sure to de�ne any symbols you use.
Question 6
No response
In the context of Deep Q-Learning, explain the following:
Experience Replay
Double Q-Learning
Coding: Deep Q-Learning
In this exercise, you will practice how to use PyTorch to train a Deep Q-learning (DQN) agent on the
CartPole-v0 task from OpenAI Gym.
Please download the notebook and implement/run it locally.
Installation instruction: https://gym.openai.com/docs/
This exercise is modi�ed from Adam Paszke’s reinforcement learning (DQN) tutorial with necessary
simpli�cation.
Week 8 Thursday video