The University of Sydney Page 1
Deep Reinforcement Learning
Dr Chang Xu
School of Computer Science
The University of Sydney Page 2
Deep RL Breakthroughs
From the talk Introduction to Deep Reinforcement Learning From Theory to Applications by Siyi Li (HKUST)
At human-level or above
The University of Sydney Page 3
Reinforcement Learning (RL) in a nutshell
– RL is a general-purpose framework for decision making
– RL is for an agent to act with an environment
– Each action influences the agent’s future state
– Feedback is given by a scalar reward signal
– Goal: select actions to maximize future reward
From the talk Introduction to Deep Reinforcement Learning From Theory to Applications by Siyi Li (HKUST)
The University of Sydney Page 4
Markov Decision Process (MDP)
– set of states S, set of actions A, initial state S0
– transition model P(s, a, s’)
– P( frame(t), right, frame(t’)) = 0.8
– reward function r(s)
– r(frame(t)) = +1
– goal: maximize cumulative reward in the long run
– policy: mapping from S to A
– a=p(s) or p(s, a) (deterministic vs. stochastic)
– reinforcement learning
– transitions and rewards usually not available
– how to change the policy based on experience
– how to explore the environment
environment
agent
actionreward
new state
The University of Sydney Page 5
Markov Decision Processes (MDPs)
From the Tutorial: Deep Reinforcement Learning by David Silver, Google DeepMind
The University of Sydney Page 6
Computing return from rewards
– episodic (vs. continuing) tasks
– “game over” after N steps
– optimal policy depends on N; harder to analyze
– additive rewards
– V(s0, s1, …) = r(s0) + r(s1) + r(s2) + …
– infinite value for continuing tasks
– discounted rewards
– V(s0, s1, …) = r(s0) + γ*r(s1) + γ2*r(s2) + …
– value bounded if rewards bounded
The goal of RL is to find the policy which maximizes the expected
return.
The University of Sydney Page 7
Value Function
– A value function is the prediction of the future return
– Two definitions exist for the value function
– State value function
– “How much reward will I get from state s?”
– expected return when starting in s and following p
– State-action value function
– “How much reward will I get from action a in state s?”
– expected return when starting in s, performing a, and following p
The University of Sydney Page 8
Bellman Equation and Optimality
– Value functions decompose into Bellman equations, i.e. the value
functions can be decomposed into
immediate reward plus discounted value of successor state
– An optimal value function is the maximum achievable value.
The University of Sydney Page 9
Bellman Optimality Equation
– Optimality for value functions is governed by the Bellman
optimality equations.
– Two equations:
The University of Sydney Page 10
Q-Learning
The University of Sydney Page 11
Initializing the Q(s, a) function
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Noop 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Fire 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Right 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Left 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
states
actions
The University of Sydney Page 12
Q-Learning
Initialize !(#, %) arbitrarily.
Start with #.
Before taking action a, we calculate the current expected return as
After taking action %, and observing ‘ and #!, we calculate the target
expected return as
( #, %, #! +max
“!
-!(#!, %!)
! #, %
∆”($, &) = ) $, &, $! +max
“!
.”($!, &!)-” $, &
! #, % ← ! #, % + /∆!(#, %)
The University of Sydney Page 13
Q-Learning
Initialize Q(s, a) arbitrarily.
Repeat (for each episode)
Initialize s
Repeat (for each step of the episode)
Choose a from s using a policy
Take action a, observe r, s’
Q(s, a) ß Q(s, a) + α[r + γ max Q(s’, a’) – Q(s, a)]
a’
s ß s’
The University of Sydney Page 14
Exploration and Exploitation
Initialize Q(s, a) arbitrarily.
Repeat (for each episode)
Initialize s
Repeat (for each step of the episode)
Choose a from s using a policy
Take action a, observe r, s’
Update Q and s….
Random policy; Exploration
Greedy policy; Exploitation
1 # = argmax
”
!(#, %)
5-greedy policy: With probability ϵ select a random action
The University of Sydney Page 15
Deep RL: Deep Learning + RL
– Traditional RL
• low-dimensional state spaces
• handcrafted features
– DL’s representation power + RL’s generalization ability
– RL defines the objective
– Deep Learning learns the representation
The University of Sydney Page 16
Deep Q-Learning
The University of Sydney Page 17
Deep Q-Learning
#states = 256#$%×$’%
160
2
1
0
Value Func*on Approxima*on
! #, % = 6(#, %, 7)
The University of Sydney Page 18
Q-Networks
– Represent the state-action value function (discrete
actions) by Q-network with weights 7
From the Tutorial: Deep Reinforcement Learning by David Silver, Google DeepMind
The University of Sydney Page 19
Deep Q-Learning
– End-to-end learning of state-action values from raw pixels
– Input state is stack of raw pixels from last 4 frames
– Output are state-action values from all possible actions
From the Tutorial: Deep Reinforcement Learning by David Silver, Google DeepMind
Actions
‘NOOP’
‘FIRE’
‘RIGHT’
‘LEFT’
The University of Sydney Page 20
Deep Q-Networks (DQN)
– Optimal Q-values obey Bellman equation
– Treat right-hand size as a target and minimize MSE loss
by SGD
– Divergence issues using neural networks due to
– Correlations between samples
– Non-stationary targets
From the Tutorial: Deep Reinforcement Learning by David Silver, Google DeepMind
Target
The University of Sydney Page 21
Experience replay
– Build data set from agent’s own experience
– Sample experiences uniformly from data set to remove
correlations
The University of Sydney Page 22
Algorithm
From the Tutorial: Deep Reinforcement Learning by David Silver, Google DeepMind
5-greedy
The University of Sydney Page 23
Improvements: Target Network
– To deal with non-stationarity, target parameters 87 are held
fixed
The University of Sydney Page 24
Improvements: Double DQN
– Q-learning is known to overestimate state-action values
– The max operator uses the same values to select and evaluate an action
– The upward bias can be removed by decoupling the selection
from the evaluation
– Current Q-network is used to select actions
– Older Q-network is used to evaluate actions
From the talk Introduction to Deep Reinforcement Learning From Theory to Applications by Siyi Li (HKUST)
The University of Sydney Page 25
Improvements: Prioritized Replay
– Uniform experience replay samples transitions regardless of
their significance
– Can weight experiences according to their significance
– Prioritized replay stores experiences in a priority queue
according to the TD error
– Use stochastic sampling to increase sample diversity
From the talk Introduction to Deep Reinforcement Learning From Theory to Applications by Siyi Li (HKUST)
The University of Sydney Page 27
Imitation Learning
The University of Sydney Page 28
Imitation Learning
Imitation learning aims to let the agent mimic the behavior of the expert,
without any reward signal.
Reinforcement Learning Imitation Learning
The University of Sydney Page 29
Generative Adversarial Imitation Learning
Expert
(s, a)~”!
(s, a)~””
Reward
min
!
max
”
( #,% ~’! log,” -, / + ( #,% ~'” log(1 − ,” -, / )
D
iscrim
inator(!
! )
Agent