CS计算机代考程序代写 deep learning algorithm The University of Sydney Page 1

The University of Sydney Page 1

Deep Reinforcement Learning

Dr Chang Xu

School of Computer Science

The University of Sydney Page 2

Deep RL Breakthroughs

From the talk Introduction to Deep Reinforcement Learning From Theory to Applications by Siyi Li (HKUST)

At human-level or above

The University of Sydney Page 3

Reinforcement Learning (RL) in a nutshell

– RL is a general-purpose framework for decision making
– RL is for an agent to act with an environment
– Each action influences the agent’s future state
– Feedback is given by a scalar reward signal
– Goal: select actions to maximize future reward

From the talk Introduction to Deep Reinforcement Learning From Theory to Applications by Siyi Li (HKUST)

The University of Sydney Page 4

Markov Decision Process (MDP)

– set of states S, set of actions A, initial state S0
– transition model P(s, a, s’)

– P( frame(t), right, frame(t’)) = 0.8
– reward function r(s)

– r(frame(t)) = +1
– goal: maximize cumulative reward in the long run

– policy: mapping from S to A
– a=p(s) or p(s, a) (deterministic vs. stochastic)

– reinforcement learning
– transitions and rewards usually not available
– how to change the policy based on experience
– how to explore the environment

environment

agent
actionreward

new state

The University of Sydney Page 5

Markov Decision Processes (MDPs)

From the Tutorial: Deep Reinforcement Learning by David Silver, Google DeepMind

The University of Sydney Page 6

Computing return from rewards

– episodic (vs. continuing) tasks
– “game over” after N steps
– optimal policy depends on N; harder to analyze

– additive rewards
– V(s0, s1, …) = r(s0) + r(s1) + r(s2) + …
– infinite value for continuing tasks

– discounted rewards
– V(s0, s1, …) = r(s0) + γ*r(s1) + γ2*r(s2) + …
– value bounded if rewards bounded

The goal of RL is to find the policy which maximizes the expected
return.

The University of Sydney Page 7

Value Function

– A value function is the prediction of the future return
– Two definitions exist for the value function

– State value function
– “How much reward will I get from state s?”
– expected return when starting in s and following p

– State-action value function
– “How much reward will I get from action a in state s?”
– expected return when starting in s, performing a, and following p

The University of Sydney Page 8

Bellman Equation and Optimality

– Value functions decompose into Bellman equations, i.e. the value
functions can be decomposed into
immediate reward plus discounted value of successor state

– An optimal value function is the maximum achievable value.

The University of Sydney Page 9

Bellman Optimality Equation

– Optimality for value functions is governed by the Bellman
optimality equations.

– Two equations:

The University of Sydney Page 10

Q-Learning

The University of Sydney Page 11

Initializing the Q(s, a) function

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Noop 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Fire 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Right 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Left 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

states

actions

The University of Sydney Page 12

Q-Learning

Initialize !(#, %) arbitrarily.
Start with #.
Before taking action a, we calculate the current expected return as

After taking action %, and observing ‘ and #!, we calculate the target
expected return as

( #, %, #! +max
“!

-!(#!, %!)

! #, %

∆”($, &) = ) $, &, $! +max
“!

.”($!, &!)-” $, &
! #, % ← ! #, % + /∆!(#, %)

The University of Sydney Page 13

Q-Learning

Initialize Q(s, a) arbitrarily.
Repeat (for each episode)

Initialize s
Repeat (for each step of the episode)

Choose a from s using a policy
Take action a, observe r, s’

Q(s, a) ß Q(s, a) + α[r + γ max Q(s’, a’) – Q(s, a)]
a’

s ß s’

The University of Sydney Page 14

Exploration and Exploitation
Initialize Q(s, a) arbitrarily.

Repeat (for each episode)
Initialize s
Repeat (for each step of the episode)

Choose a from s using a policy
Take action a, observe r, s’
Update Q and s….

Random policy; Exploration

Greedy policy; Exploitation
1 # = argmax


!(#, %)

5-greedy policy: With probability ϵ select a random action

The University of Sydney Page 15

Deep RL: Deep Learning + RL

– Traditional RL
• low-dimensional state spaces
• handcrafted features

– DL’s representation power + RL’s generalization ability

– RL defines the objective
– Deep Learning learns the representation

The University of Sydney Page 16

Deep Q-Learning

The University of Sydney Page 17

Deep Q-Learning

#states = 256#$%×$’%

160

2
1
0

Value Func*on Approxima*on

! #, % = 6(#, %, 7)

The University of Sydney Page 18

Q-Networks

– Represent the state-action value function (discrete
actions) by Q-network with weights 7

From the Tutorial: Deep Reinforcement Learning by David Silver, Google DeepMind

The University of Sydney Page 19

Deep Q-Learning

– End-to-end learning of state-action values from raw pixels
– Input state is stack of raw pixels from last 4 frames
– Output are state-action values from all possible actions

From the Tutorial: Deep Reinforcement Learning by David Silver, Google DeepMind

Actions
‘NOOP’
‘FIRE’
‘RIGHT’
‘LEFT’

The University of Sydney Page 20

Deep Q-Networks (DQN)

– Optimal Q-values obey Bellman equation

– Treat right-hand size as a target and minimize MSE loss
by SGD

– Divergence issues using neural networks due to
– Correlations between samples
– Non-stationary targets

From the Tutorial: Deep Reinforcement Learning by David Silver, Google DeepMind

Target

The University of Sydney Page 21

Experience replay

– Build data set from agent’s own experience
– Sample experiences uniformly from data set to remove

correlations

The University of Sydney Page 22

Algorithm

From the Tutorial: Deep Reinforcement Learning by David Silver, Google DeepMind

5-greedy

The University of Sydney Page 23

Improvements: Target Network

– To deal with non-stationarity, target parameters 87 are held
fixed

The University of Sydney Page 24

Improvements: Double DQN

– Q-learning is known to overestimate state-action values
– The max operator uses the same values to select and evaluate an action

– The upward bias can be removed by decoupling the selection
from the evaluation
– Current Q-network is used to select actions
– Older Q-network is used to evaluate actions

From the talk Introduction to Deep Reinforcement Learning From Theory to Applications by Siyi Li (HKUST)

The University of Sydney Page 25

Improvements: Prioritized Replay

– Uniform experience replay samples transitions regardless of
their significance

– Can weight experiences according to their significance
– Prioritized replay stores experiences in a priority queue

according to the TD error

– Use stochastic sampling to increase sample diversity

From the talk Introduction to Deep Reinforcement Learning From Theory to Applications by Siyi Li (HKUST)

The University of Sydney Page 27

Imitation Learning

The University of Sydney Page 28

Imitation Learning

Imitation learning aims to let the agent mimic the behavior of the expert,
without any reward signal.

Reinforcement Learning Imitation Learning

The University of Sydney Page 29

Generative Adversarial Imitation Learning

Expert
(s, a)~”!

(s, a)~””

Reward

min
!
max

( #,% ~’! log,” -, / + ( #,% ~'” log(1 − ,” -, / )

D
iscrim

inator(!
! )

Agent