程序代写代做代考 deep learning algorithm Reinforcement Learning II

Reinforcement Learning II

Recall: MDP notation
• S – set of States
• A – set of Actions
• 𝑅𝑅: 𝑆𝑆 →R (Reward)
• Psa – transition probabilities (𝑝𝑝(𝑠𝑠, 𝑎𝑎, 𝑠𝑠′) ∈ R) • 𝛾𝛾 – discount factor
MDP = (S, A, R, Psa, 𝛾𝛾)

Q-learning algorithm
The agent interacts with the environment, updates Q recursively
reward
current value discount largest increase over all learning rate possible actions in new state

Continuous state
Reinforcement Learning

Continuous state – Pong

Deep Learning 2017, Brian Kulis & Kate Saenko 5

MDP for Pong
In this case, what are these?
• S – set of States
• A – set of Actions
• 𝑅𝑅: 𝑆𝑆 →R (Reward)
• Psa – transition probabilities (𝑝𝑝(𝑠𝑠, 𝑎𝑎, 𝑠𝑠′) ∈ R)
Can we learn Q-value?
• Can discretize state space, but it may be too large
• Can simplify state by adding domain knowledge (e.g.
paddle, ball), but it may not be available
• Instead, use a neural net to learn good features of
the state!

Deep RL
Reinforcement Learning

Deep RL playing DOTA

Deep Learning 2017, Brian Kulis & Kate Saenko 8

Deep RL
• V, Q or 𝜋𝜋 can be approximated with deep network
• Deep Q-Learning
• Input: state, action • Output:Q-value
• Alternative: learn a Policy Network • Input:state
• Output: distribution over actions
Cover today
9

Q-value network
David Silver, Deep RL Tutorial, ICML 10

Q-value network
David Silver, Deep RL Tutorial, ICML 11

Deep Q-network (DQN)
David Silver, Deep RL Tutorial, ICML 12

DQN – Playing Atari

DQN – Playing Atari
Deep Learning 2017, Brian Kulis & Kate Saenko 14

DQN – Playing Atari

Human level

DQN for Atari
DQN paper:
www.nature.com/articles/nature142 36
DQN demo:
https://www.youtube.com/watch?v=i qXKQf2BOSE
DQN source code:
www.sites.google.com/a/deepmind.c om/dqn/
Deep Learning 2017, Brian Kulis & Kate Saenko 17

Downsides of RL
• RL is less sampling efficient than supervised learning because it involves bootstrapping, which uses an estimate of the Q-value to update the Q- value predictor
• Rewards are usually sparse and learning requires to reach the goal by chance
• Therefore, RL might not find a solution at all if the state space is large or if the task is difficult
Deep Learning 2017, Brian Kulis & Kate Saenko 32

References
Andrew Ng’s Reinforcement Learning course, lecture 16

Andrej Karpathy’s blog post on policy gradient
http://karpathy.github.io/2016/05/31/rl/
Mnih et. al, Playing Atari with Deep Reinforcement Learning (DeepMind)
https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
Intuitive explanation of deep Q-learning
https://www.nervanasys.com/demystifying-deep-reinforcement-learning/
Deep Learning 2017, Brian Kulis & Kate Saenko 33