程序代写代做代考 game algorithm deep learning Announcements

Announcements
Reminder: pset5 self-grading form and pset6 out Thursday, due 11/19 (1 week)
• No lab this week!

Reinforcement Learning II

Recall: MDP notation
• S – set of States
• A – set of Actions
• 𝑅: 𝑆 →R (Reward)
• Psa – transition probabilities (𝑝(𝑠, 𝑎, 𝑠′) ∈ R) • 𝛾 – discount factor
MDP = (S, A, R, Psa, 𝛾)

MDP (Simple example)
Machine Learning 2019, Kate Saenko 4

MDP (Simple example)
• States S = locations
• Actions A = { , , , }
1
2
3
4
1
2
3
Machine Learning 2019, Kate Saenko 5

MDP (Simple example)
• States S = locations
• Actions A = { , , , } • Reward 𝑅: 𝑆 →R
• Transition Psa
1
2
3
4
1
-.02
-.02
-.02
+1
2
-.02
-.02
-1
3
-.02
-.02
-.02
𝑃3,3,↑ 2,3 =0.8 𝑃3,3,↑ 3,4 =0.1 𝑃3,3,↑ 3,2 =0.1 𝑃3,3,↑ 1,3 =0

Machine Learning 2019, Kate Saenko 6

MDP – Dynamics
• Start from state 𝑆 0
• Choose action 𝐴0 • Transit to 𝑆 ~𝑃
1
2
3
4
1
-.02
-.02
-.02
+1
2
-.02
-.02
-1
3
-.02
-.02
-.02
-.02
1 𝑠0𝑎0
• Continue…
• Total payoff:
-.02 -.02 -.02
-.02 -.02
-.02
Machine Learning 2019, Kate Saenko
7

Q-Learning (discrete)
Reinforcement Learning

Q-value function
Machine Learning 2019, Kate Saenko 9

Optimal Q-value function
David Silver, Deep RL Tutorial, ICML
Machine Learning 2019, Kate Saenko 10

Q-learning algorithm
The agent interacts with the environment, updates Q recursively
reward
current value
discount learning rate
largest increase over all possible actions in new state
Machine Learning 2019, Kate Saenko
11

Q-learning example
Goal: get from bottom left to top right

Machine Learning 2019, Kate Saenko 12

Exploration vs exploitation
• How does the agent select actions during learning? Should it trust the learned values of Q(s, a) to select actions based on it? or try other actions hoping this may give it a better reward?
• This is known as the exploration vs exploitation dilemma
• Simple 𝛆-greedy approach: at each step with small probability 𝛜, the agent will pick a random action (explore) or with probability (1-𝛜) the agent will select an action according to the current estimate of Q-values
• The 𝛜 value can be decreased overtime as the agent becomes more confident with its estimate of Q-values
Machine Learning 2019, Kate Saenko 13

Continuous state
Reinforcement Learning

Continuous state – Pong

Deep Learning 2017, Brian Kulis & Kate Saenko 15

MDP for Pong
In this case, what are these?
• S – set of States
• A – set of Actions
• 𝑅: 𝑆 →R (Reward)
• Psa – transition probabilities (𝑝(𝑠, 𝑎, 𝑠′) ∈ R)
Can we learn Q-value?
• Can discretize state space, but it may be too large
• Can simplify state by adding domain knowledge (e.g.
paddle, ball), but it may not be available
• Instead, use a neural net to learn good features of
the state!

Deep RL
Reinforcement Learning

Deep RL playing DOTA

Deep Learning 2017, Brian Kulis & Kate Saenko 18

Deep RL
• V, Q or 𝜋 can be approximated with deep network
• Deep Q-Learning
• Input: state, action • Output: Q-value
• Alternative: learn a Policy Network • Input: state
• Output: distribution over actions
19

Q-value network
David Silver, Deep RL Tutorial, ICML 20

Q-value network
David Silver, Deep RL Tutorial, ICML 21

Deep Q-network (DQN)
David Silver, Deep RL Tutorial, ICML 22

DQN – Playing Atari

DQN – Playing Atari
DQN in Atari
I I I I
End-to-end learning of values Q(s, a) from pixels s Input state s is stack of raw pixels from last 4 frames Output is Q(s, a) for 18 joystick/ button positions Reward is change in score for that step
Network architecture and hyperparameters fixed across all games
Deep Learning 2017, Brian Kulis & Kate Saenko 24

DQN – Playing Atari

Human level

DQN for Atari
DQN paper:
www.nature.com/articles/nature142 36
DQN demo:
https://www.youtube.com/watch?v=i qXKQf2BOSE
DQN source code:
www.sites.google.com/a/deepmind.c om/dqn/
Deep Learning 2017, Brian Kulis & Kate Saenko 27

Deep RL
• V, Q or 𝜋 can be approximated with deep network
• Deep Q-Learning
• Input: state, action • Output: Q-value
• Alternative: learn a Policy Network • Input: state
• Output: distribution over actions
Deep Learning 2017, Brian Kulis & Kate Saenko 28

Policy network for pong
• define a policy network that implements the player
• takes the state of the game and decides what to do (move
UP or DOWN)
• 2-layer neural network that takes the raw image pixels* (100,800 = 210x160x3), outputs the probability of going UP
*feed at least 2 frames to the policy network so that it can detect motion.
http://karpathy.github.io/2016/05/31/rl/
Deep Learning 2017, Brian Kulis & Kate Saenko 29

Policy gradient
• Suppose network predicts p(UP) = 30%
p(DOWN)=70%
• Can sample an action from this distribution and execute it
• Can immediately use gradient of 1.0 for DOWN and backprop to find the gradient vector that would encourage the network to predict DOWN
Problem: do not yet know if going DOWN is good!
Solution: simply wait until the end of the game, then take the reward we get (either +1 if we won or -1 if we lost), and enter that as the gradient for taken actions
http://karpathy.github.io/2016/05/31/rl/
Deep Learning 2017, Brian Kulis & Kate Saenko 30

Policy gradient
http://karpathy.github.io/2016/05/31/rl/
Deep Learning 2017, Brian Kulis & Kate Saenko 31

Problems with this?
• what if we made a good action in frame 50 (bouncing the ball back correctly), but then missed the ball in frame 150?
• If every single action is now labeled as bad (because we lost), wouldn’t that discourage the correct bounce on frame 50?
• Yes, but after thousands/millions of games, network will learn a good policy
http://karpathy.github.io/2016/05/31/rl/
Deep Learning 2017, Brian Kulis & Kate Saenko 33

Policy gradient
Want to maximize
𝐸 𝑓(𝑥) 𝑥~𝑝(𝑥|𝜃)
𝑓(𝑥) is the reward function
p(𝑥) is the policy network with parameters 𝜃
(i.e. change the network’s parameters so that action samples get higher rewards)
http://karpathy.github.io/2016/05/31/rl/
Deep Learning 2017, Brian Kulis & Kate Saenko 34

Downsides of RL
• RL is less sampling efficient than supervised learning because it involves bootstrapping, which uses an estimate of the Q-value to update the Q- value predictor
• Rewards are usually sparse and learning requires to reach the goal by chance
• Therefore, RL might not find a solution at all if the state space is large or if the task is difficult
Deep Learning 2017, Brian Kulis & Kate Saenko 42

References
Andrew Ng’s Reinforcement Learning course, lecture 16

Andrej Karpathy’s blog post on policy gradient
http://karpathy.github.io/2016/05/31/rl/
Mnih et. al, Playing Atari with Deep Reinforcement Learning (DeepMind)
https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
Intuitive explanation of deep Q-learning
https://www.nervanasys.com/demystifying-deep-reinforcement-learning/
Deep Learning 2017, Brian Kulis & Kate Saenko 43

Next Class
Unsupervised Learning III: Anomaly Detection
Anomaly detection methods: density estimation, reconstruction based method, One Class SVM; evaluating anomaly detection