程序代写代做代考 arm game go deep learning algorithm Announcements

Announcements
Reminder: pset5 out, due midnight today
• pset5 self-grading form out Monday, due 11/16 (1
week)
• pset 6 out next week 11/12

Reinforcement Learning

Deep Mind’s bot playing Atari Breakout

Machine Learning 2019, Kate Saenko 3

Reinforcement Learning
• Plays Atari video games
• Beats human champions at Poker and Go • Robot learns to pick up, stack blocks
• Simulated quadruped learns to run
Machine Learning 2019, Kate Saenko 4

What is reinforcement learning?
Reinforcement Learning

Types of learning
Supervised
Unsupervised Reinforcement
Machine Learning 2019, Kate Saenko 6

Supervised learning
• model f receives input x
• also gets correct output y
• predictions do not change future inputs
Supervised learning: (in arbitrary order of examples)
𝒚𝟎 𝒚𝟏 𝒚𝟐 𝒚𝑻−𝟏 𝒚𝑻
f(x0) f(x1) f(x2) f(xT-1) f(xT)
f f f … f f
x0 x1 x2 xT-1 xT
Machine Learning 2019, Kate Saenko
7

This is not how humans learn!
Machine Learning 2019, Kate Saenko 8

Reinforcement learning
• agent receives input x, chooses action • gets R (reward) after T time steps
• actions affect the next input (state)
Reinforcement learning:
time
f(x0) f(x1) f(x2) f(xT-1) f(xT)
f f f … f f
x0 x1 x2 xT-1 xT
𝑹𝒕 (reward)
Machine Learning 2019, Kate Saenko
9

Input is the “world’s” state
• Current game board layout
• Picture of table with blocks
• Quadriped position and orientation
Machine Learning 2019, Kate Saenko 10

Output is an action
• Which game piece to move where (discrete)
• Orientation and position of robot arm (continuous) • Joint angles of quadruped legs (continuous)
Actions affect state!
Machine Learning 2019, Kate Saenko 11

action→reward
Machine Learning 2019, Kate Saenko 12

Only some actions lead to rewards
Machine Learning 2019, Kate Saenko 13

Some rewards are negative
Machine Learning 2019, Kate Saenko 14

Reward examples
• Wining the game (positive)
• Successfully picking up block (positive) • Falling (negative)
Machine Learning 2019, Kate Saenko 15

Goal of reinforcement learning
• Learn to predict actions that maximize future rewards
• Need a new mathematical framework
Reinforcement learning:
time
f(x0) f(x1) f(x2) f(xT-1) f(xT)
f f f … f f
x0 x1 x2 xT-1 xT
R
𝑹 (reward) 𝒕T
Machine Learning 2019, Kate Saenko
16

Markov Decision Process
Reinforcement Learning

Markov Decision Process (MPD)
Definition: a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.
https://en.wikipedia.org/wiki/Markov_decision_process
Machine Learning 2019, Kate Saenko 18

MDP notation
• S – set of States
• A – set of Actions
• 𝑅: 𝑆 →R (Reward)
• Psa – transition probabilities (𝑝(𝑠, 𝑎, 𝑠′) ∈ R) • 𝛾 – discount factor
MDP = (S, A, R, Psa, 𝛾)
Machine Learning 2019, Kate Saenko 19

Discount factor 𝛾
• discount factor prevents the total reward from going to
infinity (0 ≤ 𝛾 ≤ 1)
• makes the agent prefer immediate rewards to rewards that
are potentially received far away in the future
• E.g., two paths to the goal state: 1) longer path but gives higher reward 2) shorter path with smaller reward; the 𝛾 value controls which the path the agent should prefer
Machine Learning 2019, Kate Saenko 20

MDP (Simple example)
Machine Learning 2019, Kate Saenko 21

MDP (Simple example)
1
2
3
4
1
2
3
Machine Learning 2019, Kate Saenko 22

MDP (Simple example)
• States S = locations
• Actions A = { , , , }
1
2
3
4
1
2
3
Machine Learning 2019, Kate Saenko 23

MDP (Simple example)
• States S = locations
• Actions A = { , , , } • Reward 𝑅: 𝑆 →R
1
2
3
4
1
0
0
0
+1
2
0
0
-1
3
0
0
0
Machine Learning 2019, Kate Saenko 24

MDP (Simple example)
• States S = locations
• Actions A = { , , , } • Reward 𝑅: 𝑆 →R
1
2
3
4
1
-.02
-.02
-.02
+1
2
-.02
-.02
-1
3
-.02
-.02
-.02
Machine Learning 2019, Kate Saenko 25

MDP (Simple example)
• States S = locations
• Actions A = { , , , } • Reward 𝑅: 𝑆 →R
• Transition Psa
1
2
3
4
1
-.02
-.02
-.02
+1
2
-.02
-.02
-1
3
-.02
-.02
-.02
𝑃3,3,↑ 2,3 =0.8 𝑃3,3,↑ 3,4 =0.1 𝑃3,3,↑ 3,2 =0.1 𝑃3,3,↑ 1,3 =0
⋮
Machine Learning 2019, Kate Saenko 26

MDP – Dynamics
• Start from state 𝑆 0
• Choose action 𝐴0 • Transit to 𝑆 ~𝑃
1
2
3
4
1
-.02
-.02
-.02
+1
2
-.02
-.02
-1
3
-.02
-.02
-.02
-.02
1 𝑠0𝑎0
• Continue…
• Total payoff:
-.02 -.02 -.02
-.02 -.02
-.02
Machine Learning 2019, Kate Saenko
27

How do we choose good actions?
Machine Learning 2019, Kate Saenko 28

Choosing actions in MDP
1
2
3
4
1
-.02
-.02
-.02
+1
2
-.02
-.02
-1
3
-.02
-.02
-.02
States S = locations Actions A = { , , , } Reward 𝑅: 𝑆 →R Transition Psa
• Goal – Choose actions as to maximize expected total payoff:
• In our example:
R – get to charge station, avoid stairs
𝛾 – discourage long paths, how much to delay reward
Machine Learning 2019, Kate Saenko 29

MDP – Policy 𝜋
• Goal – Choose actions as to maximize expected total payoff:
• Policy is a function 𝜋:𝑆 → 𝐴
Machine Learning 2019, Kate Saenko 30
1
2
3
4
1
+1
2
-1
3
States S = locations Actions A = { , , , } Reward 𝑅: 𝑆 →R Transition Psa

Policy Value and Q functions
Reinforcement Learning

MDP – Policy value function
• Value function for policy 𝜋 – 𝑉: 𝑆 →R
1
2
3
4
1
+1
2
-1
3
States S = locations Actions A = { , , , } Reward 𝑅: 𝑆 →R Transition Psa
Policy 𝜋: 𝑆 → 𝐴
Expected sum of discounted rewards
Machine Learning 2019, Kate Saenko 32

MDP – Policy value function
⇒𝑉𝜋𝑠=𝐸𝑅𝑠0 +
Bellman’s equation:
this is recursion!
𝐸𝛾𝑅𝑠 +𝛾2𝑅𝑠 +⋯ 12
𝑉𝜋 𝑠 = 𝑅 𝑠 + 𝛾𝐸𝑠′~𝑃𝑠,𝜋(𝑠) 𝑉(𝑠′)
expectation over values of next state
Machine Learning 2019, Kate Saenko 33

MDP – Policy value function
1
2
3
4
1
1
2
3
4
2
5
6
7
3
8
9
10
11
𝑉=𝑅1+𝛾𝑃 (2)𝑉+… 1 1,↓2
𝑉=… 2
solve 𝑆 eqs.
…
…
𝑉 =𝑅10+𝛾(𝑃 6𝑉+𝑃 9𝑉+
10
10,↑ 6
𝑃 11𝑉)
10,↑ 9
1
2
3
1
2
3
4
+1
-1
10,↑ 11
Machine Learning 2019, Kate Saenko
34
Policy 𝜋

Optimal value function
If the agent uses a given policy 𝛑 to select actions, the corresponding value function is given by:
There exists an optimal value function that has higher value than other functions for all states
The optimal policy 𝛑* is the policy that corresponds to the optimal value function
Machine Learning 2019, Kate Saenko 35

Value function vs. Q-function
For convenience, RL algorithms introduce the Q-function, which takes a state-action pair and returns a real value
𝑄: 𝑆 × 𝐴 → R
The optimal Q-function 𝑄∗(𝑠, 𝑎) means the highest expected total reward received by an agent starting in 𝑠 and choosing action 𝑎 which maximizes value over
𝑄∗(𝑠, 𝑎) is an indication for how good it is for an agent to pick action 𝑎 while being in state 𝑠
Machine Learning 2019, Kate Saenko 36

Optimal Q-function
If we know the optimal Q-function 𝑄∗(𝑠, 𝑎) , the optimal policy 𝜋∗ can be easily extracted by choosing the action 𝑎 that gives maximum 𝑄∗(𝑠, 𝑎) for state 𝑠:
Machine Learning 2019, Kate Saenko 37

RL Approaches
Reinforcement Learning

Reinforcement learning approaches
Value-based RL
• Estimate the optimal value function
• i.e., the maximum value achievable under any policy • Guaranteed to converge to optimum
Policy-based RL
• Search directly for the optimal policy
• re-define the policy at each step and compute the value
according to this new policy until the policy converges
• Guaranteed to converge, often faster than value
Q-learning
• Search for the optimal Q-function
• No prior knowledge of MDP required
Machine Learning 2019, Kate Saenko 39

Value Iteration Algorithm
Given P (𝑠′) = 𝑝 𝑠′ 𝑠, 𝑎 , iteratively compute the Q and 𝑠,𝑎
value functions using Bellman’s equation.
Machine Learning 2019, Kate Saenko 40

Policy Iteration Algorithm
Given P (𝑠′) = 𝑝 𝑠′ 𝑠, 𝑎 , 𝜋, iteratively compute the 𝑠,𝑎
policy’s value function and improve the policy to maximize it
Machine Learning 2019, Kate Saenko 41

Reinforcement learning approaches
Optimal value function
Bellman:
Optimal policy
need Psa
Bellman:
need 𝜋, Psa
Optimal state-action value function Q
easier!
Define 𝑄:𝑆×𝐴→R
𝑄∗ 𝑠,𝑎 =𝑅𝑠 +𝛾max𝑄∗(𝑠′,𝑎) 𝑎∈𝐴
Machine Learning 2019, Kate Saenko
42

Q-Learning (discrete)
Reinforcement Learning

Q-value function
Machine Learning 2019, Kate Saenko 44

Optimal Q-value function
David Silver, Deep RL Tutorial, ICML
Machine Learning 2019, Kate Saenko 45

Q-learning algorithm
The agent interacts with the environment, updates Q recursively
reward
current value
discount learning rate
largest increase over all possible actions in new state
Machine Learning 2019, Kate Saenko
46

Q-learning example
Goal: get from bottom left to top right

Machine Learning 2019, Kate Saenko 47

Exploration vs exploitation
• How does the agent select actions during learning? Should it trust the learned values of Q(s, a) to select actions based on it? or try other actions hoping this may give it a better reward?
• This is known as the exploration vs exploitation dilemma
• Simple 𝛆-greedy approach: at each step with small probability 𝛜, the agent will pick a random action (explore) or with probability (1-𝛜) the agent will select an action according to the current estimate of Q-values
• The 𝛜 value can be decreased overtime as the agent becomes more confident with its estimate of Q-values
Machine Learning 2019, Kate Saenko 48

Continuous state
Reinforcement Learning

Continuous state – Pong

Deep Learning 2017, Brian Kulis & Kate Saenko 50

MDP for Pong
In this case, what are these?
• S – set of States
• A – set of Actions
• 𝑅: 𝑆 →R (Reward)
• Psa – transition probabilities (𝑝(𝑠, 𝑎, 𝑠′) ∈ R)
Can we learn Q-value?
• Can discretize state space, but it may be too large
• Can simplify state by adding domain knowledge (e.g.
paddle, ball), but it may not be available
• Instead, use a neural net to learn good features of
the state!

Deep RL
Reinforcement Learning

Deep RL playing DOTA

Deep Learning 2017, Brian Kulis & Kate Saenko 53

Deep RL
• V, Q or 𝜋 can be approximated with deep network
• Deep Q-Learning
• Input: state, action • Output: Q-value
Cover today
• Alternative: learn a Policy Network • Input: state
• Output: distribution over actions
54

Q-value network
David Silver, Deep RL Tutorial, ICML 55

Q-value network
David Silver, Deep RL Tutorial, ICML 56

Deep Q-network (DQN)
David Silver, Deep RL Tutorial, ICML 57

DQN – Playing Atari

DQN – Playing Atari
DQN in Atari
I I I I
End-to-end learning of values Q(s, a) from pixels s Input state s is stack of raw pixels from last 4 frames Output is Q(s, a) for 18 joystick/ button positions Reward is change in score for that step
Network architecture and hyperparameters fixed across all games
Deep Learning 2017, Brian Kulis & Kate Saenko 59

DQN – Playing Atari

Human level

DQN for Atari
DQN paper:
www.nature.com/articles/nature142 36
DQN demo:
https://www.youtube.com/watch?v=i qXKQf2BOSE
DQN source code:
www.sites.google.com/a/deepmind.c om/dqn/
Deep Learning 2017, Brian Kulis & Kate Saenko 62

Downsides of RL
• RL is less sampling efficient than supervised learning because it involves bootstrapping, which uses an estimate of the Q-value to update the Q- value predictor
• Rewards are usually sparse and learning requires to reach the goal by chance
• Therefore, RL might not find a solution at all if the state space is large or if the task is difficult
Deep Learning 2017, Brian Kulis & Kate Saenko 77

Summary
• The goal of Reinforcement learning:
• learn to predict actions that maximize future rewards
• Markov Decision Process:
• Formalizes the RL framework as • MDP = (S, A, R, Psa, 𝛾)
• Approaches to reinforcement learning: • Learn value function (offline)
• Learn optimal policy (offline)
• Learn Q-function (online)
Machine Learning 2019, Kate Saenko 78

References
Andrew Ng’s Reinforcement Learning course, lecture 16

Andrej Karpathy’s blog post on policy gradient
http://karpathy.github.io/2016/05/31/rl/
Mnih et. al, Playing Atari with Deep Reinforcement Learning (DeepMind)
https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
Intuitive explanation of deep Q-learning
https://www.nervanasys.com/demystifying-deep-reinforcement-learning/
Deep Learning 2017, Brian Kulis & Kate Saenko 79

Next Class
Reinforcement Learning II
Q-learning cont’d; deep Q-learning (DQN)

Related Posts