Announcements
Reminder: pset5 out, due midnight today
• pset5 self-grading form out Monday, due 11/16 (1
week)
• pset 6 out next week 11/12
Reinforcement Learning
Deep Mind’s bot playing Atari Breakout
Machine Learning 2019, Kate Saenko 3
Reinforcement Learning
• Plays Atari video games
• Beats human champions at Poker and Go • Robot learns to pick up, stack blocks
• Simulated quadruped learns to run
Machine Learning 2019, Kate Saenko 4
What is reinforcement learning?
Reinforcement Learning
Types of learning
Supervised
Unsupervised Reinforcement
Machine Learning 2019, Kate Saenko 6
Supervised learning
• model f receives input x
• also gets correct output y
• predictions do not change future inputs
Supervised learning: (in arbitrary order of examples)
𝒚𝟎 𝒚𝟏 𝒚𝟐 𝒚𝑻−𝟏 𝒚𝑻
f(x0) f(x1) f(x2) f(xT-1) f(xT)
f f f … f f
x0 x1 x2 xT-1 xT
Machine Learning 2019, Kate Saenko
7
This is not how humans learn!
Machine Learning 2019, Kate Saenko 8
Reinforcement learning
• agent receives input x, chooses action • gets R (reward) after T time steps
• actions affect the next input (state)
Reinforcement learning:
time
f(x0) f(x1) f(x2) f(xT-1) f(xT)
f f f … f f
x0 x1 x2 xT-1 xT
𝑹𝒕 (reward)
Machine Learning 2019, Kate Saenko
9
Input is the “world’s” state
• Current game board layout
• Picture of table with blocks
• Quadriped position and orientation
Machine Learning 2019, Kate Saenko 10
Output is an action
• Which game piece to move where (discrete)
• Orientation and position of robot arm (continuous) • Joint angles of quadruped legs (continuous)
Actions affect state!
Machine Learning 2019, Kate Saenko 11
action→reward
Machine Learning 2019, Kate Saenko 12
Only some actions lead to rewards
Machine Learning 2019, Kate Saenko 13
Some rewards are negative
Machine Learning 2019, Kate Saenko 14
Reward examples
• Wining the game (positive)
• Successfully picking up block (positive) • Falling (negative)
Machine Learning 2019, Kate Saenko 15
Goal of reinforcement learning
• Learn to predict actions that maximize future rewards
• Need a new mathematical framework
Reinforcement learning:
time
f(x0) f(x1) f(x2) f(xT-1) f(xT)
f f f … f f
x0 x1 x2 xT-1 xT
R
𝑹 (reward) 𝒕T
Machine Learning 2019, Kate Saenko
16
Markov Decision Process
Reinforcement Learning
Markov Decision Process (MPD)
Definition: a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.
https://en.wikipedia.org/wiki/Markov_decision_process
Machine Learning 2019, Kate Saenko 18
MDP notation
• S – set of States
• A – set of Actions
• 𝑅: 𝑆 →R (Reward)
• Psa – transition probabilities (𝑝(𝑠, 𝑎, 𝑠′) ∈ R) • 𝛾 – discount factor
MDP = (S, A, R, Psa, 𝛾)
Machine Learning 2019, Kate Saenko 19
Discount factor 𝛾
• discount factor prevents the total reward from going to
infinity (0 ≤ 𝛾 ≤ 1)
• makes the agent prefer immediate rewards to rewards that
are potentially received far away in the future
• E.g., two paths to the goal state: 1) longer path but gives higher reward 2) shorter path with smaller reward; the 𝛾 value controls which the path the agent should prefer
Machine Learning 2019, Kate Saenko 20
MDP (Simple example)
Machine Learning 2019, Kate Saenko 21
MDP (Simple example)
1
2
3
4
1
2
3
Machine Learning 2019, Kate Saenko 22
MDP (Simple example)
• States S = locations
• Actions A = { , , , }
1
2
3
4
1
2
3
Machine Learning 2019, Kate Saenko 23
MDP (Simple example)
• States S = locations
• Actions A = { , , , } • Reward 𝑅: 𝑆 →R
1
2
3
4
1
0
0
0
+1
2
0
0
-1
3
0
0
0
Machine Learning 2019, Kate Saenko 24
MDP (Simple example)
• States S = locations
• Actions A = { , , , } • Reward 𝑅: 𝑆 →R
1
2
3
4
1
-.02
-.02
-.02
+1
2
-.02
-.02
-1
3
-.02
-.02
-.02
Machine Learning 2019, Kate Saenko 25
MDP (Simple example)
• States S = locations
• Actions A = { , , , } • Reward 𝑅: 𝑆 →R
• Transition Psa
1
2
3
4
1
-.02
-.02
-.02
+1
2
-.02
-.02
-1
3
-.02
-.02
-.02
𝑃3,3,↑ 2,3 =0.8 𝑃3,3,↑ 3,4 =0.1 𝑃3,3,↑ 3,2 =0.1 𝑃3,3,↑ 1,3 =0
⋮
Machine Learning 2019, Kate Saenko 26
MDP – Dynamics
• Start from state 𝑆 0
• Choose action 𝐴0 • Transit to 𝑆 ~𝑃
1
2
3
4
1
-.02
-.02
-.02
+1
2
-.02
-.02
-1
3
-.02
-.02
-.02
-.02
1 𝑠0𝑎0
• Continue…
• Total payoff:
-.02 -.02 -.02
-.02 -.02
-.02
Machine Learning 2019, Kate Saenko
27
How do we choose good actions?
Machine Learning 2019, Kate Saenko 28
Choosing actions in MDP
1
2
3
4
1
-.02
-.02
-.02
+1
2
-.02
-.02
-1
3
-.02
-.02
-.02
States S = locations Actions A = { , , , } Reward 𝑅: 𝑆 →R Transition Psa
• Goal – Choose actions as to maximize expected total payoff:
• In our example:
R – get to charge station, avoid stairs
𝛾 – discourage long paths, how much to delay reward
Machine Learning 2019, Kate Saenko 29
MDP – Policy 𝜋
• Goal – Choose actions as to maximize expected total payoff:
• Policy is a function 𝜋:𝑆 → 𝐴
Machine Learning 2019, Kate Saenko 30
1
2
3
4
1
+1
2
-1
3
States S = locations Actions A = { , , , } Reward 𝑅: 𝑆 →R Transition Psa
Policy Value and Q functions
Reinforcement Learning
MDP – Policy value function
• Value function for policy 𝜋 – 𝑉: 𝑆 →R
1
2
3
4
1
+1
2
-1
3
States S = locations Actions A = { , , , } Reward 𝑅: 𝑆 →R Transition Psa
Policy 𝜋: 𝑆 → 𝐴
Expected sum of discounted rewards
Machine Learning 2019, Kate Saenko 32
MDP – Policy value function
⇒𝑉𝜋𝑠=𝐸𝑅𝑠0 +
Bellman’s equation:
this is recursion!
𝐸𝛾𝑅𝑠 +𝛾2𝑅𝑠 +⋯ 12
𝑉𝜋 𝑠 = 𝑅 𝑠 + 𝛾𝐸𝑠′~𝑃𝑠,𝜋(𝑠) 𝑉(𝑠′)
expectation over values of next state
Machine Learning 2019, Kate Saenko 33
MDP – Policy value function
1
2
3
4
1
1
2
3
4
2
5
6
7
3
8
9
10
11
𝑉=𝑅1+𝛾𝑃 (2)𝑉+… 1 1,↓2
𝑉=… 2
solve 𝑆 eqs.
…
…
𝑉 =𝑅10+𝛾(𝑃 6𝑉+𝑃 9𝑉+
10
10,↑ 6
𝑃 11𝑉)
10,↑ 9
1
2
3
1
2
3
4
+1
-1
10,↑ 11
Machine Learning 2019, Kate Saenko
34
Policy 𝜋
Optimal value function
If the agent uses a given policy 𝛑 to select actions, the corresponding value function is given by:
There exists an optimal value function that has higher value than other functions for all states
The optimal policy 𝛑* is the policy that corresponds to the optimal value function
Machine Learning 2019, Kate Saenko 35
Value function vs. Q-function
For convenience, RL algorithms introduce the Q-function, which takes a state-action pair and returns a real value
𝑄: 𝑆 × 𝐴 → R
The optimal Q-function 𝑄∗(𝑠, 𝑎) means the highest expected total reward received by an agent starting in 𝑠 and choosing action 𝑎 which maximizes value over
𝑄∗(𝑠, 𝑎) is an indication for how good it is for an agent to pick action 𝑎 while being in state 𝑠
Machine Learning 2019, Kate Saenko 36
Optimal Q-function
If we know the optimal Q-function 𝑄∗(𝑠, 𝑎) , the optimal policy 𝜋∗ can be easily extracted by choosing the action 𝑎 that gives maximum 𝑄∗(𝑠, 𝑎) for state 𝑠:
Machine Learning 2019, Kate Saenko 37
RL Approaches
Reinforcement Learning
Reinforcement learning approaches
Value-based RL
• Estimate the optimal value function
• i.e., the maximum value achievable under any policy • Guaranteed to converge to optimum
Policy-based RL
• Search directly for the optimal policy
• re-define the policy at each step and compute the value
according to this new policy until the policy converges
• Guaranteed to converge, often faster than value
Q-learning
• Search for the optimal Q-function
• No prior knowledge of MDP required
Machine Learning 2019, Kate Saenko 39
Value Iteration Algorithm
Given P (𝑠′) = 𝑝 𝑠′ 𝑠, 𝑎 , iteratively compute the Q and 𝑠,𝑎
value functions using Bellman’s equation.
Machine Learning 2019, Kate Saenko 40
Policy Iteration Algorithm
Given P (𝑠′) = 𝑝 𝑠′ 𝑠, 𝑎 , 𝜋, iteratively compute the 𝑠,𝑎
policy’s value function and improve the policy to maximize it
Machine Learning 2019, Kate Saenko 41
Reinforcement learning approaches
Optimal value function
Bellman:
Optimal policy
need Psa
Bellman:
need 𝜋, Psa
Optimal state-action value function Q
easier!
Define 𝑄:𝑆×𝐴→R
𝑄∗ 𝑠,𝑎 =𝑅𝑠 +𝛾max𝑄∗(𝑠′,𝑎) 𝑎∈𝐴
Machine Learning 2019, Kate Saenko
42
Q-Learning (discrete)
Reinforcement Learning
Q-value function
Machine Learning 2019, Kate Saenko 44
Optimal Q-value function
David Silver, Deep RL Tutorial, ICML
Machine Learning 2019, Kate Saenko 45
Q-learning algorithm
The agent interacts with the environment, updates Q recursively
reward
current value
discount learning rate
largest increase over all possible actions in new state
Machine Learning 2019, Kate Saenko
46
Q-learning example
Goal: get from bottom left to top right
Machine Learning 2019, Kate Saenko 47
Exploration vs exploitation
• How does the agent select actions during learning? Should it trust the learned values of Q(s, a) to select actions based on it? or try other actions hoping this may give it a better reward?
• This is known as the exploration vs exploitation dilemma
• Simple 𝛆-greedy approach: at each step with small probability 𝛜, the agent will pick a random action (explore) or with probability (1-𝛜) the agent will select an action according to the current estimate of Q-values
• The 𝛜 value can be decreased overtime as the agent becomes more confident with its estimate of Q-values
Machine Learning 2019, Kate Saenko 48
Continuous state
Reinforcement Learning
Continuous state – Pong
Deep Learning 2017, Brian Kulis & Kate Saenko 50
MDP for Pong
In this case, what are these?
• S – set of States
• A – set of Actions
• 𝑅: 𝑆 →R (Reward)
• Psa – transition probabilities (𝑝(𝑠, 𝑎, 𝑠′) ∈ R)
Can we learn Q-value?
• Can discretize state space, but it may be too large
• Can simplify state by adding domain knowledge (e.g.
paddle, ball), but it may not be available
• Instead, use a neural net to learn good features of
the state!
Deep RL
Reinforcement Learning
Deep RL playing DOTA
Deep Learning 2017, Brian Kulis & Kate Saenko 53
Deep RL
• V, Q or 𝜋 can be approximated with deep network
• Deep Q-Learning
• Input: state, action • Output: Q-value
Cover today
• Alternative: learn a Policy Network • Input: state
• Output: distribution over actions
54
Q-value network
David Silver, Deep RL Tutorial, ICML 55
Q-value network
David Silver, Deep RL Tutorial, ICML 56
Deep Q-network (DQN)
David Silver, Deep RL Tutorial, ICML 57
DQN – Playing Atari
DQN – Playing Atari
DQN in Atari
I I I I
End-to-end learning of values Q(s, a) from pixels s Input state s is stack of raw pixels from last 4 frames Output is Q(s, a) for 18 joystick/ button positions Reward is change in score for that step
Network architecture and hyperparameters fixed across all games
Deep Learning 2017, Brian Kulis & Kate Saenko 59
DQN – Playing Atari
Human level
DQN for Atari
DQN paper:
www.nature.com/articles/nature142 36
DQN demo:
https://www.youtube.com/watch?v=i qXKQf2BOSE
DQN source code:
www.sites.google.com/a/deepmind.c om/dqn/
Deep Learning 2017, Brian Kulis & Kate Saenko 62
Downsides of RL
• RL is less sampling efficient than supervised learning because it involves bootstrapping, which uses an estimate of the Q-value to update the Q- value predictor
• Rewards are usually sparse and learning requires to reach the goal by chance
• Therefore, RL might not find a solution at all if the state space is large or if the task is difficult
Deep Learning 2017, Brian Kulis & Kate Saenko 77
Summary
• The goal of Reinforcement learning:
• learn to predict actions that maximize future rewards
• Markov Decision Process:
• Formalizes the RL framework as • MDP = (S, A, R, Psa, 𝛾)
• Approaches to reinforcement learning: • Learn value function (offline)
• Learn optimal policy (offline)
• Learn Q-function (online)
Machine Learning 2019, Kate Saenko 78
References
Andrew Ng’s Reinforcement Learning course, lecture 16
Andrej Karpathy’s blog post on policy gradient
http://karpathy.github.io/2016/05/31/rl/
Mnih et. al, Playing Atari with Deep Reinforcement Learning (DeepMind)
https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
Intuitive explanation of deep Q-learning
https://www.nervanasys.com/demystifying-deep-reinforcement-learning/
Deep Learning 2017, Brian Kulis & Kate Saenko 79
Next Class
Reinforcement Learning II
Q-learning cont’d; deep Q-learning (DQN)