Reinforcement Learning
Deep Mind’s bot playing Atari Breakout
Machine Learning 2019, Kate Saenko 2
Reinforcement Learning
• Plays Atari video games
• Beats human champions at Poker and Go • Robot learns to pick up, stack blocks
• Simulated quadruped learns to run
Machine Learning 2019, Kate Saenko 3
What is reinforcement learning?
Reinforcement Learning
Types of learning
Supervised
Unsupervised Reinforcement
Machine Learning 2019, Kate Saenko 5
Supervised learning
• model f receives input x
• also gets correct output y
• predictions do not change future inputs
Supervised learning: (in arbitrary order of examples)
𝒚𝒚𝟎𝟎 𝒚𝒚𝟏𝟏 𝒚𝒚𝟐𝟐 𝒚𝒚𝑻𝑻−𝟏𝟏 𝒚𝒚𝑻𝑻
f(x0) f(x1) f(x2) f(xT-1) f(xT)
fff…ff
x0 x1 x2 xT-1 xT
Machine Learning 2019, Kate Saenko
6
This is not how humans learn!
Machine Learning 2019, Kate Saenko 7
Reinforcement learning
• agent receives input x, chooses action • gets R (reward) after T time steps
• actions affect the next input (state)
Reinforcement learning:
𝑹𝑹𝒕𝒕(reward) f(x0) f(x1) f(x2) f(xT-1) f(xT)
time
fff…ff
x0 x1 x2 xT-1 xT
Machine Learning 2019, Kate Saenko
8
Input is the “world’s” state
• Current game board layout
• Picture of table with blocks
• Quadriped position and orientation
Machine Learning 2019, Kate Saenko 9
Output is an action
• Which game piece to move where (discrete)
• Orientation and position of robot arm (continuous) • Joint angles of quadruped legs (continuous)
Actions affect state!
Machine Learning 2019, Kate Saenko 10
action→reward
Machine Learning 2019, Kate Saenko 11
Only some actions lead to rewards
Machine Learning 2019, Kate Saenko 12
Some rewards are negative
Machine Learning 2019, Kate Saenko 13
Reward examples
• Wining the game (positive)
• Successfully picking up block (positive) • Falling (negative)
Machine Learning 2019, Kate Saenko 14
Goal of reinforcement learning
• Learn to predict actions that maximize future rewards
• Need a new mathematical framework
Reinforcement learning:
time
f(x0) f(x1) f(x2)
f(xT-1)
𝑹𝑹𝒕𝒕(reward) RT
f(xT)
fff…ff
x0 x1 x2 xT-1 xT
Machine Learning 2019, Kate Saenko
15
Markov Decision Process
Reinforcement Learning
Markov Decision Process (MPD)
Definition: a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.
https://en.wikipedia.org/wiki/Markov_decision_process
Machine Learning 2019, Kate Saenko 17
MDP notation
• S – set of States
• A – set of Actions
• 𝑅𝑅: 𝑆𝑆 →R (Reward)
• Psa – transition probabilities (𝑝𝑝(𝑠𝑠, 𝑎𝑎, 𝑠𝑠′) ∈ R) • 𝛾𝛾 – discount factor
MDP = (S, A, R, Psa, 𝛾𝛾)
Machine Learning 2019, Kate Saenko 18
Discount factor 𝛾𝛾
• discount factor prevents the total reward from going to
infinity (0 ≤ 𝛾𝛾 ≤ 1)
• makes the agent prefer immediate rewards to rewards that
are potentially received far away in the future
• E.g., two paths to the goal state: 1) longer path but gives higher reward 2) shorter path with smaller reward; the 𝛾𝛾 value controls which the path the agent should prefer
Machine Learning 2019, Kate Saenko 19
MDP (Simple example)
Machine Learning 2019, Kate Saenko 20
MDP (Simple example)
1
2
3
4
1
2
3
Machine Learning 2019, Kate Saenko 21
MDP (Simple example)
• States S = locations •ActionsA={ , , , }
1
2
3
4
1
2
3
Machine Learning 2019, Kate Saenko 22
MDP (Simple example)
• States S = locations •ActionsA={ , , , } • Reward 𝑅𝑅: 𝑆𝑆 →R
1
2
3
4
1
0
0
0
+1
2
0
0
-1
3
0
0
0
Machine Learning 2019, Kate Saenko 23
MDP (Simple example)
• States S = locations •ActionsA={ , , , } • Reward 𝑅𝑅: 𝑆𝑆 →R
1
2
3
4
1
-.02
-.02
-.02
+1
2
-.02
-.02
-1
3
-.02
-.02
-.02
Machine Learning 2019, Kate Saenko 24
MDP (Simple example)
1
2
3
4
1
-.02
-.02
-.02
+1
2
-.02
-.02
-1
3
-.02
-.02
-.02
• States S = locations •ActionsA={ , , , } • Reward 𝑅𝑅: 𝑆𝑆 →R
• Transition Psa
𝑃𝑃3,3,↑ 2,3 =0.8 𝑃𝑃3,3,↑ 3,4 =0.1 𝑃𝑃3,3,↑ 3,2 =0.1 𝑃𝑃3,3,↑ 1,3 =0
⋮ Machine Learning 2019, Kate Saenko 25
MDP – Dynamics
1
2
3
4
1
-.02
-.02
-.02
+1
2
-.02
-.02
-1
3
-.02
-.02
-.02
-.02
• Start from state 𝑆𝑆0 • Choose action 𝐴𝐴0
• Transit to 𝑆𝑆1~𝑃𝑃𝑠𝑠0𝑎𝑎0 • Continue…
-.02 -.02
-.02
• Total payoff:
-.02 -.02 -.02
Machine Learning 2019, Kate Saenko
26
Escape the room: two player game
6 rooms
Player 1: environment
• Choose 5 internal opening doors, 1 external (reward)
Player 2: agent
• Carryoutactions(tryto open N, S, E, W or door)
• Observe next state, reward (if any)
Player 2 sees:
Player 1 sees:
Machine Learning 2019, Kate Saenko
27
How do we choose good actions?
Machine Learning 2019, Kate Saenko 28
Choosing actions in MDP
States S = locations ActionsA={ , , , } Reward 𝑅𝑅: 𝑆𝑆 →R Transition Psa
1
2
3
4
1
-.02
-.02
-.02
+1
2
-.02
-.02
-1
3
-.02
-.02
-.02
• Goal – Choose actions as to maximize expected total payoff:
• In our example:
R – get to charge station, avoid stairs
𝛾𝛾 – discourage long paths, how much to delay reward
Machine Learning 2019, Kate Saenko 29
MDP – Policy 𝜋𝜋
• Goal – Choose actions as to maximize expected total payoff:
• Policy is a function 𝜋𝜋:𝑆𝑆 → 𝐴𝐴
Machine Learning 2019, Kate Saenko 30
States S = locations ActionsA={ , , , } Reward 𝑅𝑅: 𝑆𝑆 →R Transition Psa
1
2
3
4
1
+1
2
-1
3
Policy Value and Q functions
Reinforcement Learning
MDP – Policy value function
States S = locations ActionsA={ , , , } Reward 𝑅𝑅: 𝑆𝑆 →R Transition Psa
1
2
3
4
1
+1
2
-1
3
Policy 𝜋𝜋: 𝑆𝑆 → 𝐴𝐴
• Value function for policy 𝜋𝜋 – 𝑉𝑉: 𝑆𝑆 →R
Expected sum of discounted rewards
Machine Learning 2019, Kate Saenko 32
MDP – Policy value function
⇒𝑉𝑉𝜋𝜋𝑠𝑠=𝐸𝐸𝑅𝑅𝑠𝑠0 +𝐸𝐸𝛾𝛾𝑅𝑅𝑠𝑠1 +𝛾𝛾2𝑅𝑅𝑠𝑠2 +⋯
this is recursion!
𝑉𝑉𝜋𝜋 𝑠𝑠 = 𝑅𝑅 𝑠𝑠 + 𝛾𝛾𝐸𝐸𝑠𝑠′~𝑃𝑃𝑠𝑠,𝜋𝜋(𝑠𝑠) 𝑉𝑉(𝑠𝑠′)
expectation over values of next state
Bellman’s equation:
Machine Learning 2019, Kate Saenko 33
MDP – Policy value function
1
2
3
4
1
1
2
3
4
2
5
6
7
3
8
9
10
11
solve 𝑆𝑆 eqs.
𝑉𝑉1=𝑅𝑅1 +𝛾𝛾𝑃𝑃1,↓(2)𝑉𝑉2+… 𝑉𝑉2 = …
Policy 𝜋𝜋
…𝑉𝑉=𝑅𝑅10+𝛾𝛾(𝑃𝑃 6𝑉𝑉+𝑃𝑃 9𝑉𝑉+ 10 10,↑ 6 10,↑ 9
𝑃𝑃10,↑ 11 𝑉𝑉11) …
1
2
3
1
2
3
4
+1
-1
Machine Learning 2019, Kate Saenko
34
Optimal value function
If the agent uses a given policy 𝛑𝛑 to select actions, the corresponding value function is given by:
There exists an optimal value function that has higher value than other functions for all states
The optimal policy 𝛑𝛑* is the policy that corresponds to the optimal value function
Machine Learning 2019, Kate Saenko 35
Value function vs. Q-function
For convenience, RL algorithms introduce the Q-function, which takes a state-action pair and returns a real value
𝑄𝑄: 𝑆𝑆 × 𝐴𝐴 → R
The optimal Q-function 𝑄𝑄∗(𝑠𝑠, 𝑎𝑎) means the highest expected total reward received by an agent starting in 𝑠𝑠 and choosing action 𝑎𝑎 which maximizes value over
𝑄𝑄∗(𝑠𝑠, 𝑎𝑎) is an indication for how good it is for an agent to pick action 𝑎𝑎 while being in state 𝑠𝑠
Machine Learning 2019, Kate Saenko 36
Optimal Q-function
If we know the optimal Q-function 𝑄𝑄∗(𝑠𝑠, 𝑎𝑎) , the optimal policy 𝜋𝜋∗ can be easily extracted by choosing the action 𝑎𝑎 that gives maximum 𝑄𝑄∗(𝑠𝑠, 𝑎𝑎) for state 𝑠𝑠:
Machine Learning 2019, Kate Saenko 37
RL Approaches
Reinforcement Learning
Reinforcement learning approaches
Value-based RL
• Estimate the optimal value function
• i.e., the maximum value achievable under any policy • Guaranteed to converge to optimum
Policy-based RL
• Search directly for the optimal policy
• re-define the policy at each step and compute the value
according to this new policy until the policy converges
• Guaranteed to converge, often faster than value
Q-learning
• Search for the optimal Q-function
• No prior knowledge of MDP required
Machine Learning 2019, Kate Saenko 39
Value Iteration Algorithm
Given P𝑠𝑠,𝑎𝑎(𝑠𝑠′) = 𝑝𝑝 𝑠𝑠′ 𝑠𝑠, 𝑎𝑎 , iteratively compute the Q and value functions using Bellman’s equation.
Machine Learning 2019, Kate Saenko 40
Policy Iteration Algorithm
Given P𝑠𝑠,𝑎𝑎(𝑠𝑠′) = 𝑝𝑝 𝑠𝑠′ 𝑠𝑠, 𝑎𝑎 , 𝜋𝜋, iteratively compute the policy’s value function and improve the policy to maximize it
Machine Learning 2019, Kate Saenko 41
Reinforcement learning approaches
Optimal value function
Bellman: Optimal policy
need Psa
need 𝜋𝜋, Psa easier!
Optimal state-action value function Q Define 𝑄𝑄:𝑆𝑆×𝐴𝐴→R
Bellman: 𝑎𝑎∈𝐴𝐴
𝑄𝑄∗ 𝑠𝑠,𝑎𝑎 =𝑅𝑅𝑠𝑠 +𝛾𝛾max𝑄𝑄∗(𝑠𝑠′,𝑎𝑎)
Machine Learning 2019, Kate Saenko
42
Q-Learning (discrete)
Reinforcement Learning
Q-value function
Machine Learning 2019, Kate Saenko 44
Optimal Q-value function
David Silver, Deep RL Tutorial, ICML
Machine Learning 2019, Kate Saenko 45
Q-learning algorithm
The agent interacts with the environment, updates Q recursively
reward
current value discount learning rate
largest increase over all possible actions in new state
Machine Learning 2019, Kate Saenko
46
Q-learning example
Goal: get from bottom left to top right
Machine Learning 2019, Kate Saenko 47
Exploration vs exploitation
• How does the agent select actions during learning? Should it trust the learned values of Q(s, a) to select actions based on it? or try other actions hoping this may give it a better reward?
• Simple 𝛆𝛆-greedy approach: at each step with small probability 𝛜𝛜, the agent will pick a random action (explore) or with probability (1-𝛜𝛜) the agent will select an action according to the current estimate of Q-values
• This is known as the exploration vs exploitation dilemma
• The 𝛜𝛜 value can be decreased overtime as the agent becomes more confident with its estimate of Q-values
Machine Learning 2019, Kate Saenko 48
Continuous state
Reinforcement Learning
Continuous state – Pong
Deep Learning 2017, Brian Kulis & Kate Saenko 50
MDP for Pong
In this case, what are these?
• S – set of States
• A – set of Actions
• 𝑅𝑅: 𝑆𝑆 →R (Reward)
• Psa – transition probabilities (𝑝𝑝(𝑠𝑠, 𝑎𝑎, 𝑠𝑠′) ∈ R)
Can we learn Q-value?
• Can discretize state space, but it may be too large
• Can simplify state by adding domain knowledge (e.g.
paddle, ball), but it may not be available
• Instead, use a neural net to learn good features of
the state!
Deep RL
Reinforcement Learning
Deep RL playing DOTA
Deep Learning 2017, Brian Kulis & Kate Saenko 53
Deep RL
• V, Q or 𝜋𝜋 can be approximated with deep network
• Deep Q-Learning
• Input: state, action • Output:Q-value
• Alternative: learn a Policy Network • Input:state
• Output: distribution over actions
Cover today
54
Q-value network
David Silver, Deep RL Tutorial, ICML 55
Q-value network
David Silver, Deep RL Tutorial, ICML 56
Deep Q-network (DQN)
David Silver, Deep RL Tutorial, ICML 57
DQN – Playing Atari
DQN – Playing Atari
Deep Learning 2017, Brian Kulis & Kate Saenko 59
DQN – Playing Atari
Human level
DQN for Atari
DQN paper:
www.nature.com/articles/nature142 36
DQN demo:
https://www.youtube.com/watch?v=i qXKQf2BOSE
DQN source code:
www.sites.google.com/a/deepmind.c om/dqn/
Deep Learning 2017, Brian Kulis & Kate Saenko 62
Downsides of RL
• RL is less sampling efficient than supervised learning because it involves bootstrapping, which uses an estimate of the Q-value to update the Q- value predictor
• Rewards are usually sparse and learning requires to reach the goal by chance
• Therefore, RL might not find a solution at all if the state space is large or if the task is difficult
Deep Learning 2017, Brian Kulis & Kate Saenko 77
Summary
• The goal of Reinforcement learning:
• learn to predict actions that maximize future rewards
• Markov Decision Process:
• Formalizes the RL framework as • MDP=(S,A,R,Psa,𝛾𝛾)
• Approaches to reinforcement learning: • Learn value function (offline)
• Learn optimal policy (offline) • Learn Q-function (online)
Machine Learning 2019, Kate Saenko 78
References
Andrew Ng’s Reinforcement Learning course, lecture 16
Andrej Karpathy’s blog post on policy gradient
http://karpathy.github.io/2016/05/31/rl/
Mnih et. al, Playing Atari with Deep Reinforcement Learning (DeepMind)
https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
Intuitive explanation of deep Q-learning
https://www.nervanasys.com/demystifying-deep-reinforcement-learning/
Deep Learning 2017, Brian Kulis & Kate Saenko 79