SGN-13000/SGN-13006 Introduction to Pattern Recognition and Machine Learning (5 cr) – Learning in Robotics – Reinforcement Learning
SGN-13000/SGN-13006 Introduction to
Pattern Recognition and Machine Learning
(5 cr)
Learning in Robotics – Reinforcement Learning
Joni-Kristian Kämäräinen
October 2021
Department of Signal Processing
Tampere University of Technology
1
Material
• Lecturer’s slides and blackboard notes
• R.S. Sutton and A.G. Barto. Reinforcement Learning: An
Introduction. 2nd. The MIT Press, 2018: Chapters 3-6
2
Contents
Dynamic Programming
Monte Carlo Methods
Temporal-Difference Learning
3
Dynamic Programming
Dynamic Programming
• Brute force approach – calculate value functions vπi for a
number of policies πi , i = 1, . . . ,R and for each state select
the action from policy that provides the largest value in that
state
4
Iterative policy evaluation, for estimating V ≈ vπ
Require: π, the policy to be evaluated
Require: θ, a small threshold determining the accuracy of
estimation
1: Initialize V (s), for all s arbitrarily expect V (terminal) = 0
2: repeat
3: ∆← 0
4: for all s ∈ S do
5: v ← V (s)
6: V (s)←
∑
a π(a|s)
∑
s′,r p(s
′, r |s, a)(r + γV (s ′))
7: ∆← max(∆, |v − V (s)|
8: end for
9: until ∆ < θ 5 Iterative policy evaluation, for estimating V ≈ vπ Require: π, the policy to be evaluated Require: θ, a small threshold determining the accuracy of estimation 1: Initialize V (s), for all s arbitrarily expect V (terminal) = 0 2: repeat 3: ∆← 0 4: for all s ∈ S do 5: v ← V (s) 6: V (s)← ∑ a π(a|s) ∑ s′,r p(s ′, r |s, a)(r + γV (s ′)) 7: ∆← max(∆, |v − V (s)| 8: end for 9: until ∆ < θ 5 Iterative policy evaluation, for estimating V ≈ vπ (determ.) Require: π, the policy to be evaluated Require: θ, a small threshold determining the accuracy of estimation 1: Initialize V (s), for all s arbitrarily expect V (terminal) = 0 2: repeat 3: ∆← 0 4: for all s ∈ S do 5: v ← V (s) 6: (r , s ′)← Environment(π(s)) 7: V (s)← r + γV (s ′) 8: ∆← max(∆, |v − V (s)| 9: end for 10: until ∆ < θ 6 Example 3.6 (book): vπ and qπ in golf • Reward represents how many hits needed (terminal: 0) 7 Problems • Modeling the probabilities π(a|s) and p(s ′, r |s, a) can be difficult for real world problems (e.g., only partially observable) 8 Monte Carlo Methods Monte Carlo methods • A broad class of algorithms that rely on repeated random sampling to obtain numerical results 9 First-visit MC prediction, for estimating V ≈ vπ Require: π, the policy to be evaluated 1: Initialize V (s), for all s arbitrarily 2: repeat 3: Generate an episode following π: S0,A0,R1,S1,A1,R2, . . . ,ST−1,AT−1,RT 4: G ← 0 5: for all step of episode, t = T − 1,T − 2, . . . , 0 do 6: G ← γG + Rt+1 7: if t = 0 or St appears in S0,S1, . . . ,St−1 then 8: Append G to Returns(St) 9: V (St)← average(Returns(St)) 10: end if 11: end for 12: until Forever 10 First-visit MC prediction, for estimating V ≈ vπ Require: π, the policy to be evaluated 1: Initialize V (s), for all s arbitrarily 2: repeat 3: Generate an episode following π: S0,A0,R1,S1,A1,R2, . . . ,ST−1,AT−1,RT 4: G ← 0 5: for all step of episode, t = T − 1,T − 2, . . . , 0 do 6: G ← γG + Rt+1 7: if t = 0 or St appears in S0,S1, . . . ,St−1 then 8: Append G to Returns(St) 9: V (St)← average(Returns(St)) 10: end if 11: end for 12: until Forever 10 Temporal-Difference Learning Temporal-Difference Learning • The most central and novel idea in reinforcement learning • Based on the ideas of learning in parts in Dynamic Programming and sampling the world in Monte Carlo methods 11 Q-learning for estimating π ≈ π? Require: Learning step α ∈ (0, 1] 1: Initialize Q(s, a) arbitrarily for all s and a, except Q(terminal , ·) = 0 2: for all episodes do 3: for all steps of episode do 4: Choose A from S using policy derived from Q (e.g., greedy or random) 5: Take action A, observe R and S ′ 6: Q(S ,A)← Q(S ,A) + α(R + γmaxa Q(S ′, a)− Q(S ,A)) 7: S ← S ′ 8: If S terminal, break 9: end for 10: end for 12 Reinforcement Learning Problem Agent Environment State Reward Action r + γγ r + r + ... , where γ <1 0 2 2 1 Goal: Learn to choose actions that maximize 0s 1s 2s 0a 1a 2a 0r 1r 2r ... <0 13 Robot Example G 100 100 0 0 0 0 0 0 0 0 0 0 0 (a) r(s, a) (immediate reward) values G 10090 100 81 90 81 81 90 81 72 72 81 0 (b) Q(s, a) values G100 10090 90 81 0 (c) V ∗(s) values G (d) One optimal policy 14 Summary • Robots learn by exploration • Q-learning • Instead of programming a robot (or agent) for a specific task, program a learning procedure for the task • Reinforcement learning often combined with imitation learning 15 Dynamic Programming Monte Carlo Methods Temporal-Difference Learning