CS计算机代考程序代写 algorithm SGN-13000/SGN-13006 Introduction to Pattern Recognition and Machine Learning (5 cr) – Learning in Robotics – Reinforcement Learning

SGN-13000/SGN-13006 Introduction to Pattern Recognition and Machine Learning (5 cr) – Learning in Robotics – Reinforcement Learning

SGN-13000/SGN-13006 Introduction to

Pattern Recognition and Machine Learning

(5 cr)

Learning in Robotics – Reinforcement Learning

Joni-Kristian Kämäräinen

October 2021

Department of Signal Processing

Tampere University of Technology

1

Material

• Lecturer’s slides and blackboard notes

• R.S. Sutton and A.G. Barto. Reinforcement Learning: An
Introduction. 2nd. The MIT Press, 2018: Chapters 3-6

2

Contents

Dynamic Programming

Monte Carlo Methods

Temporal-Difference Learning

3

Dynamic Programming

Dynamic Programming

• Brute force approach – calculate value functions vπi for a
number of policies πi , i = 1, . . . ,R and for each state select

the action from policy that provides the largest value in that

state

4

Iterative policy evaluation, for estimating V ≈ vπ

Require: π, the policy to be evaluated

Require: θ, a small threshold determining the accuracy of

estimation

1: Initialize V (s), for all s arbitrarily expect V (terminal) = 0

2: repeat

3: ∆← 0
4: for all s ∈ S do
5: v ← V (s)
6: V (s)←


a π(a|s)


s′,r p(s

′, r |s, a)(r + γV (s ′))
7: ∆← max(∆, |v − V (s)|
8: end for

9: until ∆ < θ 5 Iterative policy evaluation, for estimating V ≈ vπ Require: π, the policy to be evaluated Require: θ, a small threshold determining the accuracy of estimation 1: Initialize V (s), for all s arbitrarily expect V (terminal) = 0 2: repeat 3: ∆← 0 4: for all s ∈ S do 5: v ← V (s) 6: V (s)← ∑ a π(a|s) ∑ s′,r p(s ′, r |s, a)(r + γV (s ′)) 7: ∆← max(∆, |v − V (s)| 8: end for 9: until ∆ < θ 5 Iterative policy evaluation, for estimating V ≈ vπ (determ.) Require: π, the policy to be evaluated Require: θ, a small threshold determining the accuracy of estimation 1: Initialize V (s), for all s arbitrarily expect V (terminal) = 0 2: repeat 3: ∆← 0 4: for all s ∈ S do 5: v ← V (s) 6: (r , s ′)← Environment(π(s)) 7: V (s)← r + γV (s ′) 8: ∆← max(∆, |v − V (s)| 9: end for 10: until ∆ < θ 6 Example 3.6 (book): vπ and qπ in golf • Reward represents how many hits needed (terminal: 0) 7 Problems • Modeling the probabilities π(a|s) and p(s ′, r |s, a) can be difficult for real world problems (e.g., only partially observable) 8 Monte Carlo Methods Monte Carlo methods • A broad class of algorithms that rely on repeated random sampling to obtain numerical results 9 First-visit MC prediction, for estimating V ≈ vπ Require: π, the policy to be evaluated 1: Initialize V (s), for all s arbitrarily 2: repeat 3: Generate an episode following π: S0,A0,R1,S1,A1,R2, . . . ,ST−1,AT−1,RT 4: G ← 0 5: for all step of episode, t = T − 1,T − 2, . . . , 0 do 6: G ← γG + Rt+1 7: if t = 0 or St appears in S0,S1, . . . ,St−1 then 8: Append G to Returns(St) 9: V (St)← average(Returns(St)) 10: end if 11: end for 12: until Forever 10 First-visit MC prediction, for estimating V ≈ vπ Require: π, the policy to be evaluated 1: Initialize V (s), for all s arbitrarily 2: repeat 3: Generate an episode following π: S0,A0,R1,S1,A1,R2, . . . ,ST−1,AT−1,RT 4: G ← 0 5: for all step of episode, t = T − 1,T − 2, . . . , 0 do 6: G ← γG + Rt+1 7: if t = 0 or St appears in S0,S1, . . . ,St−1 then 8: Append G to Returns(St) 9: V (St)← average(Returns(St)) 10: end if 11: end for 12: until Forever 10 Temporal-Difference Learning Temporal-Difference Learning • The most central and novel idea in reinforcement learning • Based on the ideas of learning in parts in Dynamic Programming and sampling the world in Monte Carlo methods 11 Q-learning for estimating π ≈ π? Require: Learning step α ∈ (0, 1] 1: Initialize Q(s, a) arbitrarily for all s and a, except Q(terminal , ·) = 0 2: for all episodes do 3: for all steps of episode do 4: Choose A from S using policy derived from Q (e.g., greedy or random) 5: Take action A, observe R and S ′ 6: Q(S ,A)← Q(S ,A) + α(R + γmaxa Q(S ′, a)− Q(S ,A)) 7: S ← S ′ 8: If S terminal, break 9: end for 10: end for 12 Reinforcement Learning Problem Agent Environment State Reward Action r + γγ r + r + ... , where γ <1 0 2 2 1 Goal: Learn to choose actions that maximize 0s 1s 2s 0a 1a 2a 0r 1r 2r ... <0 13 Robot Example G 100 100 0 0 0 0 0 0 0 0 0 0 0 (a) r(s, a) (immediate reward) values G 10090 100 81 90 81 81 90 81 72 72 81 0 (b) Q(s, a) values G100 10090 90 81 0 (c) V ∗(s) values G (d) One optimal policy 14 Summary • Robots learn by exploration • Q-learning • Instead of programming a robot (or agent) for a specific task, program a learning procedure for the task • Reinforcement learning often combined with imitation learning 15 Dynamic Programming Monte Carlo Methods Temporal-Difference Learning