程序代写代做代考 CMPUT 366 F20: Reinforcement Learning III

CMPUT 366 F20: Reinforcement Learning III
Vadim Bulitko & James Wright
October 1, 2020
CMPUT 366 F20: Reinforcement Learning III
1

Lecture Outline
Reinforcement Learning (RL)
SB 3.0-3.7, 4.0-4.4
CMPUT 366 F20: Reinforcement Learning III 2

How do we compute π∗?
Consider over all possible π, compute Vπ for each, select the π that is best
Problems?
From π∗ to V∗: From V∗ to π∗:
a
So if someone gave us V∗ we could act greedily with respect to it and get π∗
How do we compute V∗?
CMPUT 366 F20: Reinforcement Learning III 3
∀s 􏰊V∗(s) = Vπ∗ (s)􏰋 
∀s π∗(s) = argmax 􏰆 p(s′, r|s, a) 􏰎r + γV∗(s′)􏰏
s′ ,r

How do we compute V∗?
By solving Bellman optimality equations:
 ∀s V∗(s) = max 􏰆 p(s′, r|s, a) 􏰎r + γV∗(s′)􏰏
 as′,r 
CMPUT 366 F20: Reinforcement Learning III
4

Solving (Acting Optimally in) the Gridworld (Example 3.8 in SB)
CMPUT 366 F20: Reinforcement Learning III 5

V∗ and Heuristics
Consider pathfinding on a grid four cardinal moves
episode ends when the goal tile is reached γ =  (no discounting)
p(s′, r|s, a) gives a reward r = − for all moves and is deterministic
Optimal value function:
V∗(s) = max 􏰈 p(s′, r|s, a) [r + γV∗(s′)] a s′,r
What is V∗ for this grid?
CMPUT 366 F20: Reinforcement Learning III 6

How do we compute V∗?
By solving Bellman optimality equations:
 ∀s V∗(s) = max 􏰆 p(s′, r|s, a) 􏰎r + γV∗(s′)􏰏
 as′,r 
Problems:
the environment needs to be Markovian
need to know p(s′, r|s, a) for all states and actions computational resources
In practice we (incrementally) compute an approximation to V∗ and/or π∗ policy evaluation
policy improvement
CMPUT 366 F20: Reinforcement Learning III 7

Policy Evaluation: π → Vπ
Need to solve a system of Bellman equations
 ∀s Vπ(s) = 􏰆 π(a|s) 􏰆 p(s′, r|s, a) 􏰎r + γVπ(s′)􏰏
 a s′,r  Can do so iteratively (k = à, , . . . ) starting with an arbitrary Và:
 ∀s Vk+(s) = 􏰆 π(a|s) 􏰆 p(s′, r|s, a) 􏰎r + γVk(s′)􏰏
 a s′,r  Vk converges to Vπ as k → ∞
the updates are expected since 􏰈 p(s′, r|s, a) [r + γVk(s′)] runs over all states s′ ,r
s′ and uses the environment model p(s′, r|s, a) to compute the expectation CMPUT 366 F20: Reinforcement Learning III 8

Policy Evaluation: π → Vπ
Normally would use two arrays: one for Vk and one for Vk+ Can also do the update from Vk to Vk+ in place:
Order matters for convergence speed
CMPUT 366 F20: Reinforcement Learning III 9

Iterative Policy Evaluation for the Gridworld
CMPUT 366 F20: Reinforcement Learning III 10

Iterative Policy Evaluation for the Gridworld
CMPUT 366 F20: Reinforcement Learning III 11

Iterative Policy Evaluation for the Gridworld
CMPUT 366 F20: Reinforcement Learning III 12

Iterative Policy Evaluation for the Gridworld
CMPUT 366 F20: Reinforcement Learning III 13

Policy Improvement: π → π′ Why evaluate π?
So that we can improve it
The following is for deterministic policies π in each state agent takes action a = π(s)
if we follow π from state s on we get the return of Vπ(s)
can we do better by taking (another) action a in s (and then following π)?
mathematically:􏰈s′,r p(s′,r|s,a)[r+γVπ(s′)]>Vπ(s)?
Policy improvement theorem: if π and π′ are deterministic policies and
∀s 􏰊􏰈s′,r p(s′, r|s, π′(s)) [r + γVπ(s′)] ≥ Vπ(s)􏰋 then π′ is better or equal for all states ∀s 􏰊Vπ′ (s) ≥ Vπ(s)􏰋
In the case above π′ = π on all states except that one state s where π′(s)=a̸=π(s).Soif􏰈s′,rp(s′,r|s,a)[r+γVπ(s′)]>Vπ(s)thenπ′ isbetter than or equal to π on all states (and strictly better on s)
CMPUT 366 F20: Reinforcement Learning III 14

Policy Iteration
Start with an arbitrary policy πà
Evaluate a given policy πk (i.e., compute its Vπk )
Greedily improve it: πk+ = argmax 􏰈 p(s′, r|s, a) [r + γVπk (s′)] a s′,r
Repeat until convergence
Converges to π∗
Can be generalized to stochastic policies (SB 4.2)
CMPUT 366 F20: Reinforcement Learning III 15

Policy Iteration Example (SB Example 4.1)
πà is random
π is greedy with respect to Vπà
π is guaranteed to be an improvement over πà in fact, π is optimal
CMPUT 366 F20: Reinforcement Learning III 16

Policy Iteration
CMPUT 366 F20: Reinforcement Learning III 17

Problem with Policy Iteration
Policy evaluation can take a long time
Do we need to know the exact value of Vπ in
each state to improve on π?
CMPUT 366 F20: Reinforcement Learning III 18

Value Iteration
CMPUT 366 F20: Reinforcement Learning III 19

Value Iteration and Bellman Optimality Equation
 ∀s V∗(s) = max 􏰆 p(s′, r|s, a) 􏰎r + γV∗(s′)􏰏
 as′,r  CMPUT 366 F20: Reinforcement Learning III
20

Next time: Online Reinforcement Learning
So far RL methods were offline
dynamic programming over the entire state space akin to non-real-time heuristic search (e.g., A*)
Problems:
need to know the whole world: p(s′, r|s, a) computationally very expensive (time and memory)
sweeping over the whole space aiming for an optimal solution
Solution:
settle for suboptimal solutions
find a good (suboptimal) policy for states that matter: overfit to life agent-centered: learn as we go: akin to real-time heuristic search
agent may not know p(s′ , r|s, a) but it knows its own trajectory: [s → s′ via a, got r]t CMPUT 366 F20: Reinforcement Learning III 21