CMPUT 366 F20: Reinforcement Learning III
Vadim Bulitko & James Wright
October 1, 2020
CMPUT 366 F20: Reinforcement Learning III
1
Lecture Outline
Reinforcement Learning (RL)
SB 3.0-3.7, 4.0-4.4
CMPUT 366 F20: Reinforcement Learning III 2
How do we compute π∗?
Consider over all possible π, compute Vπ for each, select the π that is best
Problems?
From π∗ to V∗: From V∗ to π∗:
a
So if someone gave us V∗ we could act greedily with respect to it and get π∗
How do we compute V∗?
CMPUT 366 F20: Reinforcement Learning III 3
∀s V∗(s) = Vπ∗ (s)
∀s π∗(s) = argmax p(s′, r|s, a) r + γV∗(s′)
s′ ,r
How do we compute V∗?
By solving Bellman optimality equations:
∀s V∗(s) = max p(s′, r|s, a) r + γV∗(s′)
as′,r
CMPUT 366 F20: Reinforcement Learning III
4
Solving (Acting Optimally in) the Gridworld (Example 3.8 in SB)
CMPUT 366 F20: Reinforcement Learning III 5
V∗ and Heuristics
Consider pathfinding on a grid four cardinal moves
episode ends when the goal tile is reached γ = (no discounting)
p(s′, r|s, a) gives a reward r = − for all moves and is deterministic
Optimal value function:
V∗(s) = max p(s′, r|s, a) [r + γV∗(s′)] a s′,r
What is V∗ for this grid?
CMPUT 366 F20: Reinforcement Learning III 6
How do we compute V∗?
By solving Bellman optimality equations:
∀s V∗(s) = max p(s′, r|s, a) r + γV∗(s′)
as′,r
Problems:
the environment needs to be Markovian
need to know p(s′, r|s, a) for all states and actions computational resources
In practice we (incrementally) compute an approximation to V∗ and/or π∗ policy evaluation
policy improvement
CMPUT 366 F20: Reinforcement Learning III 7
Policy Evaluation: π → Vπ
Need to solve a system of Bellman equations
∀s Vπ(s) = π(a|s) p(s′, r|s, a) r + γVπ(s′)
a s′,r Can do so iteratively (k = à, , . . . ) starting with an arbitrary Và:
∀s Vk+(s) = π(a|s) p(s′, r|s, a) r + γVk(s′)
a s′,r Vk converges to Vπ as k → ∞
the updates are expected since p(s′, r|s, a) [r + γVk(s′)] runs over all states s′ ,r
s′ and uses the environment model p(s′, r|s, a) to compute the expectation CMPUT 366 F20: Reinforcement Learning III 8
Policy Evaluation: π → Vπ
Normally would use two arrays: one for Vk and one for Vk+ Can also do the update from Vk to Vk+ in place:
Order matters for convergence speed
CMPUT 366 F20: Reinforcement Learning III 9
Iterative Policy Evaluation for the Gridworld
CMPUT 366 F20: Reinforcement Learning III 10
Iterative Policy Evaluation for the Gridworld
CMPUT 366 F20: Reinforcement Learning III 11
Iterative Policy Evaluation for the Gridworld
CMPUT 366 F20: Reinforcement Learning III 12
Iterative Policy Evaluation for the Gridworld
CMPUT 366 F20: Reinforcement Learning III 13
Policy Improvement: π → π′ Why evaluate π?
So that we can improve it
The following is for deterministic policies π in each state agent takes action a = π(s)
if we follow π from state s on we get the return of Vπ(s)
can we do better by taking (another) action a in s (and then following π)?
mathematically:s′,r p(s′,r|s,a)[r+γVπ(s′)]>Vπ(s)?
Policy improvement theorem: if π and π′ are deterministic policies and
∀s s′,r p(s′, r|s, π′(s)) [r + γVπ(s′)] ≥ Vπ(s) then π′ is better or equal for all states ∀s Vπ′ (s) ≥ Vπ(s)
In the case above π′ = π on all states except that one state s where π′(s)=a̸=π(s).Soifs′,rp(s′,r|s,a)[r+γVπ(s′)]>Vπ(s)thenπ′ isbetter than or equal to π on all states (and strictly better on s)
CMPUT 366 F20: Reinforcement Learning III 14
Policy Iteration
Start with an arbitrary policy πà
Evaluate a given policy πk (i.e., compute its Vπk )
Greedily improve it: πk+ = argmax p(s′, r|s, a) [r + γVπk (s′)] a s′,r
Repeat until convergence
Converges to π∗
Can be generalized to stochastic policies (SB 4.2)
CMPUT 366 F20: Reinforcement Learning III 15
Policy Iteration Example (SB Example 4.1)
πà is random
π is greedy with respect to Vπà
π is guaranteed to be an improvement over πà in fact, π is optimal
CMPUT 366 F20: Reinforcement Learning III 16
Policy Iteration
CMPUT 366 F20: Reinforcement Learning III 17
Problem with Policy Iteration
Policy evaluation can take a long time
Do we need to know the exact value of Vπ in
each state to improve on π?
CMPUT 366 F20: Reinforcement Learning III 18
Value Iteration
CMPUT 366 F20: Reinforcement Learning III 19
Value Iteration and Bellman Optimality Equation
∀s V∗(s) = max p(s′, r|s, a) r + γV∗(s′)
as′,r CMPUT 366 F20: Reinforcement Learning III
20
Next time: Online Reinforcement Learning
So far RL methods were offline
dynamic programming over the entire state space akin to non-real-time heuristic search (e.g., A*)
Problems:
need to know the whole world: p(s′, r|s, a) computationally very expensive (time and memory)
sweeping over the whole space aiming for an optimal solution
Solution:
settle for suboptimal solutions
find a good (suboptimal) policy for states that matter: overfit to life agent-centered: learn as we go: akin to real-time heuristic search
agent may not know p(s′ , r|s, a) but it knows its own trajectory: [s → s′ via a, got r]t CMPUT 366 F20: Reinforcement Learning III 21