CMPUT 366 F20: Reinforcement Learning IV
James Wright & Vadim Bulitko
October 6, 2020
CMPUT 366 F20: Reinforcement Learning IV
1
Lecture Outline
Reinforcement Learning (RL)
SB 4.4, 5.0-5.5
CMPUT 366 F20: Reinforcement Learning IV 2
Value Iteration
CMPUT 366 F20: Reinforcement Learning IV 3
Value Iteration and Bellman Optimality Equation
∀s V∗(s) = max p(s′, r|s, a) r + γV∗(s′)
as′,r CMPUT 366 F20: Reinforcement Learning IV
4
Experience-based Reinforcement Learning
So far RL methods were offline
dynamic programming over the entire state space akin to non-real-time heuristic search (e.g., A*)
Problems:
need to know the world dynamics: p(s′, r|s, a) computationally very expensive (time and memory)
sweeping over the whole space aiming for an optimal solution
Solution ideas:
learn from agent’s experience, not from p(s′, r|s, a)
agent may not know p(s′ , r|s, a) but it knows its own trajectory: [s → s′ via a, got r]t
settle for suboptimal solutions
find a good (suboptimal) policy for states that matter: overfit to life agent-centered: learn as we go: akin to real-time heuristic search
CMPUT 366 F20: Reinforcement Learning IV 5
Monte-Carlo Methods
Need to compute expected return Vπ(s)
so that we can use it greedily to have an improved policy
Instead of using the world model p(s′, r|s, a): Vπ(s) = Eπ[Gt | St = s]
= π(a|s) p(snext, r|s, a) [r + γVπ(snext)] a snext ,r
Average returns collected by the agent from state s until the end of episode k:
Vπ(s)
= Eπ[Gt | St = s] K
=K [Gkt|St=s] k=
CMPUT 366 F20: Reinforcement Learning IV
6
Monte-Carlo Methods
CMPUT 366 F20: Reinforcement Learning IV 7
Monte-Carlo versus Dynamic Programming
Iterative policy evaluation uses the estimates of the next state’s value to update the value of this state
Only needs to compute a single transition to update a state’s estimate
Monte Carlo estimate of each state’s value is independent from estimates of other states’ values
Needs the entire episode to compute an update
Can focus on evaluating a subset of states if desired
CMPUT 366 F20: Reinforcement Learning IV 8
Estimating State-action Values
When we know the dynamics p(s′, r|s, a) an estimate of state values is sufficient to determine a good policy:
greedily choose the action that gives the best expectation of the immediate reward and the next-state value: argmax p(s′, r|s, a)[r + γVπ(s′)]
a s′,r
If we do not know the dynamics, state values are not enough
to improve on π we will need an explicit estimate of state-action values:
Qπ (s, a) is the expected return of taking action a in state s and following policy π thereafter
can modify the previous MC algorithm to estimate Qπ(s,a) instead of Vπ(s) then can act greedily on them
CMPUT 366 F20: Reinforcement Learning IV 9
Policy Iteration with MC
CMPUT 366 F20: Reinforcement Learning IV 10
Policy Iteration with MC
What if the policy π never takes a in s?
Discussion:
What should the estimate Qπ (s, a) be then?
Why is this a problem for policy improvement?
Why was this not a concern for estimating Vπ(s) using MC?
CMPUT 366 F20: Reinforcement Learning IV 11
Policy Iteration with MC: Need Exploration
What is unrealistic about this?
CMPUT 366 F20: Reinforcement Learning IV 12