程序代写代做代考 algorithm game go CMPUT 366 F20: Reinforcement Learning V

CMPUT 366 F20: Reinforcement Learning V
James Wright & Vadim Bulitko
October 8, 2020
CMPUT 366 F20: Reinforcement Learning V
1

Lecture Outline
Reinforcement Learning (RL)
SB 5.3-5.7, 6.0-6.2, 6.4-6.5
CMPUT 366 F20: Reinforcement Learning V 2

Monte-Carlo versus Dynamic Programming
Iterative policy evaluation uses the estimates of the next state’s value to update the value of this state
Only needs to compute a single transition to update a state’s estimate
Monte Carlo estimate of each state’s value is independent from estimates of other states’ values
Needs the entire episode to compute an update
Can focus on evaluating a subset of states if desired
CMPUT 366 F20: Reinforcement Learning V 3

Estimating State-action Values
When we know the dynamics p(s′, r|s, a) an estimate of state values is sufficient to determine a good policy:
greedily choose the action that gives the best expectation of the immediate reward and the next-state value: argmax 􏰈 p(s′, r|s, a)[r + γVπ(s′)]
a s′,r
If we do not know the dynamics, state values are not enough
to improve on π we will need an explicit estimate of state-action values:
Qπ (s, a) is the expected return of taking action a in state s and following policy π thereafter
can modify the previous MC algorithm to estimate Qπ(s,a) instead of Vπ(s)
then can act greedily on them: π′(s) = argmax Qπ(s, a) a
CMPUT 366 F20: Reinforcement Learning V
4

Policy Iteration with MC
CMPUT 366 F20: Reinforcement Learning V 5

Policy Iteration with MC
What if the policy π never takes a in s?
Policy improvement, π′(s) = argmax Qπ(s, a), will not work since we will not have
Qπ for some actions
a
We need π to continue visiting all (s, a) pairs so that our return averages converge to Qπ (s, a) – continued exploration
exploring starts
on each episode start the agent in a state s and force to take action a where every (s, a) has a non-zero probability of being selected
ε-soft policies:
make sure π has a non-zero chance of selecting every action in a state
CMPUT 366 F20: Reinforcement Learning V 6

Policy Iteration with MC: Need Exploration
What is unrealistic about this?
CMPUT 366 F20: Reinforcement Learning V 7

Removing Exploring Starts
The exploring starts setup ensures that we see every (s, a) with a non-zero probability
but it is impractical (e.g., in chess game)
Alternative: force π being evaluated to sometimes take a in s Policy π is ε-soft iff ∀s,a[π(s,a) ≥ ε], à ≤ ε ≤
ε-greedy policy is an example of an ε-soft policy. It takes a random action ε percent of the time and takes action greedily otherwise
equivalently:
 ε π(s,a) = |A|
 − ε + ε |A|
if a ̸= argmax Q(s, a′), a′
otherwise.
CMPUT 366 F20: Reinforcement Learning V
8

Policy Iteration with MC: No Exploring Starts
CMPUT 366 F20: Reinforcement Learning V 9

Exploration versus Exploitation
Will π converge to optimal π∗?
No: always takes random actions (ε-soft)
A general problem: how to trade exploration and exploitation
Here we want to learn optimal policy from trajectories generated by another, continually exploring and thus non-optimal policy
CMPUT 366 F20: Reinforcement Learning V 10

Generalization: Off-policy Learning
Off-policy learning:
learn about a target policy π
while executing a behaviour policy b
For now, consider the prediction problem: both policies π and b are fixed
want to evaluate π (i.e., compute Vπ or Qπ) but have experience generated by b cannot simply average returns of b to get Vπ because b ̸= π
Assume there is coverage: ∀a, s[π(a|s) > à → b(a|s) > à]
Discuss whether b can be deterministic in states when it is different from π
CMPUT 366 F20: Reinforcement Learning V 11

Importance Sampling
Consider the episodic MDP on the right
Policy b is random: b(A|S) = b(A2|S) = /2 Policy π is not random: π(A|S) = , π(A2|S) = à
Discuss:
calculate ρ for trajectory S, A, S2
calculate ρ for trajectory S, A2, S3
average returns produced by b to estimate Vb(S) average returns produced by π to estimate Vπ(S) average returns produced by b but multiplied by ρ for the corresponding trajectory
CMPUT 366 F20: Reinforcement Learning V 13
3S
2A
2S
01 = R
1S
001 = R
1A

Importance Sampling
Assume continuous timing (e.g., episode ends at t = àà, episode 2 begins at t = à)
T (s) is the set of time steps when s was visited for the first time within an episode
An episode following t ends at T(t)
(Gt)t∈T (s) is the sequence of returns for state s
(ρt:T(t)−)t∈T (s) are the corresponding importance-sampling ratios
EstimateV (s)as π
π
􏰈t∈T (s) ρt:T(t)−Gt
|T(s)| :ordinaryimportancesampling
􏰈t∈T (s) ρt:T(t)−Gt
Estimate V (s) as 􏰈t∈T (s) ρt:T(t)− : weighted importance sampling
Can be used for both prediction and control: SB 5.6 – 5.7
CMPUT 366 F20: Reinforcement Learning V 14

Monte Carlo Summary
Monte Carlo estimation: Estimate expected returns to a state or action by averaging actual returns over sampled trajectories
Estimating action values requires either exploring starts or a soft policy (e.g., ε-greedy)
Off-policy learning is the estimation of value functions for a target policy based on episodes generated by a different behaviour policy
Off-policy control is learning the optimal policy (target policy) using episodes from a behaviour policy
No need for world model p(s′, r|s, a) What is the downside of MC?
CMPUT 366 F20: Reinforcement Learning V 15

Monte Carlo versus Dynamic Programming
Dynamic Programming bootstraps: uses estimates to learn other estimates without waiting for a final outcome
can propagate knowledge through the state space
requires world model off-line
Monte Carlo uses experience only
cannot propagate knowledge through the state space
does not require world model on-line
Can we combine pros of each?
CMPUT 366 F20: Reinforcement Learning V 16

Temporal Difference Learning
Want an on-line learning method that learns from experience, bootstraps and does not rely on the world model
and we want a pony too!
Updates to the value function in policy evaluation:
Dynamic Programming: V(s) ← 􏰈a,s′,r π(a|s)p(s′, r|s, a)[r + γV(s)] Monte Carlo: V(s) ← V(s) + α[G − V(s)]
α is the learning rate; G is the target to move towards
need to wait until the end of the episode to get the return G Temporal difference: V(s) ← V(s) + α[r + γV(snext) − V(s)]
using r + γV(snext) as the target in place of G
do not need to wait until the end of the episode; bootstrapping
CMPUT 366 F20: Reinforcement Learning V 17

TD(à) for Policy Evaluation
The update rule for V(S) uses sample updates: a single observed outcome DP uses expected updates: a weighted sum over all possible outcomes
CMPUT 366 F20: Reinforcement Learning V 18

TD(à) for Control
We can plug TD prediction into the generalized policy iteration framework
Monte Carlo control loop: Generate an episode using π Update estimates of Q and π
On-policy TD control loop:
Take an action according to π Update estimates of Q and π
CMPUT 366 F20: Reinforcement Learning V 19

On-policy TD(à) Control
The update rule uses St, At, Rt+, St+, At+ – hence SARSA
Converges to optimal if use ε-greedy policy and cool off ε over time (e.g., ε = /t)
CMPUT 366 F20: Reinforcement Learning V 20

Off-policy TD(à) Control
Learns values for an optimal policy while running a behaviour policy
Converges to optimal if use ε-greedy policy and cool off ε over time (e.g., ε = /t)
CMPUT 366 F20: Reinforcement Learning V 21

Sarsa versus Q-learning
Agent gets − reward until they reach the goal state
Step into the Cliff region, get reward −àà and go back to start How will Q-Learning estimate the value of the star state?
How will Sarsa estimate the value of the star state?
CMPUT 366 F20: Reinforcement Learning V 22

Sarsa versus Q-learning
ε = à., constant
Why is Sarsa doing better?
What happens if we gradually reduce ε → à?
CMPUT 366 F20: Reinforcement Learning V 23

Q-learning versus RTHS
Q-learning update rule
Q(s, a) ← Q(s, a) + α[r + γ max Q(s′, a) − Q(s, a)] a
Q-learning action-selection rule (with ε-greedy policy):
− ε fraction of times otherwise
argmax Q(s, a) a
a←
If α = , γ = , ε = à then we get
RTHS (LRTA*) update rule: h(s) ← min[cost(s, s′) + h(s′)]
s′
RTHS (LRTA*) action-selection
rule:
snext ← arg min[cost(s, s′)+h(s′)]
random action
and
Q(s, a) ← r + max Q(s′, a) a
a ← argmax Q(s, a)
a s′
CMPUT 366 F20: Reinforcement Learning V 24

Summary
Temporal Difference Learning bootstraps and learns from experience Dynamic programming bootstraps, but does not learn from experience (requires
full dynamics)
Monte Carlo learns from experience, but does not bootstrap
Policy evaluation: TD(à) algorithm
Sarsa estimates action-values of actual ε-greedy policy
Q-Learning estimates action-values of optimal policy while executing an ε-greedy policy
Connections to real-time heuristic search
CMPUT 366 F20: Reinforcement Learning V 25

Related Posts