程序代写代做代考 go CMPUT 366 F20: Reinforcement Learning II

CMPUT 366 F20: Reinforcement Learning II
James Wright & Vadim Bulitko
September 29, 2020
CMPUT 366 F20: Reinforcement Learning II
1

Lecture Outline
Reinforcement Learning (RL)
SB 3.0-3.6, 4.0-4.4
CMPUT 366 F20: Reinforcement Learning II 2

How to Specify Desired Behaviour of an Agent
Desired action for each state → supervised learning Desired goal state(s) → search
Good/bad feedback per action → reinforcement learning Survival/death → evolution
CMPUT 366 F20: Reinforcement Learning II 3

Reinforcement Learning
In a reinforcement learning task, an agent learns how to act based on feedback from the environment
The agent’s actions may change the environment
Neither the desired action per state nor desired goal state(s) are known
The task may be episodic or continuing
The agent makes decisions online: determines how to act while interacting with the environment
CMPUT 366 F20: Reinforcement Learning II 4

Interacting with the Environment
At each time t = ,2,3,… Agent receives input
denoting current state St Agent chooses action At
On the next time step, agent receives reward Rt+ and new state St+ chosen according to p(s′, r|s, a)
CMPUT 366 F20: Reinforcement Learning II 5

Markov Decision Processes
A Markov decision process (MDP) is a tuple (S , A, R, p): S is a set of states
A is a set of actions
R ⊂ R is a set of possible rewards
p(s′, r|s, a) fully defines the dynamics of the environment
do not need history: Markov property
CMPUT 366 F20: Reinforcement Learning II 6

Returns for Episodic Tasks
What does it mean to maximize the expected value of the cumulative sum of rewards?
A task is episodic if it ends after some finite number T of time steps in a special terminal state ST
The return Gt after time t is the sum of rewards receivedaftertimet:Gt =Rt+ +Rt+2 +Rt+3 +···+RT
The return Gt is a random variable. In an episodic task, we want to maximize its expected value E[Gt]
CMPUT 366 F20: Reinforcement Learning II 7

Returns for Continuing Tasks
What does it mean to maximize the expected value of the cumulative sum of rewards?
A task is continuing if it does not end
Thus we cannot just maximize the plain sum of rewards
Instead, we maximize the discounted return: Gt = Rt+ + γRt+2 + γ2Rt+3 + . . . where γ ∈ [à, ) is the discount factor
Returns are related to each other: Gt = Rt+ + γGt+
The return Gt is a random variable. In an episodic task, we want to maximize its
expected value E[Gt]
CMPUT 366 F20: Reinforcement Learning II 8

Maximizing Returns
We want to maximize the expected return (i.e., cumulative (discounted) reward)
Markov property: the state incorporates all of the necessary information about the history up until this point. Probabilities of future rewards and transitions are the same from state St regardless of how you got there
Thus the agent can choose its actions based only on its current state St Need a policy: mapping from states to actions:
deterministic: π : S → A stochastic: π : S × A → [à, ]
can be written as π(a|s) ∈ [à, ] — the probability of taking action a given that the current state is s
CMPUT 366 F20: Reinforcement Learning II 9

How Does an RL Agent Come Up with a Policy?
Need a way to evaluate a policy
Need a way to improve it based on the evaluation
CMPUT 366 F20: Reinforcement Learning II 10

State Value Function
How does an agent know what state to go to next?
Need to a way to tell that one state is better than another What is better?
Higher return
A (state) value function Vπ (old notation) or vπ (new notation) estimates return
from state s under policy π: Vπ(s)
= Eπ[Gt | St = s]
􏰙∞ 􏰀􏰀􏰚
= Eπ 􏰆γkRt+k+􏰀􏰀St=s
k=à
􏰀
CMPUT 366 F20: Reinforcement Learning II
11

Computing Value Functions
Monte Carlo methods:
average actual returns from state s under policy π(s)
Local methods: update Vπ(st) from
neighboring states Vπ(sneighbor) trajectories Vπ(st+k)
Function approximation:
represent Vθπ in a parameterized form (e.g., a deep ANN) tune θ from experience
CMPUT 366 F20: Reinforcement Learning II 12

Local Consistency of Value Function
Just like returns, values of neighboring states are related to each other Bellman equation:
Vπ(s) = Eπ[Gt | St = s]
= Eπ[Rt++γGt+|St=s]
= 􏰆 π(a|s) 􏰆 p(snext, r|s, a) [r + γEπ[Gt+|St+ = snext]] a snext ,r
= 􏰆 π(a|s) 􏰆 p(snext, r|s, a) [r + γVπ(snext)] a snext ,r
Vπ is a unique solution to Bellman equation
CMPUT 366 F20: Reinforcement Learning II 13

Backup Diagrams
Show the flow of information:
Vπ(s) = Eπ[Gt | St = s]
= 􏰆 π(a|s) 􏰆 p(s′, r|s, a) 􏰎r + γVπ(s′)􏰏
a s′,r
CMPUT 366 F20: Reinforcement Learning II 14

A Gridworld Example (Example 3.5 in SB)
A deterministic grid
At each cell can go north, south, east, west
Try to leave the grid: reward of −, location unchanged Leaving state A: takes agent to state A′, reward of +à Leaving state B: takes agent to state B′, reward of +5 All other rewards are à
CMPUT 366 F20: Reinforcement Learning II 15

Computing Value of a Random Policy
Suppose γ = à.9 and π is random (i.e., π(a|s) =  ) 4
By solving a system of Bellman equations
we get:
 ∀s Vπ(s) = 􏰆 π(a|s) 􏰆 p(s′, r|s, a) 􏰎r + γVπ(s′)􏰏
 a s′,r 
CMPUT 366 F20: Reinforcement Learning II
16

Better than Random
How do we do better than random?
Some hand-coded common-sense adjustments to the policy:
Hacky. Need to do better.
CMPUT 366 F20: Reinforcement Learning II 17

Optimality (for finite MDPs)
Policy π is better than or equal to policy π2 iff
∀s [Vπ (s) ≥ Vπ2 (s)]
Optimal policy π∗ is a policy that is better than or equal to any other policy: ∀π∀s 􏰊Vπ∗ (s) ≥ Vπ(s)􏰋
All optimal policies have the same (optimal) value function
∀s 􏰊V∗(s) = Vπ∗ (s) = max Vπ(s)􏰋 π
CMPUT 366 F20: Reinforcement Learning II
18

How do we compute π∗?
Consider over all possible π, compute Vπ for each, select the π that is best
Problems?
From π∗ to V∗: From V∗ to π∗:
a
So if someone gave us V∗ we could act greedily with respect to it and get π∗
How do we compute V∗?
CMPUT 366 F20: Reinforcement Learning II 19
∀s 􏰊V∗(s) = Vπ∗ (s)􏰋 
∀s π∗(s) = argmax 􏰆 p(s′, r|s, a) 􏰎r + γV∗(s′)􏰏
s′ ,r

How do we compute V∗?
By solving Bellman optimality equations:
 ∀s V∗(s) = max 􏰆 p(s′, r|s, a) 􏰎r + γV∗(s′)􏰏
 as′,r 
CMPUT 366 F20: Reinforcement Learning II
20

Solving (Acting Optimally in) the Gridworld (Example 3.8 in SB)
In teams discuss why the optimal policy is random in states A and B
CMPUT 366 F20: Reinforcement Learning II 21

How do we compute V∗?
By solving Bellman optimality equations:
 ∀s V∗(s) = max 􏰆 p(s′, r|s, a) 􏰎r + γV∗(s′)􏰏
 as′,r 
Problems:
the environment needs to be Markovian
need to know p(s′, r|s, a) for all states and actions computational resources
In practice we (incrementally) compute an approximation to V∗ and/or π∗
CMPUT 366 F20: Reinforcement Learning II 22

Policy Evaluation: π → Vπ
Need to solve a system of Bellman equations
 ∀s Vπ(s) = 􏰆 π(a|s) 􏰆 p(s′, r|s, a) 􏰎r + γVπ(s′)􏰏
 a s′,r  Can do so iteratively (k = à, , . . . ) starting with an arbitrary Và:
 ∀s Vk+(s) = 􏰆 π(a|s) 􏰆 p(s′, r|s, a) 􏰎r + γVk(s′)􏰏
 a s′,r  Vk converges to Vπ as k → ∞
the updates are expected since 􏰈 p(s′, r|s, a) [r + γVk(s′)] runs over all states s′ ,r
s′ and uses the environment model p(s′, r|s, a) to compute the expectation CMPUT 366 F20: Reinforcement Learning II 23

Policy Evaluation: π → Vπ
Normally would use two arrays: one for Vk and one for Vk+
Can also do the update from Vk to Vk+ in place:
CMPUT 366 F20: Reinforcement Learning II 24

Iterative Policy Evaluation for the Gridworld
CMPUT 366 F20: Reinforcement Learning II 25

Iterative Policy Evaluation for the Gridworld
CMPUT 366 F20: Reinforcement Learning II 26

Iterative Policy Evaluation for the Gridworld
CMPUT 366 F20: Reinforcement Learning II 27

Iterative Policy Evaluation for the Gridworld
CMPUT 366 F20: Reinforcement Learning II 28