程序代写代做代考 html CMPUT 366 F20: Reinforcement Learning

CMPUT 366 F20: Reinforcement Learning
James Wright & Vadim Bulitko
September 24, 2020
CMPUT 366 F20: Reinforcement Learning
1

Lecture Outline
Reinforcement Learning (RL)
SB 3.0-3.4
CMPUT 366 F20: Reinforcement Learning 2

Real-time Heuristic Search
Pros
search is centred on an agent
agent interleaves planning and plan execution (i.e., acting) agent updates its knowledge of the world as it moves about agent can access the world around its current state
agent has bounded memory (less than the world)
Cons
state revisits (scrubbing) stationary world deterministic world
CMPUT 366 F20: Reinforcement Learning 3

CanBot
Setup:
CanBot’s job is to find and recycle empty beverage cans
At any given time, its battery charge is either high or low
It can do three actions: search for cans, wait or recharge
Goal: find cans efficiently without running out of battery charge
How would you represent this as a search problem?
CMPUT 366 F20: Reinforcement Learning 4

How to Specify Desired Behaviour of an Agent
Desired action for each state → supervised learning Desired goal state(s) → search
Good/bad feedback per action → reinforcement learning Survival/death → evolution
CMPUT 366 F20: Reinforcement Learning 6

Reinforcement Learning
In a reinforcement learning task, an agent learns how to act based on feedback from the environment
The agent’s actions may change the environment
Neither the desired action per state nor desired goal state(s) are known
The task may be episodic or continuing
The agent makes decisions online: determines how to act while interacting with the environment
CMPUT 366 F20: Reinforcement Learning 7

Interacting with the Environment
At each time t = ,2,3,… Agent receives input
denoting current state St Agent chooses action At
On the next time step, agent receives reward Rt+ and new state St+ chosen according to p(s′, r|s, a)
CMPUT 366 F20: Reinforcement Learning 8

Markov Decision Processes
A Markov decision process (MDP) is a tuple (S , A, R, p): S is a set of states
A is a set of actions
R ⊂ R is a set of possible rewards
p(s′, r|s, a) fully defines the dynamics of the environment
do not need history: Markov property
How can we represent grid pathfinding as an MDP?
CMPUT 366 F20: Reinforcement Learning 9

MDP Dynamics
The four-argument dynamics function returns the probability of every state transition: p(s′, r|s, a) = Pr(St+ = s′, Rt+ = r|St = s, At = a)
Convenient shorthands:
probability of getting from state s to state s′ via action a:
p(s′|s, a) = 􏰈 p(s′, r|s, a) r∈R
expected reward from getting from state s to state s′ via action a: r(s,a,s′) = 􏰈 rp(s′,r|s,a)
p(s′ |s,a) r∈R
expected reward from taking action a in state s: r(s,a) = 􏰈 r 􏰈 p(s′,r|s,a)
r∈R s′ ∈S
CMPUT 366 F20: Reinforcement Learning 10

CanBot as an RL Agent
Need to define:
S is a set of states
A is a set of actions
R ⊂ R is a set of possible rewards
p(s′, r|s, a) fully defines the dynamics of the environment
CMPUT 366 F20: Reinforcement Learning 11

The Reward Hypothesis
That all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward).
http://incompleteideas.net/rlai.cs.ualberta.ca/RLAI/rewardhypothesis.html
A possible refutation?
October 24, 2003
CMPUT 366 F20: Reinforcement Learning
12

Returns for Episodic Tasks
What does it mean to maximize the expected value of the cumulative sum of rewards?
A task is episodic if it ends after some finite number T of time steps in a special terminal state ST
The return Gt after time t is the sum of rewards receivedaftertimet:Gt =Rt+ +Rt+2 +Rt+3 +···+RT
The return Gt is a random variable. In an episodic task, we want to maximize its expected value E[Gt]
CMPUT 366 F20: Reinforcement Learning 13

Returns for Continuing Tasks
What does it mean to maximize the expected value of the cumulative sum of rewards?
A task is continuing if it does not end
Thus we cannot just maximize the plain sum of rewards
Instead, we maximize the discounted return: Gt = Rt+ + γRt+2 + γ2Rt+3 + . . . where γ ∈ [à, ) is the discount factor
Returns are related to each other: Gt = Rt+ + γGt+
The return Gt is a random variable. In an episodic task, we want to maximize its
expected value E[Gt]
CMPUT 366 F20: Reinforcement Learning 14

Maximizing Returns
We want to maximize the expected return (i.e., cumulative (discounted) reward)
Markov property: the state incorporates all of the necessary information about the history up until this point. Probabilities of future rewards and transitions are the same from state St regardless of how you got there
Thus the agent can choose its actions based only on its current state St Need a policy: mapping from states to actions:
deterministic: π : S → A stochastic: π : S × A → [à, ]
can be written as π(a|s) ∈ [à, ] — the probability of taking action a given that the current state is s
CMPUT 366 F20: Reinforcement Learning 15

Related Posts