程序代写代做代考 flex algorithm Reinforcement Learning

Reinforcement Learning

Dynamic Programming;
Monte Carlo Methods

Subramanian Ramamoorthy
School of Informa=cs

27 January 2017

Recap: Key Quan–es defining an MDP

•  System dynamics are
stochas-c – represented by a
probability distribu-on.

•  Problem is defined as
maximiza-on of expected
rewards

–  Recall that E(X) = Σ xi p(xi)
for finite-state systems

27/01/2017 Reinforcement Learning 2

Recap: Decision Criterion

What is the criterion for op-miza-on (i.e., learning)?

20/01/17 3

Rt = rt+1 +γ rt+2 +γ
2rt+3 +!= γ

krt+k+1,
k=0

where γ, 0 ≤ γ ≤1, is the discount rate.

Effect of changing γ?

Reinforcement Learning

Nota-on for Episodic vs. Infinite

•  In (discrete) episodic tasks, we could number the -me
steps of each episode star-ng from zero.

•  We usually do not have to dis-nguish between episodes,
so we write instead of for the state at step t of
episode j.

•  Think of each episode as ending in an absorbing state that
always produces reward of zero:

•  We can cover all cases by wri-ng

st st, j

Rt = γ
krt+k+1,

k=0

where γ can be 1 only if a zero reward absorbing state is always reached.

20/01/17 4 Reinforcement Learning

Three Aspects of the RL Problem

•  Planning
The MDP is known (states, ac-ons, transi-ons, rewards).
Find an op-mal policy, π*!

•  Learning
The MDP is unknown. You are allowed to interact with it.
Find an op-mal policy π*!

•  Op=mal learning
While interac-ng with the MDP, minimize the loss due to not
using an op-mal policy from the beginning.

27/01/2017 Reinforcement Learning 5

Solving MDPs – Many Dimensions

•  Which problem? (Planning, learning, op-mal learning)
•  Exact or approximate?
•  Uses samples?
•  Incremental?
•  Uses value func-ons?

–  Yes: Value-func-on based methods
•  Planning: DP, Random Discre-za-on Method, FVI, …
•  Learning: Q-learning, Actor-cri-c, …

–  No: Policy search methods
•  Planning: Monte-Carlo tree search, Likelihood ra-o methods (policy
gradient), Sample-path op-miza-on (Pegasus), …

•  Representa-on
–  Structured state:

•  Factored states, logical representa-on, …
–  Structured policy space:

•  Hierarchical methods

27/01/2017 Reinforcement Learning 6

Value Func-ons

•  Value func-ons are used to determine how good it is for the agent
to be in a given state
–  Or, how good is it to perform an ac-on from a given state?

•  This is defined w.r.t. a specific policy, i.e., distribu-on π(s,a)
•  State value func-on:

•  Ac-on (or State-Ac-on) value func-on:

27/01/2017 Reinforcement Learning 7

Value Func-ons

Note that there are mul-ple sources of (probabilis-c) uncertainty:
•  In state s, one is allowed to select different ac-ons a
•  The system may transi-on to different states s’ from s
•  Depending on the above, return (defined in terms of reward) is a

random variable – which we seek to maximize in expecta-on

27/01/2017 Reinforcement Learning 8

Recursive Form of V – Bellman Equa=on

27/01/2017 Reinforcement Learning 9

Expand 1-step forward
& rewrite expectation

Backup Diagrams

•  If you go all the way ‘down’, you can just read off the reward value
•  The backup process (i.e., recursive equa-on above) allows you to

compute the corresponding value at current state
–  taking transi-on probabili-es into account

27/01/2017 Reinforcement Learning 10

Bellman Equa-on for Q
(State-Action Value Function)

27/01/2017 Reinforcement Learning 11

Op(mal Value Func-on

•  For finite MDPs,
•  Let us denote the op-mal policy π*
•  The corresponding op-mal value func-ons are:

•  From this,

•  Will there always be a well defined π* ?
Theorem [Blackwell, 1962] – For every MDP with finite state/ac-on
space, there exists an op-mal determinis-c sta-onary plan

27/01/2017 Reinforcement Learning 12

Recursive Form of V*

27/01/2017 Reinforcement Learning 13

Backup for Q*

27/01/2017 Reinforcement Learning 14

What is Dynamic Programming?

Given a known model of the environment as an MDP
(transi-on dynamics and reward probabili-es),

DP is a collec-on of algorithms for compu-ng op-mal policies
(via Op-mal Value Func-ons)

27/01/2017 Reinforcement Learning 15

Policy Evalua-on

How to compute V(s) for an arbitrary policy π? (Predic(on problem)

For a given MDP, this yields a system of simultaneous equa-ons

–  as many unknowns as states
–  Solve using linear algebraic computa-on

Solve itera-vely, with a sequence of value func-ons,

27/01/2017 Reinforcement Learning 16

Computa-onally…

We could achieve this in a number of different ways:
•  Maintain two arrays, compu-ng itera-ons over one and copying

results to the other
•  In-place: Overwrite as new backed-up values become available
•  It can be shown that this algorithm will also converge to op-mality

(somewhat faster, even)
–  Backups sweep through the space
–  Sweep order has significant influence on convergence rates

27/01/2017 Reinforcement Learning 17

Itera-ve Policy Evalua-on

27/01/2017 Reinforcement Learning 18

Grid-World Example

27/01/2017 Reinforcement Learning 19

Itera-ve Policy Evalua-on in Grid World

27/01/2017 Reinforcement Learning 20

Note: The value function can be searched
greedily to find long-term optimal actions

Policy Improvement

Does it make sense to deviate from π(s) at any state (following the
policy everywhere else)?

27/01/2017 Reinforcement Learning 21

-  Policy Improvement Theorem [Howard/Blackwell]

Key Idea Behind Policy Improvement

27/01/2017 Reinforcement Learning 22

Compu-ng Berer Policies

Star-ng with an arbitrary policy, we’d like to approach truly op-mal
policies. So, we compute new policies using the following,

Are we restricted to determinis-c policies? No.
With stochas-c policies,

27/01/2017 Reinforcement Learning 23

Policy Itera-on

We can combine policy evalua-on and improvement to obtain a
sequence of monotonically improving policies and value func-ons

–  Each policy is guaranteed to be a strict improvement over
previous one (unless it is already op-mal)

[Policy Improvement Theorem]

–  As a finite MDP admits finitely many policies, this eventually
converges to an op-mal policy

27/01/2017 Reinforcement Learning 24

Policy Itera-on Algorithm

27/01/2017 Reinforcement Learning 25

Example: Jack’s Car Rental

•  £10 for each car rented (must be available when request received)
•  Two loca-ons, maximum of 20 cars at each
•  Cars returned and requested randomly

–  Poisson distribu-on, n returns/requests with probability
–  Loca-on 1: Average requests = 3, Average returns = 3
–  Loca-on 2: Average requests = 4, Average Returns = 2

•  Can move up to 5 cars between loca-ons overnight (costs £2 each)

Problem setup:
•  States, ac-ons, rewards?
•  Transi-on probabili-es?

27/01/2017 Reinforcement Learning 26

Solu-on: Jack’s Car Rental

27/01/2017 Reinforcement Learning 27

Numbers indicate
action: #cars to move

Value

Points to Ponder: Jack’s Car Rental

•  Suppose first car moved is free but all others transfers cost £2
–  From Loca-on 1 to Loca-on 2 (not other direc-on!)
–  Because an employee would anyway go in that direc-on, by bus

•  Suppose only 10 cars can be parked for free at each loca-on
–  More than 10 incur fixed cost £4 for using an extra parking lot

… typical examples of ‘real-world nonlineari(es’

27/01/2017 Reinforcement Learning 28

Value Itera-on

27/01/2017 Reinforcement Learning 29

Each step in Policy Iteration needs Policy Evaluation (upto convergence)
– can we avoid this computational overhead?

… use Bellman equation
as update rule

Value Itera-on Algorithm

27/01/2017 Reinforcement Learning 30

Example: Gambler’s Problem

•  Gambler can repeatedly bet on a coin flip
•  Heads: wins stake; Tails: loses his money
•  Ini-al capital ∈ {$1, $2, … , $99}
•  Gambler has won if he reaches $100 and has lost if he goes

bankrupt ($0)
•  Unfair coin: p(H) = 0.4, p(T) = 0.6

Problem formula-on:
•  States, Ac-ons, Rewards?
•  State transi-ons?

27/01/2017 Reinforcement Learning 31

Solu-on to Gambler’s Problem

27/01/2017 Reinforcement Learning 32

Based on successive
sweeps of value iteration

Why does it look
so strange?

Generalized Policy Itera-on

Caricature of the process:

Builds on the no-on of
interleaving evalua-on and
improvement – but allows the
granularity to be flexible

27/01/2017 Reinforcement Learning 33

Monte Carlo Methods

•  Learn value func-ons
•  Discover op-mal policies
•  Do not assume knowledge of model as in DP, i.e.,

•  Learn from experience: Sample sequences of states, ac-ons
and rewards (s, a, r)
–  In simulated or real (e.g., physical robo-c) worlds
–  Clearly, simulator is a model but not a full one as in a prob.

distribu-on

•  Eventually arain op-mal behaviour (same as with DP)

27/01/2017 Reinforcement Learning 34

Pass0 ,R
a
ss0

Learning in MDPs

•  You are learning from a long
stream of experience:

… up to some terminal state

•  Direct methods:
Approximate value func-on
(V/Q) straight away –
without compu-ng

31/01/2017 35 Reinforcement Learning

s0a0r0s1a1r1…skakrk…

Pass0 ,R
a
ss0

Pictorial: What does DP Do?

27/01/2017 Reinforcement Learning 36

Pictorial: What does Simple MC Do?

27/01/2017 Reinforcement Learning 37

Monte Carlo Policy Evalua-on

•  Goal: Approximate a value func-on
•  Given: Some number of episodes under π which contain s
•  Maintain average returns ayer visits to s

•  First visit vs. Every visit MC:
–  Consider a reward process and define the
first visit -me, , and a set,

–  First visit MC averages
whereas every visit MC averages over

27/01/2017 Reinforcement Learning 38

What is the effect of π?
What if it is deterministic?

First-visit Monte Carlo Policy Evalua-on

27/01/2017 Reinforcement Learning 39

Example: Blackjack

•  Goal: Achieve a card sum
greater than dealer without
exceeding 21

•  Player’s op-ons: Hit (take
another card) or S-ck (pass)
–  If player crosses 21 – loss

•  Dealer follows simple rule:
S-ck if ≥ 17, else Hit

•  Result:
Closest to 21 wins
Equally close is a draw

27/01/2017 Reinforcement Learning 40

Example: Blackjack

•  Goal: Achieve a card sum greater than dealer without
exceeding 21

•  State space: (200 states)
–  Current sum (12 – 21)
–  Dealer’s showing card (ace – 10)
–  Do I have a usable ace (can be used as 11 without overshoot)?

•  Reward: +1 for win, 0 for loss, -1 for a loss
•  Ac-on space: s(ck (no more cards), hit (receive another card)
•  Policy: s(ck if sum is 20 or 21, else hit
Note: This is an (arbitrary) policy π with which algorithm works

27/01/2017 Reinforcement Learning 41

Solu-on ( ) : Blackjack

27/01/2017 Reinforcement Learning 42

Why is this
more choppy?

Remarks on Blackjack Example

•  Why does the value func-on jump up for the last two rows in
the rear?
–  When sums correspond to 20 or 21, policy is to s(ck; this is a
good choice in this region of state space

•  Why does it drop off for the whole last row on the ley?
–  Dealer is showing an ace, which gives him extra flexibility (two
chances to get close to 21)

•  Why are the foremost values higher on upper plots than
lower plots?
–  Player has usable ace (more flexibility)

27/01/2017 Reinforcement Learning 43

Backup in MC

•  Does the concept of backup diagram make sense for MC
methods?

•  As in figure, MC does not sample all transi-ons
–  Root node to be updated as before
–  Transi-ons are dictated by policy
–  Acquire samples along a sample path
–  Clear path from eventual reward to states
along the way (credit assignment easier)

•  Es-mates are different states are independent
–  Computa-onal complexity not a func-on of state
dimensionality

27/01/2017 Reinforcement Learning 44

Monte Carlo Es-ma-on of Ac-on Values

•  Model is not available, so we do not know how states and
ac-ons interact
– We want Q*

•  We can try to approximate Qπ(s,a) using Monte Carlo method
–  Asympto-c convergence if every state-ac-on pair is visited

•  Explore many different star-ng state-ac-on pairs: Equal
chance of star-ng from any given state
–  Not en-rely prac-cal, but simple to understand

27/01/2017 Reinforcement Learning 45

Monte Carlo Control

•  Policy Evalua-on:
Monte Carlo method

•  Policy Improvement:
Greedify with respect to
state-value of ac-on-value
func-on

27/01/2017 Reinforcement Learning 46

Convergence of MC Control

•  Policy improvement s-ll works if evalua-on is done with MC:

•  πk+1 ≥ πk by the policy improvement theorem
•  Assump-on: exploring starts and infinite number of episodes

for MC policy evalua-on (i.e., value func-on has stabilized)
•  Things to do (as in DP):

–  update only to given tolerance
–  interleave evalua-on/improvement

27/01/2017 Reinforcement Learning 47

Monte Carlo Exploring Starts

27/01/2017 Reinforcement Learning 48

Blackjack Example – Op-mal Policy

27/01/2017 Reinforcement Learning 49