Reinforcement Learning
In-class tutorial:Worked examples
[DP, MC, basics of TD]
Subramanian Ramamoorthy
School of InformaFcs
17 January 2017
Plan for the Session
• Problems chosen to illustrate concepts covered in earlier
lectures
• We will work out problems on the board and take ques:ons
to clarify concepts
• These slides provide the outline sketch of the ques:ons to be
covered
07/02/17 Reinforcement Learning 2
0. Interpreta:on of V and Q
07/02/17 Reinforcement Learning 3
Using the task of selecting
a club to play the game of
golf, discuss the meaning
of V and Q
What are:
- States
- Actions
- Rewards
What do you understand
by the shape and numbers
in this figure?
I. Interpreta:on of Vπ and π
• Cells = States
• NSEW ac:ons resul:ng in
movement by 1 cell
• Ac:ons taking agent off grid
have no effect but incur
reward of -1
• All other ac:ons result in a
reward of 0
– except those that move
the agent out of the special
states A and B.
07/02/17 Reinforcement Learning 4
Inspect and interpret Vπ
I. Interpreta:on of Vπ
07/02/17 Reinforcement Learning 5
I. Interpreta:on of V* and π*
07/02/17 Reinforcement Learning 6
Calculate and show that
Bellman’s equation holds
for centre state – to
understand nature of V*
Interpre:ng V: Cost-to-go
07/02/17 Reinforcement Learning 7
Finding the shortest path in a graph
using optimal substructure; a straight
line indicates a single edge; a wavy line
indicates a shortest path between two
vertices it connects (other nodes on
these paths are not shown); bold line
is the overall shortest path
from start to goal. [From Wikipedia]
Understanding the recursion:
If shortest path from LA to NY must
include Chicago, then shortest path
from LA to Chicago can be computed
separately from last leg.
II. Value/ Policy Itera:on
using Grid World
• Calculate ini:al steps of Policy Evalua:on using a grid world
example seen in our earlier lectures
07/02/17 Reinforcement Learning 8
Vπ and Greedy π at k = 2
07/02/17 Reinforcement Learning 9
III. MC Value Evalua:on
• Work out some steps of the MC value evalua:on process for
the 5-state Markov Chain example (for a random walker who
goes one step to the le] or right with equal probability)
07/02/17 Reinforcement Learning 10
IV. Understanding MC through modified
random walk
07/02/17 Reinforcement Learning 11
• The transi:on probabili:es for state C are as shown. For all
other states, the transi:ons are based on a fair coin flip. The
square is an absorbing terminal state with reward as shown.
Perform some initial
steps of calculation of
Vπ using first-visit
MC.
Discuss MC with Exploring Starts, etc.
Exploring starts: Every state-action pair
has a non-zero probability of being the
starting pair
V. Cliff Walking: TD
07/02/17 Reinforcement Learning 12
Discuss SARSA and Q-learning
procedures with respect to this example