Reinforcement Learning Introduction
AIMA 22.1-22.2
CMPSC 442
Week 12, Meeting 36, Three Segments
Outline
● Overview of Reinforcement Learning
● Direct Estimation vs. Adaptive Dynamic Programming
● Temporal Difference Learning
2Outline, Wk 12, Mtg 36
Reinforcement Learning Introduction
AIMA 22.1-22.2
CMPSC 442
Week 12, Meeting 36, Segment 1 of 3: Overview of
Reinforcement Learning (RL)
Reinforcement Learning in a Nutshell
Imagine playing a new game whose rules you
don’t know; after a hundred or so moves, your
opponent announces, “You lose”.
AIMA, Introduction to Chapter 22
4Overview of RL
Problem Set Up
● Agent interacts with the world and receives rewards
○ Can act
○ Can sense the rewards
● Transition model, reward function, and utilities are unknown
● Agent must learn the MDP through the rewards it receives
○ Positive experiences are reinforced
○ Negative experiences are avoided
Overview of RL 5
Learning by Doing Instead of from Examples
● In supervised learning each labeled example is a single decision
● Reinforcement learning cannot use labeled examples
○ Each action has uncertain outcomes
○ The more time steps involved in reaching the goal, the more possible
paths there are
○ Consider the number of chess board configurations in a database of all
grandmaster games (108) versus all possible chess boards (1040)
Overview of RL 6
Learning in Real versus Simulated Environments
● Learning through experiencing
○ Agent can learn on-line in its environment
○ Agent can learn off-line through a simulation of its environment
● Robotics
○ Ability to walk can be learned through simulation, by modeling
characteristics of the environment such as incline, smooth versus rough
surface
○ Learned behavior can be tested in the real world
○ Adequacy of the learning depends on capturing the physics of the real
world in the simulator
Overview of RL 7
Model-Based Reinforcement Learning
● Agent uses a transition model of the environment to interpret the
reward signals and to estimate expected utilities of states
○ Model can be known
○ Model can be learned
● In partially observable environments, the transition model is also used
for state estimation
Overview of RL 8
Model-Free RL: Action-Utility Learning
● No transition model is involved
● Action-utility learning, e.g., Q-learning
○ Agent learns a Q-function (quality function) Q(s,a) denoting the sum of
rewards from state s onward if action a is taken (cf. Mtg 21, AIMA ch. 17)
○ Given a Q-function, agent can chose the action with the highest Q-value
Overview of RL 9
Policy Search
● Agent learns a parameterized policy πθ(s) that maximizes its reward
● No value function
● Often used in very high dimensional or continuous spaces
Overview of RL 10
Reinforcement Learning Introduction
AIMA 22.1-22.2
CMPSC 442
Week 12, Meeting 36, Segment 2 of 3: Direct Estimation vs.
ADP
Passive Reinforcement Learning
● The agent has a fixed policy πi
○ No transition model
○ No reward function
● Simplified learning task
○ Input: a fixed policy πi(s)
○ Goal: learn the state values (expected utilities over time), without learning
the transition model or reward function
● Not applied in practice
○ Mainly serves as a vehicle for explaining active reinforcement learning
Direct Estimation vs. ADP 12
Direct Estimation in Passive RL
● Learning the state values without learning the transition model
○ Trial-and-error allows estimation of the expected utility of a state
○ Transition model is not needed: the utilities are affected by the most
probable outcomes
○ Given sufficient time, the state values (provably) converge to the true
utilities – but it is very slow
● Similar to policy evaluation
13Direct Estimation vs. ADP
Direct Estimation in Passive RL
● General approach
○ Repeatedly execute the policy, taking actions until termination (i.e.,
success or failure at the sequential decision-making problem)
○ Observe the reward received in each state
○ After each run, calculate the utility for each state in the sequence as the
sum of cumulative rewards over time from that state
○ Update the average utility for each of the observed states
● Problems with this approach
○ Treats all the actions as independent, rather than Markovian
○ Converges very slowly
14Direct Estimation vs. ADP
Example Grid World from Chapter 17 (Mtg 31)
● Agent engages in three trials using this policy
○ Starts in (1, 1)
○ Observes a living reward at each time step of -0.04
○ Reaches (4, 3) two times, (4, 2) one time
○ Actions with unintended outcomes are circled in red
15Direct Estimation vs. ADP
Direct Utility Estimation: Expected Reward-to-Go
● First run:
○ One reward-to-go for (1, 1): 0.75
○ Two rewards-to-go for (1, 2): 0.80; 0.88 (giving a running average of 0.84)
○ Two rewards-to-go for (1, 3): 0.84; 0.92 (giving a running average of 0.88)
. . .
16Direct Estimation vs. ADP
Direct Estimation is Simple but Unintelligent
● Reduces the estimation to a supervised learning problem
○ The trial runs are sequences of examples for each state in the run
○ The labels are the rewards-to-go
● Ignores the connections between states
○ Assume direct estimation has already given a high value to (3, 3)
○ Assume (3, 2) has never been seen
○ There is no way to infer that (3, 3) is a likely outcome of following the
policy from (3, 2) and that therefore (3, 2) can be expected to have high
value
17Direct Estimation vs. ADP
Adaptive Dynamic Programming (ADP)
● Given a fixed policy πi the Bellman equations are linear (no use of max)
and can be solved using a linear algebra package
● The transition model and reward function can be learned by
experiencing the world, e.g., from the runs shown on slides 15 and 16
● Alternatively, modified policy iteration can be used to update utility
estimates, per update rule shown above, which converges quickly
18Direct Estimation vs. ADP
ADP Algorithm
19Direct Estimation vs. ADP
● Intractable for large state spaces
ADP Applied to the 4 x 3 Grid World Problem
● ADP learner for the transition model
and reward function, given the
optimal policy as πi
○ Uses γ = 1
● Figure shows convergence on the
utilities
● Note that it takes multiple trials to
discover that the rarely visited states
(2, 1) and (3, 2) have non-zero utilities
20Direct Estimation vs. ADP
Reinforcement Learning Introduction
AIMA 22.1-22.2
CMPSC 442
Week 12, Meeting 36, Segment 3 of 3: Temporal Difference
Learning (RL)
Temporal Difference Learning
● Given a current set of estimates of the utilities of states, instead of
waiting until the end of a full episode to update the utilities, do so after
each time step, using the difference between the utility of s at t and the
utility of s’ at t+1 as an error estimate
○ Fixed learning rate α
○ At each time step, update the utility of s after a transition to s’ considering
only the current utilities of s and s’ (ignoring the transition model)
22Temporal Difference Learning
Temporal Difference Learning
● Temporal-difference methods adjust utility estimates toward the ideal
equilibrium that holds when the utility estimates are correct
● With passive learning (fixed policy), the equilibrium is given by the
Bellman equation from slide 17
23Temporal Difference Learning
Temporal Difference Learning
● The Bellman equation for a fixed policy
● . . . and the temporal difference update rule that considers only the
observed successor state
● . . . lead to the same equilibrium in the limit, but have different
convergence properties
24Temporal Difference Learning
Behavior of TD Algorithm on 4 x 3 Grid World
● Convergence behavior compared to
ADP
○ Takes longer to converge
○ Convergence is less smooth
● Much simpler
● Requires less computation per
observation
● Does not need a transition model
25
Comparison of ADP and TD
● TD
○ Adjusts a state to agree with its observed successor
○ Makes a single adjustment per observed transition
● ADP
○ Adjusts a state to agree with all its successors weighted by their
probabilities
○ Relies on the current estimates for the transition model (essentially
experience in simulation rather than actual experience)
○ Makes as many adjustments as needed to restore consistency between
the utility estimates U and the transition model P
26
Summary
● Reinforcement learning relies on trial-and-error to learn the parameters
of an MDP
● Passive reinforcement learning, which assumes a fixed policy π, is a
step towards active reinforcement learning
● Direct estimation learns expected utilities of states by averaging over
many trial runs; does not learn P or R, very slow
● Adaptive Dynamic Programming (ADP) continually updates the MDP
using observed episodes of experience and policy evaluation
● Temporal Difference Learning, like direct estimation, only learns the
utilities and not the MDP
Summary, Wk 12, Mtg 36 27