CS计算机代考程序代写 database algorithm Reinforcement Learning Introduction

Reinforcement Learning Introduction
AIMA 22.1-22.2

CMPSC 442
Week 12, Meeting 36, Three Segments

Outline

● Overview of Reinforcement Learning
● Direct Estimation vs. Adaptive Dynamic Programming
● Temporal Difference Learning

2Outline, Wk 12, Mtg 36

Reinforcement Learning Introduction
AIMA 22.1-22.2

CMPSC 442
Week 12, Meeting 36, Segment 1 of 3: Overview of
Reinforcement Learning (RL)

Reinforcement Learning in a Nutshell

Imagine playing a new game whose rules you
don’t know; after a hundred or so moves, your
opponent announces, “You lose”.

AIMA, Introduction to Chapter 22

4Overview of RL

Problem Set Up

● Agent interacts with the world and receives rewards
○ Can act
○ Can sense the rewards

● Transition model, reward function, and utilities are unknown
● Agent must learn the MDP through the rewards it receives

○ Positive experiences are reinforced
○ Negative experiences are avoided

Overview of RL 5

Learning by Doing Instead of from Examples

● In supervised learning each labeled example is a single decision
● Reinforcement learning cannot use labeled examples

○ Each action has uncertain outcomes
○ The more time steps involved in reaching the goal, the more possible

paths there are
○ Consider the number of chess board configurations in a database of all

grandmaster games (108) versus all possible chess boards (1040)

Overview of RL 6

Learning in Real versus Simulated Environments

● Learning through experiencing
○ Agent can learn on-line in its environment
○ Agent can learn off-line through a simulation of its environment

● Robotics
○ Ability to walk can be learned through simulation, by modeling

characteristics of the environment such as incline, smooth versus rough
surface

○ Learned behavior can be tested in the real world
○ Adequacy of the learning depends on capturing the physics of the real

world in the simulator

Overview of RL 7

Model-Based Reinforcement Learning

● Agent uses a transition model of the environment to interpret the
reward signals and to estimate expected utilities of states
○ Model can be known
○ Model can be learned

● In partially observable environments, the transition model is also used
for state estimation

Overview of RL 8

Model-Free RL: Action-Utility Learning

● No transition model is involved
● Action-utility learning, e.g., Q-learning

○ Agent learns a Q-function (quality function) Q(s,a) denoting the sum of
rewards from state s onward if action a is taken (cf. Mtg 21, AIMA ch. 17)

○ Given a Q-function, agent can chose the action with the highest Q-value

Overview of RL 9

Policy Search

● Agent learns a parameterized policy πθ(s) that maximizes its reward
● No value function
● Often used in very high dimensional or continuous spaces

Overview of RL 10

Reinforcement Learning Introduction
AIMA 22.1-22.2

CMPSC 442
Week 12, Meeting 36, Segment 2 of 3: Direct Estimation vs.
ADP

Passive Reinforcement Learning

● The agent has a fixed policy πi
○ No transition model
○ No reward function

● Simplified learning task
○ Input: a fixed policy πi(s)
○ Goal: learn the state values (expected utilities over time), without learning

the transition model or reward function

● Not applied in practice
○ Mainly serves as a vehicle for explaining active reinforcement learning

Direct Estimation vs. ADP 12

Direct Estimation in Passive RL

● Learning the state values without learning the transition model
○ Trial-and-error allows estimation of the expected utility of a state
○ Transition model is not needed: the utilities are affected by the most

probable outcomes
○ Given sufficient time, the state values (provably) converge to the true

utilities – but it is very slow

● Similar to policy evaluation

13Direct Estimation vs. ADP

Direct Estimation in Passive RL

● General approach
○ Repeatedly execute the policy, taking actions until termination (i.e.,

success or failure at the sequential decision-making problem)
○ Observe the reward received in each state
○ After each run, calculate the utility for each state in the sequence as the

sum of cumulative rewards over time from that state
○ Update the average utility for each of the observed states

● Problems with this approach
○ Treats all the actions as independent, rather than Markovian
○ Converges very slowly

14Direct Estimation vs. ADP

Example Grid World from Chapter 17 (Mtg 31)

● Agent engages in three trials using this policy
○ Starts in (1, 1)
○ Observes a living reward at each time step of -0.04
○ Reaches (4, 3) two times, (4, 2) one time
○ Actions with unintended outcomes are circled in red

15Direct Estimation vs. ADP

Direct Utility Estimation: Expected Reward-to-Go

● First run:
○ One reward-to-go for (1, 1): 0.75
○ Two rewards-to-go for (1, 2): 0.80; 0.88 (giving a running average of 0.84)
○ Two rewards-to-go for (1, 3): 0.84; 0.92 (giving a running average of 0.88)

. . .

16Direct Estimation vs. ADP

Direct Estimation is Simple but Unintelligent

● Reduces the estimation to a supervised learning problem
○ The trial runs are sequences of examples for each state in the run
○ The labels are the rewards-to-go

● Ignores the connections between states
○ Assume direct estimation has already given a high value to (3, 3)
○ Assume (3, 2) has never been seen
○ There is no way to infer that (3, 3) is a likely outcome of following the

policy from (3, 2) and that therefore (3, 2) can be expected to have high
value

17Direct Estimation vs. ADP

Adaptive Dynamic Programming (ADP)

● Given a fixed policy πi the Bellman equations are linear (no use of max)
and can be solved using a linear algebra package

● The transition model and reward function can be learned by
experiencing the world, e.g., from the runs shown on slides 15 and 16

● Alternatively, modified policy iteration can be used to update utility
estimates, per update rule shown above, which converges quickly

18Direct Estimation vs. ADP

ADP Algorithm

19Direct Estimation vs. ADP

● Intractable for large state spaces

ADP Applied to the 4 x 3 Grid World Problem

● ADP learner for the transition model
and reward function, given the
optimal policy as πi
○ Uses γ = 1

● Figure shows convergence on the
utilities

● Note that it takes multiple trials to
discover that the rarely visited states
(2, 1) and (3, 2) have non-zero utilities

20Direct Estimation vs. ADP

Reinforcement Learning Introduction
AIMA 22.1-22.2

CMPSC 442
Week 12, Meeting 36, Segment 3 of 3: Temporal Difference
Learning (RL)

Temporal Difference Learning

● Given a current set of estimates of the utilities of states, instead of
waiting until the end of a full episode to update the utilities, do so after
each time step, using the difference between the utility of s at t and the
utility of s’ at t+1 as an error estimate
○ Fixed learning rate α
○ At each time step, update the utility of s after a transition to s’ considering

only the current utilities of s and s’ (ignoring the transition model)

22Temporal Difference Learning

Temporal Difference Learning

● Temporal-difference methods adjust utility estimates toward the ideal
equilibrium that holds when the utility estimates are correct

● With passive learning (fixed policy), the equilibrium is given by the
Bellman equation from slide 17

23Temporal Difference Learning

Temporal Difference Learning

● The Bellman equation for a fixed policy

● . . . and the temporal difference update rule that considers only the
observed successor state

● . . . lead to the same equilibrium in the limit, but have different
convergence properties

24Temporal Difference Learning

Behavior of TD Algorithm on 4 x 3 Grid World

● Convergence behavior compared to
ADP
○ Takes longer to converge
○ Convergence is less smooth

● Much simpler
● Requires less computation per

observation
● Does not need a transition model

25

Comparison of ADP and TD

● TD
○ Adjusts a state to agree with its observed successor
○ Makes a single adjustment per observed transition

● ADP
○ Adjusts a state to agree with all its successors weighted by their

probabilities
○ Relies on the current estimates for the transition model (essentially

experience in simulation rather than actual experience)
○ Makes as many adjustments as needed to restore consistency between

the utility estimates U and the transition model P

26

Summary

● Reinforcement learning relies on trial-and-error to learn the parameters
of an MDP

● Passive reinforcement learning, which assumes a fixed policy π, is a
step towards active reinforcement learning

● Direct estimation learns expected utilities of states by averaging over
many trial runs; does not learn P or R, very slow

● Adaptive Dynamic Programming (ADP) continually updates the MDP
using observed episodes of experience and policy evaluation

● Temporal Difference Learning, like direct estimation, only learns the
utilities and not the MDP

Summary, Wk 12, Mtg 36 27