CS计算机代考程序代写 algorithm Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel

Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel
[

Markov Decision Processes (MDPs)
AIMA 17

CMPSC 442
Week 11, Meeting 31, Four Segments

Outline

● Introducing Markov Decision Processes
● Utilities of State Sequences
● Optimal Policies

● Intro to Solving MDPs

2Outline, Wk 11, Mtg 31

Markov Decision Processes (MDPs)
AIMA 17

CMPSC 442
Week 11, Meeting 31, Segment 1 of 4: Markov Decision
Processes

Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel
[

Example: Grid World
▪ A maze-like problem: states are fully observable
▪ Noisy movement: actions are unpredictable

▪ 80% of the time, an action (Up, Down, Right, Left) succeeds (if
there is no wall in the way)
■ 10% of the time, it moves at positive right angle
■ 10% of the time, it moves at negative right angle

▪ If there is a wall in the way, it stays put with 90% probability, and
moves orthogonally with 10% probability

▪ The agent receives rewards at each time step
▪ Small, negative “living” reward each step
▪ Big rewards come at the end (positive or negative)

▪ Goal: maximize sum of rewards over time
▪ Methods usually involve dynamic programming

4Intro to MDPs

Grid World Actions

5Intro to MDPs

Markov Decision Processes: Decisions over Time
● A set of states s ∈ S
● A set of actions a ∈ A
● A transition function T(s, a, s’)
○ Probability that a from s leads to s’, i.e.,

P(s’| s, a)
○ Also called the model or the dynamics

● A reward function R(s, a, s’)

6Intro to MDPs

○ The reward function (per time step)
○ Figure into the agents’s utility function (over time)

● A start state s0
● Possibly a terminal state

What is Markov about MDPs?

● For Markov decision processes, action outcomes depend only on the
current state:

● This is like search:
○ Search: successor function uses the current state to return successors
○ MDP: search process includes successor state based on current state, and the

transition model, and the reward

7Intro to MDPs

Search Strategies versus Decision Policies

● In deterministic single-agent search problems,
find an optimal sequence of actions, from start
to a goal

● For MDPs, find an optimal policy π*: S → A
○ A policy π gives an action for each state
○ An optimal policy is one that maximizes

expected utility over time

8Intro to MDPs

Markov Decision Processes (MDPs)
AIMA 17

CMPSC 442
Week 11, Meeting 31, Segment 2 of 4: Utilities of State
Sequences

Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel
[

Utilities of States Over Time

● U(s) The expected cumulative rewards over time from state s
● Finite versus infinite horizons
○ Finite: agent is restricted to a limited number of actions
○ Infinite: agent can take any number of actions until it reaches a goal

● Additive versus discounted rewards
○ Additive: sum the rewards over time
○ Discounted: apply a discount to prefer immediate rewards over future rewards

10Utilities of state sequences

● U(s) The expected cumulative rewards over time from state s
○ Additive rewards for utility function on history:

○ Discounted rewards for utility function on history, for discount γ ∈ [0,1]:

● If γ = 1, discounted sum of rewards is the same as the additive sum of rewards
● The best policy is the policy that produces the state sequence with the highest

rewards over time

Utility of a State Sequence

11Utilities of state sequences

Discounted Rewards with Infinite Horizon MDP

● Optimal policy π* is independent of the start state
● Utility of a state s is the expected sum of discounted rewards over time

(t=0 . . . t=∞)

● Optimal policy is the policy that gives the maximum utility

12Utilities of state sequences

Choosing the Optimal Action in State s

● The optimal policy should choose the action leading to a successor
state with the maximum utility

● The utility of a given state U(s) can then be defined in terms of the
expected utilities of all state sequences from s: Bellman equation

13Utilities of state sequences

Example Utilities for Grid World

γ=1 R(s) = − 0.04
U(s)

14Utilities of state sequences

Example Utilities for Grid World

15Utilities of state sequences

γ −

Markov Decision Processes (MDPs)
AIMA 17

CMPSC 442
Week 11, Meeting 31, Segment 3 of 4: Optimal Policies

Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel
[

17Optimal Policies

Utilities of Sequences

18Optimal Policies

Utilities of Sequences

● What preferences should an agent have over reward sequences?

● More or less?

● Now or later?

19Optimal Policies

Discounting

● It’s reasonable to maximize the sum of rewards
● It’s also reasonable to prefer rewards now to rewards later
● One solution: values of rewards decay exponentially

20Optimal Policies

Discounting

● How to discount?
○ Each time we descend a level, we multiply in

the discount once

● Why discount?
○ Sooner rewards probably do have higher

utility than later rewards

○ Also helps our algorithms converge

● Example: discount of 0.5
○ U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
○ U([1,2,3]) < U([3,2,1]) 21Optimal Policies Stationary Preferences ● Theorem: if we assume stationary preferences: ● Then: there are only two ways to define utilities ○ Additive utility: ○ Discounted utility: 22Optimal Policies Infinite Utilities?! ● Problem: What if the game lasts forever? Do we get infinite rewards? ● Solutions: ○ Finite horizon: (similar to depth-limited search) ■ Terminate episodes after a fixed T steps (e.g. life) ■ Gives nonstationary policies (π depends on time left) ○ Discounting: use 0 < γ < 1 ■ Smaller γ means smaller “horizon” = shorter term focus ○ Absorbing state: guarantees that for every policy, a terminal state will eventually be reached (like “overheated” for racing) 23Optimal Policies Recap: Defining MDPs ● Markov decision processes: ○ Set of states S ○ Start state s 0 ○ Set of actions A ○ Transitions P(s’|s,a) (or T(s,a,s’)) ○ Rewards R(s,a,s’) (and discount γ) ● MDP quantities so far: ○ Policy = Choice of action for each state ○ Utility = sum of (discounted) rewards 24Optimal Policies Markov Decision Processes (MDPs) AIMA 17 CMPSC 442 Week 11, Meeting 31, Segment 4 of 4: Solving MDPs Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel [ Optimal Quantities ▪ ▪ ▪ π 26Solving MDPs Snapshot of Demo – Gridworld V Values 27Solving MDPs Snapshot of Demo – Gridworld Q Values 28Solving MDPs Snapshot of Demo – Gridworld V Values 29Solving MDPs Snapshot of Demo – Gridworld Q Values 30Solving MDPs Snapshot of Demo – Gridworld V Values 31 Snapshot of Demo – Gridworld Q Values 32Solving MDPs Snapshot of Demo – Gridworld V Values 33Solving MDPs Snapshot of Demo – Gridworld Q Values 34Solving MDPs Search Tree ● We’re doing way too much work! ● Problem: States are repeated ○ Idea: Only compute needed quantities once ● Problem: Tree goes on forever ○ Idea: Do a depth-limited computation, but with increasing depths until change is small ○ Note: deep parts of the tree eventually don’t matter if γ < 1 35Solving MDPs Summary ● MDPs ○ Search for sequence of best actions given an environment (actions have negative or positive rewards) ○ Simplest case: infinite horizon with discounted rewards gives stationary policy ○ Utility of a state is the expected cumulative rewards ○ Best policy identifies the next action from state transitioning to state with the maximum utility ● Next time: Solving MDPs using Value Iteration 36Summary, Wk 11, Mtg 31