Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel
[
Markov Decision Processes (MDPs)
AIMA 17
CMPSC 442
Week 11, Meeting 31, Four Segments
Outline
● Introducing Markov Decision Processes
● Utilities of State Sequences
● Optimal Policies
● Intro to Solving MDPs
2Outline, Wk 11, Mtg 31
Markov Decision Processes (MDPs)
AIMA 17
CMPSC 442
Week 11, Meeting 31, Segment 1 of 4: Markov Decision
Processes
Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel
[
Example: Grid World
▪ A maze-like problem: states are fully observable
▪ Noisy movement: actions are unpredictable
▪ 80% of the time, an action (Up, Down, Right, Left) succeeds (if
there is no wall in the way)
■ 10% of the time, it moves at positive right angle
■ 10% of the time, it moves at negative right angle
▪ If there is a wall in the way, it stays put with 90% probability, and
moves orthogonally with 10% probability
▪ The agent receives rewards at each time step
▪ Small, negative “living” reward each step
▪ Big rewards come at the end (positive or negative)
▪ Goal: maximize sum of rewards over time
▪ Methods usually involve dynamic programming
4Intro to MDPs
Grid World Actions
5Intro to MDPs
Markov Decision Processes: Decisions over Time
● A set of states s ∈ S
● A set of actions a ∈ A
● A transition function T(s, a, s’)
○ Probability that a from s leads to s’, i.e.,
P(s’| s, a)
○ Also called the model or the dynamics
● A reward function R(s, a, s’)
6Intro to MDPs
○ The reward function (per time step)
○ Figure into the agents’s utility function (over time)
● A start state s0
● Possibly a terminal state
What is Markov about MDPs?
● For Markov decision processes, action outcomes depend only on the
current state:
● This is like search:
○ Search: successor function uses the current state to return successors
○ MDP: search process includes successor state based on current state, and the
transition model, and the reward
7Intro to MDPs
Search Strategies versus Decision Policies
● In deterministic single-agent search problems,
find an optimal sequence of actions, from start
to a goal
● For MDPs, find an optimal policy π*: S → A
○ A policy π gives an action for each state
○ An optimal policy is one that maximizes
expected utility over time
8Intro to MDPs
Markov Decision Processes (MDPs)
AIMA 17
CMPSC 442
Week 11, Meeting 31, Segment 2 of 4: Utilities of State
Sequences
Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel
[
Utilities of States Over Time
● U(s) The expected cumulative rewards over time from state s
● Finite versus infinite horizons
○ Finite: agent is restricted to a limited number of actions
○ Infinite: agent can take any number of actions until it reaches a goal
● Additive versus discounted rewards
○ Additive: sum the rewards over time
○ Discounted: apply a discount to prefer immediate rewards over future rewards
10Utilities of state sequences
● U(s) The expected cumulative rewards over time from state s
○ Additive rewards for utility function on history:
○ Discounted rewards for utility function on history, for discount γ ∈ [0,1]:
● If γ = 1, discounted sum of rewards is the same as the additive sum of rewards
● The best policy is the policy that produces the state sequence with the highest
rewards over time
Utility of a State Sequence
11Utilities of state sequences
Discounted Rewards with Infinite Horizon MDP
● Optimal policy π* is independent of the start state
● Utility of a state s is the expected sum of discounted rewards over time
(t=0 . . . t=∞)
● Optimal policy is the policy that gives the maximum utility
12Utilities of state sequences
Choosing the Optimal Action in State s
● The optimal policy should choose the action leading to a successor
state with the maximum utility
● The utility of a given state U(s) can then be defined in terms of the
expected utilities of all state sequences from s: Bellman equation
13Utilities of state sequences
Example Utilities for Grid World
γ=1 R(s) = − 0.04
U(s)
14Utilities of state sequences
Example Utilities for Grid World
15Utilities of state sequences
γ −
Markov Decision Processes (MDPs)
AIMA 17
CMPSC 442
Week 11, Meeting 31, Segment 3 of 4: Optimal Policies
Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel
[
●
17Optimal Policies
Utilities of Sequences
18Optimal Policies
Utilities of Sequences
● What preferences should an agent have over reward sequences?
● More or less?
● Now or later?
19Optimal Policies
Discounting
● It’s reasonable to maximize the sum of rewards
● It’s also reasonable to prefer rewards now to rewards later
● One solution: values of rewards decay exponentially
20Optimal Policies
Discounting
● How to discount?
○ Each time we descend a level, we multiply in
the discount once
● Why discount?
○ Sooner rewards probably do have higher
utility than later rewards
○ Also helps our algorithms converge
● Example: discount of 0.5
○ U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
○ U([1,2,3]) < U([3,2,1])
21Optimal Policies
Stationary Preferences
● Theorem: if we assume stationary preferences:
● Then: there are only two ways to define utilities
○ Additive utility:
○ Discounted utility:
22Optimal Policies
Infinite Utilities?!
● Problem: What if the game lasts forever? Do we get infinite rewards?
● Solutions:
○ Finite horizon: (similar to depth-limited search)
■ Terminate episodes after a fixed T steps (e.g. life)
■ Gives nonstationary policies (π depends on time left)
○ Discounting: use 0 < γ < 1
■ Smaller γ means smaller “horizon” = shorter term focus
○ Absorbing state: guarantees that for every policy, a terminal state will
eventually be reached (like “overheated” for racing)
23Optimal Policies
Recap: Defining MDPs
● Markov decision processes:
○ Set of states S
○ Start state s
0
○ Set of actions A
○ Transitions P(s’|s,a) (or T(s,a,s’))
○ Rewards R(s,a,s’) (and discount γ)
● MDP quantities so far:
○ Policy = Choice of action for each state
○ Utility = sum of (discounted) rewards
24Optimal Policies
Markov Decision Processes (MDPs)
AIMA 17
CMPSC 442
Week 11, Meeting 31, Segment 4 of 4: Solving MDPs
Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel
[
Optimal Quantities
▪
▪
▪
π
26Solving MDPs
Snapshot of Demo – Gridworld V Values
27Solving MDPs
Snapshot of Demo – Gridworld Q Values
28Solving MDPs
Snapshot of Demo – Gridworld V Values
29Solving MDPs
Snapshot of Demo – Gridworld Q Values
30Solving MDPs
Snapshot of Demo – Gridworld V Values
31
Snapshot of Demo – Gridworld Q Values
32Solving MDPs
Snapshot of Demo – Gridworld V Values
33Solving MDPs
Snapshot of Demo – Gridworld Q Values
34Solving MDPs
Search Tree
● We’re doing way too much work!
● Problem: States are repeated
○ Idea: Only compute needed quantities once
● Problem: Tree goes on forever
○ Idea: Do a depth-limited computation, but with increasing depths until change is small
○ Note: deep parts of the tree eventually don’t matter if γ < 1
35Solving MDPs
Summary
● MDPs
○ Search for sequence of best actions given an environment (actions have
negative or positive rewards)
○ Simplest case: infinite horizon with discounted rewards gives stationary policy
○ Utility of a state is the expected cumulative rewards
○ Best policy identifies the next action from state transitioning to state with
the maximum utility
● Next time: Solving MDPs using Value Iteration
36Summary, Wk 11, Mtg 31