Markov Decision Processes (MDPs)
AIMA 17
CMPSC 442
Week 11, Meeting 32, Three Segments
Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel
[All CS188 materials are available at http://ai.berkeley.edu.]
Outline
● Bellman Equations
● Policy Evaluation
● Policy Extraction
2Outline, Wk 11, Mtg 31
Markov Decision Processes (MDPs)
AIMA 17
CMPSC 442
Week 11, Meeting 32, Segment 1 of 3: Bellman Equations
Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel
[All CS188 materials are available at http://ai.berkeley.edu.]
Recap: MDPs
● Markov decision processes:
○ States S
○ Actions A
○ Transitions P(s’|s,a) (or T(s,a,s’))
○ Rewards R(s,a,s’) (and discount γ)
○ Start state s0
● Quantities:
○ Policy = map of states to actions
○ Utility= sum of discounted rewards
○ Q-Value = expected future utility from a q-state
a
s
s, a
s,a,s’
’s
4
Bellman Equations
Recap: Optimal Quantities
▪ The utility of a state s:
U*(s) = expected utility starting in s and
acting optimally
▪ The value (utility) of a q-state (s,a):
Q*(s,a) = expected utility starting out
having taken action a from state s and
(thereafter) acting optimally
▪ The optimal policy:
π*(s) = optimal action from state s
a
s
’s
s, a
(s,a,s’) is a
transition
s,a,s’
s is a
state
(s, a) is a
q-state
5
Bellman Equations
The Bellman Equations
How to be optimal:
Step 1: Take correct first action
Step 2: Keep being optimal
6
Bellman Equations
The Bellman Equations
● Definition of “optimal utility” gives a simple one-step lookahead
relationship amongst optimal utility values
a
s
s, a
s,a,s’
’s
7
Bellman Equations
Value Iteration
● Bellman equations characterize the optimal values:
● Value iteration computes them:
a
V(s)
s, a
s,a,s’
8
Bellman Equations
V(s’)
Value Iteration
● Solving the Bellman equations:
○ For n states, there are n Bellman equations with unknown utilities
○ Systems of linear equations can be solved easily using linear algebra, but the
max operator is not a linear operator
● Value iteration is one solution
○ Initialize with arbitrary values for the utilities of every state
○ Calculate the right hand side of the equation, then use the new value as an
update for the left hand side, applied simultaneously to all states
9Bellman Equations
Convergence*
● How do we know the Vk vectors are going to converge?
● Case 1: If the tree has maximum depth M, then VM holds the actual untruncated
values
● Case 2: If the discount is less than 1
○ For any state Vk and Vk+1 can be viewed as depth k+1 results in nearly identical search trees
○ The difference is that on the bottom layer, Vk+1 has actual rewards while Vk has zeros
○ The last layer is at best all RMAX
○ It is at worst RMIN
○ But everything is discounted by γ k that far out
○ Vk and Vk+1 are at most γ
k RMAX different
○ So as k increases, the values converge
10
Bellman Equations
Markov Decision Processes (MDPs)
AIMA 17
CMPSC 442
Week 11, Meeting 32, Segment 2 of 3: Policy Evaluation
Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel
[All CS188 materials are available at http://ai.berkeley.edu.]
Expectimax for ”Environment” Adversary
● Uncertain outcomes of agent’s actions
12
Policy Evaluation
Recall Minimax
10 10 9 100
max
min
13
Policy Evaluation
● Max is the agent
● Min is the opponent
● Look ahead to get
future utilities
● Back them up to the
root decision node
○ Assume worst case for
Max (Min plays
optimally)
Expectimax: Average Case Instead of Worst Case
10 10 9 100
max
min
Idea: Uncertain outcomes controlled by chance, not an agent adversary!
14
Policy Evaluation
Expectimax Search
● Why wouldn’t we know what the result of an action will be?
○ Explicit randomness: rolling dice
○ Actions can fail: when moving a robot, wheels might slip
● Values should reflect average-case (expectimax) out-
comes, not worst-case (minimax) outcomes
● Expectimax search: use average score under optimal play
○ Max nodes as in minimax search
○ Chance nodes are like min nodes, represent uncertain outcomes
○ Calculate their expected utilities, i.e. take weighted average (expectation) of the
children
15
max
chance
10 10 9 100
Policy Evaluation
Fixed Policies
● Expectimax trees max over all actions to compute the optimal values
● If we fixed a policy π(s), then the tree will be simpler, with one action per state
○ Though the tree’s value would depend on which policy we fixed
a
s
s, a
s,a,s’
’s
π(s)
s
s, π(s)
s, π(s),s’
’s
Do the optimal action Do what π says to do
16
Policy Evaluation
● Another basic operation: compute the utility of a state s under a fixed (generally
non-optimal) policy
● Define the utility of a state s, under a fixed policy π:
Uπ(s) = expected total discounted rewards starting in s and following π
● Recursive relation (one-step look-ahead / Bellman equation) using a fixed policy:
Utilities for a Fixed Policy
17
Policy Evaluation
Policy Evaluation
● How do we calculate the U’s for a fixed policy π?
● Idea 1: Turn recursive Bellman equations into updates
(like value iteration)
● Efficiency: O(s2) per iteration
● Idea 2: Without the maxes, the Bellman equations are just a linear system
○ Solve with your favorite linear system solver
π(s)
s
s, π(s)
s, π(s),s’
’s
18
Policy Evaluation
Markov Decision Processes (MDPs)
AIMA 17
CMPSC 442
Week 11, Meeting 32, Segment 3 of 3: Policy Extraction
Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel
[All CS188 materials are available at http://ai.berkeley.edu.]
Computing Actions from Values (aka Utilities)
● Let’s imagine we have the optimal values V*(s)
● How should we act? It’s not obvious!
● We need to do a mini-expectimax (one step)
● This is called policy extraction: it gets the policy implied by the values
20
Policy Extraction
Computing Actions from Q-Values
● Let’s imagine we have the optimal Q-values:
● How should we act?
○ Completely trivial to decide!
● Important lesson: actions are easier to select from Q-values than values!
21
Policy Extraction
Problems with Value Iteration
● Value iteration repeats the Bellman updates:
● Problem 1: It’s slow – O(S2A) per iteration
● Problem 2: The “max” at each state rarely changes
● Problem 3: The policy often converges long before the values
a
s
s, a
s,a,s’
’s
22
Policy Extraction
k=0
Noise = 0.2
Discount = 0.9
Living reward = 0
23
Policy Extraction
24
k=1
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Extraction
25
k=2
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Extraction
26
k=3
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Extraction
27
k=4
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Extraction
28
k=5
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Extraction
29
k=6
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Extraction
30
k=7
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Extraction
31
k=8
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Extraction
32
k=9
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Extraction
33
k=10
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Extraction
34
k=11
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Extraction
35
k=12
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Extraction
36
k=100
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Extraction
Policy Iteration
● Alternative approach for optimal values:
○ Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal
utilities!) until convergence
○ Step 2: Policy improvement: update policy using one-step look-ahead with
resulting converged (but not optimal!) utilities as future values
○ Repeat steps until policy converges
● This is policy iteration
○ It’s still optimal!
○ Can converge (much) faster under some conditions
37
Policy Extraction
Policy Iteration
● Evaluation: For fixed current policy π, find values with policy evaluation:
○ Iterate until values converge:
● Improvement: For fixed values, get a better policy using policy extraction
○ One-step look-ahead:
38
Policy Extraction
Comparison: Two Dynamic Programming Approaches
● Value iteration and policy iteration compute the same thing (all optimal values)
● In value iteration:
○ Every iteration updates both the values and (implicitly) the policy
○ We don’t track the policy, but taking the max over actions implicitly recomputes it
● In policy iteration:
○ Several passes that update utilities with a fixed policy (each pass is fast because
one action is considered)
○ After the policy is evaluated, a new policy is chosen (slow like a value iteration
pass)
○ The new policy will be better (or we’re done)
39
Policy Extraction
Summary
● If you want to . . .
○ Compute optimal values: use value iteration or policy iteration
○ Compute values for a particular policy: use policy evaluation
○ Turn your values into a policy: use policy extraction (one-step lookahead)
● These all look the same!
○ They are all variations of Bellman updates
○ They all use one-step lookahead expectimax fragments
○ They differ only in whether we plug in a fixed policy, or max over actions
40
Summary Wk 11, Mtg 32