CS计算机代考程序代写 Markov Decision Processes (MDPs)

Markov Decision Processes (MDPs)
AIMA 17

CMPSC 442
Week 11, Meeting 32, Three Segments

Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel
[All CS188 materials are available at http://ai.berkeley.edu.]

Outline

● Bellman Equations
● Policy Evaluation
● Policy Extraction

2Outline, Wk 11, Mtg 31

Markov Decision Processes (MDPs)
AIMA 17

CMPSC 442
Week 11, Meeting 32, Segment 1 of 3: Bellman Equations

Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel
[All CS188 materials are available at http://ai.berkeley.edu.]

Recap: MDPs

● Markov decision processes:
○ States S
○ Actions A
○ Transitions P(s’|s,a) (or T(s,a,s’))
○ Rewards R(s,a,s’) (and discount γ)
○ Start state s0

● Quantities:
○ Policy = map of states to actions
○ Utility= sum of discounted rewards
○ Q-Value = expected future utility from a q-state

a

s

s, a

s,a,s’

’s

4
Bellman Equations

Recap: Optimal Quantities

▪ The utility of a state s:
U*(s) = expected utility starting in s and

acting optimally

▪ The value (utility) of a q-state (s,a):
Q*(s,a) = expected utility starting out

having taken action a from state s and
(thereafter) acting optimally

▪ The optimal policy:
π*(s) = optimal action from state s

a

s

’s

s, a

(s,a,s’) is a
transition

s,a,s’

s is a
state

(s, a) is a
q-state

5
Bellman Equations

The Bellman Equations

How to be optimal:

Step 1: Take correct first action

Step 2: Keep being optimal

6
Bellman Equations

The Bellman Equations

● Definition of “optimal utility” gives a simple one-step lookahead
relationship amongst optimal utility values

a

s

s, a

s,a,s’

’s

7
Bellman Equations

Value Iteration

● Bellman equations characterize the optimal values:

● Value iteration computes them:

a

V(s)

s, a

s,a,s’

8
Bellman Equations

V(s’)

Value Iteration

● Solving the Bellman equations:
○ For n states, there are n Bellman equations with unknown utilities
○ Systems of linear equations can be solved easily using linear algebra, but the

max operator is not a linear operator
● Value iteration is one solution

○ Initialize with arbitrary values for the utilities of every state
○ Calculate the right hand side of the equation, then use the new value as an

update for the left hand side, applied simultaneously to all states

9Bellman Equations

Convergence*

● How do we know the Vk vectors are going to converge?
● Case 1: If the tree has maximum depth M, then VM holds the actual untruncated

values

● Case 2: If the discount is less than 1
○ For any state Vk and Vk+1 can be viewed as depth k+1 results in nearly identical search trees
○ The difference is that on the bottom layer, Vk+1 has actual rewards while Vk has zeros
○ The last layer is at best all RMAX
○ It is at worst RMIN
○ But everything is discounted by γ k that far out
○ Vk and Vk+1 are at most γ

k RMAX different
○ So as k increases, the values converge

10
Bellman Equations

Markov Decision Processes (MDPs)
AIMA 17

CMPSC 442
Week 11, Meeting 32, Segment 2 of 3: Policy Evaluation

Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel
[All CS188 materials are available at http://ai.berkeley.edu.]

Expectimax for ”Environment” Adversary
● Uncertain outcomes of agent’s actions

12
Policy Evaluation

Recall Minimax

10 10 9 100

max

min

13
Policy Evaluation

● Max is the agent
● Min is the opponent
● Look ahead to get

future utilities
● Back them up to the

root decision node
○ Assume worst case for

Max (Min plays
optimally)

Expectimax: Average Case Instead of Worst Case

10 10 9 100

max

min

Idea: Uncertain outcomes controlled by chance, not an agent adversary!

14
Policy Evaluation

Expectimax Search
● Why wouldn’t we know what the result of an action will be?

○ Explicit randomness: rolling dice
○ Actions can fail: when moving a robot, wheels might slip

● Values should reflect average-case (expectimax) out-
comes, not worst-case (minimax) outcomes

● Expectimax search: use average score under optimal play
○ Max nodes as in minimax search
○ Chance nodes are like min nodes, represent uncertain outcomes
○ Calculate their expected utilities, i.e. take weighted average (expectation) of the

children

15

max

chance

10 10 9 100

Policy Evaluation

Fixed Policies

● Expectimax trees max over all actions to compute the optimal values

● If we fixed a policy π(s), then the tree will be simpler, with one action per state
○ Though the tree’s value would depend on which policy we fixed

a

s

s, a

s,a,s’

’s

π(s)

s

s, π(s)

s, π(s),s’
’s

Do the optimal action Do what π says to do

16
Policy Evaluation

● Another basic operation: compute the utility of a state s under a fixed (generally
non-optimal) policy

● Define the utility of a state s, under a fixed policy π:
Uπ(s) = expected total discounted rewards starting in s and following π

● Recursive relation (one-step look-ahead / Bellman equation) using a fixed policy:

Utilities for a Fixed Policy

17
Policy Evaluation

Policy Evaluation

● How do we calculate the U’s for a fixed policy π?
● Idea 1: Turn recursive Bellman equations into updates

(like value iteration)

● Efficiency: O(s2) per iteration
● Idea 2: Without the maxes, the Bellman equations are just a linear system

○ Solve with your favorite linear system solver

π(s)

s

s, π(s)

s, π(s),s’
’s

18
Policy Evaluation

Markov Decision Processes (MDPs)
AIMA 17

CMPSC 442
Week 11, Meeting 32, Segment 3 of 3: Policy Extraction

Slides adapted from U.C. Berkeley CS 188, Dan Klein and Peter Abbeel
[All CS188 materials are available at http://ai.berkeley.edu.]

Computing Actions from Values (aka Utilities)

● Let’s imagine we have the optimal values V*(s)
● How should we act? It’s not obvious!
● We need to do a mini-expectimax (one step)

● This is called policy extraction: it gets the policy implied by the values

20
Policy Extraction

Computing Actions from Q-Values

● Let’s imagine we have the optimal Q-values:

● How should we act?
○ Completely trivial to decide!

● Important lesson: actions are easier to select from Q-values than values!

21
Policy Extraction

Problems with Value Iteration

● Value iteration repeats the Bellman updates:

● Problem 1: It’s slow – O(S2A) per iteration

● Problem 2: The “max” at each state rarely changes

● Problem 3: The policy often converges long before the values

a

s

s, a

s,a,s’

’s

22
Policy Extraction

k=0
Noise = 0.2
Discount = 0.9
Living reward = 0

23
Policy Extraction

24

k=1
Noise = 0.2
Discount = 0.9
Living reward = 0

Policy Extraction

25

k=2
Noise = 0.2
Discount = 0.9
Living reward = 0

Policy Extraction

26

k=3
Noise = 0.2
Discount = 0.9
Living reward = 0

Policy Extraction

27

k=4
Noise = 0.2
Discount = 0.9
Living reward = 0

Policy Extraction

28

k=5
Noise = 0.2
Discount = 0.9
Living reward = 0

Policy Extraction

29

k=6
Noise = 0.2
Discount = 0.9
Living reward = 0

Policy Extraction

30

k=7
Noise = 0.2
Discount = 0.9
Living reward = 0

Policy Extraction

31

k=8
Noise = 0.2
Discount = 0.9
Living reward = 0

Policy Extraction

32

k=9
Noise = 0.2
Discount = 0.9
Living reward = 0

Policy Extraction

33

k=10
Noise = 0.2
Discount = 0.9
Living reward = 0

Policy Extraction

34

k=11
Noise = 0.2
Discount = 0.9
Living reward = 0

Policy Extraction

35

k=12
Noise = 0.2
Discount = 0.9
Living reward = 0

Policy Extraction

36

k=100
Noise = 0.2
Discount = 0.9
Living reward = 0

Policy Extraction

Policy Iteration

● Alternative approach for optimal values:
○ Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal

utilities!) until convergence

○ Step 2: Policy improvement: update policy using one-step look-ahead with
resulting converged (but not optimal!) utilities as future values

○ Repeat steps until policy converges

● This is policy iteration
○ It’s still optimal!
○ Can converge (much) faster under some conditions

37
Policy Extraction

Policy Iteration

● Evaluation: For fixed current policy π, find values with policy evaluation:
○ Iterate until values converge:

● Improvement: For fixed values, get a better policy using policy extraction
○ One-step look-ahead:

38
Policy Extraction

Comparison: Two Dynamic Programming Approaches

● Value iteration and policy iteration compute the same thing (all optimal values)
● In value iteration:

○ Every iteration updates both the values and (implicitly) the policy
○ We don’t track the policy, but taking the max over actions implicitly recomputes it

● In policy iteration:
○ Several passes that update utilities with a fixed policy (each pass is fast because

one action is considered)

○ After the policy is evaluated, a new policy is chosen (slow like a value iteration
pass)

○ The new policy will be better (or we’re done)

39
Policy Extraction

Summary

● If you want to . . .
○ Compute optimal values: use value iteration or policy iteration
○ Compute values for a particular policy: use policy evaluation
○ Turn your values into a policy: use policy extraction (one-step lookahead)

● These all look the same!
○ They are all variations of Bellman updates
○ They all use one-step lookahead expectimax fragments
○ They differ only in whether we plug in a fixed policy, or max over actions

40
Summary Wk 11, Mtg 32