Decision Making Over Time
A robot planning its moves through a grid
1234
567
8 9 10 11
45
Chance Nodes
The robot’s position
• Domain={1,2,3,4,5,6,7,8,9,10,11}
• The possible squares the robot can be in
• Changes each time the robot moves • A series of positions over time
• 0: The robot’s initial position
• 1: The robot’s position after 1 move
• 2: The robot’s position after 2 moves
• …
46
Dynamic Decision Network
Move-1
-1
Robot chooses an action at each point in time
• Up, down, left, or right
• Move succeeds with 80% chance
• 10% chance of moving 90° or 270° from intended direction
• If move would go through a wall, robot stays put
47
Dynamic Decision Network
-1
1
1
1
1
2
…
-1
U
D
L
R
U
(=1)
0.9
0.1
0.9
0.1
0.1
Move-1
-1
(=2)
0.1
0.1
0.0
0.8
0.8
(=3)
0.0
0.0
0.0
0.0
0.1
(=4)
0.0
0.0
0.0
0.0
0.0
(=5)
0.0
0.8
0.1
0.1
0.0
…
48
Terminal States
Robot’s journey ends if it reaches either one
1234
567
8 9 10 11
49
Terminal States
Robot’s journey ends if it reaches either one
1234
+1
-1
567
8 9 10 11
50
Dynamic Influence Diagram
Utility
Move-1
4
7
Else
+1
-1
-0.04
-1 Utility
Stationary Preferences
• Utility is the same over time
51
Markov Decision Process
An MDP consists of:
: states (e.g., or Battery) • : actions (e.g., Move)
(’|,): transition model (e.g., ( =’ | -1=, Move-1=)
() or (,): reward (e.g., Utility( =))
Problem: Finding an optimal , *
• Not enough to pick just a single best move
• Must pick a policy, ():
• What action should the robot choose in each state?
• The optimal policy maximizes Utility(0)
• What’s the policy that gives us the highest utility from our start?
52
Policy
1
2
3
4
5
6
7
8
9
10
11
53
Utility of a Policy: Finite Horizon
Robot will operate for moves
Utility(0) = [(0)+(1)+…+()|0=0]
UtilityN() = (0)+[(1)+…+()|0=0]
UtilityN() = (0)+∑1((1|0,)·((1)+E[()+…+()|1=])) UtilityN() = (0)+∑1((1|0,(0))·((1)+E[(+…+()|1=1]))
UtilityN() = (0)+∑1((1|0,(0))·Utility-1(1)) • Utility and are stationary
Utility0(0) = (0)
54
Utility of a Policy: Infinite Horizon
Robot keeps on operating
Utility(0) = [(0)+(1)+…|0=0]
UtilityN() = (0)+[(1)+…|0=0]
UtilityN() = (0)+∑1((1|0,)·((1)+E[(2)+…|1=1])) UtilityN() = (0)+∑1((1|0,(0))·((1)+E[(2)+…|1=1])) UtilityN() = (0)+∑1((1|0,(0))·Utility(1))
55
Utility of a Policy: Infinite Horizon
Utility(0) = [(0)+(1)+…]
UtilityN() = (0)+∑1((1|0,(0))·Utility())
• Potentially unbounded!
• Use a discount factor, ≤1
Utility(0) = [(0)+(1)+2(2)+…]
UtilityN() = (0)+∑1((1|0,(0))·Utility())
• Reward now is better than reward later
• Utility and stationary
56
Computing Expected Utility
Utility(4) = 1 Utility(7) = 1 1234
567
8 9 10 11
57
Computing Expected Utility
(3) = Right =1 1234
567
8 9 10 11
58
Computing Expected Utility
(3) = (3)+∑1((1|3,Right)·(1)) 1234
567
8 9 10 11
59
Computing Expected Utility
(3) = (3)+(4|3,Right)·(4))+(6|3,Right)·(6)) +(3|3,Right)·(3))
1
2
3
4
5
6
7
8
9
10
11
60
Computing Expected Utility
(3) = .04+1·0.8·1+1·0.1·(6)+1·0.1·(3)
1
2
3
4
5
6
7
8
9
10
11
61
Computing Expected Utility
(6) = Up
1
2
3
4
5
6
7
8
9
10
11
62
Computing Expected Utility
(6) = (6)+0.8·(3))+0.1·(7))+0.1·(6))
1
2
3
4
5
6
7
8
9
10
11
63
Computing Expected Utility
(6) = Left
1
2
3
4
5
6
7
8
9
10
11
64
Computing Expected Utility
(6) = (6)+0.1·(3))+0.8·(6))+0.1·(10))
1
2
3
4
5
6
7
8
9
10
11
65
Value Iteration
Bellman equation
In each state, choose the action with highest utility
() = ()+maxa∑1((1|,)(1))
Repeating this formula will converge to optimal policy
• () = ()
• () = ()+maxa∑1((1|,)-1(1))
• Bellman update
• As ,
• See AIMA 17.2.3 for details
• *() = arg max∑1((1|,)(1))
66
Value Iteration
1
−.04
2
−.04
3
−.04
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
67
Value Iteration
1
−.04
2
−.04
3
−.04
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
68
Value Iteration
1 : −.08 : −.08 : −.08 : −.08
2
−.04
3
−.04
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
69
Value Iteration
1
−0.08
*(1)=
2
−.04
3
−.04
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
70
Value Iteration
1
−.04
2
−.04
3
: 0.024
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
4
71
Value Iteration
1
−.04
2
−.04
3
: 0.024
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
4
72
Value Iteration
1
−.04
2
−.04
3
: −0.08
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
73
Value Iteration
1
−.04
2
−.04
3
: 0.752
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
4
74
Value Iteration
1
−.04
2
−.04
3 : 0.024 : 0.024 : −0.08 : 0.752
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
75
Value Iteration
1
−.04
2
−.04
3
0.752
*(3)=R
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
76
Value Iteration
1
−.04
2
−.04
3
−.04
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
: −.848
77
Value Iteration
1
−.04
2
−.04
3
−.04
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
: −.08
11
78
Value Iteration
1
−.04
2
−.04
3
−.04
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
: −.176
79
Value Iteration
1
−.04
2
−.04
3
−.04
6
−.04
4
+1
5
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
: −.176
80
Value Iteration
1
−.04
2
−.04
3
−.04
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
: −.848
11
: −.176 : −.08 : −.176
81
Value Iteration
1
−.04
2
−.04
3
−.04
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−0.08
*(11)=
82
Final Values
.812
.762
.868
.918
+1
.705
.655
.660
.611
-1
.388
83
Final Policy
.812
.868
.918
*(1)=R
*(2)=R
*(3)=R
+1
.762
*(5)=U
.660
*(6)=L
-1
.705
.655
.611
.388
*(8)=U
*(9)=L
*(10)=L
*(11)=L
84
Partial Observability
Move-1
-1
-1
Utility
85
POMDPs
Partially Observable MDPs
• Observations instead of direct knowledge of state
• States Belief states
• Update = ( | observations) each turn • Policies are defined over belief states, ()
Value iteration still works
• Computing () for belief states
• Similar to a conditional plan Expressive but expensive
• Space of beliefs is the space of distributions over states
• Active area of research in making this computation efficient
86
Multiple Agents
1
2
3
4
5
6
7
8
9
10
11
87
Game Theory
Other agents introduce additional uncertainty • But we can assume that they are also optimizing
• Similar to assumption underlying minimax algorithm Single-move games
• All agents reveal their moves simultaneously • Rock-Paper-Scissors
• Paper beats Rock, Rock beats Scissors, Scissors beats Paper • What is the optimal strategy (policy)?
88
Pure Strategies
Deterministic Policy
• e.g., Bart follows “Always play Rock”
• Each agent chooses a best response to others’ strategies • Lisa should follow “Always play Paper”
• Each agent continues updating
• Bart should now follow “Always play Scissors” • etc.
• If we reach a point where no one updates: Nash equilibrium • No pure-strategy equilbrium in Rock-Paper-Scissors
89
Mixed Strategy
Choose actions with some probability
• “Play Rock 1⁄3, Paper 1⁄3, Scissors 1⁄3”
• Best response strategy is exactly the same
• If either Bart or Lisa changes strategies, s/he will do worse
• So no incentive to change unilaterally: Nash equilibrium
• Local optimum
• Not necessarily observed in practice
90
Game Theory
Repeated Games
• Multiple rounds of RPS
Sequential Games
• Backgammon, Chess
• Introduce randomness as a “third player”
Imperfect Information
• Partial observability
• Use belief states (also called information sets)
91
Next Time
AIMA Chapter 18: Learning from Examples
• Computing an optimal policy relies on an accurate model • What if I don’t know how accurate my movement is?
• Learning provides a mechanism for a model • It can also be used to acquire a policy directly
92