代写 R algorithm game network theory Decision Making Over Time

Decision Making Over Time
A robot planning its moves through a grid
1234
567
8 9 10 11
45

Chance Nodes
The robot’s position
• Domain={1,2,3,4,5,6,7,8,9,10,11}
• The possible squares the robot can be in
• Changes each time the robot moves • A series of positions over time
• 􏰀0: The robot’s initial position
• 􏰀1: The robot’s position after 1 move
• 􏰀2: The robot’s position after 2 moves
• …
46

Dynamic Decision Network
Move􏰁-1
􏰀􏰁-1 􏰀􏰁
Robot chooses an action at each point in time
• Up, down, left, or right
• Move succeeds with 80% chance
• 10% chance of moving 90° or 270° from intended direction
• If move would go through a wall, robot stays put
47

Dynamic Decision Network
􏰀􏰁-1
1
1
1
1
2

􏰂􏰁-1
U
D
L
R
U
􏰃(􏰀􏰁=1)
0.9
0.1
0.9
0.1
0.1
Move􏰁-1
􏰀􏰁-1
􏰀􏰁
􏰃(􏰀􏰁=2)
0.1
0.1
0.0
0.8
0.8
􏰃(􏰀􏰁=3)
0.0
0.0
0.0
0.0
0.1
􏰃(􏰀􏰁=4)
0.0
0.0
0.0
0.0
0.0
􏰃(􏰀􏰁=5)
0.0
0.8
0.1
0.1
0.0

48

Terminal States
Robot’s journey ends if it reaches either one
1234
567
8 9 10 11
49

Terminal States
Robot’s journey ends if it reaches either one
1234
+1
-1
567
8 9 10 11
50

Dynamic Influence Diagram
􏰀􏰁
Utility
Move􏰁-1
4
7
Else
+1
-1
-0.04
􏰀􏰁-1 􏰀􏰁 Utility
Stationary Preferences
• Utility is the same over time
51

Markov Decision Process
An MDP consists of:
􏰄 􏰅: states (e.g., 􏰀 or Battery) • 􏰆: actions (e.g., Move)
􏰄 􏰃(􏰇’|􏰇,􏰈): transition model (e.g., 􏰃(􏰀􏰁 =􏰇’ | 􏰀􏰁-1=􏰇, Move􏰁-1=􏰈)
􏰄 􏰊(􏰇) or 􏰊(􏰇,􏰈): reward (e.g., Utility(􏰀􏰁 =􏰇))
Problem: Finding an optimal 􏰋􏰌􏰍􏰎􏰏􏰐, 􏰑*
• Not enough to pick just a single best move
• Must pick a policy, 􏰑(􏰇): 􏰅􏰒􏰆
• What action should the robot choose in each state?
• The optimal policy maximizes Utility􏰑(􏰇0)
• What’s the policy that gives us the highest utility from our start?
52

Policy
1
2
3
4
5
6
7
8
9
10
11
53

Utility of a Policy: Finite Horizon
Robot will operate for 􏰓 moves
Utility􏰑􏰓(􏰇0) = 􏰔[􏰊(􏰅0)+􏰊(􏰅1)+…+􏰊(􏰅􏰓)|􏰅0=􏰇0]
Utility􏰑N(􏰇) = 􏰊(􏰇0)+􏰔[􏰊(􏰅1)+…+􏰊(􏰅􏰓)|􏰅0=􏰇0]
Utility􏰑N(􏰇) = 􏰊(􏰇0)+∑􏰇1(􏰃(􏰇1|􏰇0,􏰈)·(􏰊(􏰇1)+E[􏰊(􏰅􏰕)+…+􏰊(􏰅􏰓)|􏰅1=􏰇􏰖])) Utility􏰑N(􏰇) = 􏰊(􏰇0)+∑􏰇1(􏰃(􏰇1|􏰇0,􏰑􏰓(􏰇0))·(􏰊(􏰇1)+E[􏰊(􏰅􏰕􏰗+…+􏰊(􏰅􏰓)|􏰅1=􏰇1]))
Utility􏰑N(􏰇) = 􏰊(􏰇0)+∑􏰇1(􏰃(􏰇1|􏰇0,􏰑􏰓(􏰇0))·Utility􏰑􏰓-1(􏰇1)) • Utility and 􏰑 are 􏰘􏰌􏰁 stationary
Utility􏰑0(􏰇0) = 􏰊(􏰇0)
54

Utility of a Policy: Infinite Horizon
Robot keeps on operating
Utility􏰑(􏰇0) = 􏰔[􏰊(􏰅0)+􏰊(􏰅1)+…|􏰅0=􏰇0]
Utility􏰑N(􏰇) = 􏰊(􏰇0)+􏰔[􏰊(􏰅1)+…|􏰅0=􏰇0]
Utility􏰑N(􏰇) = 􏰊(􏰇0)+∑􏰇1(􏰃(􏰇1|􏰇0,􏰈)·(􏰊(􏰇1)+E[􏰊(􏰅2)+…|􏰅1=􏰇1])) Utility􏰑N(􏰇) = 􏰊(􏰇0)+∑􏰇1(􏰃(􏰇1|􏰇0,􏰑(􏰇0))·(􏰊(􏰇1)+E[􏰊(􏰅2)+…|􏰅1=􏰇1])) Utility􏰑N(􏰇) = 􏰊(􏰇0)+∑􏰇1(􏰃(􏰇1|􏰇0,􏰑(􏰇0))·Utility􏰑(􏰇1))
55

Utility of a Policy: Infinite Horizon
Utility􏰑(􏰇0) = 􏰔[􏰊(􏰅0)+􏰊(􏰅1)+…]
Utility􏰑N(􏰇) = 􏰊(􏰇0)+∑􏰇1(􏰃(􏰇1|􏰇0,􏰑(􏰇0))·Utility􏰑(􏰇􏰖))
• Potentially unbounded!
• Use a discount factor, 􏰙≤1
Utility􏰑(􏰇0) = 􏰔[􏰊(􏰅0)+􏰙􏰊(􏰅1)+􏰙2􏰊(􏰅2)+…]
Utility􏰑N(􏰇) = 􏰊(􏰇0)+􏰙∑􏰇1(􏰃(􏰇1|􏰇0,􏰑(􏰇0))·Utility􏰑(􏰇􏰖))
• Reward now is better than reward later
• Utility and 􏰑 􏰈􏰚􏰛 stationary
56

Computing Expected Utility
Utility􏰑(4) = 􏰜1 Utility􏰑(7) = 􏰝1 1234
567
8 9 10 11
57

Computing Expected Utility
􏰑(3) = Right 􏰙=1 1234
567
8 9 10 11
58

Computing Expected Utility
􏰞􏰑(3) = 􏰊(3)+􏰙∑􏰇1(􏰃(􏰇1|3,Right)·􏰞􏰑(􏰇1)) 1234
567
8 9 10 11
59

Computing Expected Utility
􏰞􏰑(3) = 􏰊(3)+􏰙􏰃(4|3,Right)·􏰞􏰑(4))+􏰙􏰃(6|3,Right)·􏰞􏰑(6)) +􏰙􏰃(3|3,Right)·􏰞􏰑(3))
1
2
3
4
5
6
7
8
9
10
11
60

Computing Expected Utility
􏰞􏰑(3) = 􏰝.04+1·0.8·1+1·0.1·􏰞􏰑(6)+1·0.1·􏰞􏰑(3)
1
2
3
4
5
6
7
8
9
10
11
61

Computing Expected Utility
􏰑(6) = Up
1
2
3
4
5
6
7
8
9
10
11
62

Computing Expected Utility
􏰞􏰑(6) = 􏰊(6)+􏰙0.8·􏰞􏰑(3))+􏰙0.1·􏰞􏰑(7))+􏰙0.1·􏰞􏰑(6))
1
2
3
4
5
6
7
8
9
10
11
63

Computing Expected Utility
􏰑(6) = Left
1
2
3
4
5
6
7
8
9
10
11
64

Computing Expected Utility
􏰞􏰑(6) = 􏰊(6)+􏰙0.1·􏰞􏰑(3))+􏰙0.8·􏰞􏰑(6))+􏰙0.1·􏰞􏰑(10))
1
2
3
4
5
6
7
8
9
10
11
65

Value Iteration
Bellman equation
􏰄 In each state, choose the action with highest utility
􏰄 􏰞(􏰇) = 􏰊(􏰇)+􏰙maxa∑􏰇1(􏰃(􏰇1|􏰇,􏰈)􏰞(􏰇1))
Repeating this formula will converge to optimal policy
• 􏰞􏰟(􏰇) = 􏰊(􏰇)
• 􏰞􏰎(􏰇) = 􏰊(􏰇)+􏰙maxa∑􏰇1(􏰃(􏰇1|􏰇,􏰈)􏰞􏰎-1(􏰇1))
• Bellman update
• As 􏰎􏰒􏰠, 􏰞􏰎􏰒􏰞
• See AIMA 17.2.3 for details
• 􏰑*(􏰇) = arg max􏰈∑􏰇1(􏰃(􏰇1|􏰇,􏰈)􏰞(􏰇1))
66

Value Iteration
1
−.04
2
−.04
3
−.04
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
67

Value Iteration
1
−.04
2
−.04
3
−.04
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
68

Value Iteration
1 􏰞: −.08 􏰡: −.08 􏰢: −.08 􏰊: −.08
2
−.04
3
−.04
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
69

Value Iteration
1
−0.08
􏰑*(1)=􏰣
2
−.04
3
−.04
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
70

Value Iteration
1
−.04
2
−.04
3
􏰞: 0.024
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
4
71

Value Iteration
1
−.04
2
−.04
3
􏰡: 0.024
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
4
72

Value Iteration
1
−.04
2
−.04
3
􏰢: −0.08
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
73

Value Iteration
1
−.04
2
−.04
3
􏰊: 0.752
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
4
74

Value Iteration
1
−.04
2
−.04
3 􏰞: 0.024 􏰡: 0.024 􏰢: −0.08 􏰊: 0.752
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
75

Value Iteration
1
−.04
2
−.04
3
0.752
􏰑*(3)=R
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−.04
76

Value Iteration
1
−.04
2
−.04
3
−.04
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
􏰞: −.848
77

Value Iteration
1
−.04
2
−.04
3
−.04
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
􏰡: −.08
11
78

Value Iteration
1
−.04
2
−.04
3
−.04
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
􏰢: −.176
79

Value Iteration
1
−.04
2
−.04
3
−.04
6
−.04
4
+1
5
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
􏰊: −.176
80

Value Iteration
1
−.04
2
−.04
3
−.04
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
􏰞: −.848
11
􏰢: −.176 􏰡: −.08 􏰊: −.176
81

Value Iteration
1
−.04
2
−.04
3
−.04
4
+1
5
−.04
6
−.04
7
−1
8
−.04
9
−.04
10
−.04
11
−0.08
􏰑*(11)=􏰡
82

Final Values
.812
.762
.868
.918
+1
.705
.655
.660
.611
-1
.388
83

Final Policy
.812
.868
.918
􏰑*(1)=R
􏰑*(2)=R
􏰑*(3)=R
+1
.762
􏰑*(5)=U
.660
􏰑*(6)=L
-1
.705
.655
.611
.388
􏰑*(8)=U
􏰑*(9)=L
􏰑*(10)=L
􏰑*(11)=L
84

Partial Observability
Move􏰁-1
􏰀􏰁-1
􏰤􏰃􏰅􏰁-1
􏰀􏰁 Utility
􏰤􏰃􏰅􏰁
85

POMDPs
Partially Observable MDPs
• Observations instead of direct knowledge of state
• States 􏰒 Belief states
• Update 􏰥 = 􏰃(􏰅 | observations) each turn • Policies are defined over belief states, 􏰑(􏰥)
Value iteration still works
• Computing 􏰞(􏰥) for belief states
• Similar to a conditional plan Expressive but expensive
• Space of beliefs is the space of distributions over states
• Active area of research in making this computation efficient
86

Multiple Agents
1
2
3
4
5
6
7
8
9
10
11
87

Game Theory
Other agents introduce additional uncertainty • But we can assume that they are also optimizing
• Similar to assumption underlying minimax algorithm Single-move games
• All agents reveal their moves simultaneously • Rock-Paper-Scissors
• Paper beats Rock, Rock beats Scissors, Scissors beats Paper • What is the optimal strategy (policy)?
88

Pure Strategies
Deterministic Policy
• e.g., Bart follows “Always play Rock”
• Each agent chooses a best response to others’ strategies • Lisa should follow “Always play Paper”
• Each agent continues updating
• Bart should now follow “Always play Scissors” • etc.
• If we reach a point where no one updates: Nash equilibrium • No pure-strategy equilbrium in Rock-Paper-Scissors
89

Mixed Strategy
Choose actions with some probability
• “Play Rock 1⁄3, Paper 1⁄3, Scissors 1⁄3”
• Best response strategy is exactly the same
• If either Bart or Lisa changes strategies, s/he will do worse
• So no incentive to change unilaterally: Nash equilibrium
• Local optimum
• Not necessarily observed in practice
90

Game Theory
Repeated Games
• Multiple rounds of RPS
Sequential Games
• Backgammon, Chess
• Introduce randomness as a “third player”
Imperfect Information
• Partial observability
• Use belief states (also called information sets)
91

Next Time
AIMA Chapter 18: Learning from Examples
• Computing an optimal policy relies on an accurate model • What if I don’t know how accurate my movement is?
• Learning provides a mechanism for 􏰈􏰏􏰦􏰧􏰎􏰚􏰎􏰘􏰨 a model • It can also be used to acquire a policy directly
92