Optimal policies
If we have the correct utility values, the agent has a simple decision process It just picks the action a that maximises the expected utility of the next state:
̊ ÿ1π ̊1 π psq“arg max Pps|s,aqU psq
aPApsq s1 Only have to consider the next step.
The big question is how to compute U π ̊ psq.
⃝c -Trenn, King’s College London
2
Optimal policies
Note that this is specific to the value of the reward Rpsq for non-terminal states — different rewards will give different values and policies.
⃝c -Trenn, King’s College London 3
Bellman equation
How do we find the best policy (for a given set of rewards)?
Turns out that there is a neat way to do this, by first computing the utility of each state.
We compute this using the Bellman equation ÿ
Upsq“Rpsq`γ max aPApsq s1
γ is a discount factor.
Pps1|s,aqUps1q
⃝c -Trenn, King’s College London
4
Not this Bellman
”Just the place for a Snark!” the Bellman cried, As he landed his crew with care;
Supporting each man on the top of the tide
By a finger entwined in his hair.
”Just the place for a Snark! I have said it twice: That alone should encourage the crew.
Just the place for a Snark! I have said it thrice: What I tell you three times is true.”
( ’s illustrations to “The Hunting of the Snark”).
⃝c -Trenn, King’s College London 5
Bellman equation
Apply:
and we get:
ÿ
Upsq“Rpsq`γ max aPApsq s1
Pps1|s,aqUps1q
⃝c -Trenn, King’s College London
6
Bellman equation
U p1, 1q “ ́0.04 ` γ max
$0.8Up1, 2q ` 0.1Up2, 1q ` 0.1Up1, 1q, pUpq , ’/ ’/
’&0.9U p1, 1q ` 0.1U p1, 2q, pLef tq /.
’0.9U p1, 1q ` 0.1U p2, 1q, pDownq / ’/ ’/ %-
0.8U p2, 1q ` 0.1U p1, 2q ` 0.1U p1, 1q pRightq
⃝c -Trenn, King’s College London
7
Value iteration
In an MDP wth n states, we will have n Bellman equations.
( /Cartoon Network)
Hard to solve these simultaneously because of the max operation
‚ Makesthemnon-linear
⃝c -Trenn, King’s College London 8
Value iteration
Luckily an iterative approach works. Start with arbitrary values for states. Then apply the Bellman update:
Ui`1psqÐRpsq`γ max aPApsq s1
simultaneously to all the states.
Continue until the values of states do not change.
The values are guaranteed to converge on the optimal values (but might take some time).
⃝c -Trenn, King’s College London 9
ÿ
Pps1|s,aqUips1q
Value iteration
procedure VALUE ITERATION forsinS do
Upsq Ð 0 end for
repeat
Ucopy Ð U forsinS do
UpsqÐRpsq`γmaxaPApSq end for
until U == Ucopy end procedure
ř11 s1 Pps|s,aqUcopypsq
States, S, reward, Rpsq, set of actions, Apsq and transition model, P ps1|s, aq, are exactly as defined earlier.
γ is the discount factor.
⃝c -Trenn, King’s College London 10
Value iteration
How the values of states change as updates occur.
⃝c -Trenn, King’s College London 11
Value iteration
U p4, 3q is pinned to 1.
U p3, 3q quickly settles to a value close to 1
U p1, 1q becomes negative, and then grows as positive utility from the goal feeds back to it.
⃝c -Trenn, King’s College London 12
Rewards
The example so far has a negative reward Rpsq for each state. Encouragement for an agent not to stick around.
Can also think of Rpsq is being the cost of moving to the next state (where we obatin the utility):
where s is the action used. Bellman becomes:
Rpsq “ ́cps, aq
Ui`1psqÐγ maxp Pps1|s,aqUips1qq ́cps,aq aPApsq s1
Note that the action can be dependent on the state.
ÿ
⃝c -Trenn, King’s College London 13