Agent-based Systems
Paolo Turrini
www.dcs.warwick.ac.uk/~pturrini R p.turrini@warwick.ac.uk
Markov Decision Processes Policies are strategies
Paolo Turrini Policies Agent-based Systems
Plan for Today
We now go back to a ”typical” AI framework: Markov Decision Processes
Plans and policies Optimal policies
These are ”one player” games with perfect information. Except they are not played on trees.
This (and more) in RN’s chapters 17-18.
Stuart Russell and Peter Norvig
Artificial Intelligence: a modern approach
2014 (3rd edition)
Paolo Turrini Policies Agent-based Systems
The world
Start at the starting square
Move to adjacent squares
Collision results in no movement
The game ends when we reach either goal state +1 or −1
Paolo Turrini Policies Agent-based Systems
The agent
The agent chooses between {Up, Down, Left, Right} and goes:
to the intended direction with probability: e.g., 0.8
to the left of the intended direction with probability: e.g., 0.1 to the right of the intended direction with probability: e.g., 0.1
Paolo Turrini Policies Agent-based Systems
Let’s start walking
Walking is a repetition of throws:
The probability that I walk right the first time: 0.8
The probability that I walk right the second time: 0.8
The probability that I walk right both times… is a product! 0.82
The environment is Markovian: the probability of reaching a state only depends on the state the agent is in and the action they perform.
It is also fully observable, like an extensive game (of imperfect information).
Paolo Turrini Policies Agent-based Systems
Plans
{Up, Down, Left, Right} denote the intended directions.
A plan is a finite sequence of intended moves, from the start.
So [Up, Down, Up, Right] is going to be the plan that, from the starting square, selects the intended moves in the specified order.
Paolo Turrini Policies Agent-based Systems
Makings plans
Goal: get to +1
Consider the plan [Up, Up, Right, Right, Right].
With deterministic agents, it gets us to +1 with probability 1. But what happens to our stochastic agent instead?
What’s the probability that [Up, Up, Right, Right, Right] gets us to +1?
Paolo Turrini Policies Agent-based Systems
Makings plans
It’s not 0.85! This is the probability that we get to +1 when the plan works!
Paolo Turrini Policies Agent-based Systems
Makings plans
It’s not 0.85! This is the probability that we get to +1 when the plan works! The probability the plan does not work but still reaches +1 is
0.14 × 0.8 = 0.00008
The correct answer is 0.85 + 0.14 × 0.8
Notice 0.85 + 0.14 × 0.8 < 1 , not great. 3
Paolo Turrini
Policies
Agent-based Systems
Policies
S+ is set of possible sequences of states (just like the histories of an extensive game!) A the set of available actions.
Then a policy is a function:
π : S+ → A
In words a policy is a protocol that at each possible decision point prescribes an action.
This is a strategy.
Paolo Turrini Policies Agent-based Systems
A policy
This is a state-based policy. It recommends the same action at each state
(so if two sequences end up with the same state, this policy is going to recommend the same action)
Now let’s complicate things a little bit...
Paolo Turrini Policies Agent-based Systems
Rewards
A reward function is a (utility) function of the form r:S→R
All states, not just the terminal ones, get a reward!
Obviously, if you only care about terminal states, you may want to give zero to every other state. This is a more general model.
Paolo Turrini Policies Agent-based Systems
Rewards
For instance, each non-terminal state:
has 0 reward, i.e., only the terminal states matter;
has negative reward, e.g., each move consumes -0.04 of battery; has positive reward, e.g., I like wasting battery
Rewards are usually small, negative and uniform at non-terminal states. But the reward function allows for more generality.
Paolo Turrini Policies Agent-based Systems
Rewards
Consider now the following. The reward is:
+1 at state +1, -1 at state -1, -0.04 in all other states.
What’s the expected utility of [Up, Up, Right, Right, Right]?
Paolo Turrini Policies Agent-based Systems
Rewards
Consider now the following. The reward is:
+1 at state +1, -1 at state -1, -0.04 in all other states.
What’s the expected utility of [Up, Up, Right, Right, Right]? IT DEPENDS
Paolo Turrini Policies Agent-based Systems
Rewards
Consider now the following. The reward is:
+1 at state +1, -1 at state -1, -0.04 in all other states.
What’s the expected utility of [Up, Up, Right, Right, Right]? IT DEPENDS on how we are going to put rewards together!
Paolo Turrini Policies Agent-based Systems
Comparing states
Many ways of comparing states:
summing all the rewards
giving priority to the immediate rewards ...
Paolo Turrini Policies Agent-based Systems
Utility of state sequences
There is only one general and ‘reasonable’ way to combine rewards over time. Discounted utility function: u([s0, s1, s2, . . .]) = r(s0) + γr(s1) + γ2r(s2) + · · ·
where γ ∈ [0, 1] is the discounting factor
Notice: additive utility function u([s0, s1, s2, . . .]) = r(s0) + r(s1) + r(s2) + · · · is just a
discounted utility function where γ = 1.
Paolo Turrini Policies Agent-based Systems
Discounting factor
γ is a measure of the agent patience. How much more they value a gain of five today than a gain of five tomorrow, the day after etc...
Used everywhere in AI, game theory, cognitive psychology
A lot of experimental research on it
Variants: discounting the discounting! I care more about the difference between today and tomorrow than the difference between some distant moment and the day after that!
Paolo Turrini Policies Agent-based Systems
Discounting
γ = 1 today is just another day γ = 0 today is all that matters
Basically γ is my attitude to risk towards the future!
Notice that stochastic actions introduce further gambling into the picture
Paolo Turrini Policies Agent-based Systems
A problem
Here is a 3 × 101 world.
s
start at s.
two deterministic actions at s: either Up or Down beyond s you can only go Right.
the numbers are the rewards you are going to get.
50 -1 -1 -1 ... -1 -1 -1 -1
...
-50 1 1 1 ... 1 1 1 1
Paolo Turrini Policies Agent-based Systems
A problem
Here is a 3 × 101 world.
50
-1
-1
-1
...
-1
-1
-1
-1
s
...
-50
1
1
1
...
1
1
1
1
start at s.
two deterministic actions at s: either Up or Down beyond s you can only go Right.
the numbers are the rewards you are going to get.
Compute the expected utility of each action as a function of γ
Paolo Turrini Policies Agent-based Systems
Solution
The utility of Up is
101
t=2
50γ−γt =50γ−γ21−γ100
1−γ
Paolo Turrini
Policies
Agent-based Systems
Solution
The utility of Up is
The utility of Down is
101
50γ−γt =50γ−γ21−γ100
t=2
101
t=2
−50γ+γt =−50γ+γ21−γ100
1−γ
1−γ
Paolo Turrini
Policies
Agent-based Systems
Solution
The indifference point is
2 1 − γ100 50γ −γ 1−γ
= −50γ +γ
2 1 − γ100 1−γ
Paolo Turrini
Policies
Agent-based Systems
Solution
The indifference point is
2 1 − γ100 50γ −γ 1−γ
Solving numerically, we have γ ≈ 0.9844.
= −50γ +γ
2 1 − γ100 1−γ
Paolo Turrini
Policies
Agent-based Systems
Solution
The indifference point is
2 1 − γ100 2 1 − γ100 50γ −γ 1−γ = −50γ +γ 1−γ
Solving numerically, we have γ ≈ 0.9844.
If γ is strictly larger than this then Down is better than Up;
Paolo Turrini Policies Agent-based Systems
Solution
The indifference point is
2 1 − γ100 2 1 − γ100 50γ −γ 1−γ = −50γ +γ 1−γ
Solving numerically, we have γ ≈ 0.9844.
If γ is strictly larger than this then Down is better than Up; If γ is strictly smaller than this then Up is better than Down;
Paolo Turrini Policies Agent-based Systems
Solution
The indifference point is
2 1 − γ100 2 1 − γ100 50γ −γ 1−γ = −50γ +γ 1−γ
Solving numerically, we have γ ≈ 0.9844.
If γ is strictly larger than this then Down is better than Up; If γ is strictly smaller than this then Up is better than Down; Else, it does not matter.
Paolo Turrini Policies Agent-based Systems
Markov Decision Process
A Markov Decision Process is a sequential decision problem for a:
Paolo Turrini Policies Agent-based Systems
Markov Decision Process
A Markov Decision Process is a sequential decision problem for a: fully observable environment
Paolo Turrini Policies Agent-based Systems
Markov Decision Process
A Markov Decision Process is a sequential decision problem for a: fully observable environment
with stochastic actions
Paolo Turrini Policies Agent-based Systems
Markov Decision Process
A Markov Decision Process is a sequential decision problem for a: fully observable environment
with stochastic actions
with a Markovian transition model
Paolo Turrini Policies Agent-based Systems
Markov Decision Process
A Markov Decision Process is a sequential decision problem for a: fully observable environment
with stochastic actions
with a Markovian transition model
and with discounted (possibly additive) rewards
Paolo Turrini Policies Agent-based Systems
MDPs formally
Definition
Let s be a state and a and action
Model P(s′|s,a) = probability that a in s leads to s′
Reward function r(s) (or r(s, a), r(s, a, s′)) =
−0.04 ±1
(small penalty) for nonterminal states for terminal states
Paolo Turrini
Policies Agent-based Systems
Expected utility of a policy
The expected utility (or value) of policy π, from state s is:
∞
vπ(s) = E[γtr(St)] t=0
E is the expected utility of the sequences induced by:
the policy π (the actions we are actually going to make) the initial state s (where we start)
the transition model (where we can get to)
Paolo Turrini Policies Agent-based Systems
Loops
In principle we can go on forever!
We are going to assume we need to keep going unless we hit a terminal state
(infinite horizon assumption)
Paolo Turrini Policies Agent-based Systems
Discounting
With discounting the utility of an infinite sequence is in fact finite. If γ < 1 and rewards are bounded above by r, we have:
∞∞
u[s1,s2,...] = γtr(st) γtr = r
t=0
t=0
1−γ
Paolo Turrini
Policies
Agent-based Systems
Expected utility of a policy
An optimal policy (from a state) is the policy with the highest expected utility, starting from that state.
πs∗ =argmaxvπ(s) π
We want to find the optimal policy.
Paolo Turrini Policies Agent-based Systems
A remarkable fact
Theorem
With discounted rewards and infinite horizon
∗∗′
πs =πs′,foreachs ∈S
This means that the optimal policy does not depend on the sequences of states, but on the states only.
In other words, the optimal policy is a state-based policy.
Paolo Turrini Policies Agent-based Systems
A remarkable fact
Theorem
With discounted rewards and infinite horizon
∗∗′
πs =πs′,foreachs ∈S
This means that the optimal policy does not depend on the sequences of states, but on the states only.
In other words, the optimal policy is a state-based policy.
Idea: Take πa∗ and πb∗. If they both reach a state c, because they are both optimal, there is no reason why they should disagree (modulo indifference!). So πc∗ is identical for both (modulo indifference!). But then they behave the same at all states!
Paolo Turrini Policies Agent-based Systems
Value of states
The value of a state is the value of the optimal policy from that state.
But then (VERY IMPORTANT): Given the values of the states, choosing the best
action is just maximisation of expected utility!
maximise the expected utility of the immediate successors
Paolo Turrini Policies Agent-based Systems
Value of states
Figure: The values with γ = 1 and r(s) = −0.04
Paolo Turrini Policies Agent-based Systems
Value of states
Figure: The optimal policy π∗(s)=argmaxP(s′ |s,a)v(s′)
a∈A(s ) s ′
Maximise the expected utility of the subsequent state
Paolo Turrini Policies Agent-based Systems
Value of states
Figure: The optimal policy π∗(s)=argmaxP(s′ |s,a)v(s′)
a∈A(s ) s ′
Maximise the expected utility of the subsequent state
Paolo Turrini Policies Agent-based Systems
The Bellman equation
The definition of values of states, i.e., the expected utility of the optimal policy from there, leads to a simple relationship among values of neighbouring states:
Paolo Turrini Policies Agent-based Systems
The Bellman equation
The definition of values of states, i.e., the expected utility of the optimal policy from there, leads to a simple relationship among values of neighbouring states:
expected sum of rewards =
current reward + γ × expected sum of rewards after taking best action
Paolo Turrini Policies Agent-based Systems
The Bellman equation
Bellman equation (1957):
v(s) = r(s) + γ max P(s′ | (s, a))v(s′)
a
s′
Paolo Turrini Policies Agent-based Systems
The Bellman equation
Bellman equation (1957):
v(s) = r(s) + γ max P(s′ | (s, a))v(s′)
a
s′
We can use it to compute the optimal policy!
Paolo Turrini Policies Agent-based Systems
Value Iteration Algorithm
1 Start with arbitrary values
2 Repeat for every s simultaneously until “no change”
v(s) ← r(s) + γ max v(s′)P(s′ | (s, a))
a
s′
Paolo Turrini
Policies
Agent-based Systems
Value Iteration Algorithm
Input S, A, γ, r, and P(s′ | (s,a)) for each s,s′ ∈ S. Input ε > 0, the error you want to allow
Output v, the value of each state
Paolo Turrini Policies Agent-based Systems
Value Iteration Algorithm
1 Initialise δs := ε(1−γ) for all s1, v := 0, storing information to be updated γ
1Typically this is uniform across states.
Paolo Turrini Policies Agent-based Systems
Value Iteration Algorithm
1 Initialise δs := ε(1−γ) for all s1, v := 0, storing information to be updated γ
2 whileδs′ ε(1−γ),forsomes′,do γ
1Typically this is uniform across states.
Paolo Turrini Policies Agent-based Systems
Value Iteration Algorithm
1 Initialise δs := ε(1−γ) for all s1, v := 0, storing information to be updated γ
2 whileδs′ ε(1−γ),forsomes′,do γ
v′(s):=r(s)+γmaxas′ P(s′ |(s,a))v(s′)
1Typically this is uniform across states.
Paolo Turrini Policies Agent-based Systems
Value Iteration Algorithm
1 Initialise δs := ε(1−γ) for all s1, v := 0, storing information to be updated γ
2 whileδs′ ε(1−γ),forsomes′,do γ
v′(s):=r(s)+γmaxas′ P(s′ |(s,a))v(s′) δs :=|v′(s)−v(s)|
1Typically this is uniform across states.
Paolo Turrini Policies Agent-based Systems
Value Iteration Algorithm
1 Initialise δs := ε(1−γ) for all s1, v := 0, storing information to be updated γ
2 whileδs′ ε(1−γ),forsomes′,do γ
v′(s):=r(s)+γmaxas′ P(s′ |(s,a))v(s′) δs :=|v′(s)−v(s)|
v(s) := v′(s)
1Typically this is uniform across states.
Paolo Turrini Policies Agent-based Systems
Value Iteration Algorithm
1 Initialise δs := ε(1−γ) for all s1, v := 0, storing information to be updated γ
2 whileδs′ ε(1−γ),forsomes′,do γ
v′(s):=r(s)+γmaxas′ P(s′ |(s,a))v(s′) δs :=|v′(s)−v(s)|
v(s) := v′(s)
3 Return v
1Typically this is uniform across states.
Paolo Turrini Policies Agent-based Systems
A fundamental fact
Theorem
VIA: returns the optimal policy (for the input values)
terminates
Paolo Turrini Policies Agent-based Systems
VIA in action
Paolo Turrini Policies Agent-based Systems
VIA in action
Paolo Turrini Policies Agent-based Systems
VIA in action
Initialise the values, for γ = 1, r = −0.04
Paolo Turrini Policies Agent-based Systems
VIA in action
Simultaneously apply the Bellmann update to all states
v(s) = r(s) + γ max P(s′ | (s, a))v(s′)
a
s′
Paolo Turrini
Policies Agent-based Systems
Value Iteration Algorithm
Paolo Turrini Policies Agent-based Systems
The state values
Paolo Turrini Policies Agent-based Systems
The optimal policy
Paolo Turrini Policies Agent-based Systems
VIA in action
Paolo Turrini Policies Agent-based Systems
VIA in action
Initialise the values, for γ = 1, r = −0.4
Paolo Turrini Policies Agent-based Systems
VIA in action
Paolo Turrini Policies Agent-based Systems
VIA in action
Initialise the values, for γ = 1, r = −4
Paolo Turrini Policies Agent-based Systems
VIA in action
Paolo Turrini Policies Agent-based Systems
VIA in action
Initialise the values, for γ = 1, r = 0
Paolo Turrini Policies Agent-based Systems
VIA in action
Paolo Turrini Policies Agent-based Systems
VIA in action
Paolo Turrini Policies Agent-based Systems
Stocktaking
Stochastic actions can lead to unpredictable outcomes
But we can still find optimal ”strategies”, exploiting what happens in case we deviate from the original plan
If we know what game we are playing and we play long enough… What next? Learning in MDPs
Paolo Turrini Policies Agent-based Systems