CS计算机代考程序代写 AI algorithm Agent-based Systems

Agent-based Systems
Paolo Turrini
™ www.dcs.warwick.ac.uk/~pturrini R p.turrini@warwick.ac.uk

Markov Decision Processes Policies are strategies
Paolo Turrini Policies Agent-based Systems

Plan for Today
We now go back to a ”typical” AI framework: Markov Decision Processes
Plans and policies Optimal policies
These are ”one player” games with perfect information. Except they are not played on trees.
This (and more) in RN’s chapters 17-18.
Stuart Russell and Peter Norvig
Artificial Intelligence: a modern approach
2014 (3rd edition)
Paolo Turrini Policies Agent-based Systems

The world
Start at the starting square
Move to adjacent squares
Collision results in no movement
The game ends when we reach either goal state +1 or −1
Paolo Turrini Policies Agent-based Systems

The agent
The agent chooses between {Up, Down, Left, Right} and goes:
to the intended direction with probability: e.g., 0.8
to the left of the intended direction with probability: e.g., 0.1 to the right of the intended direction with probability: e.g., 0.1
Paolo Turrini Policies Agent-based Systems

Let’s start walking
Walking is a repetition of throws:
The probability that I walk right the first time: 0.8
The probability that I walk right the second time: 0.8
The probability that I walk right both times… is a product! 0.82
The environment is Markovian: the probability of reaching a state only depends on the state the agent is in and the action they perform.
It is also fully observable, like an extensive game (of imperfect information).
Paolo Turrini Policies Agent-based Systems

Plans
{Up, Down, Left, Right} denote the intended directions.
A plan is a finite sequence of intended moves, from the start.
So [Up, Down, Up, Right] is going to be the plan that, from the starting square, selects the intended moves in the specified order.
Paolo Turrini Policies Agent-based Systems

Makings plans
Goal: get to +1
Consider the plan [Up, Up, Right, Right, Right].
With deterministic agents, it gets us to +1 with probability 1. But what happens to our stochastic agent instead?
What’s the probability that [Up, Up, Right, Right, Right] gets us to +1?
Paolo Turrini Policies Agent-based Systems

Makings plans
It’s not 0.85! This is the probability that we get to +1 when the plan works!
Paolo Turrini Policies Agent-based Systems

Makings plans
It’s not 0.85! This is the probability that we get to +1 when the plan works! The probability the plan does not work but still reaches +1 is
0.14 × 0.8 = 0.00008
The correct answer is 0.85 + 0.14 × 0.8
Notice 0.85 + 0.14 × 0.8 < 1 , not great. 3 Paolo Turrini Policies Agent-based Systems Policies S+ is set of possible sequences of states (just like the histories of an extensive game!) A the set of available actions. Then a policy is a function: π : S+ → A In words a policy is a protocol that at each possible decision point prescribes an action. This is a strategy. Paolo Turrini Policies Agent-based Systems A policy This is a state-based policy. It recommends the same action at each state (so if two sequences end up with the same state, this policy is going to recommend the same action) Now let’s complicate things a little bit... Paolo Turrini Policies Agent-based Systems Rewards A reward function is a (utility) function of the form r:S→R All states, not just the terminal ones, get a reward! Obviously, if you only care about terminal states, you may want to give zero to every other state. This is a more general model. Paolo Turrini Policies Agent-based Systems Rewards For instance, each non-terminal state: has 0 reward, i.e., only the terminal states matter; has negative reward, e.g., each move consumes -0.04 of battery; has positive reward, e.g., I like wasting battery Rewards are usually small, negative and uniform at non-terminal states. But the reward function allows for more generality. Paolo Turrini Policies Agent-based Systems Rewards Consider now the following. The reward is: +1 at state +1, -1 at state -1, -0.04 in all other states. What’s the expected utility of [Up, Up, Right, Right, Right]? Paolo Turrini Policies Agent-based Systems Rewards Consider now the following. The reward is: +1 at state +1, -1 at state -1, -0.04 in all other states. What’s the expected utility of [Up, Up, Right, Right, Right]? IT DEPENDS Paolo Turrini Policies Agent-based Systems Rewards Consider now the following. The reward is: +1 at state +1, -1 at state -1, -0.04 in all other states. What’s the expected utility of [Up, Up, Right, Right, Right]? IT DEPENDS on how we are going to put rewards together! Paolo Turrini Policies Agent-based Systems Comparing states Many ways of comparing states: summing all the rewards giving priority to the immediate rewards ... Paolo Turrini Policies Agent-based Systems Utility of state sequences There is only one general and ‘reasonable’ way to combine rewards over time. Discounted utility function: u([s0, s1, s2, . . .]) = r(s0) + γr(s1) + γ2r(s2) + · · · where γ ∈ [0, 1] is the discounting factor Notice: additive utility function u([s0, s1, s2, . . .]) = r(s0) + r(s1) + r(s2) + · · · is just a discounted utility function where γ = 1. Paolo Turrini Policies Agent-based Systems Discounting factor γ is a measure of the agent patience. How much more they value a gain of five today than a gain of five tomorrow, the day after etc... Used everywhere in AI, game theory, cognitive psychology A lot of experimental research on it Variants: discounting the discounting! I care more about the difference between today and tomorrow than the difference between some distant moment and the day after that! Paolo Turrini Policies Agent-based Systems Discounting γ = 1 today is just another day γ = 0 today is all that matters Basically γ is my attitude to risk towards the future! Notice that stochastic actions introduce further gambling into the picture Paolo Turrini Policies Agent-based Systems A problem Here is a 3 × 101 world. s start at s. two deterministic actions at s: either Up or Down beyond s you can only go Right. the numbers are the rewards you are going to get. 50 -1 -1 -1 ... -1 -1 -1 -1 ... -50 1 1 1 ... 1 1 1 1 Paolo Turrini Policies Agent-based Systems A problem Here is a 3 × 101 world. 50 -1 -1 -1 ... -1 -1 -1 -1 s ... -50 1 1 1 ... 1 1 1 1 start at s. two deterministic actions at s: either Up or Down beyond s you can only go Right. the numbers are the rewards you are going to get. Compute the expected utility of each action as a function of γ Paolo Turrini Policies Agent-based Systems Solution The utility of Up is 101 t=2 50γ−􏰗γt =50γ−γ21−γ100 1−γ Paolo Turrini Policies Agent-based Systems Solution The utility of Up is The utility of Down is 101 50γ−􏰗γt =50γ−γ21−γ100 t=2 101 t=2 −50γ+􏰗γt =−50γ+γ21−γ100 1−γ 1−γ Paolo Turrini Policies Agent-based Systems Solution The indifference point is 2 1 − γ100 50γ −γ 1−γ = −50γ +γ 2 1 − γ100 1−γ Paolo Turrini Policies Agent-based Systems Solution The indifference point is 2 1 − γ100 50γ −γ 1−γ Solving numerically, we have γ ≈ 0.9844. = −50γ +γ 2 1 − γ100 1−γ Paolo Turrini Policies Agent-based Systems Solution The indifference point is 2 1 − γ100 2 1 − γ100 50γ −γ 1−γ = −50γ +γ 1−γ Solving numerically, we have γ ≈ 0.9844. If γ is strictly larger than this then Down is better than Up; Paolo Turrini Policies Agent-based Systems Solution The indifference point is 2 1 − γ100 2 1 − γ100 50γ −γ 1−γ = −50γ +γ 1−γ Solving numerically, we have γ ≈ 0.9844. If γ is strictly larger than this then Down is better than Up; If γ is strictly smaller than this then Up is better than Down; Paolo Turrini Policies Agent-based Systems Solution The indifference point is 2 1 − γ100 2 1 − γ100 50γ −γ 1−γ = −50γ +γ 1−γ Solving numerically, we have γ ≈ 0.9844. If γ is strictly larger than this then Down is better than Up; If γ is strictly smaller than this then Up is better than Down; Else, it does not matter. Paolo Turrini Policies Agent-based Systems Markov Decision Process A Markov Decision Process is a sequential decision problem for a: Paolo Turrini Policies Agent-based Systems Markov Decision Process A Markov Decision Process is a sequential decision problem for a: fully observable environment Paolo Turrini Policies Agent-based Systems Markov Decision Process A Markov Decision Process is a sequential decision problem for a: fully observable environment with stochastic actions Paolo Turrini Policies Agent-based Systems Markov Decision Process A Markov Decision Process is a sequential decision problem for a: fully observable environment with stochastic actions with a Markovian transition model Paolo Turrini Policies Agent-based Systems Markov Decision Process A Markov Decision Process is a sequential decision problem for a: fully observable environment with stochastic actions with a Markovian transition model and with discounted (possibly additive) rewards Paolo Turrini Policies Agent-based Systems MDPs formally Definition Let s be a state and a and action Model P(s′|s,a) = probability that a in s leads to s′ Reward function r(s) (or r(s, a), r(s, a, s′)) = 􏰉 −0.04 ±1 (small penalty) for nonterminal states for terminal states Paolo Turrini Policies Agent-based Systems Expected utility of a policy The expected utility (or value) of policy π, from state s is: ∞ vπ(s) = E[􏰗γtr(St)] t=0 E is the expected utility of the sequences induced by: the policy π (the actions we are actually going to make) the initial state s (where we start) the transition model (where we can get to) Paolo Turrini Policies Agent-based Systems Loops In principle we can go on forever! We are going to assume we need to keep going unless we hit a terminal state (infinite horizon assumption) Paolo Turrini Policies Agent-based Systems Discounting With discounting the utility of an infinite sequence is in fact finite. If γ < 1 and rewards are bounded above by r, we have: ∞∞ u[s1,s2,...] = 􏰗γtr(st) 􏰤 􏰗γtr = r t=0 t=0 1−γ Paolo Turrini Policies Agent-based Systems Expected utility of a policy An optimal policy (from a state) is the policy with the highest expected utility, starting from that state. πs∗ =argmaxvπ(s) π We want to find the optimal policy. Paolo Turrini Policies Agent-based Systems A remarkable fact Theorem With discounted rewards and infinite horizon ∗∗′ πs =πs′,foreachs ∈S This means that the optimal policy does not depend on the sequences of states, but on the states only. In other words, the optimal policy is a state-based policy. Paolo Turrini Policies Agent-based Systems A remarkable fact Theorem With discounted rewards and infinite horizon ∗∗′ πs =πs′,foreachs ∈S This means that the optimal policy does not depend on the sequences of states, but on the states only. In other words, the optimal policy is a state-based policy. Idea: Take πa∗ and πb∗. If they both reach a state c, because they are both optimal, there is no reason why they should disagree (modulo indifference!). So πc∗ is identical for both (modulo indifference!). But then they behave the same at all states! Paolo Turrini Policies Agent-based Systems Value of states The value of a state is the value of the optimal policy from that state. But then (VERY IMPORTANT): Given the values of the states, choosing the best action is just maximisation of expected utility! maximise the expected utility of the immediate successors Paolo Turrini Policies Agent-based Systems Value of states Figure: The values with γ = 1 and r(s) = −0.04 Paolo Turrini Policies Agent-based Systems Value of states Figure: The optimal policy π∗(s)=argmax􏰗P(s′ |s,a)v(s′) a∈A(s ) s ′ Maximise the expected utility of the subsequent state Paolo Turrini Policies Agent-based Systems Value of states Figure: The optimal policy π∗(s)=argmax􏰗P(s′ |s,a)v(s′) a∈A(s ) s ′ Maximise the expected utility of the subsequent state Paolo Turrini Policies Agent-based Systems The Bellman equation The definition of values of states, i.e., the expected utility of the optimal policy from there, leads to a simple relationship among values of neighbouring states: Paolo Turrini Policies Agent-based Systems The Bellman equation The definition of values of states, i.e., the expected utility of the optimal policy from there, leads to a simple relationship among values of neighbouring states: expected sum of rewards = current reward + γ × expected sum of rewards after taking best action Paolo Turrini Policies Agent-based Systems The Bellman equation Bellman equation (1957): v(s) = r(s) + γ max 􏰗 P(s′ | (s, a))v(s′) a s′ Paolo Turrini Policies Agent-based Systems The Bellman equation Bellman equation (1957): v(s) = r(s) + γ max 􏰗 P(s′ | (s, a))v(s′) a s′ We can use it to compute the optimal policy! Paolo Turrini Policies Agent-based Systems Value Iteration Algorithm 1 Start with arbitrary values 2 Repeat for every s simultaneously until “no change” v(s) ← r(s) + γ max 􏰗 v(s′)P(s′ | (s, a)) a s′ Paolo Turrini Policies Agent-based Systems Value Iteration Algorithm Input S, A, γ, r, and P(s′ | (s,a)) for each s,s′ ∈ S. Input ε > 0, the error you want to allow
Output v, the value of each state
Paolo Turrini Policies Agent-based Systems

Value Iteration Algorithm
1 Initialise δs := ε(1−γ) for all s1, v := 0, storing information to be updated γ
1Typically this is uniform across states.
Paolo Turrini Policies Agent-based Systems

Value Iteration Algorithm
1 Initialise δs := ε(1−γ) for all s1, v := 0, storing information to be updated γ
2 whileδs′ 􏰩ε(1−γ),forsomes′,do γ
1Typically this is uniform across states.
Paolo Turrini Policies Agent-based Systems

Value Iteration Algorithm
1 Initialise δs := ε(1−γ) for all s1, v := 0, storing information to be updated γ
2 whileδs′ 􏰩ε(1−γ),forsomes′,do γ
v′(s):=r(s)+γmaxa􏰝s′ P(s′ |(s,a))v(s′)
1Typically this is uniform across states.
Paolo Turrini Policies Agent-based Systems

Value Iteration Algorithm
1 Initialise δs := ε(1−γ) for all s1, v := 0, storing information to be updated γ
2 whileδs′ 􏰩ε(1−γ),forsomes′,do γ
v′(s):=r(s)+γmaxa􏰝s′ P(s′ |(s,a))v(s′) δs :=|v′(s)−v(s)|
1Typically this is uniform across states.
Paolo Turrini Policies Agent-based Systems

Value Iteration Algorithm
1 Initialise δs := ε(1−γ) for all s1, v := 0, storing information to be updated γ
2 whileδs′ 􏰩ε(1−γ),forsomes′,do γ
v′(s):=r(s)+γmaxa􏰝s′ P(s′ |(s,a))v(s′) δs :=|v′(s)−v(s)|
v(s) := v′(s)
1Typically this is uniform across states.
Paolo Turrini Policies Agent-based Systems

Value Iteration Algorithm
1 Initialise δs := ε(1−γ) for all s1, v := 0, storing information to be updated γ
2 whileδs′ 􏰩ε(1−γ),forsomes′,do γ
v′(s):=r(s)+γmaxa􏰝s′ P(s′ |(s,a))v(s′) δs :=|v′(s)−v(s)|
v(s) := v′(s)
3 Return v
1Typically this is uniform across states.
Paolo Turrini Policies Agent-based Systems

A fundamental fact
Theorem
VIA: returns the optimal policy (for the input values)
terminates
Paolo Turrini Policies Agent-based Systems

VIA in action
Paolo Turrini Policies Agent-based Systems

VIA in action
Paolo Turrini Policies Agent-based Systems

VIA in action
Initialise the values, for γ = 1, r = −0.04
Paolo Turrini Policies Agent-based Systems

VIA in action
Simultaneously apply the Bellmann update to all states
v(s) = r(s) + γ max 􏰗 P(s′ | (s, a))v(s′)
a
s′
Paolo Turrini
Policies Agent-based Systems

Value Iteration Algorithm
Paolo Turrini Policies Agent-based Systems

The state values
Paolo Turrini Policies Agent-based Systems

The optimal policy
Paolo Turrini Policies Agent-based Systems

VIA in action
Paolo Turrini Policies Agent-based Systems

VIA in action
Initialise the values, for γ = 1, r = −0.4
Paolo Turrini Policies Agent-based Systems

VIA in action
Paolo Turrini Policies Agent-based Systems

VIA in action
Initialise the values, for γ = 1, r = −4
Paolo Turrini Policies Agent-based Systems

VIA in action
Paolo Turrini Policies Agent-based Systems

VIA in action
Initialise the values, for γ = 1, r = 0
Paolo Turrini Policies Agent-based Systems

VIA in action
Paolo Turrini Policies Agent-based Systems

VIA in action
Paolo Turrini Policies Agent-based Systems

Stocktaking
Stochastic actions can lead to unpredictable outcomes
But we can still find optimal ”strategies”, exploiting what happens in case we deviate from the original plan
If we know what game we are playing and we play long enough… What next? Learning in MDPs
Paolo Turrini Policies Agent-based Systems