CS代考 Other notions of “rational”

Other notions of “rational”
There are other criteria for decision-making than maximising expected utility. One approach is to look at the option which has the least-bad worst outcome.
This maximin criterion can be formalised in the same framework as MEU (Maximum Expected Utility), making the rational (in this sense) action:
a ̊ “argmaxtminups1qu aPA s1 Psa
Its effect is to ignore the probability of outcomes and concentrate on optimising the worst case outcome.
Example (utilities are in red):
s a2 7s6 a1 5
4 53s5 4 s s4
Here we would pick a1 ⃝c -Trenn, King’s College London
2
s1 3 s2

Other notions of “rational”
The opposite attitude, that of optimistic risk-seeker, is captured by the maximax criterion:
a ̊ “ arg maxtmax ups1qu aPA s1 Psa
This will ignore possible bad outcomes and just focus on the best outcome of each action.
Example:
s a2 7s6 a1 5
4 53s5 4 s s4
Here we would pick a2 ⃝c -Trenn, King’s College London
3
s1 3 s2

Sequential decision problems
These approaches give us a battery of techniques to apply to individual decisions by agents.
However, they aren’t really sufficient.
Agents aren’t usually in the business of taking single decisions
‚ Lifeisaseriesofdecisions.
The best overall result is not necessarily obtained by a greedy approach to a series of decisions.
The current best option isn’t the best thing in the long-run.
⃝c -Trenn, King’s College London 4

Sequential decision problems
Otherwise I’d only ever eat cherry pie
(pillsbury.com)
(Damn fine pie.)
⃝c -Trenn, King’s College London 5

An example
The agent has to pick a sequence of actions.
for all states s.
⃝c -Trenn, King’s College London
6
Apsq “ tU p, Down, Lef t, Rightu

An example
The world is fully observable. End states have values `1 or ́1.
⃝c -Trenn, King’s College London 7

An example
If the world were deterministic, the choice of actions would be easy here.
But actions are stochastic.
⃝c -Trenn, King’s College London
8
U p, U p, Right, Right, Right

An example
80% of the time the agent moves as intended.
20% of the time the agent moves perpendicular to the intended direction. Half the time to the left, half the time to the right.
The agent doesn’t move if it hits a wall.
⃝c -Trenn, King’s College London 9

An example
So U p, U p, Right, Right, Right succeeds with probability: 0.85 “ 0.32768
⃝c -Trenn, King’s College London
10

An example
Also a small chance of going around the obstacle the other way.
⃝c -Trenn, King’s College London 11

An example
We can write a transition model to describe these actions. Since the actions are stochastic, the model looks like:
Pps1|s,aq
where a is the action that takes the agent from s to s1. Transitions are assumed to be first order Markovian. That is, they only depend on the current and next states.
So, we could write a large set of probability tables that would describe all the possible actions executed in all the possible states.
This would completely specify the actions.
⃝c -Trenn, King’s College London 12

An example
The full description of the problem also has to include the utility function.
This is defined over sequences of states — runs in the terminology of the first lecture.
We will assume that in each state s the agent receives a reward Rpsq. This may be positive or negative.
⃝c -Trenn, King’s College London 13

An example
The reward for non-terminal states is ́0.04.
We will assume that the utility of a run is the sum of the utilities of states, so the ́0.04 is an incentive to take fewer steps to get to the terminal state.
(You can also think of it as the cost of an action).
⃝c -Trenn, King’s College London 14

How do we tackle this?
( /Cartoon Network)
⃝c -Trenn, King’s College London 15

Markov decision process
The overall problem the agent faces here is a Markov decision process (MDP) Mathematically we have
‚ a set of states s P S with an initial state s0. ‚ AsetofactionsApsqineachstate.
‚ AtransitionmodelPps1|s,aq;and
‚ ArewardfunctionRpsq.
Captures any fully observable non-deterministic environment with a Markovian transition model and additive rewards.
⃝c -Trenn, King’s College London
16
Kaelbling

Markov decision process
What does a solution to an MDP look like?
⃝c -Trenn, King’s College London 17

Markov decision process
A solution is a policy, which we write as π. This is a choice of action for every state.
‚ thatwayifwegetofftrack,westillknowwhattodo.
In any state s, πpsq identifies what action to take.
Example policy π: πps0q “ left, πps1q “ left, πps2q “ right, . . . Another example policy π1: For all states s, π1psq “ left
⃝c -Trenn, King’s College London 18

Markov decision process
Naturally we’d prefer not just any policy but the optimum policy. ‚ Buthowtofindit?
Need to compare policies by the reward they generate.
Since actions are stochastic, policies won’t give the same reward every time.
‚ Socomparetheexpectedvalue.
The optimum policy π ̊ is the policy with the highest expected value. At every stage the agent should perform π ̊psq.
⃝c -Trenn, King’s College London 19

Markov decision process
(40 Acres and a Mule Filmworks/Universal Pictures)
π ̊psq is the right thing.
⃝c -Trenn, King’s College London 20

An example
(a) An optimal policy for the stochastic environment with Rpsq “ ́0.04. (b) Optimal policies for different values of Rpsq.
⃝c -Trenn, King’s College London 21

An example
Rpsq ď ́1.6284, life is painful so the agent heads for the exit, even if it is a bad state.
́0.4278 ď Rpsq ď ́0.0850, life is unpleasant so the agent heads for the `1 state and is prepared to risk falling into the ́1 state.
́0.0221 ă Rpsq ă 0, life isn’t so bad, and the optimal policy doesn’t take any risks.
Rpsq ą 0, the agent doesn’t want to leave.
⃝c -Trenn, King’s College London 22