How utilities are calculated
So far we have assumed that utilities are summed along a run. ‚ Nottheonlyway.
In general we need to compute Urprs0, s1, . . . , snsq for general Urp ̈q. That is, the utility of a run.
Before Urp ̈q was just the sum of rewards in every state. Can consider finite and infinite horizons.
‚ Isit“gameover”atsomepoint?
Turns out that infinite horizons are mostly easier to deal with.
‚ Thatiswhatwewilluse.
⃝c -Trenn, King’s College London 2
How utilities are calculated
Also have to consider whether utilities are stationary or non-stationary.
‚ Think of: does the same state always have the same value?
‚ E.g., in Pacman when you pick up a fruit, there is a large reward for that tile. That
changes after you picked up the fruit. Example:
‚ Normallywepreferonestatetoanother.
‚ PassingtheAImoduletofailingit
‚ Inthiscasewhentheexamis,todayornextweek,isirrelevant.
We assume utilities are stationary.
⃝c -Trenn, King’s College London 3
But are they?
Not clear that utilities are always stationary.
In truth, I don’t always most want to eat cherry pie. Despite this, we will assume that utilities are stationary.
⃝c -Trenn, King’s College London 4
How utilities are calculated
With stationary utilities, there are two ways to establish Ur prs0 , s1 , . . . , sn sq from Rpsq.
Additive rewards:
Urprs0, s1, . . . , snsq “ Rps0q ` Rps1q ` . . . ` Rpsnq
as above. Discounted rewards:
Urprs0, s1, . . . , snsq “ Rps0q ` γRps1q ` . . . ` γnRpsnq where the discount factor γ is a number between 0 and 1.
The discount factor models the preference of the agent for current over future rewards.
⃝c -Trenn, King’s College London 5
How utilities are calculated
There is an issue with infinite sequences with additive, undiscounted rewards. ‚ Whatwilltheutilityofapolicybe?
⃝c -Trenn, King’s College London 6
How utilities are calculated
There is an issue with infinite sequences with additive, undiscounted rewards. ‚ Whatwilltheutilityofapolicybe?
Unbounded
8 or ́8.
This is problematic if we want to compare policies.
⃝c -Trenn, King’s College London 7
How utilities are calculated
Some solutions are (definitions follow): ‚ Properpolicies
‚ Averagereward
‚ Discountedrewards
⃝c -Trenn, King’s College London 8
How utilities are calculated
Proper policies always end up in a terminal state eventually. Thus they have a finite expected utility.
⃝c -Trenn, King’s College London 9
How utilities are calculated
We can compute the average reward per time step. Even for an infinite policy this will (usually) be finite.
⃝c -Trenn, King’s College London 10
How utilities are calculated
Assume: 0 ď γ ă 1 and rewards are bounded by Rmax
With discounted rewards the utility of an infinite sequence is finite:
Urprs0, s1, . . . , snsq
n
ÿ
“ γtRpstq t“0
8
ÿ
ď γtRpstq t“0
8
ÿ
ď γtRmax t“0
Rmax p1 ́γq
ď
⃝c -Trenn, King’s College London
11
Optimal policies
With discounted rewards we compare policies by computing their expected values. The expected utility of executing π starting in s is given by:
«ff
8
ÿ
Uπpsq “ E
where St is the state the agent gets to at time t.
γtRpStq
St is a random variable and we compute the probability of all its values by looking
at all the runs which end up there after t steps.
⃝c -Trenn, King’s College London 12
t“0
Optimal policies
The optimal policy is then:
π ̊ “argmaxUπpsq π
It turns out that this is independent of the state the agent starts in.
⃝c -Trenn, King’s College London 13
Optimal policies
Here we have the values of states if the agent executes an optimal policy
Uπ ̊psq
⃝c -Trenn, King’s College London 14
Optimal policies
Here we have the values of states if the agent executes an optimal policy
Uπ ̊psq
What should the agent do if it is in (3, 1)?
⃝c -Trenn, King’s College London 15
Example
The answer is Left.
The best action is the one that maximises the expected utility.
(You have to calculate the expected utility of all the actions to see why Left is the best choice.)
⃝c -Trenn, King’s College London 16