程序代写 Policy iteration

Policy iteration
Rather than compute optimal utility values, policy iteration looks through the space of possible policies.
Starting from some initial policy π0 we do:
‚ Policyevaluation
Given a policy πi, calculate Uipsq.
‚ Policyimprovement
Given Uipsq, compute πi`1
We will look at each of these steps in turn. But not in order.
⃝c -Trenn, King’s College London 2

Policy improvement
Easy
Calculate a new policy πi`1 by applying:
πi`1psq“arg max aPApsq s1
For each state we do a one-step MEU lookahead. A simple decision.
Use the values established by policy evaluation.
⃝c -Trenn, King’s College London 3

Pps1|s,aqUips1q

Policy evaluation
How do we calculate the utility of each step given the policy πi? Turns out not to be so hard.
Given a policy, the choice of action in a given state is fixed (that is what a policy
tells us) so:

P ps1|s, πipsqqUips1q and so standard linear algebra solutions will work.
(This is because we use the same Ui on the lhs and rhs)
Uipsq “ Rpsq ` γ
Again there are lots of simultaneous equations, but now they are linear (no max)
⃝c -Trenn, King’s College London 4
s1

Policy iteration
Put these together to get:
Starting from some initial policy π0 we do:
1. Policyevaluation Compute:
for every state.
Uipsq “ Rpsq ` γ

s1
P ps1|s, πipsqqUips1q
2. Policyimprovement
Calculate a new policy πi`1 by applying:
for every state s. Until convergence.
⃝c -Trenn, King’s College London
5
πi`1psq“arg max aPApsq

s1
Pps1|s,aqUips1q

Policy iteration
The iteration will terminate when there is no improvement in utility from one iteration to the next.
At this point the utility Ui is a fixed point of the Bellman update and so πi must be optimal.
⃝c -Trenn, King’s College London 6

Policy evaluation
There is a problem with the policy evaluation stage of the policy iteration approach.
If we have n states, we have n linear equations with n unknowns in the evaluation stage.
Solution in Opn3q (there is also an impractical solution with Opn2.376q). For large n, can be a problem.
So, an approximate solution.
⃝c -Trenn, King’s College London 7

Approximate?
( /Cartoon Network)
⃝c -Trenn, King’s College London 8

Approximate policy evaluation
Run a simplified value iteration.
Policy is fixed, so we know what action to do in each state. Repeat:

s1
Ui`1psq Ð Rpsq ` γ a fixed number of times.
P ps1|s, πipsqqUips1q
⃝c -Trenn, King’s College London
9

Modified policy iteration
Starting from some initial policy π0 we do: 1. Approximatepolicyevaluation
Repeat

Ui`1 psqÐRpsq ` γ
a fixed number of times. The difference to policy iteration is the arrow.
2. Policyimprovement
for every state s. Until convergance
Often more efficient than policy iteration or value iteration.
Pps1|s,aqUips1q
⃝c -Trenn, King’s College London
10
πi`1psq“arg max aPApsq
s1

s1
P ps1 |s, πi psqqUi ps1 q

Solving MDPs
Have covered three methods for solving MDPs
‚ Valueiteration
(Exact)
‚ Policyiteration
(Exact)
‚ Modifiedpolicyiteration
(Approximate)
Which to use is somewhat problem specific.
⃝c -Trenn, King’s College London 11

Bellman redux
The Bellman equation(s)/update are widely used.
D. Romer, It’s Fourth Down and What Does the Say? A Dynamic Programming Analysis of Football Strategy, NBER Working Paper No. 9024, June 2002
⃝c -Trenn, King’s College London 12

Bellman redux
This paper uses play-by-play accounts of virtually all regular season National Football League games for 1998-2000 to analyze teams’ choices on fourth down between trying for a first down and kicking. Dynamic programming is used to estimate the values of possessing the ball at different points on the field. These estimates are combined with data on the results of kicks and conventional plays to estimate the average payoffs to kicking and going for it under different circumstances. Examination of teams’ actual decisions shows systematic, over- whelmingly statistically significant, and quantitatively large departures from the decisions the dynamic-programming analysis implies are preferable.
⃝c -Trenn, King’s College London 13