COMP 424 – Artificial Intelligence Markov Decision Processes
Instructor: Jackie CK Cheung and Readings: R&N Ch 17
• Markov decision processes
Copyright By PowCoder代写 加微信 powcoder
• Policies and value functions
• Computing optimal value functions for MDPs
• Policy iteration algorithm
• Policy evaluation
• Policy improvement
COMP-424: Artificial intelligence 2
Sequential decision-making
• Utility theory provides a foundation for one-shot decisions.
• If more than one decision has to be taken, reasoning about all of them in general is very expensive.
• Agents need to be able to make decisions in a repeated interaction with the environment over time, where the effects of one decision affect the next one.
• Markov Decision Processes (MDPs) provide a framework for modeling sequential decision-making.
COMP-424: Artificial intelligence 3
Applications of MDPs
• AI / Computer Science:
• Robotic control
• Air campaign planning
• Elevator control
• Computation scheduling
• Control and automation
• Spoken dialogue management
• Cellular channel allocation
• Football play selection
COMP-424: Artificial intelligence 4
More applications of MDPs
• Economics / Operations Research
• Inventory management
• Fleet maintenance
• Road maintenance
• Packet retransmission
• Nuclear plant management
• Agriculture
• Herd management
• Fish stock management
COMP-424: Artificial intelligence 5
Sequential decision-making
• At each time step t, the agent is in some state st.
• It chooses an action at, and as a result, it receives a
numerical reward Rt+1 and it can observe the new state st+1.
COMP-424: Artificial intelligence 6
Markov Decision Processes (MDPs)
• Set of states S
• Set of actions A
• Transition model (dynamics): T: S x A x S → [0, 1]
• T(s,a,s’) = P(st+1=s’ | st=s, at=a) is the probability of going from s to
s’ under action a. (Same as HMM model.)
• Reward function R: S x A → R
• R(s,a) is the short-term utility of the action. • Discount factor, γ
• γ is between 0 and 1, usually close to 1.
COMP-424: Artificial intelligence 7
MDP Assumptions
• Markov property:
• Outcome of transition depends on current state and action
• Reward function depends on current state and action
• Distributions are stationary (transition and reward functions do not change over time)
COMP-424: Artificial intelligence 8
Discount factor, γ
• Future rewards are worth less than current rewards. Why?
• Two interpretations:
• Inflation rate: receiving an amount of money in a year, is worth
less than today.
• At each time step, there is a 1- γ chance that the agent dies, and does not receive rewards afterwards.
COMP-424: Artificial intelligence 9
Planning in MDPs
• The goal of an agent in an MDP is to be rational. Maximize its expected utility (i.e., the MEU principle again).
• Maximizing the immediate utility (reward) is not sufficient.
• e.g., the agent might pick an action that gives instant gratification,
even if it later makes it “die”.
• The goal is to maximize long-term utility, (also called return).
• The return is an additive function of all rewards received by the agent.
COMP-424: Artificial intelligence 10
Long-term utilities
The utility Ut for a trajectory, starting from step t, is defined depending on the type of task.
Episodic / finite-horizon tasks:
(e.g., games, trips through a maze, etc.)
Ut = Rt + Rt+1 + Rt+2 + … + RT
where T is the time when a terminal state is reached.
Continuing / infinite-horizon tasks:
(e.g., tasks which may go on forever)
Ut = Rt + γRt+1 + γ2Rt+2 + γ3Rt+3 … = ∑k=0: ∞ γkRt+k Discount factor γ < 1 ensures that return is finite if rewards are
COMP-424: Artificial intelligence 11
Example: Mountain-Car
• States: position and velocity
• Actions: accelerate forward, accelerate backward, coast
• Goal: get the car to the top of the hill as quickly as possible.
• Reward : -1 for every time step, until car reaches the top (then 0)
(Alternately: reward = 1 at the top, 0 otherwise, γ<1)
COMP-424: Artificial intelligence 12
• Defines how agent should act in a state
• Two types of policies:
1. Deterministic policy: in each state the agent chooses a unique action.
π: S → A, π(s) = a
2. Stochastic policy: in the same state, the agent can “roll a die” and
choose different actions.
π: S x A → [0, 1], π(s,a) = P(at=a | st=s)
COMP-424: Artificial intelligence
Finding a good policy
• Our agent needs to learn a good policy
• E.g., how should this robot behave in this map?
0.1 0.7 0.1
Intended direction
• One approach:
• Figure out how good each state is (i.e., its expected utility)
• Choose policy that takes actions with high expected utility, using the states the agent might end up in after an action
COMP-424: Artificial intelligence 14
State-value function
• The state-value function (or simply value function) of a policy π is a function: Vπ : S → R
• The value of state s under policy π is the expected return if the agent starts from state s and picks actions according to policy π:
Vπ(s) = Eπ [ Ut | st = s ]
• For a finite state space, we can represent this as an array, with one
entry for every state.
COMP-424: Artificial intelligence 15
Policy iteration
COMP-424: Artificial intelligence
Policy evaluation
• Recall our definition of the return:
Ut = Rt + γRt+1 + γ2Rt+2 + γ3Rt+3 + ...
= Rt + γ ( Rt+1 + γ Rt+2 + ... )
=Rt +γUt+1
• Based on this observation, Vπ(s) becomes:
Vπ(s) = Eπ [ Ut | st = s ] = Eπ [ Rt + γUt+1 | st=s ]
• By writing the expectation explicitly, we get:
• Deterministic policy: Vπ(s) = ( R(s, π(s)) + γ ∑
• Stochastic policy: Vπ(s) = ∑ π(s,a) ( R(s,a) + γ ∑ T(s,a,s’)Vπ(s’) )
This is a system of linear equations (one per state) with unique
solution Vπ.
COMP-424: Artificial intelligence 17
T(s, π(s), s’)Vπ(s’) )
• Consider the general, stochastic version:
Vπ(s) = ∑a∈A π(s,a) ( R(s,a) + γ ∑s’∈S T(s,a,s’)Vπ(s’) )
• This equation is recursive:
• It rewrites the value of a state in terms of the values of other
• We will use it to derive algorithms for policy evaluation and policy improvement.
COMP-424: Artificial intelligence 18
Policy evaluation in matrix form
• Bellman’s equation in matrix form: Vπ = Rπ + γ Tπ Vπ
• What are Vπ, Rπ andTπ ?
• Vπ is a vector containing the value of each state under policy π.
• Rπ is a vector containing the immediate reward at each state: R(s, π (s)).
• Tπ is a matrix containing the transition probability at each state: T(s, π (s), s’).
• In some cases, we can solve this exactly:
Vπ = ( I - γ Tπ )-1 Rπ
• Can we do this iteratively? (Necessary for large state spaces) COMP-424: Artificial intelligence 19
Iterative policy evaluation
Main idea: turn Bellman equations into update rules.
1. Start with some initial guess V0.
2. During every iteration k, update the value function for all states:
Vk+1(s) ← ( R(s, π(s)) + γ ∑s’∈S T(s, π(s), s’)Vk(s’) )
3. Stop when the maximum changes between two iterations is
smaller than a desired threshold (the values stop changing.)
This is a bootstrapping idea: the value of one state is updated based on the current estimates of the values of successor states.
This is a dynamic programming algorithm which is guaranteed to converge!
COMP-424: Artificial intelligence 20
Convergence of iterative policy evaluation
• Consider the absolute error in our estimate Vk+1(s):
COMP-424: Artificial intelligence 21
Convergence of iterative policy evaluation
• Let εk be the worst error at iteration k-1:
• From previous calculation, we have:
• Because γ < 1, this means that
• We say that the error contracts by a contraction factor of
COMP-424: Artificial intelligence 22
Policy iteration
COMP-424: Artificial intelligence
Searching for a good policy
• We say that π ≥ π’ if Vπ(s) ≥Vπ’(s), ∀s∈S.
• This gives a partial ordering of policies.
• If one policy is better at one state but worse at another state, the two policies are not comparable.
• Since we know how to compute values for policies, we can search through the space of policies.
• Local search seems like a good fit.
COMP-424: Artificial intelligence 24
Policy improvement
• Recall Bellman’s eqn:
Vπ(s) ← ∑a∈A π(s,a) ( R(s,a) + γ ∑s∈S T(s,a,s’)Vπ(s’) )
• Suppose that there is some action a*, such that: ( R(s,a*) + γ ∑s’∈S T(s,a*,s’)Vπ(s’) ) > Vπ(s)
• Then if we set π(s,a*)←1, the value of state s will increase.
• Because we replaced each element in the sum in Vπ(s) with a bigger
• The values of states that can transition to s increase as well.
• The values of all other states stay the same.
• So the new policy using a* is better than the initial policy π. COMP-424: Artificial intelligence 25
Policy iteration
• More generally, we can change the policy π to a new policy π’ which is greedy with respect to the computed values Vπ
π’(s) = argmaxa∈A ( R(s,a) + γ ∑s’∈S T(s,a,s’)Vπ(s’) )
• This gives us a local search through the space of policies.
• We stop when the values of two successive policies are identical.
• Because we only look for deterministic policies, and there is a finite number of them, the search is guaranteed to terminate.
COMP-424: Artificial intelligence 26
Policy iteration algorithm
• Start with an initial policy π0 (e.g. random)
• Compute Vπ, using policy evaluation.
• Compute a new policy π’ that is greedy with respect to Vπ
• Terminate when Vπ = Vπ’
COMP-424: Artificial intelligence
A 4×3 gridworld example
• Transitions are stochastic, as shown on left figure.
0.1 0.7 0.1
Intended direction
COMP-424: Artificial intelligence 28
Policy Iteration (1)
Initial policy After policy evaluation
COMP-424: Artificial intelligence
Policy Iteration (2)
After policy improvement Another policy evaluation
COMP-424: Artificial intelligence
Policy Iteration (3)
Another policy improvement State values – converged!
COMP-424: Artificial intelligence
A 4×3 gridworld example
• New version:
– Transitions are still stochastic.
– Change the reward of the pit from -10 to -500.
Agent actively tries to avoid the goal, for fear of falling into the pit!
COMP-424: Artificial intelligence 32
Generalized Policy Iteration
• Any combination of policy evaluation and policy improvement steps, e.g., only update the value of one state, and immediately improve the policy at that state.
COMP-424: Artificial intelligence 33
Optimal policies and optimal value functions
• The optimal value function, V*, is defined as the best value that can be achieved at any state:
V*(s) = maxπ Vπ(s)
• In a finite MDP, there exists a unique optimal value
function (shown by Bellman, 1957).
• Any policy that achieves the optimal value function is called an optimal policy (denoted π*). The optimal policy is not necessarily unique.
COMP-424: Artificial intelligence 34
Optimal policies in the gridworld example
• Optimal state values give information about the shortest path to the goal.
• One of the deterministic optimal policies is shown below.
• There can be an infinite number of optimal policies (think
stochastic policies).
COMP-424: Artificial intelligence 35
Complexity of policy iteration
Repeat 2 basic steps: Compute Vπ + Compute a new policy π’ 1. Compute Vπ, using policy evaluation.
2. Compute a new policy π’ that is greedy with respect to Vπ .
Repeat for how many iterations?
COMP-424: Artificial intelligence 36
Complexity of policy iteration
Repeat 2 basic steps: Compute Vπ + Compute a new policy π’ 1. Compute Vπ, using policy evaluation.
Per iteration: O(S3)
2. Compute a new policy π’ that is greedy with respect to Vπ
Per iteration: O(S2A)
Repeat for how many iterations? At most |A||S|
Can get very expensive when there are many states!
COMP-424: Artificial intelligence 37
What you should know
• Definition of MDP framework.
• Differences/similarities between MDPs and other AI approaches (e.g. general search, game playing, STRIPS planning).
• Basic MDP algorithms and their properties:
• Policy iteration
• Policy evaluation
• Policy improvement
COMP-424: Artificial intelligence 38
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com