代写代考 COMP 424 - Artificial Intelligence Markov Decision Processes

COMP 424 – Artificial Intelligence Markov Decision Processes
Instructor: Jackie CK Cheung and Readings: R&N Ch 17

• Markov decision processes

• Policies and value functions
• Computing optimal value functions for MDPs
• Policy iteration algorithm
• Policy evaluation
• Policy improvement
COMP-424: Artificial intelligence 2

Sequential decision-making
• Utility theory provides a foundation for one-shot decisions.
• If more than one decision has to be taken, reasoning about all of them in general is very expensive.
• Agents need to be able to make decisions in a repeated interaction with the environment over time, where the effects of one decision affect the next one.
• Markov Decision Processes (MDPs) provide a framework for modeling sequential decision-making.
COMP-424: Artificial intelligence 3

Applications of MDPs
• AI / Computer Science:
• Robotic control
• Air campaign planning
• Elevator control
• Computation scheduling
• Control and automation
• Spoken dialogue management
• Cellular channel allocation
• Football play selection
COMP-424: Artificial intelligence 4

More applications of MDPs
• Economics / Operations Research
• Inventory management
• Fleet maintenance
• Road maintenance
• Packet retransmission
• Nuclear plant management
• Agriculture
• Herd management
• Fish stock management
COMP-424: Artificial intelligence 5

Sequential decision-making
• At each time step t, the agent is in some state st.
• It chooses an action at, and as a result, it receives a
numerical reward Rt+1 and it can observe the new state st+1.
COMP-424: Artificial intelligence 6

Markov Decision Processes (MDPs)
• Set of states S
• Set of actions A
• Transition model (dynamics): T: S x A x S → [0, 1]
• T(s,a,s’) = P(st+1=s’ | st=s, at=a) is the probability of going from s to
s’ under action a. (Same as HMM model.)
• Reward function R: S x A → R
• R(s,a) is the short-term utility of the action. • Discount factor, γ
• γ is between 0 and 1, usually close to 1.
COMP-424: Artificial intelligence 7

MDP Assumptions
• Markov property:
• Outcome of transition depends on current state and action
• Reward function depends on current state and action
• Distributions are stationary (transition and reward functions do not change over time)
COMP-424: Artificial intelligence 8

Discount factor, γ
• Future rewards are worth less than current rewards. Why?
• Two interpretations:
• Inflation rate: receiving an amount of money in a year, is worth
less than today.
• At each time step, there is a 1- γ chance that the agent dies, and does not receive rewards afterwards.
COMP-424: Artificial intelligence 9

Planning in MDPs
• The goal of an agent in an MDP is to be rational. Maximize its expected utility (i.e., the MEU principle again).
• Maximizing the immediate utility (reward) is not sufficient.
• e.g., the agent might pick an action that gives instant gratification,
even if it later makes it “die”.
• The goal is to maximize long-term utility, (also called return).
• The return is an additive function of all rewards received by the agent.
COMP-424: Artificial intelligence 10

Long-term utilities
The utility Ut for a trajectory, starting from step t, is defined depending on the type of task.
Episodic / finite-horizon tasks:
(e.g., games, trips through a maze, etc.)
Ut = Rt + Rt+1 + Rt+2 + … + RT
where T is the time when a terminal state is reached.
Continuing / infinite-horizon tasks:
(e.g., tasks which may go on forever)
Ut = Rt + γRt+1 + γ2Rt+2 + γ3Rt+3 … = ∑k=0: ∞ γkRt+k Discount factor γ < 1 ensures that return is finite if rewards are COMP-424: Artificial intelligence 11 Example: Mountain-Car • States: position and velocity • Actions: accelerate forward, accelerate backward, coast • Goal: get the car to the top of the hill as quickly as possible. • Reward : -1 for every time step, until car reaches the top (then 0) (Alternately: reward = 1 at the top, 0 otherwise, γ<1) COMP-424: Artificial intelligence 12 • Defines how agent should act in a state • Two types of policies: 1. Deterministic policy: in each state the agent chooses a unique action. π: S → A, π(s) = a 2. Stochastic policy: in the same state, the agent can “roll a die” and choose different actions. π: S x A → [0, 1], π(s,a) = P(at=a | st=s) COMP-424: Artificial intelligence Finding a good policy • Our agent needs to learn a good policy • E.g., how should this robot behave in this map? 0.1 0.7 0.1 Intended direction • One approach: • Figure out how good each state is (i.e., its expected utility) • Choose policy that takes actions with high expected utility, using the states the agent might end up in after an action COMP-424: Artificial intelligence 14 State-value function • The state-value function (or simply value function) of a policy π is a function: Vπ : S → R • The value of state s under policy π is the expected return if the agent starts from state s and picks actions according to policy π: Vπ(s) = Eπ [ Ut | st = s ] • For a finite state space, we can represent this as an array, with one entry for every state. COMP-424: Artificial intelligence 15 Policy iteration COMP-424: Artificial intelligence Policy evaluation • Recall our definition of the return: Ut = Rt + γRt+1 + γ2Rt+2 + γ3Rt+3 + ... = Rt + γ ( Rt+1 + γ Rt+2 + ... ) =Rt +γUt+1 • Based on this observation, Vπ(s) becomes: Vπ(s) = Eπ [ Ut | st = s ] = Eπ [ Rt + γUt+1 | st=s ] • By writing the expectation explicitly, we get: • Deterministic policy: Vπ(s) = ( R(s, π(s)) + γ ∑ • Stochastic policy: Vπ(s) = ∑ π(s,a) ( R(s,a) + γ ∑ T(s,a,s’)Vπ(s’) ) This is a system of linear equations (one per state) with unique solution Vπ. COMP-424: Artificial intelligence 17 T(s, π(s), s’)Vπ(s’) ) • Consider the general, stochastic version: Vπ(s) = ∑a∈A π(s,a) ( R(s,a) + γ ∑s’∈S T(s,a,s’)Vπ(s’) ) • This equation is recursive: • It rewrites the value of a state in terms of the values of other • We will use it to derive algorithms for policy evaluation and policy improvement. COMP-424: Artificial intelligence 18 Policy evaluation in matrix form • Bellman’s equation in matrix form: Vπ = Rπ + γ Tπ Vπ • What are Vπ, Rπ andTπ ? • Vπ is a vector containing the value of each state under policy π. • Rπ is a vector containing the immediate reward at each state: R(s, π (s)). • Tπ is a matrix containing the transition probability at each state: T(s, π (s), s’). • In some cases, we can solve this exactly: Vπ = ( I - γ Tπ )-1 Rπ • Can we do this iteratively? (Necessary for large state spaces) COMP-424: Artificial intelligence 19 Iterative policy evaluation Main idea: turn Bellman equations into update rules. 1. Start with some initial guess V0. 2. During every iteration k, update the value function for all states: Vk+1(s) ← ( R(s, π(s)) + γ ∑s’∈S T(s, π(s), s’)Vk(s’) ) 3. Stop when the maximum changes between two iterations is smaller than a desired threshold (the values stop changing.) This is a bootstrapping idea: the value of one state is updated based on the current estimates of the values of successor states. This is a dynamic programming algorithm which is guaranteed to converge! COMP-424: Artificial intelligence 20 Convergence of iterative policy evaluation • Consider the absolute error in our estimate Vk+1(s): COMP-424: Artificial intelligence 21 Convergence of iterative policy evaluation • Let εk be the worst error at iteration k-1: • From previous calculation, we have: • Because γ < 1, this means that • We say that the error contracts by a contraction factor of COMP-424: Artificial intelligence 22 Policy iteration COMP-424: Artificial intelligence Searching for a good policy • We say that π ≥ π’ if Vπ(s) ≥Vπ’(s), ∀s∈S. • This gives a partial ordering of policies. • If one policy is better at one state but worse at another state, the two policies are not comparable. • Since we know how to compute values for policies, we can search through the space of policies. • Local search seems like a good fit. COMP-424: Artificial intelligence 24 Policy improvement • Recall Bellman’s eqn: Vπ(s) ← ∑a∈A π(s,a) ( R(s,a) + γ ∑s∈S T(s,a,s’)Vπ(s’) ) • Suppose that there is some action a*, such that: ( R(s,a*) + γ ∑s’∈S T(s,a*,s’)Vπ(s’) ) > Vπ(s)
• Then if we set π(s,a*)←1, the value of state s will increase.
• Because we replaced each element in the sum in Vπ(s) with a bigger
• The values of states that can transition to s increase as well.
• The values of all other states stay the same.
• So the new policy using a* is better than the initial policy π. COMP-424: Artificial intelligence 25

Policy iteration
• More generally, we can change the policy π to a new policy π’ which is greedy with respect to the computed values Vπ
π’(s) = argmaxa∈A ( R(s,a) + γ ∑s’∈S T(s,a,s’)Vπ(s’) )
• This gives us a local search through the space of policies.
• We stop when the values of two successive policies are identical.
• Because we only look for deterministic policies, and there is a finite number of them, the search is guaranteed to terminate.
COMP-424: Artificial intelligence 26

Policy iteration algorithm
• Start with an initial policy π0 (e.g. random)
• Compute Vπ, using policy evaluation.
• Compute a new policy π’ that is greedy with respect to Vπ
• Terminate when Vπ = Vπ’
COMP-424: Artificial intelligence

A 4×3 gridworld example
• Transitions are stochastic, as shown on left figure.
0.1 0.7 0.1
Intended direction
COMP-424: Artificial intelligence 28

Policy Iteration (1)
Initial policy After policy evaluation
COMP-424: Artificial intelligence

Policy Iteration (2)
After policy improvement Another policy evaluation
COMP-424: Artificial intelligence

Policy Iteration (3)
Another policy improvement State values – converged!
COMP-424: Artificial intelligence

A 4×3 gridworld example
• New version:
– Transitions are still stochastic.
– Change the reward of the pit from -10 to -500.
Agent actively tries to avoid the goal, for fear of falling into the pit!
COMP-424: Artificial intelligence 32

Generalized Policy Iteration
• Any combination of policy evaluation and policy improvement steps, e.g., only update the value of one state, and immediately improve the policy at that state.
COMP-424: Artificial intelligence 33

Optimal policies and optimal value functions
• The optimal value function, V*, is defined as the best value that can be achieved at any state:
V*(s) = maxπ Vπ(s)
• In a finite MDP, there exists a unique optimal value
function (shown by Bellman, 1957).
• Any policy that achieves the optimal value function is called an optimal policy (denoted π*). The optimal policy is not necessarily unique.
COMP-424: Artificial intelligence 34

Optimal policies in the gridworld example
• Optimal state values give information about the shortest path to the goal.
• One of the deterministic optimal policies is shown below.
• There can be an infinite number of optimal policies (think
stochastic policies).
COMP-424: Artificial intelligence 35

Complexity of policy iteration
Repeat 2 basic steps: Compute Vπ + Compute a new policy π’ 1. Compute Vπ, using policy evaluation.
2. Compute a new policy π’ that is greedy with respect to Vπ .
Repeat for how many iterations?
COMP-424: Artificial intelligence 36

Complexity of policy iteration
Repeat 2 basic steps: Compute Vπ + Compute a new policy π’ 1. Compute Vπ, using policy evaluation.
Per iteration: O(S3)
2. Compute a new policy π’ that is greedy with respect to Vπ
Per iteration: O(S2A)
Repeat for how many iterations? At most |A||S|
Can get very expensive when there are many states!
COMP-424: Artificial intelligence 37

What you should know
• Definition of MDP framework.
• Differences/similarities between MDPs and other AI approaches (e.g. general search, game playing, STRIPS planning).
• Basic MDP algorithms and their properties:
• Policy iteration
• Policy evaluation
• Policy improvement
COMP-424: Artificial intelligence 38

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts