编程代写 Markov Decision Processes

Markov Decision Processes
CSci 5512: Artificial Intelligence II
Instructor:
February 22, 2022

Copyright By PowCoder代写 加微信 powcoder

Instructor:
Markov Decision Processes

Announcements
HW2 due this Thursday (Feb 24)
Exam in 1 week (posted Mar 1, due Mar 3)
Will cover all topics through Feb 24 lecture Chapters 12-14, 16-18
Markov Decision Processes
Instructor:

Sequential Decision Problems
Instructor:
Markov Decision Processes

Markov Decision Process (MDP)
States s ∈ S, actions a ∈ A
Model T(s,a,s′) ≡ P(s′|s,a)
Reward function R(s, a, s′) (or R(s, a, s′), R(s, a))
􏳆−0.04 (small penalty) for non-terminal states ±1 for terminal states
R(s,a,s′) =
Markov Decision Processes
Instructor:

Solving MDPs
In search problems, aim is to find an optimal sequence In MDPs, aim is to find an optimal policy π(s)
Best action for every possible state s Cannot predict where one will end up
Optimal policy maximizes expected sum of rewards Optimal policy when state penalty R(s, a, s′) is -0.04
Markov Decision Processes
Instructor:

Reward and Optimal Policy
Instructor:
Markov Decision Processes

Utility of State Sequences
Need to understand preferences between sequences of states Typically consider stationary preferences on reward sequences
[r,r0,r1,r2,…] ≻ [r,r0′,r1′,r2′,…] ⇔ [r0,r1,r2,…] ≻ [r0′,r1′,r2′,…] Theorem: Only two ways to combine rewards over time:
Additive utility function:
U([s0, s1, s2, . . . ]) = R(s0, a0, s1)+R(s1, a1, s2)+R(s2, a2, s3)+. . .
Discounted utility function: For discount factor γ
U([s0, s1, s2, . . . ]) = R(s0, a0, s1)+γR(s1, a1, s2)+γ2R(s2, a2, s3)+. . .
Markov Decision Processes
Instructor:

Utility of States
Utility U(s) of a state (i.e., its value)
Expected (discounted) sum of rewards (until termination) Assume optimal actions
Choosing the best action is just MEU
Maximize the expected utility of the immediate successors Utilities of the states are given
Markov Decision Processes
Instructor:

Utilities (cont.)
Problem: Infinite lifetimes =⇒ Additive utilities are infinite Finite horizon:
Termination at a fixed time T
Non-stationary policy as π(s) depends on time left
Absorbing state(s):
With probability 1, agent eventually “dies” for any π Expected utility of every state is finite
Discounting:
Assuming γ < 1, R(s, a, s′) ≤ Rmax U([s0, a0, s1, . . . , ]) = 􏳇 γt R(st , at , st+1) ≤ Rmax/(1 − γ) t=0 Average reward per time step Theorem: π∗(s) has constant gain after initial transient Markov Decision Processes Instructor: Utilities (cont.) Utility U(st) of state st is expected discounted sum of rewards U(st ) = E 􏳂R(st , at , st+1) + γR(st+1, at+1, st+2) + γ2R(st+2, at+2, st+3) + . . . 􏳃 = E [R(st , at , st+1) + γ (R(st+1, at+1, st+2) + γR(st+2, at+2, st+3) + . . . )] = E [R(st , at , st+1) + γU(st+1)] Note the recursive nature of the utility Utility depends on the immediate reward plus all future reward Markov Decision Processes Instructor: The Optimal Policy Given a policy π, the overall (discounted) utility 􏳄∞􏳅 Uπ(s0) = E 􏳇γtR(st,π(st),st+1)|π t=0 If utilities U(s) are known, optimal policy is MEU π∗(s) = argmax 􏳇 P(s′|s, a) 􏳂R(s, a, s′) + γU(s′)􏳃 Markov Decision Processes Instructor: Dynamic Programming: The Bellman Equation Simple relationship among utilities of neighboring states Expected sum of rewards = current reward + γ× Expected sum of rewards after taking best action Bellman equation (1957): U(s) = max 􏳇 P(s′|s, a) 􏳂R(s, a, s′) + γU(s′)􏳃 a∈A s′ One equation per state s n non-linear equations in n unknowns Markov Decision Processes Instructor: Value Iteration Algorithm State with arbitrary utility values Update to make them locally consistent Everywhere locally consistent ⇒ Global optimality For iteration i and every s simultaneously: Ui+1(s)←max􏳇P(s′|s,a)􏳂R(s,a,s′)+γUi(s′)􏳃 ∀s Repeat the above update until there is “no change” Convergence? Markov Decision Processes Instructor: Value Iteration Algorithm Ui+1(s)←max􏳇P(s′|s,a)􏳂R(s,a,s′)+γUi(s′)􏳃 ∀s Instructor: Markov Decision Processes Policy Iteration Search for optimal policy and utility values simultaneously Algorithm: π ← an arbitrary initial policy Repeat until no change in π Policy Evaluation: Compute utilities given π U(s) = 􏳇 P(s′|π(s), s) 􏳂R(s, π(s), s′) + γU(s′)􏳃 ∀s Policy Improvement: Update π as if utilities were correct (i.e., local MEU) π∗(s) = argmax 􏳇 P(s′|s, a) 􏳂R(s, a, s′) + γU(s′)􏳃 s′ Markov Decision Processes Instructor: Policy Iteration Instructor: Markov Decision Processes Modified Policy Iterations Policy iteration often converges in few iterations But each iteration is expensive Main Idea: An approximate value determination step Use a few steps of value iteration (with π fixed) Start from the value function produced in the last iteration Often converges much faster than pure VI or PI Leads to much more general algorithms Value/Policy updates can be performed locally in any order Reinforcement learning algorithms use such updates Markov Decision Processes Instructor: 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com