Markov Decision Processes
CSci 5512: Artificial Intelligence II
Instructor:
February 22, 2022
Copyright By PowCoder代写 加微信 powcoder
Instructor:
Markov Decision Processes
Announcements
HW2 due this Thursday (Feb 24)
Exam in 1 week (posted Mar 1, due Mar 3)
Will cover all topics through Feb 24 lecture Chapters 12-14, 16-18
Markov Decision Processes
Instructor:
Sequential Decision Problems
Instructor:
Markov Decision Processes
Markov Decision Process (MDP)
States s ∈ S, actions a ∈ A
Model T(s,a,s′) ≡ P(s′|s,a)
Reward function R(s, a, s′) (or R(s, a, s′), R(s, a))
−0.04 (small penalty) for non-terminal states ±1 for terminal states
R(s,a,s′) =
Markov Decision Processes
Instructor:
Solving MDPs
In search problems, aim is to find an optimal sequence In MDPs, aim is to find an optimal policy π(s)
Best action for every possible state s Cannot predict where one will end up
Optimal policy maximizes expected sum of rewards Optimal policy when state penalty R(s, a, s′) is -0.04
Markov Decision Processes
Instructor:
Reward and Optimal Policy
Instructor:
Markov Decision Processes
Utility of State Sequences
Need to understand preferences between sequences of states Typically consider stationary preferences on reward sequences
[r,r0,r1,r2,…] ≻ [r,r0′,r1′,r2′,…] ⇔ [r0,r1,r2,…] ≻ [r0′,r1′,r2′,…] Theorem: Only two ways to combine rewards over time:
Additive utility function:
U([s0, s1, s2, . . . ]) = R(s0, a0, s1)+R(s1, a1, s2)+R(s2, a2, s3)+. . .
Discounted utility function: For discount factor γ
U([s0, s1, s2, . . . ]) = R(s0, a0, s1)+γR(s1, a1, s2)+γ2R(s2, a2, s3)+. . .
Markov Decision Processes
Instructor:
Utility of States
Utility U(s) of a state (i.e., its value)
Expected (discounted) sum of rewards (until termination) Assume optimal actions
Choosing the best action is just MEU
Maximize the expected utility of the immediate successors Utilities of the states are given
Markov Decision Processes
Instructor:
Utilities (cont.)
Problem: Infinite lifetimes =⇒ Additive utilities are infinite Finite horizon:
Termination at a fixed time T
Non-stationary policy as π(s) depends on time left
Absorbing state(s):
With probability 1, agent eventually “dies” for any π Expected utility of every state is finite
Discounting:
Assuming γ < 1, R(s, a, s′) ≤ Rmax
U([s0, a0, s1, . . . , ]) = γt R(st , at , st+1) ≤ Rmax/(1 − γ) t=0
Average reward per time step
Theorem: π∗(s) has constant gain after initial transient
Markov Decision Processes
Instructor:
Utilities (cont.)
Utility U(st) of state st is expected discounted sum of rewards
U(st ) = E R(st , at , st+1) + γR(st+1, at+1, st+2) + γ2R(st+2, at+2, st+3) + . . . = E [R(st , at , st+1) + γ (R(st+1, at+1, st+2) + γR(st+2, at+2, st+3) + . . . )] = E [R(st , at , st+1) + γU(st+1)]
Note the recursive nature of the utility
Utility depends on the immediate reward plus all future reward
Markov Decision Processes
Instructor:
The Optimal Policy
Given a policy π, the overall (discounted) utility ∞
Uπ(s0) = E γtR(st,π(st),st+1)|π t=0
If utilities U(s) are known, optimal policy is MEU
π∗(s) = argmax P(s′|s, a) R(s, a, s′) + γU(s′)
Markov Decision Processes
Instructor:
Dynamic Programming: The Bellman Equation
Simple relationship among utilities of neighboring states
Expected sum of rewards = current reward + γ× Expected sum of rewards after taking best action
Bellman equation (1957):
U(s) = max P(s′|s, a) R(s, a, s′) + γU(s′) a∈A s′
One equation per state s
n non-linear equations in n unknowns
Markov Decision Processes
Instructor:
Value Iteration Algorithm
State with arbitrary utility values
Update to make them locally consistent Everywhere locally consistent ⇒ Global optimality
For iteration i and every s simultaneously:
Ui+1(s)←maxP(s′|s,a)R(s,a,s′)+γUi(s′) ∀s
Repeat the above update until there is “no change” Convergence?
Markov Decision Processes
Instructor:
Value Iteration Algorithm
Ui+1(s)←maxP(s′|s,a)R(s,a,s′)+γUi(s′) ∀s
Instructor:
Markov Decision Processes
Policy Iteration
Search for optimal policy and utility values simultaneously Algorithm:
π ← an arbitrary initial policy Repeat until no change in π
Policy Evaluation: Compute utilities given π
U(s) = P(s′|π(s), s) R(s, π(s), s′) + γU(s′) ∀s
Policy Improvement: Update π as if utilities were correct (i.e.,
local MEU)
π∗(s) = argmax
P(s′|s, a) R(s, a, s′) + γU(s′) s′
Markov Decision Processes
Instructor:
Policy Iteration
Instructor:
Markov Decision Processes
Modified Policy Iterations
Policy iteration often converges in few iterations But each iteration is expensive
Main Idea:
An approximate value determination step
Use a few steps of value iteration (with π fixed)
Start from the value function produced in the last iteration
Often converges much faster than pure VI or PI Leads to much more general algorithms
Value/Policy updates can be performed locally in any order Reinforcement learning algorithms use such updates
Markov Decision Processes
Instructor:
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com