COMP3702 Artificial Intelligence – Module 3: Reasoning and planning under uncertainty — Part 2 MDPs (Offline methods)
COMP3702
Artificial Intelligence
Module 3: Reasoning and planning under uncertainty — Part 2
MDPs (Offline methods)
Dr Alina Bialkowski
Semester 2, 2021
The University of Queensland
School of Information Technology and Electrical Engineering
Week 7: Logistics
• RiPPLE round 2 – due 4pm this Friday, 10 Sep
• Note: won’t approve simple copy paste from lectures or repeated resources
• Assignment 1 marking is in progress; We will be releasing a video walkthrough/solution
• Assignment 2 released! – due Friday, 24 Sep
1
Last Week
Began Module 3: Reasoning and planning under uncertainty
• Probability Review
• AND-OR trees
• Introduction to Decision Theory
• Framework for choosing the “best” decision in non-deterministic environments
• Preferences (≺, ∼), Lotteries e.g. L = [p1 : o1, p2 : o2]
• 6 Axioms of Utility Theory / Rational Preferences
• Expected Utility
• Utility of Money
2
Week 7 — Overview
Continue Module 3: Reasoning and planning under uncertainty
• Markov decision processes 1: Representation and exact algorithms
3
Table of contents
1. Recap – utility and rewards over time
2. Decision-theoretic planning
3. Markov decision processes
4. Value Iteration
5. Policy Iteration
4
Recap – utility and rewards over
time
Utility and time
• How would you compare the following sequences of rewards (per week):
A: $1000000, $0, $0, $0, $0, $0,. . .
B: $1000, $1000, $1000, $1000, $1000,. . .
C: $1000, $0, $0, $0, $0,. . .
D: $1, $1, $1, $1, $1,. . .
E: $1, $2, $3, $4, $5,. . .
5
Rewards and Values
Suppose the agent receives a sequence of rewards r1, r2, r3, r4, . . . in time. What utility should
be assigned? “Return” or “value”
• total reward V =
∞∑
i=1
ri
• average reward V = lim
n→∞
(r1 + · · ·+ rn)/n
6
Rewards and Values
Suppose the agent receives a sequence of rewards r1, r2, r3, r4, . . . in time.
• discounted return V = r1 + γr2 + γ2r3 + γ3r4 + · · ·
γ is the discount factor 0 ≤ γ ≤ 1.
7
Properties of the Discounted Rewards
• The discounted return for rewards r1, r2, r3, r4, . . . is
V = r1 + γr2 + γ
2r3 + γ
3r4 + · · ·
= r1 + γ(r2 + γ(r3 + γ(r4 + . . . )))
• If Vt is the value obtained from time step t
Vt = rt + γVt+1
• How is the infinite future valued compared to immediate rewards?
1 + γ + γ2 + γ3 + · · · = 1/(1− γ)
Therefore
minimum reward
1− γ ≤ Vt ≤
maximum reward
1− γ
• We can approximate V with the first k terms, with error:
V − (r1 + γr2 + · · ·+ γk−1rk) = γkVk+1
8
Allais Paradox (1953)
What would you prefer:
A: $1m — one million dollars
B: lottery [0.10 : $2.5m, 0.89 : $1m, 0.01 : $0]
What would you prefer:
C: lottery [0.11 : $1m, 0.89 : $0]
D: lottery [0.10 : $2.5m, 0.9 : $0]
It is inconsistent with the axioms of preferences to have A � B and D � C .
A,C: lottery [0.11 : $1m, 0.89 : X ]
B,D: lottery [0.10 : $2.5m, 0.01 : $0, 0.89 : X ]
9
Framing Effects [Tversky and Kahneman]
• A disease is expected to kill 600 people. Two alternative programs have been proposed:
Program A: 200 people will be saved
Program B: probability 1/3: 600 people will be saved
probability 2/3: no one will be saved
Which program would you favor?
• A disease is expected to kill 600 people. Two alternative programs have been proposed:
Program C: 400 people will die
Program D: probability 1/3: no one will die
probability 2/3: 600 will die
Which program would you favor?
Tversky and Kahneman: 72% chose A over B.
22% chose C over D.
10
Decision-theoretic planning
Agents as Processes
Agents carry out actions:
• forever infinite horizon
• until some stopping criteria is met indefinite horizon
• finite and fixed number of steps finite horizon
11
Decision-theoretic Planning
What should an agent do when
• it gets rewards (and penalties) and tries to maximize its rewards received
• actions can be stochastic; the outcome of an action can’t be fully predicted
• there is a model that specifies the (probabilistic) outcome of actions and the rewards
• the world is fully observable?
12
Initial Assumptions for decision-theoretic planning
• flat or modular or hierarchical
• explicit states or features or individuals and relations
• static or finite stage or indefinite stage or infinite stage
• fully observable or partially observable
• deterministic or stochastic dynamics
• goals or complex preferences
• single agent or multiple agents
• knowledge is given or knowledge is learned
• perfect rationality or bounded rationality
13
Markov decision processes
Markov decision processes (MDPs)
• A framework to find the best sequence of actions to perform when the outcome of each
action is non-deterministic
• The basis of Reinforcement Learning
• Examples:
• Games: Tic-Tac-Toe, Chess, Go, etc.
• Control systems: Traffic intersection
• How many patients to admit to hospital
• Navigation:
Markov Decision Processes
} A framework to find the best sequence of actions
to perform when the outcome of each action is
non-deterministic.
} Example:
} Games: Tic Tac Toe, Chess, Go, etc.
} Races: bicycle race, car race, etc.
} Navigation:
14
World State
• The world state is the information such that if the agent knew the world state, no
information about the past is relevant to the future. Markovian assumption.
• Sk is state at time k , and Ak is the action at time k:
P(St+1 | S0,A0, . . . ,St ,At) = P(St+1 | St ,At)
P(s ′ | s, a) is the probability that the agent will be in state s ′ immediately after doing
action a in state s.
• The state captures all relevant information from the history
• The state is a sufficient statistic of the future
• The dynamics is stationary if the distribution is the same for each time point.
15
• A Markov decision process augments a Markov chain with actions and values:
16
Markov Decision Processes
An MDP consists of:
• set S of states.
• set A of actions.
• P(St+1 | St ,At) specifies the dynamics or transition function.
• R(St ,At ,St+1) specifies the reward at time t.
Sometimes is a random variable, R(s, a, s ′) is the expected reward received when the
agent is in state s, does action a and ends up in state s ′.
Sometimes we use R(s, a) =
∑
s′ P(s
′ | s, a)R(s, a, s ′).
• γ is discount factor.
• An MDP’s objective is: E
[∑T
t=0 γ
tR(st , at)
]
.
17
Example MDP Definition: to exercise or not?
Each week Alina has to decide whether to exercise or not:
• States: {fit, unfit}
• Actions: {exercise, relax}
• Dynamics:
State Action P(fit | State,Action)
fit exercise 0.99
fit relax 0.7
unfit exercise 0.2
unfit relax 0.0
• Reward (does not depend on resulting state):
State Action Reward
fit exercise 8
fit relax 10
unfit exercise 0
unfit relax 5
18
Example MDP Definition: Simple Grid World
19
Grid World Model
• States: 100 states corresponding to the positions of the robot.
• Actions: up, down, left, right.
• Transitions: Robot goes in the commanded direction with probability 0.7, and one of the
other directions with probability 0.1.
• Rewards: If it crashes into an outside wall, it remains in its current position and has a
reward of −1.
• Four special rewarding states; the agent gets the reward when leaving.
20
Planning Horizons
The planning horizon is how far ahead the planner looks to make a decision.
• The robot gets flung to one of the corners at random after leaving a positive (+10 or +3)
reward state.
• the process never halts
• infinite horizon
• The robot gets +10 or +3 in the state, then it stays there getting no reward. Or it is left
with only a special action exit, and the episode ends. These are absorbing states.
• the robot will eventually reach an absorbing state. WHY?
• indefinite horizon
21
Information Availability
What information is available when the agent decides what to do?
• fully-observable MDP the agent gets to observe St when deciding on action At .
• partially-observable MDP (POMDP) the agent has some noisy sensor of the state. It is a
mix of a hidden Markov model and MDP. It needs to remember (some function of) its
sensing and acting history.
[This lecture only considers FOMDPs]
22
Policy
• A policy is a sequence of actions, taken to move from each state to the next state over the
whole time horizon.
• A stationary policy is a function or a map:
π : S → A
Given a state s, π(s) specifies what action the agent who is following π will do.
• An optimal policy, usually denoted as π∗, is one with maximum expected discounted
reward.
max
π
E
[∑
t=0
γtR(st , π(st))
]
where π(t) is the action taken at time t.
• For a fully-observable MDP with stationary dynamics and rewards with infinite or
indefinite horizon, there is always an optimal stationary policy.
23
Example: to exercise or not?
Each week Alina has to decide whether to exercise or not:
• States: {fit, unfit}
• Actions: {exercise, relax}
How many stationary policies are there?
What are they?
For the grid world with 100 states and 4 actions
How many stationary policies are there?
24
What is the solution of an MDP problem?
• For the simple navigation “grid world”:
• A policy, or mapping from states to actions π : S → A could be given as below
What is the solution of an MDP problem?
} An optimal policy, usually denoted as π*.
} Policy = strategy
} A mapping from states to actions π : S à A.
} Meaning for any state s in S, π(s) wil tell us what
action the agent should perform.
} Example:
+1
-1
• A policy, π(s), tells us what action the agent should perform for each state
• The optimal policy, π∗(s), tells us what the best action the agent should perform for each
state
25
Which is the “best” policy?
Suppose the agent receives the sequence of rewards r1, r2, r3, r4, . . .. What value should be
assigned?
• total reward V =
∞∑
i=1
ri
• average reward V = lim
n→∞
(r1 + · · ·+ rn)/n
• discounted reward V = r1 + γr2 + γ2r3 + γ3r4 + · · ·
γ is the discount factor 0 ≤ γ ≤ 1.
26
Discounted Rewards
• V = r1 + γr2 + γ2r3 + · · ·+ γk−1rk + . . .
• V =
∑∞
k=1 γ
k−1rk
• The discount factor, γ ∈ [0, 1) gives the present value of future rewards
• The value of receiving reward r after k + 1 time-steps is γk r
• This values immediate reward above delayed reward:
• γ close to 0 leads to “myopic” evaluation
• γ close to 1 leads to “far-sighted” evaluation
• Recall earlier slides
27
Discounted Rewards
Different criteria for defining “best”. Maximise one of the following:
• Myopic:
E [R(st , π(st))]
• Finite horizon:
E
[
k∑
t=0
R(st , π(st))
]
Policy changes over time
Optimal policy: non-stationary
• Infinite horizon:
max
π
E
[∑
t=0
γtR(st , π(st))
]
Optimal policy: stationary
28
Value of a Policy
We first approach MDPs using a recursive reformulation of the objective called a value function
• The value function of an MDP, V π(s), is the expected future reward of following an
(arbitrary) policy, π, starting from state, s, given by:
V π(s) =
∑
s′∈S
P(s ′ | π(s), s) [R(s, π(s), s ′) + γV π(s ′)] .
where the policy π(s) determines that action taken in state s.
• Here we have dropped the time index, as it is redundant, but note that at = π(st).
29
Value of a Policy
Given a policy π:
• The Q-function represents the value of choosing an action and then following policy π in
every subsequent state.
• Qπ(s, a), where a is an action and s is a state, is the expected value of doing a in state s,
then following policy π.
• Qπ and V π can be defined mutually recursively:
Qπ(s, a) =
∑
s′
P(s ′ | a, s) (R(s, a, s ′) + γV π(s ′))
V π(s) = Q(s, π(s))
30
Computing the Value of a Policy
Let vπ ∈ R|S| be a vector of values for each state, and r ∈ R|S| be a vector of rewards for each
state.
Let Pπ ∈ R |S|×|S| be a matrix containing probabilities for each transition under policy π, where:
Pπij = P(st+1 = j | st = i , at = π(st))
Then the value function can be written in vector form as:
vπ = r + γPπvπ
We can solve this using linear algebra:
⇒(I − γPπ)vπ = r
⇒vπ = (I − γPπ)−1r
i.e., computing value for a policy requires solving a linear system.
31
Value of the Optimal Policy
• An optimal policy, π∗, expressed in terms of the value function, is one that satisfies
Bellman’s optimality condition (1957):
V ∗(s) = max
a
{
R(s, a, s ′) + γmax
π
V π(s ′)
}
• V ∗(s) is the expected value of following the optimal policy in state s.
• Similarly, Q∗(s, a), where a is an action and s is a state, is the expected value of doing a
in state s, then following the optimal policy.
• Q∗ and V ∗ can be defined mutually recursively:
Q∗(s, a) =
∑
s′
P(s ′ | a, s) (R(s, a, s ′) + γV ∗(s ′))
V ∗(s) = max
a
Q(s, a)
π∗(s) = arg max
a
Q(s, a)
32
Value Iteration
Value Iteration
• Let Vk be k-step lookahead value function.
• Idea: Given an estimate of the k-step lookahead value function, determine the k + 1 step
lookahead value function.
1. Set V0 arbitrarily, e.g.:
V̂ (s)← 0
2. Compute Vi+1 from Vi .
Loop for all s {
Vi+1(s) = max
a
∑
s′
P(s ′ | a, s) {R(s, a, s ′) + γVi (s ′)}
}
Once the values converge, recover the best policy from the current value function estimate:
arg max
a
∑
s′
P(s ′ | a, s)
{
R(s, a, s ′) + γV̂ (s ′)
}
33
Value Iteration
• Theorem: There is a unique function V ∗ satisfying these functions
• If we know V ∗, the optimal policy can be generated easily
• Guaranteed to converge to V*
• No guarantee we’ll reach optimal in finite time, but this converges exponentially fast (in
k) to the optimal value function.
• The error reduces proportionally to γ
k
1− γ
34
Grid world
1
-100
γ = 0.9
Agent can move up, down, left or right.
Moves successfully with p = 0.8, or perpendicular to direction with p = 0.1, each direction.
Hit a wall and the agent stays where it is.
35
VI — initialisation
0
0
0
0
0
0
0
0
0
0
0
00
0
36
VI — 1 iteration
0
0
0
0
0
0
0
0
0
0
0
01
-100
37
VI — 2 iterations
0
0
0
0
0
0
0
0
0
0
0
00.72 1
-100
38
VI — 3 iterations
0
0
0
0
0
0
0
0
0
0
0
00.5184
0.0648
0.7848 1
-100
39
VI — 4 iterations
0
0
0
0
0
0
0
0
0
0
0
00.3732 0.6584
0.0467
0.1173
0.7965 1
-100
40
VI — termination
0.4800 0.4215 0.3717 0.1760
0.5540 0.3860
0.6310 0.7282 0.8294 1
-100
41
Asynchronous Value Iteration
• The agent doesn’t need to sweep through all the states, but can update the value
functions for each state individually.
• Do update assignments to the states in any order we want, even random
• This converges to the optimal value functions, if each state and action is visited infinitely
often in the limit.
• It can either store V [s] or Q[s, a].
42
Asynchronous VI
Storing V [s]
• Repeat forever:
1. Select state s
2. V [s]← max
a
∑
s′
P(s
′ | s, a)
(
R(s, a, s
′
) + γV [s
′
]
)
Storing Q[s, a]
• Repeat forever:
1. Select state s, action a
2. Q[s, a]←
∑
s′
P(s
′ | s, a)
(
R(s, a, s
′
) + γmax
a′
Q[s
′
, a
′
]
)
43
Policy Iteration
Policy Iteration
• Set π0 arbitrarily, let i = 0
• Repeat:
1. Solve for V πi (s) (or Qπi (s, a)):
V
πi (s) =
∑
s′∈S
P(s
′ | πi (s), s)
[
R(s, πi (s), s
′
) + γV
πi (s
′
)
]
∀s ∈ S
2. Update policy:
πi+1(s)← arg max
a
∑
s′∈S
P(s
′ | a, s)
[
R(s, a, s
′
) + γV
πi (s
′
)
]
3. i = i + 1
• until πi (s) = πi−1(s)
Solving V πi (s) means finding a solution to a set of |S | × |A| linear equations with |S | × |A|
unknowns, as shown earlier.
44
↑ ↑ ↑ ↑
↑ ↑
↑ ↑ ↑ 1
-100
45
-0.437 -4.833 -14.60 -80.56
0.053 -9.60
0.061 0.135 0.364 1
-100
46
← ← ← ←
↑ ←
→ → ↑ 1
-100
47
↑ ← ← ↓
↑ ←
→ → → 1
-100
48
↑ ← ← ↓
↑ ←
→ → → 1
-100
49
Attributions and References
Thanks to Dr Archie Chapman and Dr Hanna Kurniawati for their materials.
Many of the slides are adapted from David Poole and Alan Mackworth, Artificial Intelligence: foundations of
computational agents, 2E, CUP, 2017 http://https://artint.info/. These materials are copyright c© Poole
and Mackworth, 2017, licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0
International License.
Other materials derived from Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach, 3E,
Prentice Hall, 2009.
http://https://artint.info/
Recap – utility and rewards over time
Decision-theoretic planning
Markov decision processes
Value Iteration
Policy Iteration
Appendix