COMP3702 Artificial Intelligence – Module 3: Reasoning and planning under uncertainty — Part 3 MDPs (Approximations)
COMP3702
Artificial Intelligence
Module 3: Reasoning and planning under uncertainty — Part 3
MDPs (Approximations)
Dr Alina Bialkowski
Semester 2, 2021
The University of Queensland
School of Information Technology and Electrical Engineering
Week 8: Logistics
• Assignment 1 marking complete
• Assignment 1 video walkthrough/solution released
• Assignment 2 – due Friday, 24 Sep
• RiPPLE round 3 is open – due Friday, 8 Oct
1
Week 8: Overview
Last week:
• Introduced Markov Decision Processes (MDPs)
• Sequential decision making under uncertainty
• Two exact offline algorithms: Value Iteration and Policy Iteration
This week:
• Approximation methods
2
Table of contents
1. Review of MDPs from last week
2. Real-Time Dynamic Programming (RTDP)
3. Monte Carlo Tree Search (MCTS)
4. Value function approximation (VFA)
3
Review of MDPs from last week
Markov Decision Processes
An MDP consists of:
• set S of states.
• set A of actions.
• P(St+1 | St ,At), or T (St ,At ,St+1), specifies the dynamics or transition function.
• R(St ,At ,St+1) specifies the reward at time t.
Sometimes is a random variable, R(s, a, s ′) is the expected reward received when the
agent is in state s, does action a and ends up in state s ′.
Sometimes we use R(s, a) =
∑
s′ P(s
′ | s, a)R(s, a, s ′).
• γ is discount factor.
An MDP’s objective is to maximise: E
[∑T
t=0 γ
tR(st , at)
]
The goal is to find the optimal policy (which action to perform in each state)
4
Policies
• A policy is a sequence of actions, taken to move from each state to the next state over the
whole time horizon.
• A stationary policy is a function or a map:
π : S → A
Given a state s, π(s) specifies what action the agent who is following π will do.
• An optimal policy is one with maximum expected discounted reward.
max
π
E
[∑
t=0
γtR(st , π(st))
]
where π(t) is the action taken at time t.
• For a fully-observable MDP with stationary dynamics and rewards with infinite or
indefinite horizon, there is always an optimal stationary policy.
5
Value of a Policy
We first approach MDPs using a recursive reformulation of the objective called a value function
• The value function of an MDP, V π(s), is the expected future reward of following an
(arbitrary) policy, π, starting from state, s, given by:
V π(s) =
∑
s′∈S
P(s ′ | π(s), s) [R(s, π(s), s ′) + γV π(s ′)] .
where the policy π(s) determines that action taken in state s.
• Here we have dropped the time index, as it is redundant, but note that at = π(st).
6
Value of the Optimal Policy
• Optimal value function is given by the Bellman Equation:
V ∗(s) = max
a
∑
s′
P(s ′ | a, s) [R(s, a, s ′) + γV ∗(s ′)]
• Using Q-value (action-value function) notation:
Q
∗
(s, a) =
∑
s′
P(s
′ | a, s)
(
R(s, a, s
′
) + γV
∗
(s
′
)
)
V
∗
(s) = max
a
Q
∗
(s, a)
• Theorem states that there is a unique function, V*, satisfying these functions
• If we know V*, the optimal policy can be easily generated:
π∗(s) = arg max
a
Q(s, a)
7
Value Iteration
Iterate calculating the optimal value of each state until convergence
1. Set V0 arbitrarily, e.g.:
V̂ (s)← 0
2. Compute Vi+1 from Vi .
Loop for all s {
V̂i+1(s) = max
a
∑
s′
P(s ′ | a, s)
{
R(s, a, s ′) + γV̂i (s
′)
}
}
Once the values converge, recover the best policy from the current value function estimate:
π(s) = arg max
a
∑
s′
P(s ′ | a, s)
{
R(s, a, s ′) + γV̂ (s ′)
}
8
Policy Iteration
• Set π0 arbitrarily, let i = 0
• Repeat:
1. Evaluate the value of the policy, V πi (s) (or Qπi (s, a)):
V
πi (s) =
∑
s′∈S
P(s
′ | πi (s), s)
[
R(s, πi (s), s
′
) + γV
πi (s
′
)
]
∀s ∈ S
2. Update policy using one step look ahead:
πi+1(s)← arg max
a
∑
s′∈S
P(s
′ | a, s)
[
R(s, a, s
′
) + γV
πi (s
′
)
]
3. i = i + 1
• until πi (s) = πi−1(s)
Solving V πi (s) means finding a solution to a set of |S | linear equations with |S | unknowns, as
shown earlier.
9
Comparing value and policy iteration
• Value iteration:
• Includes finding optimal value function, then one policy extraction after convergence.
• Every iteration updates both the values and (implicitly) the policy since taking the max over
actions implicitly recomputes it
• Simpler to implement
• Policy iteration:
• Includes policy evaluation + policy update/improvement, repeated iteratively until policy
converges
• Police evaluation: perform several passes that update utilities with fixed policy (each pass is
fast because we consider only one action, not all of them)
• Policy update: is slow like a value iteration pass
• More complicated
• In practice, converges faster than VI → Why?
Note: both are Dynamic Programming methods
10
Modified Policy Iteration
• Set π[s] arbitrarily
Set Q[s, a] arbitrarily
• Repeat forever:
• Repeat for a while:
• Select state s, action a
• Q[s, a]←
∑
s′
P(s′ | s, a)
(
R(s, a, s′) + γQ[s′, π[s′]]
)
• π[s]← argmaxaQ[s, a]
• until πi (s) = πi−1(s)
11
Special case: Finite Horizon MDPs
For finite horizon MDPs, can use dynamic programming or “backwards induction” to compute
the optimal value function:
• Start from the goal state and propagate values backwards through the transition function.
• At each decision node, set the value function equal to
V ∗(s) = max
a
{
R(s, a, s ′) + max
π
V π(s ′)
}
• For more detail, see P&M Ch 3.8.3:
https://artint.info/2e/html/ArtInt2e.Ch3.S8.SS3.html
12
https://artint.info/2e/html/ArtInt2e.Ch3.S8.SS3.html
Summary: Q, V , π, R
Q∗(s, a) =
∑
s′
P(s ′ | a, s) (R(s, a, s ′) + γV ∗(s ′))
V ∗(s) = max
a
Q(s, a)
π∗(s) = argmaxaQ(s, a)
Let
R(s, a) =
∑
s′
P(s ′ | a, s)R(s, a, s ′)
Then:
Q∗(s, a) = R(s, a) + γ
∑
s′
P(s ′ | a, s)V ∗(s ′)
13
Problem: Large state space
• So far, we assumed we can store the values of all states, and update the value of each
state sufficiently often
• Infeasible in large state spaces, especially continuous state space
• Polynomial time algorithm for large state space is still not fast enough
• Not all states are relevant
• Focus Bellman Update on relevant states only
• More compact policy representation
14
Policy representation
• If state space extremely large, usually run out of memory
• Rather than mapping from every single state, can use AND-OR tree for policy
representation
• Root = initial state
• Don’t care about states that are unreachable from the initial state
• Can define parametric function to define value function
Policy representation
} Tree (AND-OR tree), where the
root is the initial state.
} Don’t care about states that are
unreachable from the initial state
} Parametric function
} Usually, parametric function to define
the value function
s
π(s)
15
Computing the policy online
• Problems with offline policy
• Large state & action spaces
• Even a single backup operation can be expensive
• Computationally expensive in time and memory
• Solutions?
• Interleave policy computation with using the policy
• Focus on finding the optimal policy for the present state only
• In general, relax the optimality requirement to approximately optimal
16
Real-Time Dynamic
Programming (RTDP)
Real-Time Dynamic Programming (RTDP)
Repeat until planning time runs out:
• Simulate greedy policy from starting state until a goal state is reached
• Perform Bellman backup on visited states
RTDP(Initial state s0, Goal set G ):
• Repeat until time runs out:
• s = s0
• While (s is not in G)
• agreedy = argmaxa∈AQ(s, a)
• V (s) = Q(s, agreedy)
• s′ = sampleFrom(P(s′|a, s)) and set s = s′
17
Real-Time Dynamic Programming (RTDP)
• Converges to optimality as num. iterations goes to infinity
• Drawbacks:
• No real termination condition
• May perform backup at state s repeatedly even when the value of s has converged to the
optimal value
18
Labelled RTDP
Check if the value of a state can no longer improve (i.e. the isSolved function in the following
algorithm)
• If the value of a state s & all its descendants (that have been visited) change less than a
small value �, label s as solved
RTDP(Initial state s0, Goal set G ):
• Repeat until s0 is solved:
• s = s0
• While (s is not solved)
• agreedy = argmaxa∈AQ(s, a)
• V (s) = Q(s, agreedy)
• s′ = sampleFrom(P(s′|a, s)), add s′ to list L, and set s = s′
• For all states s in L, if (isSolved(s)), Label s as solved
19
Monte Carlo Tree Search
(MCTS)
Monte Carlo Tree Search (MCTS)
Alternative approach: Given a starting state, learn the (local) value model well enough to take
an action in the current state, take an action, then estimate the value of the next node.
V ∗(s) = max
a
∑
s′
P(s ′ | a, s) [R(s, a, s ′) + γV ∗(s ′)]
20
Monte Carlo Tree Search (MCTS)
Combines tree search and Monte Carlo
• Monte Carlo sampling is a well known method for searching through large state space.
• In MDPs, Monte Carlo tree search does this by making use of a generative model, or
simulator, of the system under control.
• Exploiting MC in sequential decision making was first successfully demonstrated by Kocsis
& Szepesvari, 2006 — a relatively recent development.
MCTS can be used as a planning method (offline), or a learning method (online).
E.g. in the online case, we can use simulations to learn the local model, then take an action in
the real world and find out what state we actually end up in next.
21
Monte Carlo methods
Monte Carlo Simulation: a technique that can be used to solve a mathematical or statistical
problem using repeated sampling to determine the properties of some phenomenon (or
behaviour)1
Monte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP
simulator
1e.g. approximating π https://www.youtube.com/watch?v=ELetCV_wX_c
22
Monte Carlo Tree Search (MCTS)
MCTS is used for sequential decision making, over different states:
• gradually grow the search tree
• two types of tree nodes (and and-or trees)
1. decision nodes (action selection) — the algorithm selects
2. chance nodes (world selection) — the world selects the outcome (in the case of MDPs, these
are based on known probabilities)
• returned solution: path (action from root) visited the most often
23
Monte Carlo Tree Search (MCTS)
Gradually grow the search tree:
• Iterate over a tree-walk:
1. Bandit phase Select action from existing tree
2. Add a node Grow a leaf on the fringe of the search tree
3. Random phase/roll-out Select next action to expand from fringe
4. Evaluate Compute instant reward
5. Back-propagate Update information in visited nodes, (like is done in dynamic programming
for finite horizon MDPs).
• Returned solution: Path visited most often.
There are lots of details, and we will cover an example in the tutorial next week.
24
Model-based Monte Carlo
• Goal is to estimate the MDP transition function P(s ′ | a, s) and reward R(s, a, s ′)
• Transitions:
P̂(s ′ | a, s) = #times (s,a,s
′) occurs
#times (s,a) occurs
• Rewards:
R(s, a, s ′) = r in (s, a, r , s ′)
25
Value function approximation
(VFA)
Large-Scale Problems
• MDPs and Reinforcement learning should be used to solve large problems, e.g.
• Backgammon: 1020 states
• Computer Go: 10170 states
• Quad-copter, biped robot, . . . : enormous continuous state space
• Tabular methods clearly cannot handle this. . . why?
26
Problem: large state spaces
If the state space is large, several problems arise.
• The table of Q-value estimates can get extremely large.
• Q-value updates can be slow to propagate.
• High-reward states can be hard to find.
• State space grows exponentially with feature dimension.
27
MDPs and Reinforcement Learning with Features
• Usually we don’t want to reason in terms of states, but in terms of features.
• In state-based methods, information about one state cannot be used by similar states.
• If there are too many parameters to learn, it takes too long.
• Idea: Express the value function as a function of the features.
• Most common is a linear function of the features.
28
Linear Value Function Approximation
Key idea: learn a reward/value function as a linear combination of features.
• We can think of feature extraction as a change of basis.
• For each state encountered, determine its representation in terms of features.
• Perform a Q-learning update on each feature.
• Value estimate is a sum over the state’s features.
A linear function of variables x0, . . . , xn is of the form
f w (x1, . . . , xn) = w0 + w1x1 + · · ·+ wnxn
where w = (w0,w1, . . . ,wn) are weights (and by convention x0 = 1).
29
Example: Q-learning with linear value function approximation
Given γ: discount factor; η: step size
Assign weights w = (w0, . . . ,wn) arbitrarily
observe current state s
repeat each episode, until convergence:
select and carry out action a
observe reward r and state s ′
select action a′ (using a policy based on Qw , which is the Q-table indexed by features)
let δ = r + γQw (s
′, a′)− Qw (s, a)
for i = 0 to n
wi ← wi + ηδFi (s, a)
s ← s ′
Intuition: this is performing gradient descent across features, effectively adjusting feature
weights to reduce the difference between the sampled value and the estimated expected value. 30
Advantages and disadvantages of VFAs
Pros:
• Dramatically reduces the size of the Q-table.
• States will share many features.
• Allows generalisation to unvisited states.
• Makes behaviour more robust: making similar decisions in similar states.
• Handles continuous state spaces!
Cons:
• Requires feature selection (often must be done by hand).
• Restricts the accuracy of the learned rewards.
• The true reward function may not be linear in the features.
31
This looks like regression, so can we do better?
Why restrict ourselves to linear VFAs?
What else could we use?
Enter your answers here:
https://padletuq.padlet.org/alina_bialkowski/
uwv3zgjnnzjxhq2q
31
https://padletuq.padlet.org/alina_bialkowski/uwv3zgjnnzjxhq2q
https://padletuq.padlet.org/alina_bialkowski/uwv3zgjnnzjxhq2q
General VFAs
In general, VFA can replace the tables with a general parameterized form. . .
32
Next time
• Online methods weren’t planning but learning! → Reinforcement learning
• There was an MDP, but using learning to estimate the Value, and it wasn’t solved with
just computation
• In reinforcement learning:
• Exploration: you have to try unknown actions to get information
• Exploitation: eventually, you have to use what you know
• Regret: even if you learn intelligently, you make mistakes
• Sampling: because of chance, you have to try things repeatedly
• Difficulty: learning can be much harder than solving a known MDP
33
Attributions and References
Thanks to Dr Archie Chapman and Dr Hanna Kurniawati for their materials.
Many of the slides are adapted from David Poole and Alan Mackworth, Artificial Intelligence: foundations of
computational agents, 2E, CUP, 2017 http://https://artint.info/. These materials are copyright c© Poole
and Mackworth, 2017, licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0
International License.
Other materials derived from Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach, 3E,
Prentice Hall, 2009.
http://https://artint.info/
Review of MDPs from last week
Real-Time Dynamic Programming (RTDP)
Monte Carlo Tree Search (MCTS)
Value function approximation (VFA)
Appendix