COMP3702 Artificial Intelligence
Semester 2, 2021
Tutorial 8
Monte Carlo tree search
This material is adapted from: C. B. Browne et al., “A Survey of Monte Carlo Tree Search Methods,” in
IEEE Transactions on Computational Intelligence and AI in Games, vol. 4, no. 1, pp. 1-43, March 2012,
http: // www. diego-perez. net/ papers/ MCTSSurvey. pdf . See the paper for more detail, and note that
we will cover bandit sampling, UCB and UCT in detail in next week’s lecture.
MCTS is a family of algorithms for fast planning, typically in online settings. The basic MCTS algorithm
involves iteratively building a search tree until some predefined computational budget – typically a time,
memory or iteration constraint, or sometimes all three – is reached, at which point the search is halted and
the best-performing root action returned. MCTS is particularly useful because it is anytime – the algorithm
can be interrupted and it will still return a solution, and the quality of the solution tends to increase with
more iterations.
As the name suggests, MCTS involves search on a tree. Like in the search problems we addressed earlier
in the course, each node in the search tree represents a state of the problem, and directed links to child nodes
represent actions leading to subsequent states. However, unlike blind or heuristic search, MCTS interleaves
random sampling of new states and backpropagation of values in order to direct the search process.
Four steps are applied per search iteration:
1. Selection: Starting at the root node, a child selection policy (i.e. sampling) is recursively applied to
descend through the tree until the most urgent expandable node is reached. A node is expandable if it
represents a nonterminal state and has unvisited (i.e. unexpanded) children. The measure of urgency
is typically given by a function of the estimated mean and variance of the value of the state, which is
incorporated into a bandit sampling policy.
2. Expansion: One (or more) child nodes are added to expand the tree, according to the available actions.
3. Simulation: A simulation is run from the new node(s) according to the default policy (e.g. choose a
node uniformly at random) to produce an outcome.
4. Backpropagation: The simulation result is “backed up” (i.e. backpropagated) through the selected
nodes to update their statistics. Backpropagation involves passing estimated mean and variance values
back up the tree while selecting the highest value child (i.e. maxs′ V (s
′)) at each node to add to
the instantaneous reward of the current node s, as is used in dynamic programming for finite horizon
MDPs. Note that the backpropagation step does not use a policy itself (like VI), but updates node
value estimates that inform future tree policy decisions.
These steps are illustrated in the figure below:
You can find pseudocode for the UCT algorithm, a particular variant of MCTS, on page 9 of the paper above.
1
http://www.diego-perez.net/papers/MCTSSurvey.pdf
COMP3702 Tutorial 8
Exercise 8.1. Consider the simple Grid World problem from last week, copied below. Please discuss how
you might implement an MCTS algorithm for this problem.
• Start by defining the four components: Selection, expansion, simulation and backpropagation.
• Note that selection is driven by a sampling policy, and �-greedy is a simple-to-implement bandit sampling
policy that can work effectively.
• Also note that backpropagation is effectively an asynchronous value iteration algorithm running in
reverse order through the search tree’s state nodes, from the leaf nodes back to the root node.
• https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/ is another
useful resource on MCTS, however, it discusses MCTS applied to complex games like go and chess,
which are typically harder problems to solve than MDPs.
Grid World
Consider the gridworld below:
1
-100
States in this environment are the positions on the tiles. The world is bounded by a boundary wall, and
there is one obstacle, at [1,1] (using python or zero indexing starting from the bottom left corner). In addition,
there are two terminal states, indicated by the coloured squares.
Terminal states: In practice, we would say that the two coloured tiles are terminal states of the MDP. For
mathematical convenience, we often prefer to deal with an infinite horizon MDP (see more below). To convert
this model to an infinite horizon MDP, we will add an extra pseudo-state to the list of MDP states called
exited, which is absorbing (the agent cannot leave it irrespective of its action, i.e. T (exited, a, exited) = 1
for all a).
Actions and Transitions: In this world, an agent can generally choose to move in four directions — up,
down, left and right (which we might sometimes refer to as ^, v,< and >, respectively). However, the agent
moves successfully with only p = 0.8, and moves perpendicular to its chosen direction with p = 0.1 in each
perpendicular direction. If it hits a wall or obstacle, the agent stays where it is. In addition, once the agent
arrives on a coloured square with a value, it has only one special action available to it; that is, to exit the
environment.
Rewards: The values stated on the coloured squares are the reward for exiting the square and the
environment, so the reward is not repeatedly earned; that is, R([3, 2], exit) = 1, and R([3, 1], exit) = −100.
All other states have 0 reward.
Discount factor: γ = 0.9.
Page 2
https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/