CRICOS code 00025B
MCTS Example of Steps
CRICOS code 00025B
Monte Carlo Tree Search (MCTS)
2Artificial Intelligence COMP3702
A node: state & value
• Build the search tree based on the
outcomes of the simulated plays
• Iterate over 4 main components
⎯ Selection: Choose the best path
CRICOS code 00025B
Monte Carlo Tree Search (MCTS)
3Artificial Intelligence COMP3702
• Build the search tree based on the
outcomes of the simulated plays
• Iterate over 4 main components
⎯ Selection: Choose the best path
⎯ Expansion: When a terminal node is
reached, add a child node
CRICOS code 00025B
Monte Carlo Tree Search (MCTS)
4Artificial Intelligence COMP3702
• Build the search tree based on the
outcomes of the simulated plays
• Iterate over 4 main components
⎯ Selection: Choose the best path
⎯ Expansion: When a terminal node is
reached, add a child node
⎯ Simulation: Simulate from the newly
added node n, to estimate its value
CRICOS code 00025B
• Build the search tree based on the
outcomes of the simulated plays
• Iterate over 4 main components
⎯ Selection: Choose the best path
⎯ Expansion: When a terminal node is
reached, add a child node
⎯ Simulation: Simulate from the newly
added node n, to estimate its value
⎯ Backpropagation: Update the value of
the nodes visited in this iteration
• Many variants for each component
Monte Carlo Tree Search (MCTS)
5Artificial Intelligence COMP3702
CRICOS code 00025B
• The value Q(s,a) is the average total discounted reward over the simulations that start from s
and perform action a as its first action
• Does not need the exact transition and reward model
⎯ Only need a simulator
⎯ Computes a good policy by interacting with the simulator
MCTS for MDP
6Artificial Intelligence COMP3702
CRICOS code 00025B
Commonly used MCTS for MDP
7Artificial Intelligence COMP3702
• Build the search tree based on the
outcomes of the simulated plays
• Iterate over 4 main components
⎯ Selection: Choose the best path
⎯ Expansion: When a terminal node is
reached, add a child node
⎯ Simulation: Simulate from the newly
added node n, to estimate its value
⎯ Backpropagation: Update the value of
the nodes visited in this iteration
CRICOS code 00025B
• Multi-armed bandit to select which action to use
⎯ In general, use a method called Upper Confidence Bound (UCB)
⎯ Choose an action a to perform at s as:
𝜋!”# 𝑠 = argmax
$∈&
𝑄 𝑠, 𝑎 + 𝑐
ln 𝑛 𝑠
𝑛 𝑠, 𝑎
⎯ c: a constant indicating how to balance exploration & exploitation, needs to be decided by trial & error
⎯ n(s): #times node s has been visited
⎯ n(s,a): #times the out-edge of s with label a has been visited
• MCTS + UCB is often called Upper Confidence Bound for Trees (UCT)
Node Selection
8Artificial Intelligence COMP3702
Exploitation
Exploration
CRICOS code 00025B
• Often called rollout
• Essentially a way to estimate the optimal value of the newly added state
• In practice,
⎯ Use heuristic, e.g., greedy, solution of deterministic case
⎯ Important for performance
Simulation
9Artificial Intelligence COMP3702
CRICOS code 00025B
• Essentially, updating the Q values
⎯ Q s, a = ‘ (,$ ×+ (,$ ,-
+ (,$ ,.
⎯ 𝑁 𝑠 = 𝑁 𝑠 + 1
⎯ 𝑁 𝑠, 𝑎 = 𝑁 𝑠, 𝑎 + 1
Backpropagation
10Artificial Intelligence COMP3702