CS计算机代考程序代写 CRICOS code 00025B

CRICOS code 00025B

MCTS Example of Steps

CRICOS code 00025B

Monte Carlo Tree Search (MCTS)

2Artificial Intelligence COMP3702

A node: state & value

• Build the search tree based on the
outcomes of the simulated plays

• Iterate over 4 main components
⎯ Selection: Choose the best path

CRICOS code 00025B

Monte Carlo Tree Search (MCTS)

3Artificial Intelligence COMP3702

• Build the search tree based on the
outcomes of the simulated plays

• Iterate over 4 main components
⎯ Selection: Choose the best path
⎯ Expansion: When a terminal node is

reached, add a child node

CRICOS code 00025B

Monte Carlo Tree Search (MCTS)

4Artificial Intelligence COMP3702

• Build the search tree based on the
outcomes of the simulated plays

• Iterate over 4 main components
⎯ Selection: Choose the best path
⎯ Expansion: When a terminal node is

reached, add a child node
⎯ Simulation: Simulate from the newly

added node n, to estimate its value

CRICOS code 00025B

• Build the search tree based on the
outcomes of the simulated plays

• Iterate over 4 main components
⎯ Selection: Choose the best path
⎯ Expansion: When a terminal node is

reached, add a child node
⎯ Simulation: Simulate from the newly

added node n, to estimate its value
⎯ Backpropagation: Update the value of

the nodes visited in this iteration

• Many variants for each component

Monte Carlo Tree Search (MCTS)

5Artificial Intelligence COMP3702

CRICOS code 00025B

• The value Q(s,a) is the average total discounted reward over the simulations that start from s
and perform action a as its first action

• Does not need the exact transition and reward model
⎯ Only need a simulator
⎯ Computes a good policy by interacting with the simulator

MCTS for MDP

6Artificial Intelligence COMP3702

CRICOS code 00025B

Commonly used MCTS for MDP

7Artificial Intelligence COMP3702

• Build the search tree based on the
outcomes of the simulated plays

• Iterate over 4 main components
⎯ Selection: Choose the best path
⎯ Expansion: When a terminal node is

reached, add a child node
⎯ Simulation: Simulate from the newly

added node n, to estimate its value
⎯ Backpropagation: Update the value of

the nodes visited in this iteration

CRICOS code 00025B

• Multi-armed bandit to select which action to use
⎯ In general, use a method called Upper Confidence Bound (UCB)
⎯ Choose an action a to perform at s as:

𝜋!”# 𝑠 = argmax
$∈&

𝑄 𝑠, 𝑎 + 𝑐
ln 𝑛 𝑠
𝑛 𝑠, 𝑎

⎯ c: a constant indicating how to balance exploration & exploitation, needs to be decided by trial & error
⎯ n(s): #times node s has been visited
⎯ n(s,a): #times the out-edge of s with label a has been visited

• MCTS + UCB is often called Upper Confidence Bound for Trees (UCT)

Node Selection

8Artificial Intelligence COMP3702

Exploitation
Exploration

CRICOS code 00025B

• Often called rollout
• Essentially a way to estimate the optimal value of the newly added state
• In practice,
⎯ Use heuristic, e.g., greedy, solution of deterministic case
⎯ Important for performance

Simulation

9Artificial Intelligence COMP3702

CRICOS code 00025B

• Essentially, updating the Q values

⎯ Q s, a = ‘ (,$ ×+ (,$ ,-
+ (,$ ,.

⎯ 𝑁 𝑠 = 𝑁 𝑠 + 1
⎯ 𝑁 𝑠, 𝑎 = 𝑁 𝑠, 𝑎 + 1

Backpropagation

10Artificial Intelligence COMP3702