程序代写代做代考 algorithm game go AI MSBA7003 Quantitative Analysis Methods

MSBA7003 Quantitative Analysis Methods
ZHANG, Wei Assistant Professor HKU Business School
04 Monte Carlo Tree Search
1

• AlphaGo and Game AI
• Monte Carlo Tree Search • Concepts
• Search Strategy
• Simulation Strategy • Implementation
• Carpark Revenue Management
Agenda
2

AlphaGo and Game AI
• AlphaGo won Lee Sedol and Ke Jie, the top two go players.
• Monte Carlo Tree Search (MCTS) is at the heart of AlphaGo’s algorithm.
3

AlphaGo and Game AI
• Normally, a game can be solved by DP.
• DP fails if the game tree is too large or the random events follow complicated or unknown probability distributions.
4

Monte Carlo Tree Search
• The basic MCTS algorithm is to build a search tree, node by node, according to the outcomes of simulated playouts.
• The process can be broken down into four steps. Repeat
5
Selection
Expansion
Simulation
Backpropagation
1/4 1/4
1/2 0/0 0/2 1/2 0/0 0/2
1/4 2/5
1/2 0/0 0/2 2/3 0/0 0/2
0/1 1/1 0/1 0/1 0/1 1/1 0/0 0/1 0/1 0/1 1/1 0/0 0/1 0/1 0/1 1/1 1/1 0/1 0/1
0/0 0/0 0/0 0/0 0/0 1/1
Win

Monte Carlo Tree Search
• Initialization:Createtherootnode(i.e.,thelevel1decisionnode)and the child nodes (i.e., the state-of-nature nodes) following each option.
• Selection:Searchforwardbyselectingastate-of-naturenode(oran option) according to a given strategy. After that, select or create a child node (or a state of nature) through simulation. Repeat until no more child nodes are available.
• Expansion:Ifthelastnodeisnotaterminalnode,createallthestate-of- nature nodes (i.e., all the options).
• Simulation: Repeat the previous two steps (with random selection & expansion) until a terminal node is reached.
• Backpropagation: Update the summary statistics of each node along the path according to the result.
6

• Random Search
• Focusesonexploration(i.e.,lookinareasthathavenotbeenwell sampled yet) but ignores exploitation (i.e., look in areas which appear to be promising)
• The“optimal”optiongivenbythetreeintheendmaynotbetruly optimal
• Upper Confidence Bound (UCB1) Strategy
• Balancesexploitationofthecurrentlymostpromisingoptionwith
exploration of alternatives which may later turn out to be better • Converges to optimal decisions given sufficient time
Selection Strategy
7

UCB1 Strategy
• The upper confidence bound for node 𝑖 (an option) is given by
• 𝑉 𝑖
• 𝑁 𝑖
ln𝑁 𝑖
𝑛𝑖
is the average total reward/value achieved by paths after node 𝑖
(going forward)
is the number of times the parent node of 𝑖 has been visited is the number of times node 𝑖 has been visited
𝐵=𝑉+2 𝑖𝑖
• 𝑛𝑖
• Atadecisionnode,selectthechildnodethatmaximizestheUCB.
8

UCB1 Strategy
• IfmorethanonechildnodehasthesamemaximumUCBvalue,thetie is usually broken randomly.
• The first term drives exploitation and the second term drives exploration. If a child node is selected, the value of its second term will decrease but the value will increase for other child nodes.
• 𝑛𝑖 = 0 yields a UCB value of infinity, so previously unvisited child nodes are assigned the largest value in order to ensure that all child nodes are considered at least once before any child node is further expanded.
• Theformulacanbemodifiedto𝐵 =𝑉 +2𝐶 ln𝑁𝑖,where𝐶canbe 𝑖 𝑖 𝑛𝑖
adjusted to lower or increase the amount of exploration performed. If the reward is beyond the range of [0,1], the value of 𝐶 can be carefully chosen to achieve the balance.
9

Simulation Strategy
• If the state of nature is independent of previous decisions, the realization of the state can be simulated by historical data or according to certain distributions.
• If the state of nature depends on previous decisions, a model can be built to capture the dependence of the state on previous decisions.
• For games, the simulation of the opponent’s move can be done by other algorithms.
10

Implementation: Building Search Tree
• The MCTS algorithm with UCB1 selection strategy can be implemented with a table.
11
Node Index
Parent Index
Child Index
Node Type
Meaning
n
V
UCB
1
0
{2,3,4}
Decision
The root node
2
1/2

2
1
{…}
State
Option 1-1
1
1/1
2.6651
3
1
{}
State
Option 1-2
0
0

4
1
{…}
State
Option 1-3
1
0
1.6651

K +1
3
{}
Decision
State 3-1
0
0

Implementation: Building Search Tree
12
Node Index
Parent Index
Child Index
Node Type
Meaning
n
V
UCB
1
0
{2,3,4}
Decision
The root node
2
1/2

2
1
{…}
State
Option 1(a)
1
1/1
2.6651
3
1
{K+1}
State
Option 1(b)
0
0

4
1
{…}
State
Option 1(c)
1
0
1.6651

K +1
3
{}
Decision
State 3(a)
0
0

K +2
K +1
{}
State
Option (K+1)(a)
0
0

K +3
K +1
{}
State
Option (K+1)(b)
0
0

Implementation: Building Search Tree
13
Node Index
Parent Index
Child Index
Node Type
Meaning
n
V
UCB
1
0
{2,3,4}
Decision
The root node
2
1/2

2
1
{…}
State
Option 1(a)
1
1/1
2.6651
3
1
{K+1}
State
Option 1(b)
0
0

4
1
{…}
State
Option 1(c)
1
0
1.6651

K +1
3
{K+2,K+3}
Decision
State 3(a)
0
0

K +2
K +1
{}
State
Option (K+1)(a)
0
0

K +3
K +1
{}
State
Option (K+1)(b)
0
0


K’

{}
Terminal
Outcome
0
1

Implementation: Building Search Tree
14
Node Index
Parent Index
Child Index
Node Type
Meaning
n
V
UCB
1
0
{2,3,4}
Decision
The root node
3
2/3

2
1
{…}
State
Option 1(a)
1
1/1
3.0963
3
1
{K+1}
State
Option 1(b)
1
1/1
3.0963
4
1
{…}
State
Option 1(c)
1
0
2.0963

K +1
3
{K+2,K+3}
Decision
State 3(a)
1
1
1
K +2
K +1
{…}
State
Option (K+1)(a)
1
1
1
K +3
K +1
{}
State
Option (K+1)(b)
0
0


K’

{}
Terminal
Outcome
1
1
1

Implementation: Optimization
• When doing optimization:
• Selecttheoptionthatmaximizesthetotalreward/valueV
• Thestateofnaturerealizesaccordingtotherealsituation
• Ifthestatedoesnotexistinthecurrenttree,performtheMCTS algorithm in real time and then do the optimization
• Notes:
• Therootnodecanbeastate-of-naturenodeifthefirstdecision depends on the initial state
15

Carpark Revenue Management
• Company Background
• SPSisoneofthelargestcarparkoperatorsinHongKong. • Threetypesofcustomers
• Hourly parking
• Floating monthly parking
• Reserved monthly parking
• Problem to Solve
• 50 spaces, 20 floating monthly users, 10 reserved monthly users
• Maximizerevenuefromhourlyparkingwithoutaffectingmonthly customers
• Whentoshowthe“FULL”sign?
16

Carpark Revenue Management
• Challenges:
• Arrivalsandlengthsofstayforhourlycustomerswerenotrecorded
when the “FULL” sign is on
• Demandcanbedifferentfordifferentdaysanddifferenthoursofa day
• Decisionsmadeacrossadayarenotindependent.Becausethe length of stay is random, accepting too many hourly users may affect monthly users several hours later
• Solution:
• Re-generatethehourlyparkingarrivaltimeintervalsandthe
lengths of stay with the uncensored data
• UseMonteCarloTreeSearchtodecidewhethertoshowthe“FULL” sign for every 15-min time interval
17

Carpark Revenue Management
Hourly_ReGen.csv MonthlyFloating_ReGen.csv
18

Carpark Revenue Management
19

20
Carpark Revenue Management
……
A pointer to find the appropriate day in the data.
We ‘die’ if a monthly user cannot find a parking lot.
Simulation terminates at midnight.
The first element in the list.
Total number of hourly arrivals during the day.

……
Carpark Revenue Management
21
Check whether the current state has been realized before. If yes, go to that node; otherwise, build a new decision node.

Carpark Revenue Management
22

The tree after 1000 days of simulation:
Carpark Revenue Management
23
We need a really large amount of simulation to explore deeper in the tree!

• Consider this tree built by the MCTS algorithm for the carpark problem. Now is 02:30:00 pm and there are 5 monthly users and 19 hourly users in the carpark. If for the next 15 minutes, there will be 2 hourly users arriving and 0 leaving. In addition, there will be 1 monthly floating users returning and 0 leaving. What sign should be shown after 15 minutes?
• When doing optimization, if we visit a state that has never been simulated before, we should:
• A: Randomly pick an option
• B: Run the MCTS algorithm in real time

In-Class Exercise
24
(5,19)
(6,20) (6,21)
(5,19)
State: (nM=5, nH=19)
Time: 02:30:00 pm
Time: 02:45:00 pm
Available, UCB=4.5
Full, UCB=5.0
(5,18) (6,19)
AFAFAFAFAFAF 4.5 5.2 4.2 5.4 4.0 5.4 4.5 5.2 5.0 4.8 4.9 4.7