On Beyond Minimax and Games
AIMA 5.3 – 5.5
CMPSC 442
Week 5, Meeting 13, Three Segments
Outline
● Heuristic Alpha-Beta, and Other Hybrid Search
● Monte Carlo Tree Search
● Stochastic Games: Expectiminimax
● Other Variants, and Limitations
Outline, Wk 5, Mtg 13 2
On Beyond Minimax and Games
AIMA 5.3 – 5.5
CMPSC 442
Week 5, Meeting 13, Segment 1 of 3: Heuristic Alpha-Beta
and other Hybrid Search
Minimax Won’t Work for Chess or Go
● When there are too many states, Claude Shannon’s 1950 paper (see
AIMA sec. 5.2), Programming a Computer for Playing Chess,
suggested:
○ Type A strategy (wide but shallow): all possible moves to a certain depth d,
then a heuristic evaluation function for states at d (e.g., chess)
○ Type B strategy (deep but narrow): ignore moves that look bad and pursue
promising paths as far as possible (e.g., Go)
Heuristic Alpha-Beta 4
https://vision.unipv.it/IA1/aa2009-2010/ProgrammingaComputerforPlayingChess.pdf
Heuristic Alpha-Beta
● H-Minimax function will treat non-terminal nodes as if they were
terminal
○ Replace the terminal test with a cutoff test (e.g., use iterative deepening)
○ Replace the utility function with a heuristic evaluation function
Heuristic Alpha-Beta 5
Evaluation Functions for H-Minimax
● Estimation of the expected utility of state s to player p
○ If Is-terminal(s), Eval(s, p) = Utility(s, p)
○ Else Utility(loss, p) ≤ Eval(s, p) ≤ Utility(win, p)
● Evaluation functions should be
○ Fast (Heuristic Alpha-Beta intended to improve Alpha-Beta performance)
○ Informed by expert knowledge
○ Often based on features that form equivalence classes of game positions
■ For a given class, experience may indicate the proportion of times games end
in win (utility=1.0), lose (utility=0) or draw (utility=0.5)
■ Then use expected value: e.g., if feature A leads to win 82% of the time, loss
2%, and draw 16%, expected value = 82% x 1 + 16% x 0.5 = 0.90
Heuristic Alpha-Beta 6
Using Chess as an Example
● A feature based approach with expected value based on frequency requires
empirical data that may be hard to collect
● Expert knowledge in chess assigns material value to individual pieces
○ Pawn, 1
○ Bishop/knight, 2
○ Rook, 5
○ Queen, 9
● Weighted linear function of a priori feature values f (e.g., number of white
bishops) based on expert knowledge for f and w, or machine learning
Heuristic Alpha-Beta 7
What if the Features are not Independent?
● Reinforcement learning
● Deep learning
Centuries of knowledge about chess, e.g., that a bishop is worth three
pawns, can be replicated through machine learning
Heuristic Alpha-Beta 8
Issues with Chess Lead to Hybrid Approaches
● Apply Eval() only to quiescent positions, where there is no pending
move that shifts the game wildly (e.g., capturing the queen)
● ProbCut, a probabilistic cut algorithm (Buro, 1995)
○ Uses forward pruning with Alpha-Beta
○ Estimates the probability that a node can be safely pruned based on
statistical knowledge of game states
● Table lookup for openings and endgames, which have fewer variations
Heuristic Alpha-Beta 9
On Beyond Minimax and Games
AIMA 5.3 – 5.5
CMPSC 442
Week 5, Meeting 13, Segment 2 of 3: Monte Carlo Tree
Search
Why Monte Carlo Tree Search (MCTS)?
● Monte Carlo methods to solve problems, named after the famous gambling
locale Casino Monte-Carlo opened in the mid-nineteenth century to save the
royal family of Monaco from bankruptcy, are based on computing a
expectation from repeated random simulations, aka chance
● Monte Carlo Tree Search can be used instead of minimax for games
○ More efficiency for games with a high branching factor
○ No need for heuristics to inform the game state evaluation function
● Formalizes the tradeoff between exploration versus exploitation
○ Similar to the motivation for changing temperature in simulated annealing
○ A key concept in reinforcement learning, which we will touch on later in the course
Monte Carlo Tree Search 11
General Framework
● From a given position si, simulate m move sequences to game end
● Each simulation is called a rollout or playout
● Value of si is given by the average utility over the m simulations
● Random playouts work only for simple games, so we need
○ A playout policy (how to bias moves towards good ones) rather than
randomly pick moves
○ What start positions to use for playouts, how many to do?
■ Pure Monte Carlo: do N simulations, find the next move with highest win %
■ Selection policy: Balances exploration and exploitation
Monte Carlo Tree Search 12
Monte Carlo Tree Search Balances Explo(r it)ation
● Selection:
○ Apply a metric (selection policy) to rank next moves
○ Take each next move in a (known) playout (path in the MC tree) to a leaf
● Expansion: Add one or more new children below the leaf
● Simulation:
○ Perform a playout simulation from the new node(s) to a game end
○ Note: The simulation is not part of the tree
● Back-Propagation:
○ Incorporate the game result of the simulation into the tree by updating all
the nodes back to the root
Monte Carlo Tree Search 13
Example Search Steps in MCTS
Monte Carlo Tree Search
Node labels: U(n)/N(n) = wins from this node / playouts from or through this node
14
Becky
Becky
Becky
Upper Confidence Bounds for Trees Selection Policy
● U(n) is the total utility of all playouts through node n
● N(n) is the number of playouts through node n
● N(PARENT(n)) is the number of rollouts at the parent node of n
● U(n)/N(n) is the average utility of n, i.e., the exploitation term
● The square root term is the exploration term; because N(n) is in the denominator,
and the log of PARENT(n) is in the numerator, the exploration term starts high and
goes to zero as the counts increase
● C is a constant to balance exploitation and exploration, values around
Monte Carlo Tree Search 15
Becky
Becky
Becky
Becky
Becky
Example UCB1 scores
● If C = 1.4
○ Node labeled 60/79, UCB1(n) = 1.09751
○ Node labeled 1/10, UCB1(n) = 1.05006
○ Node labeled 2/11, UCB1(n) = 1.087665
● If C = 1.5
○ Node labeled 60/79, UCB1(n) = 1.121654
○ Node labeled 1/10, UCB1(n) = 1.117921
○ Node labeled 2/11, UCB1(n) = 1.152368
Monte Carlo Tree Search 16
MCTS Pseudo Code
EG: Use UCB1
Monte Carlo Tree Search 17
Performance Considerations
Consider a game with a branching factor of 32 where average game = 100 ply, and
computing resources to consider 109 states
● Minimax can search 6 ply deep
● Alpha-beta can search 12 ply deep
● MCTS can do 104 playouts
Which is best depends on accuracy of evaluation function for minimax or
alpha-beta versus the selection policy of MCTS; if a single move can change the
course of the game, MCTS might not find it.
● Algorithms can combine minimax and MCTS
Monte Carlo Tree Search 18
Connection to Reinforcement Learning
● MCTS uses a fixed metric as a “selection policy” to choose the next
move
● Reinforcement learning iterates through courses of action to train a
flexible decision policy that maximizes a long term reward
● Similarities
○ Future moves are simulated
○ Exploration involves acquiring knowledge about unknowns, sometimes
through failure
○ Exploitation involves re-using what has been learned through trial-and-error
Monte Carlo Tree Search 19
On Beyond Minimax and Games
AIMA 5.3 – 5.5
CMPSC 442
Week 5, Meeting 13, Segment 3 of 3: Stochastic Games,
and Other Variants
Stochastic Games
● Examples of stochastic games
○ Backgammon: includes rolls of the dice
● To extend minimax to handle chance:
○ The search tree must include a new ply for
chance nodes (green circles) after every MAX
or MIN node
○ Minimax(n) has to include an expectation of
the value of a position, taking into account the
probabilities of the chance events from green
nodes
Stochastic Games 21
Expectiminimax
● If the next node s is a chance node, then:
○ Sum over all the observed chance outcomes r at s, weighted by the
probability P(r) of each chance action (e.g. dice roll)
Stochastic Games 22
Evaluation Functions of Stochastic Games
● How to combine probabilities of each chance action and utility of each
leaf?
● Notice that order-preserving payoff values of the two trees below ([1,2,3,4],
[1,20,30,400]) leads to different expectiminimax choices: a1 for left tree, a2
for right tree
Stochastic Games 23
Constraint on Evaluation Functions
● For games whose payoffs are not win/lose ([0,1]), the expectiminimax
values of chance nodes must be a positive linear transformation of
the expected utilities
● Discussed in more detail after probabilistic reasoning is introduced
(e.g., Chapter 16, Making Simple Decisions (in an uncertain world))
Stochastic Games 24
Alpha-Beta Pruning Applied to Stochastic Games
● Minimax of MAX and MIN nodes
is the same
● For chance nodes, e.g., C:
○ Can be pruned if we can find the
upper bound of its value
○ Consider the bounds on the
utility function, e.g., [-1, 2]
○ Then there are upper and lower
bounds on the expectation at C
Stochastic Games 25
Partially Observable Games
● Deterministic actions
○ In Stratego, piece locations are known but not their shape
○ In Battleship, opponent cannot see where player’s ships are
● Many card games have stochastic partial observability, e.g., poker
○ Computational cost of poker agent is high: In 2017, the poker program Libratus
(developed at CMU) beat several expert poker players hands down, using 25M CPU
hours on the Bridges supercomputer at U Pitt
● Video games like Star Craft are multi-agent, partially observable,
non-deterministic, dynamic and unknown; agents sometimes beat people
● People still beat agents for certain games of imperfect information (e.g., Bridge)
Stochastic Games 26
https://www.cardplayer.com/poker-news/21285-poker-bot-beats-humans-in-historic-match
https://www.cardplayer.com/poker-news/21285-poker-bot-beats-humans-in-historic-match
Limitations of Game Search
● Optimal search for complex 2-person zero-sum games is intractable
due to large branching factor b, and average depth of game d
● Tradeoff between different kinds of algorithms
○ MCTS is best if b is high and/or evaluation function is hard to construct
○ Alpha Beta can be more precise, but heuristic alpha beta is very sensitive
to the evaluation function, whose average error could cause bad choices
● Biggest limitation is the focus on individual moves in a game; people
reason about games at a higher level of abstraction that breaks the
overall of winning down into component sub-goals, such as trapping
the opponent’s queen in chess (see Chapter 11)
Stochastic Games 27
Summary
● Games are formulated as search problems by defining an initial state, legal
actions, the result of each action, a terminal test, and a utility function that
assigns a point value to terminal states
● Monte Carlo tree search evaluates states by averaging over multiple playouts
from the state to the end of the game
● Stochastic games can be handled by Expectiminimax(), where the value at a
chance node is the average of the utilities of its children weighted by the
probability of each child
● Agents have bested humans at many games including poker, and rely on
machine learning for complex games like Go.
Summary, Wk 5, Mtg 13 28