CS计算机代考程序代写 deep learning flex algorithm On Beyond Minimax and Games

On Beyond Minimax and Games
AIMA 5.3 – 5.5

CMPSC 442
Week 5, Meeting 13, Three Segments

Outline

● Heuristic Alpha-Beta, and Other Hybrid Search
● Monte Carlo Tree Search
● Stochastic Games: Expectiminimax
● Other Variants, and Limitations

Outline, Wk 5, Mtg 13 2

On Beyond Minimax and Games
AIMA 5.3 – 5.5

CMPSC 442
Week 5, Meeting 13, Segment 1 of 3: Heuristic Alpha-Beta
and other Hybrid Search

Minimax Won’t Work for Chess or Go

● When there are too many states, Claude Shannon’s 1950 paper (see
AIMA sec. 5.2), Programming a Computer for Playing Chess,
suggested:
○ Type A strategy (wide but shallow): all possible moves to a certain depth d,

then a heuristic evaluation function for states at d (e.g., chess)
○ Type B strategy (deep but narrow): ignore moves that look bad and pursue

promising paths as far as possible (e.g., Go)

Heuristic Alpha-Beta 4

https://vision.unipv.it/IA1/aa2009-2010/ProgrammingaComputerforPlayingChess.pdf

Heuristic Alpha-Beta

● H-Minimax function will treat non-terminal nodes as if they were
terminal
○ Replace the terminal test with a cutoff test (e.g., use iterative deepening)
○ Replace the utility function with a heuristic evaluation function

Heuristic Alpha-Beta 5

Evaluation Functions for H-Minimax

● Estimation of the expected utility of state s to player p
○ If Is-terminal(s), Eval(s, p) = Utility(s, p)
○ Else Utility(loss, p) ≤ Eval(s, p) ≤ Utility(win, p)

● Evaluation functions should be
○ Fast (Heuristic Alpha-Beta intended to improve Alpha-Beta performance)
○ Informed by expert knowledge
○ Often based on features that form equivalence classes of game positions

■ For a given class, experience may indicate the proportion of times games end
in win (utility=1.0), lose (utility=0) or draw (utility=0.5)

■ Then use expected value: e.g., if feature A leads to win 82% of the time, loss
2%, and draw 16%, expected value = 82% x 1 + 16% x 0.5 = 0.90

Heuristic Alpha-Beta 6

Using Chess as an Example

● A feature based approach with expected value based on frequency requires
empirical data that may be hard to collect

● Expert knowledge in chess assigns material value to individual pieces
○ Pawn, 1
○ Bishop/knight, 2
○ Rook, 5
○ Queen, 9

● Weighted linear function of a priori feature values f (e.g., number of white
bishops) based on expert knowledge for f and w, or machine learning

Heuristic Alpha-Beta 7

What if the Features are not Independent?

● Reinforcement learning
● Deep learning

Centuries of knowledge about chess, e.g., that a bishop is worth three
pawns, can be replicated through machine learning

Heuristic Alpha-Beta 8

Issues with Chess Lead to Hybrid Approaches

● Apply Eval() only to quiescent positions, where there is no pending
move that shifts the game wildly (e.g., capturing the queen)

● ProbCut, a probabilistic cut algorithm (Buro, 1995)
○ Uses forward pruning with Alpha-Beta
○ Estimates the probability that a node can be safely pruned based on

statistical knowledge of game states

● Table lookup for openings and endgames, which have fewer variations

Heuristic Alpha-Beta 9

On Beyond Minimax and Games
AIMA 5.3 – 5.5

CMPSC 442
Week 5, Meeting 13, Segment 2 of 3: Monte Carlo Tree
Search

Why Monte Carlo Tree Search (MCTS)?

● Monte Carlo methods to solve problems, named after the famous gambling
locale Casino Monte-Carlo opened in the mid-nineteenth century to save the
royal family of Monaco from bankruptcy, are based on computing a
expectation from repeated random simulations, aka chance

● Monte Carlo Tree Search can be used instead of minimax for games
○ More efficiency for games with a high branching factor
○ No need for heuristics to inform the game state evaluation function

● Formalizes the tradeoff between exploration versus exploitation
○ Similar to the motivation for changing temperature in simulated annealing
○ A key concept in reinforcement learning, which we will touch on later in the course

Monte Carlo Tree Search 11

General Framework

● From a given position si, simulate m move sequences to game end
● Each simulation is called a rollout or playout
● Value of si is given by the average utility over the m simulations
● Random playouts work only for simple games, so we need

○ A playout policy (how to bias moves towards good ones) rather than
randomly pick moves

○ What start positions to use for playouts, how many to do?
■ Pure Monte Carlo: do N simulations, find the next move with highest win %
■ Selection policy: Balances exploration and exploitation

Monte Carlo Tree Search 12

Monte Carlo Tree Search Balances Explo(r it)ation

● Selection:
○ Apply a metric (selection policy) to rank next moves
○ Take each next move in a (known) playout (path in the MC tree) to a leaf

● Expansion: Add one or more new children below the leaf
● Simulation:

○ Perform a playout simulation from the new node(s) to a game end
○ Note: The simulation is not part of the tree

● Back-Propagation:
○ Incorporate the game result of the simulation into the tree by updating all

the nodes back to the root

Monte Carlo Tree Search 13

Example Search Steps in MCTS

Monte Carlo Tree Search

Node labels: U(n)/N(n) = wins from this node / playouts from or through this node

14

Becky

Becky

Becky

Upper Confidence Bounds for Trees Selection Policy

● U(n) is the total utility of all playouts through node n
● N(n) is the number of playouts through node n
● N(PARENT(n)) is the number of rollouts at the parent node of n

● U(n)/N(n) is the average utility of n, i.e., the exploitation term
● The square root term is the exploration term; because N(n) is in the denominator,

and the log of PARENT(n) is in the numerator, the exploration term starts high and
goes to zero as the counts increase

● C is a constant to balance exploitation and exploration, values around

Monte Carlo Tree Search 15

Becky

Becky

Becky

Becky

Becky

Example UCB1 scores

● If C = 1.4
○ Node labeled 60/79, UCB1(n) = 1.09751
○ Node labeled 1/10, UCB1(n) = 1.05006
○ Node labeled 2/11, UCB1(n) = 1.087665

● If C = 1.5
○ Node labeled 60/79, UCB1(n) = 1.121654
○ Node labeled 1/10, UCB1(n) = 1.117921
○ Node labeled 2/11, UCB1(n) = 1.152368

Monte Carlo Tree Search 16

MCTS Pseudo Code

EG: Use UCB1

Monte Carlo Tree Search 17

Performance Considerations

Consider a game with a branching factor of 32 where average game = 100 ply, and
computing resources to consider 109 states

● Minimax can search 6 ply deep
● Alpha-beta can search 12 ply deep
● MCTS can do 104 playouts

Which is best depends on accuracy of evaluation function for minimax or
alpha-beta versus the selection policy of MCTS; if a single move can change the
course of the game, MCTS might not find it.

● Algorithms can combine minimax and MCTS

Monte Carlo Tree Search 18

Connection to Reinforcement Learning

● MCTS uses a fixed metric as a “selection policy” to choose the next
move

● Reinforcement learning iterates through courses of action to train a
flexible decision policy that maximizes a long term reward

● Similarities
○ Future moves are simulated
○ Exploration involves acquiring knowledge about unknowns, sometimes

through failure
○ Exploitation involves re-using what has been learned through trial-and-error

Monte Carlo Tree Search 19

On Beyond Minimax and Games
AIMA 5.3 – 5.5

CMPSC 442
Week 5, Meeting 13, Segment 3 of 3: Stochastic Games,
and Other Variants

Stochastic Games

● Examples of stochastic games
○ Backgammon: includes rolls of the dice

● To extend minimax to handle chance:
○ The search tree must include a new ply for

chance nodes (green circles) after every MAX
or MIN node

○ Minimax(n) has to include an expectation of
the value of a position, taking into account the
probabilities of the chance events from green
nodes

Stochastic Games 21

Expectiminimax

● If the next node s is a chance node, then:
○ Sum over all the observed chance outcomes r at s, weighted by the

probability P(r) of each chance action (e.g. dice roll)

Stochastic Games 22

Evaluation Functions of Stochastic Games

● How to combine probabilities of each chance action and utility of each
leaf?

● Notice that order-preserving payoff values of the two trees below ([1,2,3,4],
[1,20,30,400]) leads to different expectiminimax choices: a1 for left tree, a2
for right tree

Stochastic Games 23

Constraint on Evaluation Functions

● For games whose payoffs are not win/lose ([0,1]), the expectiminimax
values of chance nodes must be a positive linear transformation of
the expected utilities

● Discussed in more detail after probabilistic reasoning is introduced
(e.g., Chapter 16, Making Simple Decisions (in an uncertain world))

Stochastic Games 24

Alpha-Beta Pruning Applied to Stochastic Games

● Minimax of MAX and MIN nodes
is the same

● For chance nodes, e.g., C:
○ Can be pruned if we can find the

upper bound of its value
○ Consider the bounds on the

utility function, e.g., [-1, 2]
○ Then there are upper and lower

bounds on the expectation at C

Stochastic Games 25

Partially Observable Games

● Deterministic actions
○ In Stratego, piece locations are known but not their shape
○ In Battleship, opponent cannot see where player’s ships are

● Many card games have stochastic partial observability, e.g., poker
○ Computational cost of poker agent is high: In 2017, the poker program Libratus

(developed at CMU) beat several expert poker players hands down, using 25M CPU
hours on the Bridges supercomputer at U Pitt

● Video games like Star Craft are multi-agent, partially observable,
non-deterministic, dynamic and unknown; agents sometimes beat people

● People still beat agents for certain games of imperfect information (e.g., Bridge)

Stochastic Games 26

https://www.cardplayer.com/poker-news/21285-poker-bot-beats-humans-in-historic-match
https://www.cardplayer.com/poker-news/21285-poker-bot-beats-humans-in-historic-match

Limitations of Game Search

● Optimal search for complex 2-person zero-sum games is intractable
due to large branching factor b, and average depth of game d

● Tradeoff between different kinds of algorithms
○ MCTS is best if b is high and/or evaluation function is hard to construct
○ Alpha Beta can be more precise, but heuristic alpha beta is very sensitive

to the evaluation function, whose average error could cause bad choices

● Biggest limitation is the focus on individual moves in a game; people
reason about games at a higher level of abstraction that breaks the
overall of winning down into component sub-goals, such as trapping
the opponent’s queen in chess (see Chapter 11)

Stochastic Games 27

Summary

● Games are formulated as search problems by defining an initial state, legal
actions, the result of each action, a terminal test, and a utility function that
assigns a point value to terminal states

● Monte Carlo tree search evaluates states by averaging over multiple playouts
from the state to the end of the game

● Stochastic games can be handled by Expectiminimax(), where the value at a
chance node is the average of the utilities of its children weighted by the
probability of each child

● Agents have bested humans at many games including poker, and rely on
machine learning for complex games like Go.

Summary, Wk 5, Mtg 13 28