Machine Learning in Game Search
AIMA Slides © and ; 1
Copyright By PowCoder代写 加微信 powcoder
Many design decisions need to be fine-tuned in game playing agents
Can we automatically tune these decisions?
This week will give you a basic introduction to:
• Supervised learning using gradient decent search
• Temporal difference learning in games
AIMA Slides © and ; 2
Challenges in game agent design
♢ Handling each stage of the game separately
→ compile a ”book” of opening games or end games
♢ Adjusting search parameters
→ search control learning
♢ Adjusting weights in evaluation functions
♢ Finding good features for evaluation functions
AIMA Slides © and ; 3
Book learning
Aim: learn sequence of moves for important positions
Example1: In chess, there are books of opening moves
e.g. for every position seen in an opening game,
remember the move taken and the final outcome
Example 2: Learn from mistakes
e.g. identify moves that lead to a loss,
and whether there was a better alternative
Question: How do we recognise which moves were important?
AIMA Slides © and ; 4
Search control learning
Aim: learn how to make search more efficient
Example1: Order of move generation affects α–β pruning
e.g. learn a preferred order for generating possible moves
to maximise pruning of subtrees
Example 2: Can we vary the cut-off depth?
e.g. learn a classifier to predict what depth we should search to
based on current states
AIMA Slides © and ; 5
Learning evaluation function weights
Aim: adjust weights in evaluation function based on experience of their ability
to predict the true final utility
Typically use a linear weighted sum of features
Eval(s) = w1f1(s) + w2f2(s) + … + wnfn(s)
AIMA Slides © and ; 6
Gradient descent learning
In supervised learning, we are given a training set of examples
D = {d1, d2, …}, where each training example d comprises a vector of inputs
along with the desired output t, i.e.
d = ⟨x1, …, xn, t⟩
In the case of learning the weights of an evaluation function, the training ex-
ample corresponds to the set of features for a state s, with the true minimax
utility value of the state as desired output, i.e.
d = ⟨f1(s), …, fn(s), U(s)⟩
The aim is to learn a set of weights w = {w1, …, wn} so that the actual
output z = Eval(s;w) closely approximates the desired (true) output
t = U(s) on the training examples, and hopefully on new states as well.
AIMA Slides © and ; 7
Gradient descent learning
How do we quantify how well the actual output z approximates the desired
We define an error function E to be (half) the sum over all training ex-
amples of the square of the difference between the actual output z and the
desired output t
If we think of E as height, it defines an error landscape on the weight
space. The aim is to find a set of weight values for which E is very low
on the training examples. This is done by moving in the steepest downhill
direction, i.e., ∂E/∂wi
wi ← wi − ∂E/∂wi
The gradient descent approach is the basis of the back-propagation algo-
rithm in neural networks for supervised learning
AIMA Slides © and ; 8
Example error landscape
An example error landscape for gradient descent search in weight space
AIMA Slides © and ; 9
Weight update rule
To derive the weight update rule, we need to remember the chain rule from
differential calculus:
if y = y(u) and u = u(x)
then ∂y/∂x = (∂y/∂u)(∂u/∂x)
So, if z = Eval(s;w) and t = true utility of state s
∂E/∂wi = ∂/∂z [
(t− z)2] ∂z/∂wi
= (z − t) ∂z/∂wi
= (z − t) fi(s)
Normally we include a learning rate parameter η = 0.1 to control
the update of weights in a stable manner
wi ← wi − η (z − t)fi(s)
We repeatedly update the weights based on each example
until the weights converge
AIMA Slides © and ; 10
Delayed reinforcement: reward resulting from an action may not be re-
ceived until several time steps later, which also slows down the learning
Credit assignment: need to know which action(s) was responsible
for the outcome
AIMA Slides © and ; 11
Temporal difference learning
Supervised learning is for single step prediction
e.g. predict Saturday’s weather based on Friday’s weather
Temporal Difference (TD) learning is for multi-step prediction
e.g. predict Saturday’s weather based on Monday’s weather,
then update prediction based on Tue’s, Wed’s, …
e.g. predict outcome of game based on first move,
then update prediction as more moves are made
• Correctness of prediction not known until several steps later
• Intermediate steps provide information about correctness of prediction
• TD learning is a form of reinforcement learning
• Tesauro (1992) applied TD learning to Backgammon
i.e. given state of game, predict probability of winning
AIMA Slides © and ; 12
TDLeaf(λ) algorithm
Proposed in [Baxter, Tridgell and Weaver 1998]
Combines temporal difference learning with minimax search
Basic idea: update weight in evaluation function to reduce differences in
rewards predicted at different levels in search tree
→ good functions should be stable from one move to next
AIMA Slides © and ; 13
TDLeaf(λ) algorithm – notation
eval(s ,w ): evaluation function for state s with parameters w = [w1, …, wk]
s1, …, sN : the N states that occurred during a game
r(sN): reward based on outcome of game {+1, 0,−1}
sli: best leaf found at max cut-off depth using minimax search starting at
We can convert an evaluation score into a reward using
r(sli, w) = tanh(eval(s
where tanh squashes the evaluation score into the range [−1,+1]
AIMA Slides © and ; 14
TDLeaf(λ) algorithm
For i = 1, …, N − 1
Compute temporal difference between successive states
i+1, w)− r(s
Update each weight parameter wj as follows
wj ← wj + η
∂r(sli, w)
where η is the learning rate
if λ = 0: weights adjusted to move r(si, w) towards r(si+1, w)
i.e. the predicted reward at the next state
if λ = 1: weights adjusted to move r(si, w) towards r(sN)
i.e. the final true reward (better if eval() is unrealistic)
AIMA Slides © and ; 15
Environment for learning
♢ Learning from labelled examples
♢ Learning by playing a skilled opponent
♢ Learning by playing against random moves
♢ Learning by playing against yourself
AIMA Slides © and ; 16
♢ Introduction to role for learning in games
♢ Supervised learning using gradient descent search
♢ Temporal difference learning in games using TDLeaf(λ) algorithm
Examples of skills expected:
♢ Discuss opportunities for learning in game playing
♢ Explain differences between supervised and temporal difference learning
♢ I do not expect you to derive or memorise the TDLeaf(λ) weight update
rule, but if given this rule I may ask you to explain what the main terms
AIMA Slides © and ; 17
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com