CS代写 Machine Learning in Game Search

Machine Learning in Game Search

AIMA Slides © and ; 1

Copyright By PowCoder代写 加微信 powcoder

Many design decisions need to be fine-tuned in game playing agents

Can we automatically tune these decisions?

This week will give you a basic introduction to:

• Supervised learning using gradient decent search

• Temporal difference learning in games

AIMA Slides © and ; 2

Challenges in game agent design

♢ Handling each stage of the game separately
→ compile a ”book” of opening games or end games

♢ Adjusting search parameters
→ search control learning

♢ Adjusting weights in evaluation functions

♢ Finding good features for evaluation functions

AIMA Slides © and ; 3

Book learning

Aim: learn sequence of moves for important positions

Example1: In chess, there are books of opening moves
e.g. for every position seen in an opening game,
remember the move taken and the final outcome

Example 2: Learn from mistakes
e.g. identify moves that lead to a loss,
and whether there was a better alternative

Question: How do we recognise which moves were important?

AIMA Slides © and ; 4

Search control learning

Aim: learn how to make search more efficient

Example1: Order of move generation affects α–β pruning
e.g. learn a preferred order for generating possible moves
to maximise pruning of subtrees

Example 2: Can we vary the cut-off depth?
e.g. learn a classifier to predict what depth we should search to
based on current states

AIMA Slides © and ; 5

Learning evaluation function weights

Aim: adjust weights in evaluation function based on experience of their ability
to predict the true final utility

Typically use a linear weighted sum of features

Eval(s) = w1f1(s) + w2f2(s) + … + wnfn(s)

AIMA Slides © and ; 6

Gradient descent learning

In supervised learning, we are given a training set of examples
D = {d1, d2, …}, where each training example d comprises a vector of inputs
along with the desired output t, i.e.

d = ⟨x1, …, xn, t⟩

In the case of learning the weights of an evaluation function, the training ex-
ample corresponds to the set of features for a state s, with the true minimax
utility value of the state as desired output, i.e.

d = ⟨f1(s), …, fn(s), U(s)⟩

The aim is to learn a set of weights w = {w1, …, wn} so that the actual
output z = Eval(s;w) closely approximates the desired (true) output
t = U(s) on the training examples, and hopefully on new states as well.

AIMA Slides © and ; 7

Gradient descent learning

How do we quantify how well the actual output z approximates the desired

We define an error function E to be (half) the sum over all training ex-
amples of the square of the difference between the actual output z and the
desired output t

If we think of E as height, it defines an error landscape on the weight
space. The aim is to find a set of weight values for which E is very low
on the training examples. This is done by moving in the steepest downhill
direction, i.e., ∂E/∂wi

wi ← wi − ∂E/∂wi

The gradient descent approach is the basis of the back-propagation algo-
rithm in neural networks for supervised learning

AIMA Slides © and ; 8

Example error landscape

An example error landscape for gradient descent search in weight space

AIMA Slides © and ; 9

Weight update rule

To derive the weight update rule, we need to remember the chain rule from
differential calculus:

if y = y(u) and u = u(x)

then ∂y/∂x = (∂y/∂u)(∂u/∂x)

So, if z = Eval(s;w) and t = true utility of state s

∂E/∂wi = ∂/∂z [

(t− z)2] ∂z/∂wi

= (z − t) ∂z/∂wi
= (z − t) fi(s)

Normally we include a learning rate parameter η = 0.1 to control
the update of weights in a stable manner

wi ← wi − η (z − t)fi(s)

We repeatedly update the weights based on each example
until the weights converge

AIMA Slides © and ; 10

Delayed reinforcement: reward resulting from an action may not be re-
ceived until several time steps later, which also slows down the learning

Credit assignment: need to know which action(s) was responsible
for the outcome

AIMA Slides © and ; 11

Temporal difference learning

Supervised learning is for single step prediction
e.g. predict Saturday’s weather based on Friday’s weather

Temporal Difference (TD) learning is for multi-step prediction
e.g. predict Saturday’s weather based on Monday’s weather,
then update prediction based on Tue’s, Wed’s, …
e.g. predict outcome of game based on first move,
then update prediction as more moves are made

• Correctness of prediction not known until several steps later

• Intermediate steps provide information about correctness of prediction

• TD learning is a form of reinforcement learning

• Tesauro (1992) applied TD learning to Backgammon
i.e. given state of game, predict probability of winning

AIMA Slides © and ; 12

TDLeaf(λ) algorithm

Proposed in [Baxter, Tridgell and Weaver 1998]

Combines temporal difference learning with minimax search

Basic idea: update weight in evaluation function to reduce differences in
rewards predicted at different levels in search tree
→ good functions should be stable from one move to next

AIMA Slides © and ; 13

TDLeaf(λ) algorithm – notation

eval(s ,w ): evaluation function for state s with parameters w = [w1, …, wk]

s1, …, sN : the N states that occurred during a game

r(sN): reward based on outcome of game {+1, 0,−1}
sli: best leaf found at max cut-off depth using minimax search starting at

We can convert an evaluation score into a reward using

r(sli, w) = tanh(eval(s

where tanh squashes the evaluation score into the range [−1,+1]

AIMA Slides © and ; 14

TDLeaf(λ) algorithm

For i = 1, …, N − 1
Compute temporal difference between successive states

i+1, w)− r(s

Update each weight parameter wj as follows

wj ← wj + η

∂r(sli, w)

where η is the learning rate

if λ = 0: weights adjusted to move r(si, w) towards r(si+1, w)
i.e. the predicted reward at the next state

if λ = 1: weights adjusted to move r(si, w) towards r(sN)
i.e. the final true reward (better if eval() is unrealistic)

AIMA Slides © and ; 15

Environment for learning

♢ Learning from labelled examples

♢ Learning by playing a skilled opponent

♢ Learning by playing against random moves

♢ Learning by playing against yourself

AIMA Slides © and ; 16

♢ Introduction to role for learning in games

♢ Supervised learning using gradient descent search

♢ Temporal difference learning in games using TDLeaf(λ) algorithm

Examples of skills expected:

♢ Discuss opportunities for learning in game playing

♢ Explain differences between supervised and temporal difference learning

♢ I do not expect you to derive or memorise the TDLeaf(λ) weight update
rule, but if given this rule I may ask you to explain what the main terms

AIMA Slides © and ; 17

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com