COMP30024 Artificial Intelligence
Week 6 Problem Sheet
Weekly Topic Review
Last week you investigated pruning strategies which exclude regions of state space which provably cannot contain the optimal solution from search, but may still need to search an unacceptably large region of the search tree. Heuristic methods tradeoff optimality for speed, recognizing it is infeasible to exhaustively search an exponentially large state space, and instead exclude regions of state space that probably do not contain the optimal solution.
Copyright By PowCoder代写 加微信 powcoder
The idea is to pretend non-terminal states are leaf nodes, and to use a heuristic evaluation function Eval to evaluate the ’utility’ of these non-terminal states. The recursive definition of the heuristic minimax value of a state σ is given below. Search proceeds as normal from the current state until some cutoff criteria is satisfied, based on the search depth, computation time, or other properties of the state, and the values are propagated back up through the truncated tree as in regular minimax.
Eval(σ),
H-MM(σ) = maxa∈A(σ) H-MM(σ′),
mina∈A(σ) H-MM(σ′),
if Cutoff-Test(σ) for MAX, σ′ = a(σ) for MIN, σ′ = a(σ)
The heuristic evaluation function Eval returns an estimate of the expected utility in the state σ for a given player that results by propagating the minimax score up to σ. It is desirable for Eval to preserve ordering of terminal states under the true utility, and be strongly correlated with the chance of winning for nonterminal states.
Evaluation functions typically make use of a feature vector x = (x1,…,xD) ∈ RD, where each dimension encapsulates certain properties of the state in a real number. The evaluation function may take the form of a linear combination of features, composed with a nonlinear function g:
D Eval(σ;θ)=g(⟨w,x⟩+b)=g wkxk +b , θ={w,b}.
Here w ∈ RD,b ∈ R are the parameters of Eval (subsumed in θ), which may be learnt adaptively.
One common approach is to specify an objective to the minimized under desirable behaviour of the 1
evaluation function. It is desirable for Eval(σ, θ) to be close to the backed-up utility for σ, hence for a terminal state, a natural objective is the square of the difference between the true utility for that state U(σ) ∈ {−1, 1} for a loss/win, respectively, and the output of Eval for that state. The values of the parameters can be adaptively learnt from experience using gradient descent with learning rate η:
l(θ; σ) = (Eval(σ; θ) − U(σ))2 , θ ← θ − η∇θl(θ; σ).
One problem is that only states σ for which U(σ) can be readily obtained (such as end-game states) can be used to derive exact parameter updates. This greatly limits the number of states that can be used for training, but temporal difference learning may be used to mitigate this. Problem 2 on this sheet sketches the motivation for the TD(λ) scheme.
1. MENACE [T] One of the pioneers of machine learning in AI was a British researcher called , who worked with as a code-breaker during World War 2. He developed a game-playing agent called the Matchbox Educable Noughts and Crosses Engine1 (MENACE) which could learn to play a highly effective game of Noughts and Crosses from experience alone – an early example of reinforcement learning2.
For each possible state of the game, MENACE maintains a set of counters {c1, . . . , cn}. The counters ci define a categorical distribution over possible moves. In other words, when deciding which move to play in a given state, it picks move i with probability
pi = 9 c .
If a square i is occupied in a particular state, such that move i in that state is illegal, the
counter ci = 0 for that state.
During a game, MENACE records the sequence of states and moves taken in the game. At the end of the game, for each state in the game where it played a moves, MENACE updates the counter ci corresponding to move i made in that state as follows:
• If MENACE won the game, increase the counter in that state, ci ← ci + ∆, ∆ > 0.
• If MENACE lost the game, decrease the counter in that state, ci ← ci + ∆′, ∆′ > 0.
• If the game was a draw, leave the counter in that state unchanged.
(a) Derive a nontrivial upper bound on the size of the state space for Noughts and Crosses. Use this to derive an upper bound on the memory requirements of MENACE for Noughts- and-Crosses.
1 https://people.csail.mit.edu/brooks/idocs/matchbox.pdf
2As Michie did not have a computer available, the agent was literally implemented using matchboxes and beads.
(b) After a large number of games, what effect do you expect these updates to have?
(c) Should you always update counters in the visited states by the same amount? Explain your answer.
(d) An upper bound on the size of the state space for Chess is ≈ 1046 with an estimated average branching factor of 35. Estimate the memory requirements of applying MENACE to Chess. Is a chess-playing MENACE agent feasible?
(e) For games with large state spaces, memory requirements may be infeasibly large to exe- cute a MENACE-type algorithm. Discuss how you would use a suitably chosen heuristic function to emulate the behaviour of MENACE using a reasonable amount of memory.
(f) What are the strengths or weaknesses of applying the MENACE learning technique to the game we are using in the project?
2. TD(λ) [T] In this problem we walk through a derivation of the TD(λ) update rule. For a general state s, and parametric evaluation function Eval(s; θ) we would like Eval(s; θ) ≈ U(s), where U(s) is the backed-up utility of the terminal state that arises from playing the game out from state s, using minimax up to depth l with the evaluation function Eval for both players.
We have to play s out to completion in order to make a single weight update based on s. But playing out s to completion generates multiple intermediate states {st, st+1, . . . , sN } that we may use to refine our weight updates.
(a) Assume the objective function is l(θ) = 21 (U(s) − Eval(s; θ))2. Show that the gradient descent weight update from playing a state s until completion is:
θ ← θ + η (U(s) − Eval(s; θ)) ∇θEval(s; θ)
(b) By using minimax with depth l to look ahead, we only calculate rewards from states of depth l in the future. If sl is the state whose reward ultimately propagates back to s via minimax, the weight update rule is:
θ ← θ + η U(s) − Eval(sl; θ) ∇θEval(sl; θ)
To be clear, here Eval(sl; θ) is the evaluation obtained for state s by applying Eval to the leaf nodes of a depth l minimax search from s. The weight update rule tries to make Eval(sl; θ) a good approximation to the true terminal utility. If {st, st+1, . . . , sN } are the states encountered in a training game starting at time step t, we define the evaluation function at the terminal state sN as the utility of st. The contribution to the parameter update by passing through state st is
θ ← θ + η Eval(slN ; θ) − Eval(slt; θ) ∇θEval(slt; θ)
Show the total update to the parameter vector for all N − 1 moves made from state s1 is:
θ ← θ + η Eval(slN ; θ) − Eval(sli; θ) ∇θEval(sli; θ)
(c) Finally, we note that Eval(slN ; θ) − Eval(sli, θ) is a telescoping series:
Eval(slN ; θ) − Eval(sli; θ) = Eval(slm+1; θ) − Eval(slm; θ) m=i
Define the temporal differences dm = Eval(slm+1; θ) − Eval(slm; θ). Show the update rule reduces to:
N−1 N−1 θ←θ+η ∇θEval(sli;θ)dm
(d) We can insert a decay term λ ∈ [0, 1] which controls the effect of this telescoping:
N−1 N−1 θ←θ+η ∇θEval(sli;θ)λm−idm
This is the final update rule for TDLeaf(λ). Identify the update rules for the λ = 0 and λ = 1 cases, how does the behaviour of the updates qualitatively change as we increase λ?
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com