程序代写 COMP30024 Artificial Intelligence

COMP30024 Artificial Intelligence
Week 6 Problem Sheet Solutions
• Tutors: These are not unique, complete (and definitely not optimally pedagogical) solutions. there are other ways of deriving these results and the way you prefer is to some extent an aesthetic judgement.
• Students: If you want to attain any benefit from the problems you should make a genuine attempt at them before you see the solutions.

Comments/corrections are as always greatly appreciated by the teaching team.
1. MENACE [T] One of the pioneers of machine learning in AI was a British researcher called , who worked with as a code-breaker during World War 2. He developed a game-playing agent called the Matchbox Educable Noughts and Crosses Engine1 (MENACE) which could learn to play a highly effective game of Noughts and Crosses from experience alone – an early example of reinforcement learning2.
For each possible state of the game, MENACE maintains a set of counters {c1, . . . , cn}. The counters ci define a categorical distribution over possible moves. In other words, when deciding which move to play in a given state, it picks move i with probability
pi = 􏰕9 c .
If a square i is occupied in a particular state, such that move i in that state is illegal, the
counter ci = 0 for that state.
During a game, MENACE records the sequence of states and moves taken in the game. At the end of the game, for each state in the game where it played a moves, MENACE updates the counter ci corresponding to move i made in that state as follows:
• If MENACE won the game, increase the counter in that state, ci ← ci + ∆, ∆ > 0. 1 https://people.csail.mit.edu/brooks/idocs/matchbox.pdf
2As Michie did not have a computer available, the agent was literally implemented using matchboxes and beads. 1

• If MENACE lost the game, decrease the counter in that state, ci ← ci + ∆′, ∆′ > 0.
• If the game was a draw, leave the counter in that state unchanged.
(a) Derive a nontrivial upper bound on the size of the state space for Noughts and Crosses. Use this to derive an upper bound on the memory requirements of MENACE for Noughts- and-Crosses.
• Na ̈ıvely 39 = 19683 states (9 squares, 3 possible states for each square).
– Note number of possible games > number of unique states as a game considers
state ordering.
• How to do better? Note game finishes after 9 moves (5 by X, 4 by O).
– How many ways to place first X on board? 9 = 􏰋9􏰌. 1
– How many ways to place first O on board? 72 = 􏰋9􏰌 × 􏰋8􏰌. 11
– How many ways to place second X on board? 􏰋9􏰌 × 􏰋7􏰌. 21
– How many ways to place second O on board? 􏰋9􏰌 × 􏰋7􏰌. 22
– How many ways to place third X on board? 􏰋9􏰌 × 􏰋6􏰌. 32
– How many ways to place third O on board? 􏰋9􏰌 × 􏰋6􏰌, etc. 33
– Sum all up to get 6046
• This can be reduced a bit more via symmetry.
There are at most 9 possible moves in each state – so MENACE needs O(4B) bits, where B is the upper bound on the state space.
(b) After a large number of games, what effect do you expect these updates to have?
After a large number of games, the probability of taking a move that leads to a loss should go to zero, and the probability of taking a move that leads to a win should approach 1, so that the agent will favor the optimal action. This assumes that every move in every possible state has been visited sufficiently many times so that we can obtain a viable estimate of its usefulness.
(c) Should you always update counters in the visited states by the same amount? Explain your answer.
Magnitude of ∆ interpreted as ’learning rate’ – initially set ∆ to move away from uniform probabilities, then anneal the learning rate over training to smaller values to prevent oscillation of probabilities (escaping of local optima) after a large number of games e.g. ∆ = C , for C constant, t number of games played.
(d) An upper bound on the size of the state space for Chess is ≈ 1046 with an estimated
average branching factor of 35. Estimate the memory requirements of applying MENACE to Chess. Even disregarding memory requirements, do you think a MENACE-based chess engine is feasible?
About 7.5×1045 ×{size of data-structure needed to represent 1 state} bits, so O(1033) TB. Many hard drives.
Even ignoring the issue of memory, the MENACE engine must have visited a significant proportion of the state space sufficiently many times to obtain ’good statistics’ for each

state that tell it whether a particular move in that state will lead to a favourable outcome. The state space for Chess is so large, assuming it takes O(1) FLOPs to visit and process a single state, it would take a 1 YottaFLOP processor ≈ 1020 seconds to visit 1% of the estimated chess state space once.
(e) For games with large state spaces, memory requirements may be infeasibly large to exe- cute a MENACE-type algorithm. Discuss how you would use a suitably chosen heuristic function to emulate the behaviour of MENACE using a reasonable amount of memory.
Assuming the heuristic function is an acceptable approximation to the utility function, one possible idea is to replace the counters ci in state σi with the heuristic minimax value hMM of the child nodes of σ. One may e.g. apply alpha-beta search until some cutoff condition (e.g. execution time, depth limit), then backup heuristic values at the leaf nodes to children of σi to obtain hMM. Now we can select child j with probability, for suitably chosen β > 0.
exp(−βhMM(σj))
pj = 􏰕k∈children(σi) exp (−βhMM (σk))
Other possibilities are to combine MCTS with heuristic evaluation, and instead use an exploration vs. exploitation based approach to successor selection. The key in all realistic game-playing agents is to avoid the exponential forking of states in the game tree by only looking a handful of plies ahead.
(f) What are the strengths or weaknesses of applying the MENACE learning technique to the game we are using in the project?
D-I-S-C-U-S-S-I-O-N.
2. TD(λ) [T] In this problem we walk through the TD(λ) update rule, based on a derivation by former COMP30024 tutor .
Recall for a general state s, and parametric evaluation function Eval(s;θ) we would like Eval(s; θ) ≈ U(s), where U(s) is the backed-up utility of the terminal state that arises from playing the game out from state s, using minimax up to depth l with the evaluation function Eval for both players.
We have to play s out to completion in order to make a single weight update based on s. But playing out s to completion generates multiple intermediate states {st, st+1, . . . , sN } that we may use to refine our weight updates.
(a) Assume the objective function is l(θ; s) = 12 (U(s) − Eval(s; θ))2. Show that the gradient descent weight update from playing a state s until completion is:
θ ← θ + η (U(s) − Eval(s; θ)) ∇θEval(s; θ) This follows from the chain rule:
∇θl(θ; s) = −∇θEval(s; θ)(U(s) − Eval(s; θ)) Substitute into the update rule θ ← θ − η∇θl(θ; σ) to obtain the result.

(b) By using minimax with depth l to look ahead, we only calculate rewards from states of depth l in the future. If sl is the state whose reward ultimately propagates back to s via minimax, the weight update rule is:
θ ← θ + η 􏰋U(s) − Eval(sl; θ)􏰌 ∇θEval(sl; θ)
To be clear, here Eval(sl; θ) is the evaluation obtained for state s by applying Eval to the leaf nodes of a depth l minimax search from s. The weight update rule tries to make Eval(sl; θ) a good approximation to the true terminal utility. If {st, st+1, . . . , sN } are the states encountered in a training game starting at time step t, we define the evaluation function at the terminal state sN as the utility of st. The contribution to the parameter update by passing through state st is
θ ← θ + η 􏰋Eval(slN ; θ) − Eval(slt; θ)􏰌 ∇θEval(slt; θ)
Show the total update to the parameter vector for all N − 1 moves made from state s1 is:
θ ← θ + η 􏰖 􏰋Eval(slN ; θ) − Eval(sli; θ)􏰌 ∇θEval(sli; θ)
The parameter update for all N − 1 is the sum of all incremental updates ∆θt due to
passing through state st:
θ ← θ + 􏰖 ∆θt
= θ + η 􏰖 􏰋Eval(slN ; θ) − Eval(sli; θ)􏰌 ∇θEval(sli; θ) i=1
(c) Finally, we note that Eval(sN ; θ) − Eval(sli, θ) is a telescoping series:
Eval(slN ; θ) − Eval(sli; θ) = 􏰖 Eval(slm+1; θ) − Eval(slm; θ) m=i
Define the temporal differences dm = Eval(slm+1; θ) − Eval(slm; θ). Show the update rule reduces to:
N−1 􏰀 N−1 􏰁 θ←θ+η􏰖 ∇θEval(sli;θ)􏰖dm
i=1 m=i Straightforward from the definitions.

(d) We can insert a decay term λ ∈ [0, 1] which controls the effect of this telescoping:
N−1 􏰀 N−1 􏰁 θ←θ+η􏰖 ∇θEval(sli;θ)􏰖λm−idm
This is the final update rule for TDLeaf(λ). Identify the update rules for the λ = 0 and λ = 1 cases, how does the behaviour of the updates qualitatively change as we increase λ?
For λ = 0 (noting that limx→0+ x0 = 1), only the m = i term in the inner sum is non-zero, and the update looks like:
θ ← θ + η 􏰖 􏰋∇θEval(slt; θ) 􏰙Eval(slt+1; θ) − Eval(slt; θ)􏰚􏰌
This is known as the one-step temporal difference update, training is guided to minimize
the change in reward between consecutive states. For λ = 1, the telescope collapses to yield:
θ ← θ + η 􏰖 􏰋∇θEval(slt; θ) 􏰙U(s) − Eval(slt; θ)􏰚􏰌
Here training is being guided to minimize the difference between each state’s st predicted reward with the final state’s utility at time step N. This is equivalent to gradient descent using only terminal state utilities as targets.
Setting 0 < λ < 1 controls how far into the future the algorithm should look ahead in determining its update. Increasing λ interpolates between the short-sighted one-step λ = 0 behaviour and the long-sighted λ = 1 behaviour. 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts