Adaptive Intelligence
Lecturer: Professor Eleni Vasilaki
Assignment
April 4, 2019
1. Reinforcement Learning
A chessboard 4×4 is automatically generated and three pieces, 1x King (1), 1x Queen (2) and 1x Opponent’s King (3), are placed in a random location of the board. On the initial positions the pieces are not causing any threats. Assuming the role of player 1 (white) the task is to perform checkmate to the Opponent’s King (player 2, black). The game progresses as follows:
1. Player 1 performs an action and moves the King or the Queen aiming for a check- mate.
2. Player 2 performs a move by randomly moving the Opponent’s King to a position so that it is not threatened by Player 1’s King or Queen.
3. These two steps are repeated until:
• It is the turn of Player 2 and the Opponent’s King is currently threatened and any action will not make it avoid the threat (checkmate).
• It is the turn of Player 2 and the Opponent’s King is not currently threatened but any action will cause threat to it (draw).
Your task is to implement a deep neural network capable of learning to perform check- mate. The features representing the states are denoted by a vector s(t), whose dimen- sionality is 3N2 +2, where N ×N is the size of the board and 3 is the number of the pieces involved in this task. Thus, the position of each piece is decoded by a binary matrix Cij (the chessboard), where cij = 1 if that particular piece is in the position ij and cij = 0 otherwise. The +2 terms in the dimension of s(t) contain the degree of free- dom of the opponent king, that is the number of available squares where the opponent king is allowed to move, and a binary variable that is one if the opponent king is checked.
We have provided you documented code implementing the game (Python files inside: chess student python.zip or MATLAB files inside chess student matlab.zip). Within these files, you only need to implement the update rules for the backdrop and TD algorithms (for more information on the algorithms please see appendix A; for more information about the chess game please see appendix B). You are allowed to use (mod- ified where appropriate) code from earlier Labs you have developed on your own or from the Lab solutions we have provided to you.
Your specific tasks for this assignment are:
1
1. Describe the algorithms Q-learning and SARSA. Explain the differences and the possible advantages/disadvantages between the two methods. No coding is needed to reply this question.
2. A template code of the chess game is provided. Fill the gaps in the program files chess and Q values that are provided; detailed instructions are available inside the files (chess is the main script, start from there). You do not need to edit the rest of the files and information about their output(s) is provided inside the code. Use the suggested parameters values. In a plot show the reward per game and the number of moves per game. The plot will be noisy; to make sense of them use a moving average to smooth out the plot (with parameter α close to 0, i.e. 1/10000, see the slides of Lecture 8 on Exponential Moving Average).
3. Change the discount factor γ and the speed β of the decaying trend of ε (for a definition of β please see code). Try at least one different value in addition to the one suggested for γ and β and plot the number of moves per game vs the number of games and the reward per game vs the number of games (as before, with a moving average). Give an explanation for your network performance, even if the learning is not successful.
4. Implement the SARSA version of the algorithm, compare and explain the new results with the Q-learning results. To compare the two algorithms you need to plot the learning curve of your network in both cases in one plot. [Note: it is likely that the performance difference might be due to noise, as you do not vary the initial conditions.]
5. The values of the weights could explode during learning, this issue is called explod- ing gradients. Explain one possible way to fix this issue (for example search and study RMSprop) and implement it. Select an appropriate way to show in a plot how your solution fixed this issue.
2. Report
This is an individually written report of scientific standards, i.e. in a journal-paper like format and style. It is recommended that you search the web for relevant journal papers and mimic their structure. Results should be clearly presented and justified. Figures should have captions and legends. Your report should NOT exceed 4 pages (excluding Appendix and Cover Page). Additional pages will be ignored. Two-column format is recommended. The minimum font size is 11pt and margins should be reasonable. Kindly note that the readability and clarity of your document plays a major role in marking the assessment.
Your report should include:
1. A response to all points requested by the assignment (including graphs and expla- nations).
2. An Appendix with snippets of your code referring to the algorithm implementa- tions, with comments/explanations.
3. A description on how your results can be reproduced, see also “Important Note”.
Important Note: Please make sure that your results are reproducible – you can for instance initialise a random seed to make sure you always get the same results. Together
2
with the assignment, please upload your code well commented and with sufficient infor- mation (e.g. Readme file) so that we can easily test how you produced the results. If your results are not reproducible by your code (or you have not provided sufficient infor- mation on how to run your code in order to reproduce the figures), the assessment cannot receive full points. If no code is uploaded, the submission is considered incomplete and is not marked.
3. Suggested language
The recommended programming language is Python/Matlab. Notebooks are also ac- ceptable, but not suggested for this assignment. On your own responsibility, you may submit the project in any language you wish, but you should agree it beforehand with the Lecturer.
4. Marking
Assignments will be marked in a 0-20 scale. Results and explanations of questions 1-4 contribute for up to for 12 points (3 points each), question 5 for up to 4 points and scientific presentation and code documentation for up to 4 points.
To maximise your mark, make sure you explain your model and that results are supported by appropriate evidence (e.g. plots with captions, scientific arguments). Figures should be combined wherever appropriate to show a clear message. Any interesting additions you may wish to employ should be highlighted and well explained.
5. Submission
The deadline for uploading the assignment to MOLE is Monday 13 May 2019, 23:59 (in PDF format). Please also upload the corresponding code (zip file), with appropriate amount of detail for reproducing your results.
6. Plagiarism, and Collusion
You are permitted to use Python/Matlab code developed for the lab as a basis for the code you produce for this assignment with appropriate acknowledgement. This will not affect your mark. You may discuss this assignment with other students, but the work you submit must be your own. Do not share any code or text you write with other students. Credit will not be given to material that is copied (either unchanged or mini- mally modified) from published sources, including web sites. Any sources of information that you use should be cited fully. Please refer to the guidance on “Collaborative work, plagiarism and collusion” in the Undergraduate or MSc Handbook.
3
A. Deep Reinforcement Learning and chess
The task is a simplified chess problem where the agent has to learn to give chackmate with a queen and a king against the lonely opponent king. For reasons of computational time the chessboard is a five by five grid. While the moves for the agent has to be learnt, the moves of the opponent king are chosen randomly among the allowed ones.
The features representing the state are a denoted by a vector s(t), whose dimensionality is 3x(NxN) + 2, where NxN is the size of the board and 3 is the number of the pieces involved in this task. Thus, the position of each piece is decoded by a binary matrix Cij (the chessboard), where cij = 1 if that particular piece is in the position ij and cij = 0 otherwise. The +2 terms in the dimension of s(t) contain the degree of freedom of the opponent king, that is the number of available square where the opponent king is allowed to move, and a binary variable that is one if the opponent king is checked.
The network that is used for the task is a two layer neural network where the input correspond to s(t) and the output corresponds to the Q-values of the possible actions. The total number of the possible actions in this scenario are 32 (for N = 4), that is 8 ∗ (N − 1) for the queen and 8 for the king. The activation function is a rectified linear unit and the algorithm is Q-learning with backpropagation. The following equations describe how to derive the backpropagation updating rule for a network of rectified linear units .
The cost function is
E = 1 t (μ) − xout(μ)2 , (1) 2ii
μi
and we want to update each weight wik in the k − th layer thanks to
∆wk =−η∂E, (2) ij ∂wk
ij
The neurons of the output layer are calculated by: xout(μ)=xn(μ)=f(h(n))=f(w(n)x(n−1) +b(n))=f(n), (3)
then
and
∂E ∂w(n)
∂f( w(n)x(n−1)(μ)+b(n))
=− t(μ)−xout(μ) j ij j i (4)
i i i ijj i i j
where f is the activation function. f is a rectified linear unit and f′(x) = H(x), where H is the Heaviside function defined by
H(x) =
1, ifx>0
i i ∂w(n) ijμ ij
0, otherwise
∂f( w(n)x(n−1)(μ)+b(n)) ∂ w(n)x(n−1)(μ) jijj i ′ jijj
∂w(n) ij
= f |h(n)(μ) ∂w(n)
= H(h(n)) i
i
ij
∂ w(n)x(n−1)(μ) (5) j ij j
∂w(n) ij
= H(h(n)(μ))x(n−1)(μ) ij
4
Combining the two we arrive at
∆w(n) = η t (μ) − xout(μ) H(h(n)(μ))x(n−1)(μ) (6)
ij i i i j μ
and in a similar way it is possible to derive the update for the bias b(n) i
∆b(n) = η t (μ) − xout(μ) H(h(n)(μ)) iiii
μ
That can be rewritten in a more compact way by introducing δ(n) = t (μ) − xout(μ) H(h(n)(μ))
(7)
(8) (9)
(10)
iiii
∆w(n) = η δ(n)x(n−1)(μ) ij ij
μ
∆b(n) = η δ(n) ii
μ
It is now possible to rederive the general learning rule for the hidden layers by following the exact same steps shown in the reading materials of lecture 2 and remembering that the cost function has changed
δ(k−1) = δ(k)w(k)H(h(k−1)(μ)) (11) j iijj
i
∆w(k−1) = η δ(k−1)x(k−2)(μ) (12) jl jl
μ
∆b(k−1) = η δ(k−1) (13) jj
μ
Equations (8) (9) (10) (11) (12) and (13) constitute the backpropagation upsdating rules for the connections and the bias for the output layer and the hidden layers.
Furthermore, since the exploited algorithm is Q-learning, the cost function of eq. (1) has the form
where Q is
E = 1Rt+1 +γQ(a,t+1)−Q(a,t)2 (14) 2
N
Q(i, t) = reluw(n)x(n−1)(t) (15) ij j
j
where i is the index for the action.
Because the algorithm exploited and Reinforcement Learning in general 1 is online, we have dropped out the sum over μ (see also Figure 1 in the next page).
Having said that, remember that the updating rule for the output neurons consider only the index corresponding to the action made in Reinforcement Learning. Stated otherwise, the parameters that are changed in the output layer are the ones relative to to the output neuron corresponding to the action chosen; practically, if the agent
1Anyway, versions of Reinforcement Learning with a batch updating rule are possible, see for example the experience replay method.
5
chose the action a you need to change only w(n) and b(n) with i = a, that correspond ij i
to the a-th row in the matrix W(n) and the a-th index in the vector b(n). The updat- ij
ing rules for the parameters in the hidden layers are the same of the supervised methods. The following scheme is a summary of the all algorithm.
Figure 1: Scheme of the network input. Learning rule
6
sigmoid relu
15 0.8 4 0.6 3 0.4 2 0.2 1
00
-5 0 5 -5 0 5
Figure 2: Sigmoid function and rectified linear unit.
7
B. Chess
Chess is a two-player strategy game played on a 64 square board (chessboard) arranged in an 8×8 grid. Each player is given 16 pieces in total falling under 6 types and each type has a unique set of moves. The objective of the game is to checkmate the opponent’s king by placing it into a position where it is threatened by a piece and it cannot perform any action to escape the threat. Another way that a chess game can end is by draw which happens if the remaining pieces in the chessboard are unable to perform checkmate or the king (a) is not threatened, (b) is the only piece that can perform an action and (c) any action will cause him threat. More information on chess and how each piece moves can be found online (e.g. here), for your assignment you will use a simplified version of the chess with only 3 total pieces in the chessboard board.
Figure 3: A game of chess and the name of each piece.
Learning to checkmate only with King and Queen
For your assignment you will be given a reduced chessboard with a 4×4 grid and three pieces which are generated in random locations: 1x King (1), 1x Queen (2) and 1x Opponent’s King (3). Your task is to use reinforcement learning to teach your pieces (White, King and Queen) to perform checkmate to the Opponent’s King (Black). For illustrating purposes in these examples we will use an 8×8 grid.
General movement rules
• The King can move 1 block at a time.
• The Queen can move in any one straight direction: forward, backward, right, left, or diagonally, as far as possible as long as she does not move through any of the other pieces in the board.
• None of the pieces is allowed to exit the board. Refer also to figure 5.
8
Figure 4: Simplified chess.
Hard-coded movement rules
To simplify the game even more and make the learning process faster some constraints have been added:
• The King will never move to a location that will cause it to be threatened by the Opponent’s King.
• The Queen will never move to a location that will cause it to be threatened by the Opponent’s King except if the King is nearby to protect.
• The Opponent’s King will always pick a random action that will not cause him to be threatened. If such action is not available then:
– If it is already threatened then its a checkmate. – If it is not already threatened then its a draw.
Refer also to figure 6.
End of game conditions and example of a full game
The game can end either in a checkmate, where the Opponent’s King cannot move and it is threatened (refer to figures 7(a) and 8), or in a draw, where the Opponent’s King cannot move but it is not threatened (refer to figures 7(b) and 9).
9
Figure 5: Movement pattern of each piece. (a) King, (b) Opponent’s King, (c) Queen, (d) Queen blocked by King.
10
Figure 6: Constrained movement pattern of each piece. (a) The King or the Op- ponent’s King will not select to move in the blocks marked with red because on the next turn one of them will be captured. (b) The Queen or the Oppo- nent’s King will not select to move in the blocks marked with red because on the next turn one of them will be captured. (c) The Queen or the Opponent’s King will not select to move in the blocks marked with red because on the next turn one of them will be captured.
11
Figure 7: Ending of a game. (a) It is the Opponent’s King turn, it is threatened by the Queen and it cannot perform any action to escape the threat (it also cannot capture the Queen because then it will be captured by the King). It is a checkmate. (b) It is the Opponent’s King turn, it is not threatened but it cannot move anywhere since all the available positions will cause it to be threatened. It is a draw.
12
Figure 8: Game ends with a checkmate. The pieces are randomly placed in the chessboard without causing any threats. (1) Player 1 (white) plays first and moves the Queen. (2) Player 2 (black) moves the Opponent’s King to a random position given that it will not be threatened. (3) Player 1 (white) moves the King. (4) Player 2 (black) moves the Opponent’s King. (5) Player 1 (white) moves the Queen. The game ends in a checkmate because the Opponent’s King is threatened by the Queen and it cannot perform any action to escape the threat.
13
Figure 9: Game ends with a draw. The pieces are randomly placed in the chessboard without causing any threats. (1) Player 1 (white) plays first and moves the King. (2) Player 2 (black) moves the Opponent’s King to a random position given that it will not be threatened. (3) Player 1 (white) moves the Queen. (4) Player 2 (black) moves the Opponent’s King. (5) Player 1 (white) moves the King. The game ends in a draw because the Opponent’s King is not threatened and it cannot perform any action since every available action will cause it to be threatened.
14