程序代写代做代考 chain algorithm 1. You MUST answer this question.

1. You MUST answer this question.

(a) For what types of decision problems would the application of reinforcement
learning methods be preferable over other methods? [3 marks ]

(b) Explain the difference between value and reward signal. What properties are
desirable for the choice of a reward signal when setting-up a reinforcement
learning algorithm? [3 marks ]

(c) Why is the ergodicity of a Markov chain an important property in the con-
text of reinforcement learning? [4 marks ]

(d) Explain the concept of regret in the context of performance evaluation of
on-line algorithms. Does the ε-greedy method achieve zero regret? If so,
why? If not, why not? [3 marks ]

(e) How can one achieve sufficient exploration of states and actions in a rein-
forcement learning problem? [3 marks ]

(f) Using eligibility traces can speed-up learning. Give an example and briefly
explain your answer, also mentioning potential risks. [3 marks ]

(g) Explain how the stationary solution of a well-behaved Markovian decision
problem can be calculated when the transition probabilities and the starting
state are known. How can this solution be used to compare a number of
applicable control policies with respect to a given cost function? [3 marks ]

(h) Consider a one-dimensional track of discrete states which are numbered from
1 to N . Rewards are always zero, except for state 1 where a reward of r = 1
is given. The discount factor is γ. Assume the agent is currently in state k
(1 < k < N) and can move either to k + 1 or to k − 1. How do the values of the two reachable states differ? [3 marks ] Page 1 of 3 2. (a) What is a state in a reinforcement learning problem and what is the reason the states are sometimes not fully observable? What is a belief state in the context of Partially Observable Markov Decision Processes (POMDP)? [3 marks ] (b) Classical reinforcement algorithms aim at maximising the mean discounted reward over an unlimited future. In POMDPs, however, often a finite time horizon is considered. Why is this so? [2 marks ] (c) What do you understand by the term particle filter? Explain — with equa- tions — how this can be used to define an approximate solution method for POMDPs. [3 marks ] (d) In principle, partial observability considerably increases the complexity of the learning algorithms. Explain two ways (different from question (c)) to reduce the complexity of POMDPs. [4 marks ] (e) How do on- and off-policy methods differ at the algorithmic level? What is the key requirement for off-policy methods in terms of policy architecture? [2 marks ] (f) How does a semi-Markov Decision Process (SMDP) differ from an MDP? Explain, using expressions for reward, state transition law and cost criterion for the example of a robot navigating in a large building, why SMDPs can be beneficial. If the robot receives information about its position, but does not have access to a map of the building, would it still be able to realise these benefits? [4 marks ] (g) Describe an application of inverse reinforcement learning in robotics. Pro- vide details of an algorithm for the problem by a graphical representation or by relevant equations. [4 marks ] (h) Give an example where a stochastic policy is better than a deterministic policy even after perfect exploration? [3 marks ] Page 2 of 3 3. Connect Four (according to wikipedia) is a game in which two players first choose a colour and then take turns dropping a disc of their respective colour from the top into a seven-column, six-row grid. The pieces fall straight down, occupying the next available space within the column. The objective of the game is to connect four of one’s own discs next to each other vertically, horizontally, or diagonally before your opponent. Consider the problem of training an artificial player using reinforcement learning. (a) How would you design a reward signal for this problem? How would you apply the algorithm and organise the training? [2 marks ] (b) While the number of actions is quite small, the state space is rather large: Each of the 42 grid positions can be empty or occupied by one of the colours, which leads to an upper bound of 342 states, but not all of these states actually occur. Would it still make sense to use a look-up table for the representation of the value function? [2 marks ] (c) Assuming that the answer for (b) is yes, what reinforcement learning algo- rithm would you choose? Describe the main steps of the algorithm. How would you initialise and control its parameters? [6 marks ] (d) Assuming that the answer of (b) is no, explain two ways how the algo- rithms can be adapted to be applicable to the problem? Which way would you choose? How would you initialise and control the parameters of the algorithm? [7 marks ] (e) If only a few thousands of games have been played for training, which im- plementation, (c) or (d), can be expected to perform better against a skilled human player? [2 marks ] (f) How does the two-player problem differ from the standard reinforcement learning problem? Extending upon this, how does a multi-agent reinforce- ment learning problem differ from its single-agent counterpart? Explain what aspects of the problem formulation change owing to the introduction of additional agents? Outline a learning procedure that works in the pres- ence of multiple agents. [6 marks ] Page 3 of 3