程序代写代做代考 chain algorithm 1. You MUST answer this question.

1. You MUST answer this question.

(a) Consider a bandit problem with two arms. It is known that one of the arms
leads to rewards that are homogeneously distributed in the interval [0, 1],
while for the other one rewards in [0, 2] are possible. How many exploratory
actions will you need take on average in order to identify which of the two
arms which has a higher reward average? In the above example, give a value
for the optimistic initialisation of the reward estimate which is sufficient to
identify the optimal solution in at least 80% of all cases. [6 marks ]

(b) The concept of value is important in reinforcement learning. Explain the
difference between value and reward signal. What properties are desirable
for the choice of a reward signal when setting-up a reinforcement learning
algorithm? [3 marks ]

(c) Explain the difference between on-policy learning and off-policy learning for
the example of an agent that moves in a grid world that contains one or
more “pits” (positions with very large negative reward). [3 marks ]

(d) Describe (using symbols and pseudocode) the SARSA and Q-learning algo-
rithms. What is the essential difference between them? Explain using an
example. In some applications, it is empirically observed, although not the-
oretically justified, that SARSA converges faster than Q-learning. Describe
possible reasons for this effect. [6 marks ]

(e) Exploration is essential in most reinforcement learning algorithms. Identify
three different approaches to exploration. [3 marks ]

(f) Consider a one-dimensional track of discrete states which are numbered from
1 to N . Rewards are always zero, except for state 1 where a reward of r = 1
is given. The discount factor is γ. Assume the agent is currently in state k
(1 < k < N) and can move either to k + 1 or to k − 1. How do the values of the two reachable states differ? [4 marks ] Page 1 of 3 2. Markov Decision Processes (a) Why is the Markov property important for most reinforcement learning algo- rithms? Explain the concept of ergodicity in the context of discrete Markov chains. Why is it important for reinforcement learning? [4 marks ] (b) What are the advantages and disadvantages of TD(0), Monte Carlo and TD(λ) learning? Discuss your answer also for an inventory control problem. [3 marks ] (c) Explain the role of the discount parameter γ in reinforcement learning. Un- der what condition is γ = 1 possible? Under what conditions would γ = 0 make sense? [3 marks ] (d) How does a semi-Markov Decision Process (SMDP) differ from an MDP? Explain, using expressions for reward, state transition law and cost criterion for the example of a robot navigating in a large building, why SMDPs can be beneficial. If the robot receives information about its position, but does not have access to a map of the building, would it still be able to realise these benefits? [4 marks ] (e) What is a belief state in the context of Partially Observable Markov Decision Processes (POMDP)? How does one modify the Bellman equation for the optimal value function to accommodate this concept? [4 marks ] (f) A mobile robot is travelling through a maze and receives the more reward the faster it reaches at a certain goal location after departing from a given start state. At low speed the robot obtains precise state information through its camera, but at higher speeds the camera image shows movement artefacts such that the state information cannot provide unambiguous state informa- tion. Describe the details of the task and its solution in terms of a POMDP, and describe what behaviours the robot might learn within this framework. [7 marks ] Page 2 of 3 3. Application of Reinforcement Learning to real-world problems (a) In the past century, many algorithms for reinforcement learning were de- signed and analysed for grid-world problems. In what respects were these algorithms developed in order to become competitive in real-world prob- lems? Explain your answer and choose an example to illustrate your argu- ment. [5 marks ] (b) Suppose you are using function approximation in reinforcement learning. What would be the advantage of using a linear combination of basis functions in order to represent the value function of a state rather than a non-linear method such as a neural network with hidden layers? How would you design a set of basis functions for the example that you have discussed in Question 3(a)? [3 marks ] (c) How can the learning speed of reinforcement learning algorithms be im- proved? Answer this question by discussing whether this is possible i. without using detailed domain knowledge, ii. based on pre-training in a simulation, and iii. using trajectories obtained from human demonstration. Explain your answer in the context of specific reinforcement learning algo- rithms and discuss also potential drawbacks. [7 marks ] (d) What generalisation issues may arise when applying reinforcement learning to a real-world problem? How can their effect be reduced? [3 marks ] (e) Design a reinforcement learning algorithm for the control of an online adver- tising system that places ads on webpages based on (incomplete) information about the user. Discuss states, rewards, actions, type and structure of the algorithm and reasonable values for its parameters. In order to convince a advertising agency to adopt your algorithm you will also need to provide es- timates of training times and expected performance. Given the uncertainty of the problem, these estimates will not be very reliable. What general ar- guments could you give in addition to the estimates in order to convince the agency of the reinforcement learning approach? [7 marks ] Page 3 of 3