1. You MUST answer this question.
(a) Consider a bandit problem with two arms. It is known that one of the arms
leads to rewards that are homogeneously distributed in the interval [0, 1],
while for the other one rewards in [0, 2] are possible. How many exploratory
actions will you need take on average in order to identify which of the two
arms which has a higher reward average? In the above example, give a value
for the optimistic initialisation of the reward estimate which is sufficient to
identify the optimal solution in at least 80% of all cases. [6 marks ]
(b) The concept of value is important in reinforcement learning. Explain the
difference between value and reward signal. What properties are desirable
for the choice of a reward signal when setting-up a reinforcement learning
algorithm? [3 marks ]
(c) Explain the difference between on-policy learning and off-policy learning for
the example of an agent that moves in a grid world that contains one or
more “pits” (positions with very large negative reward). [3 marks ]
(d) Describe (using symbols and pseudocode) the SARSA and Q-learning algo-
rithms. What is the essential difference between them? Explain using an
example. In some applications, it is empirically observed, although not the-
oretically justified, that SARSA converges faster than Q-learning. Describe
possible reasons for this effect. [6 marks ]
(e) Exploration is essential in most reinforcement learning algorithms. Identify
three different approaches to exploration. [3 marks ]
(f) Consider a one-dimensional track of discrete states which are numbered from
1 to N . Rewards are always zero, except for state 1 where a reward of r = 1
is given. The discount factor is γ. Assume the agent is currently in state k
(1 < k < N) and can move either to k + 1 or to k − 1. How do the values
of the two reachable states differ? [4 marks ]
Page 1 of 3
2. Markov Decision Processes
(a) Why is the Markov property important for most reinforcement learning algo-
rithms? Explain the concept of ergodicity in the context of discrete Markov
chains. Why is it important for reinforcement learning? [4 marks ]
(b) What are the advantages and disadvantages of TD(0), Monte Carlo and
TD(λ) learning? Discuss your answer also for an inventory control problem. [3 marks ]
(c) Explain the role of the discount parameter γ in reinforcement learning. Un-
der what condition is γ = 1 possible? Under what conditions would γ = 0
make sense? [3 marks ]
(d) How does a semi-Markov Decision Process (SMDP) differ from an MDP?
Explain, using expressions for reward, state transition law and cost criterion
for the example of a robot navigating in a large building, why SMDPs can
be beneficial. If the robot receives information about its position, but does
not have access to a map of the building, would it still be able to realise
these benefits? [4 marks ]
(e) What is a belief state in the context of Partially Observable Markov Decision
Processes (POMDP)? How does one modify the Bellman equation for the
optimal value function to accommodate this concept? [4 marks ]
(f) A mobile robot is travelling through a maze and receives the more reward the
faster it reaches at a certain goal location after departing from a given start
state. At low speed the robot obtains precise state information through its
camera, but at higher speeds the camera image shows movement artefacts
such that the state information cannot provide unambiguous state informa-
tion. Describe the details of the task and its solution in terms of a POMDP,
and describe what behaviours the robot might learn within this framework.
[7 marks ]
Page 2 of 3
3. Application of Reinforcement Learning to real-world problems
(a) In the past century, many algorithms for reinforcement learning were de-
signed and analysed for grid-world problems. In what respects were these
algorithms developed in order to become competitive in real-world prob-
lems? Explain your answer and choose an example to illustrate your argu-
ment. [5 marks ]
(b) Suppose you are using function approximation in reinforcement learning.
What would be the advantage of using a linear combination of basis functions
in order to represent the value function of a state rather than a non-linear
method such as a neural network with hidden layers? How would you design
a set of basis functions for the example that you have discussed in Question
3(a)? [3 marks ]
(c) How can the learning speed of reinforcement learning algorithms be im-
proved? Answer this question by discussing whether this is possible
i. without using detailed domain knowledge,
ii. based on pre-training in a simulation, and
iii. using trajectories obtained from human demonstration.
Explain your answer in the context of specific reinforcement learning algo-
rithms and discuss also potential drawbacks. [7 marks ]
(d) What generalisation issues may arise when applying reinforcement learning
to a real-world problem? How can their effect be reduced? [3 marks ]
(e) Design a reinforcement learning algorithm for the control of an online adver-
tising system that places ads on webpages based on (incomplete) information
about the user. Discuss states, rewards, actions, type and structure of the
algorithm and reasonable values for its parameters. In order to convince a
advertising agency to adopt your algorithm you will also need to provide es-
timates of training times and expected performance. Given the uncertainty
of the problem, these estimates will not be very reliable. What general ar-
guments could you give in addition to the estimates in order to convince the
agency of the reinforcement learning approach? [7 marks ]
Page 3 of 3