8a: Reinforcement Learning
Week 8: Overview
The topic for this week is Reinforcement Learning (RL). We will see how reinforcement learning tasks
can be formally de�ned, compare di�erent models of optimality, and discuss the need for
exploration and the problem of delayed rewards. We will present a number of RL algorithms
including Temporal Di�erence learning (TD-learning), Q-learning, policy gradients, actor-critic. We will
see how RL can be applied to games like backgammon, and how deep Q-learning or A3C can learn to
play Atari games from raw pixels.
Weekly learning outcomes
By the end of this module, you will be able to:
explain the di�erence between supervised learning, reinforcement learning, and unsupervised
learning
describe the formal de�nition of a reinforcement learning task
identify di�erent models of optimality
explain the need for exploration in reinforcement learning
describe reinforcement learning algorithms, including TD-learning, Q-learning, policy gradients,
and actor-critic
apply Q-learning to simple tasks.
Reinforcement Learning
We have previously discussed supervised learning, where pairs of inputs and target outputs are
provided and the system must learn to predict the correct output for each input.
There are many situations where we instead want to train a system to perform certain actions in an
environment, in order to maximise a reward function. These situations include, for example, playing
a video game, allocating mobile phone channels or other resources dynamically, driving a car or
�ying a helicopter.
Supervised learning can sometimes be used for this purpose, if we construct a training set of
situation-action pairs (for example, a database of game positions and the move chosen by a human
expert, or sensor readings from a motor car and the steering direction chosen by a human driver).
This process is sometimes called Behavioral Cloning.
However, it is often better if the system can learn by purely by self-play, without the need for training
data from a human expert.
Reinforcement Learning Framework
Reinforcement Learning (RL) can be formalised in terms of an Agent interacting with its Environment.
The Environment includes a set of states and a set of actions. At each time step , the agent is in
some state . It must choose an action , whereupon it goes into state and
receives reward
S A t
s t a t s =t+1 δ(s , a )t t
r =t R(s , a )t t
The Agent chooses its actions according to some policy .π : S → A
The aim of Reinforcement Learning is to �nd an optimal policy which maximises the cumulative
reward.
π∗
In some cases, the Environment may be probabilistic or stochastic meaning that the transitions
and/or rewards are not fully determined, but instead occur randomly according to some probability
distribution:
−
δ(s =t+1 s ∣ s , a ), R(r =t t t r ∣ s , a )t t
The Agent can also employ a probabilistic policy, with its actions chosen according to a probability
distribution:
π(a =t a ∣ s )t
Probabilistic Policies
The main bene�t of a probabilistic policy is that it forces the agent to explore its environment more
comprehensively while it is learning.
However, probabilistic policies may have additional bene�ts in other contexts, such as partially
observable or multi-player environments.
Consider, for example, a situation in which the Agent is able to observe that it is in one of the grey
squares, but is not able to determine which grey square it is in. In this case, the policy of moving left
or right with equal probability from the grey square will perform better than any deterministic policy.
In two-player games like rock-paper-scissors, a random strategy is also required in order to make
agent choices unpredictable to the opponent.
Models of Optimality
What exactly do we mean when we say that an RL Agent should choose a policy which maximises its
future rewards?
The old saying “a fast nickel is worth a slow dime” reminds us that people often prefer to receive a
smaller reward sooner instead of a larger reward later. Economists use surveys to gauge people’s
preferences in this regard: “Would you prefer ten dollars today, or 15 dollars next week? How about
50 dollars today compared to 100 dollars next year?” Responses to these surveys can be used to
estimate a number between and that is chosen such that one dollar in the current timestep is
considered equally desirable with dollars in the next timestep. In Reinforcement Learning, this
number is called the discount factor and is used to de�ne the in�nite discounted reward, which
can be compared with average reward and �nite horizon reward.
γ 0 1
γ
γ
Finite horizon reward:
Infinite discounted reward:
Average reward:
r
i=0
∑
h−1
t+i
γ r , 0 ≤ γ < 1 i=0 ∑ ∞ i t+i r h→∞ lim h 1 i=0 ∑ h−1 t+i We normally try to maximise the in�nite discounted reward, because it is easier to analyse theoretically and also works well in practice. The �nite horizon reward is easy to compute, but may lead to bad decisions by failing to take into account rewards which are just over the horizon. Average reward is hard to deal with because we cannot sensibly choose between a small reward soon and a large reward very far in the future for example, 100 dollars today compared with a million dollars, 50 years from now. − Comparing Models of Optimality This environment illustrates how the choice of action may depend on the model of optimality. An agent trying to maximise the �nite horizon reword with =4 would prefer action because it is the only action which will gain any reward within the �rst four time steps. An agent maximising average reward will prefer action because, after the �rst few time steps, it will be getting a reward of +11 every timestep which is larger than +10 for an agent who chose action . An agent maximising in�nite discounted reward with will prefer action . One way to see this is as follows: Consider Agent A choosing action , compared to Agent A choosing action . The reward of 11 received by A in timestep 7 would be equivalent to 9.9 in timestep 6 and therefore slightly less than the $10 received by A in timestep 6. By the same logic, the $10 received by A in timestep is always slightly more valuable than the $11 received by A in timestep +1. h a 1 a 3 a 2 γ = 0.9 a 2 2 a 2 3 a 3 3 2 2 n 3 n Value Function Every policy determines a Value Function where is the average discounted reward received by an agent who begins in state and chooses its actions according to policy . π V :π S → R V (s)π s π For small environments, we can compute the value function using simultaneous equations. For larger environments, we need a way of learning it, through experience. If , the value of the last node in the middle row of the above diagram is . The value of the last node in the lowest row is . The value of the second last node in the lower row is , which can be obtained from the value of the last node either by subtracting or multiplying by . γ = 0.9 =1−γ 10 100 =1−γ 11 110 99 11 γ Value Function Example This image shows the value function in a grid world environment, where is the policy of choosing between available actions uniformly randomly. V π π Exploration and Delayed Reinforcement I was born to try ... But you've got to make choices Be wrong or right Sometimes you've got to sacri�ce the things you like. Delta Goodrem− Multi-Armed Bandit Problem The special case of a stochastic reinforcement learning environment with only one state is called the Multi-Armed Bandit Problem, because it is like being in a room with several (friendly) slot machines (also called "one-armed bandits") for a limited time, and trying to collect as much money as possible. We assume that each slot machine provides rewards chosen from its own (stationary) distribution, and that in each time step we are able to pull the lever on only one machine. Exploration / Exploitation Tradeo� After pulling the levers a few times and observing the rewards received, it makes sense that most of the time we should choose the lever (action) which we think will give the highest expected reward, based on previous observations. However, in order to ensure convergence to the optimal strategy, we must occasionally choose something di�erent from our preferred action. Perhaps the simplest way to achieve this is known as an epsilon-greedy strategy , where with probability we choose what we think is the best action, and with probability (typically, 5%) we choose a random action. More sophisticated strategies also exist such as Thompson Sampling, Upper Con�dence Bound (UCB) algorithms, or choosing from a Softmax (Boltzmann) distribution based on the average reward so far observed for 1−ε ε each action. P (a) = e∑ b∈A R(b)/T e R(a)/T You may have noticed that we make the same kind of judgements about exploration versus exploitation in everyday situations. Should you eat at the old restaurant you have visited many times, or should you try the newly opened restaurant across the street which might be worse but might be much better? Delayed Reinforcement The need for exploration also applies in the general case where there are multiple states, and where we may need to take a whole sequence of actions in order to get to the state from which we can obtain a high reward. Recall that every policy determines a Value Function where is the expected discounted reward received by an agent who begins in state and chooses its actions according to policy . π V :π S → R V (s)π s π If is optimal, then is the maximum (expected) discounted reward obtainable from state Knowing this optimal value function can help to determine the optimal policy. π = π∗ V (s) =∗ V (s)π ∗ s. Computing the Value Function Let and consider the policy . We haveγ = 0.9 π : S ↦1 a , S ↦2 2 a , S ↦1 3 a 2 V (S )π 3 V (S )π 2 V (S )π 1 = = 20 1 − γ 2 = −1 + γV (S ) = −1 + 18 = 173 = −1 + γV (S ) = −1 + 15.3 = 14.32 For this example, the policy shown above must be the optimal policy, because the value function for states and under any other policy would not be more than . Therefore, the optimal value function is π S 1 S 2 10 V :∗ S ↦1 14.3, S ↦2 17, S ↦3 20 If we were given this value function , we could use it to determine the optimal policy provided we also know the reward function and the transfer function . This information is sometimes called the "World Model". V ∗ π∗ R : S × A → R δ : S × A → S Q-Function The Q-function is a more sophisticated version of the value function which enables an agent to act optimally without needing to know the World Model. For any policy , the Q-function is the expected discounted reward received by an agent who begins in state , �rst performs action and then follows policy for all subsequent timesteps. π Q (s, a)π s a π If is optimal, then is the maximum (expected) discounted reward obtainable from , if the agent is forced to take action in the �rst timestep but can act optimally thereafter. π = π∗ Q (s, a) =∗ Q (s, a)π ∗ s a If the optimal Q-function is known, then the optimal policy is given byQ∗ π (s) =∗ Q (s, a) a arg max ∗ The Q-function for the environment shown above can be computed as follows: Q(S , a )1 1 Q(S , a )1 2 Q(S , a )2 1 Q(S , a )2 2 Q(S , a )3 1 Q(S , a )3 2 = 1 + γV (S ) = 1 + 0.9 × 14.3 = 13.871 = V (S ) = 14.31 = V (S ) = 172 = 1 + γV (S ) = 13.871 = 1 + γV (S ) = 1 + 0.9 × 17 = 16.32 = V (S ) = 203 TD-Learning and Q-Learning Reinforcement Learning algorithms can generally be grouped into three classes: 1. Value function learning, including TD-Learning and Q-Learning 2. Policy learning, including policy gradients and evolution strategies 3. Actor-Critic, which is a combination of value function and policy learning Value Function Learning Recall that if is optimal, then is the maximum (expected) discounted reward obtainable from state , and is the maximum (expected) discounted reward obtainable from , if the agent is forced to take action in the �rst timestep but can act optimally thereafter. π = π∗ V (s) =∗ V (s)π ∗ s Q (s, a) =∗ Q (s, a)π ∗ s a The idea behind value function learning is that the agent retains its own estimate or of the "true" value function or . This estimate might be quite inaccurate to start with, but it gets iteratively improved over time so that it more closely approximates the true value. This process is sometimes called "Bootstrapping". V () Q() V ()∗ Q ()∗ Temporal Di�erence Learning Let’s �rst assume that and are deterministic. Then the (true) value of the current state should be equal to the immediate reward plus the discounted value of the next state R δ V (s)∗ s V (s) =∗ R(s, a) + γV (δ(s, a))∗ We can turn this into an update rule for the estimated value: V (s ) ←t r +t γV (s )t+1 If and are stochastic (multi-valued), it is not safe to simply replace with the expression on the right hand side. Instead, we move its value fractionally in this direction, proportional to a learning rate . R δ V (s) η V (s ) ←t V (s ) +t η[r +t γV (s ) −t+1 V (s )]t Q-Learning Q-learning is similar to TD-learning except that the Q-function depends on a state, action pair instead of just a state. Q :∗ S × A → R For a deterministic environment, and are related byπ , Q∗ ∗ V ∗ π (s)∗ Q (s, a)∗ V (s)∗ So Q (s, a)∗ = Q (s, a) a arg max ∗ = R(s, a) + γV (δ(s, a))∗ = Q (s, b) b max ∗ = R(s, a) + γ Q (s, b) b max ∗ The last equation suggests that we can iteratively update our estimate byQ Q(s , a ) ←t t r +t γ Q(s , b) b max t+1 If the environment is stochastic, we instead write Q(s , a ) ←t t Q(s , a ) +t t η[r +t γ Q(s , b) − b max t+1 Q(s , a )]t t Theoretical Results and Generalization It can be proved that TD-learning and Q-learning will eventually converge to the optimal policy, for any deterministic Markov decision process, assuming an appropriately randomised strategy (Sutton, 1988; Watkins & Dayan 1992; Dayan & Sejnowski, 1994). However, these results rely on the assumption that every state will be visited many times. For small environments with a limited number of states and actions, this assumption is reasonable. For more challenging environments, the number of states can be exponentially large and it may be impractical to assume that every state will be visited even once. In these situations, we need to rely on a model such as a neural network that is able to generalise to estimate the value of new states based on their similarity with states that have previously been encountered. Computer Game Playing Suppose we want to write a computer program to play a game like backgammon, chess, checkers or Go. This can be done using a tree search algorithm (expectimax, MCTS, or minimax with alpha-beta pruning). But we need: (a) an scheme for encoding any board position as a set of numbers, and (b) a way to train a neural network or other learning system to compute a board evaluation, based on those numbers. Backgammon One of the �rst practical applications of TD-learning (Tesauro, 1992) was to play the game of backgammon, which involves rolling dice and moving pieces around a board with 24 "points". In the above diagram, the white player is moving their pieces clockwise toward the bottom left quadrant while the black player is moving their pieces anti-clockwise toward the upper left quadrant. When a player gets all of their pieces into the last quadrant, they can use the dice rolls to move pieces o� the board. The �rst player to get all of their pieces o� the board is the winner. If a player has only one piece on a point (such as point 10 or 24 above) and the other player moves a piece to that point, the original piece is sent back to the "bar" and must start again from the beginning. If a player has two or more pieces on a point, the other player is not allowed to move a piece to that point. We see in the above position that the black player has multiple pieces on points 1, 4, 6, 7, 8, 9 so that the white player will not be able to move their pieces on points 3 or 5 unless they roll either 2, 5 or 6. TD-Gammon Neural Network TD-Gammon used a 2-layer neural network with 196 inputs, 20 hidden nodes and 1 output. The inputs are arranged as follows: 4 units × 2 players × 24 points 2 units for pieces on the bar 2 units for pieces o� the board Four inputs are assigned to each colour and each point, with a 1-hot encoding used to specify one, two or three pieces and the fourth input scaled in proportion to the number of pieces if there are more than three. Additional inputs specify the number of pieces of each colour which are on the bar, or o� the board. For any encoded board state, the network produces an output between 0 and 1 which is interpreted as its estimate of the probability of winning from that state. V At each timestep, the dice are rolled, the network considers all possible legal moves and computes the “next board position” that would result from each move. These are converted to the appropriate input format, fed to the network, and the one which produces the largest output is chosen. The network can be trained by backpropagation using this formula w ←i w +i η(T − V ) ∂w i ∂V where are the weights in the network, is the learning rate, is the actual output of the network, and is the target value. w i η V T The question is: How do we determine the target value ? In other words, how do we know what the value of the current position “should" have been? Alternatively, how do we �nd a better estimate for the value of the current position? T How to choose the Target value One approach is to have a human expert play many games and build a database of positions, dice rolls and chosen moves, similar to what was done for the ALVINN autonomous driving system we discussed in Week 1. The network could then be trained to increase the evaluation of each move chosen by the human expert and decrease that of other moves which were available but did not get chosen. Another approach is for the network to learn from self-play by TD-learning, e�ectively using the evaluation of subsequent positions in the game to re�ne the evaluation of earlier positions. Other methods such as TD-Root, TD-Leaf, MCTS and TreeStrap (Veness et al., 2009) combine learning with tree search. These are important for deterministic games like chess or Go, but less important for a stochastic game like backgammon because the randomness of the dice rolls limits the bene�t of deep lookahead. For backgammon, the agent receives just a single reward at the end of each game, which we can consider as the �nal value (typically, +1 for a win or -1 for a loss). We then have a sequence of V m+1 game positions, each with its own (estimated) value: (current estimate) V →t V →t+1 … → V →m V (final result)m+1 In this context, TD-learning simpli�es and becomes equivalent to using the value of the next state as the training value for the current state . A fancier version, called TD( ), uses a weighted average over future estimates as the training value for where V ( t+1) V ( k) λ T k V ,k T =k (1 − λ) λ V + k=t+1 ∑ m k−1−t k λ V m−t m+1 The parameter between 0 and 1 serves as a kind of discount factor, but is di�erent to the usual discount factor (which can be equal to in the case of TD-Gammon, because we do not care when we get the reward so long as we get it). λ γ 1 TD-Gammon Tesauro trained two networks one using Expert Preferences (EP), the other by TD-learning (TD). The TD-network outperformed the EP-network, and with modi�cations such as 3-step lookahead (expectimax) and additional hand-crafted input features, became the best backgammon player in the world in 1995. − References Sutton, R.S. 1988. Learning to predict by the methods of temporal di�erences, Machine Learning, 3(1), 9-44. Watkins, C.J. 1989. Learning from Delayed Rewards, PhD Thesis, University of Cambridge. Watkins, C.J., & Dayan, P. 1992. Q-learning, Machine Learning, 8(3-4), 279-292. Dayan, P., & Sejnowski, T.J. 1994. TD (λ) converges with probability 1, Machine Learning, 14(3), 295- 301. Tesauro, G., 1992. Practical issues in temporal di�erence learning. Machine Learning, 8(3), 257-277. Veness, J., Silver, D., Uther, W. & Blair, A., 2009. Bootstrapping from game tree search, Advances in Neural Information Processing Systems (NIPS 22), 1937-1945. Exercise: Reinforcement Learning Question 1 No response Question 2 No response Question 3 No response Question 4 Consider an environment with two states and two actions where the (deterministic) transitions and reward for each state and action are as follows: S = {S , S }1 2 A = {a , a }1 2 δ R δ(S , a ) = s , R(S , a ) = +11 1 1 1 1 δ(S , a ) = s , R(S , a ) = −21 2 2 1 2 δ(S , a ) = s , R(S , a ) = +72 1 1 2 1 δ(S , a ) = s , R(S , a ) = +32 2 2 2 2 Assuming a discount factor of , determine the optimal policy γ = 0.7 π :∗ S → A Still assuming , determine the value function γ = 0.7 V : S → R Still assuming , determine the values of the Q-function γ = 0.7 Q : S × A → R Still assuming , trace through the �rst few steps of the Q-learning algorithm, assuming a learning rate of 1 and with all Q values initially set to zero. Explain why it is necessary to force exploration through probabilistic choice of actions, in order to ensure convergence to the true Q γ = 0.7 No response Question 5 No response Question 6 No response values. Here are some hints to get you started: Since the learning rate is 1 (and the environment deterministic) we can use this Q-Learning update rule: Q(S, a) ← r(S, a) + γ Q(δ(S, a), b) b max Let's assume the agent starts in state . Because the initial Q values are all zero, the �rst action must be chosen randomly. If action is chosen, the agent will get a reward of +1 and the update will be S 1 a 1 Q(S , a ) ←1 1 1 + γ × 0 = 1 Now let's consider how the value function changes as the discount factor varies between 0 and 1. There are four deterministic policies for this environment, which can be written as , , and , where , γ π 11 π 12 π 21 π 22 π (S ) =ij 1 a i π (S ) =ij 2 a j Calculate the value function for each of these four policies (keeping as a variable)V :(γ) π S → R γ Determine for which range of values of each of the policies , , , is optimal.γ π 11 π 12 π 21 π 22 Week 8 Wednesday video