COMP9444
Neural Networks and Deep Learning
Outline
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Reinforcement Learning
2
COMP9444 20T2
Reinforcement Learning 3
7a. Reinforcement Learning
Reinforcement Learning vs. Supervised Learning Models of Optimality
Exploration vs. Exploitation
Value Function Learning
Supervised Learning
Learning of Actions
Recall: Supervised Learning
Supervised Learning can also be used to learn Actions, if we construct a training set of situation-action pairs (called Behavioral Cloning).
We have a training set and a test set, each consisting of a set of examples. For each example, a number of input attributes and a target attribute are specified.
However, there are many applications for which it is difficult, inappropri- ate, or even impossible to provide a “training set”
The aim is to predict the target attribute, based on the input attributes.
optimal control
◮ mobile robots, pole balancing, flying a helicopter
Various learning paradigms are available: ◮ Decision Trees
◮ Neural Networks
◮ SVM
resource allocation
◮ job shop scheduling, mobile phone channel allocation
◮ .. others .. COMP9444
mix of allocation and control
◮ elevator control, backgammon
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2 Reinforcement Learning 1
◮ TD-Learning ◮ Q-Learning
TD-Gammon
COMP9444 20T2 Reinforcement Learning
4
COMP9444 20T2 Reinforcement Learning 5
Reinforcement Learning Framework
Probabilistic Policies
An agent interacts with its environment.
There is a set S of states and a set A of actions.
Ateachtimestept,theagentisinsomestatest.
It must choose an action at , whereupon it goes into state st+1 =δ(st,at)andreceivesrewardrt =R(st,at)
Agenthasapolicyπ:S→A. Weaimtofindanoptimalpolicyπ∗ which maximizes the cumulative reward.
There are some environments in which any deterministic agent will perform very poorly, and the optimal (reactive) policy must be stochastic (i.e. randomized).
In general, δ, R and π can be multi-valued, with a random element, in which case we write them as probability distributions
In 2-player games like Rock-Paper-Scissors, a random strategy is also required in order to make agent choices unpredictable to the opponent.
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Reinforcement Learning
6
COMP9444 20T2 Reinforcement Learning
7
δ(st+1 = s|st,at)
R (rt = r|st,at) π(at = a|st)
Models of optimality
Comparing Models of Optimality
Is a fast nickel worth a slow dime? h−1
a1 a2
+2
Finite horizon reward Infinite discounted reward
∑ rt+i i=0
Average reward
Finite horizon reward is simple computationally
+11
∞
∑ γirt+i,
0 ≤ γ < 1
+10
i=0
lim h ∑ rt+i
1 h−1 h→∞ i=0
a3
Infinite discounted reward is easier for proving theorems
Average reward is hard to deal with, because can’t sensibly choose
Finite horizon, k = 4
between small reward soon and large reward very far in the future. COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
→ a1 is preferred Infinite horizon, γ = 0.9 → a2 is preferred Average reward → a3 is preferred
COMP9444 20T2
Reinforcement Learning
8 COMP9444 20T2 Reinforcement Learning 9
3
+ 1
.5
.33 3 .33 .33
−0.0380
0.0886
0.2152
+ 1
2
− 1
.33 1.0
−1 2 −0.1646
−0.4430 − 1
1 START
RL Approaches
Value Function Learning
Value Function Learning ◮ TD-Learning
◮ Q-Learning
Every policy π determines a Value Function Vπ : S → R where Vπ(s) is the average discounted reward received by an agent who begins in state s and chooses its actions according to policy π.
Policy Learning
◮ Evolution Strategies ◮ Policy Gradients
If π = π∗ is optimal, then V ∗ (s) = V π∗(s) is the maximum (expected) discounted reward obtainable from state s. Learning this optimal value function can help to determine the optimal strategy.
Actor-Critic
◮ Combination of Value and Policy learning
The agent retains its own estimate V () of the “true” value function V ∗ (). The aim of Value Function Learning is generally to start with a random V and then iteratively improve it so that it more closely approximates V∗. This process is sometimes called “Bootstrapping”.
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T2
Reinforcement Learning
10 COMP9444 20T2 Reinforcement Learning 11
Value Function
K-Armed Bandit Problem
.5
.5
.5
.5
.33 .5
.33 .5 .33
.5
.5 .33 1.0 +1
.5
1 −0.2911 −0.0380 −0.5443 −0.7722 1234 1234
.5
.33
.5
(a) (b) (c)
This is the Value Function V π where π is the policy of choosing between
The special case of an active, stochastic environment with only one state is called the K-armed Bandit Problem, because it is like being in a room with several (friendly) slot machines, for a limited time, and trying to collect as much money as possible.
available actions uniformly randomly.
COMP9444 ⃝c Alan Blair, 2017-20
Each action (slot machine) provides a different average reward.
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T2 Reinforcement Learning
12
COMP9444 20T2 Reinforcement Learning 13
Exploration / Exploitation Tradeoff
Exploration / Exploitation Tradeoff
Most of the time we should choose what we think is the best action.
However, in order to ensure convergence to the optimal strategy, we must occasionally choose something different from our preferred action, e.g.
I was born to try...
choose a random action 5% of the time, or
use Softmax (Boltzmann distribution) to choose the next action:
But you’ve got to make choices
Be wrong or right
Sometimes you’ve got to sacrifice the things you like.
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2 Reinforcement Learning
14
COMP9444 20T2
Reinforcement Learning 15
Delayed Reinforcement
Temporal Difference Learning
eR (a))/T P(a) = ∑ eR (b))/T
- Delta Goodrem
+1
−1
Let’s first assume that R and δ are deterministic. Then the (true) value V∗(s) of the current state s should be equal to the immediate reward plus the discounted value of the next state
b∈A
a2 s a1 +11a2 3+2
V∗(s)=R(s,a)+γV∗(δ(s,a))
We can turn this into an update rule for the estimated value, i.e.
s
s
a 2 a1 a 12
−1 +1
V(st) ← rt +γV(st+1)
We may need to take several actions before we can get the good reward.
If R and δ are stochastic (multi-valued), it is not safe to simply replace V (s) with the expression on the right hand side. Instead, we move its value fractionally in this direction, proportional to a learning rate η
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
V(st)←V(st)+η[rt +γV(st+1)−V(st)]
COMP9444 20T2 Reinforcement Learning
16
COMP9444 20T2 Reinforcement Learning 17
Q-Learning
Q-Learning
Q-Learning is similar to TD-Learning except that we use a function Qπ : S × A → R which depends on a state, action pair instead of just a state.
Foradeterministicenvironment,π∗,Q∗ andV∗ arerelatedby
For any policy π the Q-Function Q π (s, a) is the average discounted reward received by an agent who begins in state s, first performs action a and then follows policy π for all subsequent timesteps.
V∗(s) = max Q∗(s,b) b
If π = π∗ is optimal, then Q∗ (s, a) = Q π∗ (s, a) is the maximum (expected) discounted reward obtainable from s, if the agent is forced to take action a in the first timestep but can act optimally thereafter.
Q(st,at) ← rt +γ maxQ(st+1,b) b
The agent retains its own (initially, random) estimate Q() and iteratively improves this estimate to more closely approximate the “true” function Q∗().
If the environment is stochastic, we instead write
Q(st,at) ← Q(st,at)+η[rt +γ max Q(st+1,b)−Q(st,at)]
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T2 Reinforcement Learning
18
COMP9444 20T2
Reinforcement Learning 19
Q-Learning Example
Theoretical Results
+1 a2
−1
Theorem: Q-learning will eventually converge to the optimal policy, for any deterministic Markov decision process, assuming an appropriately randomized strategy.
a1 +1s1as2 s3+2
a2 a1a 12
(Watkins & Dayan 1992)
−1 +1
Exercise:
1. computeVπ(s3)if π(s3)=a2 andγ=0.9
2. compute π∗ , V ∗ and Q∗ for this environment (if γ = 0.9)
Theorem: TD-learning will also converge, with probability 1. (Sutton 1988, Dayan 1992, Dayan & Sejnowski 1994)
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
π∗(s) = argmaxa Q∗(s,a) Q∗(s,a) = R (s,a)+γV∗(δ(s,a))
So
This allows us to iteratively approximate Q by
Q∗(s,a) = R (s,a)+γ maxQ∗(δ(s,a),b) b
b
COMP9444 20T2 Reinforcement Learning 20
COMP9444 20T2 Reinforcement Learning 21
Limitations of Theoretical Results
Computer Game Playing
COMP9444 20T2
Reinforcement Learning
22 COMP9444 20T2 Reinforcement Learning 23
Delayed reinforcement
◮ reward resulting from an action may not be received until several
Suppose we want a write a computer program to play a game like Backgammon, Chess, Checkers or Go. This can be done using a tree search algorithm (expectimax, MCTS, or minimax with alpha-beta pruning). But we need:
time steps later, which also slows down the learning
Search space must be finite
◮ convergence is slow if the search space is large ◮ relies on visiting every state infinitely often
(a) an appropriate way of encoding any board position as a set of numbers, and
For “real world” problems, we can’t rely on a lookup table
◮ need to have some kind of generalisation (e.g. TD-Gammon)
(b) a way to train a neural network or other learning system to compute a board evaluation, based on those numbers
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
Backgammon
Backgammon Neural Network
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
0 123456
789101112
Board encoding
Two layer neural network
25
24 23 22 21 20 19
18 17 16 15 14 13
4 units × 2 players × 24 points 2 units for the bar
2 units for off the board
196 input units 20 hidden units 1 output unit
The input s is the encoded board position (state),
the output V(s) is the value of this position (probability of winning).
At each move, roll the dice, find all possible “next board positions”, convert them to the appropriate input format, feed them to the network, and choose the one which produces the largest output.
COMP9444 20T2
Reinforcement Learning
24
COMP9444 20T2 Reinforcement Learning 25
Backpropagation
How to Choose the Target Value
w ← w + η(T − V ) ∂V ∂w
V
T
w = weight
Behavioral Cloning (Supervised Learning)
◮ learn moves from human games (Expert Preferences)
Q: How do we choose the target value T ?
◮ general method, does not rely on knowing the “world model” (rules of the game)
In other words, how do we know what the value of the current position “should have been”? or, how do we find a better estimate for the value of the current position?
methods which combine learning with tree search (must know the “world model”)
COMP9444 ⃝c Alan Blair, 2017-20
◮ TD-Root, TD-Leaf, MCTS, TreeStrap
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T2 Reinforcement Learning
26
COMP9444 20T2 Reinforcement Learning 27
TD-Learning for Episodic Games
TD-Gammon
Backgammon is an example of an episodic task, in the sense that the agent receives just a single reward at the end of the game, which we can consider as the final value Vm+1 (typically, +1 for a win or −1 for a loss). We then have a sequence of game positions, each with its own (estimated) value:
Tesauro trained two networks:
◮ EP-network was trained on Expert Preferences (Supervised) ◮ TD-network was trained by self play (TD-Learning)
(currentestimate) Vt→Vt+1→...→Vm→Vm+1 (finalresult)
In this context, TD-Learning simplifies and becomes equivalent to using the
TD-network outperformed the EP-network.
value of the next state (Vt+1) as the training value for the current state (Vk) A fancier version, called TD(λ), uses Tk as the training value for Vk , where
With modifications such as 3-step lookahead (expectimax) and additional hand-crafted input features, TD-Gammon became the best Backgammon player in the world (Tesauro, 1995).
m
Tt =(1−λ) ∑ λk−1−tVk+λm−tVm+1
k=t+1
Tt is a weighted average of future estimates, λ = discount factor (0 ≤ λ < 1)
COMP9444 ⃝c Alan Blair, 2017-20
η =
learning rate
◮ use subsequent positions to refine evaluation of current position
= =
actual output target value
Temporal Difference Learning
COMP9444
⃝c Alan Blair, 2017-20