9b_Reinforcement_Learning.dvi
COMP9414 Reinforcement Learning 1
This Lecture
� Reinforcement Learning vs Supervised Learning
� Models of Optimality
� Exploration vs Exploitation
� Temporal Difference Learning
� Q-Learning
UNSW ©W. Wobcke et al. 2019–2021
COMP9414: Artificial Intelligence
Lecture 9b: Reinforcement Learning
Wayne Wobcke
e-mail:w. .au
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Reinforcement Learning 3
Supervised Learning
� Given a training set and a test set, each consisting of a set of items for
each item in the training set, a set of features and a target output
� Learner must learn a model that can predict the target output for any
given item (characterized by its set of features)
� Learner is given the input features and target output for each item in
the training set
◮ Items may be presented all at once (batch) or in sequence (online)
◮ Items may be presented at random or in time order (stream)
◮ Learner cannot use the test set at all in defining the model
� Model is evaluated by its performance on predicting the output for
each item in the test set
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Reinforcement Learning 2
Types of Learning
� Supervised Learning
◮ Agent is presented with examples of inputs and their target
outputs, and must learn a function from inputs to outputs that
agrees with the training examples and generalizes to new examples
� Reinforcement Learning
◮ Agent is not presented with target outputs for each input, but
is periodically given a reward, and must learn to maximize
(expected) rewards over time
� Unsupervised Learning
◮ Agent is only presented with a series of inputs, and must find
useful patterns in these inputs
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Reinforcement Learning 5
Reinforcement Learning Framework
� Agent interacts with its environment
� There is a set S of states and a set A of actions
� At each time step t, the agent is in some state st and must choose an
action at , whereupon it goes into state st+1 = δ(st ,at) and receives
reward r(st ,at)
� In general, r() and δ() can be multi-valued, with a random element
� The aim is to find an optimal policy π : S→ A which maximizes the
cumulative reward
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Reinforcement Learning 4
Learning Actions
Supervised learning can be used to learn actions from a training set of
situation-action pairs (called Behavioural Cloning)
However, there are many applications for which it is difficult, inappropri-
ate, or even impossible to provide a “training set”
� Optimal control
◮ Mobile robots, pole balancing, flying a helicopter
� Resource allocation
◮ Job shop scheduling, mobile phone channel allocation
� Mix of allocation and control
◮ Elevator control, Backgammon
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Reinforcement Learning 7
Comparing Models of Optimality
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Reinforcement Learning 6
Models of Optimality
Is a fast nickel worth a slow dime?
Finite horizon reward
h
∑
i=0
rt+i
Average reward lim
h→∞
1
h
h−1
∑
i=0
rt+i
Infinite discounted reward
∞
∑
i=0
γ irt+i, 0≤ γ < 1 � Finite horizon reward is simple computationally � Infinite discounted reward is easier for proving theorems � Average reward is hard to deal with, because can’t sensibly choose between small reward soon and large reward very far in the future UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 9 Value Function For each state s ∈ S, let V ∗(s) be the maximum discounted reward obtainable from s (a) 1 2 3 1 2 3 − 1 + 1 4 START −1 +1 .5 .33 .5 .33 .5 .33 .5 1.0 .33 .33 .33 (b) 1.0.5 .5 .5 .5 .5 .5 .5 .5 .33 .33 .33 1 2 3 1 2 3 − 1 + 1 4 (c) −0.0380 −0.0380 0.0886 0.2152 −0.1646 −0.2911 −0.4430 −0.5443 −0.7722 The optimal value function determines the optimal policy UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 8 Environment Types Environments can be � Passive and stochastic � Active and deterministic (Chess) � Active and stochastic (Backgammon) UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 11 Calculation Theorem: In a deterministic environment, for an optimal policy, the value function V ∗ satisfies the Bellman equations: V ∗(s) = r(s,a)+ γV ∗(δ(s,a)) where a = π∗(s) is the optimal action at state s. Let δ∗(s) be the transition function for π∗(s) and suppose γ = 0.9 1. Suppose δ∗(s1) = s1. Then V ∗(s1) = 5+ 0.9V ∗(s1) so V ∗(s1) = 50 Suppose δ∗(s2) = s2. Then V ∗(s2) = 10+0.9V ∗(s2) so V ∗(s2) = 100 2. Suppose δ∗(s1) = s2. Then V ∗(s1) = 2+ 0.9V ∗(s2) so V ∗(s1) = 92 Suppose δ∗(s2) = s2. Then V ∗(s2) = 10+0.9V ∗(s2) so V ∗(s2) = 100 3. Suppose δ∗(s1) = s2. Then V ∗(s1) = 2+0.9V ∗(s2) so V ∗(s1) = 81.6 Suppose δ∗(s2) = s1. Then V ∗(s2) = 15+0.9V ∗(s1) so V ∗(s2) = 88.4 So 2 is the optimal policy UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 10 Example: Delayed Rewards UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 13 K-Armed Bandit Problem The special case of an active stochastic environment with only one state is called a K-Armed Bandit Problem, because it is like being in a room with several (friendly) slot machines, for a limited time, and trying to collect as much money as possible Each action (slot machine) provides a different average reward UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 12 Exploration/Exploitation Tradeoff Most of the time, the agent should choose the “best” action However, in order to ensure the optimal strategy can be learned, the agent must occasionally choose a different action, e.g. � Choose a random action 5% of the time, or � Use a Boltzmann distribution to choose the next action P(a) = eV̂ (a)/T ∑ b∈A eV̂ (b)/T UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 15 Q-Learning For each s ∈ S, let V ∗(s) be the maximum discounted reward obtainable from s, and let Q(s,a) be the discounted reward available by first doing action a and then acting optimally Then the optimal policy is π∗(s) = arg max a Q(s,a) where Q(s,a) = r(s,a)+ γV ∗(δ(s,a)) Then V ∗(s) = max a Q(s,a) so Q(s,a) = r(s,a)+ γmax b Q(δ(s,a),b) This allows iterative approximation of Q by Q̂(s,a)← r(s,a)+ γmax b Q̂(δ(s,a),b) UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 14 Temporal Difference Learning TD(0) [also called AHC, or Widrow-Hoff Rule] V̂ (s)← V̂ (s)+η [r(s,a)+ γV̂(δ(s,a))−V̂ (s) ] (η = learning rate) The (discounted) value of the next state, plus the immediate reward, is used as the target value for the current state A more sophisticated version, called TD(λ), uses a weighted average of future states UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 17 Limitations of Theoretical Results � Delayed reinforcement ◮ Reward resulting from an action may not be received until several time steps later, which also slows down the learning � Search space must be finite ◮ Convergence is slow if the search space is large ◮ Relies on visiting every state infinitely often � For “real world” problems, can’t rely on a lookup table ◮ Need to have some kind of generalization (e.g. TD-Gammon) UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 16 Theoretical Results Theorem: Q-learning will eventually converge to the optimal policy, for any deterministic Markov Decision Process, assuming an appropriately randomized strategy. (Watkins & Dayan 1992) Theorem: TD-learning will also converge, with probability 1. (Sutton 1988, Dayan 1992, Dayan & Sejnowski 1994) UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 18 Summary � Reinforcement Learning is an active area of research � Mathematical results (more than in other areas of AI) � Need to have an appropriate representation � Future algorithms which choose their own representations? � Many practical applications UNSW ©W. Wobcke et al. 2019–2021