CS计算机代考程序代写 algorithm 9b_Reinforcement_Learning.dvi

9b_Reinforcement_Learning.dvi

COMP9414 Reinforcement Learning 1

This Lecture

� Reinforcement Learning vs Supervised Learning

� Models of Optimality

� Exploration vs Exploitation

� Temporal Difference Learning

� Q-Learning

UNSW ©W. Wobcke et al. 2019–2021

COMP9414: Artificial Intelligence

Lecture 9b: Reinforcement Learning

Wayne Wobcke

e-mail:w. .au

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Reinforcement Learning 3

Supervised Learning

� Given a training set and a test set, each consisting of a set of items for

each item in the training set, a set of features and a target output

� Learner must learn a model that can predict the target output for any

given item (characterized by its set of features)

� Learner is given the input features and target output for each item in

the training set

◮ Items may be presented all at once (batch) or in sequence (online)

◮ Items may be presented at random or in time order (stream)

◮ Learner cannot use the test set at all in defining the model

� Model is evaluated by its performance on predicting the output for

each item in the test set

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Reinforcement Learning 2

Types of Learning

� Supervised Learning

◮ Agent is presented with examples of inputs and their target

outputs, and must learn a function from inputs to outputs that

agrees with the training examples and generalizes to new examples

� Reinforcement Learning

◮ Agent is not presented with target outputs for each input, but

is periodically given a reward, and must learn to maximize

(expected) rewards over time

� Unsupervised Learning

◮ Agent is only presented with a series of inputs, and must find

useful patterns in these inputs

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Reinforcement Learning 5

Reinforcement Learning Framework

� Agent interacts with its environment

� There is a set S of states and a set A of actions

� At each time step t, the agent is in some state st and must choose an

action at , whereupon it goes into state st+1 = δ(st ,at) and receives

reward r(st ,at)

� In general, r() and δ() can be multi-valued, with a random element

� The aim is to find an optimal policy π : S→ A which maximizes the

cumulative reward

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Reinforcement Learning 4

Learning Actions

Supervised learning can be used to learn actions from a training set of

situation-action pairs (called Behavioural Cloning)

However, there are many applications for which it is difficult, inappropri-

ate, or even impossible to provide a “training set”

� Optimal control

◮ Mobile robots, pole balancing, flying a helicopter

� Resource allocation

◮ Job shop scheduling, mobile phone channel allocation

� Mix of allocation and control

◮ Elevator control, Backgammon

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Reinforcement Learning 7

Comparing Models of Optimality

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Reinforcement Learning 6

Models of Optimality

Is a fast nickel worth a slow dime?

Finite horizon reward
h


i=0

rt+i

Average reward lim
h→∞

1
h

h−1


i=0

rt+i

Infinite discounted reward


i=0

γ irt+i, 0≤ γ < 1 � Finite horizon reward is simple computationally � Infinite discounted reward is easier for proving theorems � Average reward is hard to deal with, because can’t sensibly choose between small reward soon and large reward very far in the future UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 9 Value Function For each state s ∈ S, let V ∗(s) be the maximum discounted reward obtainable from s (a) 1 2 3 1 2 3 − 1 + 1 4 START −1 +1 .5 .33 .5 .33 .5 .33 .5 1.0 .33 .33 .33 (b) 1.0.5 .5 .5 .5 .5 .5 .5 .5 .33 .33 .33 1 2 3 1 2 3 − 1 + 1 4 (c) −0.0380 −0.0380 0.0886 0.2152 −0.1646 −0.2911 −0.4430 −0.5443 −0.7722 The optimal value function determines the optimal policy UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 8 Environment Types Environments can be � Passive and stochastic � Active and deterministic (Chess) � Active and stochastic (Backgammon) UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 11 Calculation Theorem: In a deterministic environment, for an optimal policy, the value function V ∗ satisfies the Bellman equations: V ∗(s) = r(s,a)+ γV ∗(δ(s,a)) where a = π∗(s) is the optimal action at state s. Let δ∗(s) be the transition function for π∗(s) and suppose γ = 0.9 1. Suppose δ∗(s1) = s1. Then V ∗(s1) = 5+ 0.9V ∗(s1) so V ∗(s1) = 50 Suppose δ∗(s2) = s2. Then V ∗(s2) = 10+0.9V ∗(s2) so V ∗(s2) = 100 2. Suppose δ∗(s1) = s2. Then V ∗(s1) = 2+ 0.9V ∗(s2) so V ∗(s1) = 92 Suppose δ∗(s2) = s2. Then V ∗(s2) = 10+0.9V ∗(s2) so V ∗(s2) = 100 3. Suppose δ∗(s1) = s2. Then V ∗(s1) = 2+0.9V ∗(s2) so V ∗(s1) = 81.6 Suppose δ∗(s2) = s1. Then V ∗(s2) = 15+0.9V ∗(s1) so V ∗(s2) = 88.4 So 2 is the optimal policy UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 10 Example: Delayed Rewards UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 13 K-Armed Bandit Problem The special case of an active stochastic environment with only one state is called a K-Armed Bandit Problem, because it is like being in a room with several (friendly) slot machines, for a limited time, and trying to collect as much money as possible Each action (slot machine) provides a different average reward UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 12 Exploration/Exploitation Tradeoff Most of the time, the agent should choose the “best” action However, in order to ensure the optimal strategy can be learned, the agent must occasionally choose a different action, e.g. � Choose a random action 5% of the time, or � Use a Boltzmann distribution to choose the next action P(a) = eV̂ (a)/T ∑ b∈A eV̂ (b)/T UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 15 Q-Learning For each s ∈ S, let V ∗(s) be the maximum discounted reward obtainable from s, and let Q(s,a) be the discounted reward available by first doing action a and then acting optimally Then the optimal policy is π∗(s) = arg max a Q(s,a) where Q(s,a) = r(s,a)+ γV ∗(δ(s,a)) Then V ∗(s) = max a Q(s,a) so Q(s,a) = r(s,a)+ γmax b Q(δ(s,a),b) This allows iterative approximation of Q by Q̂(s,a)← r(s,a)+ γmax b Q̂(δ(s,a),b) UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 14 Temporal Difference Learning TD(0) [also called AHC, or Widrow-Hoff Rule] V̂ (s)← V̂ (s)+η [r(s,a)+ γV̂(δ(s,a))−V̂ (s) ] (η = learning rate) The (discounted) value of the next state, plus the immediate reward, is used as the target value for the current state A more sophisticated version, called TD(λ), uses a weighted average of future states UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 17 Limitations of Theoretical Results � Delayed reinforcement ◮ Reward resulting from an action may not be received until several time steps later, which also slows down the learning � Search space must be finite ◮ Convergence is slow if the search space is large ◮ Relies on visiting every state infinitely often � For “real world” problems, can’t rely on a lookup table ◮ Need to have some kind of generalization (e.g. TD-Gammon) UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 16 Theoretical Results Theorem: Q-learning will eventually converge to the optimal policy, for any deterministic Markov Decision Process, assuming an appropriately randomized strategy. (Watkins & Dayan 1992) Theorem: TD-learning will also converge, with probability 1. (Sutton 1988, Dayan 1992, Dayan & Sejnowski 1994) UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Reinforcement Learning 18 Summary � Reinforcement Learning is an active area of research � Mathematical results (more than in other areas of AI) � Need to have an appropriate representation � Future algorithms which choose their own representations? � Many practical applications UNSW ©W. Wobcke et al. 2019–2021