程序代写代做代考 decision tree ER algorithm deep learning game COMP9444

COMP9444
Neural Networks and Deep Learning
Outline
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Reinforcement Learning
2
COMP9444 20T2
Reinforcement Learning 3
7a. Reinforcement Learning
􏰈 Reinforcement Learning vs. Supervised Learning 􏰈 Models of Optimality
􏰈 Exploration vs. Exploitation
􏰈 Value Function Learning
Supervised Learning
Learning of Actions
Recall: Supervised Learning
Supervised Learning can also be used to learn Actions, if we construct a training set of situation-action pairs (called Behavioral Cloning).
􏰈 We have a training set and a test set, each consisting of a set of examples. For each example, a number of input attributes and a target attribute are specified.
However, there are many applications for which it is difficult, inappropri- ate, or even impossible to provide a “training set”
􏰈 The aim is to predict the target attribute, based on the input attributes.
􏰈 optimal control
◮ mobile robots, pole balancing, flying a helicopter
􏰈 Various learning paradigms are available: ◮ Decision Trees
◮ Neural Networks
◮ SVM
􏰈 resource allocation
◮ job shop scheduling, mobile phone channel allocation
◮ .. others .. COMP9444
􏰈 mix of allocation and control
◮ elevator control, backgammon
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2 Reinforcement Learning 1
◮ TD-Learning ◮ Q-Learning
􏰈 TD-Gammon

COMP9444 20T2 Reinforcement Learning
4
COMP9444 20T2 Reinforcement Learning 5
Reinforcement Learning Framework
Probabilistic Policies
􏰈 An agent interacts with its environment.
􏰈 There is a set S of states and a set A of actions.
􏰈 Ateachtimestept,theagentisinsomestatest.
It must choose an action at , whereupon it goes into state st+1 =δ(st,at)andreceivesrewardrt =R(st,at)
􏰈 Agenthasapolicyπ:S→A. Weaimtofindanoptimalpolicyπ∗ which maximizes the cumulative reward.
There are some environments in which any deterministic agent will perform very poorly, and the optimal (reactive) policy must be stochastic (i.e. randomized).
􏰈 In general, δ, R and π can be multi-valued, with a random element, in which case we write them as probability distributions
In 2-player games like Rock-Paper-Scissors, a random strategy is also required in order to make agent choices unpredictable to the opponent.
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Reinforcement Learning
6
COMP9444 20T2 Reinforcement Learning
7
δ(st+1 = s|st,at)
R (rt = r|st,at) π(at = a|st)
Models of optimality
Comparing Models of Optimality
Is a fast nickel worth a slow dime? h−1
a1 a2
+2
Finite horizon reward Infinite discounted reward
∑ rt+i i=0
Average reward
􏰈 Finite horizon reward is simple computationally
+11

∑ γirt+i,
0 ≤ γ < 1 +10 i=0 lim h ∑ rt+i 1 h−1 h→∞ i=0 a3 􏰈 Infinite discounted reward is easier for proving theorems 􏰈 Average reward is hard to deal with, because can’t sensibly choose 􏰈 Finite horizon, k = 4 between small reward soon and large reward very far in the future. COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 → a1 is preferred 􏰈 Infinite horizon, γ = 0.9 → a2 is preferred 􏰈 Average reward → a3 is preferred COMP9444 20T2 Reinforcement Learning 8 COMP9444 20T2 Reinforcement Learning 9 3 + 1 .5 .33 3 .33 .33 −0.0380 0.0886 0.2152 + 1 2 − 1 .33 1.0 −1 2 −0.1646 −0.4430 − 1 1 START RL Approaches Value Function Learning 􏰈 Value Function Learning ◮ TD-Learning ◮ Q-Learning Every policy π determines a Value Function Vπ : S → R where Vπ(s) is the average discounted reward received by an agent who begins in state s and chooses its actions according to policy π. 􏰈 Policy Learning ◮ Evolution Strategies ◮ Policy Gradients If π = π∗ is optimal, then V ∗ (s) = V π∗(s) is the maximum (expected) discounted reward obtainable from state s. Learning this optimal value function can help to determine the optimal strategy. 􏰈 Actor-Critic ◮ Combination of Value and Policy learning The agent retains its own estimate V () of the “true” value function V ∗ (). The aim of Value Function Learning is generally to start with a random V and then iteratively improve it so that it more closely approximates V∗. This process is sometimes called “Bootstrapping”. COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T2 Reinforcement Learning 10 COMP9444 20T2 Reinforcement Learning 11 Value Function K-Armed Bandit Problem .5 .5 .5 .5 .33 .5 .33 .5 .33 .5 .5 .33 1.0 +1 .5 1 −0.2911 −0.0380 −0.5443 −0.7722 1234 1234 .5 .33 .5 (a) (b) (c) This is the Value Function V π where π is the policy of choosing between The special case of an active, stochastic environment with only one state is called the K-armed Bandit Problem, because it is like being in a room with several (friendly) slot machines, for a limited time, and trying to collect as much money as possible. available actions uniformly randomly. COMP9444 ⃝c Alan Blair, 2017-20 Each action (slot machine) provides a different average reward. COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T2 Reinforcement Learning 12 COMP9444 20T2 Reinforcement Learning 13 Exploration / Exploitation Tradeoff Exploration / Exploitation Tradeoff Most of the time we should choose what we think is the best action. However, in order to ensure convergence to the optimal strategy, we must occasionally choose something different from our preferred action, e.g. I was born to try... 􏰈 choose a random action 5% of the time, or 􏰈 use Softmax (Boltzmann distribution) to choose the next action: But you’ve got to make choices Be wrong or right Sometimes you’ve got to sacrifice the things you like. COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T2 Reinforcement Learning 14 COMP9444 20T2 Reinforcement Learning 15 Delayed Reinforcement Temporal Difference Learning eR (a))/T P(a) = ∑ eR (b))/T - Delta Goodrem +1 −1 Let’s first assume that R and δ are deterministic. Then the (true) value V∗(s) of the current state s should be equal to the immediate reward plus the discounted value of the next state b∈A a2 s a1 +11a2 3+2 V∗(s)=R(s,a)+γV∗(δ(s,a)) We can turn this into an update rule for the estimated value, i.e. s s a 2 a1 a 12 −1 +1 V(st) ← rt +γV(st+1) We may need to take several actions before we can get the good reward. If R and δ are stochastic (multi-valued), it is not safe to simply replace V (s) with the expression on the right hand side. Instead, we move its value fractionally in this direction, proportional to a learning rate η COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 V(st)←V(st)+η[rt +γV(st+1)−V(st)] COMP9444 20T2 Reinforcement Learning 16 COMP9444 20T2 Reinforcement Learning 17 Q-Learning Q-Learning Q-Learning is similar to TD-Learning except that we use a function Qπ : S × A → R which depends on a state, action pair instead of just a state. Foradeterministicenvironment,π∗,Q∗ andV∗ arerelatedby For any policy π the Q-Function Q π (s, a) is the average discounted reward received by an agent who begins in state s, first performs action a and then follows policy π for all subsequent timesteps. V∗(s) = max Q∗(s,b) b If π = π∗ is optimal, then Q∗ (s, a) = Q π∗ (s, a) is the maximum (expected) discounted reward obtainable from s, if the agent is forced to take action a in the first timestep but can act optimally thereafter. Q(st,at) ← rt +γ maxQ(st+1,b) b The agent retains its own (initially, random) estimate Q() and iteratively improves this estimate to more closely approximate the “true” function Q∗(). If the environment is stochastic, we instead write Q(st,at) ← Q(st,at)+η[rt +γ max Q(st+1,b)−Q(st,at)] COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T2 Reinforcement Learning 18 COMP9444 20T2 Reinforcement Learning 19 Q-Learning Example Theoretical Results +1 a2 −1 Theorem: Q-learning will eventually converge to the optimal policy, for any deterministic Markov decision process, assuming an appropriately randomized strategy. a1 +1s1as2 s3+2 a2 a1a 12 (Watkins & Dayan 1992) −1 +1 Exercise: 1. computeVπ(s3)if π(s3)=a2 andγ=0.9 2. compute π∗ , V ∗ and Q∗ for this environment (if γ = 0.9) Theorem: TD-learning will also converge, with probability 1. (Sutton 1988, Dayan 1992, Dayan & Sejnowski 1994) COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 π∗(s) = argmaxa Q∗(s,a) Q∗(s,a) = R (s,a)+γV∗(δ(s,a)) So This allows us to iteratively approximate Q by Q∗(s,a) = R (s,a)+γ maxQ∗(δ(s,a),b) b b COMP9444 20T2 Reinforcement Learning 20 COMP9444 20T2 Reinforcement Learning 21 Limitations of Theoretical Results Computer Game Playing COMP9444 20T2 Reinforcement Learning 22 COMP9444 20T2 Reinforcement Learning 23 􏰈 Delayed reinforcement ◮ reward resulting from an action may not be received until several Suppose we want a write a computer program to play a game like Backgammon, Chess, Checkers or Go. This can be done using a tree search algorithm (expectimax, MCTS, or minimax with alpha-beta pruning). But we need: time steps later, which also slows down the learning 􏰈 Search space must be finite ◮ convergence is slow if the search space is large ◮ relies on visiting every state infinitely often (a) an appropriate way of encoding any board position as a set of numbers, and 􏰈 For “real world” problems, we can’t rely on a lookup table ◮ need to have some kind of generalisation (e.g. TD-Gammon) (b) a way to train a neural network or other learning system to compute a board evaluation, based on those numbers COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 Backgammon Backgammon Neural Network COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 0 123456 789101112 Board encoding Two layer neural network 25 24 23 22 21 20 19 18 17 16 15 14 13 􏰈 4 units × 2 players × 24 points 􏰈 2 units for the bar 􏰈 2 units for off the board 􏰈 196 input units 􏰈 20 hidden units 􏰈 1 output unit The input s is the encoded board position (state), the output V(s) is the value of this position (probability of winning). At each move, roll the dice, find all possible “next board positions”, convert them to the appropriate input format, feed them to the network, and choose the one which produces the largest output. COMP9444 20T2 Reinforcement Learning 24 COMP9444 20T2 Reinforcement Learning 25 Backpropagation How to Choose the Target Value w ← w + η(T − V ) ∂V ∂w V T w = weight 􏰈 Behavioral Cloning (Supervised Learning) ◮ learn moves from human games (Expert Preferences) Q: How do we choose the target value T ? ◮ general method, does not rely on knowing the “world model” (rules of the game) In other words, how do we know what the value of the current position “should have been”? or, how do we find a better estimate for the value of the current position? 􏰈 methods which combine learning with tree search (must know the “world model”) COMP9444 ⃝c Alan Blair, 2017-20 ◮ TD-Root, TD-Leaf, MCTS, TreeStrap COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T2 Reinforcement Learning 26 COMP9444 20T2 Reinforcement Learning 27 TD-Learning for Episodic Games TD-Gammon Backgammon is an example of an episodic task, in the sense that the agent receives just a single reward at the end of the game, which we can consider as the final value Vm+1 (typically, +1 for a win or −1 for a loss). We then have a sequence of game positions, each with its own (estimated) value: 􏰈 Tesauro trained two networks: ◮ EP-network was trained on Expert Preferences (Supervised) ◮ TD-network was trained by self play (TD-Learning) (currentestimate) Vt→Vt+1→...→Vm→Vm+1 (finalresult) In this context, TD-Learning simplifies and becomes equivalent to using the 􏰈 TD-network outperformed the EP-network. value of the next state (Vt+1) as the training value for the current state (Vk) A fancier version, called TD(λ), uses Tk as the training value for Vk , where 􏰈 With modifications such as 3-step lookahead (expectimax) and additional hand-crafted input features, TD-Gammon became the best Backgammon player in the world (Tesauro, 1995). m Tt =(1−λ) ∑ λk−1−tVk+λm−tVm+1 k=t+1 Tt is a weighted average of future estimates, λ = discount factor (0 ≤ λ < 1) COMP9444 ⃝c Alan Blair, 2017-20 η = learning rate ◮ use subsequent positions to refine evaluation of current position = = actual output target value 􏰈 Temporal Difference Learning COMP9444 ⃝c Alan Blair, 2017-20