CS计算机代考程序代写 AI database Agent-based Systems

Agent-based Systems
Paolo Turrini
www.dcs.warwick.ac.uk/~pturrini R p.turrini@warwick.ac.uk

Today
We have seen MDPs and how to calculate the optimal policy (VIA). However:
Maybe the state space is too big to do it
Even if we do know the states we might not know how they are related.
Today we are going to see how to handle these cases, using Reinforcement Learning.
Paolo Turrini Reinforcement Learning Agent-based Systems

Incomplete knowledge
What if we don’t know what game we are playing?
Play anyway and see what happens! and play as much as possible!
Paolo Turrini Reinforcement Learning Agent-based Systems

We can’t possibly calculate everything
Figure: The complexity of Go
Paolo Turrini Reinforcement Learning Agent-based Systems

Neural networks + tree search
Understanding the value of game positions using:
Neural Networks using pattern recognition from a database of previously played games.
Tree Search self-playing (a lot!) and estimating the value of moves;
David Silver et al.
Mastering the game of Go
with deep neural networks and tree search Nature, 2016.
David Silver et al.
Mastering the game of Go without human knowledge
Nature, 2017.
I’m going to only focus on how to infer value without using pre-processed information
Paolo Turrini Reinforcement Learning Agent-based Systems

Monte Carlo Tree Search
To evaluate intermediate game positions we play a huge number of games from then on.
Paolo Turrini Reinforcement Learning Agent-based Systems

The world
Begin at the start state
The game ends when we reach either goal
state +1 or −1
Rewards: +1 and −1 for terminal states
respectively, −0.04 for all others
Paolo Turrini Reinforcement Learning Agent-based Systems

Full observability
Stochastic actions (possibly different at each state!), four directions.
Paolo Turrini Reinforcement Learning Agent-based Systems

Partial observability
Stochastic actions (possibly different at each state!), four directions.
Paolo Turrini Reinforcement Learning Agent-based Systems

Assumptions
This is what is known (by the agent) about the environment
Partially observable (we know where we are, not where we will end up being) Markovian (past doesn’t matter)
Stochastic actions (we are not in full control of our choices)
Discounted rewards (we might be more or less patient)
Paolo Turrini Reinforcement Learning Agent-based Systems

Passive and Active RL
Passive reinforcement learning: I have a policy
I don’t know the probabilities
I don’t know the values of states I don’t know the value of actions
Active reinforcement learning: I don’t even have a policy
Paolo Turrini Reinforcement Learning Agent-based Systems

Passive Reinforcement Learning
I don’t know the values nor the rewards I don’t know the probabilities
I’m gonna play anyway
The plan: I execute a series of trials until the end states, just like Monte-Carlo Tree Search!
Paolo Turrini Reinforcement Learning Agent-based Systems

Passive Reinforcement Learning
Remember: the expected utility is the expected sum of discounted rewards under the policy
Assume: γ = 1, just to make things simple
Paolo Turrini Reinforcement Learning Agent-based Systems

Passive Reinforcement Learning
Suppose I get these trials:
(1,1)−0.04 􏰪(2,1)−0.04 􏰪(3,1)−0.04 􏰪(2,1)−0.04 􏰪(3,1)−0.04 􏰪(3,2)−0.04 􏰪(3,3)−0.04 􏰪(3,4)+1 (1,1)−0.04 􏰪(2,1)−0.04 􏰪(3,1)−0.04 􏰪(3,2)−0.04 􏰪(3,3)−0.04 􏰪(2,3)−0.04 􏰪(3,3)−0.04 􏰪(3,4)+1 (1,1)−0.04 􏰪(1,2)−0.04 􏰪(1,3)−0.04 􏰪(2,3)−0.04 􏰪(2,4)−1
Idea: Frequency is the key!
Each trial provides a sample of the expected rewards for each state visited.
Paolo Turrini Reinforcement Learning Agent-based Systems

Temporal difference learning
When a transition occurs from state s to state s′ we apply the following update: vπ(s) = vπ(s) + α(r(s) + γvπ(s′) − vπ(s))
where α ∈ [0, 1] is a confidence parameter: how much we value the new information.
α can be the inverse of the number of times we visited a state: the more we visited, the
less we want to learn.
Notice: rare transitions? well, they are rare.
Paolo Turrini Reinforcement Learning Agent-based Systems

TDL in action
Paolo Turrini Reinforcement Learning Agent-based Systems

TDL in action
Initialise the values, for:
γ=1
deterministic agent
α = 1 where n is the number of times we visited a state n+1
r = 0 everywhere but the terminal states
Paolo Turrini Reinforcement Learning Agent-based Systems

TDL in action
Suppose we can walk (Up, Up, Right, Right, Right). Apply the update to states, as you walk along:
vπ(s) = vπ(s) + α(r(s) + γvπ(s′) − vπ(s))
Paolo Turrini Reinforcement Learning Agent-based Systems

TDL in action
I keep walking the same way…
vπ(s) = vπ(s) + α(r(s) + γvπ(s′) − vπ(s))
Paolo Turrini Reinforcement Learning Agent-based Systems

TDL in action
Again (Up, Up, Right, Right, Right)….
vπ(s) = vπ(s) + α(r(s) + γvπ(s′) − vπ(s))
Paolo Turrini Reinforcement Learning Agent-based Systems

Passive Reinforcement Learning
We have a policy which we follow;
We backpropagate the value with a Bellmann-like adjustment; We can use a learning rate, depending on our confidence.
Paolo Turrini Reinforcement Learning Agent-based Systems

Active Reinforcement Learning
Now we start without a fixed policy…
What the agent needs to learn is the values of the optimal policy
v(s) = r(s) + γ max 􏰗 P(s′ | (s, a))v(s′)
a
s′
Important: we can’t stick to our (locally optimal) habits, we need to try new stuff!
Exploration vs Exploitation
Paolo Turrini Reinforcement Learning Agent-based Systems

Q-learning
Q(s,a) the value of performing action a in state s
Q(s, a) = r(s) + γ 􏰗 P(s′|(s, a)) max Q(s′, a′)
a′
s′
v(s) = max Q(s, a)
a
is the value of performing action a in state s
Paolo Turrini Reinforcement Learning
Agent-based Systems

Q-learning
Q(s,a) ← Q(s,a) + α(r(s) + γ maxQ(s′,a′) − Q(s,a)) a′
It’s a temporal difference learning, without fixed policy!
Paolo Turrini Reinforcement Learning Agent-based Systems

The Maze
A 2×3 grid world
A pit, an exit and some walls are known in this grid world, but their locations are unknown
Arrive at the exit: win; fall in the pit: die; hit a wall, suffer
Goal: Get out of this maze (i.e. safely arrive at the exit) as quickly as possible
1
0
012
Paolo Turrini Reinforcement Learning Agent-based Systems

RL components in this problem
State: The agent’s current location Action: LEFT, RIGHT, UP, DOWN Environment Dynamics:
Collision results in no movement
otherwise, move one square in the intended direction
Rewards:
normal move: -1 hit a wall: -10 die: -100
exit: +100
Our Goal: find the best route to exit
Paolo Turrini Reinforcement Learning Agent-based Systems

Applying Q-Learning to The Maze Problem
α = 0.5, γ = 0.9
All Q-values are initialised as 0
1
0
012
Paolo Turrini Reinforcement Learning Agent-based Systems

Applying Q-Learning to The Maze Problem
α = 0.5, γ = 0.9
All Q-values are initialised as 0 Choose UP, and receive -100
1
0
012
r = -100
Paolo Turrini Reinforcement Learning Agent-based Systems
0 00
0

Applying Q-Learning to The Maze Problem
α = 0.5, γ = 0.9
All Q-values are initialised as 0
Choose UP, and receive -100
update Q-value: Q([0, 2], UP)
= (1−0.5)×0+ 0.5×(−100+0.9×0)
= −50
1
0
012
DIE!
-50 00
0
Paolo Turrini Reinforcement Learning Agent-based Systems

Applying Q-Learning to The Maze Problem
α = 0.5, γ = 0.9
Choose LEFT, and receive -1
1
0
r = -1
012
-50 00
0
Paolo Turrini Reinforcement Learning Agent-based Systems

Applying Q-Learning to The Maze Problem
α = 0.5, γ = 0.9
Choose LEFT, and receive -1
update Q-value: Q([0, 2], LEFT)
= (1−0.5)×0+ 0.5×(−1+0.9×0)
= −0.5
1
0
0 00
0
012
-0.5
-50
0
0
Paolo Turrini Reinforcement Learning Agent-based Systems

Applying Q-Learning to The Maze Problem
α = 0.5, γ = 0.9
Choose UP, and receive -10
-0.5
-50
0
0
Paolo Turrini Reinforcement Learning Agent-based Systems
1
0
r = -10
0 00
0
012

Applying Q-Learning to The Maze Problem
α = 0.5, γ = 0.9
Choose UP, and receive -10
update Q-value: Q([0, 1], UP)
= (1−0.5)×0+ 0.5×(−10+0.9×0)
= −5
1
0
-5 00
0
012
-0.5
-50
0
0
Paolo Turrini Reinforcement Learning Agent-based Systems

Applying Q-Learning to The Maze Problem
α = 0.5, γ = 0.9
Choose LEFT, and receive -1
r = -1
-0.5
-50
0
0
Paolo Turrini Reinforcement Learning Agent-based Systems
1
0
-5 00
0
012

Applying Q-Learning to The Maze Problem
α = 0.5, γ = 0.9
Choose LEFT, and receive -1
update Q-value: Q([0, 1], LEFT)
= (1−0.5)×0+ 0.5×(−1+0.9×0)
= −0.5
1
0
0 00
0
-0.5
-5
0
0
012
-0.5
-50
0
0
Paolo Turrini Reinforcement Learning Agent-based Systems

Applying Q-Learning to The Maze Problem
α = 0.5, γ = 0.9
Choose UP, and receive -1
0 00
0
-0.5
-50
0
0
Paolo Turrini Reinforcement Learning Agent-based Systems
1
0
r = -1
-0.5
-5
0
0
012

Applying Q-Learning to The Maze Problem
α = 0.5, γ = 0.9
Choose UP, and receive -1
update Q-value: Q([0, 0], UP)
= (1−0.5)×0+ 0.5×(−1+0.9×0)
= −0.5
-0.5 00
0
-0.5
-5
0
0
-0.5
-50
0
0
Paolo Turrini Reinforcement Learning Agent-based Systems
1
0
0 00
0
012

Applying Q-Learning to The Maze Problem
α = 0.5, γ = 0.9
Choose RIGHT, and receive
100
1
0
0 00
0
012
Paolo Turrini Reinforcement Learning Agent-based Systems
r = 100
-0.5 00
0
-0.5
-5
0
0
-0.5
-50
0
0

Applying Q-Learning to The Maze Problem
α = 0.5, γ = 0.9
Choose RIGHT, and receive
100
update Q-value:
Q([0, 1], RIGHT) = (1−0.5)×0+
0.5×(100+0.9×0) = 50
-0.5 00
0
-0.5
-5
0
0
-0.5
-50
0
0
Paolo Turrini Reinforcement Learning Agent-based Systems
1
0
0
0 50
0
WIN!
012

Applying Q-Learning to The Maze Problem
α = 0.5, γ = 0.9
The next time agent visits [0,0]
and performs UP:
Q([0, 0], UP)
= (1 − 0.5) × (−0.5)+
0.5×(−1+0.9×50) = 21.75
-0.5 00
0
-0.5
-50
0
0
Paolo Turrini Reinforcement Learning Agent-based Systems
1
0
0
0 50
r = -1
0
-0.5
-5
0
0
012

Applying Q-Learning to The Maze Problem
α = 0.5, γ = 0.9
The next time agent visits [0,0]
and performs UP:
Q([0, 0], UP)
= (1 − 0.5) × (−0.5)+
0.5×(−1+0.9×50) = 21.75
1
0
0
0 50
0
21.75 00
0
-0.5
-5
0
0
012
-0.5
-50
0
0
Paolo Turrini Reinforcement Learning Agent-based Systems

Property of Q-Learning
Quick learning speed.
Model-free: no need to explicitly compute probabilities, or record trajectory. Guarantee to converge.
Paolo Turrini Reinforcement Learning Agent-based Systems

The learning parameters in Q-Learning
α:
γ:
ε:
Learning step
balance between existing experiences (weight: 1 − α) and new observations (weight: α)
Future discount
balance between current reward (weight: 1) and next N step’s reward (weight: γN)
indicating how ’bold’ the agent is
balance between exploitation (take greedy action, 1 − ε chance) and exploration (take random action, ε chance)
Paolo Turrini Reinforcement Learning Agent-based Systems

What we have seen so far
Decision making in sequential environments typical of AI practice
Optimisation techniques (VIA)
… under incomplete information (Active/Passive Learning by Belmann updates)
What next? Population dynamics and multi-agent reinforcement learning.
Paolo Turrini Reinforcement Learning Agent-based Systems

Related Posts