Agent-based Systems
Paolo Turrini
www.dcs.warwick.ac.uk/~pturrini R p.turrini@warwick.ac.uk
Multi-agent Learning
Paolo Turrini Learning and Population Dynamics Agent-based Systems
In this lecture
We are going to look at Markov Decision Processes with many agents:
The idea is that agents interact repeatedly and learn each others’ strategies Connection with game theory
Revisiting learning as evolutionary process
A great survey:
D. Bloembergen, K. Tuyls, D. Hennes & M. Kaisers
Evolutionary Dynamics of Multi-Agent Learning: A Survey
Journal of Artificial Intelligence Research, 2015.
Paolo Turrini Learning and Population Dynamics Agent-based Systems
MAS learning
We have seen how reinforcement learning informs decision-making via self-play; However, we have also seen how it works:
with one agent only;
with a static environment;
Having multiple interacting agents makes the problem so much more difficult.
Now what?
Paolo Turrini Learning and Population Dynamics Agent-based Systems
MA-RL
We look at a general model of MDPs played by many agents. This means:
transitions take into account everyone’s choices
we look at how to generalise the one-agent models we exlore the connection with repeated games
Like in a game with start with choices, once per agent.
A=A1 ×A2 ×···×An
Rather than actions, like in an MDP, now we have profiles of actions, like in a game.
Idea: agents play a game, look at what everyone did, and then play again.
Paolo Turrini Learning and Population Dynamics Agent-based Systems
General MDPs
Idea: agents play a game, look at what everyone did, and then play again.
Both the transitions and the rewards depend on what everyone does.
r : S × A × S → R associates a reward to each transition.
P : S × A × S → [0, 1] associates a probability to each transition.
This is very general and encodes games played on graphs. Transitions are defined as triples made by the initial state, the choice profile and the final state.
Notice: Repeated games are just a special case, where we need only one state.
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Learning in Repeated Games (aka simple Multi-Agent MDPs)
Imagine a population playing a normal form game repeatedly.
1 Each individual is associated with one strategy for the infinitely repeated game (e.g., always cooperate in a PD).
2 At each time step individuals are paired with each other, randomly.
3 They all play the strategy the are associated with in the beginning.
4 Their payoff determines their reproductive success at the next round.
5 The population changes accordingly, with possibly some mutation.
6 The game is played again.
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Mutants
Suppose a population playing strategy s is invaded by mutants, playing s′. Let ε ∈ [0, 1] be the proportion of such mutants.
The expected payoff of a non-mutant is
εu(s, s′) + (1 − ε)u(s, s)
This is the payoff I get by playing a type times the probability to play them!
and for a mutant? It’s the inverse
εu(s′, s′) + (1 − ε)u(s′, s)
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Evolutionary Stable Strategies
Definition
A strategy s is evolutionarily stable if for all s ′ ̸= s there exists a δ ∈ [0, 1] such that for all ε ∈ (0, 1] with ε < δ we have that:
εu(s, s′) + (1 − ε)u(s, s) > εu(s′, s′) + (1 − ε)u(s′, s) What does this definition say?
Notice the ”limit” idea…
Paolo Turrini Learning and Population Dynamics Agent-based Systems
ESS: the general definition
Let f(x,y) be the expected fitness (=payoff) of strategy x against strategy y. Then x is evolutionarily stable iff, for any mutant strategy y, the following hold:
f (x , x ) f (y , x )
if f (x , x ) = f (y , x ) then f (x , y ) > f (y , y ) What do these conditions say?
Paolo Turrini Learning and Population Dynamics Agent-based Systems
ESS and NE
Observe:
Proposition
Each ESS is a NE.
‘always defect’ in a Prisoner’s Dilemma is evolutionarily stable.
With no invasion, ‘always cooperate’ also is and everyone is so much better off. But small invasions are problematic…
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Replicator Dynamics
Now we analyse the dynamics in a systematic fashion. We introduce to steps:
Selection Each strategy reproduces, depending on the payoff obtained. Mutation There is a percentage of invaders, at each round.
These replicator dynamics highlight the role of selection, it describes how systems consisting of different strategies change over time. They are usually formalised as a system of differential equations which describe
the payoffs for each interaction
the state of the population as the probability distribution of all different types.
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Replicator Dynamics
We start with a population x, which represents the probability distribution (= the proportion) of different player types (=strategies), and a fitness function that is normalised, i.e., all payoffs are between 0 and 1.
We are interested in the population change (=the evolution steps). This is written as x ̇i = xi[fi(x)−f(x)]
where:
f is a normalised fitness (=payoff) function, applied to a specific player f and average fitness function for the population as a whole.
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Replicator Dynamics
In a two-player game, each player is described by his own evolving population, and at every iteration one individual of each type is selected to play.
This means that the fitness of each type depends on the population distribution of the co-player, i.e., the two populations are co-evolving.
If populations x and y have payoff matrices A and B, we can write the expected fitness of player i of population x as
fi(x) = aijyj = (Ay)i j
similarly, the average fitness would be
f(x)=xi aijyj =x⊤Ay ij
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Games and fitness
Two populations are playing one of the following normalised symmetric games. Assume a distribution (0.6, 0.4) for the only two strategies: T and B (resp. L and R).
LRLRLR
TTT
BBB
Question: What is the fitness of each?
.3 .3
.5 0
0
.5
.1 .1
.4 .4
.3 .1
.1 .3
.3 .3
0
.1
.1 0
.1 0
0
.1
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Replicator Dynamics
The general form of population change is given by:
x ̇i =xi[(Ay)i−x⊤Ay] y ̇i =yi[(x⊤B)i−x⊤By]
So each strategy is multiplied by their difference with the average fitness.
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Games and fitness
Two populations are playing one of the following normalised symmetric games. Assume a distribution (0.6, 0.4) for the only two strategies: T and B (resp. L and R).
LRLRLR
TTT
BBB
Question: What is the fitness of each after the first evolution step?
.3 .3
.5 0
0
.5
.1 .1
.4 .4
.3 .1
.1 .3
.3 .3
0
.1
.1 0
.1 0
0
.1
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Replicator Dynamics
The replicator dynamics, plotted in the unit simplex, for the prisoner’s dilemma (left), the stag hunt (center), and matching pennies (right)
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Connecting Game Theory and Reinforcement Learning
Multi-agent learning and evolutionary game theory share important connections, as they both deal with the strategic adaptation of boundedly rational agents in uncertain environments.
The intuitive connection was made formal with the proof that the continuous time limit of Cross learning converges to the replicator dynamics (Bo ̈rgers & Sarin, 1997)
T. Bo ̈rgers and R. Sarin
Learning through reinforcement and replicator dynamics.
Journal of Economic Theory, 1997.
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Cross Learning
Cross learning is one of the most basic forms of stateless reinforcement learning, which updates policy π, based on the reward r received after taking action j, as follows:
r − π(i)r if i = j
π(i) ← π(i) +
A valid policy is ensured by the update rule as long as the rewards are normalised, i.e.,
r ∈ [0, 1].
J. G. Cross
A stochastic learning model of economic behavior.
The Quarterly Journal of Economics, 1973
−π(i )r otherwise
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Games and fitness
Two players are playing one of the following normalised symmetric games. Assume a mixed strategy (0.6, 0.4) for T and B (resp. L and R).
LRLRLR
TTT
BBB
Question: What is the cross learning policy of each player after the first step?
.3 .3
.5 0
0
.5
.1 .1
.4 .4
.3 .1
.1 .3
.3 .3
0
.1
.1 0
.1 0
0
.1
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Cross Learning and policy change
We can estimate the expected change of policy E(∆π(i)) (B ̈orgers & Sarin, 1997). The probability π(i) of action i is affected both if i is selected and if another action j is
selected.
Let Ei[r] be the expected reward after taking action i. Then…
E(∆π(i)) = π(i)[Ei [r] − π(i)Ei [r]] + j̸=i πj [−Ej [r]π(i)] = πi [Ei [r] − j πj Ej [r]]
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Cross Learning and replicator dynamics
Assuming the learner takes infinitesimally small update steps, the continuous time limit of the previous equation can be rewritten as
πt+δ(i) = πt(i) + δ∆πt(i)
With δ → 0 this yields a continuous time system which can be expressed as a partial
differential equation
π ̇ = π(i)[Ei[r]−πjEj(r)] j
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Cross Learning and replicator dynamics
In a two-persons normal form game we can simply write the probability as a mixed strategy. Given the payoff matrices A and B and policies x and y for the two players respectively, this yields:
x ̇i =xi[(Ay)i −x⊤Ay] y ̇i =yi[(x⊤B)i −x⊤By]
which are exactly the multi-population replicator dynamics.
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Learning and replicator dynamics
Policy traces of Cross learning, plotted on the unit simplex and overlaid on the replicator dynamics, for the prisoner’s dilemma (left), stag hunt (centre) and matching pennies (right).
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Many RL algorithms
Cross learning is a simple method. A number of other RL algorithms have been developed for learning in normal form games
Q-learning (Tuyls et al. 2003) FAQ-learning (Kaisers and Tuyls 2010) Regret Minimisation (Klos et al. 2010) Lenient FAQ-learning (Panait et al 2008) Gradient ascent (Kaisers et al. 2012)
Will they find the ”right” strategies? If so, how fast?
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Learning in Normal Form Games
Repeated normal form games can be used as a testbed for multi-agent learning. They are stateless and agents choose from a finite set of actions at each time step. This simplifies the analysis heavily.
We focus on two-player two-action games, which simplifies the analysis even more. Here the learning dynamics can be fully represented by the pair (x ̇,y ̇), which denotes the probability of both learners to choose the first action.
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Cross Learning in 2×2 games
Let h = (1,−1),x = (x,1−x),y = (y,1−y). For Cross learning (x ̇,y ̇) are updated as follows.
x ̇ = x[Ay1 −xTAy] = x(1−x)[y(−a11 −a12 −a21 +a22)+a12 −a22] = x(1 − x)[yhAhT + a12 − a22]
where a12,a22 are elements of the payoff matrix A.
To simplify the notation, we write δ = Ay⊤ − Ay⊤ = y hAhT + a12 − a22 to denote 12
the gradient so that Cross learning dynamics can be rewritten as x ̇ = x(1 − x)δ
Paolo Turrini Learning and Population Dynamics Agent-based Systems
FAQ dynamics
Frequency-adjusted Q-learning (FAQ) mimics simultaneous action updates by modulating the Q-learning update rule inversely proportional to xi .
In 2×2 games this simplifies to:
x ̇=αx(1−x)[δ−log x ] τ 1−x
Paolo Turrini Learning and Population Dynamics Agent-based Systems
RM
The dynamics of RM are slightly more complex, as the denominator depends on which action gives the highest reward. This can be derived from the gradient: the first action will be maximal if δ < 0.
So with RM this simplifies to
(1 + αxδ)−1
x ̇ = αx(1 − x)δ × (1 − α(1 − x)δ)−1
if δ < 0 otherwise
Paolo Turrini Learning and Population Dynamics
Agent-based Systems
FAQ
Policy learning plotted on the unit simplex for PD, SH, MP
Whereas the dynamics of these different algorithms are similar in their convergence behaviour when only one equilibrium is present, as is the case in the prisoner’s dilemma and matching pennies, in the stag hunt differences can be observed.
Paolo Turrini Learning and Population Dynamics Agent-based Systems
LFAQ
Policy learning plotted on the unit simplex for PD, SH, MP
The notion of leniency, introduced to overcome convergence to suboptimal equilibria, works to drive the learning process towards the optimal outcome of the game (L)FAQ does spyral inwards towards the single Nash equilibrium at (1/2, 1/2) in MP, which is which is not evolutionarily stable in the classical replicator dynamics model.
Paolo Turrini Learning and Population Dynamics Agent-based Systems
FAQ vs LFAQ
Policy learning plotted on the unit simplex for PD, SH, MP and overlaid This represents two players having different learning models.
Paolo Turrini Learning and Population Dynamics Agent-based Systems
Conclusion
Learning algorithms converge to the unique equilibria but exhibit a number of differences with multiple equilibria.
We have seen the connection with repeated games and replicator dynamics. What next? Cooperation
Paolo Turrini Learning and Population Dynamics Agent-based Systems