Reinforcement Learning
Mul2-agent Reinforcement Learning
Subramanian Ramamoorthy
School of Informa2cs
28 March, 2017
Agents o)en face Strategic Adversaries
28/03/2017 2
Key issue we seek to model: Misaligned/conflicting interest
On Self-Interest
What does it mean to say that agents are self-interested?
• It does not necessarily mean that they want to cause harm to
each other, or even that they care only about themselves.
• Instead, it means that each agent has his own descripHon of
which states of the world he likes—which can include good
things happening to other agents
—and that he acts in an a.empt to bring about these states
of the world (beLer term: inter-dependent decision making)
28/03/2017 3
A Simple Model of a Game
• Two decision makers
– Robot (has an acHon space: a)
– Adversary (has an acHon space: θ)
• Cost or payoff (to use the term common in game theory)
depends on acHons of both decision makers:
R(a, θ) – denote as a matrix corresponding to product space
28/03/2017 4
This is the normal form – simultaneous choice over moves
A
RepresenHng Payoffs
In a general, bi-matrix, normal form game:
The combined acHons (a1, a2, …, an) form an
ac#on profile a Є A
28/03/2017 5
Action sets of players Payoff function:
a.k.a.
utility u2(a)
Example: Rock-Paper-Scissors
• Famous children’s game
• Two players; Each player simultaneously picks an acHon which
is evaluated as follows,
– Rock beats Scissors
– Scissors beats Paper
– Paper beats Rock
28/03/2017 6
TCP Game
• Imagine there are only two internet users: you and me
• Internet traffic is governed by TCP protocol, one feature of
which is the backoff mechanism: when network is congested
then backoff and reduce transmission rates for a while
• Imagine that there are two implementaHons: C (correct, does
what is intended) and D (defecHve)
• If you both adopt C, packet delay is 1 ms; if you both adopt D,
packet delay is 3 ms
• If one adopts C but other adopts D then D user gets no delay
and C user suffers 4 ms delay
28/03/2017 7
TCP Game in Normal Form
28/03/2017 8
Note that this is another way of writing a bi-matrix game:
First number represents payoff of row player and second
number is payoff for column player
Some Famous Matrix Examples
– What are they Capturing?
• Prisoner’s Dilemma: Cooperate or Defect (same as TCP game)
• Bach or Stravinsky (von Neumann called it BaLle of the Sexes)
• Matching Pennies: Try to get the same outcome, Heads/Tails
28/03/2017 9
Different CategorizaHon: Common Payoff
A common-payoff game is a game in which for all ac?on
profiles a ∈ A1 ×· · ·× An and any pair of agents i, j , it is the
case that ui(a) = uj(a)
28/03/2017 10
Pure coordination:
e.g., driving on a side of the road
Different CategorizaHon: Constant Sum
A two-player normal-form game is constant-sum if there
exists a constant c such that for each strategy profile a ∈ A1 ×
A2 it is the case that u1(a) + u2(a) = c
28/03/2017 11
Pure competition:
One player wants to coordinate
Other player does not!
Defining the “acHon space”
28/03/2017 12
Strategies
28/03/2017 13
Expected utility
SoluHon Concepts
Many ways of describing what one ought to do:
– Dominance
– Minimax
– Pareto Efficiency
– Nash Equilibria
– Correlated Equilibria
Remember that in the end game theory aspires to predict
behaviour given specificaHon of the game.
Norma?vely, a soluHon concept is a ra?onale for behaviour
28/03/2017 14
Concept: Dominance
28/03/2017 15
Concept: Iterated Dominance
28/03/2017 16
Concept: Minimax
28/03/2017 17
Minimax
28/03/2017 18
CompuHng Minimax: Linear Programming
28/03/2017 19
Pick-a-Hand
• There are two players: chooser (player I) & hider (player II)
• The hider has two gold coins in his back pocket. At the
beginning of a turn, he puts his hands behind his back and
either takes out one coin and holds it in his le) hand, or takes
out both and holds them in his right hand.
• The chooser picks a hand and wins any coins the hider has
hidden there.
• She may get nothing (if the hand is empty), or she might win
one coin, or two.
28/03/2017 20
Pick-a-Hand, Normal Form:
• Hider could minimize losses
by placing 1 coin in le)
hand, most he can lose is 1
• If chooser can figure out
hider’s plan, he will surely
lose that 1
• If hider thinks chooser
might strategise, he has
incenHve to play R2, …
• All hider can guarantee is
max loss of 1 coin
28/03/2017 21
• Similarly, chooser might try
to maximise gain, picking R
• However, if hider
strategizes, chooser ends up
with zero
• So, chooser can’t actually
guarantee winning anything
Pick-a-Hand, with Mixed Strategies
• Suppose that chooser
decides to choose R with
probability p and L with
probability 1 − p
• If hider were to play pure
strategy R2 his expected
loss would be 2p
• If he were to play L1,
expected loss is 1 − p
• Chooser maximizes her
gains by choosing p so as to
maximize min{2p, 1 − p}
• Thus, by choosing R with
probability 1/3 and L with
probability 2/3, chooser
assures expected payoff of
2/3, regardless of whether
hider knows her strategy
28/03/2017 22
p
Chooser
Payoff
Mixed Strategy for the Hider
• Hider will play R2 with some
probability q and L1 with
probability 1−q
• The payoff for chooser is 2q
if she picks R, and 1 − q if
she picks L
• If she knows q, she will
choose the strategy
corresponding to the
maximum of the two
values.
• If hider knows chooser’s
plan, he will choose q = 1/3
to minimize this maximum,
guaranteeing that his
expected payout is 2/3
(because 2/3 = 2q = 1 − q)
• Chooser can assure
expected gain of 2/3, hider
can assure an expected loss
of no more than 2/3,
regardless of what either
knows of the other’s
strategy.
28/03/2017 23
Safety Value as IncenHve
• Clearly, without some extra incenHve, it is not in hider’s
interest to play Pick-a-hand because he can only lose by
playing.
• Thus, we can imagine that chooser pays hider to enHce him
into joining the game.
• 2/3 is the maximum amount that chooser should pay him in
order to gain his parHcipaHon.
28/03/2017 24
Equilibrium as a Saddle Point
28/03/2017 25
Concept: Nash Equilibrium
28/03/2017 26
Nash Equilibrium
28/03/2017 27
Nash Equilibrium – Example
28/03/2017 28
Nash Equilibrium – Example
28/03/2017 29
28/03/2017 30
Many well known techniques from reinforcement learning, e.g., value/policy
iteration can still be applied to solving these games
StochasHc Games (SG)
Defined by the tuple
28/03/2017 31
(n,S,A1,…,n, T, R1,…,n)
No. agents
Set of states
Set of actions
available to each agent
A = A1 ⇥A2 ⇥ …⇥An
S ⇥A⇥ S ! [0, 1]
Transition dynamics
Reward function
of ith agent
S ⇥A ! R
R = R1 ⇥R2 ⇥ …⇥Rn
We wish to learn a stationary, possibly stochastic, policy:
Objective continues to be maximization of expected future reward
⇢ : S ! Pr(Ai)
A First Algorithm for SG SoluHon [Shapley]
28/03/2017 32
This classic algorithm (from 1953) is akin to Value Iteration for MDPs.
- Max operator has been replaced by “Value”, which refers to equilibrium.
- i.e., the matrix game is being solved at each state (step 2b)
The Policy IteraHon Algorithm for SGs
28/03/2017 33
• This algorithm is akin to Policy Iteration for MDPs.
• Each player selects equilibrium policy according to current value
function (using the same G matrix as in Shapley’s algorithm)
• Value function is then updated based on rewards as per equil. policy
Q-Learning for SGs
28/03/2017 34
• Q-learning version of Shapley’s algorithm (maintaining value over joint
actions)
• Algorithm converges to stochastic game’s equilibrium, even if other
player doesn’t, provided everyone executes all actions infinitely often.
What do we do if we have no Model?
FicHHous Play [Robinson ‘51]
28/03/2017 35
• Assumes opponents play stationary strategies
• Maintains information about average value of each action
• Finds equilibria in zero-sum and some general sum games
Summary: General TacHc for SGs
28/03/2017 36
Matrix Game
Solver
Temporal
Differencing
StochasHc
Game Solver
Summary: Many Approaches
28/03/2017 37
OpHonal Reference/Acknowledgements
Learning algorithms for stochasHc games are from the paper:
M. Bowling, M. Veloso, An analysis of stochasHc game theory for
mulHagent reinforcement learning, CMU-CS-00-165, 2000.
Several slides are adapted from the following sources:
• Tutorial at IJCAI 2003 by Prof Peter Stone, University of Texas
• Y. Peres, Game Theory, Alive (Lecture Notes)
28/03/2017 38