程序代写代做代考 algorithm Reinforcement Learning

Reinforcement Learning

Mul2-agent Reinforcement Learning

Subramanian Ramamoorthy
School of Informa2cs

28 March, 2017

Agents o)en face Strategic Adversaries

28/03/2017 2

Key issue we seek to model: Misaligned/conflicting interest

On Self-Interest

What does it mean to say that agents are self-interested?

•  It does not necessarily mean that they want to cause harm to
each other, or even that they care only about themselves.

•  Instead, it means that each agent has his own descripHon of
which states of the world he likes—which can include good
things happening to other agents

—and that he acts in an a.empt to bring about these states
of the world (beLer term: inter-dependent decision making)

28/03/2017 3

A Simple Model of a Game

•  Two decision makers
–  Robot (has an acHon space: a)
–  Adversary (has an acHon space: θ)

•  Cost or payoff (to use the term common in game theory)
depends on acHons of both decision makers:
R(a, θ) – denote as a matrix corresponding to product space

28/03/2017 4

This is the normal form – simultaneous choice over moves

A

RepresenHng Payoffs

In a general, bi-matrix, normal form game:

The combined acHons (a1, a2, …, an) form an
ac#on profile a Є A

28/03/2017 5

Action sets of players Payoff function:

a.k.a.
utility u2(a)

Example: Rock-Paper-Scissors

•  Famous children’s game
•  Two players; Each player simultaneously picks an acHon which

is evaluated as follows,
–  Rock beats Scissors
–  Scissors beats Paper
–  Paper beats Rock

28/03/2017 6

TCP Game

•  Imagine there are only two internet users: you and me
•  Internet traffic is governed by TCP protocol, one feature of

which is the backoff mechanism: when network is congested
then backoff and reduce transmission rates for a while

•  Imagine that there are two implementaHons: C (correct, does
what is intended) and D (defecHve)

•  If you both adopt C, packet delay is 1 ms; if you both adopt D,
packet delay is 3 ms

•  If one adopts C but other adopts D then D user gets no delay
and C user suffers 4 ms delay

28/03/2017 7

TCP Game in Normal Form

28/03/2017 8

Note that this is another way of writing a bi-matrix game:
First number represents payoff of row player and second
number is payoff for column player

Some Famous Matrix Examples
– What are they Capturing?

•  Prisoner’s Dilemma: Cooperate or Defect (same as TCP game)

•  Bach or Stravinsky (von Neumann called it BaLle of the Sexes)

•  Matching Pennies: Try to get the same outcome, Heads/Tails

28/03/2017 9

Different CategorizaHon: Common Payoff

A common-payoff game is a game in which for all ac?on
profiles a ∈ A1 ×· · ·× An and any pair of agents i, j , it is the
case that ui(a) = uj(a)

28/03/2017 10

Pure coordination:
e.g., driving on a side of the road

Different CategorizaHon: Constant Sum

A two-player normal-form game is constant-sum if there
exists a constant c such that for each strategy profile a ∈ A1 ×
A2 it is the case that u1(a) + u2(a) = c

28/03/2017 11

Pure competition:
One player wants to coordinate
Other player does not!

Defining the “acHon space”

28/03/2017 12

Strategies

28/03/2017 13

Expected utility

SoluHon Concepts

Many ways of describing what one ought to do:
–  Dominance
– Minimax
–  Pareto Efficiency
–  Nash Equilibria
–  Correlated Equilibria

Remember that in the end game theory aspires to predict
behaviour given specificaHon of the game.

Norma?vely, a soluHon concept is a ra?onale for behaviour

28/03/2017 14

Concept: Dominance

28/03/2017 15

Concept: Iterated Dominance

28/03/2017 16

Concept: Minimax

28/03/2017 17

Minimax

28/03/2017 18

CompuHng Minimax: Linear Programming

28/03/2017 19

Pick-a-Hand
•  There are two players: chooser (player I) & hider (player II)

•  The hider has two gold coins in his back pocket. At the
beginning of a turn, he puts his hands behind his back and
either takes out one coin and holds it in his le) hand, or takes
out both and holds them in his right hand.

•  The chooser picks a hand and wins any coins the hider has
hidden there.

•  She may get nothing (if the hand is empty), or she might win
one coin, or two.

28/03/2017 20

Pick-a-Hand, Normal Form:

•  Hider could minimize losses
by placing 1 coin in le)
hand, most he can lose is 1

•  If chooser can figure out
hider’s plan, he will surely
lose that 1

•  If hider thinks chooser
might strategise, he has
incenHve to play R2, …

•  All hider can guarantee is
max loss of 1 coin

28/03/2017 21

•  Similarly, chooser might try
to maximise gain, picking R

•  However, if hider
strategizes, chooser ends up
with zero

•  So, chooser can’t actually
guarantee winning anything

Pick-a-Hand, with Mixed Strategies

•  Suppose that chooser
decides to choose R with
probability p and L with
probability 1 − p

•  If hider were to play pure
strategy R2 his expected
loss would be 2p

•  If he were to play L1,
expected loss is 1 − p

•  Chooser maximizes her
gains by choosing p so as to
maximize min{2p, 1 − p}

•  Thus, by choosing R with
probability 1/3 and L with
probability 2/3, chooser
assures expected payoff of
2/3, regardless of whether
hider knows her strategy

28/03/2017 22

p

Chooser
Payoff

Mixed Strategy for the Hider

•  Hider will play R2 with some
probability q and L1 with
probability 1−q

•  The payoff for chooser is 2q
if she picks R, and 1 − q if
she picks L

•  If she knows q, she will
choose the strategy
corresponding to the
maximum of the two
values.

•  If hider knows chooser’s
plan, he will choose q = 1/3
to minimize this maximum,
guaranteeing that his
expected payout is 2/3
(because 2/3 = 2q = 1 − q)

•  Chooser can assure
expected gain of 2/3, hider
can assure an expected loss
of no more than 2/3,
regardless of what either
knows of the other’s
strategy.

28/03/2017 23

Safety Value as IncenHve

•  Clearly, without some extra incenHve, it is not in hider’s
interest to play Pick-a-hand because he can only lose by
playing.

•  Thus, we can imagine that chooser pays hider to enHce him
into joining the game.

•  2/3 is the maximum amount that chooser should pay him in
order to gain his parHcipaHon.

28/03/2017 24

Equilibrium as a Saddle Point

28/03/2017 25

Concept: Nash Equilibrium

28/03/2017 26

Nash Equilibrium

28/03/2017 27

Nash Equilibrium – Example

28/03/2017 28

Nash Equilibrium – Example

28/03/2017 29

28/03/2017 30

Many well known techniques from reinforcement learning, e.g., value/policy
iteration can still be applied to solving these games

StochasHc Games (SG)

Defined by the tuple

28/03/2017 31

(n,S,A1,…,n, T, R1,…,n)
No. agents

Set of states

Set of actions
available to each agent

A = A1 ⇥A2 ⇥ …⇥An

S ⇥A⇥ S ! [0, 1]
Transition dynamics

Reward function
of ith agent

S ⇥A ! R
R = R1 ⇥R2 ⇥ …⇥Rn

We wish to learn a stationary, possibly stochastic, policy:

Objective continues to be maximization of expected future reward

⇢ : S ! Pr(Ai)

A First Algorithm for SG SoluHon [Shapley]

28/03/2017 32

This classic algorithm (from 1953) is akin to Value Iteration for MDPs.
-  Max operator has been replaced by “Value”, which refers to equilibrium.
-  i.e., the matrix game is being solved at each state (step 2b)

The Policy IteraHon Algorithm for SGs

28/03/2017 33

•  This algorithm is akin to Policy Iteration for MDPs.
•  Each player selects equilibrium policy according to current value

function (using the same G matrix as in Shapley’s algorithm)
•  Value function is then updated based on rewards as per equil. policy

Q-Learning for SGs

28/03/2017 34

•  Q-learning version of Shapley’s algorithm (maintaining value over joint
actions)

•  Algorithm converges to stochastic game’s equilibrium, even if other
player doesn’t, provided everyone executes all actions infinitely often.

What do we do if we have no Model?
FicHHous Play [Robinson ‘51]

28/03/2017 35

•  Assumes opponents play stationary strategies
•  Maintains information about average value of each action
•  Finds equilibria in zero-sum and some general sum games

Summary: General TacHc for SGs

28/03/2017 36

Matrix Game
Solver

Temporal
Differencing

StochasHc
Game Solver

Summary: Many Approaches

28/03/2017 37

OpHonal Reference/Acknowledgements

Learning algorithms for stochasHc games are from the paper:
M. Bowling, M. Veloso, An analysis of stochasHc game theory for

mulHagent reinforcement learning, CMU-CS-00-165, 2000.

Several slides are adapted from the following sources:
•  Tutorial at IJCAI 2003 by Prof Peter Stone, University of Texas
•  Y. Peres, Game Theory, Alive (Lecture Notes)

28/03/2017 38