Agent-based Systems
Paolo Turrini
 www.dcs.warwick.ac.uk/~pturrini p.turrini@warwick.ac.uk
Learning in Games (Artificial) Poker Stars
Paolo Turrini Learning in Games Agent-based Systems
Plan for Today
We have seen extensive games of imperfect information, where players are typically uncertainty about the current state of the game being played.
We are going to look at the relevance of this to Artificial Intelligence:
Learning through self-play: the key to many game-playing engines (AlphaGo); Regret: why it’s important to minimise it.
Paolo Turrini Learning in Games Agent-based Systems
Poker and Artificial Intelligence
“Robots are unlikely to be welcome in casinos any time soon, especially now that a poker-playing computer has learned to play a virtually perfect game — including bluffing.”
(Philip Ball: Game theorists crack poker, Nature News, 2015.)
Paolo Turrini Learning in Games Agent-based Systems
Poker and Artificial Intelligence
A difficult game
Chance, counting odds Bluffing, aggressive play
Still… a game
An extensive game with imperfect information Rational and irrational strategies
What is the right solution concept?
T.W. Sandholm.
Solving Imperfect-Information Games.
Science, 347(6218):122–123, 2015.
Paolo Turrini Learning in Games Agent-based Systems
An emotional game
Everyone who has played poker knows how critical “emotions” are in the game;
Well… it turns out that there is one emotion that computers use better than anyone else;
This emotion is regret: how bad I’ve played with respect to how I could have played;
What is regret really?
Paolo Turrini Learning in Games Agent-based Systems
Regret in games
0
0
1
−1
−1 1
−1 1
0
0
1
−1
1
−1
−1 1
0
0
If you play paper and I play paper, my regret for not playing scissors is 1 (the payoff difference!).
If you play paper and I play rock, my regret for not playing scissors is 2!
Paolo Turrini
Learning in Games
Agent-based Systems
Regret in games
0
0
1
−1
−1 1
−1 1
0
0
1
−1
1
−1
−1 1
0
0
regretfornotplaying in( , )=2 regretfornotplaying in( , )=0 regret for not playing in ( , ) = -1
Let ⟨N, A, u⟩ be a normal-form game.
At action profile a, the regret of player i for not playing ai′ is:
ui(ai′,a−i)−ui(a).
Paolo Turrini Learning in Games Agent-based Systems
Avoiding feeling bad (without saying it)
How can we use regret to inform future play?
The idea is that I want to take actions that I wish I had played in the past; Obviously if my opponent knew exactly what I was doing it would not be good; My strategy needs to be good and not exploitable.
How to minimise regret without being predictable?
Paolo Turrini Learning in Games Agent-based Systems
Regret in games
0
0
1
−1
−1 1
−1 1
0
0
1
−1
1
−1
−1 1
0
0
We do this by regret matching: choosing actions at random, with a distribution that is proportional to positive regrets.
This means regrets that are proportional to the relative losses one has experienced for not having selected actions in the past.
Paolo Turrini Learning in Games Agent-based Systems
Regret in games
0
0
1
−1
−1 1
−1 1
0
0
1
−1
1
−1
−1 1
0
0
Games started with ( , ) followed by ( , ): regretfornotplaying in( , )=2 regretfornotplaying in( , )=1
We do this by regret matching: choosing actions at random, with a distribution that is proportional to positive regrets.
This means regrets that are proportional to the relative losses one has experienced for not having selected actions in the past.
Paolo Turrini Learning in Games Agent-based Systems
Regret in games
0
0
1
−1
−1 1
−1 1
0
0
1
−1
1
−1
−1 1
0
0
Games started with ( , ) followed by ( , ): regretfornotplaying in( , )=2 regretfornotplaying in( , )=1
In the next hand, we choose with probability 2 ,
with probability 1 , with probability 0. 3
3
Notice: positive regrets divided by their sum.
We do this by regret matching: choosing actions at random, with a
distribution that is proportional to positive regrets.
This means regrets that are proportional to the relative losses one has
experienced for not having selected actions in the past.
Paolo Turrini Learning in Games Agent-based Systems
Cumulating regrets
0
0
1
−1
−1 1
−1 1
0
0
1
−1
1
−1
−1 1
0
0
Suppose in the next hand I do play (2 ;1 33
)
… and it turns out to be ; Suppose my opponent plays
.
Paolo Turrini
Learning in Games
Agent-based Systems
Cumulating regrets
0
0
1
−1
−1 1
−1 1
0
0
1
−1
1
−1
−1 1
0
0
Suppose in the next hand I do play (2 ;1 33
… and it turns out to be ;
Suppose my opponent plays
My regret for this hand is 1 for not playing ,
)
.
2 for not playing , and obviously 0 for not playing .
Paolo Turrini Learning in Games
Agent-based Systems
Cumulating regrets
0
0
1
−1
−1 1
−1 1
0
0
1
−1
1
−1
−1 1
0
0
Suppose in the next hand I do play (2 ;1 33
… and it turns out to be ;
Suppose my opponent plays
My regret for this hand is 1 for not playing ,
)
.
2 for not playing , and obviously 0 for not playing .
We add the new regrets to the old ones, and play accordingly.
Paolo Turrini Learning in Games Agent-based Systems
Cumulating regrets
0
0
1
−1
−1 1
−1 1
0
0
1
−1
1
−1
−1 1
0
0
Suppose in the next hand I do play (2 ;1 ) 33
… and it turns out to be ;
Suppose my opponent plays
My regret for this hand is 1 for not playing ,
.
2 for not playing , and obviously 0 for not playing
.
, 2 total regrets for .
We have 2 total regrets for , 2 total regrets for Our next strategy is going to be (2 ;2 ; 2 )
We add the new regrets to the old ones, and play accordingly.
666
Paolo Turrini Learning in Games Agent-based Systems
Cumulating regrets
Cumulative regrets are good, but not very good.
If our opponent knows that we are using cumulative regrets, then they are always in a position to best respond.
Can we do better than this? Yes, we can.
The key idea, as often the case in modern AI, is to play against ourselves;
We exploit this ”hypothetical games” to simulate our opponents and strengthen our strategies.
Paolo Turrini Learning in Games Agent-based Systems
Learning in Games
Context: you keep playing the same game against the same opponents. Objective: you want to learn their strategies.
A good hypothesis might be that the frequency with which player i plays action ai is approximately her probability of playing ai .
Now suppose you always best-respond to those hypothesised strategies. And suppose everyone else does the same. What will happen?
We are going to see that for zero-sum games this process converges to a NE.
This yields a method for computing a NE for the (non-repeated) game: just imagine
players engaging in such “fictitious play”.
Paolo Turrini Learning in Games Agent-based Systems
Empirical Mixed Strategies
Given a history of actions Hl = a0, a1, . . . , al−1 played by player i in l prior plays of iiii
game ⟨N, A, u⟩, fix her empirical mixed strategy sl ∈ Si : i
sl(ai) = 1·#{k
opponents’ empirical mixed strategies:
ali ∈ argmax ui (ai , s l−i ), where ai ∈Ai
l1k′
si′(ai′)=l ·#{k