代写代考 Partially Observable MDPs, Game Theory

Partially Observable MDPs, Game Theory
CSci 5512: Artificial Intelligence II
Instructor:
February 24, 2022

Copyright By PowCoder代写 加微信 powcoder

Instructor:
Partially Observable MDPs, Game Theory

Announcements
HW2 deadline extended 24 hrs (due Feb 25) Exam 1 posted on Mar 1 (next Tue)
Will cover all topics through Feb 24 lecture Chapters 12-14, 16-18
HW3 posted on Mar 15 (after Spring Break)
Partially Observable MDPs, Game Theory
Instructor:

Sequential Decision Problems
Instructor:
Partially Observable MDPs, Game Theory

Markov Decision Process (MDP)
States s ∈ S, actions a ∈ A
Model T(s,a,s′) ≡ P(s′|s,a)
Reward function R(s) (or R(s, a), R(s, a, s′))
􏰇−0.04 (small penalty) for non-terminal states
±1 for terminal states
Partially Observable MDPs, Game Theory
Instructor:

Partially Observable MDPs
Agent does not know which state it is in Makes no sense to talk about policy π(s)
POMDPs have an observation (sensor) model P(e|s) = P(obtain evidence e when in state s)
Maintain beliefs b: Distribution over states New evidence e comes in, update beliefs b
Partially Observable MDPs, Game Theory
Instructor:

Partially Observable MDPs
Filtering equation:
P(Xt+1|e1:t+1) = αP(et+1|Xt+1)􏰂 P(Xt+1|xt) P(xt|e1:t)
􏰊 􏰉􏰈 􏰋xt 􏰊 􏰉􏰈 􏰋􏰊 􏰉􏰈 􏰋 Sensor Model Transition Model Recursion
Filtering belief states
b′(s′) = αP(e|s′) 􏰂 P(s′|s, a)b(s) s
The new belief state b′ depends on current belief state b, action a, and evidence e:
b′ =αForward(b,a,e)
Partially Observable MDPs, Game Theory
Instructor:

Optimal action depends only on agent’s current belief state
Optimal policy is mapping π∗(b) from belief states to actions
Decision cycle has 3 steps:
1 Given current belief b execute action a = π∗(b)
2 Receive evidence e
3 Infer next belief state b′ = Forward(b, a, e) and repeat
Theorem (Astrom, 1965): The optimal policy in a POMDP is a function π(b) where b is the belief state (probability distribution over states)
Can convert POMDP into an MDP in belief-state space
Partially Observable MDPs, Game Theory
Instructor:

POMDPs as MDPs over beliefs
What is probability agent in belief state b reaches b′ after taking action a?
If we know action a and evidence then we can use b′ =αForward(b,a,e)
But we don’t know the evidence yet (since we haven’t taking the action)
Compute probability of evidence given action taken in belief state b by summing over all actual states s′ we might reach:
P(e|a, b) = 􏰂 P(e|a, s′, b)P(s′|a, b) = 􏰂 P(e|s′)P(s′|a, b) s′ s′
= 􏰂 P(e|s′) 􏰂 P(s′|s, a)b(s) s′ s
Partially Observable MDPs, Game Theory
Instructor:

POMDPs as MDPs over beliefs
Need transition model P(b′|b, a) Transition model between belief states
P(b′|b, a) = 􏰂 P(b′|e, a, b)P(e|a, b) e
= 􏰂 P(b′|e, a, b) 􏰂 P(e|s′) 􏰂 P(s′|s, a)b(s) e s′ s
where P(b′|e, a, b) = 1 if b′ = Forward(b, a, e) and 0 o/w Can view as transition model for belief-state space
Partially Observable MDPs, Game Theory
Instructor:

POMDPs as MDPs over beliefs
Can also define reward function for belief-state transitions Reward for a belief state
ρ(b, a) = 􏰂 b(s) 􏰂 P(s′|s, a)R(s, a, s′) s s′
P(b′|b, a) and ρ(b, a) define observable MDP on space of belief states
Optimal policy π∗(b) is optimal policy for original POMDP
Solving POMDP on state space can be reduced to solving MDP on belief-state space
Partially Observable MDPs, Game Theory
Instructor:

Bellman equation in terms of beliefs
If there are n states, b is an n-dimensional real-valued vector Solving POMDPs is very (actually, PSPACE-) hard
Value iteration and policy iteration algorithms Approximate solutions
Solutions include information-gathering behavior
The real world is a POMDP (with initially unknown T and O)
Partially Observable MDPs, Game Theory
Instructor:

Game Theory and Mechanism Design
Payoffs and Strategies Dominant Strategy Equilibrium
Maximin Strategies
Partially Observable MDPs, Game Theory
Instructor:

Components of a Game:
Players: Agents who will be making decisions
May be 2-played or n-player games Actions: Choices available to the players
Each player may have a different set of actions
Payoff Matrix: Utility of each player for action combinations
A function over actions of all players
Partially Observable MDPs, Game Theory
Instructor:

Alice: Testify
Alice: Refuse
Bob: Testify
A=-5, B=-5
A=-10, B=0
Bob: Refuse
A=0, B=-10
A=-1, B=-1
Table: Prisoner’s Dilemma
Acme: Blueray
Best: Blueray
A=+9, B=+9
A=-4, B=-1
A=-3, B=-1
A=+5, B=+5
Table: Video Game
Instructor:
Partially Observable MDPs, Game Theory

Strategies
Strategy is a policy: which action to choose? Two types of strategies
Pure strategy: Deterministic policy with a fixed action Mixed strategy: A probability distribution over actions
Strategy profile
A strategy assignment for each player
Every strategy profile gives an outcome for each player
Partially Observable MDPs, Game Theory
Instructor:

Dominant Strategy Equilibrium
Dominant strategy
Better than all other choices of actions
Better over all strategies of other players May or may not exist in a game
Alice has Testify as a dominant strategy (so does Bob)
Alice: Testify
Alice: Refuse
Bob: Testify
A=-5, B=-5
A=-10, B=0
Bob: Refuse
A=0, B=-10
A=-1, B=-1
Dominant strategy equilibrium
All players choose their dominant strategy Example: Alice and Bob both choose Testify
Equilibrium strategy profile
Assume other player sticks with the same strategy Player has no incentive to change
Partially Observable MDPs, Game Theory
Instructor:

Consider an n-player game
Let si be the strategy of the i-th player
Let s = (si )ni =1 be the strategy profile Let ui(s) be the payoff for the i-th player
A strategy profile s∗ is a Nash equilibrium (NE) if for all players i
u(s∗)≥u(s,s∗ ) i ii−i
Theorem: Every finite n-player game admits at least one NE
Partially Observable MDPs, Game Theory
Instructor:

Expected Loss in Games
Rock Paper Scissors
The loss matrix M for the row player Pure strategies: Rock, Paper, Scissors Randomized play
Row player choose distribution P Column player chooses distribution Q The expected loss for the row player
M(P, Q) = P⊤MQ = 􏰂 P(i)Q(j)M(i, j) i,j
Partially Observable MDPs, Game Theory
Instructor:

Maximin Strategies
Row player chooses a (randomized) strategy P
Given P, column player chooses a strategy Q such that the loss of row player is maximized, i.e., maxQ M(P,Q)
Row player chooses P upfront to minimize the maximum loss
min max M(P, Q) PQ
If column player chooses first, then the loss of the row player
max min M(P, Q) QP
Von Neumann says the following:
max min M(P, Q) = min max M(P, Q) QP PQ
Partially Observable MDPs, Game Theory
Instructor:

Issues with
Inconsistent with the Bayesian viewpoint
NE are not the limit of a (Bayesian) learning process
The full payoff matrix needs to be known upfront In practice, the full payoff matrix is rarely known Game has to be played repeatedly
Computation of the NE are tricky
Posed as a linear program with many constraints Approximation algorithms can be used
Players can choose different (marginal) NE The profile is not a NE
Partially Observable MDPs, Game Theory
Instructor:

Mechanism Design
Designing games for rational agents Inverse game theory
Why we need mechanism design Tragedy of the commons
Neglect well-being of society for personal gain
(crowded) bar problem
Decide to go to bar (not fun if full)
Driving on the road
Goal: Rational agents collectively achieve/optimize global utility
Externalities need to be made explicit
Decision making for agents is easy
Partially Observable MDPs, Game Theory
Instructor:

Each agent has a utility value vi and a bid bi English Auction:
Auctioneer increments price until one bidder is left Bidder with highest value gets the goods at price bo + d
bo: highest bid among other agents d: auctioneer’s increment
Bidding strategy: Keep bidding until vi ≤ b0 + d
Cons: Not truth revealing, don’t get to know vi for the winner Cons: Can discourage competition
Cons: Too much communication
Partially Observable MDPs, Game Theory
Instructor:

Auctions (cont.)
Sealed-Bid Auction:
Each bidder privately communicates bi to auctioneer
Highest bid wins
No simple dominant strategy
Estimate other’s bid b0, bid if b0 + ε < vi Cons: Player with highest value need not get the goods Cons: Players must estimate other player’s bid Partially Observable MDPs, Game Theory Instructor: Auctions (cont.) Vickrey Auction ≡ sealed-bid second-price auction Winner pays the price of the second highest bid The (dominant) strategy is to bid your actual value vi Fair auction, widely used in practice Utility of the agent, given best bid b0 of other agents 􏰇vi − b0 if bi > b0
0 otherwise
Bidding bi = vi is optimal
Only bid with the optimality property
Partially Observable MDPs, Game Theory
Instructor:

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com