CS计算机代考程序代写 chain Hidden Markov Mode algorithm COMP3702 Artificial Intelligence - Module 3: Reasoning and planning under uncertainty — Part 2 MDPs (Offline methods)

COMP3702 Artificial Intelligence – Module 3: Reasoning and planning under uncertainty — Part 2 MDPs (Offline methods)

COMP3702

Artificial Intelligence

Module 3: Reasoning and planning under uncertainty — Part 2

MDPs (Offline methods)

Dr Alina Bialkowski

Semester 2, 2021

The University of Queensland

School of Information Technology and Electrical Engineering

Week 7: Logistics

• RiPPLE round 2 – due 4pm this Friday, 10 Sep
• Note: won’t approve simple copy paste from lectures or repeated resources

• Assignment 1 marking is in progress; We will be releasing a video walkthrough/solution

• Assignment 2 released! – due Friday, 24 Sep

Last Week

Began Module 3: Reasoning and planning under uncertainty

• Probability Review

• AND-OR trees

• Introduction to Decision Theory
• Framework for choosing the “best” decision in non-deterministic environments
• Preferences (≺, ∼), Lotteries e.g. L = [p1 : o1, p2 : o2]
• 6 Axioms of Utility Theory / Rational Preferences
• Expected Utility
• Utility of Money

Week 7 — Overview

Continue Module 3: Reasoning and planning under uncertainty

• Markov decision processes 1: Representation and exact algorithms

Table of contents

1. Recap – utility and rewards over time

2. Decision-theoretic planning

3. Markov decision processes

4. Value Iteration

5. Policy Iteration

Recap – utility and rewards over

time

Utility and time

• How would you compare the following sequences of rewards (per week):
A: $1000000, $0, $0, $0, $0, $0,. . .

B: $1000, $1000, $1000, $1000, $1000,. . .

C: $1000, $0, $0, $0, $0,. . .

D: $1, $1, $1, $1, $1,. . .

E: $1, $2, $3, $4, $5,. . .

Rewards and Values

Suppose the agent receives a sequence of rewards r1, r2, r3, r4, . . . in time. What utility should

be assigned? “Return” or “value”

• total reward V =
∞∑
i=1

• average reward V = lim
n→∞

(r1 + · · ·+ rn)/n

Rewards and Values

Suppose the agent receives a sequence of rewards r1, r2, r3, r4, . . . in time.

• discounted return V = r1 + γr2 + γ2r3 + γ3r4 + · · ·
γ is the discount factor 0 ≤ γ ≤ 1.

Properties of the Discounted Rewards

• The discounted return for rewards r1, r2, r3, r4, . . . is

V = r1 + γr2 + γ
2r3 + γ

3r4 + · · ·
= r1 + γ(r2 + γ(r3 + γ(r4 + . . . )))

• If Vt is the value obtained from time step t

Vt = rt + γVt+1

• How is the infinite future valued compared to immediate rewards?
1 + γ + γ2 + γ3 + · · · = 1/(1− γ)
Therefore

minimum reward

1− γ ≤ Vt ≤
maximum reward

1− γ
• We can approximate V with the first k terms, with error:

V − (r1 + γr2 + · · ·+ γk−1rk) = γkVk+1
8

Allais Paradox (1953)

What would you prefer:

A: $1m — one million dollars

B: lottery [0.10 : $2.5m, 0.89 : $1m, 0.01 : $0]

What would you prefer:

C: lottery [0.11 : $1m, 0.89 : $0]

D: lottery [0.10 : $2.5m, 0.9 : $0]

It is inconsistent with the axioms of preferences to have A � B and D � C .

A,C: lottery [0.11 : $1m, 0.89 : X ]

B,D: lottery [0.10 : $2.5m, 0.01 : $0, 0.89 : X ]

Framing Effects [Tversky and Kahneman]

• A disease is expected to kill 600 people. Two alternative programs have been proposed:
Program A: 200 people will be saved

Program B: probability 1/3: 600 people will be saved

probability 2/3: no one will be saved

Which program would you favor?
• A disease is expected to kill 600 people. Two alternative programs have been proposed:

Program C: 400 people will die

Program D: probability 1/3: no one will die

probability 2/3: 600 will die

Which program would you favor?

Tversky and Kahneman: 72% chose A over B.

22% chose C over D.

Decision-theoretic planning

Agents as Processes

Agents carry out actions:

• forever infinite horizon
• until some stopping criteria is met indefinite horizon
• finite and fixed number of steps finite horizon

Decision-theoretic Planning

What should an agent do when

• it gets rewards (and penalties) and tries to maximize its rewards received
• actions can be stochastic; the outcome of an action can’t be fully predicted
• there is a model that specifies the (probabilistic) outcome of actions and the rewards
• the world is fully observable?

Initial Assumptions for decision-theoretic planning

• flat or modular or hierarchical
• explicit states or features or individuals and relations
• static or finite stage or indefinite stage or infinite stage
• fully observable or partially observable
• deterministic or stochastic dynamics
• goals or complex preferences
• single agent or multiple agents
• knowledge is given or knowledge is learned
• perfect rationality or bounded rationality

Markov decision processes

Markov decision processes (MDPs)

• A framework to find the best sequence of actions to perform when the outcome of each
action is non-deterministic

• The basis of Reinforcement Learning
• Examples:

• Games: Tic-Tac-Toe, Chess, Go, etc.
• Control systems: Traffic intersection
• How many patients to admit to hospital
• Navigation:

Markov Decision Processes
} A framework to find the best sequence of actions

to perform when the outcome of each action is
non-deterministic.

} Example:
} Games: Tic Tac Toe, Chess, Go, etc.
} Races: bicycle race, car race, etc.
} Navigation:

World State

• The world state is the information such that if the agent knew the world state, no
information about the past is relevant to the future. Markovian assumption.

• Sk is state at time k , and Ak is the action at time k:

P(St+1 | S0,A0, . . . ,St ,At) = P(St+1 | St ,At)

P(s ′ | s, a) is the probability that the agent will be in state s ′ immediately after doing
action a in state s.

• The state captures all relevant information from the history
• The state is a sufficient statistic of the future

• The dynamics is stationary if the distribution is the same for each time point.

• A Markov decision process augments a Markov chain with actions and values:

Markov Decision Processes

An MDP consists of:

• set S of states.
• set A of actions.
• P(St+1 | St ,At) specifies the dynamics or transition function.
• R(St ,At ,St+1) specifies the reward at time t.

Sometimes is a random variable, R(s, a, s ′) is the expected reward received when the

agent is in state s, does action a and ends up in state s ′.

Sometimes we use R(s, a) =
∑

s′ P(s
′ | s, a)R(s, a, s ′).

• γ is discount factor.
• An MDP’s objective is: E

[∑T
t=0 γ

tR(st , at)
]
.

Example MDP Definition: to exercise or not?

Each week Alina has to decide whether to exercise or not:

• States: {fit, unfit}
• Actions: {exercise, relax}
• Dynamics:

State Action P(fit | State,Action)
fit exercise 0.99

fit relax 0.7

unfit exercise 0.2

unfit relax 0.0

• Reward (does not depend on resulting state):
State Action Reward

fit exercise 8

fit relax 10

unfit exercise 0

unfit relax 5
18

Example MDP Definition: Simple Grid World

Grid World Model

• States: 100 states corresponding to the positions of the robot.
• Actions: up, down, left, right.
• Transitions: Robot goes in the commanded direction with probability 0.7, and one of the

other directions with probability 0.1.

• Rewards: If it crashes into an outside wall, it remains in its current position and has a
reward of −1.

• Four special rewarding states; the agent gets the reward when leaving.

Planning Horizons

The planning horizon is how far ahead the planner looks to make a decision.

• The robot gets flung to one of the corners at random after leaving a positive (+10 or +3)
reward state.

• the process never halts
• infinite horizon

• The robot gets +10 or +3 in the state, then it stays there getting no reward. Or it is left
with only a special action exit, and the episode ends. These are absorbing states.

• the robot will eventually reach an absorbing state. WHY?
• indefinite horizon

Information Availability

What information is available when the agent decides what to do?

• fully-observable MDP the agent gets to observe St when deciding on action At .
• partially-observable MDP (POMDP) the agent has some noisy sensor of the state. It is a

mix of a hidden Markov model and MDP. It needs to remember (some function of) its

sensing and acting history.

[This lecture only considers FOMDPs]

Policy

• A policy is a sequence of actions, taken to move from each state to the next state over the
whole time horizon.

• A stationary policy is a function or a map:

π : S → A

Given a state s, π(s) specifies what action the agent who is following π will do.

• An optimal policy, usually denoted as π∗, is one with maximum expected discounted
reward.

max
π

[∑
t=0

γtR(st , π(st))

]
where π(t) is the action taken at time t.

• For a fully-observable MDP with stationary dynamics and rewards with infinite or
indefinite horizon, there is always an optimal stationary policy.

Example: to exercise or not?

Each week Alina has to decide whether to exercise or not:

• States: {fit, unfit}
• Actions: {exercise, relax}

How many stationary policies are there?

What are they?

For the grid world with 100 states and 4 actions

How many stationary policies are there?

What is the solution of an MDP problem?

• For the simple navigation “grid world”:
• A policy, or mapping from states to actions π : S → A could be given as below

What is the solution of an MDP problem?
} An optimal policy, usually denoted as π*.

} Policy = strategy
} A mapping from states to actions π : S à A.
} Meaning for any state s in S, π(s) wil tell us what

action the agent should perform.
} Example:

+1
-1

• A policy, π(s), tells us what action the agent should perform for each state
• The optimal policy, π∗(s), tells us what the best action the agent should perform for each

state

Which is the “best” policy?

Suppose the agent receives the sequence of rewards r1, r2, r3, r4, . . .. What value should be

assigned?

• total reward V =
∞∑
i=1

• average reward V = lim
n→∞

(r1 + · · ·+ rn)/n
• discounted reward V = r1 + γr2 + γ2r3 + γ3r4 + · · ·
γ is the discount factor 0 ≤ γ ≤ 1.

Discounted Rewards

• V = r1 + γr2 + γ2r3 + · · ·+ γk−1rk + . . .
• V =

∑∞
k=1 γ

k−1rk

• The discount factor, γ ∈ [0, 1) gives the present value of future rewards
• The value of receiving reward r after k + 1 time-steps is γk r
• This values immediate reward above delayed reward:

• γ close to 0 leads to “myopic” evaluation
• γ close to 1 leads to “far-sighted” evaluation

• Recall earlier slides

Discounted Rewards

Different criteria for defining “best”. Maximise one of the following:

• Myopic:
E [R(st , π(st))]

• Finite horizon:
E

[
k∑

t=0

R(st , π(st))

]
Policy changes over time

Optimal policy: non-stationary

• Infinite horizon:
max
π

[∑
t=0

γtR(st , π(st))

]
Optimal policy: stationary

Value of a Policy

We first approach MDPs using a recursive reformulation of the objective called a value function

• The value function of an MDP, V π(s), is the expected future reward of following an
(arbitrary) policy, π, starting from state, s, given by:

V π(s) =
∑
s′∈S

P(s ′ | π(s), s) [R(s, π(s), s ′) + γV π(s ′)] .

where the policy π(s) determines that action taken in state s.

• Here we have dropped the time index, as it is redundant, but note that at = π(st).

Value of a Policy

Given a policy π:

• The Q-function represents the value of choosing an action and then following policy π in
every subsequent state.

• Qπ(s, a), where a is an action and s is a state, is the expected value of doing a in state s,
then following policy π.

• Qπ and V π can be defined mutually recursively:

Qπ(s, a) =
∑
s′

P(s ′ | a, s) (R(s, a, s ′) + γV π(s ′))

V π(s) = Q(s, π(s))

Computing the Value of a Policy

Let vπ ∈ R|S| be a vector of values for each state, and r ∈ R|S| be a vector of rewards for each
state.

Let Pπ ∈ R |S|×|S| be a matrix containing probabilities for each transition under policy π, where:

Pπij = P(st+1 = j | st = i , at = π(st))

Then the value function can be written in vector form as:

vπ = r + γPπvπ

We can solve this using linear algebra:

⇒(I − γPπ)vπ = r
⇒vπ = (I − γPπ)−1r

i.e., computing value for a policy requires solving a linear system.

Value of the Optimal Policy

• An optimal policy, π∗, expressed in terms of the value function, is one that satisfies
Bellman’s optimality condition (1957):

V ∗(s) = max
a

{
R(s, a, s ′) + γmax

π
V π(s ′)

}
• V ∗(s) is the expected value of following the optimal policy in state s.
• Similarly, Q∗(s, a), where a is an action and s is a state, is the expected value of doing a

in state s, then following the optimal policy.

• Q∗ and V ∗ can be defined mutually recursively:

Q∗(s, a) =
∑
s′

P(s ′ | a, s) (R(s, a, s ′) + γV ∗(s ′))

V ∗(s) = max
a

Q(s, a)

π∗(s) = arg max
a

Q(s, a)

Value Iteration

• Let Vk be k-step lookahead value function.
• Idea: Given an estimate of the k-step lookahead value function, determine the k + 1 step

lookahead value function.

1. Set V0 arbitrarily, e.g.:

V̂ (s)← 0
2. Compute Vi+1 from Vi .

Loop for all s {
Vi+1(s) = max

∑
s′

P(s ′ | a, s) {R(s, a, s ′) + γVi (s ′)}

}
Once the values converge, recover the best policy from the current value function estimate:

arg max
a

∑
s′

P(s ′ | a, s)
{

R(s, a, s ′) + γV̂ (s ′)
}

Value Iteration

• Theorem: There is a unique function V ∗ satisfying these functions
• If we know V ∗, the optimal policy can be generated easily
• Guaranteed to converge to V*
• No guarantee we’ll reach optimal in finite time, but this converges exponentially fast (in

k) to the optimal value function.

• The error reduces proportionally to γ
k

1− γ

Grid world

-100

γ = 0.9

Agent can move up, down, left or right.

Moves successfully with p = 0.8, or perpendicular to direction with p = 0.1, each direction.

Hit a wall and the agent stays where it is.

VI — initialisation

VI — 1 iteration

-100

VI — 2 iterations

00.72 1

-100

VI — 3 iterations

00.5184

0.0648

0.7848 1

-100

VI — 4 iterations

00.3732 0.6584

0.0467

0.1173

0.7965 1

-100

VI — termination

0.4800 0.4215 0.3717 0.1760

0.5540 0.3860

0.6310 0.7282 0.8294 1

-100

Asynchronous Value Iteration

• The agent doesn’t need to sweep through all the states, but can update the value
functions for each state individually.

• Do update assignments to the states in any order we want, even random
• This converges to the optimal value functions, if each state and action is visited infinitely

often in the limit.

• It can either store V [s] or Q[s, a].

Asynchronous VI

Storing V [s]

• Repeat forever:
1. Select state s

2. V [s]← max
a

∑
s′

P(s
′ | s, a)

(
R(s, a, s

′
) + γV [s

′
]
)

Storing Q[s, a]

• Repeat forever:
1. Select state s, action a

2. Q[s, a]←
∑
s′

P(s
′ | s, a)

(
R(s, a, s

′
) + γmax

a′
Q[s
′
, a
′
]

)

Policy Iteration

• Set π0 arbitrarily, let i = 0
• Repeat:

1. Solve for V πi (s) (or Qπi (s, a)):

V
πi (s) =

∑
s′∈S

P(s
′ | πi (s), s)

[
R(s, πi (s), s

′
) + γV

πi (s
′
)
]
∀s ∈ S

2. Update policy:

πi+1(s)← arg max
a

∑
s′∈S

P(s
′ | a, s)

[
R(s, a, s

′
) + γV

πi (s
′
)
]

3. i = i + 1

• until πi (s) = πi−1(s)

Solving V πi (s) means finding a solution to a set of |S | × |A| linear equations with |S | × |A|
unknowns, as shown earlier.

↑ ↑ ↑ ↑

↑ ↑

↑ ↑ ↑ 1

-100

-0.437 -4.833 -14.60 -80.56

0.053 -9.60

0.061 0.135 0.364 1

-100

← ← ← ←

↑ ←

→ → ↑ 1

-100

↑ ← ← ↓

↑ ←

→ → → 1

-100

↑ ← ← ↓

↑ ←

→ → → 1

-100

Attributions and References

Thanks to Dr Archie Chapman and Dr Hanna Kurniawati for their materials.

Many of the slides are adapted from David Poole and Alan Mackworth, Artificial Intelligence: foundations of

computational agents, 2E, CUP, 2017 http://https://artint.info/. These materials are copyright c© Poole
and Mackworth, 2017, licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0

International License.

Other materials derived from Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach, 3E,

Prentice Hall, 2009.

http://https://artint.info/

Recap – utility and rewards over time
Decision-theoretic planning
Markov decision processes
Value Iteration
Policy Iteration
Appendix

Related Posts