CS计算机代考程序代写 chain Hidden Markov Mode algorithm COMP3702 Artificial Intelligence – Module 3: Reasoning and planning under uncertainty — Part 2 MDPs (Offline methods)

COMP3702 Artificial Intelligence – Module 3: Reasoning and planning under uncertainty — Part 2 MDPs (Offline methods)

COMP3702

Artificial Intelligence

Module 3: Reasoning and planning under uncertainty — Part 2

MDPs (Offline methods)

Dr Alina Bialkowski

Semester 2, 2021

The University of Queensland

School of Information Technology and Electrical Engineering

Week 7: Logistics

• RiPPLE round 2 – due 4pm this Friday, 10 Sep
• Note: won’t approve simple copy paste from lectures or repeated resources

• Assignment 1 marking is in progress; We will be releasing a video walkthrough/solution

• Assignment 2 released! – due Friday, 24 Sep

1

Last Week

Began Module 3: Reasoning and planning under uncertainty

• Probability Review

• AND-OR trees

• Introduction to Decision Theory
• Framework for choosing the “best” decision in non-deterministic environments
• Preferences (≺, ∼), Lotteries e.g. L = [p1 : o1, p2 : o2]
• 6 Axioms of Utility Theory / Rational Preferences
• Expected Utility
• Utility of Money

2

Week 7 — Overview

Continue Module 3: Reasoning and planning under uncertainty

• Markov decision processes 1: Representation and exact algorithms

3

Table of contents

1. Recap – utility and rewards over time

2. Decision-theoretic planning

3. Markov decision processes

4. Value Iteration

5. Policy Iteration

4

Recap – utility and rewards over

time

Utility and time

• How would you compare the following sequences of rewards (per week):
A: $1000000, $0, $0, $0, $0, $0,. . .

B: $1000, $1000, $1000, $1000, $1000,. . .

C: $1000, $0, $0, $0, $0,. . .

D: $1, $1, $1, $1, $1,. . .

E: $1, $2, $3, $4, $5,. . .

5

Rewards and Values

Suppose the agent receives a sequence of rewards r1, r2, r3, r4, . . . in time. What utility should

be assigned? “Return” or “value”

• total reward V =
∞∑
i=1

ri

• average reward V = lim
n→∞

(r1 + · · ·+ rn)/n

6

Rewards and Values

Suppose the agent receives a sequence of rewards r1, r2, r3, r4, . . . in time.

• discounted return V = r1 + γr2 + γ2r3 + γ3r4 + · · ·
γ is the discount factor 0 ≤ γ ≤ 1.

7

Properties of the Discounted Rewards

• The discounted return for rewards r1, r2, r3, r4, . . . is

V = r1 + γr2 + γ
2r3 + γ

3r4 + · · ·
= r1 + γ(r2 + γ(r3 + γ(r4 + . . . )))

• If Vt is the value obtained from time step t

Vt = rt + γVt+1

• How is the infinite future valued compared to immediate rewards?
1 + γ + γ2 + γ3 + · · · = 1/(1− γ)
Therefore

minimum reward

1− γ ≤ Vt ≤
maximum reward

1− γ
• We can approximate V with the first k terms, with error:

V − (r1 + γr2 + · · ·+ γk−1rk) = γkVk+1
8

Allais Paradox (1953)

What would you prefer:

A: $1m — one million dollars

B: lottery [0.10 : $2.5m, 0.89 : $1m, 0.01 : $0]

What would you prefer:

C: lottery [0.11 : $1m, 0.89 : $0]

D: lottery [0.10 : $2.5m, 0.9 : $0]

It is inconsistent with the axioms of preferences to have A � B and D � C .

A,C: lottery [0.11 : $1m, 0.89 : X ]

B,D: lottery [0.10 : $2.5m, 0.01 : $0, 0.89 : X ]

9

Framing Effects [Tversky and Kahneman]

• A disease is expected to kill 600 people. Two alternative programs have been proposed:
Program A: 200 people will be saved

Program B: probability 1/3: 600 people will be saved

probability 2/3: no one will be saved

Which program would you favor?
• A disease is expected to kill 600 people. Two alternative programs have been proposed:

Program C: 400 people will die

Program D: probability 1/3: no one will die

probability 2/3: 600 will die

Which program would you favor?

Tversky and Kahneman: 72% chose A over B.

22% chose C over D.

10

Decision-theoretic planning

Agents as Processes

Agents carry out actions:

• forever infinite horizon
• until some stopping criteria is met indefinite horizon
• finite and fixed number of steps finite horizon

11

Decision-theoretic Planning

What should an agent do when

• it gets rewards (and penalties) and tries to maximize its rewards received
• actions can be stochastic; the outcome of an action can’t be fully predicted
• there is a model that specifies the (probabilistic) outcome of actions and the rewards
• the world is fully observable?

12

Initial Assumptions for decision-theoretic planning

• flat or modular or hierarchical
• explicit states or features or individuals and relations
• static or finite stage or indefinite stage or infinite stage
• fully observable or partially observable
• deterministic or stochastic dynamics
• goals or complex preferences
• single agent or multiple agents
• knowledge is given or knowledge is learned
• perfect rationality or bounded rationality

13

Markov decision processes

Markov decision processes (MDPs)

• A framework to find the best sequence of actions to perform when the outcome of each
action is non-deterministic

• The basis of Reinforcement Learning
• Examples:

• Games: Tic-Tac-Toe, Chess, Go, etc.
• Control systems: Traffic intersection
• How many patients to admit to hospital
• Navigation:

Markov Decision Processes
} A framework to find the best sequence of actions

to perform when the outcome of each action is
non-deterministic.

} Example:
} Games: Tic Tac Toe, Chess, Go, etc.
} Races: bicycle race, car race, etc.
} Navigation:

14

World State

• The world state is the information such that if the agent knew the world state, no
information about the past is relevant to the future. Markovian assumption.

• Sk is state at time k , and Ak is the action at time k:

P(St+1 | S0,A0, . . . ,St ,At) = P(St+1 | St ,At)

P(s ′ | s, a) is the probability that the agent will be in state s ′ immediately after doing
action a in state s.

• The state captures all relevant information from the history
• The state is a sufficient statistic of the future

• The dynamics is stationary if the distribution is the same for each time point.

15

• A Markov decision process augments a Markov chain with actions and values:

16

Markov Decision Processes

An MDP consists of:

• set S of states.
• set A of actions.
• P(St+1 | St ,At) specifies the dynamics or transition function.
• R(St ,At ,St+1) specifies the reward at time t.

Sometimes is a random variable, R(s, a, s ′) is the expected reward received when the

agent is in state s, does action a and ends up in state s ′.

Sometimes we use R(s, a) =

s′ P(s
′ | s, a)R(s, a, s ′).

• γ is discount factor.
• An MDP’s objective is: E

[∑T
t=0 γ

tR(st , at)
]
.

17

Example MDP Definition: to exercise or not?

Each week Alina has to decide whether to exercise or not:

• States: {fit, unfit}
• Actions: {exercise, relax}
• Dynamics:

State Action P(fit | State,Action)
fit exercise 0.99

fit relax 0.7

unfit exercise 0.2

unfit relax 0.0

• Reward (does not depend on resulting state):
State Action Reward

fit exercise 8

fit relax 10

unfit exercise 0

unfit relax 5
18

Example MDP Definition: Simple Grid World

19

Grid World Model

• States: 100 states corresponding to the positions of the robot.
• Actions: up, down, left, right.
• Transitions: Robot goes in the commanded direction with probability 0.7, and one of the

other directions with probability 0.1.

• Rewards: If it crashes into an outside wall, it remains in its current position and has a
reward of −1.

• Four special rewarding states; the agent gets the reward when leaving.

20

Planning Horizons

The planning horizon is how far ahead the planner looks to make a decision.

• The robot gets flung to one of the corners at random after leaving a positive (+10 or +3)
reward state.

• the process never halts
• infinite horizon

• The robot gets +10 or +3 in the state, then it stays there getting no reward. Or it is left
with only a special action exit, and the episode ends. These are absorbing states.

• the robot will eventually reach an absorbing state. WHY?
• indefinite horizon

21

Information Availability

What information is available when the agent decides what to do?

• fully-observable MDP the agent gets to observe St when deciding on action At .
• partially-observable MDP (POMDP) the agent has some noisy sensor of the state. It is a

mix of a hidden Markov model and MDP. It needs to remember (some function of) its

sensing and acting history.

[This lecture only considers FOMDPs]

22

Policy

• A policy is a sequence of actions, taken to move from each state to the next state over the
whole time horizon.

• A stationary policy is a function or a map:

π : S → A

Given a state s, π(s) specifies what action the agent who is following π will do.

• An optimal policy, usually denoted as π∗, is one with maximum expected discounted
reward.

max
π

E

[∑
t=0

γtR(st , π(st))

]
where π(t) is the action taken at time t.

• For a fully-observable MDP with stationary dynamics and rewards with infinite or
indefinite horizon, there is always an optimal stationary policy.

23

Example: to exercise or not?

Each week Alina has to decide whether to exercise or not:

• States: {fit, unfit}
• Actions: {exercise, relax}

How many stationary policies are there?

What are they?

For the grid world with 100 states and 4 actions

How many stationary policies are there?

24

What is the solution of an MDP problem?

• For the simple navigation “grid world”:
• A policy, or mapping from states to actions π : S → A could be given as below

What is the solution of an MDP problem?
} An optimal policy, usually denoted as π*.

} Policy = strategy
} A mapping from states to actions π : S à A.
} Meaning for any state s in S, π(s) wil tell us what

action the agent should perform.
} Example:

+1
-1

• A policy, π(s), tells us what action the agent should perform for each state
• The optimal policy, π∗(s), tells us what the best action the agent should perform for each

state

25

Which is the “best” policy?

Suppose the agent receives the sequence of rewards r1, r2, r3, r4, . . .. What value should be

assigned?

• total reward V =
∞∑
i=1

ri

• average reward V = lim
n→∞

(r1 + · · ·+ rn)/n
• discounted reward V = r1 + γr2 + γ2r3 + γ3r4 + · · ·
γ is the discount factor 0 ≤ γ ≤ 1.

26

Discounted Rewards

• V = r1 + γr2 + γ2r3 + · · ·+ γk−1rk + . . .
• V =

∑∞
k=1 γ

k−1rk

• The discount factor, γ ∈ [0, 1) gives the present value of future rewards
• The value of receiving reward r after k + 1 time-steps is γk r
• This values immediate reward above delayed reward:

• γ close to 0 leads to “myopic” evaluation
• γ close to 1 leads to “far-sighted” evaluation

• Recall earlier slides

27

Discounted Rewards

Different criteria for defining “best”. Maximise one of the following:

• Myopic:
E [R(st , π(st))]

• Finite horizon:
E

[
k∑

t=0

R(st , π(st))

]
Policy changes over time

Optimal policy: non-stationary

• Infinite horizon:
max
π

E

[∑
t=0

γtR(st , π(st))

]
Optimal policy: stationary

28

Value of a Policy

We first approach MDPs using a recursive reformulation of the objective called a value function

• The value function of an MDP, V π(s), is the expected future reward of following an
(arbitrary) policy, π, starting from state, s, given by:

V π(s) =

s′∈S

P(s ′ | π(s), s) [R(s, π(s), s ′) + γV π(s ′)] .

where the policy π(s) determines that action taken in state s.

• Here we have dropped the time index, as it is redundant, but note that at = π(st).

29

Value of a Policy

Given a policy π:

• The Q-function represents the value of choosing an action and then following policy π in
every subsequent state.

• Qπ(s, a), where a is an action and s is a state, is the expected value of doing a in state s,
then following policy π.

• Qπ and V π can be defined mutually recursively:

Qπ(s, a) =

s′

P(s ′ | a, s) (R(s, a, s ′) + γV π(s ′))

V π(s) = Q(s, π(s))

30

Computing the Value of a Policy

Let vπ ∈ R|S| be a vector of values for each state, and r ∈ R|S| be a vector of rewards for each
state.

Let Pπ ∈ R |S|×|S| be a matrix containing probabilities for each transition under policy π, where:

Pπij = P(st+1 = j | st = i , at = π(st))

Then the value function can be written in vector form as:

vπ = r + γPπvπ

We can solve this using linear algebra:

⇒(I − γPπ)vπ = r
⇒vπ = (I − γPπ)−1r

i.e., computing value for a policy requires solving a linear system.

31

Value of the Optimal Policy

• An optimal policy, π∗, expressed in terms of the value function, is one that satisfies
Bellman’s optimality condition (1957):

V ∗(s) = max
a

{
R(s, a, s ′) + γmax

π
V π(s ′)

}
• V ∗(s) is the expected value of following the optimal policy in state s.
• Similarly, Q∗(s, a), where a is an action and s is a state, is the expected value of doing a

in state s, then following the optimal policy.

• Q∗ and V ∗ can be defined mutually recursively:

Q∗(s, a) =

s′

P(s ′ | a, s) (R(s, a, s ′) + γV ∗(s ′))

V ∗(s) = max
a

Q(s, a)

π∗(s) = arg max
a

Q(s, a)

32

Value Iteration

Value Iteration

• Let Vk be k-step lookahead value function.
• Idea: Given an estimate of the k-step lookahead value function, determine the k + 1 step

lookahead value function.

1. Set V0 arbitrarily, e.g.:

V̂ (s)← 0
2. Compute Vi+1 from Vi .

Loop for all s {
Vi+1(s) = max

a


s′

P(s ′ | a, s) {R(s, a, s ′) + γVi (s ′)}

}
Once the values converge, recover the best policy from the current value function estimate:

arg max
a


s′

P(s ′ | a, s)
{

R(s, a, s ′) + γV̂ (s ′)
}

33

Value Iteration

• Theorem: There is a unique function V ∗ satisfying these functions
• If we know V ∗, the optimal policy can be generated easily
• Guaranteed to converge to V*
• No guarantee we’ll reach optimal in finite time, but this converges exponentially fast (in

k) to the optimal value function.

• The error reduces proportionally to γ
k

1− γ

34

Grid world

1

-100

γ = 0.9

Agent can move up, down, left or right.

Moves successfully with p = 0.8, or perpendicular to direction with p = 0.1, each direction.

Hit a wall and the agent stays where it is.

35

VI — initialisation

0

0

0

0

0

0

0

0

0

0

0

00

0

36

VI — 1 iteration

0

0

0

0

0

0

0

0

0

0

0

01

-100

37

VI — 2 iterations

0

0

0

0

0

0

0

0

0

0

0

00.72 1

-100

38

VI — 3 iterations

0

0

0

0

0

0

0

0

0

0

0

00.5184

0.0648

0.7848 1

-100

39

VI — 4 iterations

0

0

0

0

0

0

0

0

0

0

0

00.3732 0.6584

0.0467

0.1173

0.7965 1

-100

40

VI — termination

0.4800 0.4215 0.3717 0.1760

0.5540 0.3860

0.6310 0.7282 0.8294 1

-100

41

Asynchronous Value Iteration

• The agent doesn’t need to sweep through all the states, but can update the value
functions for each state individually.

• Do update assignments to the states in any order we want, even random
• This converges to the optimal value functions, if each state and action is visited infinitely

often in the limit.

• It can either store V [s] or Q[s, a].

42

Asynchronous VI

Storing V [s]

• Repeat forever:
1. Select state s

2. V [s]← max
a


s′

P(s
′ | s, a)

(
R(s, a, s


) + γV [s


]
)

Storing Q[s, a]

• Repeat forever:
1. Select state s, action a

2. Q[s, a]←

s′

P(s
′ | s, a)

(
R(s, a, s


) + γmax

a′
Q[s

, a

]

)

43

Policy Iteration

Policy Iteration

• Set π0 arbitrarily, let i = 0
• Repeat:

1. Solve for V πi (s) (or Qπi (s, a)):

V
πi (s) =


s′∈S

P(s
′ | πi (s), s)

[
R(s, πi (s), s


) + γV

πi (s

)
]
∀s ∈ S

2. Update policy:

πi+1(s)← arg max
a


s′∈S

P(s
′ | a, s)

[
R(s, a, s


) + γV

πi (s

)
]

3. i = i + 1

• until πi (s) = πi−1(s)

Solving V πi (s) means finding a solution to a set of |S | × |A| linear equations with |S | × |A|
unknowns, as shown earlier.

44

↑ ↑ ↑ ↑

↑ ↑

↑ ↑ ↑ 1

-100

45

-0.437 -4.833 -14.60 -80.56

0.053 -9.60

0.061 0.135 0.364 1

-100

46

← ← ← ←

↑ ←

→ → ↑ 1

-100

47

↑ ← ← ↓

↑ ←

→ → → 1

-100

48

↑ ← ← ↓

↑ ←

→ → → 1

-100

49

Attributions and References

Thanks to Dr Archie Chapman and Dr Hanna Kurniawati for their materials.

Many of the slides are adapted from David Poole and Alan Mackworth, Artificial Intelligence: foundations of

computational agents, 2E, CUP, 2017 http://https://artint.info/. These materials are copyright c© Poole
and Mackworth, 2017, licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0

International License.

Other materials derived from Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach, 3E,

Prentice Hall, 2009.

http://https://artint.info/

Recap – utility and rewards over time
Decision-theoretic planning
Markov decision processes
Value Iteration
Policy Iteration
Appendix