Markov Decision Processes
[AIMA 4G] Chapter 15.1-15.3, 16.1-16.2
Artificial Intelligence
COSC1127/1125
Semester 2, 2021
Prof. Sebastian Sardina
* Many slides are based on those of Dan Klien and Pieter Abbeel, and Jeffrey Chan. Thanks!
Today…
Some news…
Project 2 marking is done!
1st contest feedback this week!
Submit as per instruction.
Check #170
Course THE happening next week
Monday 9am – 9pm!
Mock Test available today
PDF (spec) + Google form (answer)
Get familiar with form+process.
Deadline: Thursday 12pm
*
First feedback contest tomorrow!
1st feedback tomorrow!
Tag your version with “testing” to participate!
Feedback from Project 2
Week 9 THE
20% of course marks; individual assessment.
Monday Sept 20th @ 9am-9pm (12hrs!)
2-3hrs long test; many targeted questions.
Link will be posted in EdStem.
Be ready: NO REPLACEMENT!
Submit before 6pm when silence policy starts (dont risk it!)
Two parts:
PDF document with question specs.
Online Google Form to enter answers. Make 100% sure you are logged to your RMIT account.
Try the Mock THE this week: be ready, avoid surprises.
You can have your own pen & paper notes.
Content: Weeks 1 – 8 (included).
Theoretical & conceptual questions: ~tutorials.
No negative marking.
Managing content in difficult times…
Simplified tute 6 a lot!
removed content on axioms
removed redundant tasks.
added practical question
Simplified Week 8 (MDPs):
Less videos (just 4 + UCB’s one)
Reduce policy iteration to its concept only; just focus on value iteration.
Week 11: a review of content
Artificial Intelligence
8 Markov Decision Processes
9 Reinforcement Learning
10 Bayesian Reasoning
11 Review MDP, RL, BN
12 IA + AI Philosophy + Course recap
*
Questions?
*
IMPORTANT DISCLAIMER
These pack of slides are NOT core study
material. The core & main material is the book or any specific resources given for the week.
The slides in this presentations were prepared only for supporting the instructor during the lectorial and only some of them have been used in the session as needed.
These pack of slides do not represent the complete material of the week, but only a few bits here and there around what was discussed in lectorials. Most of the week content will NOT show up in this pack of slides.
Please refer to the book (and any resources specifically given) for a comprehensive coverage of the week material.
This week…
Sequential decision problems.
Markov decision processes:
Definition
Policy
Utilities of Sequences
Value Iteration:
Utilities of States
Bellman equations
Iterative algorithm
Policy Evaluation & Policy Iteration
Week 6: Classical Planning
PLANNER
Plan
(sequence of actions)
State Model
Language
Algorithm
Recap Previous Weeks
Automated planning:
Goals, planning, logic representation, search, …
E.g: To be on the flight ontime.
Properties/Assumptions:
World is fully observable.
Deterministic actions.
Static environment.
Goal is “all or nothing”.
But:
Knowledge is incomplete.
Actions are fallible & non-deterministic.
Environment is dynamic.
Agent uses preferences.
Traffic? Accidents? Security checks? Flight delays?
Probability
Probability theory very well established.
Use it to assign degrees of beliefs and likelihoods to (uncertain) events.
E.g. Pr(coin = head) = ½
We have seen this 2 weeks ago; let’s use them!
Making decisions? Utility Theory
Deterministic logical agent has a goal and executes (any) plan that achieves the goals.
In an uncertain world, this is no longer the case.
Flight A90: 95% making to airport on time.
Flight A120: 98% making it on time.
Flight A1440: 100% making it on time.
Relax goals: consider agent preferences for different outcomes.
Utility theory = associate degree of usefulness (utility) to each outcome, with outcomes of higher utility preferred.
From Archie Home
Simple vs Complex Sequential Decisions
Simple decisions: utility of plan based on one action/decision:
Whether I take a taxi or train to airport?
Where to go for a date?
Complex Sequential decisions: utility of outcomes are dependent on a sequence of actions/decisions:
Chess.
Financial planning.
Path navigation.
Deployment of mining trucks in an area.
Overview of next 2 weeks
Sequential Complex Decision Making: MDPs
Environment known, but uncertainty in outcomes of actions/decisions.
Utility of plan dependent on sequence of actions.
Reinforcement Learning
Environment unknown (transitions, rewards).
But feedback available!
MDP: Markov Decision Processes
Image from: A Primer on Reinforcement Learning in the Brain: Psychological, Computational, and Neural Perspectives
Outline of MDP Content
GridWorld Running Example.
Definition of Markov Decision Processes.
Policies.
Utilities of Sequences.
Value Iteration.
Policy Evaluation & Policy Iteration.
A maze-like world
Agent lives in a grid
Walls block agent’s path
Movement actions are noisy – they do
not always go as planned.
When go north, 80% go north, but
10% of time go west, 10% east.
Agent receives rewards each time step
Small “living” reward each step (can be negative)
Big rewards comes at the end (good or bad)
Goal: maximise sum of rewards
Illustrative Example: GridWorld
Question: How to find the best sequence of actions or plan an agent should take in GridWorld?
GridWorld Actions
Deterministic Grid World
Stochastic Grid World
Characteristics of GridWorld
Environment is fully observable.
Agent knows where it is at any time.
But actions are uncertain (movement).
Each grid location is a state that the agent can be in.
Has terminating grid locations/states.
Sequential decision problem.
Each grid location/state has a reward.
Final utility depends on a sequence of states agent traversed.
Question: How to find the best sequence of actions an agent should take in GridWorld?
An MDP is a tuple M = (S, A, T, R, s0, F):
A set of states s ∈ S:
locations, clean, fuel, broken, ..
A set of actions a ∈ A:
move to, suck, pick, drop, …
A transition function T (“the model”):
T(s, a, s’): Probability that a
in s leads to s’, i.e., P(s’| s, a)
pick action succeeds 90%%
A reward function R(s): value of state s.
clean living room gives +10
A start state s0 ∈ S.
robot initial location, etc..
Possibly terminal states F ⊆ S
Markov Decision Processes (MDP)
Rewards on states or transitions?
R(s) : rewards on states (book R&N & here).
R(s, a, s’): rewards on transitions (“standard”)
Transition System / Model
Note: This is only a portion of the whole transition model for GridWorld. States only show location, more features may be in a state
What is Markovian in MDP?
“Markov” generally means that given the present state, the future and the past are independent.
For MDPs, “Markov” means action outcomes depend only on the current state.
E.g., in GridWorld, the outcomes of the agents movement actions (go north, can go north, west or east) only depends on its current grid cell.
Outline of MDP Content
GridWorld Running Example
Definition of Markov Decision Processes
Policies
Utilities of Sequences
Value Iteration
Policy Evaluation & Policy Iteration
MDP Solution: A “Good” Policy
In deterministic single agent problems, an (optimal) sequence of actions from start to goal is enough.
In MDP, we want an optimal policy π*: S → A
A policy specifies what an agent should do (action) for each state.
An optimal policy is one that maximises “expected utility” (more on this soon)
Optimal policy when
R(s) = -0.03 for all non-terminals s
Four different policies
Outline of MDP Content
GridWorld Running Example
Definition of Markov Decision Processes
Policies
Utilities of Sequences
Value Iteration
Policy Evaluation & Policy Iteration
Utilities of Sequences
Recall basic rewards represents agents “preferences”.
How to measure preferences/rewards/utilities over a sequence of states?
Less now, but better after:
[-10, 0, 1000] or [10, 1, 1]
Now or Later?
[10, 0, 0] or [0, 0, 10]
Discounting
It’s reasonable to maximise the sum of rewards.
It’s also reasonable to prefer rewards now to later.
One solution: values of rewards decay exponentially.
Worth Now
Worth Next Step
Worth In Two Steps
How does discounting work?
Given a sequence of states [s1, s2, s3, …, sn] and discount factor 𝛾 ∈ (0,1], the utility of starting at state s1 can be defined as:
U(s1) = R(s1) + 𝛾R(s2) + 𝛾2 R(s3) + … + 𝛾n R(sn)
Stationary Preferences Assumption
Theorem: if we assume stationary preferences:
Then: there are only two ways to define utilities:
Additive utility
Discounted utility
Infinite Horizons
Problem: What if the game lasts forever (no termination state, or agent never reaches one)? Infinite utilities for additive?
Solutions:
Proper policies: policies that guarantee agent always reach a terminal state. **Not always possible**.
e.g., classical planning or control.
Finite horizon: terminate game after a fixed number of steps.
But gives non-stationary policies (they have time dependencies)
Discounted rewards: – utility is finite (0 < γ < 1)
What are you worry about and looking forward?
9836 8086 @ menti.com
Recap: MDPs
Markov decision processes (MDP):
Set of states S
Start state s0
Set of actions A
Transitions T(s,a,s’)
Rewards R(s)
MDP concepts so far:
Policy = choice of action for each state
Utility = sum of discounted rewards
Outline of MDP Content
GridWorld Running Example
Definition of Markov Decision Processes
Policies
Utilities of Sequences
Value Iteration
Policy Evaluation & Policy Iteration
Utility of states & Value Iteration
Given a MDP: how to find the optimal policy for it?
Optimal policy indicates for each state, what is the best action to perform?
In order to determine what is the best action, need to calculate the utility of each state. Recursive? How to calculate this?
Value Iteration
Utility of states - Deterministic Actions
Assume deterministic actions, i.e, that is:
For all s, a, s’: T(s, a, s’) = 1 or T(s, a, s’) = 0.
Then a policy from s1 gives a sequence s1,s2,s3,....
What is the utility of state s1 when following policy π?
Uπ(s1) = R(s1) + 𝛾R(s2) + 𝛾2 R(s3) + … = Σt>= 0 𝛾t R(st)
si+1 = the state after doing action π(si) in state si
that is, the state such that T(si, a, si+1) = 1
Utility of states – Deterministic Actions
Uπ(s1) = R(s1) + 𝛾R(s2) + 𝛾2 R(s3) + 𝛾3 R(s4) + …
Uπ(s1) = R(s1) + 𝛾[R(s2) + 𝛾 R(s3) + 𝛾2 R(s4) + … ]
Uπ(s1) = R(s1) + 𝛾Uπ(s2)
This is based on using a particular policy π.
But π is what we are trying to find!
The utility of a state U(s) is when we choose
the optimal action:
U(s) = R(s) + 𝛾 maxa{T(s, a, s’)U(s’)}
The maximum value across all possible actions.
Utilities of States
Now let’s go back to MDPs, where actions are uncertain.
T(s, a, s’) = P(s’ | s, a)
U(s) = R(s) + 𝛾 maxa{T(s, a, s’)U(s’)}
U(s) = R(s) + 𝛾 maxa{Σs’∊ST(s, a, s’)U(s’)}
deterministic domain
Bellman Equation
Gridworld Utilities
Utilities of state for 𝛾=1 and R(s) = 0.04 for non-terminal states.
Value Iteration Gridworld App
Run VI step by step using this tool I refactored as a stand-alone Java application:
Maximum Expected Utility Principle
Suppose that we have the optimal utility at every state, that is we have U(s). What is the optimal policy?
The Maximum Expected Utility (MEU) Principle says to find the optimal action, we chose the action that maximises the expected utility of the subsequent states:
Value Iteration
For each state s we have its Bellman equation (so we have |S| equations):
U(s) = R(s) + 𝛾 maxa{Σs’∊ST(s, a, s’)U(s2)}
But we cannot solve as simultaneous equations due to the max operator, but can solve iteratively:
U0(s) = 0, for all states s ∊ S
Ui+1(s) = R(s) + 𝛾 maxa{Σs’∊ST(s, a, s’)Ui(s2)}
Guaranteed to converge!
Value Iteration
Function valueIteration(MDP, 𝜖, 𝜸)
U(s) = 0, for all s ∊ S
Repeat
U’ = U // save existing U into U’
d = 0 // largest difference between U and U’
For each state s ∊ S do: // get new U
U(s) = R(s) + 𝛾 maxa {Σs’∊ST(s, a, s’)U’(s’)}
d = max {d, |U(s) – U’(s)|}
Until d is very small
Extracting Optimal Policy
After value iteration, we know the utility of each state for an optimal policy. We have U(s) function, for all s.
To extract the optimal action for each state (i.e., the policy), we use:
The Maximum Expected Utility (MEU) Principle says to find the optimal action, we chose the action that maximises the expected utility of the subsequent states:
Gridworld using Value Iteration
States and their evolution of utilities.
Number of iterations required to guarantee an error of c.Rmax, as the discount factor is varied.
Outline of MDP Content
GridWorld Running Example
Definition of Markov Decision Processes
Policies
Utilities of Sequences
Value Iteration
Policy Evaluation & Policy Iteration
(just know it conceptually; no exercises)
Policy Evaluation
In the case where the policy is fixed (e.g, always move right if possible), policy evaluation is to calculation the utility of each state for the fixed policy.
Given policy :
Policy Iteration
Value Iteration can be slow to converge and finding best actions can be costly.
Alternative: Policy Iteration:
Initialise with some policy
Given current policy, evaluate it (calculate U)
Calculate new policy using MEU and U
Repeat until no changes in policy.
Value Iteration vs Policy Iteration
Both value iteration and policy iteration compute the optimal values
Value iteration
Every iteration updates utility values and policies implicitly
We don’t track the policy but taking the max over all actions implicitly computes it
In policy iteration
Utilities are updated over a fixed policy (fast)
After utilities calculated, find optimal policy (similar to value iteration)
Complexity Comparison
Policy iteration
Generally less iterations than value iteration:
Policy evaluation can be solved in O(|S|^3)
MEU can be costly O(|A||S|^2) per run
Value iteration:
O(|A||S|^2) per iteration
Can have many iterations to convergence
Conclusion
Sequential decision problems.
Markov decision processes:
Definition
Policy: map S -> A
Utilities of Sequences
Value Iteration:
Utilities of States
Bellman equations
Iterative algorithm
Policy Evaluation & Policy Iteration:
Next lecture: Reinforcement learning!
Whiteboard used…