CS代考 COSC1127/1125 Artificial Intelligence

COSC1127/1125 Artificial Intelligence
School of Computing Technologies Semester 2, 2021
Prof. ̃a
Tutorial No. 8
Markov Decision Processes (MDPs)
You can check this video on max/min vs arg max/arg min.
PART 1: MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(a) Recall a Markov Decision Process (MDP) represents a scenario where actions are non-deterministic. Consider the following minor gridworld example:
123 1
2 3
Cells are indexed by (row, column). Cells (1, 1), (3, 1) and (1, 3) are terminal cells, and have rewards of 20, −20 and 5, respectively. For all other cells, there is an reward of −1. The agent starts at cell (3, 3). The agent can move in north, east, south and west directions or stay where it is. The actions are stochastic and have the following outcomes:
• If go north, 70% of the time the agent will go north, 30% will go west.
• If go east, 60% of the time the agent will go east, 40% of the time will go south. • If go south, 80% of the time the agent will go south, 20% of the time will go east. • If go west, 70% of the time the agent will go west, 30% of the time will go north. • If stay in current cell, 100% of the time will stay.
If an action causes the agent to travel into a wall, the agent will stay in their current cell.
Construct a MDP for this instance of gridworld. Recall a MDP consists of S (set of states), s0 (starting state), possible terminal states, A (set of actions), T (transition model/function) and R (reward function), so all of these have to be specified. To help, consider constructing the MDP in steps:
i. What are the states in this gridworld?
ii. What are the starting and terminal states?
20
5
−20
Solution:
Each coordinate in the grid corresponds to a state.
Solution: (Note: this is a partial solution)
Starting state is (3,3), terminal states are (1,1), (3,1) and ………….
1

COSC1127/1125 – AI Tutorial 8: Markov Decision Processes (MDPs) Page 2 of 5
iii. What are the set of actions?
iv. With cell (3, 3) as the current state s, construct the transition model/function for all actions and subsequent states, i.e., T (s, a, s′ ), where a are all possible actions for (3, 3) and s’ are all possible successor states from (3, 3).
Recall, the definition of the transition function T is,
T :S×A×S→R,(s,a,s′)􏰄→P(s′|a∧s)
Solution:
For each state, actions are {north, east, south, west, stay}
Solution: (Note: this is a partial solution)
• T((3,3),north,(2,3))=0.7(70%,gonorth)
• T((3,3),north,(3,2))=0.3(30%,gowest)
• T((3,3),east,(3,3))=1(cannotgoeastorsouthfrom(3,3))
• T((3,3),south,(3,3))=1(cannotgoeastorsouthfrom(3,3)) • T((3,3),west,(3,2)) = ………… ( …………, go west)
• T((3,3),west,(2,3))=0.3(30%,gonorth)
• T((3,3),stay,(3,3))=1.0(100%,stay)
v. What is the reward function?
(b) Using the MDP built for question 1, consider the following sequences of states that our agent progressed through. For each sequence, compute the additive reward and discounted reward (with a decay factor γ of 0.5). Take note of the differences between the two approaches.
(a) (3,3),(2,3),(2,2),(2,1),(1,1)
(b) (3,3),(3,2),(2,2),(2,3),…(repeat)
Here, we assume the cumulative reward for a set of states s0, s1, . . . , st is 􏰀ti=0 R(si) (with discounting if you like); as it is given in the textbook. However, it is worth noting that this is contrary to some RL conventions, which only count rewards from s1 onwards. Another convention, that you may have noticed if you looked at the UCB slides is that they use a different reward function that is defined as R(s, a, s′), i.e., the reward given as you move from state s to state s′ taking action a. In this scenario, you cannot have a reward for s0, as we dont have any state it has come from previously.
PART2: Valueiteration…………………………………………………………………… (a) Consider the MDP from question 1, with a decay factor γ of 1.
Solution:
• R((1, 1)) = +20
• R((3, 1)) = −20
• R((1,3))=+5
• R(other states) = −1
Solution: (Note: this is a partial solution)
(a) Additive: (−1) + (−1) + (−1) + (−1) + (+20) = 16
Discount: (−1) + 0.5(−1) + 0.52(−1) + 0.53(−1) + 0.54(+20) = −0.625 (10/16)
(b) Additive:(−1)+(−1)+(−1)+(−1)+…=−∞
Discount:…………=􏰀∞ 0.5t ×(−1)= −1 =−2 t=0 1−0.5
RMITAI2021(Semester2)-SebastianSardin ̃a Exercise2continuesonthenextpage…

COSC1127/1125 – AI Tutorial 8: Markov Decision Processes (MDPs) Page 3 of 5
i. Using value iteration, compute the utilities for each state after two iterations. For this question take the value iteration equation:
􏰅􏰆 Ui+1(s) = R(s) + γ max 􏰁 T (s, a, s′)Ui(s′) .
a
s′ ∈S
Initially, we set U0(s) = 0 for all states s ∈ S. Remember terminal states do not get more rewards
once reached.
Solution: (Note: this is a partial solution)
123 1 20 12.7 5
2 12.7 −2 2.2 3 −20 −2 −2
Here is the detailed working for cell(1,2). Note that, in this case, it is clear that “moving west” is the best action (i.e., maximizes the second term in formula above), but in general one would need to calculate the term for every action and then keep the the max one (that is the best action to do!). Assume U0(s) = 0, ∀s ∈ S.
Compute utility value V for state (1, 2) – row 1, column 2 – using R(s, a, s′): 􏰅􏰆
Ui+1(s) = R(s) + γ max 􏰁 T (s, a, s′)Ui(s′)
a
s′ ∈S
U1((1, 2)) = −1 + γ[T ((1, 2), west, (1, 1)) × U0(1, 1) + T ((1, 2), north, (1, 2)) × U0(1, 2)]
U1((1, 2)) = −1 + 1 × [0.7 × 0 + 0.3 × 0] = −1
(ordinarily we would compute all U1 values for every state,
but to compute for (1, 2) we can see the only other one we need is (1, 1).) U1((1, 1)) = 20 + γ[T ((1, 1), stay, (1, 1)) × U0(1, 1)] (terminal state so must “stay”)
U1((1,1)) = ………… = 20
U2((1, 2)) = −1 + γ[T ((1, 2), west, (1, 1)) × U1(1, 1) + T ((1, 2), north, (1, 2)) × U1(1, 2)]
U2((1,2)) = ………… = 12.7
U2((1, 1)) = 20 (terminal state so cannot take further action)
ii. Repeat and determine the values after three iterations.
Solution: (Note: this is a partial solution)
123
1 20 16.81 5
2 16.81 11.7 1.9
3 −20 −3 −0.059
RMITAI2021(Semester2)-SebastianSardin ̃a Exercise2continuesonthenextpage…

COSC1127/1125 – AI Tutorial 8: Markov Decision Processes (MDPs) Page 4 of 5
Continuing on from the previous question:
U3((1,2)) = …………
U3((1, 2)) = −1 + 1 × [0.7 × 20 + 0.3 × 12.7] = 16.81
(b) Consider the utilities after two iterations of value iteration from question a (we typically only perform this when values have converged, but for this question, do this after 2 iterations). Using the Maximum Expected Utility principle, find the best action (i.e., policy) for each state.
Recall, the equation to extract the optimal policy for a state s is:
π∗(s) = argmax 􏰁 T(s,a,s′)U(s′)
a
(c) Consider the following policy for the MDP of question 1, with a decay factor γ of 1: • π((2, 1)) = west.
• π((2, 2)) = west. • π((2, 3)) = north. • π((3, 2)) = north. • π((3, 3)) = east. • π((1, 2)) = south.
Evaluate this policy, i.e., compute the utility for each state, for one and two iterations (we typically can solve it as a system of simultaneous equations but for this question we use an iterative approach).
s′ ∈S
Solution:
For the cell (1,2) we can derive that the best action to do there after 2 iterations is west, that is, π((1, 2)) = west.
Solution:
Here we are given the policy (it is fixed), so the task is to calculate its value at every state, that is, Vπ(s) or Uπ(s). We can do that as follows:
Uπ (s)=R(s)+γ􏰁T(s,π(s),s′)Uπ(s′).
s′ ∈S
Observe we do not take the max of actions because we know, at every state s, what we should do,
namely, π(s)!
i+1
i
123 1 20 −0.8 5
2 4.3 −2 2.2
3 −20 −7.7 −2 You can check THIS VIDEO on the solving of this problem.
(d) For any possible gridworld MDP, is the stay action ever part of an optimal policy? If yes, design a gridworld that demonstrates it, or if no prove that it is impossible. Hint: consider the reward function for non-terminal cells.
Solution: If the reward function is positive valued and large in magnitude for non terminating states, and γ is small, then it is possible for an agent to choose to stay in a state.
RMIT AI 2021 (Semester 2) – ̃a

COSC1127/1125 – AI Tutorial 8: Markov Decision Processes (MDPs) Page 5 of 5
PART3: MoreonMDPmodeling………………………………………………………….. When defining an MDP, one can choose to represent the rewards per state, via a function R(s) (i.e., “the immediate reward received at state s”) or assign rewards to each transition via a function R(s, a, s′) (i.e., “the immediate reward obtaining when executing action a in state s and the system evolving to the next state s′”).
The R&N book uses R(s), but the more standard literature on Machine Learning uses R(s, a, s′). However, one can covert one to the other, possibly by adding additional states.
(a) ConsiderthefollowingMDPrepresentingthelifeofastudentfromDavidSilver’sclass(checkhisslides):
Can you obtain the corresponding an alternative MDP that achieves the same modeling but with rewards per state, that is, with a reward function R(s)?
Solution: (Note: this is a partial solution)
For each state s′ in the original MDP:
• If all the transitions arriving to s′ have the same reward, then just keep s′ as is in the new MDP
and assign that same reward to state s′.
• Otherwise, we will create a version of s′ for each transition arriving to it. Concretely, for every
transition T (s, a, s′) with reward R(s, a, s′), make a duplicate of state s′, which we will call s′s,a, and assign R(s′s,a) = R(s, a, s′). Since s′s,a is a version of the original s′, all outgoing transitions from s′ need to be implemented in s′ as well.
Now apply that process to the MDP above to obtain another MDP that has rewards associated to states. State s1 will not need to be ”copied” but the others may.
You should process one state per time. For example, s0 will have two versions: s0s1,quit and s0s3,pub
with rewards 0 and 1, respectively. Both new states should have transitions, with probability 1 to state s1 under action F acebook. Now you do the rest… 🙂
RMIT AI 2021 (Semester 2) – ̃a
End of tutorial worksheet