程序代写 6CCS3AIN, Tutorial 04 Answers (Version 1.0)

6CCS3AIN, Tutorial 04 Answers (Version 1.0)
1. (a)
We have a decision between staying in the office and going out. Both options will lead to one or more states.
Staying in the office means that I will either work, get distracted, or talk with colleague. These states have the following utilities:
U(work) = 8 U (distracted ) = 1 U(colleague) = 5
and the probabilities of these happening, given I stay in the office are:
P(work|office) = 0.5 P(distracted|office) = 0.3 P(colleague|office) = 0.2
Thus the expected utility of staying in the office is:
EU (office) = 0.5 · 8 + 0.3 · 1 + 0.2 · 5 = 5.3
Going out can result in the states coffee and spill, with utilities: U(coffee) = 10
and the relevant probabilities are:
U(spill) = −20
P(coffee|out) = 0.95 P(spill|out) = 0.05
Thus EU (out ) = 8.5.
(b) The principle of maximum expected utility says to pick the option with the greatest expected utility, in
this case that is to go out.
(c) The maximin principle says to pick the options based on the utillty of their worst outsomes, and to pick the option with the largest utility for its worst outcome.
Formally we have:
which will pick out. 2. The grid world is:
a = arg max
s ∈{office ,out }
= arg max
s ∈{office ,out } = arg max
s ∈{office ,out } = office
{min(U(s))} 􏰌minoffice(8,1,5),minout(10,−20)􏰍
􏰌1office , −20out 􏰍
so the maximin principle would say I should stay in the office.
The maximax principle says to pick the option based on the best outcome of each action, and to pick the one with the largest utility for the best outcome, in other words:
a = arg max {max(U(s))} s ∈{office ,out }
1

􏰀􏰁
+1
􏰀􏰁 􏰀􏰁
−1
􏰀􏰁
3
2 1
(a) The formal description of this as an MDP is the following:
• States: (1, 1), (1, 2), (1, 3), …(2, 1), (2, 3), …(4, 3)
Note that the location (2, 2), where there is an obstacle, does not correspond to a state.
• Initial state: (1, 1).
• Actions: Up, Down, Left Right for all states.
• Reward:  1 fors=(4,3) R(s) = −1 for s = (4,2)
• Transition model, P(s′|s, a):
 −0.04 otherwise
P((1,2)|(1,1),Up) = 0.8 P((1,1)|(1,1),Up) = 0.1 P((2,1)|(1,1),Up) = 0.1
P ((1, 1)|(1, 1), Down ) = 0.9 .
1234
(b) For a deterministic model, we no longer have to worry about the expected utility of an action, because we know it succeeds, and we have:
U (s) = R(s) + γmaxa∈A(s)U (s′) where s′ is the state that results from action a.
(c) We will assume that γ = 1, and set U(s) to 0 initially. Then, after the first round of value iteration using the above formula, we get:
􏰀􏰁 􏰀􏰁 􏰀􏰁 􏰀􏰁
−0.04
−0.04 􏰀􏰁
0.96
+1
−0.04
􏰀􏰁 􏰀􏰁
−0.04
−1
−0.04
􏰀􏰁 −0.04
−0.04
−0.04
After a second round we get:
3
2 1
3
2 1
􏰀􏰁 􏰀􏰁 􏰀􏰁 􏰀􏰁
1234
−0.08
0.92 􏰀􏰁
0.96
+1
−0.08
􏰀􏰁 􏰀􏰁
0.92
−1
−0.08
􏰀􏰁 −0.08
−0.08
−0.08
1234
2
􏰀􏰁 􏰀􏰁 􏰀􏰁 􏰀􏰁

3. (a)
so after 3 more rounds we will end up with:
0.88
0.92 􏰀􏰁
0.96
+1
0.84
􏰀􏰁 􏰀􏰁
0.92
−1
0.8
􏰀􏰁 0.84
0.88
0.84
3
2 1
when the values will change no more.
􏰀􏰁 􏰀􏰁 􏰀􏰁 􏰀􏰁
1234
See the spreadsheet (on KEATS) for detail on the calculations.
(d) At each state, the optimum policy is to pick the action with the highest expected utility. Since actions are deterministic in this case, that equates to deciding to move to the state with the highest utility (make sure you understand why this is the case). As a result, we have:
􏰀􏰁
􏰀􏰁 􏰀􏰁
􏰀􏰁
3
2 1
Note that we assume the agent does not move once it has reached (4, 3) or (4, 2) as in the lecture.
My spreadsheet tells me that when the values have converged, we have (rounding to two decimal places):
1234
􏰀􏰁 􏰀􏰁 􏰀􏰁 􏰀􏰁
0.87
0.91 􏰀􏰁
0.96
+1
0.82
􏰀􏰁 􏰀􏰁
0.91
−1
0.78
􏰀􏰁 0.82
0.87
0.82
3
2 1
For γ = 1.
The calculation for the first iteration, for (3, 3) is:
U(s) =−0.04+γmax( (1·0) Up (0.9·0+0.1·0) Down
(0.9·0+0.1·0) Left (0.9·1+0.1·0) Right
and so on for the rest of the states.
See the spreadsheet (on KEATS) for the rest of intermediate calculations.
1234
(b) The computation of the best action in each state parallels the calculation above. For (3, 3) we want to establish the maximum of the expected values for the 4 possible actions:
(1 · 0.96) Up (0.9 · 0.91 + 0.1 · 0.96) Down (0.9 · 0.91 + 0.1 · 0.96) Left (0.9 · 1 + 0.1 · 0.96) Right
so Right will be the best action.
3
􏰀􏰁 􏰀􏰁 􏰀􏰁 􏰀􏰁

4. (a)
Repeating this for all the states, we quickly find that the optimum policy is the same as in the deterministic case (though the expected utility of the policy will be lower since the agent will spend more time in lower utility states, and will take more steps to get to the goal).
What that suggests is that for a motion model like the one we have here, the deterministic version of value iteration is a pretty good approximation of the non-deterministic version (though the difference in the utilities is obvious from the figures).
U((3,1)) =−0.04+γmax( 0.8·0.660+0.1·0.655+0.1·0.388 Up 0.8 · 0.611 + 0.1 · 0.655 + 0.1 · 0.388 Down
0.8 · 0.655 + 0.1 · 0.660 + 0.1 · 0.611 Left 0.8 · 0.388 + 0.1 · 0.660 + 0.1 · 0.611 Right
(b)
which comes to 0.611.
This is the same as the utility of the state. Which is exactly what we would expect from the Bellman
equation once we have the state utility that is the one under the optimum policy (which is what happens when value iteration terminates).
The numbers in the equation above are the expected utilities of the actions. So the action with the maximum expected utility is Left.
Note that this is not the same as moving towards the state with the maximum utility.
(c) i.
For the maximin policy, we look at the worst outcome of each action, and pick the action which maximises this worst outcome. We get:
which reduces to:
U((3,1)) =−0.04+γmax(
1234
sure that it will eventually get to (4, 3).
ii. For the maximax policy, we look at the best outcome of each action and pick the action which
maximises this. Because every action has three possible outcomes, there are always three actions which can lead to the maximum utility. Since maximax ignores the probability of outcomes, it treats all of these actions as equally good:
3
2 1
The only action that is not selected by maximax is the one that points away from the state with the maximum utility.
Of course, an agent can’t pick three actions, so any agent that wanted to employ maximax in this environment would have to pick one of the three actions.
0.6323 Up 0.5931 Down 0.6511 Left 0.4375 Right
􏰀􏰁
􏰀􏰁 􏰀􏰁
􏰀􏰁
3
2 1
The policy makes sure that in every state the agent has no chance of entering (4,2) while making
􏰀􏰁 􏰀􏰁 􏰀􏰁 􏰀􏰁
5. There is no question 5.
1234
4