CIS 471/571: Introduction to Artificial Intelligence, Fall 2020
Written Assignment 3: Solution Deadline: Nov 10th, 2020
Instruction: You may discuss these problems with classmates, but please complete the write- ups individually. (This applies to BOTH undergraduates and graduate students.) Remember the collaboration guidelines set forth in class: you may meet to discuss problems with classmates, but you may not take any written notes (or electronic notes, or photos, etc.) away from the meeting. Your answers must be typewritten, except for figures or diagrams, which may be hand-drawn.
Q1. MDPs – Value Iteration (25 points)
An agent lives in grid world G consisting of grid cells s ∈ S, and is not allowed to move into the cells colored black. In this grid world, the agent can take actions to move to neighboring squares, when it is not on a numbered square. When the agent is on a numbered square, it is forced to exit to a terminal state (where it remains), collecting a reward equal to the number written on the square in the process. You decide to run value iteration for grid world G. The value function at
Figure 1: Grid world G
iteration k is Vk(s). The initial value for all grid cells is 0 (that is, V0(s) = 0 for all s). When answering questions about iteration k for Vk(s), either answer with a finite integer or ∞. For all questions, the discount factor is γ = 1.
1
Q1.1. (18 points) Consider running value iteration in grid world G. Assume all legal movement actions will always succeed (and so the state transition function is deterministic).
1. What is the smallest iteration the value of Vk(A)?
Answer. k = 3, Vk(A) = 10 2. What is the smallest iteration
the value of Vk(B)? Answer. k = 3, Vk(B) = 1
3. What is the smallest iteration
Answer. k = 3, V∗(A) = 10 4. What is the smallest iteration
Answer. k=6,V∗(B)=10
k for which Vk(A) > 0? For this smallest iteration k, what is
k for which Vk(B) > 0? For this smallest iteration k, what is
k for which Vk(A) = V ∗(A)? What is the value of V ∗(A)?
k for which Vk(B) = V ∗(B)? What is the value of V ∗(B)?
legal movement actions succeed with probability 0.8; with the agent remains in the same state. Consider running value
Q1.2. (7 points) Now assume all
probability 0.2, the action fails and
iteration in grid world G. What is the smallest iteration k for which Vk(A) = V ∗(A)? What is the value of V ∗(A)?
Answer. k = ∞, V ∗(A) = 10. Because γ = 1 and the only rewards are in the exit states, the optimal policy will move to the exit state with highest reward. This is guaranteed to ultimately succeed, so the optimal value of state A is 10. However, because the transition is non-deterministic, it’s not guaranteed this reward can be collected in 3 steps. It could any number of steps from 3 through infinity, and the values will only have converged after infinitely many iterations.
Note. In practice, we can have a stopping condition, for example, if |Vk+1(A) − Vk(A)| < ε for some given ε, then we stop the iteration process. If your answer specify an ε and find the smallest k according to this ε, then your answer is accepted too.
Q2. MDPs - Policy Iteration (25 points)
In a car race, a car robot has three states {cool, warm, overheated}. At each state, the robot can either move fast or slow. At state overheated, the robot has to exit the race and receive a reward 0. The transition function and reward function are illustrated in the following figure
Consider running policy iteration with the following initial policy: π0(cool) = fast, π0(warm) = fast and π0(overheated) = ∅. The discount factor is γ = 0.1.
2
0.5
Slow
0.5
Fast
0.5
+2
+1
0.5 +2
Fast
1.0
Overheated
+1
-1
Slow
1.0
Warm
+1 Cool
Q2.1. (12.5 points) What are values of states given the fixed policy π0: V π0 (cool), V π0 (warm), V π0 (overheated)? What is the policy πi at iteration i = 1?
Answer. There are two methods to compute utilities of states. The first one is using value iteration-based approach and the second one is using the equation system approach.
Value iteration-based approach. We have V π0 (cool) = V π0 (warm) = 0. Note that for value 00
iteration-based approach, you can set a stopping condition, for example, if |V π0 (cool)−V π0 (cool)| < k+1 k
ε with ε = 0.001, then stop the iteration.
• ItcanbeeasilyverifythatVπ0(warm)=Vπ0(warm)=1.0×[−1+0.1×0]=−1
• Vπ0(cool)=0.5×[2+0.1×Vπ0(cool)]+0.5×[2+0.1×Vπ0(warm)]=2 100
• Vπ0(cool)=0.5×[2+0.1×2]+0.5×[2+0.1×(−1)]=2.05 2
• Vπ0(cool)=0.5×[2+0.1×2.05]+0.5×[2+0.1×(−1)]=2.0525 3
• Vπ0(cool)=0.5×[2+0.1×2.0525]+0.5×[2+0.1×(−1)]=2.052625 4
• Given the stopping condition, then V π0 (cool) ≈ 2.052625 since 2.052625 − 2.0525 < 0.001 Equation system approach. We have the following equation system
V π0 (warm) = −1.0
V π0 (cool) = 0.5 × [2 + 0.1 × V π0 (cool)] + 0.5 × [2 + 0.1 × V π0 (warm)]
=⇒ 0.95×Vπ0(cool)=1.95 =⇒ Vπ0(cool)= 1.95 ≈2.0526. 0.95
Q2.2. (12.5 points) What is the optimal policy π∗ for the robot car? What are the values of V ∗(cool),V ∗(warm), and V ∗(overheated)?
Answer. The optimal policy is π∗(cool) = fast, π∗(warm) = slow. The values of states are
V ∗(warm) = 7 ≈ 1.1667, V ∗(fast) = 13 ≈ 2.1667. 66
3
1
Q3. Model-based RL (10 points)
An agent lives in a grid world as shown in the figure below. The agent tries out a policy π which is indicated by the arrows in the figure. After four trials, the agent observes four episodes.
What model would be learned from the above observed episodes (transition/reward functions)? Answer. T(A,south,C) = 1, T(C,south,E) = 0.75, T(B,east,C) = 1, T(C,south,D) = 0.25.
In addition, R(A, south, C) = −1, R(C, south, E) = −1, R(B, east, C) = −1, R(C, south, D) = −1 Q4. RL - Direct Evaluation (10 points)
Consider the same problem as in Q3. What are the estimates for Vˆπ(A), Vˆπ(B), Vˆπ(C), Vˆπ(D), and Vˆπ(E) as obtained by direct evaluation? Assume the discount factor γ = 0.8.
Answer.
Vˆπ(A)=−1+0.8×(−1)+(0.8)2 ×10=4.6
Vˆπ(B) = 1(−1 + 0.8 × (−1) + 0.64 × (−10)) + 1(−1 + 0.8 × (−1) + 0.64 × 10) = −1.8
Vˆ π (C ) = 3 × (−1 + 0.8 × 10) + 1 × (−1 + 0.8 × (−10)) = 3 44
Vˆπ(D) = −10 Vˆπ(E) = 10
Q5. RL - Temporal Difference Learning (6 points)
Consider the grid world shown below. The left panel shows the name of each state A through E. The middle panel shows the current estimate of the value function V π for each state. A transition is observed, that takes the agent from state B through taking action east into state C, and the
4
22
agent receives a reward of -2. Assuming γ = 0.8,α = 0.75, what are the value estimates of Vˆπ(A),Vˆπ(B),Vˆπ(C),Vˆπ(D), and Vˆπ(E) after the TD learning update?
Answer.
• Vˆ π ( A ) = 1
• Vˆ π ( B ) = 3 . 8 • Vˆ π ( C ) = 8
• Vˆπ(D)=10
• Vˆπ(E)=10
The only value that gets updated is Vˆπ(B), because the only transition observed starts in state B.
Vˆπ(B) = 0.25 × 2 + 0.75 × (−2 + 0.8 × 8) = 3.8. Q6. Model-free Reinforcement Learning (12 points)
Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are given with samples of what an agent actually experiences when it interacts with the environment (although, we do know that we do not remain in the same state after taking an action). In this problem, instead of first estimating the transition and reward functions, we will directly estimate the Q function using Q-learning. Assume, the discount factor, γ is 0.8 and the step size for Q-learning, α is 0.5.
Our current Q function, Q(s, a), is shown in the left figure. The agent encounters the samples shown in the right figure:
Provide the Q-values for all pairs of (state, action) after both samples have been accounted for. 5
Answer.
• Q(A, clockwise) = 1.501
• Q(A, counterclockwise) = 6.6685 • Q(B, clockwise) = −.451
• Q(B, counterclockwise) = −6.055 • Q(C, clockwise) = 2.73
• Q(C, counterclockwise) = 3.7339
For each (s, a, s′, r) transition sample, you update the Q value function as follows: Q(s, a) = (1 − α)Q(s, a) + α(R(s, a, s′) + γ max Q(s′, a′))
a′
First, we update: Q(A, counterclockwise) = .5 × 3.153 + .5 × (8 + .8 × 2.73) = 6.6685 Then we update: Q(C, counterclockwise) = .5 × 2.133 + .5 × (0 + .8 × 6.6685) = 3.7339. Because there are only two samples, the other four values stay the same.
Q7. RL - Feature-based Representation (12 points)
Consider the following feature based representation of the Q-function: Q(s, a) = w1f1(s, a) + w2f2(s, a) with:
• f1(s, a) = 1/ (Manhattan distance to nearest dot after having executed action a in state s) • f2(s, a) =(Manhattan distance to nearest ghost after having executed action a in state s)
Q7.1. Assume w1 = 3 and w2 = 8. Assume that the red and blue ghosts are both sitting on top of a dot. Provide the values of Q(s, west) and Q(s, south).
Based on this approximate Q-function, which action would be chosen?
6
Answer.
• Q(s,west)=3×1+8×3=27
• Q(s,south)=3×1+8×1=11
• The chosen action is west since 27 > 11.
Q7.2. Assume Pac-Man moves West. This results in the state s′ shown below. Pac-Man receives reward 9 (10 for eating a dot and -1 living penalty).
Provide the values of Q(s′, west) and Q(s′, east). What is the sample value (assuming γ = 0.8)?
Answer.
• Q(s′,west)=3×1+8×1=11
• Q(s′,east)=3×1+8×1=11
• sample=[r+γ×maxa′ Q(s′,a′)]=9+0.8×11=17.8
Q7.3. Now provide the update to the weights. Let α = 0.5.
Answer. w1 = −4.75 and w2 = −12.25
Explanation. difference = [r + γ maxa′ Q(s′, a′)] − Q(s, a) = 17.8 − 27 = −9.2. Therefore, w1 =w1 +α(difference)f1(s,a)=3+.5×(−9.2)×1=−1.6
w2 =w2 +α(difference)f2(s,a)=8+.5×(−9.2)×3=−5.8 Q8. Strange MDPs (Graduates Only) (20 points)
In this MDP, the available actions at state A, B, C are LEFT, RIGHT, UP, and DOWN unless there is a wall in that direction. The only action at state D is the EXIT ACTION and gives the agent a reward of x. The reward for non-exit actions is always 1.
7
Q8.1 (10 points) Let all actions be deterministic. Assume γ = 0.5. Express the values of states V ∗(D), V ∗(A), V ∗(C), V ∗(B) in terms of x.
Answer. We have:
V ∗(D) = x
V ∗(A) = max{1 + 0.5x, 2}
V ∗(C) = max{1 + 0.5x, 2}
V ∗(B) = max{1 + 0.5(1 + 0.5x), 2}
The 2 comes from the utility being an infinite geometric sum of discounted reward = 1 1−1
2
idea behind the above equations is to explore the symmetric relation between A and C. First, D has only action to take, which is to exit the game. For B, due to the symmetry property, Q∗(B, LEF T ) must be the same as Q∗(B,DOWN). As a result, both LEFT or DOWN can be considered as an optimal action to take from B. For A and C, if optimal action to take from A is DOWN, then the optimal action to take from C is LEFT. Similarly, if the optimal action to take from A is RIGHT, then the optimal action to take from C is UP. Based on this analysis, there are only two policies which we should consider
• Policy 1: π(D) = exit, π(A) = DOWN, π(C) = LEFT, π(B) = LEFT item Policy 2: π(D) = exit, π(A) = RIGHT, π(C) = UP, π(B) = LEFT
As a result, you only have to compute the values of states according to these policies and take the maximum between the two.
Q8.2 (10 points) Let any non-exit action be successful with probability = 0.5 . Otherwise, the agent stays in the same state with reward = 0. The EXIT ACTION from the state D is still deterministic and will always succeed. Assume that γ = 0.5. For which value of x does Q∗(A,DOWN) = Q∗(A,RIGHT)?
8
= 2. The
Answer. Q∗(A,DOWN) = Q∗(A,RIGHT)impliesV∗(A) = Q∗(A,DOWN) = Q∗(A,RIGHT). Therefore,
V∗(A) = Q∗(A,DOWN) = 1(0+ 1V∗(A))+ 1(1+ 1x) = 1 + 1V∗(A)+ 1x (1) 2222244
=⇒ V∗(A)=2+1x (2) 33
V∗(A) = Q∗(A,RIGHT) = 1(0+ 1V∗(A))+ 1(1+ 1V∗(B)) = 1 + 1V∗(A)+ 1V∗(B) (3) 2222244
=⇒ V∗(A)=2+1V∗(B) 33
(4)
Because Q∗(B,LEFT) and Q∗(B,DOWN) are symmetric decisions, V∗(B) = Q∗(B,LEFT). Therefore,
V ∗(B) = 1(0 + 1V ∗(B)) + 1(1 + 1V ∗(A)) = 1 + 1V ∗(B) + 1V ∗(A) (5) 2222244
=⇒ V∗(B)=2+1V∗(A) (6) 33
From(4)and(6)weobtain,V∗(A)=V∗(B)=1. From(2),weobtainx=1.
9