CS代写 Spring ‘22 AI Homework 8: solutions

Spring ‘22 AI Homework 8: solutions
Gridworld (4×4 – skipped O as too similar to 0)
Similar to the maze from class, in Gridworld you can make 4 possible moves: up, down, left, right. However, in case you are at an edge/corner, rather than lose moves the boundary moves simply transition to self. Multiple “self” moves are merged.
e.g. the actions from M are {I, N, M} and from P are {K,N,P,Q} You may either:

Copyright By PowCoder代写 加微信 powcoder

● Use a spreadsheet (Note: search for “iterative calculation” for info on setting up excel or sheets to allow recursive formulas.)
● Use your Lab 3 solver
● Use a closed form algebraic computation
Please indicate which you did in case of small numerical discrepancies.
Print answers to 3 decimal places, using 150 iterations and tolerance of 0.001.
Answers accepted correct within .01
Remember: α is the “transition failure probability” for a decision node. So α=0.15 means the success rate is 0.85
You will use the Bellman equation: the utility of a state is equal to the immediate reward for that state, plus the discounted utility of future state(s).
v(s) = R(s) + 𝛾 * P * v(s) Questions
1. [5 pts] Assuming that:
○ A and Q are terminal states with a reward of 2
○ all other states give a reward of -3 and are chance nodes with uniform random
probabilities (.25 for each of the 4 transitions, .333 or so for corners, etc).

Solve the value function as a Markov reward process. Print the 14 non-terminal values.
2. [5 pts] Assuming that:
○ A and Q are terminal states with a reward of 15 and -15 respectively
○ states J and G give a reward of 3, and are chance nodes with uniform random
probabilities
○ all other states give a reward of -1 and are decision nodes
Using a discount factor 𝛾 of 0.9 and a Q-learning α of 0.15 (a.k.a. decision node probability of failure), solve the MDP using value iteration and greedy policy computation.
Print out the learned policy, e.g. {F -> E, K -> J, …} and also the 14 non-terminal values under that policy.
1. You did not need to write the equations, but they are:
B= -3 + A/4 + B/4 + C/4 + F/4 C= -3 + B/4 + C/4 + D/4 + G/4 D=-3 + C/3 + D/3 + H/3
E= -3 + A/4 + E/4 + F/4 + I/4
F = -3 + B/4 + E/4 + G/4 + J/4 G = -3 + C/4+F/4 + H/4 + K/4 H= -3 + D/4 + G/4 +H/4 + L/4 I= -3 + E/4 + I/4 + J/4 + M/4
J = -3 + F/4+I/4 + K/4 + N/4
K = -3 + G/4+J/4 + L/4 + P/4 L= -3 + H/4 + K/4 +L/4+ Q/4 M=-3 + I/3 + M/3 + N/3
N= -3 + M/4 + N/4 + J/4 + P/4 P=-3+N/4+K/4+P/4+ Q/4 Q=2
I used my MDP solver with “-iter 150 -tol 0.001“, and for the 3-edge corners I set .3333 for the outbound edges and .3334 for the self edge. [Changing this around leads to changes in the 0.001 range.]
B=-38.489 C=-55.359 D=-59.858 E=-38.489 F=-50.111 G=-55.734 H=-55.359 I=-55.359 J=-55.734 K=-50.111 L=-38.489 M=-59.857 N=-55.359 P=-38.489

2. Again the equations with max (noting that you separate the max via policy computation and evaluation) are:
B= -1 + .8 * max(A,B,C,F) + .05 * (A + B + C + F) B= -1 + .8 * max(B,C,D,G) + .05 * (B + C + D + G) D=-1 + .775 * max(C,D,H) + .075 * (C + D + H) E= -1 + .8 * max(A,E,F,I) + .05 * (A + E + F + I)
F= -1 + .8 * max(B,E,G,J) + .05 * (B + E + G + J) G = 3 + C/4+F/4 + H/4 + K/4
H= -1 + .8 * max(D,G,H,L) + .05 * (D + G + H + L) I= -1 + .8 * max(E,I,J,M) + .05 * (E + I + J + M)
J = 3 + F/4+I/4 + K/4 + N/4
K= -1 + .8 * max(G,J,L,P) + .05 * (G + J + L + P) L= -1 + .8 * max(H,K,L,Q) + .05 * (H + K + L + Q) M=-1 + .775 * max(I,M,N) + .075 * (I + M + N)
N= -1 + .8 * max(M,N,J,P) + .05 * (M + N + J + P) P= -1 + .8 * max(K,N,P,Q) + .05 * (K + N + P + Q) Q=2
I used my MDP solver with “-iter 150 -tol 0.001 -df .9“. It took about 4 top level iterations to get the optimal policy.
B -> A C -> B D -> C E -> A F -> E H -> G I -> E K -> G L -> K M -> I N -> J P -> K
A=15.000 B=11.860 C=9.311 D=7.173 E=11.860 F=9.600 G=11.037 H=8.385 I=9.311 J=11.037 K=8.424 L=5.389 M=7.173 N=8.385 P=5.389 Q=-15.000

===== solver file for 1 ==== # HW 8
B : [B, A, C, F]
B % .25 .25 .25 .25 B=-3
C : [C, B, D, G]
C % .25 .25 .25 .25 C=-3
D : [C, D, H]
D % .3333 .3334 .3333 D=-3
E : [A, E, F, I]
E % .25 .25 .25 .25 E=-3
F : [B, G, J, E]
F % .25 .25 .25 .25 F=-3
G : [C, F, H, K]
G % .25 .25 .25 .25 G=-3
H : [H, D, G, L]
H % .25 .25 .25 .25 H=-3
I : [I, E, J, M]
I % .25 .25 .25 .25
J : [F, K, N, I]
J % .25 .25 .25 .25 J=-3
K : [G, J, L, P]
K % .25 .25 .25 .25 K=-3
L : [H, L, K, Q]
L % .25 .25 .25 .25 L=-3
M : [I, M, N]
M % .3334 .3333 .3333 M=-3
N : [M, N, J, P]
N % .25 .25 .25 .25 N=-3
P : [N, K, P, Q]

P % .25 .25 .25 .25 P=-3
==== solver file for 2 ==== # HW 8.2
B : [B, A, C, F]
C : [C, B, D, G]
D : [C, D, H]
E : [A, E, F, I]
F : [B, G, J, E]
G : [C, F, H, K]
G % .25 .25 .25 .25 G=3
H : [H, D, G, L]
I : [I, E, J, M]
J : [F, K, N, I]
J % .25 .25 .25 .25 J=3
K : [G, J, L, P]
L : [H, L, K, Q]

M : [I, M, N]
N : [M, N, J, P] N % .85
P : [N, K, P, Q] P % .85

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com