Useful Formulas
MDPs and RL
• Q-learningupdate:Qk+1(s,a)=Qk(s,a)+α(Rt+1+maxa’ γQk(s’,a’)−Qk(s,a)).
• Sarsa update: Qk+1(s,a)=Qk(s,a)+α(Rt+1+γQ(s’ ,a’)−Qk(s,a)).
Copyright By PowCoder代写 加微信 powcoder
1. What is an MDP? What are the elements that define an MDP?
A Markov Decision Process is a controllable stochastic process in which the next state and reward depend solely on the current state. It is a tuple ⟨S , A , T , r ⟩ where S is a set of states, A is a set of actions, T is the transition function, and r is the reward function.
2. What makes a transition system Markovian?
The distribution of the next state and the reward depend only on the current state, that is:
p(st+1=s,rt+1=r|s0,a0,s1,a1,…,st ,at)=p(st+1=s,rt+1=r|st ,at)
3. What does it mean that an RL method bootstraps? Provide an example of an RL algorithm
that bootstraps and one that does not.
An RL method bootstraps when it computes a prediction based on another prediction. TD methods bootstrap, for instance Q-Learning and Sara, while Monte Carlo does not.
4. An agent has to find the coin in the MDP below, and pick it up. The actions available to the agent are move up, down, left, right, toggle switch and pick up. The action toggle switch turns on and off the light in the room, and succeeds only if executed in the square with the switch, while it does not do anything anywhere else. The action pick up picks up the coin if executed in the square with the coin and if the light is on, while does nothing anywhere else, or with the light off. How would you model this domain so that the representation is Markovian?
A state, in order for the representation to be Markovian, must have at least the following components: position, LightOn, HoldingCoin. The position can be represented in different ways, as long as there is one coordinate per square. LightOn and HoldingCoin are two boolean variables, that are true when the light is on, and when the agent is holding the coin respectively.
Note on notation: in the following MDPs, each state is labeled with an id. Each transition is labeled with the name of the corresponding action, the probability of landing in the next state, and the reward for that transition. If a state has no outgoing edges, it is an absorbing state.
5. Calculate the action-value function that Sarsa and Q-learning would compute on the following MDP, while acting with an ε-greedy policy with and ε = 0.1 and γ = 0.5.
The difference between the two algorithms is that Q-learning is off-policy, while Sarsa is on policy. Therefore, Sarsa converges to:
qπ(s,a)=∑p(s’,r∣s,a)(r+γ∑π(a’∣s)qπ(s’,a’)) , s’ ,r a’
while Q-learning converges to: q*(s,a)=∑p(s’,r∣s,a)(r+γmaxa’q*(s’,a’)) ,
where the policy has been substituted by the max operator.
States where there is no choice are shared by the two algorithms. Applying the two equations above, we obtain:
for both Sarsa and Q-learning:
q(5,a)=100 , q(4,a)=50 , q(3,a)=−5+γq(4,a)=−5+0.5⋅50=20 , q(3,b)=−1+γq(5,a)=−1+0.5∗100=49 , q(2,a)=−10+γq(5,a)=−10+0.5⋅100=40 .
For Sarsa:
q(2,b)=−1+γ((1−ε)q(3,b)+εq(3,a))=−1+0.5(0.9⋅49+0.1⋅20)=22.05 , q(1,a)=−1+γ((1−ε)q(2,a)+εq(2,b))=−1+0.5(0.9⋅40+0.1⋅22.05)=18.103 .
For Q-learning:
q(2,b)=−1+γq(3,b)=−1+0.5⋅49=23.5 , q(1,a)=−1+γq(2,a)=−1+0.5⋅40=19 . Note how Q-learning does not take the policy into account, and the values it computes are higher because they are the values under the optimal policy.
6. Calculate the action-value function that Q-learning and Sarsa would compute on the following MDP, with γ = 0.5 and ε=0.1.
For both Q-learning and Sarsa:
q(5,a)=100 , q(4 ,a)=50 , q(3,a)=0.8(−5+γq(4,a))+0.2(200)=0.8(−5+0.5∗50)+0.2⋅200=56 , q(3,b)=−1+γq(5,a)=−1+0.5∗100=49 .
For Sarsa:
q(2,a)=0.4(−10+γq(5,a))+0.6(−1+γ((1−ε)q(3,a)+εq(3,b)))= =0.4(−10+0.5⋅100)+0.6(−1+0.5(0.9⋅56+0.1⋅49))=31.99 , q(1,a)=−1+γq(2,a)=−1+0.5⋅31.99=14.995 .
For Q-learning: q(2,a)=0.4(−10+γq(5,a))+0.6(−1+γq(3,a))=0.4(−10+0.5⋅100)+0.6(−1+0.5⋅56)=32.2
, q(1,a)=−1+γq(2,a)=−1+0.5⋅32.2=15.1 .
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com