CS计算机代考程序代写 algorithm CMPUT397 Winter 2021

CMPUT397 Winter 2021
Midterm2
导师:Baihong Qi

课程大纲
Policy iteration& Value Iteration Probabilities & Expectations Sample Midterm
Previous Exam

Value Iteration vs Policy Iteration

Key points:
1.Policy iteration includes: policy evaluation + policy improvement, and the two are repeated iteratively until policy converges.
2.Value iteration includes: finding optimal value function + one policy extraction. There is no repeat of the two because once the value function is optimal, then the policy out of it should also be optimal (i.e. converged).
3.Finding optimal value function can also be seen as a combination of policy improvement (due to max) and truncated policy evaluation (the reassignment of v_(s) after just one sweep of all states regardless of convergence).
4.The algorithms for policy evaluation and finding optimal value function are highly similar except for a max operation (as highlighted)
5.Similarly, the key step to policy improvement and policy extraction are identical except the former involves a stability check

Sample Midterm

Probabilities and Expections
• Mutually exclusive:
• 不会同时发生
• Independent:
• 可能同时发生, 但是互不影响

MDP
• For Optimal Action Value (Action value of optimal policy)
7. Bellman Optimality Equation

4. Return (G)
• Episodic Task
• Continuing Task
MDP
如果R一直等于1:

MDP 6. Bellman Equation (用 Sum 表示Expected Value)
• For State Value
• For Action Value ?

Midterm Practice 讲解

MDP Q2
2.Return(Gt)
• 考法3: return 与 value 的区别 (2018F – midterm)
• Given two actions, we should always pick the one with the larger ___
• Answer: Value (not return)
• 在RL过程中,我们通过预测来判断performance,不是real value, 所以应该take expectation,根据公式,we should always pick the one with the larger value, not larger return.

4. Return (G)
• Episodic Task
• Continuing Task
MDP
如果R一直等于1:

MDP Q3

Answer: (a) π’ >=π
(b)both optimal
MDP Q3
(c)1. One optimal , one not optimal,
2. They can be both optimal. When breaking tie(平局), π’ 因为argmax默认
return 第一个item,而π可以是平局下的任何一个,所以π和π‘可以不同。

MDP 6. Bellman Equation (用 Sum 表示Expected Value)
• For State Value
• 用Action Value 表示 State Value ?

Midterm Practice (2)

k-armed Bandit
1. K-armed : 一台机器K个摇杆,每个摇杆最初会随机生成一个随机数 每次只能从K个摇杆中选择一个,选择后该摇杆的值会update
2. Time steps : 1000 selection(选择1000次) 3. 目的: 最大化 Expected Total Value
4. Exploiting : keep choosing greedy action
5. Exploring : select non-greedy action
6. Exploiting & Exploring Trade-off

k-armed Bandit
7. ε – greedy

MDP 6. Bellman Equation (用 Sum 表示Expected Value)
• For State Value
• For Action Value ?

MDP
• For Optimal State Value (State value of optimal policy)
7. Bellman Optimality Equation

3. Policy Iteration DP
• E(Policy Evaluation) : for a given policy π, compute the state-value function vπ
• I (Policy Improvement) : 根据vπ反馈来更新policy • GPI

MDP Q1
Answer:
(a) X left 0 X left 0…
(b) X right +1 X right +1 X right -1 Y right +3 T

MDP Q2
2.Return(Gt): 每一步得到的反馈总结果
• 考法1: given terminal state,根据reward计算return (2018F – midterm)

Answer:
解题方法:计算return时, 从后往前逆推着解,因 为Terminal state下的 return是0,可简化计算 流程。
MDP Q2

MDP Q2
2.Return(Gt)
• 考法2: given infinity state, 根据reward计算return (2018F – midterm sample)

Answer: Reward=1,13,13,13… 根据公式:
解题方法:计算return时, 从后往前逆推着解,因 为Gt(t为∞时),
Gt=R/(1- γ),由此可简化 计算过程。
MDP Q2
G2=R3+γR4+… =13/(1-γ) =13/(1-0.5) =26
G1=R2+γG2=13+0.5*26=26 G0=R1+γG1=1+0.5*26=14

MDP Q2
2.Return(Gt)
• 考法3: return 与 value 的区别 (2018F – midterm)
• Given two actions, we should always pick the one with the larger ___
• Answer: Value (not return)
• 在RL过程中,我们通过预测来判断performance,不是real value, 所以应该take expectation,根据公式,we should always pick the one with the larger value, not larger return.

MDP Q3
Policy正式定义:mapping from states to probability of selecting each possible actions (denote: π(a | s))
3.Optimal Policy
• 考法1: TF判断定义理解(2018F – midterm)

MDP Q3
Answer:
According to the formular of optimal policy, combined with formular of v optimal
we can see that the optimal policy is greedy with Vπ, therefore, (b) is T
we can also see that the optimal policy is respect to v by using argmax, not a random policy,therefore, (a) is F

MDP Q3
3.Optimal Policy
• 考法2: optimal policy与general policy的关联 (2018F – midterm sample)

MDP Q3

MDP Q3
Answer:
(a) π’ 是optimal policy(因为argmax), therefore, π’ >=π
(b)both optimal
(c)They can be both optimal. When breaking tie(平局), π’ 因为argmax默认 return 第一个item,而π可以是平局下的任何一个,所以π和π‘可以不同。

MDP Q3
3.Optimal Policy
• 考法3: 根据MDP图,找出optimal policy (2018F – midterm sample)

MDP Q3

MDP Q3
Answer:
If start state=s1, optimal policy is switch If start state=s2, optimal policy is stay

MDP Q4
4.episodic与continuing区别
• episodic task 指的是case with terminal
• continuing task 指的是 case without terminal, run forever.
• 考法1: 填空(2018F – midterm)
• An episodic task begins and ends. A ___task goes on and on.
• Answer: continuing (not continuous).

MDP Q5
5.value function 计算
value function:计算公式,将state与数值通过公式计算对应起来。
• 考法1:根据图像计算value function (2018F – midterm)

MDP Q5
解题方法:对每一个 state下的每一个 action分别计算value function的数值。如 果涉及其他state的 value function,就先 计算与其相关的state 的value function,然 后再返回代入计算。

MDP Q6
6.关于estimate的应用
• 考法1:根据描述完成estimate update (2018F – midterm)

MDP Q6

1.GPI(generalized policy iteration)
DP Q1
• 考法1: 简述(2017F – midterm)
• What is generalized policy iteration. Refer to all three words of the phrase in your
explanation.
• Answer:any iteration of policy evaluation & policy improvement independent of their granulerity.