CS计算机代考程序代写 algorithm CMPUT397 Winter 2021

CMPUT397 Winter 2021
Midterm Review(1) MC+ 知识点
导师:Baihong Qi

课程安排
• Tuesday • MC
• 期中考试知识点复习
• Wednesday
• Sample midterm讲解
• 往年期中考试题 • 答疑

课程大纲
Monte Carlo for state value Monte Carlo for Action Values Exploration for Monte Carlo 期中考试知识点梳理

Monte Carlo Methods
• Recap: Bellman Equation for state values
• 需要知道环境 transition probabilities • 计算transition概率是困难的!

Predict the outcome of 12 rolls of dice
• Monte Carlo don’t sweep through all possible outcomes

Monte Carlo in RL
• 不假设 probability model p(s’,r | s,a)
• 通过Experience来estimate value functions (V(s), Q(s,a)) • 找到一系列 states, actions, & rewards
• 从某state开始,在termination state结束
• Focus on undercounted episodic tasks
• 在每个episodic结束来改变对value funcation的estimate

MC Policy Evaluation

公式 policy→State Value

Blackjack
• 扑克游戏
• 玩家和庄家(dealer)对战,各两张牌,能看到庄家的一张牌
• 面卡价值10,A可以是1或11
• 玩家先行:可以决定Stick(不要再获得更多牌)或hit(可以获得 另一张牌)
• 您的目标是在不超过21的情况下,使您的分数高于发牌者,否则 Bust,游戏结束
• 如果玩家2张初始牌= 21,并且发牌者没有21,则自动获胜
• 发牌者回合: 他们Stick 如果总分> = 17,否则他们hit(他也可以 破产)

Blackjack in MDP State

Action and reward

Result

Bootstrap
• 定义: “using one or more estimated values in the update step for the same kind of estimated value”.
• MC do not bootstrap:

Use Monte Carlo to learn action values
• MC不需要 transition probability, 与state value类似,我们可以用sample 来estimate action value
• Learn action value 需要我们Maintaining Exploration
• Learn action value是通过选取不同action value中的argmax来实现的 • 所有我们要确保所有的action都会被explore
• Exploring starts: every state-action pair has nonzero prob of being the start state and first action of each episode

Use MC for generalized Policy Iteration
• Use Monte Carlo to estimate action values

公式 Policy Iteration using MC ES

Blackjack optimal policy

/
考点内容
● Introduction
● K-armed Bandit
● MDP
● DP
● Monte Carlo

Reinforcement Learning 入门
1. RL 特点:Use training information that evaluates the action (不是告诉你correct action)
2. 三要素 : Action (A) , Reward (R) , State (S) 3. 举例 : 机器人迷宫
4. Action : 前后左右
5. Reward : +1 -1 ?0 +1
6. State : 坐标 (入口:Start State ;出口:Termination)

Reinforcement Learning 入门
7. Value : 显示某一State/Action下Reward的期望值(Expected Reward)
• V: State Value (基于State)
• Q: Action Value (基于Action & State)
• q : true value ≈ Q : estimate value
• Q [前 , 后 , 左 , 右 ]
• Q [11, 14 , -2 , 189 ]
• Q [前] = 11

Reinforcement Learning 入门
8. Value Function : 用Reward来表示V/Q的公式
9. RL 目的 : produce a value function, mapping from state to real number

k-armed Bandit
1. K-armed : 一台机器K个摇杆,每个摇杆最初会随机生成一个随机数 每次只能从K个摇杆中选择一个,选择后该摇杆的值会update
2. Time steps : 1000 selection(选择1000次) 3. 目的: 最大化 Expected Total Value
4. Exploiting : keep choosing greedy action
5. Exploring : select non-greedy action
6. Exploiting & Exploring Trade-off

k-armed Bandit
7. ε – greedy

8. greedy algorithm
k-armed Bandit
(standard learning rule)

k-armed Bandit
9. ε – greedy algorithm

k-armed Bandit 10. ε – greedy algorithm VS greedy algorithm
?0.1优于0.01

k-armed Bandit 11. stationary VS non-stationary algorithm
• stationary bandit : unchanged reward probability
• non-stationary bandit : Changed reward probability

k-armed Bandit
12. optimistic initial value
?为什么optimistic greedy 前期不及 ε – greedy ?为什么optimistic greedy 前期会有剧烈的上升

MDP
1. Interface
• Trajectory
S0 A0 R1 S1 A1 R2 S2 A2 R3 …

MDP
2. MDP概念
• 定义:A frame of learning from iterations to achieve a goal.
• state/action有限:finite MDP
• state/action无限:infinite MDP

MDP
• Policy: A mapping from states to action probabilities
3. Policy
• Deterministic policy

MDP
4. Return (G)
• 定义:将reward通过某些方式加到一起的总和 • 分类

4. Return (G)
• Episodic Task
• Continuing Task
MDP
如果R一直等于1:

MDP • Value & Return 关联: Values are expected return
5. Value (V/Q) Function

MDP
5. Value (V/Q) Function
• State Value Function
• Action Value Function

MDP 6. Bellman Equation (用 Sum 表示Expected Value)
• For State Value
• For Action Value ?

MDP 6. Bellman Equation (用 Sum 表示Expected Value)
• For State Value
• 用Action Value 表示 State Value ?

MDP 6. Bellman Equation (用 Sum 表示Expected Value)
• For Action Value
• 用State Value 表示 Action Value ?

MDP
• For Optimal State Value (State value of optimal policy)
7. Bellman Optimality Equation

MDP
• For Optimal Action Value (Action value of optimal policy)
7. Bellman Optimality Equation

MDP
7. Bellman Optimality Equation
• 用Optimal Action Value 表示 Optimal State Value ?
• 用Optimal State Value 表示 Optimal Action Value ?

DP
1. DP 概念

• •

定义:A collection of algorithms to dynamically compute value functions and optimal policies
前提:Given a perfect model of the environment as MDP 算法核心:Use value functions to organize and structure the
search for good policies
实现:Policy Iteration / Value Iteration

MDP
8. Policy Optimality Equation • For Optimal Action Value
• For Optimal State Value ?

DP
1. Policy Evaluation
• 公式

DP
1. Policy Evaluation
• 算法

DP
2. Policy Improvement
• 公式

2. Policy Improvement
• 算法
DP

DP
3. Policy Iteration
• E(Policy Evaluation) : for a given policy π, compute the state-value function vπ
• I (Policy Improvement) : 根据vπ反馈来更新policy • GPI

DP 3. Policy Iteration (基于Policy Evaluation)

DP • Recall : Policy Evaluation 公式
4. Value Iteration
• Value Iteration 公式

5. Value Iteration
• 算法
DP