CMPUT397 Winter 2021
期末知识点复习
导师:Baihong Qi
重点复习
课程大纲
RL流程梳理
Reinforcement Learning 入门
1. RL 特点:Use training information that evaluates the action (不是告诉你correct action)
2. 三要素 : Action (A) , Reward (R) , State (S) 3. 举例 : 机器人迷宫
4. Action : 前后左右
5. Reward : +1 -1 ?0 +1
6. State : 坐标 (入口:Start State ;出口:Termination)
Reinforcement Learning 入门
7. Value : 显示某一State/Action下Reward的期望值(Expected Reward)
• V: State Value (基于State)
• Q: Action Value (基于Action & State)
• q : true value ≈ Q : estimate value
• Q [前 , 后 , 左 , 右 ]
• Q [11, 14 , -2 , 189 ]
• Q [前] = 11
Reinforcement Learning 入门
8. Value Function : 用Reward来表示V/Q的公式
9. RL 目的 : produce a value function, mapping from state to real number
k-armed Bandit
1. K-armed : 一台机器K个摇杆,每个摇杆最初会随机生成一个随机数 每次只能从K个摇杆中选择一个,选择后该摇杆的值会update
2. Time steps : 1000 selection(选择1000次) 3. 目的: 最大化 Expected Total Value
4. Exploiting : keep choosing greedy action
5. Exploring : select non-greedy action
6. Exploiting & Exploring Trade-off
k-armed Bandit
7. ε – greedy
k-armed Bandit
8. greedy algorithm
(standard learning rule)
9. ε – greedy algorithm
k-armed Bandit
k-armed Bandit 10. ε – greedy algorithm VS greedy algorithm
?0.1优于0.01
k-armed Bandit 11. stationary VS non-stationary algorithm
• stationary bandit : unchanged reward probability
• non-stationary bandit : Changed reward probability
12. optimistic initial value
k-armed Bandit
?为什么optimistic greedy 前期不及 ε – greedy ?为什么optimistic greedy 前期会有剧烈的上升
MDP
1. Interface
• Trajectory
S0 A0 R1 S1 A1 R2 S2 A2 R3 …
MDP
2. MDP概念
• 定义:A frame of learning from iterations to achieve a goal.
• state/action有限:finite MDP
• state/action无限:infinite MDP
MDP
• Policy: A mapping from states to action probabilities
3. Policy
• Deterministic policy
MDP
4. Return (G)
• 定义:将reward通过某些方式加到一起的总和 • 分类
4. Return (G)
• Episodic Task
• Continuing Task
MDP
如果R一直等于1:
MDP • Value & Return 关联: Values are expected return
5. Value (V/Q) Function
5. Value (V/Q) Function
• State Value Function
MDP
• Action Value Function
MDP 6. Bellman Equation (用 Sum 表示Expected Value)
• For State Value
• For Action Value ?
MDP 6. Bellman Equation (用 Sum 表示Expected Value)
• For State Value
• 用Action Value 表示 State Value ?
MDP 6. Bellman Equation (用 Sum 表示Expected Value)
• For Action Value
• 用State Value 表示 Action Value ?
MDP
• For Optimal State Value (State value of optimal policy)
7. Bellman Optimality Equation
MDP
• For Optimal Action Value (Action value of optimal policy)
7. Bellman Optimality Equation
7. Bellman Optimality Equation
• 用Optimal Action Value 表示 Optimal State Value ?
• 用Optimal State Value 表示 Optimal Action Value ?
MDP
DP
1. DP 概念
•
• •
•
定义:A collection of algorithms to dynamically compute value functions and optimal policies
前提:Given a perfect model of the environment as MDP 算法核心:Use value functions to organize and structure the
search for good policies
实现:Policy Iteration / Value Iteration
MDP
8. Policy Optimality Equation • For Optimal Action Value
• For Optimal State Value ?
DP
1. Policy Evaluation
• 公式
DP
1. Policy Evaluation
• 算法
DP
2. Policy Improvement
• 公式
2. Policy Improvement
• 算法
DP
3. Policy Iteration
DP
• E(Policy Evaluation) : for a given policy π, compute the state-value function vπ
• I (Policy Improvement) : 根据vπ反馈来更新policy • GPI
DP 3. Policy Iteration (基于Policy Evaluation)
DP • Recall : Policy Evaluation 公式
4. Value Iteration
• Value Iteration 公式
DP
5. Value Iteration
• 算法
TD
1. TD相关概念
• 特点:使用 previous experience 来预测
• TD : temporal difference 时间差分
2.TD (0)公式
• St+1时look back
• current state 是 St+1
TD 3. MC, TD, DP 的prediction target对比
• (MC)
• (TD)
(DP)
The expected (reward + discounted value of the next state)
TD 4. MC, TD, DP 的bootstraping对比
• bootstraping: update estimate value of state based on estimate of successor states.
• TD bootstrap
• MC no bootstrap
• DP bootstrap
TD
5. MC, TD, DP 的sampling对比
• sampling: Value function 是实际取样非预测出来的
• TD sampling
• MC sampling
• DP not sampling
(MC) (DP) (TD)
6. TD (0) 算法(off-policy)
TD
7. TD (0) 例题 预估回家时间
TD
7. TD (0) 例题
TD
8. TD优点
TD
TD converge更快, one step
1. Sarsa公式
• St+1时look back
• current state 是 St+1 2. Sarsa与TD(0)的关系
Sarsa
• 用Q-value来实现TD,而不是V-value
• Sarsa是on-policy, TD是off-policy
• TD (0)公式
Sarsa
3. Sarsa算法(on-policy)
4. Expected Sarsa公式 (off-policy)
Sarsa
• St+1时look back
• current state 是 St+1
5. Sarsa与Expected Sarsa的关系
• Sarsa是on-policy, Expected Sarsa多数情况下是off-policy
Q-learning
1.Q-learning公式
• St+1时look back
• current state 是 St+1 2. Q-learning与Sarsa的区别
• 同样用Q-value来实现TD,但是Q-learning预估未来最大值Q-value
• Sarsa是on-policy, Q-learning是off-policy
• Sarsa公式
3. Q-learning算法(off-policy)
Q-learning
Planning & Learning • Model: Agent 用来预测 Environment 对 Action 反馈的模型
1. Model-based RL
1. Model-based RL
• Model分类
Planning & Learning
Planning & Learning
1. Model-based RL
• MC, TD, DP 的 Model对比
• TD (sample model) sampling
• MC (sample model) sampling
• DP (distribution model) not sampling
(MC) (DP) (TD)
2. Planning
Planning & Learning
• 用模拟的数据来提升Model,从而更新value function和policy
• 用真实的数据来提升Model,从而更新value function和policy
2.Planning
Planning & Learning • 算法举例:Model-based Q-planning
2.Planning
Planning & Learning • 算法举例:非 Model-based Q-learning
2.Planning
Planning & Learning • Model-free 和 Model-based 方法对比
3. Dyna Q
Planning & Learning
Planning & Learning 注意:Model有可能是错误的
4. Dyna Q +
解决:计算suboptimal-policy
Planning & Learning • Exploration & Exploitation
4. Dyna Q +
Planning & Learning
4. Dyna Q +
• Exploration bonus
4. Dyna Q+
Planning & Learning
4. Dyna Q +
• 图像
Planning & Learning
Function Approximation
Function Approximation How are things the same?
How are things different?
• Before, we were able to represent this function exactly, with a table (tabular)
• is a vector (or array) of size number of states, with entry V(s) estimating vπ(s)
• More generally, we can only approximate these functions
• e.g., v(̂ s, w) ≈ vπ(s) for each s, for parameterized function v(̂ ⋅ , w)
• Tabular is a special case: V(s) = v(̂ s, w) under tabular features (x(s) one-hot encoding) • Can also see it as a generalization, rather than a difference
使用监督学习
Performance Measures
公式
For RL:
Generalization and Discrimination
Generalization and Discrimination
• tabular representations (用表格存)provide good discrimination but no generalization.
• Generalization is important for faster learning,
• 最好同时保持generalization 和discrimination,既快又准
把连续的, 无限的 state转换成有限的
Course Coding
图形可以是不对称的
Multiple tilings help in the
212 CHAPTER 9. ON-POLICY PREDICTION WITH APPROXIMATION 1000 state chain
Figure 9.10: Why we use coarse coding. Shown are learning curves on the 1000-state random walk example for the gradient MC algorithm with a single tiling and with multiple
RMSVE averaged over 30 runs
.4
.3
.2 .1
State aggregation (one tiling)
Tile coding (50 tilings)
0
0 Episodes 5000
Gradient Descent
• Derivative 的定义 • 正: 同向移动
• 负: 逆向移动
• 数值越大,斜率越大, 变化越快
Gradient Descent
• Our update is proportional to the negative of the gradient of our loss function with respect to the parameter vector wt
• 朝着gradient反向移动
• Objective function对w求导
Stochastic Gradient descent
• 随机sample一个斜率,然后更新W
• 与传统gradient descent不同,用一次sample来更新 • 不是batch update
Gradient Monte Carlo Algorithm
• 用V(st)未知,可以多种方法来估计
• 使用Morte Carlo 来取得G (return)来estimate V(st)
MC
TD
Course Roadmap
Bandits (C2) MDPs, returns, value functions (C3) Monte Carlo learning (C5) TD learning (C6)
Dynamic programming (C4) n-step TD learning (C7)
Policy gradients (C13) Bio. RL (C14 + 15)
Planning (C8) Eligibility traces (C12)
Options (C17)
Function approximation (C9 + 10) Off-policy with approximation (C11)
Monte Carlo Tree Search (Dr. Mueller + C8)
Course Roadmap
Bandits (C2) MDPs, returns, value functions (C3) Monte Carlo learning (C5) TD learning (C6)
Dynamic programming (C4) n-step TD learning (C7)
Policy gradients (C13) Bio. RL (C14 + 15)
Planning (C8) Eligibility traces (C12)
Options (C17)
Function approximation (C9 + 10) Off-policy with approximation (C11)
Monte Carlo Tree Search (Dr. Mueller + C8)
Bandits
• Simple decision making problem with 1 state
Course Roadmap
Bandits (C2) MDPs, returns, value functions (C3) Monte Carlo learning (C5) TD learning (C6)
Dynamic programming (C4) n-step TD learning (C7)
Policy gradients (C13) Bio. RL (C14 + 15)
Planning (C8) Eligibility traces (C12)
Options (C17)
Function approximation (C9 + 10) Off-policy with approximation (C11)
Monte Carlo Tree Search (Dr. Mueller + C8)
MDPs, returns, and value functions
• Decision making problems with many states
MDPs, returns, and value functions
• Sequential decision making- must make many correct actions in a row
• Agent is concerned with returns (estimating and/or maximizing):
• Specifically, the expected return, which depends on the agent’s policy and the environment dynamics
MDPs, returns, and value functions
MDPs, returns, and value functions
• Bellman Equations- write the value of a decision point in terms of the value of a later time step’s decision point
• i.e. for all states:
𝑣 𝑠 =∑𝜋 𝑎𝑠 ∑O𝑝 𝑠,𝑟,𝑠𝑎 𝑟+𝛾𝑣 𝑠 C Q N,P C
LL
MDPs, returns, and value functions
• Policy improvement:
• If you derive a greedy policy with respect to the action-values of another
policy, the new policy will be at least as “good” as the other
• If the new policy did not change from the previous policy, the
policy is greedy with respect to its own value function, and is an
optimal policy 𝜋 ∗
MDPs, returns, and value functions
• Know the definitions of the return and value functions • Know the Bellman equations!
• 𝑣 𝑠 , 𝑞 ,𝑠𝑎 , 𝑣 𝑠 , 𝑞 ,𝑠𝑎 CC∗∗
Course Roadmap
Bandits (C2) MDPs, returns, value functions (C3) Monte Carlo learning (C5) TD learning (C6)
Dynamic programming (C4)
n-step TD learning (C7)
Policy gradients (C13) Bio. RL (C14 + 15)
Planning (C8) Eligibility traces (C12)
Options (C17)
Function approximation (C9 + 10) Off-policy with approximation (C11)
Monte Carlo Tree Search (Dr. Mueller + C8)
Dynamic programming
• Computes an approximate value function 𝑉≈ 𝑣 C
• Sweeps across all states and actions, and evaluates the Bellman equation using the current estimates in the value function:
For all s:
LL
𝑉 𝑠 ←V𝜋 𝑎𝑠 V 𝑝𝑠,𝑟,𝑠𝑎 𝑟+𝛾𝑉 𝑠 U<= U
O Q NP,
Dynamic programming
• Introduces the idea of bootstrapping- basing the update to a state’s value on its current value estimates of successor states
L
• Requires knowledge of the environment dynamics 𝑝 𝑠,𝑟,𝑠𝑎
Dynamic programming
• Given an MDP and initial value function (i.e. all zeros), you should be able to write out what the values would be after a sweep of a dynamic programming algorithm (i.e. value iteration)
Course Roadmap
Bandits (C2) MDPs, returns, value functions (C3) Monte Carlo learning (C5) TD learning (C6)
Dynamic programming (C4) n-step TD learning (C7)
Policy gradients (C13) Bio. RL (C14 + 15)
Planning (C8) Eligibility traces (C12)
Options (C17)
Function approximation (C9 + 10) Off-policy with approximation (C11)
Monte Carlo Tree Search (Dr. Mueller + C8)
Monte Carlo learning
• Computes an approximate value function
• Samples returns from states by behaving in the world under a particular policy, then averages them out for each state
Monte Carlo learning
• Doesn’t need a model of the environment
• Can only be used in episodic problems, and learning only occurs after each episode
Monte Carlo learning
• Given a trajectory of experience (states, actions, rewards), some initial value function (i.e. all zeros), and parameters (step size, discount rate), you should be able to write out the value updates that (first-visit or every-visit) Monte Carlo would have performed.
Course Roadmap
Bandits (C2) MDPs, returns, value functions (C3) Monte Carlo learning (C5) TD learning (C6)
Dynamic programming (C4) n-step TD learning (C7)
Policy gradients (C13) Bio. RL (C14 + 15)
Planning (C8) Eligibility traces (C12)
Options (C17)
Function approximation (C9 + 10) Off-policy with approximation (C11)
Monte Carlo Tree Search (Dr. Mueller + C8)
TD learning
TD learning
TD learning
TD learning
TD learning
• Computes an approximate value function
• Combines ideas from Monte Carlo and Dynamic Programming- uses a mix of sampled information and bootstrapping off of current estimates
• Can learn online, without having to wait for the final outcome
TD learning
One-step TD (or TD(0)):
X 𝑉𝑆←𝑉𝑆+𝛼𝐺−𝑉 𝑆
9999
X
𝐺=𝑅 +𝛾𝑉
𝑆
9 9<= One-step Sarsa (or Sarsa(0)):
𝑄 𝑆, 𝐴 ← 𝑄 𝑆, 𝐴
99 99999
9<=
X
+ 𝛼 𝐺 − 𝑄
𝑆, 𝐴
X
𝐺=𝑅 +𝛾𝑄 𝑆,𝐴
9 9<= 9<=9<=
TD learning
• Know the difference between online and offline updating
• Given a trajectory of experience (states, actions, rewards), some initial value function (i.e. all zeros), and parameters (step size, discount rate), you should be able to write out the value updates that TD(0) would have performed at each step.
Course Roadmap
Bandits (C2) MDPs, returns, value functions (C3) Monte Carlo learning (C5) TD learning (C6)
Dynamic programming (C4)
n-step TD learning (C7)
Policy gradients (C13) Bio. RL (C14 + 15)
Planning (C8) Eligibility traces (C12)
Options (C17)
Function approximation (C9 + 10) Off-policy with approximation (C11)
Monte Carlo Tree Search (Dr. Mueller + C8)
Back-up Diagrams
• White circle represents a state, black dot represents an action
• Shows what information is “backed up” into the estimate of the return for the top node
Back-up Diagrams
• Know the back-up diagrams of well-known algorithms (Dynamic programming, Monte Carlo, TD/Sarsa, Q-learning, etc.)
• Know how a back-up diagram relates to an algorithm’s estimate of the return/update rule
• Be able to write out an algorithm’s update rule given an arbitrary back-up diagram
Course Roadmap
Bandits (C2) MDPs, returns, value functions (C3) Monte Carlo learning (C5) TD learning (C6)
Dynamic programming (C4) n-step TD learning (C7)
Policy gradients (C13) Bio. RL (C14 + 15)
Planning (C8)
Eligibility traces (C12) Options (C17)
Function approximation (C9 + 10) Off-policy with approximation (C11)
Monte Carlo Tree Search (Dr. Mueller + C8)
Planning
• A process which takes a model as input and produces or improves a policy
• i.e. Dyna uses a model to simulate experience and improve its value estimates, where greedifying with respect to these value estimates produces an improved policy
Planning
• Know what planning refers to in the context of this course • Know at a high level how algorithms like Dyna-Q work
Course Roadmap
Bandits (C2) MDPs, returns, value functions (C3) Monte Carlo learning (C5) TD learning (C6)
Dynamic programming (C4) n-step TD learning (C7)
Policy gradients (C13) Bio. RL (C14 + 15)
Planning (C8) Eligibility traces (C12)
Options (C17)
Function approximation (C9 + 10)
Off-policy with approximation (C11) Monte Carlo Tree Search (Dr. Mueller + C8)
Function approximation
• Instead of learning a value function as a table of values with an entry for each state, an agent learns weights of a parametrized function of state features
• Want the best set of weights such that given a state’s features, the parametrized function is as close as possible to the true value of that state
• i.e. Linear function approximation: ]
[𝑣𝑆,𝐰 ≜𝐰𝐱 𝑆 ≈𝑣 𝑆 99C9
Function approximation
• Uses stochastic gradient-descent in the mean-squared value error to find the best set of weights
• “Semi-gradient” TD methods ignore that the estimate of the return depend on the current weights:
𝐰←𝐰+𝛼 𝑅+𝛾𝑣[ 𝑆,𝐰 −[𝑣𝑆,𝐰 𝛻𝑣[𝑆,𝐰 9<= 9 9<= 9<=9 99 𝐰 99
Function approximation
• Tile coding is a way of producing a binary feature vector for a point in an n-dimensional space
• Nearby points will have similar feature vectors, and distant points will have different feature vectors- allows for generalization between nearby points (many features in common →many shared weights)
Function Approximation
• See https://github.com/MeepMoop/tilecoding
Function Approximation
• Know the implications of a tile coder’s parameters (i.e. shape of the tiles, number of tilings, etc.) on things like generalization or the number of active features
• Given the feature vectors of each state, and an experienced transition (state, action, reward, next state), you should be able to write out the weight updates for a semi-gradient TD method with linear function approximation