程序代写代做代考 algorithm 10-601 Introduction to Machine Learning

10-601 Introduction to Machine Learning
Machine Learning Department School of Computer Science Carnegie Mellon University
Reinforcement Learning:
Value Iteration + Q-Learning
Matt Gormley Lecture 16 Mar. 16, 2021
1

Reminders
• Homework 5: Neural Networks – Out: Thu, Mar. 18
– Due: Mon, Mar. 29 at 11:59pm
• Homework 6: Deep RL
– Out: Mon, Mar. 29
– Due: Wed, Apr. 07 at 11:59pm
3

VALUE ITERATION
5

Definitions for Value Iteration
Whiteboard
– State trajectory
– Value function
– Bellman equations
– Optimal policy
– Optimal value function
– Computing the optimal policy – Ex: Path Planning
6

RL Terminology
Question: Match each term (on the left) to the corresponding statement or definition (on the right)
Terms:
A. a reward function
B. a transition probability
C. a policy
D. state/action/reward triples
E. a value function
F. transition function
G. an optimal policy
H. Matt’s favorite statement
Statements:
1. gives the expected future discounted reward of a state
2. maps from states to actions
3. quantifies immediate success of agent
4. is a deterministic map from state/action pairs to states
5. quantifies the likelihood of landing a new state, given a state/action pair
6. is the desired output of an RL algorithm
7. can be influenced by trading off between exploitation/exploration
7

Example: Path Planning
8

Example: Robot Localization
ImImmedeidaitaetererweawradrsdrs(sr(,sa,)a)
e rewards r(s,a) Immediate rewards r(s,a)
StSataetevavlauleusesV*V(s*()s)
es V*(s)
State values V*(s)
Immediat State valu
Figure from Tom Mitchell
9

Whiteboard
Value Iteration
– Value Iteration Algorithm
– Synchronous vs. Asychronous Updates
10

Value Iteration
Variant 2: without Q(s,a) table
11

Value Iteration
Variant 1: with Q(s,a) table
12

Synchronous vs. Asynchronous Value Iteration
asynchronous updates: compute and update V(s) for each state one at a time
synchronous updates: compute all the fresh values of V(s) from all the stale values of V(s), then update V(s) with fresh values
13

Value Iteration Convergence
very abridged
Holds for both asynchronous and sychronous updates
Provides reasonable stopping criterion for value iteration
Often greedy policy converges well before the value function
14

Value Iteration Variants
Question:
True or False: The value iteration algorithm shown below is an example of synchronous updates
15

POLICY ITERATION
16

Policy Iteration
18

Policy Iteration
Compute value function for fixed policy is easy
System of |S| equations and |S| variables
Greedy policy might remain the same for a particular state if there is no better action
Greedy policy w.r.t. current value function
19

Policy Iteration Convergence
In-Class Exercise:
How many policies are there for a finite sized state and action space?
In-Class Exercise:
Suppose policy iteration is shown to improve the policy at every iteration. Can you bound the number of iterations it will take to converge? If yes, what is the bound? If no, why not?
20

Value Iteration vs. Policy Iteration
• Value iteration requires O(|A| |S|2)
computation per iteration
• Policy iteration requires O(|A| |S|2 + |S|3) computation per iteration
• In practice, policy iteration converges in fewer iterations
21

Learning Objectives
Reinforcement Learning: Value and Policy Iteration
You should be able to…
1. Compare the reinforcement learning paradigm to other learning paradigms
2. Cast a real-world problem as a Markov Decision Process
3. Depict the exploration vs. exploitation tradeoff via MDP examples
4. Explain how to solve a system of equations using fixed point iteration
5. Define the Bellman Equations
6. Show how to compute the optimal policy in terms of the optimal value function
7. Explain the relationship between a value function mapping states to expected rewards and a value function mapping state-action pairs to expected rewards
8. Implement value iteration
9. Implement policy iteration
10. Contrast the computational complexity and empirical convergence of value iteration vs. policy iteration
11. Identify the conditions under which the value iteration algorithm will converge to the true value function
12. Describe properties of the policy iteration algorithm
22

Today’s lecture is brought you by the letter….
Q
23

Q-LEARNING
24

Whiteboard
Q-Learning
– Motivation: What if we have zero knowledge of the environment?
– Q-Function: Expected Discounted Reward
25

State-action values Q*(s,a)
Example: Robot Localization
drs(sr(,sa,)a) ) r(s,a)
Bellman equation.
ImImmedeidaitaetererweawradrs
Immediate rewards r(s,a) Immediate rewards
StSataetevavlauleusesV*V(s*()s
State values V*(s)
State values V*(s)
Consider first the case P(s􏰺| s,a) is determinist
Figure from Tom Mitchell
26
where ic

Whiteboard
Q-Learning
– Q-Learning Algorithm
• Case 1: Deterministic Environment
• Case 2: Nondeterministic Environment
– Convergence Properties
– Exploration Insensitivity
– Ex: Re-ordering Experiences – ε-greedy Strategy
27