10-601 Introduction to Machine Learning
Machine Learning Department School of Computer Science Carnegie Mellon University
Reinforcement Learning:
Value Iteration + Q-Learning
Matt Gormley Lecture 16 Mar. 16, 2021
1
Reminders
• Homework 5: Neural Networks – Out: Thu, Mar. 18
– Due: Mon, Mar. 29 at 11:59pm
• Homework 6: Deep RL
– Out: Mon, Mar. 29
– Due: Wed, Apr. 07 at 11:59pm
3
VALUE ITERATION
5
Definitions for Value Iteration
Whiteboard
– State trajectory
– Value function
– Bellman equations
– Optimal policy
– Optimal value function
– Computing the optimal policy – Ex: Path Planning
6
RL Terminology
Question: Match each term (on the left) to the corresponding statement or definition (on the right)
Terms:
A. a reward function
B. a transition probability
C. a policy
D. state/action/reward triples
E. a value function
F. transition function
G. an optimal policy
H. Matt’s favorite statement
Statements:
1. gives the expected future discounted reward of a state
2. maps from states to actions
3. quantifies immediate success of agent
4. is a deterministic map from state/action pairs to states
5. quantifies the likelihood of landing a new state, given a state/action pair
6. is the desired output of an RL algorithm
7. can be influenced by trading off between exploitation/exploration
7
Example: Path Planning
8
Example: Robot Localization
ImImmedeidaitaetererweawradrsdrs(sr(,sa,)a)
e rewards r(s,a) Immediate rewards r(s,a)
StSataetevavlauleusesV*V(s*()s)
es V*(s)
State values V*(s)
Immediat State valu
Figure from Tom Mitchell
9
Whiteboard
Value Iteration
– Value Iteration Algorithm
– Synchronous vs. Asychronous Updates
10
Value Iteration
Variant 2: without Q(s,a) table
11
Value Iteration
Variant 1: with Q(s,a) table
12
Synchronous vs. Asynchronous Value Iteration
asynchronous updates: compute and update V(s) for each state one at a time
synchronous updates: compute all the fresh values of V(s) from all the stale values of V(s), then update V(s) with fresh values
13
Value Iteration Convergence
very abridged
Holds for both asynchronous and sychronous updates
Provides reasonable stopping criterion for value iteration
Often greedy policy converges well before the value function
14
Value Iteration Variants
Question:
True or False: The value iteration algorithm shown below is an example of synchronous updates
15
POLICY ITERATION
16
Policy Iteration
18
Policy Iteration
Compute value function for fixed policy is easy
System of |S| equations and |S| variables
Greedy policy might remain the same for a particular state if there is no better action
Greedy policy w.r.t. current value function
19
Policy Iteration Convergence
In-Class Exercise:
How many policies are there for a finite sized state and action space?
In-Class Exercise:
Suppose policy iteration is shown to improve the policy at every iteration. Can you bound the number of iterations it will take to converge? If yes, what is the bound? If no, why not?
20
Value Iteration vs. Policy Iteration
• Value iteration requires O(|A| |S|2)
computation per iteration
• Policy iteration requires O(|A| |S|2 + |S|3) computation per iteration
• In practice, policy iteration converges in fewer iterations
21
Learning Objectives
Reinforcement Learning: Value and Policy Iteration
You should be able to…
1. Compare the reinforcement learning paradigm to other learning paradigms
2. Cast a real-world problem as a Markov Decision Process
3. Depict the exploration vs. exploitation tradeoff via MDP examples
4. Explain how to solve a system of equations using fixed point iteration
5. Define the Bellman Equations
6. Show how to compute the optimal policy in terms of the optimal value function
7. Explain the relationship between a value function mapping states to expected rewards and a value function mapping state-action pairs to expected rewards
8. Implement value iteration
9. Implement policy iteration
10. Contrast the computational complexity and empirical convergence of value iteration vs. policy iteration
11. Identify the conditions under which the value iteration algorithm will converge to the true value function
12. Describe properties of the policy iteration algorithm
22
Today’s lecture is brought you by the letter….
Q
23
Q-LEARNING
24
Whiteboard
Q-Learning
– Motivation: What if we have zero knowledge of the environment?
– Q-Function: Expected Discounted Reward
25
State-action values Q*(s,a)
Example: Robot Localization
drs(sr(,sa,)a) ) r(s,a)
Bellman equation.
ImImmedeidaitaetererweawradrs
Immediate rewards r(s,a) Immediate rewards
StSataetevavlauleusesV*V(s*()s
State values V*(s)
State values V*(s)
Consider first the case P(s| s,a) is determinist
Figure from Tom Mitchell
26
where ic
Whiteboard
Q-Learning
– Q-Learning Algorithm
• Case 1: Deterministic Environment
• Case 2: Nondeterministic Environment
– Convergence Properties
– Exploration Insensitivity
– Ex: Re-ordering Experiences – ε-greedy Strategy
27