Policy Gradient Methods
Rupam Mahmood CMPUT 397, Winter 2021
Policy gradient methods are emerging as general-purpose mechanisms for controlling robots
2
Let¡¯s study the Roomba-Pool task
¡ñ Each ball has a barcode
¡ñ Each pocket is associated with a particular barcode
¡ñ A camera is attached to the Roomba
¡ñ First goal is to recognize barcodes as soon as possible
3
Let¡¯s define the RL task
¡ñ Agent is not physical
¡ñ The task is composed of ind. trials or episodes
¡ñ Episode ends at scan or after 10 seconds
¡ñ The agent executes the observe-act cycle repeatedly with a fixed cycle time of 40ms
¡ñ Rewards are -1 every cycle/step. Objective is accumulated episodic rewards aka return
¡ñ So, for a timed-out episode, the agent receives a return of … ?
4
Let¡¯s define the RL task
¡ñ Action is wheel velocities [-150mm/s, 150mm/s]2
¡ñ Action is exerted once every 40ms
¡ñ Robot controller executes the action until next
¡ñ Robot controller streams sensori-packets once every 15ms. Images arrive once every ~30ms
¡ñ Observations sampled by the agent once every 40ms using:
Robot stream
Distance signals
6
Bumping signals
2
Last velocities
2
Camera image
5
Policies for continuous actions are often Gaussian
-1 +1 -1 +1
Action elements are clipped between [-1, +1]
Then scaled between [-150, 150] before sending to the robot controller
6
Policy-distribution parameters are obtained from a DNN
CONV MLP
Robot stream
7
Notations
Policy gradient methods make SGD updates by approximating the true gradient of the objective
This is often known as the likelihood ratio (LR) gradient. Many policy gradient methods such as Reinforce, actor-critic, PPO, and ACER are based on this gradient estimation.
Notations
10
First policy gradient method: Reinforce
11
Batch Reinforce
12
Limitations of Reinforce
¡ñ ¡ñ ¡ñ
Reinforce is known for producing high-variance gradient estimates
That results in inefficient use of samples
Learning NN features with a single update over a batch is also inefficient
Actor and critic networks share CNN for Visuomotor learning
CONV
Actor MLP
Critic MLP
Robot stream
The Batch Actor-Critic method
Proximal Policy Optimization
For each epoch: