A framework for experience replay

The use of experience plays a key role in reinforcement learning (RL). How best to use this data is one of the central problems of this field. As RL agents have advanced over recent years, taking on bigger and more complex problems (Atari, Go, StarCraft, Dota), the generated data has grown in both size and complexity. To cope with this complexity many RL systems split the learning problem into two distinct parts: experience producers (actors) and experience consumers (learners) — allowing these different parts to run in parallel. Often a data storage system lies at the intersection between these two components. The question of how to efficiently store and transport the data is itself a challenging engineering problem.

To address this challenge we are releasing Reverb, an efficient, extensible, and easy to use system for data transport and storage. One of Reverb’s strengths is its flexibility. It can be used to implement experience replay (prioritized or not) which is a crucial component in a number of off-policy algorithms including Deep Q-Networks, Deep Deterministic Policy Gradients, Soft Actor-Critic, etc. However, Reverb can also be used for FIFO, LIFO, and Heap-based queues, enabling on-policy methods like Proximal Policy Optimization and IMPALA. It is also possible to utilize LIFO stacks and heaps in order to enable further algorithms.

Another strength of Reverb is its efficiency: it can also be used in large-scale RL agents with many experience producers and consumers in parallel, with minimal overhead. Researchers have used Reverb to manage experience storage and transport for thousands of concurrent actors and learners. This scalability (coupled with Reverb’s flexibility) frees researchers from having to worry about changing infrastructure components as they apply algorithms to problems that require different scales.

Additionally, Reverb provides an easy-to-use mechanism for controlling the ratio of sampled to inserted data elements. While this form of control is easy to accomplish in simple, synchronous settings, it is much more difficult to enable with many experience producers and consumers. Users can explicitly control the relative rate of data collection to training in RL experiments by limiting or throttling this rate — something that has been difficult to do until now.

No items found.