Reinforcement Learning (RL) provides an elegant formalization for the problem of intelligence.
In combination with advances in deep learning and increases in computation, this formalization has resulted in powerful solutions to longstanding artificial intelligence challenges — e.g. playing Go at a championship level. We believe it also offers an avenue for solving some of our greatest challenges: from drug design to industrial and space robotics, or improving energy efficiency in a variety of applications.
However, in this pursuit, the scale and complexity of RL programs has grown dramatically over time. This has made it increasingly difficult for researchers to rapidly prototype ideas, and has caused serious reproducibility issues. To address this, we are launching Acme — a tool to increase reproducibility in RL and simplify the ability of researchers to develop novel and creative algorithms.
Acme is a framework for building readable, efficient, research-oriented RL algorithms. At its core Acme is designed to enable simple descriptions of RL agents that can be run at various scales of execution — including distributed agents. By releasing Acme, our aim is to make the results of various RL algorithms developed in academia and industrial labs easier to reproduce and extend for the machine learning community at large. See the GitHub repository here.
Overall, the high-level goals of Acme are as follows:
In order to enable these goals, the design of Acme also bridges the gap between large-, medium-, and small-scale experiments. We have done so by carefully thinking about the design of agents at many different scales.
At the highest level, we can think of Acme as a classical RL interface (found in any introductory RL text) which connects an actor (i.e. an action-selecting agent) to an environment. This actor is a simple interface which has methods for selecting actions, making observations, and updating itself. Internally, learning agents further split the problem up into an “acting” and a “learning from data” component. Superficially, this allows us to re-use the acting portions across many different agents. However, more importantly this provides a crucial boundary upon which to split and parallelize the learning process. We can even scale down from here and seamlessly attack the batch RL setting where there exists no environment and only a fixed dataset. Illustrations of these different levels of complexity are shown below:
This design allows us to easily create, test, and debug novel agents in small-scale scenarios before scaling them up — all while using the same acting and learning code. Acme also provides a number of useful utilities from checkpointing, to snapshotting, to low-level computational helpers. These tools are often the unsung heroes of any RL algorithm, and in Acme we strive to keep them as simple and understandable as possible.
To enable this design Acme also makes use of Reverb: a novel, efficient data storage system purpose built for machine learning (and reinforcement learning) data. Reverb is primarily used as a system for experience replay in distributed reinforcement learning algorithms, but it also supports other data structure representations such as FIFO and priority queues. This allows us to use it seamlessly for on- and off-policy algorithms. Acme and Reverb were designed from the beginning to play nicely with one another, but Reverb is also fully usable on its own, so go check it out!
Along with our infrastructure, we are also releasing single-process instantiations of a number of agents we have built using Acme. These run the gamut from continuous control (D4PG, MPO, etc.), discrete Q-learning (DQN and R2D2), and more. With a minimal number of changes — by splitting across the acting/learning boundary — we can run these same agents in a distributed manner. Our first release focuses on single-process agents as these are the ones mostly used by students and research practitioners.
While additional results are readily available in our [arxiv paper], we show a few plots comparing the performance of a single agent (D4PG) when measured against both actor steps and wall clock time for a continuous control task. Due to the way in which we limit the rate at which data is inserted into replay — refer to the paper for a more in-depth discussion — we can see roughly the same performance when comparing the rewards an agent receives versus the number of interactions it has taken with the environment (actor steps). However, as the agent is further parallelized we see gains in terms of how fast the agent is able to learn. On relatively small domains, where the observations are constrained to small feature spaces, even a modest increase in this parallelization (4 actors) results in an agent that takes under half the time to learn an optimal policy:
But for even more complex domains where the observations are images that are comparatively costly to generate we see much more extensive gains:
And the gains can be even bigger still for domains such as Atari games where the data is more expensive to collect and the learning processes generally take longer. However, it is important to note that these results share the same acting and learning code between both the distributed and non-distributed setting. So it is perfectly feasible to experiment with these agents and results at a smaller scale — in fact this is something we do all the time when developing novel agents!
For a more detailed description of this design, along with further results for our baseline agents, see our paper. Or better yet, take a look at our GitHub repository to see how you can start using Acme to simplify your own agents!