Launchpad: a programming model for distributed machine learning research

Modern numerical frameworks—such as TensorFlow, PyTorch, and JAX—have significantly contributed to recent advances in machine learning, however building complicated distributed systems with these tools can prove difficult. This is due in large part to the fact that distributed communication is often implicit, which can obscure program flow in a way that makes such algorithms hard to reason about. To address this we have built Launchpad, a programming model that simplifies the process of defining and launching instances of distributed computation.

The fundamental concept of Launchpad is that of a program, which represents computation as a directed graph of service nodes with edges in this graph denoting communication between nodes. By making the graph representation explicit, Launchpad makes it easy to both design and later modify a program's topology in a single script—something that would not be possible were the system defined in a more decentralised or implicit fashion. Further, by clearly separating the program definition, given by its graph data-structure, from the launching mechanism Launchpad can also be used to launch the same distributed system on different platforms (e.g. on a single machine, a cloud provider, or on a self-hosted cluster).

Interacting with Launchpad is a simple matter of creating an empty graph (a program) and incrementally adding nodes to it. As nodes are added the system returns a handle to that node which can be used as a reference for that service. Passing a handle to another node at its creation time forms an edge in the communication graph. Launching the program is then as simple as passing it to a given launching mechanism, e.g. by calling lp.launch(program). By implementing different launchers we are also able to target different backends. Right now we are releasing a version of this mechanism that runs this computation within different processes on a single machine. However, by modifying or building upon launch we are able to (and do!) use this same mechanism for launching on multiple machines at once.

Edges in the Launchpad program denote that communication occurs between the two nodes, but not necessarily how it happens—that’s ultimately up to the nodes themselves. However, Launchpad does have a little more to say about this communication mechanism, namely that the communication edges are exposed at runtime as calls to a gRPC service. To make this process easier, Launchpad also provides CourierNode (named for our internal gRPC wrapper). This is a particular node implementation which, given a constructor to a python class, will create an instance of that class and expose its public methods to clients as a gRPC service. This node type bridges the gap between local and distributed computation, since from the perspective of the launched class they both look one and the same.

Obviously we have just scratched the surface! For more information about Launchpad take a look at our github repository for all of the code as well as some helpful examples and documentation. Finally, as mentioned above the reason for Launchpad’s existence is to make complicated distributed systems simple to write and read. To see some examples of this, take a look at Acme, a library purpose built to use Launchpad for distributed reinforcement learning agents.

Work by the DeepMind machine learning and distributed computing teams, and Google Brain.

OpenSource

22 Apr 2021