Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning

We present a novel characterization of the mapping of multiple forms of parallelism onto hierarchical device systems, which greatly reduces the space of software-to-hardware mappings, while taking into account the structure of hierarchical system. We show that one can expect more than 448x performance differences for all-reduce operations between the best and the worst placement. We further present a novel syntax-guided program synthesis framework that is able to decompose reductions over one or more parallelism axes to sequences of collective operations in a hierarchy-aware way. We show that our approach can -- for many interesting parallelism placements and user requested reductions -- synthesize programs that outperform the default all-reduce implementation when evaluated on two different GPU hierarchical systems. We finally introduce a simulation framework that is able to achieve more than 90% top-10 accuracy (validated across hundreds of programs and parallelism placement strategies), which complements our enumerative synthesis tool.

Authors' notes