Unified Scaling Laws for Routed Language Models

Language models performance scales as a power-law in its parameter count, dataset size and allotted compute. Here we seek to understand the behaviors that govern \textit{Routing Networks}: an architecture class which removes the strict link between a model's size and its computational requirement. We show that three different techniques of training Routing Networks obey a shared set of laws governing their performance, dependent only on the total number of parameters and the floating point operations required per inference. These laws imply an axis along which Routing Networks and equivalently-powerful dense models follow a unified power law, and are used to quantitatively compare the three routing techniques described. This behavior is established from an extensive evaluation of Routing Networks -- across five orders of magnitude of model size -- including models with hundreds of experts and hundreds of billions of parameters.

Authors' notes