Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression. However, the most commonly used autoregressive models, Transformers, are prohibitively expensive to scale to the number of inputs and layers needed to capture this long-range structure. We develop Perceiver AR, an autoregressive, modality-agnostic architecture which uses cross-attention to map long-range inputs to a small number of latents while also maintaining end-to-end causal masking. Perceiver AR can directly attend to over a hundred thousand tokens, enabling practical long-context density estimation without the need for hand-crafted sparsity patterns or memory mechanisms. When trained on images or music, Perceiver AR generates outputs with clear long-term coherence and structure. Our architecture also obtains state-of-the-art likelihood on long-sequence benchmarks, including 64 x 64 ImageNet images and PG-19 books.
Over the last few years, autoregressive Transformers have brought a steady stream of breakthroughs in generative modeling. These models generate each element of a sample – the pixels of an image, the characters of a text (typically in “token” chunks), the samples of an audio waveform, and so on – by predicting one element after the other. When predicting the next element, the model can look back at those that were created earlier.
However, each of a Transformer’s layers grows more expensive as more elements are used as input, and practitioners can only afford to train deep Transformers on sequences no more than about 2,048 elements in length. And so, most Transformer-based models ignore all elements beyond the most recent past (around 1,500 words or 1/6 of a small image) when making a prediction.
In contrast, our recently developed Perceiver models give excellent results on a variety of real-world tasks with up to around 100,000 elements. Perceivers use cross-attention to encode inputs into a latent space, decoupling the input’s compute requirements from model depth. Perceivers also spend a fixed cost, regardless of input size, at nearly every layer.
While latent-space encoding handles all elements in a single pass, autoregressive generation assumes processing happens one element at a time. To address this problem, Perceiver AR proposes a simple solution: align the latents one by one with the final elements of the input, and carefully mask the input so latents see only earlier elements.
The result is an architecture (shown above) that attends to as much as 50x longer inputs as standard Transformers, while deploying as widely (and essentially as easily) as standard decoder-only Transformers.
Perceiver AR scales considerably better with size than both standard Transformers and Transformer-XL models at a range of sequence lengths in real terms. This property allows us to build very effective long-context models. For example, we find that a 60-layer Perceiver AR with context length 8192 outperforms a 42-layer Transformer-XL on a book-length generation task, while running faster in real wall-clock terms.
On standard, long-context image (ImageNet 64x64), language (PG-19), and music (MAESTRO) generation benchmarks, Perceiver AR produces state-of-the-art results. Increasing input context by decoupling input size from compute budget leads to several intriguing results:
Using a dataset of piano music, we trained Perceiver AR to generate new pieces of music from scratch. Because each new note is predicted based on the full sequence of notes that came before, Perceiver AR is able to produce pieces with a high level of melodic, harmonic, and rhythmic coherence:
Learn more about using Perceiver AR:
See the Google Magenta blog post with more music!