Random Feature Attention

Transformers are state-of-the-art models for a variety of sequence modeling tasks. The core of a transformer is an attention function which models interactions between every token pair. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic (in the number of tokens) time and space complexity. Our work builds upon recent advances on approximating attention with a kernel function. We propose RFF-Attn, a linear time attention that uses random Fourier features to approximate a Gaussian kernel. We show how to use it to model causal attention, cross attention, and self attention in a transformer. We also present an extension that incorporates recency bias into RFF-Attn with a gating mechanism. Experiments on language modeling and machine translation demonstrate that our models achieve comparable or better performance compared to a strong transformer baseline. In the machine translation experiment, RFF-Attn decodes twice as fast as a vanilla transformer. Our analysis indicates that the speed up and memory footprint reduction obtained from using RFF-Attn increases with the length of the sequence. We consider RFF-Attn as a drop in replacement to the standard softmax attention for tasks that require working with long sequences, fast decoding speed, or low memory footprints.

Authors' notes