Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

Vision-text transformers have lead to large improvements for language-based visual search, notably thanks to powerful cross-attention mechanisms. Yet in practice, these methods are not amenable to efficient large-scale retrieval as they require to perform exhaustive and expensive query-image comparisons. On the other hand, dual encoder (or two-stream) approaches are less accurate, but they can scale to billion-scale image search using approximate nearest neighbor search techniques. In this work, we propose a generic framework to get the best of both worlds. First, we propose a distillation method to transfer the knowledge from a Slow transformer based model into a Fast dual encoder. In addition, we effectively combine this approach with re-ranking a few retrieved examples from the distilled Fast model to outperform the Slow one while maintaining a fast and scalable search. Our approach allows us to explore novel finer-grained cross-attention architecture while maintaining a fast retrieval system. We validate our approach on the COCO, Conceptual Captions and Flickr datasets and show we can reduce the inference time on these datasets by 100$\times$ while also improving retrieval performance. We also demonstrate that the same technique can be used for state-of-the-art text to video retrieval on the VATEX dataset.