Multimodal Few-Shot Learning with Frozen Language Models

When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples of that task. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multi-modal setting (vision and language). Using a comparatively small amount of aligned image and caption data, we train a vision encoder to represent images as a sequence of continuous token embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multi-modal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on an arbitrary sequence of interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, and answer questions about them, while also making use of outside knowledge, by measuring our single model on a variety of established benchmarks from the multimodal machine learning community.