Where Should I Spend My FLOPS? Efficiency Evaluations of Visual Pre-training Methods

The past few years have witnessed an explosion of strong self-supervised methods that have achieved remarkable success achieving parity with supervised methods as a pre-training method. Much of prior work has added large numbers of contrastive views, training for overlong schedules, and scaling to larger models in order to drive up absolute accuracy on downstream tasks. In this work, we are interested in examining a related, but slightly orthogonal question: given a fixed FLOP budget, what are the best datasets, models, and (self-)supervised training methods to achieve high downstream performance on representative visual tasks? This setting is often more realistic for both academic and industry labs. We examine five large scale datasets (JFT-300M, ALIGN, ImageNet-1K, ImageNet-21K, COCO) and six pre-training methods (CLIP, DINO, SimCLR, BYOL, Masked Autoencoding, and Supervised), and characterize their FLOP and CO2 footprints, relative to their absolute performance on a common image segmentation task. From this, we advocate that more close attention be paid to (1) dataset quality and curatedness and (2) accuracy gains in context of FLOP usage, and question the commonly held hypothesis of the inherent scalability of current self-supervised methods.