Self-supervised video pretraining yields strong image representations

Videos contain infinitely more information than still images. Yet pretraining on image datasets has remained the dominant paradigm for learning representations that capture spatial information, and previous attempts at video pretraining have fallen short on image understanding benchmarks. In this work we revisit self-supervised learning of image representations from the dynamic evolution of video frames. We find that a suitably, but minimally curated video dataset coupled with a contrastive objective that encourages learning combinations of spatial and temporal invariances is sufficient to produce frame-based models that perform surprisingly well on a variety of downstream image-based scene understanding tasks. Additionally, we find video pretraining to scale considerably better with model capacity than image pretraining, closing the gap on semantic segmentation on PASCAL and ADE20k, and object detection on COCO. Together, these results present video pretraining as a general solution for learning visual representations.