Broaden Your Views for Self-Supervised Video Learning

Self-supervised learning methods are trained to align the representation of two independent views from the data. State-of-the-art methods in video are inspired by image techniques, where these two views are similarly extracted by cropping and augmenting the resulting crop. However, those methods miss a very important element in the video domain: time. In this paper, we introduce a novel framework for self-suprvised learning, BraVe, where one of the views has access to narrow temporal window into the video while the other view has a broad access to the video content. Our models learn to generalise from the narrow view to the general content of the video. Furthermore, as the two views are processed by different backbones, this enables the use of alternative augmentations or modalities into the broad view such as optical flow, randomly convoluted RGB frames or even audio. We demonstrate that BraVe achieves state-of-the-art results in representation learning on videos alone and on audio and video together, over standard video and audio classification benchmarks including UCF101, HMDB51, ESC-50 and AudioSet.

Authors' notes