Bootstrap Your Own Latent, BYOL, is a self-supervised learning approach for image representation. From an augmented view of an image, BYOL trains an online network to predict a target network representation of the same image under a different augmented view. Unlike contrastive methods, BYOL does not explicitly use a repulsion term build from negative pairs in its training objective, yet it avoids collapse to a trivial, constant representations. Recently, it has been hypothesized that batch normlaization (BN) is critical to preventing collapse in BYOL. Indeed, BN flows gradients across batch elements, and could leak information about the negative views that would act as an implicit negative (contrastive) term.
We experimentally show that replacing BN with a batch-independent normalization scheme (namely, a combination of group normalization and weight standardization) achieves performance comparable to vanilla BYOL (73.2% vs. 74.3% top-1 acc. with ResNet-50 features, under the linear evaluation protocol on ImageNet). This finding disproves the hypothesis that having BN is an indispensable ingredient for BYOL to learn useful representations. However, we further confirm that proper layer regularisation is crucial for BYOL.