Less can be more in contrastive learning

Pretraining representations with abundantly available unsupervised data can reduce the reliance of deep neural networks on costly supervised data. Contrastive methods have recently emerged as the most successful strategy for unsupervised representation learning. They shape representations by contrasting a datapoint against many other datapoints called negatives; usually the whole batch excluding the datapoint whose representation is being learned is used as negatives. Empirically it has been observed that large batch sizes are needed to achieve good performance. In this work we want to better understand the role of negatives in contrastive learning. We disentangle the number of negatives used from the batch size. Surprisingly, with fixed batch size we empirically observe that performance actually increases as the number of negatives decreases. We examine the three potential causes for this behaviour and conclude that the most likely explanation for this comes from the improved gradient dynamics that is present at lower numbers of negatives.

Authors' notes