Abstract:
This paper presents a simple framework for contrastive representation learning. The framework, SimCLR, simplifies recently proposed approaches and requires neither specific architectural modifications nor a memory bank. In order to understand what enables the contrastive prediction task to learn useful representations, we systematically study the major components in the framework. We empirically show that 1) composition of data augmentations plays a critical role in defining the predictive tasks that enable effective representation learning, 2) introducing a learned nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the representation, and 3) contrastive learning benefits from a larger batch size and more training steps compared to the supervised counterpart. By combining our findings, we improve considerably over previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on the representation of our best model achieves 76.5% top-1 accuracy, a 7% relative improvement over previous state-of-the-art. When fine-tuned on 1% of labels, our model achieves 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.