Summary

author:	specfazhou
score:	8 / 10

What is the core idea?
This paper comes up with a simple framework for contrasive learning of visual representation, which not only archieves state of the art accuracy, but also has simplier structure, unlike previous works that require specialized architectures or memory bank. The result of SimCLR is shown in the following picture
How is it realized (technically)?
The architecture of the SimCLR is shown as the following picture
As you can see from the picture. First, for each data sample, x, the model generates two views of of the data sample by sequentially applying three data augmentation methods – random cropping, random color distortions, and random Gauessian blur. Then the model applies encoder \(f(-)\). There are various choices of the encoder, but in this paper, the authors use ResNet. Projection head \(g(-)\) is MLP with one hidden layer, and it uses ReLU as activation function. The reason we need projection head \(g(-)\) is that it maps representations to the space where contrasive loss can be applied so that two augmented images come from same image is small. In addition, they use LARS optimizer to train, and the training process usually takes several hours on TPU.

The training process is shown as following.