chen20_simpl_framew_contr_learn_visual_repres: A simple framework for contrastive learning of visual representations

SimCLR is a simple framework for Contrastive Methods of visual representations.

A simple framework for contrastive learning of visual representations

We do not train the model with a memory bank

Rather than train with a memory bank, they use a large batch size, and the LARS Optimizer to stabilize training.

Key Contributions

Composition of data augmentation to form positive pairs
introduce a learnable non-linear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations
Contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning

Data Augmentation

A stochastic data augmentation module is introduced to produce two correlated views of the same example, denoted \(\tilde{x}_i\) and \(\tilde{x}_j\), which is considered a positive pair. Some of these augmentations include:

random cropping
random color distortions
random Gaussian blur

A neural network encoder \(f(\cdot)\) extracts representation vectors from augmented data examples.

A small network projection head \(g(\cdot)\) maps representations to the space where contrastive loss is applied.

The loss function (normalized temperature-scaled cross entropy loss) is applied on the output of \(g(\cdot)\).

A minibatch of N examples is sampled, resulting in \(2N\) data-points. The other 2(N-1) augmented examples within the minibatch is used as negative examples.

\begin{equation} \ell_{i, j}=-\log \frac{\exp \left(\operatorname{sim}\left(\boldsymbol{z}_{i}, \boldsymbol{z}_{j}\right) / \tau\right)}{\sum_{k=1}^{2 N} \mathbb{1}_{[k \neq i]} \exp \left(\operatorname{sim}\left(\boldsymbol{z}_{i}, \boldsymbol{z}_{k}\right) / \tau\right)} \end{equation}

The Importance of the Projection Head

It is conjectured that the projection head \(g(\cdot)\) is important due to loss of information induced by the contrastive loss. \(z = g(h)\) is trained to be invariant to the data transformation. Thus \(g\) can remove information that may be useful for the downstream task, such as color or orientation of objects.

<biblio.bib>