Contrastive Self-Supervised Learning Techniques
This article is a survey on the different contrastive self-supervised learning techniques published over the last couple of years. The article discusses three things: 1) the commonly used pretext tasks used in a contrastive learning setup 2) the different architectures that have been proposed 3) performance comparison between different downstream tasks like image classification, object detection, and action recognition. In the end, the article discusses some of the limitations of contrastive learning and future direction
Introduction
Why do we need self-supervised learning?
Deep learning techniques have touched almost every industry due to their ability to learn rich patterns from data. Various industries such as medical, manufacturing, automobile, and fintech have successfully adopted deep learning and are continuously exploring newer horizons. Today, there are algorithms available to perform various tasks in computer vision (CV) like image classification, object recognition, image segmentation, image generation as well as in natural language processing (NLP) like sentence classification, emotion recognition, sentence completion, and language translation. However, these successes are due to supervised learning approaches which demand the availability of labeled data. In supervised learning approaches, the data features and their labels or annotations are required to train the deep neural network (DNNs) model. Supervised learning approaches have almost reached their saturation due to the high cost and labor required to generate and maintain datasets with annotation. Although there is plenty of data available in the real world, it's tedious to annotate the data at the same scale to make it useful for supervised learning.
This compels the exploration of alternative learning techniques that do not require all the data that are available to be annotated. Here is where self-supervised learning comes. In short, self-supervised learning is when an unsupervised learning task is converted into a supervised learning task to train the model. For instance, we have images of animals without their label. To learn from this dataset we could rotate each image to a random angle and then train the model to predict the rotation for each image. This changes the task to a supervised learning task and the model starts to learn the pattern in images.
Self-Supervised Contrastive Learning
Contrastive learning is a discriminative approach that aims to group similar images together and group dissimilar images in different groups. In this approach, each image is first randomly augmented and then the model is trained to group the original and its augmented image together, and group the original image and the rest of the images far away. Figure 1 shows the basic intuition behind the contrastive learning paradigm.
In order to achieve contrastive learning, a similarity metric is used to measure how close the representations or embeddings of two data items are. As a first step, a sample from the dataset is retrieved and an augmented version of the sample is generated using some standard augmentation techniques. During the training process, the embedding of the augmented data is considered as the positive sample, and embeddings of the rest of the data items in the batch are considered as negative samples. Using contrastive loss, the positive samples are brought closer and samples are pushed far from each other. By doing this the model learns to generate quality representations of the data items that can be used later for downstream tasks. This is also referred to as knowledge transfer. Figure 2 shows the pipeline used for contrastive learning.
In recent years several methods such as SwAV, MoCo, and SimCLR have demonstrated results comparable to state-of-the-art supervised methods on the ImageNet dataset (Figure 3)
Pretext Tasks
In self-supervised learning, pretext tasks refer to tasks that are used to learn representations of the data using pseudo labels. These pseudo labels are usually generated using the data attributes. The resulting model after the pretext task is done and be further used from downstream tasks such as classification, segmentation, detection, etc. In context to pretext tasks in contrastive learning, the original image is called the anchor image, and its transformed (or augmented) version is called a positive sample. The rest of the images and their augmented versions are called negative samples. The four major categories of pretext tasks are color transformation, geometric transformation, context-based tasks, and cross-model-based tasks.
Color Transformation
Color transformation involves basic adjustments of color levels in an image such as blurring, color distortions, converting to grayscale, etc
Geometric Transformation
Geometric transformation involves altering the basic geometry of the image without changing the actual pixel values. The transformations include scaling, random cropping, flipping, etc.
Context-Based
Jigsaw puzzle
In the jigsaw puzzle method, the image is cut into patches, and the position of the patches is changed. With respect to contrastive learning, the original image and the scrambled image act as positive pairs, and the rest of the images act as negative samples.
Frame order based
This approach applies to sequential data such as time-series data or a video with a series of image frames. The basic idea is to take the original data item and jumble up the sequence of frames to generate the positive sample. The rest of the data items and their transformations act as negative samples.
Future prediction
This approach is again applicable to data that extends through time such as sensory data, audio, and video data. The high-level idea is to predict the future value of the data given the past values. The past values and the future values act as positive samples.
Cross Model-Based
This approach is also known as view prediction tasks. This is applicable for data that have multiple views of the same scene. For a given instance of time, the images from different angles of the view act as positive samples, and the images at different time-step act as negative samples.
Pretext Tasks in NLP
Self-supervised learning research has been active in the NLP space as well. Various approaches have been proposed to convert the text data into representations using a large corpus of unlabeled texts. This section discusses the different pretext tasks for NLP.
Center and neighbor word prediction
In the center word prediction pretext task approach a sentence with fixed work length and missing center word is given as input to the model. The model is trained to predict the missing word. In the neighbor work prediction approach, a single word is given as input and the model is trained to predict the neighboring words. Both these approaches can be used to convert text into embeddings that can be further used for downstream tasks.
Next sentence prediction
Similar to the previous approach, the next sentence prediction pretext task, the model is trained to predict if two sentences are consecutive or not. The most famous model that uses this technique is BERT. BERT has demonstrated using this approach to significantly improve performance on downstream tasks that require the understanding of sentence relationships.
Auto-regressive language model
This method involves predicting the next word given previous words. A sequence of words is provided as input to the model and the model predicts the next word. This technique is used to train GPT model.
Sentence permutation
This technique is similar to the image scrambling technique. A series of sentences are taken and their order is randomly changed. The model is trained to predict the correct order of the sentences. a recent paper BART used this task.
Architectures
The main aspect of a contrastive learning method is the way in which negative samples are collected for a given positive sample. The learned representation of the images improves and the number of negative samples increases. This entire task can be looked like a dictionary look-up task for collecting the negative samples. Broadly the architectures can be categorized into four groups each of which are discussed in the following section.
End-to-End Learning
End-to-End learning is a complex system that employs gradient-based learning and has all components differentiable. This architecture works well only in the presence of large negative samples. In a batch of images, only the image and its augmented images are positive pairs, and the rest of the images in the batch are considered as negative samples. The original image is passed through an encoder called the query encoder and the augmented image is passed through another encoder called the key encoder. These two encoders are trained by making the positive representations closer to each other and negative samples farther away from each other. The most common similarity metric used is the cosine similarity, which is nothing but the normalized inner dot product of two vectors. Both the encoder weights are updated during training using backpropagation of the gradients. The most popular model proposed in this category is SimCLR which uses a batch size of 4096. Since there is a need to have a large batch size, this category suffers from the limitation of GPU memory and hence is not scalable.
Using Memory Bank
The problem of a large batch size to maintain a large negative sample count can be eliminated by using a memory bank. The main aim of the memory bank is to maintain the negative sample representations. It is a store that maintains the feature representation of every image in the dataset. The representation is a moving average and is updated every time the data item is seen in an epoch. By doing this, for every epoch, we have the entire memory bank to compare the negative samples which were accumulated in the previous epochs. PIRL is the of the recent methods which use memory backs to store the image representations. However, it could be complicated to maintain the memory bank and update the representations as the representations could quickly get outdated within a few epochs.
Using a Momentum Encoder
In this architecture, the memory bank is replaced by a momentum encoder. A momentum encoder is a dictionary that queues the encoded keys from the current mini-batch and dequeues the keys from the oldest mini-batch. The momentum encoder and the Q encoder share the same parameters. The advantage of this architecture is that only one encoder is required to train and a computationally expensive memory bank is not required.
Clustering Feature Representations
Until now, the architectures discussed use instance-based contrastive learning to generate useful representations. In these approaches, each image is treated as a distinct class and the representations are made to move farther from other image representations. This may not always be desirable. For example, when there are multiple cat images and car images in the dataset, the representations of the cat images should be relatively closer than the car images. To accomplish this, clustering feature representation follows an end-to-end approach with two encoders that share parameters, but instead of using the instance-based contrastive approach, they utilize a clustering algorithm to group similar features together. SwAV is one of the recent methods that employ this architecture. Figure 5 shows the difference between instance-based contrastive learning and cluster-based contrastive learning.
Encoders
In the context of self-supervised learning, an encoder is a network that takes the original data sample as input and outputs the representation in a latent space. These representations can be further used for downstream tasks. Depending on the problem, the output of the encoder is either upsampled or downsampled. Most of the time, ResNet-50 is used as an encoder in the literature.
Training
To train the encoder, a pretext task utilizes contrastive loss function and backpropagates weights to update the encoder parameters. The contrastive loss behaves in such a way that the positive samples get closer in the latent space and negative samples move farther apart. Cosine similarity is commonly used as a measure of closeness between the representations. Cosine similarity can be defined as the cosine of the angle between two vectors.
The contrastive loss function which compares the embeddings can be represented in Figure 8. This is called Noise Contrastive Estimation (NCE)
Downstream Tasks
Once the pretext task is performed in a self-supervised learning pipeline, the learning is transferred to a downstream task. Downstream tasks are application-specific tasks such as classification, segmentation, detection, etc.
The process of transferring the learnings from pretext tasks to a downstream task is called knowledge transfer and this is the central idea behind self-supervised learning.
Shortcomings of self-supervised learning
Although recent research has greatly closed the gap in performance between self-supervised learning algorithms and supervised learning algorithms, there are several things that need to be addressed before adopting self-supervised learning into mainstream deep learning tasks.
Lack of Theoretical Foundation
The performance of the self-supervised learning algorithm is highly dependent upon the pretext task used during the training process and the sampling scheme used. Furthermore, there is not much theoretical analysis on the different modules of self-supervised learning algorithms.
Proper Negative Samples During Training
Self-supervised contrastive learning suffers when the distance between the positive samples and negative samples is not much, resulting in a lack of contribution to the loss function. This ultimately makes the training process hard to converge. Techniques such as using a larger batch size or using a larger memory bank are used currently to deal with this problem. However such tricks increase the computational cost and complexity of the training process.
Dataset Bias
In self-supervised learning, the data itself provides supervision. Hence the learned representations are heavily influenced by the underlying training data. This makes the model less general as compared to supervised learning counterparts.
Conclusion
This article talks extensively about the entire self-supervised learning pipeline and the different architectures proposed in the literature. It also talks about the different pretext task categories employed to train the encoder. Finally this article discusses some of the shortcomings and future direction of self-supervised learning techniques.