What Can We Learn from LLM? Self-Supervised Learning

Self-supervised learning (SSL) has been the key to the success of LLM. It benefits most from seas of Internet data as it uses data as it is without needing annotations and labels. Particularly when the training data is scaled up, the power of transformers is unleashed, according to scaling law. This is the terrian where only self-supervised learning can do, connecting the two beasts of data and weights. (Supervised transformer sounds like a fallacy to me.)

ssl-data-network

Self-supervised learning is the best way of combining massive data and huge network so far.

Representation Learning

The major outcome of self-supervised learning is to give powerful and meaningful latent representations for raw tokens, just by learning from the distribution of the massive data. Knowing the distribution of a data set is equivalent to knowing the correlations (joint probability) of its elements inside. This is why attention machanism can do a good job in this realm (Attention is correlation IMO). And when the data collection is large enough, it contains sufficient combinations of tokens for the hungry transformer network to learn from it.

Learning a good latent space is so important that we often call an encoder/auto-encoder foundation model if it is well trained to enable the downstream tasks such as decoding, diffusion or multi-modality alignment. There are two approaches that have proven successful: contrastive learning and generative learning. Contrastive learning needs to construct similar and dissimilar pairs from data samples. Although being prone to collapse and instablility issues, it is commonly used for multimodal alignment (such as CLIP) and translation tasks, where data naturally form pairs.

Generative or Contrastive

Personally, generative learning is more attractive to me than constrastive ones as it works in a way just like people thinking. If a network is able to generate meaningful things within or even without context as we do, I would achknowledge it has the intelligence as we have. The learning scheme is usually achieved by pre-text tasks, among which I like inpainting most.

Inpainting Task

Predicting a masked word in a sentence as defined by BERT is same as the cloze test I did in high school English exams. Visually, imagining missing parts of a scene is a task our brain is excel at. The pretraining job of completing an image from a masked input has been applied to single-view images (MAE), multi-view images (CroCo) and videos (VideoMAE), which enables the mono-vision, 3D vision and temporal vision tasks, respectively. It is still surprising to me that the networks are able to fill in meaningful visual content even more than 80% of the images are missing (e.g., MAE and CroCo).

ssl-inpainting

Inpainting capability is where intelligence emerges.

Conclusion

In conclusion, the scale of data will continue to grow, so are the network parameters. Self-supervised learning using pre-text tasks such as inpainting is the only way of making both parties work together. This lesson has already been learned by image and video diffusion model pretraining. But it’s just the beginning. Computer vison ecompasses many more tasks beyond diffusion, and the same practices should be transferred to all the other tasks, to achieve optimal success.