Jethro's Braindump

Self-supervised Learning

Prediction is the Essence of Intelligence

Self-supervised Learning vs Other Learning Paradigms

In self-supervised learning, the machine predicts any part of its input for any observed part. For example, it may predict future frames in videos. This results in a lot of feedback.

In Reinforcement Learning ⭐, the machine predicts a scalar reward given weak feedback once in a while. Since there is very little feedback, it seems impossible to learn any complex representations in a short amount of time.

See LeCun’s Cake Analogy.


Exemplar-CNN creates surrogate training dataset with unlabeled image patches. Surrogate classes are created by applying image transformations, such as rotation.

Another class of tasks involve extracting multiple patches from a single image, and asking the model to predict the relationship between the patches.

The model can also be tasked to perform colorization: coloring a grayscale input image.


Any visual representation learned for the same object across close frames should be close in the latent feature space. For example, applying a triplet loss on patches of motion (compared to a random patch).

Another common task is to validate the order of frames. Training frames can be sampled from high-motion windows in order (a,b,c,d,e). Positive tuples such as (b,c,d) can be formed, and negative tuples like (b,a,d), (b,e,d) can be created.