Learning Representations by Maximizing Mutual Information Across Views
About
We propose an approach to self-supervised representation learning based on maximizing mutual information between features extracted from multiple views of a shared context. For example, one could produce multiple views of a local spatio-temporal context by observing it from different locations (e.g., camera positions within a scene), and via different modalities (e.g., tactile, auditory, or visual). Or, an ImageNet image could provide a context from which one produces multiple views by repeatedly applying data augmentation. Maximizing mutual information between features extracted from these views requires capturing information about high-level factors whose influence spans multiple views -- e.g., presence of certain objects or occurrence of certain events. Following our proposed approach, we develop a model which learns image representations that significantly outperform prior methods on the tasks we consider. Most notably, using self-supervised learning, our model learns representations which achieve 68.1% accuracy on ImageNet using standard linear evaluation. This beats prior results by over 12% and concurrent results by 7%. When we extend our model to use mixture-based representations, segmentation behaviour emerges as a natural side-effect. Our code is available online: https://github.com/Philip-Bachman/amdim-public.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1k (val) | Top-1 Accuracy68.1 | 1453 | |
| Image Classification | ImageNet (val) | Top-1 Acc68.1 | 1206 | |
| Image Classification | CIFAR-10 (test) | Accuracy93.1 | 906 | |
| Image Classification | ImageNet 1k (test) | Top-1 Accuracy68.1 | 798 | |
| Image Classification | CIFAR-10 | Accuracy91.2 | 471 | |
| Image Classification | STL-10 (test) | Accuracy93.8 | 357 | |
| Image Classification | ImageNet (val) | Top-1 Accuracy67.4 | 354 | |
| Image Classification | ImageNet (test) | -- | 235 | |
| Image Classification | ImageNet 1% labeled | Top-5 Accuracy67.4 | 118 | |
| Image Classification | ImageNet (10% labels) | -- | 98 |