DINO as a von Mises-Fisher mixture model
About
Self-distillation methods using Siamese networks are popular for self-supervised pre-training. DINO is one such method based on a cross-entropy loss between $K$-dimensional probability vectors, obtained by applying a softmax function to the dot product between representations and learnt prototypes. Given the fact that the learned representations are $L^2$-normalized, we show that DINO and its derivatives, such as iBOT, can be interpreted as a mixture model of von Mises-Fisher components. With this interpretation, DINO assumes equal precision for all components when the prototypes are also $L^2$-normalized. Using this insight we propose DINO-vMF, that adds appropriate normalization constants when computing the cluster assignment probabilities. Unlike DINO, DINO-vMF is stable also for the larger ViT-Base model with unnormalized prototypes. We show that the added flexibility of the mixture model is beneficial in terms of better image representations. The DINO-vMF pre-trained model consistently performs better than DINO on a range of downstream tasks. We obtain similar improvements for iBOT-vMF vs iBOT and thereby show the relevance of our proposed modification also for other methods derived from DINO.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1k (val) | -- | 1453 | |
| Video Object Segmentation | DAVIS 2017 (val) | J mean61.9 | 1130 | |
| Image Classification | ImageNet-1k (val) | Top-1 Accuracy59.1 | 840 | |
| Image Classification | CIFAR-100 | -- | 622 | |
| Image Classification | ImageNet-1k (val) | -- | 512 | |
| Image Classification | ImageNet (val) | Accuracy74.14 | 300 | |
| Image Classification | ImageNet-1K | Accuracy84.1 | 190 | |
| Image Retrieval | Revisited Oxford (ROxf) (Medium) | mAP38.1 | 124 | |
| Image Retrieval | Revisited Paris (RPar) (Hard) | mAP39.5 | 115 | |
| Image Classification | ImageNet 1K (train val) | Top-1 Accuracy51.6 | 107 |