Hyperbolic Image-Text Representations
About
Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept "dog" entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while being competitive with CLIP's performance on standard multi-modal tasks like image classification and image-text retrieval. Our code and models are available at https://www.github.com/facebookresearch/meru
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR-10 | Accuracy78.6 | 875 | |
| Image Classification | CIFAR-100 | -- | 691 | |
| Image Classification | DTD | Accuracy22.1 | 599 | |
| Image Classification | Food-101 | Accuracy48.5 | 570 | |
| Image Classification | EuroSAT | Accuracy39.1 | 569 | |
| Text-to-Image Retrieval | Flickr30k (test) | -- | 525 | |
| Classification | Cars | Accuracy5.3 | 492 | |
| Image Classification | DTD | Accuracy22.18 | 487 | |
| Image Classification | RESISC45 | Accuracy42.6 | 472 | |
| Image-to-Text Retrieval | Flickr30k (test) | -- | 472 |