Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Hyperbolic Image-Text Representations

About

Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept "dog" entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while being competitive with CLIP's performance on standard multi-modal tasks like image classification and image-text retrieval. Our code and models are available at https://www.github.com/facebookresearch/meru

Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, Ramakrishna Vedantam• 2023

Related benchmarks

TaskDatasetResultRank
Image ClassificationCIFAR-100--
691
Image ClassificationEuroSAT
Accuracy39.1
569
Image ClassificationFood-101
Accuracy48.5
542
Image ClassificationDTD
Accuracy22.1
542
Image ClassificationCIFAR-10
Accuracy78.6
508
Image ClassificationDTD
Accuracy22.18
485
Text-to-Image RetrievalFlickr30k (test)--
445
Image ClassificationSUN397
Accuracy49.59
441
ClassificationCars
Accuracy5.3
395
Image-to-Text RetrievalFlickr30k (test)--
392
Showing 10 of 47 rows

Other info

Follow for update