Hyperbolic Image-Text Representations

About

Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept "dog" entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while being competitive with CLIP's performance on standard multi-modal tasks like image classification and image-text retrieval. Our code and models are available at https://www.github.com/facebookresearch/meru

Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, Ramakrishna Vedantam• 2023

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-10	Accuracy78.6	875
Image Classification	CIFAR-100	--	691
Image Classification	DTD	Accuracy22.1	599
Image Classification	Food-101	Accuracy48.5	570
Image Classification	EuroSAT	Accuracy39.1	569
Text-to-Image Retrieval	Flickr30k (test)	--	525
Classification	Cars	Accuracy5.3	492
Image Classification	DTD	Accuracy22.18	487
Image Classification	RESISC45	Accuracy42.6	472
Image-to-Text Retrieval	Flickr30k (test)	--	472

Showing 10 of 56 rows

Other info

Follow for update

@wizwand_team Discord