ImageBind: One Embedding Space To Bind Them All

About

We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra• 2023

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1K	Top-1 Acc77.7	1239
Language Understanding	MMLU	Accuracy43.6	844
Text-to-Image Retrieval	Flickr30k (test)	Recall@174.9	525
Text-to-Video Retrieval	DiDeMo	R@10.36	465
Audio Classification	ESC-50	Accuracy66.9	441
Action Recognition	UCF101	--	433
Text-to-Video Retrieval	MSR-VTT	Recall@136.8	406
Text-to-Video Retrieval	MSVD	R@147.9	290
Text-to-Video Retrieval	MSR-VTT (test)	R@136.8	265
Video Anomaly Detection	UCF-Crime	AUC55.78	263

Showing 10 of 323 rows

...

Other info

Code

Follow for update

@wizwand_team Discord