ImageBind: One Embedding Space To Bind Them All
About
We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1K | Top-1 Acc77.7 | 1239 | |
| Language Understanding | MMLU | Accuracy43.6 | 825 | |
| Text-to-Video Retrieval | DiDeMo | R@10.36 | 459 | |
| Text-to-Image Retrieval | Flickr30k (test) | Recall@174.9 | 445 | |
| Action Recognition | UCF101 | -- | 431 | |
| Audio Classification | ESC-50 | Accuracy66.9 | 374 | |
| Text-to-Video Retrieval | MSR-VTT | Recall@136.8 | 369 | |
| Text-to-Video Retrieval | MSVD | R@147.9 | 264 | |
| Text-to-Video Retrieval | MSR-VTT (test) | R@136.8 | 255 | |
| Video Anomaly Detection | UCF-Crime | AUC55.78 | 218 |