Omnivore: A Single Model for Many Visual Modalities

About

Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data. Instead, in this paper, we propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters. Our 'Omnivore' model leverages the flexibility of transformer-based architectures and is trained jointly on classification tasks from different modalities. Omnivore is simple to train, uses off-the-shelf standard datasets, and performs at-par or better than modality-specific models of the same size. A single Omnivore model obtains 86.0% on ImageNet, 84.1% on Kinetics, and 67.1% on SUN RGB-D. After finetuning, our models outperform prior work on a variety of vision tasks and generalize across modalities. Omnivore's shared visual representation naturally enables cross-modal recognition without access to correspondences between modalities. We hope our results motivate researchers to model visual modalities together.

Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, Ishan Misra• 2022

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1K	Top-1 Acc86	1239
Image Classification	ImageNet-1K	Top-1 Acc86	600
Action Recognition	Something-Something v2 (val)	Top-1 Accuracy71.4	545
Action Recognition	Kinetics-400	Top-1 Acc84.1	498
Action Recognition	Something-Something v2	Top-1 Accuracy71.4	363
Image Classification	iNaturalist 2018	Top-1 Accuracy84.1	291
Semantic segmentation	NYU v2 (test)	mIoU56.8	282
Action Recognition	Something-Something v2 (test val)	Top-1 Accuracy71.4	187
Semantic segmentation	NYUD v2 (test)	mIoU56.8	187
Semantic segmentation	NYU Depth V2 (test)	mIoU56.8	183

Showing 10 of 42 rows

Other info

Code

Follow for update

@wizwand_team Discord