Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Omnivore: A Single Model for Many Visual Modalities

About

Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data. Instead, in this paper, we propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters. Our 'Omnivore' model leverages the flexibility of transformer-based architectures and is trained jointly on classification tasks from different modalities. Omnivore is simple to train, uses off-the-shelf standard datasets, and performs at-par or better than modality-specific models of the same size. A single Omnivore model obtains 86.0% on ImageNet, 84.1% on Kinetics, and 67.1% on SUN RGB-D. After finetuning, our models outperform prior work on a variety of vision tasks and generalize across modalities. Omnivore's shared visual representation naturally enables cross-modal recognition without access to correspondences between modalities. We hope our results motivate researchers to model visual modalities together.

Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, Ishan Misra• 2022

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1K
Top-1 Acc86
836
Action RecognitionSomething-Something v2 (val)
Top-1 Accuracy71.4
535
Image ClassificationImageNet-1K
Top-1 Acc86
524
Action RecognitionKinetics-400
Top-1 Acc84.1
413
Action RecognitionSomething-Something v2
Top-1 Accuracy71.4
341
Image ClassificationiNaturalist 2018
Top-1 Accuracy84.1
287
Semantic segmentationNYU v2 (test)
mIoU56.8
248
Action RecognitionSomething-Something v2 (test val)
Top-1 Accuracy71.4
187
Semantic segmentationNYUD v2 (test)
mIoU56.8
187
Semantic segmentationNYU Depth V2 (test)
mIoU56.8
172
Showing 10 of 41 rows

Other info

Code

Follow for update