OmniVec: Learning robust representations with cross modal sharing

About

Majority of research in learning based methods has been towards designing and training networks for specific tasks. However, many of the learning based tasks, across modalities, share commonalities and could be potentially tackled in a joint framework. We present an approach in such direction, to learn multiple tasks, in multiple modalities, with a unified architecture. The proposed network is composed of task specific encoders, a common trunk in the middle, followed by task specific prediction heads. We first pre-train it by self-supervised masked training, followed by sequential training for the different tasks. We train the network on all major modalities, e.g.\ visual, audio, text and 3D, and report results on $22$ diverse and challenging public benchmarks. We demonstrate empirically that, using a joint network to train across modalities leads to meaningful information sharing and this allows us to achieve state-of-the-art results on most of the benchmarks. We also show generalization of the trained network on cross-modal tasks as well as unseen datasets and tasks.

Siddharth Srivastava, Gaurav Sharma• 2023

Related benchmarks

Task	Dataset	Result
Action Recognition	Kinetics-400	Top-1 Acc91.1	498
Audio Classification	ESC-50	Accuracy98.4	441
Image Classification	iNaturalist 2018	Top-1 Accuracy93.8	291
Action Recognition	HMDB51	3-Fold Accuracy91.6	191
Semantic segmentation	NYUD v2 (test)	mIoU60.8	187
Video Action Classification	Something-Something v2	Top-1 Acc85.4	145
Text-to-Video Retrieval	YouCook2	Recall@1070.8	117
3D Point Cloud Classification	ScanObjectNN	Accuracy96.1	76
Semantic segmentation	NYU V2	mIoU60.8	74
Video Recognition	Kinetics-400	Top-1 Acc91.1	54

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord