OmniVec2 -- A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning

About

We present a novel multimodal multitask network and associated training algorithm. The method is capable of ingesting data from approximately 12 different modalities namely image, video, audio, text, depth, point cloud, time series, tabular, graph, X-ray, infrared, IMU, and hyperspectral. The proposed approach utilizes modality specialized tokenizers, a shared transformer architecture, and cross-attention mechanisms to project the data from different modalities into a unified embedding space. It addresses multimodal and multitask scenarios by incorporating modality-specific task heads for different tasks in respective modalities. We propose a novel pretraining strategy with iterative modality switching to initialize the network, and a training algorithm which trades off fully joint training over all modalities, with training on pairs of modalities at a time. We provide comprehensive evaluation across 25 datasets from 12 modalities and show state of the art performances, demonstrating the effectiveness of the proposed architecture, pretraining strategy and adapted multitask training.

Siddharth Srivastava, Gaurav Sharma• 2025

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU58.5	559
Action Recognition	Kinetics-400	Top-1 Acc93.6	498
Audio Classification	ESC-50	Accuracy99.1	441
Text-to-Video Retrieval	MSR-VTT	--	406
Image Classification	iNaturalist 2018	Top-1 Accuracy94.6	291
Action Recognition	HMDB51	3-Fold Accuracy92.1	191
Video Action Classification	Something-Something v2	Top-1 Acc86.1	145
Text-to-Video Retrieval	YouCook2	Recall@1069.9	117
Natural Language Understanding	GLUE (test dev)	MRPC Accuracy85.8	90
3D Point Cloud Classification	ScanObjectNN	Accuracy97.2	76

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord