Explore the Limits of Omni-modal Pretraining at Scale

About

We propose to build omni-modal intelligence, which is capable of understanding any modality and learning universal representations. In specific, we propose a scalable pretraining paradigm, named Multimodal Context (MiCo), which can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process. With MiCo, the pretrained models show significant emergent abilities in multimodal learning, which are evaluated on the following tasks: i) single-modality perception benchmarks of 10 different modalities, ii) 25 cross-modality understanding tasks of retrieval, question-answering, captioning, and iii) 18 multimodal large language model benchmarks. Our models establish 37 new records for state-of-the-art performance. We hope that our research could contribute to the development of omni-modal intelligence. Code and Models are at https://github.com/invictus717/MiCo

Yiyuan Zhang, Handong Li, Jing Liu, Xiangyu Yue• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VizWiz	Accuracy49.1	1820
Visual Question Answering	TextVQA	Accuracy53.4	1453
Visual Question Answering	VQA v2	Accuracy80.5	1429
Visual Question Answering	GQA	Accuracy61.5	1425
Multimodal Understanding	MMBench	Accuracy65.2	847
Language Understanding	MMLU	Accuracy68.9	844
Multimodal Understanding	MM-Vet	MM-Vet Score31.4	631
Visual Question Answering	ScienceQA	Accuracy71.3	446
Text-to-Video Retrieval	MSR-VTT	Recall@164.3	406
Video Question Answering	MSRVTT-QA (test)	Accuracy60.1	376

Showing 10 of 53 rows

Other info

Follow for update

@wizwand_team Discord