Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Explore the Limits of Omni-modal Pretraining at Scale

About

We propose to build omni-modal intelligence, which is capable of understanding any modality and learning universal representations. In specific, we propose a scalable pretraining paradigm, named Multimodal Context (MiCo), which can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process. With MiCo, the pretrained models show significant emergent abilities in multimodal learning, which are evaluated on the following tasks: i) single-modality perception benchmarks of 10 different modalities, ii) 25 cross-modality understanding tasks of retrieval, question-answering, captioning, and iii) 18 multimodal large language model benchmarks. Our models establish 37 new records for state-of-the-art performance. We hope that our research could contribute to the development of omni-modal intelligence. Code and Models are at https://github.com/invictus717/MiCo

Yiyuan Zhang, Handong Li, Jing Liu, Xiangyu Yue• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy80.5
1165
Visual Question AnsweringTextVQA
Accuracy53.4
1117
Visual Question AnsweringVizWiz
Accuracy49.1
1043
Visual Question AnsweringGQA
Accuracy61.5
963
Language UnderstandingMMLU
Accuracy68.9
756
Multimodal UnderstandingMM-Vet
MM-Vet Score31.4
418
Video Question AnsweringMSRVTT-QA (test)
Accuracy60.1
371
Multimodal UnderstandingMMBench
Accuracy65.2
367
Text-to-Video RetrievalMSR-VTT
Recall@164.3
313
Visual Question AnsweringOKVQA
Top-1 Accuracy56.6
283
Showing 10 of 53 rows

Other info

Follow for update