Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Explore the Limits of Omni-modal Pretraining at Scale

About

We propose to build omni-modal intelligence, which is capable of understanding any modality and learning universal representations. In specific, we propose a scalable pretraining paradigm, named Multimodal Context (MiCo), which can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process. With MiCo, the pretrained models show significant emergent abilities in multimodal learning, which are evaluated on the following tasks: i) single-modality perception benchmarks of 10 different modalities, ii) 25 cross-modality understanding tasks of retrieval, question-answering, captioning, and iii) 18 multimodal large language model benchmarks. Our models establish 37 new records for state-of-the-art performance. We hope that our research could contribute to the development of omni-modal intelligence. Code and Models are at https://github.com/invictus717/MiCo

Yiyuan Zhang, Handong Li, Jing Liu, Xiangyu Yue• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy49.1
1820
Visual Question AnsweringTextVQA
Accuracy53.4
1453
Visual Question AnsweringVQA v2
Accuracy80.5
1429
Visual Question AnsweringGQA
Accuracy61.5
1425
Multimodal UnderstandingMMBench
Accuracy65.2
847
Language UnderstandingMMLU
Accuracy68.9
844
Multimodal UnderstandingMM-Vet
MM-Vet Score31.4
631
Visual Question AnsweringScienceQA
Accuracy71.3
446
Text-to-Video RetrievalMSR-VTT
Recall@164.3
406
Video Question AnsweringMSRVTT-QA (test)
Accuracy60.1
376
Showing 10 of 53 rows

Other info

Follow for update