Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Explore the Limits of Omni-modal Pretraining at Scale

About

We propose to build omni-modal intelligence, which is capable of understanding any modality and learning universal representations. In specific, we propose a scalable pretraining paradigm, named Multimodal Context (MiCo), which can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process. With MiCo, the pretrained models show significant emergent abilities in multimodal learning, which are evaluated on the following tasks: i) single-modality perception benchmarks of 10 different modalities, ii) 25 cross-modality understanding tasks of retrieval, question-answering, captioning, and iii) 18 multimodal large language model benchmarks. Our models establish 37 new records for state-of-the-art performance. We hope that our research could contribute to the development of omni-modal intelligence. Code and Models are at https://github.com/invictus717/MiCo

Yiyuan Zhang, Handong Li, Jing Liu, Xiangyu Yue• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy49.1
1525
Visual Question AnsweringVQA v2
Accuracy80.5
1362
Visual Question AnsweringTextVQA
Accuracy53.4
1285
Visual Question AnsweringGQA
Accuracy61.5
1249
Language UnderstandingMMLU
Accuracy68.9
825
Multimodal UnderstandingMMBench
Accuracy65.2
637
Multimodal UnderstandingMM-Vet
MM-Vet Score31.4
531
Video Question AnsweringMSRVTT-QA (test)
Accuracy60.1
376
Visual Question AnsweringScienceQA
Accuracy71.3
370
Text-to-Video RetrievalMSR-VTT
Recall@164.3
369
Showing 10 of 53 rows

Other info

Follow for update