Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Emerging Properties in Unified Multimodal Pretraining

About

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open-source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder-only model pretrained on trillions of tokens curated from large-scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan• 2025

Related benchmarks

TaskDatasetResultRank
Text-based Visual Question AnsweringTextVQA
Accuracy80
962
Multimodal UnderstandingMMBench
Accuracy85
847
Text-to-Image GenerationGenEval
Overall Score88
704
Multimodal UnderstandingMM-Vet
MM-Vet Score67.2
631
Visual Question AnsweringChartQA--
519
Text-to-Image GenerationGenEval
Overall Score88
517
Multimodal ReasoningMM-Vet
MM-Vet Score67.2
517
Multimodal UnderstandingSEED-Bench--
516
Mathematical ReasoningMathVista
Score73.1
474
Text-to-Image GenerationDPG-Bench
Overall Score85.07
451
Showing 10 of 615 rows
...

Other info

Code

Follow for update