BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

About

Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	GenEval	Overall Score84	914
Multimodal Understanding	MMBench	Accuracy83.5	887
Multimodal Understanding	MM-Vet	MM-Vet Score66.6	664
Text-to-Image Generation	GenEval	Overall Score84	581
Multimodal Understanding	SEED-Bench	Accuracy77.5	571
Text-to-Image Generation	DPG-Bench	Overall Score82.27	510
Text-to-Image Generation	GenEval	GenEval Score84	459
Multi-discipline Multimodal Understanding	MMMU	Accuracy50.6	422
Text-to-Image Generation	GenEval	Overall Score0.84	318
Text-to-Image Generation	DPG	Overall Score81.6	270

Showing 10 of 128 rows

...

Other info

Follow for update

@wizwand_team Discord