Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BLIP3o-NEXT: Next Frontier of Native Image Generation

About

We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.

Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, Tianyi Zhou, Junnan Li, Silvio Savarese, Caiming Xiong, Ran Xu• 2025

Related benchmarks

TaskDatasetResultRank
Text-based Visual Question AnsweringTextVQA
Accuracy78
807
Text-to-Image GenerationGenEval
Overall Score81
506
Multimodal UnderstandingSEED-Bench--
343
Multi-discipline Multimodal UnderstandingMMMU--
317
Text-to-Image GenerationDPG-Bench
Overall Score79.4
265
Text-to-Image GenerationImageReward
ImageReward Score0.926
56
Multi-modal UnderstandingMMBench EN
Overall Score78.6
55
Image EditingImgEdit GPT-4.1 (test)
Add Score4
19
Table-to-Image GenerationTableVisBench v1 (test)
DA0.4
19
Visual World ModellingAction Genome
GPT-4o Score3.04
18
Showing 10 of 18 rows

Other info

Follow for update