Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
About
We introduce Lumina-Image 2.0, an advanced text-to-image generation framework that achieves significant progress compared to previous work, Lumina-Next. Lumina-Image 2.0 is built upon two key principles: (1) Unification - it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), specifically designed for T2I generation tasks. UniCap excels at generating comprehensive and accurate captions, accelerating convergence and enhancing prompt adherence. (2) Efficiency - to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies and introduce inference acceleration techniques without compromising image quality. Extensive evaluations on academic benchmarks and public text-to-image arenas show that Lumina-Image 2.0 delivers strong performances even with only 2.6B parameters, highlighting its scalability and design efficiency. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-Image-2.0.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | GenEval | Overall Score73 | 391 | |
| Text-to-Image Generation | GenEval | GenEval Score73 | 360 | |
| Text-to-Image Generation | DPG-Bench | Overall Score87.2 | 265 | |
| Text-to-Image Generation | GenEval (test) | Two Obj. Acc87 | 221 | |
| Text-to-Image Generation | T2I-CompBench | Shape Fidelity60.28 | 185 | |
| Text-to-Image Generation | DPGBench | Attribute Score90.2 | 44 | |
| Text-to-Image Generation | OneIG-ZH | Alignment73.1 | 34 | |
| Text-to-Image Generation | R2I-Bench | Causal Accuracy40 | 28 | |
| Spatial Reasoning Generation | OneIG-EN (test) | Alignment Score81.9 | 26 | |
| Text-to-Image Generation | OneIG-Bench EN | Alignment Score81.9 | 24 |