Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding
About
We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: https://synbol.github.io/Lumina-DiMOO.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | GenEval | Overall Score88 | 467 | |
| Text-to-Image Generation | GenEval | GenEval Score88 | 277 | |
| Text-to-Image Generation | DPG | Overall Score86.04 | 131 | |
| Text-to-Image Generation | GenEval | Two Objects94 | 87 | |
| Multimodal Understanding | MMMU | MMMU Score41.4 | 78 | |
| Text-to-Image Generation | DPGBench | DPGBench Score86.04 | 31 | |
| Multimodal Understanding | MMB | Score58.7 | 30 | |
| Multimodal Understanding | SEED | SEED Score71.4 | 27 | |
| Text-to-Image Generation | UniGenBench | UniGenBench71.12 | 17 | |
| Reasoning-based Image Editing | UniREditBench 44 (test) | Real World Score51.4 | 10 |