Ovis-U1 Technical Report
About
In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | GenEval | Overall Score89 | 467 | |
| Mathematical Reasoning | MathVista | Score69.4 | 322 | |
| Text-to-Image Generation | GenEval | GenEval Score89 | 277 | |
| Text-to-Image Generation | DPG-Bench | Overall Score83.72 | 173 | |
| Image Editing | ImgEdit-Bench | Overall Score4 | 132 | |
| Text-to-Image Generation | DPG-Bench | DPG Score83.72 | 89 | |
| Multimodal Understanding | MMMU | MMMU Score51.1 | 78 | |
| Image Editing | GEdit-Bench English | G_O (Overall Quality)6.42 | 73 | |
| Optical Character Recognition Evaluation | OCRBench | Score88.3 | 46 | |
| Multi-modal Understanding | MMBench EN | Overall Score77.8 | 39 |