Ovis-Image Technical Report
About
We introduce $\textbf{Ovis-Image}$, a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. Built upon our previous Ovis-U1 framework, Ovis-Image integrates a diffusion-based visual decoder with the stronger Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o. Crucially, the model remains deployable on a single high-end GPU with moderate memory, narrowing the gap between frontier-level text rendering and practical deployment. Our results indicate that combining a strong multimodal backbone with a carefully designed, text-focused training recipe is sufficient to achieve reliable bilingual text rendering without resorting to oversized or proprietary models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | DPG | Overall Score86.59 | 131 | |
| Text-to-Image Generation | GenEval | Overall Score84 | 68 | |
| Text Rendering | CVTG-2K | NED96.95 | 28 | |
| Spatial Reasoning Generation | OneIG-EN (test) | Alignment Score85.8 | 26 | |
| Text-to-Image Generation | OneIG-ZH | Alignment80.5 | 24 | |
| Text Rendering | LongText-Bench Chinese | Score0.964 | 13 | |
| Text Rendering | LongText-Bench English | Score0.922 | 13 |