UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
About
Although existing unified models achieve strong performance in vision-language understanding and text-to-image generation, they remain limited in addressing image perception and manipulation -- capabilities increasingly demanded in practical applications. Recently, OpenAI introduced the powerful GPT-4o-Image model, which showcases advanced capabilities in comprehensive image perception and manipulation, sparking widespread interest. Through carefully designed experiments, we observe that GPT-4o-Image likely relies on semantic encoders rather than VAEs for feature extraction, despite VAEs being commonly regarded as crucial for image manipulation tasks. Inspired by this insight, we propose UniWorld-V1, a unified generative framework built upon semantic features extracted from powerful multimodal large language models and contrastive semantic encoders. Using only 2.7M training data, UniWorld-V1 achieves impressive performance across diverse tasks, including image understanding, generation, manipulation, and perception. We fully open-source the UniWorld-V1 framework, including model weights, training and evaluation scripts, and datasets to promote reproducibility and further research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | GenEval | Overall Score84 | 467 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score67.1 | 418 | |
| Multimodal Understanding | MMBench | -- | 367 | |
| Multimodal Reasoning | MM-Vet | MM-Vet Score67.1 | 281 | |
| Text-to-Image Generation | GenEval | GenEval Score84 | 277 | |
| Text-to-Image Generation | DPG-Bench | Overall Score81.38 | 173 | |
| Image Editing | ImgEdit-Bench | Overall Score3.26 | 132 | |
| Text-to-Image Generation | DPG | Overall Score81.38 | 131 | |
| Vision Understanding | MMBench | Accuracy83.5 | 104 | |
| Visual Understanding | MM-Vet | MM-Vet Score67.1 | 102 |