Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

About

In this report, we present OpenUni, a simple, lightweight, and fully open-source baseline for unifying multimodal understanding and generation. Inspired by prevailing practices in unified model learning, we adopt an efficient training strategy that minimizes the training complexity and overhead by bridging the off-the-shelf multimodal large language models (LLMs) and diffusion models through a set of learnable queries and a light-weight transformer-based connector. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG- Bench, and WISE, with only 1.1B and 3.1B activated parameters. To support open research and community advancement, we release all model weights, training code, and our curated training datasets (including 23M image-text pairs) at https://github.com/wusize/OpenUni.

Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, Chen Change Loy• 2025

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMBench--
847
Text-to-Image GenerationGenEval
Overall Score84
704
Text-to-Image GenerationDPG-Bench
Overall Score83.08
451
Text-to-Image GenerationGenEval
GenEval Score85
442
Multimodal UnderstandingMMMU
MMMU Score48.6
232
Text-to-Image GenerationGenEval
Overall Score (GenEval)0.51
153
Vision UnderstandingMMBench
Accuracy81.19
141
Visual PerceptionMMVP--
118
Text-to-Image GenerationGenEval
Overall Score86
96
Reasoning-based text-to-image generationWISE
Overall Score52
70
Showing 10 of 19 rows

Other info

Follow for update