Semantic Generative Tuning for Unified Multimodal Models
About
Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | GenEval | Overall Score (GenEval)0.9 | 153 | |
| Visual Perception | MMVP | -- | 118 | |
| Real-world Question Answering | RWQA | RWQA Accuracy72.42 | 62 | |
| Visual Spatial Reasoning | VSR | Accuracy81.54 | 59 | |
| Multimodal Understanding | MMStar | Score68.33 | 26 | |
| Image Editing | GEdit-Bench EN | Score6.94 | 20 | |
| Hallucination Evaluation | Hallu | Score70.24 | 13 | |
| Visual Mathematical Reasoning | MathV | Score73.9 | 9 |