STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning
About
Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), validating its efficacy for unified multimodal learning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy86.6 | 935 | |
| Text-to-Image Generation | GenEval | Overall Score91 | 467 | |
| Mathematical Reasoning | MathVista | Score68.1 | 322 | |
| Multimodal Understanding | SEED-Bench | -- | 203 | |
| Multimodal Understanding | MMStar | -- | 197 | |
| Text-to-Image Generation | DPG-Bench | Overall Score87.44 | 173 | |
| Image Editing | ImgEdit-Bench | Overall Score4.34 | 132 | |
| Multimodal Understanding | MMMU | MMMU Score58.6 | 78 | |
| Optical Character Recognition Evaluation | OCRBench | Score86.4 | 46 | |
| Knowledge-grounded reasoning | WISE | Overall Score66 | 45 |