STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning

About

Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), validating its efficacy for unified multimodal learning.

Jie Qin, Jiancheng Huang, Limeng Qiao, Lin Ma• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy86.6	2019
Text-to-Image Generation	GenEval	Overall Score91	517
Multimodal Understanding	SEED-Bench	--	516
Mathematical Reasoning	MathVista	Score68.1	474
Text-to-Image Generation	DPG-Bench	Overall Score87.44	451
Multimodal Understanding	MMStar	--	407
Multimodal Understanding	MMMU	MMMU Score58.6	232
Image Editing	ImgEdit-Bench	Overall Score4.34	224
Optical Character Recognition Evaluation	OCRBench	Score86.4	91
Reasoning-based text-to-image generation	WISE	Overall Score66	70

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord