Transfer between Modalities with MetaQueries

About

Unified multimodal models aim to integrate understanding (text output) and generation (pixel output), but aligning these different modalities within a single architecture often demands complex training recipes and careful data balancing. We introduce MetaQueries, a set of learnable queries that act as an efficient interface between autoregressive multimodal LLMs (MLLMs) and diffusion models. MetaQueries connects the MLLM's latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM's deep understanding and reasoning capabilities. Our method simplifies training, requiring only paired image-caption data and standard diffusion objectives. Notably, this transfer is effective even when the MLLM backbone remains frozen, thereby preserving its state-of-the-art multimodal understanding capabilities while achieving strong generative performance. Additionally, our method is flexible and can be easily instruction-tuned for advanced applications such as image editing and subject-driven generation.

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, Ji Hou, Saining Xie• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMBench	--	847
Text-to-Image Generation	GenEval	Overall Score80	704
Multimodal Understanding	MM-Vet	MM-Vet Score66.6	631
Multimodal Reasoning	MM-Vet	MM-Vet Score66.6	517
Text-to-Image Generation	GenEval	Overall Score80	517
Multimodal Understanding	SEED-Bench	--	516
Text-to-Image Generation	DPG-Bench	Overall Score82.05	451
Text-to-Image Generation	GenEval	GenEval Score80	442
Multi-discipline Multimodal Understanding	MMMU	--	363
Text-to-Image Generation	GenEval	Overall Score0.8	277

Showing 10 of 63 rows

Other info

Code

Follow for update

@wizwand_team Discord