Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Transfer between Modalities with MetaQueries

About

Unified multimodal models aim to integrate understanding (text output) and generation (pixel output), but aligning these different modalities within a single architecture often demands complex training recipes and careful data balancing. We introduce MetaQueries, a set of learnable queries that act as an efficient interface between autoregressive multimodal LLMs (MLLMs) and diffusion models. MetaQueries connects the MLLM's latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM's deep understanding and reasoning capabilities. Our method simplifies training, requiring only paired image-caption data and standard diffusion objectives. Notably, this transfer is effective even when the MLLM backbone remains frozen, thereby preserving its state-of-the-art multimodal understanding capabilities while achieving strong generative performance. Additionally, our method is flexible and can be easily instruction-tuned for advanced applications such as image editing and subject-driven generation.

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, Ji Hou, Saining Xie• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationGenEval
Overall Score80
467
Multimodal UnderstandingMM-Vet
MM-Vet Score66.6
418
Multimodal UnderstandingMMBench--
367
Multimodal ReasoningMM-Vet
MM-Vet Score66.6
281
Text-to-Image GenerationGenEval
GenEval Score80
277
Multi-discipline Multimodal UnderstandingMMMU--
266
Multimodal UnderstandingSEED-Bench--
203
Text-to-Image GenerationDPG-Bench
Overall Score82.05
173
Text-to-Image GenerationDPG
Overall Score82.05
131
Vision UnderstandingMMBench
Accuracy83.5
104
Showing 10 of 41 rows

Other info

Code

Follow for update