Transfer between Modalities with MetaQueries
About
Unified multimodal models aim to integrate understanding (text output) and generation (pixel output), but aligning these different modalities within a single architecture often demands complex training recipes and careful data balancing. We introduce MetaQueries, a set of learnable queries that act as an efficient interface between autoregressive multimodal LLMs (MLLMs) and diffusion models. MetaQueries connects the MLLM's latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM's deep understanding and reasoning capabilities. Our method simplifies training, requiring only paired image-caption data and standard diffusion objectives. Notably, this transfer is effective even when the MLLM backbone remains frozen, thereby preserving its state-of-the-art multimodal understanding capabilities while achieving strong generative performance. Additionally, our method is flexible and can be easily instruction-tuned for advanced applications such as image editing and subject-driven generation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MMBench | -- | 637 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score66.6 | 531 | |
| Text-to-Image Generation | GenEval | Overall Score80 | 506 | |
| Multimodal Reasoning | MM-Vet | MM-Vet Score66.6 | 431 | |
| Text-to-Image Generation | GenEval | Overall Score80 | 391 | |
| Text-to-Image Generation | GenEval | GenEval Score80 | 360 | |
| Multimodal Understanding | SEED-Bench | -- | 343 | |
| Multi-discipline Multimodal Understanding | MMMU | -- | 317 | |
| Text-to-Image Generation | DPG-Bench | Overall Score82.05 | 265 | |
| Text-to-Image Generation | GenEval (test) | -- | 221 |