Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Mixture of States: Routing Token-Level Dynamics for Multimodal Generation

About

We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-$k$ hidden states and is trained with an $\epsilon$-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to $4\times$ larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.

Haozhe Liu, Ding Liu, Mingchen Zhuge, Zijian Zhou, Tian Xie, Sen He, Yukang Yang, Shuming Liu, Yuren Cong, Jiadong Guo, Hongyu Xu, Ke Xu, Kam-Woh Ng, Juan C. P\'erez, Juan-Manuel P\'erez-R\'ua, Tao Xiang, Wei Liu, Shikun Liu, J\"urgen Schmidhuber• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationGenEval
GenEval Score90
360
Text-to-Image GenerationDPG-Bench
Overall Score87.01
265
Text-to-Image GenerationDPG-Bench
DPG Score87.01
131
Image EditingGEdit-Bench
Semantic Consistency8.54
92
Text-to-Image GenerationWISE
WISE Score0.54
48
Text-to-Image GenerationWISE
Cultural Score47
48
Image EditingImgEdit
ImgEdit4.33
31
Image GenerationoneIG
Alignment85
22
Image EditingImgEdit Benchmark 2025 (full)
Add Score4.63
15
Text to ImageoneIG
oneIG Score52
6
Showing 10 of 11 rows

Other info

Follow for update