Mixture of States: Routing Token-Level Dynamics for Multimodal Generation
About
We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-$k$ hidden states and is trained with an $\epsilon$-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to $4\times$ larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | GenEval | GenEval Score90 | 360 | |
| Text-to-Image Generation | DPG-Bench | Overall Score87.01 | 265 | |
| Text-to-Image Generation | DPG-Bench | DPG Score87.01 | 131 | |
| Image Editing | GEdit-Bench | Semantic Consistency8.54 | 92 | |
| Text-to-Image Generation | WISE | WISE Score0.54 | 48 | |
| Text-to-Image Generation | WISE | Cultural Score47 | 48 | |
| Image Editing | ImgEdit | ImgEdit4.33 | 31 | |
| Image Generation | oneIG | Alignment85 | 22 | |
| Image Editing | ImgEdit Benchmark 2025 (full) | Add Score4.63 | 15 | |
| Text to Image | oneIG | oneIG Score52 | 6 |