Mixture of States: Routing Token-Level Dynamics for Multimodal Generation

About

We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-$k$ hidden states and is trained with an $\epsilon$-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to $4\times$ larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.

Haozhe Liu, Ding Liu, Mingchen Zhuge, Zijian Zhou, Tian Xie, Sen He, Yukang Yang, Shuming Liu, Yuren Cong, Jiadong Guo, Hongyu Xu, Ke Xu, Kam-Woh Ng, Juan C. P\'erez, Juan-Manuel P\'erez-R\'ua, Tao Xiang, Wei Liu, Shikun Liu, J\"urgen Schmidhuber• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	DPG-Bench	Overall Score87.01	451
Text-to-Image Generation	GenEval	GenEval Score90	442
Text-to-Image Generation	DPG-Bench	DPG Score87.01	156
Image Editing	GEdit-Bench	Semantic Consistency8.54	102
Text-to-Image Generation	WISE	WISE Score0.54	67
Image Editing	ImgEdit	ImgEdit4.33	62
Text-to-Image Generation	WISE	Cultural Score47	48
Image Generation	oneIG	Alignment85	22
Image Editing	GEdit	GEdit Score7.86	16
Image Editing	ImgEdit Benchmark 2025 (full)	Add Score4.63	15

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord