Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training
About
Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model's capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Generation | GenEval | Overall Score89 | 57 | |
| Image Generation | DPG | DPG Score87.21 | 47 | |
| Image Editing | ImgEdit | ImgEdit4.31 | 31 | |
| Image Editing | GEdit-EN | GEdit-EN Score7.39 | 27 | |
| Understanding | MMMU | MMMU Score74.9 | 20 | |
| Interleaved Image-Text Generation | WeaverBench | FDT91.84 | 15 | |
| Interleaved Image-Text Generation | OpenING | FDT62.94 | 15 | |
| Understanding | MathVista | MathVista Score84.3 | 12 |