SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens

About

Recent unified models such as Bagel demonstrate that paired image-edit data can effectively align multiple visual tasks within a single diffusion transformer. However, these models remain limited to single-condition inputs and lack the flexibility needed to synthesize results from multiple heterogeneous sources. We present SIGMA (Selective-Interleaved Generation with Multi-Attribute Tokens), a unified post-training framework that enables interleaved multi-condition generation within diffusion transformers. SIGMA introduces selective multi-attribute tokens, including style, content, subject, and identity tokens, which allow the model to interpret and compose multiple visual conditions in an interleaved text-image sequence. Through post-training on the Bagel unified backbone with 700K interleaved examples, SIGMA supports compositional editing, selective attribute transfer, and fine-grained multimodal alignment. Extensive experiments show that SIGMA improves controllability, cross-condition consistency, and visual quality across diverse editing and generation tasks, with substantial gains over Bagel on compositional tasks.

Xiaoyan Zhang, Zechen Bai, Haofan Wang, Yiren Song• 2026

Related benchmarks

Task	Dataset	Result
Compositional generation	XVerse Bench	CLIP Score31.96	6
Compositional generation	Our Bench	CLIP Score30.29	6
Layout-based generation	Our Bench Layout only	F1 Score44	5
Layout-based generation	Our Bench Layout + Reference	F1 Score35	4

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord