Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens

About

Recent unified models such as Bagel demonstrate that paired image-edit data can effectively align multiple visual tasks within a single diffusion transformer. However, these models remain limited to single-condition inputs and lack the flexibility needed to synthesize results from multiple heterogeneous sources. We present SIGMA (Selective-Interleaved Generation with Multi-Attribute Tokens), a unified post-training framework that enables interleaved multi-condition generation within diffusion transformers. SIGMA introduces selective multi-attribute tokens, including style, content, subject, and identity tokens, which allow the model to interpret and compose multiple visual conditions in an interleaved text-image sequence. Through post-training on the Bagel unified backbone with 700K interleaved examples, SIGMA supports compositional editing, selective attribute transfer, and fine-grained multimodal alignment. Extensive experiments show that SIGMA improves controllability, cross-condition consistency, and visual quality across diverse editing and generation tasks, with substantial gains over Bagel on compositional tasks.

Xiaoyan Zhang, Zechen Bai, Haofan Wang, Yiren Song• 2026

Related benchmarks

TaskDatasetResultRank
Compositional generationXVerse Bench
CLIP Score31.96
6
Compositional generationOur Bench
CLIP Score30.29
6
Layout-based generationOur Bench Layout only
F1 Score44
5
Layout-based generationOur Bench Layout + Reference
F1 Score35
4
Showing 4 of 4 rows

Other info

Follow for update