Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

About

Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data. Additionally, we introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding. It is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.

Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, Qian He• 2025

Related benchmarks

Task	Dataset	Result
Subject-driven image generation	DreamBench	DINO Score76	113
Instructive image editing	MagicBrush (test)	CLIP Image0.9236	53
Subject-driven generation	DreamBench	DINO Score0.76	30
Cinematic Story Generation	ViStoryBench	CSD (Cross)0.391	24
Multi-image Reasoning	OmniContext	Single Scene Char Score7.15	20
Personalized Text-to-Image Generation	DreamBench++ Single-subject	CP0.721	18
Virtual Try-On and Animation	ViViD Dataset	L10.2125	18
Virtual Try-On and Animation	Internet Dataset	L1 Loss0.1774	18
Multi-image context generation	MICON-Bench	Object Score62.3	18
Identity-preserving Image Generation	MultiID-Bench 1-people	Sim(GT)0.304	18

Showing 10 of 67 rows

Other info

Follow for update

@wizwand_team Discord