DreamO: A Unified Framework for Image Customization

About

Recently, extensive research on image customization (e.g., identity, subject, style, background, etc.) demonstrates strong customization capabilities in large-scale generative models. However, most approaches are designed for specific tasks, restricting their generalizability to combine different types of condition. Developing a unified framework for image customization remains an open challenge. In this paper, we present DreamO, an image customization framework designed to support a wide range of tasks while facilitating seamless integration of multiple conditions. Specifically, DreamO utilizes a diffusion transformer (DiT) framework to uniformly process input of different types. During training, we construct a large-scale training dataset that includes various customization tasks, and we introduce a feature routing constraint to facilitate the precise querying of relevant information from reference images. Additionally, we design a placeholder strategy that associates specific placeholders with conditions at particular positions, enabling control over the placement of conditions in the generated results. Moreover, we employ a progressive training strategy consisting of three stages: an initial stage focused on simple tasks with limited data to establish baseline consistency, a full-scale training stage to comprehensively enhance the customization capabilities, and a final quality alignment stage to correct quality biases introduced by low-quality data. Extensive experiments demonstrate that the proposed DreamO can effectively perform various image customization tasks with high quality and flexibly integrate different types of control conditions.

Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, Mengtian Li, Mingcong Liu, Yi Zhang, Shaojin Wu, Songtao Zhao, Jian Zhang, Qian He, Xinglong Wu• 2025

Related benchmarks

Task	Dataset	Result
Subject-driven image generation	DreamBench	DINO Score75.37	113
Single-image editing	GEdit EN (full)	BG Change3.06	42
Identity-preserving Image Generation	MultiID-Bench 1-people	Sim(GT)0.454	18
Reference-based multi-human generation	MultiHuman TestBench	Count61.2	14
Outfit Generation	VITON-HD	LPIPS0.711	13
Identity-Preserving Multi-subject Image Generation	LAMICBench++ Fewer Subjects	ITC90.14	12
Identity-Preserving Multi-subject Image Generation	LAMICBench++ More Subjects	ITC78.49	12
Outfit Generation	Fashion130K	LPIPS0.657	12
model-free try-on	Omni-TryOn (test)	DINO-I40.92	11
Multi-Human Image Generation	DiverseHumans TestPrompts (2-7 People)	Count Accuracy70.5	11

Showing 10 of 44 rows

Other info

Follow for update

@wizwand_team Discord