DreamO: A Unified Framework for Image Customization
About
Recently, extensive research on image customization (e.g., identity, subject, style, background, etc.) demonstrates strong customization capabilities in large-scale generative models. However, most approaches are designed for specific tasks, restricting their generalizability to combine different types of condition. Developing a unified framework for image customization remains an open challenge. In this paper, we present DreamO, an image customization framework designed to support a wide range of tasks while facilitating seamless integration of multiple conditions. Specifically, DreamO utilizes a diffusion transformer (DiT) framework to uniformly process input of different types. During training, we construct a large-scale training dataset that includes various customization tasks, and we introduce a feature routing constraint to facilitate the precise querying of relevant information from reference images. Additionally, we design a placeholder strategy that associates specific placeholders with conditions at particular positions, enabling control over the placement of conditions in the generated results. Moreover, we employ a progressive training strategy consisting of three stages: an initial stage focused on simple tasks with limited data to establish baseline consistency, a full-scale training stage to comprehensively enhance the customization capabilities, and a final quality alignment stage to correct quality biases introduced by low-quality data. Extensive experiments demonstrate that the proposed DreamO can effectively perform various image customization tasks with high quality and flexibly integrate different types of control conditions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Single-image editing | GEdit EN (full) | BG Change3.06 | 15 | |
| Reference-based multi-human generation | MultiHuman TestBench | Count61.2 | 14 | |
| Identity-Preserving Multi-subject Image Generation | LAMICBench++ Fewer Subjects | ITC90.14 | 12 | |
| Identity-Preserving Multi-subject Image Generation | LAMICBench++ More Subjects | ITC78.49 | 12 | |
| Style Transfer | Style-Content Pairs 50 style x 40 content references (test) | CSD Score0.402 | 8 | |
| Identity-Preserving Text-to-Image Generation | IBench 41 prompts 100 IDs | Aesthetic Score67.8 | 7 | |
| Identity Customization | IBench ChineseID editable long prompts | Aesthetic Score0.678 | 6 | |
| Personalized Text-to-Image Generation | IBench ChineseID | Aesthetic Score0.678 | 6 | |
| In-context image generation | OmniContext (test) | Prompt Following6.1 | 5 | |
| Multi-human generation | MultiID-2M (test) | Multi-ID (Ref)0.396 | 5 |