Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OminiControl: Minimal and Universal Control for Diffusion Transformer

About

We present OminiControl, a novel approach that rethinks how image conditions are integrated into Diffusion Transformer (DiT) architectures. Current image conditioning methods either introduce substantial parameter overhead or handle only specific control tasks effectively, limiting their practical versatility. OminiControl addresses these limitations through three key innovations: (1) a minimal architectural design that leverages the DiT's own VAE encoder and transformer blocks, requiring just 0.1% additional parameters; (2) a unified sequence processing strategy that combines condition tokens with image tokens for flexible token interactions; and (3) a dynamic position encoding mechanism that adapts to both spatially-aligned and non-aligned control tasks. Our extensive experiments show that this streamlined approach not only matches but surpasses the performance of specialized methods across multiple conditioning tasks. To overcome data limitations in subject-driven generation, we also introduce Subjects200K, a large-scale dataset of identity-consistent image pairs synthesized using DiT models themselves. This work demonstrates that effective image control can be achieved without architectural complexity, opening new possibilities for efficient and versatile image generation systems.

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, Xinchao Wang• 2024

Related benchmarks

TaskDatasetResultRank
Subject-driven image generationDreamBench
DINO Score68.4
100
Personalized Image GenerationDreamBooth
CLIP-I Score72.7
34
Subject-driven generationDreamBench
DINO Score0.684
28
Personalized Text-to-Image GenerationDreamBench++ Single-subject
CP0.596
18
Image PersonalizationUser Study Personalization Tasks
Concept Preservation (CP)72.1
17
Subject-driven Contextual Image EditingDreamBench++ Multiple-Object
DINO-I Score0.501
10
Neural-guided Image EditingLoongX (test)
L1 Loss0.2632
7
Identity-preserving Image Generation3D Assets (test)
GPT-eval Texture5.631
6
3D-conditioned Image GenerationUser Study
Faithfulness3.909
6
Controllable Image Generation5,000-image Reconstruction (evaluation)
FID17.38
6
Showing 10 of 19 rows

Other info

Follow for update