EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

About

Recent advancements in Unet-based diffusion models, such as ControlNet and IP-Adapter, have introduced effective spatial and subject control mechanisms. However, the DiT (Diffusion Transformer) architecture still struggles with efficient and flexible control. To tackle this issue, we propose EasyControl, a novel framework designed to unify condition-guided diffusion transformers with high efficiency and flexibility. Our framework is built on three key innovations. First, we introduce a lightweight Condition Injection LoRA Module. This module processes conditional signals in isolation, acting as a plug-and-play solution. It avoids modifying the base model weights, ensuring compatibility with customized models and enabling the flexible injection of diverse conditions. Notably, this module also supports harmonious and robust zero-shot multi-condition generalization, even when trained only on single-condition data. Second, we propose a Position-Aware Training Paradigm. This approach standardizes input conditions to fixed resolutions, allowing the generation of images with arbitrary aspect ratios and flexible resolutions. At the same time, it optimizes computational efficiency, making the framework more practical for real-world applications. Third, we develop a Causal Attention Mechanism combined with the KV Cache technique, adapted for conditional generation tasks. This innovation significantly reduces the latency of image synthesis, improving the overall efficiency of the framework. Through extensive experiments, we demonstrate that EasyControl achieves exceptional performance across various application scenarios. These innovations collectively make our framework highly efficient, flexible, and suitable for a wide range of tasks.

Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, Jiaming Liu• 2025

Related benchmarks

Task	Dataset	Result
Subject-driven image generation	DreamBench	DINO Score65.2	113
Image Inpainting	High-resolution Image Editing 50% Edit Ratio	Latency (s)29	18
Image Inpainting	High-resolution Image Editing 75% Edit Ratio	Latency (s)29.2	18
Image Inpainting	High-resolution Image Editing 25% Edit Ratio	Latency (s)27.5	18
Multi-condition Image Generation (Multi-Spatial)	Multi-Spatial Evaluation Set	FID62.38	6
Layout-based generation	Our Bench Layout only	F1 Score16	5
Subject-driven Text-to-Image Generation	DreamBench	Subject Fidelity15	5
Text-guided inpainting	1K x 1K resolution dataset	FID108.6	5
Multi-condition Image Generation (Subject-Canny)	Subject-Canny (Evaluation Set)	FID57.53	4
Multi-condition Image Generation (Subject-Depth)	Subject-Depth Evaluation Set	FID68.36	4

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord