Equilibrated Diffusion: Frequency-aware Textual Embedding for Equilibrated Image Customization

About

Image customization learns target subjects from reference concept images and generates conditioned images per text prompts, mainly modifying styles or backgrounds. Prevailing methods adopt fine-tuning to pack diverse concept attributes into a unified latent embedding, yet entangled attributes hinder elimination of irrelevant disturbances from style and background. To address this issue, we propose Equilibrated Diffusion, a frequency-driven approach that disentangles tangled concept features for balanced customization and consistent text-visual matching. Unlike conventional methods learning full concepts with shared embeddings and unified tuning, our work utilizes the inherent link between image frequency components and semantics: low frequencies represent subject content and high frequencies correspond to styles. We decompose concepts in frequency space and optimize each embedding independently. This separate optimization enables the denoiser to capture style detached from subject identity and generalize better to unseen stylistic prompts. Merging multi-frequency embeddings preserves the model's original spatial customization ability. We further deploy mask-guided diffusion to restrict irrelevant background changes and boost text alignment. Residual Reference Attention (RRA) is inserted into spatial attention to retain subject structure and identity consistency. Experiments prove Equilibrated Diffusion exceeds mainstream baselines on subject fidelity and text adherence, verifying our method's superiority.

Liyuan Ma, Xueji Fang, Guo-Jun Qi• 2026

Related benchmarks

Task	Dataset	Result	Rank
Subject-driven Text-to-Image Generation	DreamBooth unstylized prompts v1.4	CLIP-T0.782		6
Subject-driven Text-to-Image Generation	DreamBooth stylized prompts v1.4	CLIP-T Score79		6

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord