A training-free framework for high-fidelity appearance transfer via diffusion transformers

About

Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic scene structure. We address this by proposing the first training-free framework specifically designed to tame DiTs for high-fidelity appearance transfer. Our core is a synergistic system that disentangles structure and appearance. We leverage high-fidelity inversion to establish a rich content prior for the source image, capturing its lighting and micro-textures. A novel attention-sharing mechanism then dynamically fuses purified appearance features from a reference, guided by geometric priors. Our unified approach operates at 1024px and outperforms specialized methods on tasks ranging from semantic attribute transfer to fine-grained material application. Extensive experiments confirm our state-of-the-art performance in both structural preservation and appearance fidelity.

Shengrong Gu, Ye Wang, Song Wu, Rui Ma, Qian Wang, Lanjun Wang, Zili Yi• 2026

Related benchmarks

Task	Dataset	Result
Material Transfer	Curated dataset 100 image pairs	CLIP-T Score0.2927	6
Overall Appearance Transfer Quality	Curated dataset 100 image pairs	DeQA4.1728	6
Semantic-Aware Appearance Transfer	Curated dataset 100 image pairs	CLIP-I87.49	6

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord