Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

A training-free framework for high-fidelity appearance transfer via diffusion transformers

About

Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic scene structure. We address this by proposing the first training-free framework specifically designed to tame DiTs for high-fidelity appearance transfer. Our core is a synergistic system that disentangles structure and appearance. We leverage high-fidelity inversion to establish a rich content prior for the source image, capturing its lighting and micro-textures. A novel attention-sharing mechanism then dynamically fuses purified appearance features from a reference, guided by geometric priors. Our unified approach operates at 1024px and outperforms specialized methods on tasks ranging from semantic attribute transfer to fine-grained material application. Extensive experiments confirm our state-of-the-art performance in both structural preservation and appearance fidelity.

Shengrong Gu, Ye Wang, Song Wu, Rui Ma, Qian Wang, Lanjun Wang, Zili Yi• 2026

Related benchmarks

TaskDatasetResultRank
Material TransferCurated dataset 100 image pairs
CLIP-T Score0.2927
6
Overall Appearance Transfer QualityCurated dataset 100 image pairs
DeQA4.1728
6
Semantic-Aware Appearance TransferCurated dataset 100 image pairs
CLIP-I87.49
6
Showing 3 of 3 rows

Other info

Follow for update