Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Cross-Image Attention for Zero-Shot Appearance Transfer

About

Recent advancements in text-to-image generative models have demonstrated a remarkable ability to capture a deep semantic understanding of images. In this work, we leverage this semantic knowledge to transfer the visual appearance between objects that share similar semantics but may differ significantly in shape. To achieve this, we build upon the self-attention layers of these generative models and introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images. Specifically, given a pair of images -- one depicting the target structure and the other specifying the desired appearance -- our cross-image attention combines the queries corresponding to the structure image with the keys and values of the appearance image. This operation, when applied during the denoising process, leverages the established semantic correspondences to generate an image combining the desired structure and appearance. In addition, to improve the output image quality, we harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process. Importantly, our approach is zero-shot, requiring no optimization or training. Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint between the two input images.

Yuval Alaluf, Daniel Garibi, Or Patashnik, Hadar Averbuch-Elor, Daniel Cohen-Or• 2023

Related benchmarks

TaskDatasetResultRank
Virtual Try-OnVITON-HD (test)
SSIM76
48
Virtual Try-OnStreetTryOn Shop-to-Street
FID69.444
13
Virtual Try-OnDressCode Upper (unpaired and paired)
FIDu46.309
13
Virtual Try-OnDressCode Lower unpaired and paired
FID (Unpaired)42.674
13
Virtual Try-OnDressCode Dresses (unpaired and paired)
FIDu76.94
13
Inference EfficiencyInference Efficiency Evaluation
Inference Latency (s)24.47
12
Virtual Try-OnStreetTryOn Model-to-Street
FID66.755
11
Virtual Try-OnStreetTryOn Street-to-Street
FID57.753
11
Virtual Try-OnStreetTryOn Model-to-Model
FID52.31
11
Structure and appearance controlNatural image
Self-similarity0.145
7
Showing 10 of 13 rows

Other info

Follow for update