Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Multi-Level Conditioning by Pairing Localized Text and Sketch for Fashion Image Generation

About

Sketches offer designers a concise yet expressive medium for early-stage fashion ideation by specifying structure, silhouette, and spatial relationships, while textual descriptions complement sketches to convey material, color, and stylistic details. Effectively combining textual and visual modalities requires adherence to the sketch visual structure when leveraging the guidance of localized attributes from text. We present LOcalized Text and Sketch with multi-level guidance (LOTS), a framework that enhances fashion image generation by combining global sketch guidance with multiple localized sketch-text pairs. LOTS employs a Multi-level Conditioning Stage to independently encode local features within a shared latent space while maintaining global structural coordination. Then, the Diffusion Pair Guidance stage integrates both local and global conditioning via attention-based guidance within the diffusion model's multi-step denoising process. To validate our method, we develop Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Sketchy provides high-quality, clean sketches with a professional look and consistent structure. To assess robustness beyond this setting, we also include an "in the wild" split with non-expert sketches, featuring higher variability and imperfections. Experiments demonstrate that our method strengthens global structural adherence while leveraging richer localized semantic guidance, achieving improvement over state-of-the-art. The dataset, platform, and code are publicly available.

Ziyue Liu, Davide Talon, Federico Girella, Zanxi Ruan, Mattia Mondo, Loris Bazzani, Yiming Wang, Marco Cristani• 2026

Related benchmarks

TaskDatasetResultRank
Sketch-to-imageSketchy (test)
FID0.74
17
Sketch-to-Image GenerationSketchy In the Wild
FID1.23
17
Attribute LocalizationSketchy Human Sketches (Human Evaluation)
Precision87
8
Structural alignmentSketchy Human Sketches (Human Evaluation)--
7
Showing 4 of 4 rows

Other info

Follow for update