TryOnDiffusion: A Tale of Two UNets
About
Given two images depicting a person and a garment worn by another person, our goal is to generate a visualization of how the garment might look on the input person. A key challenge is to synthesize a photorealistic detail-preserving visualization of the garment, while warping the garment to accommodate a significant body pose and shape change across the subjects. Previous methods either focus on garment detail preservation without effective pose and shape variation, or allow try-on with the desired shape and pose but lack garment details. In this paper, we propose a diffusion-based architecture that unifies two UNets (referred to as Parallel-UNet), which allows us to preserve garment details and warp the garment for significant pose and body change in a single network. The key ideas behind Parallel-UNet include: 1) garment is warped implicitly via a cross attention mechanism, 2) garment warp and person blend happen as part of a unified process as opposed to a sequence of two separate tasks. Experimental results indicate that TryOnDiffusion achieves state-of-the-art performance both qualitatively and quantitatively.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Virtual Try-On | VITON-HD unpaired 1.0 (test) | FID23.352 | 14 | |
| Virtual Try-On | DressCode triplets (test) | FID15.944 | 6 | |
| Video Virtual Try-on | Our Dataset Internet Videos v1 (test) | FID95 | 5 | |
| Video Virtual Try-on | UBC (test) | FID94 | 5 | |
| Virtual Try-On | 6K unpaired 1.0 (test) | FID13.447 | 4 | |
| Virtual Try-On | unpaired 6K (Random test) | User Preference Rate92.72 | 4 | |
| Virtual Try-On | unpaired 6K (Challenging test) | User Preference Rate9.58e+3 | 4 | |
| Video Virtual Try-on | Our Dataset (test) | Video Smoothness0.03 | 4 | |
| Virtual Try-On | 8,300 triplets (test) | FID19.459 | 2 | |
| Virtual Try-On | 1,000 paired (test) | SSIM0.883 | 2 |