UPGPT: Universal Diffusion Model for Person Image Generation, Editing and Pose Transfer
About
Text-to-image models (T2I) such as StableDiffusion have been used to generate high quality images of people. However, due to the random nature of the generation process, the person has a different appearance e.g. pose, face, and clothing, despite using the same text prompt. The appearance inconsistency makes T2I unsuitable for pose transfer. We address this by proposing a multimodal diffusion model that accepts text, pose, and visual prompting. Our model is the first unified method to perform all person image tasks - generation, pose transfer, and mask-less edit. We also pioneer using small dimensional 3D body model parameters directly to demonstrate new capability - simultaneous pose and camera view interpolation while maintaining the person's appearance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reposing | WPose (Out-of-Domain) | FID75.653 | 10 | |
| Reposing | DeepFashion In-Domain | FID9.611 | 10 | |
| Pose Transfer | DeepFashion reduced (test) | FID7.876 | 7 | |
| Multi-view pose transfer | DeepFashion Multimodal | SSIM0.7085 | 5 | |
| Text-and-pose guided image generation | DeepFashion Multimodal Text2Human | FID23.46 | 3 | |
| Text-based Human Image Manipulation | WVTON (test) | FID138.2 | 3 | |
| Text Manipulation | WVTON Full Edit (test) | Pose Accuracy13.2 | 2 |