UPGPT: Universal Diffusion Model for Person Image Generation, Editing and Pose Transfer

About

Text-to-image models (T2I) such as StableDiffusion have been used to generate high quality images of people. However, due to the random nature of the generation process, the person has a different appearance e.g. pose, face, and clothing, despite using the same text prompt. The appearance inconsistency makes T2I unsuitable for pose transfer. We address this by proposing a multimodal diffusion model that accepts text, pose, and visual prompting. Our model is the first unified method to perform all person image tasks - generation, pose transfer, and mask-less edit. We also pioneer using small dimensional 3D body model parameters directly to demonstrate new capability - simultaneous pose and camera view interpolation while maintaining the person's appearance.

Soon Yau Cheong, Armin Mustafa, Andrew Gilbert• 2023

Related benchmarks

Task	Dataset	Result
Reposing	WPose (Out-of-Domain)	FID75.653	10
Reposing	DeepFashion In-Domain	FID9.611	10
Pose Transfer	DeepFashion reduced (test)	FID7.876	7
Multi-view pose transfer	DeepFashion Multimodal	SSIM0.7085	5
Text-and-pose guided image generation	DeepFashion Multimodal Text2Human	FID23.46	3
Text-based Human Image Manipulation	WVTON (test)	FID138.2	3
Text Manipulation	WVTON Full Edit (test)	Pose Accuracy13.2	2

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord