Exploring Disentangled and Controllable Human Image Synthesis: From End-to-End to Stage-by-Stage

About

Achieving fine-grained controllability in human image synthesis is a long-standing challenge in computer vision. Existing methods primarily focus on either facial synthesis or near-frontal body generation, with limited ability to simultaneously control key factors such as viewpoint, pose, clothing, and identity in a disentangled manner. In this paper, we introduce a new disentangled and controllable human synthesis task, which explicitly separates and manipulates these four factors within a unified framework. We first develop an end-to-end generative model trained on MVHumanNet for factor disentanglement. However, the domain gap between MVHumanNet and in-the-wild data produces unsatisfactory results, motivating the exploration of virtual try-on (VTON) dataset as a potential solution. Through experiments, we observe that simply incorporating the VTON dataset as additional data to train the end-to-end model degrades performance, primarily due to the inconsistency in data forms between the two datasets, which disrupts the disentanglement process. To better leverage both datasets, we propose a stage-by-stage framework that decomposes human image generation into three sequential steps: clothed A-pose generation, back-view synthesis, and pose and view control. This structured pipeline enables better dataset utilization at different stages, significantly improving controllability and generalization, especially for in-the-wild scenarios. Extensive experiments demonstrate that our stage-by-stage approach outperforms end-to-end models in both visual fidelity and disentanglement quality, offering a scalable solution for real-world tasks. Additional demos are available on the project page: https://taited.github.io/discohuman-project/.

Zhengwentai Sun, Chenghong Li, Hongjie Liao, Xihe Yang, Keru Zheng, Heyuan Li, Yihao Zhi, Shuliang Ning, Shuguang Cui, Xiaoguang Han• 2025

Related benchmarks

Task	Dataset	Result
Human Image Synthesis	MVHumanNet	PSNR17.037	8
Human Image Synthesis	THuman 4.0	FID58.5134	5
Human Image Synthesis	AvatarReX	FID53.0746	5
Human Image Synthesis	In-the-wild data	Percentage of Votes61.2	4

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord