DisCo: Disentangled Control for Realistic Human Dance Generation

About

Generative AI has made significant strides in computer vision, particularly in text-driven image/video synthesis (T2I/T2V). Despite the notable advancements, it remains challenging in human-centric content synthesis such as realistic dance generation. Current methodologies, primarily tailored for human motion transfer, encounter difficulties when confronted with real-world dance scenarios (e.g., social media dance), which require to generalize across a wide spectrum of poses and intricate human details. In this paper, we depart from the traditional paradigm of human motion transfer and emphasize two additional critical attributes for the synthesis of human dance content in social media contexts: (i) Generalizability: the model should be able to generalize beyond generic human viewpoints as well as unseen human subjects, backgrounds, and poses; (ii) Compositionality: it should allow for the seamless composition of seen/unseen subjects, backgrounds, and poses from different sources. To address these challenges, we introduce DISCO, which includes a novel model architecture with disentangled control to improve the compositionality of dance synthesis, and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DisCc can generate high-quality human dance images and videos with diverse appearances and flexible motions. Code is available at https://disco-dance.github.io/.

Tan Wang, Linjie Li, Kevin Lin, Yuanhao Zhai, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang• 2023

Related benchmarks

Task	Dataset	Result
Human Dance Generation	Tiktok (test)	SSIM0.674	17
Human Image Animation	TikTok	FVD292.8	15
Character Image Animation	Follow-Your-Pose V2	LPIPS0.239	15
Reposing	WPose (Out-of-Domain)	FID50.948	10
2D Human Video Generation	Human Video Generation Dataset (test)	FID60.95	10
Reposing	DeepFashion In-Domain	FID9.818	10
Human Image Animation	Unseen100	L1 Loss3.74e+4	9
Full-body selfie generation	Collected selfie-to-full-body dataset 17 captures 1.0 (test)	LPIPS0.287	8
Novel Pose Synthesis	Thuman	FID34.63	7
Novel View Synthesis	Thuman	FID28.71	7

Showing 10 of 19 rows

Other info

Code

Follow for update

@wizwand_team Discord