UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

About

Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.

Yiheng Li, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen• 2024

Related benchmarks

Task	Dataset	Result
3D Human Pose Estimation	Human3.6M (test)	--	570
3D Human Pose Estimation	3DPW (test)	PA-MPJPE59.1	514
Human Motion Prediction	3DPW	--	27
Vision-to-Motion	Human3.6M (test)	MPJPE81.8	9
Image-to-text	ImageScript (test)	Top-1 R-Precision24.5	5
Text-to-Pose	PoseScript (test)	RT2P Top-573.7	5
Vision-to-Text	H3.6M	BLEU-417.3	4
Image-Diff	ImageDiff (test)	Top-1 R-Precision13.5	3
Pose Editing	PoseFix	MPJPE270.3	3
Pose-Diff	PoseFix (test)	R-Precision (Top-1)67.9	3

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord