UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing
About
Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Human Pose Estimation | Human3.6M (test) | -- | 547 | |
| 3D Human Pose Estimation | 3DPW (test) | PA-MPJPE59.1 | 505 | |
| Image-to-text | ImageScript (test) | Top-1 R-Precision24.5 | 5 | |
| Text-to-Pose | PoseScript (test) | RT2P Top-573.7 | 5 | |
| Image-Diff | ImageDiff (test) | Top-1 R-Precision13.5 | 3 | |
| Pose Editing | PoseFix | MPJPE270.3 | 3 | |
| Pose-Diff | PoseFix (test) | R-Precision (Top-1)67.9 | 3 | |
| Pose-to-Text | PoseScript (test) | R-Precision (Top-1)85.6 | 3 |