Hulk: A Universal Knowledge Translator for Human-Centric Tasks
About
Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, \emph{e.g.,} languages, and the other for continuous representations, \emph{e.g.,} location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. Comprehensive evaluations of Hulk on 12 benchmarks covering 8 human-centric tasks demonstrate the superiority of our proposed method, achieving state-of-the-art performance in 11 benchmarks. The code will be available on https://github.com/OpenGVLab/Hulk.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Pose Estimation | COCO (val) | AP78.7 | 319 | |
| Skeleton-based Action Recognition | NTU 60 (X-sub) | Accuracy94.3 | 220 | |
| Human Mesh Recovery | 3DPW | PA-MPJPE38.5 | 123 | |
| Pedestrian Attribute Recognition | PA-100K | mA88.97 | 79 | |
| Whole-body Pose Estimation | COCO-Wholebody 1.0 (val) | Body AP70.2 | 64 | |
| Pedestrian Detection | CrowdHuman (val) | MR^-236.5 | 61 | |
| 3D Human Pose and Mesh Recovery | Human3.6M | PA-MPJPE28.8 | 40 | |
| Human Parsing | LIP | mIoU66.02 | 39 | |
| Pedestrian Detection | CrowdHuman | mAP93 | 38 | |
| Monocular 3D Human Pose and Mesh Recovery | Human3.6M (test) | PA-MPJPE (mm)28.8 | 36 |