Poseur: Direct Human Pose Regression with Transformers
About
We propose a direct, regression-based approach to 2D human pose estimation from single images. We formulate the problem as a sequence prediction task, which we solve using a Transformer network. This network directly learns a regression mapping from images to the keypoint coordinates, without resorting to intermediate representations such as heatmaps. This approach avoids much of the complexity associated with heatmap-based approaches. To overcome the feature misalignment issues of previous regression-based methods, we propose an attention mechanism that adaptively attends to the features that are most relevant to the target keypoints, considerably improving the accuracy. Importantly, our framework is end-to-end differentiable, and naturally learns to exploit the dependencies between keypoints. Experiments on MS-COCO and MPII, two predominant pose-estimation datasets, demonstrate that our method significantly improves upon the state-of-the-art in regression-based pose estimation. More notably, ours is the first regression-based approach to perform favorably compared to the best heatmap-based pose estimation methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Human Pose Estimation | COCO (test-dev) | AP78.3 | 408 | |
| 2D Human Pose Estimation | COCO 2017 (val) | AP76.8 | 386 | |
| Pose Estimation | COCO (val) | AP79.6 | 319 | |
| Whole-body Pose Estimation | COCO-Wholebody 1.0 (val) | Body AP68.5 | 64 | |
| 2D Human Pose Estimation | MPII (val) | -- | 61 | |
| Human Pose Estimation | PoseTrack 2017 (val) | -- | 54 | |
| 2D Occluded Pose Estimation | SyncOCC 1.0 (test) | AP^OC93.1 | 10 | |
| 2D Occluded Pose Estimation | SyncOCC-H 1.0 | AP^OC78.5 | 10 | |
| 2D Occluded Pose Estimation | OCHuman 1.0 (test) | AP^OC45.6 | 10 | |
| 2D Occluded Pose Estimation | OCHuman 1.0 (val) | AP^OC44.4 | 10 |