The Center of Attention: Center-Keypoint Grouping via Attention for Multi-Person Pose Estimation
About
We introduce CenterGroup, an attention-based framework to estimate human poses from a set of identity-agnostic keypoints and person center predictions in an image. Our approach uses a transformer to obtain context-aware embeddings for all detected keypoints and centers and then applies multi-head attention to directly group joints into their corresponding person centers. While most bottom-up methods rely on non-learnable clustering at inference, CenterGroup uses a fully differentiable attention mechanism that we train end-to-end together with our keypoint detector. As a result, our method obtains state-of-the-art performance with up to 2.5x faster inference time than competing bottom-up methods. Our code is available at https://github.com/dvl-tum/center-group .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Human Pose Estimation | COCO (test-dev) | AP71.4 | 408 | |
| 2D Human Pose Estimation | COCO 2017 (val) | AP73.3 | 386 | |
| Human Pose Estimation | COCO 2017 (test-dev) | AP71.4 | 180 | |
| Multi-person Pose Estimation | CrowdPose (test) | AP70 | 177 | |
| Multi-person Pose Estimation | COCO (test-dev) | AP71.1 | 101 | |
| Multi-person Pose Estimation | COCO 2017 (mini-val) | AP69.1 | 17 |