Versatile Multi-Modal Pre-Training for Human-Centric Perception
About
Human-centric perception plays a vital role in vision and graphics. But their data annotations are prohibitively expensive. Therefore, it is desirable to have a versatile pre-train model that serves as a foundation for data-efficient downstream tasks transfer. To this end, we propose the Human-Centric Multi-Modal Contrastive Learning framework HCMoCo that leverages the multi-modal nature of human data (e.g. RGB, depth, 2D keypoints) for effective representation learning. The objective comes with two main challenges: dense pre-train for multi-modality data, efficient usage of sparse human priors. To tackle the challenges, we design the novel Dense Intra-sample Contrastive Learning and Sparse Structure-aware Contrastive Learning targets by hierarchically learning a modal-invariant latent space featured with continuous and ordinal feature distribution and structure-aware semantic consistency. HCMoCo provides pre-train for different modalities by combining heterogeneous datasets, which allows efficient usage of existing task-specific human data. Extensive experiments on four downstream tasks of different modalities demonstrate the effectiveness of HCMoCo, especially under data-efficient settings (7.16% and 12% improvement on DensePose Estimation and Human Parsing). Moreover, we demonstrate the versatility of HCMoCo by exploring cross-modality supervision and missing-modality inference, validating its strong ability in cross-modal association and reasoning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO (val) | mAP63.11 | 613 | |
| Pose Estimation | COCO (val) | AP76.9 | 319 | |
| DensePose Estimation | COCO (val) | GPS AP67.77 | 20 | |
| Human Parsing | Human3.6M | mIoU66.01 | 19 | |
| Human Parsing | Human3.6M 96 | mIoU66.01 | 10 | |
| Human Parsing | Human3.6M 96 (test) | mIoU66.01 | 10 | |
| Human Parsing | Human3.6M (Full Data) | mIoU62.5 | 8 | |
| Human Parsing | Human3.6M (20% Data) | mIoU60.85 | 8 | |
| Human Parsing | Human3.6M (10% Data) | mIoU58.28 | 8 | |
| Human Parsing | Human3.6M 1% Data | mIoU20.78 | 8 |