Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Versatile Multi-Modal Pre-Training for Human-Centric Perception

About

Human-centric perception plays a vital role in vision and graphics. But their data annotations are prohibitively expensive. Therefore, it is desirable to have a versatile pre-train model that serves as a foundation for data-efficient downstream tasks transfer. To this end, we propose the Human-Centric Multi-Modal Contrastive Learning framework HCMoCo that leverages the multi-modal nature of human data (e.g. RGB, depth, 2D keypoints) for effective representation learning. The objective comes with two main challenges: dense pre-train for multi-modality data, efficient usage of sparse human priors. To tackle the challenges, we design the novel Dense Intra-sample Contrastive Learning and Sparse Structure-aware Contrastive Learning targets by hierarchically learning a modal-invariant latent space featured with continuous and ordinal feature distribution and structure-aware semantic consistency. HCMoCo provides pre-train for different modalities by combining heterogeneous datasets, which allows efficient usage of existing task-specific human data. Extensive experiments on four downstream tasks of different modalities demonstrate the effectiveness of HCMoCo, especially under data-efficient settings (7.16% and 12% improvement on DensePose Estimation and Human Parsing). Moreover, we demonstrate the versatility of HCMoCo by exploring cross-modality supervision and missing-modality inference, validating its strong ability in cross-modal association and reasoning.

Fangzhou Hong, Liang Pan, Zhongang Cai, Ziwei Liu• 2022

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO (val)
mAP63.11
613
Pose EstimationCOCO (val)
AP76.9
319
DensePose EstimationCOCO (val)
GPS AP67.77
20
Human ParsingHuman3.6M
mIoU66.01
19
Human ParsingHuman3.6M 96
mIoU66.01
10
Human ParsingHuman3.6M 96 (test)
mIoU66.01
10
Human ParsingHuman3.6M (Full Data)
mIoU62.5
8
Human ParsingHuman3.6M (20% Data)
mIoU60.85
8
Human ParsingHuman3.6M (10% Data)
mIoU58.28
8
Human ParsingHuman3.6M 1% Data
mIoU20.78
8
Showing 10 of 19 rows

Other info

Code

Follow for update