Versatile Multi-Modal Pre-Training for Human-Centric Perception

About

Human-centric perception plays a vital role in vision and graphics. But their data annotations are prohibitively expensive. Therefore, it is desirable to have a versatile pre-train model that serves as a foundation for data-efficient downstream tasks transfer. To this end, we propose the Human-Centric Multi-Modal Contrastive Learning framework HCMoCo that leverages the multi-modal nature of human data (e.g. RGB, depth, 2D keypoints) for effective representation learning. The objective comes with two main challenges: dense pre-train for multi-modality data, efficient usage of sparse human priors. To tackle the challenges, we design the novel Dense Intra-sample Contrastive Learning and Sparse Structure-aware Contrastive Learning targets by hierarchically learning a modal-invariant latent space featured with continuous and ordinal feature distribution and structure-aware semantic consistency. HCMoCo provides pre-train for different modalities by combining heterogeneous datasets, which allows efficient usage of existing task-specific human data. Extensive experiments on four downstream tasks of different modalities demonstrate the effectiveness of HCMoCo, especially under data-efficient settings (7.16% and 12% improvement on DensePose Estimation and Human Parsing). Moreover, we demonstrate the versatility of HCMoCo by exploring cross-modality supervision and missing-modality inference, validating its strong ability in cross-modal association and reasoning.

Fangzhou Hong, Liang Pan, Zhongang Cai, Ziwei Liu• 2022

Related benchmarks

Task	Dataset	Result
Object Detection	COCO (val)	mAP63.11	637
Pose Estimation	COCO (val)	AP76.9	319
DensePose Estimation	COCO (val)	GPS AP67.77	20
Human Parsing	Human3.6M	mIoU66.01	19
Human Parsing	Human3.6M 96	mIoU66.01	10
Human Parsing	Human3.6M 96 (test)	mIoU66.01	10
Human Parsing	Human3.6M (Full Data)	mIoU62.5	8
Human Parsing	Human3.6M (20% Data)	mIoU60.85	8
Human Parsing	Human3.6M (10% Data)	mIoU58.28	8
Human Parsing	Human3.6M 1% Data	mIoU20.78	8

Showing 10 of 19 rows

Other info

Code

Follow for update

@wizwand_team Discord