CLIP-Guided Adaptable Self-Supervised Learning for Human-Centric Visual Tasks

About

Human-centric visual analysis plays a pivotal role in diverse applications, including surveillance, healthcare, and human-computer interaction. With the emergence of large-scale unlabeled human image datasets, there is an increasing need for a general unsupervised pre-training model capable of supporting diverse human-centric downstream tasks. To achieve this goal, we propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework designed for unsupervised pre-training in human-centric visual tasks. CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels. These multi-level semantic cues are then integrated into the learned visual representations, enriching their expressiveness and generalizability. Recognizing that different downstream tasks demand varying levels of semantic granularity, CLASP incorporates a Prompt-Controlled Mixture-of-Experts (MoE) module. MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability. Furthermore, CLASP employs a multi-task pre-training strategy, where part- and attribute-level pseudo-labels derived from CLIP guide the representation learning process. Extensive experiments across multiple benchmarks demonstrate that CLASP consistently outperforms existing unsupervised pre-training methods, advancing the field of human-centric visual analysis.

Mingshuang Luo, Ruibing Hou, Bo Chao, Hong Chang, Zimo Liu, Yaowei Wang, Shiguang Shan• 2026

Related benchmarks

Task	Dataset	Result
Person Re-Identification	MSMT17	mAP0.788	546
Person Re-Identification	Market1501	mAP0.946	143
Pedestrian Attribute Recognition	PA-100K	mA87.91	92
Pose Estimation	COCO	mAP77.8	30
Pedestrian Attribute Recognition	PETA existing vs. zero-shot (multiple)	mA77.86	26
Attribute Recognition	RAP zero-shot	mA77.82	18
Pedestrian Detection	CityPersons highly occluded (HO)	Miss Rate37.4	16
Person Search	CUHK-SYSU	mAP96.1	15
Person Search	PRW	mAP60.8	15
Pedestrian Detection	CityPersons Reasonable	--	9

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord