Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CLIP-Guided Adaptable Self-Supervised Learning for Human-Centric Visual Tasks

About

Human-centric visual analysis plays a pivotal role in diverse applications, including surveillance, healthcare, and human-computer interaction. With the emergence of large-scale unlabeled human image datasets, there is an increasing need for a general unsupervised pre-training model capable of supporting diverse human-centric downstream tasks. To achieve this goal, we propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework designed for unsupervised pre-training in human-centric visual tasks. CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels. These multi-level semantic cues are then integrated into the learned visual representations, enriching their expressiveness and generalizability. Recognizing that different downstream tasks demand varying levels of semantic granularity, CLASP incorporates a Prompt-Controlled Mixture-of-Experts (MoE) module. MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability. Furthermore, CLASP employs a multi-task pre-training strategy, where part- and attribute-level pseudo-labels derived from CLIP guide the representation learning process. Extensive experiments across multiple benchmarks demonstrate that CLASP consistently outperforms existing unsupervised pre-training methods, advancing the field of human-centric visual analysis.

Mingshuang Luo, Ruibing Hou, Bo Chao, Hong Chang, Zimo Liu, Yaowei Wang, Shiguang Shan• 2026

Related benchmarks

TaskDatasetResultRank
Person Re-IdentificationMSMT17
mAP0.788
404
Pedestrian Attribute RecognitionPA-100K
mA87.91
79
Person Re-IdentificationMarket1501
mAP0.946
57
Pose EstimationCOCO
mAP77.8
30
Pedestrian Attribute RecognitionPETA existing vs. zero-shot (multiple)
mA77.86
23
Pedestrian DetectionCityPersons highly occluded (HO)
Miss Rate37.4
16
Attribute RecognitionRAP zero-shot
mA77.82
15
Person SearchCUHK-SYSU
mAP96.1
15
Person SearchPRW
mAP60.8
15
Pedestrian DetectionCityPersons Reasonable--
9
Showing 10 of 11 rows

Other info

Follow for update