CLIP-Guided Adaptable Self-Supervised Learning for Human-Centric Visual Tasks
About
Human-centric visual analysis plays a pivotal role in diverse applications, including surveillance, healthcare, and human-computer interaction. With the emergence of large-scale unlabeled human image datasets, there is an increasing need for a general unsupervised pre-training model capable of supporting diverse human-centric downstream tasks. To achieve this goal, we propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework designed for unsupervised pre-training in human-centric visual tasks. CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels. These multi-level semantic cues are then integrated into the learned visual representations, enriching their expressiveness and generalizability. Recognizing that different downstream tasks demand varying levels of semantic granularity, CLASP incorporates a Prompt-Controlled Mixture-of-Experts (MoE) module. MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability. Furthermore, CLASP employs a multi-task pre-training strategy, where part- and attribute-level pseudo-labels derived from CLIP guide the representation learning process. Extensive experiments across multiple benchmarks demonstrate that CLASP consistently outperforms existing unsupervised pre-training methods, advancing the field of human-centric visual analysis.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Person Re-Identification | MSMT17 | mAP0.788 | 404 | |
| Pedestrian Attribute Recognition | PA-100K | mA87.91 | 79 | |
| Person Re-Identification | Market1501 | mAP0.946 | 57 | |
| Pose Estimation | COCO | mAP77.8 | 30 | |
| Pedestrian Attribute Recognition | PETA existing vs. zero-shot (multiple) | mA77.86 | 23 | |
| Pedestrian Detection | CityPersons highly occluded (HO) | Miss Rate37.4 | 16 | |
| Attribute Recognition | RAP zero-shot | mA77.82 | 15 | |
| Person Search | CUHK-SYSU | mAP96.1 | 15 | |
| Person Search | PRW | mAP60.8 | 15 | |
| Pedestrian Detection | CityPersons Reasonable | -- | 9 |