VLM-PAR: A Vision Language Model for Pedestrian Attribute Recognition

About

Pedestrian Attribute Recognition (PAR) involves predicting fine-grained attributes such as clothing color, gender, and accessories from pedestrian imagery, yet is hindered by severe class imbalance, intricate attribute co-dependencies, and domain shifts. We introduce VLM-PAR, a modular vision-language framework built on frozen SigLIP 2 multilingual encoders. By first aligning image and prompt embeddings via refining visual features through a compact cross-attention fusion, VLM-PAR achieves significant accuracy improvement on the highly imbalanced PA100K benchmark, setting a new state-of-the-art performance, while also delivering significant gains in mean accuracy across PETA and Market-1501 benchmarks. These results underscore the efficacy of integrating large-scale vision-language pretraining with targeted cross-modal refinement to overcome imbalance and generalization challenges in PAR.

Abdellah Zakaria Sellam, Salah Eddine Bekhouche, Fadi Dornaika, Cosimo Distante, Abdenour Hadid• 2025

Related benchmarks

Task	Dataset	Result
Pedestrian Attribute Recognition	PA-100K	mA92.88	92
Pedestrian Attribute Recognition	PETA	mA93.52	52
Pedestrian Attribute Recognition	Market1501	mA85.38	3

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord