InterPartAbility: Phrase-Region Grounding for Interpretable Text-to-Image Person Re-Identification

About

Text-to-image person re-identification (TI-ReID) relies on natural-language text descriptions to retrieve top matching individuals from a gallery of reference images. While recent large vision-language models (VLMs) achieve strong retrieval performance, their decisions remain largely uninterpretable. Existing interpretability approaches in TI-ReID rely solely on slot-attention to highlight attended regions, but fail to reliably bind visual regions to semantically meaningful concepts, limiting interpretation to qualitative visualizations over a restricted vocabulary. This paper introduces InterPartAbility, an interpretable TI-ReID method that performs explicit part-wise matching and enables phrase-region grounding. Unlike parameter-heavy slot-attention methods that yield only qualitative interpretability, our open-vocabulary patch-phrase interaction module (PPIM) guides a standard TI-ReID model with concept-level phrases. Concept-based part phrases provide evidence that encourages the model to attend to the corresponding local image regions. InterPartAbility further leverages CLIP ViT self-attention to produce spatially concentrated patch activations aligned with each part-level phrase, yielding grounded explanation maps. Finally, a quantitative interpretability protocol for TI-ReID is introduced that extends current perturbation-based evaluation metrics into the TI-Reid domain. This includes a counterfactual region removal that measures retrieval degradation when top-ranked explanatory regions are removed. Empirical results on three challenging benchmarks show that InterPartAbility can achieve SOTA interpretability performance under these metrics, while sustaining competitive retrieval accuracy.

Shakeeb Murtaza, Aryan Shukla, Rajarshi Bhattacharya, Maguelonne Heritier, Eric Granger• 2026

Related benchmarks

Task	Dataset	Result
Text-to-Image Retrieval	CUHK-PEDES (test)	Recall@178.17	114
Text-to-image person retrieval	RSTPReid	Rank-1 Accuracy70.9	80
Text-based Person Re-identification	RSTPReid	Rank-1 Accuracy70.9	57
Text-to-image Person Re-identification	CUHK-PEDES	Rank-178.17	51
Text-based Person Re-identification	ICFG-PEDES	R@169.92	36
Text-to-Image Retrieval	ICFG-PEDES	R@169.92	8
Text-to-Image Retrieval	RSTPReid (test)	Delta R@1%10.87	3

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord