Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification

About

Despite remarkable advancements in text-to-image person re-identification (TIReID) facilitated by the breakthrough of cross-modal embedding models, existing methods often struggle to distinguish challenging candidate images due to intrinsic limitations, such as network architecture and data quality. To address these issues, we propose an Interactive Cross-modal Learning framework (ICL), which leverages human-centered interaction to enhance the discriminability of text queries through external multimodal knowledge. To achieve this, we propose a plug-and-play Test-time Humane-centered Interaction (THI) module, which performs visual question answering focused on human characteristics, facilitating multi-round interactions with a multimodal large language model (MLLM) to align query intent with latent target images. Specifically, THI refines user queries based on the MLLM responses to reduce the gap to the best-matching images, thereby boosting ranking accuracy. Additionally, to address the limitation of low-quality training texts, we introduce a novel Reorganization Data Augmentation (RDA) strategy based on information enrichment and diversity enhancement to enhance query discriminability by enriching, decomposing, and reorganizing person descriptions. Extensive experiments on four TIReID benchmarks, i.e., CUHK-PEDES, ICFG-PEDES, RSTPReid, and UFine6926, demonstrate that our method achieves remarkable performance with substantial improvement.

Yang Qin, Chao Chen, Zhihang Fu, Dezhong Peng, Xi Peng, Peng Hu• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-image Person Re-identificationCUHK-PEDES (test)
Rank-1 Accuracy (R-1)77.91
150
Text-based Person SearchCUHK-PEDES (test)
Rank-176.41
142
Text-based Person SearchICFG-PEDES (test)
R@168.11
104
Text-based Person SearchRSTPReid (test)
R@167.7
85
Text-to-image Person Re-identificationICFG-PEDES (test)
Rank-10.6902
81
Text-based Person Re-identificationRSTPReid (test)
Rank-1 Acc70.55
52
Text-to-image Person Re-identificationUFine6926 1.0 (test)
Rank-191.78
18
Text-based Person Re-identificationCUHK-PEDES coarse-grained (test)
Rank-1 Accuracy79.06
15
Text-based Person Re-identificationICFG-PEDES coarse-grained (test)
Rank-1 Accuracy0.7005
15
Text-based Person Re-identificationRSTPReid coarse-grained (test)
Rank-1 Accuracy72.55
15
Showing 10 of 16 rows

Other info

Code

Follow for update