See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval
About
Text-based person retrieval aims to find the query person based on a textual description. The key is to learn a common latent space mapping between visual-textual modalities. To achieve this goal, existing works employ segmentation to obtain explicitly cross-modal alignments or utilize attention to explore salient alignments. These methods have two shortcomings: 1) Labeling cross-modal alignments are time-consuming. 2) Attention methods can explore salient cross-modal alignments but may ignore some subtle and valuable pairs. To relieve these issues, we introduce an Implicit Visual-Textual (IVT) framework for text-based person retrieval. Different from previous models, IVT utilizes a single network to learn representation for both modalities, which contributes to the visual-textual interaction. To explore the fine-grained alignment, we further propose two implicit semantic alignment paradigms: multi-level alignment (MLA) and bidirectional mask modeling (BMM). The MLA module explores finer matching at sentence, phrase, and word levels, while the BMM module aims to mine \textbf{more} semantic alignments between visual and textual modalities. Extensive experiments are carried out to evaluate the proposed IVT on public datasets, i.e., CUHK-PEDES, RSTPReID, and ICFG-PEDES. Even without explicit body part alignment, our approach still achieves state-of-the-art performance. Code is available at: https://github.com/TencentYoutuResearch/PersonRetrieval-IVT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-image Person Re-identification | CUHK-PEDES (test) | Rank-1 Accuracy (R-1)65.59 | 150 | |
| Text-based Person Search | CUHK-PEDES (test) | Rank-165.59 | 142 | |
| Text-based Person Search | ICFG-PEDES (test) | R@156.04 | 104 | |
| Text-to-Image Retrieval | CUHK-PEDES (test) | Recall@165.59 | 96 | |
| Text-based Person Search | RSTPReid (test) | R@146.7 | 85 | |
| Text-to-image Person Re-identification | ICFG-PEDES (test) | Rank-10.5604 | 81 | |
| Text-based Person Search | CUHK-PEDES | Recall@166.1 | 61 | |
| Text-based Person Re-identification | RSTPReid (test) | Rank-1 Acc46.7 | 52 | |
| Text-to-image Person Re-identification | CUHK-PEDES | Rank-165.59 | 34 | |
| Text-based Person Retrieval | ICFG-PEDES | R@156.04 | 32 |