Learning Semantic-Aligned Feature Representation for Text-based Person Search
About
Text-based person search aims to retrieve images of a certain pedestrian by a textual description. The key challenge of this task is to eliminate the inter-modality gap and achieve the feature alignment across modalities. In this paper, we propose a semantic-aligned embedding method for text-based person search, in which the feature alignment across modalities is achieved by automatically learning the semantic-aligned visual features and textual features. First, we introduce two Transformer-based backbones to encode robust feature representations of the images and texts. Second, we design a semantic-aligned feature aggregation network to adaptively select and aggregate features with the same semantics into part-aware features, which is achieved by a multi-head attention module constrained by a cross-modality part alignment loss and a diversity loss. Experimental results on the CUHK-PEDES and Flickr30K datasets show that our method achieves state-of-the-art performances.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Retrieval | Flickr30k (test) | Recall@150.74 | 423 | |
| Text-to-image Person Re-identification | CUHK-PEDES (test) | Rank-1 Accuracy (R-1)64.13 | 150 | |
| Text-based Person Search | CUHK-PEDES (test) | Rank-164.13 | 142 | |
| Text-based Person Search | ICFG-PEDES (test) | R@154.86 | 104 | |
| Text-to-Image Retrieval | CUHK-PEDES (test) | Recall@164.13 | 96 | |
| Text-based Person Search | RSTPReid (test) | R@144.05 | 85 | |
| Text-to-image Person Re-identification | ICFG-PEDES (test) | Rank-10.5486 | 81 | |
| Text-based Person Re-identification | RSTPReid (test) | Rank-1 Acc44.05 | 52 | |
| Person Search | CUHK-PEDES (test) | Recall@164.13 | 47 |