Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning Semantic-Aligned Feature Representation for Text-based Person Search

About

Text-based person search aims to retrieve images of a certain pedestrian by a textual description. The key challenge of this task is to eliminate the inter-modality gap and achieve the feature alignment across modalities. In this paper, we propose a semantic-aligned embedding method for text-based person search, in which the feature alignment across modalities is achieved by automatically learning the semantic-aligned visual features and textual features. First, we introduce two Transformer-based backbones to encode robust feature representations of the images and texts. Second, we design a semantic-aligned feature aggregation network to adaptively select and aggregate features with the same semantics into part-aware features, which is achieved by a multi-head attention module constrained by a cross-modality part alignment loss and a diversity loss. Experimental results on the CUHK-PEDES and Flickr30K datasets show that our method achieves state-of-the-art performances.

Shiping Li, Min Cao, Min Zhang• 2021

Related benchmarks

TaskDatasetResultRank
Text-to-Image RetrievalFlickr30k (test)
Recall@150.74
423
Text-to-image Person Re-identificationCUHK-PEDES (test)
Rank-1 Accuracy (R-1)64.13
150
Text-based Person SearchCUHK-PEDES (test)
Rank-164.13
142
Text-based Person SearchICFG-PEDES (test)
R@154.86
104
Text-to-Image RetrievalCUHK-PEDES (test)
Recall@164.13
96
Text-based Person SearchRSTPReid (test)
R@144.05
85
Text-to-image Person Re-identificationICFG-PEDES (test)
Rank-10.5486
81
Text-based Person Re-identificationRSTPReid (test)
Rank-1 Acc44.05
52
Person SearchCUHK-PEDES (test)
Recall@164.13
47
Showing 9 of 9 rows

Other info

Follow for update