Learning Semantic-Aligned Feature Representation for Text-based Person Search

About

Text-based person search aims to retrieve images of a certain pedestrian by a textual description. The key challenge of this task is to eliminate the inter-modality gap and achieve the feature alignment across modalities. In this paper, we propose a semantic-aligned embedding method for text-based person search, in which the feature alignment across modalities is achieved by automatically learning the semantic-aligned visual features and textual features. First, we introduce two Transformer-based backbones to encode robust feature representations of the images and texts. Second, we design a semantic-aligned feature aggregation network to adaptively select and aggregate features with the same semantics into part-aware features, which is achieved by a multi-head attention module constrained by a cross-modality part alignment loss and a diversity loss. Experimental results on the CUHK-PEDES and Flickr30K datasets show that our method achieves state-of-the-art performances.

Shiping Li, Min Cao, Min Zhang• 2021

Related benchmarks

Task	Dataset	Result
Text-to-Image Retrieval	Flickr30k (test)	Recall@150.74	525
Text-based Person Search	CUHK-PEDES (test)	Rank-164.13	171
Text-to-image Person Re-identification	CUHK-PEDES (test)	Rank-1 Accuracy (R-1)64.13	150
Text-based Person Search	RSTPReid (test)	R@144.05	136
Text-to-Image Retrieval	CUHK-PEDES (test)	Recall@164.13	114
Text-based Person Search	ICFG-PEDES (test)	R@154.86	109
Text-to-image Person Re-identification	ICFG-PEDES (test)	Rank-10.5486	81
Text-based Person Re-identification	RSTPReid (test)	Rank-1 Acc44.05	52
Person Search	CUHK-PEDES (test)	Recall@164.13	47

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord