Asymmetric Cross-Scale Alignment for Text-Based Person Search

About

Text-based person search (TBPS) is of significant importance in intelligent surveillance, which aims to retrieve pedestrian images with high semantic relevance to a given text description. This retrieval task is characterized with both modal heterogeneity and fine-grained matching. To implement this task, one needs to extract multi-scale features from both image and text domains, and then perform the cross-modal alignment. However, most existing approaches only consider the alignment confined at their individual scales, e.g., an image-sentence or a region-phrase scale. Such a strategy adopts the presumable alignment in feature extraction, while overlooking the cross-scale alignment, e.g., image-phrase. In this paper, we present a transformer-based model to extract multi-scale representations, and perform Asymmetric Cross-Scale Alignment (ACSA) to precisely align the two modalities. Specifically, ACSA consists of a global-level alignment module and an asymmetric cross-attention module, where the former aligns an image and texts on a global scale, and the latter applies the cross-attention mechanism to dynamically align the cross-modal entities in region/image-phrase scales. Extensive experiments on two benchmark datasets CUHK-PEDES and RSTPReid demonstrate the effectiveness of our approach. Codes are available at \href{url}{https://github.com/mul-hjh/ACSA}.

Zhong Ji, Junhua Hu, Deyin Liu, Lin Yuanbo Wu, Ye zhao• 2022

Related benchmarks

Task	Dataset	Result
Text-based Person Search	CUHK-PEDES (test)	Rank-163.56	171
Text-based Person Search	RSTPReid (test)	R@148.4	136
Text-to-Image Retrieval	CUHK-PEDES (test)	Recall@168.67	114
Text to Image	CUHK-PEDES	Rank-163.56	28
Text-to-image person retrieval	RSTPReid (test)	Rank-1 Accuracy48.4	17

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord