Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Text-based Aerial-Ground Person Retrieval

About

This work introduces Text-based Aerial-Ground Person Retrieval (TAG-PR), which aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions. Unlike traditional Text-based Person Retrieval (T-PR), which focuses solely on ground-view images, TAG-PR introduces greater practical significance and presents unique challenges due to the large viewpoint discrepancy across images. To support this task, we contribute: (1) TAG-PEDES dataset, constructed from public benchmarks with automatically generated textual descriptions, enhanced by a diversified text generation paradigm to ensure robustness under view heterogeneity; and (2) TAG-CLIP, a novel retrieval framework that addresses view heterogeneity through a hierarchically-routed mixture of experts module to learn view-specific and view-agnostic features and a viewpoint decoupling strategy to decouple view-specific features for better cross-modal alignment. We evaluate the effectiveness of TAG-CLIP on both the proposed TAG-PEDES dataset and existing T-PR benchmarks. The dataset and code are available at https://github.com/Flame-Chasers/TAG-PR.

Xinyu Zhou, Yu Wu, Jiayao Ma, Wenhao Wang, Min Cao, Mang Ye• 2025

Related benchmarks

TaskDatasetResultRank
Cross-modal Geo-localizationCVG-Text (New York)
R@131.17
29
Cross-modal Geo-localizationCVG-Text Tokyo
Recall@124.33
15
Cross-modal Geo-localizationCVG-Text (Brisbane)
Recall@129.83
15
Cross-modal Geo-localizationCORE World-level 1.0 (All)
R@138.46
15
Cross-modal Geo-localizationCORE Intercontinental-level Subset1 1.0
R@138.15
15
Cross-modal Geo-localizationCORE Intercontinental-level Subset2 1.0
R@142.35
15
Cross-modal Geo-localizationCORE Intercontinental-level Subset3 1.0
R@137.32
15
Cross-modal Geo-localizationCORE Intercontinental-level Subset4 1.0
R@136.46
15
Showing 8 of 8 rows

Other info

Follow for update