Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Robust Test-time Video-Text Retrieval: Benchmarking and Adapting for Query Shifts

About

Modern video-text retrieval (VTR) models excel on in-distribution benchmarks but are highly vulnerable to real-world query shifts, where the distribution of query data deviates from the training domain, leading to a sharp performance drop. Existing image-focused robustness solutions are inadequate to handle this vulnerability in video, as they fail to address the complex spatio-temporal dynamics inherent in these shifts. To systematically evaluate this vulnerability, we first introduce a comprehensive benchmark featuring 12 distinct types of video perturbations across five severity degrees. Analysis on this benchmark reveals that query shifts amplify the hubness phenomenon, where a few gallery items become dominant "hubs" that attract a disproportionate number of queries. To mitigate this, we then propose HAT-VTR (Hubness Alleviation for Test-time Video-Text Retrieval), as our baseline test-time adaptation framework designed to directly counteract hubness in VTR. It leverages two key components: a Hubness Suppression Memory to refine similarity scores, and multi-granular losses to enforce temporal feature consistency. Extensive experiments demonstrate that HAT-VTR substantially improves robustness, consistently outperforming prior methods across diverse query shift scenarios, and enhancing model reliability for real-world applications.

Bingqing Zhang, Zhuo Cao, Heming Du, Yang Li, Xue Li, Jiajun Liu, Sen Wang• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalMSVD
R@157.46
290
Text-to-Video RetrievalActivityNet
R@129.18
245
Text-to-Video RetrievalLSMDC (test)
R@529.33
245
Text-to-Video RetrievalMSVD (test)
R@152.09
211
Text-to-Video RetrievalLSMDC
R@117.32
181
Text-to-Video RetrievalMSRVTT
R@138.7
144
Video-to-Text retrievalActivityNet
R@10.3169
136
Video-to-Text retrievalMSVD
R@157.91
119
Video-to-Text retrievalLSMDC
R@118.02
92
Video-to-Text retrievalMSVD (test)
R@153.28
68
Showing 10 of 30 rows

Other info

Follow for update