UATVR: Uncertainty-Adaptive Text-Video Retrieval

About

With the explosive growth of web videos and emerging large-scale vision-language pre-training models, e.g., CLIP, retrieving videos of interest with text instructions has attracted increasing attention. A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities in specific granularities for semantic correspondence. Unfortunately, the intrinsic uncertainties of optimal entity combinations in appropriate granularities for cross-modal queries are understudied, which is especially critical for modalities with hierarchical semantics, e.g., video, text, etc. In this paper, we propose an Uncertainty-Adaptive Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure. Concretely, we add additional learnable tokens in the encoders to adaptively aggregate multi-grained semantics for flexible high-level reasoning. In the refined embedding space, we represent text-video pairs as probabilistic distributions where prototypes are sampled for matching evaluation. Comprehensive experiments on four benchmarks justify the superiority of our UATVR, which achieves new state-of-the-art results on MSR-VTT (50.8%), VATEX (64.5%), MSVD (49.7%), and DiDeMo (45.8%). The code is available at https://github.com/bofang98/UATVR.

Bo Fang, Wenhao Wu, Chang Liu, Yu Zhou, Yuxin Song, Weiping Wang, Xiangbo Shu, Xiangyang Ji, Jingdong Wang• 2023

Related benchmarks

Task	Dataset	Result
Text-to-Video Retrieval	DiDeMo	R@10.458	465
Text-to-Video Retrieval	MSVD	R@149.7	290
Text-to-Video Retrieval	MSR-VTT (test)	R@150.8	265
Text-to-Video Retrieval	MSVD (test)	R@146	211
Text-to-Video Retrieval	MSRVTT	R@147.5	144
Text-to-Video Retrieval	VATEX	R@164.5	134
Video-to-Text retrieval	MSRVTT 9k	R@146.9	19
Video-to-Text retrieval	MSRVTT (test)	Recall@148.1	15
Text-to-Video Retrieval	MSRVTT Retrieval	R@150.8	10

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord