Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions

About

In recent text-video retrieval, the use of additional captions from vision-language models has shown promising effects on the performance. However, existing models using additional captions often have struggled to capture the rich semantics, including temporal changes, inherent in the video. In addition, incorrect information caused by generative models can lead to inaccurate retrieval. To address these issues, we propose a new framework, Narrating the Video (NarVid), which strategically leverages the comprehensive information available from frame-level captions, the narration. The proposed NarVid exploits narration in multiple ways: 1) feature enhancement through cross-modal interactions between narration and video, 2) query-aware adaptive filtering to suppress irrelevant or incorrect information, 3) dual-modal matching score by adding query-video similarity and query-narration similarity, and 4) hard-negative loss to learn discriminative features from multiple perspectives using the two similarities from different views. Experimental results demonstrate that NarVid achieves state-of-the-art performance on various benchmark datasets.

Chan Hur, Jeong-hun Hong, Dong-hun Lee, Dabin Kang, Semin Myeong, Sang-hyo Park, Hyeyoung Park• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalDiDeMo
R@10.534
459
Text-to-Video RetrievalMSVD
R@153.1
264
Text-to-Video RetrievalMSR-VTT (test)
R@152.7
255
Text-to-Video RetrievalVATEX
R@168.4
130
Showing 4 of 4 rows

Other info

Code

Follow for update