Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment

About

Text-to-video retrieval systems have recently made significant progress by utilizing pre-trained models trained on large-scale image-text pairs. However, most of the latest methods primarily focus on the video modality while disregarding the audio signal for this task. Nevertheless, a recent advancement by ECLIPSE has improved long-range text-to-video retrieval by developing an audiovisual video representation. Nonetheless, the objective of the text-to-video retrieval task is to capture the complementary audio and video information that is pertinent to the text query rather than simply achieving better audio and video alignment. To address this issue, we introduce TEFAL, a TExt-conditioned Feature ALignment method that produces both audio and video representations conditioned on the text query. Instead of using only an audiovisual attention block, which could suppress the audio information relevant to the text query, our approach employs two independent cross-modal attention blocks that enable the text to attend to the audio and video representations separately. Our proposed method's efficacy is demonstrated on four benchmark datasets that include audio: MSR-VTT, LSMDC, VATEX, and Charades, and achieves better than state-of-the-art performance consistently across the four datasets. This is attributed to the additional text-query-conditioned audio representation and the complementary information it adds to the text-query-conditioned video representation.

Sarah Ibrahimi, Xiaohang Sun, Pichao Wang, Amanmeet Garg, Ashutosh Sanan, Mohamed Omar• 2023

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalMSR-VTT
Recall@152
313
Text-to-Video RetrievalMSR-VTT (test)
R@149.9
234
Text-to-Video RetrievalLSMDC (test)
R@126.8
225
Video-to-Text retrievalMSR-VTT
Recall@161
157
Text-to-Video RetrievalLSMDC
R@126.8
154
Text-to-Video RetrievalMSRVTT
R@149.4
98
Text-to-Video RetrievalVATEX
R@161
95
Text-to-Video RetrievalVATEX (test)
R@161
62
Text-to-Video RetrievalMSR-VTT 9K
R@152
55
Video-to-Text retrievalMSR-VTT 9K
R@147.1
43
Showing 10 of 13 rows

Other info

Follow for update