Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

About

In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self attention. We propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. The interpretability of LAFF can be used for feature selection. Extensive experiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX and TRECVID AVS 2016-2020) justify LAFF as a new baseline for text-to-video retrieval.

Fan Hu, Aozhu Chen, Ziyue Wang, Fangming Zhou, Jianfeng Dong, Xirong Li• 2021

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalMSVD (test)
R@145.4
204
Text-to-Video RetrievalMSR-VTT 1k-A (test)
R@145.8
57
Text-to-Video RetrievalMSR-VTT (Full)
R@129.1
55
Ad-hoc Video SearchTRECVID (TV16) 2016 (test)
infAP0.222
29
Ad-hoc Video SearchTRECVID TV17 2017 (test)
infAP29
28
Ad-hoc Video SearchTRECVID (TV18) 2018 (test)
infAP14.7
26
Ad-hoc Video SearchTRECVID TV19 2019 (test)
infAP19.2
17
Text-to-Video RetrievalTRECVid IACC.3 2017 (tv17)
xinfAP26.1
16
Text-to-Video RetrievalTRECVid V3C1 2019 (tv19)
xinfAP21.5
16
Text-to-Video RetrievalTRECVid IACC.3 2016
xinfAP18.8
16
Showing 10 of 19 rows

Other info

Code

Follow for update