Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
About
In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self attention. We propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. The interpretability of LAFF can be used for feature selection. Extensive experiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX and TRECVID AVS 2016-2020) justify LAFF as a new baseline for text-to-video retrieval.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Retrieval | MSVD (test) | R@145.4 | 204 | |
| Text-to-Video Retrieval | MSR-VTT 1k-A (test) | R@145.8 | 57 | |
| Text-to-Video Retrieval | MSR-VTT (Full) | R@129.1 | 55 | |
| Ad-hoc Video Search | TRECVID (TV16) 2016 (test) | infAP0.222 | 29 | |
| Ad-hoc Video Search | TRECVID TV17 2017 (test) | infAP29 | 28 | |
| Ad-hoc Video Search | TRECVID (TV18) 2018 (test) | infAP14.7 | 26 | |
| Ad-hoc Video Search | TRECVID TV19 2019 (test) | infAP19.2 | 17 | |
| Text-to-Video Retrieval | TRECVid IACC.3 2017 (tv17) | xinfAP26.1 | 16 | |
| Text-to-Video Retrieval | TRECVid V3C1 2019 (tv19) | xinfAP21.5 | 16 | |
| Text-to-Video Retrieval | TRECVid IACC.3 2016 | xinfAP18.8 | 16 |