Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

About

In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self attention. We propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. The interpretability of LAFF can be used for feature selection. Extensive experiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX and TRECVID AVS 2016-2020) justify LAFF as a new baseline for text-to-video retrieval.

Fan Hu, Aozhu Chen, Ziyue Wang, Fangming Zhou, Jianfeng Dong, Xirong Li• 2021

Related benchmarks

Task	Dataset	Result
Text-to-Video Retrieval	MSVD (test)	R@145.4	229
Text-to-Video Retrieval	MSR-VTT 1k-A (test)	R@145.8	57
Text-to-Video Retrieval	MSR-VTT (Full)	R@129.1	55
Ad-hoc Video Search	TRECVID (TV16) 2016 (test)	infAP0.222	29
Ad-hoc Video Search	TRECVID TV17 2017 (test)	infAP29	28
Ad-hoc Video Search	TRECVID (TV18) 2018 (test)	infAP14.7	26
Ad-hoc Video Search	TRECVID TV19 2019 (test)	infAP19.2	17
Text-to-Video Retrieval	TRECVid IACC.3 2017 (tv17)	xinfAP26.1	16
Text-to-Video Retrieval	TRECVid V3C1 2019 (tv19)	xinfAP21.5	16
Text-to-Video Retrieval	TRECVid IACC.3 2016	xinfAP18.8	16

Showing 10 of 19 rows

Other info

Code

Follow for update

@wizwand_team Discord