Support-Set Based Cross-Supervision for Video Grounding

About

Current approaches for video grounding propose kinds of complex architectures to capture the video-text relations, and have achieved impressive improvements. However, it is hard to learn the complicated multi-modal relations by only architecture designing in fact. In this paper, we introduce a novel Support-set Based Cross-Supervision (Sscs) module which can improve existing methods during training phase without extra inference cost. The proposed Sscs module contains two main components, i.e., discriminative contrastive objective and generative caption objective. The contrastive objective aims to learn effective representations by contrastive learning, while the caption objective can train a powerful video encoder supervised by texts. Due to the co-existence of some visual entities in both ground-truth and background intervals, i.e., mutual exclusion, naively contrastive learning is unsuitable to video grounding. We address the problem by boosting the cross-supervision with the support-set concept, which collects visual information from the whole video and eliminates the mutual exclusion of entities. Combined with the original objectives, Sscs can enhance the abilities of multi-modal relation modeling for existing approaches. We extensively evaluate Sscs on three challenging datasets, and show that our method can improve current state-of-the-art methods by large margins, especially 6.35% in terms of R1@0.5 on Charades-STA.

Xinpeng Ding, Nannan Wang, Shiwei Zhang, De Cheng, Xiaomeng Li, Ziyuan Huang, Mingqian Tang, Xinbo Gao• 2021

Related benchmarks

Task	Dataset	Result
Video Moment Retrieval	Charades-STA (test)	Recall@1 (IoU=0.5)43.15	108
Temporal Grounding	ActivityNet Captions	Recall@1 (IoU=0.5)46.67	85
Temporal Grounding	Charades-STA (test)	Recall@1 (IoU=0.5)43.15	68
Temporal Sentence Grounding	TACOS (test)	R@1 (IoU=0.5)29.56	37
Video Moment Retrieval	ActivityNet Captions	R1@0.5 Recall46.67	16
Temporal Video Grounding	TACoS C3D features (val)	Recall@1 (IoU=0.5)29.56	12
Video Moment Retrieval	TACOS	Recall@1 (IoU=0.5)29.56	11

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord