CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

About

This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13% to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE.

Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Zheng Shou, Nan Duan• 2022

Related benchmarks

Task	Dataset	Result
Video Grounding	MAD (test)	Recall@1 (IoU=0.1)8.9	35
Video Grounding	Ego4D-NLQ v1 (test)	Recall@1 (Avg)17.67	27
Video Grounding	MAD 1.0 (test)	R@1 (IoU=0.3)6.87	26
Temporal Grounding	Ego4D-NLQ	R@1 (IoU=0.3)14.15	25
Natural Language Queries	Ego4D NLQ (val)	Recall@1 (IoU=0.3)14.15	23
Temporal Grounding	Ego4D NLQ (test)	R@1 (IoU=0.3)14.15	20
Video Temporal Grounding	Ego4D NLQ v1 (val)	R@1 (IoU=0.3)14.15	12
Long Video Moment Retrieval	MAD (test)	Recall@1 (Tol 0.1)8.9	10
Temporal Video Grounding	Ego4D-NLQ	Recall@1 (tIoU=0.3)14.15	9
Long Video Temporal Grounding	Ego4D	Average Recall17.67	9

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord