ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models

About

Video Temporal Grounding (VTG) aims to ground specific segments within an untrimmed video corresponding to the given natural language query. Existing VTG methods largely depend on supervised learning and extensive annotated data, which is labor-intensive and prone to human biases. To address these challenges, we present ChatVTG, a novel approach that utilizes Video Dialogue Large Language Models (LLMs) for zero-shot video temporal grounding. Our ChatVTG leverages Video Dialogue LLMs to generate multi-granularity segment captions and matches these captions with the given query for coarse temporal grounding, circumventing the need for paired annotation data. Furthermore, to obtain more precise temporal grounding results, we employ moment refinement for fine-grained caption proposals. Extensive experiments on three mainstream VTG datasets, including Charades-STA, ActivityNet-Captions, and TACoS, demonstrate the effectiveness of ChatVTG. Our ChatVTG surpasses the performance of current zero-shot methods.

Mengxue Qu, Xiaodong Chen, Wu Liu, Alicia Li, Yao Zhao• 2024

Related benchmarks

Task	Dataset	Result
Moment Retrieval	Charades-STA (test)	R@0.533	186
Temporal Video Grounding	Charades-STA (test)	Recall@IoU=0.533	124
Video Grounding	Charades-STA	R@1 IoU=0.533	113
Temporal Grounding	Charades-STA	mIoU34.9	107
Temporal Grounding	ActivityNet Captions	Recall@1 (IoU=0.5)22.5	85
Natural Language Video Localization	Charades-STA (test)	R@1 (IoU=0.5)33	61
Video Temporal Grounding	ActivityNet Captions	Recall @ IoU=0.522.5	43
Temporal Grounding	Charades-STA	mIoU34.8	21
Temporal Grounding	ActivityNet-MR	R@0.79.4	21
Video Moment Retrieval	ActivityNet-Captions (test)	R1@0.522.5	19

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord