VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT

About

Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query. Most existing VTG models are trained on extensive annotated video-text pairs, a process that not only introduces human biases from the queries but also incurs significant computational costs. To tackle these challenges, we propose VTG-GPT, a GPT-based method for zero-shot VTG without training or fine-tuning. To reduce prejudice in the original query, we employ Baichuan2 to generate debiased queries. To lessen redundant information in videos, we apply MiniGPT-v2 to transform visual content into more precise captions. Finally, we devise the proposal generator and post-processing to produce accurate segments from debiased queries and image captions. Extensive experiments demonstrate that VTG-GPT significantly outperforms SOTA methods in zero-shot settings and surpasses unsupervised approaches. More notably, it achieves competitive performance comparable to supervised methods. The code is available on https://github.com/YoucanBaby/VTG-GPT

Yifang Xu, Yunzhuo Sun, Zien Xie, Benxiang Zhai, Sidan Du• 2024

Related benchmarks

Task	Dataset	Result
Temporal Video Grounding	Charades-STA (test)	Recall@IoU=0.543.68	124
Video Grounding	QVHighlights (test)	mAP (IoU=0.5)54.13	64
Video Temporal Grounding	QVHighlights (val)	mAP (Avg)30.91	37
Temporal Video Grounding	ActivityNet-Captions (test)	Recall@IoU>0.528.25	32
Video Question Answering	EMCompress (Stage-1)	Temporal Relational mIoU17	3
Video Temporal Grounding	EMCompress 1.0 (test)	Temporal Relational mIoU17	3

Showing 6 of 6 rows

Other info

Code

Follow for update

@wizwand_team Discord