Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

About

Video temporal grounding (VTG) takes an untrimmed video and a natural-language query as input and localizes the temporal moment that best matches the query. Existing methods rely on large, task-specific datasets requiring costly manual annotation. We introduce EvoGround, a framework of two coupled self-evolving agents, a proposer and a solver, that learn temporal grounding from raw videos without any human-labeled data. The proposer generates query--moment pairs from raw videos, while the solver learns to ground them and feeds back signals that improve the proposer in return. Through this self-reinforcing reinforcement-learning loop, the two agents are initialized from the same backbone and mutually improve across iterations. Trained on 2.5K unlabeled videos, EvoGround matches or surpasses fully supervised models across multiple VTG benchmarks, while emerging as a state-of-the-art fine-grained video captioner without manual labels.

Minjoon Jung, Byoung-Tak Zhang, Lorenzo Torresani• 2026

Related benchmarks

TaskDatasetResultRank
Dense Video CaptioningActivityNet Captions
METEOR7
48
Video Question AnsweringVideo-MME without subtitles
Accuracy (Overall)62.3
46
Video Temporal GroundingActivityNet Captions--
43
Video Temporal GroundingCharades-STA
R1@0.5 Recall60.5
20
Temporal GroundingReXTime
R@0.333.5
15
Temporal Video GroundingET-Bench
F1-score69
11
Video CaptioningTemporalBench
Similarity Score53.8
10
Multiple-choice Video Question AnsweringTempCompass multi-choice QA
Accuracy71.5
3
Showing 8 of 8 rows

Other info

Follow for update