Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
About
Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their abilities to generalize remain limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance the capabilities of LVLMs on the TVG task. (2) TimeRFT: we explore data-efficient post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend difficult samples, leading to better generalization. (3) TVGBench: we carefully construct a small yet comprehensive benchmark for LVLM evaluation, assessing 11 types of queries and featuring balanced distributions across both videos and queries. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using only 2.5K training data, while improving its general video understanding capabilities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Temporal Video Grounding | Charades-STA (test) | Recall@IoU=0.572.2 | 117 | |
| Video Grounding | Charades-STA | R@1 IoU=0.559 | 113 | |
| Temporal Action Localization | ActivityNet v1.3 (test) | -- | 47 | |
| Temporal Grounding | ActivityNet Captions | Recall@1 (IoU=0.5)55.6 | 45 | |
| Temporal Grounding | Charades-STA | mIoU58.8 | 33 | |
| Temporal Video Grounding | ActivityNet (test) | Recall @ 0.539 | 27 | |
| Grounded Video Question Answering | NExT-GQA (test) | mIoU28.3 | 24 | |
| Video Event Grounding | ActivityNet | Recall@0.536.4 | 17 | |
| Temporal Video Grounding | QVHighlights TimeLens (test) | Recall @ IoU=0.365.8 | 17 | |
| Temporal Video Grounding | Charades-TimeLens (test) | R@0.3 (IoU=0.3)57.9 | 17 |