Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

About

Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their abilities to generalize remain limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance the capabilities of LVLMs on the TVG task. (2) TimeRFT: we explore data-efficient post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend difficult samples, leading to better generalization. (3) TVGBench: we carefully construct a small yet comprehensive benchmark for LVLM evaluation, assessing 11 types of queries and featuring balanced distributions across both videos and queries. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using only 2.5K training data, while improving its general video understanding capabilities.

Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, Xiangnan Fang, Zewen He, Zhenbo Luo, Wenxuan Wang, Junqi Lin, Jian Luan, Qin Jin• 2025

Related benchmarks

Task	Dataset	Result
Video Question Answering	VideoMME	Accuracy54.2	254
Video Understanding	MVBench (test)	Accuracy63.1	201
Video Understanding	MLVU	Accuracy60.5	147
Temporal Video Grounding	Charades-STA (test)	Recall@IoU=0.572.2	139
Video Understanding	LongVideoBench	Accuracy56	128
Temporal Grounding	Charades-STA	mIoU58.8	120
Video Grounding	Charades-STA	R@1 IoU=0.559	113
Temporal Grounding	ActivityNet	Recall@0.358.6	111
Video Understanding	LVBench	Overall Accuracy38.2	95
Video Understanding	MMVU	Accuracy63.4	91

Showing 10 of 91 rows

...

Other info

Follow for update

@wizwand_team Discord