Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

About

This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.

Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, Limin Wang• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingVideoMME
Overall Score65.7
192
Video Temporal GroundingCharades-TimeLens
R1@0.376.6
13
Video Temporal GroundingActivityNet TimeLens
R@0.368.9
13
Video Temporal GroundingQVHighlights TimeLens
R1@0.380.2
13
Video Temporal GroundingCharades-TimeLens refined (test)
R@1 (IoU=0.3)0.705
11
Video Temporal GroundingCharades-STA original (test)
mIoU42.3
11
Video Temporal GroundingActivityNet-Captions (original)
Recall@0.535.2
7
Video Temporal GroundingActivityNet TimeLens (refined)
R@0.362.8
7
Video Temporal GroundingQVHighlights original (test)
R@0.378.4
5
Video Temporal GroundingQVHighlights TimeLens refined (test)
R1@0.374.1
5
Showing 10 of 10 rows

Other info

GitHub

Follow for update