TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
About
This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | VideoMME | Overall Score65.7 | 192 | |
| Video Temporal Grounding | Charades-TimeLens | R1@0.376.6 | 13 | |
| Video Temporal Grounding | ActivityNet TimeLens | R@0.368.9 | 13 | |
| Video Temporal Grounding | QVHighlights TimeLens | R1@0.380.2 | 13 | |
| Video Temporal Grounding | Charades-TimeLens refined (test) | R@1 (IoU=0.3)0.705 | 11 | |
| Video Temporal Grounding | Charades-STA original (test) | mIoU42.3 | 11 | |
| Video Temporal Grounding | ActivityNet-Captions (original) | Recall@0.535.2 | 7 | |
| Video Temporal Grounding | ActivityNet TimeLens (refined) | R@0.362.8 | 7 | |
| Video Temporal Grounding | QVHighlights original (test) | R@0.378.4 | 5 | |
| Video Temporal Grounding | QVHighlights TimeLens refined (test) | R1@0.374.1 | 5 |