Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VideoExplorer: Think With Videos For Agentic Long-Video Understanding

About

Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video'', which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer's significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).

Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Yan Shu, Nicu Sebe, Ji-Rong Wen, Zhicheng Dou• 2025

Related benchmarks

TaskDatasetResultRank
Long Video UnderstandingLongVideoBench (val)--
210
Long Video UnderstandingMLVU--
154
Long Video UnderstandingMLVU (test)
Average Score64.5
60
Long Video UnderstandingLVBench (test)
LVBench Score55.5
43
Video Question AnsweringVideo-MME Long
Accuracy72.4
36
Long Video UnderstandingVideo-MME Long (test)
Overall Score76.3
16
Showing 6 of 6 rows

Other info

Follow for update