Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

About

Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .

Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Bo Li, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing• 2025

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringVideoMME--
99
Video Question AnsweringVideoMMMU
Accuracy45.4
52
Video Question AnsweringLVBench
Overall Score41.3
32
Video ReasoningSAGE-Bench 1.0 (test)
Overall Score46.7
29
Video UnderstandingLVBench--
23
Grounded VQAOphVL In-domain (test)
mIoU42.17
13
Temporal GroundingSurgVidLM Out-of-domain (test)
R@0.351.13
13
Temporal GroundingOphVL In-domain (test)
R@0.348.73
13
Grounded VQAMedVideoCap In-domain (test)
mIoU55.34
13
Grounded VQASurgVidLM Out-of-domain (test)
mIoU25.72
13
Showing 10 of 15 rows

Other info

GitHub

Follow for update