Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

About

Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .

Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Bo Li, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingVideoMME--
248
Video Question AnsweringVideoMME--
210
Video Question AnsweringVideoMMMU
Accuracy45.4
124
Video UnderstandingLVBench
Average Score41.3
67
Video Question AnsweringLVBench
Overall Score41.3
32
Video UnderstandingVideoMMMU--
32
Video ReasoningSAGE-Bench 1.0 (test)
Overall Score46.7
29
Video UnderstandingWorldSense
Score22.1
25
Multimodal Question AnsweringMM-Lifelong week (test)
Accuracy9.75
14
Multimodal Question AnsweringMM-Lifelong (val@month)
Accuracy7.54
14
Showing 10 of 21 rows

Other info

GitHub

Follow for update