LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
About
Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | VideoMME | -- | 99 | |
| Video Question Answering | VideoMMMU | Accuracy45.4 | 52 | |
| Video Question Answering | LVBench | Overall Score41.3 | 32 | |
| Video Reasoning | SAGE-Bench 1.0 (test) | Overall Score46.7 | 29 | |
| Video Understanding | LVBench | -- | 23 | |
| Grounded VQA | OphVL In-domain (test) | mIoU42.17 | 13 | |
| Temporal Grounding | SurgVidLM Out-of-domain (test) | R@0.351.13 | 13 | |
| Temporal Grounding | OphVL In-domain (test) | R@0.348.73 | 13 | |
| Grounded VQA | MedVideoCap In-domain (test) | mIoU55.34 | 13 | |
| Grounded VQA | SurgVidLM Out-of-domain (test) | mIoU25.72 | 13 |