Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning

About

Recent advances in video understanding have been driven by MLLMs. But these MLLMs are good at analyzing short videos, while suffering from difficulties in understanding videos with a longer context. To address this difficulty, several agent methods have been proposed, using MLLMs as agents for retrieving extra contextual knowledge in a long video. However, most existing agents ignore the key fact that a long video is composed with multiple shots, i.e., to answer the user question from a long video, it is critical to deeply understand its relevant shots like human. Without such insight, these agents often mistakenly find redundant even noisy temporal context, restricting their capacity for long video understanding. To fill this gap, we propose VideoChat-A1, a novel long video agent paradigm. Different from the previous works, our VideoChat-A1 can deeply think with long videos, via a distinct chain-of-shot reasoning paradigm. More specifically, it can progressively select the relevant shots of user question, and look into these shots in a coarse-to-fine partition. By multi-modal reasoning along the shot chain, VideoChat-A1 can effectively mimic step-by-step human thinking process, allowing the interactive discovery of preferable temporal context for thoughtful understanding in long videos. Extensive experiments show that, VideoChat-A1 achieves the state-of-the-art performance on the mainstream long video QA benchmarks, e.g., it achieves 77.0 on VideoMME (w/ subs) and 70.1 on EgoSchema, outperforming its strong baselines (e.g., InternVL2.5-8B and InternVideo2.5-8B), by up to 10.1\% and 6.2\%. Compared to leading closed-source GPT-4o and Gemini 1.5 Pro, VideoChat-A1 offers competitive accuracy, but only with 7% input frames and 12% inference time on average. The code is available on https://github.com/SpXace/VideoChat-A1.

Zikang Wang, Boyu Chen, Zhengrong Yue, Yi Wang, Yu Qiao, Limin Wang, Yali Wang• 2025

Related benchmarks

TaskDatasetResultRank
Long Video UnderstandingLongVideoBench (val)
Accuracy77.2
210
Video Question AnsweringLongVideoBench
Accuracy65.4
180
Video Question AnsweringEgoSchema
Accuracy72.1
161
Long Video UnderstandingMLVU--
154
Video Question AnsweringMLVU
Accuracy76.2
143
Long-form Video UnderstandingLongVideoBench
Accuracy65.4
115
Video UnderstandingVideo-MME
Overall Score69.7
96
Long Video UnderstandingMLVU (test)
Average Score76.2
60
Long Video QAVideo-MME
Average Score72.9
41
Long Video Question AnsweringMLVU
M-Avg76.2
39
Showing 10 of 15 rows

Other info

Follow for update