Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding

About

The rapid development of multimodal large-language models (MLLMs) has significantly expanded the scope of visual language reasoning, enabling unified systems to interpret and describe complex visual content. However, applying these models to long-video understanding remains computationally intensive. Dense frame encoding generates excessive visual tokens, leading to high memory consumption, redundant computation, and limited scalability in real-world applications. This inefficiency highlights a key limitation of the traditional process-then-reason paradigm, which analyzes visual streams exhaustively before semantic reasoning. To address this challenge, we introduce Video-QTR (Query-Driven Temporal Reasoning), a lightweight framework that redefines video comprehension as a query-guided reasoning process. Instead of encoding every frame, Video-QTR dynamically allocates perceptual resources based on the semantic intent of the query, creating an adaptive feedback loop between reasoning and perception. Extensive experiments across five benchmarks: MSVD-QA, Activity Net-QA, Movie Chat, and Video MME demonstrate that Video-QTR achieves state-of-the-art performance while reducing input frame consumption by up to 73%. These results confirm that query-driven temporal reasoning provides an efficient and scalable solution for video understanding.

Xinkui Zhao, Zuxin Wang, Yifan Zhang, Guanjie Cheng, Yueshen Xu, Shuiguang Deng, Chang Liu, Naibo Wang, Jianwei Yin• 2025

Related benchmarks

Task	Dataset	Result
Video Question Answering	ActivityNet-QA (test)	Accuracy82.32	288
Video Question Answering	MSVD-QA (test)	Accuracy87.8	279
Video Question Answering	MovieChat-1k Breakpoint	Accuracy74.72	23
Video Question Answering	MovieChat Global Breakpoint	Breakpoint Accuracy74.72	14
Video Question Answering	Video-MME long overall durations	Acc (Long, -subs)66.46	13
Video Question Answering	Movie-Chat Global Mode	Accuracy88.72	8

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord