A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering

About

Effectively applying Vision-Language Models (VLMs) to Video Question Answering (VideoQA) hinges on selecting a concise yet comprehensive set of frames, as processing entire videos is computationally infeasible. However, current frame selection methods face a critical trade-off: approaches relying on lightweight similarity models, such as CLIP, often fail to capture the nuances of complex queries, resulting in inaccurate similarity scores that cannot reflect the authentic query-frame relevance, which further undermines frame selection. Meanwhile, methods that leverage a VLM for deeper analysis achieve higher accuracy but incur prohibitive computational costs. To address these limitations, we propose A.I.R., a training-free approach for Adaptive, Iterative, and Reasoning-based frame selection. We leverage a powerful VLM to perform deep, semantic analysis on complex queries, and this analysis is deployed within a cost-effective iterative loop that processes only a small batch of the most high-potential frames at a time. Extensive experiments on various VideoQA benchmarks demonstrate that our approach outperforms existing frame selection methods, significantly boosts the performance of the foundation VLM, and achieves substantial gains in computational efficiency over other VLM-based techniques.

Yuanhao Zou, Shengji Jin, Andong Deng, Youpeng Zhao, Jun Wang, Chen Chen• 2025

Related benchmarks

Task	Dataset	Result
Video Question Answering	EgoSchema (Full)	Accuracy63.3	256
Video Question Answering	MLVU	Accuracy74.5	213
Video Question Answering	EgoSchema subset	Accuracy72.2	124
Video Question Answering	LongVideoBench (val)	Accuracy62.8	113
Video Question Answering	NextQA	Accuracy82.6	92
Video Understanding	LongVideoBench	--	59
Video Question Answering	Video-MME	Accuracy (Average, wo/ Subtitle)68.2	48
Video Question Answering	MLVU (dev)	Accuracy74.5	34
Video Question Answering	LVB	Accuracy62.8	25
Video Understanding	VideoMME	Accuracy (Base)65.6	22

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord